Integration: Unstructured File Converter
Component to easily convert files and directories into Documents using the Unstructured API
Component for the Haystack (2.x) LLM framework to easily convert files and directories into Documents using the Unstructured API.
Unstructured provides a series of tools to do ETL for LLMs. This component calls the Unstructured API that simply extracts text and other information from a vast range of file formats. See supported file types.
Installation
pip install unstructured-fileconverter-haystack
Hosted API
If you plan to use the hosted version of the Unstructured API, you just need the (free) Unsctructured API key. You can get it by signing up here.
Local API (Docker)
If you want to run your own local instance of the Unstructured API, you need Docker and you can find instructions here.
In short, this should work:
docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
Usage
If you plan to use the hosted version of the Unstructured API, set the Unstructured API key as an environment variable UNSTRUCTURED_API_KEY
:
export UNSTRUCTURED_API_KEY=your_api_key
In isolation
import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]
In a Haystack Pipeline
import os
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
document_store = InMemoryDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", UnstructuredFileConverter())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})