Maintained by deepset

Integration: Unstructured File Converter

Component to easily convert files and directories into Documents using the Unstructured API

Authors
deepset

Component for the Haystack (2.x) LLM framework to easily convert files and directories into Documents using the Unstructured API.

Unstructured provides a series of tools to do ETL for LLMs. This component calls the Unstructured API that simply extracts text and other information from a vast range of file formats. See supported file types.

Installation

pip install unstructured-fileconverter-haystack

Hosted API

If you plan to use the hosted version of the Unstructured API, you just need the (free) Unsctructured API key. You can get it by signing up here.

Local API (Docker)

If you want to run your own local instance of the Unstructured API, you need Docker and you can find instructions here.

In short, this should work:

docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0

Usage

In isolation

import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

converter = UnstructuredFileConverter(api_key="UNSTRUCTURED_API_KEY")
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]

In a Haystack Pipeline

import os
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

document_store = InMemoryDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", UnstructuredFileConverter(api_key="UNSTRUCTURED_API_KEY"))
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")

indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})