Integration: AstraDB
A Document Store for storing and retrieval from AstraDB - built for Haystack 2.0.
Table of Contents
Overview
DataStax Astra DB is a serverless vector database built on Apache Cassandra, and it supports vector-based search and auto-scaling. You can deploy it on AWS, GCP, or Azure and easily expand to one or more regions within those clouds for multi-region availability, low latency data access, data sovereignty, and to avoid cloud vendor lock-in. For more information, see the DataStax documentation.
This integration allows you to use AstraDB for document storage and retrieval in your Haystack 2.0 pipelines. This page provides instructions on how to initialize an AstraDB instance and connect with Haystack.
Components
-
AstraDocumentStore
. This component serves as a persistent data store for your Haystack documents, and supports a number of embedding models and vector dimensions. -
AstraEmbeddingRetriever
This is an embedding-based Retriever compatible with the Astra Document Store.
Initialization
First you need to sign up for a free DataStax account. Follow these instructions for creating an AstraDB Database in the Datastax console. Make sure you create a collection, a keyspace name, and an access token since you’ll need those later.
Installation
pip install astra-haystack
Usage
This package includes Astra Document Store and Astra Retriever classes that integrate with Haystack 2.0, allowing you to easily perform document retrieval or RAG with AstraDB, and include those functions in Haystack pipelines.
In order to connect AstraDB with Haystack, you’ll need these pieces of information from your Datastax console:
- AstraDB ID
- region
- Astra collection name
- Astra keyspace name
- access token
how to use the AstraDocumentStore
:
from haystack import Document
from haystack_integrations.document_stores.astra import AstraDocumentStore
astra_id = os.getenv("ASTRA_DB_ID", "")
astra_region = os.getenv("ASTRA_DB_REGION", "us-east-2")
astra_application_token = os.getenv("ASTRA_DB_APPLICATION_TOKEN", "")
collection_name = os.getenv("COLLECTION_NAME", "haystack_integration")
keyspace_name = os.getenv("KEYSPACE_NAME", "astra_haystack_test")
document_store = AstraDocumentStore(
astra_id=astra_id,
astra_region=astra_region,
astra_collection=collection_name,
astra_keyspace=keyspace_name,
astra_application_token=astra_application_token,
)
document_store.write_documents([
Document(content="This is first"),
Document(content="This is second")
])
print(document_store.count_documents())
How to use the AstraEmbeddingRetriever
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack_integrations.components.retrievers.astra import AstraEmbeddingRetriever
from haystack_integrations.document_stores.astra import AstraDocumentStore
astra_id = os.getenv("ASTRA_DB_ID", "")
astra_region = os.getenv("ASTRA_DB_REGION", "us-east-2")
astra_application_token = os.getenv("ASTRA_DB_APPLICATION_TOKEN", "")
collection_name = os.getenv("COLLECTION_NAME", "haystack_integration")
keyspace_name = os.getenv("KEYSPACE_NAME", "astra_haystack_test")
document_store = AstraDocumentStore(
astra_id=astra_id,
astra_region=astra_region,
astra_collection=collection_name,
astra_keyspace=keyspace_name,
astra_application_token=astra_application_token,
)
model = "sentence-transformers/all-mpnet-base-v2"
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_embedder = SentenceTransformersDocumentEmbedder(model=model_name_or_path)
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)
document_store.write_documents(documents_with_embeddings.get("documents"))
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model=model_name_or_path))
query_pipeline.add_component("retriever", AstraEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query = "How many languages are there?"
result = query_pipeline.run({"text_embedder": {"text": query}})
print(result['retriever']['documents'][0])
Note:
Please note that the current version of Astra JSON API does not support the following operators: $lt, $lte, $gt, $gte, $nin, $not, $neq As well as filtering with none values (these won’t be inserted as the result is stored as json document, and it doesn’t store nones)
License
astra-haystack
is distributed under the terms of the
Apache-2.0 license.