Integrate Unstructured.io with Astra DB Serverless
The integration between Unstructured.io and Astra DB Serverless enables you to quickly convert common document types into LLM-ready vector data for highly relevant GenAI similarity searches.
In this Python tutorial, you will build a Retrieval Augmented Generation (RAG) pipeline powered by Astra DB Serverless. The code builds an LLM-based query engine and retrieve parsed data to provide contextual insights for users.
The Unstructured.io Python client library for document parsing is also included in RAGStack. It’s for enterprises that want a curated, supported, out-of-the-box GenAI stack for enterprise RAG applications, leveraging LangChain and LLamaIndex. For details, see RAG with Unstructured and Astra DB.
Prerequisites
The code samples on this page assume the following:
-
You have an active Astra account.
-
You have created a Serverless (Vector) database.
-
You have created an application token with the Database Administrator role.
-
You have installed Python 3.8+ and pip 23.0+.
Install the integration
Unstructured.io provides a simple, powerful developer function partition()
, which takes over a dozen different document types as input including PDFs, HTML, CSVs, and more, and then returns plain text suitable for indexing into Astra DB Serverless.
Only a file path is required. Unstructured.io handles the rest. The Python library has a number of optional requirements to support the parsing of the various formats, but if you want it to handle everything, you simply need to run:
pip install "unstructured[all-docs]"
You now have the ability to parse documents using Unstructured.io. In addition, you will need to add support for the Astra DB Serverless destination connector by installing the appropriate library. A destination connector is a vector store or some other persistent or in-memory store that houses embeddings and associated data that has been parsed from the input documents. The Astra DB Serverless low latency and high performance vector database is an optimal choice. You can install this, along with an embeddings model for LlamaIndex which you will use later, with:
pip install "unstructured[astra]"
pip install llama-index-embeddings-huggingface
The |
Now, you are ready to set up the RAG pipeline, using Unstructured.io powered by Astra DB Serverless.
How Unstructured.io parses the document
Unstructured.io supports over a dozen document formats, everything from PDFs to images to CSV files. The first step in the process is taking a file or set of files and producing a parsed text-based document. You can use the Astra DB Pricing Page as an example. Here’s a code snippet. The full Python sample is shown later in this topic.
from unstructured.partition.html import partition_html
url = "https://www.datastax.com/pricing/astra-db"
elements = partition_html(url=url)
print("\n\n".join([str(el) for el in elements]))
By providing the web URL, you can parse the HTML page into text suitable for storing into Astra DB Serverless. Of course, there is a wide array of other document formats available, but for the sake of illustration, use this one specifically. Now, you can use the Astra DB Serverless integration to index the documents, generate embeddings, and perform RAG-type operations.
Set environment variables
Before running the Python sample, create a .env
file. Put it in the directory where you’ll run the sample Python code.
In the .env
file, assign values for the ASTRA_DB_APPLICATION_TOKEN
and ASTRA_DB_API_ENDPOINT
environment variables:
ASTRA_DB_APPLICATION_TOKEN=TOKEN
ASTRA_DB_API_ENDPOINT=ENDPOINT
The integration steps and sample Python code, as described in this topic, do not require you to specify an Unstructured.io API key. |
In Astra Portal, get the relevant values for your vector database from its Overview tab, under Database Details. If you haven’t already, generate an auth token, and copy its value. Keep it secret, keep it safe.
The sample Python code loads and uses these defined credentials, and sets two additional variables inline, as shown in this code snippet:
# ...
def get_writer() -> Writer:
return AstraWriter(
connector_config=SimpleAstraConfig(
access_config=AstraAccessConfig(
api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
),
collection_name=os.getenv("ASTRA_DB_COLLECTION_NAME", "unstructured"),
embedding_dimension=os.getenv("ASTRA_DB_EMBEDDING_DIMENSION", 384),
),
write_config=AstraWriteConfig(batch_size=20),
)
# ...
If you haven’t already, install python-dotevn
.
-
Command
-
Result
python -m pip install python-dotenv
Collecting python-dotenv
Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
Build the pipeline
Using the Unstructured.io parsing functionality, you can build an end-to-end pipeline going directly from a web URL (or a set of web URLs), to an Astra DB Serverless collection suitable for RAG applications. Here’s a Python example.
import os
from dotenv import load_dotenv
from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.astra import AstraWriter
from unstructured.partition.html import partition_html
load_dotenv()
url = "https://www.datastax.com/pricing/astra-db"
elements = partition_html(url=url)
if not os.path.exists("local-input-to-astra"):
os.makedirs("local-input-to-astra")
for elem in elements:
# Write the text to local txt files
with open(f"local-input-to-astra/{elem.id}.txt", "w") as f:
f.write(elem.text)
from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.connector.astra import (
AstraAccessConfig,
AstraWriteConfig,
SimpleAstraConfig,
)
from unstructured.ingest.interfaces import (
ChunkingConfig,
EmbeddingConfig,
PartitionConfig,
ProcessorConfig,
ReadConfig,
)
from unstructured.ingest.runner import LocalRunner
from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.astra import (
AstraWriter,
)
def get_writer() -> Writer:
return AstraWriter(
connector_config=SimpleAstraConfig(
access_config=AstraAccessConfig(
api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
),
collection_name=os.getenv("ASTRA_DB_COLLECTION_NAME", "unstructured"),
embedding_dimension=os.getenv("ASTRA_DB_EMBEDDING_DIMENSION", 384),
),
write_config=AstraWriteConfig(batch_size=20),
)
writer = get_writer()
runner = LocalRunner(
processor_config=ProcessorConfig(
verbose=True,
output_dir="local-output-to-astra",
num_processes=2,
),
connector_config=SimpleLocalConfig(
input_path="local-input-to-astra",
),
read_config=ReadConfig(),
partition_config=PartitionConfig(),
chunking_config=ChunkingConfig(chunk_elements=True),
embedding_config=EmbeddingConfig(
provider="langchain-huggingface",
),
writer=writer,
writer_kwargs={},
)
runner.run()
The HTML document is automatically parsed into a set of chunks. Each chunk is stored in Astra DB Serverless. These actions happen automatically.
As a result of running the Python code, you now have an Astra DB Serverless collection named unstructured
, with embeddings generated, that is suitable for a simple RAG pipeline. See the displayed result so far in Astra Portal for your database. Here’s an example:
Query the vector data store
You can use a toolkit to perform queries against this Astra DB Serverless vector data store. In this example, you will use LlamaIndex to connect to the newly created store, and with OpenAI’s GPT-4 model, ask some questions about the indexed page.
Here’s a Python example.
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.astra import AstraDBVectorStore
astra_db_store = AstraDBVectorStore(
token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
collection_name=os.getenv("ASTRA_DB_COLLECTION_NAME", "unstructured"),
embedding_dimension=os.getenv("ASTRA_DB_EMBEDDING_DIMENSION", 384),
)
index = VectorStoreIndex.from_vector_store(
vector_store=astra_db_store,
embed_model=HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5"
)
)
query_engine = index.as_query_engine()
response = query_engine.query(
"how much is the astra db free tier?"
)
print(response.response)
Here’s a sample response:
The Astra DB free tier provides $25 monthly credit for the first three months, allowing users to explore the service without incurring costs during this initial period.
Next steps
See these related topics: