Integrate Unstructured.io with Astra DB Serverless

query_builder 15 min

The integration between Unstructured.io and Astra DB Serverless enables you to quickly convert common document types into LLM-ready vector data for highly relevant GenAI similarity searches.

In this Python tutorial, you will build a Retrieval Augmented Generation (RAG) pipeline powered by Astra DB Serverless. The code builds an LLM-based query engine and retrieve parsed data to provide contextual insights for users.

The Unstructured.io Python client library for document parsing is also included in RAGStack. It’s for enterprises that want a curated, supported, out-of-the-box GenAI stack for enterprise RAG applications, leveraging LangChain and LLamaIndex. For details, see RAG with Unstructured and Astra DB.

Prerequisites

The code samples on this page assume the following:

Install the integration

Unstructured.io provides a simple, powerful developer function partition(), which takes over a dozen different document types as input including PDFs, HTML, CSVs, and more, and then returns plain text suitable for indexing into Astra DB Serverless.

Only a file path is required. Unstructured.io handles the rest. The Python library has a number of optional requirements to support the parsing of the various formats, but if you want it to handle everything, you simply need to run:

pip install "unstructured[all-docs]"

You now have the ability to parse documents using Unstructured.io. In addition, you will need to add support for the Astra DB Serverless destination connector by installing the appropriate library. A destination connector is a vector store or some other persistent or in-memory store that houses embeddings and associated data that has been parsed from the input documents. The Astra DB Serverless low latency and high performance vector database is an optimal choice. You can install this, along with an embeddings model for LlamaIndex which you will use later, with:

pip install "unstructured[astra]"
pip install llama-index-embeddings-huggingface

The pip install example with …​ [astra] depends on Unstructured.io 0.12.5 and higher releases. The Unstructured.io with Astra DB Serverless integration uses the latest release. See the Unstructured.io GitHub site for information about its tagged releases.

Now, you are ready to set up the RAG pipeline, using Unstructured.io powered by Astra DB Serverless.

How Unstructured.io parses the document

Unstructured.io supports over a dozen document formats, everything from PDFs to images to CSV files. The first step in the process is taking a file or set of files and producing a parsed text-based document. You can use the Astra DB Pricing Page as an example. Here’s a code snippet. The full Python sample is shown later in this topic.

from unstructured.partition.html import partition_html

url = "https://www.datastax.com/pricing/astra-db"
elements = partition_html(url=url)
print("\n\n".join([str(el) for el in elements]))

By providing the web URL, you can parse the HTML page into text suitable for storing into Astra DB Serverless. Of course, there is a wide array of other document formats available, but for the sake of illustration, use this one specifically. Now, you can use the Astra DB Serverless integration to index the documents, generate embeddings, and perform RAG-type operations.

Set environment variables

Before running the Python sample, create a .env file. Put it in the directory where you’ll run the sample Python code.

In the .env file, assign values for the ASTRA_DB_APPLICATION_TOKEN and ASTRA_DB_API_ENDPOINT environment variables:

ASTRA_DB_APPLICATION_TOKEN=TOKEN
ASTRA_DB_API_ENDPOINT=ENDPOINT

The integration steps and sample Python code, as described in this topic, do not require you to specify an Unstructured.io API key.

In Astra Portal, get the relevant values for your vector database from its Overview tab, under Database Details. If you haven’t already, generate an auth token, and copy its value. Keep it secret, keep it safe.

The sample Python code loads and uses these defined credentials, and sets two additional variables inline, as shown in this code snippet:

# ...

def get_writer() -> Writer:
   return AstraWriter(
       connector_config=SimpleAstraConfig(
           access_config=AstraAccessConfig(
               api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
               token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
           ),
           collection_name=os.getenv("ASTRA_DB_COLLECTION_NAME", "unstructured"),
           embedding_dimension=os.getenv("ASTRA_DB_EMBEDDING_DIMENSION", 384),
       ),
       write_config=AstraWriteConfig(batch_size=20),
   )

# ...

If you haven’t already, install python-dotevn.

  • Command

  • Result

python -m pip install python-dotenv
Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1

Build the pipeline

Using the Unstructured.io parsing functionality, you can build an end-to-end pipeline going directly from a web URL (or a set of web URLs), to an Astra DB Serverless collection suitable for RAG applications. Here’s a Python example.

import os

from dotenv import load_dotenv

from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.astra import AstraWriter

from unstructured.partition.html import partition_html

load_dotenv()

url = "https://www.datastax.com/pricing/astra-db"
elements = partition_html(url=url)

if not os.path.exists("local-input-to-astra"):
   os.makedirs("local-input-to-astra")

for elem in elements:
   # Write the text to local txt files
   with open(f"local-input-to-astra/{elem.id}.txt", "w") as f:
       f.write(elem.text)

from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.connector.astra import (
   AstraAccessConfig,
   AstraWriteConfig,
   SimpleAstraConfig,
)
from unstructured.ingest.interfaces import (
   ChunkingConfig,
   EmbeddingConfig,
   PartitionConfig,
   ProcessorConfig,
   ReadConfig,
)
from unstructured.ingest.runner import LocalRunner
from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.astra import (
   AstraWriter,
)

def get_writer() -> Writer:
   return AstraWriter(
       connector_config=SimpleAstraConfig(
           access_config=AstraAccessConfig(
               api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
               token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
           ),
           collection_name=os.getenv("ASTRA_DB_COLLECTION_NAME", "unstructured"),
           embedding_dimension=os.getenv("ASTRA_DB_EMBEDDING_DIMENSION", 384),
       ),
       write_config=AstraWriteConfig(batch_size=20),
   )

writer = get_writer()
runner = LocalRunner(
   processor_config=ProcessorConfig(
       verbose=True,
       output_dir="local-output-to-astra",
       num_processes=2,
   ),
   connector_config=SimpleLocalConfig(
       input_path="local-input-to-astra",
   ),
   read_config=ReadConfig(),
   partition_config=PartitionConfig(),
   chunking_config=ChunkingConfig(chunk_elements=True),
   embedding_config=EmbeddingConfig(
       provider="langchain-huggingface",
   ),
   writer=writer,
   writer_kwargs={},
)
runner.run()

The HTML document is automatically parsed into a set of chunks. Each chunk is stored in Astra DB Serverless. These actions happen automatically.

As a result of running the Python code, you now have an Astra DB Serverless collection named unstructured, with embeddings generated, that is suitable for a simple RAG pipeline. See the displayed result so far in Astra Portal for your database. Here’s an example:

Astra Portal displays the collection named unstructured

Query the vector data store

You can use a toolkit to perform queries against this Astra DB Serverless vector data store. In this example, you will use LlamaIndex to connect to the newly created store, and with OpenAI’s GPT-4 model, ask some questions about the indexed page.

Here’s a Python example.

from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.astra import AstraDBVectorStore

astra_db_store = AstraDBVectorStore(
   token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
   api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
   collection_name=os.getenv("ASTRA_DB_COLLECTION_NAME", "unstructured"),
   embedding_dimension=os.getenv("ASTRA_DB_EMBEDDING_DIMENSION", 384),
)

index = VectorStoreIndex.from_vector_store(
   vector_store=astra_db_store,
   embed_model=HuggingFaceEmbedding(
       model_name="BAAI/bge-small-en-v1.5"
   )
)

query_engine = index.as_query_engine()
response = query_engine.query(
   "how much is the astra db free tier?"
)

print(response.response)

Here’s a sample response:

The Astra DB free tier provides $25 monthly credit for the first three months, allowing users to explore the service without incurring costs during this initial period.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com