• Glossary
  • Support
  • Downloads
  • DataStax Home
Get Live Help
Expand All
Collapse All

DataStax Astra DB Serverless Documentation

    • Overview
      • Release notes
      • Astra DB FAQs
      • Astra DB Architecture FAQ
      • Astra DB glossary
      • Get support
    • Getting Started
      • Astra Vector Search Quickstart
      • Create your database
      • Grant a user access
      • Load and retrieve data
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
      • Connect a driver
      • Build sample apps
    • Planning
      • Plan options
      • Database regions
    • Securing
      • Security highlights
      • Security guidelines
      • Default user permissions
      • Change your password
      • Reset your password
      • Authentication and Authorization
      • Astra DB Plugin for HashiCorp Vault
    • Connecting
      • Connecting private endpoints
        • AWS Private Link
        • Azure Private Link
        • GCP Private Endpoints
        • Connecting custom DNS
      • Connecting Change Data Capture (CDC)
      • Connecting CQL console
      • Connect the Spark Cassandra Connector to Astra
      • Drivers for Astra DB
        • Connecting C++ driver
        • Connecting C# driver
        • Connecting Java driver
        • Connecting Node.js driver
        • Connecting Python driver
        • Connecting Legacy drivers
        • Drivers retry policies
      • Get Secure Connect Bundle
    • Migrating
      • Components
      • FAQs
      • Preliminary steps
        • Feasibility checks
        • Deployment and infrastructure considerations
        • Create target environment for migration
        • Understand rollback options
      • Phase 1: Deploy ZDM Proxy and connect client applications
        • Set up the ZDM Proxy Automation with ZDM Utility
        • Deploy the ZDM Proxy and monitoring
        • Configure Transport Layer Security
        • Connect client applications to ZDM Proxy
        • Leverage metrics provided by ZDM Proxy
        • Manage your ZDM Proxy instances
      • Phase 2: Migrate and validate data
        • Cassandra Data Migrator
        • DSBulk Migrator
      • Phase 3: Enable asynchronous dual reads
      • Phase 4: Change read routing to Target
      • Phase 5: Connect client applications directly to Target
      • Troubleshooting
        • Troubleshooting tips
        • Troubleshooting scenarios
      • Glossary
      • Contribution guidelines
      • Release Notes
    • Managing
      • Managing your organization
        • User permissions
        • Pricing and billing
        • Audit Logs
        • Delete an account
        • Bring Your Own Key
          • BYOK AWS Astra Portal
          • BYOK GCP Astra Portal
          • BYOK AWS DevOps API
          • BYOK GCP DevOps API
        • Configuring SSO
          • Configure SSO for Microsoft Azure AD
          • Configure SSO for Okta
          • Configure SSO for OneLogin
      • Managing your database
        • Create your database
        • View your databases
        • Database statuses
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
        • Monitor your databases
        • Export metrics to third party
          • Export metrics via Astra Portal
          • Export metrics via DevOps API
        • Manage access lists
        • Manage multiple keyspaces
        • Using multiple regions
        • Terminate your database
      • Managing with DevOps API
        • Managing database lifecycle
        • Managing roles
        • Managing users
        • Managing tokens
        • Managing BYOK AWS
        • Managing BYOK GCP
        • Managing access list
        • Managing multiple regions
        • Get private endpoints
        • AWS PrivateLink
        • Azure PrivateLink
        • GCP Private Service
    • Integrations
    • Astra CLI
    • Astra Vector Search
      • Quickstarts
      • Examples
      • Create a serverless database with Vector Search
      • Query Vector Data with CQL
        • Using analyzers
      • Data modeling
      • Working with embeddings
    • Astra Block
      • Quickstart
      • FAQ
      • Data model
      • About NFTs
    • API QuickStarts
      • JSON API QuickStart
      • Document API QuickStart
      • REST API QuickStart
      • GraphQL CQL-first API QuickStart
    • Developing with APIs
      • Developing with JSON API
      • Developing with Document API
      • Developing with REST API
      • Developing with GraphQL API
        • Developing with GraphQL API (CQL-first)
        • Developing with GraphQL API (Schema-first)
      • Developing with gRPC API
        • gRPC Rust Client
        • gRPC Go Client
        • gRPC Node.js Client
        • gRPC Java Client
      • Developing with CQL API
      • Tooling Resources
      • Node.js Document Collection Client
      • Node.js REST Client
    • API References
      • Astra DB JSON API v1
      • Astra DB REST API v2
      • Astra DB Document API v2
      • Astra DB DevOps API v2
  • DataStax Astra DB Serverless Documentation
  • Astra Vector Search
  • Data modeling

Data Modeling

As you develop AI and Machine Learning (ML) applications using Astra Vector Search, here are some data modeling considerations. These factors help effectively leverage vector search to produce accurate and efficient search responses within your application.

Data representation

Vector search relies on representing data points as high-dimensional vectors. The choice of vector representation depends on the nature of the data.

For data that consists of text documents, techniques like word embeddings (e.g., Word2Vec) or document embeddings (e.g., Doc2Vec) can be used to convert text into vectors. More complex models can also be used to generate embeddings using Large Language Models (LLMs) like OpenAI GPT-4 or Meta LLaMA 2. Word2Vec is a relatively simple model that uses a shallow neural network to learn embeddings for words based on their context. The key concept is that Word2Vec generates a single fixed vector for each word, regardless of the context in which the word is used. LLMs are much more complex models that use deep neural networks, specifically transformer architectures, to learn embeddings for words based on their context. Unlike Word2Vec, these models generate contextual embeddings, meaning the same word can have different embeddings depending on the context in which it is used.

Images can be represented using deep learning techniques like convolutional neural networks (CNNs) or pre-trained models such as Contrastive Language Image Pre-training (CLIP). Select a vector representation that captures the essential features of the data.

DataStax Astra Vector Search offers a quickstart to uses some of these techniques.

Dataset dimensions

A vector search only works when the vectors have the same dimensions, because vector operations like dot product and cosine similarity require vectors to have the same number of dimensions.

For vector search, it is crucial that all embeddings are created in the same vector space. This means that the embeddings should follow the same principles and rules to enable proper comparison and analysis. Using the same embedding library guarantees this compatibility because the library consistently transforms data into vectors in a specific, defined way. For example, comparing Word2Vec embeddings with BERT (an LLM) embeddings could be problematic because these models have different architectures and create embeddings in fundamentally different ways.

Preprocessing embeddings vectors

Normalizing is about scaling the data so it has a length of one. This is typically done by dividing each element in a vector by the vector’s length.

Standardizing is about shifting (subtracting the mean) and scaling (dividing by the standard deviation) the data so it has a mean of zero and a standard deviation of one.

It is important to note that standardizing and normalizing in the context of embedding vectors are not the same. The correct preprocessing method (standardizing, normalizing, or even something else) depends on the specific characteristics of your data and what you are trying to achieve with your machine learning model. Preprocessing steps may involve cleaning and tokenizing text, resizing and normalizing images, or handling missing values.

Normalizing embedding vectors

Normalizing embedding vectors is a process that ensures every embedding vector in your vector space has a length (or norm) of one. This is done by dividing each element of the vector by the vector’s length (also known as its Euclidean norm or L2 norm).

For example, look at the embedding vectors from the CQL for Vector Search and their normalized counterparts, where a consistent length has been used for all the vectors:

  • Original

  • Normalized

[0.1, 0.15, 0.3, 0.12, 0.05]
[0.45, 0.09, 0.01, 0.2, 0.11]
[0.1, 0.05, 0.08, 0.3, 0.6]
[0.27, 0.40, 0.80, 0.32, 0.13]
[0.88, 0.18, 0.02, 0.39, 0.21]
[0.15, 0.07, 0.12, 0.44, 0.88]

The primary reason you would normalize vectors when working with embeddings is that it makes comparisons between vectors more meaningful. By normalizing, you ensure that comparisons are not affected by the scale of the vectors, and are solely based on their direction. This is particularly useful to calculate the cosine similarity between vectors, where the focus is on the angle between vectors (directional relationship), not their magnitude.

Normalizing embedding vectors is a way of standardizing your high-dimensional data so that comparisons between different vectors are more meaningful and less affected by the scale of the original vectors. Since dot product and cosine are equivalent for normalized vectors, but the dot product algorithm is 50% faster, it is recommended that developers use dot product for the similarity function.

However, if embeddings are NOT normalized, then dot product silently returns meaningless query results. Therefore, dot product is not set as the default similarity function in Vector Search.

When you use OpenAI, PaLM, or Simsce to generate your embeddings, they are normalized by default. If you use a different library, you want to normalize your vectors and set the similarity function to dot product. See how to set the similarity function in CQL for Vector Search.

Normalization is not required for all vector search examples.

Standardizing embedding vectors

Standardizing embedding vectors typically refers to a process similar to that used in statistics where data is standardized to have a mean of zero and a standard deviation of one. The goal of standardizing is to transform the embedding vectors so they have properties of a standard normal Gaussian distribution.

If you are using a machine learning model that uses distances between points (like nearest neighbors or any model that uses Euclidean distance or cosine similarity), standardizing can ensure that all features contribute equally to the distance calculations. Without standardization, features on larger scales can dominate the distance calculations.

In the context of neural networks, for example, having input values that are on a similar scale can help the network learn more effectively, because it ensures that no particular feature dominates the learning process simply because of its scale.

Indexing and Storage

SAI indexing and storage mechanisms are tailored for large datasets like vector search. Currently, SAI uses Hierarchical Navigable Small World (HNSW), an algorithm for Approximate Nearest Neighbor (ANN) search.

The goal of ANN search algorithms like HNSW is to find the data points in a dataset that are closest (or most similar) to a given query point. However, finding the exact nearest neighbors can be computationally expensive, particularly when dealing with high-dimensional data. Therefore, ANN algorithms aim to find the nearest neighbors approximately, prioritizing speed and efficiency over exact accuracy.

HNSW achieves this goal by creating a hierarchy of graphs, where each level of the hierarchy corresponds to a small world graph that is navigable. For any given node (data point) in the graph, it is easy to find a path to any other node. The higher levels of the hierarchy have fewer nodes and are used for coarse navigation, while the lower levels have more nodes and are used for fine navigation. Such indexing structures enable fast retrieval by narrowing down the search space to potential matches.

Similarity Metric

Vector search relies on computing the similarity or distance between vectors to identify relevant matches. Choosing an appropriate similarity metric is crucial, as different metrics may be more suitable for specific types of data. Common similarity metrics include cosine similarity, Euclidean distance, or Jaccard similarity. The choice of metric should align with the characteristics of the data and the desired search behavior.

Astra DB Vector Search supports three similarity metrics: cosine similarity, dot product, and Euclidean distance. The default similarity algorithm for the Astra DB Vector Search indexes is cosine similarity. DataStax recommends using dot product on normalized embeddings for most applications, because dot product is 50% faster than cosine similarity.

Scalability and Performance

Scalability is a critical consideration as your dataset expands. Vector search algorithms should be designed to handle large-scale datasets efficiently. Your serverless database using Astra Vector Search efficiently distributes data and accesses that data with parallel processing to enhance performance.

Evaluation and iteration

Continuously evaluate and iterate the data to refine search results against known truth and user feedback. This also helps identify areas for improvement. Iteratively refining the vector representations, similarity metrics, indexing techniques, or preprocessing steps can lead to better search performance and user satisfaction.

Using analyzers Working with embeddings

General Inquiries: +1 (650) 389-6000 info@datastax.com

© DataStax | Privacy policy | Terms of use

DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.

Kubernetes is the registered trademark of the Linux Foundation.

landing_page landingpage