Working with embeddings
Embeddings are vectors, often generated by machine learning models, that capture semantic relationships between concepts or objects by placing related objects nearby to each other in the embedding space. An embedding model assigns multiple values to data based on its attributes and puts the data into a relative assignment based on that number.
You can embed almost any kind of data and retrieve good results with vector search. As models continue to improve, the quality of results will also continue to improve.
Considerations
- Schema
-
When creating a table or updating schema, vector columns use the
VECTOR
type with a fixed dimensionality. The dimensionality refers to the number of floats in the vector, which could be represented asVECTOR<FLOAT, 768>
. The dimension value is defined by the embedding model you use. Some machine learning libraries will tell you the dimension value, but you must define it with the embedding model.In Astra DB, you can add multiple vector columns to each table.
- Encoding
-
Selecting an embedding model for your dataset that creates good structure by ensuring related objects are nearby each other in the embedding space. Determine the dimension of your embedding and set in your
VECTOR
type. You may need to test out different embedding models.Use the same embedding model for the query vector used for the Approximate Nearest Neighbor (ANN) search and the set of embeddings. Use only one embedding model for a single vector search, not multiple models.
Use cases
Vector databases with well-optimized embeddings allow for new ways to search and associate data, generating results which previously would not have been possible with traditional databases.
Examples:
-
Search for items that are similar to a given item, without needing to know the exact item name or IDs
-
Retrieve documents based on similarity of context and content rather than exact string or keyword matches
-
Expand search results across dissimilar items, such as searching for a product and retrieving contextually similar products from a different category
-
Execute word similarity searches, and suggest to users ways to rephrase queries or passages
-
Encode text, images, audio, or video as queries and retrieve media that are conceptually, visually, audibly, or contextually similar to the input
-
Reduce time spent on metadata and curation by automatically generating associations for data
-
Improve data quality by automatically identifying and removing duplicates
Best practices
-
Store relevant metadata about a vector in other columns in your table. For example, if your vector is an image, store the original image in the same table.
-
Select a pre-trained model based on the queries you will need to make to your database.
Limitations
While the vector embeddings can replace or augment some functions of a traditional database, vector embeddings are not a replacement for other data types. Embeddings are best applied as a supplement to existing data because of the limitations:
-
Vector embeddings are not human-readable. Embeddings are not recommended when seeking to directly retrieve data from a table.
-
The model might not be able to capture all relevant information from the data, leading to incorrect or incomplete results.