• Glossary
  • Support
  • Downloads
  • DataStax Home
Get Live Help
Expand All
Collapse All

DataStax Astra DB Serverless Documentation

    • Overview
      • Release notes
      • Astra DB FAQs
      • Astra DB Architecture FAQ
      • Astra DB glossary
      • Get support
    • Getting Started
      • Astra Vector Search Quickstart
      • Create your database
      • Grant a user access
      • Load and retrieve data
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
      • Connect a driver
      • Build sample apps
    • Planning
      • Plan options
      • Database regions
    • Securing
      • Security highlights
      • Security guidelines
      • Default user permissions
      • Change your password
      • Reset your password
      • Authentication and Authorization
      • Astra DB Plugin for HashiCorp Vault
    • Connecting
      • Connecting private endpoints
        • AWS Private Link
        • Azure Private Link
        • GCP Private Endpoints
        • Connecting custom DNS
      • Connecting Change Data Capture (CDC)
      • Connecting CQL console
      • Connect the Spark Cassandra Connector to Astra
      • Drivers for Astra DB
        • Connecting C++ driver
        • Connecting C# driver
        • Connecting Java driver
        • Connecting Node.js driver
        • Connecting Python driver
        • Connecting Legacy drivers
        • Drivers retry policies
      • Get Secure Connect Bundle
    • Migrating
      • Components
      • FAQs
      • Preliminary steps
        • Feasibility checks
        • Deployment and infrastructure considerations
        • Create target environment for migration
        • Understand rollback options
      • Phase 1: Deploy ZDM Proxy and connect client applications
        • Set up the ZDM Proxy Automation with ZDM Utility
        • Deploy the ZDM Proxy and monitoring
        • Configure Transport Layer Security
        • Connect client applications to ZDM Proxy
        • Leverage metrics provided by ZDM Proxy
        • Manage your ZDM Proxy instances
      • Phase 2: Migrate and validate data
        • Cassandra Data Migrator
        • DSBulk Migrator
      • Phase 3: Enable asynchronous dual reads
      • Phase 4: Change read routing to Target
      • Phase 5: Connect client applications directly to Target
      • Troubleshooting
        • Troubleshooting tips
        • Troubleshooting scenarios
      • Glossary
      • Contribution guidelines
      • Release Notes
    • Managing
      • Managing your organization
        • User permissions
        • Pricing and billing
        • Audit Logs
        • Delete an account
        • Bring Your Own Key
          • BYOK AWS Astra Portal
          • BYOK GCP Astra Portal
          • BYOK AWS DevOps API
          • BYOK GCP DevOps API
        • Configuring SSO
          • Configure SSO for Microsoft Azure AD
          • Configure SSO for Okta
          • Configure SSO for OneLogin
      • Managing your database
        • Create your database
        • View your databases
        • Database statuses
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
        • Monitor your databases
        • Export metrics to third party
          • Export metrics via Astra Portal
          • Export metrics via DevOps API
        • Manage access lists
        • Manage multiple keyspaces
        • Using multiple regions
        • Terminate your database
      • Managing with DevOps API
        • Managing database lifecycle
        • Managing roles
        • Managing users
        • Managing tokens
        • Managing BYOK AWS
        • Managing BYOK GCP
        • Managing access list
        • Managing multiple regions
        • Get private endpoints
        • AWS PrivateLink
        • Azure PrivateLink
        • GCP Private Service
    • Integrations
    • Astra CLI
    • Astra Vector Search
      • Quickstarts
      • Examples
      • Create a serverless database with Vector Search
      • Query Vector Data with CQL
        • Using analyzers
      • Data modeling
      • Working with embeddings
    • Astra Block
      • Quickstart
      • FAQ
      • Data model
      • About NFTs
    • API QuickStarts
      • JSON API QuickStart
      • Document API QuickStart
      • REST API QuickStart
      • GraphQL CQL-first API QuickStart
    • Developing with APIs
      • Developing with JSON API
      • Developing with Document API
      • Developing with REST API
      • Developing with GraphQL API
        • Developing with GraphQL API (CQL-first)
        • Developing with GraphQL API (Schema-first)
      • Developing with gRPC API
        • gRPC Rust Client
        • gRPC Go Client
        • gRPC Node.js Client
        • gRPC Java Client
      • Developing with CQL API
      • Tooling Resources
      • Node.js Document Collection Client
      • Node.js REST Client
    • API References
      • Astra DB JSON API v1
      • Astra DB REST API v2
      • Astra DB Document API v2
      • Astra DB DevOps API v2
  • DataStax Astra DB Serverless Documentation
  • Astra Vector Search
  • Query Vector Data with CQL
  • Using analyzers

Using analyzers with CQL

Analyzers process the text in a column to enable term matching for large strings. Combined with vector-based search algorithms, it is easier to find relevant information in large datasets. Instead of returning only a list of results, analyzers allow you to return specific terms while semantically ordering the results by your query vector.

For example, if you ask an LLM “Tell me about available shoes” and this query is done with vector search against your Astra DB table, you would get a list of several shoes with a variety of features.

SELECT * from products
ORDER BY vector ANN OF [6.0,2.0, ... 3.1,4.0]
LIMIT 10;

Alternatively, you can use the analyzer search to specify a keyword, such as hiking:

SELECT * from products
WHERE val : 'hiking'
ORDER BY vector ANN OF [6.0,2.0, … 3.1,4.0]
LIMIT 10;

An analyzed index is one where the stored values are derived from the raw column values. The stored values are dependent on the analyzer Configuration options, which can include tokenization, filtering, and char filtering.

Analyzer Operator for CQL

To enable analyzer operations using CQL on Astra serverless databases with Vector Search, a : operator is available in Cassandra Query Language. This operator can search indexed columns in Storage Attached Indexes (SAI) that are analyzed.

Example

  1. Create a table:

    CREATE TABLE vsearch.products
    (id text PRIMARY KEY,
    val text);
  2. Create an SAI index with the index_analyzer option and stemming enabled:

    CREATE CUSTOM INDEX vsearch_products_val_idx 
    ON vsearch.products(val)
    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' WITH OPTIONS = { 
    'index_analyzer': '{
    "tokenizer" : {"name" : "standard"},
    "filters" : [{"name" : "porterstem"}]
    }'};
  3. Insert sample rows:

    INSERT INTO vsearch.products (id, val)
    VALUES ('1', 'soccer cleats');
    
    INSERT INTO vsearch.products (id, val)
    VALUES ('2', 'running shoes');
    
    INSERT INTO vsearch.products (id, val)
    VALUES ('3', 'hiking shoes');
  4. Query to retrieve your data.

    To get data from the rows with id = '2' and id = '3':

    SELECT * FROM vsearch.products
    WHERE val : 'running';

    The analyzer splits the text into case-independent terms.

    To get the row with id = '3' using a different case, the analyzer standardizes that to perform the match:

    SELECT * FROM vsearch.products
    WHERE val : 'hiking' AND val : 'shoes';

Restrictions

  • Only SAI indexes support the : operator.

  • The analyzed column cannot be part of the Primary Key, including the partition key and clustering columns.

  • The : operator can be used with only SELECT statements.

  • The : operator cannot be used in with light-weight transactions, such as a condition for an IF clause.

Configuration options

When querying with an analyzer, you must configure the index_analyzer for your Storage Attached Index (SAI). This analyzer determines how a column value is analyzed before indexing them. The analyze is applied to the query term search, too.

The index_analyzer takes a single string or a JSON object as a value. The JSON objects are elements used to configure the analyzers. Each object configures a tokenizer, filter, or charFilter. Exactly one tokenizer should be configured.

The following built-in non-tokenizing filters are also available:

  • normalize - normalizes input using Normalization Form C (NFC)

  • case_sensitive - lower cases all inputs.

  • ascii - same as the ASCIIFoldingFilter from Lucene; converts ascii characters to their associated UTF-8 values

Examples

Analyzer

To configure a built-in analyzer, add the analyzer to your query OPTIONS:

OPTIONS = {'index_analyzer':'STANDARD'}

Tokenizer

An ngram tokenizer processes text by splitting the given text into contiguous sequences of n items to capture the linguistic patterns and context. This is a part of natural language processing (NLP) tasks.

To configure an ngram tokenizer that also lowercases all tokens, ensure the “tokenizer” key specifies the Lucene tokenizer. The remaining key-value pairs in that JSON object configure the tokenizer:

OPTIONS = {
  'index_analyzer':
  '{
	"tokenizer" : {
  	"name" : "ngram",
  	"args" : {
"minGramSize":"2",
"maxGramSize":"3"
}
	},
	"filters" : [
  	{
    	"name" : "lowercase",
    	"args": {}
  	}
	],
	"charFilters" : []
  }'
}

Non-tokenizing analyzers

This example shows non-tokenizing analyzers:

OPTIONS = {'case_sensitive': false} // default is true
OPTIONS = {'normalize': true} // default is false
OPTIONS = {'ascii': true} // default is false

These analyzers can be mixed to build a pipeline of filters:

OPTIONS = {'normalize': true, 'case_sensitive': false}

Built-in analyzers

There are several built-in analyzers from the Lucene project. This includes the following analyzers:

Table 1. Built-in analyzer types

Generic analyzers

standard, simple, whitespace, stop, lowercase

Language-specific analyzers

Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish

Tokenizers

standard, korean, hmmChinese, openNlp, japanese, wikipedia, letter, keyword, whitespace, classic, pathHierarchy, edgeNGram, nGram, simplePatternSplit, simplePattern, pattern, thai, uax29UrlEmail, icu

CharFilters

htmlstrip, mapping, persian, patternreplace

TokenFilters

apostrophe, arabicnormalization, arabicstem, bulgarianstem, bengalinormalization, bengalistem, brazilianstem, cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams, commongramsquery, dictionarycompoundword, hyphenationcompoundword, decimaldigit, lowercase, stop, type, uppercase, czechstem, germanlightstem, germanminimalstem, germannormalization, germanstem, greeklowercase, greekstem, englishminimalstem, englishpossessive, kstem, porterstem, spanishlightstem, persiannormalization, finnishlightstem, frenchlightstem, frenchminimalstem, irishlowercase, galicianminimalstem, galicianstem, hindinormalization, hindistem, hungarianlightstem, hunspellstem, indonesianstem, indicnormalization, italianlightstem, latvianstem, minhash, asciifolding, capitalization, codepointcount, concatenategraph, daterecognizer, delimitedtermfrequency, fingerprint, fixbrokenoffsets, hyphenatedwords, keepword, keywordmarker, keywordrepeat, length, limittokencount, limittokenoffset, limittokenposition, removeduplicates, stemmeroverride, protectedterm, trim, truncate, typeassynonym, worddelimiter, worddelimitergraph, scandinavianfolding, scandinaviannormalization, edgengram, ngram, norwegianlightstem, norwegianminimalstem, patternreplace, patterncapturegroup, delimitedpayload, numericpayload, tokenoffsetpayload, typeaspayload, portugueselightstem, portugueseminimalstem, portuguesestem, reversestring, russianlightstem, shingle, fixedshingle, snowballporter, serbiannormalization, classic, standard, swedishlightstem, synonym, synonymgraph, flattengraph, turkishlowercase, elision

Astra DB doesn’t support these Elastic filters:

  • synonymgraph

  • synonym

  • commongrams

  • stop

  • snowballporter

Query Vector Data with CQL Data modeling

General Inquiries: +1 (650) 389-6000 info@datastax.com

© DataStax | Privacy policy | Terms of use

DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.

Kubernetes is the registered trademark of the Linux Foundation.

landing_page landingpage