Build a text-to-SQL generator

query_builder 60 min

Text-to-SQL is an application of LLMs that converts natural language queries into SQL. People can use text-to-SQL to query their databases, even if they don’t know SQL.

To improve the accuracy of text-to-SQL, you can use a technique called dynamic few-shot prompting. This tutorial shows how to implement dynamic few-shot prompting with Astra DB Serverless.

Prerequisites

To complete this tutorial, you’ll need the following:

An active Serverless (Non-Vector) database, application token, and API endpoint.
A paid OpenAI account and API key.
An Amazon RedShift Serverless data warehouse or any SQL-compatible database.
The necessary Python packages installed. You can install them using the following command:
```
pip install openai astrapy datasets redshift_connector
```

Initialize the clients

First, initialize the OpenAI, Astra, and RedShift clients.

Initialize the OpenAI client with your API key.

# Initialize the OpenAI client
import os
from typing import List

from getpass import getpass
import openai
from tenacity import retry, wait_exponential

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key: ")

client = openai.OpenAI()

# This introduces exponential backoff in case ChatGPT is being rate limited/has errors
@retry(wait=wait_exponential(multiplier=1.2, min=4, max=30))
def chat_gpt_completion(prompt: str) -> str:
    return client.chat.completions.create(
        messages=[{
            "role": "user",
            "content": prompt,
        }],
        model="gpt-3.5-turbo",
    ).choices[0].message.content

@retry(wait=wait_exponential(multiplier=1.2, min=4, max=30))
def ada_embedding(text: str) -> List[float]:
    return client.embeddings.create(
        input=text,
        model="text-embedding-3-small",
        timeout=10,
    ).data[0].embedding

Initialize the Astra client with your application token and API endpoint.

# Initialize the Astra client
import os

from astrapy.db import AstraDB, AstraDBCollection
from astrapy.ops import AstraDBOps

# Get the Astra token and API endpoint from the environment or user input
ASTRA_TOKEN = os.getenv(
    "ASTRA_DB_TOKEN",
    getpass("Astra DB Token: "),
)
ASTRA_API_ENDPOINT = os.getenv(
    "ASTRA_DB_API_ENDPOINT",
    input("Astra DB Endpoint: "),
)

astra_db = AstraDB(token=ASTRA_TOKEN, api_endpoint=ASTRA_API_ENDPOINT)

Response

Astra DB Token:  ········
Astra DB Endpoint:  https://3a6953df-27bf-4272-a584-31f7b4dcaa80-us-east1.apps.astra.datastax.com

Initialize the RedShift connector with your credentials.

To connect to Amazon RedShift using other forms of authorization, see Connecting to Amazon RedShift.

import redshift_connector

RS_HOST_URL = os.getenv(
    "REDSHIFT_HOST_URL",
    getpass("RedShift Host URL: "),
)
RS_USER = os.getenv(
    "REDSHIFT_USER",
    input("RedShift User: "),
)
RS_PASSWORD = os.getenv(
    "REDSHIFT_PASSWORD",
    getpass("RedShift Password: "),
)
RS_DB_NAME = os.getenv(
    "REDSHIFT_DB_NAME",
    input("RedShift DB Name: "),
)

def rs_execute(*statements, return_as_df: bool = True, commit: bool = True) -> pd.DataFrame:
    """Execute RedShift statements and return the result of the final query"""
    with redshift_connector.connect(
        host=RS_HOST_URL,
        database=RS_DB_NAME,
        user=RS_USER,
        password=RS_PASSWORD,
    ) as conn:
        with conn.cursor() as cursor:
            for statement in statements:
                cursor.execute(statement)

            try:
                if return_as_df:
                    res = cursor.fetch_dataframe()
                else:
                    res = cursor.fetchall()
            except redshift_connector.ProgrammingError:
                # No result set (final statement was not a SELECT)
                res = None

            if commit:
                conn.commit()

            return res

Response

RedShift Host URL:  ········
RedShift User:  test_redshift
RedShift Password:  ········
RedShift DB Name:  dev

The Spider dataset is commonly used to evaluate generated SQL performance. The dataset consists of pairs of questions and queries, which show the ideal query to generate from a given natural language question.

Use the datasets package to load the Spider dataset and schema.

from datasets import load_dataset

spider = load_dataset("spider", split="validation")
spider_df = spider.to_pandas()
spider_schema = load_dataset("richardr1126/spider-schema", split="train")
spider_schema_df = spider_schema.to_pandas()

Show the first few rows of the dataset.

spider_df.head(5)

Response

	db_id	query	question	query_toks	query_toks_no_value	question_toks
0	concert_singer	SELECT count(*) FROM singer	How many singers do we have?	[SELECT, count, (, *, ), FROM, singer]	[select, count, (, *, ), from, singer]	[How, many, singers, do, we, have, ?]
1	concert_singer	SELECT count(*) FROM singer	What is the total number of singers?	[SELECT, count, (, *, ), FROM, singer]	[select, count, (, *, ), from, singer]	[What, is, the, total, number, of, singers, ?]
2	concert_singer	SELECT name , country , age FROM singer ORDER BY name	Show name, country, age for all singers ordered by name	[SELECT, name, ,, country, ,, age, FROM, singe…	[select, name, ,, country, ,, age, from, singe…	[Show, name, ,, country, ,, age, for, all, sin…
3	concert_singer	SELECT name , country , age FROM singer ORDER BY name	What are the names, countries, and ages for every singer ordered by name?	[SELECT, name, ,, country, ,, age, FROM, singe…	[select, name, ,, country, ,, age, from, singe…	[What, are, the, names, ,, countries, ,, and, …
4	concert_singer	SELECT avg(age) , min(age) , max(age) FROM singer	What is the average, minimum, and maximum age of all singers?	[SELECT, avg, (, age, ), ,, min, (, age, ), ,,…	[select, avg, (, age, ), ,, min, (, age, ), ,,…	[What, is, the, average, ,, minimum, ,, and, m…

db_id

query

question

query_toks

query_toks_no_value

question_toks

concert_singer

SELECT count(*) FROM singer

How many singers do we have?

[SELECT, count, (, *, ), FROM, singer]

[select, count, (, *, ), from, singer]

[How, many, singers, do, we, have, ?]

concert_singer

SELECT count(*) FROM singer

What is the total number of singers?

[SELECT, count, (, *, ), FROM, singer]

[select, count, (, *, ), from, singer]

[What, is, the, total, number, of, singers, ?]

concert_singer

SELECT name , country , age FROM singer ORDER BY name

Show name, country, age for all singers ordered by name

[SELECT, name, ,, country, ,, age, FROM, singe…

[select, name, ,, country, ,, age, from, singe…

[Show, name, ,, country, ,, age, for, all, sin…

concert_singer

SELECT name , country , age FROM singer ORDER BY name

What are the names, countries, and ages for every singer ordered by name?

[SELECT, name, ,, country, ,, age, FROM, singe…

[select, name, ,, country, ,, age, from, singe…

[What, are, the, names, ,, countries, ,, and, …

concert_singer

SELECT avg(age) , min(age) , max(age) FROM singer

What is the average, minimum, and maximum age of all singers?

[SELECT, avg, (, age, ), ,, min, (, age, ), ,,…

[select, avg, (, age, ), ,, min, (, age, ), ,,…

[What, is, the, average, ,, minimum, ,, and, m…

Set aside a few questions for testing. These questions come from two schemas in the Spider dataset. One schema contains tables about singers and where they’ve performed, and the other schema contains tables about pets and which students own those pets.

# Remove some questions to test the pipeline with
test_ndxs = [21, 28, 60, 63, 75]  # Randomly pick some queries the model succeeds and fails on
ndxs_to_drop = test_ndxs.copy()
test_df = spider_df.loc[test_ndxs]

for ndx in test_ndxs:
    row = spider_df.loc[ndx]
    print(f"Question ({row['db_id']}):\n    {row['question']}")
    print(f"Gold Query SQL:\n    {row['query']}")
    print("=" * 80)

    # Remove any rows that have the exact same query as the answer
    # This prevents data leakage so that the dynamic few-shot approach can't see the correct query
    same_query_rows = spider_df[spider_df['query'] == row['query']]
    for same_query_ndx in same_query_rows.index:
        ndxs_to_drop.append(same_query_ndx)

spider_df = spider_df.drop(ndxs_to_drop)

Response

Question (concert_singer):
    How many concerts occurred in 2014 or 2015?
Gold Query SQL:
    SELECT count(*) FROM concert WHERE YEAR  =  2014 OR YEAR  =  2015
================================================================================
Question (concert_singer):
    Show the stadium names without any concert.
Gold Query SQL:
    SELECT name FROM stadium WHERE stadium_id NOT IN (SELECT stadium_id FROM concert)
================================================================================
Question (pets_1):
    What are the students' first names who have both cats and dogs as pets?
Gold Query SQL:
    SELECT T1.Fname FROM student AS T1 JOIN has_pet AS T2 ON T1.stuid  =  T2.stuid JOIN pets AS T3 ON T3.petid  =  T2.petid WHERE T3.pettype  =  'cat' INTERSECT SELECT T1.Fname FROM student AS T1 JOIN has_pet AS T2 ON T1.stuid  =  T2.stuid JOIN pets AS T3 ON T3.petid  =  T2.petid WHERE T3.pettype  =  'dog'
================================================================================
Question (pets_1):
    Find the id of students who do not have a cat pet.
Gold Query SQL:
    SELECT stuid FROM student EXCEPT SELECT T1.stuid FROM student AS T1 JOIN has_pet AS T2 ON T1.stuid  =  T2.stuid JOIN pets AS T3 ON T3.petid  =  T2.petid WHERE T3.pettype  =  'cat'
================================================================================
Question (pets_1):
    Find the first name and age of students who have a pet.
Gold Query SQL:
    SELECT DISTINCT T1.fname ,  T1.age FROM student AS T1 JOIN has_pet AS T2 ON T1.stuid  =  T2.stuid
================================================================================

Set up a SQL database

Store the structured data from the Spider dataset in a SQL-compatible database.

If you have an existing database with usage data formatted as (question, query) pairs, you can connect to that database instead of using the Spider dataset for the few-shot examples below.

Set up the concert_singer database. Create tables matching the Spider schema and insert fake data for each test question.

# Define concert_singer database
CREATE_TABLES_SQL = """
-- Create the stadium table
CREATE TABLE stadium (
    Stadium_ID INT PRIMARY KEY,
    Location TEXT,
    Name TEXT,
    Capacity INT,
    Highest INT,
    Lowest INT,
    Average INT
);

-- Create the singer table
CREATE TABLE singer (
    Singer_ID INT PRIMARY KEY,
    Name TEXT,
    Country TEXT,
    Song_Name TEXT,
    Song_release_year TEXT,
    Age INT,
    Is_male BOOLEAN
);

-- Create the concert table
CREATE TABLE concert (
    concert_ID INT PRIMARY KEY,
    concert_Name TEXT,
    Theme TEXT,
    Stadium_ID INT,
    Year TEXT,
    FOREIGN KEY (Stadium_ID) REFERENCES stadium(Stadium_ID)
);

-- Create the singer_in_concert table
CREATE TABLE singer_in_concert (
    concert_ID INT,
    Singer_ID INT,
    PRIMARY KEY (concert_ID, Singer_ID),
    FOREIGN KEY (concert_ID) REFERENCES concert(concert_ID),
    FOREIGN KEY (Singer_ID) REFERENCES singer(Singer_ID)
);
"""

POPULATE_DATA_SQL = """
-- Populate the stadium table
INSERT INTO stadium (Stadium_ID, Location, Name, Capacity, Highest, Lowest, Average) VALUES
(1, 'New York, USA', 'Liberty Stadium', 50000, 1000, 500, 750),
(2, 'London, UK', 'Royal Arena', 60000, 1500, 600, 900),
(3, 'Tokyo, Japan', 'Sunshine Dome', 55000, 1200, 550, 800),
(4, 'Sydney, Australia', 'Ocean Field', 40000, 900, 400, 650),
(5, 'Berlin, Germany', 'Eagle Grounds', 45000, 1100, 450, 700);

-- Populate the singer table
INSERT INTO singer (Singer_ID, Name, Country, Song_Name, Song_release_year, Age, Is_male) VALUES
(1, 'John Doe', 'USA', 'Freedom Song', '2018', 28, TRUE),
(2, 'Emma Stone', 'UK', 'Rolling Hills', '2019', 25, FALSE),
(3, 'Haruki Tanaka', 'Japan', 'Tokyo Lights', '2020', 30, TRUE),
(4, 'Alice Johnson', 'Australia', 'Ocean Waves', '2021', 27, FALSE),
(5, 'Max Müller', 'Germany', 'Berlin Nights', '2017', 32, TRUE);

-- Populate the concert table
INSERT INTO concert (concert_ID, concert_Name, Theme, Stadium_ID, Year) VALUES
(1, 'Freedom Fest', 'Pop', 1, '2021'),
(2, 'Rock Mania', 'Rock', 2, '2022'),
(3, 'Electronic Waves', 'Electronic', 3, '2020'),
(4, 'Jazz Evenings', 'Jazz', 3, '2019'),
(5, 'Classical Mornings', 'Classical', 5, '2023');

-- Populate the singer_in_concert table
INSERT INTO singer_in_concert (concert_ID, Singer_ID) VALUES
(1, 1),
(1, 2),
(2, 3),
(3, 4),
(4, 5),
(5, 1),
(2, 2),
(3, 3),
(4, 4),
(5, 5);
"""

statements = [
    statement.strip() for statement in (CREATE_TABLES_SQL + POPULATE_DATA_SQL).split(";")
    if len(statement.strip()) > 0
]
rs_execute(*statements)

Set up the pets_1 database.

# Define pets_1 database
CREATE_TABLES_SQL = """
-- Create the Student table
CREATE TABLE Student (
    StuID INT PRIMARY KEY,
    LName VARCHAR(255),
    Fname VARCHAR(255),
    Age INT,
    Sex VARCHAR(10),
    Major INT,
    Advisor INT,
    city_code VARCHAR(50)
);

-- Create the Pets table
CREATE TABLE Pets (
    PetID INT PRIMARY KEY,
    PetType VARCHAR(255),
    pet_age INT,
    weight INT
);

-- Create the Has_Pet table
CREATE TABLE Has_Pet (
    StuID INT,
    PetID INT,
    FOREIGN KEY (StuID) REFERENCES Student(StuID),
    FOREIGN KEY (PetID) REFERENCES Pets(PetID)
);
"""

POPULATE_DATA_SQL = """
-- Populate the Student table
INSERT INTO Student (StuID, LName, Fname, Age, Sex, Major, Advisor, city_code) VALUES
(101, 'Smith', 'John', 20, 'M', 501, 301, 'NYC'),
(102, 'Johnson', 'Emma', 22, 'F', 502, 302, 'LAX'),
(103, 'Williams', 'Michael', 21, 'M', 503, 303, 'CHI'),
(104, 'Brown', 'Sarah', 23, 'F', 504, 304, 'HOU'),
(105, 'Jones', 'David', 19, 'M', 505, 305, 'PHI');

-- Populate the Pets table
INSERT INTO Pets (PetID, PetType, pet_age, weight) VALUES
(201, 'dog', 3, 20.5),
(202, 'cat', 5, 10.2),
(203, 'dog', 2, 8.1),
(204, 'parrot', 4, 0.5),
(205, 'hamster', 1, 0.7);

-- Populate the Has_Pet table
INSERT INTO Has_Pet (StuID, PetID) VALUES
(101, 201),
(101, 202),
(105, 203),
(103, 204),
(104, 205),
(105, 201);
"""

statements = [
    statement.strip() for statement in (CREATE_TABLES_SQL + POPULATE_DATA_SQL).split(";")
    if len(statement.strip()) > 0
]
rs_execute(*statements)

Define an evaluation function that accepts a list of queries. The function should return if the queries run successfully against the fake data, and if they return the same results that the gold SQL queries return.

def eval_generated_queries(queries: List[str]) -> pd.DataFrame:
    """Evaluate the given queries against the test set, and report the performance of each row"""
    report = []

    for ndx, pred_sql in enumerate(queries):
        row = test_df.iloc[ndx]
        gold_sql = row["query"]
        gold_results = rs_execute(gold_sql)

        pred_sql = queries[ndx]
        try:
            pred_results = rs_execute(pred_sql)
            err = None

            # Figure out correctness by deciding between set matching (no order) and list matching (ordered)
            correct = len(gold_results) == len(pred_results)
            if correct:
                # Sort columns so the arrays match up
                pred_results = pred_results.reindex(sorted(pred_results.columns), axis=1)
                gold_results = gold_results.reindex(sorted(gold_results.columns), axis=1)
                if "order by" in gold_sql.lower():
                    # All row orderings must match
                    correct = (gold_results.values == pred_results.values).all()
                else:
                    for col in gold_results.columns:
                        if set(gold_results[col].tolist()) != set(pred_results[col].tolist()):
                            correct = False

        except redshift_connector.ProgrammingError as e:
            pred_results = None
            err = e
            correct = False

        report.append({
            "DB ID": row["db_id"],
            "Question": row["question"],
            "Correct": correct,
            "Error": err,
        })

    return pd.DataFrame(report)

Use dynamic few-shot prompting to generate SQL from text

Traditional methods like zero-shot prompting or few-shot prompting with fixed examples are a good starting point for a text-to-SQL implementation. However, these methods only get 3 of 5 answers correct on our test dataset. To improve performance, we can use dynamic few-shot prompting with Astra DB Serverless.

Dynamic few-shot prompting is a variation on few-shot prompting that provides tailored information to the model to improve its accuracy. If you generate embeddings for the questions, you can perform a vector search to find similar questions to the one being asked. You can then use these similar questions to construct an improved few-shot prompt for the model.

Load the rest of the Spider dataset into Astra DB Serverless.

# Create a collection for vector search
collection = astra_db.create_collection("text2sql_examples_test1", dimension=1536)

Response

Astra DB Collection[name="text2sql_examples_test1", endpoint="https://3a6953df-27bf-4272-a584-31f7b4dcaa80-us-east1.apps.astra.datastax.com"]

Generate embeddings for all questions in the training set.

from tqdm.auto import tqdm

tqdm.pandas()

spider_embeddings = spider_df["question"].progress_apply(lambda question: ada_embedding(question))
spider_df["question_embedding"] = spider_embeddings

Response

  0%|          | 0/1028 [00:00<?, ?it/s]

Upload the training set and their embeddings to Astra DB Serverless.

def insert_doc(row):
    return collection.insert_one({
        "_id": row.name,
        "question": row['question'],
        "gold_sql": row['query'],
        "db_id": row["db_id"],
        "$vector": row["question_embedding"],
    })

spider_df.progress_apply(insert_doc, axis=1)

Response

  0%|          | 0/1028 [00:00<?, ?it/s]
0          {'status': {'insertedIds': [0]}}
1          {'status': {'insertedIds': [1]}}
2          {'status': {'insertedIds': [2]}}
3          {'status': {'insertedIds': [3]}}
4          {'status': {'insertedIds': [4]}}
                       ...
1029    {'status': {'insertedIds': [1029]}}
1030    {'status': {'insertedIds': [1030]}}
1031    {'status': {'insertedIds': [1031]}}
1032    {'status': {'insertedIds': [1032]}}
1033    {'status': {'insertedIds': [1033]}}
Length: 1028, dtype: object

Create a prompt function that generates a few-shot prompt based on the most similar questions to the one being asked.

def dynamic_few_shot_prompt_fn(db_id: str, question: str) -> str:
    emb = ada_embedding(question)
    docs = collection.vector_find(emb, limit=2)
    closest_q_ids = [
        doc["_id"] for doc in docs
    ]

    return few_shot_prompt(db_id, question, indices=closest_q_ids)

Generate SQL from the test questions using the dynamic few-shot prompt.

# Include the prompt in the output to show the model's context
sql1 = generate_sql(
    question="How many concerts occurred in 2014 or 2015?",
    db_id="concert_singer",
    prompt_fn=dynamic_few_shot_prompt_fn,
    debug_prompt=True,
)
print(sql1)

Response

Prompting ChatGPT with:
Here is an example: Convert text to SQL:

[Schema : (values)]: | concert_singer | stadium : Stadium_ID , Location , Name , Capacity , Highest , Lowest , Average | singer : Singer_ID , Name , Country , Song_Name , Song_release_year , Age , Is_male | concert : concert_ID , concert_Name , Theme , Stadium_ID , Year | singer_in_concert : concert_ID , Singer_ID ;

[Column names (type)]: stadium : Stadium_ID (number) | stadium : Location (text) | stadium : Name (text) | stadium : Capacity (number) | stadium : Highest (number) | stadium : Lowest (number) | stadium : Average (number) | singer : Singer_ID (number) | singer : Name (text) | singer : Country (text) | singer : Song_Name (text) | singer : Song_release_year (text) | singer : Age (number) | singer : Is_male (others) | concert : concert_ID (number) | concert : concert_Name (text) | concert : Theme (text) | concert : Stadium_ID (text) | concert : Year (text) | singer_in_concert : concert_ID (number) | singer_in_concert : Singer_ID (text) ;

[Primary Keys]: stadium : Stadium_ID | singer : Singer_ID | concert : concert_ID | singer_in_concert : concert_ID

[Foreign Keys]: concert : Stadium_ID equals stadium : Stadium_ID | singer_in_concert : Singer_ID equals singer : Singer_ID | singer_in_concert : concert_ID equals concert : concert_ID

[Q]: What are the names and locations of the stadiums that had concerts that occurred in both 2014 and 2015?

[SQL]: SELECT T2.name ,  T2.location FROM concert AS T1 JOIN stadium AS T2 ON T1.stadium_id  =  T2.stadium_id WHERE T1.Year  =  2014 INTERSECT SELECT T2.name ,  T2.location FROM concert AS T1 JOIN stadium AS T2 ON T1.stadium_id  =  T2.stadium_id WHERE T1.Year  =  2015

Here is an example: Convert text to SQL:

[Schema : (values)]: | concert_singer | stadium : Stadium_ID , Location , Name , Capacity , Highest , Lowest , Average | singer : Singer_ID , Name , Country , Song_Name , Song_release_year , Age , Is_male | concert : concert_ID , concert_Name , Theme , Stadium_ID , Year | singer_in_concert : concert_ID , Singer_ID ;

[Column names (type)]: stadium : Stadium_ID (number) | stadium : Location (text) | stadium : Name (text) | stadium : Capacity (number) | stadium : Highest (number) | stadium : Lowest (number) | stadium : Average (number) | singer : Singer_ID (number) | singer : Name (text) | singer : Country (text) | singer : Song_Name (text) | singer : Song_release_year (text) | singer : Age (number) | singer : Is_male (others) | concert : concert_ID (number) | concert : concert_Name (text) | concert : Theme (text) | concert : Stadium_ID (text) | concert : Year (text) | singer_in_concert : concert_ID (number) | singer_in_concert : Singer_ID (text) ;

[Primary Keys]: stadium : Stadium_ID | singer : Singer_ID | concert : concert_ID | singer_in_concert : concert_ID

[Foreign Keys]: concert : Stadium_ID equals stadium : Stadium_ID | singer_in_concert : Singer_ID equals singer : Singer_ID | singer_in_concert : concert_ID equals concert : concert_ID

[Q]: Which year has most number of concerts?

[SQL]: SELECT YEAR FROM concert GROUP BY YEAR ORDER BY count(*) DESC LIMIT 1

Here is the test question to be answered: Convert text to SQL:

[Schema : (values)]: | concert_singer | stadium : Stadium_ID , Location , Name , Capacity , Highest , Lowest , Average | singer : Singer_ID , Name , Country , Song_Name , Song_release_year , Age , Is_male | concert : concert_ID , concert_Name , Theme , Stadium_ID , Year | singer_in_concert : concert_ID , Singer_ID ;

[Column names (type)]: stadium : Stadium_ID (number) | stadium : Location (text) | stadium : Name (text) | stadium : Capacity (number) | stadium : Highest (number) | stadium : Lowest (number) | stadium : Average (number) | singer : Singer_ID (number) | singer : Name (text) | singer : Country (text) | singer : Song_Name (text) | singer : Song_release_year (text) | singer : Age (number) | singer : Is_male (others) | concert : concert_ID (number) | concert : concert_Name (text) | concert : Theme (text) | concert : Stadium_ID (text) | concert : Year (text) | singer_in_concert : concert_ID (number) | singer_in_concert : Singer_ID (text) ;

[Primary Keys]: stadium : Stadium_ID | singer : Singer_ID | concert : concert_ID | singer_in_concert : concert_ID

[Foreign Keys]: concert : Stadium_ID equals stadium : Stadium_ID | singer_in_concert : Singer_ID equals singer : Singer_ID | singer_in_concert : concert_ID equals concert : concert_ID

[Q]: How many concerts occurred in 2014 or 2015?

[SQL]:


SELECT COUNT(*) FROM concert WHERE Year IN (2014, 2015)

sql2 = generate_sql(
    question="Show the stadium names without any concert.",
    db_id="concert_singer",
    prompt_fn=dynamic_few_shot_prompt_fn,
)
print(sql2)

Response

SELECT name
FROM stadium
WHERE stadium_id NOT IN (SELECT stadium_id FROM concert)

sql3 = generate_sql(
    question="What are the students' first names who have both cats and dogs as pets?",
    db_id="pets_1",
    prompt_fn=dynamic_few_shot_prompt_fn,
)
print(sql3)

Response

SELECT DISTINCT T1.Fname FROM student AS T1
JOIN has_pet AS T2 ON T1.stuid = T2.stuid
JOIN pets AS T3 ON T3.petid = T2.petid
WHERE T3.pettype = 'cat' AND T1.stuid IN (
    SELECT DISTINCT T1.stuid FROM student AS T1
    JOIN has_pet AS T2 ON T1.stuid = T2.stuid
    JOIN pets AS T3 ON T3.petid = T2.petid
    WHERE T3.pettype = 'dog'
)

sql4 = generate_sql(
    question="Find the id of students who do not have a cat pet.",
    db_id="pets_1",
    prompt_fn=dynamic_few_shot_prompt_fn,
)
print(sql4)

Response

SELECT stuID FROM Student WHERE stuID NOT IN (SELECT stuID FROM Has_Pet JOIN Pets ON Has_Pet.petID = Pets.petID WHERE Pets.PetType = 'cat')

sql5 = generate_sql(
    question="Find the first name and age of students who have a pet.",
    db_id="pets_1",
    prompt_fn=dynamic_few_shot_prompt_fn,
)
print(sql5)

Response

SELECT DISTINCT T1.fname, T1.age
FROM student AS T1
JOIN has_pet AS T2 ON T1.stuid = T2.stuid
JOIN pets AS T3 ON T3.petid = T2.petid

Evaluate all five queries and see how the LLM did.

sqls = [sql1, sql2, sql3, sql4, sql5]
report = eval_generated_queries(sqls)
display(report)

Response

DB ID

Question

Correct

Error

concert_singer

How many concerts occurred in 2014 or 2015?

True

None

concert_singer

Show the stadium names without any concert.

True

None

pets_1

What are the students' first names who have both cats and dogs as pets?

True

None

pets_1

Find the id of students who do not have a cat pet.

True

None

pets_1

Find the first name and age of students who have a pet.

True

None

The model was able to correctly answer all 5 questions. Using Astra DB Serverless as a dynamic few-shot prompt store is a robust and low-effort way to improve your text-to-SQL applications in production.

Build a text-to-SQL generator

Prerequisites

Initialize the clients

Load the Spider dataset

Set up a SQL database

Use dynamic few-shot prompting to generate SQL from text

Was this helpful?

Give Feedback