Prompt Engineering for a Better SQL Code Generation With LLMs Copy

May 29, 2024

Raghid Bsat

Picture yourself as a marketing executive tasked with optimising advertising strategies to target different customer segments effectively…

Picture yourself as a marketing executive tasked with optimising advertising strategies to target different customer segments effectively. You are required to answer the following question: “Which customer segments yield the highest return on investment for our advertising campaign?”. You might want to look at customer segment profitability and compare it with Customer Lifetime Value (CLV) to answer such questions. The following SQL query could be the answer:

with customer_segment_profitability as (
    select
        customer_segment,
        avg(order_total) as avg_order_value,
        count(distinct customer_id) as num_customers,
        sum(order_total) as total_sales,
        sum(advertising_cost) as total_ad_cost,
        (sum(order_total) - sum(advertising_cost)) / count(distinct customer_id) as segment_profit
    from
        orders
    group by
        customer_segment
),
customer_clv as (
    select
        customer_segment,
        sum(total_sales) as total_spent,
        count(distinct customer_id) as num_customers,
        DATEDIFF(MAX(order_date), MIN(order_date)) as lifespan_days,
        sum(total_sales) / DATEDIFF(MAX(order_date), MIN(order_date)) as clv_per_day,
        (sum(total_sales) / DATEDIFF(MAX(order_date), MIN(order_date))) * 365 as clv_yearly
    from
        orders
    where
        customer_id is not NULL
    group by
        customer_segment
)
select
    csp.customer_segment,
    csp.avg_order_value,
    csp.num_customers as segment_customers,
    csp.total_sales,
    csp.total_ad_cost,
    csp.segment_profit,
    cc.total_spent as clv_total_spent,
    cc.num_customers as clv_num_customers,
    cc.clv_yearly
from
    customer_segment_profitability csp
join
    customer_clv cc on csp.customer_segment = cc.customer_segment
order by
    csp.segment_profit desc;

Crafting such complex query demands expertise and context beyond the scope of many business professionals and would require more time to create. There is often a database-language barrier between business leaders and the data-driven decisions they are required to make. This is typically encoded by SQL which many business professionals find daunting or hardly practical. The rise of GenAI has opened the door for AI assistants to overcome such barriers where Large Language Models (LLMs) have demonstrated the ability to understand and generate natural language. Using LLMs as a way to create SQL statements from natural language can ease the issue of data access for non-technical profiles.

As of today, there have been many attempts in creating a state of the art LLM SQL query generator. Such example is MotherDuck’s “AI that Quacks, DUCKDB NSQL-7B”. This blogpost explains the efforts done by MotherDuck and Numbers Station to create a text-to-SQL LLM. They created an open source LLM model using CodeLLama’s base model and fine-tune it with their custom training set that is a modified version of NSText2SQL. They have reported exceptionally good performance on the SQL benchmarks used. While this tool sounds amazing and training your own open-source model on a custom SQL training dataset can be a significant leap forward compared to simply using pre-existing models, it is important to provide additional context to the model. This involves crafting precise and contextually appropriate prompts to elicit desired responses from language models. By having a suitable prompt, it gives better context for the LLM to create error-free queries. In this blogpost, I will address the following: challenge of prompt engineering and explore the scenarios where leveraging a Retrieval Augmented Generation (RAG) solution is advantageous, compared to including all information directly in the prompt.

How to build an effective prompt?

  1. Include database schema

The main challenge with using an LLM to generate SQL code is allowing the model to know the schema of the tables we are trying to query. Such information can be given in what we call a ‘context window’, which is the number of tokens (~words) the model can take as input when generating responses. In the case of generating SQL code, we can include our database schema into the context before asking our question. The catch is that LLMs come with a limited context window. For example, GPT-4 comes with a window of 128,000 tokens (~96,000 words). If your database is small enough, you can provide the full schema as part of the prompt.

This is an example of a prompt with the database schema included:

You are a helpful assistant specialising in data analysis in a snowflake warehouse.
Answer the questions by providing SQL code that is compatible with the snowflake environment.
This is the question you are required to answer: 
What is the total number of customers in the Chocolate Haven branch?

Here is the relevant context of the database:
create or replace TABLE CUSTOMER_DETAILS (
    CUSTOMER_ID NUMBER(38,0) NOT NULL,
    FIRST_NAME VARCHAR(255),
    LAST_NAME VARCHAR(255),
    EMAIL VARCHAR(255),
    PHONE VARCHAR(20),
    ADDRESS VARCHAR(255),
    primary key (CUSTOMER_ID)
);

Testing this out on the snowflake sample dataset (TPCH_SF1), I noticed that in general the easy queries were answered correctly. These queries are characterised by having simple filters and multiple conditions:

-- Give me all the customers where the marketing segment is 'MACHINERY'
select *
  from customer
where 
  c_mktsegment = 'MACHINERY';
-- Give me all the orders where the order priority is '3-MEDIUM' and the total price is above 100 000 euros
select *
  from orders
where 
  o_orderpriority = '3-MEDIUM' and o_totalprice > 100000;
-- Give me all the order keys where the order priority is '5-LOW' and the total price is below 250 000 euros and where the status is not 'F'
select o_orderkey
  from orders
where 
  o_orderpriority = '5-LOW' and o_totalprice <= 250000 and o_orderstatus <> 'F';
  1. Include user feedback (session history)

In the case of complex queries with multiple joins and aggregations, the LLM was not able to reproduce accurate results. There were two main issues the LLM faced:

  • Semantic Complexity between tables: Complex queries often involve specific relationships between tables, nested subqueries, and conditional logic. LLMs may struggle to grasp the full semantic meaning of these complex structures, leading to inaccuracies in understanding and reproducing the intended query semantics.

  • Business Vagueness: There can be certain ambiguous references, such as column aliases or ambiguous table references. Resolving these ambiguities requires contextual understanding and inference capabilities, which may be challenging for LLMs. In the case of query 2 (below), the LLM assumed that some columns exist already in the schema which produced an invalid SQL statement.

-- What is for each customer the total amount spent and the average order amount? Show our best customers (= most-paying) first.
-- This query has a join/group by that the LLM failed to identify.
select
  min(c_name) as customer_name,
  sum(coalesce(o_totalprice, 0)) as total_spent,
  avg(o_totalprice) as avg_order
from customer
  left join orders on c_custkey = o_custkey
group by 
  c_custkey
order by 
  total_spent desc;
-- Show me all the customers that have placed more than 40 orders
select
  min(c_name) as customer_name,
  count(*) as num_orders
from customer
  join 
    orders on c_custkey = o_custkey
group by 
  c_custkey
having 
  count(*) > 40;
-- For each nation, show me our top-3 highest-revenue customers in that nation. Don’t consider discounts.
-- There is a subquery to find the rank_in_nation the LLM failed to reproduce. It gave an inaccurate schema 
-- by assuming that rank already exists in the table.
with
customer_with_rank as (
  select
    c_name,
    c_nationkey,
    sum(o_totalprice) as total_revenue,
    rank() over (
      partition by c_nationkey
      order by total_revenue desc
    ) as rank_in_nation
  from customer
  join orders on c_custkey = o_custkey
  join nation on c_nationkey = n_nationkey
  group by c_name, c_nationkey
)
select *
from customer_with_rank
join nation on c_nationkey = n_nationkey
where rank_in_nation <= 3;

User feedback can ‘push’ the LLM in the right direction that results in a correct query after a failed attempt. This is the next step of prompt engineering: Adding more relevant context other than database schemas can be crucial for optimised SQL code generation. You can include the session history as part of the prompt like this:

You are a helpful assistant specialising in data analysis in a snowflake warehouse.
Answer the questions by providing SQL code that is compatible with the snowflake environment.
This is the question you are required to answer: 
What is the total number of customers in the Chocolate Haven branch?

Here is the relevant context of the database:
create or replace TABLE CUSTOMER_DETAILS (
    CUSTOMER_ID NUMBER(38,0) NOT NULL,
    FIRST_NAME VARCHAR(255),
    LAST_NAME VARCHAR(255),
    EMAIL VARCHAR(255),
    PHONE VARCHAR(20),
    ADDRESS VARCHAR(255),
    primary key (CUSTOMER_ID)
);

The user has reported the following feedback:
You should not be looking in the orders table
  1. Include business information

For more in-depth business context, large organisations tend to have relevant data scattered everywhere in different formats and sources. I suggest to split the data into three different categories:

  • Domain specific knowledge: This is the information that can be relevant to the question. For example, we can include statements such as:

| The Belgian branch is called the “Chocolate Haven” branch
| Benelux region is made up of Luxembourg, Netherlands, and Belgium branches;
| Marketing campaigns in the “Ardennes” region have historically shown higher conversion rates.

  • Historical Used Queries: These are queries that are often used by business analysts and are go-to queries for common analytical needs: Monthly sales report by region, customer churn analysis by product category, top-selling products by quarter, profitability analysis by customer segment.

  • Data Transformation and Enrichment Logic: This category includes the logic and rules applied to transform and enrich raw data into meaningful business insights. Some examples include: Calculations for derived metrics such as customer lifetime value (CLV), average order value (AOV), or conversion rates; Time-based calculations for trends analysis, seasonality adjustments, or cohort analysis.

  • Bonus Category: Business Rules and Policies: While this might be still experimental and there is no set solution for handling data governance with genAI, I would generally handle this with care. This category includes rules such as access control policies that determine who within the organisation has permission to view or modify certain tables or data retention policies. I would instead push this to be an sql sanitising logic based module after the LLM returns an answer.

The final prompt in action:

You are a helpful assistant specialising in data analysis in a snowflake warehouse.
Answer the questions by providing SQL code that is compatible with the snowflake environment.
This is the question you are required to answer: 
What is the total number of customers in the Chocolate Haven branch?

Here is the relevant context of the database:
create or replace TABLE CUSTOMER_DETAILS (
    CUSTOMER_ID NUMBER(38,0) NOT NULL,
    FIRST_NAME VARCHAR(255),
    LAST_NAME VARCHAR(255),
    EMAIL VARCHAR(255),
    PHONE VARCHAR(20),
    ADDRESS VARCHAR(255),
    primary key (CUSTOMER_ID)
);

The user has reported the following feedback:
You should not be looking in the orders table

This is some business context that is relevant to the question:
1. The Belgian branch is called the Chocolate Haven branch.
2. Benelux region is made up of Luxembourg, Netherlands, 
and Belgium branches
 
Historically, business analysts have used these queries 
for answering similar questions:
Query to analyse profitability by customer segment:
WITH customer_profitability AS (
 SELECT
        .... 
)

If certain aggregations or data derived metrics are needed
to answer the user's question, these are the most relevant
pieces of information needed:
Rule to calculate Customer Lifetime Value (CLV):
SELECT
    SUM(total_sales) AS total_spent,
    COUNT(DISTINCT order_id) AS total_orders,
      ...
ORDER BY
    clv_yearly DESC;

Take your LLM-SQL bot to the next level with data engineering:

For a potential future client, we created a proof-of-concept that showcases the possibility of LLMs with SQL code generation on top of their data warehouse. We created a simple user interface where business professionals can interact with a chatbot, generate SQL statements, and print out the resulting table.In the case of the demo I worked on, the database contained only 8 tables; with a varying schema length between 3 columns to 18 columns. Making up the Data Definition Language (DDL) SQL commands (CREATE TABLE …) sums up to 1k tokens. In this small example, I had enough room with the token limit to provide it all within the same context. However, that might not be the case for large organisations that tend to have few hundreds of tables within their databases. This is where the power of Retrieval-Augmented Generation (RAG) comes in.I relied on the architecture that was proposed by Kaarthik in his blogpost. His main idea was to use a RAG that retrieves information from a vector database built with the table schemas. The result of the retrieval would later be engineered in the prompt given to the LLM through an API. In summary, this is the architecture followed:

By using a RAG, you would not need to include the full context in the prompt; but rather the most relevant parts only. The idea of a RAG is to rely on pre-existing knowledge or data to provide accurate responses. The relevant information is retrieved and provided in the prompt of the LLM to generate better answers. In the case of this demo, it was the table schemas (data definition language queries) assembled as readme files with some context for each table added to the vector database. Relevant schemas will be retrieved with a similarity search before it gets engineered into the prompt and passed to the LLM as an API call. Vectors are created in the vector database by representing the input query as a numerical vector and is compared to the numerical vector stored. The retrieval function would return N-similar records. You can read more about how to store and retrieve similarities between vector embeddings here.

Having the business context categories, database schema, and user feedback stored and indexed efficiently in a vector database is where the magic of data engineering should happen: Whether the data sources are pdfs, word documents, text files, or databases, they can be ingested into a centralised database where they are structured, indexed, and made searchable.

Domain-specific knowledge can be stored in a knowledge graph, where entities and their relationships are represented as nodes and edges. Historically used queries can be stored in a query table, where they are categorised and tagged based on their purpose and usage. Data transformation and enrichment logic can be stored in a rules book database. As a data engineer myself, we strive to build systems that are not only efficient but also flexible and responsive to change. By using the power of ETL (Extract, Transform, Load) pipelines, we can create dynamic processes that automatically update our business knowledge databases as new insights emerge and business contexts evolve. This can include a monitoring ETL pipeline to continuously observe these different data sources, applies the necessary transformation, and updates the related databases.

Large Context Windows

Google’s Gemini 1.5 has included a 128k token context window size. However, a limited number of developers are already able to try their refined version which includes a window of up to 1 million tokens. This begs the question, is a RAG really needed at this point? With such huge window size, these models can capture 10 times more information within one prompt, without the need to retrieve relevant information in a vector database. This can lead to more efficient inference and training process, since the middle man of a RAG is cut out, allowing the model to directly access all important context.

However, I would argue that a RAG would still be relevant in this case: the goal is to thoroughly understand which context the LLM model considered “relevant” when generating a query. It might not be enough to magically generate a query for a business analyst. Retrieving the relevant context can be returned to explain how this particular query was generated for the human to evaluate.

Conclusion:

In conclusion, my recommendation for an LLM-SQL bot is to train your own open source model (similar to that of DuckDB NSQL) and also include the business categories mentioned in this blogpost. This ensures that the model is well aware of important context when generating responses. Motherduck’s NSQL-7B does a good job in validating whether the generated SQL statement works on the provided schema or not. This validation process essentially acts as a sanitisation step for your query statements, helping to catch any potential errors or mismatches before they cause problems in your database. Depending on the volume of data within the organisation, one could opt to either include all information directly in the prompt or create a RAG solution.

It’s equally essential to emphasise the importance of clean and meaningful data along with having a well-crafted prompt that captures relevant business context. If organisations are set to create their own LLM-SQL bot, my advice is to prioritise data quality control, data collection, and data cleaning. Normalising the data into structured format is also proven to be a better format for the LLM to understand.

Latest

A glimpse into the life of a data leader

Data leaders face pressure to balance AI hype with data landscape organization. Here’s how they stay focused, pragmatic, and strategic.

Data Stability with Python: How to Catch Even the Smallest Changes

As a data engineer, it is nearly always the safest option to run data pipelines every X minutes. This allows you to sleep well at night…

Clear signals: Enhancing communication within a data team

Leave your email address to subscribe to the Dataminded newsletter

Leave your email address to subscribe to the Dataminded newsletter

Leave your email address to subscribe to the Dataminded newsletter

Belgium

Vismarkt 17, 3000 Leuven, Belgium


Vat. BE.0667.976.246

Germany

Friedrichstraße 68, 10117 Berlin, Germany


© 2024 Dataminded. All rights reserved.

Belgium

Vismarkt 17, 3000 Leuven, Belgium


Vat. BE.0667.976.246

Germany

Friedrichstraße 68, 10117 Berlin, Germany


© 2024 Dataminded. All rights reserved.

Belgium

Vismarkt 17, 3000 Leuven, Belgium


Vat. BE.0667.976.246

Germany

Friedrichstraße 68, 10117 Berlin, Germany


© 2024 Dataminded. All rights reserved.