Prompt Engineering for a Better SQL Code Generation With LLMs Copy
May 29, 2024
•
Raghid Bsat
Picture yourself as a marketing executive tasked with optimising advertising strategies to target different customer segments effectively…
Picture yourself as a marketing executive tasked with optimising advertising strategies to target different customer segments effectively. You are required to answer the following question: “Which customer segments yield the highest return on investment for our advertising campaign?”. You might want to look at customer segment profitability and compare it with Customer Lifetime Value (CLV) to answer such questions. The following SQL query could be the answer:
Crafting such complex query demands expertise and context beyond the scope of many business professionals and would require more time to create. There is often a database-language barrier between business leaders and the data-driven decisions they are required to make. This is typically encoded by SQL which many business professionals find daunting or hardly practical. The rise of GenAI has opened the door for AI assistants to overcome such barriers where Large Language Models (LLMs) have demonstrated the ability to understand and generate natural language. Using LLMs as a way to create SQL statements from natural language can ease the issue of data access for non-technical profiles.
As of today, there have been many attempts in creating a state of the art LLM SQL query generator. Such example is MotherDuck’s “AI that Quacks, DUCKDB NSQL-7B”. This blogpost explains the efforts done by MotherDuck and Numbers Station to create a text-to-SQL LLM. They created an open source LLM model using CodeLLama’s base model and fine-tune it with their custom training set that is a modified version of NSText2SQL. They have reported exceptionally good performance on the SQL benchmarks used. While this tool sounds amazing and training your own open-source model on a custom SQL training dataset can be a significant leap forward compared to simply using pre-existing models, it is important to provide additional context to the model. This involves crafting precise and contextually appropriate prompts to elicit desired responses from language models. By having a suitable prompt, it gives better context for the LLM to create error-free queries. In this blogpost, I will address the following: challenge of prompt engineering and explore the scenarios where leveraging a Retrieval Augmented Generation (RAG) solution is advantageous, compared to including all information directly in the prompt.
…
How to build an effective prompt?
Include database schema
The main challenge with using an LLM to generate SQL code is allowing the model to know the schema of the tables we are trying to query. Such information can be given in what we call a ‘context window’, which is the number of tokens (~words) the model can take as input when generating responses. In the case of generating SQL code, we can include our database schema into the context before asking our question. The catch is that LLMs come with a limited context window. For example, GPT-4 comes with a window of 128,000 tokens (~96,000 words). If your database is small enough, you can provide the full schema as part of the prompt.
This is an example of a prompt with the database schema included:
Testing this out on the snowflake sample dataset (TPCH_SF1), I noticed that in general the easy queries were answered correctly. These queries are characterised by having simple filters and multiple conditions:
Include user feedback (session history)
In the case of complex queries with multiple joins and aggregations, the LLM was not able to reproduce accurate results. There were two main issues the LLM faced:
Semantic Complexity between tables: Complex queries often involve specific relationships between tables, nested subqueries, and conditional logic. LLMs may struggle to grasp the full semantic meaning of these complex structures, leading to inaccuracies in understanding and reproducing the intended query semantics.
Business Vagueness: There can be certain ambiguous references, such as column aliases or ambiguous table references. Resolving these ambiguities requires contextual understanding and inference capabilities, which may be challenging for LLMs. In the case of query 2 (below), the LLM assumed that some columns exist already in the schema which produced an invalid SQL statement.
User feedback can ‘push’ the LLM in the right direction that results in a correct query after a failed attempt. This is the next step of prompt engineering: Adding more relevant context other than database schemas can be crucial for optimised SQL code generation. You can include the session history as part of the prompt like this:
Include business information
For more in-depth business context, large organisations tend to have relevant data scattered everywhere in different formats and sources. I suggest to split the data into three different categories:
Domain specific knowledge: This is the information that can be relevant to the question. For example, we can include statements such as:
| The Belgian branch is called the “Chocolate Haven” branch
| Benelux region is made up of Luxembourg, Netherlands, and Belgium branches;
| Marketing campaigns in the “Ardennes” region have historically shown higher conversion rates.
Historical Used Queries: These are queries that are often used by business analysts and are go-to queries for common analytical needs: Monthly sales report by region, customer churn analysis by product category, top-selling products by quarter, profitability analysis by customer segment.
Data Transformation and Enrichment Logic: This category includes the logic and rules applied to transform and enrich raw data into meaningful business insights. Some examples include: Calculations for derived metrics such as customer lifetime value (CLV), average order value (AOV), or conversion rates; Time-based calculations for trends analysis, seasonality adjustments, or cohort analysis.
Bonus Category: Business Rules and Policies: While this might be still experimental and there is no set solution for handling data governance with genAI, I would generally handle this with care. This category includes rules such as access control policies that determine who within the organisation has permission to view or modify certain tables or data retention policies. I would instead push this to be an sql sanitising logic based module after the LLM returns an answer.
The final prompt in action:
…
Take your LLM-SQL bot to the next level with data engineering:
For a potential future client, we created a proof-of-concept that showcases the possibility of LLMs with SQL code generation on top of their data warehouse. We created a simple user interface where business professionals can interact with a chatbot, generate SQL statements, and print out the resulting table.In the case of the demo I worked on, the database contained only 8 tables; with a varying schema length between 3 columns to 18 columns. Making up the Data Definition Language (DDL) SQL commands (CREATE TABLE …) sums up to 1k tokens. In this small example, I had enough room with the token limit to provide it all within the same context. However, that might not be the case for large organisations that tend to have few hundreds of tables within their databases. This is where the power of Retrieval-Augmented Generation (RAG) comes in.I relied on the architecture that was proposed by Kaarthik in his blogpost. His main idea was to use a RAG that retrieves information from a vector database built with the table schemas. The result of the retrieval would later be engineered in the prompt given to the LLM through an API. In summary, this is the architecture followed:
By using a RAG, you would not need to include the full context in the prompt; but rather the most relevant parts only. The idea of a RAG is to rely on pre-existing knowledge or data to provide accurate responses. The relevant information is retrieved and provided in the prompt of the LLM to generate better answers. In the case of this demo, it was the table schemas (data definition language queries) assembled as readme files with some context for each table added to the vector database. Relevant schemas will be retrieved with a similarity search before it gets engineered into the prompt and passed to the LLM as an API call. Vectors are created in the vector database by representing the input query as a numerical vector and is compared to the numerical vector stored. The retrieval function would return N-similar records. You can read more about how to store and retrieve similarities between vector embeddings here.
Having the business context categories, database schema, and user feedback stored and indexed efficiently in a vector database is where the magic of data engineering should happen: Whether the data sources are pdfs, word documents, text files, or databases, they can be ingested into a centralised database where they are structured, indexed, and made searchable.
Domain-specific knowledge can be stored in a knowledge graph, where entities and their relationships are represented as nodes and edges. Historically used queries can be stored in a query table, where they are categorised and tagged based on their purpose and usage. Data transformation and enrichment logic can be stored in a rules book database. As a data engineer myself, we strive to build systems that are not only efficient but also flexible and responsive to change. By using the power of ETL (Extract, Transform, Load) pipelines, we can create dynamic processes that automatically update our business knowledge databases as new insights emerge and business contexts evolve. This can include a monitoring ETL pipeline to continuously observe these different data sources, applies the necessary transformation, and updates the related databases.
Large Context Windows
Google’s Gemini 1.5 has included a 128k token context window size. However, a limited number of developers are already able to try their refined version which includes a window of up to 1 million tokens. This begs the question, is a RAG really needed at this point? With such huge window size, these models can capture 10 times more information within one prompt, without the need to retrieve relevant information in a vector database. This can lead to more efficient inference and training process, since the middle man of a RAG is cut out, allowing the model to directly access all important context.
However, I would argue that a RAG would still be relevant in this case: the goal is to thoroughly understand which context the LLM model considered “relevant” when generating a query. It might not be enough to magically generate a query for a business analyst. Retrieving the relevant context can be returned to explain how this particular query was generated for the human to evaluate.
…
Conclusion:
In conclusion, my recommendation for an LLM-SQL bot is to train your own open source model (similar to that of DuckDB NSQL) and also include the business categories mentioned in this blogpost. This ensures that the model is well aware of important context when generating responses. Motherduck’s NSQL-7B does a good job in validating whether the generated SQL statement works on the provided schema or not. This validation process essentially acts as a sanitisation step for your query statements, helping to catch any potential errors or mismatches before they cause problems in your database. Depending on the volume of data within the organisation, one could opt to either include all information directly in the prompt or create a RAG solution.
It’s equally essential to emphasise the importance of clean and meaningful data along with having a well-crafted prompt that captures relevant business context. If organisations are set to create their own LLM-SQL bot, my advice is to prioritise data quality control, data collection, and data cleaning. Normalising the data into structured format is also proven to be a better format for the LLM to understand.
Latest
A glimpse into the life of a data leader
Data leaders face pressure to balance AI hype with data landscape organization. Here’s how they stay focused, pragmatic, and strategic.
Data Stability with Python: How to Catch Even the Smallest Changes
As a data engineer, it is nearly always the safest option to run data pipelines every X minutes. This allows you to sleep well at night…
Clear signals: Enhancing communication within a data team