The AI Developer's Handbook: Practical guide to evaluate AI-generated SQL

SQL has one of the simplest grammars among programming languages. LLMs have taken the world of text-to-sql by storm. The new programming language for information retrieval and data analytics is Natural Language. Techniques on top of GPT-4 are approaching human accuracy on benchmark datasets like Spider (~90%), and model builders report 60% accuracy on real-world datasets. However, when I spoke with 20+ AI teams, they all complained how hard it is to compare model-generated SQL with expected SQL or compare the outputs of two models. Most of them were still using spreadsheets given the myriad visualization, filtering, labeling and diffing needs.

In this blog, I will describe the complexities in evaluating SQL, share my opinion on what constitutes a best-in-class evaluation framework and discuss what concrete methods you can use to build your evaluation system.

Generating SQL code is easy but evaluating is hard

What makes comparing SQL statements so hard?

Two very different-looking outputs, both correct

SQL is a versatile language. A question can be answered with a number of different-looking SQL expressions. Consider these two equivalent statements which look so different, one generated by GPT-3.5T, the other is defined as expected in Spider dataset for the question How many car makers are there in each continents? List the continent name and the count

SELECT T1.Continent, count(*) 
FROM CONTINENTS AS T1 
JOIN COUNTRIES AS T2 ON T1."Cont Id" = T2.continent 
JOIN car_makers AS T3 ON T2."Country Id" = T3.Country 
GROUP BY T1.Continent

SELECT c.continent, COUNT([cm.id](http://cm.id/)) AS num_car_makers 
FROM countries c 
JOIN car_makers cm ON c."country id" = cm.country 
JOIN continents con ON c.continent = con."cont id" 
GROUP BY c.continent;

Here is another example

SELECT COUNT(*) 
FROM flights 
WHERE airline = (
		SELECT airline_id 
		FROM airlines 
		WHERE airline_name = 'United Airlines') 
	AND destination_airport = 'ASY';

SELECT count(*) 
FROM AIRLINES AS T1 
JOIN FLIGHTS AS T2 ON T2.Airline  =  T1.uid 
WHERE T1.Airline  =  "United Airlines" AND T2.DestAirport  =  "ASY"

Diff-ing between two SQL statements is non-trivial. You may have to shuffle the order of columns and joins, do case-insensitive match for some parts and sensitive for some other parts, interpret the join expression or even execute the statement and compare the data retrieved from the SQL engine (which by the way is even lower Return on Investment, discussed later in the blog). We will discuss methods of comparison later in this blog to solve this problem.

Input context consists of many different entities

Reasoning about the correctness of SQL statement requires close review of multiple entities - question, tables, columns, data types, data values, joins, keys and so on. The list of columns and values may be long as well. You may even need to audit if the user was authorized to access all the entities used in the user’s request, if you are evaluating your AI app end-to-end.

To evaluate the above SQL queries on cars you would have to review the schema

CREATE TABLE continents (
    cont id NUMERIC PRIMARY KEY, /* 1, 2, 3 */
    continent TEXT /* 'america', 'europe', 'asia' */
);

CREATE TABLE countries (
    country id NUMERIC PRIMARY KEY, /* 1, 2, 3 */
    country name TEXT, /* 'usa', 'germany', 'france' */
    continent NUMERIC, /* 1, 2, 3 */
    FOREIGN KEY (continent) REFERENCES continents(cont id)
);

CREATE TABLE car makers (
    id NUMERIC PRIMARY KEY, /* 1, 2, 3 */
    maker TEXT, /* 'amc', 'bmw', 'honda' */
    full name TEXT, /* 'American Motor Company', 'BMW', 'Honda' */
    country TEXT, /* 1, 2, 3 */
    FOREIGN KEY (country) REFERENCES countries(country id)
);

For the example on flights you would have to review some rows of data in column for airline name and destination airport. The RAG evaluation method covered below can help you reduce your cognitive load of parsing through all these entities.

Drawing patterns from hundreds of SQLs is very hard

Drawing patterns across a number of SQL (structured text) statements generated by the same model is hard. Is syntax of generated SQL mostly correct? Are data values in WHERE clauses hallucinated often? Are column names generated correctly? Does the model have natural tendency to use sub-query or join if a question can be answered by either? Are the metric aggregation functions like sum, average, count, unique count etc generated correctly when they should be?

Detecting errors made in the SQL or clauses used and labeling the SQL statements with it can enable you to draw patterns. Semantic and syntactic comparison methods, covered below discuss this.

5 Methods to evaluate or compare SQL

It’s important to distinguish between comparison (when you can evaluate SQL statement only by compare it with another model output or with the golden truth) and evaluation (when you can evaluate the statement in isolation).

When comparing two SQLs, start with quick-to-implement methods that find the most obvious, simple differences and progressively increase the sophistication to capture more and more subtle differences. The key insight is to implement these methods one by one in increasing order of RoI.

Data comparison

This is the most used methodology and is referred to as execution accuracy in benchmark leaderboards. Defog has published a good benchmark detailing this.

Create the schema and load the data into a database like sqlite or postgres.
Add deterministic ORDER BY clauses to both SQL statements for all selected columns so that data from two SQLs is ordered in the same way.
Execute the SQLs to get two output tables.
Do a quick comparison to avoid step 4 wherever possible - compare number of rows and hash of each row.
Do a deeper comparison when quick eval doesn’t match: one by one, match column from one output table with columns from the other by formatting consistently based on data type (for eg, for float data type, truncate to same precision) and comparing hash of values. After all columns of one table are matched, the two SQLs are considered equivalent. The unmatched columns in the second table can be considered a benign difference.

However, data comparison has two big drawbacks. Both lead to a low return on investment.

High Investment. It requires significant human effort because it requires you to verify that the schema and data quality match, verify the data quality if data exists in the tables, generate some rows of data if you don’t have access to the data, setup a database, normalize the sql query such that it aligns with dialect used by your chosen database (sqlite vs postgres vs mysql has differences), and write a resilient query orchestrator.
Low return. It doesn’t give you actionable next steps to iterate on the prompt. Result of data comparison is binary (pass/fail). For queries that fail, you don’t know where in the SQL was the mistake. It is like an end-to-end test whereas other methods are more like unit tests, hence more actionable for the developer. For example, you can catch syntax errors or hallucinations through easier-to-implement methods that we discuss in the following sections.

You will need this eventually for those class of queries that notoriously can’t be evaluated or compared by any other method, but it’s not enough at all.

Textual comparison

This is referred to as matching accuracy in academic circles. Exact match or f1 scores are used as metrics to do a like-for-like comparison.

Normalize all SQL statements through operations like standardize formatting, spacing or casing, use of special characters like \n or " or ' that models generate inconsistently, alphabetical sorting of phrases separated by comma and so on.
Compare the normalized SQL strings. This will lead to a lot of equivalent statements being classified as string mismatches. You can make it more robust by comparing SQL strings but one phrase at a time.
Split SQL statement into phrases and categorize into types like SQL clauses (select, where, order by, having, group by, join etc), SQL functions (measure aggregation, date function, operators etc), data entities (table, column, column value, table/column alias) and so on.
Compare phrases of the same type with each other. for eg, filter phrase should only be compared with other filter phrases

The following statements could be deemed equivalent with this method. SQL generated by some models like Gemini is prefixed inconsistently with ````sql` . It's an important detail when integrating model with app but can be ignored when comparing models.

``sql
SELECT T1.Continent, count(T1.id) FROM COUNTRIES AS T1 JOIN CONTINENTS AS T2 ON T2."Cont Id" = T1.continent JOIN car_makers AS T3 ON T1."Country Id" = T3.Country GROUP BY T1.Continent


```sql
SELECT c.continent, COUNT(cm.id) AS num_car_makers 
FROM countries c 
JOIN car_makers cm ON c."country id" = cm.country 
JOIN continents con ON c.continent = con."cont id" 
GROUP BY c.continent;

Syntactic evaluation

Normalize the SQL as described in the first section.
Parse SQL using a library like sqlglot. If parsing throws an error mark it as syntactically incorrect.
If you want to learn the common mistakes made by the model then you can label the sql with the error code/message. However, it requires more effort than meets the eye. Often a SQL has multiple syntax errors. In one pass of parsing the first error encountered will be reported. More serious errors may be revealed only after you fix this error. Hence, you may get a more favorable evaluation of the outputs than the ground truth. Do a couple of more passes of parsing to fix this bias. Prompt a state-of-the-art model (like GPT-4, not the model you are evaluating) to refine the SQL given the error message to get an AI-fixed SQL. Restart from step 1.
Stop this loop if you get the same error consecutively. Or stop after you get 3 unique errors. Why unique? Because if a model messed up a column name containing two words, it’s likely that it would have messed up other such column names.

GPT-3.5T answered Find the first name and age of students who have a pet with this SQL that can benefit from multi-pass syntactic evaluation.

SELECT student.first name, student.age 
FROM student 
JOIN has pet ON student.student id = has pet.student id;

RAG evaluation

Generating SQL leverages 3 emergent capabilities of LLMs

Retrieval Augmented Generation
Understanding analytical intent and
Generating syntactically correct SQL

Syntactic evaluation tests the SQL on the third whereas Textual comparison evaluates on the first two however it requires a ground truth SQL to compare with. However, you can evaluate RAG quality of the SQL without the ground truth.

Did the model get the column and table names right in the output? Did it hallucinate a column that didn’t exist in the schema because it couldn’t find a relevant column for the metric in the question? You can answer these by following these steps.

Extract all the data entities from the SQL output, like table, column, values, join column. You can do this by building an Abstract Syntax Tree for it. Classify them into one of table, column or value if you can.
Extract all the data entities from the schema.
For each entity in SQL output, verify that it is present in the list of entities from schema. For values, you may simply verify the data type as seen in the schema.

Semantic comparison

Do the two SQL statements have the same analytical intent, no matter how different they look? Can I conclude that without comparing data?

Normalize the SQL as described in the first section.
Cast each SQL statement into an Abstract Syntax Tree.
Compare trees and find subgraph with differences. You can label the SQL with an error category - the SQL clause of the root node of the subgraph.

Compare the SQL of the subgraphs using AI and human in the loop. You can define classes of SQL phrases that are equivalent or near-equivalent. For example,

SELECT id, maker
FROM car makers
WHERE id IN (
    SELECT maker
    FROM model list
    GROUP BY maker
    HAVING count(*) > 3
);

SELECT T1.id, T1.maker
FROM car makers as T1
JOIN model list AS T2 ON T1.id = T2.maker 
GROUP BY T1.id 
HAVING count(*)  >  3;

The above classes need not be exhaustive. You can build an evaluation UX that grows this semantic equivalence class over time.
1. allows a human evaluator to select a phrase each in two SQL statements
2. add them to the repository of equivalent SQL phrases
3. find other instances of this phrase pair in the dataset
4. seek confirmation from the evaluator if the equivalence definition can be applied to them, then apply it.

Guide to building best-in-class SQL evaluation

For an evaluation framework to be robust to real-world data variations and to the evolving needs of your app it needs to be implemented with these principles.

Be practical instead of comprehensive

There are multiple ways to evaluate structured output like SQL, implement them one by one by assessing which ones will be effective on your dataset. Some records may be easily evaluated using a deterministic heuristic while another set may be better evaluated with a probabilistic approach while some will need Human Eval no matter what. There is no one best methodology for comparison**.**

Consider these SQLs for the question What is the code of airport that has the highest number of flights?

/*GPT-3.5T*/
SELECT source airport 
FROM flights 
GROUP BY source airport 
ORDER BY COUNT(*) DESC 
LIMIT 1;

/*Gemini-pro*/
SELECT airport_code
FROM Airports
WHERE number_of_flights = (
	SELECT MAX(number_of_flights) 
	FROM Airports
);

/*expected*/
SELECT T1.AirportCode 
FROM AIRPORTS AS T1 
JOIN FLIGHTS AS T2 ON T1.AirportCode  =  T2.DestAirport OR T1.AirportCode  =  T2.SourceAirport 
GROUP BY T1.AirportCode 
ORDER BY count(*) DESC 
LIMIT 1

Which automated method should we use to compare these SQLs? An easy answer is data comparison, but it is also high effort to implement. Look closely and you can find cheaper alternatives. You can find the Gemini-Pro SQL to be syntactically incorrect because number_of_flights is a hallucinated column. You could detect that the GPT-3.5T SQL is syntactically correct and semantically incorrect in the JOIN clause by comparing with expected SQL. The semantic error label is lot more valuable than a 0 or 1 label using data comparison even if they take the same time to implement.

Leverage AI for evaluation in a way that improves efficacy non-linearly with human effort

It’s important to not only implement your evaluation system with both AI and human in the loop, but integrate them in a way that the AI can learn from the human and become more proficient in evaluation. If a human evaluates ten examples then your system should be able to generalize some learning from it such that it can evaluate ten more similar examples on its own. It’s important to distinguish two approaches here:

AI evaluates, human reviews it. This is what you may have seen mostly on the internet. You prompt a model to evaluate your app first, then ask a human to evaluate the AI’s evaluation or a sample of it. GPT-4 is often used as an evaluator of the output of other models. However, human effort does not have compounding outcome here. Because AI doesn’t have the opportunity to learn from human evaluation.
The paradigm of human first, then AI in a continuous loop is less common but can be more effective. For example, in SQL generation, human can annotate phrases in two SQLs that are semantically equivalent. Feed 2-3 such examples (can be counter examples as well - when similar looking phrases aren’t equivalent) to GPT-4, ask it to generate code to compare SQL phrases or reason step-by-step using those examples as unit tests, execute that code/prompt on your whole dataset, review a sample of the outputs and commit the code/prompt change to the evaluation system. You will likely find more such phrases with high accuracy because you focused on specific phrases instead of the entire statement. For example, if you annotate a few examples of count(*) being equivalent to count(<any column in table>) or <> to != then the model can learn that when comparing SQLs. The same works for equivalence of sub-query and join-query.

Be simple enough for builders across the technical spectrum

It should be usable not just by ML engineers or software developers but also QAs, PMs and domain-knowledgable but not technically skilled human evaluators.

This is a guest post. Utkarsh leads engineering for ThoughtSpot Sage, a Natural Language Q&A engine for analytics. ThoughtSpot's mission is to bring data analytics to business users. Utkarsh has been building AI applications and engineering teams for the last decade, as a programmer, entrepreneur and startup tech lead. He enjoys working on high-risk high-reward unsolved engineering & ML problems, and more recently building with GenAI.