The AI Developer's Handbook: Comparing Models

Building software in 2024 is vastly different from building software in 2022, thanks to the advancement of AI and democratization of access to AI. Since 2010 the application building experience kept getting simpler as cloud providers created increasingly higher level abstractions for serving applications. Suddenly in 2023 it got more complex with the introduction of two new entities in building applications - Datasets and Models.

As builders we all are up-skilling ourselves on datasets and models, and looking for tools to help us build with them. The journey of building an application, now closely involves evaluating datasets, models and your application’s AI functionality. You will have to build your own dataset to ship with confidence. If foundation models aren't good enough at your task or cost is a concern, they you may build building smaller, better models for your application.

The dev loop now includes evaluating datasets and comparing models

When you start building your AI app you may pick the de facto standard model like GPT-3.5T currently or compare a few models against each other on public datasets to choose the best suited from among them. In parallel, you also eyeball multiple public datasets to get a sense of their quality and fit for your use case. Once you start building your app, the dev loop is kicked off. You are continuously building your app, evaluating it on golden datasets as well as continuously improving the golden datasets. As your app gets more usage your ability to curate the golden datasets keeps increasing, with the obvious investment from you and other domain-knowledgable humans in the loop.

The traditional ML or NLP metrics like precision, recall, f1 score do not work well because they were created for blobs of text, and not structured data or executable code. So how do you solve the problem?

Journey of evaluation

When you are building an AI app, there is so much experimentation and building to do that you cannot build out a complete evaluation framework at the outset. You build incrementally, increasing the sophistication of your framework as your app gets more usage and your needs change.

Stage	Goal	Activity for text-to-sql use case as an example	Eval metric for output	Level of automation
Prototyping	Which model should I build with?	Compare SQLs, rank which is better	0 or 1	Human eval
Early Access	Am I ready to ship the app?	Is generated SQL reasonably good? If yes, add it to golden list	0 or 1	Human eval + string matching
Pre-GA	What mistakes is the model making in the wild? For what inputs does it make them?	Classify diff between generated and golden SQL into categories. Classify question and schema complexity	Labels for input and output - dimensions to analyze by	Mostly human eval with some data engineering and AI eval
Post-GA	Report accuracy & coverage metric that the organization can rally behind	Build quantitative evaluation framework that compares SQL via multiple methods.	Float between 0 and 1	Automated with human in the loop

If you are building a text-to-sql app, you probably started out by eye-balling SQL statements and categorizing some of them as golden output.

As you see more and more of the users’ intent and dataset, you start to understand the complexities your app needs to support. At this point, the evaluation evolves from binary to qualitative. You start to label the dataset, both manually and through automation.

As your evaluation dataset grows large, you feel the need for a single score that represents how good your application is so that everyone in your team has a shared sense of progress. This is when you build a more sophisticated SQL evaluation framework.

The many comparisons in the AI dev loop

You are in your happy place - the dev loop, iterating every week on your app and improving accuracy and user experience, when you read on Twitter (yes, we still call it that) that Mistral is giving state-of-the-art results on the tasks that you care about. It may be much cheaper to direct some of your simpler tasks to it than GPT-3.5T. The next week you hear from your co-founder that Google Cloud is offering a sweet deal on pricing of Gemini Pro if you switch from OpenAI’s models to it and co-author a blog about it.

The AI dev loop can get disrupted by advancements in models and datasets. Periodically, you have to compare outputs of another model with those from the model that you have shipped your app with.

At Thoughtspot, we had started integrating LLM into our patented search architecture and marquee Natural Language analytics product, Sage when text-davinci-003 was state-of-the-art. As gpt-3.5-turbo, gpt-4, gpt-4-turbo, gemini were released one by one, we had to compare them with each other to decide which use cases should use which models. With the deprecation of text-davinci-003 in Jan 2024 all early-mover AI teams were forced to migrate to more recent models. I was recently exchanging notes with another startup that has successfully fine-tuned llama-v2 to get GPT-comparable accuracy for their app. They wanted to evaluate Mixtral. It is significantly more effort to host it or fine-tune it than llama-v2, given it’s a mixture of experts. So before diverting their limited engineering resources to this endeavour they wanted to compare the two models and validate that the comparison was favorable.

How to compare two models?

Given a dataset, comparing outputs from two models is a highly cognitive exercise. It requires you to go deep on certain input-output pairs as well as glean patterns across inputs, instructions and outputs. If the dataset were a table with the above entities as rows, it requires you to scroll left to right as well as top to bottom, while continuously filtering and sorting. Based on my & my team's experience of building them, there are three key activities that a developer does:

The horizontal scroll: Eyeball and diff the output, one record at a time
The vertical scroll: Filter, label, fuzzy search or cluster records in one-click
The full picture: Classify inputs and score outputs

That's a lot! That’s why it may take a few days for you to meaningfully compare two models on a thousand-row dataset instead of hours or minutes. As a developer or PM or QA or data scientist, you will either need to build or buy a comparison tool that has these super powers. Empirical is building such a dev tool and I am excited to see how it evolves.

This is a guest post. Utkarsh leads engineering for ThoughtSpot Sage, a Natural Language Q&A engine for analytics. ThoughtSpot's mission is to bring data analytics to business users. Utkarsh has been building AI applications and engineering teams for the last decade, as a programmer, entrepreneur and startup tech lead. He enjoys working on high-risk high-reward unsolved engineering & ML problems, and more recently building with GenAI.