Introducing Empirical: a playground for LLM comparisons

We started Empirical with the goal to enable all developers to build high-quality LLM applications. As we shared earlier, our plan is to build tools for “fast vibe checking”, so that developers have a tight iteration loop between prompting and evaluating results.

We love the simplicity of playgrounds (like ChatGPT) that enable this loop today. You can try a scenario or two, but will be limited to anecdotal proof that your prompt works. Real-world usage is more diverse, and requires trying hundreds of scenarios. Our approach combines the accessibility of a playground, with the power of bulk prompting.

I’m excited to share about our first external release today. This release focuses on our UI for productive vibe checking in the context of comparing two LLMs.

The text-to-SQL use-case

To demo these capabilities today, we have chosen a popular problem: converting natural language queries into SQL (the “text-to-SQL” problem). In the examples below, we’ll compare 5 popular off-the-shelf LLMs – from OpenAI, Google and Mistral – on a subset of the Spider dataset.

The Spider dataset consists of multiple SQLite databases, and natural language questions for each of them. In our experiment, we pass the database schema (table structure, column names) and the question to the LLM, and ask it to return an SQL query.

Spider also includes expected values for SQL outputs (as ground truth), and which means we can compare an LLM output against expected.

Enough of context setting—let’s see how Empirical works today.

Choose LLMs to compare

We start our journey by choosing two LLMs to compare on the homepage. For example, GPT-3.5 Turbo vs GPT-4. Empirical loads up results from these LLMs, and shows us how the prompt was configured.

Eyeball and diff the output

You can eyeball all rows of the dataset to see the question and the schema that was sent to the LLM. This especially helps when you want to verify the query against the schema, for correctness of column names and data types.

You can also diff an output against another to assist with eyeballing and detecting differences like casing or semi-colons which might affect your post-processing logic.

Scores to guide you

I hear you – eyeballing 200 samples doesn’t sound productive. Empirical uses "scores" to check your outputs for you, which enables you to filter for rows where scoring fails.

We’ve currently added scores that are appropriate for the text-to-SQL problem: syntax and semantic. The syntax score checks for whether the output is valid SQL. The semantic score checks if the output SQL is semantically equivalent with the expected (ground truth) result.

These scores can be used to filter samples. For example, when we compare two Mistral models: mistral-tiny (Mistral 7B) to mistral-small (8x7B MoE), we find that mistral-tiny returns natural language messages before the query. This means it’s not great at following instructions that we had defined in the prompt (”return only the SQL”)—and might require post-processing or additional prompt engineering.

Annotate and filter samples

Not all rows are the same. There will be scenarios that are higher priority than others ("must work"). Or, you might feel the need to segment your dataset depending on your business use-case. For instance, I’m more interested in seeing how LLMs perform on more complicated tasks, since they are pretty good on simple queries.

To achieve this, Empirical lets you annotate the dataset, and filter rows based on those annotations. This allows me to filter for sub-query tasks while comparing Gemini Pro vs GPT-3.5 Turbo.

What’s coming: make Empirical your own

Public datasets like Spider might not represent the nuances of your internal scenarios, and therefore, we're investing in making Empirical work for your dataset, your prompts, and your models. Our subsequent releases will make Empirical more interactive and configurable.

In addition, we're also going to expose scores and annotations as abstractions that can work for your use-case.

We'd love to prioritise your use-case and get you setup with Empirical. Book a meeting with me directly to take that forward.

Or if you prefer following our progress async, you can join our Discord or subscribe to our email updates from our homepage.