Introduction
When GPT4 came out in March 2023, I found myself in a company of countless founders, execs, product leaders and engineers across the world, staring at the ChatGPT console slightly agog and all facing the same question - how do I integrate AI into my product?
Cordless was built from the start to bring insights into phone conversations, so the use case for us was clear. The question was not “what to do” but rather “how to do it”. Playing around in a chat window or with OpenAI API is one thing. Wrangling ChatGPT to do something useful, reliably, with production data is quite another.
First we dipped our toes in the AI waters with a couple of simple features, like transcript summaries (even before ChatGPT!). For this feature, the answers from AI are free-form, don’t have to be in a specific format, and, crucially, don’t have to be aggregated. So we decided to venture into more interesting problems.
Ask Jeeves AI
AskAI is a Cordless feature that allows you to ask any question about the content of your calls and get an answer without listening to hours of calls or reading pages and pages of transcripts.
To emphasize - you can ask any arbitrary question about your transcripts. For example:
- Did the agent authenticate the customer?
- Did the customer mention filing a complaint?
- Did the agent handle objections in a certain way?
- Was the customer’s issue resolved on this call?
This feature is a game changer for customer support managers, but straight away we ran into challenges. How do we know if the answers are accurate? How do we know which version of the prompt performs better? Questions intimately familiar to ML scientists and data engineers, none of whom are present on our team 🫠
This is the first of the series of blog posts on how Cordless set up an experiment workflow to work with Large Language Models. The purpose of these experiments is to assess the performance of LLMs in relation to business-specific use cases. The use case I’m going to describe is answering questions about call transcripts.
Who’s this for?
For engineers with little to no machine learning and data processing background.
If you are looking for a structured way to figure out if generative AI can solve your particular business problem but don’t know where to start, this is for you.
👉 Start here for the technical stuff.
The Experiment
Let’s say we have 1000 call transcripts between ACME Co. and their customers. The Head of CS at ACME wants to answer the following questions about their calls:
- Did the agent mention at the start of the conversation that the calls are recorded?
- Was the customer issue resolved by the end of the call?
- If a customer mentioned that the cost of products is a problem, did the agent mention that they can offer a 10% discount for bulk orders?
We need to determine if an LLM can reliably answer these questions about ACME’s calls with good enough accuracy.
We’ll talk about what “good enough” is in later posts.
To achieve this, we needed a way to experiment with different prompts and strategies (one-shot prompt, chain of thought etc.) and determine which ones perform the best.
To determine which prompts and strategies perform best, we needed a way to assess and compare the accuracy of responses.
To compare the accuracy, we needed to have the “right” answers to compare against. Which brings us to the first task.
Data labelling
First, we need a human to answer these questions for a portion of calls. In the ideal world, your customers will have run this process without AI for a while and can provide you with the “correct” answers.
In our case, ACME didn’t answer these questions at all, so we needed to find an alternative. The two main questions were:
- Who should do it?
- What format should you store it in?
Who should do it?
Data labelling is a very time-consuming process. It can be tempting to do it in-house, especially if the dataset is relatively small, but it adds up. And it’s not a good use of anyone’s time, especially if you have a small team.
Tip: Hire a freelancer to label your data. We found a person on Upwork who labelled about 100 questions in roughly a week for under £200.
What format should you store it in?
A csv would do just fine. We gave the freelancer a spreadsheet to fill in, along with clear instructions of what were looking for.
We included call IDs, so it’s easier to work with the data afterwards.
After the data was labelled, we exported the spreadsheet as a CSV file and committed it alongside the code in the experiment repo.
Storing data for the experiment
The second question was how do we work with production data for this experiment?
Tip: Do not work with production data in the experiment.
One thing we wanted to avoid is for the experiment repo to read directly from the production database. In fact, the code for the experiment should be as isolated as possible from the production environment.
Cordless’ entire tech stack is hosted on Google Cloud Platform. For the purpose of experimenting with AI we created a separate GCP project with its own service account.
We then downloaded all relevant call transcripts from the production DB, pre-processed them and uploaded them to a Google Cloud Storage bucket in the experiment project.
Running the experiment
Local or remote?
The main question we had to answer is can the experiments run locally on a dev laptop?
The answer depends on the nature of the experiment. In our case, we were just making API calls to OpenAI and comparing the results with a CSV file, which is not particularly computationally intensive.
However, OpenAI API has quite severe rate limits. Depending on the size of the dataset, it can take anywhere from a couple of minutes to several hours to run a single experiment.
In the phase of trying out different prompts and prompting strategies, we kept individual runs short, 10-20 transcripts, which we then expanded to the whole set of labelled transcripts to check the accuracy. Individual runs took a few minutes each, so we kept it local.
The “final run” with 10 questions and ~100 transcripts took around 4 hours though, so adjust accordingly.
Which programming language to use?
Python.
Jupyter notebooks
At first, we considered using Google’s Vertex AI to run experiments in shared Jupyter notebooks.
While it might be a good solution for workloads that require a lot of resources, it wasn’t the right choice for us.
One of the most important problems we were solving while building this experiment workflow was versioning. Put simply, we wanted to be able to look back at the prompts and the code months later and be able to tell which version performs better.
Storing well-organised Python scripts in a private repo alongside clear documentation solves this problem very well.
OpenAI API keys
Tip: Create a separate OpenAI account (or several) for your experiments to avoid sharing rate limits with production.
Measuring the results
We’ve got the training and test data, ran the experiment and got the results back. How do we measure if the results are any good?
The obvious answer is to calculate the percentage of results where AI agrees with the manual label.
So if AI answered the question for 5 out of 10 transcripts correctly, the accuracy of the answers would be 50%. However, looking just at the accuracy doesn’t give us any insights into why the model performed this way.
This is where precision and recall come in handy.
Here’s an example of the results for one of the questions we asked:
Question: Was the customer issue resolved by the end of the call?
Accuracy: 0.62
Precision: 0.95
Recall: 0.39
This tells us that AI agreed with manual labels 62% of the time. But why did it disagree in the remaining 38%? Did the human mark more transcripts “yes” than the AI? Or the other way around?
Precision is a measure of amongst all the predicted positives, how many of them are actual positives. It is a measure of correctness achieved in true prediction i.e., how many of them labelled as positive by the model are actually positive.
Precision is calculated as: Precision = True Positives / (True Positives + False Positives)
.
In this case, precision is 0.95, which means when AI said that yes, the customer issue was resolved, it was almost always right. It marked very few transcripts “yes” in error. The number of false positives is very low.
Recall (or sensitivity) is a measure of how many actual positives our model captures through labelling it as Positive (True Positive). Recall shall be the model metric we use to select our best model when there is a high cost associated with False Negative.
Recall is calculated as: Recall = True Positives / (True Positives + False Negatives)
In this example, the recall is quite low - 0.39. This means that AI only captured about 40% of the “yes” answers.
So looking at these three metrics, we can say that AI leaned towards “no” for this question a lot more than a human. It thinks that for a lot of transcripts, the issue was not resolved by the end of the call, while the human who did the labelling has a more optimistic outlook.
Armed with this information, we can dig into individual answers and figure out why AI thinks that. We can rephrase the question, adjust the prompting strategy or simply decide that AI is right and it’s the human labeller who’s mistaken. We’ll talk more about this in later blog posts.
Conclusion
To experiment with LLMs at Cordless in an efficient and reproducible manner we:
- Downloaded call transcripts into a separate environment to avoid accessing production data from non-production code.
- Ran the experiments locally as well-organised Python scripts that we committed to a private repository.
- Hired a contractor from Upwork to manually label a significant enough number of transcripts and saved the labels to a CSV file alongside the code
- Compared AI answers with human answers using accuracy, precision and recall metrics.
In the next blog post in the series, we’ll cover how we actually talk to OpenAI API. Specifically, we’ll cover Langchain and whether you should use it.