Why are we here?
In the previous blog post we set out to integrate AI into Cordless - a telephony platform for customer support with a built-in conversational intelligence.
The use case we picked was the ability to ask free-form “yes” or “no” questions about call transcripts, e.g.:
- Did the agent mention at the start of the conversation that the calls are recorded?
- Was the customer issue resolved by the end of the call?
- If a customer mentioned that the cost of products is a problem, did the agent mention that they can offer a 10% discount for bulk orders?
In this noble endeavour we immediately ran into a problem - how do we know if the answers AI gives us are any good? To solve this problem we set up The Experiment.
The experiment
The purpose of the experiment is to systematically assess the accuracy of answers we get from AI. The experiment consists of a few steps:
- First, we labelled some data with the help of a human labeller.
- Then we wrote some code that talks to OpenIA API and gets answers for the transcripts in the test dataset (the list of calls for which we have manual answers).
- After this we measured the accuracy, precision and recall of AI-supplied answers compared to the human answers.
This blog post focuses on step 2 of this process - writing some code. Before assessing the answers from ChatGPT, we need to make sure we ask the right questions and get the answers back in the right format.
Btw, there’s a really good post from Honeycomb.io talking about their challenges integrating AI into their product.
Questions too big
In its most basic form, this is what we want to ask ChatGPT:
This is a transcript of a phone call: { transcript } Answer this question about the transcript: { question } Answer “yes” or “no” only.
So far so good, but some phone calls are quite long, which makes their transcripts quite large. When this is the case, the prompt might exceed the model’s context window.
For more details on what tokens are see the OpenAI’s FAQ article.
So what should we do when the transcript is too large? We split it into chunks.
Transcript chunks
When splitting the input into chunks we need to consider three variables:
token_limit
- the context window size of the model we’re using.prompt_size
- the total length of the prompt, including “This is a transcript of a phone call…” and the length of questionoutput_size
- the number of tokens we want to leave for the output. If we want the model to answer “yes” or “no” only the number of tokens in the output is 1.
Naive “map-reduce” prompting strategy
The workflow of asking a question about a transcript will then look like this:
- Split transcript into chunks of size:
token_limit - output_size - prompt_size
- For each chunk: send the API request to the model with the prompt that includes the transcript chunk and the question.
- Collect the answers and come up with the final answer for the whole transcript.
Let’s say the question was “Did the agent mention at the start of the conversation that the calls are recorded?” If the transcript was split into 3 chunks, we might get responses like this:
[
"yes", // First part of the conversation is where the agent mentions the recording
"no", // No mention of the recording in the second part, as expected
"No." // Also no, but in a different format
]
All that’s left to do is to define a strategy of when we say the final answer was “yes” or “no”.
The nature of the question “Did the agent mention at the start of the conversation that the calls are recorded?” implies “at least once”. So if we get even one “yes” answer for one of the chunks we can safely ignore the remaining “no”s.
We can even stop sending requests to the LLM’s API after we get a “yes” response.
The naive part
Of course, this only works for questions where you can “stop after the first yes”.
What if the question is “Was the agent friendly throughout the conversation?” In this case the answer is the opposite - if there’s even one “no” for one of the chunks, the whole question should be a “no”.
This problem can be solved by another prompting strategy, but first a quick aside about the LLM outputs.
Wrangling the output instructions
It’s already pretty well-known that asking LLM to reason about the answer leads to better results.
This created strategies like chain-of-thought prompting and tree-of-thought prompting.
What we didn’t know before we started experimenting with different prompting strategies though is this:
We have a few prompts that ended like this:
… Explain the reasoning for your answer, then answer “yes” or “no” only.
This led to all sorts of messed-up outputs. Sometimes the model gave an explanation that clearly indicated the “yes” answer and then ended it with “no”, contradicting itself. Sometimes it just ignored the “answer only with ‘yes’ or ‘no’” instruction altogether.
Through trial and error we concluded that the best way to get a decent answer is to explain the reasoning and return a formatted answer in two separate steps.
Map-reduce prompting strategy
The two problems above - collating responses to multiple chunks into one answer and the output instructions - led us to come up with a more reasonable map-reduce prompting strategy.
Here’s how it works:
- Split the transcript into chunks, for each chunk send the following prompt to the LLM:
- Collect all explanations, then send them all to the LLM with the following prompt:
This is a part of a transcript of a phone call: { transcript_chunk } Answer this question about the transcript: { question } Explain your reasoning.
I have sent you multiple parts of a transcript of a phone conversation. I have asked you this question about each part: { question } Below are your answers:{answers} Compile your previous responses into an answer to this question: { question } Answer only “yes” or “no”
This format performs much better* than the “naive” map-reduce approach.
* It still ignores the output formatting instructions sometimes. We solved this by using functions to format the output, which we’ll cover in a separate blog post.
Conclusion
To get accurate and well-formatted answers from ChatGPT, you should:
- Split the long inputs into chunks that fit the context window.
- Ask ChatGPT to explain its reasoning when providing an answer.
- Ask ChatGPT to summarise it’s previous long answers into a “yes” or “no” answer.
There’s a lot more to prompting and structuring API calls, so stay tuned for the next post!
P.S. Book a demo if you need telephony and want us to keep writing about cool AI stuff.