How Large Language Models Are Changing The Game For Voice-Enabled AI

Most Americans have had a relationship with some form of voice assistant (VA) for several years and have a good sense of the limitations of such tools. So much so that “Hey Siri”, “Alexa…” and “Hey Google” have become part of the zeitgeist. We’ve embraced the convenience of hands free interaction, on the go or in the kitchen, and come to accept that sometimes they just don’t get us.

Enter Chat GPT, powered by one of several leading Large Language Models (LLMs), and we are entering a new era of voice. One where the abilities of voice assistants, for both comprehension and recommendation, are going through a step-change. What does this mean for end users?

More assistants, more applications, better user experiences and more value.

The path to this destination will rely on the pace with which developers and enablers in the voice and AI landscape embrace LLMs in their product development cycles.

We are already seeing LLMs as a game-changer for the way voice services are developed as an interface to intelligence – true conversational AI – both improving performance and increasing speed of development, while reducing cost.

IMAGE: UNSPLASH

What Are LLMs?

A LLM is a model based on machine learning, able to process natural language to extract intent from free text and then respond in a conversational manner.

Some of the more prominent LLMs have billions of parameters from which to learn and evolve in their ability to predict the next token or word in a sentence, based on the context around it.

The model repeats this task over and again until accuracy is optimum, before moving to the next word.

Even though the technology has been around for years and continues to improve, the trigger for today’s hype around LLMs has been the introduction of easier to access interfaces embedded in consumer-facing tools from Open.ai’s Chat GPT, Microsoft’s update to its Bing search engine (based on Chat GPT) and Google’s

Additionally, the models themselves have become larger and more capable.

What Aren’t LLMs?

LLMs are terrific at summarizing text in natural language and generating new data from an existing data set, but they are not “killer apps” on their own, and have limited ability to perform tasks. LLMs are also prone to “hallucination”, where the model generates incorrect or contradictory information based on the multitude of sources it pulls from.

This will change quickly because the “plugin” or “extension” ecosystem is relatively immature but growing rapidly, increasing the ability for LLMs to access up to date information, run computations and leverage third party services. As these connections with live APIs grow, so will the range of use cases that conversational AI assistants based on LLMs will be able to execute.

Today, for example, Chat GPT is integrated with Expedia so that users can plan and book a travel itinerary, or chat with Instacart to order groceries. Similarly, Google Bard is rolling out “extensions”, which enable users to shop, create custom images or listen to music from category-leading brands.

So, How Do LLMs Help Developers Of Conversational AI Services?

Voice services like Siri and Alexa rely on Automatic Speech Recognition (ASR) to translate voice commands from end users into text strings for processing. Nothing changes here, these tools are usually above 90% accurate.

But what does change, with the addition of LLMs, is the ability to infer the meaning of phrases based on context and more accurately extract the intent of the user. Greater accuracy means better responses.

Think about the last time you battled with a call center IVR trying to navigate a cryptic menu structure to reach the help you need.

Now imagine placing a takeout order for a family that likes to customize food and drink options and change the order part way through… no problem for LLMs, just see Wendy’s recent shift to embrace the technology in this context.

A new framework is emerging that uses LLMs, vector search, vector sensors, and language models trained to decide which APIs to call in response to a command delivered by ASR.

The model autonomously decides to call different APIs to obtain information that is useful for completing an instruction or request and the response is then delivered to the user in natural language using a LLM, then converted to audio using a text to speech engine.

The LLMs can even be employed to automatically generate code for accessing data from APIs.

For use cases that require finding the best result from a large set of data, a vector search could be used and is very effective.  Vector search increases the quality of search by allocating numerical representations to text using OpenAI (“word embeddings”), making it easier to understand the relationships between concepts and producing more accurate search results, which typically rely on keyword matching.

This is more accurate than prior approaches relying on keyword matching where the intents and entities are identified by using specific programming and training along with any synonyms or a lookup table (for example being able to identify LA as Los Angeles as a city name).  Instead, the LLM draws on a swathe of similar requests and billions of data points to understand how to respond, and even retains memory from prior interactions to refine the results.

Vector search can deliver pinpoint, nuanced accuracy in results. If a user were to ask, “I want to listen to music that sounds like Pink Floyd, Led Zeppelin and Tool and comes from a radio station in the Snoqualmie area”, we can expect the response to locate a local rock station in the area and return the result instantly.

It is the LLM’s ability to know that certain artists often appear near certain genres, and certain locations appear next to other locations, that enables it to return the result. The vector database doesn’t need to be structured along these dimensions for the utterance to be converted into a format that is useful for the API to call, and this is what sets LLM’s apart.

Vector sensors are utilized to categorize requests, such as determining whether a user is inquiring about weather or wants to listen to media. They can also be used to determine which API to call for a particular request. This eliminates much of the manual training a voice assistant previously required during development.

The “sensor” here is looking for the sentiment underlying a request and, when combined with word embedding vectors that are chained together, we can determine the relative weighting of different parts of a sentence to respond correctly to requests containing multiple actions. For example, “play classic rock and increase the volume to level 5.”

What Does This Mean For Developers?

LLMs have the ability to analyze complex problems, generate or review intricate coding, troubleshoot errors and debug within seconds – the ultimate programmers’ assistant (not to mention documentation quality).

As recently as last year, we’ve seen that writing code, generating data and optimizing the intelligence of a new voice assistant could take more than six months, and performance optimization was an outcome of multiple test-and-refine cycles.

By employing the approaches above, we are seeing the same outcomes achieved in mere weeks and delivering 80 to 90% task success to enable earlier launches and faster iteration.

So What Comes Next?

ChatGPT is the fastest growing consumer application in history, with over 100 million users in its first two months of being publicly available.

A tectonic shift in consumer awareness is underway already and as rapidly as LLMs are adopted, so will consumer expectations grow…”My mebot just planned my wedding in 74 minutes – awesome – but it scheduled the rehearsal dinner for 3 PM. How frustrating!!”

For businesses, LLMs are enablers of speed to innovate, meaning more voice-enabled brands will come to market sooner with voice services that can search and understand faster and respond more accurately to nuanced requests.

With the embrace of LLMs, we can expect voice-enabled AIs to grow in number and relevance such that voice/natural language can become a primary interface, leaving the disappointment of the early years of general assistants behind. The “everything assistant” is closer than we think, and the race to create and distribute voice-enabled AI is on.

Author Bio: John Goscha, Founder & CEO of Native Voice.