The Path to a More Human Voice Interface


The world is changing fast. Everything is getting connected. The connected home – the smart fridge, the smart thermostat, the smart TV – will soon be a mainstay of a connected world, an era quickly being ushered in by 5G, AI on the edge, and a slew of other enabling technologies. And I believe the central controller of that everything will be voice, the ultimate human-machine interface.

But for that to happen, voice interfaces need to catch up. Fast.

The voice tsunami is upon us, with demand for voice-activated systems, voice-enabled devices, and voice-driven virtual assistants set to increase exponentially over the next decade. And not just for the smart home. Industry wants in, too, including healthcare, financial services, and automotive. Why? Because voice is human. It’s the most natural way we have of interacting with each other and – as we develop and advance the capability – with our machines.

Yet the challenges are significant. For people to interact naturally with platforms like Alexa, Siri, Google, and Echo, as well as their voice-controlled smart phone, smart home, and smart car, they will need intelligent voice interfaces that can do all of the following:

1. Work effectively in high-noise and far-field environments.
2. Use biometrics to verify identity and enhance security.
3. Track and separate a voice of interest from other voices.
4. Intelligently adjust to – and predict – changing audio environments.
5. Operate without a priori information.
6. Automatically activate a user’s profile settings using his/her voice.
7. Enable on-edge computations for always-on applications.

The Role of AI and Unsupervised Learning

While each of these capabilities could be the subject of its own white paper, I want to focus instead on their broader implications. First, these seven “musts” of intelligent voice interfaces show where the innovation in voice is happening today and where the competitive skirmishes are most fierce.

For example, progress in high-noise and far-field environments has driven 20-30x performance improvements over the norm of just a year or two ago – but we’ve already learned that such boosts are just pieces in the larger puzzle.

Incremental improvement across each of these categories, from wake-word recognition in a noisy room to biometric “yes, that’s you” familiarity in any scenario, is essential to improving voice interfaces today. But the breakthrough that transforms the human voice into the master controller of tomorrow’s big “C” – Connected world – is something different altogether. That will require the always-on, and always improving, interplay among all these capabilities – in short, a solution that can learn, and do so on its own.

Unsupervised learning is the catalyst of a more intuitive, higher functioning voice interface. And that learning requires stimuli. Less noise and fewer variables may work well for some rote applications, but reacting to the human voice in its endless subtlety and variation, which is where the voice interface of the Connected world needs to be, is anything but a rote exercise.

The Noisier the Better

As such, it’s time to throw voice interfaces into the deep end – the noisier, messier, and more unpredictable the better – because that’s the real sonic landscape humans navigate daily. These variations also provide context for advanced algorithm-driven systems to learn and grow, helping them better attend to a voice of interest by understanding what is happening around it.

Our industry is on the cusp of this breakthrough now. And when it happens, voice interfaces will make a dramatic step change from passive decoders of sound to active listeners in tune with their operator’s intent, a leap approximating cognizance.

Look again at the list above. Features 1-5? Those are things we do as humans every day without thinking. That’s how our brains work. This is how we hear. When music is playing, and we need to talk over it, we raise our voices a bit. When a friend calls on the phone, we know it’s them by the sound of their voice. It’s automatic – humans are wired for sound understanding.

Voice interfaces have tough competition – human expectations. We want our systems to perform as intuitively and successfully as we do. The quality of our connected experiences will hinge on it.

Yet AI is proving up to the challenge, and advances in machine, contextual, and unsupervised learning are quickly bridging any gaps. Look across the must-have capabilities for intelligent voice interfaces, and you’ll see each improving because of AI. Now realize that AI is also looking across those capabilities, orchestrating them, and fine-tuning them through experience to make them more human.

Speakers – both human and electronic – should like the sound of that.

Ken Sutton is co-founder, president, and CEO of Yobe, Inc., a Boston, MA-based software startup at the forefront of using AI for more intuitive, adaptive, and personalized voice technologies. Follow @kenmsutton

Trending on Xconomy