I was recently asked by my friend Bret Kinsella from voicebot.ai for my predictions on AI and Voice. You can find my 50 cents in the post 2017 Predictions From Voice-first Industry Leaders.
In this contribution, I mentioned the concept of speech metadata that I want to detail with you here.
As Voice App developper, when you have to deal with voice inputs coming from an Amazon Echo or a Google Home, the best you can get today is the transcription of the text pronounced by the user.
While It’s cool to finally have access to efficient speech to text engines, It’s a bit sad that in the process, so much valuable information is lost!
The reality of a conversational input is much more than just a sequence of words, It’s also about:
- the people — is it John or Emma speaking?
- the emotions — is Emma happy ? angry ? excited ? tired ? laughing ?
- the environment — is she walking on a beach or stuck in a traffic jam?
- local sounds — a door slam? a fire alarm? some birds tweeting ?.
Imagine now the possibilities, the intelligence of the conversations if we could have access to all this information: Huge!
But even we could go further.
It’s a known fact in communication that while interacting with someone, non-verbal communication is as important as verbal communication.
So why are we sticking to the verbal side of the conversation while interacting with Voice Apps ?
Speech metadata is all about the non verbal information, wich is in my opinion the immerged part of the iceberg and thus the more interesting to explore!
A good example of speech metadata is the combination of vision and voice processing in the movie Her.
With the addition of the camera, new conversations can happens, such as discussing the beauty of a sunset, the origin of an artwork or the composition of a chocolate bar!
Asteria is one of the many startups starting to offer this kind of rich interactions.
I think this is the way to go and that there would be a tremendous amount of innovative apps that will be unleashed by the availablily of the conversational metadata.
In particular, I hope from Amazon, Google & Microsoft to release some of this data in 2017 so we the developers can work on a fully context aware conversational agent.