In July 2005, voice actor Susan Bennett spent 4 hours a day inside a recording booth, reading pages upon pages of strange phrases for an unnamed project. This kind of script was common for GPS devices and company telephone system recordings, so she didn’t give it much thought.
“There were some real sentences, but a lot of the phrases were just created solely for sound and so a lot of times didn't make very much sense,” said Susan. “Like, ‘Say: Mug wamp blue egg today.’ Or, ‘Say: Cow horn bore hide today.’ There wasn't much room for any kind of creativity or any kind of changing of the pacing or the tone or anything. It had to be very consistent.”
Susan did the work then moved on to other projects. It wasn’t until 6 years later that she discovered that that project had led to her voice becoming one of the most recognizable voices in the world. She had become Siri.
Apple first launched Siri in 2011 with the iPhone 4S. As the first interactive automated personal assistant, Siri revolutionized how people interacted with their smartphones and changed what we realized technology could do. Siri instantly became an iconic part of the smartphone experience: part navigator, part encyclopedia, part task-master… and even part jokester!
But how do you take a series of unrelated phrases from a voice actor like Susan and turn it into the intelligent technology employed by Apple? For that, enter Dr. Andrew Breen, Director of Speech-to-Text Technology for Nuance, the company rumored to have worked on Siri in the beginning.
“In principle, it's very simple,” says Dr. Breen. “Just record a phrase, then extract the individual sounds. We'll do that laboriously for several thousands of phrases. We then go and search in the database and pull out those sounds and then stick them together using very basic simple processing to smooth out the joints.”
It’s a complicated process, but the end resulting voice is clear— albeit a bit robotic. But Dr. Breen says the future of this technology is to give more expressivity to synthetic voices like Siri.
“The nuances of emotions that we are able to present to somebody on a phone is incredible— a pause of the right duration on a phone line and you'll get the message that I'm not happy or a subtle expression in my voice will give you an indication of the meaning and emotion that's behind it. We want to move away from recordings and move to the generation of sound. That's where we want to be.”
And if they can perfect that aspect of humanity in the technology, the possibilities are endless.
Music used in this episode
"Know How" by Skeewiff Feat Siri
"Neighbors" by Steven Gutheinz
"Never Wanna Grow Up (Instrumental)" by Katrina Stone
"Vona" by Moncrief