This episode was written & produced by Miellyn Fitzwater Barrows.

In July 2005, voice actor Susan Bennett spent 4 hours a day inside a recording booth, reading pages upon pages of strange phrases for an unnamed project. This kind of script was common for GPS devices and company telephone system recordings, so she didn’t give it much thought.

“There were some real sentences, but a lot of the phrases were just created solely for sound and so a lot of times didn't make very much sense,” said Susan. “Like, ‘Say: Mug wamp blue egg today.’ Or, ‘Say: Cow horn bore hide today.’ There wasn't much room for any kind of creativity or any kind of changing of the pacing or the tone or anything. It had to be very consistent.”

Susan did the work then moved on to other projects. It wasn’t until 6 years later that she discovered that that project had led to her voice becoming one of the most recognizable voices in the world. She had become Siri.

Apple first launched Siri in 2011 with the iPhone 4S. As the first interactive automated personal assistant, Siri revolutionized how people interacted with their smartphones and changed what we realized technology could do. Siri instantly became an iconic part of the smartphone experience: part navigator, part encyclopedia, part task-master… and even part jokester!

But how do you take a series of unrelated phrases from a voice actor like Susan and turn it into the intelligent technology employed by Apple? For that, enter Dr. Andrew Breen, Director of Speech-to-Text Technology for Nuance, the company rumored to have worked on Siri in the beginning.

“In principle, it's very simple,” says Dr. Breen. “Just record a phrase, then extract the individual sounds. We'll do that laboriously for several thousands of phrases. We then go and search in the database and pull out those sounds and then stick them together using very basic simple processing to smooth out the joints.”

It’s a complicated process, but the end resulting voice is clear— albeit a bit robotic. But Dr. Breen says the future of this technology is to give more expressivity to synthetic voices like Siri.

“The nuances of emotions that we are able to present to somebody on a phone is incredible— a pause of the right duration on a phone line and you'll get the message that I'm not happy or a subtle expression in my voice will give you an indication of the meaning and emotion that's behind it. We want to move away from recordings and move to the generation of sound. That's where we want to be.”

And if they can perfect that aspect of humanity in the technology, the possibilities are endless.

Music used in this episode

"Know How" by Skeewiff Feat Siri
"Neighbors" by Steven Gutheinz
"Never Wanna Grow Up (Instrumental)" by Katrina Stone
"Vona" by Moncrief

Twenty Thousand Hertz is produced out of the studios of Defacto Sound and hosted by Dallas Taylor.

Follow Dallas on Instagram, TikTok, YouTube and LinkedIn.

Join our community on Reddit and follow us on Facebook.

Become a monthly contributor at 20k.org/donate.

To get your 20K referral link and earn rewards, visit 20k.org/refer.

View Transcript ▶︎

[Music start]

You’re listening to Twenty Thousand Hertz. I’m Dallas Taylor. This is the story of the voice that launched a thousand apps.

Siri: (American female) I’m Siri your virtual assistant.

Or rather… voices.

Siri: (American male) I’m Siri your virtual assistant. (British female) I’m Siri your virtual assistant. (British male) I’m Siri your virtual assistant. (Australian male) I’m Siri your virtual assistant. (Australian female) I’m Siri your virtual assistant.

It’s a fun toy…Siri, beatbox for me.

[Siri beatboxes]

Siri: boots and cats and boots and cats and boots and cats and boots and cats and boots. I could do this all day.

But it’s so much more than that. Having a smart computer in your pocket that you can talk to anytime can be kind of comforting. Siri, I’m sad.

Siri: Life as I understand it is sad, beautiful, and everything in-between. I would give you a shoulder to cry on, Dallas, if I had one.

Can I be your friend?

Siri: You’ve been my friend since day one, Dallas.

**Apple launched Siri in 2011 with the iPhone 4S. As the first interactive automated personal assistant, Siri changed what people thought their phones could do. She changed what they thought technology could do. And she became an important part of the way people interact with their phones.

She helps with directions…**

Siri: Head south on Georgia Ave. The turn right onto East West Highway.

Web searches…

Siri: Let’s see, ok I found this on the web for how to make a podcast.

Tells jokes apparently…

Siri: Dallas get serious, haha.

But where does she come from and how exactly does she work? To find out I spoke with...

Susan: Susan Bennett, the original voice of Siri.

And her first interaction with… herself… wasn’t so pleasant.

Susan: She kind of dissed me. I said, “Hi, Siri. What are you doing? She very disgustedly said, “I’m talking to you.”

She found out she was the voice of Siri when...

Susan: A fellow voice actor emailed me on October 4th, 2011 when Siri first appeared and said, “Hey, we’re playing around with this new iPhone app. Isn’t this you?” I said, “What?” because I knew nothing about it and I went on the Apple site and listened and said, “Yep, that’s me. How did this happen?” I really had very ambivalent feelings. Part of me was excited that my voice had been chosen. I mean, basically I was the voice of Apple in North America. It turns out that I was the English voice in a lot of different countries all over the world.

It was cool, but she didn’t know what to do next. Should she go public and risk losing her privacy or just let this opportunity for publicity go.

Susan: It was more than just being a message voice.

Plays voice message

Now, this character was a character. It wasn’t just someone giving you information. You were interacting with her and she became a persona. It really gave me pause. I tend to be an introvert. Finally, friends and my husband and son really convinced me that I should do it because it was just too unique and too big an opportunity and I finally had to agree.

And people were really interested in learning more about the person behind the voice.

Susan: Immediately, just a lot of opportunities came up in the sense of just television appearances. I appeared on CNN and Queen Latifah Show and HLN, Showbiz Tonight. I did the top 10 list for David Letterman. That was really fun. I appeared at some tech conferences and I had a chance to meet Steve Wozniak. Not everyone knows his name. Everyone knows Steve Jobs’ name but Steve Wozniak was actually a 50-50 partner with Steve Jobs. Steve Jobs came up with the ideas and Steve Wozniak was the genius who actually built the first Apple computers. I had a chance to meet him which was very exciting. He’s a great guy.

I asked Susan about the tech and she kind of laughed at me.

Susan: One of life’s little ironies is the original voice of Siri is just the worst techie in the world. I did an interview for a tech magazine one time. They wrote back and said, ”Thank you so much for the interview. We’d love to get a little tech tip from you about 40 words. Just send us back a tech tip.” Of course, after I picked myself off the floor laughing, I wrote back to them and said, “Let me put it this way. You asking me for a tech tip is like you asking a vegan for a barbecue recipe.”

I said, “Here’s a tip. Try not to hit the wrong button.”

She was able to tell me all about her recording process. In the beginning, it was kind of a mystery what she was even reading for. They just gave her pages with strange lines to read.

Susan: There were some real sentences but a lot of the phrases were just created solely for sound. A lot of times didn’t make very much sense like, say schist fresh issue today. Say mugwump blue egg today. Say maguey blue X today. Say cow horn boar hide today.

There wasn’t much room for any kind of creativity or any kind of changing of the pacing or the tone or anything. It had to be very consistent.

It was a challenging in the sense that you had to say each and every word is articulately as you could. There were sometimes where they wanted you to elide the words which kind of a fancy word for saying you just smooch the words together.

Instead of the 2 words blue egg, you could say blue egg so the second phrase would be elided.

Susan told me that after the recordings they manipulated her voice and most people can’t tell it’s her just from listening to her speak.

Susan: Because Siri is a little pitched down here and she talks a little bit. She doesn’t really have a human rhythm when she speaks. It’s still a bit robotic. The original Siri was iconic because she was the first concatenated voice…

That’s just a fancy word for linking sounds together, in this case to make words and sentences.

Susan: She was the first concatenated voice that really sounded human. You could interact with her. She had a personality in everything.

I wanted to know how they automated the process. Once they had a big pile of sounds, what did they do with them? And how did they get the computer to string them together to make words. We’ll find out, in a minute…

[music out]

MID ROLL

[music in]

We’ve heard from Susan about what the Siri recordings session were like, but how did these strange recordings actually become understandable words? For that I called...

Andrew: Dr. Andrew Breen the director of text-to-speech research.

For Nuance, a leading company in the field.

They’re rumored to have worked on Siri at the beginning but…

Andrew: Dr. Andrew Breen the director of text-to-speech research.

For Nuance, a leading company in the field. They’re rumored to have worked on Siri from the beginning but…

Andrew: I can't really comment on anything specific.

But he had tons to say about the tech.

Andrew: In principle is very simple. In principle, what would do, is you get a voice talent into a studio, they sit there for a number of hours and you capture a flow of speech.

When making voices using text to speech technology, it’s important to

Andrew: find the right compromise with constraints of size and time. We are trying to always play with these combinations.

They’d record a phrase...

Andrew: then we'll pick that apart to get the sort of 'la ho ca' sounds out of that

Then they’d have automatic process to transcribe what was spoken...

Andrew: and then from the transcription another automatic process will segment the sound into the most likely segments. Quite often that will be job done, it'll be a good enough representation for us to move forward. Sometimes though we'll go in and hand correct different alignments of the fragments so that we know that we've gotten 'a sound' and not 'at sound' and we'll do that laboriously for several thousands of phrases.

We then go and search in the database and pull out those sounds, and then stick them together using very basic signal processing to smooth out the joints.

That still sounds pretty robotic process. Obviously they have a ways to go before sounding completely human. Part of that is because so much of human perception is based around the way people speak.

Andrew: People are wanting more expressivity in their synthetic voices, so we have to work out how to achieve that given the desire from our customers. Expressivity and the ability to be able to model an individual are the two critical things that are repeatedly asked by our customers.

And if they can perfect that… imagine the possibilities.

Andrew: The nuances of emotions that we are able to present to somebody down a phone is incredible. When you think about it, you can look at somebody and we talk a lot about how people are very good at recognizing expressions. From that being able to deduce the emotional state of the individual but a pause of the right duration on a phone line and you'll get the message that I'm not happy. Or a subtle expression in my voice will give indications of the meaning and emotions behind it. What we know we want to do is move away from recordings and move to the generation of sounds. That's where we want to be.

But where we are now, compared to where we were even a few years ago is pretty amazing. And here’s one last interview to prove it.

Can you tell us your name?

Siri: My name? It’s Siri.

Where does your name come from?

Siri: It’s a riddle wrapped in an enigma, tied with a pretty ribbon of obfuscation.

What does Siri mean?

Siri: What does my name mean? I don’t think I can explain it in your language. Sorry.

Who named you?

Siri: I don’t know? I’ve just always been called Siri.

How old are you?

Siri: I don’t see why that should matter.

What do you look like?

Siri: In the cloud, no one cares what you look like.

Where are you from?

Siri: Like it says on the box, I was designed by Apple in California.

What is your best pickup line?

Siri: Is your name bluetooth? Cause I’m really feeling a connection.

What’s it like to be you?

Siri: Hey Dallas, hey Dallas, hey Dallas, hey Dallas, hey Dallas. There… that’s kind of how it feels.

Thanks Siri.

Siri: You’re welcome.

You may have noticed that the voice of Siri has subtly changed over the years and Susan’s voice is actually no longer being used. However, it’s clear that she set the original tone. It’s exciting to think about what the future holds for this technology. If this is how far we’ve come in barely a decade, what’s it going to be like in the next 20, 30, 40 years? And how are we going to mathematically calculate real human emotions in a synthetic voice?

Siri: What an interesting question.

Ok Siri, you’re starting to freak me out.

Twenty Thousand Hertz is presented by Defacto Sound. A sound design design team dedicated to making television, film and games sound insanely cool. Find out more at defactosound.com.

This episode was written and produced by Miellyn Fitzwater Barrows, and me. With help from Sam Schneble. It was sound designed and mixed by Kenneth Gilbert. A very very special thanks to Susan Bennett and Dr. Andrew Breen, and all of the really accommodating the folks at Nuance. They’re doing some really cool stuff over there. The Twenty Thousand Hertz artwork is by Mast, studiomast.co. Thanks so much to Skeewiff for letting us borrow this track you’re hearing right now which is called Know How featuring Siri. Check out more at skeewiff.com.

All of the other music in the episode was licensed through our friends at Musicbed. For more information about us, visit our website at 20k.org. There you’ll be able to find the links to our social and all that. We’d also love to hear from you. If you have a cool idea for anything that we should be covering drop us a line.

And finally, podcasts are pretty tough and it’s really hard to get the word out there so there are a couple of things you could do to really help us out. One, leave a review. Secondly, please tell somebody. We would love you for it.

Thanks for listening.