Speech is more than spoken text

Original article was published by Catherine Breslin on Artificial Intelligence on Medium

Speech is more than spoken text

Words carry meaning, but there’s much more to spoken language

Since the launch of Alexa, Siri, and Google Assistant, we’re all becoming much more used to talking to our devices. Beyond these virtual assistants, voice technology and conversational AI have increased in popularity over the last decade and are used in many applications.

One use of Natural Language Processing (NLP) technology is to analyse and gain insight from the written transcripts of audio— whether from voice assistants or from other scenarios like meetings, interviews, call centres, lectures or TV shows. Yet when we speak, things are more complicated than a simple text transcription suggests. This post talks about some of the differences between written and spoken language, especially in the context of conversation.

To understand conversation, we need data. Transcribed conversational data is harder to come by than written text data, but some good sources are available. One example is the CallHome set which consists of 120 unscripted 30-minute telephone conversations between native speakers of English, and is available to browse online. Here’s a snippet of one of the transcriptions:

Part of a CallHome transcription (6785.cha)

Transcripts contain mistakes

“That’s one small step for (a) man. One giant leap for mankind.”

In 1969, Neil Armstrong stepped onto the surface of the moon and spoke the now famous line “That’s one small step for man, one giant leap for mankind”. Later, Armstrong insisted the line had been misheard. He had not said “for man”, but rather “for a man”.

The poor quality audio and particular phrase mean that Armstrong’s words still remain ambiguous. But it’s clear that mistakes are made when transcribing audio — both people and machines are guilty of this. Another example is in the hand-transcribed CallHome example above. About half-way through the excerpt, there’s an error where the word ‘weather’ is written as ‘whether’.

Exactly how many transcription mistakes are made depends on the type of audio. Is the speech clear, or is there a lot of background noise? Is the speaker clearly enunciating, or speaking informally? Is the topic general enough to easily transcribe, or is there a lot of unfamiliar vocabulary?

Attempts have been made to quantify human transcription error rate on conversational telephone speech. Switchboard is another dataset of transcribed telephone calls, containing about 260 hours of speech. One team measured the human WER on this set at around 5.9% using 2 expert transcribers to do so. The first transcribed the audio, and the second validated the transcription, correcting mistakes they found. The same paper estimated a human error rate of 11.3% on the CallHome set.

A subsequent paper from a different team used 3 transcribers plus a 4th to verify. They showed human error rates for the 3 transcribers of 5.6, 5.1 & 5.2% on Switchboard, and 7.8, 6.8 & 7.6% on CallHome. This same paper showed their best automatic speech recognition (ASR) error rate as 5.5% on Switchboard and 10.3% on CallHome. So interestingly, while ASR performance is in the ballpark of human performance for Switchboard, it’s a few percent worse than human transcription for CallHome.

Analysis of human transcription show that people tend to mis-recognise common words far more frequently than they recognise rare words, and they are also poor at recognising repetitions. On these specific conversational telephone speech tasks where human and computer error rates are low, people & machines make similar kinds of transcription error.

ASR systems make mistakes (source: https://knockhundred.com/news/when-english-subtitles-go-wrong)

Errors crop up in written text too, in the form of typos and incorrect word choice. Yet, the kinds of errors in written text are different from those made transcribing spoken text.

The penguin jumped?

Once we accurately know the words someone said, the meaning can be affected by altering how we say them— the prosody. We can turn a statement into a question, or a question into an exclamation. “The penguin jumped?” and “The penguin jumped!” are spoken differently. Written text uses punctuation — ? and ! — to show this difference, but punctuation is often unreliable or missing in transcripts of audio.

Emphasising different words changes the meaning too — “The penguin jumped?” and “The penguin jumped?” are spoken in different ways and are asking these two different questions that would elicit different replies.

Additional information, too, is conveyed by our tone. We might sound uncertain, nervous, happy, or excited while we speak. How we choose to speak may also convey a different emotion to how we’re really feeling. Understanding emotion in speech is an increasingly popular research topic, though in practice it typically reduces the set of emotions down to a small set that are easy to separate. For example, one dataset RAVDESS, uses categories of neutral, calm, happy, sad, angry, fearful, surprise, and disgust. This doesn’t come close to capturing the full range of emotion that can be expressed.

Of course, words carry emotion and meaning too. But in analysing only the words spoken, we run the risk of missing much of the meaning behind what people are saying. That, we can only get from how they say it.

When in agreement with someone, we’re usually quick to voice it. Sometimes, so quick that we overlap the beginning of our speech with the end of theirs. When disagreeing, though, we aren’t always so quick off the mark. Pauses in conversation are often a precursor to disagreement, or are used before saying something unexpected. Anything upwards of half a second silence can indicate an upcoming disagreement. Elizabeth Stokoe’s book ‘Talk’, has many examples of where conversation go wrong and how an unexpectedly long silence is often the first sign.

I mean-er-I want to-you know-say something…

We stumble over our words all the time and may barely even notice. Take this line from one conversation:

“So what I would say is that, you know, the survey is — the survey instructs the consumers”

The speaker first inserts a filler (‘you know’), and then goes on in the same utterance second to correct herself (‘the survey is-the survey instructs’).

You might think that fillers and corrections are characteristics of informal speech, but this example is taken from a much more formal setting — the US Supreme Court arguments. Both audio and transcript of these are available to browse online. The sessions have a mix of pre-prepared remarks from both sides of the argument, and some back-and-forth discussion.

The ‘um’s and the ‘er’s and ‘you know’s may seem random, but they serve very specific purposes during speech. One time we use them is when we want to eke out some extra time to get our thoughts together, without signalling to others that we’ve finished talking (otherwise known as ‘holding the floor’). They can also help in communicating uncertainty or discomfort, or let us speak more indirectly to appear more polite.

In the CallHome conversation snippet at the beginning of this post, you can see speaker A saying ‘mhm’ and ‘yeah’ while speaker B is talking. Like the ‘you know’ from the Supreme Court example, ‘mhm’ and ‘yeah’ don’t convey information here, but are ‘backchannels’. Speaker A is simply letting speaker B know that they are still paying attention. Backchannels often overlap the speech of the other person, in a way that doesn’t interrupt their flow.

In conversation, we talk over each other all the time. Some of this is the backchannels we use to show that we’re paying attention, sometimes we start our turn naturally before the other person has finished theirs, and sometimes we jump in to interrupt (‘take the conversational floor’) before the other person has finished talking. The amount of overlap varies a lot between scenarios and speakers. A staged interview between two participants might have very little overlap, but a meeting where the participants are excited about the ideas being discussed might have much more overlapping speech. The AMI dataset is a set of meeting recordings, and the amount of overlapping speech in its meetings varies between 1% and 25%.

Despite unconsciously using these conversational phenomena when talking with others, people tend to use fewer such hesitations when talking to a computer. Perhaps they intuitively know that computers will struggle with these aspects of conversation. Still, these patterns of speech are important when building any technology that analyses conversations between people.