Automatic Speech Recognition: Setting, Benefits and Limitations

Audio is an important modality for criminal investigations. This is true for different kinds of audio and in particular for spoken content. Be it a part of files, streams or posts which are accessible on the web and social media, a recording collected from mobile devices for investigations and interviews, or a conversation recorded via intercepted telephone calls - speech can be encountered in many environments. Audio data by itself, however, presents unstructured information. Without having access to the spoken content, no further analysis and processing is possible. Enter Automatic Speech Recognition (ASR).

ASR systems are key in tapping into this rich content and, in addition, can speed-up the transcription process. For pre-recorded audio, they can even run faster than real time, e.g. they can be used to transcribe one hour of audio in 30 minutes or less. Furthermore, they can be employed in a continuous manner to provide instantaneous transcripts even while a conversation is still taking place.

The main elements of an ASR system are an acoustic model (AM) representing the acoustic properties of speech, and a language model (LM) representing the words and their usage. The basic element of the AM is a phoneme. Phonemes are the distinct sounds occurring in a particular language and are commonly described in terms of the International Phonetic Alphabet (IPA). Different languages typically make use of different sets of phoneme-inventories. An AM contains models for these phonemes, their contexts and interactions. A single word may have one or multiple pronunciations (“either”, “tomato”) and different words may share the same pronunciation (“Wright”, “write”, “right”). The LM, on the other hand, models the usage of words and sequences of words. The LM is typically a statistical model (and not a grammar) to allow for the necessary flexibility in modelling sequences of words – after all, we don’t always speak in grammatically complete and correct sentences and these should also be captured by the model.

During runtime of the ASR system, acoustic features are extracted from the audio which are subsequently passed on to the recognizer, a component which combines the different sources of information. It produces a set of alternative phonemes, with associated probabilities (likelihood). These phonemes are then joined together (via the pronunciation model) to units corresponding to words. The LM finally assigns probabilities to sequences of words. Scores produced by the AM and the LM are combined to define the likelihood of a path (sequence of words) being chosen in a large network of alternatives. The decoding algorithm selects the best (most likely) word sequence corresponding to the input audio.

Qualitative and quantitative measures can be employed to assess the accuracy of an ASR system. The accuracy is typically measured by WER (word error rate, or for some languages, character error rate). This is a measure which captures the correspondence of the produced sequence of words to the actual (reference) sequence and takes into account three types of errors: insertion, deletion and substitution. An insertion refers to the case that a word has been recognized by mistake when no such word can be found in the reference. A deletion refers to the case that a word in the reference has been missed (nothing has been recognized). A substitution refers to the case where a word A has been recognized instead of another word B. The sum of these errors is divided by the total number of words in the reference, yielding the resulting WER. It should be noted that the WER is indifferent to the types of words in the transcription. For example, a word like “impracticability” has the same weight on the WER as of a word like “the”. As such, it provides only one angle at the performance of an ASR system and should not be used in isolation. For practical purposes, especially for keyword search, different measures (like precision and recall) should be applied.

Despite its advantages, an ASR system remains a statistical system, which learns how to transcribe text given input data, a model and potentially a set of rules. As such, it is bound to make errors. In fact, an ASR system can be very sensitive to many factors, i.e., the accuracy can degrade easily when certain factors do not match the target conditions. In the remainder of this blog post we will examine the factors which influence the performance of an ASR system.

Acoustic/recording conditions: The closer (more similar) the acoustic conditions are between the training and evaluation data, the better the accuracy of an ASR system will be. These conditions include the microphone used to record the audio (e.g. far-field vs close-talking microphones, directed vs undirected microphones), the conditions of the recording environment (e.g. open air, reverberation and echo), properties of the audio transmitting medium (e.g. channel noise, crosstalk), ambient effects (e.g. background noise, music, vehicles, gunfire).

Domain mismatch: The domain, or specifically, the topic of speech should best match the domain that the ASR system was trained for. A system built for media analysis (TV, radio) may not work very well with telephone audio, and vice versa. Similarly, a medical transcription system would likely not work as accurately with legalese. Most of the out-of-the-box commercial systems are geared towards individuals, covering the natural conversational speech scenario, falling short on technical terms and jargon.

Vocabulary and jargon: An ASR system is typically a statistical system using a fixed vocabulary. This means that a word which does not exist in the system’s vocabulary can never be recognized correctly. These words are referred to as out-of-vocabulary words (OOV) and form a major source of errors for ASR. For instance, a word like “COVID-19”, which did not exist before, cannot be recognized even though it may be commonplace in speech. An OOV word typically leads to more than one error, because it also affects (and is affected by) its neighboring words. However, simply adding a word to the vocabulary is often not enough. The LM also needs be updated and trained to reflect how the particular word is typically used in context. End-users such as LEAs, who are aiming to correctly recognize such words, need to ensure that the words/jargon they are interested in are indeed present in the vocabulary and that the LM is well trained with those words in context.

Accented/emotional speech: Standard ASR systems are trained with audio corresponding to the official standard of a language. If a particular person speaks with a certain accent or a dialect, then the words may not be recognizable by the ASR system. In such cases, either the pronunciation model should be adapted, or the complete AM should be trained from scratch. Similarly, emotions embedded in speech may change the vocal characteristics of how certain phones are uttered. It is typically more difficult to recognize words that are shouted or cried (than in a normal mood), or spoken when the speaker is sick, exhausted or under stress.

Very short utterances and backchanneling: Function words, fillers (hesitations such as “uhm”), and backchannels are often difficult to recognize as they can appear in many contexts and often uttered only rudimentarily (i.e. the audio only contains traces of them).

Semantics and punctuation: An ASR system is typically meant to transcribe exactly what is being said. It does so word by word and without knowledge about the meaning of these words. Said another way, an ASR model operates on levels below the semantic level. This is the reason why – especially with the rare words – the final sentence may be grammatically correct but semantically incorrect. In a similar manner, an ASR system does not possess any knowledge about punctuation. Punctuation models exist and can be added as a post-processing step, but since they also depend on semantics (question marks, exclamation marks), one should not expect an ASR system to position punctuation marks correctly on the transcribed text.

The SAIL LABS Media Mining Indexer is a software component which is used to transcribe spoken content in the ROXANNE Platform. It includes real time processing and transcription in more than 25 languages and dialects and offers end-users the option to re-train the vocabulary and LM according to their particular needs and interests[1]. This feature allows us to tailor the ASR component to their specific topics and domains and apply ASR to investigative data.

[1] Dikici, Erinc, Gerhard Backfried, and Jürgen Riedler. "The SAIL LABS Media Mining Indexer and the CAVA Framework." INTERSPEECH. 2019.