Overview of LTEC Voice Databases & ASR System Training

The main problem in the application of ASR systems in forensics is the accuracy and reliability of the results of such system. The accuracy of identification methods depends on a number of factors that cannot always be assessed. Since it is very difficult to assess the impact of all the factors encountered in forensic speaker examinations, the performance of such systems can best be determined using voice databases developed on the basis of audio recordings submitted for examinations. Despite the variety of created voice databases that attempt to record voices under a variety of conditions, forensic investigations still encounter factors, whose impact on an automated speaker recognition system, is often unknown. ROXANNE BLOG briefly presents characteristics of voice databases - BALSAS_LTv1.1 and BALSAS-200LT of the Forensic Centre of Lithuania (LTEC), also describes ASR VOICE training performance using BALSAS_LTv1.1. In the ROXANNE project, LTEC will use the described voice databases (BALSAS_LT200 and BALSAS_LTv1.1) to assess the recognition accuracy of the existing ASR systems VOICE and BATVOX, as well as the ones developed internally.

When using ASR systems, they must first be trained. We currently have two voice databases at the Forensic Science Centre of Lithuania (FSCL) - BALSAS_LTv.1.1 and BALSAS_200LT. Voice database BALSAS_LTv1.1 was created in cooperation with the Institute of the Lithuanian Language at Vilnius University and UAB BALTIJOS KOMPIUTERIŲ CENTRAS. The voices of 120 speakers were recorded during the creation of this database. All audio recordings were made using professional audio recording equipment. 5 recordings were made per each speaker, i.e. each speaker had 5 sessions and the duration of each recording was at least half an hour long. The audio recordings of the last 4th and 5th sessions were made while talking on a mobile phone by using a telephone exchange (session 5) and a voice recorder (session 4), respectively. These recordings are sufficiently long - 20-30 minutes; the age of the speakers is between 25 and 55 years old; recordings were made of both monologues and conversations; the same person was speaking both in Lithuanian and in Russian; each person talked about different things, meaning that we have text-independent audio recordings. Audio recordings were made in almost all regions of Lithuania, and in different parts of the country (Vilnius, Panevėžys, Kaunas, etc.). Both male and female voices have been recorded in the voice database BALSAS LT v.1.1. Dictaphone recordings were made using a professional voice recorder. Record discretisation frequency - 16000 Hz, 16 bits (PCM format). Mobile phone recordings were made using a standard mobile phone and GSM network. These audio recordings are used to train existing ASR systems. Only male voice recordings were used for training and examinations. Two sets of voice recordings - MOBREF20_D and MOBTRAIN20_D were made using this voice database, i.e. both sets contain a voice recording of the same person. MOBREF20_D and MOBTRAIN20_D voice sets consist of the voice recordings of 160 different men, respectively. There is a minimum of one voice recording per person in both MOBREF20_D and MOBTRAIN20_D databases. Some individuals have multiple recordings. The voice recording of each person is at least 20 seconds long but not longer than 3 minutes.

Training was conducted on an existing ASR system VOICE using MOBREF20_D and MOBTRAIN20_D. These voice recording sets will also be used in the future for training ROXANNE systems. After conducting VOICE training using MOBREF20_D and MOBTRAIN20_D, the following characteristics describing the training of the system have been obtained accordingly:

EER = 0.23%,

C_LLR = 0.02.

DET curve, LLR distribution and CMC are provided accordingly in Figures 1, 2 and 3, FRR/FAR - in Figure 4.

As shown by the obtained results, the training of the system fully meets the requirements of the ENFSI guidelines [4], i.e. C_LLR does not exceed 0.6, and the number of different voices is not less than 40.

Figure 1. DET curve. Examination of the accuracy of VOICE system identification using MOBREF20_D and MOBTRAIN20_D voice recording sets.

Figure 2. VOICE system training accuracy assessment was obtained using the CMC criteria and voice recording sets - MOBREF20_D and MOBTRAIN20_D.

Figure 3. LLR distribution of VOICE system results for training using voice recording sets MOBREF20_D and MOBTRAIN20_D.

As can be seen from the results provided in Figure 4, part of the voices overlap, meaning that the system is making mistakes, i.e. other unrelated voices are attributed to some of the voice recordings of one and the same person, even though the EER is relatively very small. It is therefore very important to identify the range of uncertainty, or to set the limits within which the voices overlap. For this purpose, LLR distribution is shown in Figure 4 in the form of curves, where normalized LLR frequency distribution is marked in red for the same person’s voice and blue for a false/unrelated voice.

Figure 4. VOICE system's LLR error distribution characteristics - FAR/FRR, when using voice recording sets MOBREF20_D and MOBTRAIN20_D for training.

In ASR, the largest error is considered to be the second type error (TYPE-II), when a given comparative voice with a very high degree of coincidence (high positive LLR) is assigned the voice of another person recorded in the examined audio recording. This type of error must be avoided in forensic science. Therefore, further investigations are needed to avoid this type of error. First type error (TYPE-I) is considered when the searched voice is in the examined recording, i.e. the given comparative voice corresponds to the voice recorded in the examined recording, however the comparison produces a high negative LLR value, thus it is rejected due to the high said value. This type of error is acceptable in forensic science, however further investigations are also needed to avoid it, i.e. the first error occurs when the voice of a compared person is rejected. In this case, a decision is made that no voice from the set of the examined voice recordings corresponds to the provided comparative recording. Basically, errors mostly occur for those recordings that are taken from actual forensic examinations. This can be explained as follows. Recordings in the voice database BALSAS LTv1.1 are made using the same recording equipment, thus audio recordings are not distorted or clipped, speakers are not walking and do not significantly change the position of their phone in relation to their mouth, and the recordings themselves are at least 1 min long, whereas recordings taken from actual examinations are often clipped, with a changing recording degree, different noise levels, etc.

The VOICE training results of the ASR system can now be used to determine the limits of distribution of these results (LLR): range of uncertainty – UNCT; false accept – FA (TYPE-II). LLR_MAX= +1,5, and false reject (TYPE-I) – FR, LLR_MIN= -4,5. Uncertainty (UNCT) is obtained for all voices with overlapping values that fall within this range, and the overlap of such voices needs to be further verified with additional investigations. All voices with LLR< -4,5 are false voices or the voices of another person, whereas those with LLR> + 1,5 are the voices of the same person. If the obtained LLR result is positive and is outside the specified range of uncertainty, then with a very high probability it can be stated that it is the voice of the same person; if it is negative ( <-4,5) – it is the voice of another person (very strong evidence (VSE) to support/against the prosecution hypothesis). This means that such cases essentially require minimum further investigations, and specific conclusions are drawn in the case of operational investigations. If the result of comparison falls within the range of uncertainty and no further investigations are performed, then conclusions are possible only with a very low degree of probability (limited evidence to support the prosecution hypothesis).

In other words, in the case of testing (assessment of recognition accuracy), if the obtained LLR is < -4,5, then it can be stated with a high probability that it is the voice of another person, and if the obtained LLR is > + 1,5, then it is the voice of the same person. Thus, the limits of LLR are < – 4,5 in which the voice is false, and > + 1,5 in which the voice is of the same person. These are the upper (+1,5) and lower (-4,5) limits, respectively. It should be noted that the results of the comparison will be within the specified limits only if we have the recordings made under similar conditions and using similar equipment which produced the recordings that were used for system training. In other words, they won’t be distorted, clipped, etc. Unfortunately, recordings made in real conditions differ for many reasons from those that were used to train the system, and are difficult to assess due to the unknown circumstances of making these audio recordings. Since, in many cases, recording conditions are not known and the quality of examined recordings differs from recordings used for system training, it is necessary to change these limits or, as some researchers suggest, use voice databases for training that match the quality of the examined recordings, however, this is quite difficult to implement during examinations.

The above test limits only apply to the VOICE ASR system. With the implementation of the ROXANNE project and by using the same voice database BALSAS_LTv1.1, we plan to train the BATVOX and ROXANNE ASR systems and set their result distribution limits in order to compare the results.

Voice database BALSAS-200LT, comprised of real recordings submitted for examination, will be used for testing the ASR (FASR) systems (assessment of recognition accuracy). It consists of three sets of audio recordings, respectively:

BALSAS_LT200. This set consists of 200 audio recordings. These recordings are made using Windows PCM format 8 kHz and 16 bit mono, and using a GSM standard phone. 200 audio recordings of various durations taken from real phonoscopic examinations were used for the study. The following marking was used: e.g., 00274_0, T_21_3apl, etc. Each recording includes two speakers. The average duration of such recordings is 50 sec;
BALSAS_LT200SGM. This audio set consists of 203 audio recordings. Accordingly, each audio recording here includes the voice of only one person. These recordings were obtained from the BALSAS_LT200 audio recordings. The average duration of these segmented voice recordings is 20 sec;
CMR5. These are 10 comparative voice recordings, or the voice recordings of five known individuals marked as follows: AV, MM, VB, ZIG, Z. There are two recordings per each person. One is of GSM standard and the other is made using a voice recorder. The duration of each comparative recording is at least 80 sec.

During the ROXANNE project, we will use the described voice databases (BALSAS_LT200 and BALSAS_LTv.1.1) to assess the recognition accuracy of the existing ASR systems VOICE and BATVOX, as well as the developed ROXANNE platform. BALSAS_LT200 will also be used to assess the performance of the diarisation module of ROXANNE.

Author:

Dr. Bernardas Šalna

Forensic Science Centre of Lithuania (LTEC)