# Probabilistic modelling for speaker recognition in criminal networks

**Speaker recognition for criminal networks**

Speaker recognition is a technology that uses computer algorithms to analyze speech patterns and determine the identity of the speaker in a recording.** **Speaker recognition is an integral part of the ROXANNE platform because the identities of speakers in recordings from criminal investigations are usually not known. However, the speaker recognition scenario in criminal investigations differs from the most other speaker recognition scenarios because in addition to the audio of the recordings, we also have access to information about how the recordings are related. In particular, information about who has talked to whom is important. This information forms a *network structure *between the recordings.

Naturally, it is important for the investigation to accurately uncover the network structure. Interestingly, using *prior knowledge* about the *expected network structure* in criminal networks may also improve the accuracy of speaker recognition. For example, assuming there has been verified telephone communication between Alice and Donald, and also between Bob and Donald. In other words, both Alice and Bob know Donald. Since *Alice and Bob are linked via Donald*, there is a reasonable chance that they know each other. Accordingly, if there is a phone call between Alice and an unknown person, we shall be more inclined to believe that the unknown person is in fact Bob, than if there was no link between Alice and Bob. Indeed, this kind of link analysis has been shown to improve speaker recognition [1, 2].

**Probabilistic modelling**

Arguably, the most principled and useful approach to pattern recognition tasks such as speaker recognition is through *probabilistic modelling*. This approach provides a clear mathematical framework for how to combine different pieces of information as well as how to interpret the output from the pattern recognizer.

In the context of speaker recognition, we can exemplify the probabilistic modelling and information combination by a simple speaker verification scenario. Given two recordings, we would like to know the probability that the speakers in the two recordings are the same. We will denote this probability as *P(same)* To determine this, we use two pieces of information:

- Our degree of belief that the two speakers are the same before observing the data expressed as an a-priori probability
- The probability that two recordings would sound as they do if they are from the same speaker and the probability that two recordings would sound as they do if they are from different speakers. These probabilities are provided by our standard speaker recognition model.

Using the rules of probability theory, these pieces of information can then be combined and give us P(same).

The output from a probabilistic pattern recognizer is easy to interpret. For example, assume that we test our system many times (e.g. 1000000) and that in 1000 of the tests we get P(Same) = 0.75. This means that our system believes that for these cases there is a 75% chance that the speaker is the same, i.e., we expect approximately that in 750 of the 1000 test cases, the speakers in the two recordings are the same. A probabilistic pattern recognizer that behaves this way is said to be *well calibrated*. Clearly, this is a desirable behavior of a pattern recognizer and the most important evaluations of speaker recognition technology, organized by the National Institute of Standards and Technology (NIST), require speaker recognition systems to behave in this way [3].

**Probabilistic modelling of networks**

Given the inherent network structure of recordings from criminal investigations and the clear benefits of probabilistic modelling in speaker recognition tasks, a natural goal for the ROXANNE project is to develop approaches for probabilistic modelling of networks and their combination with speaker recognition systems.

Probabilistic modelling of networks is not a new idea. One of the most well-known examples is the Google page rank algorithm. This algorithm measures how important a web page is by estimating the probability that a user ends up on the page by randomly clicking on links on the internet. Thus a webpage will receive a high rank if there are many links pointing towards it, especially if the links are from other pages with high rank. Compared to earlier algorithms that ranked web pages based solely on their content, the link analysis added by page rank was a game changer in internet search.

In the context of speaker recognition on criminal networks we can use similar approaches to form a *prior belief* about the network structure, for example, who in the criminal network is the likely receiver of a phone call. As discussed earlier, such prior belief can help improve the accuracy of a speaker recognition system. In principle, our prior belief can be much more complex. For example, it could consider

- What is the probability that the receiver of a phone call is another person in the criminal network and what is the probability that it is an unrelated person such as a family member?
- What is the probability that a person calls someone he/she has called before?
- How likely is it that a
*low-level*guy in the network calls the highest leader? - What is the probability that two people call each other more than five times in one day?
- Any type of meta-information available, for example “What is the probability that person A will call person B” at a given time of a day?” or “What is the probability that this call was from person A if it was made from a different country?” etc.

In a longer perspective for the ROXANNE project, we also hope to develop a framework for law enforcement officers to input their own prior beliefs based on other clues in the criminal investigation.

Needless to say, there are challenges in this research direction both in formulating reasonable models and in developing computationally efficient algorithms. The potential impact is however huge, not only on criminal investigations but also on tasks such as multimedia search or business analytics.

**References**

[1] Ning Gao, Gregory Sell, Douglas W. Oard, and Mark Dredze, *Leveraging Side Information for Speaker Identification with the Enron Conversational Telephone Speech Collection*, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), no. 2, pp. 577–583, 2017

[2] Mael Fabien, Seyyed Saeed Sarfjoo, Petr Motlicek, and Srikanth Madikeri, *Graph2Speak: Improving Speaker Identification using Network Knowledge in Criminal Conversational Data*

https://arxiv.org/abs/2006.02093

[3] https://sre.nist.gov