Research work in ROXANNE

Throughout the course of the project, many research topics have been explored by the partners and many publications have already been published in scientific journal publications. Here we provide a brief summary of the different research topics on which partners are currently considering submitting new journal publications.


A considerable amount of effort was put by the Consortium in preparing the project’s own research dataset, called the ROXANNE Simulated Dataset (ROXSD). In May 2022, the partners met in Munich, Germany for the third and final round of data collection and managed to extend the dataset not only with more conversational speech but also with new modalities such as video, text and geolocation. In its final version v3.0, ROXSD contains around 18.5 hours of intercepted telephone conversations in 13 different languages, 1.5 hours video recordings, and more than 400 messages in textual format contributed by 104 speakers and 42 authors. Ground-truth annotations such as anonymized telephone and IMEI numbers, speaker labels, age and gender, transcription of the calls as well as date and time of the calls/messages are provided together with the raw data. ROXSD aims to serve as a multimodal and multilingual dataset of communication in organized crime, based on a fictional but realistic story that takes into account the constraints and challenges of a real investigation.

Speaker recognition

The speaker recognition research and development in ROXANNE has focused on several issues that arise when applying such systems on data from criminal networks. 

  • Speaker enrollment with mono recordings: When using speaker recognition technologies, the user typically wants to register (a.k.a. enroll) some speakers in this system. This means that a model for each speaker to enroll is created from a one or more example recordings of the speakers voice. A common difficulty with speech data from real criminal cases is that it is often stored as mono, i.e., the two speakers in a call are mixed into one recording. The so called “speaker diarization” can distinguish the two speakers. However, in order to create a speaker model, we need to know which of the two obtained speakers is our target speaker. Of course, this can be checked manually but when the number of speakers to enroll is large, this becomes time consuming for the user. Intuitively, this problem can be solved if we have more than one recording with the target speaker where the partner speakers are different in at least two of those recordings. For example, we can use diarization and the most similar of the obtained speakers from each call corresponds to the target speakers. Within the project, we investigated several approaches along these lines. 
  • Clustering with network structure: The speaker recognition scenario in criminal investigations differs from most other speaker recognition scenarios because, in addition to the audio of the recordings, we also have access to information about how the recordings are related. In particular, information about who has talked to whom is important. This information forms a network structure between recordings. Within the project, we have investigated approaches to take into to account the expected structure of the social network in speaker identification and clustering.
  • Speaker recognition with linguistic features: Standard Speaker recognition uses acoustic features for comparing speakers, i.e., it analyses the sound of the speakers. As an alternative and complement, we have explored text-based speaker recognition which is based on vocabulary usage. In the context of criminal investigations, text-based speaker recognition has the advantage that it is works even if the speakers use voice conversion to confuse speaker recognition systems.


Speech recognition 

The research in speech recognition has mainly focused on boosting the chances that certain important words occur in the output of the speech recognition system. There are two reasons to do this, namely:

  1.  The word is more common in the test domain than in the training domain. The boosting then, to some extent, corrects this. 
  2.  A false rejection (failure to detect that the word is present) is much more serious than false acceptance (incorrectly detecting the word even though it is not there). This can often be the case when the output of the speaker recognition system will process further by e.g., a named entity recognition system because obviously such a system cannot recognize word as an entity if the word is not present.


Natural language processing (NLP)

This research has focused on three closely related tasks.

  • Named entity recognition (NER): This technology aims to detect entities. Within the project we have focused on the entities PERSON and LOCATION but with the training pipeline developed by the ROXANNE partners one can easily add new entities to the system.  
  • Mention disambiguation: The persons mentioned in the call can either be third parties (when the speakers talk about a third person not taking part in the call), or it can be one of the parties in the call (this mention usually appears when the speakers greet each other). An important step is therefore to disambiguate these mentions into Third Party or Party before the Phone Network is modified. We call this mention disambiguation. The ROXANNE partners have developed a method for this based on so called co-reference resolution. Co-reference resolution is the task of linking all linguistic expressions, for example an entity and its corresponding pronoun. If then, for example the entity “Carl” is linked to the entity ”he”, the system can infer that Carl is a third party. This approach has been further combined with a rule-based approach for better accuracy.
  • Relation extraction: This technology operates on the entities detected by the NER system and detects if there is a relation between them. For example, if the sentence is “Carl is in Brno where he eats spaghetti”, the NER system detects that Carl is a PERSON, Brno is a LOCATION and their relation is current location i.e., Brno is the current location of Carl. 


Network analysis

As a crime typically involves an offender and a target and often occurs at a specific place and time, predictive policing techniques should answer the question of who will commit a crime and who will be offended. We mainly concentrated on answering the question through advanced machine-learning approaches. In particular, our recent research aims to address the following research questions: (RQ1) Knowing a network of offenders and their previous collaborations, can we predict potential future burglary attempts placed by the existing offenders in the network? (RQ2:) Knowing the history of crimes and their offenders, can we narrow down the inspections of a new case to a list of potential offenders?


Figure 1 - An Example of Burglary Offenders Network

To this end, we use a comprehensive burglary dataset of over 30,000 real-life case reports, which we transform into a bipartite graph of offenders and criminal cases to ultimately build a network of criminals based on the information collected. We then determine the co-offense likelihood for known criminals. By proposing different machine learning methods, we aim to contribute to predictive policing and crime linkage research with a general, parsimonious, automated approach.



  • Ethical and legal partners in ROXANNE have been conducting extensive research regarding ethical issues that could impact on technologies like ROXANNE. This includes a global survey of law enforcement agencies via the INTERPOL network. Questions in this survey asked volunteer respondents about how the use of biometric analysis technologies are subject to oversight and how evidence is presented in court. The responses to this survey have been analysed, and ethical and legal partners are exploring how the results can be merged with other research outputs to develop recommendations for LEAs who have, or are considering, procuring AI technologies to use in investigations.


Naturally, it is very challenging to write papers in the final phase of the project since there are many other activities. Also, suitable evaluation data is often scarce which raises concern about the statistical significance of the results. Nevertheless, we believe the chances for publication about many of the above topics are good. We look forward to sharing our results with you in future.