ROXSD: a Simulated Dataset of Communication in Organized Crime


ROXSD: a Simulated Dataset of Communication in Organized Crime

The latest ROXSD dataset is available by contacting the data controller (Idiap). Please send email to: petr dot motlicek add idiap dot ch

The license and data processing agreement will be shared, to be signed by the interested partner. ROXSD data can only be used for research activities related to security domain (in EU). The agreement will request to cite the reference paper:

ROXSD: The ROXANNE Multimodal and Simulated Dataset for Advancing Criminal Investigations

Petr Motlicek (1,3), Erinc Dikici (2), Srikanth Madikeri (1), Pradeep Rangappa (1), Miroslav Janosik (2), Gerhard Backfried (2),
Dorothea Thomas-Aniola (
2), Maximilian Schurz (2), Johan Rohdin (3),Petr Schwarz (3) Marek Kovac (4), Kvetoslav Maly (4),
Dominik Bobo
s (4), Mathias Leibiger (5), Costas Kalogiros (6), Andreas Alexopoulos (6), Daniel Kudenko (7), Zahra Ahmadi (7),
Hoang H. Nguyen (
7), Aravind Krishnan (8), Dawei Zhu (8), Dietrich Klakow (8), Maria Jofre (9), Francesco Calderoni (9),
Denis Marraud (
10), Nikolaos Koutras (11), Nikos Nikolau (12), Christiana Aposkiti (13), Panagiotis Douris (13),
Konstantinos Gkountas (
13), Eleni Sergidou (14), Wauter Bosma (14), Joshua Hughes (15), Hellenic Police Team (16)

1 Idiap Research Institute, Martigny, Switzerland,
2 HENSOLDT Analytics GmbH, Austria,
3 Brno University of Technology, Czech Republic,
4 Phonexia, Czech Republic
5 ZITiS, Germany,
6 Aegis IT Research, Germany,
7 University of Hannover, Germany
8 University of Saarland, Germany,
9 Transcrime, Universita Cattolica del Sacro Cuore di Milano, Italy
10 AIRBUS Defence and Space, France,
11 ADDITESS Cyprus,
12 ITML, Greece,
13 KEMEA, Greece
14 Netherlands Forensic Institute,
15 Trilateral Research Ltd, England,
16 Hellenic Police Team, Greece

Odyssey 2024, The Speaker and Language Recognition Workshop, Quebec, Canada, 2024. Conference link.

Paper is available here.

Suggested citation:
Petr Motlicek, et. al., "ROXSD: The ROXANNE Multimodal and Simulated Dataset for Advancing Criminal Investigations",
in Proceedings of Odyssey 2024: The Speaker and Language Recognition Workshop, Quebec, Canada, 2024.

Previous versions:

First version from 2021:

The first version of ROXSD description was released in 2021 as a submission to SPSC symposium. The paper is available here.


Second version (status 5/2023:

The version of data (as described in D4.3 document here).


A short description of ROXSD:

ROXSD audio:

In its latest version v3.0, the ROXSD calls subset contains 432 intercepted telephone conversations recorded into 481 audio files, encoded in 8kHz, 16-bit, stereo10 wave format. The dataset is composed of different types of calls: standard phone calls in which the caller calls the receiver’s telephone number, teleconference calls in which the caller calls a third person while already talking to the receiver,and calls that are made to a web conferencing service (Zoom, Webex) where the callers dial a common telephone number (the service’s dial-in number) in order to talk to each other.

The difference between the number of calls (432) and the number of recordings (481) is due to the fact that some of the calls were intercepted multiple times by different sides of the conversation, which is a consequence of the variety in call types: 270 calls are intercepted only on the caller’s side, 111 are intercepted only on the receiver’s side, and 45 calls on both sides. There is an additional teleconference call which was intercepted a total of 10 times. This results in some of the recordings being very similar in content. However, they are not an exact copy of each other, because of the following reasons: (i) The interception begins on the caller’s side as soon as the caller finishes dialing the receiver’s telephone number. Hence, the ringing dial tone as well as any sounds/speech which the caller’s phone picks up before the connection is established are captured by the intercepted recording coming from the caller’s side. For the same reason, the receiver’s intercepted recording is a few seconds shorter than that of the caller’s. There are also cases where, although both sides are intercepted, the receiver’s phone is not reachable, therefore there is no recording from the receiver (in such cases, either the receiver’s voice box message or the operator’s out-of-reach message can be heard in the caller’s recording). (ii) For teleconference calls involving three (or more) parties, a new interception is initiated when the caller calls a third (fourth, ...) person in order to connect them into the existing conversation. (iii) For web conferencing where multiple parties call the same (operator) telephone number, each party’s interception begins when they join the conference room. (iv) The audibility of speech in both recordings can be different than each other due to the background or microphone noise introduced by one of the parties, or issues with their interception equipment. These inexact copies of the same phone conversation are intentionally left in the dataset, as these artefacts closely reflect the nature of interception in the real world.

Criminal network structure in the ROXSD calls subset. Each individual who took part in the calls subset are represented with a person icon together with their gender, speaker ID and story name. The silhouettes with a headphone indicate the voices containing automatic intercept messages of the telecom provider which are also intercepted by the system, and the badges represent ``unknown'' persons whose names are mentioned in a conversation. The black lines show the telephone calls between two individuals with the arrow pointing to the receiver of the call (bidirectional if both parties called one another at different times), and the dotted green lines show the connection of the mentioned persons to the parties who referred them in their call together, indicating a common acquaintance. 


ROXSD video:

In order to illustrate the interest of exploiting the image modality, ROXSD was complemented with images and videos representative of files which may be found on a seized smartphone, seized computer, or grabbed from the internet. This corresponds mainly to selfie images or videos where various people are heard and/or seen while observing certain objects or locations. The captured images and videos enable the evaluation of face and scene matching technologies used in the Autocrime platform to enrich the speaker network with additional nodes and edges (for instance an edge is added between two speaker’s nodes when both persons are found - either through their voice or face - in a same video).

ROXHOOD - social media:

ROXHOOD dataset extends ROXSD by adding social media communications.



audio (voice), video, text (including social media), metadata (speakers, devices, telephone numbers, location, network of people).


Technologies to profit from the ROXSD data:

Audio: multilingual speech recognition, speaker identification (open set), speaker clustering, language recognition, voice activity detection, word boosting

Text: multilingual entity recognition, multilingual topic detection, co-reference resolution, relation extraction

Video: face characterisation, scene characterisation

Network analysis: social influence, outlier detection, community detection, link prediction, cross-network analysis


Accessing ROXSD database:

ROXSD dataset is part of the foreground of the project, thus will be made available for other bodies/institutions (as required by Grant agreement) for further research and development in security related areas. 

Please contact: petr dot motlicek ad idiap dot ch (for more information)