Datasets

This is a list of datasets that use Freesound content, sorted alphabetically. Do yo have a dataset that uses Freesound and does not appear here? Please send us an email at freesound@freesound.org!

ARCA23K

ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.

In addition to ARCA23K, this release includes a companion dataset called ARCA23K-FSD, which is a single-label subset of the FSD50K dataset. ARCA23K-FSD contains the same sound classes as ARCA23K and the same number of audio clips per class. As it is a subset of FSD50K, each clip and its label have been manually verified. Note that only the ground truth data of ARCA23K-FSD is distributed in this release. To download the audio clips, please visit the Zenodo page for FSD50K.

The source code used to create the datasets is available: https://github.com/tqbl/arca23k-dataset

Authors:
Turab Iqbal, Tin Cao, Andrew Bailey, Mark D. Plumbley, Wenwu Wang

Institutions:
University of Surrey, Centre for Vision, Speech and Signal Processing, Audio Research Group

More information and download:
https://zenodo.org/record/5117901

ARCA23K thumbnail

Clotho Dataset

Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Audio samples are collected from Freesound.

Authors:
Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen

Institutions:
Tampere University

More information and download:
https://zenodo.org/record/3490684#.Xlj0dy2ZN24

Clotho Dataset thumbnail

Dataset used in COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

This dataset consists of two hdf5 files that contain pre-computed log-mel spectrograms that have been used to to train audio embedding models. The dataset is split into a training set and a validation set containing respectively 170793 and 19103 spectrogram patches with their accompanying multi-hot encoded tags from a vocabulary of 1000 tags provided by Freesound users.

More details can be found in “COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations” by X. Favory, K. Drossos, T. Virtanen, and X. Serra. (arXiv)

Authors:
Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

Institutions:
Music Technology Group (MTG), Universitat Pompeu Fabra

More information and download:
https://zenodo.org/record/3887261

Dataset used in COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations thumbnail

DBR Dataset

DBR dataset is an environmental audio dataset created for the Bachelor’s Seminar in Signal Processing in Tampere University of Technology. The samples in the dataset were collected from the online audio database Freesound. The dataset consists of three classes, each containing 50 samples, and the classes are ‘dog’, ‘bird’, and ‘rain’ (hence the name DBR).

Authors:
Ville-Veikko Eklund

Institutions:
Tampere University

More information and download:
https://zenodo.org/record/1069747#.Xlj0vi2ZN24

DBR Dataset thumbnail

DCASE2019 Task 4 - Sythetic data

The synthetic set is composed of 10 sec audio clips generated with Scaper. The foreground events are obtained from a subset of FSD dataset from Freesound. Each event audio clip was verified manually to ensure that the sound quality and the event-to-background ratio were sufficient to be used an isolated event. We also verified that the event was actually dominant in the clip and we controlled if the event onset and offset are present in the clip.

Authors:
Turpault Nicolas, Serizel Romain, Salamon Justin, Shah Ankit Parag

Institutions:
Université de Lorraine, CNRS, Inria, Language Technologies Institute, Carnegie Mellon University, Adobe Research

More information and download:
https://zenodo.org/record/2583796#.Xlj0Ii2ZN24

DCASE2019 Task 4 - Sythetic data thumbnail

DESED Synthetic

This dataset is the synthetic part of the DESED dataset. It allows creating new mixtures from isolated foreground sounds and background sounds, and was used in DCASE 2019 and DCASE2020 task 4 challenges.

Authors:
Turpault, Nicolas, Serizel, Romain, Salamon, Justin, Shah, Ankit, Wisdom, Scott, Hershey, John, Erdogan, Hakan

Institutions:
Université de Lorraine, CNRS, Inria, Language Technologies Institute, Carnegie Mellon University, Adobe Research, Google, Inc

More information and download:
https://zenodo.org/record/3713328#.XyAMoy-w2L4

DESED Synthetic thumbnail

Divide and Remaster (DnR)

Divide and Remaster (DnR) is a source separation dataset for training and testing algorithms that separate a monaural audio signal into speech, music, and sound effects/background stems. The dataset is composed of artificial mixtures using audio from the librispeech, free music archive (FMA), and Freesound Dataset 50k (FSD50k). We introduce it as part of the Cocktail Fork Problem paper: https://arxiv.org/abs/2110.09958

Authors:
Darius Petermann, Gordon Wichern, Zhong-Qiu Wang, Jonathan Le Roux

Institutions:
Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA, Indiana University, Department of Intelligent Systems Engineering, Bloomington, IN, USA

More information and download:
https://zenodo.org/record/5574713#.YZzPtS8w2Al

Divide and Remaster (DnR) thumbnail

Emo-Soundscapes

We propose a dataset of audio samples called Emo-Soundscapes and two evaluation protocols for machine learning models. We curated 600 soundscape recordings in Freesound.org and mixed 613 audio clips from a combination of these. The Emo- Soundscapes dataset contains 1213 6-second Creative Commons licensed audio clips. We collected the ground truth annotations of perceived emotion in 1213 soundscape recordings using a crowdsourcing listening experiment, where 1182 annotators from 74 different countries rank the audio clips according to the perceived valence/arousal. This dataset allows studying soundscape emotion recognition and how the mixing of various soundscape recordings influences their perceived emotion.

Authors:
Jianyu Fan, Miles Thorogood, Philippe Pasquier

Institutions:
Simon Fraser University

More information and download:
https://metacreation.net/emo-soundscapes/

Emo-Soundscapes thumbnail

ESC-50: Dataset for Environmental Sound Classification

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories. Clips in this dataset have been manually extracted from public field recordings gathered from Freesound.

Authors:
Piczak, Karol J.

Institutions:
Warsaw University of Technology

More information and download:
https://github.com/karolpiczak/ESC-50

ESC-50: Dataset for Environmental Sound Classification thumbnail

freefield1010

This dataset contains 7690 10-second audio files in a standardised format, extracted from contributions on the Freesound archive which were labelled with the “field-recording” tag. Note that the original tagging (as well as the audio submission) is crowdsourced, so the dataset is not guaranteed to consist purely of “field recordings” as might be defined by practitioners. The intention is to represent the content of an archive collection on such a topic, rather than to represent a controlled definition of such a topic.

Authors:
Dan Stowell

Institutions:
Centre for Digital Music (C4DM), Queen Mary University of London

More information and download:
https://c4dm.eecs.qmul.ac.uk/rdr/handle/123456789/35

freefield1010 thumbnail

Freesound Loop Dataset

This dataset contains 9,455 loops from Freesound.org and the corresponding annotations. These loops have tempo, key, genre and instrumentation annotations made by MIR researchers.

More details can be found in “The Freesound Loop Dataset and Annotation Tool “ by António Ramires et al. (Paper)

Authors:
António Ramires, Frederic Font, Dmitry Bogdanov, Jordan B. L. Smith, Yi-Hsuan Yang, Joann Ching, Bo-Yu Chen, Yueh-Kao Wu, Hsu Wei-Han, Xavier Serra

Institutions:
Music Technology Group (MTG), Universitat Pompeu Fabra, Tik Tok, Research Center for IT Innovation, Academia Sinica

More information and download:
https://zenodo.org/record/3967852

Freesound Loop Dataset thumbnail

Freesound Loops 4k (FSL4)

This dataset contains ~4000 user-contributed loops uploaded to Freesound. Loops were selected by searching Freesound for sounds with the query terms loop and bpm, and then automatically parsing the returned sound filenames, tags and textual descriptions to identify tempo annotations made by users. For example, a sound containing the tag 120bpm is considered to have a ground truth of 120 BPM.

Authors:
Frederic Font

Institutions:
Music Technology Group (MTG), Universitat Pompeu Fabra

More information and download:
https://zenodo.org/record/3685832#.XlPj75NKiL4

Freesound Loops 4k (FSL4) thumbnail

Freesound One-Shot Percussive Sounds

This dataset contains 10254 one-shot (single event) percussive sounds from Freesound and the corresponding timbral analysis. These were used to train the generative model for Neural Percussive Synthesis Parameterised by High-Level Timbral Features.

Authors:
António Ramires, Pritish Chandna, Xavier Favory, Emilia Gómez, Xavier Serra

Institutions:
Music Technology Group (MTG), Universitat Pompeu Fabra

More information and download:
https://zenodo.org/record/3665275

Freesound One-Shot Percussive Sounds thumbnail

FSD-FS

FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43030 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London.

Authors:
Jinhua Liang, Huy Phan, Emmanouil Benetos

Institutions:
Centre for Digital Music, Queen Mary University of London

More information and download:
https://zenodo.org/record/7452708#.Y7w5UOzP23I

FSD-FS thumbnail

FSD50K

FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. Please check the Zenodo FSD50K page for more information about the dataset, download and citation information.

Authors:
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra, All the annotators!!!

Institutions:
Music Technology Group (MTG), Universitat Pompeu Fabra

More information and download:
https://zenodo.org/record/4060432#.X3xrgi8RqL4

FSD50K thumbnail

FSDKaggle2018

FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

Authors:
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Manoj Plakal, Daniel P. W. Ellis, Xavier Serra

Institutions:
Music Technology Group (MTG), Universitat Pompeu Fabra, Google Research’s Machine Perception Team

More information and download:
https://zenodo.org/record/2552860#.Xk6Xgi2ZNPU

FSDKaggle2018 thumbnail

FSDKaggle2019

FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019. The dataset allows development and evaluation of machine listening methods in conditions of label noise, minimal supervision, and real-world acoustic mismatch. FSDKaggle2019 consists of two train sets and one test set. One train set and the test set consists of manually-labeled data from Freesound, while the other train set consists of noisily labeled web audio data from Flickr videos taken from the YFCC dataset.

Authors:
Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra

Institutions:
Music Technology Group (MTG), Universitat Pompeu Fabra, Google Research’s Machine Perception Team

More information and download:
https://zenodo.org/record/3612637#.Xk6ebi2ZNPU

FSDKaggle2019 thumbnail

FSDnoisy18k

FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

Authors:
Eduardo Fonseca, Mercedes Collado, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, Xavier Serra

Institutions:
Music Technology Group (MTG), Universitat Pompeu Fabra, Google Research’s Machine Perception Team

More information and download:
https://zenodo.org/record/2529934#.Xlj0zi2ZN24

FSDnoisy18k thumbnail

GISE-51

GISE-51 is an open dataset of 51 isolated sound events based on the FSD50K dataset. The release also includes the GISE-51-Mixtures subset, a dataset of 5-second soundscapes with up to three sound events synthesized from GISE-51. The GISE-51 release attempts to address some of the shortcomings of recent sound event datasets, providing an open, reproducible benchmark for future research and the freedom to adapt the included isolated sound events for domain-specific applications, which was not possible using existing large-scale weakly labelled datasets. GISE-51 release also included accompanying code for baseline experiments, which can be found at https://github.com/SarthakYadav/GISE-51-pytorch.

Authors:
Sarthak Yadav, Mary Ellen Foster

Institutions:
University of Glasgow, UK

More information and download:
https://zenodo.org/record/4593514#.YZzMay8w2Ak

GISE-51 thumbnail

Good-sounds dataset

This dataset contains monophonic recordings musical instruments playing two kinds of exercises: single notes and scales. The recordings were made in the Universitat Pompeu Fabra / Phonos recording studio by 15 different professional musicians, all of them holding a music degree and having some expertise in teaching, and uploaded to Freesound. 12 different instruments were recorded using one or up to 4 different microphones (depending on the recording session). For all the instruments the whole set of playable semitones in the instrument is recorded several times with different tonal characteristics. Each note is recorded into a separate mono .flac audio file of 48kHz and 32 bits. The tonal characteristics are explained both in the the following section and the related publication.

Authors:
Oriol Romani Picas, Hector Parra Rodriguez, Dara Dabiri, Xavier Serra

Institutions:
Music Technology Group (MTG), Universitat Pompeu Fabra

More information and download:
https://zenodo.org/record/820937#.Xlj1by2ZN24

Good-sounds dataset thumbnail

LAION-Audio-630K

LAION-Audio-630K is the largest audio-text dataset publicly available and a magnitude larger than previous audio-text datasets (by 2022-11-05). Notably, it combines eight distinct datasets, including 3000 hours of audio content from Freesound.

Authors:
LAION Team, Yusong Wu, Ke Chen, Tianyu Zhang, Marianna Nezhurina, Yuchen Hui

Institutions:
LAION

More information and download:
https://huggingface.co/datasets/Meranti/CLAP_freesound

LAION-Audio-630K thumbnail

LibriFSD50K

LibriFSD50K is a dataset thatc ombines speech data form the LibriSpeech dataset and noise data from FSD50K. More information in the paper: https://arxiv.org/pdf/2105.04727.pdf

Authors:
Efthymios Tzinis, Jonah Casebeer, Zhepei Wang, Paris Smaragdis

Institutions:
University of Illinois at Urbana-Champaign, Department of Computer Science, Urbana, IL, USA, Adobe Research, San Jose, CA, USA

More information and download:
https://github.com/etzinis/fedenhance#librifsd50k-dataset-generation

LibriFSD50K thumbnail

MAESTRO Synthetic - Multi-Annotator Estimated Strong Labels

MAESTRO synthetic contains 20 synthetic audio files created using Scaper, each of them 3 minutes long. The dataset was created for studying annotation procedures for strong labels using crowdsourcing. The audio files contain sounds from the following classes: car_horn, children_voices, dog_bark, engine_idling, siren, street_music. Audio files contain excerpts of recordings uploaded to Freesound. Please see FREESOUNDCREDITS.txt for an attribution list.

Audio files are generated using Scaper, with small changes to the synthesis procedure: Sounds were placed at random intervals, controlling for a maximum polyphony of 2. Intervals between two consecutive events are selected at random, but limited to 2-10 seconds. Event classes and event instances are chosen uniformly, and mixed with a signal-to-noise ratio (SNR) randomly selected between 0 and 20 dB over a Brownian noise background. Having two overlapping events from the same class is avoided.

Authors:
Irene Martin Morato, Manu Harju, Annamaria Mesaros

Institutions:
Machine Listening Group, Tampere University

More information and download:
https://zenodo.org/record/5126478#.YTX0vi0Ro-W

MAESTRO Synthetic - Multi-Annotator Estimated Strong Labels thumbnail

SimSceneTVB Learning

This is a dataset of 600 simulated sound scenes of 45s each representing urban sound environments, simulated using the simScene Matlab library. The dataset is divided in two parts with a train subset (400 scenes) and a test subset (200 scenes) for the development of learning-based models. Each scene is composed of three main sources (traffic, human voices and birds) according to an original scenario, which is composed semi-randomly conditionally to five ambiances: park, quiet street, noisy street, very noisy street and square. Separate channels for the contribution of each source are available. The base audio files used for simulation are obtained from Freesound and LibriSpeech. The sound scenes are scaled according to a playback sound level in dB, which is drawn randomly but remains plausible according to the ambiance.

Authors:
Felix Gontier, Mathieu Lagrange, Pierre Aumond, Catherine Lavandier, Jean-François Petiot

Institutions:
Ecole Centrale de Nantes, University of Paris Seine, University of Cergy-Pontoise, ENSEA

More information and download:
https://zenodo.org/record/3248703#.XlzBhabRZUQ

SimSceneTVB Learning thumbnail

SimSceneTVB Perception

This is a corpus of 100 sound scenes of 45s each representing urban sound environments, including: 6 scenes recorded in Paris; 19 scenes simulated using simScene to replicate recorded scenarios, including the 6 recordings in this corpus; 75 scenes simulated using simScene with diverse new scenarios, containing traffic, human voices and bird sources. The base audio files used for simulation are obtained from Freesound and LibriSpeech. The sound scenes are scaled according to a playback sound level in dB, which is drawn randomly but remains plausible according to the ambiance.

Authors:
Felix Gontier, Mathieu Lagrange, Pierre Aumond, Catherine Lavandier, Jean-François Petiot

Institutions:
Ecole Centrale de Nantes, University of Paris Seine, University of Cergy-Pontoise, ENSEA

More information and download:
https://zenodo.org/record/3248734#.Xlj1NS2ZN24

SimSceneTVB Perception thumbnail

Sound Events for Surveillance Applications

The Sound Events for Surveillance Applications (SESA) dataset files were obtained from Freesound. The dataset was divided between train (480 files) and test (105 files) folders. All audio files are WAV, Mono-Channel, 16 kHz, and 8-bit with up to 33 seconds. # Classes: 0 - Casual (not a threat) 1 - Gunshot 2 - Explosion 3 - Siren (also contains alarms).

Authors:
Tito Spadini

More information and download:
https://zenodo.org/record/3519845#.Xlj0xi2ZN24

Sound Events for Surveillance Applications thumbnail

TUT Rare Sound Events 2017

TUT Rare Sound Events 2017 is a dataset used in the DCASE Challenge 2017 Task 2, focused on the detection of rare sound events in artificially created mixtures. It consists of isolated sound events for each target class and recordings of everyday acoustic scenes to serve as background. The target sound event categories are: Baby crying, Glass breaking, Gunshot. The background audio is part of TUT Acoustic Scenes 2016 dataset, and the isolated sound examples were collected from Freesound. Selection of sounds from Freesound was based on the exact label, selecting examples that had sampling frequency 44.1 kHz or higher.

Authors:
Aleksandr Diment, Annamaria Mesaros, Toni Heittola

Institutions:
Tampere University of Technology

More information and download:
https://doi.org/10.5281/zenodo.401395

TUT Rare Sound Events 2017 thumbnail

Urban Sound 8K

This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy. All excerpts are taken from field recordings uploaded to Freesound.

Authors:
Justin Salamon, Christopher Jacoby, Juan Pablo Bello

Institutions:
Music and Audio Research Laboratory (MARL), New York University, Center for Urban Science and Progress (CUSP), New York University

More information and download:
https://urbansounddataset.weebly.com/urbansound8k.html

Urban Sound 8K thumbnail

USM-SED

USM-SED is a dataset for polyphonic sound event detection in urban sound monitoring use-cases. Based on isolated sounds taken from the FSD50k dataset, 20,000 polyphonic soundscapes are synthesized with sounds being randomly positioned in the stereo panorama using different loudness levels.

Authors:
Jakob Abeßer

Institutions:
Semantic Music Technologies Group, Fraunhofer IDMT

More information and download:
https://github.com/jakobabesser/USM_SED/

USM-SED thumbnail

Vocal Imitation Set v1.1.3

The VocalImitationSet is a collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound, which were curated based on Google’s AudioSet ontology. We expect that this dataset will help research communities obtain a better understanding of human’s vocal imitation and build a machine understand the imitations as humans do.

Authors:
Bongjun Kim, Bryan Pardo

Institutions:
Dept. of Electrical Engineering and Computer Science, Northwestern University

More information and download:
https://zenodo.org/record/1340763#.Xlj1By2ZN24

Vocal Imitation Set v1.1.3 thumbnail

WavCaps

WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset (with 260k clips sourced from Freesound website). However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.

Authors:
Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

Institutions:
Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey, Johns Hopkins University, ByteDance, School of Electronic and Computer Engineering, Peking University

More information and download:
https://github.com/XinhaoMei/WavCaps

WavCaps thumbnail