Datasets

This is a list of datasets that use Freesound content, sorted alphabetically. Do yo have a dataset that uses Freesound and does not appear here? Please send us an email at freesound@freesound.org!

  • ARCA23K

    ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.

    In addition to ARCA23K, this release includes a companion dataset called ARCA23K-FSD, which is a single-label subset of the FSD50K dataset. ARCA23K-FSD contains the same sound classes as ARCA23K and the same number of audio clips per class. As it is a subset of FSD50K, each clip and its label have been manually verified. Note that only the ground truth data of ARCA23K-FSD is distributed in this release. To download the audio clips, please visit the Zenodo page for FSD50K.

    The source code used to create the datasets is available: https://github.com/tqbl/arca23k-dataset


    Turab Iqbal, Tin Cao, Andrew Bailey, Mark D. Plumbley, Wenwu Wang


    University of Surrey, Centre for Vision, Speech and Signal Processing, Audio Research Group

    ARCA23K thumbnail

  • Clotho Dataset

    Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Audio samples are collected from Freesound.


    Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen


    Tampere University

    Clotho Dataset thumbnail

  • Dataset used in COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

    This dataset consists of two hdf5 files that contain pre-computed log-mel spectrograms that have been used to to train audio embedding models. The dataset is split into a training set and a validation set containing respectively 170793 and 19103 spectrogram patches with their accompanying multi-hot encoded tags from a vocabulary of 1000 tags provided by Freesound users.

    More details can be found in “COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations” by X. Favory, K. Drossos, T. Virtanen, and X. Serra. (arXiv)


    Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra


    Music Technology Group (MTG), Universitat Pompeu Fabra

    Dataset used in COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations thumbnail

  • DBR Dataset

    DBR dataset is an environmental audio dataset created for the Bachelor’s Seminar in Signal Processing in Tampere University of Technology. The samples in the dataset were collected from the online audio database Freesound. The dataset consists of three classes, each containing 50 samples, and the classes are ‘dog’, ‘bird’, and ‘rain’ (hence the name DBR).


    Ville-Veikko Eklund


    Tampere University

    DBR Dataset thumbnail

  • DCASE2019 Task 4 - Sythetic data

    The synthetic set is composed of 10 sec audio clips generated with Scaper. The foreground events are obtained from a subset of FSD dataset from Freesound. Each event audio clip was verified manually to ensure that the sound quality and the event-to-background ratio were sufficient to be used an isolated event. We also verified that the event was actually dominant in the clip and we controlled if the event onset and offset are present in the clip.


    Turpault Nicolas, Serizel Romain, Salamon Justin, Shah Ankit Parag


    Université de Lorraine, CNRS, Inria, Language Technologies Institute, Carnegie Mellon University, Adobe Research

    DCASE2019 Task 4 - Sythetic data thumbnail

  • DESED Synthetic

    This dataset is the synthetic part of the DESED dataset. It allows creating new mixtures from isolated foreground sounds and background sounds, and was used in DCASE 2019 and DCASE2020 task 4 challenges.


    Turpault, Nicolas, Serizel, Romain, Salamon, Justin, Shah, Ankit, Wisdom, Scott, Hershey, John, Erdogan, Hakan


    Université de Lorraine, CNRS, Inria, Language Technologies Institute, Carnegie Mellon University, Adobe Research, Google, Inc

    DESED Synthetic thumbnail

  • Divide and Remaster (DnR)

    Divide and Remaster (DnR) is a source separation dataset for training and testing algorithms that separate a monaural audio signal into speech, music, and sound effects/background stems. The dataset is composed of artificial mixtures using audio from the librispeech, free music archive (FMA), and Freesound Dataset 50k (FSD50k). We introduce it as part of the Cocktail Fork Problem paper: https://arxiv.org/abs/2110.09958


    Darius Petermann, Gordon Wichern, Zhong-Qiu Wang, Jonathan Le Roux


    Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA, Indiana University, Department of Intelligent Systems Engineering, Bloomington, IN, USA

    Divide and Remaster (DnR) thumbnail

  • Emo-Soundscapes

    We propose a dataset of audio samples called Emo-Soundscapes and two evaluation protocols for machine learning models. We curated 600 soundscape recordings in Freesound.org and mixed 613 audio clips from a combination of these. The Emo- Soundscapes dataset contains 1213 6-second Creative Commons licensed audio clips. We collected the ground truth annotations of perceived emotion in 1213 soundscape recordings using a crowdsourcing listening experiment, where 1182 annotators from 74 different countries rank the audio clips according to the perceived valence/arousal. This dataset allows studying soundscape emotion recognition and how the mixing of various soundscape recordings influences their perceived emotion.


    Jianyu Fan, Miles Thorogood, Philippe Pasquier


    Simon Fraser University

    Emo-Soundscapes thumbnail

  • ESC-50: Dataset for Environmental Sound Classification

    The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories. Clips in this dataset have been manually extracted from public field recordings gathered from Freesound.


    Piczak, Karol J.


    Warsaw University of Technology

    ESC-50: Dataset for Environmental Sound Classification thumbnail

  • freefield1010

    This dataset contains 7690 10-second audio files in a standardised format, extracted from contributions on the Freesound archive which were labelled with the “field-recording” tag. Note that the original tagging (as well as the audio submission) is crowdsourced, so the dataset is not guaranteed to consist purely of “field recordings” as might be defined by practitioners. The intention is to represent the content of an archive collection on such a topic, rather than to represent a controlled definition of such a topic.


    Dan Stowell


    Centre for Digital Music (C4DM), Queen Mary University of London

    freefield1010 thumbnail

  • Freesound Loop Dataset

    This dataset contains 9,455 loops from Freesound.org and the corresponding annotations. These loops have tempo, key, genre and instrumentation annotations made by MIR researchers.

    More details can be found in “The Freesound Loop Dataset and Annotation Tool “ by António Ramires et al. (Paper)


    António Ramires, Frederic Font, Dmitry Bogdanov, Jordan B. L. Smith, Yi-Hsuan Yang, Joann Ching, Bo-Yu Chen, Yueh-Kao Wu, Hsu Wei-Han, Xavier Serra


    Music Technology Group (MTG), Universitat Pompeu Fabra, Tik Tok, Research Center for IT Innovation, Academia Sinica

    Freesound Loop Dataset thumbnail

  • Freesound Loops 4k (FSL4)

    This dataset contains ~4000 user-contributed loops uploaded to Freesound. Loops were selected by searching Freesound for sounds with the query terms loop and bpm, and then automatically parsing the returned sound filenames, tags and textual descriptions to identify tempo annotations made by users. For example, a sound containing the tag 120bpm is considered to have a ground truth of 120 BPM.


    Frederic Font


    Music Technology Group (MTG), Universitat Pompeu Fabra

    Freesound Loops 4k (FSL4) thumbnail

  • Freesound One-Shot Percussive Sounds

    This dataset contains 10254 one-shot (single event) percussive sounds from Freesound and the corresponding timbral analysis. These were used to train the generative model for Neural Percussive Synthesis Parameterised by High-Level Timbral Features.


    António Ramires, Pritish Chandna, Xavier Favory, Emilia Gómez, Xavier Serra


    Music Technology Group (MTG), Universitat Pompeu Fabra

    Freesound One-Shot Percussive Sounds thumbnail

  • FSD-FS

    FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43030 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London.


    Jinhua Liang, Huy Phan, Emmanouil Benetos


    Centre for Digital Music, Queen Mary University of London

    FSD-FS thumbnail

  • FSD50K

    FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. Please check the Zenodo FSD50K page for more information about the dataset, download and citation information.


    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra, All the annotators!!!


    Music Technology Group (MTG), Universitat Pompeu Fabra

    FSD50K thumbnail

  • FSDKaggle2018

    FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.


    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Manoj Plakal, Daniel P. W. Ellis, Xavier Serra


    Music Technology Group (MTG), Universitat Pompeu Fabra, Google Research’s Machine Perception Team

    FSDKaggle2018 thumbnail

  • FSDKaggle2019

    FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019. The dataset allows development and evaluation of machine listening methods in conditions of label noise, minimal supervision, and real-world acoustic mismatch. FSDKaggle2019 consists of two train sets and one test set. One train set and the test set consists of manually-labeled data from Freesound, while the other train set consists of noisily labeled web audio data from Flickr videos taken from the YFCC dataset.


    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra


    Music Technology Group (MTG), Universitat Pompeu Fabra, Google Research’s Machine Perception Team

    FSDKaggle2019 thumbnail

  • FSDnoisy18k

    FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.


    Eduardo Fonseca, Mercedes Collado, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, Xavier Serra


    Music Technology Group (MTG), Universitat Pompeu Fabra, Google Research’s Machine Perception Team

    FSDnoisy18k thumbnail

  • GISE-51

    GISE-51 is an open dataset of 51 isolated sound events based on the FSD50K dataset. The release also includes the GISE-51-Mixtures subset, a dataset of 5-second soundscapes with up to three sound events synthesized from GISE-51. The GISE-51 release attempts to address some of the shortcomings of recent sound event datasets, providing an open, reproducible benchmark for future research and the freedom to adapt the included isolated sound events for domain-specific applications, which was not possible using existing large-scale weakly labelled datasets. GISE-51 release also included accompanying code for baseline experiments, which can be found at https://github.com/SarthakYadav/GISE-51-pytorch.


    Sarthak Yadav, Mary Ellen Foster


    University of Glasgow, UK

    GISE-51 thumbnail

  • Good-sounds dataset

    This dataset contains monophonic recordings musical instruments playing two kinds of exercises: single notes and scales. The recordings were made in the Universitat Pompeu Fabra / Phonos recording studio by 15 different professional musicians, all of them holding a music degree and having some expertise in teaching, and uploaded to Freesound. 12 different instruments were recorded using one or up to 4 different microphones (depending on the recording session). For all the instruments the whole set of playable semitones in the instrument is recorded several times with different tonal characteristics. Each note is recorded into a separate mono .flac audio file of 48kHz and 32 bits. The tonal characteristics are explained both in the the following section and the related publication.


    Oriol Romani Picas, Hector Parra Rodriguez, Dara Dabiri, Xavier Serra


    Music Technology Group (MTG), Universitat Pompeu Fabra

    Good-sounds dataset thumbnail

  • LAION-Audio-630K

    LAION-Audio-630K is the largest audio-text dataset publicly available and a magnitude larger than previous audio-text datasets (by 2022-11-05). Notably, it combines eight distinct datasets, including 3000 hours of audio content from Freesound.


    LAION Team, Yusong Wu, Ke Chen, Tianyu Zhang, Marianna Nezhurina, Yuchen Hui


    LAION

    LAION-Audio-630K thumbnail

  • LibriFSD50K

    LibriFSD50K is a dataset thatc ombines speech data form the LibriSpeech dataset and noise data from FSD50K. More information in the paper: https://arxiv.org/pdf/2105.04727.pdf


    Efthymios Tzinis, Jonah Casebeer, Zhepei Wang, Paris Smaragdis


    University of Illinois at Urbana-Champaign, Department of Computer Science, Urbana, IL, USA, Adobe Research, San Jose, CA, USA

    LibriFSD50K thumbnail

  • MAESTRO Synthetic - Multi-Annotator Estimated Strong Labels

    MAESTRO synthetic contains 20 synthetic audio files created using Scaper, each of them 3 minutes long. The dataset was created for studying annotation procedures for strong labels using crowdsourcing. The audio files contain sounds from the following classes: car_horn, children_voices, dog_bark, engine_idling, siren, street_music. Audio files contain excerpts of recordings uploaded to Freesound. Please see FREESOUNDCREDITS.txt for an attribution list.

    Audio files are generated using Scaper, with small changes to the synthesis procedure: Sounds were placed at random intervals, controlling for a maximum polyphony of 2. Intervals between two consecutive events are selected at random, but limited to 2-10 seconds. Event classes and event instances are chosen uniformly, and mixed with a signal-to-noise ratio (SNR) randomly selected between 0 and 20 dB over a Brownian noise background. Having two overlapping events from the same class is avoided.


    Irene Martin Morato, Manu Harju, Annamaria Mesaros


    Machine Listening Group, Tampere University

    MAESTRO Synthetic - Multi-Annotator Estimated Strong Labels thumbnail

  • SimSceneTVB Learning

    This is a dataset of 600 simulated sound scenes of 45s each representing urban sound environments, simulated using the simScene Matlab library. The dataset is divided in two parts with a train subset (400 scenes) and a test subset (200 scenes) for the development of learning-based models. Each scene is composed of three main sources (traffic, human voices and birds) according to an original scenario, which is composed semi-randomly conditionally to five ambiances: park, quiet street, noisy street, very noisy street and square. Separate channels for the contribution of each source are available. The base audio files used for simulation are obtained from Freesound and LibriSpeech. The sound scenes are scaled according to a playback sound level in dB, which is drawn randomly but remains plausible according to the ambiance.


    Felix Gontier, Mathieu Lagrange, Pierre Aumond, Catherine Lavandier, Jean-François Petiot


    Ecole Centrale de Nantes, University of Paris Seine, University of Cergy-Pontoise, ENSEA

    SimSceneTVB Learning thumbnail

  • SimSceneTVB Perception

    This is a corpus of 100 sound scenes of 45s each representing urban sound environments, including: 6 scenes recorded in Paris; 19 scenes simulated using simScene to replicate recorded scenarios, including the 6 recordings in this corpus; 75 scenes simulated using simScene with diverse new scenarios, containing traffic, human voices and bird sources. The base audio files used for simulation are obtained from Freesound and LibriSpeech. The sound scenes are scaled according to a playback sound level in dB, which is drawn randomly but remains plausible according to the ambiance.


    Felix Gontier, Mathieu Lagrange, Pierre Aumond, Catherine Lavandier, Jean-François Petiot


    Ecole Centrale de Nantes, University of Paris Seine, University of Cergy-Pontoise, ENSEA

    SimSceneTVB Perception thumbnail

  • Sound Events for Surveillance Applications

    The Sound Events for Surveillance Applications (SESA) dataset files were obtained from Freesound. The dataset was divided between train (480 files) and test (105 files) folders. All audio files are WAV, Mono-Channel, 16 kHz, and 8-bit with up to 33 seconds. # Classes: 0 - Casual (not a threat) 1 - Gunshot 2 - Explosion 3 - Siren (also contains alarms).


    Tito Spadini

    Sound Events for Surveillance Applications thumbnail

  • TUT Rare Sound Events 2017

    TUT Rare Sound Events 2017 is a dataset used in the DCASE Challenge 2017 Task 2, focused on the detection of rare sound events in artificially created mixtures. It consists of isolated sound events for each target class and recordings of everyday acoustic scenes to serve as background. The target sound event categories are: Baby crying, Glass breaking, Gunshot. The background audio is part of TUT Acoustic Scenes 2016 dataset, and the isolated sound examples were collected from Freesound. Selection of sounds from Freesound was based on the exact label, selecting examples that had sampling frequency 44.1 kHz or higher.


    Aleksandr Diment, Annamaria Mesaros, Toni Heittola


    Tampere University of Technology

    TUT Rare Sound Events 2017 thumbnail

  • Urban Sound 8K

    This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy. All excerpts are taken from field recordings uploaded to Freesound.


    Justin Salamon, Christopher Jacoby, Juan Pablo Bello


    Music and Audio Research Laboratory (MARL), New York University, Center for Urban Science and Progress (CUSP), New York University

    Urban Sound 8K thumbnail

  • USM-SED

    USM-SED is a dataset for polyphonic sound event detection in urban sound monitoring use-cases. Based on isolated sounds taken from the FSD50k dataset, 20,000 polyphonic soundscapes are synthesized with sounds being randomly positioned in the stereo panorama using different loudness levels.


    Jakob Abeßer


    Semantic Music Technologies Group, Fraunhofer IDMT

    USM-SED thumbnail

  • Vocal Imitation Set v1.1.3

    The VocalImitationSet is a collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound, which were curated based on Google’s AudioSet ontology. We expect that this dataset will help research communities obtain a better understanding of human’s vocal imitation and build a machine understand the imitations as humans do.


    Bongjun Kim, Bryan Pardo


    Dept. of Electrical Engineering and Computer Science, Northwestern University

    Vocal Imitation Set v1.1.3 thumbnail

  • WavCaps

    WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset (with 260k clips sourced from Freesound website). However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.


    Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang


    Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey, Johns Hopkins University, ByteDance, School of Electronic and Computer Engineering, Peking University

    WavCaps thumbnail