KAIST
Adobe Research
Adobe Research
Adobe Research
KAIST
Music similarity search is useful for a variety of creative tasks such as replacing one music recording with another recording with a similar “feel”, a common task in video editing. For this task, it is typically necessary to define a similarity metric to compare one recording to another. Music similarity, however, is hard to define and depends on multiple simultaneous notions of similarity (i.e. genre, mood, instrument, tempo). While prior work ignore this issue, we embrace this idea and introduce the concept of multidimensional similarity and unify both global and specialized similarity metrics into a single, semantically disentangled multidimensional similarity metric.
The dim-sim dataset is a collection of user-annotated music similarity triplet ratings used to evaluate music similarity search and related algorithms. Our similarity ratings are linked to the Million Song Dataset (MSD).
To collect our data, we randomly sampled 4,000 3-second triplets (i.e., anchor, song 1, song 2) from the MSD and asked people to annotate which track sounded more similar to the anchor (i.e., song 1 or song 2). Each triplet was annotated by 5-12 people, resulting in 39,440 raw human annotations. We then calculated the annotator agreement per triplet, defined as the ratio between the majority vote and total number of annotations, and filtered out triplets where the agreement was below 0.9, to create 879 high-agreement cleaned, human-annotated triplets. We have released both the raw and clean versions of the dataset in multiple formats discussed below.
The dataset can be downloaded from the zenodo link: https://zenodo.org/record/3889149#.XuovcxMzbyV.
We have released both CSV and JSON versions of the data for both the raw (raw-dim-sim
) and clean (clean-dim-sim
) annotations as described above. For a given triplet rating, the following data is provided:
triplet_id
anchor_id
anchor_start_seconds
anchor_start_samples
song1_id
song1_start_seconds
song1_start_samples
song2_id
song2_start_seconds
song2_start_samples
sampling_rate
clip_lengths_seconds
clip_lengths_samples
song1_vote
song2_vote
For the raw versions, song1_vote
and song2_vote
correspond to the total number of users that voted for the each song respectively. For the clean versions, the values of song1_vote
and song2_vote
are set to 0 or 1. All clips used were exactly 3 seconds long. The triplet_id
, song1_id
, and song2_id
denote the corresponding MSD track ID.
The dim-sim dataset is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
When dim-sim is used for academic research, we would highly appreciate it if scientific publications of works partly based on the dim-sim dataset cite the following publication:
@inproceedings{Lee2019MusicSimilarity,
title={Disentangled Multidimensional Metric Learning For Music Similarity},
author={Lee, Jongpil and Bryan, Nicholas J. and Salamon, Justin and Jin, Zeyu, and Nam, Juhan},
booktitle={Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2020},
organization={IEEE}
}