Disentangled Multidimensional Metric Learning For Music Similarity
ICASSP 2020 (Oral Presentation)
Jongpil Lee

KAIST

Nicholas J. Bryan

Adobe Research

Justin Salamon

Adobe Research

Zeyu Jin

Adobe Research

Juhan Nam

KAIST

Paper

Abstract

Music similarity search is useful for a variety of creative tasks such as replacing one music recording with another recording with a similar “feel”, a common task in video editing. For this task, it is typically necessary to define a similarity metric to compare one recording to another. Music similarity, however, is hard to define and depends on multiple simultaneous notions of similarity (i.e. genre, mood, instrument, tempo). While prior work ignore this issue, we embrace this idea and introduce the concept of multidimensional similarity and unify both global and specialized similarity metrics into a single, semantically disentangled multidimensional similarity metric.

Read the Full Paper   (ICASSP) (arXiv)

arXiv link

Dim-Sim Dataset

dim-sim

The dim-sim dataset is a collection of user-annotated music similarity triplet ratings used to evaluate music similarity search and related algorithms. Our similarity ratings are linked to the Million Song Dataset (MSD).

About

To collect our data, we randomly sampled 4,000 3-second triplets (i.e., anchor, song 1, song 2) from the MSD and asked people to annotate which track sounded more similar to the anchor (i.e., song 1 or song 2). Each triplet was annotated by 5-12 people, resulting in 39,440 raw human annotations. We then calculated the annotator agreement per triplet, defined as the ratio between the majority vote and total number of annotations, and filtered out triplets where the agreement was below 0.9, to create 879 high-agreement cleaned, human-annotated triplets. We have released both the raw and clean versions of the dataset in multiple formats discussed below.

Download

The dataset can be downloaded from the zenodo link: https://zenodo.org/record/3889149#.XuovcxMzbyV.

Formats

We have released both CSV and JSON versions of the data for both the raw (raw-dim-sim) and clean (clean-dim-sim) annotations as described above. For a given triplet rating, the following data is provided:

triplet_id    
                    anchor_id    
                    anchor_start_seconds    
                    anchor_start_samples    
                    song1_id    
                    song1_start_seconds    
                    song1_start_samples    
                    song2_id    
                    song2_start_seconds    
                    song2_start_samples    
                    sampling_rate    
                    clip_lengths_seconds    
                    clip_lengths_samples    
                    song1_vote    
                    song2_vote
                    

For the raw versions, song1_vote and song2_vote correspond to the total number of users that voted for the each song respectively. For the clean versions, the values of song1_vote and song2_vote are set to 0 or 1. All clips used were exactly 3 seconds long. The triplet_id, song1_id, and song2_id denote the corresponding MSD track ID.

License

The dim-sim dataset is licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

Acknowledgement

When dim-sim is used for academic research, we would highly appreciate it if scientific publications of works partly based on the dim-sim dataset cite the following publication:

@inproceedings{Lee2019MusicSimilarity,
                     title={Disentangled Multidimensional Metric Learning For Music Similarity},
                     author={Lee, Jongpil and Bryan, Nicholas J. and Salamon, Justin and Jin, Zeyu, and Nam, Juhan},
                     booktitle={Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
                     year={2020},
                     organization={IEEE}
                    }