Multi-Level and Multi-Scale Feature Aggregation Using Pre-trained Convolutional Neural Networks for Music Auto-tagging


Music auto-tagging is often handled in a similar manner to image classification by regarding the 2D audio spectrogram as image data. However, music auto-tagging is distinguished from image classification in that the tags are highly diverse and have different levels of abstractions. Considering this issue, we propose a convolutional neural networks (CNN)-based Feature Aggregation Method that embraces multi-level and multi-scaled features.

Motivation: Hierarchy in Music. Multi-level

Motivation: Music tags with various levels of abstraction. Multi-level

Motivation: Code music using CNN's last hidden layer?. Multi-level

Motivation: Let's use the intermediate layer Features. Multi-level

Motivation: We can also consider multi-scale Features by using several pre-trained CNN with different input sizes. Multi-level

Pre-training: Let's make Feature extractors. Multi-level

Pre-training: Let's make Feature extractors. Multi-level

Pre-training: Let's make Feature extractors. Multi-level

Feature extraction: To better capture local characteristics, frame-wise dimension is max-pooled to 1. After, average pooling is applied on whole segments of a song. Then, the feature size of each layer become equal to the number of filters on each layer. Multi-level

Feature extraction: Another Feature Aggregation. Multi-level

Feature extraction: Another Feature Aggregation in different scale. Multi-level

Rich Features: Song-level Aggregated Features are now obtained. Multi-level

Classification: Classify using DNNs. Multi-level

Transfer Learning: Since our method consist of two stages, transfer learning is easily applied. Multi-level

Datasets: MSD, Tagtraum, MTAT, GTZAN. Multi-level

Results: Comparisons. Multi-level

Analysis: We can see that some tags are indeed located at different levels and scales. Multi-level

Check out the paper for more info.