Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms

Sample-level

We propose sample-level deep convolutional neural networks which learn representations from very small grains of waveforms (e.g. 2 or 3 samples) beyond typical frame-level input representations. In addition, we visualize filters learned in a sample-level DCNN in each layer to identify hierarchically learned features and show that they are sensitive to log-scaled frequency along layer, such as mel-frequency spectrogram that is widely used in music classification systems.

Background: Learning from raw data in the image domain. Sample-level

Background: Learning from raw data in the text domain. Sample-level

Background: Learning from raw data in the audio domain. Sample-level

Frame-level Mel-spectrogram model: Mel-spectrograms are powerful, but this process is separated from the training phase. Sample-level

Frame-level Raw waveform model: Previous studies have attempted to replace the Mel-spectrogram stage with Single Large filter CNN layer. Sample-level

Sample-level Raw waveform model: Let's use a deeper CNN with small filters. Sample-level

Comparison: Results of three comparative models. Sample-level

Trend 1: Small filters with small strides of the first convolution layer. Sample-level

Trend 2: Deeper models (about 10 layers or more). Sample-level

Trend 3: 1-5 seconds network input. Sample-level

Gradient Ascent Method: Convolution operation. Sample-level

Gradient Ascent Method: Loss is set to target filter to maximize the activation of the estimated filter shape. Sample-level

Gradient Ascent Method: Add the back-propagated value to input noise. Sample-level

Gradient Ascent Method: Repeat several steps. Sample-level

Gradient Ascent Method: Run the same steps for different filters. Sample-level

Gradient Ascent Method: The filter shape estimate is obtained at the input signal. Sample-level

Filter Visualization: To show the spectrums effectively, we use typical frame-size input (e.g. 729 samples). Sample-level

Filter Visualization: We can see that they are sensitive to log-scaled frequency along layer, such as mel-frequency spectrogram that is widely used in music classification systems. Sample-level

Check out the paper for more info.