Classification-Based Singing Melody Extraction Using Deep Convolutional Neural Networks

Singing melody extraction is the task that identifies the melody pitch contour of singing 1 voice from polyphonic music. Most of the traditional melody extraction algorithms are based on 2 calculating salient pitch candidates or separating the melody source from the mixture. Recently, 3 classification-based approach based on deep learning has drawn much attentions. In this paper, 4 we present a classification-based singing melody extraction model using deep convolutional neural 5 networks. The proposed model consists of a singing pitch extractor (SPE) and a singing voice activity 6 detector (SVAD). The SPE is trained to predict a high-resolution pitch label of singing voice from 7 a short segment of spectrogram. This allows the model to predict highly continuous curves. The 8 melody contour is smoothed further by post-processing the output of the melody extractor. The 9 SVAD is trained to determine if a long segment of mel-spectrogram contains a singing voice. This 10 often produces voice false alarm errors around the boundary of singing segments. We reduced them 11 by exploiting the output of the SPE. Finally, we evaluate the proposed melody extraction model on 12 several public datasets. The results show that the proposed model is comparable to state-of-the-art 13 algorithms. 14


Introduction
Melody extraction is the task of estimating the fundamental frequency that corresponds to the melodic line of a polyphonic music.Since melody is the essence of music from which listeners can identify the piece, melody extraction has been applied to various music information retrieval tasks such as query-by-humming [1] and cover song identification [2].In popular music, melody is usually performed by singers and so the melody extraction task is often recast into detecting the presence of singing voices and estimating the dominant voice pitch when background music is accompanied.
The expressive nature of the singing melody has been utilized for explaining characteristics of different music genres [3] or singers [4] as well.Furthermore, the continuous pitch curves have been incorporated in source separation algorithms to take vocal and background music apart [5].
A number of melody extraction algorithms, where some of them are particularly for singing voices, have been proposed so far.They can be broadly classified into three categories according to the approach type: salience-based, source separation-based, and classification-based ones [6].While the majority of previous work are associated with the first two approaches, the data-driven approach based on classification, which predicts a finite set of pitch labels from audio features, have been rarely explored.An early work used a support vector machine classifier to predict a pitch label from spectrogram [7].Since then, it had been no attempt until Bittner et al. proposed a random forest classifier that predicts pitch contours from pitch salience features [8].
The lack of the classification-based approach can be attributed to the following reasons.First, melodic pitch is a physically measurable value as opposed to abstract labels defined in high-level tasks such as genre or mood classification.Thus, it is more intuitive to directly leverage time-frequency representations where the patterns for pitch estimation are observable, as in the saliency-based or source separation-based approaches.Second, in the classification-based approach, the melodic pitch is supposed to be quantized to a certain resolution (e.g.semitone in [7]).While this discrete pitch may be useful for some applications that require a MIDI-level pitch notation, it loses detailed information about singing styles such as vibrato or note-to-note transition patterns.Third, the classification-based approach typically requires a sufficient amount of labeled data to achieve good performance.Manual extraction of melodic pitch in a frame-level is a highly tedious labor, particularly for mixed tracks.This has hindered the availability of labeled datasets.
In the recent past, however, there have been important changes that have encouraged the classification-based approach.First, multi-track audio recording data including singing voice as a separate track have been more available [9][10][11].With the multi-track datasets, the melody labels can be obtained more easily by applying a monophonic pitch detector to the isolated vocal track.Second, deep learning, the powerful data-driven learning algorithm based on neural networks, has emerged and tremendously advanced, achieving a remarkable series of state-of-the-art results in numerous tasks.An indispensable element in the success of deep learning is the availability of large-scale labeled datasets.
Leveraging the datasets and recent advances in deep learning, several classification-based methods using neural networks have been recently attempted.Rigaud and Radenen proposed to use two types of neural networks.One is for detecting singing voice activity built with 3-Bidirectional Long Short-Term Memory (BLSTM) layers following [12].The other is for extracting singing pitch composed of 2-hidden fully connected layers and a softmax layer that discriminates up to an eighth of semi-tone [13].Comparing the state-of-the-art salience-based system, Melodia [14], they show significantly improved results.Kum and Nam proposed multi-column deep neural networks (MC-DNN) where each column network is trained to predict a pitch label with a different pitch resolution and the outputs are combined [15].The results showed that the ensemble method achieves better performance than a single model and also it returns a high pitch resolution.Park and Yoo presented a LSTM-based melody classification algorithm where they added harmonic sum loss to the objective function to incorporate the harmonic structure in melodic tone [16].They showed that the harmonic sum loss makes the model more robust to octave mismatch and interference from background music.
In this work, we propose a singing melody extraction model using deep convolutional neural networks (DCNN).While DCNN-based models have been shown to achieve state-of-the-art results in many music information retrieval (MIR) tasks including singing voice detection [17], polyphonic piano transcription [18], chord recognition [19] and music-auto tagging [20], to the best of our knowledge, they have been not applied to singing melody extraction yet.We use two DCNNs, one for singing pitch extractor (SPE) and the other for singing voice activity detector (SVAD).In particular, we investigate the importance of pitch resolution in the SPE.Also, we suggest to use the output of pitch prediction in the SPE as a mean to suppress voice false alarm errors from the result of the SVAD.Using several public datasets, we show that the proposed method significantly outperforms our previous work and the overall results are comparable to state-of-the-arts.

Proposed Methods
The proposed melody extraction method is illustrated in Figure 1.It is composed by two main parts.The SPE extracts melody features and predicts its pitch from a short segment of spectrograms (11 frames).Then, the output is temporally smoothed by a hidden Markov model (HMM) based post-processing.The SVAD serves to distinguish singing voice frames from a long segment of mel-spectrograms (115 frames) and removes melodic contours of the non-voice segments.In addition, the voice false alarm detector reduces the false positive from the SVAD results by exploiting the output of SPE.

DCNN Model Configuration for the Singing Pitch Extractor
The architecture of SPE are summarized in Table 1.The SPE is configured with four convolutional blocks and one fully connected layer.Each block contains two convolutional layers and two pooling layers except the last one.The convolution filters have a filter size of 3×3 and the number of filters in the convolutional blocks gradually increases as 64, 128, 256 and up to 512.Then, average-pooling is applied to the time axis and max-pooling is to the frequency axis.The intuition behind this setting is that singing pitch is typically continuous and so temporal smoothing by the average maintain the pitch information better than max-pooling.Experimentally, we confirmed that this actually worked better than using max-pooling on both axes.We apply batch normalization on each convolutional layer and use the Leaky ReLU as an activation function for the non-linearity.We include dropout to the end of each block and use softmax activation function for the output layer.The pitch labels cover from D2 (73.416Hz) to B5 (987.77Hz).We quantized the pitch labels on MIDI scale but with high resolutions.
R t denotes the resolution in 1/t semitone unit.For example, R 1 indicates pitch resolution in semitone unit.R 2 , R 4 , R 8 , R 16 and R 32 indicate progressively higher resolutions than semitone by a factor of 2.
We used spectrogram as input for the SPE.We first resampled audio clips to 8 kHz and merged stereo channels into mono.We then computed spectrogram with Hann window of 1024 samples and hop size of 80 samples.We compressed the magnitude of the spectrogram in a log scale and used 513 bins from 0 Hz to 4000 Hz.As in the previous work [15], we took multiple frames of spectrogram as input to capture contextual information from neighboring frames and use the pitch label at the center position of the context window.We also experimented with different sizes of input frames and obtained the best results at 11 frames in SPE as well.Thus, we fix the input size to 11 frames for all experiments.

DCNN Model Configuration for the Singing Voice Activity Detector
The architecture of SVAD is summarized in global average pooling instead of fully connected layer.The architecture was inspired by the Network In Network model [21], which has the advantage of avoiding the problem of overfitting and greatly reducing the amount of computation without degrading performance.After the DCNN predicts the output, we used a median filter of 110ms as the final step to perform temporal smoothing [17].
The SVAD takes 115 frames of mel-spectrogram as input to capture contextual information over a long time span following [17].We resampled audio signals to 16kHz and merged stereo channels to mono.We extracted mel-spectrogram with 80 triangular filters between 0 and 8 kHz, a frame length of 1024, hop size of 160 samples.We compressed the magnitude by a log scale.

Voice False Alarm Detection for the SVAD
The SVAD takes a long segment (115 frames or 1.15 seconds) as input.We observed that the setting often produces false positive errors around the boundary of melody contours or short pauses between two melody contours.It is because long input frames taken from the boundary regions contain singing voice in part and so the SVAD is likely to predict the existence even if there is no voice at the center position.Therefore, we need to have additional means to minimize the false positive errors by taking a smaller size of input frames.For this purpose, a method of reducing the errors by detecting sub-semitone fluctuations has been previously attempted [22].In this work, we propose a novel method that utilizes the output of the SPE.
We empirically found that, when the SPE takes non-voiced frames as input and predicts the pitch, the output was not dominant at a particular class and tends to have a low probability for each class.This is probably because the model was trained only with voice frames and so the model cannot make a prediction with high confidence for the unseen input.By exploiting the observation, we add a voice false alarm detector (VFAD) based on the SPE as follows:  where y SPE (n) is the softmax output of SPE at n frame and θ is a threshold.We obtain the final result of singing voice activity, S(n) by incorporating the VFAD into the SVAD: where S SVAD (n) is the result of SVAD that returns one for voiced frames or zero for unvoiced frames.

Temporal Smoothing by HMM
After predicting the output in the SPE, we conduct temporal smoothing for the frame-wise pitch prediction.The procedure was basically borrowed from the Viterbi decoding based on HMM in [7].
The HMM state corresponds to each of the melody pitch values and the prior probabilities and the transition matrix are computed from the ground-truth of the training set.As posterior probabilities, the prediction from the combined output of SPE is used.To generate the prior and transition probabilities, we counted the number of occurrences and all pitch-to-pitch transition per pitch label, respectively.
In addition, we normalized the transition matrix by replacing each element with the average of its corresponding diagonal.This alleviates the sparsity problem in the transition matrix obtained from a limited training set by assuming that all adjacent pitch transitions depend only on their interval rather that absolute pitch value.
However, even with the normalization, the diagonal components of the transition matrix are still dominant.Thus, when the pitch difference between consecutive melodies is small, the result of smoothing tends to keep the same pitch.This leads to the loss of detail changes in the pitch contours.
To deal with the problem, we add more weights to off-diagonal elements by multiplying a penalty matrix to the transition matrix as follows: where D is the off-diagonal matrix of the transition matrix T, whose diagonal elements are zeros.I is identity matrix.By increasing the value of the off-diagonal component, it adjusts the sensitivity to small pitch changes during the smoothing process.

Training Datasets
We used the RWC and MedleyDB datasets to train the SPE.To train the SVAD model, we used the Jamendo dataset in addition to the two.We divided them into training and validation splits to tune the network parameters.To avoid overfitting and select the best performing model, we chose songs such that genre and gender are evenly distributed over the splits and also the songs of the same singer are not divided over the splits.
• RWC [23]: 80 Japanese popular songs and 20 American popular songs with singing voice melody annotations.We divided the dataset into two splits, 85 songs for training and the remaining 15 songs for validation.The total length of the dataset is 407 minutes.
• MedleyDB [10]: 122 songs with a variety of musical genres and 70 of them including vocals with melody annotations.Among them, we chose 60 songs that are dominated by vocal melody.We divided the dataset into two splits, 47 songs for training and the remaining 13 songs for validation.
The total length of the dataset is about 200 minutes.
• Jamendo [24]: 93 songs designed for the evaluation of singing voice detection.The training, validation and test set splits are designated as 61, 16 and 16 songs, respectively.The total length of the dataset used for training is about 360 minutes.
We also augmented the three datasets to obtain more generalized models.Pitch shifting has proven to be an effective way to increase data and improve results for singing voice activity detection [17] and melody extraction [15] as well.To this end, instead of resampling that modifies the pitch and length of audio clips at the same time [25], we used a phase vocoder method that conducts pitch-shifting independent of time-stretching [26].We augmented the training set by applying the pitch-shifting by ±1, 2 semitones, thereby increasing the data size by five times.

Test Datasets
To evaluate the proposed model, we use publicly available datasets: ADC2004, LabROSA, MIR1k and iKala.Synthesized sounds or instrument sounds (e.g.'train13MIDI.wav' in LabROSA or 'midi1.wav' in the ADC04 dataset) were excluded from the training data so that both SPE and SVAD focus on singing voice in polyphonic music.Thus, we used only singing voice songs as test data among the whole datasets.
• LabROSA 1 : 13 excerpts that contain Rock, R&B, pop, and jazz songs, as well as audio generated from a MIDI file.We evaluated our algorithm using 9 songs out of a total of 13 songs.
• ADC2004 1 : 20 excerpts of 20 seconds that contain pop, jazz and opera songs, as well as synthesized singing and audio from MIDI files.Jazz and MIDI songs were excluded from the evaluation.

Evaluation
We evaluated the proposed method in terms of five metrics, including overall accuracy (OA), raw pitch accuracy (RPA), raw chroma accuracy (RCA), voicing detection rate (VR) and voicing false alarm rate (VFA), as detailed in [6].We compute them using mir-eval, a Python library designed for objective 1 We obtained the LabROSA dataset from the website, http://labrosa.ee.columbia.edu/projects/melody/.This was used for part of the 2005 MIREX melody extraction task.In our previous work [15], we referred to it as MIREX05.evaluation in MIR tasks [27].The evaluation consists of two main parts: voice detection determining whether a voice is included in a particular time frame (VR and VFA) and pitch detection determining the most accurate melody pitch for each time frame (RPA, RCA, and OA).We convert the pitch labels, which were quantized to MIDI scale, back to frequency scale (Hz) to compare them with the ground truth.
where m is the estimated pitch label.The pitch of the frame is considered correct if the difference between the estimated pitch frequency and the ground-truth is within ±50 cents (0.5 semitone).In addition, we progressively reduced the pitch tolerance to ±25, ±12.5 cents.We report the results in order to show the performance under more strict conditions.

Experiments
Given the SPE and SVAD models and training data, we conducted several experiments to figure out the effect of different settings in the models.In the followings, we describe the experimental setup.

Training Details of DCNNs
We randomly initialized the network parameters using He uniform initialization [28] and trained them with stochastic gradient descent with Nesterov momentum which was set to 0.9.We iterated it over all the training data up to 100 epochs.The initial learning rate was set to 0.02.To prevent overfitting, we applied a dropout ratio of 0.3 after all max-pooling layers.By means of early-stopping strategy, if the validation accuracy does not increase after 20 iterations, we reset the learning rate to 1/2 of the initial learning rate and repeated the training.We iterate this process five times.For fast computing, we ran the code using Keras [29], a deep learning library in Python, on a computer with two GPUs.

Pitch resolution and ensemble model
Our first experiment is to figure out the maximum pitch resolution of the SPE.High pitch resolution allows the SPE to predict continuous pitch curves, mitigating the pitch quantization problem that the classification-based approach has intrinsically.In our previous work [15], we progressively increased the pitch resolution and observed that the performance saturates before R 8 .With the DCNN-based model, we conduct the same experiment and find the pitch resolution that provides the best performance.
We also combine multiple neural networks with different pitch resolutions as we conducted in [15].We denote SC-SPE r as a single-column DCNN with a pitch resolution R r and MC-SPE r as an ensemble model that combines SC-SPE r , SC-SPE r/2 and SC-SPE r/4 .We evaluted all the models on two test sets (ADC2004, LabROSA) and compared the accuracy.

HMM-based Postprocessing
We conducted temporal smoothing of the pitch prediction using the Viterbi decoding.The prior probabilities and transition matrix were estimated from the ground-truth of the training set.To increase the value of the off-diagonal components, we set the λ according to Equation 2. We used a set of λ and empirically found that λ of 1 yielded the best results.

Singing voice activity detector with VFAD
As mentioned in Chapter 2.3, we use the VFAD to reduce false positive frames after the SVAD.
If the maximum softmax output of SPE does not exceed a specific threshold θ, it is assumed that the frame is not a singing voice.The threshold θ was set to a value between 0 and 0.05 to find the proper threshold.We used the songs from the ADC04 and LabROSA datasets.We evaluated the performance in terms of VR, precision, F1 score and VFA.We also compared the performance of the SVAD with those from state-of-the-art algorithms.We reported the results on the Jamendo dataset as unseen test data.To evaluate the performance of the SVAD, we compute three common evaluation metrics: VR, precision, F1 score [30].

DCNN Models of Melody Extraction
Figure 2 shows the RCA on the two test datasets and the classification accuracy of the validation set with varying pitch resolution.We conducted the experiment by increasing the pitch resolution until the accuracy becomes saturated.The result shows that the higher the pitch resolution, the lower the classification accuracy.This is because it is not easy to predict the exact class corresponding to the reference pitch as the melody extractor has more classes to predict.On the other hand, the RPA on the test datasets tend to increase as the pitch resolution is higher.Compared to the DNN model which was saturated at R 4 [15], the DCNN model has the best results at R 32 .This indicates that the DCNN is more capable of handling high pitch resolutions.However, we should note that higher resolutions require more network parameters.
Figure 3 shows that the MC-SPE models perform better than the SC-SPE models in general, validating that combining multiple models with different pitch resolutions is more effective [15].
However, the effect of using the multi-column models becomes less significant as the pitch resolution increases.This is clearly indicated by ∆ RPA, the difference of RPA between the two models.For R 32 on the LabROSA, the SC-SPE model is even better the MC-SPE model, achieving the best accuracy.
Thus, considering the ensemble model requires as many parameters as the model size, SC-SPE is seen to be a more practical choice in the DCNN-based approach.

HMM-based Post-processing
Figure 4 show that both RPA and RCA increase by more than 1% on both datasets after the temporal smoothing.Comparing the difference between RPA and RCA, we can observe that the octave error decreases significantly.This indicates that the abrupt rise and fall of pitch contours are suppressed.

Singing Voice Activity Detector for Melody Extraction
We compared the performance of the SVAD with VFAD on ADC04 and LabROSA.Table 3 shows the evaluation matrix when θ is 0, 0.03, and 0,05.The larger the value of θ, the smaller the VFA.If the threshold is set too high, the F1 score is lowered.Using VFAD does not significantly reduce VFA because the number of frame is relatively small to be removed by VFAD among the voice frames detected by the SVAD.However, this process makes it possible to provide more natural melody contours.As shown in Figure 5, the effect of VFAD can be seen by comparing the blue line (obtained from SVAD) with the yellow line (obtained by further using VFAD).Based on the results from Table 3, we set 0.03 as a trade-off.[31] and the Leglaive algorithm is on BLSTM-RNN [12].We did not show other state-of-the-arts algorithm using DCNN due to different evaluation metrics, for example, [17].The proposed method has higher precision and lower voice recall than the two.This conservative result in detecting the voice activity is attributed to the VFAD.However, when we evaluate them according to the F1 score, the proposed method slightly outperformed the two compared algorithms.The proposed method significantly outperforms our previous multi-column DNN model [15] for three datasets.The overall accuracy of the proposed model is above 80% for all datasets except MIR-1K.This might be because the audio files in MIR-1K have poor recording quality.Compared to the results from MIREX, the proposed method achieved better accuracy except those from Dressler on ADC04 [32].A notable result is that the proposed method has significantly low voice false alarm.This may be attributed to the proposed SVAD that is supported by the VFAD.

Melodia vs. Proposed Method
In general, classification-based approach to melody extraction produces discrete pitch contours, losing detailed singing information.However, the proposed method can generate nearly continuous curves by increasing the output resolution up to R 32 .Therefore, they preserve natural singing styles such as vibrato or note-to-note transition patterns.In order to confirm the continuity, we obtained the evaluation results by reducing the pitch tolerance to ±25, ±12.5 cents, and compared them with the results from Melodia, a saliency-based algorithm that generates the continuous pitch curves [14].
Figure 6 shows that the proposed method (SC-SPE) achieves 5 to 10% higher than Melodia although the performance becomes worse for smaller tolerance.Figure 7 compares an example of pitch contours, each from Meloida and the proposed method.This illustrates more intuitively that the proposed method produces highly continuous curves that are similar to the ground-truth in Hz.

Conclusions
We proposed a novel melody extraction algorithm composed of singing pitch detector and singing voice activity detector using deep convolution neural networks.We have shown that the SPE can effectively extract melody features and classify pitch classes.Since the pitch can be predicted with a high resolution, the classification-based algorithm can produce nearly continuous curves.The multi-column method for predicting pitches of various resolutions can improve performance in DCNN, but the effect becomes less significant as the pitch resolution is higher.We propose a high performance SVAD with VFAD to minimizes false positive errors.Finally, we compared our melody extraction model to previous state-of-the-arts methods on several public test dataset and showed that the results are comparable to those from the best.

Supplementary Materials:
The demo of the proposed melody extraction method is available at http://mac-bach. kaist.ac.kr/keums/melodyExtraction.

Figure 1 .
Figure 1.The diagram of our architecture for melody extraction including singing voice activity detector.

Figure 5 .
Figure 5.An example of singing voice activity detection with VFAD: (1) The SPE predicts the pitch over all frames.(2) The SVAD determines the singing voice frames (blue box in the bottom) and removes non-vocal melody contours (dotted black).However, some melody lines are misidentified as singing voice (blue line).(3) To reduce the false alarm errors, the VFAD determines non-singing voice frames (red box in the bottom).(4) Finally, we obtain more elaborate melody contours (yellow line).(5) The ground truth (black line) is plotted 100Hz below the prediction for visual comparison.

Figure 6 .
Figure 6.Comparison of evaluation matrix of LabROSA according to pitch tolerance.The tolerance value used in the MIREX melody extraction task is 50 cents.

Figure 7 .
Figure 7.Comparison of pitch contours for the 'opera_fem4.wav' of ADC04.The reference pitch was plotted below 140 Hz for visual comparison.

Table 2
. The SVAD is configured with four convolutional blocks and 1×1 convolutional layer.Each convolution block contains two 3×3 convolutional layers followed by batch normalization.The number of channels in the blocks gradually increases as 64, 128, 256 and up to 512.3×3 max-pooling layers with 2×2 stride are used at the end of each convolutional block.At the final stage of the DCNN model, we used 1×1 convolution and Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted:

Table 2 .
configuration of DCNN for Singing Voice Activity Detector

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 3 November 2017 doi:10.20944/preprints201711.0027.v1
Comparison of raw pitch accuracy between SC-SPE and MC-SPE when the pitch resolution is increasing.RPA SC and RPA MC correspond to RPA of SC-SPE and MC-SPE, respectively.MC-SPE R is a network where a model that combines SC-SPE R , SC-SPE R/2 and SC-SPE R/4 .

Table 3 .
Comparison of the proposed SVAD performance with the VFAD according to theta value

Table 4 .
Comparison of SVAD results on the Jamendo test dataset

Table 5
compares the proposed method with state-of-the-art algorithms on the four test datasets.

Table 5 .
Comparison of Melody Extraction Results