Music fingerprinting based on bhattacharya distance for song and cover song recognition

Received Jul 24, 2018 Revised Nov 27, 2018 Accepted Dec 25, 2018 People often have trouble recognizing a song especially, if the song is sung by a not original artist which is called cover song. Hence, an identification system might be used to help recognize a song or to detect copyright violation. In this study, we try to recognize a song and a cover song by using the fingerprint of the song represented by features extracted from MPEG-7. The fingerprint of the song is represented by Audio Signature Type. Moreover, the fingerprint of the cover song is represented by Audio Spectrum Flatness and Audio Spectrum Projection. Furthermore, we propose a sliding algorithm and k-Nearest Neighbor (k-NN) with Bhattacharyya distance for song recognition and cover song recognition. The results of this experiment show that the proposed fingerprint technique has an accuracy of 100% for song recognition and an accuracy of 85.3% for cover song recognition.


INTRODUCTION
Music is a popular aspect of human life. Many people make music or listen to it while working, studying, or relaxing. In public places, songs are played to entertain visitors, who are often curious about what music they are listening to and want to know what song they are hearing. In addition, a music recognition system also can be used to detect illegally copied music. One problem that is difficult to handle is detection cover song. Cover songs are sung by different singers. Gender differences in singers, different sound colors, and improvised tones make it harder to detect. Several studies have been done in song recognition and several methods are applied in this field. In the previous studies, song recognition has been done by Multi-band Sum of Spectrogram, with 91% accuracy [1], and the PCA algorithm [2]. Cover song recognition has been done using the raw signal as input [3], [4]. From each of segment of the signal, the chroma will be obtained. This chroma can be used to recognize a cover song with 62% accuracy [5]- [7]. Other experiments used pitch [8]- [10], Information-Theoretic Measures of Similarity [11], music structure segmentation [12] and 2D Fourier Transform [13] to recognize cover songs. In a previous experiment, fingerprinting has been used to identify the title of a song based on the raw signal of the music. In another experiment, the fingerprint was gained and processed in spectrogram form. In our experiment, we used features from MPEG-7. The extraction of music features based on MPEG-7 in the form of a two-dimensional matrix, n x m. MPEG-7 has subband values for each feature. In a previous experiment, song recognition using fingerprinting was done by applying a sliding algorithm on a one-dimensional matrix [14]. This experiment kept the matrix obtained from MPEG-7 extraction. This activity is known as MIR. MIR means processing the music spectrogram to obtain useful information [15]. Song recognition (fingerprinting) and 1037 cover song detection is part of MIR. For example, music recommendation system using collaborative filtering and music genre classification also have been proposed [16], [17]. A fingerprint of song is characteristic for certain types of music. There are many ways to get the fingerprint of a music piece, such as using scikit-learn library or analyzing its spectrogram.
In this paper, we propose an audio fingerprinting technique for song and cover song recognition based on MPEG-7 features. Moreover, MPEG-7 has been reported to successfully detect the mood of music [18], [19] and tempo classification [3]. A cover song means that the singer performs a song originally performed by another artist [20]. Besides the extraction process, the difference between this experiment and previous experiments is in how the fingerprint of the music was obtained. We used MPEG-7 extraction because the result of the extraction process is a number of features that can be useful for obtaining information from a music piece. By MPEG-7 extraction, 17 features can be obtained [21]. However, in this experiment, we used only 3 out these 17 features. For song recognition we used Audio Signature Type, while we used Audio Projection and Audio Spectrum Flatness for cover song recognition. MPEG-7 extraction produces an XML file containing the 17 features from the MPEG-7 DDL scheme. To get the features that exist in the XML file format, XQuery needs to be applied. The selected features are then treated by a sliding algorithm and k-NN algorithm with Bhattacharyya distance.
The remain of this paper is arranged as follows: Section 2 explains the research methods as well as MPEG-7, Bhattacharyya distance, the sliding algorithm, k-NN, discrete wavelet transform, song recognition method, cover song recognition method, the system architecture, and the dataset. Section 3 describes the result that was gotten from the experiment. Section 4 is the conclusion of this paper.

RESEARCH METHOD
This section explains the MPEG-7 and required features for this experiment. Furthermore, it also describes the proposed k-NN combined with sliding algorithm, Bhattacharyya distance, discrete wavelet transform, and the dataset.

Song recognition method
The goal of this method is to recognize the title of a song. The song saved in "wav" format is extracted using MPEG-7 feature extractor. The extraction result is the Audio Signature Type feature. Audio Signature Type from the extraction process is then applied in the sliding algorithm and k-NN using Bhattacharyya distance. In this experiment, k-NN was used because it has been successfully reported for favorable performance in non-stationary signal processing [17], [22], [23]. The details of this process are depicted in Figure 1.

Cover song recognition method
The goal of this method is to identify the title of a cover song. Cover song recognition is an extension of the song recognition method. The song saved in .wav format is extracted using MPEG-7. The difference with song recognition is the number of required features. The extraction process of the cover song recognition method produces Audio Spectrum Flatness and Audio Spectrum Projection. These features are then processed by 2D discrete wavelet transform. The 2D discrete wavelet transform result is applied in the sliding algorithm and k-NN with Bhattacharyya distance. The specifics of this process are shown in Figure 2.

Feature extraction
MPEG-7 is a standard description of multimedia content according to the ISO/IEC 15938 standard [24]. The multimedia contents include images, music (sound), and video. However, this study focused only on multimedia content in the form of music. In performing feature extraction, MPEG-7 produces a number of features called low level descriptors. The extracted music features based on MPEG-7 are in the form of metadata which is stored in matrix form. The matrix has a size of n×m. An example of such a matrix can be seen in Figure 3. The m value is called the subband metadata of the music. The subband metadata depend on the features used. The value of n depends on the duration and size of the sound source. Hence, the longer a sound implies the greater value of n obtained by extraction of MPEG-7 features. The collection of all MPEG-7 features is stored in an XML document with a specific scheme, called the MPEG-7 DDL scheme. In order to get features from the MPEG-7 extraction process, XQuery has to be applied to the XML-document. MPEG-7 has 17 features but we only used three of them in this experiment. For song recognition, we used Audio Signature Type, while for cover song recognition we used Audio Spectrum Projection and Audio Spectrum Flatness. Audio Signature Type represents the "identity" of an audio signal. Based on MPEG-7 extraction, this feature is the fingerprint of a piece of music. The Audio Signature Type number, m, is 16 based on Figure 1. Audio Signature Type is gained from the audio file saved in "wav" format. The audio files in "wav" format are extracted using MPEG-7 feature extractor. Audio Signature Type is in the form of a matrix. The values in the Audio Signature Type matrix are in the range [-1, 1]. It is this matrix that represents the characteristics of the music and it has to be processed to perform song recognition.
In the cover song recognition experiment we used two features from MPEG-7. Audio Spectrum Projection (ASP) is a feature derived from independent component analysis (ICA) and singular value decomposition (SVD). Audio Spectrum Projection is used to symbolize a low-dimensional feature of the spectrum after projection. Audio Spectrum Projection is a spectrogram that is used for sound classification from various sources. To get the value of Audio Spectrum Projection, several are steps required. Apply (1) to the Audio Spectrum Envelope feature, where l is the index of an AE logarithmic frequency range (〖AE〗_cb (k, l)), and k is the frame index. After that, apply (2) and (3), the goal of which is to normalize each decibel-scale spectral vector with root mean square (RMS), where L is the number of AE spectral coefficients and K is the total number of frames.
Audio Spectrum Flatness is the second MPEG-7 feature used in cover song recognition. Audio Spectrum Flatness (ASF) reflects the flatness properties of a signal. The flatness properties are defined by a specified number of frequency bands. ASF shows how the power spectrum of a signal deviates from the frequencies of a flat shape. The term "flat shape" describes the noise or impulse in a signal. This feature is used to measure the similarity between one audio signal and another. The first step of Audio Spectrum Flatness extraction is calculating the power spectrum of the signal [21]. Then, it is divided into loSign and hiSign.
The value of loSign and hiSign are determined by (4) and (5), respectively, where the value of Y determines the lower band edge. The minimum value for loSign is recommended to be 250 Hz, so n=-8, and D is the desired number of frequency bands. Defining the frequency bands with hiSign and loSign makes Audio Spectrum Flatness too sensitive to the sampling frequency. Hence, it needs to be modified so that all bands slightly overlap each other. The modification of hiSign and loSign can be seen in (6) and (7). Then for each frequency band d, Audio Spectrum Flatness can be calculated with (8).

Bhattacharyya distance
Bhattacharyya distance does not really look for the distance of anything. Unlike Euclidean distance, Bhattacharyya distance is more like finding the similarity between two distributions [25]. The distribution refers to the Bhattacharyya coefficient. This coefficient is usually used to discover the relative closeness of two vectors. The formula of Bhattacharyya distance can be seen in (9), where D_B (r, s) is the distance between r and s distributions or classes.

Discrete wavelet transform (DWT)
Discrete wavelet transform (DWT) is generally used in signal processing. DWT is a method used to reconstruct a signal but still retains the original signal. It was also successfully applied to process signals electronic nose [22], [26], [27]. Signals that wavelet transform has been applied to have a shorter length than the original. There are two filters that can be taken when applying wavelet transform to an audio signal, i.e. low-pass filter and high-pass filter. The low-pass filter yields approximate coefficient which represents the voice of the singer of the song (in this case women and men have almost the same values) and the detail coefficient which represents the sound of the instruments of the song. In this study, the values taken from the DWT were the approximate coefficient only. DWT is highly dependent on the level of decomposition that is used. The higher the level of decomposition indicates the higher the risk that the signal will be damaged. Conversely, the lower the level of decomposition, the more susceptible the signal will be to noise [28]. Level decomposition up to 4 with wavelet family haar were used in this experiment. All levels are gathered by the low-pass filter, so they can be classified. 2-D DWT is applied to each feature, which is still in a matrix form. Figure 4 denotes an example of audio signal that reconstructed by DWT.

k-NN combined with sliding algorithm
As mentioned above, the features of MPEG-7 are in the form of a matrix. Moreover, k-NN is used as a classification method. Common k-NN methods analyze training data and testing data then measure the distance between them using Euclidean distance. However, in this experiment, we used Bhattacharyya distance to measure distance. Both training data and testing data are in the form of a vector, but the feature extraction result of MPEG-7 is in the form of a two-dimensional matrix. In order to keep the matrix form, k-NN is combined with a modified sliding algorithm.
In this experiment, we did not "slide" each subband (row). Bhattacharyya distance allows us to calculate the average value of each matrix because Bhattacharyya distance measures the distance between vectors, not between points. The average value of each of the matrices is "slided" and the Bhattacharyya distance is calculated. This method is used both for song recognition and cover song recognition. In this experiment, the k parameter of k-NN is 5. The difference between song recognition and cover song recognition was the type and number of features used. Figure 5 illustrates the details of the system architecture used in this experiment.  The explanation of the process flow is as follows: 1. A song is recorded through a mobile device application. The recorded song should be in "wav" format 2. The recorded song is sent to a server 3. The recorded song is extracted using MPEG-7 4. The result of the extraction process is an XML document that contains several features. Then XQuery is used to obtain the required features 5. The required features are applied in the preprocessing method 6. The result from the preprocessing stage is used according to the method being executed in the classification stage 7. The result from processing will be shown in mobile device application as a result

Dataset
The audio dataset was obtained from YouTube. The audio dataset has to be in "wav" format. The dataset that was used as training data for song recognition and cover song recognition was one whole song. The training data had duration of 3-4 minutes. After extraction the signal had a length of 300-400 subbands (rows). The dataset for testing data was randomly cut to a length of 30 seconds. After extraction, the signal had a length of 20-30 subbands (rows). The subbands of all training data were then "slided" along the subbands of the testing data. The result of this "sliding" is the distance between the testing data and the training data. Then, the shortest distance is searched. The training data and testing data scheme is shown in Figure 6.
For the song recognition experiment, thirty three songs were used to recognize. Each song was randomly selected. Cover song recognition used fifty songs with five unique song titles. For each song there were five cover songs with a male singer and five cover songs with a female singer.

RESULTS AND ANALYSIS
This section discusses the result of experiment including song and cover song recognition.

Song recognition result
The training data for fingerprint processing consisted of 33 songs. The songs were recorded using a mobile device. The Audio Signature Type feature of each song needs to be saved in a database. In this experiment, we used a whole song as training data. The testing data used were 30 seconds of randomly cut training data. The accuracy of the song recognition method was calculated using (11). (11) Where TRUE is the total of valid predictions and DATA is the total number of testing data. The result of song recognition testing using Bhattacharyya distance is shown in Table 1. The title of the song denotes training data and testing data used in this experiment. In this experiment, all songs were recognized correctly. According to Table 2. and (11), the testing scenario for song recognition using Bhattacharyya distance had a recognition accuracy of 100%.

Cover song recognition result
To calculate the accuracy of cover song recognition, we used cross validation. The cross validation method used in this experiment was k-fold cross validation. In k-fold cross validation, all entries in the original training data set are used for both training as well as validation. Also, each entry is used for validation just once. K-fold estimates the accuracy of the machine learning model, in this case k-NN, by averaging the accuracies derived from all k cases of cross validation.
In this experiment, the value of k used was 3. Hence, there were ten iterations to calculate the accuracy of our model. The result of cross validation for cover song recognition is shown in Table 3. From the Table 3., we can see that the accuracy of cover song recognition was 85.3%. The standard deviation of cover song recognition was 8.18%, which indicates that the classification was quite stable.

CONCLUSION
In order to recognize a song, the MPEG-7 feature Audio Signature Type was used as fingerprint or "identity" of a song. In cover song recognition, the MPEG-7 features Audio Spectrum Projection and Audio Spectrum Flatness features were used. For both song recognition and cover song recognition a k-NN classifier was used. Because the data are originally in matrix form, they were modified by combining them with a sliding algorithm and Bhattacharyya distance. Accuracy for song recognition of this experiment was 100%. After calculating the accuracy using cross validation the recognition accuracy is 85.3%. These results indicate that the proposed method can obtain favorable performance to recognize song and cover song.