Impulsive spike enhancement on gamelan audio using harmonic percussive separation

ABSTRACT


Int J Elec & Comp Eng
ISSN: 2088-8708  Impulsive spike enhancement on gamelan audio using harmonic percussive separation (Solekhan) 1701 In research conducted by Amart Sulong, Teddy Surya Gunawan, Othman O. Khalifa, Mira Kartiwi, Hassan Dao [7] for Speech Audio Signal Enhancement and Kayode Francis Akingbade, Isiaka A jewale Alimi [8] that explained the process of separation on audio music using Least-Mean-Square (LMS) Adaptive Algorithm and J. Driedger, M. Müller and S. Ewert [9] and H. Tachibana, N. Ono and S. Sagayama [10] that explained the process of separation between the harmonic and percussive components from the audio signal. Spike or impulsiveness tends to occur only in the percussive component.

Figure 1. Spikes on audio
This research proposes the process of spike enhancement using HPSS, while for the comparison spike enhancement method is using median filter method. This paper is organized as follows. The research method in section 2, the results and discussion are in section 3 and the conclusion of the paper is in section 4.

RESEARCH METHOD 2.1. Spike enhancement
The enhancement of the spike is process to enhance spike signals by reducing or amplifying. Some spike noise filters can be used to improve quality and reduce noise from signal interference. The median filter is a method that can be used to enhance by reducing the presence of spike signals. This method replaces data signals by utilizing the median average, from all data in a range.
The median is taken from a set of samples that is derived by sorting the sample in ascending or descending order, and then choose the center values. In the median filtering, a window that had been arranged sequentially over the sign and the middle of the sample in the window will be replaced by the median of all the samples in the window [6], as illustrated in Figure 2. In some experiments it showed that the median filter can be used to reduce the Spike signals well. The Median Filter output xmed(n) with input x(n) and the median window of length 2K+1 sample data can be written as in (1), Median is the middle value of a given data dimension one, from (1), n is the discrete time index, x is the signal source, K is integer number and xmed is the result median filter. The purpose of the Percussive Harmonic Source Separation (HPSS) is to separate the input signal audio into a signal audio which contains the harmonic and percussive component [6]. The algorithm is based on the fact that audio has a harmonic and percussive component. The harmonic audio component tends to form the horizontal structure while the percussive audio component tends to form a vertical structure. In general, sounds can be classified into two groups: harmonic and percussive sounds. Harmonic sound is as pitched sound or melody while percussive sound is like temporary sound of percussive music instruments.
From separation process, the harmonic and percussive signal components are obtained. Spikes are the impulsive component of percussive musical audio, so by using harmonic and percussive component separation, it would be easier to enhance (by reducing, amplifying or removing) the spike component. The enhancement of the spike done in percussive component and then the result will be combined with harmonic component in the output. In some conditions, if necessary, to eliminate the percussive component, it can be more easily done by taking the harmonic components only.

Cosine similarity
Cosine similarity (CS) [11], [12] are measures based on the angle of the signal. Two signals of x and ̂, with the cosine similarity, is described as the (2). Where CS is the Cosine Similarity, x is the source signal, and ̂ is the new signal of x, the ranges of similarities of -1 which means the opposite, and of 1 which means equal, 0 usually indicates free. Cosine similarity (CS) can be seen as a method of normalizing document length during comparison.
Cosine Distance (CD) is a development of the cosine similarity with relationships such as (3). This test is also used for the similarity between two vectors of audio music signals [12]. Cosine Distance (CD) ranges from 0 to 2 while the zero value means exactly the same and two means the opposite.

Noise ratio
The mean squared error (MSE) is measures the squares average of the errors or deviations, as in (4). The MSE is a measure of the quality of an estimator, it is always non-negative, and values closer to zero are better [14]. Where MSE is the Mean Square Error, m is the discrete time index, N is the length of data, x is the source signal, and ̂ is the new signal of x after enhance spike. (4)

Perceptual evaluation of audio quality (PEAQ)
Perceptual Evaluation of Audio Quality (PEAQ) is standard that describe the testing audio quality objectively. This parameter is compared to the difference in the audio sound by measuring the original audio data and audio data to be tested. PEAQ Parameter can be seen on the scale of measurement in Table 7, based on ITU-R BS. 1387 [15].
The Perceptual Evaluation of Audio Quality (PEAQ) is estimated by mapping Model Output Variables (MOV), to single number using an artificial neural network structure with one hidden layer. The activation function of the neural network is an asymmetric sigmoid. The inputs are mapped to a distortion index (DI). From (5), the network uses inputs and nodes in the hidden layer. The mapping is For the Basic version, Model Output Variables (MOV) parameters are input to a neural network with 11 input nodes, 1 hidden layer with 3 nodes and a single output. While in the Advanced version, the scaled and shifted MOV's are input to a neural network with 5 input nodes, 1 hidden layer with 5 nodes and a single output, where I is the number of MOV's (11 for the Basic version, 5 for the Advanced version) and J is the number of nodes in the hidden layer.
Model Output Variables in Basic Version uses only the FFT-based ear model. It uses the following 11 MOV's. These 11 MOV's are mapped to a single quality index using a neural network with three nodes in the hidden layer. The parameters of the mapping are given in Tables 1 to 3. Weights for the input nodes win in basic versions can be seen in Table 2, Weights for the output nodes wout in basic versions can be seen in Table 3.    Weights for the input nodes in basic version can be seen in Table 5, and weights for the output nodes in advance version can be seen in Table 6.  Equation objective difference grade (ODG) can be seen in the equation 6. Where bmin and bmax is the scaling factor for output, see Table 7, The bmin value is the setting minimum output value ( The value of the ODG was obtained from the calculation of the Model Output Variables (MOV) through neural network. As shown in Table 7.

Enhance spike using HPSS method
In this process the input audio signal is first converted using the STFT, then from the pattern of the spectrogram (Y) a median filtering of horizontal direction is done to get (Yh) and for the vertical direction (Yp). From the pattern, next is determining the harmonic masking (Mh) and percussive masking (Mp). The enhance spike it uses HPSS as discribed in Figure 3.

Int J Elec & Comp Eng
ISSN: 2088-8708  Impulsive spike enhancement on gamelan audio using harmonic percussive separation (Solekhan) 1705 Figure 3. Enhancement spike using HPSS [5] The results of harmonic and percussive masking is, through inversion of STFT can be obtained harmonic signals (xh) and percussive signals (xp). Then these percussive signals are reduced (xep) and merged back with the harmonic components (xes).
Signal x is assumed to contain harmonic components (xh) and percussive components (xp) as in (8), To do the separation, at the first step, compute the short-time Fourier transform (STFT) X of the signal x as shown in (9), With ∈ [0: − 1] and ∈ [0: − 1], M the number of frames, N the size of the frame, ∈ [0: − 1] → is a function of window and H is the hop size of X can be searched the spectrogram Y such as (10).
(m,k) = ( (m,k)) 2 (10) Next process is applying median filter to spectrogram both horizontally and vertically as in (11) and (12) With Yh is the horizontal Spectrogram from spectrogram Y, Yp is the vertical spectrogram of spectrogram Y, m is the data on horizontal, k is the data on vertical, and K in integer value. The binary masking can be obtained with equations (13) and (14). Mh and Mp are horizontal and vertical binary masking respectively. Then, to get harmonic and percussive component can be done by using Equations (15) and (16) Xh is harmonic component in frequency domain, while Xp is the percussive component. By performing the inverse STFT, then harmonics and percussive components can be obtained by Equations (17) and (18).
With percussive component and ℎ is a harmonic components of signal x, the next percussive can be reduced as xep, as shown in (19).
With ef as enhance factor, in the (20), a new signal is defined by the combination between results of reduction and harmonic components as xes.
The result is a combination of a new signal in which Percussive component has been enhanced.

RESULTS AND DISCUSSION
The steps of enhancement spike testing using HPSS method can be explained as follows; Experiments are applied to signal as shown in Figure 4, which is a signal that contains a spike. Then, the signal is modified to shape the spectrogram patterns as shown in Figure 5. Median filter processes are implemented against the spectrogram horizontally and vertically for getting binary masking as in (8)  The signal separation converted back using the Inverse STFT ((17) and (18)), as shown in Figure 8  and 9 and then the percussive signal components is enhanced as shown in Figure 10. The next process is to recombine the enhanced percussive signal and harmonic signal, as shown in Figure 11. Audio recordings of music are taken from ITS gamelan instruments, while the experiment and testing are conducted in the B303 laboratory. This experiment then performed using 132 audio signals to examine enhance spike using median filter. According to (1), the value of K is in the range of 1 to 6.
The results of this experiment are tested by using Cosine Distance (CD), Mean Square Error (MSE) and Objective Difference Grade (ODG), for the test of spikes enhance using a median filter with index K 1 to 6 is shown in Table 8.  Table 8, it is found that the results of the enhanced spike with reduction using median filter method, the smaller range is better with the CD and MSE values close to zero. This indicates that median filters can be used for audio spike reduction of gamelan music audio.
For some circumstances, spike on some results of the audio recording of the gamelan has low magnitude, so it needs amplifying due to enhancement spike using median filter method cannot amplify spike. HPSS method can be used to solve this issue.
The experiment and testing to enhance spike using HPSS method performed using 132 audio signals, with Enhance Factor (EF) from 1.3 to 0.7. The value of enhanced factor 1 means no amplify or reduction, so it is not used; while reduction spike for EF is less than one (EF < 1), and amplify spike for EF is greater than one (EF > 1). The result of this experiment is tested with the Cosine Distance (CD), Mean Square Error (MSE) and Objective Difference Grade (ODG), using HPSS with EF 0.7 to 1.3 that can be seen in Table 9.
From Table 8 and Table 9, it is found that the results of the enhanced spike with the proposed method (HPSS), have smaller CD value with average of 0.0004, and the method of median filters has a CD value of 0.0012. MSE in HPSS methods has value very small average of 0.0004, while in the median filter methods have value of 0.0027. Testing of Perceptual Evaluation of Audio Quality (PEAQ) using median filters has average ODG value of -1.59 (Perceptible but not annoying), as shown in Table 8, while using HPSS the average ODG value is -0,24 (Imperceptible), as shown in Table 9. So the PEAQ test of enhances spike using HPSS method is better compared with the enhanced spike using median filter method. The enhancement spike with HPSS method can increase the weak spike and reduce the high spike, while the enhancement spike using median filter method can reduce it only.

CONCLUSION
The test performed on some gamelan audio shows that median filter method can be applied in enhancement spike, with reduction of impulsive spikes in gamelan music audio results, average value of Cosine Distance (CD) is 0.0012, Mean Square Error (MSE) value is 0.0027, and Objective Difference Grade (ODG) value is -1.59 (imperceptible). While using HPSS methods results, average value of Cosine Distance (CD) is 0.0004, Mean Square Error (MSE) is 0.0004, and Objective Difference Grade (ODG) values is -0.24 (imperceptible). So HPSS method can be used to enhance the spike with the reduction or amplify modes.