High clarity speech separation using synchroextracting transform

Degenerate unmixing estimation technique (DUET) is the most ideal blind source separation (BSS) method for underdetermined conditions with number of sources exceeds number of mixtures. Estimation of mixing parameters which is the most critical step in the DUET algorithm is developed based on the characteristic feature of sparseness of speech signals in time frequency (TF) domain. Hence, DUET relies on the clarity of time frequency representation (TFR) and even the slightest interference in the TF plane will be detrimental to the unmixing performance. In conventional DUET algorithm, short time Fourier transform (STFT) is utilized for extracting the TFR of speech signals. However, STFT can provide on limited sharpness to the TFR due to its inherent conceptual limitations, which worsens under noise contamination. This paper presents the application of post-processing techniques like synchrosqueezed transform (SST) and synchroextracting transform (SET) to the DUET algorithm, to improve the TF resolution. The performance enhancement is evaluated both qualitatively and quantitatively by visual inspection, Renyi entropy of TFR and objective measures of speech signals. The results show enhancement in TF resolution and high clarity signal reconstruction. The method also provides adequate robustness to noise contamination.


INTRODUCTION
The under `determined BSS method called sparse component analysis has been widely used in audio source separation in the current scenario. This involves transforming audio mixtures into a sparse domain where manipulation and separation of these mixtures becomes easier. Various techniques have been used for audio source separation in sparse domain. The first approach on this direction was proposed by [1,2]. These papers demonstrated that mixtures of source signals could be separated without knowledge of underlying source signals or mixing procedure. These methods assumed an instantaneous mixing procedure and a scenario where number of mixtures exceeds number of sources. Speech separation in underdetermined cases is difficult as it does not have a linear solution. The first efforts in this direction are presented in [3,4]. The first practical algorithm for separation of arbitrary number of speech signals from two anechoic mixtures was initially proposed by [5] and further explored by [6] and is known as DUET algorithm.

2623
DUET algorithm works well for convolutive mixtures. The speech mixtures are first converted to TF domain using STFT where speech is assumed to be sparse. Then it partitions the time-frequency domain into regions corresponding to individual sources. The region for separation depends on the closeness of TF coefficients to the estimated delay and amplitude parameters and each source is then demixed by synthesising the estimated coefficients in the region. This technique mainly relies on the correct estimation of amplitude and delay parameters corresponding to each source which in turn depends on correct estimation of TF coefficients. However the TF resolutions are restricted by Heisenbergs uncertainty principle which limits how accurately time varying information can be captured over short time intervals. This results in 'blurring' or 'smearing out' of TFR regardless of the analysis tool used which leads to wrong estimation of TF coefficients and hence reducing clarity of separated speech.
Many researches has been carried out in designing high resolution TF techniques, at the same time retain their invertible ability to recover original time series signal. Usually STFT, Vigner-Ville distribution, Wavelet are used to convert speech to TF domain. However the TF resolutions of these transforms are poor.
STFT converts a one dimensional time series signal into two-dimensional TFR where we can see both the time and frequency of the signal. However band limited window function is used in STFT which causes energy blurred spectrogram. Various post processing techniques on STFT have been proposed to improve TF resolutions. This include reassignment method (RM) [7], synchrosqueezing transform (SST) [8][9][10], parametric time frequency analysis (PTFA) method [11][12][13][14] and demodulated time frequency analysis (DTFA) [9,15]. The ultimate aim of these methods is to improve TF resolution by developing an ideal time frequency analysis (ITFA) method [16] and to obtain an ideal TF representation (ITFR) which is of the form: where Ak(t) is the time varying amplitude, δ is the Dirac distribution function, and φ (t) is the time varying phase of the signal. φ'k(t) is the derivative of φ (t) and is the instantaneous frequency (IF). We know that an ITFR is the one producing an impulse at IF of signal and elsewhere it is zero. This can be achieved by squeezing or reassigning the TF coefficients, so that signal energy only appears in IF trajectory, which results in good time frequency resolution and anti-noise property.
Though RM gives sharper TFR for speech mixtures it is based on absolute TFR which leads to loss of signal reconstruction ability. PTFA and DTFA are not suitable for processing signals containing multiple components with distinct frequency modulation laws continuously. Here we use SET and SST as post processing technique of STFT which are considered to be promising TFR method, as it enhances TF resolution at the same time allows for perfect signal reconstruction particularly in the case of noisy speech mixtures.
The paper is organized as follows. Section 2 describes the background of the proposed method of SET and SST DUET algorithm. In Section 3 we present the SET and SST enhanced DUET algorithm. Section 4 includes results and discussion and Section 5 gives summary. Expansion of major acronyms used in the following text are-TFR: time frequency representation, TF: time frequency, DUET: degenerate unmixing estimation technique, STFT: short time Fourier transform, SET: synchroextracting transform, SST: synchrosqueezed transform.

PROPOSED METHOD
DUET is a well-established method for multichannel source separation and localization. It is used in various applications like separation of EEG and ECG signals from medical sensors, separation of radio signals in telecommunication, in audio applications as in hearing aids, for demixing stereo recordings etc. This technique is not bound to any particular type of signals but it performs extremely well when used for separating speech signals due to its various latent properties.

Duet algorithm
Given an audio mixture recorded using two omnidirectional microphones in an anechoic room we know that if a source j has a distinct spatial position then that source possesses a distinct magnitude parameter αk and phase delay δk which is unique to that particular source [6]. Provided the audio source signals have scanty and disjoint time-frequency characteristics the mixture can be partitioned based on these spatial characteristics. But DUET algorithm fails to provide exact partitioning for separation in real world situation. One reason for this is not able to characterize the mixing parameters exactly in TFR and its effect becomes more prominent in presence of noise. Though STFT is used to convert speech mixtures to TF domain it gives blurred TFR due to various limitations imposed by Heisenberg's uncertainty principle. This leads to wrong partitioning of sources based on mixing parameters. Various techniques have been proposed to improve clarity of separated speech [17]. SET based on adopting the reassignment approach of SST and theory of ITFA has been proposed in this paper for sharper TFR which helps in accurate estimation of mixing parameters belonging to each source and hence gives nearly perfect speech separation especially in noisy environments.

Problem formulation
In real world scenarios a time domain mixing model is depicted as where aij are the mixing coefficients, sj(t) are source signals, n(t) the noise and Xi(t) are the resultant audio mixtures obtained from the system shown in Figure 1. The main aim is to use a better TFR which helps in recovering original sources from their mixtures with utmost clarity. The mixing model consist of a room with 2 microphones and k sources which is actually the number of speakers who speak in the room and the position of speakers and the microphones are as shown in Figure 1. The speakers are assumed to be stationary and the speakers were randomly assigned to one of the position as shown in Figure 1 and 50 different recordings are made. The performance evaluation for the given arrangement can be done with more number of sources provided minimum angle between two consecutive microphones was 30 o .

From STFT to SST and SET
The STFT of a multicomponent signal s(t) with k modes is given by where Ak,φk, φk ' denote the kth mode instantaneous amplitude, instantaneous phase and instantaneous frequency(IF) respectively and ŵ (.) denote the Fourier transform of Hamming window function. STFT gives smeared time frequency energy [18] hence it is impossible to identify time-varying feature accurately. From (3) the instantaneous frequency is given by; According to [4] SST congregates the STFT coefficients with identical frequency and location given by (5).
is the synchrosqueezing operator (SSO). Here TF coefficients are squeezed into IF region η=ω0 resulting in a new TF plane SS(t,η) instead of original TF plane S(t,ω). Here reassignment of TF coefficients takes only in frequency direction ie from (t,ω) to (t,η) However SET removes trivial interference and smeared time frequency energy and keeps the TF information most associated with TF attributes of target signal by synchroextracting operator (SEO) given by (6).
Hence SET is formulated as (7).
Thus it is clear from equation (5) that SST reassigns the coefficients around IF trajectory while SET extracts the TF coefficients in IF trajectory with SEO and the rest of TF coefficients are removed. Thus we find that SET is more energy concentrated than SST. Hence by taking only the TF coefficients that are more energy concentrated we can remove the most smeared time frequency energy and get high clarity TF representation in case of SET whereas SST is actually reassignment of instantaneous frequency of the smeared energy coefficients to a new point in TF plane. Also though SST and SET needs to know the instantaneous frequency trajectories for reconstruction SST needs to have additional information about the integration regions [8]. Compared to SST, SET provide a sharpened and focused representation of coefficients in TF plane especially in noisy environments and our aim is to find out which technique is better when applied to DUET.

RESEARCH METHOD
In BSS techniques DUET has been accepted as one of the most effective way for signal separation especially in cases when number of sources is greater than number of sensors. However TFR which is the core part of DUET fail to localize and separate individual signals when STFT is used due to fixed spectral resolution caused by predetermined window width. Various potential methods like SET and SST has been used to increase the sharpness of TFR and is used in various application like identifying power quality disturbances [19], fault diagnosis in rolling bearing [20], hydrocarbon detection [21], model based deep learning [22][23], seismic time-frequency analysis [24] etc. Hence SST and SET is an ideal method for improving TF resolution in DUET which ultimately results in high quality speech separation especially in noisy conditions.
 for i=1:numfreq/2 for j=1:numtime if abs(real(Xe12(t,ω)/Xe11(t,ω))>λ IF(i,j)=1 end end  Tef1=Xe11(t,ω)*IF(i,j); Repeat steps 4 to 5 for mixture X2(t) and obtain Tef2  For each TF points given by Tef1 and Tef2, find mixing parameters (α(t,ω),δ(t,ω)), where α(t,ω) and δ(t,ω) are the instantaneous estimates of the relative attenuation and delay of sources [6] respectively as observed by X1(t) and X2(t)  Construct high resolution histogram and smooth it  Locate peaks in histogram, there will be N peaks (one for each source) with peak location approximately equal to the true mixing parameter pairs  For N pairs of mixing parameter pairs construct TF masks using ML partitioning [6] and apply these masks to one of the mixtures to get estimate of TF representation of original sources.  Find inverse SET to convert each source back to time domain where numfreq is frequency components per time point and numtime is time components per frequency point and 1 is SET TF points. Here λ=10 -8 in noise free conditions and = √2 log 2 * where N is signal length and; = (| 11 ( , ) − ( 11 ( , ))|/0.6745 [25] SST-DUET Algorithm All the steps of SST-DUET algorithm is the same as SET-DUET algorithm except in steps 4. In SST the IF region is found out and all the TF coefficients are squeezed into that region along frequency direction. We find the IF trajectory ω(i,j) and the step 4 of algorithm becomes; Here the coefficients are squeezed into IF trajectory to obtain a new TFR given by Ts(i,η).

RESULTS AND DISCUSSION
Here we analyze the influence of SET &SST on the characteristics of signal mixtures in their TF domain. For the purpose of accessing the performance of SET & SST, STFT is considered as reference tool. Also performances of TF sharpening tools in DUET algorithm are numerically validated using known mixture of five sinusoidal signals under both noisy and noise free conditions it is further experimentally evaluated using speech mixtures. General evaluation of clarity of TF domain is validated using Renyi Entropy. In speech mixtures quality of demixing, noise robustness and ability of signal reconstruction is evaluated using BSS-Eval tool box [25].

Evaluation in TF domain
We evaluate the performance of the proposed method on two different situations: synthetic mixture of five sinusoidal signals and real speech mixtures. The synthetic mixtures of sinusoidal signals are modeled as; 1 ( ) = sin(2 6 ) + 0.8 sin (2 10  The TFR using these three transforms in real speech mixtures is shown in Figure 3. In general for speech signals, there are possibilities of multiple frequencies to be present which in most cases will be highly overlapped. In this situation, TFR of SET shows much clear separation between component frequencies where in overlapping is low and hence less blurring compared to other two transforms. Further we investigate the performance of these transforms in terms of histogram obtained for estimation of mixing parameters in DUET. Figure 4(a) and Figure 4(b) shows histogram obtained using SET TFR and STFT TFR. Figure 4(a) shows concentration of peaks than the histogram in Figure 4(b) that uses STFT-TFR. Figure 4

Quantitative analysis 4.2.1. Renyi entropy
In order to evaluate efficiency of proposed method in providing clearly separated speech sources from their mixtures and further comparing it with that of conventional methods we carry out quantitative evaluation using Renyi entropy. In any case of demixing the correct choice of window length is a crucial parameter which influences the clarity of separated sources. To investigate how the choice of window length influences the correct estimation we evaluate the Renyi entropy [18,19] of TFR of three transforms for varying window length. Figure 5 shows variation of Renyi entropy values for various window lengths in noise free and noisy conditions. From Figure 5(a) it can be inferred that irrespective of window length SET always provides lower Renyi entropy than SST and STFT. However this observation is valid for short window length wherein Renyi entropy decreases as window length decreases. For longer window the performances of all three transforms tend to be similar indicated by similar Renyi entropy in Figure 5 The performances of SET and SST in noisy condition for varying window length is investigated using the Renyi entropy of STFT, SST and SET TFR's in the speech mixtures which is given in Figure 5(b). It is clear from the above figure that Renyi entropy of SET is much lower than SST and STFT in noisy conditions for shorter as well as longer window. Unlike SST, which gathers all coefficients to the corresponding IF trajectories, SET gathers only those coefficients in the IF trajectories which has maximum energy [17,18]. Thus SET generates a novel TFR where in the effect of noise in minimized there by resulting in lower entropy. The above results clearly prove the efficiency of SET over SST in speech separation under noisy conditions. Table 1 shows the time required for computation of STFT, SET and SST for speech mixtures. On comparing them we find that SET and SST requires almost twice the computational time of STFT. As SET requires evaluation of every IF trajectory in addition to TF point it is natural that computation time is higher than that of STFT. Slightly higher computation time demanded by such transforms is worthified considering the sharper TFR they can provide which in turn contribute to highly efficient demixing. Among these two transforms SET proves to be an ideal choice on considering the increased sharpness of TFR and slightly lower computation time.

Objective measures
We compare the proposed method with existing DUET techniques for various objective measures such as signal to distortion ratio (SDR), signal to artifact ratio (SAR) and signal to interference ratio (SIR) given by equations (8,9,10). The signal is decomposed into a source part Starget, along with error terms such as interference einter and algorithmic artifacts eartifact [25]. Also intelligibility of estimated sources is evaluated  Table 2 shows values of SDR, SIR and SAR for DUET algorithm enhanced with SET, SST and conventional DUET algorithm which uses STFT TFR's. A high value of SIR is the basic requirement for efficient speech separation algorithm while it can be allowed that the other two measures are at relatively moderate level. From Table 2 it can be inferred that both SST and SET are efficient in enhancing the speech separation performance of conventional DUET algorithm. Reconstruction ability indicated by correlation values of extracted speech signals also indicate efficiency of SET and SST compared to that of traditional DUET algorithm. A comparison of performance of SET and SST in terms of objective measures and correlation value clearly indicate relatively better efficiency of SET.
Noise robustness is another crucial factor that affects the reconstruction efficiency of any speech separation algorithm. To evaluate performance of SET and SST under different noise levels white noises with SNR's 1 dB to 80 dB were added to the speech mixture before applying the demixing procedure. Figure 6 shows average correlation values for the speech signals under these noise levels for SET and SST. From the figure it can be observed that for high SNR values SET and SST exhibits similar performance whereas for low SNR values below 40 dB reconstruction ability of SET is much better than SST. Better performance of SET under low SNR values can be attributed to the specific approach of SET which removes the most smeared TF coefficients and thus leading to reduced effect of noise in IF trajectory. Thus SET reconstruction shows the best match between estimated and original sources in low SNR cases. Hence we can conclude that SET reconstruction is more robust to noise than SST in highly noisy speech mixtures.

CONCLUSION
Here we present the results of investigation and the effectiveness of post processing, reassignment techniques of SET and SST in improving TF resolution for the method of DUET under different noise levels. These approaches make use of IF's to further process TF points, for improving readability in both frequency as well as time direction. The performances of these two techniques are addressed qualitatively and quantitatively. Qualitative analysis is carried out by visual inspection of TFR and histogram peaks used for estimation of mixing parameters in conventional DUET algorithm. Quantitative measures are done using Renyi entropy of TFR and objective measures of speech mixtures. Renyi entropy which is used as performance indicator has lower values for SET than SST for different noise levels. Also SDR, SIR and SAR give much better results for SET compared to SST. The efficiency in extraction of original signals are estimated using correlation values between extracted signal and original source signals which is better for SET compared to other two transforms. Thus SET enhances the performance of DUET algorithm in terms of accuracy of source estimation from speech mixtures.
The present results indicate that SET is much better than SST as it requires only fewer parameters for reconstruction of the signal while SST demands information about the regions of integration which is hard to obtain in case of strong FM signals. SST squeezes every TF coefficients into specific IF trajectory which is carried out only in frequency direction whereas SET extracts specific TF coefficients into specific IF trajectory both in time and frequency direction. The objective measures obtained are highly promising and encouraging for the use of SET as an efficient post processing technique of speech signals in underdetermined condition. Thus SET can prove to be an efficient technique in application areas requiring sharp TFR like image processing, speech processing and various other signal processing fields.