Parallel implementation of pulse compression method on a multi-core digital signal processor

Received Feb 26, 2020 Revised May 30, 2020 Accepted Jun 15, 2020 Pulse compression algorithm is widely used in radar applications. It requires a huge processing power in order to be executed in real time. Therefore, its processing must be distributed along multiple processing units. The present paper proposes a real time platform based on the multi-core digital signal processor (DSP) C6678 from Texas Instruments (TI). The objective of this paper is the optimization of the parallel implementation of pulse compression algorithm over the eight cores of the C6678 DSP. Two parallelization approaches were implemented. The first approach is based on the open multi processing (OpenMP) programming interface, which is a software interface that helps to execute different sections of a program on a multi core processor. The second approach is an optimized method that we have proposed in order to distribute the processing and to synchronize the eight cores of the C6678 DSP. The proposed method gives the best performance. Indeed, a parallel efficiency of 94% was obtained when the eight cores were activated.


INTRODUCTION
Pulse compression algorithm is widely used in radar applications, such as pulse Doppler radar [1], ground-moving target indicator (GMTI) [2], and synthetic aperture radar (SAR) [3]. It is carried out on the acquired signal in order to extract distance of target from radar with high precision. Its major constraints is that it requires a high-computing power. Consequently, one processing element cannot holds its processing in real-time. Therefore, one solution is using multiple computing cores working together; each one of them execute a small portion of processing.
This paper presents the C6678 DSP from TI as a processing platform. It provides a high performance floating-point calculation with a low power consumption. In fact, it contains eight independent C66x cores, each core run to a frequency of 1GHz. Moreover, it provides a maximum performance of 128 GFLOPS for a single precision floating point calculation [4]. In addition, several research communities have developed high-performance computing systems using the C6678 DSP [3,[5][6][7][8][9].
Embedded systems based on DSP has proved its efficiency to execute a large number of signal processing algorithm in real time. It has been used by a large scientific community to build real time embedded systems. Abdelkareem et al. [10] have developed high performance software that requires real-time embedded systems for emerging technology areas like 5G Wireless and software defined  [11][12][13][14][15] have developed an embedded system based on the C6678 DSP for beef meet freshness evaluation.
In our previous works [1], we presented a real time parallel implementation of pulse Doppler radar signal processing chain, including beam forming, pulse compression and Doppler, on a parallel machine with 2 C6678 DSPs boards (a total of 16 processing cores). A straightforward model has been used and optimized as a processing parallelization strategy. All communications, including data exchange and synchronization, between processing DSP cores goes through the inter-processor communication bus Serial RapidIO (SRIO), which we have optimized its use [16,17]. The major obtained result is a parallel efficiency of about 90%.
Huang et al., [18] have proposed a parallel implementation of beam forming algorithm on TI-based Tomahawk platform containing six DSP cores. The algorithm is widely used in radar applications. In fact Huang et al., [18] have used the OpenMP interface [19] to distribute the processing over the six DSP cores. Results show a maximum speedup about 3.7. Mego et al., [20] have evaluated the performance of parallelization of basics signal processing algorithms, such as finite impulse response (FIR) filter, discrete fourier transform (DFT) and fast fourier transform (FFT), on the C6678 DSP. In their study, authors have used the OpenMP interface to distribute the processing over the eight DSP cores. Obtained results show that the relative speedup is highly dependent on the algorithm and the amount of processed data. Results show a maximum speedup of about 6. Yu et al., [21] have implemented the pulse Doppler radar signal processing chain on computing platform based on the C6678 DSP. The studied algorithm include three steps: beam forming, pulse compression and Doppler filtering. They have used OpenMP framework for parallel implementation. Obtained results show that multi-threaded execution is less than single-threaded. According to authors, this difference was explained by the highly non-linear memory accesses required by the FFT and the inverse fast fourier transform (IFFT). Wang et al. [3] have implemented and optimized SAR algorithms on the eight core of the C6678 DSP. The studied algorithm include two steps of pulse compression method (range compression and azimuth compression), range cell migration correction (RCMC) and corner turn. The OpenMP framework was used to instantiate individual threads across the eight cores. Obtained results show that the timing required for range compression and azimuth compression scales very well with the increase of the number of operational cores. However, the other RCMC and corer turn steps saturates at around four cores. For the total execution time, the acceleration factor with eight cores relative to a single core is equal to 5.6.
From all presented researches works, OpenMP has been successfully tested to distribute many signal-processing algorithms over multi-core DSP platforms. However, the obtained parallel efficiency does not exceed 70% in the best cases. In this paper, an optimized method is proposed as an alternative to OpenMP method in order to improve the performances.
The major contribution of this paper is the distribution of the pulse compression algorithm over the eight processing core of the C6678 DSP. We have implemented two parallelization approaches. The first one, is based one the OpenMP, which is a shared-memory application programming interface (API) whose features, are based on prior efforts to facilitate shared-memory parallel programming. As the C6678 DSP integrates two levels of memory shared between the eight cores, which are the internal multi-core shared memory (MSM) and the external DDR memory, the OpenMP is fully adapted. The second approach is an optimized method that we have proposed to distribute the processing of the pulse compression algorithm on the eight cores. The performance of the two parallelization methods are compared to each other based on speedup and parallel efficiency indicators.
This paper is organized as follows. Section 2 presents an overview of pulse compression method, experimental platform, and metrics used for evaluating parallel processing performance. Moreover, it presents the proposed mehod to distribute pulse compression algorithm on multiples cores. Section 3 provides the experimental results of parallel implementation of pulse compression using the OpenMP API and the proposed approach. Finally, a conclusion is provided in section 4.

RESEARCH METHOD 2.1. Pulse compression algorithm
A convolution operation between the transmitted and the received pulse is performed in order to detect radar targets [22]. In fact, two closely targets are fully merged in case where the wave sent by the radar is a sinusoidal signal as shown in Figure 1. To improve detection accuracy of closely targets, the transmitted wave undergoes a linear frequency modulation operation shown in Figure 2(b). The obtained signal is called 'Chirp' shown in Figure 2(a).
To optimize the processing of the pulse compression, the convolution operation is realized in the frequency space. It is carried out by performing the product of the FFT [23,24]

C6678 DSP overview
The experimental platform consists of one development board EVM6678 as shown in Figure 4, which integrates one C6678 DSP and 512MB of DDR3 memory [25,26]. The multi-core C6678 DSP provided by TI is a high-performance computing and low power system. It contains eight independent DSP cores, each core run at a frequency of 1GHz and has a peak performance of 16 GFLOPS for single precision floating point calculation. The C66x DSP core is based on a very long instruction word (VLIW) architecture. The instruction set also includes single input multiple data (SIMD) operating up to 128-bit vectors [4].
The DSP C6678 integrates three levels of memory. Each core has a 32-KB of level 1 for program (L1P) and 32-KB of level 1 for data (L1D). The level 1 is the nearest, and it is usually used as cache memory. In addition, each core has a local level 2 memory; it is slower than level 1, and its size is 512 KB. The level 3 or MSM is shared and is concurrently accessed by eight cores; its size is 4 MB. Furthermore, the eight DSP cores also access simultaneously to the external DDR memory.
For code development, the integrated development environment (IDE) code composer studio (CCS) has been used with C6000 compiler version v8.3.5. All optimization options provided by the compiler have been activated. The compiler also supports OpenMP 3.0, which allows rapid porting of existing multi-threaded codes to the multicore DSP. TI's C66x compiler translates the OpenMP into multi-threaded code with calls to a custom runtime library. The OpenMP framework was employed to instantiate individual threads across multiple cores. Pulse compression coefficients and input/output data have been allocated in MSM memory in order to be sahred between all cores, while L1 memory has been fully activated as cache.

Metrics for evaluating parallel processing performance
There are two metrics to evaluate performance of parallel processing: speedup (1) and parallel efficiency (2) [19]. An ideal parallel implementation leads to a speedup equal to the number of cores and to a parallel efficiency of 100%.

Proposed approach
The proposed approach aims to distribute the processing over the eight cores of the C6678 DSP. This approach is based on using MSM memory shared between all cores. We have placed pulse compression coefficients, input and output data in MSM memory in such a way that they are accessible to all cores at the same time. We have reserved seven memory boxes for synchronization; one box is dedicated for each core. Indeed, during the initialization phase, the master core (core 0) resets all these memory boxes and once arriving at the start of the parallel region, the master core set all boxes to one and begins processing its portion of data. Once the memory box of each core is set to one, the core starts processing its data portion. When ending its processing, the master core examines the states of the seven boxes and it would wait until it returns to state zero. This means that the other cores have also finished the processing. A diagram that illustrate the proposed method is presented in Figure 5.

Parallel implementation based on OpenMP
As described in section 2.1, pulse compression algorithm consists of three operations, FFT on input data, point-wise vector multiplication with pulse compression coefficients, and finally the IFFT to generate the output data. These three operations must be applied on all beams and pulses in case of pulse Doppler and GMTI applications, and on all pulses in case of SAR applications. In this work a use case of 256 iterations was chosen. Therefore, the software of the pulse compression consists of an external loop For, which repeats the three operations on all input data. OpenMP provides three scheduling techniques to control the manner in which loop iterations are distributed over the multiple cores. Thus, the scheduling method could have a major impact on performances. These methods are: static, guided and dynamic [19]. Experimental results are presented in Figure 6.  [3] have obtained exactly the same result, however, Yu et al., [21] have obtained less value of the speedup that is equal in the best case to 1.

Parallel implementation using the proposed method
The proposed method presented in section 2.4 has been used to distribute the processing of pulse compression algorithm on multiple cores of the C6678 DSP. Experimental results are presented in Figure 7. Obtained results show that the speedup scales very well with the increase of the number of operational cores, with a small performance degradation in case where six and seven cores where activated. This depends on the number of iterations, which it is not a multiple of six and seven in our use case. A good choice of iterations number will lead to a best performance. When the eight cores are activated, the speedup achieves 7.5 with a corresponding parallel efficiency of 94%. Compared to Wang et al., [3] and to our previous research work [1], the proposed method gives the best performance. Figure 8 presents a comparison between obtained results using the OpenMP framework and the proposed method. Thus, the proposed method leads to a gain of one core when the number of activated cores is equal to five and seven and a gain of two cores when the eight cores are activated. Therefore, our proposed method could be used as an alternative to OpenMP framework to distribute signal-processing algorithms over multi-core DSP. Radar applications are a good example.

CONCLUSION
Pulse compression is the main processing step in several radar applications, such as pulse Doppler radar, GMTI and SAR. Its processing is based on cross-correlation. In order to optimize its processing, the cross-correlation was performed in frequency domain. We proposed the multi-core C6678 DSP as a real-time computing platform, which integrates eight independent cores with a shared memory. The goal of this paper was the evaluation of the OpenMP framework and the proposition of an optimized approach to distribute the processing over multiples cores. The proposed method consists of using shared memory to store synchronization flags, input and output data. Three scheduling techniques of OpenMP framework have been tested: static, guided and dynamic. These three techniques give the same performances with a maximum parallel efficiency of about 70% when the eight cores were activated. Obtained results using the proposed method lead to a speedup of about 7.5 and a parallel efficiency of about 94%, which is better than 70 % found in the previous works and obtained using OpenMP framework.