Visual victim detection and quadrotor-swarm coordination control in search and rescue environment

We propose a distributed victim-detection algorithm through visual information on quadrotors using convolutional neuronal networks (CNN) in a search and rescue environment. Describing the navigation algorithm, which allows quadrotors to avoid collisions. Secondly, when one quadrotor detects a possible victim, it causes its closest neighbors to disconnect from the main swarm and form a new sub-swarm around the victim, which validates the victim’s status. Thus, a formation control that permits to acquire information is performed based on the well-known rendezvous consensus algorithm. Finally, images are processed using CNN identifying potential victims in the area. Given the uncertainty of the victim detection measurement among quadrotors’ cameras in the image processing, estimation consensus (EC) and max-estimation consensus (M-EC) algorithms are proposed focusing on agreeing over the victim detection estimation. We illustrate that M-EC delivers better results than EC in scenarios with poor visibility and uncertainty produced by fire and smoke. The algorithm proves that distributed fashion can obtain a more accurate result in decision-making on whether or not there is a victim, showing robustness under uncertainties and wrong measurements in comparison when a single quadrotor performs the mission. The well-functioning of the algorithm is evaluated by carrying out a simulation using V-Rep.

INTRODUCTION Nowadays, one of the main areas in which the robotics research community is working is in search and rescue (SAR) missions assistance by the use of mobile robots in disaster zones to safeguard as many lives as possible. Earthquakes, floods, hurricanes, fires, are just some of the most frequent scenarios that put human lives in endangered. On one hand, the current advance of technology has allowed robots to have the highest apogee as a possible solution or improvement to the SAR tasks, there are many fields where robotics could intervene optimizing the performance of some human SAR tasks, such as mapping of the environment, detection of victims, and deployment of first aids. On the other hand, the development of robotics theory has improved the algorithms applied to robots in natural disaster zones increasing the probability of finding survivors, as observed in [1]. In the same way, in [2] it is explained why it is vital to act on the affected Ì ISSN: 2088-8708 area within 48 hours of the incident due to the probability of finding survivors gets significantly reduced after that period. Bearing in mind that time is crucial, applying robotics to SAR missions seems to be advantageous, relying on the robots' ability to carry out tasks efficiently in conditions that might be adverse for human rescuers as presented in [3]. Likewise, there are works as [4,5] that show the possibility of employing robot swarms capable of behave cooperatively with humans improving the tasks performance. Particularly in the robotics SAR area, it is possible to perform the exploration and mapping tasks of the affected area employing robots in a shorter time than carried out by a conventional rescue team [6], allowing to detect victims and navigate within the area faster. Regardless of several works are tackling the problem, there are still multiple challenges to be solved, such as navigation through non-convex spaces, the implementation of robust victim detection and estimation algorithms, the distributed SLAM, among others. Some of the most popular techniques that have been used in these tasks are bio-inspired or vision-based approaches as [7][8][9]. However real disaster environments are difficult to get access due to usually those are unpredictable and dangerous making difficult to test algorithms. In contrast, virtual environments are really easy to be accessed thanks to the development of simulator motors such as V-Rep and Gazebo, these must capture as many features as possible from real disaster environments, allowing algorithms to be tested and evaluated as shown in [2]. Another important task that robots can do in a SAR mission is victim detection due to the number of robots increases the resiliency and robustness of the detection in comparison to the human rescuers that usually are available. Here is important to consider that the algorithms implemented allow diminishing the effects of the gap in visual information in virtual and real environments sensed by the robot cameras.
The difficulty level of finding victims in a SAR scenario is increasing when the coordination and synchronization of a swarm-robotics are considered as in [10], as well the coordination needed when the victims may move inside the risk zone, is important to be able to track the dynamic targets and maintain communication within the swarm, like the task allocation consensus algorithm developed in [11]. other ideas giving the popularity in drone technology is the use of this drones to cover areas of interest such as crowded areas where a lot of mobile devices can provided rich information or the drones can be deploy to monitor climate changes anywhere anytime, like seen in [12,13] respectively. Regarding only the detection process, there are multiple ways to proceed using different types of sensors, such as radar or cameras. When using radar systems, aspects such as the speed of processing, the instrumental accuracy and the need of sophisticated algorithms are relevant to achieve an acceptable operation under adverse conditions [9,14]. On the contrary, a camera sensor as in [7] allows the system to acquire a lot of information from the environment, challenge here becomes to detect people in images taking into account that not all of the victims in a catastrophe are in the same position, neither they have similar features, as shape, clothes color, size, rotation, and occlusion. Regardless of the sensor used, one of the approaches that have become relevant to read the sensor information and interpret is convolutional neural networks (CNN). The CNN technique derived from learning algorithms and artificial intelligence is a tool that permits the system to be trained and recognize objects in an image by finding target features. Once the CNN is trained the sensor takes information about the environment and recognizes relevant objects that allow the system to develop behaviors according to the information perceived as shown in [15].
The challenge in the development of control algorithms based on well-functioning neural networks, lies in terms of computational expense and processing time, so that the actions are executed in the shortest possible time and in a correct way [16]. Other authors have worked on victim identification in SAR environments, for instance in [8], histograms of oriented gradients, based on color skin detection which could be sensitive to lightening conditions. In [17] the problem of light conditions is tackled generating a robust Violo-Jones algorithm that detect victim humans. In the same way, [18] use a transformation space through Gabor approach to detect humans based on skin color on a RGB image processed by photogram captured in a video. In contrast, our approach focuses in a robust way to detect a victim based on many features that a CNN can choose. Considering that SAR environments have highly exposure to occlusion and also that human skin might be altered by other components in the environment. On the other hand, there are some other works that works with sophisticated robots such as [19] where a robot is capable to identify a victim by the use of a infrared camera and a lidar sensor. In spite of the fact it is necessary to have redundancy in the sensors and determine whether or not the robot is detecting a victim this robot can be expensive and since robots are exposed to drop out due to attrition, it might be better to consider a cheaper robot but deploying multiple of them providing greater robustness to the system. We assumed the communication through the multi-agent network and the base system is done. Our research approach is not focused on the communication issue as it was dealt with in [20,21]. Instead of it, we are focused on the robot swarm navigation [22] and victim detection through multi-agent consensus.

Int J Elec & Comp Eng
ISSN: 2088-8708 Ì 2081 In this paper we extend the work [23] where an aerial multi-quadrotor platform capable to navigate in a virtual SAR environment with obstacles in it is carried out. The contribution of this paper is threefold. First, we consider the non-linear dynamic of the quadrotor instead of the linear which allows the algorithm to generate non-smooth trajectories. Second, we consider the use of cameras in each quadrotor to acquire virtual images from V-Rep simulator with the inclusion of fire that occludes the victim. Finally, we prove that the max estimation consensus, brings a better outcome than the well-known consensus algorithm when there is occlusion in the images. We make use of the camera to acquire information of the virtual environment. A data-set of 25.000 pictures was used in order to train a CNN to allow the system to identify victims in the virtual environment and then give to each robot in the swarm the trained CNN. Once that each robot in the swarm is capable of detecting victims, when some of them detect a victim in the same place they disrupt their communication to the main swarm and generate new links to a sub-swarm. These sub-swarms let the main swarm keep navigating while they perform a consensus formation algorithm in order to navigate around the victim while at the same time performing either a estimation consensus (EC) or a max estimation consensus (M-EC) to determine if there is or not a victim, making the system robust in contrast when just one robot tries to detect the victim.
The other sections that completes this paper are organized as follows: Section 2, swarm navigation and consensus algorithms, here is shown the quadrotors model, and how they navigate in the environment until they detect a possible victim, where the generation of sub-swarm are carried out, and its formation control applied around the victim. Section 3, visual victim detection, show how the CNN in each quadrotor identifies the victim through its camera, and then an estimation consensus. Section 4, shows some simulation and results that validate the algorithms. Finally, section 5 corresponds to the document conclusions and some future work alternatives.

SWARM NAVIGATION AND CONSENSUS ALGORITHM
We consider a set of quadrotors N = {1, 2, ...n} whose interaction are modeled via graph G = (N , E), where E represents the communication between quadrotors. Each quadrotor i ∈ N has a corresponding state variable x i ∈ R 3 , which are the location of the quadrotor regarding to each axis of the space S ∈ R 3 . It is necessary to state that, S is a non-convex space which means that is composed by either areas clear to move by the quadrotors S f ∈ R 3 and areas with obstacles S o ∈ R 3 that disallow quadrotors to go thru, noting that S = S f ∪ S o .

Robot swarm navigation
The movement of the quadrotors is similar to the one shown in [23] where each quadrotor target position is determined by a single integrator dynamicẋ d i = u i , whereẋ d i are the linear target velocities of each quadrotor and u i is the signal control to be designed. The approach used to give the desired target position to quadrotors is artificial potential functions, which emulates the attraction and repulsion behaviors presented in nature as in Reynolds rules. Allowing quadrotor i ∈ N to maintain a comfortable distance to obstacles and neighbor quadrotors j ∈ N i , where N i is the neighborhood of quadrotor i th . The control signal is a summation of both attraction and repulsive forces as, is the attraction force, k ai ∈ R >0 is an established constant. On the other hand, the repulsion force when a comfortable distance is reached is defined as, , is an established repulsion constant, and r s ∈ R >0 is the security radius in which the quadrotor avoid collisions and depend on the obstacle size. Considering the generation of desired locations to reach, the objective becomes to navigate in a known space being attracted to interesting points while avoiding collisions to both other quadrotors and obstacles. The complete swarm navigation behavior is shown in Figure 1, where the swarm navigates keeping the distance among quadrotors and avoiding obstacles at the same time. Quadrotors dynamic is modeled by the use of Newton-Euler approach, based on [24,25] in which the motion equations can be written as (1) where {X w , Y w , Z w } are unit vectors along the axes of {W} which is the inertial reference frame, {X Bi , Y Bi , Z Bi } unit vectors along the axes of the i th quadrotor {B i } with respect to {W}, v i ∈ R 3 is the linear velocities of the i th quadrotor, m qi is the mass of each quadrotor, g represents the gravity, J i ∈ R 3×3 is the inertia matrix of the i th quadrotor with respect to {B i }, R i ∈ R 3×3 is the rotation matrix that relates {B i } with respect to {W}, Ω i ∈ R 3 angular velocity of i th quadrotor in {B i }, F i ∈ R total thrust produced by the i th quadrotor M i ∈ R 3 moment produced by the i th quadrotor, and the hat map· : R 3 → SO (3) is the skew-symmetric operator matrix as explained in [26] such that,xy − x × y ∀x, y ∈ R 3 . In (1) it is noticeable that the control inputs are F i and M i , and the control laws are found through the use of geometric control depicted in [24], where, F i = F des i · Z Bi , controls the altitude dynamics in which F des i the velocity error, and k p , K v ∈ R >0 are proportional gains. On the other hand, the attitude dynamics is controlled by are proportional gains. The control inputs guarantees that the quadrotor position tends to the desired position x → x d and proof is shown in [27].

Sub-swarm formation around a possible victim
Once the navigation avoiding collisions and maintaining connectivity is guaranteed, the next important behavior is to create sub-swarm when a victim in the environment is found. When at least one of the quadrotors detects a victim, it hovers over it, affecting the complete behavior of the swarm, reason why the robot has to break communication links with the rest of the quadrotors. Allowing the main swarm to move freely leaving a reduced group of quadrotors out of the graph. The way the quadrotors decide to leave the main swarm is through the use of K − nearest neighbors approach. The K − nearest algorithm behaves as a classifier by selecting as its name indicates the k closer quadrotors to the first quadrotor that detected a possible victim. When the neighborhood has been chosen as N ss , we create a sub-graph G ss ⊂ G, where G ss = (N ss , E ss ), which will be disconnected from the main graph G. Here E ss is generated taking into account a weighted function W ss : E ss → R in the following way, , where x d f is the first quadrotor that detects the victim and x d j are the neighbors. Allowing in this way, that the closer quadrotors have more relevance to the classification of quadrotor i th between take itself to the sub-swarm or remain in the main swarm.

Formation control
When the sub-swarm is achieved all of the quadrotors that belong to the sub-swarm start to sense the victim through its cameras. Each quadrotor obtains an own percentage value of the victim recognition all the time, here we perform a formation control based on consensus, which gives quadrotors the possibility to acquire as much information as possible about the victim, from different perspectives allowing them to determine more precisely if it is a victim or not. Hence, the desired formation control considers the following dynamics just for the quadrotors that belong to the sub-swarm,ẋ d ssi = − j∈Nss ∇ x d i ψ ij , where ψ ij are the potential functions that guarantee to maintain connectivity, and avoid collisions while achieving a formation established. These potential functions are defined as, , in which ρ 2 corresponds to the radius of connectivity and ρ 1 the minimum radius to avoid collisions. The formation performed is shown in Figure 2, here the objective is to locate all the quadrotors equally distributed around the victim. VISUAL VICTIM DETECTION In order to perform the task of saving victims from a disaster zone not only the navigation algorithm is important but also the system that allows the robot to detect and localize victims. Thus, for each agent is important to recognize its environment and detect when a victim is in the near area. Each robot performs the detection task through visual information analysis using CNN. The image acquisition is based on use of robot's cameras which are the main sensors on this approach.
Once the robots who conform the sub-swarm are determined, the consensus formation is performed until every quadrotor reach its final position. During the time that formation consensus is performed, every quadrotor is using CNN to determine the certainty value of victim detection individually. While each quadrotor belonging to the sub-swarm determines the victim estimation measurement. The EC algorithm is applied using those values to calculate a concerted level victim detection. At the same time, the sub-swarm will provide more accurate information to the rescuers about the existence or not of a victim in the near zone. In this order of ideas some concepts of CNN and the sensing consensus are briefly described.

Basic concepts of convolutional meural-networks
CNN is a special class of artificial neural network (ANN) originally proposed in [28] that is used to process digital images in classification or identification tasks. These networks employ different convolutional filters with linear rectifiers intended to extract multiple interest features from the image as borders, corners or specific shapes. After the convolutional filtering, the resulting images are down-sampled in the so-called pooling process, reducing the size of the image but preserving most of the relevant information. Those two steps (convolution and pooling) are repeated several times, where every time, the result is a larger number of images with smaller dimensions. Finally, the values of the resulting images are given as input to a traditional neural network with fully connected layers whose weights are adjusted in a supervised training process feed by a large number of images properly labeled according to a human victim detection goal as shown in Figure 3.

Consensus applied to a multi-sensor network for victim identification
When the sub-swarm that wants to identify the victim is determined as shown in section 2.2 a distributed estimation consensus will guarantee the robustness to uncertainties or malfunctioning of any sensor or quadrotor. The model of the estimation contemplated is based on distributed linear least square in presence of uncertainties as follows, σ i = H i β + ε i , where β is the estimation function, affected by uncertainties ε i , σ i is the measurement channel of the i th sensor, and H is a variable that assures that the measurements are not entirely redundant. Taking into account that the aim of the least squares is to minimize the error, by isolating , the outcome is a function f (β i ) which depends only of β i . The purposes of this algorithm can be then described by the following minimization, in which f i : R q → R are convex functions, as a consequence the optimal point is found in the average of the gradient functions of each sensor as f * = 1 n n j=1 (f i ), additionally considering the minimum is given in, then the distributed estimation will have form (2), which is exactly a distributed average consensus, which will converges if the following conditions are met, when convergence proof is shown in [29], both the sensor network must be connected and ρ(L w (G)) < 2 ∆ , where ρ(L w (G)) correspond to the maximum eigenvalue of the graph in absolute values.
Given the consensus is applied to a multi-sensor system which aim is human victims detection, it is highly required to avoid uncertain measurements produced by different objects in the disaster scene or special situations proper of the emergency area as a fire, smoke, or debris among others. Because of this possible bias measurement and the relevance of human life, we propose the consensus calculation using the best detection measurement acquired until the current moment in any sensor of the network. As depicted in 4, where k = [0, 1, 2, ..., t]. This consensus calculation will be called M-EC and it will be used to maximize the victim detection level in scenarios with a high uncertainty levels of sensor measurements.

SIMULATION AND RESULTS
In order to perform experiments where navigation, sub-swarm generation, consensus formation, and visual victim detection algorithms can be evaluated, it was performed in a virtual scenario with trees, human victims, fire, and quadrotors in uneven terrain. The virtual scenario was developed using a combination of MATLAB, Python, and V-Rep. MATLAB was used to implement the mathematical model for navigation and consensus algorithms. Python performed the CNN model and it was in charge of the human victims' detection. Finally, V-Rep is a virtual robotics environment used to develop a virtual disaster scenario where quadrotor models can be used in SAR operations. The principal reason to perform these experiments in a virtual environment simulation is because of the difficulty to have a real disaster scenario where is possible to perform this kind of experiments, as depicted previously by [2]. Two cases were performed to show the benefits of using the EC and the M-EC. The objective of this simulation is to illustrate the improvement that the use of M-EC implies in cases where the environment generates high levels of uncertainty in the sensors measurement. The first case is a disaster scenario where there are occlusion points generated by objects of the scene, such as debris or trees as shown by Figure 4. In the second case, the same scene is used with the addition of fire and smoke, which increase the sensing uncertainty and it makes it difficult for the victims' detection process as depicted in Figure 5.

Victim detection performed by a single quadrotor
As explained in section 3 every quadrotor is equipped with cameras as a sensing system, which principal aim is human victims detection. The image processing task is performed by a CNN, which is in charge of image analysis focusing on the identification of the potential human victims. As shown in [30] the victims are presented in a urban search and rescue like environments thus the victims will be under rubble or some kind of object will obstruct a percentage of the victim so is important to train a CNN with this kind of data in order to get a better detection, but to get the amount of images that meet this characteristics is a hard task Ì ISSN: 2088-8708 by itself, in the paper they create a set of 570 images, so it can be consider a small dataset. Transfer learning is a quite useful method for object detection where a feature extractor is use then top layers for classification are fine tune for the task in hand, the need of a dataset that can generalize a desire concept could be difficult to acquire so a virtual environment can be use due to the fact that is more flexible, labels or bounding boxes can be extracted in a automated manner and we can get models of any existing object if needed, in [31] the aim of the paper is to combine this two methods using transfer learning and a virtual dataset to get a pedestrian detector, the results show that high performance can be achieve in real world datasets when the training was done on a virtual dataset then a small set of real images are used for fine tuning. Finally in [32] a CNN is fully train solely with a virtual dataset of three classes then tested in real images, the architecture of this CNN was taken as an initial point for this paper after iterations and pruning we have arrive to the model presented here, where the training time was lower due to the network been shallow without lowering the accuracy.
For the case of this simulation, the topology of CNN consists of 3 convolutional neural layers and two fully connected layers. The convolutional layers have 256 filters each one and 5 × 5, 3 × 3, and 3 × 3 kernel size respectively. The fully connected layers have 512 and 256 neurons respectively with a linear rectifier as the activation function. Finally, the output layer has a neuron in charge of providing the certainty level related to the human victim detection. As depicted in Figure 4 where three cases of victim detection are shown with its respective victim detection level (VDL).
Once the quadrotors are determined as a part of the sub-swarm, all of them perform a formation consensus were every one navigates through the area near to the potential victim and covering the bigger possible area around it as shown by Figure 2. While formation consensus is performed, the VDL changes according to the visibility and proximity of the quadrotor to the potential victim as depicted by Figures 4 and 5. Figure 5a shows the path followed by six quadrotors that conform sub-swarm and Figures 5b and 5c show the victim detection performed by quadrotors D1 and D2 respectively during the formation consensus.
As shown by Figure 6, the measurements provided by different quadrotors about existence of victims in the near area can be confusing and dissimilar. For example, Figure 6a depicts the six different quadrotors sensor cases where D1 is a quadrotor that start its navigation with a total lack of evidence of victim detection, however, while the formation consensus is performed the victim detection improves considerably. D4 depicts a low and constant detection level all the time. Finally, D5 is a quadrotor which loses visual contact with victims but through the time recovers some certain VDL. This is just one example of the wide variety of possible cases that can arise in a real disaster scenario.

Victim detection performed by a sub-swarm
As depicted in section 4.1 the single measurement provided by a single quadrotor has a considerable discrepancy if it is compared with measurements provided by the rest of the sub-swarm agents. This discrepancy is just logical if it is taken into account that a disaster place is a chaotic environment where a measurement can be affected by fire, debris, electromagnetic interference, and landslides among others. Figure 6 shows the different VDL values for each quadrotor from sub-swarm during their trajectory in the formation consensus. Figures 6a and 6b  introduction of section 4. Figure 6a shows the VDL of the six quadrotors in a scenario with some occasional occlusions and Figure 6b shows the same scenario but with the addition of fire and smoke, which increases the uncertainty of the VDL and makes it difficult the victims detection. Both cases are evaluated in Table 1, where different statistics descriptors are evaluated from the VLD acquired while the formation consensus event runs. The statistics descriptors used are mean, standard deviation, maximum, and final value. These descriptors show how fire and smoke affect the victim's detection. The VDL average decreases for each quadrotor in the presence of fire and smoke, in a similar way, the standard deviation increases slightly in the second case. This shows that fire reduces the VDL and increases the uncertainty of the measurements. The maximum and final VDL values vary slightly because they also depend on the path that each quadrotor follows in the consensus formation and the presence of fire and smoke in the described path by each quadrotor. This possible lack of agreement between all the sub-swarm quadrotors demonstrates the need for an EC in the way to have a concerted VDL. As previously shown, there are many factors that can increase the sensing uncertainty in a disaster area and at the same time make it difficult to detect victims. For this reason, we propose the implementation of an EC to try to agree on an official value of victims detection among the different sub-swarm agents. Additionally, we propose a consensus-based on the maximum values of VDL aimed to offset the effect of adverse factors in the disaster area such as fire, smoke, debris, and collapse among many others. Figure 7 shows the contrast between the EC and the M-EC in both simulation scenarios. Figure 7 shows the EC and M-EC as red and blue lines receptively. Figures 7a and 7b depicts the values of EC and M-EC for both cases performed by disaster zone simulations In Figure 7 is depicted how the M-EC slightly improves the final value of victims detection, going from 61.483% to 66.36%, presenting an increase of approximately 5% which it is not significantly large, however, the second case where there are adverse factors such as fire, the EC decreases considerably, reaching values of 44.5%, as Table 1 previously suggested. In this case, the M-EC considerably improves the estimation, increasing the consensus value up to 63.342% giving and improvement close to 20%, despite of difficulties proper of a disaster zone as fire and smoke.

CONCLUSIONS AND FUTURE WORK
As expected, the artificial potential functions work well to navigate in a non-convex environment, allowing quadrotors to maintain communication connectivity while avoiding collisions to both other robots and obstacles. The sub-swarm generation makes the swarm to not being stuck by the quadrotors that detect the victim and on the contrary, keeping it navigating. On the other hand, the sub-swarm that detects the victim breaks communication with the big swarm due to they switch the task, from navigating to acquire more information to improve accuracy about the determination of the victim. Every quadrotor was equipped with cameras for victim detection. The visual system includes a CNN in charge of localizing the possible victim and provide an estimation value of the victim called victim detection level (VDL). It was evident that the performance of visual detection can be affected by external aspects and proper of the disaster zone such as fire, smoke, visual occlusions, and debris among others, which can generate mistakes in the VDL generation.
Taking into account that lives are the main objective in SAR missions and mistakes must be diminished, EC and M-EC were introduced, which the main objective is to agree to an estimation from the different measurements provided by each sub-swarm quadrotor. The basic EC indicated to be more effective in detecting victims than the single quadrotor sensing system, however, it showed a reduction of the victim detection consensus level in environments with fire and smoke. Instead, M-EC demonstrates to be robust in environments with visual occlusions, fire, and smoke. Since, if one q quadrotor fails at identifying a victim, the distributed fashion will assure this mismeasurement will not miss a real victim in the environment. As future work we consider the use of different kinds of sensors that can be in the same estimation network, which makes the graph to be heterogeneous, changing the dynamics of it. Additionally, when different kinds of sensors are used in the estimation, they might have different accuracy levels which can be modeled in the graph as weighted links, indicating which sensors are more reliable than others.