Quality grading of soybean seeds using image analysis

ABSTRACT


INTRODUCTION
Soybeans are an important agricultural crop that is widely consumed because it is an exceptional source of nutrients, with a high protein and very high oil content [1]. Soybean quality affects the pricing and quality of grain for cropping and consumption. Soybean diseases greatly reduce the economic value of soybean products and result in economic losses for the soybean industry and farmers. Thus, the development of a rapid and reliable method of detecting the appearance quality of soybeans is of great significance to soybean farmers and the soybean industry [2]. Various soybean diseases affect the seeds appearance in terms of size, shape, and color. Disease affected soybeans can be purple seeds, green seeds, wrinkled seeds, and small/split seeds. For the most part, the productivity of soybeans depends on the quality of grains, and that is why quality grading is a very important process for the soybean industry and soybean farmers.
A grading machine is used to classify soybeans according to their quality. This machine can separate only foreign material and seeds with non-standard sizes. However, the machine cannot classify low-grade seeds such as dried seeds, green seeds, purple seeds, and contaminants with the same size as regular soybean seeds. Therefore, skilled workers are needed to carry out the quality grading of soybean grain. However, this approach requires many workers, in addition to being time-consuming, and prone to human error.
Another quality grading approach consists of using image processing techniques and machine learning algorithms to evaluate the quality of agricultural products and food. In recent years, such methods have become widespread [3][4][5][6][7][8]. Mebatsion et al. have classified cereal grains, namely; barley, oat, rye and wheat using morphological and RGB color feature with an identification accuracy of 99.6% [9]. Olgun et al. have developed an automated system for wheat grain classification that can achieve an accuracy of 88.33% using the dense scale-invariant feature transform (SIFT) feature, which is evaluated by a support vector machine (SVM) classifier [10]. Another proposed approach for soybean disease recognition of soybean leaves is based on the image local descriptors and Bag of Visual Words, which are methods robust to occlusion [11]. Tan et al. [12] have proposed a method using back propagation (BP) artificial neural networks to recognize soybean seed diseases such as soybean frogeye, mildewed soybeans, worm-eaten soybeans, and damaged soybeans, with an accuracy of 90% for heterogeneous soybean seeds with several diseases. The system's performance was limited by color difference and shadow noise under the condition of natural light. Therefore, this proposed requires the light source that distributed the light evenly.
In a recent study, Wei-Zhen et al. [13] developed a model to estimate the percent defoliation of soybean leaves from RGB images using the leaf area and edge pixel number. Momin et al. [14] have developed a machine vision system using all components of the HSI color features from front and back lit images to detect materials other than grain (MOGs), such as split beans, contaminated beans, defective beans and stem/pod materials, in soybean harvesting. Their method achieved accuracies of 96%, 75%, and 98% for defective beans and stems/pods, respectively and performance depends on back and front light controlling of web camera that affect with illumination changes of HSI color components.
Although some existing image processing and machine learning technique are modified to use the quality grading of soybean seeds. There are still some critical problems that need to be overcome. First, shadow noise can occur when changing camera angle and the condition of natural light. This condition can reduce accuracy of image segmentation and it has an effect on detection of the boundaries for soybean seed. Second, color difference can degrade classification power of the quality grading of soybean seeds. Current research of color feature extraction are not robust for changing of illumination because the light source was designed for each dataset. Therefore, color feature should be extracted from the color model and components that are not sensitive to illuminance changes and to enhance classify results. Third, shape variance in each soybean seeds class are an obstacle for soybean seeds categorization. For this reason, combination of color feature and other features that are not based on shape can improve the efficiency of the classification in the each class of soybean seeds.
To this end, this paper proposed a framework to generate a new classifier model for the quality grading of soybean seeds which addresses these three challenges. Our approach consisted of process that robust for shahow noise, changing of illumination and shape variance. Our investigation focused on the diseases of soybean seeds such as dried soybeans, green soybeans, purple soybeans, wrinkled soybeans, and small/split soybeans.

PROPOSED METHOD
This work focuses on the recognition and classification of the quality grading of soybean seeds. The major steps of classification include image segmentation, seed cropping, feature extraction, model construction, selection of suitable classifier model, and accuracy assessment, which are presented in Figure 1.

Image data set
This study used 1,320 soybean seeds from the Seed Research and Development Center at Phisanuloke Province. Specifically, this study used the Chaing Mai 60 seed because it is the most popular soybean seed in Thailand. The soybean samples consisted of 400 normal seeds, while the infected seeds comprised 275 purple seeds, 215 green seeds, 220 wrinkled seeds, and 210 other seeds. The soybean seed samples are presented in Figure 2.
The hardware used for the acquisition of images consisted of a color camera (EOS 700D, Canon) with a zoom lens with focal length of 8-55 mm (EF-S18-55 IS STM), light source, and black A3 paper (11.7 x 16.5 in). The camera was mounted onto a stand above a light with an 11 watt lamp. Each group of soybean samples was captured on black paper by the camera at a distance of 12 in, and at an angle of 60° to 90° with the black paper. A photograph of the black paper without the soybean seeds was captured before the soybean seeds were added. Each image of the soybean included soybean samples containing approximately 10 to 30 seeds. The images were captured with a size of 800 x 600 pixels and a resolution of 72 pixels/in.

Image segmentation and seed cropping
In the captured image, the boundary of each soybean seed was segmented using background subtraction [15], which is an approach used widely to separate the foreground object from the background in videos captured by static cameras. In this study, we applied this method to still image segmentation. Moreover, this study used the frame difference algorithm for image segmentation because it is the simplest method, in addition to being appropriate for still images and able of handling lighting changes. Let I be a captured image that contains soybean seeds, and the intensity for pixels in I be denoted as P[I]. Then, P[B] is intensity for pixels in the background image subtracted from the corresponding pixels at the same position in P[I] and computed by (1). Finally, P[F] is the result of the frame differencing computation and is expressed as follows.
In each component of the RGB color space, the frame difference was calculated between the background and captured images. The result of this process identifies the region of soybean seeds in the captured image. This method can help to eliminate the problem of a shadow appearing in the captured image. Subsequently, this image is converted to a binary image by locally adaptive thresholding [16], which computes a threshold for each pixel by using the local mean intensity around the neighborhood of the pixel. The existence of holes in the binary image indicates the existence of noise. These holes are removed by a morphological operation [17]. Then, the region boundaries of each soybean seed are traced using the Moore-Neighbor tracing algorithm. A soybean seed is represented by the connected components of the region boundaries. The split soybean seeds are shown in Figure 3. Additionally, the soybean seeds are cropped to cover the region boundaries of each soybean seed, as shown in Figure 4.

Feature extraction
This study combined color and texture characteristics to improve the efficiency of the classification from problem of shape variance in each soybean seeds class. Therefore, this study used the color histogram algorithm [18] and Grey Level Co-occurrence Matrix (GLCM) [19] for the extraction of color and texture features. The HSI model is used for the color histogram because it is more closely related to how humans perceive color. The components of this model are hue, saturation, and intensity. The captured images are converted from the RGB model to the HSI model using the color transformation [17] expressed by (2) Figure 5 shows the colors in the H (hue) components of all soybean seeds in the dataset. The differences can be observed in the hue value of the HSI model. The max, min, and mean hue values are listed in Table 1.  The hue values of each soybean seed are transformed to a color histogram to represent the color distribution in each image. The color histogram counts the number of pixels in each color in the image, and the color space is divided into a number of color bins. Given a color space with k color bins, the color histogram of an image I is defined as H(I) = [h(1), h(2),…h(k)], where h(i) is the number of pixels with a color bin i calculated by Equation (5). Additionally, n_i is the number of pixels with a color bin i, and N is the number of pixels in an image.
The GLCM is a statistical method of examining textures through analyzing the spatial relationship of pixels by calculating how often different combinations of pixel with grey levels occur in an image. The GLCM method performs better than other texture discrimination methods, which can be used to solve the problem of shape variance in each soybean seeds class. The GLCM statistics with regard to the texture features are based on the two-order statistical parameters that are contrast, energy, correlation, and entropy [20]. The texure features of the soybean seed are shown in Table 2. Haralick [21] has proposed 14 types of GLCM statistics to describe the texture features. In this study, we used four types of statistics, namely, contrast, entropy, correlation, and energy, to extract the texture features of soybean seeds. The equations for calculating these features are presented in Table 2. Contrast measures the local variations in the gray-level co-occurrence matrix. Entropy measures the randomness or disorder of the image area. Correlation is a measure of how correlated a pixel to its neighbor over the entire image. Finally, energy measures the textural uniformity of the image area.

Model construction
A support vector machine (SVM) is a supervised learning algorithm that is mostly used for visual pattern recognition and image classification [22,23]. The objective of SVM classifier [24] is hyper plane classifier, which determines an optimal line to separate the training set of the two classes. For a multi-class SVM classifier, the two-classes separation is operated as one-against to all. A classifier model is constructed to assign one of the classes to the test samples. In this study, the SVM classifier was used to classify the images of diseased soybean seeds using color and texture features and a polynomial kernel.

RESULTS AND ANALYSIS
In this section, we present the results obtained by experiments using a color histogram. Additionally, we present the GLCM statistics to evaluate the accuracy of the proposed soybean seed disease classification method for quality grading. Each image in training dataset was added and reduced of bright level to evaluate robustness of lighting changes, which consisted of five brightness levels. Therefore, there are 6,600 images in datasets for experiments. Figure 6 shows soybean seeds of five brightness levels. In the classification method, we use a multi-class SVM classifier with a 10-fold cross validation which are partitioned into 10 folds. One fold is used for testing process, while the remaining folds are used for training process. The SVM classifier is executed 10 times and uses each fold only once. Therefore, the accuracy rate is the average accuracy of the 10 times execution. This study defined five classes for the classification of soybean seed diseases: normal seeds, purple seeds, green seeds, wrinkled seeds, and other seeds.
The performance was measured in terms of precision, recall, and the F1-measure [25], and compared with the ground-truth dataset, which was generated by users. We compared the color histogram using different numbers of color bins of the HSI model and the RGB model. The HSI are cylindrical geometries whose angular dimension ranges from 0° to 360° to represent hue, starting with the red primary at 0°, yellow primary at 0°, green primary at 120°, cyan primary at 180°, blue primary at 240°, magenta primary at 300°, and then wrapping back to red at 360°. Therefore, we separated the H (Hue) components into 6 color bins, 36 color bins, 72 color bins, 144 color bins, 288 color bins and 360 color bins. The S (Saturation) components were separated into 4 bins and the I (Intensity) components were separated into 4 bins. Each channel of the RGB model was separated into 2, 3, 4, 5, 6, 7 and 8 bins. The best color feature was combined with the GLCM statistics. The classifier models were developed using the SVM classifier with a polynomial function.  Figure 7. The results indicate that color feature using H components of the HSI model gives high classification accuracy more than the overall color feature based on the RGB model. Therefore, the HSI model with H components demonstrated that robust with light changing and used lowest dimensions of color features.   Table 4 shows the classification accuracy using the two feature sets and an SVM classifier. Fmeasure of 0.866 was obtained with GLCM based texture feature whereas F-measure of 0.992 was obtained with color and texture features, which was best classifier model. The result illustrates that the classifier based on color and texture feature can identify soybean seeds with highest accuracy and improve classification performance from using color feature only. The classification performance for each soybean class using the best classifier model was shown in Table 5. The F-measure for the classification of normal seeds, purple seeds, green seeds, wrinkled seeds, and other seeds was 0.979, 1, 1, 0.981, and 1 respectively. Due to the distinct color difference of purple seeds and green seeds, it makes high classification accuracy. Moreover, other seeds were classified with high accuracy because they consisted of frogeye seeds, dried seeds, and damaged seeds, which have different colors and texture in comparison to other soybean classes, while normal seeds and wrinkled seeds have a similar yellow color. However, the best classifier model can be used to classify soybean seeds with an average accuracy of 99.2%.

CONCLUSION
We presented a framework for soybean quality grading which has three main advantages. First the use of background subtraction reduces the problem of a shadow appearing in the captured image when changing camera angle and the condition of natural light, consequently, improving the results of seeds segmentation and cropping. Second a method is proposed for extraction color feature based on robustness for illumination changes which is H components in HSI model. Third an approach to find other features to solve shape variance in each soybean seeds class is presented. Based on the experimental results, the proposed technique combines a color histogram of H components in HSI model and the GLCM statistics to be more improve classification accuracy than using color feature only. Future work will focus on designing a comprehensive classifier for various soybean seed types and the proposed method can be combined with a harvester machine.