Functional magnetic resonance imaging-based brain decoding with visual semantic model

Received Dec 7, 2019 Revised May 31, 2020 Accepted Jun 15, 2020 The activity pattern of the brain has been activated to identify a person in mind. Using the function magnetic resonance imaging (fMRI) to decipher brain decoding is the most accepted method. However, the accuracy of fMRI-based brain decoder is still restricted due to limited training samples. The limitations of the brain decoder using fMRI are passed through the design features proposed for many label coding and model training to predict these characteristics for a particular label. Moreover, what kind of semantic features for deciphering the neurological activity patterns are unclear. In current work, a new calculation model for learning decoding labels that is consistent with fMRI activity responses. The approach demonstrates the proposed corresponding label's success in terms of accuracy, which is decoded from brain activity patterns and compared with conventional text-derived feature technique. Besides, experimental studies present a training model based on multi-tasking to reduce the problems of limited training data sets. Therefore, the multi-task learning model is more efficient than modern methods of calculation, and decoding features may be easily obtained.


INTRODUCTION
Research about understanding brain functions based on the relationship between thought and stimulation of the brain. The possibility to understand the brain's function pattern is stimulated by things such as sound, smell, light, or objects. Now there are medical analysis tools that can get data from the brain such as computed tomography (CT), electroencephalography (EEG) and magnetoencephalography (MEG) to motion brain functions. Function magnetic resonance imaging (fMRI) is a technique that allows the brain to distinguish between the states of mind. The fMRI images are generated from blood-oxygen-levels dependent (BOLD) stimulation of brain cells. Most of the research has been done on brain transcription with fMRI, such as brain-computer-interface (BCI), neural-control interface (NCI) is a direct line of communication between the brain gets wired or improvement and external devices [1], as lie detection, exposing the deliberate deception from the text of the speech. False detection by lie detector is fraud detection by evaluating the content of messages or including identifiers [2]. Brain injury or traumatic brain injury that can be measured externally [3].
The 3-dimensional image obtained from fMRI consists of a brain point called voxel (volume + pixel). A voxel shows brain activity, which indicate the areas of the brain are working. Each voxel shows a part of the brain, containing brain cells, one million cells in the mind of the subject. The brain activity is captured by identifying different actived voxels. In general, the format of decoding brain using fMRI emphasized the linear relationship between the state and the mind through statistical tests. In the literature [4]  proposed the general linear model (GLM) approach, however, this examining actived voxels approach is isolation. Moveover a multi-voxel patterns of activity (MVPA) [5] is strong technique and was proposed to detect fMRI voxels. Support vector machine (SVM) and linear discriminant analysis (LDA) algorithm is generally used for training models, a subset of the fMRI data decoding. Although the machine learning algorithm has already achieved the efficiency of MVPA, its accuracy is low due to the lack of training examples. The issues will be confined to the representation of classes applied for training, while types that do not be in training cannot be decoded. From the limitations of information for the trained model, therefore searching for and suggesting new methods. In designing features for decoding the brain that does not exist in the limited set. Pioneer literature [6] states that brain function patterns can be predicted using semantic relationships linking 25-verbs and names. Palatucci et al. [7] have investigated a 218-dimensional representation of the subjects received from 218 volunteers who were answered the questions related to the category of objects such as "Is it cold?" or "Is it hot?" or "Can you walk on it?". Some studies have suggested a mechanism for the automatic extraction of meaning by examining language knowledge such as Wikipedia [8] and WordNet [9]. Exiting research used the outstanding features from the text to decipher the brain, which is found to be a perfect method. In this research, we present the visual features associated with the class of objects, which increases accuracy. This method chooses the images related to the objects in the same class and different characteristics, such as image size, brightness, orientation. Therefore, we consider medium and high-level image properties in encoding many objects. We measure the performance of visual features by using color histograms and correlograms. The online image library ImageNet has included images and tagged images. We then took pictures to extract the features for a model in predicting activity in the brain of brain activity data recorded from fMRI studies based on multi-tasking learning, including in-depth learning to create features and transfer learning of deep learning such as ResNet50, VGG16. We use 150 images from ImageNet to create features. We then designed a model for brain prediction with visual semantic features and fMRI data with multi-tasking learning to predict 60 concepts. The results of this research show the model's efficiency and predictive capability when compared with the state-of-the-art. The main contributions are as follows:  To decode the brain using images from fMRI. We explore the outstanding features from images related to the objects or images to improve the accuracy and compared with the previous research.  To present the models that are used to decipher activities in the brain by using multiple task regression models and demonstrating the suitability of predicting properties of objects without the training set. The remainder of this document has the following structure. The second section reviews the problem of brain transcription, along with the model of brain transcription. The third section presents the image features. The proposed model is described in Section 4. Section 5 present the experimental setup, and the results achieved and discussion in Section 6. The summary and conclusion of the presentation in section 7.

BRAIN TRANSCRIPTION 2.1. Understanding brain decoding
In this path, we describe the brain decoding problem by applying 3D images obtained from the fMRI dataset. The image for fMRI data X = × matrix its (i, j) th element, comprising dimensional of fMRI data in each row. The function category to predict a label. In general, the classification model can be expressed as ( ) = + , where is the calculated error. Then is weight direction. The difficulty is to find to decrease = − ( ) and then is formulate the way out to the optimization intricacy can be defined as: where ( ) symbolizes the part experience of the challenge and regularizes the model. In the typical case, given ≤ , where possible objects are excluded from the training set, constraining the outputs for decrypting. Because it is hard to get an fMRI image for every possible object, past research can only decipher the information taught. Therefore, this solution solves the above problems for data that has never been seen before and supports data expansion.

Representation models of brain decoding
Brain decoding is presented utilizing semantic features to obtain the label set for new input data. Regular data sets obtained from experience, meaning for describing the most likely concepts

IMAGE FEATURES
Creatures tend to think out, "Is it cold?" or "Is it hot?", or "Can you walk on it?". and "How do you feel when viewing the image?". It is thus the capability to identify the possible items based on standard visual features. In this part, we want to construct semantic features from images to describe activities that occur in the brain, by finding various methods in describing the images related to the concept. It was then creating a model of medicine, the type of activity in the brain.

Hierarchical visual features
Implementing hierarchical learning in which this method has convolution neural networks (CNNs), which are commonly used to classify objects [11]. CNN has a spatial structure consisting of layers of secret units [12], including convolution, pooling, and fully connected layers. In the convolution layer, it acts as a characteristic separator from the given input image. A pooling layer process produces down-sampling forward with the spatial area. Finally, the fully connected layer operates as a classifier that predicts the product of the input picture. The combination of these layers enables us to discover a hierarchical interpretation of the input picture. In this act, we objective at the appropriate CNN to increase the accuracy of the brain transcription model by employing features from the fully connected layer. For this reason, three advanced levels of CNNs are used: VGG16, ResNet50, and Xception, to learn the vector properties of figures concepts connected with each notion.  VGG16 [13] The important VGG16 structure is the popular CNN model. The network also presents the highest 5 test accuracy in ImageNet with 92.7% for image classification. This model is shown utilizing layers. Convolutional layers 3x3 are packed on top of each layer, and the other two layers are connected by each node, with 4,096 connections shown in Figure 2. VGG16 construction.

6685
 ResNet50 [14] ResNet won the first prize of ILSVRC2005 for its image classification, shown in Figure 3. Skip connection in ResNet50. The ResNet50 structure consists of 50 deep conversion layers. In this network, there are a total of 16 remaining blocks, each with blocks. There are three-layer feed layers ready.

Figure 3. Skip connection in ResNet50
 Xception [15] Xception by Google stands for the Ultimate version of Inception. With a mitigated depthwise detachable convolution, it is even better than Inception-v3 [16] (also by Google, 1st Runner Up in ILSVRC 2015) for both ImageNet ILSVRC and JFT datasets. The Xception construction has 36 convolutional layers building the feature extraction base of the network and fully-connected layers before the logistic regression layer. The 36 convolutional layers are defined into 14 sections, all of which have direct residual links around them, except for the first and last units. Figure 4 demonstrates the Xception construction. To recover the characteristic, the picture we use VGG16, ResNet50, and Xception candidly with importance trained on data sets larger as, i.e., ImageNet has combined a fresh level before production or layers softmax layer. Actions as a separate feature from the input picture. The neural network was retrained, adopting a fresh set of pictures while the earlier part was frozen.

Low-level visual descriptors
In creating an image representation, there are many prevalent methods. We were using low-level descriptors to capture general recognition features (such as color, surface, structure, edge). These basic features can be drawn immediately from the image and easily. In this analysis, we worked on the visual properties that are:

Simple color histogram (CH) [17]
Low-level feature utilizing a color histogram. Set to the distribution the number of pixels in the image for each container. We will use the effects of the color histogram to have 64 boxes in line with the part of the color saturation spectrum. We create a visual semantic model with the use of the default RGB color values and set the number of each component of the RGB color into four, which is the easiest method. (4x4x4).

Edge histogram (EH) [18]
Edge histogram -is a coding of the spatial distribution of the direction edge. Specifically, the images are divided into 72 boxes, each of which has several corners with a direction that is measured in 5-degree intervals. In this article, use Canny filters for edge detection and Sobel Edge Detector operators to Measure the course by the gradient of every edge point. The semantic feature has 72 dimensions.

Color correlogram (CORR) [19]
Correlogram color to encode spatial relationships of colors. One of the two-dimensional and threedimensional histogram is the color of any pixel and three-dimensional spatial distances. The color correlogram is defined as: where | 1 − 2 | is the measure between 1 and 2 , is the distance intervals number, , ∈ {1,2, … , }, is the number of boxes and ∈ {1,2, … , },. We distribute the RGB value element into 36 boxes and set off the space metric to 4 odd intervals of = {1,3,5,7}. Thus, the color correlogram has a dimension of 144 (36x4).

Scale invariance feature transform (SIFT) [20]
SIFT is the most useful object identification algorithm in computer vision (CV) and has been used extensively. For any object in the picture, the SIFT key points on purpose can be separated into smaller parts to give a specific descriptor that defines the small image area around the mark on that object [21]. Because of the SIFT method results in the characteristic descriptor. The large size created in the picture, we use the theme of the bag of features. For each picture, first we analyze the SIFT descriptor above the local area based on the key points. We then calculated the vector's quantity on the SIFT district descriptor to create an image vocabulary using k-mean grouping. In this analysis, we created 500 groups, following in the size of the visual features being 500, for representing the image. [22] Image representation Remedies surface analysis of the image. In the surface analysis plan, the wavelet transform can be used to detect the surface of the image effectively. Wavelet transforms done on pictures that suggest repeated filtering and sub-sampling. At each level, pictures are divided into 4 sub-bands: LL, LH, HL, and HH, where L stands for low frequencies, and H means high frequencies. After that, the 2 types of wavelet transforms are the wavelet transform with the pyramid structure (PWT) and the wavelet transform with the tree structure (TWT). PWT breaks the LL band repeatedly, and TWT breaks the band. Other music to preserve relevant information that appears on the medium frequency channel. After passing the classification into parts, the vector is extracted using the mean and dispersion of each sub-district set of the strength distribution at each level. In this study, we used three levels of breakdowns to create vector properties of each object image. When performing PWT, they receive the characteristics of a 24 (3x4x2) vector. At the same time, the TWT answer in a visual semantic feature size 104 (52x2). Therefore, the total wavelength is 128.

MULTITASK VISUAL SPACE LEARNING
We introduced multi-task lasso learning (MTL) [23,24] in which many learning tasks are improved at the same time. This method produces accurate predictions and is highly effective. In classification, MTL aims to increase the efficiency of multiple classification tasks by co-operating learning. One example is the spam filter, which can be classified as a complex classification mission. The reason for this is that learning the native fMRI representation may not have enough impact on marking multiple classes due to the training representation that Limited and challenging to find data sets. The aim is to learn the standard fMRI descriptions related to properties that are separated from many classes. The reason for this is that learning the original fMRI representation may not be sufficient for many states' characterization due to the small and limited training examples. A significant impact on learning is the finding of representatives of linear and nonlinear transformations of fMRI image data, as illustrated by mathematical equations. Consider = { } =1 is a set of separate tasks where a task directs on learning model for estimate the th value.
, where ∈ is the th training fMRI image made of voxels in and is its th feature value. The training fMRI data matrix is for , i.e. = ( 1 , … , ). For the task , where ^, , ∈ Moreover, is the number of voxels. We presume that the data is regulated so the constant terms can be dropped, i. e. and ^ have average 0 and ‖ ‖ 2 = 1 where ‖. ‖ 2 is the L2 Euclidean norm. Let = ( 1 , … , ) be the vector of all coefficients for the th voxel across various tasks. To achieve a compact and discriminative representation, the multi-task Lasso is formulated as the answer to the optimization problem.
Where ‖ ‖ ∞ = | | is the sup-norm in the Euclidean space. It additionally has the impression of "grouping" the parts in such that they can obtain zeros concurrently. After teaching models, we still have to build a decision rule to choose the most important class to be related to a given fMRI image. For a given fMRI image , predicted feature values are obtained by using coefficients. Lastly, Pearson's correlation coefficient [25] are used to measure the association between X and a target class C. where mapping function is ( ) that reconstructs X to the th feature value. The result with the highest correlation fMRI concept.

EXPERIMENT SETUP 5.1. Datasets
In this research, we use fMRI 3D data from Carnegie Mellon University, collected from 9 right-handed adult volunteers, consisting of fMRI data that allows volunteers to view 60 line drawings and nouns, six times each. Images (6X60 = 360) 1 fMRI image will be converted to a voxel vector with approximately 20000 voxels. Then decrease the size of the vector to 500 voxels using the searchlight method [24]. In the technique [6] and [7], fMRI data are decreased to 500-features.

Text-based c features
The proposed model performing is compared with the text-based semantic features state-of-the-art models, as described follows:  Verb25 [6] provides the notion in the form of the noun and describing the co-occurrence vector between nouns and verbs of 25 names such as "run", "push", "eat" and many others. These common verbs are often definite nouns in English sentences and designed by the structuralists. In an investigation, Verb-25 is a practical effect of this dataset.  Human218 [7] holds the notion in the form of 218 attributes. The pattern vector is obtained by 218 questions, such as "Is it cold?", "Is it hot?", "Can you walk on it?" etc. The linguists created these questions and collected the answers from cloud sourcing and computed the average answers correlated to each notion.

Image-based features
In this research, we have selected visual properties from pictures for the design of experiments by using properties from images from lower to upper levels. Color histogram (CH), color correlogram (CORR), edge histogram (EH), Wavelet Transform (WT), BoW + SIFT are on NUS-WIDE [26] database research. The NUS-WIDE dataset holds 269,648 pictures compiled from the Flickr online photograph database to adapt to the fMRI image concept and Use the following characteristics from the CNNs, namely VGG16, ResNet50, and Xception, and use the properties from all the images to create a feature vector to describe the concept.

RESULTS AND DISCUSSION
This section is a test to compare the efficiency of using the properties obtained from text using nine fMRI images of each volunteer, 360 images from 60 images divided by concept, for use in teaching 300 images and 60 images for testing to find the effectiveness of models and features extracted with different methods. Table 1 shows predictability of the concept seen in the LR model accuracy (%) from 9 fMRI volunteers and the proposed algorithms (i.e., WT, EDH, ResNet50, and Xception). It can be observed that the proposed visual feature significantly enhances the performance of brain decoding. Moreover, the best performance models are obtained by using extracted features based on VGG16, ResNet50, and Xception. The multi-task learning (MTL) method is also compared with a linear regression (LR) method. As shown in Table 1 and Table 2 all the MTL models significantly outperformed the LR models. Thus, the given results emphasize the highlight of the MTL approach for improving the generalization performance. The leave-twoout cross-validation technique [6,7] have been employed for evaluating the effectiveness of the visual feature for decoding novel concepts (unseen). We tested the model on all the fMRI images of 58 concepts for the training set, and then fMRI images of 2 concepts for the testing set. The comparison results of all the predicting 2 unseen concepts model have been illustrated in Table 2. As recorded in Table 2, Both VGG16 and ResNet50 significantly bettered Verb25 and Human218, whereas the ResNet50 used 106 voxels of the whole brain showed the highest correctness. Figure 5 demonstrates that the MTL model exceeds the LR model for decoding V1 and V2 overall the semantic features. As a result, the high-level visual feature is better for identifying brain activity patterns in terms of accuracy.

CONCLUSION
In this paper, we have proposed an exotic fMRI brain transcription model using image-to-image features. The visual characteristics of many applicants are examined from low to high-level features that focus on the fMRI decoding model precisely. Also, the study presented is a compact and discriminating pattern learning-the relationship between the voxel activation model and the image properties to decipher the concept. The results show the operational success of our model and the advantage over state-of-the-art brain decoding with fMRI.