Customized mask region based convolutional neural networks for un-uniformed shape text detection and text recognition

In image scene, text contains high-level of important information that helps to analyze and consider the particular environment. In this paper, we adapt image mask and original identification of the mask region based convolutional neural networks (R-CNN) to allow recognition at 3 levels such as sequence, holistic and pixel-level semantics. Particularly, pixel and holistic level semantics can be utilized to recognize the texts and define the text shapes, respectively. Precisely, in mask and detection, we segment and recognize both character and word instances. Furthermore, we implement text detection through the outcome of instance segmentation on 2-D feature-space. Also, to tackle and identify the text issues of smaller and blurry texts, we consider text recognition by attention-based of optical character recognition (OCR) model with the mask R-CNN at sequential level. The OCR module is used to estimate character sequence through feature maps of the word instances in sequence to sequence. Finally, we proposed a fine-grained learning technique that trains a more accurate and robust model by learning models from the annotated datasets at the word level. Our proposed approach is evaluated on popular benchmark dataset ICDAR 2013 and ICDAR 2015.


INTRODUCTION
Text is one of the most expressive means of communication and can be embedded into documents or into scenes as a means of communicating information. Text plays a crucial role in our daily lives and reading it from videos and images are important values for the plentiful applications of the real world like scene text, image retrieval, and recognition [1], office automation, assistance for blind people and geolocations consist of very convenient semantics for the understanding world. The reading of scene text gives a rapid and automatic way to access the data of textual embodied in the natural scenes that are most powerful representation given by scene text recognition and detection. Text spotting intends to identify and localize the images of natural text that have been studied in prior techniques [2]. The techniques followed the traditional pipeline [3] treated the processes of text recognition and identification individually in that the text methods are trained the text detector then it is fed into the model of text recognition. This method looks straightforward and very simply but may lead to the performance of sub-optimal for identification and recognition, where two given methods are most relevant and complementary to one another. On the other hand, outcomes of recognition are highly dependent on recognized text. The recognition outcomes are very helpful for eliminating the detections of false positive (FP). Lately, He et al. [4] had started to integrate text recognition and detection with the help of an end-toend trainable-network that contains 2 models like the sequence-to-sequence (Seq2Seq) network and identified a network for removing the text instances for estimating sequential labels for every single instance of text. The main performance developments for text spots are gained by the given techniques, representing recognition and identification model are corresponding in when they are also trained in the learning method of end to end.
In order to perform the text spot through 2 phases such as a detector is utilized to recognize the text in an image and after that, the recognition of text is used on the identified regions. The main drawbacks of these given techniques rely on correlation ignorance and time cost among the text recognition and identification. Thus, many techniques [5] were introduced to combine oriented and horizontal text recognition and identification in the manner of end to end. Anyways, the scene texts are often performing in an arbitrary shape. Instead of defining the text with quadrangles and rectangles, the text snake [6] defines the text with the units of local series that behaves like the snake. This work is mainly focused on text detection in an image, whereas the mask spotter of text is to recognize and identify the shapes of arbitrary text in which the char segmentation is utilized based on text identifier [7], thus char level annotation is needed at the time of training. This paper represents unified techniques for arbitrary shape text-potting (ASTS) to identify and recognize the arbitrary shapes by considering 3 semantics levels. With the spotter of image mask text and mask region based convolutional neural networks (R-CNN) [8], we can articulate the detection of text as the segmentation of instance task. Unlike the spotter of image mask text, we adapt image mask and original mask R-CNN detection to allow identification at 3 levels such as sequence, pixel, and holistic-level semantics. Particularly, holistic and pixel-level semantics can be utilized to recognize the texts and define the text shapes respectively. Precisely, in mask and detection, we segment and recognize both character and word instances. Furthermore, we can implement the detection of text through the segmentation result on 2-D feature-space. Also, to tackle and identify the text issue of smaller and blurry texts, we contain recognition of text with an attention-based of optical character-recognition (OCR) model based on mask R-CNN to consider the semantics of the sequential level. We use an OCR module to estimate the character sequence through feature maps of the word-instances in Seq2Seq, which is implemented in 1-D feature-space. Lastly, we integrate 2 text detection outcomes based on edit distance among given lexicon and outcomes. In our techniques, texts can be analyzed at 3 semantics levels consisting of the sequential level for recognition mask, holistic level for an identified mask, and the pixel-level for the mask.
Finally, we propose a fine-grained learning technique that trains a more accurate and robust model with the help of learning models through annotated datasets at the word level. First, we trained our techniques utilizing a unified framework. Afterward, a trained model is applied to the annotated images at the word level to discover character samples and exploit the annotations at the word level to defeat the false character. Finally, the samples of the found character can be utilized as an annotation of character level and integrated with the word level annotations to train a larger model. This paper is represented: section 2 represents the prior work related to text spotting, recognition, and detection. Section 3 represents the proposed method before analyzing each model.

LITERATURE SURVEY 2.1. Text detection in image
Text identification plays an eminent part in the systems of text detection. Several techniques have been introduced to recognize scene text [9]. Neumann and Matas [10] utilizes edge boxes to create proposals and improve candidate boxes with the help of regression. Modifying solid-state drive (SSD) and the R-CNN with modifications, Ren et al. [11] introduced to recognize the horizontal words. Recently, the text identification of multi-oriented has become the most important topic.
Yao et al. [12] recognized an oriented based multi-scene text with the help of semantic segmentation. Zhang et al. [13] introduced techniques that recognize text segments and join them into text instances by link predictions and the spatial relationship, whereas Tian et al. [14] regressed the text boxes through the segmentation maps. Shi et al. [15] introduced to recognize and connect the text points to create 3.1. The architecture of the proposed method Figure 1 represents the architecture of our proposed method. First, the image is faded into a shared network that utilizes the backbone network in order to remove the feature maps by the input image and it shares with the help of 3 subsequent like the operation of region-of-interest (RoI-align) and region-proposed network (PN). Afterward, text identification identifies the semantics of the holistic level to categorize and recognize the character instances and identify by rectifying region proposals utilizing the network of context refinement. Meanwhile, the image text mask evaluates pixel-level-semantics for recognized instances to conduct the segmentation. After that, an image test mask is utilized to define the character mask, and image text shape is applied for primary recognition. Since blurred and smaller characters are expected to suffer through failures, we introduced 2-D feature maps of the word instances into text detection to discover the semantics of seq-level for exact identification. Lastly, an outcome of primary recognition and recognition outputs are joined based on edit distance among outcomes and given lexicon.

Connected network
This network contains risk priority number (RPN) and the backbone network. The feature of the image is removed with a shared and back-bone with a sub-sequent network. The feature-pyramid networks (FPNs) utilize the architecture of top-down to construct the feature map of a higher level at all of the scales through each input image and with the help of the marginal-cost. Hence, we utilize the FPN with the help of Res-Net, which is the backbone of the network.
The RPN is utilized to create the various sizes of the region and the aspect ratios for image mask and sub-sequent R-CNN, and mask-R-CNN. In our method, we implement the RPN to construct the region for three sub-sequent methods. We allocate various phases based on their sizes. Various aspect ratios are utilized in every single phase. RoI pooling assigns a floating number to distinct feature maps, it also leads to the misalignment issue. Then we apply the operation of RoI-align because it computes image-based features on the coordinates in the floating type with a method of the bilinear interpolation, thus it delivers accurate region alignments and the beneficial features to sub-sequent methods.

Text detection in image
The recognition intends region methods constructed with the help of RPN by estimating their classes and assuming the regression of the bounding box to alter their coordinates. This process is expressed as the identification of holistic-level semantics for region methods. This usually requires estimating two classes like non-text and text. In prior work [27], text recognition intended to recognize both character instances and words. Additionally, in non-text and text, we utilize the character data to train the identification method that helps to learn discriminative representation and improvise the performance of identification. Based on the spatial relationship between character and word instances, an outcome of character identification can be utilized for text identification. This is equal to implementing the character-based text identification technique while conducting text identification.
However, if methods overlap with a true word, existing refinement techniques may also suffer refinement failure. Since methods do not consist of sufficient data for the perception of a holistic object. Furthermore, the module of OCR is sensitive to the outcome of text localization. Therefore, we represent a module of context refinement to argument existing refinement in-text identification. From the refinement of context, the context data gathered from the surrounding regions are added into the representation of a unified context to improvise the identification performance.
The identification of text loss function is defined as (1).
Where the log loss is ( , ) = − for true class and is estimated the confidence score for the given class .
is described as a tuple of the bounding box of true regression targets for , and is defined as an estimated tuple. The masking of text intends to conduct the semantic segmentation and examine pixel-level semantics for character and word instances. The predicted word masks are utilized to give exact shapes and the locations of a word than the identified method. Furthermore, character masks are applied for identification from a 2-D perspective. As represented in Figure 2, we implement the fully convolutional network (FCN) which has a similar image mask in the mask of R-CNN. The given RoI is constructed by the detected method, the RoI-align removes corresponding RoI features from fixed sizes and feature maps. We describe the loss function of image text mask as an average loss of binary cross-entropy, which is followed by the help of mask R-CNN.
can be calculated: where the number of pixels is defined as , is pixel label ( ∈ (0,1)), is defined as estimated output and the function of the sigmoid is ( ).

Text recognition in image
The text recognition is intending to construct the module of OCR for discovering semantics of sequential level and estimating the character sequences through the word instances of feature maps. The recognition of text is treated as the Seq2Seq task. Moreover, the attention-based method has established the dependencies of effective modeling without distances in an input sequence. Therefore, we develop an OCR module for recognized text. The architecture of text recognition contains an OCR module and FCN. Like a prior image mask, we apply FCN that contains a deconvolution layer and 4 convolutional to scale up the maps of feature size.
We require to convert an input of 2-D feature map to feature-sequence before estimating character sequence then we implement the RoI-rotate before feeding into the module of OCR. For RoI, we achieve a rectangle with the rotation-angle by computing the minimum area of the rectangle mask which is produced with the help of an image mask. Based on rotation-angle, we also apply the RoI-rotate to alter the feature map-RoI horizontally. The feature maps are scaled to a fixed height and the aspect ratio is not changed with padding. Although an inevitably misleads the image features, we discuss the minimal impact on the performance since the same operation is used for the purpose of testing and training. Since the feature map is very short to reduce the oscillation of the network in the training phase and creates network coverage faster. We re-presented an experiment with an unchanged ratio aspect, but also performance was very poor.
Followed by a classical model of Seq2Seq, we develop an OCR module containing a decoder and encoder. An encoder alters feature-map into feature sequences (FS) and the decoder proposes to evaluate the character sequence by FS. Our encoder model is similar to the extracted FS network in convolutional recurrent neural network (CR-NN) except for an input, which has the feature map in the encoder but an input image in the CR-NN.
The encoder translates the feature-map to the FS via 4 max layers of pooling and 7 convolution layers. Afterward, FS is fed into the long short-term memory (LSTM) of multilayer bi-directional to catch high range dependencies of FS. Lastly, an encoder feeds output FS as context. Based on the Seq2Seq model, we implement RNN to construct the decoder that intends to estimate an outcome of the character sequence and also introduce the LSTM [28]. Let = { 0 , 1 , 2 , … . , +1 } is defined as the ground truth (GT) for the word instances and = { 0 , 1 , 2 , … . , +1 } is defined as the corresponding decoder of the output sequence. The loss of recognition can be calculated by (3), where text number is defined as that to be trained and ( ) defines estimated output probability being at the ℎ phase.
In order to integrate the loss of detection is defined in (1), the loss of image mask is defined in (2) and the loss of recognition is defined in (4). The loss function of a full multi-task is computed as (4), where RPN loss is . The unified framework can identify and detect texts considering 3 sematic levels such as sequence, holistic, and pixel-level semantics and also give 2 recognition outcomes. The holistic is text spotting through image mask and text identification implemented from the 2-D perspective and another is an outcome of text recognition from a 1-D perspective. The outcome is robust to not accurate localization in the detection of text. Therefore, intending to improvise the recognition, we select a word with less distance from the lexicon as a final recognition outcome.

Fine-grained learning method
In order to identify the texts, the proposed method permits us to get promising performance on the basis of weak learning. Based on fine-grained learning, we propose a learning method, intending to train an accurate and robust text spotting method by acquiring through fine-grained learning with the help of trained technicians. First, we have to train the model and weak learning that has only the annotations of a word. This model is trained with a proposed method by utilizing the proposed method from the fully grained with character and word annotations. Each image has a poor annotation in dataset at the word level which is defined as polygons set = {ℊ 1 , ℊ 2 , … . , ℊ ,… }. The fine-grained learning is to find the samples of character in the dataset .
The proposed fine-grained learning is represented in Figure 3, first, we apply trained model on the dataset of word annotations . In dataset , we can get the samples of a candidate set ℝ.
where, , , , and is denoted as a predicted category, bounding box, recognition outcome, bounding box, and image mask outcome of − ℎ sample of candidate character . In order to achieve the samples of positive character with an annotation of word-level and threshold of confidence score.
Where represents all categories of character to be removed, defines the threshold of confidence score, which is utilized to detect samples of positive character, ⋂ℊ represents an intersection overlap-of the candidate character with the level of word GT ℊ and the threshold is to choose the samples of positive character. The threshold of confidence score can set less because of constraints given by the annotations of word-level that is also useful for preserving the samples of diversity character. Lastly, recognized samples of positive character can be used as character annotations and it can be integrated with the word-levelannotations to train an accurate and robust model of the text spot.

RESULTS AND DISCUSSION
In this section, we will discuss the result and analysis of our proposed approach, which will be evaluated on most popular benchmarks dataset ICDAR 2013 [29] and ICDAR 2015 [30]. The proposed model is compared with state-of-the-art methods. Furthermore, we will discuss experimental environment and parameters, where the experiments were performed at Intel i5-7gen processor, 6 GB RAM, 1 TB solid state drive and GPU of NVIDIA GTX2080Ti at PyTorch platform. The network is optimized by stochastic gradient descent, images were resized to 640 x 640 with batch size of 12. The starting learning rate is considered to be 10-3 and after 100 epochs learning rate is reduced to 10 -4 .

ICDAR dataset, detection, and recognition protocol
There are two datasets used: ICDAR 2013 and ICDAR 2015. In ICDAR 2013 the number of training images are 229 (which consist of 849 words) and the number of testing images are 233 (which consist of 1,095 words). In this dataset, the texts placed are horizontal and annotated by the rectangles in words. ICDAR 2015 is given at challenge of ICDAR 2015 "Robust reading competition for incidental scene text detection", where it contains 1,000 training images (which consist of 11,886 words) and testing images of 500 (which consist of 5,230 words). In image dataset, the text areas are annotated by 4 quadrangle vertices. Here, we provided analysis of evaluation protocols for recognition and text detection, and the process of text detection is evaluated by the ICDAR protocol. In general, text recognition is evaluated using end-to-end recognition procedure or word recognition accuracy.
In order to find the best possible match ( , ) for a rectangular in the rectangle set can be written as (7), where shows the match between two considered instances of text rectangles. It can be computed by intersection area divided via minimum area of bounding box (i.e., contain both the rectangles), then the considered evaluation parameters such as precision, recall and F-score can compute as (8) where Z and Y represent the estimated rectangles and ground-truth set. The and represents the estimated rectangles and ground-truth, is weight parameters. Text recognition performance is always measured by the accuracy of word recognition. The word recognition accuracy (WRA) is simply defined as the percentage of the recognized text is correct, as in (11).

CONCLUSION
The demand and requirement of scene text detection and recognition have got cumulative attention in various fields due to its potential. This paper we proposed an optimized detection and recognition approach, text detection in an image is performed for identifying character and word instances. Text recognition in an image for segmenting the outcome of text identification. In addition, we discovered a finegrained learning method to achieve optimized outcome. For evaluation, the most popular benchmarks dataset ICDAR 2013 and ICDAR 2015 is considered, where the proposed model is compared with state-of-the-art methods. Evaluation is performed based upon the recall, precision, F-score, and recognition performance. The proposed model has performed better as compared with state-of-the-art techniques. In future work, we will use our detection and recognition framework in video dataset.