ATLAS: Adaptive Text Localization Algorithm in High Color Similarity Background

One of the major problems that occur in text localization process is the issue of color similarity between text and background image. The limitation of localization algorithms due to high color similarity is highlighted in several research papers. Hence, this research focuses towards the improvement of text localizing capability in high color background image similarity by introducing an adaptive text localization algorithm (ATLAS). ATLAS is an edge-based text localization algorithm that consists of two parts. TextBackground Similarity Index (TBSI) being the first part of ATLAS, measures the similarity index of every text region while the second, Multi Adaptive Threshold (MAT), performs multiple adaptive thresholds calculation using size filtration and degree deviation for locating the possible text region. In this research, ATLAS is verified and compared with other localization techniques based on two parameters, localizing strength and precision. The experiment has been implemented and verified using two types of datasets, generated text color spectrum dataset and Document Analysis and Recognition dataset (ICDAR). The result shows ATLAS has significant improvement on localizing strength and slight improvement on precision compared with other localization algorithms in high color text-background image.


Introduction
Text in an image exist in two forms either caption text or scene text [1]. Both text forms are important source for describing the semantic content of images [2] such as: in geo-location application, obtaining objects information, for indexing, categorizing and searching process [3]. Text extraction is an important research area [4], which comprises three stages [5]: text localization, text segmentation and text recognition. Text localization is to locate the position of various texts in the image while text segmentation involves separation between text pixels and background pixels. The text pixels are further converted to soft text in final stage of text recognition. Text localization, as the main element in text extraction framework, is the first process that affects the overall accuracy of text extraction result. It had been taken seriously by researchers in order to produce a high accuracy text localization algorithm which can localize the text in various image conditions whether in caption or scene text. Caption text (also known as graphic text, superimposed text or artificial text) is the text that is post-added or created through image editing tools. This type of text is commonly seen on the advertising and informative image like blog header, brochure, and logo. On the other hand, scene text is the original text in the image. It can be seen on most of the nature images captured by digital devices. Figure 1 shows the examples of the caption text and scene text.
Both caption text and scene text have three important properties that can affect text localization result [6]: text geometry, text color and text effect. Text geometry refers to the relative shape and position of the text, which includes: sizes, fonts, alignments, directions and distances between characters. Text color simply refers to the value of integer in each color channel for each text pixel and finally, text effect refers to additional ornament or decoration on the text for example shadowed, sharpened and blurred effect. The combination of those properties creates the unpredictable text model and uncertain image background, which produce a very challenging environment for text localization.
Among these three properties, text color has a simple form of root cause but requires complex techniques to solve it. The most common situation is the existence of text with almost similar color with its background. Text localization algorithms generate high false positives errors in locating the text with high color similarity. The main reason for this localization error is the small different between the color value of text and background, which prevents most algorithms from distinguishing color spaces. Hence, it is very challenging to localize text especially text with similar background color. Yet, these types of image are quite common in the real environment. Hence, there is a need to produce an algorithm to overcome the color similarity issue. Section 2 describes the related works on text localization algorithm. The details of ATLAS explain in Section 3 and followed by the experiment process, the experimental results and concluding remarks in Section 4 and 5.

Related Works
Text localization algorithms can be categorized into three categories: connectedcomponent based (CC), texture based and edge based algorithms.
CC based algorithms analyzescolor value of every image pixels and groups their nearby pixels which have similarity in color to form a region that will be used to differentiate between text region and background region [7][8][9]. On the other hand, texture based algorithm employ machine learning techniques for analyzing unique patterns that appear on the text region in the image. The algorithm examines specified color distribution, either in spatial domain or in frequency domain that matches with the features of ground truth text regions [10][11][12]. Edge based algorithm implement a different strategy with others, instead of looking for the similar color region, edge based algorithm detect the sudden change of color value in an image region and define the region of sharp change as an edge. The edge will act as a barrier that separate between text region and background region. A pre-determined threshold is used as a minimum value to evaluate the sharp changes in pixel color. Any sharp change above the threshold will be identified as an edge [13][14].
In order to look for the edges, it requires an edge detector algorithm. Several edge detector algorithms have been introduced including Sobel [15], Roberts [16], Laplacian [17], Genetic-Ant Conoly edge detector [18] and Canny edge detectors [19]. It has been recognized that [20][21][22] Canny edge detector produced higher accuracy and better edge image granted by its edge thinning algorithm and heuristic threshold compared with others. However, the original purpose of Canny edge detector is to extract the object features from images. To implement it on text localization algorithm, it requires some additional filtration and enhancement processes to locate the correct text edges and eliminate unnecessary edges. Enhancement on Canny edge based algorithm on the text localization as highlighted in [23][24][25] increases the accuracy of localizing text in images.
Liu and Wang [23] proposed stroke-liked edge detection based on contours to remove other noise edge from complex image. Then, they locate the text regions based on the distribution of edges and corners. Finally, they performed segmentation on the text regions and identify the text pixels by looking for the largest frequency bin in the color histogram. Yi and Tian [24] introduced an edge-based solution on text localization with three steps: First, to cluster the edge boundaries based on bigram color uniformity; Second, to segment strokes by assigning the mean color-pair in each boundary layer. And third, use Gabor-based text features to determine the correct text regions from the candidates. Lee and Kim [25] proposed an efficient edge based text localization algorithm by proposing a two-stage conditional random field, where it utilizes both edge map and salient map to look for optimal configuration of text regions. Limitation of the current algorithms [23][24][25] are mainly due to low resolution of image, multiple color in the text and complex background, which can be generalize as the problem of color similarity between text pixel and background pixel. For mentioned edge-based algorithms, the core function in the algorithm is Canny edge detector algorithm, which is used to discover the edges inside the image. The detected edge is then being filtered and enhanced, so that the leftovers are the edges of the text. However, it requires a specific amount of differences of color value between text pixels and background pixels before the text can be identified and located by the algorithm. Hence, the smaller of the differences between colors value (or the higher color similarity), the harder the localization algorithm can locate the position of text. Localization is likely to fail when it attempts to deal with texts that have high color similarity with its background like engraved text, text with complex background and effect of light exposure. Some examples of images are shown in Figure 2. Implementation of adaptive threshold can solve this problem. Rong et al. [26] introduced their adaptive threshold algorithm based on the mean and standard deviation of image gradient. Another example done by Li et al. [27] where they applied Mean Shift algorithm on Canny edge detector to extract weak object. However, both algorithms focus on enhancing the Canny edge detector in a more general purpose rather than specific to text localization. There are also other enhancement works done on the detection of weak edges in other field for examples medical image [28] and radar image [29]. Existing text localizing algorithms using adaptive threshold on edge detector are relatively very few. The most related work was done by Hsia and Ho [30], but they focus on localize text in video scene using Roberts edge detector instead on Canny edge detector.
Given that text images contain texts with different position and different color similarity, adaptive threshold is limited. Thus, a multi-region adaptive threshold is needed to ensure all texts with different color similarity in an image are localized. The advantage of using multi-region adaptive threshold is, it allow the application of low value of adaptive threshold on the region that has high color similarity instead of the entire image. Because in some cases, the exposure of light will effect only a small region on the image, if the adaptive threshold takes the entire image as consideration (find the mean of the entire image), the small region with very high color similarity will be neutralized by the mean value, and finally the algorithm omit the localize region with light exposure. In order to form the regions with different color similarity, the candidates of text region are first formed, and then possible region of omitted character are estimated. The suitable threshold value is calculated based on the similarity index in the region to further extract the missing edges and complete the text localization process. In summary, this paper focus on the enhancement of the Canny edge detector for solving the color similarity issues in text localization. A new adaptive threshold with multi-region is proposed, which focus on Canny edge detector with purpose of localizing text that have high color similarity with its background.

Proposed Algorithm
The proposed algorithm intends to solve the aforementioned challenges by implementing a multi-region adaptive threshold on various types of text similarity. In order to deal with uncertainty of differences in color between text pixels and background pixels, a measurement of similarity index is needed for further credit. Based on previous research, so far there is no standard measurement available to define the color similarity. Hence, this paper first proposed a measurement method for color similarity index, and next proposed a multi-region adaptive threshold for Canny edge detector by referring to its similarity index. We name it as Adaptive Text Localizing Algorithm or ATLAS. The summary of the algorithm is shown at Table  1, Table 2 and Figure 3.

Text-Background Similarity Index
To show the color similarity between text pixels and background pixels, a measurement index called Text-Background Similarity Index ( TBSI ) is introduced which is defined as the degree of likeness for pixels value between text and background in an image region.
To calculate TBSI of an image, it is first converted into grayscale image I . The conversion to grayscale image is intended to simplify the calculation. Using a rectangle box manually mark-up of each of the ground truth regions of text. Let G is the ground truth region marked and N is the total number of text regions in I . TBSI estimation depends on the ground truth region. However, if the ground truth region is unknown, it can be replaced with region of interest. Next, for each region i G , Otsu's binarization algorithm [31] is applied to get the approximated segmentation between text pixels,   The average value of color difference between text and background,   i D G for the image I can be determined by averaging the differences for all regions: The purpose of calculating the average color different between text and background is to look for its percentage of the color differences. Finally, the TBSI for image can be computed using the formula: Where max D refers to maximum possible value different between text and background pixels. In this research, grayscale image is used, hence the max D will be 255 (Maximum different scenario is when two pixels with value of 0 as black and 255 as white). TBSI ranges between zero and one, where higher value of TBSI represent higher similarity between text pixels and background pixels. It is considered invalid when 1 TBSI  or   0 D G  as these represent that there is no different between text pixels and background pixels. This situation takes place when there is only one color in the ground truth text region.

Multi Adaptive Threshold
TBSI measures the degree of likeliness between text pixels and background pixels. Hence, it is suitable to utilize it as the adaptive thresholds for Canny edge detector algorithm by applying a low threshold value on high TBSI images and high threshold value on low TBSI images. Different from other approaches, this research implements multi adaptive threshold on each of the possible text regions in an image to ensure that Canny edge detector do not omit any possible text region with high color similarity. Before calculating the thresholds value, simple analysis needs to be done to find the possible text regions in the image.
The proposed algorithm begins with an image , and Canny edge detector is applied to obtain an initial binary edge image by using the overall TBSI of I .Since the ground truth regions and text candidatesare unknown at the initial stage, oveall TBSI is simply calculated by taking the full input image as the region of interest. Next, edge pixels are divided into several groups, and let   from the overall edge pixels in i e . This simply indicates that foreach i e , the regionhas covered the minimum surface area. For any which is too small or too big, or which is having imbalance width and height ratio, it is eliminated from . Figure 4 represents the step-by-step workflow for ATLAS until obtaining initial edge pixels group. As shown in Figure 4, exposure of light in the original image increases the color similarity for the text with the background. Hence, after applying Canny edge detector, the edge inside the region with high color similarity is missing after filteration is done. In order to solve this problem, ATLAS utilized the remaining edge information on the image, and forms the edge pixel group to predict the position of the missing region.Since broken edges often exist on the result of Canny edge detector, it create a lot of small regions that will affect the complexity. Hence, for any region that it overlaps with each other, they are merged to form a new region by re-adjusting the minimum and maximum corrdinates Figure 5(b) shows the result after the first merging. After mergin steps, the leftover regionsareeach assumed to be containing either a single character or word. Therefore, the regions which are closed to each other are assumed to be the group with same feature (either text or noise). For each region, locate every region near by it, by searching the area around it by a distance of J in horizontal and K in vertical.Two regions are then merged again they are both inside each searching zone. Figure 5(c) shows the result of second merging. J and K denote to the average width of a character and average height of a character respectively and they can be obtained by: During the failure cases in high color similarity, the miss localized characters mostly on the middle of the text as shown in Figure 4(d). Hence, for any two regions which have the same alignment and same direction, it is assumed to be on the same region but contains a missing text at the middle. Degree deviation is proposed and calculatedfor such regions before merging refer to degree of deviation for the maximum x and y coordinate. Figure 6 depicts an instance where min tan and max tan can be obtained by: can be calculated using the equation: If the dev  is small, i.e. dev T    , both regions is merged into one by readjusting the maximum and minimum x , y coordinates. T  is the maximum limit of deviation allowed. In the proposed system, loose strategy 9 T    is taken, which equivalent to 10% of maximum possible deviation 90 . After merging process, the some regions might contain unwanted features (noise); hence region filtration is done to filter out the regions which are more likely containing noise. After, filtration, result of new edge pixel group groups ' E is then being produced. Assume   low T refers to the original lower threshold used at the first stage and  refers to markup factor for the threshold that reserve the upper limit growth from original threshold. In this research, the threshold is markup by 50% or 0.5

 
. Other regions which do not fall in ' E will not be processed and retain the original edge result. Figure 7 gives an illustration of such the instance. The final edge image reveals the localized text on the region where filtrations on Equation (4-8) is re-implemented and the edge pixels are clustered.

Experiment and Discussion
This section details the experimental process carried out to evaluate the efficiency of the ATLAS in terms of its strength and accuracy. To show the robustness of the result, ATLAS was tested with two different sets of image datasets. First dataset was used to evaluate the ideal localizing strength of algorithms, which consist of images with different text-background color similarity. The second dataset was used to evaluate the precision as well as the actual localizing strength, which indicates the usability of the ATLAS, by using common images dataset.
The first dataset (self-generated dataset) of the experiment process requires images with different range of TBSI value. In order to achieve a comprehensive result, the full design of the dataset is self-generated. The text to be localized in the generated images is positioned at the center of image which is considered to be the easiest position for localization. The localizing algorithm is limited to grayscale image, hence a total of   65,580 256 255  grayscale images were generated which comprises all possible combination between text pixels and color pixels except for the combination where both text pixels and background pixels are the same (invalid by TBSI definition). Figure 8 shows some examples of the self-generated image dataset and its corresponding TBSI . The second dataset evaluates the actual localizing strength which is obtained by calculating the average localizing strengths on each image. Hence, the experimental process employed public image-dataset obtained from ICDAR 2011 [32], which comprises commonly seen text images. Moreover, the dataset is commonly used for text localization and text recognition for analysis in this research field. Figure 9 shows samples of text images from ICDAR 2011 dataset. The related information of dataset for first experiment and second experiment are summarized in Table 3.  The first experiment on self-generated dataset evaluates the ideal localizing strength of the ATLAS. With reference to other algorithms, it can also be used as a reference to calculate the average similarity of the public dataset. Ideal localizing strength reviews the capability of an algorithm to localize text regardless any differences of color similarity.
Self-generated dataset uses the images with clear and unique color for each text pixels and background pixels; it eliminates all other uncertainties that can affect the localizing result except for the color similarity where this paper focuses on. All the generated images have the same texts and position but different color between text pixels and background pixels. To show the feasibility of TBSI , all probable differences in color between text pixels and background pixels are generated which consist of 65, 280 images. These images are further divided into groups categorized by its TBSI value which consists of 255 subgroups. In this experiment, the ATLAS is compared to three other algorithms: Liu and Wang's Stroke-Like Edge based algorithm [23], Yi and Tian's Boundary Clustering based algorithm [24] and Lee and Kim's Two-Stage Random Field algorithm in terms of ideal localizing strength. All the four algorithms were implemented on the self-generated dataset on same computer with Intel Core i7 2.00 GHz processor and 16GB memories. The experimental results are summarized in Table 4 and Figure 10.  [23], [24], [25] and ATLAS Text localization is relatively easier for the machine when the color difference between text pixel and background pixel is big (or  is small). The precision of similarity index function   p  shows an ideal chart with a sharp decrease at a certain level of  (See Figure 11). This simplifies the calculation of localizing strength where it can be obtained by directly taking the value at the sharp decrease point. Localizing strength is normalized to range between zero and one, and is used for judging the ability of an algorithm on localizing image with different TBSI (more precisely, image with high TBSI ). By observing Figure 11, it shows that algorithm [23]  Hence, the value 0.545 becomes a boundary or limitation for the algorithm and the value can be seen as the ideal localizing strength of the algorithm. Figure 11. Precision of Similarity Index Function for [23], [24], [25] and Proposed Algorithm Further, ATLAS is also evaluated with the same experiment (the second experiment), using common image dataset in order to evaluate the efficiency of ATLAS when implementing on real world situation. Similar to previous experiment, the efficiency of localization algorithm was evaluated by localizing strength and the localized strength of algorithm,  , is calculated on the surface area covered by the function: precision of similarity index   p  . Similarly, precision P can also be calculated by using formula / c d P N N  , where c N is the total number of correct text region, and d N is the total number of detected text region. The text region is considered correct only if the outcome of localized text region covered at least 80% of text region from ground truth region. Difference from self-generated dataset, public dataset is built up by discrete value of TBSI while self-generated dataset is built up by continuous value of TBSI . The calculation for actual localizing strength is then express by the following formula: The symbol N in the formula represent the total number of images in common image dataset while i P and i TBSI represent the precision and the similarity index respectively at the i th image.
The formula (12) is derived from the formula for the calculation of the area of trapezium on precision, P , as the base and TBSI as the height. The function: precision to similarity for the public dataset is uncertain, which is mainly caused by the existing of different TBSI within an image. The experiment considers the average value of TBSI for each image, to measure the actual localizing strength of the tested algorithm. The experiment compared ATLAS to three others algorithms same as the first experiment [23][24][25], and the results are sort by actual localizing strength showed at Table 5.
In Table 5, Actual Localizing Strength (ALS) is calculated by using formula (12) while average precision is calculated by averaging the precision of all the image results that cause the ALS. The ground truth positions of the text in the image dataset are obtained according to the related text file attached along with the dataset. It was determined directly through the x-ycoordinate, together with the value of width and height for the boundary box given by the ground truth text files. The overall precision and recall is obtained by averaging all the precision and recall value on every image in the dataset.
According to Table 5, ATLAS achieved actual localizing strength of 0.63. This value is the highest among the observed algorithms. Compare to the second highest result, ATLAS had an increment value of 0.02 or 3.282%. Similarly, the average precision of ATLAS is 0.68, which is also the highest among the observed algorithms. ATLAS has average precision improvement of 0.01, or 1.49% relatively. The higher result of localizing strength of ATLAS was mainly due to the achievement of high precision rate on the majorities of high TBSI images, which contribute large amount of cumulated strength differences between ATLAS and other published algorithms. A decrease in localizing strength of ATLAS was observed between the selfgenerated and ICDAR 2011 dataset. This can be attributed to the degree of noise in the ICDAR 2011 dataset.

Conclusion
In this paper, a measurement of similarity index between text and background pixels in an image is introduced. An enhanced text localization algorithm, ATLAS, is proposed. ATLAS localizes texts that have high color similarity with its background. The paper presents a technique that apply adaptive threshold with multi-scale on possible text regions with Canny edge detector by evaluating the text-background similarity index on the particular region to improve the localizing strength of the algorithm. Two experiments (self-generated and public dataset) were conducted to evaluate the ideal localizing strength, and actual localizing of the proposed algorithm. The experimental results show that ATLAS performed relatively higher than other observed algorithms. There are several possible enhancements for ATLAS as well as the algorithm of TBSI . Future works can consider the replacement of TBSI algorithm with lighter weight algorithm such as the OTSU algorithm. ATLAS algorithm can be improved through extra footage, which can reduce the rate of false localization.