Traffic sign detection optimization using color and shape segmentation as pre-processing system

One of performance indicators in an autonomous vehicle (AV) is its ability to accommodate rapid environment changing; and performance of traffic sign detection (TSD) system is one of them. A low frame rate of TSD impacts to late decision making and may cause to a fatal accident. Meanwhile, adding any GPU to TSD will significantly increases its cost and make it unaffordable. This paper proposed a pre-processing system for TSD which implement a color and a shape segmentation to increase the system speed. These segmentation systems filter input frames such that the number of frames sent to artificial intelligence (AI) system is reduced. As a result, workload of AI system is decreased and its frame rate increases. HSV threshold is used in color segmentation to filter frames with no desired color. This algorithm ignores the saturation when performing color detection. Further, an edge detection feature is employed in shape segmentation to count the total contours of an object. Using German traffic sign recognition benchmark dataset as model, the pre-processing system filters 97% of frames with no traffic sign objects and has an accuracy of 88%. TSD system proposed allows a frame rate improvement up to 32 frame per second (FPS) when You Only Look Once (YOLO) algorithm is used.


INTRODUCTION
In industrial revolution 4.0 era, research in autonomous vehicle (AV) are extensively encouraged to build a modern and safer transportation. An AV which is equipped with an enormous number of sensors (infrared, camera, proximity, etc) allows sensitive sense for the change of its surrounding to provide maximum safety for human being. Since it is controlled by computational devices without any human as a driver, AV is claimed to lower down the accident rate by 90% [1].
One of the crucial parts in an AV is its detection system, which has to be very responsive to any changes of its surrounding, including traffic signs. The speed of processing in the detection system, in term of frame rate number, will define correct or incorrect decisions. The low frame rate indicates slow response of the traffic sign detection (TSD) system to respond to the external changes including the existence of traffic signs. Since 2015, a number of accidents which involve AVs had been reported and the low frame rate was responsible for most of these accidents. Most of the accidents was caused by the system failure of the AV to Ì ISSN: 1693-6930 make a fast and accurate decision [2,3]. Therefore, a good TSD system must provide a frame rate of minimum 25 frame per second (FPS) in order to react instantly but still preserve its accuracy. A high frame rate TSD can be obtained in two ways either using additional graphics processing unit (GPU) or by improvement of its algorithm. An additional GPU will instantly boost the frame rate of TSD since the detection process will be performed using additional processors and memories. However, GPU extension are costly and as the consequence the vehicle price will not affordable. In contrast, improvement of algorithm for the current system may produces a higher speed TSD with a lower cost.
As one of the detection systems, the TSD system requires an extensive computation resources at the server and produces a poor frame rate. Meanwhile, not all frames sent for processing at the detection system contains traffic sign objects. Therefore, application of a preliminary system which is able to determine frames to pass for processing at the artificial intelligence (AI) system and which ones should be denied, will significantly increase TSD frame rate. Since the traffic signs have a specific characteristic in colors and shapes, the segmentation of color and shape systems are good option to perform frame filtering in the pre-processing system and allows a lower workload for artificial intelligence system [3].

2.
PREVIOUS WORK Traffic sign detection is typically classified into three fundamental categories: color-based, shapebased and learning based methods [4]. Meanwhile, Liu [5] add two mode categories: color and shape based methods and light detection and ranging (LIDAR) based methods. Among these methods, detection system based on learning is the best because it achieves a detection rate to more than 90%. Using a deep learning method such as convolutional neural network (CNN) even increases the detection rate to more than 98% [6]. However, the existing algorithms are still far from a real time classified system [4]. Balado [7] proposed an algorithm to achieve a better detection rate, but still could not reach a real time processing frame rate. Research on traffic sign detection based on CNN proposed by Wang [8] produced a detection rate of 5 FPS which is far from the real time processing standard. A multi-scale cascaded R-CNN is proposed by Zhang [9] and Liu [10] to overcome the problem of false detection when traffic signs are in small sizes. Moreover, a real time object detection named YOLO and its modification are often used to tackle the real time TSD problems [9,11,12]. Sphurti More in [13], found that YOLO showed a better accuracy than support vector machine (SVM) based algorithm, although it run slower.
On the other hand, classification using color-based and shape-based algorithms are usually fast but they perform a lower detection rate which is not suitable for AV system. Color-based segmentation algorithms can be applied for TSD since the traffic signs have specific characteristic colors: red, blue and yellow [14]. The red sign is used for signs which represent a danger or prohibition, yellow sign shows construction signs and blue sign is for information [14]. Color thresholding, region growing, color indexing, dynamic pixel aggregation are most popular color-based detection methods [15]. However, the variation of color due to the age of signs, variation of light makes color-based segmentation algorithms produce low detection rates [4]. Some researches on color segmentation algorithms show that using HSV as the color space produces better results although it requires a longer processing time [16,17,18].
Meanwhile, shape-based segmentation algorithms are commonly used to identify objects. In the TSD system, there are four specific shapes to identify since traffic signs can be classified into rectangular, circular, octagonal and triangular signs. Wali [15] listed a number of shape-based methods such as: Hough transformation, similarity detection, distance transform matching, edge detection features and Haar-like features. The combination of an adaptive color threshold segmentation method and a shape symmetry detection algorithm was proposed in [19] to get higher detection rate in TSD.

3.
PROPOSED SYSTEM Based on the performance of color-based, shape-based and learning-based method discused earlier, we proposed a system to increase the detection rate of learning-based TSD by application of a frame filtering system. In the real implementation of TSD, the number of frames without any traffic sign object typically more than the ones with traffic signs. Therefore, processing only the frames which contain traffic signs will significantly increase its overall detection rate. The proposed system consists of two important processes: preprocessing and AI system. Figure 1 shows the diagram of proposed TSD system. The pre-processing part applies a color segmentation and a shape segmentation algorithm. They are used to pass only the candidate TELKOMNIKA Telecommun Comput El Control Ì 175 frames which contain any traffic sign to the AI system. Then, the system will continue the processing to detect whether the candidate frame is actually having traffic sign in it by application of learning-based algorithm (AI).

Figure 1. The proposed AV system diagram
Input from camera is resized using a bilinear interpolation algorithm which performs linear interpolation algorithm twice in different directions. This method uses four nearest point to calculate the point in the center of four points chosen [20,8]. Resizing is used to reduce the number of iteration when performing convolution in a frame. Bilinear interpolation is chosen because it produces a smoother interpolation compares to the other resizing methods [21,22]. Since traffic signs have specific characteristic of shapes and colors, we limit the detection of shapes for square, triangle, circle and octagon; and the detection of colors for red, blue and yellow. A frame which has none of these shapes and colors can be classified as a frame without any traffic sign, then it can be excluded from further processing.

Color segmentation
In this work, we use HSV threshold for color segmentation to identify the desired color available in a frame. Before being processed by HSV threshold, the particular frame is filtered using Gaussian blur to smoothen and enhance the frame's color feature. It removes unwanted salt noises of the images and makes the color detection easier and more accurate. Gaussian blur uses the kernel window to perform blurring of an image. The blur level can be adjusted by modification of kernel size. The bigger kernel size will produce a higher blur of image which is unlikely can be detected correctly [23]. Figure 2. shows the 5x5 Gaussian blur kernel window. HSV color space is used because the color can be detected according to its hue and value without considering the saturation of the color, and according to the previous works, using the HSV color space showed a better performance than RGB. The HSV threshold is set to the color desired such as red, blue and yellow. The upper and lower bound of the thresholds are chosen based on the experiment done by Adrian from PyIm-ageSearch [25].  [190,255,255] is for red color [25]. Algorithm 1 shows an implementation of color segmentation method.

Shape segmentation
In this work, we use an edge detection feature method for shape segmentation of the frames. A morphological filter is employed to reduce the number of noises in a frame and to add pixels between important Ì ISSN: 1693-6930 features. This technique uses a nonlinear operation [26]. There are five morphological filter operations: dilation, erosion, opening, closing and hit or miss transform. Dilation and erosion are the main operations in the morphological filter. Meanwhile, the opening, closing and hit or miss transforms are formed by the process of combination between dilation and erosion [27]. Dilation is an operation which can thicken the structure of object in a frame by adding extra pixels between important features. The 1 is used for dilation operation [28]. In addition, the erosion is an operation which in contrast to the dilation operation. In this operation, the structure of an object will be reduced by elimination of pixels around the object. The 2 shows the equation of erosion operation [28].
These operations are used to enhance the features of an object inside the frame. Morphological filter allows an easier detection of the objects in an image. After the frame being processed by morphological filter, the contour of object in the frame is detected and counted. The Douglass Peucker algorithm is used to count the contour of object. It reduces the number of curve in an image and transform it to a straight line between two points. As a result, the contour of object can be identified easier [29]. If the number of contours is 4, then the object is classified as a rectangle object. The total contours of 3, 15 and 8 represent a triangle, circle and octagon object repectively. The frame which is classified having one of the desired shapes will be processed by the AI system. Algorithm 2 shows an implementation of the shape segmentation method. Get a frame from input camera 4: Apply Resizing for the current frame 5: Apply a Gaussian Blur filter to the frame 6: if (the frame contains at least one color desired) then

7:
Send the current frame to the shape segmentation system 8: Get a frame from color segmentation system 4: Apply Morphological filter to the current frame 5: Apply a Douglass Peucker algorithm to the frame 6: if (the number of contour macth one of the desired shape) then 7: Send the current frame to the AI system 8: Get the next frame 10: end if 11: end while

Artificial intelligence (AI) system
A frame which is successfully pass pre-processing system is processed by the AI system for detection and classification. In AI system, the frame is resized to a smaller size to reduce the number of neural network processing. In this work, You Only Look Once (YOLO) algorithm is employed as the processing of the AI system. YOLO is an open source system which has several versions: YOLO, YOLOv2, YOLOv3 and Tiny YOLO. Furthermore, YOLOv2 is used because it is faster and accurate enough for a TSD system [30] has a feature network with 24 convolutional layers and 2 fully connected layers [31]. Convolutional layer is used to extract the important feature in a frame by application of a convolution using 5x5 mask. This mask will keep moving through the available position in a frame to get its important features [32]. The output of this process is a smaller matrix with important feature pixels. Figure 3 describes the process in a convolution layer. The output of this process is a list of neurons which includes all the important values from the layer. This list will be connected in a fully connected layer and become one final feature map. The final feature map will be compared to the feature map from the model which has been obtained from the training phase. If the feature map of the frame matches to one of the feature maps in the model, then the system is success to detect and classify the particular object [33]. Figure 3. The convolution process [32] Max pooling will be performed after finishing the convolution process. Max pooling will take the biggest value in the matrix which is assumed as the feature which being searched. Similar to the convolutional layer, max pooling employs a matrix which will move throughout the frame and compare the values in the matrix [34,35]. Max pooling is chosen because of the condition of the dataset where most of the images have a bright color.

RESULT AND DISCUSSION
In this work, we borrow the German Traffic Sign Recognition Benchmark (GTSRB) dataset to construct our model. Figure 4 shows the sample images of the dataset. We divide our experiments into two parts, the pre-processing part and the full system part. The pre-processing experiments are done with 5 different configurations. Meanwhile, the full system is used to measure the frame rate of entire system with and without pre-processing.

Pre-processing
The first part of the experiments involves 43 traffic signs of the GTSRB dataset. This experiment run with different combination of parameter settings including the size of open kernel, size of close kernel and size of Gaussian blur. In this experiment the still images from the dataset are used as input of the system. Figure 5 shows an example of pre-processing of an image where the color segmentation system catch a yellow object and Ì ISSN: 1693-6930 the shape segmentation identifies a rectangular sign. Table 1 shows the percentage of the successful detection and filtering of the experiment using 43 traffic signs with various parameter values. The successful detection value represents how good the system can recognize traffic signs by using the shape and color of the signs. Meanwhile, the successful filter value shows the total frame successfully pass to the next processing compare to the total of input frames. As we can see from the result, changes on parameters do effect the accuracy of detection and filtering of the frames.  In experiment 1, the open and close kernel were set to 5x5 and Gaussian blur was set to 3x3. In this configuration, there were a lot of salt noises found in the frame and will produce inefficient frame filtering. In experiment 2, the open kernel size were set to 11x11 and the Gaussian blur was set to 3x3. Using this configuration, we obtain a better result than the first experiment since the open kernel can filter more frames. However the detection accuracy in experiment 2 decreased to 77%. This degradation was caused by the size of the open kernel which is too big and therefore will eliminate some of the important features in the frame.
In experiment 3, the close kernel was set to 11x11 and open kernel to 5x5. The result when using this configuration is slightly higher than the experiment 2 with the detection accuracy up to 81%. It is better because the close kernel capable to close the pixel gaps in the frame which will enhance the feature of traffic sign in a frame. In the next experiment, the Gaussian blur size, kernel open and kernel close were set to 5x5. Using these parameter values enable the system to get a higher percentage in detection accuracy and filtering. The bigger size of Gaussian blur makes the feature in a frame can be easily detected. In the last experiment, the size of open kernel was increased to 7x7 where the close kernel and the Gaussian blur were maintained to 5x5 and 3x3. The result of using these parameter values was not as expected where the performance of filtering frame were lower than experiment 4. Since the optimal results are reached in experiment 4, then these parameters are used for further experiments. Table 2 shows the average time needed to process a single frame in the pre-processing system.

Traffic sign detection (TSD) system
In the next experiment, we use some videos as inputs of the TSD system. Figures 6 (a and c) shows an example of processing a video which is taken from a car dashboard. The overall processing time for every processing step and the total frame processed are displayed on the Figures 6 (b and d). In the next step, we measure average time needed to process a single frame. The time processing includes pre-processing system, transfer file, and artificial intelligence system, and can be shown in Table 3. It shows that pre-processing system takes 1.012 seconds in average to process a single frame. Meanwhile, without the pre-processing system takes on average 1.0003 second which includes the transfer file. This means application of pre-processing system increases the time processing of the TSD system up to 1.17% when any candidate traffic sign object is identified. On the other hand, when no candidate traffic sign object is identified, the system takes 0.0117 second which is the pre-processing time only since the transfer file and AI processing are not required. Using 10 frame videos with 10% of the frames having a traffic sign object, we summaries the time processing of the system as shown in Table 4. The processing time is reduced from 10.003 to 1.12 second Ì ISSN: 1693-6930 (88%) and as the consequences, the frame rate will increase. The result on Figure 6 shows that the FPS values vary from 6.8 FPS to 105.64 FPS. The low frame rate indicates the corresponding frame has any candidate traffic sign object. On the other hand, system will gain a high frame rate when no candidate traffic sign object is detected. The experiment results shows that proposed algorithm with pre-processing system can boost up the speed of TSD system to the average of 32 FPS.

CONCLUSION AND FUTURE WORK
Using color and shape segmentation as the pre-processing system allows a frame rate improvement of the traffic sign detection system. Using open kernel, close kernel and Gaussian blur matrix size of 5x5 pixels, this system is able to filter up to 97.18% of the frames and pass less than 3% for processing at the AI system. On average, the filtering system reaches 88% detection accuracy. As the number of frames sent to the AI system is reduced, the workload of the artificial intelligence system is decreased and as a result, the frame rate of this detection system increases up to 32 FPS without using any graphics processing unit (GPU). In the future, refinement of color and shape segmentation algorithms can be used to increase the accuracy of pre-processing and therefore improve accuracy of the whole system.