Well-Known brands recognition by automated classifiers using local and global features

Advertising experts define "brand" as the "name, term, sign, symbol or design, or a combination of them projected to classify a company's products or services." Brands communicate meanings with the language of color and shape. Brands and color are un-distinguished because the color is the visual component people remember most about a brand, followed closely by shapes/symbols. So, these features can be used for recognition. However, scene classification is a very active topic in computer vision research, and much work has been done on it. However, the work specific to brand classification/recognition still has many rooms for further discoveries.

Features are the information extracted from images in terms of numerical values that are difficult to understand and correlate by humans. Two types of features are extracted from the images based on the application. They are local and global features, which are also referred to as descriptors. For this paper, we selected two approaches to classify a limited set of brands from real-life images. The two approaches being local feature classifier and global feature classifier, and both are used for object recognition.
Global image features describe an image as whole whereas local feature represents as image patches. Global feature generalizes the whole image with a single vector, and local compute with multiple distinct features like points, blobs, or edges on the image, making it more robust. Global features include contour representations, shape descriptors, and texture features. Global texture features and local features provide different information about the image because of the support over which texture varies as per the computation. Generally, for low-level applications such as object detection and classification, global features are used, and for higher-level applications such as object recognition, local features are used. The combination of global and local features improves the accuracy of recognizing the side effect of computational overheads.
For this study, we used the SIFT method to abstract local image features from the training set. The bag of words objects further defines the features by using the k-means clustering algorithm on the feature description, where each cluster defines a specific feature or a word. To achieve higher accuracy, use a gist descriptor, which creates a vector with specific parameters of the scene between indoor and outdoor images for simplicity.
A BS T RA C T From color and type to patterns and illustrations, brands sense to be recognizable and convey their values and personality. Here patterns and color are key elements, as they can play a vital role in brand recognition. The images used for brand classification were handpicked and collectively named as HKDataset. We have explored various feature extractors used for classification and used automated classifiers named Linear SVM to achieve higher accuracy while tuning the model parameters to achieve optimal performance. It has been observed that Support Vector Machines performs better when using GIST descriptors combined with Bag of SIFT features. We hope to apply deep learning and other sophisticated classifiers to much-expanded categories of brands in the future.

A. Literature Review
Previous work in logo recognition has focused on discovering design marks in historical documents [1] or similarity check-ups for trademark [2]. Other work attempts classification in realworld scenes. For logos in sports videos, Zhi Li [3] assumes rigid planar backgrounds and utilizes line-based intensity profiles. As local interest point descriptors have become popular, there have been approaches that use point correspondences for matching templates. Bagdanov et al. [4] Employ normalized matching using a bag of local features and cloud localization in sports videos. This method can scale to more than just a few logos. Fakhar et al. [5] use the Apriori algorithm to identify frequent spatial configurations of local features extracted on a spatial pyramid. Here mining requires a large amount of training data and is computationally costly. Kalantidis, Yannis, et al. [6] extend the bag-of-words model, incorporate Delaunay triangulation, represent triangles by signatures for each class, and use these signatures classification. However, there are limitations regarding image resolution. Romberg, Stefan, et al. [7] used HOG features extractor to recognize manually labeled logo images, complemented with non-logo images. Simultaneously, we have used non-labeled logo images handpicked from the search engines for the learning process. This paper opted for Brand classification/recognition in real-world images by using a famous classifier, like Support Vector Machines. It was a state-of-the-art algorithm of its time, as it is also preferred in medical image processing [8] - [20] and has fewer training parameters which leads to higher stability [21]. Moreover, the Bag of Scale-Invariant Feature Transform (SIFT) feature extraction model and GIST Descriptor model will be used. As there is already much work being done to present new approaches in the field for image classification on the same old datasets, it has gained interest from [22] to create our dataset. The aim is to apply these algorithms to the new dataset and relate the results to see whether these longstanding algorithms are still the finest. As from previous research by Lou et al., the image classification-using bag of features model was more efficient than other algorithm models [23]. Some researchers [24] - [25] also reported that image classification-using bag of features and SVM as a classifier could significantly improve the accuracy of classification, as used in [26] to detect review spam on the labeled dataset of social media. Other recommend also categorizing the scenes like in Aircraft detection [27] - [28]. The texture characteristics and Gabor filtering method are used to reduce detection processing and computational complexity while refining detection effectiveness by 90%.

B. Feature Extraction Model
In this study, there are two feature extraction techniques used. The two techniques are SIFT and GIST Descriptor. The details are presented in the following description.

SIFT Descriptor
SIFT is a method that is consistent with scope, brightness, and angle. For our research object, this method is appropriate for extracting local features that can help describe the different brand types, which have almost the same characteristic. It mainly consists of the following steps: a. Firstly, the pixel will be matched with 26 pixels values, which are eight surrounding pixels in the same scope to find extreme values to locate key points in the image. b. Each key point vector will be computed via the direction of the gradient of its surrounding pixels. Moreover, the most projective direction of a point is chosen as a key.
SIFT feature vector is generated. Lowe advised characterizing each key point using 4×4 seed points in a specific calculation to boost matching constancy. Thus, 128-dimensional SIFT vectors are obtained for each key point (shown in Fig. 1

GIST Descriptor
The GIST feature is a global image feature, which defines specific important figures about a scene. Olive et al. firstly proposed the GIST feature. It was used in place and object recognition [29]. In conclusion, they showed that it is possible to categorize different scenes like buildings, streets, trees, sky, cars, streetlights, and persons using the GIST descriptor. Our current work shows that it is possible to use GIST for Brand recognition in typical outdoor environments (shown in Fig. 2). Given an input image, a GIST descriptor is computed by following stages: a. Convolute the image with 32 Gabor filters at four scales, eight orientations, producing 32 feature maps of the same input image size. b. Divide each feature map into 16 regions (by a 4x4 grid) and then average each region's feature values. c. Concatenate the 16 averaged values of all 32 feature maps, resulting in a 16x32=512 GIST descriptor.
C. Machine Learning Algorithm Support Vector Machine (SVM) is a supervised machine-learning algorithm, which can be, used for both classification and regression challenges. In this algorithm, each data item is plotted as a point in n-dimensional space (where n represents the number of features). The value of each feature is the value of a distinct coordinate. Then, classification is applied by finding the hyper-plane that marks multiple classes (Fig. 3).
Local context information in the traditional SIFT and global RGB color quantization information in GIST, when combined, overcomes the shortcomings in both. By using these combined features, the brand classification task is performed with linear SVM. The influence on classification accuracy of the combined features when examined, the results are even better.

D. K-Cross Validation Technique
K-Cross validation is used to evaluate or compare learning algorithms. The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set, and the other k-1 subsets are put together to form a training set. Here we have used the 5-fold Cross-validation technique where the value of k is 5, which is mostly used; since we need to find the average accuracy across all k trials, the value of k does not matter result will be the same in most cases.

E. Local Features used for classification
SIFT features were extracted by using open-source algorithms from VLFeat [30]. The SVM classifier was trained on seven different SVM's for each of the different classes for each fold. All 7 SVM classifiers are run on each test image, and the classification with the average highest score is returned for each fold.

F. Global Features used for classification
Although SIFT features are quite descriptive, many other feature representations can be useful to classify a brand. The GIST feature vector extracted from each image is normalized and used as the feature vector in place of the SIFT histogram.

A. Result
Intuitively, as lambda grows, the less the wrongly classified examples are allowed. The different lambda values were considered to determine a good value for lambda, and found the best results can be seen in " Table. 2" using SIFT features at lambda = 0.01, GIST features at lambda=0.0001, and combined (GIST+SIFT) features at lambda=0.001. Despite using suitable parameters for achieving the best SVM classification, the Accuracy rate is the highest after using combined GIST and SIFT features [31]; after this lies SIFT achieving better overall values than using the only GIST.

B. Performance Measure
The Precision is the number of true positives divided by the total number of elements labeled as the positive class. The recall is defined as the number of true positives divided by the total number of elements that belong to the positive class. Details can be found in Fig. 4. Fig. 4 shows the average PR curve made after making PR curves per class for each k-fold.

C. Evaluation
For a realistic evaluation for Brand recognition, an extensive collection of photos in a real-world environment is required. Few logo datasets are available, but we feel that those datasets cannot evaluate an approach based on local and global features. They contain many images containing minimal logos. We feel that these datasets are not suitable for evaluating the recognition process rather than the visual features themselves. Therefore, we built a new dataset, "HKDataset" [32], containing photos depicting brands. The collection is made of logo, signage, mascot, and product images of 6 different classes by downloading them from Web service with appropriate queries. There are six well-known brand classes: Coca-Cola, Starbucks, Michelin, Heineken, Agip, and Gorilla.
To facilitate achieving high precision of classifier, evaluating their sensitivity on Non-object images is very important. Therefore, these non-logo images were downloaded from Web service with the queries "building", "nature", "roads", and "urban buildings". These images are unlikely to contain brands and complete our dataset. A summary of the images per category is shown in " Table. 2".

IV. Conclusion
In this paper, we elected for Brand classification/recognition in real-world images by using famous classifier, Support Vector Machines that is a state-of-the-art algorithm of its time. For Brand recognition, we have used 5-Cross Validation Technique to check its validation on the Bag of Scale-Invariant Feature Transform (SIFT) feature extraction model and GIST Descriptor model. Firstly, we used SIFT and GIST features and then applied classification using them. However, when SIFT combined with GIST features, it resulted in a more accurate rate while tuning the parameters. It also resulted in a better Precision-Recall curve by attaining high precision and better recall. We hope to apply deep learning and any new robust algorithm on much-expanded Brands classes in the future.