Batik image retrieval using convolutional neural network

This paper presents a simple technique for performing Batik image retrieval using the Convolutional Neural Network (CNN) approach. Two CNN models, i.e. supervised and unsupervised learning approach, are considered to perform end-to-end feature extraction in order to describe the content of Batik image. The distance metrics measure the similarity between the query and target images in database based on the feature generated from CNN architecture. As reported in the experimental section, the proposed supervised CNN model achieves better performance compared to unsupervised CNN in the Batik image retrieval system. In addition, image feature composed from the proposed CNN model yields better performance compared to that of the handcrafted feature descriptor. Yet, it demonstrates the superiority performance of deep learning-based approach in the Batik image retrieval system.


Introduction
The brain is an amazing organ in the human body.With our brains, we can understand what we see, smell, taste, hear and touch.The infant brain weight is only about half a kilogram but can solve a big problem, and even supercomputers cannot.After several months of birth, the baby can recognize the face of his parents, discern discrete objects from the background, and begin to speak.Within one year the baby has an intuition about natural objects, can follow objects and understand the meaning of a sound.When they are children, they can understand grammar and have thousands of words in their vocabulary.
Building machines that have intelligence like our brains are not easy, to make machines with artificial intelligence we have to solve very complex computing problems that we have even struggled with, problems that our brains can solve in a matter of seconds.To overcome this problem, we have to develop other ways to program computers that have been used in this decade.Therefore there arises an active field of artificial computer intelligence and also commonly called deep learning [1].
Nowadays Artificial intelligence has undergone very rapid development.Ai has been used in many fields of research, in the field of computer vision Content-Based Image Retrieval (CBIR) has been developed in multi-level schemes with low-level features to high-level features.Convolutional Neural Network (CNN) has been successfully used to be an effective descriptor feature and gain accurate results.In general, the features gain by the deep learning method are trained by mimic human perceptions through various operations such as convolution and pooling.Deep learning has become a descriptor feature that is better than low-level features.Although now the CNN module has become state of the art in computer vision this does not guarantee the features obtained from the highest level always get the best performance [2].
In the Content-Based Image Retrieval system aims to provide the right way to do the browsing, retrieving and searching some desired images that have been stored in the image database.The image database contains many images that have been stored and arranged in a storage device.Usually, the size of the image database is very large so that the process of searching for specific images manually requires a lot of time, and causes conditions that are uncomfortable for the user.For example, Batik is a cultural heritage of the archipelago Indonesia that has a high value and blend of art, laden with philosophical meanings and meaningful symbols that show the way of thinking of the people making it.Batik is a craft that has been a part of Indonesian culture especially Javanese for a long time, batik have ISSN: 1693-6930 ◼ Batik image retrieval using convolutional neural network (Heri Prasetyo) 3011 a lot of motives, pattern and color so to take specific batik picture from the database very challenging [3].This paper offers a solution to use convolutional neural networks to carry out CBIR tasks to solve problems that occur in taking batik images.The method intended is to produce effective image descriptors from the CNN architecture.Descriptors of this feature are very important for content-based shooting systems.The Image feature is used to improve the performance and to solve problems in existing batik shooting systems.

Content-based Image Retrieval System
Image retrieval is a computer system for searching and retrieving a specific image in large or big size of image databases.The classical approach appends on the metadata such as texts, keywords, or descriptions embedded in an image.Thus, the image retrieval can be performed with the search key as aforementioned text, keywords, etc.This technique is inefficient since the manual image annotations are time-consuming and exhausting process.Even though, large amounts of automatic images annotations have been proposed in literature [4], an image retrieval system with content annotation still cannot deliver satisfactory result.
CBIR is computer application dealing with the searching problems over large-scale image database.CBIR, also recognized as Query-Based Image Content (QBIC) and Content-Based Visual Information Search (CBVIR), differs with the content-based approach.The CBIR analyzes the image content rather than metadata information such image keywords, tags, or image descriptions [5].
In this paper, the usability of CNN model is extended to the CBIR task.The main reason is the superiority performance offered by CNN model compared to the handcrafted feature in the computer vision and recognition tasks.The CNN or Deep Learn network achieves the outstanding retrieval performance in the ImageNet challenge [6].The CNN model inspires the other deep learning-based approaches, such as AlexNet [7], VGGNet [8], GoogleLeNet [9], Microsoft ResNet [10], etc., to tackle the obsolete of handcrafted feature in the image retrieval domain.
The CNN model receives a three-dimensional image of size ℎ ×  × , where ℎ and  are spatial dimensions and  is the number of channels.This image is further processed thorough the CNN architecture consisting several convolutions, max-poolings, and activation functions to perform end-to-end image feature generation.Let   be a vector data located at spatial position (, ) in specific layers.The CNN computes a new data   as follow: where  and  denote kernel size and stride, respectively.The function   is the layer type used such as matrix dot multiplication for convolutional layers, max spatial for max pooling layers, nonlinear functions for activation functions, and other types of layers.This form of functionality is maintained using kernel size and step composition while still using the transformation rules.
While a general network computes general nonlinear functions, a network with only layers of this form computes a nonlinear filter, which we call a deep filter or fully convolutional network.FCN naturally operates at any size input and produces the appropriate spatial dimensions.The loss function is valued composed with the FCN defines task.If the loss function is a sum over the spatial dimensions of the final layer (; ) = ∑  ′  (  ; ), the parameter gradient will be a sum over the parameter gradients of each of its spatial components.Thus stochastic gradient on  computed on whole images will be the same as the stochastic gradient on ′, taking all the final receptive fields as minibatch.When calculating this receptive field is done repeatedly with forward and backward propagation operations feedback will be more effective if the calculation is done layer by layer in all images compared to computing patch by patch to the part of the image.An illustration of a CNN operation can be seen in Figure 1.The proposed CNN model constructs the feature descriptor from Batik image.This feature descriptor is to measure the similarity between query and target images in database under the K-Nearest Neighbors (KNN) [11] strategy.This KNN technique performs similarity matching with the distance score criterion.This paper investigates two CNN models in the training stage, i.e. with supervised and unsupervised learning approaches.Figure 1 illustrates an example of proposed supervised CNN architecture for Batik image retrieval.The supervised terminology refers to the utilization of class label, whereas unsupervised disobeys the image label in the training process.Autoencoder is simple example of unsupervised CNN method which compresses the data features into smaller size and recovers back to the original data [12].

Method
This section presents two methods for generating the feature descriptor in the Batik image retrieval system.We firstly explain the supervised CNN model.Then, the unsupervised CAE model [13] is subsequently described in this section.

Supervised Learning
The CNN model is the supervised deep learning-based approach commonly used in the image classification [14], prediction [15], segmentation, analysis [16], etc.The supervised CNN model consists of several layers such as convolutional layer, max pooling layer, etc.These layers are repeated over several times and fed into the fully connected layer at the end of CNN layer [17].Our proposed image retrieval system employs the CNN architecture with six convolutional layers and two fully connected layers to generate Batik feature descriptor.Table 1 summarizes the CNN architecture used in our proposed method.

3013
After performing six convolution and max-pooling operations, an input image of size 128 × 128 × 3 is converted into new representation with dimensionality 2 × 2 × 256.This new data representation is then flatten to become one dimensional data of size 1 × 1 × 1024.This flatten data is subsequently processed and trained with the Multi-Layer Perceptron (MLP).Herein, the MLP receives 1024 input feature and feeds into 1024 input neurons.The hidden and output layers are set as 256 and 97, respectively.The value of 97 in output layers is equivalent to that of the desired class target, i.e. the number of Batik image classes used in the proposed image retrieval system.

CAE Unsupervised Learning
This paper also considers the other CNN model, namely Convolutional Auto-Encoder (CAE), for generating image feature.The CAE is an unsupervised deep learning-based method, i.e. the image label is not required in the training process.In order to generate image feature, this technique learns and captures the information from input data directly without the availability of class label.
The CAE involves two parts, i.e. encoder and decoder blocks.The encoder block processes the sample data  consisting  samples and  features to yield the output .In the opposite side, the decoder aims to reconstruct the original sample data  from the .Let ′ be the reconstructed data produced at the decoder side.The main goal of CAE is to minimize the difference between the original data  and reconstructed version ′.Specifically, the encoder simply maps the input  into new representation  with the help of function .This process can be formulated as follow: where   denotes the nonlinear activation function in encoder side.CAE simply performs a linear operation if one simply uses identity function for   .The  and   ∈   are encoder parameters, respectively, referring as weight matrix and bias vector.In contrast, the decoder reconstructs ′ from  representation by means of function .This process can be simply illustrated as: where   represents the activation function in decoder side.The   and  are the bias vector and weight matrix, respectively, denotingas decoder parameter.
Strictly speaking, the CAE model searches the global or near optimum parameter = (,   ,   ) in the training process.This task is equivalent to the minimization process of loss function over all dataset  under the following objective function: where (•,•) denotes the auto-encoder loss function.In this paper, we simply use linear reconstruction  2 for loss function, or commonly referred as Mean Squared Error (MSE) [18].This loss function is formally defined as: where   ∈ ,   ′ ∈ ′ and   ∈ , respectively denote the original input data, reconstructed data, and new compact representation of input data.
In this paper, the CAE architecture was built with four encoding blocks and four decoding stages.This architecture includes a stacked Convolutional Auto-Encoder.The summary of CAE architecture used in this paper can be seen in Table 2. Suppose that an input image is of size 128 × 128 × 3.As it can be inferred from Table 2, this image is convolved four times to obtain new simpler and compact representation.This process can be also considered as repetitive encoding.Herein, the new representation is regarded as neural code ◼ ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 6, December 2019: 3010-3018 3014 with dimensionality 4 × 4 × 128.By using the backward approach and decoding process, this neural code can be recovered back to yield the reconstructed image of original size 128 × 128 × 3.This reverse process performs the deconvolution and unpooling operations.The CAE neural code can be further utilized as the feature descriptor in the proposed Batik image retrieval system.

Learning process and Hyperparemeter Tuning
The CNN model is very sensitive to hyperparameter changes in the learning process, since it utilizes the Restructured Linear Unit (ReLu) () = (0, ) for its activation function.This function is with the gradient descent making it very unstable in comparison with the tanh and sigmoid activation functions.Compared to the aforementioned activation functions, ReLu yields an identical error with 25% less iteration in learning stage [7].
In the training process of our proposed image retrieval system, we simply split the image dataset as two folds, i.e. 75% and 25% for training and testing purpose, respectively.The Adaptive Moment Estimation (Adam) [19] is exploited for CNN optimizer with learning rate 0.0001.We simply employ the Mean Square Error (MSE) [20] for calculating the loss function.For avoiding the overfitting problem and dealing with small size of dataset, the proposed system uses data augmentation technique to improve the data variation.The training and testing processes are conducted under the Intel Core i5 2010 processor.From our experiment, the supervised CNN and CAE models require around 10 hours and 3 days, respectively, for the training process.At the end of training process, two deep learning based models produce a set of image features which can be used for the descriptor in the Batik image retrieval.These image features are simply obtained from the last layer and neural code layer of supervised CNN and CAE models, respectively.

Experimental Study
Extensive experiments were carried out to investigate and examine the proposed method performance in the Batik image retrieval system.Firstly, we give a brief description about the image dataset used in the experiment.The effectiveness of the proposed method is subsequently observed under visual investigation.Then, the objective performance comparisons are further evaluated to overlook the effect of different distance metrics and superiority of the proposed method in comparison with the former competing schemes.

Dataset
This experiment utilizes a set of Batik images, refered as Batik image dataset, over various patterns, colors, and motifs.This image database consists of 1552 image.This database is further divided into 97 image classes.Each class contains a set of similar images TELKOMNIKA ISSN: 1693-6930 ◼ Batik image retrieval using convolutional neural network (Heri Prasetyo) 3015 regarding to their motifs and content appearance.Each image class owns 16 similar images, in which all images belonging to the same class are considered as similar images.Figure 2 gives several examples of Batik images from the dataset.

Practical Application on Batik Image Retrieval
This sub-section evaluates the performance of the proposed method under visual investigation.The proposed method utilizes the image feature obtained from CNN and CAE approach for performing Batik image retrieval system.The correctness of the proposed method is determined whether the system returns a set of retrieved images correctly or not.
Figure 3 displays the retrieved images returned by the proposed image retrieval system using the CNN and CAE image features.We only show six-teen retrieved images arranged in ascending manner based on their similarity score.The similarity criterion is measured using the distance score and given at the top of each image.Smaller distance value indicates more similar between the query and target image in database.As shown in this figure, the proposed method with CNN feature returns all retrieved images correctly.It is little regrettable that the proposed method with CAE feature only produces six retrieved images correctly.

Comparison of Porposed Methods with Direfferent Distance Metrics
This sub-section reports the effect of different distance metrics on the proposed method.In this experiment, three distance metrics, namely Euclidean [21], Manhattan [22], and Bray-Curtis distance [23], are extensively examined over two performance criterion, i.e. precision and recall rate.These two scores are formally defined as: where   () and   () denotes the precision and recall rate, respectively, if image  is turned as query image.The symbols  and  represent the number of retrieved images and total images in database which is relevant to image , respectively.  is the number of images which are relevant to query image  obtained at  retrieved images.
Figure 4 shows the performance comparison over various distance metrics in terms of Precision and Recall scores.All images in database are chosen as query image.The number of retrieved images are set as  = {1,2, … ,16}.In most cases, Bray-Curtis distance yields the best retrieval performance compared to that of the other distance metrics for both CNN and CAE image feature.In the Batik image retrieval system, the Bray-Curtis distance becomes a good candidate for measuring the similarity between the query and target images in database.
Table 3 tabulates more complete comparsions for the proposed image retrieval system using CNN and CAE features over various distance.This comparison is evaluated in terms of average recall rate with the number of retrieved images as  = 16.Herein, all images in database are turned as query image.As reported in this table, the proposed method with supervised CNN delivers better performance compared to that of CAE technique.The image feature obtained from proposed supervised CNN method is more suitable for Batik image retrieval task.

Comparison against Former Methods
This sub-section summarizes the performance comparison between the proposed supervised CNN method and former existing schemes on Batik image retrieval system.This comparison is conducted in terms of Average Precision Recall (APR) score.The APR is formally defined as: where   () and  are the recall rate for query image  and the total number of images in database, respectively.Herein, all images in database are turned as query image indicating that  = 1552.Thus, the APR value is averaging over all query images.The number of retrieved images is set as 16 yielding  = 16.To make a fair comparison, this experiment also investigates the dimensionality of image feature.Table 4 reports the performance comparison in terms of feature dimensionality and APR value.As shown in this table, the proposed supervised CNN yields the best performance in comparison with the other competing schemes.It is noteworthy that the proposed method requires lowest feature dimensionality (with exceptional on comparison to LBP [20] scheme).This lower dimensionality indicates the faster process on KNN searching for effective Batik image retrieval system.Thus, the proposed method can be considered on implementing the Batik image retrieval and classification system.[24] 59 92.57LTP [25] 118 95.65 CLBP [26] 118 95.17 LDP [27] 236 93.52 Gabor Filter [28] 144 96.55 ODBTC+PSO [3] 384 97.68 Proposed Supervised CNN 97 99.47

Conclusions
A new content-based image retrieval system has been presented in this paper.This system achieves the retrieval accuracies 99.47% and 76.54%, respectively, while the image feature is constructed from CNN and CAE deep learning-based architecture on Batik image database.The CNN outperforms the former existing schemes in terms of retrieval accuracy.In addition, it requires the lowest image features, i.e. 97 feature dimensionality, compared to other methods.For future work, a slight modification can be carried out for CAE model by adding fully-connected layers before and after the neural code section.This scenario may reduce the dimensionality of image feature, at the same time, it improves the performance for Batik image retrieval.

Figure 2 .Figure 3 .
Figure 2. Some image samples in the Batik dataset

Figure 4 .
Figure 4. Performance comparisons in terms of precision and recall rates over various distance metrics with the image features from: (a) CNN, and (b) CAE method

Table 2 .
The CAE Architecture for Batik Image Retrieval System

Table 3 .
APR CNN and CAE

Table 4 .
APR Comparison with Former Method