Human Re-identification with Global and Local Siamese Convolution Neural Network

Human re-identification is an important task in surveillance system to determine whether the same human re-appears in multiple cameras with disjoint views. Mostly, appearance based approaches are used to perform human re-identification task because they are less constrained than biometric based approaches. Most of the research works apply hand-crafted feature extractors and then simple matching methods are used. However, designing a robust and stable feature requires expert knowledge and takes time to tune the features. In this paper, we propose a global and local structure of Siamese Convolution Neural Network which automatically extracts features from input images to perform human re-identification task. Besides, most of the current human re-identification tasks in single-shot approaches do not consider occlusion issue due to lack of tracking information. Therefore, we apply a decision fusion technique to combine global and local features for occlusion cases in single-shot approaches.

work accurately.Farenzena et al [8] extracted three types of features which are weighted color histograms, maximally stable color regions (MSCR) and recurrent high-structured patches (RHSP) to model human appearance.Besides, they exploit symmetry property in human images to minimize the effect of view variation.Although their proposed method achieves certain robustness, it is quite time consuming to extract these three features.To deal with pose variations, Gheissari et al [12] fit a decomposable triangulated graph to capture the spatial distribution of the local descriptors over time.However, the drawback of their approach is similar to [3] that is, it is only applicable for human seen from similar viewpoints.
Besides hand-crafted feature design approach, metric learning based method is another research direction for researchers to perform human re-identification task.For metric learning based method, simplefeatures are extracted and similarity distance is measured by using a distance metric.The goal of metric learning based method is to minimize the similarity distance for similar pairs and maximize the distance for dissimilar pairs.Dikmen et al [10] proposed the Large Margin Nearest Neighbor (LMNN-R) algorithm to learn the most effective metric.Their metric learning method typically requires enormous labeled target pairs.Zheng et al [11] presented a Probabilistic Relative Distance Comparison (PRDC) model which focused on maximizing the probability that a similar pair have smaller distance.Nevertheless, in their method, noise information of features was not taken into consideration.
In a nutshell, designing and learning a robust and stable feature still remains an open problem to perform human re-identification task.One direction is to use deep learning methods as it integrates both feature extraction and metric learning in a single framework.Until now, there are only several papers on reidentification task by using deep learning methods.Li et al [18] proposed deep filter pairing neural network (FPNN) to handle photometric and geometric transformations.Their proposed method was the first work that applied deep learning in human re-identification task.Another deep learning work which used Siamese Convolution Neural Network (SCNN) is presented in [19].SCNN architecture consists of two sub-networks which are connected by similarity layers.Three overlapping body parts are used to extract features and compute similarity metric separately.Final similarity score is obtained by summing up these three similarity metrics.

Research Approach
Convolution Neural Network (CNN) consists of multiple layers with combination of convolution layers, pooling layers and fully connected layer.Convolution layer is used to detect same features at different locations in input image.Activation function is applied on feature maps which are computed from convolution layer.Pooling layer reduces the spatial resolution of each input feature map to achieve certain degree of shift, distortion and small tranformations invariance.Fully connected layer is applied after convolution and pooling layers.CNN is one kind of deep neural network.It works based on three basic ideas which are local receptive fields, shared weights and pooling.It has less parameters than fully connected networks due to its shared weights and local receptive field properties.Therefore, training a CNN is faster than fully connected networks.The advantage of CNN is that it can automatically extract features from 2D input images.
Subjects in training set are generally different than in testing set, therefore SCNN is used to make the person re-identification task as binary classification which is "sample pair with label" mode that is shown in Figure 1.In this paper, we propose a SCNN for global and local structures to perform person reidentification task.In our proposed architecture, the whole image of 128x48 pixels is used for global representation while input image is divided into 4 horizontal stripes with 32x48 pixels and 2 vertical stripes with 128x24 pixels for local representation which is shown in Figure 2.Each part is used as input to the CNN.Image pair from different cameras is passed through the SCNN which consists of 2 identical CNNs with common parameters.Contrastive cost function is applied to decrease the distance for similar pair and increase the distance for dissimilar pair.At the end, seven features are extracted from global and local parts.These features are used to compute similarity metrics between probe and gallery set.

Convolution Neural Network
Figure 3 illustrates the proposed CNN architecture for the global part and it is composed of 7 layers which are 3 convolution layers (C1, C3 and C5), 3 pooling layers (S2, S4 and S6) and 1 fully connected layer (F7).The feature maps in C1, C3 and C5 are 30, 60 and 100 respectively.The number of feature maps in pooling layers is exactly the same with their previous convolution layers, but the size is reduced to half.Number of output neurons for fully connected layer is 120 dimensions which represents the final feature vector.The filter size for C1, C3 and C5 are 5x5, 7x7 and 3x3.

Siamese Convolution Neural Network
SCNN is applied on image pairs as shown in Figure 5.Both of the images will pass through the same CNN architecture with common parameters to obtain their feature vectors.L1 norm or Manhattan distance is used to compute similarity metrics as in Equation 1.
Where ( ) and ( ) are feature vectors after passing through the CNN, || || represents the similarity measure which is Manhattan distance, corresponds to the energy between images.When the image pair is a similar image pair, then the energy will be lower, otherwise the energy level will be high (as in dissimilar image pair).

Cost Function
Contrastive cost function shown in Equation 2 is applied to discriminate between similar and dissimilar pairs.By using contrastive cost function, distance between similar pair will be decreased while distance between dissimilar pair will be increased.
Where Q represents the total number of output neurons and Y is the label of the image pair, if the image pair is similar pair, then Y = 1.Y = 0 when image pair is dissimilar pair.

The Proposed Decision Fusion
A set of feature vectors are computed from positive pairs and a set of feature vectors are computed from negative pairs.These feature vectors are stored in a database.Decision fusion is applied after feature vectors are formed for global and local part.Decision fusion involves with distance computation, weight computation, and combined decision to get the final distances.
For distance computation step, distances of feature vectors between training and testing set are computed by using Euclidean distance, then minimum distances for similar class and dissimilar class are determined by Equation 3.
Where ( ) is the v-th feature vector of local parts for testing images , ( ) is v-th feature vector of local parts for training images and ( ) is distance measure between two feature vectors.Distance matrix is formed based on Equation 4.
For weight computation step, weight of the v-th local feature vector is computed by using Equation 5.In each row of , the lower distance value is represented by and another distance value is represented by .The degree of importance of the v-th local feature vector is represented by weight , therefore the higher the , the importance of v-th local feature is lower.This strategy ensures that less discriminant feature vectors are assigned with higher weights.Weight is normalized as .Finally, final distance for similar class and dissimilar class is computed by using Equation 6. (

Results and Analysis
Viewpoint Invariant Pedestrian Recognition (VIPeR) dataset is used in our experiments.VIPeR is the most widely used benchmark in the field of human re-identification because it is quite challenging, since it suffers from viewpoint and illumination changes between the two cameras, giving a disjointed view.We utilize 100 pairs of human images as training images and testing on 20 pairs of human images in our experiment, due to the constraint on available computation power.Inside the 100 pairs of human images, there are 50 pairs of positive training

Implementation Details
Input images that are going to be passed to the SCNN architecture will undergo simple pre-processing steps.Color equalization method is applied on the raw image to reduce illumination variations between different cameras.After that, input image after color equalization is normalized so that it is in the range of 0 and 1.There are many activation functions which can be used in CNN such as sigmoid, tanh, hypertanh and Rectified Linear Unit (ReLu) activation function.In our proposed work, we chose ReLu as the activation function because it does not face gradient vanishing problem as with sigmoid and tanh function.ReLu is the most popular activation function in deep networks nowadays.
A pre-trained model is learned by using CUHK-02 dataset.Parameters of deep network are initialized by the pre-trained model and then whole network is fine tuned by backpropagation with VIPeR dataset.

Experimental Results
Cumulative Matching Characteristic (CMC) curve and normalized Area Under Curve (nAUC) score for CMC curve are used to represent the performance of our proposed method.Expectation of finding correct match in the top n matches is represented by CMC and how well our method performs overall is represented by the nAUC.
In Table 1, nAUC for global and local SCNN is 95.75% which is better than using local part SCNN (95.50%) and global part SCNN (91%).For occlusion case, the nAUC for global and local SCNN (77.5%) is still better than local SCNN (75.50%) and global SCNN (76.5%) (Table 2).Global SCNN represents feature vectors that are extracted from the whole image while local SCNN represents feature vectors that are extracted from part of the image.Feature vectors that are extracted from global and local SCNN are complementary to each other.
Table 3 shows that nAUC for our proposed method (95.75%) is better than DML method (82%) when there is no occlusion.In occlusion cases, nAUC of our proposed method (77.5%) still performs better than DML method (47.50%) significantly.Our proposed method performs better when occlusion occurs due to our decision fusion part for global and local SCNN.Decision fusion is used to fuse global and local features and weighting scheme is applied to balance the importance of global and local information.When there is occlusion in the image, feature vectors extracted from occlusion part contains less discriminant feature vectors, therefore it should be assigned with higher weights.More reliable discriminant feature vectors will be assigned with a lower weight.

Conclusion
In a nutshell, our proposed method which integrates both global and local structures of SCNN and the proposed decision fusion has shown good result in handling human reidentification task.Our experiments have shown that fusing both global and local SCNN is better than only applying local part or global part alone.The proposed method was also evaluated in the case of occlusion and performed better than existing work.

Figure 1 .
Figure 1.Example of similar and dissimilar pairs which are used for training set, test set are formed like training set for binary classification

Figure 2 .
Figure 2. Global and Local parts from input image

Figure 3 .
Figure 3. CNN architecture for global part

Figure 4 .
Figure 4. CNN architecture for local part

Figure 5 .
Figure 5. Illustration of Siamese Convolution Neural Network proposed in this work

TELKOMNIKA
ISSN: 1693-6930  Human Re-identification with Global and Local Siamese Convolution Neural… (K.B. Low) 731 images, which are the same human images captured with the same cameras.The 50 pairs of negative training images are randomly paired between two cameras with different humans.

Table 1 .
The performance of our proposed method when there is no occlusion.

Table 2 .
The performance of our proposed method perform when occlusion occur.

Table 3 .
Comparison between the proposed method and DML method.