Deep hypersphere embedding for real-time face recognition

ABSTRACT


INTRODUCTION
With the emerging development of robots capable of human-computer interaction, recognizing people in computer vision and pattern recognition has attracted immense attention as it provides huge applications in terms of finance, military, public security and daily life.Among various biometrics used for person recognition, the face is one of the most popular, since this ubiquitous biometric can be acquired in unconstrained environments while providing strong discriminative features for recognition [1].Over the years, there are many breakthroughs that contributed to the success of face recognition technology.This is with the help of advanced network architectures [2][3][4][5], discriminative approach [2].Face recognition begins with extracting the coordinates of features such as width of mouth, width of eyes, pupil, and compare the result with the measurements stored in the database and return the closest record (facial metrics) [3].There have been a huge number of research on ways of improving the local descriptors, feature transformations and pre-processing in face recognition such as linear subspace [4], in manifolds [5,6], and sparse representation [5].But these approaches targets only an aspect of constraints in facial feature and improved face recognition accuracy slowly [1].Furthermore, challenges in terms of illumination, expression and pose are the three most known problems in face recognition (FR).
In the recent years, research landscape in face recognition significantly reshaped into the breakthrough of deep learning such as deepface method.Deep learning applies multiple processing layers to learn representations of data with multiple levels of feature extraction [7].The most popular deep learning architecture is the convolutional neural network (CNN) that combats significant problems in computer visions such as image classification, segmentation, object detection, etc. [8].Many face recognition applications seek a desirable low-dimensional representation that generalizes well to new faces that the neural network wasn't trained on but the representation is a consequence of training a network for high-accuracy classification on their training data.One of the challenges of this kind of approach is that the representation is difficult to use because faces of the same person aren't necessarily clustered, that the classification algorithms can take advantage of [9].
This paper discussed about the computer vision of robots involving face recognition process incorporating FaceNet as the unified embedding for face recognition and clustering that learns how to cluster representations of the same person and that can alleviate training difficulties that can significantly improve FR accuracy utilizing Python as the programming language for the surveillance system.In this paper, we proposed a system and a method for target identification using artificial neural networks integrated in robotic vision.The contributions of this paper summarizes as follows: a.We present a security surveillance system that authenticates a person in the robotic camera.b.A method to provide an equivalent virtual instrument that has the same capability and functionality that contains the following: (1) a digital filter that is used for image processing (2) a machine learning algorithm that uses artificial neural networks by means of face vector identification for target identification.c.We utilizes a method having a unified face image representation necessary for better recognition of face images.d.The system that can be adapted to any existing surveillance systems, provides low cost memory storage, has data logging features and low maintenance.

RESEARCH METHOD
The input of the system will came from the wireless camera embedded in Robots feeds that will be processed and examined by the system.Once a face image is detected in the camera feeds, then it will decide whether the face detected is recognized or not, if the face is recognized then the system logs the date, time and the camera number otherwise the system still logs the date, time, and activates the alarm and notification system.Algorithm 1. Algorithm of the system

Signal conditioning
Figure 1 shows the general overview of the proposed system.To analyze, measure and manipulate data feeds from camera footage, analog signals should be converted into digital signal utilizing the theory of digital signal processing.Analog to digital converter (ADC) is the one responsible in sampling, quantizing and encoding the continuous-amplitude analog signal into discrete time and amplitude digital signal.
A number of variable bit-rate data streams of input signals from different wireless cameras will be integrated into a constant capacity signal through time division multiplexer (TDM) used for a higher bit-rate flow of data [10].Subsequently, the signals were being digitally filtered through digital signal processing (DSP) to process the image for the integration of face detection and data logging technology using artificial neural networking.

Image processing
Multithreading and GPU based processing technologies were used to perform the image processing.Architecture of the image processing in this research is shown in Figure 2. Detailed processing will be explained in the below section.

Multi-task cascaded CNN
In face detection phase, our method is based on multi-task cascaded CNN used for joint face detection and face alignment [11] in detecting faces within the vicinity of camera footage in real-time.It initially resizes the images into a different scale building an image pyramid.The process came with 3 stages MTCNN namely: proposal network (PNet) used to obtain candidate facial windows, as well as their bounding box regression TELKOMNIKA Telecommun Comput El Control  Deep hypersphere embedding for real-time face recognition (Ryann Alimuin) 1673 vectors.Refine network (RNet) that refines huge amount of false candidate as well as performs calibration with bounding box regression, and conducts NMSa and lastly, output network (O-Net) that is used to produce the final huge box and facial landmarks position, respectively [12].This stage aims to identify face regions with more supervision Additionally, MTCNN uses a complex algorithm in multiple threads wherein it can detect faces effectively even in ranges of distance from the camera that makes it a good fit for our application.

a. Training
In training for the CNN detector, it leverages three tasks as follows.-Face Classification.It utilizes the cross-entropy loss in each samplex i .
-Bounding box regression.The learning objective is formulated as a regression problem and the method utilizes the Euclidean loss for each sample x i .
-Facial landmark localization.The same with the bounding box, the facial landmark detection is formulated as a regression problem and utilizes minimizes the Euclidian loss.
Figure 1.General overview of the system Figure 2. Image processing architecture

Face recognition
The above-mentioned problem involving invariance of face representation over a period of time, can be solved by the notions of finding packing asymptotic bounds, that are not overlapping, for which it can be fit within a face representation space or hypersphere.A representation of the geometrical structure can be can be describe wherein the lower bound represents the low-dimensional population manifold embedded in a high dimensional space located on the upper bound hyper-ellipsoid that is clustered into their own class specific hyper-ellipsoids [13].The invariance of the face representation is determined by the number of identities that is packed per hyper-ellipsoid.a. FaceNet In this paper, we integrated a method called FaceNet for the face recognition phase.FaceNet is a unified embedding for face recognition and clustering that directly learns a mapping from face images to The system used FaceNet to map face features from the input images taken from the cameras into a 512-dimensional Euclidian space vector [15].The embedding is represented by (5), having an input image x embedded into d-dimensional Euclidian space vector ℝ  and is constraints in (6) [11].
where    represents an anchor of a specific person,    indicates positive representation of the same person.   is the negative representation of any other person and  being the margin between the positive and negative pairs.The triplet loss minimizes the square of the distance between the anchor and a positive while maximizing the square of the distance between the anchor and negative pairs [17].c.Harmonic embedding and triplet loss The system also provides a powerful function wherein it has the capability to compare a new training dataset to the existing datasets in the gallery.This function is ideal for large scale datasets that is divergent, and requires retraining the subject multiple times.

Serial communication
A USB to serial adapter also referred to as a USB serial converter or RS232 adapter was used for serial communication as the interface from the camera into the computer [18].It is a small electronic device which can convert a USB signal to serial RS232 data signals [19].It is the type of signal which is in many older PCs and is referred to as a serial COM port.A USB to serial adapter typically converts between USB and either RS232, RS485, RS422 or TCP signals, however some USB to serial adapters have other special conversion features such as custom baud rates, high-speed or other [20,21].

IMPLEMENTATION
The following are the components of the whole system a.Wireless camera b.NVR c. USB serial converter or RS232 d.Laptop Computer This research uses camera specifications as shown in Table 1, and uses a computer as shown in Table 2

TECHNICAL RESULT
In this section we will evaluate the effectiveness and the performance of the proposed system.

Graphic user interface (GUI)
The user interface of the system indicates the portion where the 4 cameras will be shown.It also displays preview, database, cloud storage, local storage, about and lastly, data logging.In the section of data logging, it displays the detection of the cameras where the recognized faces are authorized or unauthorized.It also indicates the identity of the detected person.Sample 12x12 pixel of face datasets as shown in Figure 3.

Training Data
The training set used in the experiment are face images taken from sample specimens, where facial features will be taken multiple times.To minimize face variations, each will be taken without expression, and then asked to tilt their faces to the left and slowly to the right, and move their faces slowly upward and downward position.The result of the training set acquisition process will produce 200 sets of 12x12 pixelated face images for the training set per person and additional 10 images for the samples necessary for the testing.

Detection, Extraction ad Recognition time
Detection is the process wherein the system searches for the faces within an image and returns its coordinates [22].Extraction, on the other hand is the process where the system filters the detected face to filter out the unnecessary details of the image [23].Lastly, recognition is the process of the system where it identifies the detected face based from the datasets [24].After 10 iterations, we were able to measure the detection, extractions and recognition time.The average detection time is 100.8ms,extraction time is 91.7 ms and recognition time is 0/8 ms.

Percent accuracy per person
Figure 4 shows the result of the accuracy of the system wherein 5 different people were tested one at a time.The system shows highly accurate classification having an average of 86%.

Evaluation of the systems performance
We tested the system's performance and limitations by varying distance of the subjects from the camera that ranges from 1-7ft incremented with 1foot and at the same time varying the number of people being recognized simultaneously.Table 3 shows the numerical value of the accuracy.The experiment shows that adding the number of subjects being recognized by the system simultaneously can greatly affect the performance of the system while increasing the distance of the subject from the camera creates a minimal effect to the performance of the system.Figure 5 shows the average accuracy of 50% from the multiple faces with varied distance.

Confusion matrix
To evaluate the performance of the classifier, confusion matrix was used.It shows a visualization in which the classifier is confused when making a prediction in dealing with adding training subjects to the data set.Figure 6 shows the result of classifier's true positive rate and misclassification rate by means of dividing temporarily the dataset which is composed of 33% test sets and 67% of train sets [25].It shows normalized confusion matrix from 3 and 4 persons known, respectively.

Comparisons of the systems performance with other deep learning algorithm
Varying the algorithm on the system provides minimal variations to the performance of the system.Figure 7 shows the system's performance in terms of its accuracy, sensitivity and specificity using the DeepFace, SphereFace and MTCNN.The MTCNN and FaceNet adapts to the systems performance by having a stand out result in compared with the two other algorithm.

CONCLUSION
Identifying a person from a surveillance system embedded in robots offers significant advantages in terms of security in compared with the traditional surveillance system.It can save huge amount storage and its corresponding costs by only storing frames of face images that was detected by the system.It offers more secure 1677 environment, as it will send alerts regarding burglary that is happening in real time.It serves as biometrics of a person's identity logging in and out from an establishment and can be very useful in locating and identifying criminals around the city.The experiments uncover the system's limitation in detecting and identifying multiple person at a specified distance simultaneously.The system resulted to only 50% of the average accuracy when dealing with multiple person in compared with 86% in accurately identifying different faces at a time.For future designers who wishes to venture into this study.We highly recommend to further improve the performance of the system's accuracy by trying other types of algorithm knowing that there are a lot of options.We also recommend to use better specifications of camera and computer mentioned in section 3.

Figure 7 .
Figure 7. System's performance test with other deep learning algorithm


Deep hypersphere embedding for real-time face recognition (RyannAlimuin)

Table 2 .
Laptop computer specs

Table 3 .
Accuracy testing