Comparative Study of VGG16 and MobileNetV2 for Masked Face Recognition

Indonesia is one of the countries affected by the coronavirus pandemic, which has taken too many lives. The coronavirus pandemic forces us to continue to wear masks daily, especially when working to break the chain of the spread of the coronavirus. Before the pandemic, face recognition for attendance used the entire face as input data, so the results were accurate. However, during this pandemic, all employees use masks, including attendance, which can reduce the level of accuracy when using masks. In this research, we use a deep learning technique to recognize masked faces. We propose using transfer learning pre-trained models to perform feature extraction and classification of masked face image data. The use of transfer learning techniques is due to the small amount of data used. We analyzed two transfer learning models, namely VGG16 and MobileNetV2. The parameters of batch size and number of epochs were used to evaluate each model. The best model is obtained with a batch size value of 32 and the number of epochs 50 in each model. The results showed that using the MobileNetV2 model was more accurate than VGG16, with an accuracy value of 95.42%. The results of this study can provide an overview of the use of transfer learning techniques for masked face recognition.


INTRODUCTION
More than 1,390,000 people have recently been infected with the coronavirus, which comes from Wuhan, China. The month of March 2021 has caused more than 37,000 deaths in Indonesia [1]. The Indonesian government has taken numerous measures to limit the spread of the coronavirus, including mask-wearing, hand washing, and social distancing. The use of masks is essential in preventing the coronavirus's transmission, protecting ourselves and others [2]. In offices, schools, or universities, employees who are about to start work always check their body temperature and wear masks correctly. A suitable mask is worn to protect oneself from other people's coughing and sneezing, as well as from the surrounding air, which may contain viruses in the form of aerosols that mix in the air for an extended period [3].
Air contaminated with the coronavirus is dangerous for people who move around it, especially for workers. It is a particular concern when working in offices, one of the activities when starting work is doing attendance. Some agencies or offices have used attendance using facial biometric data [4] [5]. Biometric data is biological data that humans have since birth. Humans can use biometric data to provide information about an individual's identity based on physical characteristics that distinguish individuals from one another [6] [7]. The use of facial data for identity recognition is popular because, visually, it can be easily recognized through the sense of sight. Facial characteristics often used for identity recognition are face shape, skin color, facial components such as nose, mouth, and eyes [8]. However, the coronavirus pandemic requires employees to carry out activities using masks that occlusion some facial features when identity recognition is carried out. This mask makes the identity recognition system using facial biometric data become less accurate. Almost half

2.
RESEARCH METHOD Fig. 1 shows the masked face recognition system architecture. The masked face recognition system begins with the training dataset stage. This study uses training and validation data as input in people's faces and the identities of people's names. The identity of the person's name will be as the label used for the testing phase on other image data with the same identity. Before the dataset is processed using transfer learning, the data is resized to adjust the size of the transfer learning model that has previously been trained using ImageNet data. A small amount of data input requires data augmentation to increase the amount of data. The results of augmentation data are trained using transfer learning VGG16 or MobileNetV2. The use of two transfer learning models because it has been proven in other cases previous studies have resulted in accuracy above 93% for VGG16 [17] and above 96% for MobileNetV2 [18] [19]. The training result is an h5 model, which will be evaluated to find the best model based on loss and accuracy values. The testing phase in this study uses other face image data with the same person to evaluate system accuracy. The face images are resized so that they are the same size as the trained dataset. The next stage is a matching feature that generates facial identity. The results of facial identity recognition will be evaluated to test the performance of the system being built.

Data Acquisition
We used primary data in the form of the facial image representing 35 distinct identities in our experiment. Each identity has ten facial images, so that the total amount of data used in this study is 350. From each identity data, 80% of which are used for training and 20% for validation. Each image in this study is composed of three colors channels: red, green, and blue (RGB). Fig. 2 shows an example of the data and labels used in this study. Before performing transfer learning on the original face dataset, it is scaled to 224×224 pixels. The purpose of this resized is to adjust the size of the resolution of the VGG16 and MobileNetV2 pre-trained models.

Abdur
Adam Arif Izmi Aldhan Fig. 2. Samples of data and label

Data Augmentation
This study uses augmentation data to increase the number of datasets. This stage was carried out because the number of original training datasets only amounted to 8 images. Data augmentation in this study uses a library from Keras, namely ImageDataGenerator, often used as geometric augmentation [20]. In the training data using augmentation with rescaling parameters of 1./255. Then, the rotation_range parameter is used to rotate the image by 45 degrees. The width_shift_range and height_shift_range parameters allow controlling the amount of horizontal and vertical shift by 0.3, respectively. The horizontal_flip parameter is used for flipping inputs horizontally. Next, the fill_mode parameter uses nearest. The fill_mode parameter is used to resample the image if the target size is different from that of the loaded image. Data validation only uses augmentation data with rescale parameters of 1./255.

Deep Learning
The recognition of objects in computer vision aims to make decisions related to actual physical objects based on images obtained from sensors, such as cameras. Object recognition often uses machine learning technology. Machine learning is a technique to imitate how humans learn and generalize through the training process of data. The result is drawing conclusions based on training data [21]. As technology develops, machine learning has evolved into deep learning. Deep learning is an artificial neural network method that processes input data and processes it using many hidden layers. Then the result is a non-linear transformation of the input data to calculate the output value [22]. Deep learning that is often used to process image data is often known as a Convolutional Neural Network. Deep learning is usually used on large amounts of data, but the data is not too much like masked facial recognition in this research. One technique for processing less data is transfer learning [23], which means that the model has been previously trained with other data [14]. Examples of transfer learning models are VGG16 and MobileNetV2.

VGG16
K. Simonyan and A. Zisserman of the University of Oxford proposed the VGG16 convolutional neural network model [24]. The VGG16 architecture is a Convolutional Neural Network (CNN) utilized to win the 2014 ILSVR (ImageNet) competition. It is widely regarded as one of the most exemplary vision model architectures to date. The most distinguishing feature of VGG16 is that, rather than having a vast number of hyper-parameters, they concentrated on 3×3 convolution layers with stride one and always employed the same padding and max-pooling layers 2×2 filter with stride 2. It maintains this order of convolution and max-pooling layers throughout the architecture. Finally, it includes two FC (Fully Connected layers) and a softmax for output. The 16 in VGG16 alludes to the fact that it has sixteen weighted layers. It is a vast network with approximately 138 million (estimated) parameters [25]. The VGG16 architecture is shown in Fig. 3.   Fig. 3. VGG16 architecture [26]

MobileNetV2
In April 2017, MobileNet debuted its second version. MobilenetV2, like MobilenetV1, employs depthwise and pointwise convolution. MobileNetV2 adds two new CNN layers: an inverted residual layer and a linear bottleneck layer. MobileNetV2 introduces two new capabilities: 1) bottlenecks that are linear and 2) shortcut links between bottlenecks [27]. There is input and output between models in the bottleneck region, whereas the inner layer or layer encapsulates the model's ability to convert lower-level notions (i.e., pixels) to higher-level descriptors (i.e., image categories).
Finally, like with residual connections in classical CNNs, bypassing bottlenecks enables faster training and increased accuracy. As illustrated in Table 1, the MobileNetV2 network is mainly constructed using the inverted residual layer presented in the research [28]. Additionally, the model may modify object detection and semantic segmentation by utilizing the network as a feature extractor.

System Evaluation
Research on masked face recognition is discussed in terms of the importance of training and validation accuracy. All training and validation results will be saved for each epoch and batch size combination. After the model is trained, we will evaluate the system's accuracy. Classification accuracy is a simple-to-understand concept that quantifies the proportion of instances in which a classifier correctly identifies the class of testing examples. Calculate the accuracy of a classifier using the formula shown in (1).

=
(1) where, is the correct rate, is the number of image data testing recognized correctly and is the number of all image data testing.

Training Result Analysis
In this study, several metrics related to batch size variations and the number of epochs were recorded, as shown in Table 2 and Table 3. The measurement metrics seen in the training process were the value of training accuracy and validation. We used multiple variations in the batch size variation experiment, namely 2, 4, 8, 16, and 32. A batch size of 8 indicates that the dataset will be divided into batches for Neural Network training, with each batch containing eight data.
Based on Table 2, the greater the number of batches, the better the accuracy value for training and validation. Even on MobileNetV2 in batch four, the training accuracy and validation values are 100%. Using the VGG16 model, resulting a maximum accuracy of 98% is obtained using batch size 32. In all variations of the experiment, the best batch size is 32 in both transfer learning models. Therefore, in the number of epoch experiments, we used batch size 32. An epoch is a point at which the complete dataset is trained on a Neural Network until it is reset to the beginning for one round. Based on Table 3, the value of training and validation accuracy using the MobileNetV2 model is 100% at epoch 20, in contrast to epoch on the VGG16 model, which has not reached 100% on epoch 50. These results show that the best model is obtained using the pre-trained MobileNetV2 model. Fig. 4 shows that the use of the VGG16 model is slightly underfitting because, in certain epochs, the training accuracy is greater than the validation accuracy. For example, suppose that around the epoch of 20 to 30, it appears that the validation accuracy value is smaller than the training accuracy value. In the loss chart on training and validation, it is not so visible that overfitting or underfitting is because the difference in the value of training loss and validation is not too significant. Meanwhile, using the MobileNetV2 model, the training accuracy and validation graph do not show overfitting or underfitting, like loss training and validation results. Based on these results, to perform testing using other image data on the same identity, we use a batch size of 32 and a maximum number of epochs of 50. In this study, we used the early stop function to prevent continuous overfitting or underfitting. If the seven epochs do not decrease the loss validation value, the data training process will be stopped.

Testing Result Analysis
In this research, the proposed system needs to be tested to see the accuracy value generated using different data with the same identity. Table 4 shows the accuracy results for the proposed system. We used 175 testing data with 35 identities in the same training data. The VGG16 model contained 13 false identities, while using the MobileNetV2 model only had eight false identities. Then in Fig. 5 shows an example of the results of matching data with the model that has been made. and validation was also enormous for MobileNetV2, which reached 100%. These results prove that transfer learning techniques can be used to recognize masked faces. Future research can continue to calculate processing time when using video data as testing data. The results of this study were better than previous studies [12], with an accuracy of 94.5%. In other words, the accuracy increased by 0.92% to 95.42%. Previous research also used a combination of two deep learning techniques, namely RetinaFace and VGGFace2. Meanwhile, the proposed research only uses one deep learning based on MobileNetV2 transfer learning.

CONCLUSION
This research focuses on the face recognition of a person wearing a mask. The coronavirus pandemic forces us to continue to carry out health protocols, namely wearing masks when on the move, especially working. We propose the use of transfer learning techniques for facial feature extraction and classification according to identity. The results showed that the MobileNetV2 transfer learning model was better than VGG16, with an accuracy of 95.42%. These results indicate that using the MobileNetV2 model is better in classifying 35 different people's identities. This study also uses only one deep learning model, which is undoubtedly more efficient than previous studies using two deep learning models. The accuracy value also increased by 0.92% compared to previous studies. Future research can be developed on real-time video data so that the data processing speed can be known.