Hello! My name is Charlie, and I completed an internship course in Arbeon's AI Team!
In this tech content, I would like to introduce Hand Segmentation, a project I was involved in during my internship. I hope dear readers will enjoy this. Let's begin :)
Tale of the magic trick where the hand disappears
On Hand Segmentation
Background and Purpose
The first gateway to Arbeon App is to recognize an object from an image taken with a cellphone. In order to recognize the object, it is a primary task to develop a good model that identifies and classifies objects. However, computers are not people, and even small obstacles may significantly impact their performance in identifying an object. Therefore, besides the process of recognizing objects, there is a need for a process to remove obstacles other than targeted objects.
What would be the most frequently used tool when people take a picture of an object? The answer would be the "hands" of the people. We only need a hand to take a picture in any frame we want. Hands are quite literally the handiest tool for people. However, while hands are so convenient for us, they are the biggest obstacles for Arbeon App. 😱
What I essentially did was remove obstacles for identifying an object. The most prevalent obstacle is the hand, so we needed to solve the problem of hand segmentation. Now, we will move on to the topic of how we could remove hands and what exactly "segmentation" is.
What is segmentation?
One of the most familiar AI models for people is the model that identifies whether or not there is a cat in a picture. This process is called image classification. The process identifies what object is in a picture after looking at the picture. However, this is not enough. Classification is a very coarse prediction. It sees the overall picture and only tells the class of object.
The image inference task has gone from coarse prediction to finer and finer prediction. The process follows the following order. First, object localization indicates the location of an object with a rectangular box. Further, the location and type of an object are predicted in pixels with semantic segmentation, and then instance segmentation can classify even the same object per different class.
Then, what would be the most efficient way to "delete hands from a picture"? The classification does not help at all. What it can do is merely distinguish whether there is a hand or not in an image. Object localization would be a little better. The hand is recognized, and the location of the hand is identified with a box. However, in the box, there are areas with no hands. We will lose out too much information if we assume that every part of this box is a hand and remove it from the picture.
How about semantic segmentation? It estimates the area where there is a hand precisely down to pixels. Then, you can accurately remove the part where there is a hand. Therefore, it best suits our objective. What we want to do is to precisely take out hands.
We can perform an instance segmentation, no problem. However, what we need from this technology is not distinguishing individual hands or determining whether a hand is a right hand or a left hand. We only need to find any hand. So, while we can use instance segmentation, it has higher specifications than we require, and it only makes our model more complex.
We now can know that hand semantic segmentation is exactly what we require. No more, no less, just right. So, now that we have defined our objectives clearly, we will talk about the method of how we found hands from a picture.
There are many ways, but we achieved this task by developing AI models.
While there are several models for semantic segmentation within the realm of AI, among these, I utilized a model called HGR-Net, with ASPP applied to the ResNet model, as well as models that incorporated the transformer, which is widely used recently not only for NLP but also for Vision. In this section, I will introduce the basic structure of each model and its core concepts.
The first model leverages CNN, well-known in image recognition. In addition, ResNet (residual network) and ASPP (Atrous Spatial Pyramid Pooling) methods were applied.
Before we get into ResNet, we need a brief explanation of deep learning. In essence, deep learning is comprised of several layers. Stacked layers become a deep neural network. It is called deep because layers are stacked deeply. While we want to stack more and more layers for better predictive performance, the increasing number of layers means that we may be faced with the issue of gradient vanishing. ResNet is used to address this issue. It adds the original values to the outcome that had undergone layers. Then, surprisingly, the gradient vanishing issue that comes with stacking several layers could be resolved. Thanks to ResNet, we can stack as many layers as we want.
However, ASPP is the key to this model above all others. People consider their surroundings when recognizing an object or identifying boundaries. Some may question why one might need ASPP at all as a separate process because CNN is a process to extract features from a specified area of an image. The difference between CNN and ASPP lies in dilation. Whereas the kernel of the existing CNN model was made up of a continuous succession of rectangles, the kernel of the ASPP model is comprised of the structure as shown in the figure above, where the spaces between the rectangles are empty. Therefore, the same number of parameters can cover a wider area.
Convolution with this structure is called the atrous convolution. ASPP is the result of connecting several iterations with differing dilation rates. Through this process, we are equipped with a wider perspective to see not only trees but also the forest.
2. ViT model
The second model I used is a segmentation model using ViT (Vision transformer), which recently gained recognition. While this model retains a similar structure to the previously used ViT, the difference is that the new model did not incorporate class tokens but added the dense layer and reshape layer to create a segmentation map after the transformer. Simply put, it is what you need to recover 1D vectors back into 2D images.
Similar to the use in NLP, a transformer model separates the input data into semantically connected tokens and makes a prediction based on the relevance between the tokens. Therefore, in addition to seeing a wider area compared to CNN, this model can also train relevance between each area.
After first receiving the image, ViT dissects each image into small areas called patches. These patches make up a token to be entered as inputs of a transformer, which then undergo a process called MSA (Multi-head Self Attention) within the transformer.
Each input entered into MSA is assigned individual Q (Query), K (Key), and V (Value) values, and the results are calculated through operations based on these Q, K, and V values. These outputs once again pass through a layer called "feedforward" to undergo the next phase of inputs. Please see the ViT content previously written by Kelly for a more detailed explanation on the ViT!
I explained about the models I have used so far. Before we move on to the results section, I will explain which type of augmentation was pursued to improve the model performance.
Every model requires data for training. However, there was less data for hand segmentation than I imagined. I worked hard to collect the data, but the entire set only contained about 25,000 pictures. Also, each data source had similar pictures, so there was a risk of bias and overfitting based on the data. Therefore, I performed data augmentation to increase the number of data and reduce such bias.
1. Black Box Augmentation
To write our goal more clearly, it was to "remove a hand holding an object for Arbeon App." However, many images I have collected only had hands in them, while no one would take a picture of their hand alone while using the Arbeon app. Therefore, I performed augmentation by adding a black box in the center after finding an area representing a hand so that it could produce a similar effect as a hand holding something.
The following shows the results of training a model after performing this augmentation. Can you see that the model can now recognize a watch or washcloth on the hand that was not identified by the mask of the correct answers?!
2. Other Augmentations
Aside from that, zoom, translation, and rotation techniques, which are frequently used for image augmentation, were implemented. Furthermore, hands were not well recognized in many cases when external lighting impacts the saturation, brightness, and color. Therefore, augmentation on saturation, brightness, and color was also implemented.
The pictures below show the effect of these augmentation processes. The second and third columns represent the images before and after the augmentation process, respectively. Can you see a clear difference?
After training the models, I tested them by taking pictures around and inside the office, and compared each model.
While there was some difference between models, ViT models and ResNet models were able to identify hands. In particular, the ResNet model showed a similar level of performance to the ViT model, even with a far less number of parameters. While we can still see room for improvement here and there, isn't it amazing that we can exclude our hands from a picture now?
Today, we learned about hand segmentation. I believe the past two months as an intern at Arbeon were invaluable as I researched and came up with solutions. I definitely learned a lot. Please send your interest and support to "Arbeon," filled with shining potential for becoming a unicorn.
Thank you for reading such a long post!
- IET Comput. Vis., 2019, Vol. 13 Iss. 8, pp. 700-707
- Liang-Chieh Chen et al., arXiv:1606.00915v2 [cs.CV] 12 May 2017