[Tech +]The ViT and its Meaning

Hello, everyone! I am Kelly, a developer in charge of AI and Vision in Arbeon!

Aside from the AI team (to which I belong), there are many other teams in Arbeon, including AR, Client, Server, Metaverse, and Technical Artist. These experts in each field work together to develop a unique technology that defines Arbeon. They will share stories on technologies and their jobs in future posts. So please show us a lot of support.

And now, for the first topic in Tech content, I will talk about the Vision Transformer (ViT). 

Let’s get started right away! ๐Ÿ˜€

Background behind this ViT article  

I planned to develop a model that classifies food, but I decided to develop the ViT model.

The common practice to approach the existing classification model was through the model of CNN, but the concept of ViT was relatively unfamiliar, and it intrigued me. I then became curious about how the ViT came to be and the architecture and mechanism in ViT.

Before I use the ViT, I wanted to know and understand what it was. So I researched it and decided to write an article about the topic.


Conventionally, most parts of the vision field are centered on convolutional architecture (CNN architecture).

The transformer architecture in the NLP was a huge success that it even achieved SOTA results, so the researchers opted to apply the transformer architecture in vision and introduce the standard transformer. That’s how the ViT came to be.


· CNN Architecture

  The Convolutional Neural Network (CNN) architecture is composed of the convolutional layer.

· Convolutional layer

  - It refers to the layer that extracts the features of an image.

  - There are many filters (the nxn matrix) in this layer, which means that features 
    can be extracted by going around the input images and performing the matrix multiplication.

· Transformer Architecture

  - Since the model was originally used in the NLP, something called the word embedding sequence was used.

    + Word embedding: A method of expressing words as vectors (a misnomer)

  - But the ViT is a vision Transformer architecture. 
    This means it uses the sequence where images are split into patch units.

  - For a more detailed architecture, please refer to the illustration in ViT 101.

ViT 101  

The process of making the input

· Generally, when inputs are processed in the Transformer, the following process is done: 
  Sequence Data → Embedding → Positional Encoding.

· The images that were split in a patch size get flattened and embedded (vectored) 
  at an n-dimensional level. In NLP, the approach was "sentence: word." But in ViT, it’s "image: patch."

· The goal is to train the large ones (sentences and overall image, for example), 
  but the method of training by splitting into tiny pieces is the same,

· So, this seems to be the reason behind the word → patch approach.

· A token is added that predicts the classification.

· Since the images are split into patches, each path is added with the position embedding 
  information to prevent the mishap of losing the image location information.


· Repeat the Transformer Encoder L times → 
  The output value which has the same size as the input value can be obtained.

· The output of the Encoder is also composed of the class token and vector. 
   It’s also possible to compose the MLP head using the class token only.

· The final classification through the MLP Head is also possible.


· The input processed above gets into the Transformer Encoder as embedded patches.

· The Transformer Encoder refers to the architecture where the multi-head attention and 
  MLP attach like the residual block, and normalization is applied to the corresponding block.

· What does “Attention” mean here?

   - The dictionary definition of the word “Attention” is to focus on a certain point.

   - And so, the attention mechanism refers to going over the data as a whole and deciding 
     which point to pay attention to.

   - You can use this mechanism to focus only on the object that falls within the class you want to train.

· It’s composed of query (Q), key (K), and value (V).

· Norm: The normalization takes place for each feature through layer normalization.

· Multi-Head Attention:

  - It does the attention computation in several heads.

  - It at once does the computation on the query (Q), key (K), and value (V) of each head.


· MLP = Multi-Layer Perceptron

· This is a multi-layer neural network with several perceptron neurons stacked over several layers.

· It’s composed of the input layer + hidden layer + output layer.

· Here, there are 2 fully connected layers and GELU (Normally, ReLu is used as the activation function, 
  but here, this function is used).

  - Why is that?

     - GELU adjusts (in ratio) how large the input value x is compared to other inputs 
        → This allows for a probabilistic interpretation.

MLP Head

· In the last output of the Transformer Encoder, the class token is the only one used for the classification.

· Additionally, MLP is used for the classification in the last part.


· We have gone over the background of ViT, the input and output systems, and the role of the overall components!

· Up next, I will train the food-101 dataset using ViT, and conduct food classification. So stay tuned!


  • https://hongl.tistory.com/234
  • https://engineer-mole.tistory.com/133
  • https://gaussian37.github.io/dl-concept-vit/
  • https://github.com/jeonsworld/ViT-pytorch


Contact us

General inquiry - official@arbeon.com

Investment inquiry - ir@arbeon.com
Press release – media@arbeon.com
Partnership inquiry – partnership@arbeon.com

3F, 211, Hakdong-ro, Gangnam-gu, Seoul, Republic of Korea