[Tech +]The story of Monocular Depth Estimation

Hi! My name is Andrew, and I completed a 6-week internship course in Arbeon's AI Team.๐Ÿ˜€

While I have not been with Arbeon for long, my memory at Arbeon will definitely last with its new perspective toward technologies. Moreover, every crew, including the developers, were truly passionate and down to earth!

During my time here, I was assigned to one of the projects related to Vision, and I would like to share the process and the outcome with you. Then, without further ado, let us begin.

Monocular Depth Estimation


Today I would like to share about monocular depth estimation or MDE. Most people measure distances with two eyes: the right eye and the left eye. Like our eyes, we need images taken from two different angles to compare the distance between the objects in the images and the cameras. If we throw deep learning into the mix, there was a solution to estimate a depth map that expresses the depth of each pixel using just one photo with relatively high accuracy, and we performed a test on that solution.

MDE Background

MDE compares the estimated depth map with the actual depth map, then, similar to other deep learning problems, calculates loss and updates several parameters of the model in order to optimize it. What you need to know here is that understanding how loss is guided by which metric would be the process to help us measure which model can perform with better accuracy.

Among the existing publications, the most prevalently used loss is the weighted average of three indicators, the structural similarity index (SSIM), point-wise depth (l1 loss), and smoothness loss. Among them, SSIM is the most important, and it calculates the similarity between two images by luminance, contrast, and structure.

- Luminance: quantifies the similarity between the overall brightness levels of each depth map.
- Contrast: lets us know how the brightness levels of the two different depth maps change within each map.
- Structural Difference: compares the edges of the two depth maps and quantifies the similarity.

SSIM, in a nutshell, is the multiplication of these three values.

In addition, the point-wise depth is an indicator to show the difference in size for each pixel value. Smoothness loss compares the "flow" of each image pixel using second-order derivatives of each pixel value. Normally, when working out the weighted average, weights assigned are 0.8 for SSIM and 0.1 for each I1 loss and smoothness loss.

Through this process, we can identify three factors that a successful MDE model requires.

- First, the most important factor, like segmentation problems, is that it should be able to pick out different objects within a picture.

- Second, it needs to be able to precisely pinpoint the edges of these objects.

- Third, the overall brightness balance needs to be well established.

So far, SOTA models that provided us with the best results mostly employed the Visual Transformer. In a nutshell, the transformer can pick out the overall properties of an image more effectively by leveraging the multi-head attention and context.

Among these, BinsFormer and DepthsFormer, the modified ViT models that exhibited the highest accuracy, recorded losses of around 0.1. However, there are disadvantages to these models as well. They are fairly complex in terms of structures, so they take a longer time to train, and we cannot expect high accuracy unless a certain volume of data is provided during training. These are the critical weaknesses of MDE tasks, as they require a depth map scanned with LiDAR in the training process, so it is challenging to produce a high volume of training data.

Cantrell et al. suggest that performance on par with what we described just now can be achieved without the transformer, but only using U-Net, in a shorter time and with much more memory efficiency. Therefore, next we pursued the task with U-Net.


Besides the MDE problem, there are common elements in models that provide high performance in various problems in Vision, such as a segmentation problem that identifies a certain object within an image, and verification problems, such as face recognition. They are based on a high-efficiency encoder and decoder.

Among them, the convolutional neural network (CNN) provides an excellent encoding performance. In the case of deep learning processes that utilize images, it is important to identify features from certain parts of the image, not handling each pixel independently. Each convolutional layer scans through the entire image in each section of the kernel in a specified size to perform tasks such as edge detection proficiently. In addition, feature expansion of a sample after encoding is possible based on the kernel size.

U-Net is a deep learning network based on CNN, and it trains images using CNN downsampling as the encoder, and CNN upsampling and ResNet as the decoder. During the downsampling process, after each CNN layer, the ReLU function is applied, and max pooling is used.

Also, U-Net has skip connections on each layer. It does not use the samples encoded during upsampling process exclusively but concatenates the samples before they are encoded. CNN encoding is fast and efficient, but there may be an overfitting issue impacting the performance if there are several successive encoding layers. During the process, skip connections add overall features that could not be detected during the encoding process to make the decoded output smoother.

As we see it, the U-Net method seems to be more efficient, especially for the MDE problem, because the convolutional layer is very proficient with edge detection and object differentiation, and the skip connections of the ResNet would also retain the features concerning the overall depth of the image.


The most commonly used open datasets when handling MDE problems are KITTI, which is taken with a LiDAR scanner during outdoor vehicle rides, and NYU-Depth, which captures indoor spaces and objects. Among these, we decided to use the NYU-Depth dataset, as it is more suitable for the Arbeon App.

The following tech stack was used to perform this project.

- Deep Learning Framework : Tensorflow
- Docker Image : tensorflow/tensorflow
- Arbeon GPU

You can 
click here to view a more detailed list of packages and libraries.

Learning results

First, it took about 3-4 hours to complete one learning cycle comprised of 15 epochs using one GPU, which means that each step took 0.6 seconds on average. We attempted with 15, 20, and 30 epochs, and in all three attempts, around 0.2 loss and 75% accuracy were recorded. Validation error and training error similarly converged.

This result was unsatisfactory compared to the results presented by Cantrell et al., and we could find a lot of room for improvement compared to SOTA ViT models (0.1 loss, 99% accuracy).

Upon the case study of the output, images with simple and clear distinctions between front and back provided comparatively better results.

Inference on the left, ground truth depth map taken with LiDAR in the middle, and actual image on the right

We found two major issues.
The first was that the edge detection was not clear like the image below, and edges were blurrily inferred.

The other issue was that if the image is complex and has a lot of layers, the front side and backside could not be properly distinguished, resulting in a reverse depth map.

Besides the open dataset, we also ran an inference with the pictures taken by another intern Charlie and me, which yielded the following outcomes.

1. Relatively successful cases

2. Poor edge detection

3. Recognition problem for transparent objects

4. Reverse depth order

In looking for a model with better accuracy

We expected that CNN and skip connection could be the solution for edge detection and identifying the depth order, but the outcome did not meet our expectations. From these results, we can conclude that we need more models like ViT that are more complex but stronger. As the transformer network can pinpoint the overall features of images using context and attention, we may expect an improved outcome.

In addition, among the SOTA models, BinsFormer and DepthFormer structures provided the best performance not only in the NYU-Depth but also in KITTI datasets. So here is what I wanted to share today. You will understand a little bit better with the references provided below! Thank you for reading.


About U-Net : https://www.scitepress.org/Papers/2020/97818/97818.pdf

U-Net Code : https://github.com/siddinc/monocular_depth_estimation

ViT BinsFormer : https://arxiv.org/pdf/2204.00987.pdf

ViT DepthFormer : https://arxiv.org/pdf/2203.14211.pdf

Contact us

General inquiry - official@arbeon.com

Investment inquiry - ir@arbeon.com
Press release – media@arbeon.com
Partnership inquiry – partnership@arbeon.com

3F, 211, Hakdong-ro, Gangnam-gu, Seoul, Republic of Korea