Real-Time High Resolution Background Matting/Segmentation - Issue #2

May 01, 2021

Real-Time High Resolution Background Matting, Fine Grained Segmentation

Realtime Background Matting (Source: Lin et al 2020, CVPR 2021).

Ever wondered what the state of the art is for applying neural networks in the automatic segmentation of interesting (salient objects) from their backgrounds? There are two relatively recent papers which I think provide insights. Let's take a look!

Real-Time High-Resolution Background Matting | Paper Review (CVPR 2021)

In this paper, the authors[1] propose a two stage deep neural network model for real time segmentation of subjects from background. Their approach achieves 60FPS on HD images (1920*1080) and 30FPS on 4K images (3840×2160) .. measured on a GPU.

Download paper here.

Pros

Proposed model handles hair and subject boundary details much better than current approaches (think how Zoom might crop out hair portions or fail when your hand is close to your face or some other occlusion etc)
They improve speed/latency state of the art for processing large images. Previous approaches that attempt fine grained segmentation achieve 8 fps on 512*512 images (pretty much unusable). Their approach achieves 60FPS on HD images (1920*1080) and 30FPS on 4K images (3840×2160)
They achieve these speed gains by using a two stage network. First network downsamples the image and outputs matte predictions + error prediction map at a low resolution. The second network (a refinement network) uses the low resolution result and original image to generate high-resolution output (fine grained detail) for only select regions of the image.
They compare their approach with several existing approaches and create a zoom plugin that pipes model output to zoom.
They provide sample code to reproduce their results and experiments via notebooks.

Cons

System requires specifying background image to work well. This is not a huge issue but introduces an additional step (selecting background image) that might interfere with usability.
The results (30FPS on HD images and 60FPS on HD images) are run on a GPU - Nvidia RTX 2080 TI GPU. This suggests it might still be unusable on CPUs (the majority of user environments)

See the full post here.

[1 ] Lin, S., Ryabtsev, A., Sengupta, S., Curless, B., Seitz, S., & Kemelmacher-Shlizerman, I. (2020). Real-Time High-Resolution Background Matting. arXiv preprint arXiv:2012.07810. CVPR 2021.

U2Net Going Deeper with Nested U-Structure for Salient Object Detection

In this work, the authors[2] proposes a deep UNet like model (pretty much a UNet, but made up of UNet-ish blocks) for salient object detection (the task of segmenting the most visually attractive objects in an image). Here are some reasons I found this paper interesting

Highly reproducible. The authors provide a pytorch implementation and pretrained model weights - which simply just works! This drastically reduces the amount of time and effort needed to replicate their approach. 🤘.
Provide pretrained models from their experiments and sample code for running inference on custom data.
Provide a significantly smaller/faster version on their model with performance evaluation.

Download the paper

Perhaps the most interesting contribution of this paper is the introduction of residual U-blocks (RSU) and the ablation studies that show they indeed improve performance metrics. Their intuition is that the residual connections within each UNet block enables focus on local details while the overall residual U-Net architecture enables fusing these local details with global (multi scale) contextual information.

See the full post here.

[2] Qin, Xuebin, et al. "U2-Net: Going deeper with nested U-structure for salient object detection." Pattern Recognition 106 (2020): 107404.

Designing with AI

Discussion about this post