In Background Matting: The World is Your Green Screen, Sengupta et al train a machine learning model to extract figures in the foreground of photos and videos and collage them onto new backgrounds. Traditional methods for this kind of "background matting" require a green screen or a handmade trimap to build the matte, a per-pixel annotation of foreground color and alpha (or image depth). This new model requires two versions of the source photo or video: one with the person/subject in the foreground, and one without, showing just the background. Below I show some examples of how this works and how wandb can help analyze results and compare different models on this task.
I loaded a saved model, ran it on the sample videos (fixed camera), and logged to wandb with wandb.Video(). The pretrained model is very impressive and doesn't get confused by the other humans or similar colors in the new substitute background for the video. Below I do the same for photos, logging every stage of the process, which can be very useful for debugging.
The existing code for this project enables several tasks:
All of these could be interesting to explore in Weights & Biases. The core photo-matting model requires the Adobe synthetic-composite dataset of 45.5K train and 1K test images extracted from a simple background and composited onto a new one, with accompanying alpha masks. This dataset is not immediately available for download, though there is a contact email. Fortunately, the authors provide a download link for the saved model, plus a notebook for you to run the model on your own images.
The video-matting model a self-supervised generative adversarial network using frames extracted from real unlabeled videos (with a new dataset provided by the authors). This training finetunes from a previously trained network, i.e. the photo-matting model trained on the static Adobe images. With each image, the video-matting model also takes in an automatically-generated soft semantic segmentation map as an initial estimate of foreground (these masks are precomputed prior to training). Below I finetune a few different versions of the video-matting GAN to see the effect of various hyperparameters and explore the performance on sample photos.
The top row in both panels isn't a perfect matte, but it looks very reasonable aside from slight noise around hair and shadows (and the light wood paneling at the bottom of the image being parsed as foreground). After many rounds of testing, I assumed the indoor setting and bright lighting conditions fell outside the range of reasonable inputs for the model, until I confirmed the following:
The first column shows the original training image frames. The second is the automatically-computed soft segmentation mask for the foreground figure. The third is the predicted alpha: you can see that this column makes color-based mistakes, parsing a black table top as foreground and a white ID badge as background. The last column is alpha supervision, correcting the fine details like fingers and hair. The two rightmost columns below show the prediction and supervision for the foreground.
The video-matting GAN is well-tuned, showing fast-dropping loss curves that are well-balanced between generator and discriminator. The experiments below train on a random subsample of the full 13,000+ dataset of video frames for a few epochs to explore the effect of different hyperparameters. You can zoom into subregions on the x-axis in the charts below for more detail, clicking on the left endpoint and dragging right to select a subregion of the x-axis.
I tried changing some of the hyperparameters of the GAN to see how much they would affect convergence. The baseline is shown in black below, training on the full dataset, and you can select one or both tabs to compare different model variants. I'm zooming in on the very start of training because it converges very quickly.