Understand a dashboard scene with semantic segmentation

A self-driving car must functionally understand the road and its environment the way a human would from the driver's seat. One promising computer vision approach is semantic segmentation: parse visual scenes from a car dashboard camera into relevant objects (cars, pedestrians, traffic signs), foreground (road, sidewalk), and background (sky, building). Semantic segmentation annotates an image with object types, labeling meaningful subregions as a tree, bus, cyclist, etc. For a given car dashboard photo, this means labeling every pixel as belonging to a subregion.

Below you can see examples in two columns: raw images, the model's predictions, and the ground truth (correct labeling). Buildings are orange, cars are pink, road is cobalt blue, and pedestrians are beige. In the left column, the model can't differentiate between a pedestrian and a rider on a bicycle (magenta and cyan in ground truth, beige in prediction). Note how the hazy conditions in the right column make the model predictions blurry around the boundaries between dashboard and road, or vehicle and road).

Example segmentation maps

Reproduce & extend existing work

** Objective **

Train a supervised model to annotate dashboard-camera scenes at the per-pixel level into 20 relevant categories like "car", "road", "person", "traffic light".

** Code: U-Net in fast.ai **

I follow an excellent existing W&B project on semantic segmentation, summarized in this blog post by Boris Dayma. The starting model is a U-Net trained using fast.ai.

** Dataset: Berkeley Deep Drive 100K **

This model is trained on the Berkeley Deep Drive 100K dataset(BDD100K). For semantic segmentation, this contains 7K train, 1K validation, and 2K test labeled photos.

** Findings so far **

How do we compare vehicles to people?

Great on cars — but humans are incredibly hard to detect

Factoring out the accuracies per class (car, traffic sign/light, human) shows us how well the model identifies different components of a driving scene. While it performs well on cars and traffic signs/lights, it detects barely any humans, especially when measuring by mean accuracy (percentage of human-containing pixels correctly identified) as opposed to mean IoU. One thing to try next is filtering BDD100K to train/test only on examples that contain humans.

Intersection over Union

For all of the original model variants tried, the human (pedestrian or rider) accuracy is a flat line at 0. It's possible that the mean computed is extremely low because the humans take up such a small fraction of the pixels in an image. To get more signal, I tried logging a common metric in semantic segmentation: intersection over union (good definition and more clear intuition, though here we are dealing with any contiguous collection of pixels, not strictly rectangular boxes, as subregions of the image). In the later model variants, IoU reveals which models are detecting humans, with the highest human IoU so far reaching 0.01758, compared to the highest mean IoU of 0.7997 and best overall accuracy of 0.8873. A perfect IoU is 1: where the correct pixel subregion and the predicted pixel subregion match exactly so their intersection is equal to their union.

IoU more helpful and perhaps less biased than accuracy

Below you can see that in four of the best models (one color each), car detection accuracy (all the solid lines) is generally better than traffic sign/light/pole accuracy (all the dashed lines). The overall average accuracy (all the dotted lines) measures every object class except "void", for 19 total. It is better than traffic accuracy but worse than car accuracy, likely because cars are some of the most frequent and largest objects while traffic poles and signs are much smaller and less frequent. Although the "best human iou so far" model is less impressive based on accuracy, optimizing for human iou appears to yield a more balanced model across classes—the variance across the car and traffic prediction metrics is half of what it is for the other three models.

Comparing per-class accuracies

Encoders: Resnet too broad, Alexnet too detailed or too blocky

The panel below shows the difference between two early variants of the model based on the U-Net encoder: Resnet-18 (representative predictions in left column of example panel on the right) and an Alexnet variant (right column) tried in a hyperparameter sweep.

First encoders

The Alexnet model picks up on too many details, parsing the individual windows on the buildings and shadow segments on the car as separate objects classes. The Resnet model is generally more accurate, but it makes mistakes in broader patches, such as merging a car and truck identification in the bottom right, or hallucinating patches of car and building in the overpass (note, this pale blue is labeled as "void", or not a class of interest like "wall" or "building", in the ground truth). Note that other differences between the models (namely learning rate and number of training stages) could explain this discrepancy.

Alexnet after tuning: Finds humans but blocky

From this naive Alexnet, I ran a longer hyperparameter sweep using Bayesian optimization to improve IoU as the objective metric. I also tracked human-specific iou in these runs. The top performing runs by this metric are mostly Alexnets, generally with lower learning rates and more training than the fir. While these have much higher IoU and accuracy than the initial Alexnets I tried, they parse the image in a blocky pattern (see the "Improved Encoders" section below). This yields unrealistic segmentation for most regions (straight lines where there should be curves). However, these blocks seem much better for actually finding humans, as illustrated below. On human IoU specifically, Alexnet outperforms Resnet by an order of magnitude, though of course at the expense of precision. The ideal encoder would balance human recall with precision (crisp outlines instead of big vague blocks) and requires further tuning.

alexnet example

First Encoders: Resnet vs Naive Alexnet

Improved Encoders: Alexnet vs Resnet with High Human IOU

First experiments: increase weight decay, decrease learning rate

After cloning the repo and verifying that the code runs, I tried varying weight decay and learning rate. These are grouped as "First manual sweep".

The initial experiments were running on a tiny fraction of the data (1%) and may not be representative. Select "All manual runs" below to see the effect of increasing the fraction to 20% and 100%. Note that the accuracy can vary by over 10% for a fixed pairing of learning rate and weight decay.

First Experiments: Increase Weight Decay, Decrease Learning Rate

Hyperparameter Sweep Insights

Learning rate: decrease

Lower learning rates appear to correlate with higher accuracy—could investigate in more detail.

Training stages: keep low

Increasing beyond 2-3 doesn't seem to do much (check starter code).

Weight decay: inconclusive

Initially increasing the weight decay 5X improved the accuracy by 9%. Increasing by 200X causes the same amount of improvement though.

Hyperparameter Sweep Insights

Insights from Sweeps

I compared the average accuracy of my manual sweep (purple) with two random search sweeps (blue and red) and a bayesian sweep trying to maximize human detection accuracy (gray). The automated sweeps have higher variance, finding lots of inferior combinations but also a few surprisingly superior combinations.

Objective metric matters

I tried a sweep with Bayesian optimization to maximize IoU (green). This ran the longest and yielded some of the best new models and ideas to try. In particular, it found hyperparameter combinations that started to detect humans, and suggested that Alexnet was worth revisiting.

Note: dataset size varies substantially (gray and green sweeps: ~1400 train, ~100 otherwise)

Next steps

Comparing manual and automated sweeps