We are going to use a U-net style network.
The main idea is to replace the encoder part with a ResNet which are efficient at recognizing features. It will also allow us to use pre-trained networks for the encoder part.
We start by running several variants of the architecture on smaller images (256 x 256) to have a first idea of what works well:
We then look at all the runs and try to group them to understand which ones are the most valuable.
A few conclusions appear from this first batch of runs:
Finally, it seems that a reasonable target to achieve when using larger images will be 0.9 accuracy
This report is a saved snapshot of Boris' research. He's published this example so you can see how to use W&B to visualize training and keep track of your work. Feel free to add a visualization, click on graphs and data, and play with features. Your edits won't overwrite his work.
Boris uses various approaches to parse street scenes. He's using a U-shaped network and varying the encoders, weight decay, learning rate, pre-training approach, and more.
We train a few models keeping the correct image ratio but with a reduced size of 320 x 180.
However the results are not as good, showing that:
Let's go to a higher resolution!
For our final model, we use images of 640 x 360. Our goal is to get more than 90% accuracy.
While we quickly reach 89% accuracy, it is very difficult to get to 90%.
We finally succeed in getting above 90% mainly by adjusting the learning rate (both with ResNet 34 & ResNet 18). While those runs are longer, we actually reach early our target and never improve much.
A minor improvement (<1%) is also made by running the training in 2 phases:
The quality of the predictions changes a lot with a barely 1% difference!