Are CNN-generated images hard to distinguish from real images?

CNNDetection shows that a classifier trained to detect images generated by only one GAN can detect those generated by many other models.

To learn more about the paper and the GANs featured in this report checkout the accompanying 2 Minute Papers video.

Performance

The Premise

In this paper the authors ask whether it is possible to create a 'universal' detector for telling apart real images from these generated by a CNN, regardless of architecture or dataset used. To test this, they collected a dataset consisting of fake images generated by 11 different CNN-based image generator models, chosen to span the space of commonly used architectures today (ProGAN, StyleGAN, BigGAN, CycleGAN, StarGAN, GauGAN, DeepFakes, cascaded refinement networks, implicit maximum likelihood estimation, second-order attention super-resolution, seeing-in-the-dark).

Can the detector generalize?

The authors then demonstrate that, with careful pre- and post-processing and data augmentation, a standard image classifier trained on only one specific CNN generator (ProGAN) is able to generalize surprisingly well to unseen architectures, datasets, and training methods (including the just released StyleGAN2).

This is because all these GANs share foundational elements (the convolutional neural network building blocks) that bind together all of these techniques.

Their findings suggest the intriguing possibility that today's CNN-generated images share some common systematic flaws, preventing them from achieving realistic image synthesis.

In the sections below, we dive deeper into how the detector, which was trained specifically on ProGAN, generalizes to other GANs. We also look at some of the predictions the detector makes and examples it gets wrong.

Misclassified Predictions

In this section, we can see some of the misclassified images generated by each GAN. You can use the step slider to scroll through more examples.

The confusion matrix to the left shows that the detector from the paper was pretty good at distinguishing fake images from real ones. When it made errors they were more often false positives (real images, flagged as fake) than false negatives (fake images not flagged correctly).

The accuracy graphs below confirm this – in general the detector was better at recognizing real images than fake images.

We can also see that it achieves the best performance on ProGAN, which is the algorithm that it was trained to detect. The generalization performance is pretty remarkable for most of the other GAN algorithms, with images generated by SAN and BigGAN being the hardest to detect.

All Predictions

In this section, we look at all the predictions made by the detector for images generated by each GAN. You can use the zoom slider at the bottom to look at more examples from each algorithm.

As we learnt earlier, fake images generated by SAN (Second Order Attention Network) are the hardest to detect. The authors found that applying data augmentation to SANs causes the drop in performance. SAN is a super-resolution model, so only high-frequency components can differentiate between real and fake images. Removing such cues at training time (e.g. by blurring) is likely to be causing the reduced performance.

To overcome this they suggest applying augmentation, but at reduced rate (Blur+JPEG (0.1)).

You can explore images generated by SAN and more GANs below.