Synthesis Blog
December 2, 2020
Smart Augmentations: Driving Model Performance with Synthetic Data II
Last time, I started a new series of posts, devoted to different ways of improving model performance with synthetic data. In the first post of the series, we discussed probably the simplest and most widely used way to generate synthetic data: geometric and color data augmentation applied to real training data. Today, we take the idea of data augmentation much further. We will discuss several different ways to construct “smart augmentations” that make much more involved transformations of the input but still change the labeling only in predictable ways.
Here, the resulting image and segmentation mask are the result of the following chain of transformations:
The next step was taken in the work called “AutoAugment: Learning Augmentation Strategies from Data” by Cubuk et al. (2019). This is a work from Google Brain, from the group led by Quoc V. Le that has been working wonders with neural architecture search (NAS), a technique that automatically searches for the best architectures in a class that can be represented by computational graphs. With NAS, this group has already improved over the state of the art in basic convolutional architectures with NASNet and the EfficientNet family, in object detection architectures with the EfficientDet family, and even in such a basic field as activation functions for individual neural units: the Swish activation functions were found with NAS.
So what did they do with augmentation techniques? As usual, they frame this problem as a reinforcement learning task where the agent (controller) has to find a good augmentation strategy based on the rewards obtained by training a child network with this strategy:
The controller is trained by proximal policy optimization, a rather involved reinforcement learning algorithm that I’d rather not get into (Schulman et al., 2017). The point is, they successfully learned augmentation strategies that significantly outperform other, “naive” strategies. They were even able to achieve improvements over state of the art in classical problems on classical datasets:
Here is a sample augmentation policy for ImageNet found by Cubuk et al.:
The natural question is, of course: why do we care? How can it help us when we are not Google Brain and cannot run this pipeline (it does take a lot of computation)? Cubuk et al. note that the resulting augmentation strategies can indeed transfer across a wide variety of datasets and network architectures; on the other hand, this transferability is far from perfect, so I have not seen the results of AutoAugment pop up in other works as often as the authors would probably like.
Still, these works prove the basic point: augmentations can be composed together for better effect. The natural next step would be to have some even “smarter” functions. And that is exactly what we will see next.
The difference here is that Network A does not simply choose from a predefined set of strategies but operates as a generative network that can, for instance, blend two different training set examples into one in a smart way:
In particular, Lemley et al. tested their approach on datasets of human faces (a topic quite close to our hearts here at Synthesis AI, but I guess I will talk about it in more details later). So their Network A was able to, e.g., compose two different images of the same person into a blended combination (on the left):
Note that this is not simply a blend of two images but a more involved combination that makes good use of facial features. Here is an even better example:
This kind of “smart augmentation” borders on synthetic data generation: the resulting images are nothing like the originals. But before we turn to actual synthetic data (in subsequent posts), there are other interesting ideas one could apply even at the level of augmentation.
The blended label does not change either the network architecture or the training process: binary cross-entropy trivially generalizes to target discrete distributions instead of target one-hot vectors. To borrow an illustration from Ferenc Huszar’s blog post, here is what mixup does to a single data point, constructing convex combinations with other points in the dataset:
And here is what happens when we label a lot of points uniformly in the data:
As you can see, the resulting labeled data covers a much more robust and continuous distribution, and this helps the generalization power. Zhang et al. report especially significant improvements in training GANs:
By now, the idea of mixup has become an important part of the deep learning toolbox: you can often see it as an augmentation strategy, especially in the training of modern GAN architectures.
In the YOLO family, new releases usually obtain a much better object detection quality by combining a lot of small improvements, bringing together everything that researchers in the field have found to work well since the previous YOLO version. YOLOv4 is no exception, and it outlines several different ways to add new tricks to the pipeline.
What we are interested in now is their “Bag of Freebies”, the set of tricks that do not change the performance of the object detection framework during inference, adding complexity only at the training stage. It is very characteristic that most items in this bag turn out to be various kinds of data augmentation. In particular, Bochkovskiy et al. introduce a new “mosaic” geometric augmentation that works well for object detection:
But the most interesting part comes next. YOLOv4 is trained with self-adversarial training (SAT), an augmentation technique that actually incorporates adversarial examples into the training process. Remember this famous picture?
It turns out that for most existing artificial neural architectures, one can modify input images with small amounts of noise in such a way that the result looks to us humans completely indistinguishable from the originals but the network is very confident that it is something completely different; see, e.g., this OpenAI blog post for more information.
In the simplest case, such adversarial examples are produced by the following procedure:
YOLOv4 is an important example for us: it represents a significant improvement in the state of the art in object detection in 2020… and much of the improvement comes directly from better and more complex augmentation techniques! This makes us even more optimistic about taking data augmentation further, to the realm of synthetic data.

Automating Augmentations: Finding the Best Strategy
Last time, we discussed the various ways in which modern data augmentation libraries such as Albumentations can transform an unsuspecting input image. Let me remind one example from last time:
- take a random crop from a predefined range of sizes;
- shift, scale, and rotate the crop to match the original image dimension;
- apply a (randomized) color shift;
- add blur;
- add Gaussian noise;
- add a randomized elastic transformation for the image;
- perform mask dropout, removing a part of the segmentation masks and replacing them with black cutouts on the image.




Smart Augmentation: Blending Input Images
In the previous section, we discussed how to chain together standard transformations of input images in the best possible ways. But what if we take one step further and allow augmentations to produce more complex combinations of input data points? In 2017, this idea was put forward in the work titled “Smart Augmentation: Learning an Optimal Data Augmentation Strategy” by Irish researchers Lemley et al. Their basic idea is to have two networks, “Network A” that implements an augmentation strategy and “Network B” that actually trains on the resulting augmented data and solves the end task:



Mixup: Blending the Labels
In smart augmentations, the input data is produced as a combination of several images with the same label: two different images of the same person can be “interpolated” in a way that respects the facial features and expands the data distribution. Mixup, a technique introduced by MIT and FAIR researchers Zhang et al. (2018), looks at the problem from the opposite side: what if we mix the labels together with the training samples? This is implemented in a very straightforward way: for two labeled input data points, Zhang et al. construct a convex combination of both the inputs and the labels:



Self-Adversarial Training
To get to the last idea of today’s post, I will use YOLOv4, a recently presented object detection architecture (Bochkovskiy et al., 2020). YOLOv4 is a direct successor to the famous YOLO family of object detectors, improving significantly over the previous YOLOv3. We are, by the way, witnessing an interesting controversy in object detection because YOLOv5 followed less than two months later, from a completely different group of researchers, and without a paper to explain the new ideas (but with the code so it is not a question of reproducing the results)… Very interesting stuff, but discussing it would take us very far from the topic at hand, so let’s get back to YOLOv4. It boasts impressive performance, with the same detection quality as the above-mentioned EfficientDet at half the cost:


- you have a network and an input x that you want to make adversarial; suppose you want to turn a panda into a gibbon;
- formally, it means that you want to increase the “gibbon” component of the network’s output vector (at the expense of the “panda” component);
- so you fix the weights of the network and start regular gradient ascent, but with respect to x rather than the weights! This is the key idea for finding adversarial examples; it does not explain why they exist (it’s not an easy question) but if they do, it’s really not so hard to find them.
