AI Safety IV: Sparks of Misalignment
This is the last, fourth post in our series...
Facial identification and verification for consumer and security applications.
Activity recognition and threat detection across camera views.
Spatial computing, gesture recognition, and gaze estimation for headsets.
Millions of identities and clothing options to train best-in-class models.
Simulate driver and occupant behavior captured with multi-modal cameras.
Simulate edge cases and rare events to ensure the robust performance of autonomous vehicles.
Together, we’re building the future of computer vision & machine learning
Facial identification and verification for consumer and security applications.
Activity recognition and threat detection across camera views.
Spatial computing, gesture recognition, and gaze estimation for headsets.
Millions of identities and clothing options to train best-in-class models.
Simulate driver and occupant behavior captured with multi-modal cameras.
Simulate edge cases and rare events to ensure the robust performance of autonomous vehicles.
Together, we’re building the future of computer vision & machine learning
Facial identification and verification for consumer and security applications.
Activity recognition and threat detection across camera views.
Spatial computing, gesture recognition, and gaze estimation for headsets.
Millions of identities and clothing options to train best-in-class models.
Simulate driver and occupant behavior captured with multi-modal cameras.
Simulate edge cases and rare events to ensure the robust performance of autonomous vehicles.
Together, we’re building the future of computer vision & machine learning
Facial identification and verification for consumer and security applications.
Activity recognition and threat detection across camera views.
Spatial computing, gesture recognition, and gaze estimation for headsets.
Millions of identities and clothing options to train best-in-class models.
Simulate driver and occupant behavior captured with multi-modal cameras.
Simulate edge cases and rare events to ensure the robust performance of autonomous vehicles.
Together, we’re building the future of computer vision & machine learning
Here, the resulting image and segmentation mask are the result of the following chain of transformations:
The next step was taken in the work called “AutoAugment: Learning Augmentation Strategies from Data” by Cubuk et al. (2019). This is a work from Google Brain, from the group led by Quoc V. Le that has been working wonders with neural architecture search (NAS), a technique that automatically searches for the best architectures in a class that can be represented by computational graphs. With NAS, this group has already improved over the state of the art in basic convolutional architectures with NASNet and the EfficientNet family, in object detection architectures with the EfficientDet family, and even in such a basic field as activation functions for individual neural units: the Swish activation functions were found with NAS.
So what did they do with augmentation techniques? As usual, they frame this problem as a reinforcement learning task where the agent (controller) has to find a good augmentation strategy based on the rewards obtained by training a child network with this strategy:
The controller is trained by proximal policy optimization, a rather involved reinforcement learning algorithm that I’d rather not get into (Schulman et al., 2017). The point is, they successfully learned augmentation strategies that significantly outperform other, “naive” strategies. They were even able to achieve improvements over state of the art in classical problems on classical datasets:
Here is a sample augmentation policy for ImageNet found by Cubuk et al.:
The natural question is, of course: why do we care? How can it help us when we are not Google Brain and cannot run this pipeline (it does take a lot of computation)? Cubuk et al. note that the resulting augmentation strategies can indeed transfer across a wide variety of datasets and network architectures; on the other hand, this transferability is far from perfect, so I have not seen the results of AutoAugment pop up in other works as often as the authors would probably like.
Still, these works prove the basic point: augmentations can be composed together for better effect. The natural next step would be to have some even “smarter” functions. And that is exactly what we will see next.
The difference here is that Network A does not simply choose from a predefined set of strategies but operates as a generative network that can, for instance, blend two different training set examples into one in a smart way:
In particular, Lemley et al. tested their approach on datasets of human faces (a topic quite close to our hearts here at Synthesis AI, but I guess I will talk about it in more details later). So their Network A was able to, e.g., compose two different images of the same person into a blended combination (on the left):
Note that this is not simply a blend of two images but a more involved combination that makes good use of facial features. Here is an even better example:
This kind of “smart augmentation” borders on synthetic data generation: the resulting images are nothing like the originals. But before we turn to actual synthetic data (in subsequent posts), there are other interesting ideas one could apply even at the level of augmentation.
The blended label does not change either the network architecture or the training process: binary cross-entropy trivially generalizes to target discrete distributions instead of target one-hot vectors. To borrow an illustration from Ferenc Huszar’s blog post, here is what mixup does to a single data point, constructing convex combinations with other points in the dataset:
And here is what happens when we label a lot of points uniformly in the data:
As you can see, the resulting labeled data covers a much more robust and continuous distribution, and this helps the generalization power. Zhang et al. report especially significant improvements in training GANs:
By now, the idea of mixup has become an important part of the deep learning toolbox: you can often see it as an augmentation strategy, especially in the training of modern GAN architectures.
In the YOLO family, new releases usually obtain a much better object detection quality by combining a lot of small improvements, bringing together everything that researchers in the field have found to work well since the previous YOLO version. YOLOv4 is no exception, and it outlines several different ways to add new tricks to the pipeline.
What we are interested in now is their “Bag of Freebies”, the set of tricks that do not change the performance of the object detection framework during inference, adding complexity only at the training stage. It is very characteristic that most items in this bag turn out to be various kinds of data augmentation. In particular, Bochkovskiy et al. introduce a new “mosaic” geometric augmentation that works well for object detection:
But the most interesting part comes next. YOLOv4 is trained with self-adversarial training (SAT), an augmentation technique that actually incorporates adversarial examples into the training process. Remember this famous picture?
It turns out that for most existing artificial neural architectures, one can modify input images with small amounts of noise in such a way that the result looks to us humans completely indistinguishable from the originals but the network is very confident that it is something completely different; see, e.g., this OpenAI blog post for more information.
In the simplest case, such adversarial examples are produced by the following procedure:
YOLOv4 is an important example for us: it represents a significant improvement in the state of the art in object detection in 2020… and much of the improvement comes directly from better and more complex augmentation techniques! This makes us even more optimistic about taking data augmentation further, to the realm of synthetic data.