Synthesis Blog
Synthetic-to-Real Refinement: Driving Model Performance with Synthetic Data V

We continue the series on synthetic data as it is used in machine learning today. This is a fifth part of an already pretty long series (part 1, part 2, part 3, part 4), and it’s far from over, but I try to keep each post more or less self-contained. Today, however, we pick up from last time, so if you have not read Part 4 yet I suggest to go through it first. In that post, we discussed synthetic-to-real refinement for gaze estimation, which suddenly taught us a lot about modern GAN-based architectures. But eye gaze still remains a relatively small and not very variable problem, so let’s see how well synthetic data does in other computer vision applications. Again, expect a lot of GANs and at least a few formulas for the loss functions.

PixelDA: An Early Work in Refinement

First of all, I have to constrain this post too. There are whole domains of applications where synthetic data is very often used for computer vision, such as, e.g., outdoor scene segmentation for autonomous driving. But this would require a separate discussion, one that I hope to get to in the future. Today I will show a few examples where refinement techniques work on standard “unconstrained” computer vision problems such as object detection and segmentation for common objects. Although, as we will see, most of these problems in fact turn out to be quite constrained.

We begin with an early work in refinement, parallel to (Srivastava et al., 2017), which was done by Google researchers Bousmalis et al. (2017). They train a GAN-based architecture for pixel-level domain adaptation, which they call PixelDA. In essence, PixelDA is a basic style transfer GAN, i.e., they train the model by alternating optimization steps


  • the first term is the domain loss,
  • the second term is the task-specific loss, which in (Bousmalis et al., 2017) was the image classification cross-entropy loss provided by a classifier T that was also trained as part of the model:
  • and the third term is the content similarity loss, intended to make the generator preserve the foreground objects (that would later need to be classified) with a mean squared error applied to their masks:where m is a segmentation mask for the foreground object extracted from the synthetic data renderer; note that this loss does not “insist” on preserving pixel values in the object but rather encourages the model to change object pixels in a consistent way, preserving their pairwise differences.

Bousmalis et al. applied this GAN to the Synthetic Cropped LineMod dataset, a synthetic version of a small object classification dataset, doing both classification and pose estimation for the objects. The images in this dataset are quite cluttered and complex, but small in terms of pixel size:

The generator accepts as input a synthetic image of a 3D model of a corresponding object in a random pose together with the corresponding depth map and tries to output a realistic image in a cluttered environment while leaving the object itself in place. Note that the segmentation mask for the central object is also given by the synthetic model. The discriminator also looks at the depth map when it distinguishes between reals and fakes:

Here are some sample results of PixelDA for the same object classes as above but in different poses and with different depth maps:

Hardly the images that you would sell to a stock photo site, but that’s not the point. The point is to improve the classification and object pose estimation quality after training on refined synthetic images. And indeed, Bousmalis et al. reported improved results in both metrics compared to both training on purely synthetic data (for many tasks, this version fails entirely) and a number of previous approaches to domain adaptation.

But these are still rather small images. Can we make synthetic-to-real refinement work on a larger scale? Let’s find out.

CycleGAN for Synthetic-to-Real Refinement: GeneSIS-RT

In the previous post, we discussed the general CycleGAN idea and structure: if you want to do something like style transfer, but don’t have a paired dataset where the same content is depicted in two different styles, you can close the loop by training two generators at once. This is a very natural setting for synthetic-to-real domain adaptation, so many modern approaches to synthetic data refinement include the ideas of CycleGAN.

Probably the most direct application is the GeneSIS-RT framework by Stein and Roy (2017) that refines synthetic data directly with the CycleGAN trained on unpaired datasets of synthetic and real images. Their basic pipeline, summarized in the picture below, sums up the straightforward approach to synthetic-to-real refinement perfectly:

Their results are pretty typical for the basic CycleGAN: some of the straight lines can become wiggly, and the textures have artifacts, but generally the images definitely become closer to reality:

But, again, picture quality is not the main point here. Stein and Roy show that a training set produced by image-to-image translation learned by CycleGAN improves the results of training machine learning systems for real-world tasks such as obstacle avoidance and semantic segmentation.

Here are some sample segmentation results that compare a DeepLab-v2 segmentation network trained on synthetic data and on the same synthetic data refined by GeneSIS-RT; the improvement is quite clear:

CycleGAN Evolved: T2Net

As an example of a more involved application, let’s consider T2Net by Zheng et al. (2018) who apply synthetic-to-real refinement to improve depth estimation from a single image. By the way, if you google their paper in 2021, as I just did, do not confuse it with T2-Net by Zhang et al. (2020) (yes, a one symbol difference in both first author and model names!), a completely different deep learning model for turbulence forecasting…

T2Net also uses the general ideas of CycleGAN with a translation network (generator) that makes images more realistic. The new idea here is that T2Net asks the synthetic-to-real generator G not only to translate one specific domain (synthetic data) to another (real data) but also to work across a number of different input domains, making the input image “more realistic” in every case. Here is the general architecture:

In essence, this means that G aims to learn the minimal transformation necessary to make an image realistic. In particular, it should not change real images much. In total, T2Net has the following loss function for the generator (hope you are getting used to these):


  • the first term is the usual GAN loss for synthetic-to-real transfer with a discriminator DT:
  • the second term is the feature-level GAN loss for features extracted from translated and real images, with a different discriminator Df:
  • the third term Lr is the reconstruction loss for real images, simply an L1 norm that says that T2Net is not supposed to change real images at all;
  • the fourth term Lt is the task loss for depth estimation on synthetic images, namely the L1-norm of the difference between the predicted depth map for a translated synthetic image and the original ground truth synthetic depth map; this loss ensures that translation does not change the depth map;
  • finally, the fifth term Ls is the task loss for depth estimation on real images:that is, the sum of image gradients; since ground truth depth maps are not available now, this regularizer is a locally smooth loss intended to optimize object boundaries, a common tool in depth estimation models that we won’t go into too much detail about.

Zheng et al. show that T2Net can produce realistic images from synthetic ones, even for quite varied domains such as house interiors from the SUNCG synthetic dataset:

But again, the most important conclusions deal with the depth estimation task. Zheng et al. conclude that end-to-end training of the translation network and depth estimation network is preferable over separated training. They show that T2Net can achieve good results for depth estimation with no access to real paired data, even outperforming some (but not all) supervised approaches.

Synthetic Data Refinement for Vending Machines

This is already getting to be quite a long read, so let me wrap up with just one more example that will bring us to 2019. Wang et al. (2019) consider synthetic data generation and domain adaptation for object detection in smart vending machines. Actually, we have already discussed their problem setting and synthetic data in a previous post, titled “What’s in the Fridge?“. So please see that post for a refresher on their synthetic data generation pipeline, and today we will concentrate specifically on their domain adaptation approach.

Wang et al. refine rendered images with virtual-to-real style transfer done by a CycleGAN-based architecture. The novelty here is that Wang et al. separate foreground and background losses, arguing that style transfer needed for foreground objects is very different from (much stronger than) the style transfer for backgrounds. So their overall architecture is even more involved than in previous examples; here is what it looks like:

The overall generator loss function is also a bit different:

  • LGAN(G, D, X, Y) is the standard adversarial loss for generator G mapping from domain X to domain Y and discriminator D distinguishing real images from fake ones in domain Y;
  • Lcyc(G, F) is the cycle consistency loss as used in CycleGAN and as we have already discussed several times;
  • Lbg is the background loss, which is the cycle consistency loss computed only for the background part of the images as defined by the mask mbg:
  • Lfg is the foreground loss, similar to Lbg but computed only for the hue channel in the HSV color space (the authors argue that color and profile are the most critical for recognition and thus need to be preserved the most):

Segmentation into foreground and background is done automatically in synthetic data and is made easy in this case for real data since the camera position is fixed, and the authors can collect a dataset of real background templates from the vending machines they used in the experiments and then simply subtract the backgrounds to get the foreground part.

Here are some sample results of their domain adaptation architecture, with original synthetic images on the left and refined results on the right:

As a result, Wang et al. report significantly improved results when using hybrid datasets of real and synthetic data for all three tested object detection architectures: PVANET, SSD, and YOLOv3. Even more importantly, they report a comparison between basic and refined synthetic data with clear gains achieved by refinement across all architectures.


By this point, you are probably already pretty tired of CycleGAN variations. Naturally, there are plenty more examples of this kind of synthetic-to-real style transfer in literature, I just picked a few to illustrate the general ideas and show how they can be applied to specific use cases.

I hope the last couple of posts have managed to convince you that synthetic-to-real refinement is a valid approach that can improve the performance at the end task even if the actual refined images do not look all that realistic to humans: some of the examples above look pretty bad, but training on them still improves performance for downstream tasks.

Next time, we will discuss an interesting variation of this idea: what if we reverse the process and try to translate real data into synthetic? And why would anyone want to do such a thing if we are always interested in solving downstream tasks on real data rather than synthetic?.. The answers to these questions will have to wait until the next post. See you!

Sergey Nikolenko
Head of AI, Synthesis AI