Author: @snikolenko

Variational Autoencoders (VAEs): Generative AI I

It might seem like generative models are going through new phases every couple of years: we heard about Transformers, then flow-based models were all the rage, then diffusion-based models… But in fact, new ideas build on top of older ones. Following our overview post, today we start an in-depth dive into generative AI. We consider the variational autoencoder (VAE), an idea introduced in 2013, if not earlier, but still very relevant and still underlying state of the art generative models such as Stable Diffusion. We will not consider all the gory mathematical details but I hope to explain the necessary intuition.

Continue reading
Generative AI Models in Image Generation: Overview

Some of the most widely publicized results in machine learning in recent years have been related to image generation. You’ve heard of DALL-E a year ago, and now you’ve heard of DALL-E 2, Midjourney, and Stable Diffusion, right? With this post, I’m starting a new series where I will explain the inner workings of these models, what their differences are and how they fit into the general scheme of deep generative models. Today, we begin with a general overview.

Continue reading
Facial Landmark Detection with Synthetic Data: Case Study

Today we have something very special for you: fresh results of our very own machine learning researchers! We discuss a case study that would be impossible without synthetic data: learning to recognize facial landmarks (keypoints on a human face) in unprecedented numbers and with unprecedented accuracy. We will begin by discussing why facial landmarks are important, show why synthetic data is inevitable here, and then proceed to our recent results.

Why Facial Landmarks?

Facial landmarks are certain key points on a human face that define the main facial features: nose, eyes, lips, jawline, and so on. Detecting such key points on in-the-wild photographs is a basic computer vision problem that could help considerably for a number of face-related applications. For example:

head pose estimation, that is, finding out where a person is looking and where the head is turned right now;

gaze estimation, a problem important for mobile and wearable devices that we discussed recently;

recognizing emotions (that are reflected in moving landmarks) and other conditions; in particular, systems recognizing driver fatigue often rely on facial landmarks in preprocessing.

There are several different approaches to how to define facial landmarks; here is an illustration of no less than eight approaches from Sagonas et al. (2016) who introduced yet another standard:

Their standard became one of the most widely used in industry. Named iBug68, it consists of 68 facial landmarks defined as follows (the left part shows the definitions, and the right part shows the variance of landmark points as captured by human annotators):

The iBug68 standard was introduced together with the “300 Faces in the Wild” dataset; true to its name, it contains 300 faces with landmarks labeled by agreement of several human annotators. The authors also released a semi-automated annotation tool that was supposed to help researchers label other datasets—and it does quite a good job.

All this happened back in 2013-2014, and numerous deep learning models have been developed for facial landmarks detection since then. So what’s the problem? Can we assume that facial landmarks are either solved or, at least, are not suffering from the problems that synthetic data would alleviate?

Synthetic Landmarks: Number does Matter

Not quite. As it often happens in machine learning, the problem is more quantitative than qualitative: existing datasets of landmarks can be insufficient for certain tasks. 68 landmarks are enough to get good head pose estimation, but definitely not enough to, say, obtain a full 3D reconstruction of a human head and face, a problem that we discussed very recently and deemed very important for 3D avatars, the Metaverse, and other related problems. 

For such problems, it would be very helpful to move from datasets of several dozen landmarks to datasets of at least several hundred landmarks that would outline the entire face oval and densely cover the most important lines on a human face. Here is a sample face with 68 iBug landmarks on the left and 243 landmarks on the right:

And we don’t have to stop there, we can move on to 460 points (left) or even 1001 points (right):

The more the merrier! If we are able to detect hundreds of keypoints on a face it would significantly improve the accuracy of 3D face reconstruction and many other computer vision problems.

However, by now you probably already realize the main problem of these extended landmark standards: there are no datasets, and there is little hope of ever getting them. It was hard enough to label 68 points by hand, the original dataset had only 300 photos; labeling several hundred points on a scale sufficient to train models to recognize them would be certainly prohibitive.

This sounds like a case for synthetic data, right? Indeed, the face shown above is not real, it is a synthetic data point produced by our very own Human API. When you have a synthetic 3D face in a 3D scene that you control, it is absolutely no problem to have as many landmarks as you wish. What’s even more important, you can easily play with the positions of these landmarks and choose which set of points gives you better results in downstream tasks—imagine how hard it would be if you had to get updated human annotations every time you changed landmark locations!

So at this point, we have a source of unlimited synthetic facial landmark datasets. It only remains to find out whether they can indeed help train better models.

Training on Synthetic Landmarks

We have several sets of synthetic landmarks, and we want to see how well we are able to predict them. As the backbone for our deep learning model we used HourglassNet, a venerable convolutional architecture that has been used for pose estimation and similar problems since it was introduced by Newell et al. (2016):

The input here is an image, and the output is a tensor that specifies all landmarks.

To train on synthetic data, we couple this backbone with the discriminator-free adversarial learning (DALN) approach introduced very recently by Chen et al. (2022); this is actually another paper from CVPR 2022 so you can consider this part a continuation of our CVPR ‘22 series.

Usually, unsupervised domain adaptation (UDA) works in an adversarial way by

either training a discriminator to distinguish between features extracted from source domain inputs (synthetic data) and target domain inputs (real data), training the model to make this discriminator fail,

or learning a source domain classifier and a target domain classifier at the same time, training them to perform the same on the source domain and as different as possible on the target domain while training the model to keep the classification results similar on the target domain.

DALN suggested a third option for adversarial UDA: it trains only one classifier, with no additional discriminators, and reuse the classifier as a discriminator. The resulting loss function is a combination of a regular classification loss on the source domain and a special adversarial loss on the target domain that is minimized by the model and maximized by the classifier’s weights:

We have found that this approach works very well, but we had an additional complication. Our synthetic datasets have more landmarks than iBug68. This means that we cannot use real data to help train the models in a regular “mix-and-match” fashion, simply adding it in some proportion together with the synthetic data. We could pose our problem as pure unsupervised domain adaptation, but that would mean we were throwing away perfectly good real labelings, which also does not sound like a good idea.  

To use the available real data, we introduced the idea of label transfer on top of DALN: our model outputs a tensor of landmarks as they appear in synthetic data, and then an additional small network is trained to convert this tensor into iBug68 landmarks. As a result, we get the best of both worlds: most of our training comes from synthetic data, but we can also fine-tune the model with real iBug68 datasets through this additional label transfer network.

Finally, another question arises: okay, we know how to train on real data via an auxiliary network, but how do we test the model? We don’t have any labeled real data with our newly designed extra dense landmarks, and labeling even a test set by hand is very problematic. There is no perfect answer here but we found two good ones: we test either only on the points that should exactly coincide with iBug landmarks (if they exist) or train a small auxiliary network to predict iBug landmarks, fix it, and test the rest. Both approaches show that the resulting model is able to predict dense landmarks, and both synthetic and real data are useful for the model even though the real part only has iBug landmarks.

Quantitative results

At this point, we understand the basic qualitative ideas that are behind our case study. It’s time to show the money, that is, the numbers!

First of all, we need to set the metric for evaluation. We are comparing sets of points that have a known 1-to-1 correspondence so the most straightforward way would be to calculate the average distance between corresponding points. Since we want to be able to measure quality on real datasets, we need to use iBug landmarks in the metric, not extended synthetic sets of landmarks. And, finally, different images will have faces shown at different scales, so it would be a bad idea to measure the distances directly in pixels or fractions of image size. This brings us to the evaluation metric computed as

where n is the number of landmarks, yi are the ground truth landmark positions, f(xi) are the landmark positions predicted by the model, and pupil are the positions of the left and right pupil of the face in question (this is the normalization coefficient). The coefficient 100 is introduced simply to make the numbers easier to read.

With this, we can finally show the evaluation tables! The full results will have to wait for an official research paper, but here are some of the best results we have now.

Both tables below show evaluations on the iBug dataset. It is split into four parts: the common training set (you only pay attention to this metric to ensure that you avoid overfitting), the common test set (main test benchmark), a specially crafted subset of challenging examples, and a private test set from the associated competition. We will show the synthetic test set, common test set from iBug, and their challenging test set as a proxy for generalizing to a different use case.

In the table below, we show all four iBug subsets; to get the predictions, in this table we predict all synthetic landmarks (490 keypoints in this case!) and then choose a subset of them that most closely corresponds to iBug landmarks and evaluate on it.

Model
Synthetic test set
Common test set
Challenging test set

Trained on synthetic data only
3.61
8.438
20.127

Trained on syn+real data with label adaptation
3.278
5.033
9.595

Unsupervised model with a discriminator
3.755
6.642
14.947

Supervised model trained on real data only

3.07
6.525

The table above shows only a select sample of our results, but it already compares quite a few variations of our basic model described above:

the model trained on purely synthetic data; as you can see, this variation loses significantly to all other ways to train, so using real data definitely helps;

the model trained on a mix of labeled real and synthetic data with the help of label adaptation as we have described above; we have investigated several different variations of label adaptation networks and finally settled on a small U-Net-like architecture;

the model trained adversarially in an unsupervised way; “unsupervised” here means that the model never sees any labels on real data, it uses labeled synthetic data and unlabeled real data with an extra discriminator that ensures that the same features are extracted on both domains; again, we have considered several different ways to organize unsupervised domain adaptation and show only the best one here.

But wait, what’s that bottom line in the table and how come it shows by far the best results? This is the most straightforward approach: train the model on real data from the iBug dataset (common train) and don’t use synthetic data at all. While the model shows some signs of overfitting, it still outperforms every other model very significantly.

One possible way to sweep this model under the rug would be to say that this model doesn’t count because it is not able to show us any landmarks other than iBug’s, so it can’t provide the 490 or 1001 landmarks that other models do. But still — why does it win so convincingly? How can it be that adding extra (synthetic) data hurts performance in all scenarios and variations?

The main reason here is that iBug landmarks are not quite the same as the landmarks that we predict, so even the nearest corresponding points introduce some bias that shows in all rows of the table. Therefore, we have also introduced another evaluation setting: let’s predict synthetic landmarks and then use a separate small model (a multilayer perceptron) to convert the predicted landmarks into iBug landmarks, in a procedure very similar to label adaptation that we have used to train the models. We have trained this MLP on the same common train set.

The table below shows the new results.

Model
Synthetic test set
Common test set
Challenging test set

Trained on synthetic data only
3.61
4.777
13.115

Trained on syn+real data with label adaptation
3.278
3.825
7.53

Unsupervised model with a discriminator
3.755
5.008
11.834

As you can see, this test-time label adaptation has improved the results across the board, and very significantly! However, they still don’t quite match the supervised model, so some further research into better label adaptation is still in order. The relative order of the model has remained more or less the same, and the mixed syn+real model with label adaptation done with a small U-Net-like architecture wins again, quite convincingly although with a smaller margin than before.

Conclusion

We have obtained significant improvements in facial landmark detection, but most importantly, we have been able to train models to detect dense collections of hundreds of landmarks that have never been labeled before. And all this has been made possible with synthetic data: manual labeling would never allow us to have a large dataset with so many landmarks. This short post is just a summary: we hope to prepare a full-scale paper about this research soon.

Kudos to our ML team, especially Alex Davydow and Daniil Gulevskiy, for making this possible! And see you next time!

Sergey NikolenkoHead of AI, Synthesis AI

P.S. Have you noticed the cover images today? They were produced by the recently released Stable Diffusion model, with prompts related to facial landmarks. Consider it a teaser for a new series of posts to come…

Continue reading
CVPR ‘22, Part IV: Synthetic Data Generation

We continue the long series of reviews for CVPR 2022 papers related to synthetic data. We’ve had three installments so far, devoted to new datasets, use cases for synthetic data, and a very special use case: digital humans. Today, we will discuss papers that can help with generating synthetic data, so expect a lot of 3D model reconstruction, new generative models, especially in 3D, and generally a lot of CGI-related goodness (image generated by DALL-E-Mini by craiyon.com with the prompt “robot designer making a 3D mesh”).

Continue reading
CVPR ‘22, Part III: Digital Humans

Last time, we talked about new use cases for synthetic data, from crowd counting to fractal-based synthetic images for pretraining large models. But there is a large set of use cases that we did not talk about, united by their relation to digital humans: human avatars, virtual try-on for clothes, machine learning for improving animations in synthetic humans, and much more. Today, we talk about the human side of CVPR 2022, considering two primary applications: conditional generation for applications such as virtual try-on and learning 3D avatars from 2D images (image generated by DALL-E-Mini by craiyon.com with the prompt “virtual human in the metaverse”).

Continue reading
AI Interviews: Victor Lempitsky

Meet our distinguished guest for the third interview: Professor Victor Lempitsky. Prof. Lempitsky is among the best researchers in machine learning, placing especially highly in the field of computer vision (here is his Google Scholar account). Currently Victor is leading the Computer Vision Group at Skoltech (Skolkovo Institute of Science and Technology) and is the VR project leader at Yandex.

Read more: AI Interviews: Victor Lempitsky

Foreword. Before we begin, I have to say that this interview was composed before February 24, 2022. In fact, it was finalized on February 22, so by now it is almost half a year old. This is the reason why Q6 may look a little strange these days—we were not dancing around the elephant in the room, it simply had not entered yet. By now, Victor has left both positions mentioned in the preamble and is currently working on a new startup in the AR/VR field.

Q1. Hello Victor, and welcome to our interview! Computer vision is your major focus, so let me start off immediately with the obligatory question for our blog: what is your general view on synthetic data for computer vision? Do you agree that synthetic data, understood as artificially generated labeled data used to train machine learning models, can be a feasible way out of the data problem for computer vision? Or do you place more faith in other possible approaches that we’ve previously discussed on this blog: augmentations, mixup and self-adversarial training, few- and zero-shot learning, adding unlabeled data, and others?

I do believe in synthetic data, and several recent projects I was involved with have seen clear benefits from using synthetic data. However, most useful synthetic data are modeled from the real world. Such modeling can benefit strongly from unsupervised learning. So, in the end, there is no dichotomy: I believe in the usefulness of synthetic data, which is enriched/created from real unlabeled data. Augmentations, mixups, adversarial training can all be used as the ways to generate useful synthetic data from real data, even though people not always think about augmentations in this way.

Q2. Much of your most recent work is devoted to image generation. You have created GANs that work without convolutions or self-attention, neural renderers that can dress 3D avatars and generate semi-transparent objects, GANs that generate timelapse videos of landscapes, and much more. In particular, you often work on 3D generation—generating meshes, textures, point clouds—which is the obvious next step after learning to generate flat images. 3D generation is only starting to work well enough for practical applications, but still, the rate of progress in this field is spectacular. I usually show this picture in my lectures on GANs:

Do you expect 3D generation to undergo similarly explosive growth in the near future? Or are there conceptual difficulties that need to be resolved before we get the virtual reality Metaverse generated on the fly with GANs?

The picture you show is indeed very telling, and it reflects and conflates several trends: improvements in algorithms, improvements in computational resources, and improvements in datasets. 

Given how many bright people are now working on 3D data synthesis, I believe that fast progress in algorithms is inevitable. Neural renderers such as PyTorch3D or nvdiffrast are certainly one piece of the puzzle. Computational resources are trickier and a lot of progress will be bottlenecked on them, so I naturally expect that main breakthroughs will come from the “big four” of NVidia, Google/DeepMind, Meta, and Microsoft (all four have brilliant researchers but also huge computational resources). This was to a large degree true even for 2D image generation, and will likely remain even more true for 3D. Note that I am not saying that everybody else should either join those corporations or work on something else. Just like StyleGAN(s) from NVidia created a whole vibrant ecosystem of researchers from different institutes building on top of it, the same will likely happen with 3D.

The main bottleneck for progress in 3D data synthesis, however, is (and will be) datasets. Here things are very different from 2D. With 2D, once algorithms and resources were ready, finding good enough datasets for learning was relatively easy. Note that here I am talking about 2D static image generation, good datasets of HD videos are much harder to get: say, YouTube is largely not HD quality, and it is quite a challenge to scrap video datasets of objects or people in high resolution from YouTube. Getting good and large 3D datasets is much harder, especially if we are talking about “full 3D” and not just 2.5D (i.e. color + depth) or toyish 3D models. Currently, quite a few researchers are trying to bypass this lack of datasets and to learn 3D synthesis by matching the 2D images. To this end, they insert 2D projections into their generation learning pipelines. This is surely interesting and could be fruitful, but is inevitably much harder. Just imagine someone trying to learn StyleGAN-like image synthesis while only having access to a dataset of 1D projections such as row sums or one-pixel slices.

To sum up, I think that the rate of progress in 3D data synthesis will be limited and conditioned on the quality of 3D datasets. Hence, it will be a harder and longer story than with 2D (but no less interesting!)

Q3. Let us continue from the last question, taking generative models yet further into the realm of speculation. I have always viewed image and 3D generation as an inherently finite task. It has not been easy to scale GANs up, but it seems like progress is inevitable. And human eyes have a finite resolution after all (be it 8K, 32K, or 256K), so the models will sooner or later reach this resolution with photorealistic quality, and there will be no point to move any further. 

Do you agree with this view, and if yes, when do you expect image and 3D scene generation to hit this ceiling and provide a perfectly immersive experience? (Let’s limit this question to vision, I understand that full immersion will require other senses as well.)

Let me start by noting that the story with 2D image generation is far from over, even if one can generate very realistic human faces. First of all, GANs still have limited diversity and mode coverage (otherwise we will not have dozens of interesting papers on StyleGAN inversion, and very simple approaches would do the job). Diffusion models are better than GANs in covering the whole distribution but are still extremely slow. Furthermore, even though GAN samples for faces are realistic, GAN samples for full body human images or, say, for full body cats are either significantly less realistic or significantly less diverse (or both). Finally, for 2D video synthesis, we as a community are very far from truly realistic results (at least in the unconditional setting).

Regarding 3D, the situation is even harder for the reasons I discussed in the answer to the previous question, so I do not expect perfect photorealism there for quite a few years.

Q4. Now let me ask a (slightly more) technical question that I’ve been interested in for a long time. Your two most cited papers according to Google Scholar are “Unsupervised domain adaptation by backpropagation” (joint work with Yaroslav Ganin) and its continuation and extension, “Domain-adversarial training of neural networks” (with a lot of people including, e.g., Hugo Larochelle). They are also, in my opinion, some of the most relevant for synthetic data because they present a simple and ingenious domain adaptation method.

We have just discussed the basic idea of Ganin and Lempitsky (2015) on this blog, so I’ll be very brief in explaining it. The idea goes as follows: suppose you want to have a model that works for both synthetic and real data (or any two domains, really). You want to train a feature extractor that will extract features independently of the domain, so that, say, a synthetic face will have the same features extracted as its real counterpart, and models trained with these features on synthetic data can be applied to real data. To achieve this, you add a domain classifier that predicts whether it was a synthetic or a real image based on the features extracted. You want that classifier to fail, just like you want the discriminator to fail in GANs. So you train it as another head of your network, but the gradients for the classification error function are reversed, optimizing it in the opposite direction. In the illustration below (taken from your papers), the classifier wants to minimize its loss Ld, but by the time it gets to the feature extractor, the loss is inverted, and the extractor is actually maximizing it.

My question here is two-fold. First, I explained your idea in terms of synthetic and real images, and the actual papers also present examples of synthetic-to-real transfer, but only for small images. Have there been attempts to apply this to larger-scale domain adaptation, especially synthetic-to-real, and how successful have they been?

Second, domain-adversarial training sounds like a very general idea that could actually be applicable wider than just domain adaptation. One cannot say this idea is not widely known: both papers have thousands of citations, including foundational works on GANs. But why haven’t GANs switched to gradient reversal instead of alternating training between the generator and discriminator? Are there some hidden problems here that are not evident in the basic idea?

On your first question, indeed the approach has become popular, and there has been a lot of follow-up work including applications to large images. Just as with small images, the approach there works somewhat but without miracles. I.e., it usually beats the no-adaptation baseline quite confidently, but, of course, does not solve the domain gap problem completely. For the second question, indeed almost all GANs separate the steps for the generator and the discriminator updates and do not reuse the gradient. The main reason, I believe, is that most modern GANs use slightly different functionals as objectives for the generator and the discriminator. In particular, it turns out that to get the best GAN performance, it is useful to have some form of the so-called non-saturating objective for the discriminator, and also to regularize the discriminator quite strongly with a proper regularizer (and details of such regularization matter a lot). So, when your generator and discriminator are trying to optimize slightly different functionals, gradient reuse becomes highly non-trivial and is therefore not used. 

Just to clarify, for me the difference between gradient reversal and GANs is not a big deal. Actually, we learned about the GAN arxiv report halfway during the project and by that time we have settled on the idea and the language of “gradient reversal”. This is why we explained our approach in a slightly different way in our paper, and perhaps connected it to GANs in a less clear way than we should have done (but back in early 2015 it was way less obvious that GANs would become such a dominating idea). 

Q5. Another recent work of yours introduces Cloud Transformers, special architectures for processing point clouds that use ideas similar to self-attention blocks, with excellent results in point cloud segmentation, inpainting, and reconstruction tasks.

Since their inception in 2017, Transformers have taken deep learning by storm. They started by basically replacing all other embeddings in natural language processing and serving as the basis for the very best language models, but now they are all over computer vision as well, ever expanding their reach as your own work suggests. It looks a bit like deep learning gradually taking over every field in the early 2010s.

Do you have an explanation for this success? I understand how a Transformer works mathematically, but is there any explanation why self-attention proves to be such a good idea in practice?

Or maybe it’s just an umbrella term for a specific useful trick, and otherwise modern Transformers are very different from each other? In your paper, you keep using words such as “variant” or “reminiscent”, and the architecture indeed doesn’t look much like Vaswani’s original. What is that core idea that makes an architecture a Transformer, and again, why, in your opinion, does it work so well?

Well, it is hard to argue that transformers are the most exciting and impactful thing that has happened in deep learning in recent years. What is most exciting about transformers is their universality. True, we are still witnessing the competition between vision transformer variants and ConvNet architectures for the title of “the king of ImageNet”. But what is remarkable and makes many people excited is that very similar Transformer architectures can solve very different tasks across very different modalities (images, audio, text, action planning, etc) with near state-of-the-art quality. Certainly, it feels like the right thing, as our brains also have remarkable plasticity and can repurpose different parts between modalities.

Our cloud transformers paper will obviously be far less impactful compared to the original transformers, but I still like it very much. Our architecture is similar to “classical” transformers in some ways. E.g. it treats individual points as elements within an unordered set, and our key layer uses multiple processing heads. There are also differences (our equivalent of attention is sparse, and we use convolutions). Still, what I liked about our results is that essentially the same architecture is able to solve very different point cloud processing tasks. This is again reminiscent of the general transformer idea. 

Q6. And finally a (slightly) more personal question. Anyone who knows you personally or at least follows you knows you feel strongly about the ethical use of AI.There is a trend in the computer vision community about ethical usage of CV technologies. For instance, the creator of YOLO object detectors Joseph Redmon quit computer vision in early 2020 and famously explained his decision as follows: “I stopped doing CV research because I saw the impact my work was having. I loved the work but the military applications and privacy concerns eventually became impossible to ignore.”

What is your view on the ethical concerns that arise in modern computer vision? Are researchers responsible for potentially unethical uses of their results? I suppose there is no way to stop progress, but do you think there may be ways to ensure that progress works for the benefit of humanity and not against it? What would you advise to work on if one wanted to achieve this goal?

I had a small project on person re-identification (mostly from surveillance cameras) with my PhD student back in 2016, and after one year or so we stopped. I do not think we pushed state-of-the-art in video surveillance that much, and the reviewers for the submissions we made on the subject concurred with that :). It is the only example where, in retrospect, I sleep slightly better because my work did not make an impact. 

Having said that, some of the good and well-meaning people that I know still work on face recognition and camera-based surveillance, and I do not want to judge them. After all, the camera-based surveillance technology is double-edged. It will most likely benefit strong democratic societies by making life there safer and more convenient, but it will make life in authoritarian and totalitarian societies considerably worse, which we are already starting to witness in Russia and other countries. The same actually goes for AI and automation issues. The net effect will be strongly positive, people will live more meaningful and productive lives with more interesting occupations, but the dystopian scenarios will also materialize in some societies. 

Like always, stopping the progress is impossible, even if many strong researchers including Joe Redmon quit the area. Progress in AI-based surveillance and automation “simply” calls for better and stronger political institutions. And the faster the progress, the more urgent the call. I know this all sounds like I am trying to push the responsibility from AI researchers to others (civil society and politicians), but I am just being honest and realistic. The best thing that we (researchers) can and must do is to inform the general public about the current state-of-the-art and reasonable projections for the future.

Victor, thank you very much for your answers! And you, dear reader, stay tuned for our next interviews!

Sergey NikolenkoHead of AI, Synthesis AI

Continue reading
CVPR ‘22, Part II: New Use Cases for Synthetic Data

Last time, we started a new series of posts: an overview of papers from CVPR 2022 that are related to synthetic data. This year’s CVPR has over 2000 accepted papers, and many of them touch upon our main topic on this blog. In today’s installment, we look at papers that make use of synthetic data to advance a number of different use cases in computer vision, along with a couple of very interesting and novel ideas that extend the applicability of synthetic data in new directions. We will even see some fractals as synthetic data! (image source)

Continue reading
CVPR ‘22, Part I: New Synthetic Datasets

CVPR 2022, the largest and most prestigious conference in computer vision and one of the most important ML venues in general, has just finished in New Orleans. With over 2000 accepted papers, reviewing the contributions of this year’s CVPR appears to be a truly gargantuan task. Over the next series of blog posts, we will attempt to go over the most interesting papers directly related to our main topic: synthetic data. Today, I present the first but definitely not the last installment devoted to papers from CVPR 2022.

Continue reading
Driving Model Performance with Synthetic Data VII: Model-Based Domain Adaptation

After a long hiatus, we return from interviews to long forms, continuing (and hopefully finishing) our series on how synthetic data is used in machine learning and how machine learning models can adapt to using synthetic data. This is our seventh installment in the series (part 1, part 2, part 3, part 4, part 5, part 6), but, as usual, this post is (I hope!) sufficiently self-contained. We will discuss how one can have a model that works well on synthetic data without making it more realistic explicitly but doing the domain adaptation work at the level of features or model itself.

Continue reading
AI Interviews: Serge Belongie

Hi all! Today we begin a new series of posts here in the Synthesis AI blog. We will talk to the best researchers and practitioners in the field of machine learning, discussing different topics but, obviously, trying to circle back to our main focus of synthetic data every once in a while.

Today we have our first guest, Professor Serge Belongie. He is a Professor of Computer Science at the University of Copenhagen (DIKU) and the Director of the Pioneer Centre for Artificial Intelligence. Previously he was the Andrew H. and Ann R. Tisch Professor at Cornell Tech and in the Computer Science Department at Cornell University, and an Associate Dean at Cornell Tech.

Continue reading

Explore datasets and labels with our Data Visualizer

X