AI Interviews: Serge Belongie

September 13, 2021

Hi all! Today we begin a new series of posts here in the Synthesis AI blog. We will talk to the best researchers and practitioners in the field of machine learning, discussing different topics but, obviously, trying to circle back to our main focus of synthetic data every once in a while.

Today we have our first guest, Professor Serge Belongie. He is a Professor of Computer Science at the University of Copenhagen (DIKU) and the Director of the Pioneer Centre for Artificial Intelligence. Previously he was the Andrew H. and Ann R. Tisch Professor at Cornell Tech and in the Computer Science Department at Cornell University, and an Associate Dean at Cornell Tech.

Over his distinguished career, Prof. Belongie has been greatly successful in both academia and business. He co-founded several successful startups, including Digital Persona, Inc. that first brought a fingerprint identification device to the mass market and two computer vision startups, Anchovi Labs and Orpix. The MIT Technology Review included him on their list of Innovators under 35 for 2004, and in 2015, he was the recipient of the ICCV Helmholtz Prize. Google Scholar assigns to Prof. Belongie a spectacular h-index of 96, which includes dozens of papers that have become fundamental for computer vision and other fields, with hundreds of citations each. And, to be honest, I got most of this off Prof. Belongie’s Wikipedia page, which means that this is just barely scratching the surface of his achievements.

Q1. Hello Professor, and welcome to our interview! Your list of achievements is so impressive that we definitely cannot do it justice in this format. But let’s try to add at least one little bit to this Wikipedia dump above. What is the one thing, maybe the one new idea that you are most proud of in your career? You know, the idea that makes you feel the warmest and fuzziest once you remember how you had it?

Prof. Belongie: Thank you for inviting me! I’m excited about Synthesis AI’s vision, so I’m happy to help get out the word to the CV/ML community.

This is a timely question, since I recently started a “Throwback Thursday” series on my lab’s Twitter account. Each week over this past summer, my former students and I had a fun time looking back on the journey behind our publications since I became a professor a couple decades ago. The ideas for which I feel most proud rarely have appeared in highly cited papers. One example is the grid based comparisons in our 2015 paper “Cost-Effective HITs for Relative Similarity Comparisons.” As my students from that time will recall, I was captivated by the idea of triplet based comparisons for measuring perceptual similarity (“is a more similar to b than to c?”), but the cubic complexity of such approaches limited their practical adoption. Then it occurred to us that humans have excellent parallel visual processing abilities, which means we could fill a screen with 4×4 or 5×5 grids of images, and through some simple UI trickery, we could harvest large batches of triplet constraints in one shot, using a HIT (human intelligence task) that was both less expensive to run and more entertaining to complete for the participants. While this approach and the related SNaCK approach we published the following year have not gotten much traction in the literature, I’m convinced that this concept will eventually get its day in the sun.

Q2. Now for the obligatory question: what is your view on the importance of synthetic data for modern computer vision? Here at Synthesis AI, we believe that synthetic data can become one of the solutions to the data problem; do you agree? What other solutions do you see and how, in your opinion, does synthetic data fit into the landscape of computer vision of the future?

Prof. Belongie: I am in complete agreement with this view. When pilots learn to fly, they must log thousands of hours of flight time in simulated and real flight environments. That is an industry that, over several decades, has found the right balance of real vs. synthetic for the best instructional outcome. Our field is now confronting an analogous problem, with the key difference that the student is a machine. With that difference in mind, we will again need to find the right balance. As my PhD advisor [Jitendra Malik] used to tell us in the late 90s, nature has a way of detecting a hack, so we must be careful about overstating what’s possible with purely synthetic environments. But when you think about the cartesian product of all the environmental factors that can influence, say, the appearance of city streets in the context of autonomous driving, it seems foolish not to build upon our troves of real data with clever synthesis and augmentation approaches to give our machines a gigantic head start before tackling the real thing.

Q3. Among all your influential papers with hundreds of citations, the one that looks to me most directly relevant to synthetic data is the paper where Xun Huang and yourself introduced adaptive instance normalization (AdaIN), a very simple style transfer approach that still works wonders. We recently talked about AdaIN on this blog, and in our experiments we have never seen a more complex synthetic-to-real refinement pipeline, even based on your own later work, MUNIT, outperform the basic AdaIN. What has worked best for synthetic-to-real style transfer for you? Do you maybe have more style transfer techniques in store for us, to appear in the near future?

Prof. Belongie: Good ol’ AdaIN indeed works surprisingly well in a wide variety of cases. The situation gets more nuanced, however, in fine grained settings such as the iNat challenges or NeWT downstream tasks. In these cases, even well intentioned style transfer methods can trample over the subtle differences that distinguish tightly related species; as the saying goes, “one person’s signal is another person’s noise.” In this context, we’ve been reflecting on the emerging practice of augmentation engineering. Ever since deep learning burst onto the scene around 2011, it hasn’t been socially acceptable to fiddle with feature design manually, but no one complains if you fiddle with augmentation functions. The latter can be thought of as a roundabout way to scratch the same itch. It’s likely that in fine grained domains, e.g., plant pathology, we’ll need to return to the old – and in my opinion, good – practices of working closely with domain experts to cultivate domain-appropriate geometric and photometric transformations.

In terms of what’s coming next in style transfer, I’m excited about our recent work in the optical see-through (OST) augmented reality setting. In conventional style transfer, you have total control over the values of every pixel. In the OST setting, however, you can only add light; you can’t subtract it. So what can be done about this? We tackle this question in our recent Stay Positive work, focusing on the nonnegative image synthesis problem, and leveraging quirks of the human visual system’s processing of brightness and contrast.

Q4. Continuing from the last question, one of the latest papers to come out of your group is titled “Single Image Texture Translation for Data Augmentation”. In it, you propose a new data augmentation technique that translates textures between objects from single images (as a brief reminder for the readers, we have talked about what data augmentation is previously on this blog). The paper also includes a nice graphical overview of modern data augmentation methods that I can’t but quote here:

Looking at this picture makes me excited. What is your opinion on the limits of data augmentation? Combined with neural style transfer and all other techniques shown here, how far do you think this can take us? How do you see these techniques potentially complementing synthetic data approaches (in the sense of making 3D models and rendering images), and are there, in your opinion, unique advantages of synthetic data that augmentation of real data cannot provide?

Prof. Belongie: When it comes to generic, coarse-grained settings, I would say the sky’s the limit in terms of what data augmentation can accomplish. Here I’m referring to supplying modern machine learning pipelines with sufficiently realistic augmentations, such as adding rain to a street or stubble to a face. The bar is, of course, somewhat higher if the goal is to cross the uncanny valley for human observers. And as I hinted earlier, fine grained visual categorization (FGVC) also presents some tough challenges for the data augmentation movement. FGVC problems are characterized by the need for specialized domain knowledge, the kind that is possessed by very few human experts. In that sense, knowing how to tackle the data augmentation problem for FGVC is tantamount to bottling that knowledge in the form of a family of image manipulations. That strikes me as a daunting task.

Q5. A slightly personal question here. Your group at UCSD used to be called SO(3) in honor of the group of three-dimensional rotations, and your group at Cornell now is called SE(3), after the special Euclidean group in three dimensions. This brings back memories of how I used to work in algebra a little bit back when I was an undergrad. I realize the group’s title probably doesn’t mean much but still: do you see a way for modern algebra and/or geometry to influence machine learning? What is your opinion of current efforts in geometric deep learning: would you advise current math undergrads to go there?

Prof. Belongie: Geometric deep learning provides an interesting framework for incorporating prior knowledge into traditional deep learning settings. Personally, I find it exciting because a new generation of students is talking about topics like graph Laplacians again. I don’t know if I’d point industry-focused ML engineers at geometric deep learning, but I do think it’s a rich landscape for research-oriented undergrads to explore, with an inspiring synthesis of old and new ideas.

Q6. And, if you don’t mind, let us finish with another personal question. Turns out SO3 is not just your computer vision research group’s title but also your band name! I learned about it from this profile article about you that lists quite a few cool things you’ve done, including a teaching gig in Brazil “inspired by Richard Feynman”.

So I guess it’s safe to say that Richard Feynman has been one of your heroes. Who else has been an influence? How did you turn to computer science? And are there maybe some other biographies or popular books that you can recommend for our readers who are choosing their path right now?

Prof. Belongie: Ah, I see you’ve done your research! The primary influences in my career have been my undergrad and grad school advisors, Pietro Perona and Jitendra Malik, who are both towering figures in the field. From them I gained a deep appreciation of ideas outside of computer science and engineering, including human vision, experimental psychology, art history, and neuroscience. I find myself quoting, paraphrasing, or channeling them on a regular basis when meeting with my students. In terms of turning to computer science, that was a matter of practicality. I started out in electrical engineering, focusing on digital signal processing, and as my interests coalesced around image recognition, I naturally gravitated to where the action was circa the late 90s, i.e., computer science.

As far as what I’d recommend now, that’s a tough question. My usual diet is based on the firehose of arXiv preprints that match my group’s keywords du jour. But this can be draining and even demoralizing, since you’ll start to feel like it’s all been done. So if you want something to inspire you, read an old paper by Don Geman, like this one about searching for mental pictures. Or better yet, after you’re done with your week’s quota of @ak92501-recommended papers, go for a long drive or walk and listen to a Rick Beato “What Makes this Song Great” playlist. It doesn’t matter if you know music theory, or if some of the genres he covers aren’t your thing. His passion for music – diving into it, explaining it, making the complex simple – is infectious, and he will inspire you to do great things in whatever domain you’ve chosen as your focus.

Dear Professor, thank you very much for your answers! And thank you, the reader, for your attention! Next time, we will return with an interview with another important figure in machine learning. Stay tuned!

Sergey Nikolenko
Head of AI, Synthesis AI

Biometrics & security

ID verification

Security

Consumer devices & applications

AR/VR/XR

Virtual try-on

Biometrics & security

Driver monitoring

Pedestrian detection

Gen AI

HumanAPI

Synthetic data

About us

Press

Join Our Team

Biometrics & security

ID verification

Security

Consumer devices & applications

AR/VR/XR

Virtual try-on

Biometrics & security

Driver monitoring

Pedestrian detection

Gen AI

HumanAPI

Synthetic data

About us

Press

Join Our Team

Biometrics & security

ID verification

Security

Consumer devices & applications

AR/VR/XR

Virtual try-on

Biometrics & security

Driver monitoring

Pedestrian detection

Gen AI

HumanAPI

Synthetic data

About us

Press

Join Our Team

Biometrics & security

ID verification

Security

Consumer devices & applications

AR/VR/XR

Virtual try-on

Biometrics & security

Driver monitoring

Pedestrian detection

Gen AI

HumanAPI

Synthetic data

About us

Press

Join Our Team

Embrace Synthetic Data.