LLMs for Programming: Is Coding Dying?
It is time to discuss some applications. Today, I...
Today, we are talking about the Metaverse, a bold vision for the next iteration of the Internet consisting of interconnected virtual spaces. The Metaverse is a buzzword that had sounded entirely fantastical for a very long time. But lately, it looks like technology is catching up, and we may live to see the Metaverse in the near future. In this post, we discuss how modern artificial intelligence, especially computer vision, is enabling the Metaverse, and how synthetic data is enabling the relevant parts of computer vision.
The Metaverse is far from a new idea. Anyone familiar with the cyberpunk genre will immediately recognize the concept of a virtual reality that characters of William Gibson’s Neuromancer (1984) inhabit. The term itself was coined in Neal Stephenson’s novel Snow Crash (1992), and this virtual reality-based Internet 2.0 has seen many fictionalized adaptations ever since, including The Matrix, Ready Player One, a recent Amazon series Upload, and many more.
While the Metaverse has long been the subject of sci-fi, by now many visionaries believe that developments in VR, AR, and related fields may soon enable similar experiences in real life… I mean, in virtual life, but real virtual life… you know what I mean. One of the sources that got me thinking about the Metaverse recently was a long interview with Mark Zuckerberg. He talks about “the successor to the mobile internet… an embodied internet, where instead of just viewing content — you are in it… present with other people as if you were in other places”. It sounds like Facebook believes in the VR and AR technology and sees the clunkiness of current generation devices as the main obstacle: right now hardly anybody would want to do their jobs in a VR helmet. As soon as wearable technology becomes miniature and light enough, the Metaverse will be upon us.
Mark Zuckerberg motivates this vision, in particular, with mobile workstations: “…you can walk into a Starbucks… and kind of wave your hands and you can have basically as many monitors as you want, all set up, whatever size you want them to be… and you can just bring that with you wherever you want.” Facebook calls this idea the “infinite office.” But in my opinion, it is almost inevitable that entertainment will be the main driving force behind the Metaverse: imagine that you don’t need large screens to have an immersive cinematic experience, imagine your friends on social networks (well, maybe one social network in particular) streaming their experiences through AR glasses, imagine immersive 3D games that enable real human-to-human personal interaction… Well, I’m sure you’ve heard pitches for the VR technology many times, but this time it sounds like it really has a chance of coming through and becoming the next big thing. Others are beginning to build their own vision for the Metaverse including Epic Games, Roblox, Unity, and more.
But we need more than just smaller VR helmets and AR glasses to build the Metaverse. This hardware has to be supported by software that makes the transition between the real and virtual worlds seamless—and this would be impossible without state of the art computer vision. Let me make just a few examples.
First, the obvious: VR helmets and controllers need to be positioned in space very accurately, and this tracking is usually done with visual information from cameras, either installed separately in base stations or embedded into the helmet itself. This is a basic computer vision problem of simultaneous localization and mapping problem (SLAM). VR helmet technology has recently undergone an important shift: earlier models tended to require base stations (“outside-in” tracking), and latest helmets can localize controllers accurately with embedded cameras (“inside-out” tracking) so you don’t need any special setup in the room (image source):
This is a result of progress in computer vision, the cameras themselves have not improved that much.
This problem becomes harder if we are talking about augmented reality: AR software also needs to understand its position in the world, but it needs a far more detailed and accurate 3D map of the environment in order to be able to augment it for the user. Check out our latest AI interview with Andrew Rabinovich, who was the Director of Deep Learning at Magic Leap, the startup that tried to do exactly this.
Second, we have already talked many times about gaze estimation, i.e., finding out where a person is looking by the picture of their face and eyes. This is also a crucial problem for AR and VR. In particular, current VR relies upon foveated rendering, a technique where the image in the center of our field of view is rendered in high resolution and high detail, and it becomes progressively worse on the periphery; for an overview see, e.g., Patney et al. (2016). This is, by the way, exactly how we ourselves see things; we see only a very small portion of the field of view clearly and in full detail, and peripheral vision is increasingly blurry (illustration by Rooney et al., 2017):
Foveated rendering is important for VR because VR has an order of magnitude larger field of view than flat screens, and requires a high resolution to support the illusion of immersive virtual reality, so rendering it all in this resolution would be far beyond consumer hardware.
Third, when you enter virtual reality, you need an avatar to represent you; current VR applications usually provide stock avatars or forgo them entirely (many VR games represent the player as just a head and a pair of hands), but an immersive virtual social experience would need photorealistic virtual avatars that represent real people and can capture their poses, . Constructing such an avatar is a very hard computer vision problem, but people are making good progress on it. For instance, a recent work by Victor Lempitsky’s team introduced textured full-body avatars able to capture poses in real time by visual data streaming from several cameras:
We are still not quite there, especially when it comes to faces and emotions, but we are getting better, and the Metaverse will definitely make use of this technology.
These are only a few of the computer vision problems that arise along the way to the Metaverse; for a more, pardon the pun, immersive experience just look at the list of talks on the recent IEEE VR Conference, where you will see all of these topics and much more.
Our long-time readers have no doubt already recognized where this blog post is going. Indeed, as we have discussed many times before (e.g., here or here), modern computer vision is requiring increasingly large datasets, and manual labeling simply stops working at some point. At Synthesis AI, we are proposing a solution to this problem in the form of synthetic data: artificially generated images and/or 3D scenes that can be used to train machine learning models.
I chose the three examples above because they each illustrate different uses of synthetic data in machine learning. Let us go over them again.
First, SLAM is an example where synthetic data can be used in a straightforward way: construct a 3D scene and use it to render training set images with pixel-perfect labels of any kind you would like, including segmentation, depth maps, and more. We have talked about simulated environments on this blog before, and SLAM is a practical problem where segmentation and depth estimation arise as important parts. Modern synthetic datasets provide a wide range of cameras and modalities; for example, here is an overview of a recently released dataset intended specifically for SLAM (Wang et al. 2020):
Second, gaze estimation is an interesting problem where real data may be hard to come by, and synthetic data comes to the rescue. I have already used gaze estimation on this blog as a go-to example for domain adaptation, i.e., the process of modifying the training data and/or machine learning models so that the model can work on data from a different domain. Gaze estimation works with relatively small input images, so this was an early success for GANs for synthetic-to-real refinement, where synthetic images were made more realistic with specially trained generative models. Recent developments include a large real dataset, MagicEyes, that was created specifically for augmented reality applications (Wu et al., 2020); in fact, it was released by Magic Leap, and we discussed it with Andrew last time:
Third, virtual avatars touch upon synthetic data from the opposite direction: now the question is about using machine learning to generate synthetic data. We talked about capturing the pose and/or emotions from a real human model, but there is actually a rising trend in machine learning models that are able to create realistic avatars from scratch. Instagram is experiencing a new phenomenon: virtual influencers, accounts that have a personality but do not have a human actually realizing this personality. Here is Lil Miquela, one of the most popular virtual influencers:
From a research perspective,, this requires state of the art generative models that are supplemented with synthetic data in the classical sense: you need to create a highly realistic 3D environment, place a high-quality human model inside, and then use a generative model (usually a style transfer model) to make the resulting image even more realistic. In this direction, it is still a long way to go before we can have fully photorealistic 3D avatars ready for the Metaverse, but the field is developing very rapidly, and this long way may be traversed in much less time than we have ever expected.
The Metaverse is an ambitious vision straight out of science fiction, but it looks like the Metaverse is becoming increasingly realistic. It is quite possible that you and I will live to see an actual Metaverse, be it a social-centric Facebook 2.0 envisioned by Mark Zuckerberg, massively multiplayer OASIS out of Ready Player One, or, God forbid, the all-encompassing Matrix. But before we get there, there are still many research problems to be solved. Most of them lie in the field of computer vision, and this is exactly where synthetic data is especially effective for machine learning. Join us next time for another installment on synthetic data!