- By MALLIKARJUN B R and AYUSH TEWARI, MPI for Informatics, SIC, Germany and 9 others
Paper Link
Abstract
Photorealistic editing of head portraits is a challenging task as humans are
very sensitive to inconsistencies in faces. Paper present an approach for high-quality intuitive editing of the camera viewpoint and scene illumination
(parameterised with an environment map) in a portrait image. This requires
our method to capture and control the full reflectance field of the person in
the image. Most editing approaches rely on supervised learning using training data captured with setups such as light and camera stages. Such datasets
are expensive to acquire, not readily available and do not capture all the
rich variations of in-the-wild portrait images. In addition, most supervised
approaches only focus on relighting and do not allow camera viewpoint
editing.
Paper present a method that learns
from limited supervised training data. The training images only include
people in a fixed neutral expression with eyes closed, without much hair or
background variations. Each person is captured under 150 one-light-at-a time conditions and under 8 camera poses. Instead of training directly in
the image space, we design a supervised problem that learns transformations in the latent space of StyleGAN. This combines the best of supervised
learning and generative adversarial modelling. We show that the StyleGAN prior allows for generalisation to different expressions, hairstyles and backgrounds. This produces high-quality photorealistic results for in-the-wild
images and significantly outperforms existing methods. Our approach can
edit the illumination and pose simultaneously and runs at interactive rates.
Paper Contributions
- We combine the strength of supervised learning and generative adversarial modeling in a new way to develop a technique
for high-quality editing of scene illumination and camera pose
in portrait images. Both properties can be edited simultaneously.
- Our novel formulation allows for generalisation to in-the-wild
images with significantly higher quality results than related
methods. It also allows for training with limited amount of
supervision.
The method allows for editing the scene illumination 𝐸𝑡 and camera pose 𝜔𝑡 in an input source image 𝐼𝑠 . We learn to map the StyleGAN latent code 𝐿𝑠 of the source image, estimated using pSpNet to the latent code 𝐿𝑡 of the output image. StyleGAN is then used to synthesis the final output 𝐼𝑡. Our method is trained in a supervised manner using a light-stage dataset with multiple cameras and light
sources. For training, used a latent loss and a perceptual loss defined using a pretrained network 𝜙. Supervised learning in the latent space of StyleGAN
allows for high-quality editing which can generalise to in-the-wild images.
The method takes as input an in-the-wild portrait image, target
illumination and the target camera pose. The output is a portrait
image of the same identity, synthesised with the target camera
and lit by the target illumination. Given a light-stage dataset of
multiple independent illumination sources and viewpoints, the naive
approach could be to learn the transformations directly in image
space. Instead, we propose to learn the mapping in the latent space
of StyleGAN. We show that learning using this
latent representation helps in generalisation to in-the-wild images with high photorealism. StyleGAN2 is used in implementation,
referred to as StyleGAN for better comprehension
Data Preparation We evaluate our approach on portrait images
captured in the wild. All data
in our work (including the training data) are cropped and preprocessed as described in Karras et al. The images are resized to
a resolution of 1024x1024. Since we need the ground truth images
for quantitative evaluations, we use the test portion of our lightstage dataset composed of images of 41 identities unseen during
training. We create two test sets, Set1 has the input and ground
truth pairs captured from the same viewpoint while Set2 includes
pairs captured from different viewpoints. The HDR environment
maps, randomly sampled from the Naval Outdoor and Naval Indoor
datasets are used
to synthesise the pairs with natural illumination conditions. Viewpoints are randomly sampled from the 8 cameras of the light-stage
setup. The input and ground truth images are computed using the
same environment map in Set2 for evaluating the viewpoint editing.
High-Fidelty Appearance Editing
Figs. 5 show simultaneous viewpoint and illumination
editing results of our method for various subjects. We also show the
StyleGAN projection of the input images estimated by Richardson. Our approach produces high-quality photorealistic
results and synthesises the full portrait, including hair, eyes, mouth,
torso and the background, while preserving the identity, expression and other properties (such as facial hair). Additionally, the
results show that our method can preserve a variety of reflectance
properties, resulting in effects such as specularities and subsurface
scattering. Please note the view-dependent effects such as specularities in the results(nose, forehead...). Our method can synthesise results even under high-frequency light conditions resulting in shadows, even though the StyleGAN network is trained on a dataset
of natural images. In Figs. 5 we show more detailed editing results. As it can be noted, the relighting preserve the input pose and
identity. Also, our method can change the viewpoint under a fixed
environment map (third row for each subject).
Comparisons to Related Methods
We compare our method with several state of the art portrait editing
approaches. We evaluate qualitatively on in the wild data, as well
as quantitatively on the test set of the light-stage data. We compare
with the following approaches:
• The relighting approach of Sun et al. which is a datadriven technique trained on a light-stage dataset. It can only
edit the scene illumination.
• The relighting approach of Zhou which is trained
on synthetic data. It can also only edit the scene illumination.
• PIE is a method which computes a StyleGAN embedding used to edit the image. It can edit the head pose and scene illumination sequentially (unlike ours, which
can perform the edits simultaneously). It is trained without
supervised image pairs.
• StyleFlow, like PIE can edit images by
projecting them onto the StyleGAN latent space. It is also
trained without supervised image pairs. Please note that this
paper is concurrent to us (not counted as prior art). However,
we provide comparisons for completeness.
CONCLUSION
We presented PhotoApp, a method for editing the scene illumination and camera pose in head portraits. Our method exploits the
advantages of both supervised learning and generative adversarial
modeling. By designing a supervised learning problem in the latent
space of StyleGAN, we achieve high-quality editing results which
generalise to in the wild images with significantly more diversity
than the training data. Through extensive evaluations, we demonstrated that our method outperforms all related techniques, both in
terms of realism and editing accuracy. We further demonstrated that
our method can learn from very limited supervised data, achieving
high-quality results when trained with as little as 3 identities captured in a single expression. While several limitations still exist, we
hope that our contributions inspire future work on using generative
representations for synthesis applications.
Comments