🎠 MagicPony: Learning Articulated 3D Animals in the Wild

Visual Geometry Group, University of Oxford
(* equal contribution)
CVPR 2023

Our method learns single-image 3D reconstruction models of articulated animal categories, from just online photo collections, without any 3D geometric supervision or template shapes.

Abstract

We consider the problem of learning a function that can estimate the 3D shape, articulation, viewpoint, texture, and lighting of an articulated animal like a horse, given a single test image. We present a new method, dubbed MagicPony, that learns this function purely from in-the-wild single-view images of the object category, with minimal assumptions about the topology of deformation. At its core is an implicit-explicit representation of articulated shape and appearance, combining the strengths of neural fields and meshes. In order to help the model understand an object's shape and pose, we distil the knowledge captured by an off-the-shelf self-supervised vision transformer and fuse it into the 3D model. To overcome common local optima in viewpoint estimation, we further introduce a new viewpoint sampling scheme that comes at no added training cost. Compared to prior works, we show significant quantitative and qualitative improvements on this challenging task. The model also demonstrates excellent generalisation in reconstructing abstract drawings and artefacts, despite the fact that it is only trained on real images.

Video

Single-Image 3D Reconstruction

Real Horse Images and Realistic Paintings

After training, given a single image of a new instance, the model reconstructs articulated 3D shape and appearance of it, which can be animated and re-rendered from arbitrary viewpoints.


Generalisation to Abstract Horse Drawings and Artefacts

The model also generalises to abstract drawings and artefacts, despite being trained only on real images.


Other Animal Categories: Giraffes, Zebras and Cows

After finetuning, our model also generalises to various animal categories with highly different underlying shapes.


Frame-by-Frame Reconstruction on Videos

We run the model on videos frame by frame, and obtain temporally consistent reconstructions.


BibTeX

@InProceedings{wu2023magicpony,
  author    = {Shangzhe Wu and Ruining Li and Tomas Jakab and Christian Rupprecht and Andrea Vedaldi},
  title     = {{MagicPony}: Learning Articulated 3D Animals in the Wild},
  journal   = {CVPR},
  year      = {2023}
}

Acknowledgements

We would like to thank the authors of nvdiffrec for open-sourcing the code for DMTet and rendering. We are also grateful to Tengda Han, Shu Ishida, Dylan Campbell, Eldar Insafutdinov, Luke Melas-Kyriazi, Ragav Sachdeva and Sagar Vaze for insightful discussions, and Guanqi Zhan and Jaesung Huh for proofreading.

Shangzhe Wu is supported by Meta Research. Tomas Jakab is supported by ERC-CoG UNION 101001212. Christian Rupprecht is supported by VisualAI EP/T028572/1 and ERC-CoG UNION 101001212. Andrea Vedaldi is supported by ERC-CoG UNION 101001212.