IAPR Invited Talks

July 26: Building World Foundation Models for Unlocking Physical AI

IAPR Invited Speaker

Abstract

This talk introduces world foundation models and their pivotal role in enabling Physical AI. I will begin by tracing the evolution from Generative AI to Physical AI, highlighting the unique challenges that arise in this transition. Two categories of world foundation models will be presented: synthesis models, which predict future world states, and analysis models, which interpret current states to guide action planning. I will showcase how these models are applied in Physical AI systems, review the current state of the art, and conclude with a discussion on the limitations of existing approaches and promising directions for future research.

July 27: Learning World Simulators from Data

IAPR Invited Speaker

Katerina Fragkiadaki
CMU

Biography

Katerina Fragkiadaki is the JPMorgan Chase Associate Professor in the Machine Learning Department at Carnegie Mellon University. She received her undergraduate degree in Electrical and Computer Engineering from the National Technical University of Athens, and her Ph.D. from the University of Pennsylvania. She subsequently held postdoctoral positions at UC Berkeley and Google Research. Her research focuses on enabling few-shot and continual learning for perception, action, and language grounding. Her work has been recognizedby the Best Ph.D. Thesis Award, NSF CAREER Award, AFOSR and DARPA Young Investigator Awards, as well as faculty research awards from Google, Toyota Research Institute, Amazon, NVIDIA, UPMC, and Sony. She served as a Program Chair for ICLR 2024.

Abstract

Modern foundational models have achieved superhuman performance in many logic and mathematical reasoning tasks by learning to think step by step. However, their ability to understand videos, and, consequently, control embodied agents, lags behind. They often make mistakes in recognizing simple activities, and often hallucinate when generating videos. This raises a fundamental question: What is the equivalent of thinking step-by-step for visual recognition and prediction?

In this talk, we argue that step-by-step visual reasoning has much to do with inverting a physics simulator, that is, mapping raw video pixels back to a structured, 3D-like neural representation of the world. This involves inferring 3D representations of objects, parts, their 3D motion and appearance trajectories, estimating camera movements and 3D scene structure and physics properties. We will discuss methods to automatically extract such 3D neural representations from images and videos using generative model priors and end-to-end feed-forward models. We will present methods that inject such knowledge of camera motion and 3D scene structure in modern VLMs and show it improves their ability to ground language and control robot manipulators.

How can we scale up annotations for such simulator inversion? We will discuss methods that use generative models of language and vision to automate development of 3D simulations in physics engines. Additionally, we will discuss our efforts in developing faster and more general physics engines. The integration of physics engines with generative models aims to automate the replication of real physical environments within the physics simulator, enabling more accurate and scalable world simulation for sim-to-real learning of 3D perception and robotics. We believe such real-to-sim and sim-to-real learning as universal data engines for robotics are central for democratizing and pushing the state-of-the-art in robot learning.

July 28: Making sense of the real-world via 3D Computer Vision

IAPR Invited Speaker

Yasuyuki Matsushita
MSRA Tokyo

Biography

Yasuyuki Matsushita is a Senior Director of Microsoft Research Asia - Tokyo since 2024. He received his B.S., M.S. and Ph.D. degrees in EECS from the University of Tokyo in 1998, 2000, and 2003, respectively. From April 2003 to March 2015, he was with Visual Computing group at Microsoft Research Asia. From April 2015 to September 2024, he was a Professor at Osaka University. His research area includes computer vision, machine learning and optimization. He is an Editor-in-Chief of International Journal of Computer Vision (IJCV) and is/was on editorial board of IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), The Visual Computer journal, IPSJ Transactions on Computer Vision Applications (CVA), and Encyclopedia of Computer Vision. He served/is serving as a Program Co-Chair of PSIVT 2010, 3DIMPVT 2011, ACCV 2012, ICCV 2017, and a General Co-Chair for ACCV 2014 and ICCV 2021. He has won the Osaka Science Prize in 2022. He is a Fellow of IEEE and a member of IPSJ.

Abstract

3D Computer Vision is crucial for understanding and interpreting the spatial aspects of real-world scenes. It is particularly important for the coming Embodied AI, where machines need to interact with and understand their surroundings. Sensing is crucial in this context, because it generates rich, multidimensional data that enhances AI's understanding of the world and elevates its perceptual capabilities. This talk discusses two approaches to the problem of real-world sensing, namely, learning-based and model-based approaches and advocates the synergy of these approaches. In particular, we discuss the case of photometric 3D reconstruction, where we have an access to reliable physics-based models while data-driven methods are still beneficial.