Tutorials (July 27)
13:15–14:45 Tutorial 1
Markov Decision Processes and Imitation Learning for Vision-based Human Activity Understanding
Abstract
Human activity in many ways can be modeled as an optimal control problem, where a person makes a decision based on an underlying cost/reward function and selects actions by reasoning about how it might affect the future. However, such models have been under-utilized in computer vision. In this tutorial, we will cover some of the basic theory of reinforcement learning and imitation learning, and show how ideas from robotics and control theory can be used to build robust models for human activity understanding in computer vision.
Tutor

		Dr. Kris M. Kitani
		Assistant Research Professor
		Carnegie Mellon University, USA
Biography
Dr. Kris M. Kitani is an associate research professor and director of the MS in Computer Vision program of the Robotics Institute at Carnegie Mellon University. He received his BS at the University of Southern California and his MS and PhD at the University of Tokyo. His research projects span the areas of computer vision, machine learning and human computer interaction. In particular, his research interests lie at the intersection of first-person vision, human activity modeling and inverse reinforcement learning. His work has been awarded the Marr Prize honorable mention at ICCV 2017, best paper honorable mention at CHI 2017 and CHI 2020, best paper at W4A 2017 and 2019, best application paper ACCV 2014 and best paper honorable mention ECCV 2012.
15:00–16:30 Tutorial 2
Cross-modal Retrieval
Abstract
Cross-modal retrieval (CMR) aims to retrieve the relevant samples across different modalities, which has received increasing attention in recent years under the demands of processing explosive amounts of multi-modal data. A major challenge underlying this problem is the discrepancy between the feature representations extracted from different modalities. With the advanced deep learning techniques, the modality gap can be significantly reduced, and recent works have shown quite promising retrieval performances. This tutorial will introduce recent advances in cross-modal retrieval. Specifically, several classical approaches include attention-based cross-modal learning, cross-modal generative adversarial networks as well as recent cross-modal pre-training models will be discussed.
Tutor

		Dr. Jingjing Chen
		Pre-tenured Associate Professor
		Fudan University, China
Biography
Dr. Jingjing Chen is a pre-tenured associate professor at the School of Computer Science, Fudan University. Before joining Fudan University, she was a postdoc research fellow at the School of Computing in the National University of Singapore. She received her Ph.D. degree in Computer Science from the City University of Hong Kong in 2018. Her research interest include multimedia information retrieval, image/video content analysis, adversarial learning. she has won the best student paper in ACM Multimedia 2016 and Multimedia Molding 2017.
16:45–18:15 Tutorial 3
Generative Image Models
Abstract
Generative models have improved rapidly in recent years, being able to generate high-quality images that are nearly indistinguishable from paintings or photographs. The results provide new solutions for computer vision tasks such as image inpainting, deblurring, and super-resolution. The methods also enable novel ways to edit images, which is starting to change the workflow of creative industries. In this tutorial, I will give an overview of this rapidly evolving research area, including network architectures such as variations of GANs and transformers. We will cover recent highlights, explore with hands-on examples, and discuss open research challenges.
Tutor

		Dr. Björn Stenger
		Group Leader
		Rakuten Institute of Technology, Japan
Biography
Dr. Björn Stenger is leading the vision program at the Rakuten Institute of Technology. He received the diploma (~M.Sc.) in computer science from the University of Bonn, Germany, in 2000 and his Ph.D. from the University of Cambridge, UK, in 2004. His thesis on hand tracking won the BMVA Sullivan Thesis Prize. From 2004–2006 he worked as a Research Fellow at the Toshiba R&D Center. In 2006 he joined the Computer Vision Group of Toshiba Research Europe, where he worked on human-computer interfaces, 3D capture, and object recognition. He joined the Rakuten Institute of Technology in 2015. His current research interests include image and video understanding, image enhancement, and generative AI.