Wenhao Chai

Wenhao Chai is an incoming Ph.D. Student in Computer Science at Princeton University, working with Prof. Zhuang Liu. He received his master's degree at University of Washington in 2025 and bachelor's degree at Zhejiang University in 2023. He previously studied at Stanford University as a research intern in the summer of 2024 working with Prof. Christopher D. Manning and at the University of Illinois Urbana-Champaign as a visiting scholar in the spring and summer of 2022. He has internship at Pika Labs and Microsoft Research Asia. His research spans a wide range of topics in computer vision and machine learning. His previous research primarily covers video understanding and generative models. He leads the development of MovieChat, the first Large Mutli-Modal Model for hour-long video understanding. He has published research papers in top-tier conferences and journals such as ICLR, CVPR, ICCV, ECCV, and AAAI. He has co-organized workshops and challenges on video understanding at CVPR 2024 and 2025.

Publications

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang

Wenhao Chai

Publications

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Ego3DT: Tracking Every 3D Object in Ego-centric Videos

PAD: Personalized Alignment of LLMs at Decoding-Time

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

AGLLDiff: Guiding Diffusion Models Towards Unsupervised Training-free Real-world Low-light Image Enhancement

RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark

NTIRE 2024 Image Shadow Removal Challenge Report

STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft

Learning Diffusion Texture Priors for Image Restoration

CityCraft: A Real Crafter for 3D City Generation

Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection

Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model

Random bridge generator as a platform for developing computer vision-based structural inspection algorithms

VersaT2I: Improving Text-to-Image Models with Versatile Reward

Exploring Learning-based Motion Models in Multi-Object Tracking

Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation

Unsupervised Domain Adaptation Approach for Vision-Based Semantic Understanding of Bridge Inspection Scenes without Manual Annotations

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

CityGen: Infinite and Controllable 3D City Layout Generation

See and Think: Embodied Agent in Virtual Environment

UniHPE: Towards Unified Human Pose Estimation via Contrastive Learning

Efficient Domain Adaptation via Generative Prior for 3D Infant Pose Estimation

Sequential Affinity Learning for Video Restoration

Devil in the Number: Towards Robust Multi-modality Data Filter

Chasing Consistency in Text-to-3D Generation from a Single Image

UniAP: Towards Universal Animal Perception in Vision via Few-shot Learning

StableVideo: Text-driven Consistency-aware Diffusion Video Editing

PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision

Back to Optimization: Diffusion-based Zero-Shot 3D Human Pose Estimation

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

Image Reference-guided Fashion Design with Structure-aware Transfer by Diffusion Models

Five A+ Network: You Only Need 9K Parameters for Underwater Image Enhancement

Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation

Blind Inpainting with Object-Aware Discrimination for Artificial Marker Removal

Deep Learning Methods for Small Molecule Drug Discovery: A Survey

DiffFashion: Reference-Based Fashion Design With Structure-Aware Transfer by Diffusion Models

Automatic Spinal Ultrasound Image Segmentation and Deployment for Real-time Spine Volumetric Reconstruction

Weakly Supervised Two-Stage Training Scheme for Deep Video Fight Detection Model

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Diffusion-based Zero-Shot 3D Human Pose Estimation

Supplement Material: Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation

PAD: Personalized Alignment at Decoding-Time