Ting Yao

About ME

Ting Yao (姚霆) Google Scholar tingyao.ustc@gmail.com

Ting Yao is currently the Co-Founder and CTO of HiDream.ai, propelling it to be the top Generative Artificial Intelligence company in China. Previously, he was a Principal Researcher & the Managing Director of Computer Vision and Multimedia Lab with JD.com in Beijing, China and a Researcher with Microsoft Research Asia in Beijing, China. Dr. Yao has co-authored more than 150 peer-reviewed papers in top-notch conferences/journals. His seminal work on Pseudo-3D network has become one of standard 3D convolution neural networks for spatiotemporal data analysis, and his video-to-text dataset (MSR-VTT) has been used by 500+ institutes worldwide. His research has led to several commercial products in Microsoft, JD.com and HiDream.ai, with millions of daily active users.

Dr. Yao currently serves as an associate editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Image Processing, IEEE Transactions on Multimedia, ACM Transactions on Multimedia Computing, Communications, and Applications, Pattern Recognition Letters and Multimedia Systems, and frequently served as an area chair and keynote/tutorial speaker in numerous conferences. He has organized 15+ high-quality workshops/challenges with the flagship conferences. His works have led to many awards, including 2015 ACM SIGMM Outstanding Ph.D. Thesis Award, 2019 ACM SIGMM Rising Star Award, 2019 IEEE Computer Society TCMC Rising Star Award, 2022 IEEE ICME Multimedia Star Innovator Award, 2022 Chinese Intelligent Computing Technology Innovators, and the winning of 10 championships in international multimedia analytics competitions.

Dr. Yao received the B.Sc. degree in theoretical and applied mechanics, B.Eng. double degree in electronic information engineering, and M.Eng. degree in signal and information processing all from the University of Science and Technology of China, Hefei, China. He completed a Ph.D. in computer science (2014) at the City University of Hong Kong, advised by Prof. Chong-Wah Ngo.

Distinctions

Awards

HiDream Multimodal Foundation Models, https://huggingface.co/collections/HiDream-ai/hidream-i1, 1.8M+ Hugging Face downloads, 100+ derivative models on Civitai, and 170+ tutorials on YouTube, 2025.
Best Demo Award, "Talk, Imagine, Evolve: A Unied Multimodal Agent for Seamless Visual Generation and Editing," ACM International Conference on Multimedia (ACM MM), 2025.
人工智能前沿创新奖（祖冲之奖）—年度突出成果奖，“生成式视觉多模态大模型研发及示范应用”，新一代人工智能产业技术创新战略联盟（AITISA），2025.
Top 3% Paper Award, "Visual-Aware Text-to-Speech," IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP), 2023.
Chinese Intelligent Computing Technology Innovators, 2022.
First Grade Scientific and Technology Prize of the China Society of Image and Graphics (CSIG), "Key Technologies and Applications of Ultrafine Image Recognition," 2022.
IEEE ICME Multimedia Star Innovator Award, "for outstanding innovative contribution in the area of Multimedia Intelligence," 2022.
Nicolas D. Georganas Best Paper Award, "Smart Director: An Event-Driven Directing System for Live Broadcasting," ACM Transactions on Multimedia Computing, Communications, and Applications, 2022.
IEEE Computer Society TCMC Rising Star Award, "for contributions in video content recognition and description generation," 2019.
ACM SIGMM Rising Star Award, "for contributions in activity recognition and video captioning," 2019.
ACM SIGMM Outstanding Ph.D. Thesis Award, "Multimedia Search by Self, External, and Crowdsourcing Knowledge," 2015.
Best Open Source Award, "X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics," ACM International Conference on Multimedia (ACM MM), 2021.
Outstanding Associate Editor, IEEE Transactions on Multimedia, 2021.
Second Place Best Demo Award, "Animating Your Life: Real-Time Video-to-Animation Translation," ACM International Conference on Multimedia (ACM MM), 2019.
Best Paper Finalist, "Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation," ACM on International Conference on Multimedia Retrieval (ICMR), 2016.
Best Paper Award, "Click-boosting Random Walk for Image Search Reranking," International Conference on Internet Multimedia Computing and Service (ICIMCS), 2013.

Top-performing Systems in International Competitions

Rank 1, No Interaction Track, the first workshop on Generalizable Policy Learning in the Physical World with International Conference on Learning Representations (ICLR), 2022.
Rank 1, No Restriction Track, the first workshop on Generalizable Policy Learning in the Physical World with International Conference on Learning Representations (ICLR), 2022.
Rank 1, Open-set Image Classification task, Open World Vision Challenge, with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Rank 2, Supervised Track of Kinetics-700 Action Recognition, International Challenge on Activity Recognition (ActivityNet), with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Rank 1, Multi-Source Domain Adaptation Track, Visual Domain Adaptation Challenge, with IEEE International Conference on Computer Vision (ICCV), 2019.
Rank 2, Semi-Supervised Domain Adaptation Track, Visual Domain Adaptation Challenge, with IEEE International Conference on Computer Vision (ICCV), 2019.
Rank 1, Trimmed Activity Recognition (Kinetics), International Challenge on Activity Recognition (ActivityNet), with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Rank 3, Activities in Extended Video (NIST ActEV), International Challenge on Activity Recognition (ActivityNet), with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Rank 1, Open-set Classification Track, Visual Domain Adaptation Challenge, with European Conference on Computer Vision (ECCV), 2018.
Rank 1, Detection Track, Visual Domain Adaptation Challenge, with European Conference on Computer Vision (ECCV), 2018.
Rank 2, Dense-Captioning Events in Videos, International Challenge on Activity Recognition (ActivityNet), with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Rank 2, Temporal Action Localization, International Challenge on Activity Recognition (ActivityNet), with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Rank 2, Trimmed Activity Recognition (Kinetics), International Challenge on Activity Recognition (ActivityNet), with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Rank 1, Segmentation Track, Visual Domain Adaptation Challenge, with IEEE International Conference on Computer Vision (ICCV), 2017.
Rank 1, Dense-Captioning Events in Videos, International Challenge on Activity Recognition (ActivityNet), with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
Rank 2, Temporal Action Proposals, International Challenge on Activity Recognition (ActivityNet), with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
Rank 1, COCO Image Captioning Challenge, 2017.
Rank 3, Untrimmed Video Classification, International Challenge on Activity Recognition (ActivityNet), with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Rank 2, Action Classification, THUMOS Challenge, with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

Professional Activities

Service in Professional Organizations

Elected Member, IEEE Signal Processing Society: Image, Video, and Multidimensional Signal Processing Technical Committee (IVMSP-TC), 2023 – present
Member, IEEE Computer Society: Technical Committee on Multimedia Computing (TCMC), 2019 – present
Member, IEEE Computer Society: Technical Committee on Pattern Analysis and Machine Intelligence (PAMI-TC), 2020 - present

Journal Editorship

Associate Editor, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025/04-present
Associate Editor, IEEE Transactions on Image Processing, 2024/09-present
Associate Editor, IEEE Transactions on Multimedia, 2019/11-2023/11
Associate Editor, ACM Transactions on Multimedia Computing, Communications, and Applications, 2024/11-present
Associate Editor, Pattern Recognition Letters, 2022/02-present
Associate Editor, Multimedia Systems, 2018/10-present
Guest Editor, ACM Transactions on Multimedia Computing, Communications, and Applications, Special Issue on "Deep Learning for Intelligent Multimedia Analytics," 2019.

Conference/Workshop/Challenge Organizer

Grand Challenge Co-Chair, IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP), 2027.
Industry Co-Chair, IEEE Image, Video, and Multidimensional Signal Processing Workshop, 2026.
Program Co-Chair, Identity-Preserving Video Generation Challenge, with ACM Multimedia (ACM MM), 2026 & 2025.
Technical Demo and Video Program Chair, ACM International Conference on Multimedia (ACM MM), 2023.
Program Co-Chair, Conversational Head Generation Challenge, with ACM International Conference on Multimedia (ACM MM), 2023 & 2022.
Program Co-Chair, Pre-training for Video Understanding Challenge, with ACM International Conference on Multimedia (ACM MM), 2022 & 2021.
Program Co-Chair, the first International Workshop on Theories, Applications, and Cross Modality for Self-Supervised Learning Models, with International Conference on Pattern Recognition (ICPR), 2022.
Program Co-Chair, the first International Workshop on Deep Learning for Human Centric Activity Understanding, with International Conference on Pattern Recognition (ICPR), 2020.
Program Co-Chair, Pre-training for Video Captioning Challenge, with ACM International Conference on Multimedia (ACM MM), 2020.
Program Co-Chair, the first Workshop and Challenge on Conceptual Captions, with IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Program Co-Chair, MSR Video to Language Challenge, with ACM International Conference on Multimedia (ACM MM), 2017 & 2016.
Program Co-Chair, AI Technology for Visual Fashion Computing Workshop, with IEEE International Conference on Multimedia & Expo (ICME), 2019.
Program Co-Chair, Deep Learning for Intelligent Multimedia Analytics Workshop, with IEEE International Conference on Multimedia & Expo (ICME), 2017.
Program Co-Chair, Multimedia Computing for Intelligent Life Workshop, with International Conference on Multimedia Modeling (MMM), 2017.

Keynote/Tutorial Speaker

Keynote Speaker, "Multimodal Content Creation, Consumption and Distribution," Industry Expert Talks, with ACM International Conference on Multimedia (ACM MM), 2025.
Keynote Speaker, "Multimodal Foundation Models and Agents: Empowering the Future of Content Creation," AI Large Model Technology Summit, Chinese Association for Artificial Intelligence (CAAI), 2025.
Keynote Speaker, "Multimodal Content Generation: Unleashing the Infinite Possibilities of Future Creativity," Forum on AI-Generated Art, with Chinese Congress on Image and Graphics (CCIG), 2025.
Keynote Speaker, "Multimodal Content Generation: Unleashing the Infinite Possibilities of Future Creativity," Forum on Multimodal Foundation Model, with Conference on Application of Image and Graphics Technology (IGTA), 2024.
Keynote Speaker, "Cross-Modal Vision-and-Language Intelligence: Methodologies and Applications," 3D Multimedia Analytics, Search and Generation Workshop, with IEEE International Conference on Multimedia & Expo (ICME), 2023.
Keynote Speaker, "Key Technologies and Applications of Ultrafine Image Recognition,'' CSIG Award Forum, with Chinese Congress on Image and Graphics (CCIG), 2023.
Keynote Speaker, "Deep Spatiotemporal Visual Representation Learning and Applications,'' Forum on Video Action Detection and Recognition, with Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 2022.
Keynote Speaker, "From Visual Representation Learning to Visual-Language Intelligence,'' Forum on Intelligent Computing Techniques for Vision and Language, with China Multimedia (ChinaMM), 2022.
Keynote Speaker, "Trustworthy Visual Understanding: Generic Representation Learning and Explainable Interpretation," the first International Workshop on Trustworthy AI for Multimedia Computing, with ACM International Conference on Multimedia (ACM MM), 2021.
IEEE TCMC Award Talks Speaker, "Vision to Language: from Independency, Interaction, to Symbiosis," IEEE International Conference on Multimedia Information Processing and Retrieval (MIPR), 2021.
Keynote Speaker, "Vision to Language: from Independency, Interaction, to Symbiosis," the Second Workshop on Multimodal Natural Language Processing, with the CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC), 2021.
Keynote Speaker, "Vision to Language: from Independency, Interaction, to Symbiosis," the International Workshop on Multi-Modal Deep Learning: Challenges and Applications, with International Conference on Pattern Recognition (ICPR), 2020.
ACM SIGMM Award Talks Speaker, "Deep Video Understanding: Action Recognition and Language Generation," ACM International Conference on Multimedia (ACM MM), 2019.
Tutorial Speaker, "Vision and Text: Search, Generation and Translation," IEEE International Conference on Image Processing (ICIP), 2019.
Tutorial Speaker, "Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition," IEEE International Conference on Multimedia & Expo (ICME), 2019.
Tutorial Speaker, "Human Behavior Understanding: From Action Recognition to Complex Event Detection," ACM International Conference on Multimedia (ACM MM), 2018.
Keynote Speaker, "Describing Multimedia by Localization and Generation," 1st Person in Context (PIC) Workshop and Challenge, with European Conference on Computer Vision (ECCV), 2018.
Keynote Speaker, "Describing Multimedia by Localization and Generation," 1st Vision and Learning Seminar (VALSE) Workshop on Vision and Language, 2018.
ACM SIGMM Award Talks Speaker, "Bridging Vision and Text for Multimedia Search," ACM International Conference on Multimedia (ACM MM), 2015.

Area Chair / Senior PC

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024, 2025, 2026.
IEEE International Conference on Computer Vision (ICCV), 2025.
European Conference on Computer Vision (ECCV), 2026.
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, 2024, 2025, 2026.
IEEE International Conference on Image Processing (ICIP), 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026.
IEEE International Conference on Multimedia & Expo (ICME), 2018, 2019.
International Conference on Pattern Recognition (ICPR), 2020, 2022.
ACM International Conference on Multimedia (ACM MM), 2018, 2019, 2023.
International Joint Conference on Artificial Intelligence (IJCAI), 2019, 2020, 2021.
AAAI Conference on Artificial Intelligence (AAAI), 2020, 2021, 2022, 2026.

Selected Publications (Full List)

HiDream-I1: A high-efficient image generative foundation model with sparse diffusion transformer

HiDream Foundation Model Group
arXiv preprint arXiv:2505.22705, 2025

Denoising token prediction in masked autoregressive models

T. Yao, Y. Li, Y. Pan, Z. Qiu, T. Mei
ICCV, 2025

Talk, Imagine, Evolve: A unified multimodal agent for seamless visual generation and editing

Z. Qiu, Z. Gong, Y. Pan, T. Yao, T. Mei
ACM Multimedia, 2025 (Best Demo Award)

VideoStudio: Generating consistent-content and multi-scene videos

F. Long, Z. Qiu, T. Yao, T. Mei
ECCV, 2024

HIRI-ViT: Scaling vision transformer with high resolution inputs

T. Yao, Y. Li, Y. Pan, T. Mei
IEEE TPAMI, 2024

Dual vision transformer

T. Yao, Y. Li, Y. Pan, Y. Wang, X.P. Zhang, T. Mei
IEEE TPAMI, 2023

Bi-calibration networks for weakly-supervised video representation learning

F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, T. Mei
International Journal of Computer Vision, 2023

Control3d: Towards controllable text-to-3d generation

Y. Chen, Y. Pan, Y. Li, T. Yao, T. Mei
ACM Multimedia, 2023

Wave-vit: Unifying wavelet and transformers for visual representation learning

T. Yao, Y. Pan, Y. Li, C.-W. Ngo, T. Mei
ECCV, 2022

Contextual transformer networks for visual recognition

Y. Li, T. Yao, Y. Pan, T. Mei
IEEE TPAMI, 2022

Seco: Exploring sequence supervision for unsupervised representation learning

T. Yao, Y. Zhang, Z. Qiu, Y. Pan, T. Mei
AAAI, 2021

Smart director: An event-driven directing system for live broadcasting

Y. Pan, Y. Chen, Q. Bao, N. Zhang, T. Yao, J. Liu, T. Mei
ACM TOMM, 2021 (Nicolas D. Georganas Best Paper 2022)

X-modaler: A versatile and high-performance codebase for cross-modal analytics

Y. Li, Y. Pan, J. Chen, T. Yao, T. Mei
ACM Multimedia, 2021 (Best Open Source Award)

X-linear attention networks for image captioning

Y. Pan, T. Yao, Y. Li, T. Mei
CVPR, 2020

Joint contrastive learning with infinite possibilities

Q. Cai, Y. Wang, Y. Pan, T. Yao, T. Mei
NeurIPS, 2020

Learning spatio-temporal representation with local and global diffusion

Z. Qiu, T. Yao, C.-W. Ngo, X. Tian, T. Mei
CVPR, 2019

Hierarchy parsing for image captioning

T. Yao, Y. Pan, Y. Li, T. Mei
ICCV, 2019

Gaussian temporal awareness networks for action localization

F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, T. Mei
CVPR, 2019

Exploring visual relationship for image captioning

T. Yao, Y. Pan, Y. Li, T. Mei
ECCV, 2018

Fully convolutional adaptation networks for semantic segmentation

Y. Zhang, Z. Qiu, T. Yao, D. Liu, T. Mei
CVPR, 2018

Learning spatio-temporal representation with pseudo-3d residual networks

Z. Qiu, T. Yao, T. Mei
ICCV, 2017

Boosting image captioning with attributes

T. Yao, Y. Pan, Y. Li, Z. Qiu, T. Mei
ICCV, 2017

Video captioning with transferred semantic attributes

Y. Pan, T. Yao, H. Li, T. Mei
CVPR, 2017

Incorporating copying mechanism in image captioning for learning novel objects

T. Yao, Y. Pan, Y. Li, T. Mei
CVPR, 2017

To create what you tell: Generating videos from captions

Y. Pan, Z. Qiu, T. Yao, H. Li, T. Mei
ACM Multimedia, 2017

MSR-VTT: A large video description dataset for bridging video and language

J. Xu, T. Mei, T. Yao, Y. Rui
CVPR, 2016

Jointly modeling embedding and translation to bridge video and language

Y. Pan, T. Mei, T. Yao, H. Li, Y. Rui
CVPR, 2016

Highlight detection with pairwise deep ranking for first-person video summarization

T. Yao, T. Mei, Y. Rui
CVPR, 2016

Action recognition by learning deep multi-granular spatio-temporal video representation

Q. Li, Z. Qiu, T. Yao, T. Mei, Y. Rui, J. Luo
ACM ICMR, 2016 (the Best Paper Finalist)

Why fly? when you can walk on the water!

About ME

Ting Yao (姚霆) Google Scholar tingyao.ustc@gmail.com

Distinctions

Awards

Top-performing Systems in International Competitions

Professional Activities

Service in Professional Organizations

Journal Editorship

Conference/Workshop/Challenge Organizer

Keynote/Tutorial Speaker

Area Chair / Senior PC

Selected Publications (Full List)

HiDream-I1: A high-efficient image generative foundation model with sparse diffusion transformer

Denoising token prediction in masked autoregressive models

Talk, Imagine, Evolve: A unified multimodal agent for seamless visual generation and editing

VideoStudio: Generating consistent-content and multi-scene videos

HIRI-ViT: Scaling vision transformer with high resolution inputs

Dual vision transformer

Bi-calibration networks for weakly-supervised video representation learning

Control3d: Towards controllable text-to-3d generation

Wave-vit: Unifying wavelet and transformers for visual representation learning

Contextual transformer networks for visual recognition

Seco: Exploring sequence supervision for unsupervised representation learning

Smart director: An event-driven directing system for live broadcasting

X-modaler: A versatile and high-performance codebase for cross-modal analytics

X-linear attention networks for image captioning

Joint contrastive learning with infinite possibilities

Learning spatio-temporal representation with local and global diffusion

Hierarchy parsing for image captioning

Gaussian temporal awareness networks for action localization

Exploring visual relationship for image captioning

Fully convolutional adaptation networks for semantic segmentation

Learning spatio-temporal representation with pseudo-3d residual networks

Boosting image captioning with attributes

Video captioning with transferred semantic attributes

Incorporating copying mechanism in image captioning for learning novel objects

To create what you tell: Generating videos from captions

MSR-VTT: A large video description dataset for bridging video and language

Jointly modeling embedding and translation to bridge video and language

Highlight detection with pairwise deep ranking for first-person video summarization

Action recognition by learning deep multi-granular spatio-temporal video representation