Shaping Futures: Microsoft Research Asia Lecture Series

Date: Wednesday, 4 June 2025

Time: 2:30pm - 4:00pm

Venue: Lecture Theater G (Chow Tak Sin Lecture Theater), near lift 25/26, HKUST

Talk 1: Exploring Multimodal Large Models: From Text to Audio and Video

Abstract:

With the breakthroughs of Large Language Models (LLM) in text processing, their applications have expanded to other modalities, including audio and video processing. This report provides an overview of our research on multimodal large models, focusing on audio and video. We explore advancements in pre-trained models for speech and audio, such as WavLM and BEATs. Additionally, we discuss LLM-based audio generation models like VALL-E and VALL-E 2. Our work also includes the development of the speech large model WavLLM. Furthermore, we present ARLON, an LLM-based video generation model. These models leverage extensive datasets and sophisticated techniques to achieve high performance in various tasks. Through these efforts, we aim to push the boundaries of what is possible with LLMs in multimodal data processing and will also apply these techniques to advance medical AI, enhancing diagnostic and therapeutic capabilities.

Speaker:

Shujie Liu is a Principal Researcher at Microsoft Research Asia (Hong Kong). Shujie's research interests include natural language processing, speech processing, and deep learning technologies. Shujie has published over 100 papers in top journals and conferences in the field of natural language and speech processing, and he has co-authored the book "Machine Translation" and contributed to the book "Introduction to Artificial Intelligence." His research findings have been widely applied in key Microsoft products such as Microsoft Translator, Skype Translator, Microsoft IME, and Microsoft Speech Services, including speech synthesis, speech separation, and recognition.

Talk 2: MedEd: AI-Transformed Medical Education

Abstract:

This presentation introduces our MedEd project, which develops an agentic ecosystem integrating large language models (LLMs) and vision-language models (VLMs) to advance medical education. It aims to help students build a strong theoretical foundation and gain practical experience through realistic simulations, while supporting educators by reducing teaching workloads and improving delivery efficiency. A multi-agent system simulates professional educators and realistic patients, enabling students to enhance both medical expertise and patient-care communication. The system is also optimized for agent coordination, allowing collaborative diagnosis and treatment planning, serving as virtual educators to support self-directed learning. Built on our medical foundation models, the agents leverage LLMs, external knowledge, and inter-agent communication to perform specialized tasks, while distilling knowledge back into the models to continuously refine them. The ecosystem also emphasizes human-agent interaction, focusing on usability and effectiveness. Together, these components form MedEd—a comprehensive platform to transform medical education through simulation and intelligent coordination.

Speaker:

Jinglu Wang is a Senior Researcher in the Media Computing Group at MSRA. She earned her PhD in Computer Science and Engineering from The Hong Kong University of Science and Technology and her Bachelor's degree in Computer Science from Fudan University. Her research focuses on 3D reconstruction, video understanding, and multimodal LLMs. Jinglu has published over 40 papers in leading conferences and journals, including CVPR, ICCV, PAMI, TVCG, NeurIPS, AAAI, ACL, and EMNLP, contributing to advancements in computer vision, machine learning, and natural language processing. Her current work is centered on AI for healthcare, with a focus on medical foundation models and multi-agent systems. Beyond her research, Jinglu has applied her expertise to Microsoft products, driving innovations such as video enhancement and Together Mode in Microsoft Teams to improve media experiences for users.