Ziyang Wang

I am a second year CS Ph.D. student at The University of North Carolina, Chapel Hill advised by Prof. Mohit Bansal and also work closely with Prof. Gedas Bertasius. My current research interest is multimodal learning, with a special focus on video-language understanding. I am affiliated with UNC-NLP group.

I join Meta, FAIR Perception team as a research intern in 2024 summer. Previously, I was an Applied Scientist Intern in Amazon Alexa AI working with Heba Elfardy, Kevin Small, Markus Dreyer. I also interned in Tsinghua AIR working with Prof. Jingjing Liu. I finished my undergrad study at UESTC and advised by Prof. Jingjing Li.

My email address is ziyangw at cs . unc . edu, if you have any questions, feel free to contact me!

Google Scholar  /  Curriculum Vitae  /  GitHub  /  Linkedin

profile photo


In general, I am interested in the fundamental challenges in video-language understanding.

project image

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Ziyang Wang*, Shoubin Yu*, Elias Stengel-Eskin*, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal
Arxiv, 2024
arxiv / code /

We introduce VideoTree, a query-adaptive and hierarchical framework for long-video understanding with LLMs. Specifically, VideoTree dynamically extracts query-related information from the input video and builds a tree-based video representation for LLM reasoning.

project image

DAM: Dynamic Adapter Merging for Continual Video QA Learning

Feng Cheng*, Ziyang Wang*, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, Gedas Bertasius
ArXiv, 2024
arxiv / code /

In this work, we investigate the challenging and relatively unexplored problem of rehearsal-free domain-incremental VidQA learning. Our proposed DAM framework outperforms existing state-of-the-art by 9.1% with 1.9% less forgetting on a benchmark with six distinct video domains.

project image

Unified Embeddings for Multimodal Retrieval via Frozen LLMs

Ziyang Wang, Heba Elfardy, Markus Dreyer, Kevin Small, Mohit Bansal
EACL2024 Findings, 2024

In this work, We present Unified Embeddings for Multimodal Retrieval (UNIMUR), a simple but effective approach that embeds multimodal inputs and retrieves visual and textual outputs via frozen Large Language Models (LLMs). Specifically, UNIMUR jointly retrieves multimodal outputs via unified multimodal embedding and applies dual alignment training to account for both visual and textual semantics.

project image

A Simple LLM Framework for Long-Range Video Question-Answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius
ArXiv, 2024
arxiv / code /

We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner coupled with a Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework.

project image

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal
ICCV23, 2023
arxiv / code /

UCoFiA captures the cross-modal similarity information at different granularity levels(video-sentence, frame-sentence, pixel-word) and unifies multi-level alignments for video-text retrieval.

project image

Language-Augmented Pixel Embedding for Generalized Zero-shot Learning

Ziyang Wang, Yunhao Gou, Jingjing Li, Lei Zhu, Heng Tao Shen
IEEE Transactions on Circuits and Systems for Video Technology, 2022

In this paper, we propose a novel GZSL framework named Language-Augmented Pixel Embedding (LAPE), which directly maps the image pixels to the semantic attributes with cross-modal guidance.

project image

Region Semantically Aligned Network for Zero-Shot Learning

Ziyang Wang*, Yunhao Gou*, Jingjing Li, Yu Zhang, Yang Yang
CIKM21 (long oral), 2021
arxiv /

We propose a novel ZSL framework named Region Semantically Aligned Network (RSAN), which transfers region-attribute alignment from seen classes to unseen classes.


I am a die-heart Arsenal and Tar Heel fan.

Design and source code from Leonid Keselman's website, thanks!