Rongkun Zheng
I am a third-year (starting from 2022) Ph.D. student at the University of Hong Kong, supervised by Prof. Hengshuang Zhao. Before that, I received my B.Eng. from Tsinghua University in 2022. I've done internships with
SenseTime, and Shlab.
My research interests lie in the field of deep learning and computer vision, I've published multiple research works for video perception, open-world multi-modal learning, and video understanding.
Now I have interests in MLLM and reinforcement learning.
Update: On job market now! Drop me an email if you are interested.
Email /
CV  / 
Scholar /
Github
|
|
|
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao
ICCV, 2025
pdf /
code
Our ViLLa is an effective and efficient LMM capable of segmenting and tracking with reasoning capabilities. It can handle complex video reasoning segmentation tasks, such as:
(a) segmenting objects with complex interactions; (b) segmenting objects with complex motion; (c) segmenting objects in long videos with occlusions.
|
|
DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs
Jiahe Zhao, Rongkun Zheng, Yi Wang, Helin Wang, Hengshuang Zhao
ICCV, 2025
pdf /
code
We introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs.
|
|
SyncVIS: Synchronized Video Instance Segmentation
Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao
NeurIPS, 2024
pdf /
code
In this work, we analyze the cause of this phenomenon and the limitations of the current solutions, and propose to conduct synchronized modeling via a new framework named SyncVIS.
|
|
TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation
Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao
NeurIPS, 2023
pdf /
code
In this work, we analyze that providing extra taxonomy information can help models concentrate on specific taxonomy, and propose our model TMT-VIS, which is a taxonomy-aware multi-dataset joint training framework for video instance segmentation.
|
Academic Service
Reviewer
- CVPR (2023, 2024, 2025)
- ICCV (2023)
- ECCV (2024)
- NeurIPS (2023, 2024, 2025)
- ICLR (2023)
Teaching Assistant
- DASC7606: Deep Learning (Graduate course @ HKU) (2023, 2024, 2025)
|
|