Kaijing Ma

“The future is already here – it’s just not evenly distributed.”

Hi there! ☺️

My name is Kaijing. Currently, I am interning at Embodied AI Center@Shanghai AI Laboratory, working with Prof.Tong He. We are a young team focusing on Robotics. I have a strong belief in developing Foundation Embodied AI Models with zero-shot cross-embodiment capabilities.

Back in 2022, I was enrolled in a Masters Industry Co-Op Education Program between Xi’an Jiaotong University and TeleAI. At TeleAI, I focus on building state-of-the-art multimodal understanding and generation models, such as Video Temporal Grounding and controllable text-to-image models. For example, we have launched TeleImage (aka: 星辰多模态大模型), an advanced text-to-image generation product for the public. Our team is led by Prof.Xuelong Li, who is also the CTO and Chief Scientist of China Telecom.

News

Dec 10, 2024	Our paper TUNED has been accepted by AAAI-25🙂. Feel free to access the paper and code .
Sep 21, 2024	Received the ‘Excellence in Engineering Award (Student Category)’ from the National School for Engineers at Xi’an Jiaotong University🥳
Sep 14, 2024	We released the code of SRAM🔥
Jul 15, 2024	Our Paper About Controllable Text-to-image Generation has been accepted by ACM MM2024!
Mar 15, 2024	Our paper about moment retrieval has been accepted as oral paper by ICME 2024.
Jul 15, 2023	Our paper about moment retrieval hsa been accepted by ICCV Workshop 2023.

Selected publications

ICCVW

LLaViLo: Boosting Video Moment Retrieval via Adapter-Based Multimodal Modeling

Kaijing Ma^*, Xianghao Zang^*, Zerun Feng, Han Fang, Chao Ban, Yuhan Wei, Zhongjiang He, Yongxiang Li, and Hao Sun^†

In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Bib HTML PDF

@inproceedings{ma2023llavilo,
  title = {LLaViLo: Boosting Video Moment Retrieval via Adapter-Based Multimodal Modeling},
  author = {Ma, Kaijing and Zang, Xianghao and Feng, Zerun and Fang, Han and Ban, Chao and Wei, Yuhan and He, Zhongjiang and Li, Yongxiang and Sun, Hao},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages = {2798--2803},
  year = {2023},
}

arXiv

Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding

Kaijing Ma^*, Haojian Huang^*, Jin Chen^*, Haodong Chen, Pengliang Ji, Xianghao Zang, Han Fang, Chao Ban, Hao Sun, Mulin Chen, and others

arXiv preprint arXiv:2408.16272, 2024

Bib HTML PDF Code Website

@article{ma2024beyond,
  title = {Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding},
  author = {Ma, Kaijing and Huang, Haojian and Chen, Jin and Chen, Haodong and Ji, Pengliang and Zang, Xianghao and Fang, Han and Ban, Chao and Sun, Hao and Chen, Mulin and others},
  journal = {arXiv preprint arXiv:2408.16272},
  year = {2024},
  booktitle = {arXiv preprint},
}

ACM MM

GOAL: Grounded text-to-image Synthesis with Joint Layout Alignment Tuning

Yaqi Li, Han Fang, Zerun Feng, Kaijing Ma, Chao Ban, Xianghao Zang, LanXiang Zhou, Zhongjiang He, Jingyan Chen, Jiani Hu, and others

In ACM Multimedia 2024, 2024

Bib HTML

@inproceedings{li2024goal,
  title = {GOAL: Grounded text-to-image Synthesis with Joint Layout Alignment Tuning},
  author = {Li, Yaqi and Fang, Han and Feng, Zerun and Ma, Kaijing and Ban, Chao and Zang, Xianghao and Zhou, LanXiang and He, Zhongjiang and Chen, Jingyan and Hu, Jiani and others},
  booktitle = {ACM Multimedia 2024},
  year = {2024},
}

AAAI

Trusted Unified Feature-Neighborhood Dynamics for Multi-View Classification

Haojian Huang, Chuanyu Qin, Zhe Liu, Kaijing Ma, Jin Chen, Han Fang, Chao Ban, Hao Sun, and Zhongjiang He^†

AAAI, 2025

Bib HTML PDF Code

@article{huang2024trusted,
  title = {Trusted Unified Feature-Neighborhood Dynamics for Multi-View Classification},
  author = {Huang, Haojian and Qin, Chuanyu and Liu, Zhe and Ma, Kaijing and Chen, Jin and Fang, Han and Ban, Chao and Sun, Hao and He, Zhongjiang},
  journal = {AAAI},
  booktitle = {Association for the Advancement of Artificial Intelligence Conference},
  year = {2025},
}

arXiv

BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering

Jin Chen, Kaijing Ma, Haojian Huang, Jiayu Shen, Han Fang, Xianghao Zang, Chao Ban, Zhongjiang He, Hao Sun, and Yanmei Kang

arXiv preprint arXiv:2410.02768, 2024

Bib HTML Code

@article{chen2024bovila,
  title = {BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering},
  author = {Chen, Jin and Ma, Kaijing and Huang, Haojian and Shen, Jiayu and Fang, Han and Zang, Xianghao and Ban, Chao and He, Zhongjiang and Sun, Hao and Kang, Yanmei},
  journal = {arXiv preprint arXiv:2410.02768},
  booktitle = {arXiv preprint},
  year = {2024}
}