Kaijing Ma

Internship @Shanghai AI Lab logo of shailab,
Master student @XJTU logo of xjtu with @TeleAI logo of TeleAI,
Incoming PhD Student @SomeWhere📢

avatar.jpg

“The future is already here – it’s just not evenly distributed.”

— William Gibson

As a master’s student, I am enrolled in a joint training program at Xi’an Jiaotong University and TeleAI. My master’s studies are supervised collaboratively by Professor Xingsong Hou from XJTU and Professor Hao Sun from TeleAI.

At TeleAI, I focus on building state-of-the-art multimodal understanding and generation models, such as Video Temporal Grounding and controllable AIGC systems. For example, we have released 星辰多模态大模型 as a competitive text-to-image generation product for the public. Our team is led by IEEE Fellow Xuelong Li, who is also the CTO and Chief Scientist of China Telecom.

I am currently interning at Shanghai AI Laboratory, focusing on embodied intelligence, a field with transformative potential for the future of AI. I am committed to a long-term career in robotics, aiming to develop a groundbreaking autonomous robot system that can interact with the physical world as naturally as ChatGPT interacts with language.

📢 I am seeking a PhD position for the fall of 2025.🥺😭

News

Sep 21, 2024 Received the ‘Excellence in Engineering Award (Student Category)’ from the National School for Engineers at Xi’an Jiaotong University🥳
Sep 14, 2024 We released the code of SRAM🔥
Jul 15, 2024 Our Paper About Controllable Text-to-image Generation is accepted to MM2024!
Mar 15, 2024 Our paper about moment retrieval is accepted as oral paper in ICME 2024.
Jul 15, 2023 Our paper about moment retrieval is accepted to ICCV Workshop 2023.

Selected publications

  1. ICCVW
    teaser_iccvw.png
    LLaViLo: Boosting Video Moment Retrieval via Adapter-Based Multimodal Modeling
    Kaijing Ma*, Xianghao Zang*, Zerun Feng, Han Fang, Chao Ban, Yuhan Wei, Zhongjiang He, Yongxiang Li, and Hao Sun
    In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
  2. arXiv
    nips24.png
    Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding
    Kaijing Ma*, Haojian Huang*, Jin Chen*, Haodong Chen, Pengliang Ji, Xianghao Zang, Han Fang, Chao Ban, Hao Sun, Mulin Chen, and  others
    arXiv preprint arXiv:2408.16272, 2024
  3. ACM MM
    mm24.png
    GOAL: Grounded text-to-image Synthesis with Joint Layout Alignment Tuning
    Yaqi Li, Han Fang, Zerun Feng, Kaijing Ma, Chao Ban, Xianghao Zang, LanXiang Zhou, Zhongjiang He, Jingyan Chen, Jiani Hu, and  others
    In ACM Multimedia 2024, 2024
  4. arXiv
    tuned.jpg
    Trusted Unified Feature-Neighborhood Dynamics for Multi-View Classification
    Haojian Huang, Chuanyu Qin, Zhe Liu, Kaijing Ma, Jin Chen, Han Fang, Chao Ban, Hao Sun, and Zhongjiang He
    arXiv preprint arXiv:2409.00755, 2024
  5. arXiv
    bovila.png
    BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering
    Jin Chen, Kaijing Ma, Haojian Huang, Jiayu Shen, Han Fang, Xianghao Zang, Chao Ban, Zhongjiang He, Hao Sun, and Yanmei Kang
    arXiv preprint arXiv:2410.02768, 2024