Automated Construction of Interpretable and Multi-Hop Video Workloads via Knowledge Graph Traversal
A novel pipeline that positions knowledge graphs as the central engine for automated generation of medical VideoQA datasets. Bridging the gap between scalability and interpretability.
Medical Video Question Answering (VideoQA) faces a dichotomy: manual annotation ensures structural rigor but lacks scalability, while end-to-end automatic generation offers scalability but sacrifices reasoning control. We propose Med-CRAFT, a pipeline that generates M3-Med-Auto, a large-scale dataset grounded in explicit visual evidence. This approach preserves the structural interpretability of manual methods while achieving the scalability of automatic ones.
Full automation from entity extraction to question generation.
Uses KGs as the unifying abstraction for controllable reasoning.
Generates complex queries with explicit reasoning traces.
A large-scale benchmark grounded in visual evidence.
Extracts entities from multi-modal signals (ASR, OCR) and grounds them visually.
Builds a Cross-modal Knowledge Graph capturing spatial, temporal, and logical relations.
Traverses the graph to generate multi-hop questions with precise temporal answers.
Due to YouTube's Terms of Service and user privacy agreements, we strictly do not provide direct download links for the raw video files. The dataset contains YouTube Video IDs and timestamps. Researchers must download the videos independently using the provided tools/scripts.
Beijing Institute of Technology · The Hong Kong Polytechnic University · Shenzhen Institute of Advanced Technology, CAS
@article{medcraft2026,
title={Med-CRAFT: Automated Construction of Interpretable and Multi-Hop Video Workloads via Knowledge Graph Traversal},
author={Liu, Shenxi and Li, Kan and Zhao, Mingyang and Tian, Yuhang and Zhou, Shoujun and Li, Bin},
journal={PVLDB},
volume={14},
number={1},
year={2026}
}