Project 1
CVPR paper with UCLA Complex Network Group:
Title: Seizure-Semiology-Bench: Benchmarking Multimodal Large Language Models on Fine-Grained Seizure Semiology Understanding
Abstract Multimodal Large Language Models (MLLMs) have achieved impressive performance on general video understanding, yet their capabilities in high-stakes, knowledge-intensive domains remain largely untested. For example, existing video benchmarks, which primarily focus on common, goal-oriented activities in everyday scenarios, fail to represent the dense, involuntary, and spatio-temporally evolving motor phenomena characteristic of seizures, where laterality and temporal progression are diagnostically crucial. We introduce Seizure-Semiology-Bench, the first large-scale, expert-annotated dataset for evaluating MLLMs on clinical seizure understanding. It includes 438 seizure recordings from 116 patients, paired with over 35,000 expert annotations spanning 20 ILAE-defined semiological features and 46 expert-curated prompts. Building on this dataset, we propose a seven-task hierarchical evaluation framework that probes MLLMs’ abilities from basic perception to high-level clinical reasoning, and introduce a clinically grounded metric, the Report Quality Index for Seizure Semiology (Seizure-RQI), which measures factual accuracy, temporal coherence, and structural integrity in clinical narratives, beyond superficial text similarity. We evaluate 11 state-of-the-art open-source MLLMs and perform seizure-specialized fine-tuning. Our analysis reveals a pronounced performance gap: general-purpose models recognize salient motor signs but fail at temporal sequencing, fine-grained characterization, laterality judgment, and clinically coherent narrative generation. Seizure-specialized models, however, show consistent gains across all tasks, showing promise of and exposing the need for further work on domain-aware MLLM development. Finally, a two-stage strategy–extracting structured semiological features and applying a lightweight classifier for epilepsy vs non-epilepsy seizures–significantly surpasses all end-to-end MLLM reasoning baselines. Seizure-Semiology-Bench therefore serves as both a dataset and a benchmarking framework, highlighting the fundamental limits of current MLLMs and guiding the development of clinically reliable, domain-adaptive multimodal intelligence.