Merlin: a computed tomography vision-language foundation model and dataset - PubMed

8 hours ago

Merlin is a 3D vision-language model (VLM) designed for automated analysis of abdominal CT scans.
It learns from volumetric CT scans, electronic health records, and radiology reports without requiring additional manual annotations.
Trained on a high-quality dataset of over 6 million images from 15,331 CT scans, 1.8 million diagnosis codes, and 6 million tokens of radiology reports.
Evaluated on 6 task types and 752 individual tasks, including diagnostic, prognostic, and quality-related tasks.
Demonstrated high generalization across institutions and anatomies, outperforming 2D VLMs and CT foundation models.
Released trained models, code, and a dataset of 25,494 pairs of abdominal CT scans and radiology reports.
Potential applications include assisting radiologists, biomarker discovery, and disease risk stratification.

Hasty Briefsbeta