HOPPR™ MC Chest Radiography Narrative Model

May 14

Written By: Jean-Benoit Delbrouck, PhD

HOPPR MC Chest Radiography Narrative Model

HOPPR has introduced the MC Chest Radiography Narrative Model (MC CXR Narrative), a vision-language model that translates chest X-ray images into structured narrative text for radiology reporting workflows. It is deployed as a foundational software component through HOPPR Forward Deployed Services. To learn more, visit hoppr.ai.

At HOPPR, we curate images and reports sample-by-sample, with radiologist review on edge cases, to build a training set we can stand behind. For model comparison against the state of the art, however, we rely on automated metrics.

Evaluating a radiology report generator is not trivial: n-gram metrics like ROUGE-L or BLEU reward wording overlap but say little about clinical correctness. We therefore focus on four complementary metrics: GREEN [8], an LLM-as-judge trained to mirror radiologist error counts; Disease F1 [11], which runs a CheXbert classifier on generated and reference reports and compares pathology labels; RadGraph-F1, which matches entities and relations extracted by a named-entity recognition system trained on human-annotated radiology reports; and BERTScore [10], which captures semantic fidelity beyond lexical overlap. Baselines are the most recent peer-reviewed entries from strong CS venues and journals (Nature Communications, NEJM AI, EMNLP, ICLR).

Benchmark Results

Key Metrics

GREEN

Disease F1

BERTScore

GREEN [8]

An LLM-based evaluation metric that identifies and explains clinically significant errors in generated radiology reports. Unlike lexical overlap metrics, GREEN produces scores aligned with radiologist expert preferences and provides human-interpretable error explanations, making it one of the most clinically meaningful automated metrics available.

Disease F1 [11]

A CheXbert-based text classifier is applied to both the ground-truth and predicted reports to extract disease presence labels; macro-averaged F1 is then computed across all classes. Standard benchmarks evaluate on 14 CheXpert labels, while HOPPR uses an extended set of 27 pathology labels, providing broader coverage of findings encountered in real-world clinical data.

BERTScore [10]

Measures contextual embedding similarity between generated and reference reports using pretrained language representations. By operating at the token-embedding level rather than surface-level n-gram overlap, BERTScore captures semantic equivalence and paraphrase robustness that traditional metrics like BLEU or ROUGE-L miss.

Across the evaluated metrics, the MC CXR Narrative model performs particularly well on semantic fidelity (BERTScore) and pathology label agreement (Disease F1) and is competitive on clinical error evaluation (GREEN). Together, these results suggest that the model is capturing relevant findings and clinical relationships reflected in the reference reports rather than simply reproducing surface-level phrasing. Since each metric evaluates a distinct aspect of report generation, results are best interpreted collectively rather than through any single benchmark in isolation.

Together, these three metrics provide a well-rounded evaluation: GREEN assesses clinical factual correctness, Disease F1 measures pathology detection accuracy, and BERTScore captures overall semantic fidelity of the generated narrative.

Conclusion

This model reflects HOPPR's broader approach to medical imaging AI by combining imaging-specific foundation models, tailored datasets, and transparent evaluation methods to support downstream development. As VLMs continue to evolve, we believe scalable AI will depend on flexible foundation models and rigorous benchmarking. We expect this shift to help developers build and evaluate AI systems more efficiently across a wider range of applications.

By integrating state-of-the-art evaluation metrics and comparing MC CXR Narrative with strong baselines from leading academic institutions and research groups, HOPPR aims to help advance the standard for how radiology AI systems are developed, evaluated, and responsibly deployed.

References

Bannur, S. et al. "MAIRA-2: Grounded Radiology Report Generation." Microsoft Research Technical Report MSR-TR-2024-18, 2024. arxiv.org/abs/2406.04449
Yang, L. et al. "Advancing Multimodal Medical Capabilities of Gemini." arXiv:2405.03162, 2024. arxiv.org/abs/2405.03162
Sellergren, A. et al. "MedGemma 1.5 Technical Report." arXiv:2604.05081, 2026. arxiv.org/abs/2604.05081
Liu, Q. et al. "Scaling Medical Imaging Report Generation with Multimodal Reinforcement Learning." arXiv:2601.17151, 2026. arxiv.org/abs/2601.17151
Chen, Z. et al. "A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation." arXiv:2401.12208, 2024. arxiv.org/abs/2401.12208
Wu, C. et al. "Towards Generalist Foundation Model for Radiology by Leveraging Web-Scale 2D & 3D Medical Data." Nature Communications 16, 7866, 2025. doi.org/10.1038/s41467-025-62385-7
Zhou, H.-Y. et al. "MedVersa: A Generalist Foundation Model for Diverse Medical Imaging Tasks." NEJM AI 3(4), 2026.
Ostmeier, S. et al. "GREEN: Generative Radiology Report Evaluation and Error Notation." Findings of EMNLP 2024, pp. 374–390. aclanthology.org/2024.findings-emnlp.21
ReXrank: Chest X-ray Interpretation Leaderboard. rexrank.ai
Zhang, T. et al. "BERTScore: Evaluating Text Generation with BERT." International Conference on Learning Representations (ICLR), 2020. openreview.net/forum?id=SkeHuCVFDr
Smit, A. et al. "Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT." Proceedings of EMNLP 2020, pp. 1500–1519. aclanthology.org/2020.emnlp-main.117
Xu, J. et al. "RadEval: A Framework for Radiology Text Evaluation." Proceedings of EMNLP 2025: System Demonstrations, pp. 546–557. aclanthology.org/2025.emnlp-demos.40