用于肺水肿量化的掩码图像-语言蒸馏模型

卢得民; 钟诚; 杨锋

doi:10.13417/j.gab.043.000274

用于肺水肿量化的掩码图像-语言蒸馏模型

Masked Image-language Distillation Model for Pulmonary Edema Assessment

摘要

摘要: 肺水肿量化是治疗急性充血性心力衰竭(congestive heart failure, CHF)的关键。用于视觉和语言预训练的多模态掩码自编码器已被证实可有效融合胸片和肺水肿放射学报告的多模态信息以提升肺水肿量化精度。但现有的方法是随机地对图像和文本进行掩码操作，这一不稳定的操作容易导致模型忽略图像病灶和文本关键词，并阻碍多模态信息的融合与对齐，最终影响量化精度。针对上述问题，本研究设计了一种掩码图像-语言蒸馏模型，首次将自蒸馏引入到医学图像-语言预训练任务中，使得模型获得更为稳定可靠的医学图像和语言表示；并对跨模态注意力融合机制进行优化，使得模型更好地融合与对齐多模态信息。相比于101层残差神经网络(residual network 101, ResNet101)、视觉Transformer (vision transformer, ViT)-B/16、联合胸片和肺水肿放射学报告建模(joint modeling of chest radiographs and radiology reports for pulmonary edema assessment, JMC3R)和用于视觉和语言预训练的多模态掩码自编码器(multi-modal masked autoencoders for medical vision and language pre-training, M3AE),本研究所提出的方法在肺水肿量化数据集(pulmonary edemaassessmentdataset, PEAD)上获得了更高的肺水肿量化精度。

Abstract: Pulmonary edema assessment is critical to the treatment of acute congestive heart failure（CHF）. Multimodal masked autoencoders for vision-language pre-training have been shown to effectively fuse multimodal information from chest radiographs and pulmonary edema radiology reports to improve pulmonary edema quantification accuracy. However, existing methods randomly perform masking operations on images and text, and this unstable operation easily causes the model to ignore image lesions and text keywords, and hinders the fusion and alignment of multimodal information, which ultimately affects the quantization accuracy. To address the above problems, this research designs a masked vision-language distillation model, which introduces self-distillation into the medical vision-language pre-training task for the first time, so that the model obtains more stable and reliable medical image and linguistic representations; and optimizes the cross-modal attention fusion mechanism, so that the model better fuses and aligns the multimodal information. Compared with residual network 101（ResNet101）, vision transformer（ViT）-B/16, joint modeling of chest radiographs and radio-logy reports for pulmonary edema assessment（JMC3R）, and multi-modal masked autoencoders for medical vision and language pre-training（M3AE）, our method obtains higher pulmonary edema quantification accuracy on pulmonary edema assessment dataset（PEAD）.

HTML全文

参考文献(23)

施引文献

资源附件(0)