Abstract:
Pulmonary edema assessment is critical to the treatment of acute congestive heart failure(CHF). Multimodal masked autoencoders for vision-language pre-training have been shown to effectively fuse multimodal information from chest radiographs and pulmonary edema radiology reports to improve pulmonary edema quantification accuracy. However, existing methods randomly perform masking operations on images and text, and this unstable operation easily causes the model to ignore image lesions and text keywords, and hinders the fusion and alignment of multimodal information, which ultimately affects the quantization accuracy. To address the above problems, this research designs a masked vision-language distillation model, which introduces self-distillation into the medical vision-language pre-training task for the first time, so that the model obtains more stable and reliable medical image and linguistic representations; and optimizes the cross-modal attention fusion mechanism, so that the model better fuses and aligns the multimodal information. Compared with residual network 101(ResNet101), vision transformer(ViT)-B/16, joint modeling of chest radiographs and radio-logy reports for pulmonary edema assessment(JMC3R), and multi-modal masked autoencoders for medical vision and language pre-training(M3AE), our method obtains higher pulmonary edema quantification accuracy on pulmonary edema assessment dataset(PEAD).