基于CLIP模型和文本重建的人脸图像生成方法研究

李源凡; 张丽红

基于CLIP模型和文本重建的人脸图像生成方法研究

Research on Face Image Generation Method Based on CLIP Model and Text Reconstruction

摘要

摘要: 针对文本生成人脸方法中生成图像与文本描述不一致、图像分辨率较低等问题，提出一种跨模态文本生成人脸图像网络框架。首先，采用CLIP预训练模型对文本进行特征提取，通过条件增强模块增强文本语义特征并生成隐藏向量；然后，将隐藏向量通过映射网络投影到预训练模型StyleGAN的隐式空间中获得解纠缠隐藏向量，将该向量输入到StyleGAN生成器中生成高分辨率人脸图像；最后，采用文本重建模块将人脸图像重新生成文本，计算重建文本和输入文本之间的语义对齐损失，并将其作为语义监督指导网络训练。在Multi-Modal CelebA-HQ和CelebAText-HQ两个数据集上进行训练与测试，实验结果表明，相比其他方法，该方法能生成更加符合文本描述的高分辨率人脸图像。

Abstract: To address the problems of inconsistency between generated images and text descriptions and low image resolution in text-generated face methods, this paper proposes a cross-modal text-generated face image network framework. Firstly, the CLIP pre-training model is adopted to extract features from the text, and the text semantic features are enhanced by the conditional enhancement module to generate hidden vectors; then the hidden vector is projected into the implicit space of the pre-trained model StyleGAN by the mapping network to obtain the untangled hidden vector, which is input to the StyleGAN generator to generate high-resolution face images; finally, the text reconstruction module is adopted to regenerate the face images into text, and the semantic alignment loss between the reconstructed text and the input text is calculated and utilized as semantic supervision to guide the network training. The training and testing are performed on two datasets, Multi-Modal CelebA-HQ and CelebAText-HQ, and the experimental results show that compared with other methods, the method in this paper can generate highresolution face images that are more consistent with the text description.

HTML全文

参考文献(11)

施引文献

资源附件(0)