Abstract:
To address the problems of inconsistency between generated images and text descriptions and low image resolution in text-generated face methods, this paper proposes a cross-modal text-generated face image network framework. Firstly, the CLIP pre-training model is adopted to extract features from the text, and the text semantic features are enhanced by the conditional enhancement module to generate hidden vectors; then the hidden vector is projected into the implicit space of the pre-trained model StyleGAN by the mapping network to obtain the untangled hidden vector, which is input to the StyleGAN generator to generate high-resolution face images; finally, the text reconstruction module is adopted to regenerate the face images into text, and the semantic alignment loss between the reconstructed text and the input text is calculated and utilized as semantic supervision to guide the network training. The training and testing are performed on two datasets, Multi-Modal CelebA-HQ and CelebAText-HQ, and the experimental results show that compared with other methods, the method in this paper can generate highresolution face images that are more consistent with the text description.