Abstract:
In the feature extraction of multimodal depression models, there are problems such as weak correlation between sentences, random feature fusion between different modalities, and lack of verification of the generalization ability of the model on the Chinese data set. By analyzing audio, text and visual features related to depression, this paper proposed a multi-modal depression recognition model STCMN(Sentence-level Temporal Convolutional Memory Network) based on improved TCN model. And the model was applied to the auxiliary diagnosis of clinical depression. Firstly, the fusion module of residual block, GRU and Self-Attention was used to extract the sentence-level features under different modalities, which enhances the context connection. Then, the TCN model was used to extract the global features of different modalities. Cross Attention was used to fuse the global features of different modalities mainly with multi-modal fusion features. Finally, the recognition results of the model for depression were obtained through the LogSoftmax layer. On the DAIC-WOZ public dataset, the accuracy rate, precision rate and recall rate of the proposed method for depression recognition reach 91. 3%, 93. 6% and 89. 7%, respectively. The related indicators are better than other methods, which can better meet the needs of clinical medicine. On the private Chinese dataset MMD2022, the recognition results of STCMN model are still the best, indicating that the model has good generalization ability in Chinese depression recognition tasks.