This study compares the performance of existing studies on multimodal emotion recognition, and proposes a model that fuses two modalities with the speaker's text and voice signals as input values and detects depression. Based on the DAIC-WOZ dataset, voice features were extracted using CNN, text features were extracted using Transformers, and two modalities were fused through a tensor fusion network. We also build a model to detect whether the speaker is depressed or not using LSTM in the final layer. This study suggests the possibility of increasing access to mental illness diagnosis by enabling patients to detect depression on their own in daily conversations. If the model proposed in this study is developed and the voice conversation system is connected, it will be easier for patients who cannot visit the hospital periodically or who are reluctant to visit the hospital to check their condition and seek recovery. Furthermore, it can be expanded to multi-label classification for various mental diseases and used as a simple self-mental disease diagnosis tool.