Version 1
: Received: 22 March 2024 / Approved: 26 March 2024 / Online: 27 March 2024 (06:11:04 CET)
How to cite:
Kim, J.-H.; Kim, N.; Hong, M.; Won, C. CCA-Transformer: Cascaded Cross-Attention Based Transformer for Facial Analysis in Multi-modal Data. Preprints2024, 2024031629. https://doi.org/10.20944/preprints202403.1629.v1
Kim, J.-H.; Kim, N.; Hong, M.; Won, C. CCA-Transformer: Cascaded Cross-Attention Based Transformer for Facial Analysis in Multi-modal Data. Preprints 2024, 2024031629. https://doi.org/10.20944/preprints202403.1629.v1
Kim, J.-H.; Kim, N.; Hong, M.; Won, C. CCA-Transformer: Cascaded Cross-Attention Based Transformer for Facial Analysis in Multi-modal Data. Preprints2024, 2024031629. https://doi.org/10.20944/preprints202403.1629.v1
APA Style
Kim, J. H., Kim, N., Hong, M., & Won, C. (2024). CCA-Transformer: Cascaded Cross-Attention Based Transformer for Facial Analysis in Multi-modal Data. Preprints. https://doi.org/10.20944/preprints202403.1629.v1
Chicago/Turabian Style
Kim, J., Minsoo Hong and Cheesun Won. 2024 "CCA-Transformer: Cascaded Cross-Attention Based Transformer for Facial Analysis in Multi-modal Data" Preprints. https://doi.org/10.20944/preprints202403.1629.v1
Abstract
One of the most crucial elements in deeply understanding humans on a psychological level is manifested through facial expressions. The analysis of a human behavior can be informed by their facial expressions, making it essential to employ indicators such as expression (Expr), valence-arousal (VA), and action units (AU). In this paper, we introduce the method proposed in the Challenge of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW) at CVPR 2024. Our proposed method utilizes the multi-modal Aff-wild2 dataset, which is splitted into spatial and audio modalities. For the spatial data, we extract features using a SimMiM model that was pre-trained on a diverse set of facial expression data. For the audio data, we extract features using a WAV2VEC model. To fusion the extracted spatial and audio features, we employed the cascaded cross-attention mechanism of a transformer.
Keywords
face analysis; expression; valence-arousal; action unit
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.