Unimodal Feature-level improvement on Multimodal CMU-MOSEI Dataset: Uncorrelated and Convolved Feature Sets

Daniel Mora Melanchthon


This study investigates unimodal features –BERT embeddings (text), eGeMAPs (acoustic), and openFace set (visual)– used on the multimodal CMUMOSEI dataset for Emotion Recognition in order to seek unimodal feature-level improvements. Two approaches are investigated: feature selection by hierarchically clustering each set according to their Spearman correlation value, and the use of Convolutional Neural Network (CNN) models to act as emotion feature extractors. Experiments are performed with Random Forest (RF). Main results show, firstly, that the use of uncorrelated feature sets tend to not change model’s performance, allowing for trainable parameters, training time, and storage requirements reduction. Secondly, the direct use of CNN-embeddings with RF models yields improvements for acoustic modality, which suggests that major improvements could be sought through embedding acoustic features.

Texto completo: