Abstract:
The analysis of a ect (e.g. emotions or mood), traits (e.g. personality), and social signals (e.g. frustration, disagreement) are of increasing interest in human computer interaction, in order to drive human-machine communication to become closer to human-human communication. It has manifold applications ranging from intelligent tutoring systems to a ect sensitive robots, from smart call centers to patient telemonitoring. The study of computational paralinguistics, which covers the analysis of speaker states and traits, faces with real life challenges of inter-speaker and inter-corpus variability. In this thesis, machine learning methods addressing these challenges are targeted. Automatic model selection methods are explored for modeling high dimensional paralinguistics data. These approaches can deal with di erent sources of variability in a parametric manner. To provide statistical models and classi ers with a compact set of potent features, novel feature selection methods based on discriminative projections are introduced. In addition, multimodal fusion techniques are sought for robust a ective computing in the wild. The proposed methods and approaches are validated over a set of recent challenge corpora, including INTERSPEECH Computational Paralinguistics Challenge (2013-2015), Audio-Visual Emotion Challenge (2013/2014), and Emotion Recognition in the Wild Challenge 2014. The methods proposed in this thesis advance the state-of-the-art in most of these corpora and yield competitive results in others, while enjoying the properties of parsimony and computational e ciency.