2017
Enhancing acoustic based human trait recognition using intermediate features
Titel: Enhancing acoustic based human trait recognition using intermediate features
Dozent(in): Nithin Thomas
Termin: 11AM on 28-11-2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: Masters Thesis supervised by Dr. Hesam Sagha and Prof.Dr. habil. Bj?rn Schuller at 伟德国际_伟德国际1946$娱乐app游戏 of Passau
Feature Set Optimisation for Multi-Lingual Emotion Recognition
Titel: Feature Set Optimisation for Multi-Lingual Emotion Recognition
Dozent(in): Revathi Sadanand
Termin: 11AM on 28-11-2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: Masters Thesis supervised by Dr. Hesam Sagha and Prof.Dr. habil. Bj?rn Schuller at 伟德国际_伟德国际1946$娱乐app游戏 of Passau
End-to-End Audio Laughter Detection
Titel: End-to-End Audio Laughter Detection
Dozent(in): Muhammad Mashood Tanveer
Termin: 11AM on 28-11-2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: Masters Thesis supervised by Dr. Hesam Sagha and Prof.Dr. habil. Bj?rn Schuller at 伟德国际_伟德国际1946$娱乐app游戏 of Passau
DE-ENIGMA 'Advancing Humanoid Robotics for Children on the Autism Spectrum'
There are over 5 million people with autism in the European Union. If you include their families, autism touches the lives of over 20 million Europeans. It affects the way a person communicates, understands and relates to others. People with autism often have difficulty using and understanding verbal and non-verbal language. This often makes it difficult to understand others and interact with them. Getting the right support and therapies makes a substantial difference to people with autism. The overall aim of the DE-ENIGMA project is to realize robust, context-sensitive, multimodal and naturalistic human-robot interaction (HRI) aimed at enhancing the social imagination skills of children with autism. This extends and contrasts considerably to the current state of the art in existing technological solutions to machine analysis of the facial, bodily, vocal and verbal behaviour that are used in (commercially and otherwise) available human-centric HRI applications.
Titel:DE-ENIGMA 'Advancing Humanoid Robotics for Children on the Autism Spectrum'
?
Dozent(in): Ms. Alice Baird
Termin: 21. 11. 2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: 伟德国际_伟德国际1946$娱乐app游戏 of Augsburg
An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech
The outputs of the higher layers of deep pre-trained convolutional neural networks (CNNs) have consistently been shown to provide a rich representation of an image for use in recognition tasks. This study explores the suitability of such an approach for speech-based emotion recognition tasks. First, we detail a new acoustic feature representation, denoted as deep spectrum features, derived from feeding spectrograms through a very deep image classification CNN and forming a feature vector from the activations of the last fully connected layer. We then compare the performance of our novel features with standardised brute-force and bag-of-audio-words (BoAW) acoustic feature representations for 2- and 5-class speech-based emotion recognition in clean, noisy and denoised conditions. The presented results show that image-based approaches are a promising avenue of research for speech-based recognition tasks. Key results indicate that deep-spectrum features are comparable in performance with the other tested acoustic feature representations in matched for noise type train-test conditions; however, the BoAW paradigm is better suited to cross-noise-type train-test conditions.
Titel:An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech
?
Dozent(in): Shahin Amiriparian
Termin: 21-11-2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: U Augsburg/TUM
Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio
This paper describes our contribution to the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017). We propose a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files. First, we extract mel-spectrograms from the raw audio files. Second, we train a recurrent sequence to sequence autoencoder on these spectrograms, that are considered as time-dependent frequency vectors. Then, we extract, from a fully connected layer between the decoder and encoder units, the learnt representations of spectrograms as the feature vectors for the corresponding audio instances. Finally, we train a multilayer perceptron neural network on these feature vectors to predict the class labels. An accuracy of 88.0% is achieved on the official development set of the challenge – a relative improvement of 17.7% over the challenge baseline.
?
Titel: Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio
Dozent(in): Shahin Amiriparian
Termin: 14.11.2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: Universit?t Augsburg/TUM
Feature Selection in Multimodal Continuous Emotion Prediction
Advances in affective computing have been made by combining information from different modalities, such as audio, video, and physiological signals. However, increasing the number of modalities also grows the dimensionality of the associated feature vectors, leading to higher computational cost and possibly lower prediction performance. In this regard, we present an comparative study of feature reduction methodologies for continuous emotion recognition. We compare dimensionality reduction by principal component analysis, filterbased feature selection using canonical correlation analysis, and correlation-based feature selection, as well as wrapperbased feature selection with sequential forward selection, and competitive swarm optimisation. These approaches are evaluated on the AV+EC-2015 database using support vector regression. Our results demonstrate that the wrapper-based approaches typically outperform the other methodologies, while pruning a large number of irrelevant features.
?
Titel: Feature Selection in Multimodal Continuous Emotion Prediction
Dozent(in): Shahin Amiriparian
Termin: 14-11-2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: U Augsburg/TUM
CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms
The adage that there is no data like more data is not new in affective computing; however, with recent advances in deep learning technologies, such as end-to-end learning, the need for extracting big data is greater than ever. Multimedia resources available on social media represent a wealth of data more than large enough to satisfy this need. However, an often prohibitive amount of effort has been required to source and label such instances. As a solution, we introduce Cost-efficient Audio-visual Acquisition via Social-media Smallworld Targeting (CAS2T) for efficient large-scale big data collection from online social media platforms. Our system is based on a unique combination of small-world modelling, unsupervised audio analysis, and semi-supervised active learning. Such an approach facilitates rapid training on entirely new tasks sourced in their entirety from social multimedia. We demonstrate the high capability of our methodology via collection of original datasets containing a range of naturalistic, in-the-wild examples of human behaviours.
Titel:CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms
?
Dozent(in): Shahin Amiriparian
Termin: 14-11-2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: U Augsburg/TUM
The SEILS dataset: Symbolically Encoded Scores in Modern Ancient Notation for Computational Musicology
The automatic analysis of notated Renaissance music is restricted by a shortfall in codified repertoire. Thousands of scores have been digitised by music libraries across the world, but the absence of symbolically codified information makes these inaccessible for computational evaluation. Optical Music Recognition (OMR) made great progress in addressing this issue, however, early notation is still an on-going challenge for OMR. To this end, we present the Symbolically Encoded “Il Lauro Secco” (SEILS) dataset, a new dataset of codified scores for use within computational musicology. We focus on a collection of Italian madrigals from the 16th century, a polyphonic secular a cappella composition characterised by strong musical-linguistic synergies. Thirty madrigals for five unaccompanied voices are presented in modern and early notation, considering a variety of digital formats: Lilypond, MusicXML, MIDI, and Finale (a total of 150 symbolically codified scores). Given the musical and poetic value of the chosen repertoire, we aim to promote synergies between computational musicology and linguistics.
Titel:The SEILS dataset: Symbolically Encoded Scores in Modern Ancient Notation for Computational Musicology
?
Dozent(in): Emilia Parada-Cabaleiro
Termin: 7.11.2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: Universit?t Augsburg
?
Web and mobile based intervention to enhance health
Titel: Web and mobile based intervention to enhance health
Dozent(in): Dr. Eva Maria Rathner
Termin: 03-11-2017, 15:00
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: Department of Clinical Psychology and Psychotherapy, 伟德国际_伟德国际1946$娱乐app游戏 of Ulm
Wavelets Revisited for the Classification of Acoustic Scenes
We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We observe that, the proposed wavelet features behave comparable to the typically-used temporal and spectral features in the classification of acoustic scenes. Further, a late fusion of trained models with wavelets and typical acoustic features reach the best averaged 4-fold cross validation accuracy of 83.2%, and 82.6% by SVMs, and GRNNs, respectively; both significantly outperform the baseline (74.8%) of the official development set (p<0.001, one-tailed z-test).
Titel:Wavelets Revisited for the Classification of Acoustic Scenes
?
Dozent(in): Kun Qian
Termin: 24-10-2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: U Augsburg/TUM
Deep Sequential Image Features for Acoustic Scene Classification
For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning, based on features extracted from Short-Time Fourier Transform and scalogram of the audio scenes using Convolutional Neural Networks. It is the first time to investigate the performance of bump and morse scalograms for acoustic scene classification in an according context. First, segmented audio waves are transformed into a spectrogram and two types of scalograms; then, 'deep features' are extracted from these using the pre-trained VGG16 model by probing at the fully connected layer. These representations are then fed into Gated Recurrent Neural Networks for classification separately. Predictions from the three systems are finally combined by a margin sampling value strategy. On the official development set of the challenge, the best accuracy on a four-fold cross-validation setup is 80.9%, which increases by 6.1% when compared with the official baseline (p<.001 by one-tailed z-test).
?
Titel: Deep Sequential Image Features for Acoustic Scene Classification
Dozent(in): Zhao Ren
Termin: 24-10-2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: U Augsburg
Computer Vision for Human Facial Expression analysis
Titel: Computer Vision for Human Facial Expression analysis
Dozent(in): Michel Valstar
Termin: 18-10-2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 306
Ansprechpartner: 伟德国际_伟德国际1946$娱乐app游戏 of Nottingham
VoicePlay – An Affective Sports Game Operated by Speech Emotion Recognition based on the Component Process Model
Titel: VoicePlay – An Affective Sports Game Operated by Speech Emotion Recognition based on the Component Process Model
Dozent(in): Gerhard Hagerer
Termin: 17.10.2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 306
Ansprechpartner: Universit?t Augsburg
Sentiment Analysis Using Image-based Deep Spectrum Features
We test the suitability of our novel deep spectrum feature representation for performing speech-based sentiment analysis. Deep spectrum features are formed by passing spectrograms through a pre-trained image convolutional neural network (CNN) and have been shown to capture useful emotion information in speech; however, their usefulness for sentiment analysis is yet to be investigated. Using a data set of movie reviews collected from YouTube, we compare deep spectrum features combined with the bag-of-audio-words (BoAW) paradigm with a state-of-the-art Mel Frequency Cepstral Coefficients (MFCC) based BoAW system when performing a binary sentiment classification task. Key results presented indicate the suitability of both features for the proposed task. The deep spectrum features achieve an unweighted average recall of 74.5 %. The results provide further evidence for theeffectiveness of deep spectrum features as a robust feature representation for speech analysis.
?
Titel: Sentiment Analysis Using Image-based Deep Spectrum Features
Dozent(in): Shahin Amiriparian
Termin: 17.10.2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: Universit?t Augsburg/TUM
From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty
Over the last decade, automatic emotion recognition has become well established. The gold standard target is thereby usually calculated based on multiple annotations from different raters. All related efforts assume that the emotional state of a human subject can be identified by a `hard' category or a unique value. This assumption tries to ease the human observer's subjectivity when observing patterns such as the emotional state of others. However, as the number of annotators cannot be infinite, uncertainty remains in the emotion target even if calculated from several, yet few human annotators. The common procedure to use this same emotion target in the learning process thus inevitably introduces noise in terms of an uncertain learning target. In this light, we propose a `soft' prediction framework to provide a more human-like and comprehensive prediction of emotion. In our novel framework, we provide an additional target to indicate the uncertainty of human perception based on the inter-rater disagreement level, in contrast to the traditional framework which is merely producing one single prediction (category or value). To exploit the dependency between the emotional state and the newly introduced perception uncertainty, we implement a multi-task learning strategy. To evaluate the feasibility and effectiveness of the proposed soft prediction framework, we perform extensive experiments on a time- and value-continuous spontaneous audiovisual emotion database including late fusion results. We show that the soft prediction framework with multi-task learning of the emotional state and its perception uncertainty significantly outperforms the individual tasks in both the arousal and valence dimensions.
?
Titel: From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty
Dozent(in): Jing Han
Termin: 17-10-2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: U Augsburg
The Perception of Emotion in the Singing Voice
With the increased usage of internet based services and the mass of digital content now available online, the organisation of such content has become a major topic of interest both commercially and within academic research. The addition of emotional understanding for the content is a relevant parameter not only for music classification within digital libraries but also for improving users experiences, via services including automated music recommendation. Despite the singing voice being well-known for the natural communication of emotion, it is still unclear which specific musical characteristics of this signal are involved such affective expressions. The presented study investigates which musical parameters of singing relate to the emotional content, by evaluating the perception of emotion in electronically manipulated a cappella audio samples. A group of 24 individuals participated in a perception test evaluating the emotional dimensions of arousal and valence of 104 sung instances. Key results presented indicate that the rhythmic-melodic contour is potentially related to the perception of arousal whereas musical syntax and tempo can alter the perception of valence.
Titel:The Perception of Emotion in the Singing Voice
?
Dozent(in): Emilia Parada-Cabaleiro
Termin: 17-10-2017
Geb?ude/Raum: Eichleitnerstra?e 30 / 207
Ansprechpartner: U Augsburg