In human-to-human communication, speech signals carry rich emotional cues that are further emphasized by affect-expressive gestures. In this regard, automatic synthesis and animation of gestures accompanying affective verbal communication can help to create more naturalistic virtual agents in human-computer interaction systems. Speech-driven gesture synthesis can map emotional cues of the speech signal to affect-expressive gestures by modeling complex variability and timing relationships of speech and gesture. In ...
View details for https://www.sciencedirect.com/science/article/pii/S0167639319301980
Vocal tract (VT) contour detection in real time MRI is a pre-stage to many speech production related applications such as articulatory analysis and synthesis. In this work, we present an algorithm for robust detection of keypoints on the vocal tract in rtMRI sequences using fully convolutional networks (FCN) via a heatmap regression approach. We as well introduce a spatio-temporal stabilization scheme based on a combination of Principal Component Analysis (PCA) and Kalman filter (KF) to extract stable landmarks in space and time. The ...
View details for https://ieeexplore.ieee.org/abstract/document/9054332/
We present a novel method for training a social robot to generate backchannels during human-robot interaction. We address the problem within an off-policy reinforcement learning framework, and show how a robot may learn to produce non-verbal backchannels like laughs, when trained to maximize the engagement and attention of the user. A major contribution of this work is the formulation of the problem as a Markov decision process (MDP) with states defined by the speech activity of the user and rewards generated by ...
View details for https://arxiv.org/abs/1908.01618
The ability to generate appropriate verbal and nonverbal backchannels by an agent during human-robot interaction greatly enhances the interaction experience. Backchannels are particularly important in applications like tutoring and counseling, which require constant attention and engagement of the user. We present here a method for training a robot for backchannel generation during a human-robot interaction within the reinforcement learning (RL) framework, with the goal of maintaining high engagement level. Since online learning ...
View details for https://ieeexplore.ieee.org/abstract/document/8925443/
Özetçe —Konusmadan duygu tanıma, yakın geçmiste önemli bir arastırma alanı olmustur kaydı, kayda karsılık gelen orijinal duygudurum etiketlenmelerinin ve duygudurum tahminlerinin kullanıcı Duygu durumu de˘gisimlerinin ani olmadı˘gını hesaba katarak daha büyük filtreler ...
View details for https://ieeexplore.ieee.org/abstract/document/8806402/
Yeşil, 2005). İnsan doğası gereği iletişim kurarken beden dilini, sözel iletişimden daha yoğun olarak kullanmaktadır (Borg, 2009) iletişimi önemsemeleri gerekmektedir (Miller, 1988). Sınıf içi iletişimde öğretmenin davranışları Baltaş ve Baltaş (2005) yüz yüze ikili ...
In this paper we present a data driven vocal tract area function (VTAF) estimation using Deep Neural Networks (DNN). We approach the VTAF estimation problem based on sequence to sequence learning neural networks, where regression over a sliding window is used to learn arbitrary non-linear one-to-many mapping from the input feature sequence to the target articulatory sequence. We propose two schemes for efficient estimation of the VTAF;(1) a direct estimation of the area function values and (2) an indirect estimation via ...
View details for https://ieeexplore.ieee.org/abstract/document/8639582/
In this paper we present a deep learning multimodal approach for speech driven generation of face animations. Training a speaker independent model, capable of generating different emotions of the speaker, is crucial for realistic animations. Unlike the previous approaches which either use acoustic features or phoneme label features to estimate the facial movements, we utilize both modalities to generate natural looking speaker independent lip animations synchronized with affective speech. A phoneme-based model qualifies ...
View details for https://ieeexplore.ieee.org/abstract/document/8659713/
Automated recognition of an infant's cry from audio can be considered as a preliminary step for the applications like remote baby monitoring. In this paper, we implemented a recently introduced deep learning topology called capsule network (CapsNet) for the cry recognition problem. A capsule in the CapsNet, which is defined as a new representation, is a group of neurons whose activity vector represents the probability that the entity exists. Active capsules at one level make predictions, via transformation matrices, for the parameters of ...
View details for https://www.isca-speech.org/archive/Interspeech_2018/pdfs/2187.pdf
Head-nods and turn-taking both significantly contribute conversational dynamics in dyadic interactions. Timely prediction and use of these events is quite valuable for dialog management systems in human-robot interaction. In this study, we present an audio-visual prediction framework for the head-nod and turntaking events that can also be utilized in real-time systems. Prediction systems based on Support Vector Machines (SVM) and Long Short-Term Memory Recurrent Neural Networks (LSTMRNN) are trained on human-human ...
View details for https://iui.ku.edu.tr/wp-content/uploads/2018/06/is2018_cameraReady.pdf
Food intake analysis is a crucial step to develop an automated dietary monitoring system. Processing of eating sounds deliver important cues for the food intake monitoring. Recent studies on detection of eating activity generally utilize multimodal data from multiple sensors with conventional feature engineering techniques. In this study, we target to develop a methodology for detection of ingestion sounds, namely swallowing and chewing, from the recorded food intake sounds during a meal. Our methodology relies on feature learning in ...
View details for https://ieeexplore.ieee.org/abstract/document/8551492/
This paper addresses the problem of evaluating engagement of the human participant by combining verbal and nonverbal behaviour along with contextual information. This study will be carried out through four different corpora. Four different systems designed to explore essential and complementary aspects of the JOKER system in terms of paralinguistic/linguistic inputs were used for the data collection. An annotation scheme dedicated to the labeling of verbal and non-verbal behavior have been designed. From our ...
View details for https://ieeexplore.ieee.org/abstract/document/8373903/
In this paper, we analyze the role of hidden bias in representational efficiency of the Gaussian-Bipolar Restricted Boltzmann Machines (GBPRBMs), which are similar to the widely used Gaussian-Bernoulli RBMs. Our experiments show that hidden bias plays an important role in shaping of the probability density function of the visible units. We define hidden entropy and propose it as a measure of representational efficiency of the model. By using this measure, we investigate the effect of hidden bias on the hidden entropy and ...
View details for https://www.sciencedirect.com/science/article/pii/S0893608018301849
We address the problem of continuous laughter detection over audio-facial input streams obtained from naturalistic dyadic conversations. We first present meticulous annotation of laughters, cross-talks and environmental noise in an audio-facial database with explicit 3D facial mocap data. Using this annotated database, we rigorously investigate the utility of facial information, head movement and audio features for laughter detection. We identify a set of discriminative features using mutual information-based criteria, and show how they ...
View details for https://ieeexplore.ieee.org/abstract/document/8046102/
In human-to-human communication, gesture and speech co-exist in time with a tight synchrony, and gestures are often utilized to complement or to emphasize speech. In human–computer interaction systems, natural, affective and believable use of gestures would be a valuable key component in adopting and emphasizing human-centered aspects. However, natural and affective multimodal data, for studying computational models of gesture and speech, is limited. In this study, we introduce the JESTKOD database, which consists of ...
View details for https://link.springer.com/article/10.1007/s10579-016-9377-0
Wearable sensor systems can deliver promising solutions to automatic monitoring of ingestive behavior. This study presents an on-body sensor system and related signal processing techniques to classify different types of food intake sounds. A piezoelectric throat microphone is used to capture food consumption sounds from the neck. The recorded signals are firstly segmented and decomposed using the empirical mode decomposition (EMD) analysis. EMD has been a widely implemented tool to analyze non-stationary and ...
View details for https://dl.acm.org/doi/abs/10.1145/3132635.3132640
Knowledge about the dynamic shape of the vocal tract is the basis of many speech production applications such as, articulatory analysis, modeling and synthesis. Vocal tract airway tissue boundary segmentation in the mid-sagittal plane is necessary as an initial step for extraction of the cross-sectional area function. This segmentation problem is however challenging due to poor resolution of real-time speech MRI, grainy noise and the rapidly varying vocal tract shape. We present a novel approach to vocal tract airway tissue ...
View details for https://www.isca-speech.org/archive/Interspeech_2017/pdfs/1016.PDF
We explore the effect of laughter perception and response in terms of engagement in human-robot interaction. We designed two distinct experiments in which the robot has two modes: laughter responsive and laughter non-responsive. In responsive mode, the robot detects laughter using a multimodal real-time laughter detection module and invokes laughter as a backchannel to users accordingly. In non-responsive mode, robot has no utilization of detection, thus provides no feedback. In the experimental design, we use a straightforward ...
View details for https://188.166.204.102/archive/Interspeech_2017/pdfs/1395.PDF
Dyadic interactions encapsulate rich emotional exchange between interlocutors suggesting a multimodal, cross-speaker and cross-dimensional continuous emotion dependency. This study explores the dynamic inter-attribute emotional dependency at the cross-subject level with implications to continuous emotion recognition based on speech and body motion cues. We propose a novel two-stage Gaussian Mixture Model mapping framework for the continuous emotion recognition problem. In the first stage, we perform continuous emotion ...
View details for https://pdfs.semanticscholar.org/bc87/50893faa247bbdf5a0ce752cee8cf7a45b8e.pdf
The aim of this paper is tracking Parkinson's disease (PD) progression based on its symptoms on vocal system using Unified Parkinsons Disease Rating Scale (UPDRS). We utilize a standard speech signal feature set, which contains 6373 static features as functionals of low-level descriptor (LLD) contours, and select the most informative ones using the maximal relevance and minimal redundancy based on correlations (mRMR C) criteria. Then, we evaluate performance of Gaussian mixture regression (GMR) and support ...
View details for https://ieeexplore.ieee.org/abstract/document/8037685/
Natural and affective handshakes of two participants define the course of dyadic interaction. Affective states of the participants are expected to be correlated with the nature of the dyadic interaction. In this paper, we extract two classes of the dyadic interaction based on temporal clustering of affective states. We use the k-means temporal clustering to define the interaction classes, and utilize support vector machine based classifier to estimate the interaction class types from multimodal, speech and motion, features. Then, we investigate ...
View details for https://ieeexplore.ieee.org/abstract/document/7952683/
Lips deliver visually active clues for speech articulation. Affective states define how humans articulate speech; hence, they also change articulation of lip motion. In this paper, we investigate effect of phonetic classes for affect recognition from lip articulations. The affect recognition problem is formalized in discrete activation, valence and dominance attributes. We use the symmetric KullbackLeibler divergence (KLD) to rate phonetic classes with larger discrimination across different affective states. We perform experimental evaluations using ...
View details for https://ieeexplore.ieee.org/abstract/document/7952593/
We propose a framework for joint analysis of speech prosody and arm motion towards automatic synthesis and realistic animation of beat gestures from speech prosody and rhythm. In the analysis stage, we first segment motion capture data and speech audio into gesture phrases and prosodic units via temporal clustering, and assign a class label to each resulting gesture phrase and prosodic unit. We then train a discrete hidden semi-Markov model (HSMM) over the segmented data, where gesture labels are hidden states with ...
View details for https://www.sciencedirect.com/science/article/pii/S0167639315300170
Natural and affective handshakes of two participants define the course of dyadic interaction. Affective states of the participants are expected to be correlated with the nature or type of the dyadic interaction. In this study, we investigate relationship between affective attributes and nature of dyadic interaction. In this investigation we use the JESTKOD database, which consists of speech and full-body motion capture data recordings for dyadic interactions under agreement and disagreement scenarios. The dataset also has affective annotations in ...
View details for https://www.isca-speech.org/archive/Interspeech_2016/pdfs/0407.PDF
In human-to-human communication gesture and speech co-exist in time with a tight synchrony, where we tend to use gestures to complement or to emphasize speech. In this study, we investigate roles of vocal and gestural cues to identify a dyadic interaction as agreement and disagreement. In this investigation we use the JESTKOD database, which consists of speech and full-body motion capture data recordings for dyadic interactions under agreement and disagreement scenarios. Spectral features of vocal channel and ...
View details for https://ieeexplore.ieee.org/abstract/document/7472180/
In studies on artificial bandwidth extension (ABE), there is a lack of international coordination in subjective tests between multiple methods and languages. Here we present the design of absolute category rating listening tests evaluating 12 ABE variants of six approaches in multiple languages, namely in American English, Chinese, German, and Korean. Since the number of ABE variants caused a higher-than-recommended length of the listening test, ABE variants were distributed into two separate listening tests per language ...
View details for https://ieeexplore.ieee.org/abstract/document/7472812/