|
Abstract: Research on
communication dynamics has shown that human beings, across nations
and cultures, ‘talk’ to each other using facial and voiced
expressions, gestures, body language and of course, by listening. In
fact, facial expressions can be as much as 70% of the information
(content) that is transmitted and received. Further, it’s likely
that the ‘state-of-being’, ‘state-of-mind’ or ‘emotional state’ is
expressed via communication; that is, facial and voiced expressions
can directly represent human emotions. Most of us ‘understand’
facial and voiced expressions effortlessly. With the recent
accessibility of digital images and audio, the speaker and his
students realized a natural foray into the social sciences using
‘tools’ well-developed in the digital signal process engineering.
The ‘team’ developed an initial multi-modal emotion recognition
system using cues from digitally recorded facial images and voiced
recordings. This is achieved by extracting features from each of the
modalities using signal processing techniques, and then classifying
these features with the help of artificial neural networks (ANN).
The features extracted from the face are the eyes, eyebrows, mouth
and nose; this is done using image processing techniques such as
seeded region growing algorithm, particle swarm optimization and
general properties of the feature being extracted. In contrast,
features of interest in speech are pitch, frequencies and spectra
along with some statistical properties and also the rate of change
of these properties. These features are extracted using techniques
such as Fourier transform. In the course of research the team
developed a toolbox that can read an audio and/or video file and
‘perform emotion recognition’ on the face in the video and speech in
the audio channel. The features extracted from the face and voices
are independently classified into emotions using two separate feed
forward types of ANNs. This toolbox then presents the output of the
artificial neural networks from one/both the modalities on a
synchronized time scale. Some interesting results from this research
is consistent misclassification of facial expressions between two
databases (one European, one Asian), suggesting a cultural basis for
this misinterpretation. Addition of voice component has been shown
to partially help in better classification. |