SLMC //projects: Cue Integration//

SLMC homeSLMC researchSLMC peopleSLMC projectsSLMC newsSLMC coursesSLMC hiringSLMC contactIPABSchool of InformaticsUniversity of Edinburgh
Project Title
Bayesian Multisensory Scene Understanding
Abstract
We investigate a solution to the problem of multi-sensor perception and tracking by formulating it in the framework of Bayesian model selection. Humans robustly associate multi-sensory data as appropriate, but previous theoretical work has focused largely on purely integrative cases, leaving segregation unaccounted for and unexploited by machine perception systems. We illustrate a unifying, Bayesian solution to multisensor perception and tracking which accounts for both integration and segregation by explicit probabilistic reasoning about data association in a temporal context. Explicit inference of multisensory data association may also be of intrinsic interest for higher level understanding of multisensory data. We illustrate this using a probabilistic model of audio-visual data in which unsupervised learning and inference provide automatic audio-visual detection and tracking of two human subjects, speech segmentation, and association of each conversational segment with the speaking person.
Related Publications
  • Timothy Hospedales, Joel Cartwright and Sethu Vijayakumar,
    Structure Inference for Bayesian Multisensory Perception and Tracking,
    International Joint Conference on Artificial Intelligence (IJCAI '07). [pdf]
  • Timothy Hospedales and Sethu Vijayakumar,
    Structure Inference for Bayesian Multisensory Scene Understanding,
    Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence
Schematic
Graphical model
Graphical model used for inference
Sample Result
Results: Speaker Association and Tracking
Sample Data and Results

Description: Video files are divx codec with stereo sound. Based on the stereo TDOA and video data we can learn to track the source audio-visually. Depending on the experiment, some datasets may use separate train/test data and some sets may use just one sequence.
Multiuser: Test data sequence and manually labelled ground truth are provided.
Ground truth files are matlab format with three fields which should have the same number of entries as frames in the video file. These are: l[1,2]: horizontal axis location of user 1 & 2 {1-120 pixels}, z[1,2]: visibility of user 1 & 2 {0: invisible, 1: visible, 2: partially visible}, w[1,2]: audibility of user 1 & 2 {0: user silent, 1: user audible}. Results: Annotated video is fairly self explanatory. The 'fixed stereo' version uses user inference to put the speech of user 1 & 2 on channel 1&2 respectively. The 'moving stereo' version uses user inference & user position inference to put the speech of the user on the appropriate channel depending on where they are. (Try listening to both of these with headphones!)
Mechanical: Easiest sample has maximum volume, spectrum width, lighting. Hardest sample has minimum volume, spectrum width and lighting.

Dataset

Train Data

Test Data

Ground Truth

Our Results

Multiparty Conversation #70 U1 Train, U2 Train Test Ground Truth Fixed Stereo, Moving Stereo
Multiparty Conversation #78 U1 Train, U2 Train Test Ground Truth Fixed Stereo, Moving Stereo
Single User #19 Train Test (Same as train) Ground Truth Results
Mechanical #1 (Easy) Train Test - -
Mechanical #24 (Hard) Train Test - -