Cue Integration

Project Title

Bayesian Multisensory Scene Understanding

Abstract

We investigate a solution to the problem of multi-sensor perception and tracking by formulating it in the framework of Bayesian model selection. Humans robustly associate multi-sensory data as appropriate, but previous theoretical work has focused largely on purely integrative cases, leaving segregation unaccounted for and unexploited by machine perception systems. We illustrate a unifying, Bayesian solution to multisensor perception and tracking which accounts for both integration and segregation by explicit probabilistic reasoning about data association in a temporal context. Explicit inference of multisensory data association may also be of intrinsic interest for higher level understanding of multisensory data. We illustrate this using a probabilistic model of audio-visual data in which unsupervised learning and inference provide automatic audio-visual detection and tracking of two human subjects, speech segmentation, and association of each conversational segment with the speaking person.

Related Publications

Timothy Hospedales, Joel Cartwright and Sethu Vijayakumar,
Structure Inference for Bayesian Multisensory Perception and Tracking,
International Joint Conference on Artificial Intelligence (IJCAI '07). [pdf]
Timothy Hospedales and Sethu Vijayakumar,
Structure Inference for Bayesian Multisensory Scene Understanding,
Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

Schematic

Graphical model used for inference

Results: Speaker Association and Tracking

Sample Data and Results

Description: Video files are divx codec with stereo sound. Based on the stereo TDOA and video data we can learn to track the source audio-visually. Depending on the experiment, some datasets may use separate train/test data and some sets may use just one sequence.
Multiuser: Test data sequence and manually labelled ground truth are provided.
Ground truth files are matlab format with three fields which should have the same number of entries as frames in the video file. These are: l[1,2]: horizontal axis location of user 1 & 2 {1-120 pixels}, z[1,2]: visibility of user 1 & 2 {0: invisible, 1: visible, 2: partially visible}, w[1,2]: audibility of user 1 & 2 {0: user silent, 1: user audible}. Results: Annotated video is fairly self explanatory. The 'fixed stereo' version uses user inference to put the speech of user 1 & 2 on channel 1&2 respectively. The 'moving stereo' version uses user inference & user position inference to put the speech of the user on the appropriate channel depending on where they are. (Try listening to both of these with headphones!)
Mechanical: Easiest sample has maximum volume, spectrum width, lighting. Hardest sample has minimum volume, spectrum width and lighting.

Dataset	Train Data	Test Data	Ground Truth	Our Results
Multiparty Conversation #70	U1 Train, U2 Train	Test	Ground Truth	Fixed Stereo, Moving Stereo
Multiparty Conversation #78	U1 Train, U2 Train	Test	Ground Truth	Fixed Stereo, Moving Stereo
Single User #19	Train	Test (Same as train)	Ground Truth	Results
Mechanical #1 (Easy)	Train	Test	-	-
Mechanical #24 (Hard)	Train	Test	-	-

Dataset

Train Data

Test Data

Ground Truth

Our Results