Abstract
A class of audio-visual data (fiction entertainment: movies, TV series) is segmented into scenes, which contain dialogs, using a novel hidden Markov model-based (HMM) method. Each shot is classified using both audio track (via classification of speech, silence and music) and visual content (face and location information). The result of this shot-based classification is an audio-visual token to be used by the HMM state diagram to achieve scene analysis. After simulations with circular and left-to-right HMM topologies, it is observed that both are performing very good with multi-modal inputs. Moreover, for circular topology, the comparisons between different training and observation sets show that audio and face information together gives the most consistent results among different observation sets.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 137-151 |
| Number of pages | 15 |
| Journal | Multimedia Tools and Applications |
| Volume | 14 |
| Issue number | 2 |
| DOIs | |
| State | Published - Jun 2001 |
All Science Journal Classification (ASJC) codes
- Software
- Media Technology
- Hardware and Architecture
- Computer Networks and Communications
Keywords
- Content-based indexing
- Dialog scene analysis
- Hidden Markov models
- Multi-modal analysis