Below is the complete dataset that was used for the research described in our NIPS 2007 paper. Since the files are rather large, please be considerate and save bandwidth by downloading only the parts that are really of interest to you.
This dataset is provided free of charge and with no warranty, neither expressed nor implied, and subject to the following conditions:
- If redistributed, the files should must come with this notice, which must not be modified in any way.
- Any publications using this dataset should cite this paper:
G. Englebienne, T. F. Cootes and M. Rattray. A probabilistic model for generating realistic lip movements from speech. In Advances in Neural Information Processing Systems 21, 2008
About the data
This contains a data set of 803 video sequences of a talking head. The sequences were manually cut out of freely available MP4-encoded broadcasts of Democracy Now!, an American news show. The sequences are grouped into directories according to the particular instance of the show they were cut out of. Each directory was compressed separately in tar.bz2 format. Each individual sequence is stored in its own subdirectory, which contains the following files:
- This is a ECMAscript file generated by avidemux2, the program used to cut out the sequence. It contains information about the date of the show, and the index of the start and end frame of the sequence within that show and could be used to modify the sequence based on the original data.
- The sound of the sequence, decompressed and extracted from the video file.
- The same sound as audio.wav, but extracted from the corresponding radio show. Since the radio show is available in uncompressed CD quality, this data is of higher quality than audio.wav, even though the sample rate is lower.
- The MFCC coefficients of hifi.wav, computed at 100Hz by the HTK toolkit and stored in HTK's .mfc file format.
- The textual transcription of the sequence.
- Phonetic equivalent of transcript.txt according to CMUDict v0.6, aligned to the sound MFCC samples by computation of the Viterbi path through an unrolled HMM. When multiple pronunciations for a single word are listed in the dictionary, this results in choosing the most likely pronunciation.
- Same as align.lab, but downsampled to match the samples in mouth.mfc
- The parameters of the AAM model as fitted to the video frames. The AAM model takes 32 parameters, and there is one sample per video frame. Again, the results were stored in HTK .mfc file format.
- Same as mouth.mfc, but augmented with delta features.
- Contains one JPEG-compressed image of each frame of the sequence. Each frame was greyscaled and cropped to the face of the person talking.
Extraction of hifi.wav
The audio in the MP4 stream is AAC compressed. This gives audio which sounds quite good to human ears, however early tests with this data (but were done on RealMedia rather than MP4-encoded video) showed that HMMs used for speech recognition performed markedly worse on the compressed sound than on CD-quality sound. Fortunately, the Democracy Now! show exists both as a television show and as a radio show, and the radio show is available online in CD quality. Both shows contain the same audio track of the presenter talking but seem to have a slightly different organisation. We therefore took the audio sequence extracted from the video and tried to find the same sequence in the radio show.
Finding the corresponding sequence was done by minimisation of the sum-squared error of the envelope of both soundwaves. The resulting sequences were manually checked. There may be a misalignment of at most 5ms between both sequences (due to the resolution at which the envelope was computed), but then again, the alignment of sound and video within the MP4 is not exact either.
Processing of the frames
Each individual frame was stored as a separate JPEG file, after it was cropped to the region of interest --- the face of the person talking --- and grey scaled. An Active Appearance Model was then fitted to each individual frame, and the parameters to that model were stored in a separate file, mouth.mfc.
|Gwenn Englebienne, August 2007.|