Speech database
The procedure outlined here uses utilities in the Berkeley Phonetic Machine, to produce time aligned TextGrids for recorded utterances in English.
Transcript
Create text transcripts of the words spoken in a sound file. Most plain text editors will be fine for this.
One potential complication is that plain text on windows may be saved "end of line" characters that are not compatible with make_text_grids, the perl utility that parses transcripts and passes them to pyalign.py. You can use the unix utility d2u to change the line endings (this is a separate download into your instance of the BPM).
If there is a portion of the audio file that you don't want to transcribe, and thus won't include in the tagged database, you can add a "skip region" line to the transcript with a line that starts with the "#" character. For example, the transcript below says to skip the first 0.3 seconds (from time 0 to time 0.3), then align the utterance "word, word, word" to the chunk of audio from 0.3 to 1 second, skip from 1 to 1.7, then align "sentence sentence sentence" the audio from time 1.7 to 4, and then skip to time 200 aligning no further text to the audio.
# 0,0.3 word, word, word # 1,1.7 sentence sentence sentence # 4,200
Align transcript to wave
The perl script make_text_grids reads the transcription file and parses the skip regions, calling pyalign.py for each chunk of audio that has a transcript. So in the example above, make_text_grids would call pyalign with a start time of 0.3 and an end time of 1 for the transcript "word, word, word". After calling pyalign.py for each chunk, concat_pyalign_textgrids is called to combine the separate text grids into one that corresponds to the audio file.
>make_text_grids -h >make_text_grids transcript.txt audio.wav >ls audio.* audio.wav audio.TextGrid