Forced alignment

From Phonlab
Jump to navigationJump to search

The aligner is an implementation of the Penn forced aligner (Jiahong Yuan), which is based on the HTK speech recognition toolkit. It produces a Praat textgrid file that has word and phone boundaries for the speech in a wav file that you give to the aligner. A BIG time saver. We used this system in the "voices of Berkeley" project to find vowel midpoints and take formant measurements automatically.

It runs on the Dept of Linguistics server using sox and the HTK library of automatic speech recognition software. You may be able to set this up on your home computer, but most people will find it easier to use the server.

How to use the aligner

  1. Your .wav file. The aligner uses sox to create a copy of your wav file that has all of the properties that are needed for HTK. One thing to keep in mind is that if you specify that you want the 16kHz acoustic models to be used, but you pass an 11.025 kHz file to the aligner the performance will be degraded. Just be sure that the sampling rate of your wav file is at least as fast as the acoustic models you specify.
  2. Your transcript file. The aligner needs to know what words are spoken in the .wav file, and needs to know the order in which they are spoken (and may also need to know about disfluencies, laughter, etc. if they are there). The transcript file is a text document that contains a transcript of the words spoken in the wav file.
  3. Words you can use in the transcript file. The aligner, by default, uses the pronouncing dictionary that you can see at /opt/f2fa/model/dict. It will copy your transcript to allcaps before looking up words in the dictionary. If you need a project-specific dictionary (which might include, for example, a set of nonwords, or a set of words in a language other than English) you can create a file that you name "dict.local" that has the same format as /opt/f2a/model/dict but includes your project-specific vocabulary. pyalign looks at both the default dictionary and dict.local to find transcriptions of the words in your transcript file.
  4. The unix command (the Penn tool is named align.py; pyalign is just a simple wrapper that makes align.py easier to call in the context of our server):

Command-line usage:

> pyalign [options] wave_file transcript_file output_file

where options may include:

 -r sampling_rate -- override which sample rate model to use, one of 8000, 11025, and 16000
 -s start_time    -- start of portion of wavfile to align (in seconds, default 0)
 -e end_time      -- end of portion of wavfile to align (in seconds, defaul to end)


The -r option determines which set of acoustic models to use (I would recommend that you use 16000). Your sound file should have a sampling rate that is equal to or greater than the acoustic model sampling rate.

The output file is a text file that can be read into Praat as a textgrid and then you can use Praat scripting to extract phonetic measurements, or you can read the textgrid in a python script (see meas_formants for an example) and use the ESPS unix command-line acoustic analysis package to extract phonetic measurements.