Difference between revisions of "Forced alignment"
(22 intermediate revisions by 3 users not shown) | |||
Line 9: | Line 9: | ||
but other types of recordings may also be well processed. |
but other types of recordings may also be well processed. |
||
+ | == Aligning on the BPM == |
||
− | The aligner is an implementation of the [http://www.ling.upenn.edu/phonetics/p2fa/ Penn forced aligner] (Jiahong Yuan), which is based on the [http://htk.eng.cam.ac.uk/ HTK speech recognition toolkit]. It produces a Praat textgrid file that has word and phone boundaries for the speech in a wav file that you give to the aligner. |
+ | The aligner is an implementation of the [http://www.ling.upenn.edu/phonetics/p2fa/ Penn forced aligner] (Jiahong Yuan), which is based on the [http://htk.eng.cam.ac.uk/ HTK speech recognition toolkit]. It produces a Praat textgrid file that has word and phone boundaries for the speech in a wav file that you give to the aligner. We used this system in the "voices of Berkeley" project to find vowel midpoints and take formant measurements automatically. |
− | It is implemented on |
+ | It is implemented on PhonLab BPM using sox and the HTK library of automatic speech recognition software. You may be able to set this up on your home computer, but most people will find it easier to run it through the BPM. Regardless, you will need to register to use the HTK toolkit, at http://htk.eng.cam.ac.uk. |
− | == |
+ | === Getting started with pyalign === |
+ | For simple alignments involving a single utterance you can call <code>pyalign</code> directly. The <code>multi_align</code> command is used for more complicated situations involving multiple utterances, multiple speakers, or multiple input channels. '''''You should familiarize yourself with <code>pyalign</code> even if you intend to use <code>multi_align</code> since <code>multi_align</code> is just a convenient way to iteratively call <code>pyalign</code> for the individual labels in a TextGrid.''''' |
||
⚫ | |||
⚫ | |||
⚫ | # Your .wav file. The aligner uses sox to create a copy of your wav file that has all of the properties that are needed for HTK. One thing to keep in mind is that if you specify that you want the 16kHz acoustic models to be used, but you pass an 11.025 kHz file to the aligner the performance will be degraded. Just be sure that the sampling rate of your wav file is at least as fast as the acoustic models you specify. |
||
+ | |||
− | # Your transcript file. The aligner needs to know what words are spoken in the .wav file, and needs to know the order in which they are spoken (and may also need to know about disfluencies, laughter, etc. if they are there). This means creating a text file that has an accurate transcription of every single utterance, including every “um,” “uh,” or any other sort of hesitation. It’s also important to make sure that every word used in the transcription exists in the default pronouncing dictionary (found in /opt/f2fa/model/dict) or in a local, project-specific dictionary. See below for information on how to create local dictionary, as well as how to create a shared local dictionary using Google Drive. |
||
⚫ | # Your '''.wav file'''. The aligner uses sox to create a copy of your wav file that has all of the properties that are needed for HTK. One thing to keep in mind is that if you specify that you want the 16kHz acoustic models to be used, but you pass an 11.025 kHz file to the aligner the performance will be degraded. Just be sure that the sampling rate of your wav file is at least as fast as the acoustic models you specify. |
||
+ | # Your '''transcript file'''. The aligner needs to know what words are spoken in the .wav file, and needs to know the order in which they are spoken (and may also need to know about disfluencies, laughter, etc. if they are there). Your transcription must include every single utterance, including false starts, filled pauses such as “um,” “uh,” or any other sort of hesitation. Transcript files may be either .txt files or .TextGrid files (see below) |
||
# The output file. This is a text file that can be read into Praat as a textgrid. Praat scripting can then be used to extract phonetic measurements, or you can read the textgrid in a python script ([[meas_formants|<code>meas_formants</code>]] for an example) and use the ESPS unix command-line acoustic analysis package to extract phonetic measurements. TextGrid files use the extension .TextGrid. |
# The output file. This is a text file that can be read into Praat as a textgrid. Praat scripting can then be used to extract phonetic measurements, or you can read the textgrid in a python script ([[meas_formants|<code>meas_formants</code>]] for an example) and use the ESPS unix command-line acoustic analysis package to extract phonetic measurements. TextGrid files use the extension .TextGrid. |
||
+ | ==== .txt Transcripts and <code>pyalign</code> ==== |
||
⚫ | |||
+ | |||
⚫ | |||
Command-line usage: |
Command-line usage: |
||
Line 39: | Line 44: | ||
The -r option determines which set of acoustic models to use (I would recommend that you use 16000). Your sound file should have a sampling rate that is equal to or greater than the acoustic model sampling rate. |
The -r option determines which set of acoustic models to use (I would recommend that you use 16000). Your sound file should have a sampling rate that is equal to or greater than the acoustic model sampling rate. |
||
+ | === Adding missing words to the dictionary === |
||
⚫ | |||
+ | Every word in your transcript must exactly match a word in the master dictionary, which is in the file <code>/opt/p2fa/model/dict</code> in the BPM (from the CMU Pronouncing Dictionary). If a word is missing, then the aligner does not have the pronunciation information it requires to complete alignment. You can create your own file named <code>dict.local</code> that contains pronunciations of any missing words. |
||
− | The aligner uses the CMU pronouncing dictionary, which of course does not cover every word that might be uttered in your recording. The aligner will supplement the CMU dictionary with the contents of a file named <code>dict.local</code> in your current working directory, if it exists. You can create a file with this name and add as many records as you like. |
||
+ | Refer to <code>/opt/p2fa/model/dict</code> as a model for how to create your <code>dict.local</code> file. The format for each entry line is 1) the orthographic word (in upper case); 2) two space characters; 3) a space-separated list of phones ('''''in upper case'''''). Use the same ARPAbet phoneme set as is used in the [http://www.speech.cs.cmu.edu/cgi-bin/cmudict CMU Pronouncing Dictionary], and include stress markings for all vowels. |
||
⚫ | |||
+ | <nowiki> |
||
⚫ | |||
+ | DOG D AO1 G |
||
+ | CAT K AE1 T |
||
+ | </nowiki> |
||
+ | |||
+ | '''''Finally, ensure the last entry is terminated with a line break.''''' |
||
+ | |||
+ | Place the <code>dict.local</code> file in the current working directory when you run <code>pyalign</code> so that the aligner will find it and include its contents. |
||
+ | |||
+ | === .TextGrid transcripts and <code>multi_align</code> === |
||
+ | |||
+ | TextGrid transcript files may be used with [https://github.com/rsprouse/ucblingmisc/blob/master/python/multi_align <code>multi_align</code>]. Using TextGrid transcripts allows you to align recordings with multiple speakers and have greater control over the specific intervals which are aligned. |
||
+ | |||
+ | In the BPM execute |
||
+ | |||
+ | <code>multi_align --help</code> |
||
+ | |||
+ | to see <code>multi_align</code>'s available options. See also the [[multi_align examples]] page. |
||
+ | |||
⚫ | |||
+ | |||
⚫ | For groups of people working together you may find it convenient to maintain <code>dict.local</code> in a spreadsheet in google drive and pull it in with a script. This can be especially convenient if you are collaborating with others, as you can collectively maintain a supplemental dictionary. Here is an example of how to do it, based on Ling113 in spring 2015, using the BPM: |
||
+ | |||
⚫ | |||
# Create a google spreadsheet and share it with everyone in your group as an editor. |
# Create a google spreadsheet and share it with everyone in your group as an editor. |
||
Line 54: | Line 82: | ||
# Also notice the <code>gid</code> value in your url. This will probably be '0', but if you have added multiple sheets it might be different. Make sure your current view is the sheet with the records you want to export. |
# Also notice the <code>gid</code> value in your url. This will probably be '0', but if you have added multiple sheets it might be different. Make sure your current view is the sheet with the records you want to export. |
||
− | === Create a download script === |
+ | ==== Create a download script ==== |
# Choose a name for your script. In our example here we'll call it <code>get_dict_local</code>. In some cases it might be sensible to make it specific to a project, e.g. <code>get_dict_local_myproject</code>. |
# Choose a name for your script. In our example here we'll call it <code>get_dict_local</code>. In some cases it might be sensible to make it specific to a project, e.g. <code>get_dict_local_myproject</code>. |
||
Line 65: | Line 93: | ||
# Make sure your script is executable. This works in BCE: <code>sudo chmod +x /usr/local/bin/get_dict_local</code>. Make sure you use the script name you chose if it is different than <code>get_dict_local</code>. |
# Make sure your script is executable. This works in BCE: <code>sudo chmod +x /usr/local/bin/get_dict_local</code>. Make sure you use the script name you chose if it is different than <code>get_dict_local</code>. |
||
− | === Using the script === |
+ | ==== Using the script ==== |
Using the script is easy. You simply call your script by name at the command line, e.g. <code>get_dict_local</code> and the <code>dict.local</code> file will be created or updated in your current working directory from the contents of your google spreadsheet. |
Using the script is easy. You simply call your script by name at the command line, e.g. <code>get_dict_local</code> and the <code>dict.local</code> file will be created or updated in your current working directory from the contents of your google spreadsheet. |
||
− | == Troubleshooting == |
+ | === Troubleshooting === |
− | === Word not in dictionary === |
+ | ==== Word not in dictionary ==== |
One of the most common errors occurs when a word does not exist in the default dictionary. If this happens, "SKIPPING WORD X" will print in the terminal, where X is the word. The alignment will still occur, but if a word is skipped, this will likely result in other words to be aligned incorrectly. It is thus important to ensure that the aligner does not skip any words, so a local dictionary should be created. If you need a project-specific dictionary (which might include, for example, a set of nonwords, or a set of words in a language other than English) you can create a file that you name “dict.local” that has the same format as /opt/f2a/model/dict but includes your project-specific vocabulary. <code>pyalign</code> looks at both the default dictionary and dict.local to find transcriptions of the words in your transcript file. |
One of the most common errors occurs when a word does not exist in the default dictionary. If this happens, "SKIPPING WORD X" will print in the terminal, where X is the word. The alignment will still occur, but if a word is skipped, this will likely result in other words to be aligned incorrectly. It is thus important to ensure that the aligner does not skip any words, so a local dictionary should be created. If you need a project-specific dictionary (which might include, for example, a set of nonwords, or a set of words in a language other than English) you can create a file that you name “dict.local” that has the same format as /opt/f2a/model/dict but includes your project-specific vocabulary. <code>pyalign</code> looks at both the default dictionary and dict.local to find transcriptions of the words in your transcript file. |
||
+ | |||
+ | ==== SyntaxError: invalid syntax ==== |
||
+ | If your attempt to align ends with the error message <code>SyntaxError: invalid syntax</code>, it probably indicates that you attempted to run <code>align.py</code> directly. Use the <code>pyalign</code> wrapper instead so that the correct Python version interprets the script. |
||
+ | |||
+ | == Other Resources == |
Latest revision as of 11:00, 30 November 2018
Goal and Scope
Forced alignment refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation. While automatic alignment does not yet rival manual alignment, the amount of time gained through forced alignment is often worth the small decrease in accuracy for many projects.
Forced alignment works best on recordings which
- have one speaker speaking at a time
- have little environmental noise
but other types of recordings may also be well processed.
Aligning on the BPM
The aligner is an implementation of the Penn forced aligner (Jiahong Yuan), which is based on the HTK speech recognition toolkit. It produces a Praat textgrid file that has word and phone boundaries for the speech in a wav file that you give to the aligner. We used this system in the "voices of Berkeley" project to find vowel midpoints and take formant measurements automatically.
It is implemented on PhonLab BPM using sox and the HTK library of automatic speech recognition software. You may be able to set this up on your home computer, but most people will find it easier to run it through the BPM. Regardless, you will need to register to use the HTK toolkit, at http://htk.eng.cam.ac.uk.
Getting started with pyalign
For simple alignments involving a single utterance you can call pyalign
directly. The multi_align
command is used for more complicated situations involving multiple utterances, multiple speakers, or multiple input channels. You should familiarize yourself with pyalign
even if you intend to use multi_align
since multi_align
is just a convenient way to iteratively call pyalign
for the individual labels in a TextGrid.
The pyalign
command has three required arguments:
- Your .wav file. The aligner uses sox to create a copy of your wav file that has all of the properties that are needed for HTK. One thing to keep in mind is that if you specify that you want the 16kHz acoustic models to be used, but you pass an 11.025 kHz file to the aligner the performance will be degraded. Just be sure that the sampling rate of your wav file is at least as fast as the acoustic models you specify.
- Your transcript file. The aligner needs to know what words are spoken in the .wav file, and needs to know the order in which they are spoken (and may also need to know about disfluencies, laughter, etc. if they are there). Your transcription must include every single utterance, including false starts, filled pauses such as “um,” “uh,” or any other sort of hesitation. Transcript files may be either .txt files or .TextGrid files (see below)
- The output file. This is a text file that can be read into Praat as a textgrid. Praat scripting can then be used to extract phonetic measurements, or you can read the textgrid in a python script (
meas_formants
for an example) and use the ESPS unix command-line acoustic analysis package to extract phonetic measurements. TextGrid files use the extension .TextGrid.
.txt Transcripts and pyalign
Use the pyalign
command to do forced alignment. (The Penn tool is named align.py
, and pyalign
is a simple wrapper that makes align.py
easier to call in the context of the BPM.)
Command-line usage:
> pyalign [options] wave_file transcript_file output_file
where options may include:
-r sampling_rate -- override which sample rate model to use, one of 8000, 11025, and 16000 -s start_time -- start of portion of wavfile to align (in seconds, default 0) -e end_time -- end of portion of wavfile to align (in seconds, defaul to end)
The -r option determines which set of acoustic models to use (I would recommend that you use 16000). Your sound file should have a sampling rate that is equal to or greater than the acoustic model sampling rate.
Adding missing words to the dictionary
Every word in your transcript must exactly match a word in the master dictionary, which is in the file /opt/p2fa/model/dict
in the BPM (from the CMU Pronouncing Dictionary). If a word is missing, then the aligner does not have the pronunciation information it requires to complete alignment. You can create your own file named dict.local
that contains pronunciations of any missing words.
Refer to /opt/p2fa/model/dict
as a model for how to create your dict.local
file. The format for each entry line is 1) the orthographic word (in upper case); 2) two space characters; 3) a space-separated list of phones (in upper case). Use the same ARPAbet phoneme set as is used in the CMU Pronouncing Dictionary, and include stress markings for all vowels.
DOG D AO1 G CAT K AE1 T
Finally, ensure the last entry is terminated with a line break.
Place the dict.local
file in the current working directory when you run pyalign
so that the aligner will find it and include its contents.
.TextGrid transcripts and multi_align
TextGrid transcript files may be used with multi_align
. Using TextGrid transcripts allows you to align recordings with multiple speakers and have greater control over the specific intervals which are aligned.
In the BPM execute
multi_align --help
to see multi_align
's available options. See also the multi_align examples page.
Sharing a dict.local
with a Google Drive spreadsheet
For groups of people working together you may find it convenient to maintain dict.local
in a spreadsheet in google drive and pull it in with a script. This can be especially convenient if you are collaborating with others, as you can collectively maintain a supplemental dictionary. Here is an example of how to do it, based on Ling113 in spring 2015, using the BPM:
Set up the spreadsheet
- Create a google spreadsheet and share it with everyone in your group as an editor.
- Also add share rights so that anyone with the link can view the spreadsheet. If you prefer, make the spreadsheet public on the web.
- Add records to the spreadsheet by putting the transcription of a word in the first column and the pronunciation in the second. See the Ling113 example.
- Open the spreadsheet and look at the URL from your browser's location bar. The Ling113 example looks like this:
https://docs.google.com/a/berkeley.edu/spreadsheets/d/1WwGgZxk5RoU0TAOoJlKPUsoEgZEYjEgucD7zrK3n6Xo/edit#gid=0
. - Notice the long alphanumeric string after
/d/
in your URL. This is the file key. - Also notice the
gid
value in your url. This will probably be '0', but if you have added multiple sheets it might be different. Make sure your current view is the sheet with the records you want to export.
Create a download script
- Choose a name for your script. In our example here we'll call it
get_dict_local
. In some cases it might be sensible to make it specific to a project, e.g.get_dict_local_myproject
. - Create and edit a script file in your path. This works in BCE:
sudo gedit /usr/local/bin/get_dict_local
. Use the script name you chose in the first step. - Use the Ling113 example script as a base for your download script. Just copy and paste into your editor.
- Delete the value of the
FILEKEY
variable in the Ling113 script (the part between quotation marks) and replace it with the file key you found in your spreadsheet's URL. - Delete the value of the
GID
variable and replace it with your gid value. - It's a good idea to update the comments in the file to remove references to Ling113 and update with your project name.
- Save the changes you made to the script and exit the editor.
- Make sure your script is executable. This works in BCE:
sudo chmod +x /usr/local/bin/get_dict_local
. Make sure you use the script name you chose if it is different thanget_dict_local
.
Using the script
Using the script is easy. You simply call your script by name at the command line, e.g. get_dict_local
and the dict.local
file will be created or updated in your current working directory from the contents of your google spreadsheet.
Troubleshooting
Word not in dictionary
One of the most common errors occurs when a word does not exist in the default dictionary. If this happens, "SKIPPING WORD X" will print in the terminal, where X is the word. The alignment will still occur, but if a word is skipped, this will likely result in other words to be aligned incorrectly. It is thus important to ensure that the aligner does not skip any words, so a local dictionary should be created. If you need a project-specific dictionary (which might include, for example, a set of nonwords, or a set of words in a language other than English) you can create a file that you name “dict.local” that has the same format as /opt/f2a/model/dict but includes your project-specific vocabulary. pyalign
looks at both the default dictionary and dict.local to find transcriptions of the words in your transcript file.
SyntaxError: invalid syntax
If your attempt to align ends with the error message SyntaxError: invalid syntax
, it probably indicates that you attempted to run align.py
directly. Use the pyalign
wrapper instead so that the correct Python version interprets the script.