Difference between revisions of "Speech database"

Revision as of 16:41, 26 May 2017

The procedure outlined here uses utilities in the Berkeley Phonetic Machine, to produce time aligned TextGrids with "words" and "phones" tiers for recorded utterances in English.

Transcript

Create text transcripts of the words spoken in a sound file. Most plain text editors will be fine for this.

One potential complication is that plain text in Windows may be saved "end of line" characters that are not compatible with make_text_grids, the perl utility that parses transcripts and passes them to pyalign. You can use the unix utility d2u to change the line endings (this is a separate download into your instance of the BPM).

If there is a portion of the audio file that you don't want to transcribe, and thus won't include in the tagged database, you can add a "skip region" line to the transcript with a line that starts with the "#" character. For example, the transcript below says to skip the first 0.3 seconds (from time 0 to time 0.3), then align the utterance "word, word, word" to the chunk of audio from 0.3 to 1 second, skip from 1 to 1.7, then align "sentence sentence sentence" the audio from time 1.7 to 4, and then skip to time 200, aligning no further text to the audio.

# 0,0.3
word, word, word
# 1,1.7
sentence sentence 
sentence
# 4,200

Align transcript to wave

The perl script make_text_grids File:Make text grids.txt reads the transcription file and parses the skip regions, calling pyalign for each chunk of audio that has a transcript. So in the example above, make_text_grids would call pyalign with a start time of 0.3 and an end time of 1 for the transcript "word, word, word". After calling pyalign for each chunk, concat_pyalign_textgrids is called to combine the separate text grids into one that corresponds to the audio file.

>make_text_grids -h
>make_text_grids audio.wav transcript.txt audio.TextGrid
>ls audio.*
   audio.wav    audio.TextGrid

Here's an example of using make_text_grids in a python script to produce a database full of aligned transcriptions. The script is called align_all.py. It assumes a particular directory structure:

corpus:transcripts
      :DAT*    (data directories -- usually audio files for one talker, DAT01, DAT02, etc.)

align_all.py File:Align all.txt looks in the DAT directories for .wav files for which (1) there exists a corresponding transcript (first looking in the DAT* directory and then in the transcripts directory) , and (2) there does not yet exist a corresponding TextGrid file. make_text_grids is then called with the transcript and audio file names, and the TextGrid is created.

One way that align_all is used this. We have a transcript of what each speaker was asked to say in each audio file. However, sometimes they make a mistake, have some trouble with words, repeat or otherwise don't say what the transcript says they should. In this case we delete the TextGrid, and make a local copy of the transcript (in the data directory) which is edited to add skip regions and word changes as necessary to match what the person actually said. Then align_all.py is run again and new TextGrids are created for files where the old TextGrid was deleted. align_all.py also creates a time_stamped "to do" file, that lists all of the new TextGrids that were created in a particular run of the script.

Hand-correct the alignments

In this step of the process a skilled human must visually inspect the time alignments produced by the automatic process. We use Praat to do this, inspecting alignments in windows of about 2 seconds long, and changing the labels and boundaries as needed. To speed this process, we use a Praat script to cycle through the TextGrid files and keep track of the hand correction progress.

The script initialize.py is run for each data directory in the corpus. This script compiles a list of wav,TextGrid pairs that should be hand corrected.

@@ Line 1: / Line 1: @@
-The procedure outlined here uses utilities in the Berkeley Phonetic Machine, to produce time aligned TextGrids for recorded utterances in English.
+The procedure outlined here uses utilities in the Berkeley Phonetic Machine, to produce time aligned TextGrids with "words" and "phones" tiers for recorded utterances in English.
 ==Transcript==
 Create text transcripts of the words spoken in a sound file.  Most plain text editors will be fine for this.
-One potential complication is that plain text on windows may be saved "end of line" characters that are not compatible with ''make_text_grids'', the perl utility that parses transcripts and passes them to ''pyalign''.  You can use the unix utility ''d2u'' to change the line endings (this is a separate download into your instance of the BPM).
+One potential complication is that plain text in Windows may be saved "end of line" characters that are not compatible with ''make_text_grids'', the perl utility that parses transcripts and passes them to ''pyalign''.  You can use the unix utility ''d2u'' to change the line endings (this is a separate download into your instance of the BPM).
-If there is a portion of the audio file that you don't want to transcribe, and thus won't include in the tagged database, you can add a "skip region" line to the transcript with a line that starts with the "#" character.  For example, the transcript below says to skip the first 0.3 seconds (from time 0 to time 0.3), then align the utterance "word, word, word" to the chunk of audio from 0.3 to 1 second, skip from 1 to 1.7, then align "sentence sentence sentence" the audio from time 1.7 to 4, and then skip to time 200 aligning no further text to the audio.
+If there is a portion of the audio file that you don't want to transcribe, and thus won't include in the tagged database, you can add a "skip region" line to the transcript with a line that starts with the "#" character.  For example, the transcript below says to skip the first 0.3 seconds (from time 0 to time 0.3), then align the utterance "word, word, word" to the chunk of audio from 0.3 to 1 second, skip from 1 to 1.7, then align "sentence sentence sentence" the audio from time 1.7 to 4, and then skip to time 200, aligning no further text to the audio.
  # 0,0.3
@@ Line 17: / Line 17: @@
 ==Align transcript to wave==
-The perl script ''make_text_grids'' [File:Make_text_grids.txt make_text_grids] reads the transcription file and parses the skip regions, calling ''pyalign'' for each chunk of audio that has a transcript.  So in the example above, ''make_text_grids'' would call ''pyalign'' with a start time of 0.3 and an end time of 1 for the transcript "word, word, word".  After calling ''pyalign'' for each chunk, ''concat_pyalign_textgrids'' is called to combine the separate text grids into one that corresponds to the audio file.
+The perl script ''make_text_grids'' [[File:Make_text_grids.txt]] reads the transcription file and parses the skip regions, calling ''pyalign'' for each chunk of audio that has a transcript.  So in the example above, ''make_text_grids'' would call ''pyalign'' with a start time of 0.3 and an end time of 1 for the transcript "word, word, word".  After calling ''pyalign'' for each chunk, ''concat_pyalign_textgrids'' is called to combine the separate text grids into one that corresponds to the audio file.
  >make_text_grids -h
@@ Line 29: / Line 27: @@
  corpus:transcripts
-       :JW*    (data directories)
+       :DAT*    (data directories -- usually audio files for one talker, DAT01, DAT02, etc.)
-''align_all.py'' [[file:align_all.txt]] looks in the JW directories for .wav files for which (1) there exists a corresponding transcript (first looking in the JW* directory and then in the transcripts directory) , and (2) there does not yet exist a corresponding TextGrid file.  ''make_text_grids'' is then called with the transcript and audio file names, and the TextGrid is created.
+''align_all.py'' [[file:align_all.txt]] looks in the DAT directories for .wav files for which (1) there exists a corresponding transcript (first looking in the DAT* directory and then in the transcripts directory) , and (2) there does not yet exist a corresponding TextGrid file.  ''make_text_grids'' is then called with the transcript and audio file names, and the TextGrid is created.
-One way that align_all is used this.  We have a transcript of what each speaker was asked to say in each audio file.  However, sometimes they make a mistake, have some trouble with words, repeat or otherwise don't say what the transcript says they should.  In this case we delete the TextGrid, and make a local copy of the transcript (in the data directory) which is edited to add skip regions and word changes as necessary to match what the person actually said.  Then ''align_all.py'' is run again and the only new TextGrids created are for files where the old TextGrid was deleted.  ''align_all.py'' also creates a time_stamped "to do" file, that lists all of the new TextGrids that were created in a particular run of the script.
+One way that align_all is used this.  We have a transcript of what each speaker was asked to say in each audio file.  However, sometimes they make a mistake, have some trouble with words, repeat or otherwise don't say what the transcript says they should.  In this case we delete the TextGrid, and make a local copy of the transcript (in the data directory) which is edited to add skip regions and word changes as necessary to match what the person actually said.  Then ''align_all.py'' is run again and new TextGrids are created for files where the old TextGrid was deleted.  ''align_all.py'' also creates a time_stamped "to do" file, that lists all of the new TextGrids that were created in a particular run of the script.
 ==Hand-correct the alignments==
 In this step of the process a skilled human must visually inspect the time alignments produced by the automatic process.  We use [http://www.praat.org Praat] to do this, inspecting alignments in windows of about 2 seconds long, and changing the labels and boundaries as needed. To speed this process, we use a Praat script to cycle through the TextGrid files and keep track of the hand correction progress.
+The script ''initialize.py'' is run for each data directory in the corpus. This script compiles a list of wav,TextGrid pairs that should be hand corrected.

Difference between revisions of "Speech database"

Revision as of 16:41, 26 May 2017

Transcript

Align transcript to wave

Hand-correct the alignments

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools