Klatt Synthesizer Parameters

From Phonlab
Jump to navigationJump to search

Below is a list of parameters that can be altered to create different types of synthetic utterances using the Klatt Synthesizer.

Constant Parameters

Sound file duration (du)

The duration of the utterance to be synthesized. This number will be rounded up to the nearest multiple of 'ui,' the number of milliseconds in a parameter update time interval.

Update interval (ui)

The number of msec of waveform generated between times when parameter values are updated. The default value of 5 ms is frequent enough to mimic most rapid parameter changes that occur in speech (in fact, 10 ms updates may be often enough).

Number of cascade formants (nf)

Specifies how many formants, counting from F1 up to a maximum of F8, are actually in the cascade vocal tract. The default value is 5, which is an appropriate number if the sampling rate is 10,000 samples/sec and the speaker has a vocal tract length of 17 cm.

Source select (ss)

A switch that determines which of two voicing source waveforms is used for synthesis. The default value, 1, causes a low-pass filtered impulse train to be generated, while the value 2 causes a more natural waveform with a definite sharp closing time to be invoked.

Output select (os)

Determines which waveform is saved in the output file. If 'os' has the default value of zero, the normal final output of synthesis is saved. Other output options are given below

1 -- Voicing periodic component alone

2 -- Aspiration alone

3 -- Frication alone

4 -- Glottal source (voicing, turbulence, and aspiration)

5 -- Glottal source sent to parallel vocal tract (AP) + radiation char

6 -- Cascade vocal tract, output of nasal zero resonator

7 -- Cascade vocal tract, output of nasal pole resonator

8 -- Cascade vocal tract, output of fifth formant

9 -- Cascade vocal tract, output of fourth formant

10 -- Cascade vocal tract, output of third formant

11 -- Cascade vocal tract, output of second formant

12 -- Cascade vocal tract, output of first formant

13 -- Parallel vocal tract, output of sixth formant alone

14 -- Parallel vocal tract, output of fifth formant alone

15 -- Parallel vocal tract, output of fourth formant alone

16 -- Parallel vocal tract, output of third formant alone

17 -- Parallel vocal tract, output of second formant alone

18 -- Parallel vocal tract, output of first formant alone

19 -- Parallel vocal tract, output of nasal formant alone

20 -- Parallel vocal tract, output of bypass path alone

Random Seed (rs)

Seed value given to the random number generator routine. Any number from 0 to 99999 can be specified. For each, you will get a quite different random number sequence (different frication and aspiration noises from those used to generate the previous stimuli).

Overall gain (g0)

An overall gain control is included to permit the user to adjust the output level without having to modify each source amplitude time function.

Delta of formant bandwidth (db) and delta of formant frequencies (dF)

These parameters are obscure and rarely used. They control a degree of flutter in the value of F1 and b1 as a function of the glottal state to perhaps improve the naturalness of the voice.

Variable Parameters

Voicing

Amplitude of voicing (av)

Amplitude in dB of the voicing source waveform sent through the cascade vocal tract.

Fundamental frequency of voicing (f0)

Rate at which the vocal folds are currently vibrating in Hz.

Amplitude of turbulence (at)

Amplitude in dB of turbulence noise generated at the glottis during the open phase of a glottal vibration.

Voice spectral tilt (tl)

The (additional) downward tilt of the spectrum of the voicing source, in dB as realized by a soft one-pole low-pass filter. A value of zero has no effect on the source spectrum, while a value of 24 tilts the spectrum down gradually such that frequency components above about 3 kHz are attenuated by about 24 dB relative to a more normal source spectrum.

Voice skew (sk)

The number of 25 microsecond increments to be added to and subtracted from successive fundamental period durations in order to simulate the tendency for alternate periods to be more similar in duration than adjacent periods, one aspect of vocal fry.

Open Quotient (oq)

A nominal indicator of the width of the glottal pulse when using the default impulse train glottal source, and the exact number of samples in the open period when using the natural voicing source (ss = 2). A value of oq = 50, the default value, corresponds to a 5 msec open portion of the fundamental period at the default sampling rate(10000 samples/sec) and default F0(100 Hz).

Cascade

Formant frequency (F1, F2, F3, F4, F5, f6)

The "formant frequency" variables determine the frequency in Hz of up to six resonators of the cascade vocal tract model, and of the frequency in Hz of each of six additional parallel formant resonators. Normally, the cascade branch of 'nf'=5 formants is used to generate voiced and aspirated sounds, while the parallel branches are used to generate fricatives and plosive bursts.

Formant bandwidth (b1, b2, b3, b4, b5, b6)

The "formant bandwidth" variables determine the bandwidths of resonators in the cascade vocal tract model. Since formant bandwidths depend in part on source impedance, and turbulence sources contribute more losses, the synthesizer provides separate control of bandwidths 'p1' 'p2' 'p3' 'p4' 'p5' 'p6' for the parallel formants.

Nasal pole frequency (fp) and nasal zero frequency (fz)

The variable 'fp', "frequency nasal pole", in consort with the variable 'fz', "frequency nasal zero", can mimic the primary spectral effects of nasalization in vowel-like spectra. In a typical nasalized vowel, the first formant is split into peak-valley-peak (pole-zero-pole) such that 'fp' is at about 300 Hz, 'F1' is higher than it would be if the vowel were non-nasalized, and 'fz' is at a frequency approximately halfway between 'fp' and 'F1'. When returning to a non-nasalized vowel, 'fz' is moved down gradually to a frequency exactly the same as 'fp'. The nasal pole and nasal zero then cancel each other out, and it is as if they were not present in the cascade vocal tract model.

Nasal pole bandwidth (bp) and nasal zero bandwidth (bz)

The variables "bandwidth nasal pole," and "bandwidth nasal zero", are set to default values of 90 Hz. It is difficult to determine appropriate synthesis bandwidths for individual nasalized vowels, but, fortunately, one can achieve good synthesis results without changing these default values in most cases.

Parallel

Amplitudes of parallel formants (a1, a2, a3, a4, a5, a6, ab)

These variables determine the spectral shape of a fricative or plosive burst. The bypass path amplitude (ab) is used when the vocal tract resonance effects are negligible because the cavity in front of the main fricative constriction is too short, as in [f], [v], [θ], [ð], [p], [b].

Amplitude of frication (af)

Determines the level of frication noise sent to the various parallel formants and bypass path. The variable should be turned on gradually for fricatives (e.g. straight line from 0 to 60 dB in 90 msec), and abruptly to about 60 dB for plosive bursts.

Amplitude of aspiration (ah)

The amplitude in dB of the aspiration noise sound source that is combined with periodic voicing, if present ('av'>0), to constitute the glottal sound source that is sent to the cascade vocal tract. A value of zero turns off the aspiration source, while a value of 60 results in an output aspirated speech sound with levels in formants above F1 roughly equal to the levels obtained by setting 'av' to 60.

Amplitude of voicing (ap)

The amplitude, in dB, of voiced excitation of the parallel vocal tract. Normally, this would be allowed to remain at the default value of zero since the cascade vocal tract would be used for generating the voicing component of all voiced sounds (even voicebars and voiced fricatives).

Amplitude of nasal formant (an)

This variable is normally not used. However, when employing the parallel vocal tract to synthesize vowels, as discussed above, 'an' can be used to simulate the effects of nasalization on vowels and nasal murmurs.

Bandwidth of parallel formants (p1, p2, p3, p4, p5, p6)

These variables are set to default values that are wider than the bandwidths used in the cascade vocal tract model. It is difficult to measure formant bandwidths accurately in noise spectra, even when a fairly long sustained fricative is available for analysis. However, these default values can be used in most situations. The only adjustment is then made to the parallel formant amplitudes in order to match details in a natural frication spectrum.