Quantitative Methods in Linguistics


Keith Johnson
University of California, Berkeley

Data Sets and Scripts


This page provides links to the data sets and scripts that are used as examples in the book Quantitative Methods in Linguistics.

1. Fundamentals of quantitative analysis

2. Patterns and Tests
Script: Figure 2.1
Script: The central limit function from a uniform distribution (central.limit.unif).
Script: The central limit function from a skewed distribution (central.limit).
Script: The central limit function from a normal distribution.
Script: Figure 2.5
Script: Figure 2.6 (shade.tails)
Data: Male and female F1 frequency data (F1_data.txt).
Script: Explore the chi-square distribution (chisq).

3. Phonetics
Data: Cherokee voice onset times (cherokeeVOT.txt).
Data: The tongue shape data (chaindata.txt).
Script: Commands to calculate and plot the first principal component of tongue shape (principal_components).
Script: Explore the F distribution (shade.tails.df)
Data: Made-up regression example (regression.txt)

4. Psycholinguistics

Data: One observation of phonological priming per listener from Pitt & Shoaf's (2002)
Data: One observation per listener from two groups (overlap versus no overlap) from Pitt & Shoaf's study.
Data: Hypothetical data to illustrate repeated measures of analysis.
Data: The full Pitt & Shoaf data set.
Data: Reaction time data on perception of flap, /d/, and eth by Spanish-speaking and English-speaking listeners.
Data: Luka & Barsalou (2005) "by subjects" data.
Data: Luka & Barsalou (2005) "by items" data.
Data: Boomershine's dialect identification data for exercise 5.

5. Sociolinguistics

Data: Robin Dodsworth's preliminary data on /l/ vocalization in Worthington, Ohio.
Data: Data from David Durian's rapid anonymous survey on /str/ in Columbus, Ohio.
Data: Hope Dawson's Sanskrit data.

6. Historical Linguistics

Script: A script that draws Figure 6.1
Data: Dyen et al.'s (1984) distance matrix for 84 Indo-European languages based on the percentage of cognate words between languages.
Data: A (rather arbitrary) subset of the Dyen et al. (1984) data coded as input to the Phylip program "pars".
Data: IE-lists.txt: A version of the Dyen et al. word lists that is readable in the scripts below.
Script: make_dist: This perl script tabulates all of the letters used in the Dyen et al. word lists."
Script: get_IE_distance: This perl script implements the "spelling distance" metric that was used to calculate distances between words in the Dyen et al. list.
Script: make_matrix: Another perl script. This one takes the output of get_IE_distance and writes it back out as a matrix that R can easily read.
Data: A distance matrix produced from the spellings of words in the Dyen et al. (1984) dataset.
Data: Distance matrix for eight Bantu languages from the Tanzanian Language Survey.
Data: A phonetic distance matrix of Bantu languages from Ladefoged, Glick & Criper (1971).
Data: The TLS Bantu data arranged as input for phylogenetic parsimony analysis using the Phylip program pars.

7. Syntax

Data: Results from a magnitude estimation study.
Data: Verb argument data from CoNLL-2005.
Script: Cross-validation of linear mixed effects models.
Data: Bresnan et al.'s dative alternation data.