The Reconstruction Engine

The Reconstruction Engine:
a computer implementation of the comparative method


The Reconstruction Engine (RE) models the comparative method for establishing genetic affiliation among a group of languages. The program is a research tool designed to aid the linguist in evaluating specific hypotheses, by calculating the consequences of a set of postulated sound changes (proposed by the linguist) on complete (machine-readable) lexicons of several languages. It divides the lexicons into a phonologically regular part, and a part which deviates from the sound laws. RE is bi-directional: given words in modern languages, it can propose cognate sets (with reconstructions); given reconstructions, it can project the modern forms which would result from regular changes. RE operates either interactively, allowing word-by-word evaluation of hypothesized sound changes and semantic shifts, or in a "batch" mode, processing entire multilingual lexicons en masse.

In our papers about the program, we describe the parsing and combinatorial techniques used to make projections upstream or downstream (in the sense of time), the procedures for creating and consolidating cognate sets based on these projections, and the ad hoc techniques developed for handling the semantic component of the comparative method.

Some results from a study of the Tamang languages of Nepal (a subgroup of the Tibeto-Burman family) are presented, and data from these languages are used throughout for exemplification of the operation of the program.

Finally, we discuss features of RE which make it possible to handle the complex and sometimes imprecise representations of lexical items, and speculate on possible directions for future research.

Also available are:

The authors would like to communicate with others interested in computational historical linguistics. You may contact us at:

John Lowe
Linguistics department
2337 Dwinelle Hall
University of California at Berkeley
Berkeley, CA 94720-2650
voice: (510) 643-5623 / fax: (510) 643-9911

Martine Mazaudon
44 rue de l'Amiral Mouchez
75014 Paris, France


The following are edited excerpts from :

"The reconstruction engine: a computer implementation of the comparative method," Association for Computational Linguistics Special Issue in Computational Phonology 20.3 (September 1994) pp. 381-417.


The essential step in historical reconstruction is the arrangement of related words in different languages into sets of cognates and the specification of the regular phonological correspondences which support that arrangement; the well-known means for carrying out this arrangement and specification is the comparative method (see for example Meillet 1966; Hoenigswald 1950, 1960; Watkins 1989; Baldi 1990). Words which are not demonstrably related (via regular sound change) are explained by reference to other diachronic processes which are beyond the scope of the comparative method and of this paper. Sound change is first to be explained as a rule-governed process and other explanations (which invoke more sporadic and less predictable processes) offered when it is clear that non-phonological forces are at work, as illustrated in Figure 1. There will always be a number of lexical items for which no scientific explanation can be advanced: not all words are entitled to an etymology (Meillet 1966:58).


Figure 1: The "Sieve" of Explanation in historical linguistics


RE implements 1) a set of algorithms which generate possible reconstructions given word forms in modern languages (and vice-versa as well) and 2) a set of algorithms which arrange input modern forms into possible cognate sets based on those reconstructions. The first set implements a simple bottom-up parser; the second automates database management chores, such as reading multiple input files and sorting, merging, and indexing the parser's output.

The core functions of RE compute all possible ancestor forms (using a TABLE OF CORRESPONDENCES and a phonotactic description, a SYLLABLE CANON, both described in section 3.1) and makes sets of those modern forms which share the same reconstructions. Tools for further dividing of the computer-proposed cognate sets based on semantic distinctions are also provided. The linguist (that is, the user) collects and inputs the source data, prepares the table of correspondences and phonotactic description (syllable canon), and verifies the semantics of the output of the phonologically-based reconstruction process. RE, qua "linguistic bookkeeper," makes the projections and keeps track of several competing hypotheses as the research progresses. Specifically, the linguist provides as input to the program:

The parsing algorithm implemented in RE is bi-directional (in the sense of time): the "upstream"[3] process involves projecting each modern form backward in time and merging the sets of possible ancestors generated thereby to see which, if any, are identical. Conversely, given a protoform, the program computes the expected regular reflexes in the daughter languages.

Figure 2: Input-Output diagram of RE's basic projection functions

The process can be done interactively (as illustrated in Figure 3 below) or in batch using machine-readable lexicons prepared for the purpose.

Figure 3: A simple example of interactive "upstream" computation (transcription and languages exemplified are described in section 1.1)

Figure 3 is a representation of the contents of the computer screen after the user has entered three modern words (1). The program has generated the reconstructions from which these forms might derive (2) The list of numbers (called the analysis) following the reconstruction refers to the row numbers in the table of correspondences which were used by the program in generating the reconstructions. In two cases reflexes have more than one possible ancestor. The program has then proposed the two cognate sets which result from computing the set intersection of the possible ancestors (3). The proposed sets are listed in descending order by population of supporting forms.[4]

Conversely, given a protoform, RE will predict (actually "postdict") the regular reflexes in each of the daughter languages. Figure 4 reproduces the results on the computer screen of performing such a "downstream" calculation.

Figure 4: The expected outcomes of *Abap (a "downstream" computation)

Here the etymon entered by the user (1) produced reflexes (2) through two different syllabic analyses (numbered 1. and 2. in the "Reflexes of ..." window): Abap as initial /b-/ plus vowel /-a-/ plus final /-p/, and as initial /b-/ followed by rhyme /-ap/. The algorithms used in this process are described in detail in