The phonological inventories that comprise the SAPhon dataset were compiled by harvesting the inventories from extant publications or by obtaining unpublished inventories from linguists who have worked on the languages in question. Although this process is in principle simple, several issues arise that yield complications. Our goal here is to describe these complications and how we handle them, since our judgments in such cases may introduce errors into the dataset. We encourage specialists who notice errors in the dataset to contact us to propose corrections
First, multiple incompatible phonological inventories may have been proposed for a given language in different publications. Since in the majority of cases we are not specialists in the given language or language family, it is impossible for us to directly assess the accuracy of the conflicting analysis, and we must instead rely on proxies to evaluate which inventory is most likely to be accurate. Principal among these are: 1) level of linguistic training of the author of the inventory; 2) explicitness of evidence and argumentation presented in support of the inventory; 3) typological plausibility and consistency of inventory with that given closely related languages (if they exist); and 4) recency of the publication from which the inventory. None of these criteria are infallible, of course, which makes contributions
from specialists in particular languages and language families all the more important.
Second, there may be ambiguities in the way the phonological inventory is described or analyzed. In some cases, for example, the symbol used to represent a phoneme and the prose description of the phonological features of the phoneme may be inconsistent from the perspective of modern, well-known, international uses of the symbol in question. In some rare cases, there appear to be errors in the phonological argumentation advanced to support a given inventory, suggesting that certain segments should be added to, or removed from, a given inventory.
Note that we have not attempted to create phonological inventories in cases of languages for which a phonological analysis has not been carried out. This means that there are languages for which lexical resources of some nature exist, such as 19th century wordlists collected by naturalists, ethnographers, missionaries, or explorers, but which are not represented in the SAPhon dataset, since no phonological analysis based on such lexical data has been carried out. In cases where philological work on such data has been carried out, and a phonological inventory proposed, we have included the inventory in the dataset.