30 May 2013

The Water Story

When during the First World War the Czech orientalist Bedřich Hrozný was copying cuneiform inscriptions from the Hittite royal archive, deposited at the Imperial Ottoman Museum in Constantinople, it suddenly dawned on him that the still enigmatic language was Indo-European. One of the first words that he was able to interpret was wa-a-tar (wātar) ‘water’. Hrozný already knew that the preceding clause meant something like ‘and you will eat bread...’, so ‘drink water’ certainly made sense as a continuation. Of course even the occurrence of a familiar-looking word in the right context doesn’t mean much by itself, but the newly excavated Hittite corpus was sizeable and Hrozný was soon able to understand large fragments of the texts and indentify (not always correctly) more Indo-European material in them – from pronouns and sentence particles to verbs, nouns and adjectives.

The similarity of wātar to words for ‘water’ in other branches of IE is not accidental, and the word is inherited from a common ancestor rather than borrowed. We can say so with confidence not simply because the sound correspondences look fine. The ‘water’ word is declined in Hittite, with inflectional endings familiar from elsewhere. What’s more, the declension of wātar is irregular in an interesting way: the stem has the variant witen- in the oblique cases (such as the gen.sg. witenas), and its nom./acc.pl. is witār. Those Hittite alternations can be traced back to a reconstructed pattern like *wódr̥, *wedén-os, *wedṓr – with vowel substitutions, accent shifts, and a characteristic *r/n alternation in the suffix, found also in neuter stems in other morphologically conservative IE languages.

Hittite preserves a unique variety of stem variants in one paradigm. Other IE languages have levelled them out at least partly:
  • Greek has húdōr, gen.sg. húdatos, nom./acc.pl. húdata. The a of the suffix in the oblique cases and in the plural reflects a pre-Greek syllabic nasal (*ud-n̥-t-os, *ud-n̥-t-ah₂, with an extra -t- that is a Greek innovation), which means that the *r/n alternation is indirectly reflected there, but the root syllable has a fixed shape (the full vowel *e/o was deleted, leaving */wd-/ = *ud-); also the accent is fixed on the initial syllable.
  • In Vedic, only a few isolated case forms of the word survived (loc.sg. udán ~ udáni, gen.sg. udnás, nom./acc.pl. udā́), with alternations restricted to the suffix, as in Greek, but with the word accent anywhere but on the root, quite unlike Greek.
  • In Germanic, the root syllable has the same full vowel throughout (*wat-, reflecting older *wod-); the *r/n alternation is still visible, but the variants with *r and *n are segregated among different Germanic languages (cf. Old English wæter vs. ON vatn, both remodelled as vowel-final stems: *wat-r-a- vs. *wat-n-a-). Gothic, in which the stem remained consonantal, generalised the nasal variant at the expense of *r: nom.sg. wato, gen.sg. watins, dat.pl. watnam (as if from pre-Gmc. *wod-ōn, *wod-en-, *wod-n-).
  • In Baltic the suffix has a nasal, but there is also another nasal, curiously infixed in the root, presumably due to the generalisation of anticipated nasality: *wod-n-/*ud-n- > vand-/und-, as in Lithuanian vanduõ, acc.sg. vándenį, Latvian ûdens, Old Prussian wundan, unds.
  • Some of the other IE languages also preserve traces of the noun (Slavic *voda, Umbrian utur, abl.sg. une, etc.), and numerous words derived from the stem *w(V)d-(V)n/r- appear even in those languages in which the primary noun has been lost, cf. Latin unda ‘wave’ < *ud-n-ah₂ (with a metathesis common in Latin and convergent with what we see in the declension of the Baltic ‘water’ word’).
Otter < OE oter < PGmc. *utraz < *ud-r-o-s
It’s a nice jumble of forms, not even quite compatible across related languages because of the independent fixation of different innovations along different branches of the family tree. It took the efforts and accumulated insight of several generations of Indo-Europeanists (culminating in the work of late 20th-century scholars such as Jochem Schindler) to explain their complicated evolution in detail. 

The hypothetical common starting point is an “acrostatic” neuter noun with an *o/e alternation (see here for a similar case): nom./acc. sg. *wód-r̥, oblique *wéd-n-, collective pl. *wéd-ōr (from a still earlier *wéd-or-h₂, where *h₂ was a collective ending, lost already in PIE after a stem-final liquid or nasal but causing the compensatory lengthening of the vowel of the stem-forming suffix). The *e in the root syllable was the “weak” counterpart of the “strong” grade *o. But PIE *e was ambiguous, because it could also represent the strong grade of some roots, whose weak variants lacked the vowel. On the analogy of such roots, new weak stems were created, with the *e deleted and the accent shifted to another syllable: collective *udṓr, oblique *udén- (especially in the loc.sg.) or *udn- (followed by an accented inflectional ending). The collective plural (‘waters’ = ‘a vast quantity of water’) was occasionally reinterpreted as a singular mass noun. Its declension was then remade as follows: nom./acc.sg. *wédōr or *udṓr, gen.sg. *udn-és.

Such analogical remodelling must have taken place already in the common ancestor of all the IE languages, and was continued after the breakup of Indo-European unity. The state of affairs visible in Hittite is archaic, but only in a relative sense. In the ancestor of Hittite the noun became accentually mobile – for example, the old gen.sg. *wéd-n̥-s  was replaced by *wed-én-(o)s on the analogy of nouns with a shifting accent – but no new weak grade was generated in the process.

acrostatic → mobile
collective → singular
nom./acc. coll.

The emergence of new variant paradigms is schematically shown in the table above. The forms on the left are the oldest ones; those in the last column illustrate some post-PIE developments (as  reflected e.g. in Germanic). It is important to realise that there must have been considerable variation (rather than a single paradigm) already in the most recent common ancestor of the known IE languages. That variation supplied the raw material for later developments, which could be compared to independent attempts to assemble a new vase from the scattered fragments of several broken ones. Alternative PIE paradigms, each of them too complex to survive in the long run, were mixed up, reorganised, and independently simplified in the daughter languages. We understand the process rather well because the ‘water’ word fits into a more general pattern together with other words of a similar structure, and their evolution is part of a still grander model of inflection and phonological alternation in PIE nouns. Despite its complexity it’s not an arbitrary just-so story but a coherent and well-constrained theory explaining a large segment of PIE grammar. We don’t know everything about its prehistory. For example, we are still very much in the dark about the origin of the *o/e alternation in acrostatic roots. We take the left-hand column of the table as the point of departure because it represents the earliest stage we can safely reach given our current understanding of Proto-Indo-European.

To sum up, the fact that Hittite wātar is similar to English water is interesting but not particularly impressive as an isolated observation. Similarities can be found between any languages chosen at random. It’s far more significant that the inflectional pattern visible in Hittite helps us to understand the origin of the diversity displayed by cognate ‘water’ words elsewhere in the IE family and is part of the evidence used in the reconstruction of the PIE morphological system. It’s those pervasive shared patterns that demonstrate the membership of Hittite in the IE family.

But wait a minute... I promised to discuss the global etymon ʔAQ’WA, right? Why am I talking of PIE *wódr̥ instead? Well, because it’s the best-attested IE word for ‘water’, supported by a wide array of comparative evidence. Anyone trying to establish a genetic relationship between IE and other language families had better keep this in mind. But surely there are other ‘water’ words in IE that are possible candidates for PIE status and could be of interest to long-rangers? Perhaps, but they’ll be discussed in a separate post. We are taking a roundabout route to ʔAQ’WA, but we'll eventually get there.

26 May 2013

Water, Water Everywhere: Back to Global Etymologies

The Eurasiatic interlude was longer than I had originally planned. It’s time to return to Proto-World and “global etymologies”. Few things are more instructive than a nicely dissected example, so I shall compare different approaches to analysing genetic relationships and illustrate them with real data.

No matter how severely we criticise the long-range reconstructions of Nostratic/Eurasiatic, they are proposed by scholars who respect the standard comparative method and appreciate its importance for separating signal from noise. According to the mainstream approach, it is not enough to observe that numerous pairs of words across two languages are similar in form and meaning. One ought to analyse the similarities carefully in order to decide whether they are more likely the consequence of common ancestry than of non-genetic factors such as horizontal diffusion (borrowing), functional convergence (onomatopoeia, etc.), or blind chance. Attempts to meet the accepted standards in inter-family comparison may fail, but at least there are people courageous enough to accept the challenge.

M. C. Escher, Rippled surface (1950)
But there is also a different approach, called multilateral comparison (a.k.a. mass comparison), according to which genetic relationships can be (and indeed have always been) established without assembling regular sound correspondences and reconstructions. To classify a set of languages (the larger the better) one only needs a collection of tabulated data (a list of basic vocabulary and grammatical morphemes for each language will suffice), a good eye for spotting patterns, and some general linguistic training (as opposed to the expert knowledge of some of the languages being compared). It doesn’t really matter if the evidence is partly corrupt or incomplete: as long as there’s plenty of it, its cumulative weight makes errors cancel out. Finding lexical matches across a large number of languages requires no analytic skills or painstaking detective work: enough evidence leaps out at you from the printed page as you eyeball it. Classificatory conclusions can be drawn simply from inspecting the data, with a confidence approaching certainty.

The best-known advocate of multilateral comparison was Joseph H. Greenberg (1915-2001), who used it famously to classify all the languages of Africa into four genetic stocks, and then to hypothesise that all the native languages of the New World with the exception of the Eskimo-Aleut and Na-Dene families formed one vast macrofamily, dubbed Amerind”. He was also the original proponent of “Eurasiatic” – a hypothetical genetic grouping similar to the older concept of “Nostratic”, though not identical with it. Greenberg’s successors have boldly extended his methodology to the study of the world’s languages, not only grouping them into one global phylogeny, but also arriving at twenty-seven examples of “global etymologies” labelled with approximate reconstructions (Bengtson & Ruhlen 1998). This is quite surprising, since according to their own principles comparative reconstruction is a separate technical task, not required for a correct classification. Nevertheless, mass-comparatists often propose impressionistic reconstructions, and even compile etymological dictionaries where hundreds of such reconstructions are offered (cf. Greenberg & Ruhlen 2007). They may be marked with an asterisk just like the legal products of the comparative method – a practice bound to confuse a non-specialist by creating the impression that some actual reconstructive work has been done.

In the posts to follow I shall focus on Bengtson & Ruhlen’s Global Etymology #27, ʔAQ’WA ‘water’. I intend to show, first, how Indo-European words meaning ‘water’ are analysed with the help of the standard comparative method; then, how Nostratic linguists handle data extracted from several families (including IE) to reconstruct a putative common proto-word at the macrofamily level; and finally, how mass-comparatists identify a global etymology (and restore the form of the corresponding word).

20 May 2013

A Special Question on Quechua

One of the Quechua words cited in Table 1 in my Inca Connection post is genuinely related to something Indo-European (though not to what the Eurasiatic pseudo-cognate would imply). Which one? Please post your suggestions as comments below. I shall discuss the answers (if any) about this time tomorrow.
Oops, please ignore this challenge. I thought kuchuy 'cut' was somehow back-formed from kuchillu 'knife' (Spanish cuchillo < Lat. cultellus), but it seems that I was wrong: the verb is native and the similarity is deceptive, as in the case of English cook vs. cookie.

18 May 2013

The Inca Connection: A Quechua Word Game

Gather round and I’ll show you a magic trick. Watch my hands, but first look at Table 1 below. It is based on a 200-word Swadesh list for Southern Quechua and the Tower of Babel “Eurasiatic” etymologies:

Table 1

*ma, *ʔVnV
mana... chu
ama... chu
not (negation)
not (prohibition)
what (interr.)
what (interr.)
bark (of a tree)
bark (of a tree)
bark, skin
bark, skin
far, next
thick, dense
tongue, speak
feather, tail
(a kind of) fish
thick, swell

There are only twenty-two matches because I got bored too soon, but it’s an easy game. One can even formulate some preliminary “regular correspondences” (supported by a few cognate pairs each!). For example, Eurasiatic liquids (laterals and rhotics) generally merge in Quechua, yielding /r/ (8, 11, 12, 14, 20, 22), but before certain consonants (laryngeals and semivowels) liquids are reflected as palatal /ʎ/, spelt ll (13, 17). Eurasiatic affricates are generally preserved as such, yielding Quechua ch /tʃ/ (4, 16, 17), but we also have one example of a velar stop palatalised and affricated before a front vowel (20) and possibly one more (1) if chu is related to PIE *kʷe (but I can’t say at this stage why the *e is reflected as /u/). Before the low vowel /a/ Eurasiatic dorsals become velar /q/ in Quechua (6, 7, 8, 13). There are sporadic exceptions (2, 11) and one occurrence of a uvular before /u/ (22), but come on, folks, you can’t expect me to solve all problems in one fell swoop with so little material.

No comment (Aaarrrrrgh!)
I think I have already demonstrated beyond reasonable doubt that the Quechua people are a lost Nostratic tribe. Note that the semantic matches are impeccable and the similarity of the words is quite obvious to any open-minded observer. Indeed, the matches are much better than many of those in the LWED. The quality of examples 1, 2, 3, 4, 5, 6, and 9, in particular, is guaranteed by the fact that they represent statistically certified ultraconserved Eurasiatic vocabulary (Pagel et al. 2013). The famous items ‘mother’, ‘bark’, and ‘worm’ are among them. In many Eurasiatic languages the words for ‘bark’ and ‘skin’ are the same or look related (6, 7). This seems to be true of Quechua as well, but just in order to probe every possibility, I can offer an alternative etymology of qara ‘skin’ (8, from a different Eurasiatic root), in which case its homophony with qara ‘bark’ must be accidental. A nice match either way.

But there is more to Quechua than just its Eurasiatic affinities. It seems to be particularly close to Proto-Indo-European. Compare the Quechua numerals pichqa ‘5’ and suqta ‘6’ = PIE *penkʷe, *sweḱs, clearly a common Indo-Quechuan innovation not shared with any other Eurasiatic group. I can’t reveal too much at present, but mark my words: you’ll read about it in Nature one day – or Science, perhaps, or PNAS.

17 May 2013

A Sad Loss: Jens Elmegård Rasmussen (1944-2013)

Jens Elmegård Rasmussen, a brilliant and original Indo-Europeanist, the spiritus movens of the Copenhagen school of IE studies, has passed away. I am shocked by the news. Jens was a great scholar, but also a kind man and a wonderful host. We had known each other for years as fellow members of Internet linguistic forums before we first met in real life. It was tremendously generous of Jens to share his learning with other people in public discussions. My own debt to him is enormous. My condolences go first and foremost to his wife Birgit, and to all my friends and colleagues in Copenhagen. Jens's ideas will continue to inspire others. Great thoughts never die.

16 May 2013

Eurasiatic: A Wild Pursuit (2)

The only content word cited by the Pagel et al. (2013) with a putative cognate class size of more than four is ‘give’. The proposed Eurasiatic reconstruction is *dwV[H]V, with an optional “laryngeal” and two wildcard vowels. The reconstruction of a labial-velar glide is not justified explicitly (one can only guess that without it the Altaic initial would be different). PIE *deh₃- ‘give’ (phonetically *doh₃-, with laryngeal colouring) is a widely attested root aorist. Proto-Uralic *toɣe ‘bring’ is a very nice match and indeed is frequently quoted as possible evidence of “Indo-Uralic” kinship (alternatively, it could be an old loan from IE). Ironically, if Pagel et al. had really observed their “identical meaning” requirement as strictly as declared, they would have been forced to disqualify this match as not quite exact: the Uralic meaning is primarily ‘bring’, not ‘give’. The other cognates cannot be taken seriously: the Altaic ones (apart from formal problems) have to do more with feasting than with giving, the Eskimo one means ‘sell’ rather than ‘give’, and the Dravidian one means ‘bring’ in most languages of that family. In both Eskimo and Dravidian, unlike Uralic, the only potential sound corresponcence involves the initial (a dental stop, one of the most common classes of sound cross-linguistically). To claim that we are dealing with an item represented in five families out of seven is a massive overstatement. It’s really more like two families (and even there some semantic leeway must be excused).

All the remaining content words in the list are said to have cognate class sizes of four (whetever their estimated universal frequency of use). Even they, however, present various problems. For example, the ‘mother’ word (presumably *ʔVmV ‘mother, woman’ in the database, with a glottal stop that is nowhere attested, and vowels that may be present or absent at will) is a fancy cover reconstruction for a collection of obvious nursery noises coopted as kinship terms. ‘Mother’ words like /mama/, /ma/, /eme/, /ana/, /aɲa/, /aja/ etc. are likely to be re-created independently in different languages. They are common on all inhabited continents for reasons discussed already by Roman Jakobson in 1959 and by innumerable linguists since. Though babbling sequences involving /m/ preferentially refer to ‘mother’ (note the almost self-explanatory fact that Latin mamma means ‘nipple’), they may also occasionally be applied to other members of the family, like Manchu ama or Georgian mama, both meaning ‘father’. Linguists routinely disqualify nursery words as comparative evidence (unless a root that might be of such an origin is extended with non-iconic suffixes elevating it to the status of a “regular” word).

Thors goin wormin
The meaning ‘worm’ is too imprecise to be useful in inter-family comparison. English worm (like bug) may be applied to innumerable invertebrates. It was still vaguer in the early Germanic languages, where it could describe anything that crept or crawled, including legendary dragons and serpents (such as the mighty Midgarðsormr of the outer Ocean, shown on the left). The comparative material under Eurasiatic *ḲorV- ‘worm’ in the database is aswarm with miscellaneous vermin: houseflies, gadflies, maggots and other larvae, fleas, crickets, wasps, spiders, leeches, and even eels – words referring mostly to particular taxa rather than “worms” in general. Despite this semantic latitude the formal correspondences are far from impressive. One is reminded of the apocryphal definition of etymology as une science où les voyelles ne font rien et les consonnes fort peu de chose.

There would be little point in analysing the remaining items one by one. If anyone is interested in discussing them (or any other sins against linguistics committed in the article), it can easily be done in the Comments section. The examples above were chosen more or less at random, and are not necessarily the most problematic ones; they merely illustrate some of the obvious problems. What is quite evident is that the database does not allow one to define well-bounded cognate sets, and that the size of those sets is very easy to inflate by relaxing formal or semantic criteria even minimally in a way not controlled in the study. The “data” so concocted are simply unusable. If a statistical method applied to them seems to confirm the researchers’ expectations, it’s probably because the expectations are already encoded in the reconstructions used as data (which is part of the reason why reconstructions are not supposed to be so used!). The authors of the article consider such a possibility but do not really take any precautions against it. With such input, the only signal a statistical analysis will detect comes from confirmation bias.

15 May 2013

Eurasiatic: A Wild Pursuit (1)

The aim of Pagel et al. (2013) is to lend support to the Eurasiatic superfamily proposal by showing that the suggested cognate classes size of four of more (i.e. Proto-Eurasiatic words reflected in at least four out of the seven families making up Eurasiatic) correlate with the frequency of use of the corresponding meanings to a statistically significant degree. In other words, the frequency of use turns out to be a good predictor of cognate class sizes because words used more frequently are less likely to undergo lexical replacement and in general retain the form-meaning association more faithfully (see Pagel et al. (2007), where the mean frequencies in question are established for the Indo-European family and found to be correlated with the linguistic half-lives of words). High-frequency words are therefore likelier to survive in a larger number of lineages descended from a common ancestor.

Since the size of the cognate classes discussed in the article can only equal 4, 5, 6 or 7, and the total number of meanings is only 23, it is clear that any inaccuracy in determining the number of cognates can have a significant effect on the statistics. It really matters if a given word is judged to have a cognate class of 7 or of 5, for example. How are those numbers determined? The authors accept the cognacy judgements from the Nostratic/Eurasiatic list of etymologies in the LWED database. The actual “cognates” are not listed, but the reader can track them down in the database using the information from table 1 in the article. Cognacy is accepted, according to the authors, if the putative cognates have precisely the same meaning in different families (or rather the same reconstructed meaning in their respective protolanguages). It’s a conservative requirement, but does it really reduce the danger of spurious cognacy? Let us have a look at some of the 23 cognate classes, beginning with the one with the most impressive score – 7 cognates in 7 families. The most successful  Eurasiatic replicator is the 2sg. pronoun ‘thou’ (see this link).

The Proto-Eurasiatic pronoun is reconstructed as *ṭ[u] with an emphatic (ejective?) coronal stop in the Moscow dialect of Eurasiatic. It is clearly designed to yield the familiar IE stem *tu-, and so it does. Thanks to the *u being optional, it can be easily matched with the Uralic 2sg. pronoun, whose forms begin with *t-, though it has to be noted that a match based on a single segment (and a dental stop at that) is not particularly impressive. But Uralic, like IE, also has a 1sg. pronoun with initial *m-, which reinforces the impression that the pronominal systems of the two families are somehow related. And here easy comparison ends.

In view of 1sg. *kǝm in Chukotko-Kamchatkan (and the parallel 1pl. *muri, 2pl. *turi) the final dental in 2sg. *kǝð may plausibly reflect an original second-person marker, which would make Chukotko-Kamchatkan another “M/T family”. The Eskimo(-Aleut) 2sg. forms, cited but not discussed in the database, are much harder to analyse and their final *-n (*-nt?) cannot be unambiguously connected with the M/T system.

Left: Mars in 1894. Right: Mars now.
Dravidian has no matching form (the 2sg. pronoun there begins with *n-, uncontroversially attested throughout the family, including the outlying Brahui language). The editors of the database, however, are evidently prepared to bend over backwards to find a trace of the “real” Proto-Dravidian pronoun, so they list the verb ending  *-ti as a match, and Pagel et al. accept that despite their sworn insistence on strict semantic identity! Perhaps they didn’t know (because the database didn’t mention it) that the conjugational ending in question wasn’t even reconstructible to Proto-Dravidian; but at any rate, a verb ending instead of a pronoun does not constitute a match. There goes one member of the cognate class.

Altaic (assuming for the sake of the argument that it counts as a bona fide family) has a 2sg. pronoun in *s-, which can’t be derived from an earlier stop, emphatic or otherwise, by the Moscow School Nostraticists’ own rules. Mongolic, however, has an aberrant form which could be a possible reflex of *ṭ- (no *u again, but remember that the vowel is conveniently optional), so Mongolic evidence overrules the testimony of the remaining four subfamilies of Altaic and another match is arbitrarily declared.

Kartvelian 2sg. *sen isn’t a promising cognate, so a plural form is cited instead (as if the Proto-Kartvelian 2pl. *(ś)tkwen really looked like a plausible relative of IE *tu-, but then at least it contains a *t beside four other consonants). This alleged match flies in the face of the evidence and should have been disqualified by Pagel et al.

To sum up, we have possible cognates in three families (Indo-European, Uralic, Chukotko-Kamchatkan), one extremely doubtful case (Eskimo-Aleut), and three rather clear negatives (Dravidian, Altaic, Kartvelian). An optimistic but conservative determination of the cognate class size should be 3, certainly not 7, for the supposedly most successful Eurasiatic replicator. And even so, the “matches” involve a single consonant with the most common place of articulation. Such evidence would be easy to dismiss if it were not for the accompanying 1sg. m-pronouns. Nevertheless, ‘thou’ should really have been removed from the list rather than promoted to its top.