Eight steps to Universal Grammar
Noam Chomsky (2020, 2022) argues for one mutation giving the modern Faculty of Language, FL, by a single function, ‘Merge‘, combining two elements or ‘formatives’, with one as the head of the combined expression. Applying equally to any natural language, whether spoken or signed, Merge defines Universal Grammar or UG. By this generative model of FL, Merge has now fixated and become a defining genomic character of our species, anatomically-modern Homo sapiens. In its finiteness, FL is unlike any other cognition and has no corresponding reflex in non-humans. Its properties are not reducible to the needs of communication or social interaction. By this model, there could not have been any external input because by its very nature, Merge is a strictly-internal, cognitive function. But I propose here that there was an irreducibly-necessary, ordered SEQUENCE of eight steps in the evolution of FL, each originally proposed in one form or another by Chomsky, but not as a sequence.
By the proposal here
- These steps could and did become part of the human genome by virtue of the fact that each step was defined on the same, biologically-encoded ‘spine’. The notion of a ‘spine’ was originally proposed by Martina Wiltschko (2014). The term is now widely used.
- The functionality which Chomsky characterises as Merge evolved by the combined effect of these steps. Here I follow Bart de Boer, Bill Thompson, Andrea Ravignani & Cedric Boeckx (2020) on the point that a series of steps is more biologically plausible than one.
- In modern children’s acquisition of language, these steps are normally taken in about two years, roughly between ages of one and two and three quarters, with the acquisition process as a whole continuing until ten or so.
In the framework of Chomsky’s (1995) Minimalist Program (taken as the basis of the proposal here), this spine is a decision tree. The decision-making starts at the bottom. It maps onto the formal structure of FL, and is broadly reflected in language acquisition. The decision-making structure of FL is quite different from the recognition of other speakers and other points of view in discourse. By the argument of Shigeru Miyagawa (2010), discourse and FL interact. But they are separate domains.
By Wiltschko’s proposal, linguistic derivation is defined by the form of the branching or branchedness rather than by the eventual utility, such as the contrast between statements and questions or grammatical constructions such as passives, as in “I was hit by a falling tree”. Wiltschko characterises this as the ‘Universal Spine Hypothesis’. This is, of course, the exact opposite of what might be imagined.
Unlike the particularities of grammar, a spine-based branchedness is expressible biologically. It can be encoded mathematically. Unlike other theories of UG, the USH could have evolved, and thus become part of the human genome. By the proposal here, this was indeed the case.
From paleoanthropology, it is possible to define an earliest possible beginning to the evolution of speech and language – no earlier, I submit, than the first stone tools and a last possible end point – when anatomically modern Homo sapiens spread across first the Old World, then Australia, then the New World, and then the Pacific and New Zealand. This gives a total period of speech and language evolution of at most three million years, much longer than the period envisaged by Chomsky, but still brief for a change in the genome of such significance.
This linguistic evolution can only be defined on general, i.e. non-linguistic, principles. It thus cannot directly reference any categories which would only come into existence by virtue of the evolution. It must encompass the essential apparatus which yields the complete set of linguistic categories. A category may seem to occur only very rarely or even in only one of the six or seven thousand known languages. Some categories are idiosyncratic, But if a category occurs at all, the learnability space must be configured accordingly.
Evolution thus defines UG, yielding an apparatus which allows the categories, evidenced by particular languages, to be gradually redefined in ways which seem to follow a logic of their own. One instance of uncommon idiosyncracy is the set of English auxiliaries, including do, will, would and might in “Do we like tea?” “We do not like tea” “Wouldn’t she like tea? and so on. Do has grown into its modern use and meaning over the past few hundred years, more or less since the time of Shakespeare, by a time scale much shorter than the time scale of language evolution. In the limit case, English grammar allows “Mightn’t you have been being deceived?” On this point, English grammar may represent just one, outlier case of a particular character of Indo-European languages. Many of these languages turn a particular sequence of verbal forms into a part of the grammar, each form expressing a particular category, tense, possibility, relevance to the present, and so on. As shown by Aikenwald (2006), what she calls ‘serial verbs’ are common across the languages of the world. English just seems to push the serialisation of auxiliaries into an extreme, grammaticalised form, as in “Mightn’t you have been being deceived?” where the complexity exceeds anything any child or adult is ever likely to hear. But whether experienced or not, it is understandable. The natural process of acquisition extrapolates beyond direct experience.
More pervasive, and arguably universal, are ‘Case theory’, the ‘feature’ of ‘person’, ‘agreement, and ‘displacement’. Case is evident in the contrasts between I and me, he and him, she and her, we and us, and so on. Person involves either the speaker or the listener, or neither – the last by what is known as the ‘third person’. Agreement is manifest in the contrasts between “I like him” and “He likes me” or “We like them” and “They like us”, where like and likes alternate, with the S of likes agreeing with the singularity and third person of he or she. Displacement is especially and most clearly manifest in the special case of questions with more than one Wh word, as in “What did he eat when” or “When did he eat what?” with only one, but not both, of the Wh words displaced from where it is interpreted to where it is pronounced – on the left edge of the structure. It is evident that this displacement is from right to left rather than the other way round from the form of the answer – plausibly “He ate two sandwiches yesterday”.
Shigeru Miyagawa (2010) asks: How and why are properties such as agreement and displacement seemingly universal? All human languages appear to have at least some version of these functionalities. He argues that at least some of these universalities are by the exigencies of discourse. Properties move and harmonise for the sake of good understanding. It is by virtue of these universalities that it makes sense to speak of ‘Universal Grammar’, or UG.
Similarly, there is a contrast between ‘ellipsis’, as in “I believe in peace and democracy, and so does my sister” with “believes in peace and democracy” understood as the predicate of the second part of the sentence, but not pronounced, and ‘islands’ from which unpronounced structure is prohibited, as in “What do you believe in peace and?” ungrammatical in English, but significantly with corresponding and equivalent expressions which are also ungrammatical across a wide sample of unrelated languages. Here the island is peace and X, where X is any ‘noun phrase’ and the whole expression is sometimes known as a ‘conjoined noun phrase’. What cannot be understood as a question about X.
Ellipsis, allowing unpronounced structure, and islands disallowing it, seem to be part of every language. Like case, person, agreement, and displacement, they are part of UG and FL. But exactly what can and can’t be understood without being pronounced and how case, person, agreement, and displacement are expressed all vary from language to language, and thus part of the learnability space.
Ellipsis was named and studied in the classical period. Islandhood was first identified by John Ross in 1967.
By the proposal here – setting out some implications:
- The first step involved a relation between physical and semantic features (as opposed to an unordered stack of feature attributes in a single entity). The physical aspect may have been either gestural or vocal. If primordially there was a bias towards gesture, this bias must have disappeared as language evolution progressed, or there would be sign languages used natively by normally-hearing populations (but see Michael Tomasello and Josep Call (1997) and Michael Tomasello (2010) for a quite different point of view on this);
- Each step was expressed in a form, reducible to the effects of a headed branchedness and features, commonly known as the ‘spine’, as the basis of Universal Grammar, UG, entirely defined by the terms of its evolution, making it biologically ‘encodable’, and such that it could be entered into a computation. As is often pointed out, the more restricted the formulation of UG, the greater the descriptive challenge;
- Each step must have offered a decisive biological advantage in terms of communication – with effects detectable by potential mates, leading to a consistent bias in mate selection. In terms of statistical dynamics, the advantage here may have been slight. But a slight bias applying consistently over thousands of generations can effect an epigenetic change in the genome;
- The single function of Merge is broken down into a sequence of steps, each minimal, as by the canons of evolutionary biology noted by de Boer et al (2020);
- Universal Grammar, UG, could not have evolved its precise character other than by a precise sequence of discrete steps, each offering one or more grammatical particularisations, exploited in language-specific ways. The particularities of any given language must be ultimately reducible to variations defined on the spine, and derivations from them and interactions between them. These variations are part of the learnability space – what has to be learnt in slightly different ways according to the language being learnt. The steps thus provided an apparatus around which grammatical structures could develop. The language learning child has to navigate the learnability space without any guidance about the points of variation;
- UG is the foundation of language acquisition;
- All humans alive today must be descended from one African stem. The ancestry may be from more than one point on the stem, which may have migrated and introgressed (See Chris Stringer (2016), Aaron Ragsdale et al (2023) for a different point of emphasis). By the last of these points of descent, modern UG was necessarily complete;
- The evolution of speech and language began earlier and proceeded more gradually and for longer than by Chomsky’s proposal;
- The totality of the steps gave speech as well as language;
- The steps proposed here are by evolution and evidenced in the acquisition of language and disorders, the differences between languages, new language formation, particularly creoles, signed languages, the commonalities of unrelated languages, the detailed examination of any one language – for our purposes here, English, and the special case of Nicaraguan Sign Language, as a language which developed in a single school generation in very special, probably world-unique circumstances. In other words, there is empirical evidence in relation to the substantial issues here. While there is no reason for assuming that modern speech and language acquisition exactly replicate their evolution, there is every reason to expect significant parallels, as in other areas of comparative biology;
- The proposal here involves what is sometimes known as ‘early complexity’, on the understanding here, complexity as early as possible, but no earlier and no later. This is to say that as one evolution is built on by another, the earlier evolution cannot be amended. So evolved properties are plausible only at the original point of evolution. By the same token, in the process of acquisition by modern children, UG is established early, on the evidence here, typically between one and three, with the full acquisition of a target language typically continuing until ten or so.
That is the proposal which I am exploring and testing in my research. At the moment, this is on the basis of two brothers. Next I shall be turning to the records of three unrelated children in Edinburgh, two girls and a boy, observed in a different way.
But language, as characterised by FL, contrasts with discourse, or the way language is used to achieve results, express emotions, interest, entertain or deceive. Even at the very beginning of the evolution, it is possible to imagine discourse functions as soon as there are distinct meanings in particular expressions. As UG evolves, some of these functions, such as commanding and questioning, become part of FL. Modern language is both used for discourse and structured in such a way that meanings can be shared
Following six precursor cognitions, some of the seven specifically linguistic steps which I propose can be exemplified very approximately in the language development of a modern child – except that what the child is hearing is a fully-developed, modern language, and the child is the inheritor of a corresponding genomic capacity, albeit one that is initially undeveloped.
A tale of two children
The examples given here are from the diaries kept by my wife and myself of the development of our two sons, Joe and Frank, about fifteen thousand observations in all, filling nine cathedral analysis note books. The observations continued until Joe, the older of the two, was almost ten and a half. We tried to make our observations as accurate as possible, as soon as possible after the event. Obviously we must have missed many developmentally significant occasions.
Following the convention established by Jean Piaget, ages are given as 10; 4 (25), the date of the last observation when Joe was ten years, four months, and twenty five days, in the case of the last, the only instance of a question with two Wh words. This degree of precision is useful and appropriate. Some developments happened overnight or over a few days. In every one of the cases exampled here, they were the first satisfying some particular criterion.
It is only coming back to these records for analysis forty years after they were made that I am coming to fully appreciate what they reveal. One cannot listen too carefully to the details and nuances of what children say. They can say more than they seem to on a first listening.
Listening to Joe and Frank talking with their friends and peers there was nothing obviously singular about their speech and language. The structures exampled here seem to be typical of children’s from a liberally-minded, middle-class family, going to a neighbourhood, non-denominational, local authority school, catering for children from a wide variety of social and ethnic backgrounds.
The ordering of the steps is by conceptual necessity. The examples, and the fact that they are in a corresponding sequence, are evidence for the proposal here.
1. Label of sound and meaning UNLIKENESS in forms
By the proposal here, the grammatical apparatus originated in forms, each defined on a combination of unlike elements, one sensori-motor or articulatory and perceptual, the other pragmatic and semantic, in one arbitrary symbol, defined by the KNOWN RELATION between the elements. In other words, there had to be some awareness of the relation, no matter how vague or imprecise, or it could not have developed and fixated across the species as a property of the genome. This awareness was effectively a cognitive label. The evolutionary cognitive genius was in the awareness of the unlikeness between the two elements of the physical / semantic relation, and that this was different from any shriek or howl, or the call systems of modern vervet monkeys with different alarm calls for particular predators, snakes, eagles, and leopards, or the richer but seemingly less specific systems of chimpanzees. (The recent discovery that these signals are categorised by the direction of the threat is irrelevant here).
One articulatory / perceptual feature would be hard to detect, and easily over-looked. But if used consistently it would be detectable. And that may have been crucial to its initial spread.
By the simplest, most conservative, most Darwinian assumption, the initiative here is most likely to have been by giving a one-term meaning, in the simplest of dictionary senses, to some perceptible element, whether spoken or signed.
Ferdinand de Saussure (2016) referred to the relation here as one between the ‘signifier’ and the ‘signified’. For Saussure, signifier and signified were fully-developed, modern words. Saussure emphasised the complete arbitrariness of this relation. Some linguists have drawn attention to the way particular feature combinations, supposedly onomatopoeic, have related connotations – like ASH in bash, mash, smash, crash, all denoting some degree of violence, tick tock and clip clop denoting the sounds of clock or a horse, and so on. The case of iconicity in sign language is similar. But in no modern language are such relations anything but marginal.
Supposing that the semantics might have been emotional, this is shown below by a woman’s face with the label arbitrarily at the top. The label is not a repetition of what it represents, but a definition of the relation between the lip gesture and the meaning, shown here by the outline text and the diagonal strike through.
Primordially, this might have been represented by a mere closure of the lips.
The lips are just one point at which the vocal tract can be close or constricted with an acoustic effect. The lips, just one of many humanly-possible articulators, are utilised in almost all languages.
This does not prohibit the continuation of a system which may have been used by our common ancestors at the point of differentiation between the two lineages, humans and chnimpanzees. These early ancestors may have had a system of calls of any degree of acoustic length and complexity, and used for any purpose. But such a system could be supplemented by one or more proto-words, to be articulated and understood, and labeled accordingly. The label may or may not have been overtly signaled.
How far the first inventor or inventors were CONSCIOUSLY aware of what they had invented is obviously impossible to say. All we know is that from the beginning, speech and language were noticed. Or the capacity could not have spread, and eventually fixated.
By the Minimalist Program, there are what are known as ‘uninterpretable features’ defining the ‘case’ and ‘subjecthood’ of he, she, and we in “He fell over”, “She fell over” and “We fell over”. One way or another, both of these categories, the case and the subjecthood, have to be explained – categorially by classical grammars or featurally by biolinguistic-type theories. Logically, there are two possibilities: Either the uninterpretable features were at least fore-shadowed from day one, or they evolved at some later point in the evolution of language. The former is the more plausible. Suppose, in the limit case, that the first photo-word with one acoustic feature and one semantic feature was just one item in a repertoire of hoots and shrieks, it would have been significant to users and anyone capable of understanding and processing it. It would plainly be quite fanciful to imagine that such a complex notion as case or subjecthood had any application in the context of isolated, primordial proto-words. But it is much less fanciful to imagine that the primordial gestures were noticed and cognitively labeled as such, and that this labeling laid the basis for subsequent elaboration.
The primordial system, perhaps with just one acoustic and one semantic feature, yielded only a small stock of expressions. If such proto-words just paired single features of sound and meaning, this would allow a lexicon or vocabulary no larger than the number of features of either sort, semantic or acoustic. This maximised pressure on the feature sets. The more features there were, the more it was possible to say. But even just one suitable expression could be very appealing in a would-be partner. The adaptation here conferred a significant degree of Darwinian fitness. Inheritors of the adaptation had a greater chance of mating and thus of passing the adaptation on.
This step may not be fully represented in modern children’s acquisition. In a modern, speaking community it would not be recognised by adult speakers as speech. The only fossil may be in the principle of labelling. It may be that in modern child language acquisition there are forms even simpler than those which are taken to be the first ‘words’ which are just not recognised as speech. Such primitive expressions may be missed or overlooked by the parent / observers. Primitive developmental expressions are hard to spot. But alternatively it may be that in modern acquisition the labelling is bypassed by collapsing it with the next step, yielding an exponential increase in the lexicon or vocabulary.
2. Another feature
The scope of the system is exponentially increased if more than one like features can be assembled together. On the simplest possible assumption that the primordial relation was between just two unlike features, one semantic, the other articulatory / perceptual, there is an exponential growth in the lexicon by allowing more than one such feature of each sort. This can effect either an increase in the overall number or an increase in the specificity, in the case below by generalising the reference to ‘parents’ and by increasing the salience of the message by doubling the features to include two gestures, one with the lips or otherwise and one involving or not involving the nose.
Of the faces in the image below, the only certainties are that they were African and that hair-dressing and barbering were not yet fully developed.
The label represents the increase in the number of features, sensory-motor on the right, and semantic (in this case) on the left.
This the smallest logically-possible such increase by a whole number. And the communicative advantage may have been slight. But formally there is a significant increase here nevertheless.
Consider some (poor) examples – poor because they all go beyond the first two steps, as proposed here.
Joe at 0; 11 (23) “O” (echo of hello), and at 0; 11 (26) “ba ba” (echo of bye bye)
Frank at 0; 9 (4) “Mum”, at 0; 9 (7) “Allo” (echo), at 0; 10 (7) “Mama”, 0; 10 (10) “bo” (echo of bottle), 0; 11 (10) “rabbit” (echo)…. “bus”.
Similarly, oh, ah, ooh, mmm, each a modern English phoneme with some internal complexity. But both aspects of the system, both the sensory-motor aspect and the semantic-pragmatic aspect, may have evolved with the system itself. The primordial system may have quickly evolved more power.
Modern language exploits the complete resources of a fully-developed, feature system, Apart from tut tut, every expression in every language, even those consisting of just one speech sound or phoneme, uses more complex combinations of features, The primordial system has been supplanted by both evolution and what is known as grammaticalisation. But features are critical to all aspects of modern language, in the sound system, the way words are formed. the way they are put together, and in the structures of meaning, at each point expressing the duality of meaning and physical expression.
There are what may be fossils of the primordial sound / meaning relation in:
- English yes and no and what are known as ‘modal particles’ or ‘discourse markers’ in many languages, curses and greetings;
- Expressions like “Sh” as a call for silence, “Ah” for pleased surprise, “Eh?” as a query, and tut tut for disapproval;
- What are known as ‘imperatives’, commonly, as in English by the ‘root’ form of a word, such as come and go, sometimes for sake of saving life, sometimes greatly elaborated by more complex grammar, as “For goodness sake, just go”;
- In language acquisition, there is a long period of just single words, as greetings, as “more”, “hello”, “bye bye” and so on.
These modern expressions are unlike the primordial forms proposed here in that they exploit combinations of features by later steps. But they are used in a way more characteristic of the primordial system.
At the lowest, logically-possible point of grammatical structure there were combinations, by aspects of meaning, increasing the lexicon, by the properties of sounds yielding greater clarity, and by words yielding an increase in what could be said. At some point this must have evolved. Or there would be no such thing as language.
Acoustically, a closure at the lips could be combined with an open airway through the nose and a bringing together of the vocal cords to yield a first step towards modern M. With just two features involved, this would have been barely more perceptible than a single feature. But the combinatoric was significant.
Here, I use the term ‘Pair‘, slightly modifying Chomsky’s (2020) term, which, it seems to me, risks confusion. Pair does not involve any internal structure between the elements.
For example, Joe at 1; 5; (23), six months after his first ‘words’, said “Bye, doggy”.
Frank, now much more precocious than Joe, at 1; 2 (22) said “Bye bye, Daddy”
This seems to be just discourse with no internal structure. “Doggy, bye” and “Daddy bye, bye” would not seem to involve any change in meaning.
Using the terminology of John Langshaw Austen (1957), the ‘force’ is analytically and cognitively indeterminate. A grammar by Pair is massively ambiguous.
- The senses of small or young can be combined with the sense person to yield the sense of child;
- Or a comment can be made about an event, as by modern “Nice”;
- Or the elements of sound structure can be put together as by modern bye and Ma, with bye bye and doggy or daddy, by jumping ahead,
It may be that in modern acquisition, with fully competent modelling as the learner’s only input, the sound structuring is ahead of all other aspects of UG. Pair allows the features of modern speech to combine. But at the point of evolution, Pair may have been not powerful enough to yield even the simplest modern syllable, a highly constructed entity. Modern equivalents are equivalent only very approximately. Showing an explicitly discourse element by a wiggly line, Pair thus relates two variables, a1, b1 and a2, b2 of AB.
Both bye and doggy are aspects of discourse, doggy seemingly addressed to a toy dog. But doggy exploits the combinatorial power of UG more fully than bye. Unlike bye, doggy is potentially referential, though seemingly not in this instance. Equivalently, with reference to the observation of Joe at 1; 5 (23):
This primordial system is reflected in modern language by fossils such as:
- Something close to reduplication in adult speech by hip hop, chit chat or pow wow.
- Expressions standardly by root forms combined in ways falling outside the terms of the grammar, kill joy, go between, go slow and so on, noted by Ljiljana Progovac (2015);
- Possibly adverbs, or some adverbs, as the only sort of word which can appear in different positions in English, albeit with some subtle changes in meaning, as in “Sadly he going to die”, “He is sadly going to die”, “He is going sadly to die”, “He is going to sadly die”, “He is going to die sadly” – all grammatical, at least for those who allow infinitives to be split.
The modern infant generally puts his or her first two ‘words’ together around the point when the vocabulary reaches around fifty items. There is no reason for assuming that Pair evolved at the point when some number of items became accessible.
Primordially, both the forms, not yet truly words, and the sounds are likely to have displayed only a small part of the complexity off modern speech and language, as in “Bye, love” or “Hi, bro”, by several applications of the principle here, to form the modern sound structure.
By another step, one element becomes the head of the combined expression. The relation is asymmetric between two contrasting elements, neither such that it can in all cases stand on its own, in other words, not bye bye or hello. In the framework here, the non-head is known as the ‘complement’. Dominance is thus built into the system.
For example Joe at 1;7 (30) said “In er car…. In car”.
And Frank at 1; 3 (2) said “Open door”.
Not yet a spine, but a relation which could be duplicated, as the first step towards building a spine.
“In car” has various possible interpretations, depending at least partly on the circumstances. But the interpretations are different from those of the words uttered in isolation.
- With the branching applying to just two elements, sisters in the framework here, headship is thus essentially a relation between defined elements, both with parts. In the modern child’s process of acquisition, by the simplest possible interpretation, the structure of “In car” involves two elements, car, essentially noun-like. and in, as a step towards a preposition. “Open door” contrasts noun-like door and verb like open.
- Head signals the first step towards distinctive ‘parts of speech’ as these are called by traditional grammar, nouns like car, Mummy, Daddy, verbs like want and like, prepositions like in. The items become differentiated, as the only sorts of expression on which grammatical operations can be defined. On the simplest plausible readings, open and in are plainly heads. Both elements can now express formal relations, as head and complement, to the expression as a whole. There is a grammatical relation between them, each with an an irreducibly necessary, structural role. But it seems premature to regard such elements as fully-defined nouns, verbs and propositions;
- Head defines the first step towards the language universal spine;
- In terms of sound structure, as a physical, articulated expression, there could be a vocalic head and a consonantal onset, a formal differentiation between the elements, such as Consonant Vowel or CV in what thus becomes a syllable;
- The elements have features which define the interaction, as opposed to some purely accidental relation, as by ooh, eh, ah, yes, hello, good bye,and so on, all independent from the grammar because they can stand on their own, sometimes adjoined to it, but not by Head.
5. Array and Project
Building on the featural and combinatorial properties by Pair and Head, from an array of lexical entries, Array and Project lists all the forms in a structure, as it is being built, making it possible to select or project a particular, suitable item which has already featured in a previous step of the derivation. Typically, as in English, it is then not pronounced at its point of origin.
For example, Joe at 1;10 (3), says “Daddy upstairs”, and at 1; 10 (27) “Where Daddy?”
At 1; 4 (27) Frank is asked, “Who wants some chips?” And he replies “Me”. And at 1; 5 (9) he asks “Where chicken?”
Array defines what what in the framework here is characterised as the ‘specifier’. The sister of the head ‘specifies’ the expression. The functionality allows
- Questions with a Wh word to be asked or understood. This allows a relationship between where and upstairs in “Where Daddy” and “Daddy upstairs”. Words like where can be effectively copied from one position in the structure to another. By virtue of this two-step process, an element effectively moves (as it indeed moved by earlier versions of generative grammar);
- Structures, with what traditional grammar calls ‘subjects’ , to be expressed on noun-like elements. Thus I has the special role of expressing a subject, a seemingly universal property of sentences. The subject role is purely grammatical or syntactic, as in “There is food on the table” and “It is a shame that you’re ill” where neither there nor it has any semantic role;
- The beginning of what are known as ‘thematic roles’ including ownership, location, benefit, destination, or experience;
- The first versions of the way what is known as ‘Case’ is expressed as by the difference between he and him, she and her, as what are known as ‘arguments’ in one of various relationships, essentially who is doing what to who). Some aspects of Case have a plain relation to thematic roles. In “I am waking them up” or “They are waking me up”, the references of I, me, they and them change as roles change or speakers take turns to talk;
- Projected structure to be projected again, i.e. recursively;
- Across the system of phonemes, relativities in contrasts such as those between P and B, defined on a difference in the delay between the release of a closure and the onset of ‘voicing’ by bringing the vocal cords together;
- Stress between syllables, as in Mummy and Daddy.
In “Daddy upstairs” by two branchings, one defective, Daddy is the specifier, in this case, the subject.
The contrast between the elements expresses the simplest possible structure with a definable spine, in this case with Daddy dominating X upstairs, where X is an unrealised abstract element.
In Joe’s “Where Daddy?” at 1; 10 (27), where is pronounced on the left and interpreted on the right of the structure (shown in grey) from where it has been copied.
Here “Upstairs” might be a plausible child’s answer, taking “Daddy is upstairs” or “Upstairs” as plausible adult-type answers, except that upstairs is treated here as a bare marker of location, questioned by where,
In modern acquisition, between a week and three months after two words are put together, a question is asked or answered involving a question word relating to one of the items in the two word combination, particularly what, where, and so on, signalling points of curiosity, as a key factor in discourse.
Every structural element has an irreducible semantic role, from the verbal head of an expression like open in “Open door” to the Wh question form where in “Where chicken”. But at a given point, the semantic role becomes the element itself – known as a ‘functor’. The first functor to appear expresses what is known as ‘tense’, as in the word is or its contracted form, written ‘s, the fact that the reference is to an event /situation in the here and now.
Showing the new functional projection in bold.
Joe at 1;11 (12) asks “Who’s that?” On the same day. looking at a picture book together, his mother asks: Where’s the bus? Joe replies: There’s bus.
At 1;11 (14) Joe asks: “Where’s man tractor” It was not clear if he meant “Where is the man’s tractor?” or “Where is the man for the tractor?” or something else. The point is the articulation of the ‘s form.
Frank at 1; 5 (29) asks: “What is that?” with the observation that the is form was clearly detectable.
These, the first uses of an is or ‘s form by these children, are the first instances of what is known as ’inflection’ where an element, significantly in both of these cases, the verb be is inflected with the property of tense. There is no reason for thinking that there is any contrastive intent here. But the use of the form is a place-holder for the tense category as this becomes accessible to consciousness. And in a broader sense, the form signals the accessibility of elements which are purely functional, with their own corresponding projections.
7. Work space
The recursive power of Array and Project can be exploited indefinitely. Such a grammar strains both processing and production, and is patently impossible to learn under the condition of finite learnability. Only a small minority may have been able to master the complete apparatus, with wide variations in mastery, as in all other areas of human skill from musicality, to art, to athleticism of all sorts, and in a way sometimes thought to be controversial, cognition.
Work space compares a minimal degree of dominance to any equal or greater degree. This relation characterises numerous phenomena in the grammar of unrelated languages, in English including pronouns such as I, you, he and she, what are traditionally known as ‘reflexives’, as in “I hurt myself” and negatives by not and its reduced form written as ‘nt, only appearing immediately after the auxiliary element expressing what was once the tense in doesn’t, didn’t, can’t, won’t, couldn’t, wouldn’t and so on. The scope of operations, each doing just one thing at a time, is restricted by comparing and measuring just two degrees of dominance.
For example, at 2; 4 (26) Joe said “Mog doesn’t like that” with the negative n’t next to the auxiliary does, at 2; 5 (5) “doggy licking hisself” at 2; 5 (7) “I saw lorry pulling car” at 2; 5 (11) “I took picture of milkman”.
A week later at 2; 5 (5) he says “doggy licking hisself” with the reflexive one level down from what is known as its ‘antecedent’ – what ‘comes before’ – in this case doggy.
As far as pronouns are concerned the key data for English is in contrasts like the one between “She says Mummy feels tired” and “Mummy says she feels tired.” In the second but not the first, she could be Mummy. Such relations are common across languages, raising the obvious question: Why should this be? Ever since a seminal (1976) work on the issue by Tanya Reinhard, this has been a hot and continuing topic of debate. All approaches since that of Reinhardt have focused on the small size of the domain. The most evolvable is by unifying the notions of the spine and the Work space.
By a labeled spine, universal grammar allows these things to be encoded, but in ways that vary across languages. Each of these extensions increases the dominance. By this fifth advance, degrees of dominance relations are measured and compared. Work space limits these operations to just two adjacent levels.
Work space imposes a ceiling on specified relations, abstractly A and B, at the top of the spine at a given point in the derivation.
This allowed a special relationship between I and am and between she and is, one denoting what was traditionally known as the ‘nominative’ case of the subject and the other denoting the most immediate aspect of the here and now in the discourse. In most languages including English, the key aspect of the here and how is related to time, represented as the tense of the verb, as in the differences between I am and I was and I have and I had. Nominative case is purely grammatical, with no thematic role or obvious relation to the here and now or the needs of communication.
For example, Joe at 2; 5 (7) said “I saw lorry pulling car” and on 2; 5 (11) “I took picture of milkman” with tense and case overtly marked in a sisterhood relation.
English tense is marked either as -ED, as in sorted, or -D as in lied, or T as in spilt, or by a change in what is known as the rime as in ate, saw, took, or by the whole form of the verb as in was and went. This marking of tense is separate from the verb itself.
In a way that is cross-linguistically typical, English marks tense in every sentence, apart from a few special cases, such as “The bigger the better”, arguably not sentences, but aspects of discourse. The marking of tense on the verb is almost, though not completely, universal. Universally, these two functionalities, tense and nominative Case, are defined at the top of the projection chain. This is reflected in the way both are expressed as the left most elements in “I might have been being deceived”. But the way it works in English is complex and hard to learn.
In “I may seem to be asleep” the thematic role of I is plainly not a function of the main verb, seem, but of the embedded verb, be. “I may seem to be asleep” means the same thing as “It may seem that I am asleep”. but the structure is quite different. I with its marking of Case and the tense of be get shunted upwards or ‘raised’ by successive steps of projection, each step by a separate process, shown here by the arrow in a simplified diagrammatic form of tree diagram.
The process can be continued as in “I may seem to want to be asleep” with a different meaning, but still with I immediately followed by the tense bearing may. The sense of the tense-bearing element has been lost in the history of English. “I might seem to want to he asleep” means almost the same thing, But without may or might in “I seem to want to be asleep” or “I seemed to want to be asleep” the tense difference is clear and overt. Again immediately next to I.
Tense and nominative case constitute the two most sharply contrasting sorts of elements within the hierarchy.
The expression of these levels of the hierarchy, for noun and verb like elements, varies from language to language. These are things the language learner has to learn. They fall within the learnability space. By the proposal here, Work space is universal. But expressed in terms of the spine it is biologically encodable, and thus readable within the human genome.
8. Phase and Complementiser (or Sentence) – limiting the whole expression
By this step, still in the process of conceptualisation by the ‘Minimalist Program’ – as updated by Chomsky (2001) and later work, the grammatical apparatus is factored into two components. This follows a consistent approach from Chomsky’s first widely circulated (1957) work factoring the grammar into two sorts of rule, phrase structure rules and transformational rules, then a division between deep structure and surface structure by Chomsky (1965), then the effect of a barrier with respect to what was at the time considered to be the ‘movement’ of elements such as what and where and other phenomena by Chomsky (1986), then with the ‘spelling out’ of the minimally and irreducibly necessary and relevant information in separate ‘Phases’ by Juan Uriagareka (1999) and Chomsky (2001). This information is of two sorts, information about the sound structure and information about the meaning. The totality of this information has to be both detailed and complete. This is in the context of an ongoing approach, motivated by the core task of explaining WHY the incredibly rich structure of a language is the way it is and HOW it is reliably learnt by all normally developing children in the way it is despite the infinite variations in children’s experiences of language.
By ‘phase-based’ syntax, at least by the original conception which I am upholding here, the first phase spells out the referential and propositional content, the second phase spells out what John Langshaw Austen (1957) called the ‘illocutionary force’ of statements, commands, questions, pleas, and so on.
Discriminating between the two phases is difficult, both in theory and for the language learner. Necessarily, it is ordered last. The first phase is mainly grammatical and the second phase often has a significant discourse element. Neither makes sense without the other. As Carol Chomsky showed in (1969), many ten year olds are still misunderstanding sentences like “I’m asking you what to feed the dog” as “I’m telling you what to feed the dog”.
As soon as a phase has been completed, most of its structure becomes inaccessible to the ongoing process of derivation. This allows the derivation to proceed in small, manageable parcels. These may only add one word. But it can have a special analytic status, as in the case of the Wh words.
Showing an accessible chunk in bold, and with a lower, earlier, inaccessible chunk lighter, with only its head and edge accessible.
There are thus two phases in most clauses, even if the second phase is not represented by any overt structure, but just by the fact that a ‘simple’ proposition is also a statement of fact, which may be contradicted in jest or irony. The truth or untruth of the proposition is represented by the second phase.
By the first seven steps proposed here, the force of a structure, was an accident of the structure itself and the circumstances in which it was uttered. Such a grammar was most likely prone to deep and frequent misunderstandings. By the seventh step, an expanded notion of force was defined as ‘Complementiser’ or C, as the topmost level of the spine, and replacing the traditional notion of a ‘sentence’. C provided a hosting for what in expressions like “What did you say you thought I said?” with what as the complement of I said at the opposite end of the structure.
In simple, declarative main clauses, C is not expressed in English. But it is the destination or landing site of words like what, where and when and expressions with which in questions seeking particular items of information. By the 1997 proposal of Luigi Rizzi, the ‘force’ of the structure, as a statement or a question, whether the agency of the subject is diminished by passivisation or in some other way, is expressed as a property of C.
So for example, Joe at 2; 9 (4) asked “When’s Daddy coming back?” with the Wh morpheme when projected onto the uppermost C level and the contracted auxiliary ‘S stuck on its right edge as what is known as a ‘clitic’.
By characterising this level as that of C, every level from the bottom of the structure to the top is defined in the same way, rather than by giving the sentence a special status of its own, one that is hard to define other than in a purely circular way.
Three weeks after “When’s Daddy coming back”, at 2; 9 (28) Joe produces his first sentence with multiple embeddings, in this case three, with two phases at the lowermost level in “what’s happening”, with thus eight phases in all, with all structures fully specified – in “I want to stand on the chair to see what’s happening”. Simplifying slightly:
At the lowermost level, the contracted auxiliary ‘s is projected to form a tensed structure, and then what is projected to form what is traditionally characterised as an ‘interrogative clause’, in the framework here, now specified by what.
At 2; 10 (21) Frank says “I want to sit where Joe’s been sitting.”
First, almost identical, sentences with multiple embeddings, in this case three, with two phases at the lowermost level, and six phases in all. Accidentally or otherwise, with full grammaticality. The exactness of the similarity between two utterances in two children two and half years apart, only noticed forty years later, would seem to suggest that there is great significance in such structures with a Wh word specifying an embedded clause.
Again without any interrogative intent, at 2; 11 (17), Joe defines the word, clock. Again the main action is in the embedded clause where there are two phases. In the first phase, the expression what time is projected, forming a second phase, with the first phase now becoming inaccessible. The matrix elements, a clock tells you, then constitutes a second first phase.
Looking at the structure in more detail, again without any interrogative intent, at 2; 11 (23) Joe says: He don’t know how to get down (looking at his toddler brother warily contemplating two steps)
Here the Wh word, in this case how, is overtly projected from the position on the right where it is interpreted (shown in grey here) to where it is pronounced as the complementiser of the embedded clause. Here again there are two clauses with two phases overtly represented in the embedded clause.
As by the examples above, Phase allows the derivation to proceed in parcels. But the full application of the principle here takes years to learn. At 9; 9 (13) Joe said “We don’t know whether I’m going to be picked up by who” (of the rather complicated child care arrangements we had in place at the time – to allow this research to start while I was still working full time for the NHS). Joe’s sentence is slightly anomalous in as much as who seeks particular information and whether seeks only a truth value. But the structure of two Wh words in the same clause calls up the Phase functionality in an interesting way.
By this eighth step:
- In “There is food on the table”, is on the table is formed in one phase and there by a later phase
- Building the derivation in phases limits how much of the derivation can be seen and manipulated at any one point, reducing to the minimum the speaker’s and language learner’s tasks in constructing a derivation, allowing complexity to be distributed across it;
- Information is sent to the articulatory system to be pronounced and to the semantic / conceptual system for the meaning to be analysed. English marks the point of sending articulatory information much earlier than languages like Turkish, Mohawk, and many others. So this point necessarily falls within the learnability space;
- FL becomes knowable, and Metalinguistic awareness is brought into being;
- Fantasy, fiction, non-fiction, irony, fun, comedy, contracts, become parts of everyday life,
It seems a reasonable conjecture that Phase critically reduces the learnability space, making speech and language finitely learnable in what Eric Lenneberg in 1967 called the ‘critical period’ for language acquisition, normally ending around the age of ten.
Phase may have only fixated across the ancestral stem of anatomically modern Homo sapiens between 100 and 200,000 years ago. In competition for scarce resources, this reliability of communication would have given a decisive advantage to those having it in relation to any group not having it.
Summary of eight steps
These eight hypothetical steps can be summarised as follows (for the sake of simplicity, assuming that the steps were primordially vocal, an assumption which may be wrong for at least the first advances):
- Labelling unlikeness – sound and meaning – in forms not yet properly words;
- Another feature – combining at most two LIKE features – sensory-motor or semantic;
- Pair – of elements, one such that it cannot stand on its own;
- Head – two contrasting elements, neither such that it can in all cases stand on its own – one such that it heads the expression;
- Array and Project – listing a set of elements in order, allowing a previously selected element to be selected again to form a more complete expression;
- Inflection – allowing tense as a place holder for other inflectional projections
- Work space – comparing some minimal level of dominance at some point in the derivation to some greater level, limiting specified projections.
- Phase – splitting the derivation into successive parts, one such that it becomes unreachable as the next proceeds, limiting the scope of the grammar bv any one phase of the derivation.
In the case of two randomly selected, normally developing children, these steps all occur in the space of 22 months. But seven years after the last, the whole edifice of grammar is still not complete.
These steps were taken by a species which had forsaken the safety of the trees for a much more dangerous life on the ground, after making at least six significant precursor adaptations. This was a population which plainly lived on its wits – or died. The population may have remained very small until the discovery of farming, but ranged across Africa while what is now the Sahara desert was forested and well-watered. Within this population, by the proposal here, individuals or groups of individuals must have started restructuring some of their expressions in detectably advantageous ways, but by no more than one term at a time, so that, over the course of thousands of generations, the innovation could diffuse across the population, and (separately) become part of the genome.
The totality of this evolution was most likely over at least a million years, and possibly even three million. This evolution exploits, but goes far beyond, any life-support cognitions, such as those of making stone tools.
If the proposal here is on the right lines, one or more of the last evolutionary steps may have occurred after the divergence between the main line of modern human ancestors and Neanderthals and before more or less anatomically modern humans appeared in what is now Western Morocco around 300,000 years ago. Inheritors of the epigenetic changes by the last step would have learnt to talk faster, more accurately, more reliably, and crucially more completely. The difference may have been critical, with Neanderthal mastery of speech and language mastery uneven, with no expectation of common understanding. Neanderthals may have been stuck at the point when only a fortunate minority had a full mastery of their linguistic inheritance, whatever that may have been, and the rest of the population had only varying degrees of competence and little or no metalinguistic ability.
Crucially this evolution provided an apparatus which was, and is:
- One small step at a time, like other biological advances, as opposed to one great leap;
- Independent or autonomous from any sort of physical skill from which it may originally have been extrapolated;
- Freely used in assembling words together and in the building of speech sounds, in ways that the child has to learn;
- Commonly over-used in the process of speech acquisition so that children often use devices in the building of words which should be used only in the assembling of words into sentences;
- Highly visible to the point of being (potentially, at least) biologically advantageous;
- Biologically-encoded on a spine, with the effect that particular aspects of the grammar can be expressed as finitely varying points of variation known as ‘parameters’, defining the child’s learnability space;
- Such that speech-disordered children from different generations or parts of a family often have recognisably similar issues;
- Structured so as to proceed in one, mathematically-consistent way from the abstraction of features to the last advance, by Phase;
- Available in parts, so that questions can be asked and answered in a rudimentary way, so a child of two and three quarters can say “A clock tells you what time it is” displaying the first evidence of Phase long before the full functionality of the grammar has emerged, as it normally has around seven years later;
- Such that it could not plausibly have happened more than once, no matter whether the genomic infixation was by a series of discrete steps, as proposed here, or in some other way, as suggested by Berwick and Chomsky (2016).
A theory of speech and language acquisition?
The proposal here is NOT a theory of speech and language acquisition. There are many of these, some involving various psycholinguistic considerations such as auditory memory, processing load, and more. It is just assumed here that the domain is immensely large and complex, and that the steps of mastering it are not likely to be achieved by ticking off the achievements one by one and repeating them reliably, or not so reliably, from that point on. Rather, the steps by the proposal here are more like profound insights which are grasped first tentatively, followed by more failures than successes, and then gradually more and more confidently. Instances of these insights may only appear very occasionally in any sort of longitudinal record.
The motivation here is primarily biological. But I take this to mean that any postulated genomic content should be such that it can be genetically encoded – on the assumption that any adaptation enhancing the fitness of the organism is necessarily both accidental and simple. This rejects any sort of analogy with the primary and secondary dentition, as by Ken Wexler (1996). The evolutionary time scales are different by an order of magnitude. And in every case it is necessary to link the genomic content with a process which can plausibly be encoded – by the spine by the proposal here.
What is explained by an evolved UG applying to the whole of speech and language
- The characteristic multifactoriality of speech and language disorders, so a history of delayed or disordered speech is often co-morbid with literacy problems;
- The converse specificities of common errors, by misapplying what should be syntactic processes in the phonology;
- The fact that many characteristics of child speech seem reducible to the lack of any proper definition of phonemes, syllables, words, and so on;
- The characteristically poor metalinguistics of children with speech and language disorders.
By this proposal – consequentially:
- As the genome evolved, the effects of the steps interacted with one another, giving the complex variations which Roberts (2022) characterises as ‘building blocks’;
- An abstract Universal Grammar UG is derived from the evolution of FL, available across the human species. But there is cross-linguistic variation in how it is used – for example in which parts of the sentence structure are projected where – with global effects on word order. While a language may not express one or more parts of UG, all languages, spoken and signed, are built from it;
- It is possible in principle for parts of UG to be incompletely specified in some individuals;
- There are less grounds for postulating a series of separate disorders. Many apparent disorders, even some with names in popular speech, may fall out from a fully worked out theory of speech and language evolution;
- There must have been a series of protolanguages; following Progovac (2015), each likely to have left fossils.
From evolution to acquisition
In the absence of any evidence that acquisition proceeds differently from evolution, acquisition and new language formation may provide the closest approach to direct evidence of the possible, probable, or necessary course of speech and language evolution. It thus seems significant that most modern children ask or respond appropriately to a Wh question such as “Where Daddy” only after producing a declarative structure involving two corresponding elements, such as “Daddy upstairs”, and not in the opposite order. This holds of the two children, Joe and Frank, whose records I am currently examining in detail, and also in the cases of the three Edinburgh children, whose records I have quickly examined on this point, and to which I shall be turning to next.
Stems and time scale
The proposal here makes no claim about how the cognitive evolution of speech and language connected up with cognition itself or with any of the physical changes. We just have to note that these changes were in one species, and they would seem likely to have complemented one another.
The steps and precursor steps proposed here were not events. If it took modern human ancestors at least 3 or 4 million years to learn to make a sharp edge or point, it would seem reasonable to assume that the incomparably more subtle process of encoding UG on a spine must have been similarly challenging. But the proposal here has nothing to say about the possible time scale for each step. All that can be said is that the encoding is far removed from the communicative advantages. For instance the convenience of using pronouns has no obvious relation to unobvious adjacency of levels on the spine. The difficulty of the translation here would suggest that this may have taken many thousands of generations. But as soon as the spine relation was established, the translation may have been simpler and faster, perhaps very greatly so.
From the fact that there are corresponding phenomena in all languages, it is reasonable to suppose that the eight specifically linguistic steps were made separately, as an evolutionary sequence, in a population from which all humans alive today descend. At each point when a necessarily very obvious and visible step was taken, this was valued throughout and across a population. It had to be, or it wouldn’t have diffused and fixated.
But while the capacity for speech and language clearly distinguishes humans from any other animal, and mostly develops naturally without any active intervention, this is obviously not the case for all, with 1 child in 10 having minor problems with speech and language, 1 in 1,000 having major problems, and perhaps 1 in 100,000 being unintelligible in adulthood other than to close family and friends, if at all.
In relation to less than fully competent speech and language, the proposal here effects a conceptual economy. By the proposal here, most disorders are by the effect of failures in the specification of a genomically defined UG which makes it possible for humans to learn to talk the way they do without needing to be helped, other than to learn what not to say. This makes it unnecessary to postulate a corresponding series of specific malformations.
There is clinical utility from the study of the apparatus and the structures that can be derived from it. It makes many common issues in speech and language disorders more accurately definable. And it broadens the range of plausible interventions into areas that would otherwise be at least hard to treat. For instance, many children have difficulty with both case and tense, as in “She loves me” where the S in loves expresses agreement with the singular property in She. Such children may go on saying things like “Love me” many years after most children have learnt that in a statement, both the she and the S in loves are forced in English.
Nunes (2023) argues that to help such children it may be useful to allow them to discover the ‘sisterhood’ relation between the subject marking of she, known as ‘nominative case’ and the S in loves, known as ‘third person singular’, the formal device by Work space.
Nunes (2002) notes that other children sometimes say monopoly as OPOLI. If such errors persist they can lead to stigma or mockery. If OPOLI is part of a broader pattern the child probably needs help. But for what exactly? If monopoly as OPOLI is by the deletion of the first three sounds, what are they? The first two are the first, unstressed syllable, sometimes confusingly called the ‘pre-tonic syllable.’ But the N is the onset of the stressed syllable. What sort of a thing is the whole of one syllable and the beginning of the. next? But there is another way of looking at this. OPOLI is the domain of stress. (The initial N is irrelevant in this domain). The child may be treating the stress domain as though it was the word, an easy mistake for a first language learner of English. By the framework here, this suggests a treatment approach targeting the separateness of the word from the stress domain,
The limits of a proposal
This page sets out a proposal. It is what I am working on, initially on the basis of the observations of two children. It is essentially a question. There may have been more steps. Or Merge may have just developed out of the blue by a single macro-mutation, or what Chomsky calls ‘a minor rearrangement of neurones’. The proposal here just seems to me a biologically well-motivated way of reconciling the evidence of human speech and language, as they currently are, biology, neurology, archeology, paleo-anthropology, genetics, delays, disorders, and the random cases of two individual children. The fact that the two children were brothers, living in the same family home, may have influenced the areas of their attention and interest. But it cannot have had any bearing on whatever part genetics may have played in their focus on the formalities of Universal Grammar, as documented here.
As set out here, this proposal is only a sketch. There is at least another ten years of work ahead.