Menu Close
Step ladder Seven steps R2

Personal proposal

What makes us human: seven connected steps

Children learn to talk with seemingly random variations in what they hear. But they end up with a common understanding of what counts as their target language and what doesn’t and what means more than one thing. How do they do this? What exactly are they learning? How did this capacity evolve? Noam Chomsky and others (2023) postulate one evolutionary mutation giving the function, ‘Merge’. This combines pairs of atoms, one as the head of the combined expression which can then combine with others, and so on, indefinitely.

I say ‘atoms’ rather than words because the combinations are in one sense more than words and in another sense less than words – whatever words are.

Chomsky’s proposal has been called ‘ The Great Leap Forward’. Rejecting the idea of one great leap forward, I propose instead that:

  • The necessary evolution here should be broken down into a sequence of irreducibly-necessary, discrete steps, most of them originally proposed in one form or another by Chomsky;
  • At least in a general way, the course of children’s speech and language development is likely to follow the broad course of this evolution;
  • Where this process goes wrong, as it sometimes does, this is most lilkely to be becuase of an error in the wiring of this inheritance. I seem to be one of those affected.

Here I follow Bart de Boer and others (2020) on the point that a finite number of steps is more biologically plausible than one – seven by my proposal here.

The properties by the proposed sequence here are abstract. But they are no more abstract than the straightness of the line between footfalls which the child is learning to walk along. The gait becomes more efficient and evenly balanced. More of the energy is used to propel the body forwards, and less to stay upright. So humans can run for longer than faster-running prey. But no one, apart from trainers in athletics, think about the straightness of the footfalls. The straightness just defines how we run and walk. Abstractness is useful for speech and language too.

If a child’s speech and language are not developing, why should this be? Consider the child of three who seems to have just one word, and not say it often. What does this mean? What is the likelihood of the child learning to talk normally? Have the parents done something wrong? Is the child likely to grow into an adult with a ‘communication problem’? Is there a way of reducing the chances of this? What can be done to help? To answer such questions, it is worth sharpening our understanding of the Faculty of Language, FL. By the framework here, this faculty has the extraordinary power of allowing us to both say and reliably understand an INFINITY of sentences.

Now there may seem to be a very obvious growth towards this infinity. Children slowly become able to join one another and adults in conversations about the best team in the world, or the best player, or the best dressed skater, and so on. At first children seem to be adding the various complexities one by one.

Bus.

Ride bike.

Daddy drive car.

Lady getting on train.

Man is flying the plane.

And so on. Essentially the developmental path seems to be a succession of small achievements, until at some point, a long way down the pathway from bus, FL seems to have this capacity of infinite creativity. But by the proposal here, the basis for this infinity is already there in the normally developing one-year-old’s bus.

Universalities and discourse

Language, as characterised by FL, contrasts with discourse, or the way language is used to relate utterances to the context in which they are uttered, to express emotions, interest, to entertain, or to be ironic by reversing the overt sense of an utterance. Even at the very beginning of the evolution, it is possible to imagine discourse functions as soon as there are distinct meanings in particular expressions. As UG evolves, some of these functions, such as commanding and questioning, become part of FL. Modern language is both used for discourse and structured in such a way that meanings can be both shared and defined.

Shigeru Miyagawa (2010) asks: How and why do all human languages seem to have some very particular, abstract functionalities? (I shall discuss them below). Miyagawa argues that at least some of the universalities are by the exigencies of discourse – for the sake of good understanding. But the structure of FL is quite different from the recognition of other speakers and other points of view in discourse. Discourse and FL are separate domains. Neither makes sense without the other. By the proposal here, both are separately articulated in relation to grammar.

The main points of the proposal here:

  • Children learn to talk the way they do because of how FL evolved in the human species, step by step, seven steps in all, each one unconscious and implemented at a tempo faster than by deliberate action, but each one discrete;
  • By conceptual necessity there are at least two interfaces, one involving physical expression (by speech or by sign), the other involving the analysis of meaning. Information has to be sent to these interfaces in suitable, necessarily different forms. On a pathway from the first detectable indications to the competence of the fully mature speaker-listener, these interfaces have to be assumed from the first step;
  • By each step, by evolution and by modern acquisition, sensitivity grows to a particular sort of abstract property in the linguistic expressions which is manifest in the language as this is expressed;
  • In ways to be explained below, the first step, Decompose and Recompose first decomposes, then recomposes, two sorts of unlike feature, two atoms. Putting two atoms together is the simplest logically-possible, set-theoretic operation. But without the decomposition, there can be no composition. This gives a species-specific and species-universal starting point for this aspect of individual growth or development. The fifth step anchors the expression in the discourse. The last step factors the derivation into phases, each defined on a minimal chunk of the derivation, greatly reducing the infinity, and making the grammar finitely learnable;
  • The ancestral idea of Decompose and Recompose is Merge from Chomsky (1995). Two atoms are put together. And one is projected as the head. The current version of Merge, adopted here, is more general. Chomsky calls it ‘Maximal Merge’. But on any version of this thinking, an atom is mostly, but not always, what is commonly regarded as a ‘word’. In the expression “Eat apples”, apple is ‘merged’ with S and eat with apples, with apple and eat as the heads by successive instances of the same procedure. By applying to its own output, or ‘recursively’, as by the example here, Merge is the one and only operation of the syntax – what is commonly known as the ‘grammar’. This operation has now fixated and become a defining genomic character of our species, anatomically-modern Homo sapiens;
  • Decompose and Recompose involves the relation between physical and semantic features common to all natural languages. It defines the beginning of a human-specific pathway, different from any sort of alarm calls  of vervet monkeys 0r prairie dogs, or the richer but seemingly less specific systems of chimpanzees, or the marking of individual and group identity by dolphins. Such non-human calls are not compositional. They cannot be combined with other calls to some infinite degree, known as ‘discrete infinity’. Nor can they be ‘decomposed’ into separate articulatory / perceptual and semantic / pragmatic elements, as in games like the French Verlan, reversing the order of the syllables in l’envers, the French for backwards. In evolution, the physical aspect may have been either gestural or vocal. The rich communication system of chimpanzees mixes physical and vocal gestures together. If primordially there was a bias towards physical gesture, this bias must have disappeared as language evolution progressed, or there would be sign languages used natively by normally-hearing populations. The proposal here is neutral on the extent to which the first decompositions and recompositions were vocal tract, as opposed to manual gestures;
  • Decompose and Recompose allowed what is known as Universal Grammar, UG, to start evolving. This began earlier and proceeded more gradually, for longer, and more completely, than by Chomsky’s proposal, but still brief for a change in the genome of such significance. From paleoanthropology, it is possible to define an earliest possible beginning to the evolution of speech and language – no earlier, I submit, than the first manufactured tools and a last possible end point – when anatomically modern Homo sapiens spread across first the Old World, then Australia, then the New World, and then the Pacific and New Zealand. This sets upper and lower bounds to the possible period of speech and language evolution. Involving all the steps proposed here, UG applies to any natural language, whether spoken or signed;
  • Necessarily, every atom to be Decomposed and Recomposed has to be extracted from a store of similarly defined atoms. By the act of extraction, a copy of the symbol is ‘inscribed’ in the ‘work space’. The extraction is a complex act. If a question  is formed by what John Langshaew Austin (1962) called an ‘illocutionary act’, as in “What did you say?” what is being asked and what is being asked about are defined on separate atoms of form and meaning. By this structuring, it is possible for two people to agree that they are saying the same thing even if they have never met or heard each other speak. This can be a single word or a sentence of any length and complexity.
  • Necessarily, Decompose and Recompose has to be definable in a way that can be can be encoded mathematically and expressed biologically, as set out by Matilde Marcolli and others (2023), or it could not have developed and fixated across the species as a property of the genome;
  • Resurrecting the approach of Chomsky and Halle (1968), the totality of the steps here, including  Decompose and Recompose, give speech as well as language. Decompose and Recompose applies more directly to the physical expression (no matter whether this is by sign or by speech) than it does to the meaning of expressions. Systematically, at least by the examples here, the form of children’s expressions is in advance of what they are expressing. Newborns are sensitive only to certain global properties of the expression. By teenage they have an adult-like command of language. By each step, the infinity is reduced, but the structure becomes more easily learnable;
  • By the proposal of Chomsky (1999 & 2001), the grammar is partitioned into ‘phases’, each giving one component of what is known as the ‘derivation’, the first (very roughly) defining its propositional content, the second defining its ‘illocutionary force‘ in Austin’s terminology, as a statement, question, entreaty, and so on. By the proposal of Martina Wiltschko (2014), the grammar is defined on a ‘spine’ or a headed decision-tree, with binary branches’, rather than on a set of grammatical functionalities such as passives, as in “She was hit by a falling tree”. Both the notion of a phase-based grammer and the term ‘spine are now widely accepted. By the proposal here, the spine itself is phased. This makes it biologically ‘encodable’ and such that it could be entered into a computation, structured in a way common to all living humans;
  • UG could not have evolved its precise character other than by steps, each contributing to language-specific variations, defined on derivations from the spine, interactions between these derivations, and the ways that these things are implemented in speech. These variations are part of the learnability space – what has to be learnt in different ways according to the language being learnt, as finitely varying points of variation known as ‘parameters’;
  • The steps thus provided an apparatus around which grammatical structures could develop. UG is thus the foundation of language acquisition. But the child is not born with a ready-made UG. It is only very partially expressed at birth, just as the abilities of particular bird species to hover, stoop, dive and soar, are expressed only as the fledgling develops. By the evidence here, the ‘initial state’ is thus a complex transition from the time when the embyo is still in his or her mother’s womb and listening to the sounds of language to the time when, around three, the building blocks of UG are in place. But, in a way far more complex than the particularities of bird flight, the full integration of UG elements with one another continues until at least ten or so for most children.
  • Each step must have been noticed and recognised by potential mates for what it was, offering a decisive advantage, leading to a consistent bias in mate selection, ensuring that it eventually became inheritable. In terms of statistical dynamics, the advantage here may have been slight. But a slight bias applying consistently over one or more thousands of generations can effect a change in the genome;
  • The criterion of heritability sharply constrains the form of the steps. Biology does not, cannot, operate with any properties defined solely on linguistics, such as consonants, vowels, or sentences;
  • All humans alive today must be descended from one African stem. The ancestry may be from more than one point on the stem, which may have migrated and introgressed (See Chris Stringer (2016), Aaron Ragsdale et al (2023) for a different point of emphasis). But at a given point of descent,  modern UG was necessarily complete. Or there would be groups of humans genetically incapable of ever learning one another’s languages;
  • By the terms of this evolution, it could not plausibly have happened more than once;
  • The evidence for the evolutionary sequence proposed here is from the acquisition of language, language disorders, the differences between languages, new language formation, particularly creoles, signed languages, the commonalities of unrelated languages, the detailed examination of any one language – for our purposes here, English, and the special case of Nicaraguan Sign Language, as a language which developed in a single school generation in very special, probably world-unique circumstances. In other words, there is empirical evidence here. While there is no reason for assuming that modern speech and language acquisition exactly replicate their evolution, there is every reason to expect significant parallels, as in other areas of comparative biology.
  • In acquisition, the steps proposed here are taken over a period of months with the child putting a number of different elements together. In evolution, this may have taken a hundred thousand years or more, several thousand generations;
  • The proposal here involves what is sometimes known as ‘early complexity’, on the understanding here, complexity as early as possible, but no earlier and no later. This is to say that as one evolution is built on by another, the earlier evolution cannot be amended. So evolved properties are plausible only at the original point of evolution. Decompose and Recompose provides the basis for an evolutionary and developmental sequence and a resulting apparatus. It cannot be jettisoned on a ‘Use it or lose it’ basis without fatally compromising the rest of the apparatus;
  • The ordering of the steps here is by conceptual necessity. The examples given, in many cases very similar to one another, and the fact that they are in a matching sequence, are evidence for the proposal here.

Misunderstanding

The proposal here is commonly misunderstood in relation to a now out-dated position from Hauser, Chomsky, and Fitch (2002). By that much cited article, the distinctively human property in language is recursion, or the embedding of one element of a message inside another instance of the same element. In “I think you’re right”, ‘You’re right” is one clause embedded in another clause. By the current position, from work by Chomsky and others over the past 20 years, the human distinctiveness is in the property of Merge or what I am calling Decompose and Recompose rather than in recursion, as the consequence of Merge. The distinctivess crucially involves both aspects of the unlikeness relation, allowing the system to become compositional. If the gesturing system of a non-human displays recursion, this says nothing about the compositionality, or in every known, non-human case, the complete lack of this, and the exclusive use of communication for a narrow range of functions such as warning or self-identification. The most widely cited example of this misunderstanding is by Daniel Everett (2009, 2013, 2018), who claims that because he was not able to hear any instances of embedding in the language of the 500 members of the Amazonian tribe which he studied, recursion can’t be universal. He misses various possible explanations.

The autonomy of grammar

No version of Merge or Decompose and Recompose is reducible to the needs of communication or social interaction. There could not have been any external input because by their very nature, the properties here are strictly-internal to cognition. A phase-based spine can only be defined on general, i.e. non-linguistic, principles. It cannot directly reference any categories which would only come into existence by virtue of the evolution. One example of this is the time relation between the utterance and the context in which it is uttered, known as ‘tense’, as by the actually quite complex difference between “I clean the floor” and “I cleaned the floor”. Thus the grammar must encompass the entire apparatus which yields the linguistic categories. A category may seem to occur only very rarely or even in only one of the six or seven thousand known languages. Some categories are idiosyncratic, But if a category occurs at all, the learnability space must be configured accordingly;

Precursor cognitions

Following six precursor cognitions, some of the seven specifically linguistic steps which I propose can be exemplified very approximately in the language development of a modern child – except that what the child is hearing is a fully-developed, modern language, and the child is the inheritor of a corresponding genomic capacity, albeit with only the gestural half of the first step, characterised here as Decompose and Recompose, manifested in babbling.

Two children

At the moment, this proposal which I am exploring and testing, is on the basis of the diaries kept by my wife and myself of the development of our two sons, Joe and Frank, two and a half years apart in age, comprising about fifteen thousand observations in all, filling nine cathedral analysis note books. The observations continued until Joe, the older of the two, was almost ten and a half. We tried to make our observations as accurate as possible, as soon as possible after the event. Obviously we must have missed many developmentally significant occasions. The observations exampled here exhibit commonalities across the two boys. Because of the age difference, it is not plausible that the older of the two was significantly influencing the younger, other than on matters like loyalties to one or another football team. Generalisations across the two boys are likely to be significant. Listening to them talking with their friends and peers there was nothing obviously singular about their speech and language. The structures exampled here seem to be typical of children from a liberally-minded, middle-class family, going to a neighbourhood, non-denominational, local authority school, catering for children from a wide variety of social and ethnic backgrounds. It is only coming back to these records for analysis forty years after they were made that it is becoming clear how much they reveal. One cannot listen too carefully to the details and nuances of what children say. They can say more than they seem to on a first listening.

Following the convention established by Jean Piaget, ages are given as 10; 4 (25), the date of the last observation when Joe was ten years, four months, and twenty five days, in the case of the last, the only instance of a question with two Wh words. This degree of precision is useful and appropriate. Some developments happened overnight or over a few days. In every one of the cases exampled here, the observations were the  first satisfying some particular grammatical criterion.

1. Decompose and compose (or recompose) the sound and meaning of expressions

By the proposal here, the apparatus of grammar originated in forms defined on a new sort of relation between unlike sorts of atom. One sort involved sensori-motor or articulatory and perceptual gestures, which could be seen or heard, implemented with the hands, face or head, or with the vocal tract. The other sort 0f atom was a meaning, either pragmatic, such as a greeting or a curse, or semantic like a reference to some individual. The relation here was essentially arbitrary. The expression did not look or sound like what it symbolised. It was defined just by the relation the two atomic elements. In this respect, the relation was no different from those by other communication systems between animate beings. But the novelty of the step proposed here was that the relation was simultaneously broken and remade, or decomposed and recomposed. The atomic elements had first to be reduced to their simplest, logically possible forms – in modern terms by single features. On the articulatory perceptual side, one defined acoustic or articulatory properties. On the semantic side, there could be a definition of mood or reference. This decompositon into single features could then be reversed, allowing the features to be freely recombined or recomposed. Without the decomposition, the recomposition could not have evolved. There would be no speech or language.

The evolutionary cognitive genius was in the freedom of the relation between the two atomic elements, physical and semantic. Such a relation could form part of a lexicon. In a formal sense, this was quite different from any shriek or howl.

Such a decomposition is diagrammed below.

The lips are just one point at which the vocal tract can be close or constricted with an acoustic effect. Another is the tongue tip. These are just two of many humanly-possible articulations, utilised in almost all modern languages. Priomordial articulations may have been different in any number of ways, in all probability using features that could be extracted from an existing system of shrieks, hoots, grunts, or howls.

By the simple reversal of the direction here, the primordial relation between two sorts of unlike feature, semantic, and articulatory / perceptual, is recomposed as something more like a modern speech sound or sign language gesture.

As explained in the framework here, the step wise structure of speech and language necessarily starts, from a branched structure or what is known as a ‘decision-tree’. with the ‘root’ at the top and the decisions starting at the bottom. The first decision is to combine two atoms or ‘leaves’ as these are known in maths and computer science.  Then, by another decision, another atom is combined with the first two. And so on. The output of one instance of such a composure can be composed with another. The branching can be repeated any number of times. This is what is known as ‘recursion’. Crucially there is no limit here. This creates the basis for infinity. On standard assumptions of simplicity, the recomposition must have been based on binary set, sets with only two elements. Any greater number would have been exponentially more complex. By the framework here, the modern system with its capacity to generate an infinite number of sentences could not have evolved in any other way. I sketch this below.

But this is a rather useless sort of infinity.  There is no way of knowing which features go where. Anything can be said. But there may be no way of telling or guessing what it means.

For whatever reason, children exploit the recursive potential here in only the most minimal of ways.

What makes the proposal here different from Chomsky’s Merge is that the process here is divided into two complementary aspects, first decomposing the constituent features to one of each, both the semantic or pragmatic features and the gestural features, and only then composing or recomposing them. The motivation of this change is just to allow it to apply to the evolutionary beginning.

By the simplest, most conservative, most Darwinian assumption, the initiative here is most likely to have been by giving meanings, in the simplest of dictionary senses, to perceptible elements, whether spoken or signed, with the meaning having priority over the physical expression.

Ferdinand de Saussure (2016) referred to the relation here as one between the ‘signifier’ and the ‘signified’. For Saussure, signifier and signified were fully-developed, modern words. Saussure emphasised the complete arbitrariness of this relation. Some linguists have drawn attention to the way particular feature combinations, supposedly onomatopoeic, have related connotations – like ASH in bash, mash, smash, crash, all denoting some degree of violence, tick tock and clip clop denoting the sounds of clock or a horse, and so on. The case of iconicity in sign language is similar. But in no language are such relations anything but marginal.

The first instances of the relation here could have been either referential or related to discourse. But for the sake of generality, it seems more likely that such promordial expressions involved a degree of reference, as by a recognition of identity in an act of welcome. Reference is universal across languages. It is reference which marks human language apart from non-human systems. Dolphin calls of group or individual identity seem to be meaningful only in the presence of another group-member. It is only a step towards reference. Reference allows an entity to be called up just because it happens to be in a speaker’s mind.

The decomposition may have been to a single articulatory feature such as a mere closure of the lips, and semantically to a reference to mother.

So, for example:

Joe at 1; 0 (13) seeing the swings in the playground said something which his mother heard as “See saw, margery door” – as DEE DAW, DEE BAW, DEE DAW. The commas divide this into three utterances. Here all the consonants are what are known as ‘voiced stops’, stops because the airstream is completely blocked in the mouth, and voiced because the action is only momentary with the buzzing sound from the larynx beginning as soon as the closure is released. The variation between D and B is by the action of the tongue tip in D as opposed to the action of the lips in B. The partial closure by S is replaced by  complete closure. And the nasal airstream by M is replaced, again by a complete closure.

And similarly, Frank at 0; 9 (4) “Mum”, at 0; 9 (7), at 0; 10 (7) “Mama”,  at 0; 11 (10) “bus”.

All of these forms seem to be referential, at least to some degree.

With no limits on what can go where, it may be impossible to process any more than a single step of the potentially recursive process. At this point in the acquisition of speech and language, with the exploitation of infinity essentially unordered, it is often unclear what is being represented. Such speech is very hard to understand.

So the infinite potential is not exploited by the one year old saying his or her first word. By the proposal here, the reason this potential is not initially exploited is that there is no definition of what can be paired with what or of which atom can or should constitute the ‘head’ of the expression.

By the proposal here, this structure is both primordial and general across all sorts of linguistic structure, defining the physical expression and all aspects of meaning.

In the diagram below, three articulatory features are represented, defining three physical aspects of a single expression, the involvement of the lips, the involvement of the airstream through the nose, and the complete stopping of the airstream through the mouth. The decomposition gives these three features. The recomposition gives their assembly into some part of a phoneme or speech sound. But primordially and in modern children’s early speech and language development, the recomposition is not defined. A stopping gesture may be composed with the lips. And the output of this composition may be composed with a nasal airstream. The order of this composition is accidental. But accidental or otherwise, the order is acoustically significant. Composing the air stream gesture with the other two features sounds different from any of the other logical possibilities here. This is shown in the diagram below by the dotting of the lines. The nose represents an airstream through the nose. The cork represents the complete closure of the mouth. And the lips represent the closure by the lips. All that is defined is there is a relation between these thre features. So faint versions of two features are shown for each leaf in addition to one clear version. And the dotting represents the essentially accidental nature of the structure.

Significantly, in modern acquisition, children at this early point in their speech and language development are often understandable only by those who know them best, who are used to the particular way they happen to order the recomposition. The relevant features may be inserted into the structure in any logically possible way.

Take for example what seemed to be a child’s first word – bus.  Let us apply the same schema, simplifying the complex, but undefined structure of the sound here by three representations of features and the whole vocal tract. And let us relate this to the notion of a (London) bus. By the schema here, there is no certainty about the derivational structure. And we show this by the dotted lines.

The adult listener interprets this as having the status of the form here as a syllable, even though this is not a possible syllable of English. English requires at least two elements in what is known as the ‘rime’ of a lexical monosyllable (a noun, verb, preposition, adjective, or adverb). The rime is the nucleus or vowal and any consonants after it. The nucleus can be one of the short vowels in hit, heck, hat, hut, hock and hoot. All the other vowels in English are long or tense, and fill two ‘slots’, or branches by the model here, are thus able to constitute a rime on their own without needing a consonant to complete the structure of a syllable. On both criteria, both in terms of their structure and in terms of how the structure is filled, the two commonest words in English, a and the, escape the exclusion.

On this basis, bus as a first word, pronounced as BUH without the S, might be represented as follows:

It might seem that there is a much simpler way of representing what is going on here. We could just say that the child says bus. Or we could just label the atoms from left to right, one by one, as they occur, the onset, or initial consonant, and the rime with its two constituent parts, the nucleus or vowel, and the coda or final consonant. But neither of these apparently simpler statements captures the fact that the child is rather clearly and obviously at the beginning of a pathway to speech and language. And a tree makes it easy to describe how children’s speech most commonly develops, often omitting the coda, but seldom omitting the onset, which is higher in the tree and ‘safer’ on account of that.

Using the same schema, the meaning of bus as a first word might be represented as follows:

Modern language exploits the complete resources of a system which is continually changing in how things are pronounced, in how words are put together, in what they mean, and so. But that being said, it is now in a sense fully evolved. The primordial system has been supplanted by both evolution and what is known as grammaticalisation, the churning over of the grammatical apparatus under various pressures over tens of thousnds of years, some pulling in opposite directions.

There are what may be fossils of the primordial sound / meaning relation in:

  • The long period in language acquisition of just single words, as greetings, as “more”, “hello”, “bye bye” and so on;
  • English yes and no and what are known as ‘modal particles’ or ‘discourse markers’ in many languages, curses and greetings;
  • Expressions like sh as a call for silence, Ah for pleased surprise, Eh as a query, and tut tut for disapproval;
  • What are known as ‘imperatives’, commonly, as in English by the ‘root’ form of a word, such as come and go, sometimes for sake of saving life, sometimes greatly elaborated by more complex grammar, as “For goodness sake, just go”;
  • Expressions like “genius” in response to a performance.

These expressions are unlike the primordial forms proposed here in that they exploit combinations of features by later steps. But they are used in a way more characteristic of the primordial system. Note that these modern expressions are mostly, but not all, matters of discourse. Reference is mainly embedded into structures which are meaningful by virtue of their syntax.

This does not prohibit the continuation of a system which may have been used by our common ancestors at the point of differentiation between the two lineages, humans and chnimpanzees. These early ancestors may have had a system of calls of any degree of acoustic length and complexity, and used for any purpose. But such a system could be supplemented by one or more forms, distinctively represented by the sort of branched structure diagrammed above, to be articulated and understood accordingly.

How far the first inventor or inventors were CONSCIOUSLY aware of what they had invented is obviously impossible to say. Even just one suitable expression could be very appealing. The adaptation here conferred a significant degree of Darwinian fitness. Inheritors of the adaptation had a greater chance of mating and thus of passing the adaptation on. All we know is that from the beginning, speech and language were noticed. Or the capacity could not have spread, and eventually fixated.

By the approach to grammar developed over the past 40 plus years and assumed by Chomsky and others (2023), there are what are known as ‘uninterpretable features’ defining the ‘case’ and ‘subjecthood’ of he, she, and we in “He fell over”, “She fell over” and “We fell over”. One way or another, the mechanism of case and subjecthood have to be explained – categorially by classical grammars or featurally by approaches like the one assumed here, by the logic of evolvability. Logically, there are two possibilities: Either the uninterpretable features were at least fore-shadowed from day one, or they evolved at some later point in the evolution of language. The former is the more plausible. Suppose, in the limit case, that the first photo-word was just one item in a repertoire of hoots and shrieks, it would have been significant to users and anyone capable of understanding and processing it. It would plainly be quite fanciful to imagine that such a complex notion as case or subjecthood had any application in the context of isolated, primordial proto-words. But it is much less fanciful to imagine that the first decomposed and recomposed gestures were ear-marked in a way defining their singularity in a system consisting mainly in shrieks and howls, and that it was this marking of singularity which allowed the ability to recognise and learn them to become heritable.

2. Pair

At the lowest, logically-possible point of compositionality, contrasting elements are put together. This reduces the infinity by Decompose and Recompose.

For example, Joe at 1; 5; (23), six months after his first ‘words’, said “Bye, doggy”.

Frank, still more precocious than Joe, at 1; 2 (22) said “Bye bye, Daddy”

Bye or bye bye are clearly aspects of discourse, Daddy and doggy, addressed to a toy dog, are seemingly referential. So there is discourse and reference, but with no internal structure. “Daddy, bye” and “Bye bye, Daddy” do not involve any change in meaning. The ‘force’ here is analytically and cognitively indeterminate.

Showing an explicitly discourse element by a wiggly, red line, with reference to the observation of Joe at 1; 5 (23):

This primordial system is reflected in modern language in various ways:

  • Something close to reduplication in adult speech by hip hop, chit chat or pow wow.
  • Expressions standardly by root forms combined in ways falling outside the terms of the grammar, kill joy, go between, go slow and so on, noted by Ljiljana Progovac (2015);
  • Possibly at least some adverbs, as the only sort of word which can appear in different positions in English, albeit with some subtle changes in meaning, as with sadly  in any of the six logically possible positions in “He is going to die” – all grammatical, at least for those who allow infinitives to be split;
  • Lexically, the contrast is between terms of reference and discourse items, including ah, ey, uh, oh;
  • In terms of sound structure, as a physical, articulated expression, possibly by evolution, but certainly by modern acquisition, syllables with a vocalic nucleus and a consonantal onset, in what is thus a syllable. Or, in a way that may be out of sync with the elements of meaning, there is what is known as ‘foot structure’ or stress between the syllables, as in Daddy.

The modern infant generally puts his or her first two ‘words’ together around the point when the vocabulary reaches around fifty items. There is no reason for assuming that Pair evolved at the point when some particular number of items became accessible.

3. Head

By another step, one element becomes the head of the combined expression.

For example Joe at 1;7 (30) said “In er car…. In car”.

And Frank at 1; 3 (2) said “Open door”.

When the phrases “In  car” and “Open door” were uttered it seemed probable what they meant, but not certain. The relation is asymmetric between two contrasting elements, neither such that it can in all cases stand on its own. In other words, a headed structure can’t include bye bye or hello, both stand alone elements in discourse. But however they should be understood, both of these phrases have well defined heads, in and open, contrasting with the clearly referential elements in car and door. In the framework here, the non-head is known as the ‘complement’. By the diagram above, dominance is thus built into the system in a way that is now well-defined.

  • With the branching applying to just two elements, sisters in the framework here, headship is thus essentially a relation between defined elements, both with parts. In the modern child’s process of acquisition, by the simplest possible interpretation, the structure of “In car” involves two elements, car, essentially noun-like. and in, as a step towards a preposition. “Open door” contrasts noun-like door and verb like open.
  • Head signals the first step towards distinctive ‘parts of speech’ as these are called by traditional grammar, nouns like car, Mummy, Daddy, verbs like want and like, prepositions like in. The items become differentiated, as the only sorts of expression on which grammatical operations can be defined. On the simplest plausible readings, open and in are plainly heads. Both elements can now express formal relations, as head and complement, to the expression as a whole. There is a grammatical relation between them, each with an an irreducibly necessary, structural role. But it seems premature to regard such elements as fully-defined nouns, verbs and propositions;
  • The elements have features which define the interaction, as opposed to some purely accidental relation, as by ooh, eh, ah, yeshello, good bye,and so on, all independent from the grammar because they can stand on their own, sometimes adjoined to it, but not by Head.

4. The squeeze

In English, as in most languages,  questions beginning with words like where (or their eauivalents in languges other than English) are asked with where mostly on the left of the structure, as in “Where are you going?” But in a full sentence answer, the requested information is on the right, as in “I’m going to the fair.” Why should this complexity be common, not universal, but common across languages? By ancestral grammars from the 1950s to the 1980s, this was represented as ‘movement’. More recent grammars postulate another sort of device, one that avoids the notion of movement.

Building on the featural and combinatorial properties by Decompose and Recompose, Pair and Head, from an array of lexical entries, Squeeze lists all the forms in a structure, as it is being built, making it possible to squeeze out a particular, suitable item which has already featured in a previous step of the derivation. Chomsky calls the functionality here ‘Internal Merge’. But the notion  of squeezing the last drops seems to me more plausible as an evolutionary step, even if this is entirely unconscious, implemented much too fast to be conceivable as a conscious act. By the notion of ‘Internal merge’ or Squeeze, the notion of movement is no longer necessary. And the infinity is reduced.

Necessarily in English, the force of the question in “Where are you going?” determines a reversal of the sequence of you and are from the sequence which would be followed in the statement “You are going to the fair” or I and am in “I am going to the fair”.

Typically, as in English, a squeezed item, B in the diagram below, is then not pronounced at its point of origin.

For example, Joe at 1;10 (3), says “Daddy upstairs” where Daddy seems to be the ‘subject’ in traditional terminology, and at 1; 10 (27) “Where Daddy?” with where seeming to define a clear question. In the first, the B element is only extracted once. But in “Where Daddy” it is extracted with a notion of location, and then squeezed out once again with the force of a question.

At 1; 4 (27) Frank is asked, “Who wants some chips?” And he replies “Me”. And at 1; 5 (9) he asks “Where chicken?” Both the appropriate answer to a who question and the where question 12 days later suggest a grammar capable of making two uses of the same element, once to define an identity or location, and then, by Squeeze, with the force of a question.

Here where and who are fulfilling a special role as the sisters of the head of the A B  structure. This role is commonly known as a ‘specifier’. In the framework here, it is a universal.

Significantly, English also allows forms like “Daddy is where” or “The chicken is where”, typically with where heavily stressed, no longer with the force of a question, but as statements of surprise or astonishment.

In “Daddy upstairs” by two branchings, one defective, Daddy is the specifier, in this case, the subject.

The contrast between the elements expresses the simplest possible structure with a definable spine, in this case with Daddy dominating X upstairs, where X is an unrealised abstract element.

In Joe’s “Where Daddy?” at 1; 10 (27), where is pronounced on the left and interpreted on the right of the structure (shown in grey) from where it has been copied.

Here “Upstairs” might be a plausible child’s answer, taking “Daddy is upstairs” or “Upstairs” as plausible adult-type answers, except that upstairs is treated here as a bare marker of location, questioned by where,

In modern acquisition, between a week and three months after two words are put together, a question is asked or answered involving a question word relating to one of the items in the two word combination, particularly what, where, and so on, signalling points of curiosity, as a key factor in discourse.

By this step:

  • Questions with a Wh word can be asked or understood;
  • Structures, with what traditional grammar calls ‘subjects’ , can be expressed on noun-like elements. Thus I has the special role of expressing a subject, a seemingly universal property of sentences. The subject role is purely grammatical or syntactic, as in “There is food on the table” and “It is a shame that you’re ill” where neither there nor it has any semantic role. This expresses what in the framework here is characterised as the ‘specifier’. The sister of the head ‘specifies’ the structure;
  • What are known as ‘thematic roles’, including agency, ownership, location, benefit, destination, or experience, can be expressed;
  • The first versions of what is known as ‘Case’ can be expressed, as by the difference between he and him, she and her, as what are known as ‘arguments’ in one of various relationships, essentially who is doing what to who. Some aspects of Case have a plain relation to thematic roles. In “I am waking them up” or “They are waking me up”, the references of I, me, they and them change as roles change or speakers take turns to talk;
  • Elements of composed structure can be recomposed at a higher level;
  • Across the system of phonemes, relativities can be marked in contrasts such as those between P and B, defined on a difference in the delay between the release of a closure and the onset of ‘voicing’ by bringing the vocal cords together;
  • The cognitive load of searching for and extracting items from the lexicon is greatly reduced. At this point in language acquisition, the lexicon is expanding rapidly. Reducing the onus of Search is important.

5. Anchor

Every element has a role, from the verbal head of an expression like open in “Open door” to the Wh question form where in “Where chicken”. But at a given point, the semantics gets lost, and the only role is the role itself. Such forms are known as ‘functors’, and marked in English in very obvious ways, easily detected by the child learner. They can be displaced like where, shunted to the left edge of the sentence, in a way marking illocutionary significance. Or their sound structure can be reduced, by losing their vowel, for instance. Another sort of functor anchors the sentence to the most salient aspect of the discourse context. In English, as in most languages, this what is known as ‘tense’.The most obvious example is the word is or its contracted form, written as ‘s, with reference to an immediately present event /situation. English tense is marked either as -ED, as in sorted, or -D as in lied, or T as in spilt, or by a change in what is known as the rime as in ate, saw, took, or by the whole form of the verb as in was and went.  By the profound insight of Chomsky (1957), this marking of tense is separate from the verb itself. In did in “Did you tell the truth”, tense is expressed on the word did, known as the auxiliary’.

Joe at 1;11 (12) asks “Who’s that?” On the same day. looking at a picture book together, his mother asks: “Where’s the bus?” Joe replies: “There’s bus.” At 1;11 (14) he asks: “Where’s man tractor” It was not clear if he meant “Where is the man’s tractor?” or “Where is the man for the tractor?” or something else. The point is the articulation of the ‘s form.

Frank at 1; 5 (29) asks: “What is that?” with the observation that the is form was clearly detectable.

Showing the new functional projection in bold.


These are the first uses of an is or ‘s form by these children, known as ’inflection’. Here an element is inserted into the structure, anchoring it to the here and now of the utterance. To a degree this reflects an aspect of the discourse. There is no reason for thinking that there is any contrastive intent here. He is not also asking things like “What was that?” But the use of the is form is a place-holder for the tense category as this becomes accessible to consciousness. And in a broader sense, the form signals the accessibility of elements which are purely functional, with their own corresponding projections.

6. Measure and compare

Extraction and Anchor are mathematically powerful devices. Applied to the output of one another, they can be exploited indefinitely. Such a grammar strains both processing and production. It is patently impossible to learn under the condition of finite learnability. Only a small minority may have been able to master the complete apparatus, with wide and significant variations in the mastery, as in all other areas of human skill from musicality, to art, to athleticism of all sorts, and in a way sometimes thought to be controversial, cognition.

By Measure and compare, a minimal degree of dominance by one head is compared to that of another to any equal or greater degree. This relation characterises numerous phenomena in the grammar of unrelated languages, in English including pronouns such as I, you, he and she, what are traditionally known as ‘reflexives’, as in “I hurt myself” and negatives by not and its reduced form written as ‘nt. The negative form only appears immediately after the form expressing the tense in doesn’t, didn’t, can’t, won’t, and so on. The scope of operations, each doing just one thing at a time, is restricted by comparing and measuring just two degrees of dominance.

These phenomena are obviously very abstract. But that, I propose, does not make them implausible.

At 1; 9 (22) Frank said “I hurt self’ with I and self with the same reference. At 1; 10 (27) he said “I found this” with the ‘nominative prououn I next to the past tense found. At 1; 11 (4)  he said “I don’t like it” with the negative n’t  next to the auxiliary do. At 1; 11 (13) he said “I’m making lorry” with ‘agreement’ between the first person pronoun 1 and the auxiliary am. At 2; 0 (20) he said “I need more that” with more as the head of a complex phrase. All of these cases involve the measurement and comparison of dominance.

Going through the same process a little later, but faster, at 2; 4 (26) Joe said “Mog doesn’t like that”.


A week later at 2; 5 (5) he says “doggy licking hisself” with the reflexive one level down from what is known as its ‘antecedent’ – in this case doggy.

As far as pronouns are concerned the key data for English is in contrasts like the one between “She says Mummy feels tired” and “Mummy says she feels tired.” In the second, she could be Mummy. But in “She says Mummy feels tired”,  she can’t be Mummy. Such relations are common across languages, raising the obvious question: Why should this be? Ever since a seminal (1976) work on the issue by Tanya Reinhard, this has been a hot and continuing topic of debate. All approaches since that of Reinhardt have focused on the small size of the domain, as illustrated in the diagram above.

By a spine-based universal grammar, these things can be encoded in ways that vary across languages, but using the same, universal template. By this sixth step, degrees of  dominance are measured and compared.

Measure and compare imposes a ceiling on specified relations, abstractly A and B, at the top of the spine at a given point in the derivation.

This allows a special relationship between I and am and between she and is, one denoting what was traditionally known as the ‘nominative’ case of the subject and the other denoting the most immediate aspect of the here and now in the discourse. In most languages including English, the key aspect of the here and how is related to time, represented as the tense of the verb, as in the differences between I am and I was and I have and I had. Nominative case is purely grammatical, with no thematic role or obvious relation to the here and now or the needs of communication.

For example, Joe at 2; 5 (7) said “I saw lorry pulling car” and on 2; 5 (11) “I took picture of milkman”. In both of these structures, tense and case are overtly represented in a sisterhood relation.

The marking of tense on the verb and the anchoring of the sentence to the context of the utterance is almost, though not completely, universal. But Measure and Compare restricts the anchoring to the edge of the hierarchy. The infinity is further reduced. Universally, these two functionalities, tense and nominative Case, are defined at the top of the projection chain. This is reflected in the way both are expressed as the left most elements in “I might have been being deceived”. This But the way it works in English is complex and hard to learn.

In “I may seem to be asleep” the thematic role of I is plainly not a function of the main verb, seem, but of the embedded verb, be. “I may seem to be asleep” means the same thing as “It may seem that I am asleep”. but the structure is quite different.  I with its marking of Case and the tense of be get shunted upwards or ‘raised’ by successive steps of projection, each step by a separate process, shown here by the arrow in a simplified diagrammatic form of tree diagram.

The process can be continued as in “I may seem to want to be asleep” with a different meaning, but still with I immediately followed by the tense bearing may. The sense of the tense-bearing element has been lost in the history of English. “I might seem to want to he asleep” means almost the same thing, But without may or might in “I seem to want to be asleep” or “I seemed to want to be asleep” the tense difference is clear and overt. Again immediately next to I.

Tense and nominative case constitute the two most sharply contrasting sorts of elements within the hierarchy.

The expression of these levels of the hierarchy, for noun and verb like elements, varies from language to language. These are things the language learner has to learn. They fall within the learnability space. By the proposal here, Measure and compare is universal. But expressed in terms of the spine it is biologically encodable, and thus readable within the human genome. By the proposal here, it is precisely the abstractness of Measure and compare which makes it both universal and a heritable aspect of UG.

7. Phase and Complementiser (or Sentence) – limiting the whole expression

By this final step, the grammatical apparatus AS A WHOLE is factored into the smallest posssible components. This follows a consistent approach from Chomsky’s first widely circulated (1957) work factoring the grammar into two sorts of rule, phrase structure rules and transformational rules, then a division between deep structure and surface structure by Chomsky (1965), then the effect of a barrier with respect to what was at the time considered to be the ‘movement’ of elements such as what and where and other phenomena by Chomsky (1986), then by Chomsky (1999 and 2001) with the ‘spelling out’ of the minimally and irreducibly necessary and relevant information in separate ‘Phases’. The motivation for these conceptual or theoretical changes is to explain WHY the incredibly rich structure of a language is the way it is and HOW it is reliably learnt by all normally developing children in the way it is despite the infinite variations in children’s experiences of language. A phase is defined by the fact that as soon as it has been completed, most of its structure becomes inaccessible to the ongoing process of derivation. This allows the derivation to proceed in small, manageable parcels. A phase may by expressed by only one word. But it can have a special analytic status, as in the case of the Wh words. As by the primordial Decompose and recompose step, the information that is sent to the two interfaces (for phyksical expression and understanding) has to be both detailed and complete.

Here the phases are shown as alternating light black and heavy red branchings, at least two for each clause, the light black roughly representing the propositional content, and the heay red roughly representing the ‘force’. These are not the steps proposed here, but the effects of the final step.

By this seventh step, the spine maps onto the formal structure of UG. The phasing becomes detectable only by the last of the seven steps.

The novelty, by a phased approach to syntax, is that the factoring into two phases is repeated clause by clause. At least by the original conception (which I am assuming here), the first phase spells out the referential and propositional content, the second phase spells out what John Langshaw Austen (1957) called the ‘illocutionary force’ of statements, commands, questions, pleas, and so on.

By the proposal here, both in the evolution of language and in the acquisition by modern children, the factoring of the grammar by phases is (necessarily) ordered last.

Making sense of the structure of what children hear said is not simple matter. As Carol Chomsky showed in (1969), many ten year olds are still misunderstanding sentences like “I’m asking you what to feed the dog” as “I’m telling you what to feed the dog”. She suspects that some individuals may not proceed to a full understanding of this point.

Showing accessible elements in bold red, and with a lower, earlier, inaccessible elements lighter, with only the head and edge accessible.

 There are thus two phases in most clauses, even if the second phase is not represented by any overt structure, but just by the fact that a ‘simple’ proposition is also a statement of fact, which may be contradicted in jest or irony, as represented by the second phase. Thus the first phase is mainly grammatical and the second phase often has a significant discourse element.

By the first seven steps proposed here, the force of a structure, was an accident of the structure itself and the circumstances in which it was uttered. Such a grammar was most likely prone to deep and frequent misunderstandings. By the seventh step, an expanded notion of force was defined as ‘Complementiser’ or C, as the  topmost level of the spine, and replacing the traditional notion of a ‘sentence’. C provided a hosting for what in expressions like “What did you say you thought I said?” with what as the complement of I said at the opposite end of the structure.

In simple, declarative main clauses, C is not expressed in English. But it is the destination or landing site of words like what, where and when and expressions with which in questions seeking particular items of information. By the 1997 proposal of Luigi Rizzi, the ‘force’ of the structure is expressed as a property of C. This applies no matter whether the structure is a statement or a question, or whether the agency of the subject is diminished by passivisation or in some other way.

So for example, Joe at 2; 9 (4) asked “When’s Daddy coming back?” with the Wh morpheme when projected onto the uppermost C level and the contracted auxiliary ‘S stuck on its right edge as what is known as a ‘clitic’.

By characterising this level as that of C, every level from the bottom of the structure to the top is defined in the same way, rather than by giving the sentence a special status of its own, one that is hard to define other than in a purely circular way.

Three weeks after “When’s Daddy coming back”, at 2; 9 (28) Joe produces his first sentence with multiple embeddings, in this case three, with two phases at the lowermost level in “what’s happening”, with thus eight phases in all, with all structures fully specified – in “I want to stand on the chair to see what’s happening”. Simplifying slightly:

At the lowermost level, the contracted auxiliary ‘s is projected to form a tensed structure, and then what is projected to form what is traditionally characterised as an ‘interrogative  clause’, in the framework here, now specified by what.

Up until this point, Frank’s language has been slightly more precocious than Joe’s. But now, a month after Joe, at 2; 10 (21) Frank says “I want to sit where Joe’s been sitting.”

First, almost identical, sentences with multiple embeddings, accidentally or otherwise, with full grammaticality. The exactness of the similarity between two utterances in two children two and half years apart, only noticed forty years later, would seem to suggest that there is significance in such structures with a Wh word specifying an embedded clause, and not forminmg a question.

As by the examples above, Phase allows the derivation to proceed in steps, as by the process of evolution. But the full application of the principle here takes years to learn. At 9; 9 (13) Joe said “We don’t know whether I’m going to be picked up by who” (of the rather complicated child care arrangements we had in place at the time). Joe’s sentence is anomalous in as much as who seeks particular information and whether seeks only a truth value. But the structure of two Wh words in the same clause calls up the Phase functionality in an interesting way.

By this seventh step:

  • Building the derivation in phases limits how much of the derivation can be manipulated at any one point, reducing to the minimum both Search and the speaker’s and language learner’s tasks in constructing a derivation, allowing complexity to be distributed across it;
  • Information is sent bit by bit to the articulatory system to be pronounced and to the semantic / conceptual system for the meaning to be analysed. English marks the point of sending articulatory information much earlier than languages like Turkish, Mohawk, and many others. So this point necessarily falls within the learnability space;
  • UG becomes knowable, and Metalinguistic awareness is brought into being;
  • Fantasy, fiction, non-fiction, irony, fun, comedy, contracts, become parts of everyday life.

It seems a reasonable conjecture that Phase critically reduces the learnability space, making speech and language finitely learnable in what Eric Lenneberg in 1967 called the ‘critical period’ for language acquisition, normally ending around the age of ten.

Phase may have only fixated across the ancestral stem of anatomically modern Homo sapiens between 100 and 200,000 years ago. The proposal here says nothing about the exact time scale here or about how quickly the Phase stop in language evolutionm allowed the first indications of modern culture in jewelry, wall-paintings, and musical instruments. The necessary skills may have taken generations to develop, and then been lost as many ancient skills are being lost today. In competition for scarce resources, reliability of communication would have given a decisive advantage to those having it in relation to any group not having it. Finite learnability has to have been hugely beneficial.

Summary of seven steps

In this way, the infinity by Merge or by Decompose and Recompose is reduced step by step. It can be summarised as follows (for the sake of simplicity, assuming that the steps were primordially vocal, an assumption which may be wholly or partly wrong for at least the first steps):

  1. Decompose and compose – decomposing and recomposing UNLIKE features – sensory-motor or semantic-pragmatic – the most extreme possible sort of contrast – generating a high level of infinity – in forms not yet properly words;
  2. Pair – of elements, only one of which can stand on its own, as a greeting or such like;
  3. Head – one of two contrasting elements, neither such that it can in all cases stand on its own – one such that it heads the expression;
  4. Extraction – listing a set of elements for selection, not disallowing a previously selected element to be selected again;
  5. Anchor – allowing the most salient aspect of the here-and-now to feature in the formal structure of the grammar;
  6. Measure and compare – comparing and measuring the least degree of dominance at some point in the derivation to some equal or greater degree, thus limiting at least some specified projections;
  7. Phase – splitting the derivation into successive phases, each such that its internal structure becomes unreachable as the next proceeds, limiting the scope of the grammar bv any one phase of the derivation.

In the case of two randomly selected, normally developing children, these steps all occur in the space of 22 months. But even at ten and a half, the whole edifice of grammar is still not absolutely complete.

A good start

Given the simple principle of binary branching and the acoustic phenomenon of sounds dying away as the energy is absorbed by the atmosphere, the system just consists in the step-wise reduction of infinity.

In relation to the sound system, English just happens to pursue this branchedness further than most languages. But some languages, Polish for example, go further with even more structure before the onset. All of this falls within the learnability space, and is often problematic in children’s speech development.

The tree can just develop, adding branches, up to some limit, as by the structure in strange. Here the long vowel is shown as AE, where the two elements are separated in the spelling. The final GE by the spelling is shown as a single J, representing the fact that this is just one sound. But it is also a sound with two halves, known as an ‘affricate’, beginning with a complete closure, shown here with a D, and ending with a fractional release of the closure, shown here as ZH, like the sound at the end of beige and rouge. Respecting the binary branching, the initial S is shown as a dependent off the left edge of the syllable.

Competitive advantage

These steps were taken by a species which had forsaken the safety of the trees for a much more dangerous life on the ground, after making at least six significant precursor adaptations. This was a population which plainly lived on its wits – or died. The population may have remained very small until the discovery of farming, but ranged across Africa while what is now the Sahara desert was forested and well-watered. Within this population, by the proposal here, individuals or groups of individuals must have started restructuring some of their expressions in detectably advantageous ways, but by no more than one term at a time, so that, over the course of thousands of generations, the innovation could diffuse across the population, and (separately) become part of the genome.

The totality of this evolution was most likely over at least a million years, exploiting, but going far beyond, any life-support cognitions, such as those of making stone tools.

If the proposal here is on the right lines, one or more of the last evolutionary steps may have occurred after the divergence between the main line of modern human ancestors and Neanderthals and before more or less anatomically modern humans appeared in what is now Western Morocco around 300,000 years ago. Inheritors of the epigenetic changes by the last step would have learnt to talk faster, more accurately, more reliably, and crucially more completely. The difference may have been critical, with Neanderthal mastery of speech and language mastery uneven, with no expectation of common understanding. Neanderthals may have been stuck at the point when only a fortunate minority had a full mastery of their linguistic inheritance, whatever that may have been, and the rest of the population had only varying degrees of competence and little or no metalinguistic ability.

An apparatus

Crucially this evolution provided an apparatus which was, and is:

  • Freely used in assembling words together and in the building of speech sounds, in ways that the child has to learn;
  • Commonly over-used in the process of speech acquisition so that children often use devices in the building of words which should be used only in the assembling of words into sentences;
  • Such that speech-disordered children from different generations or parts of a family often have recognisably similar issues;
  • Available in parts, so that questions can be asked and answered in a rudimentary way, so a child of two and three quarters can say “A clock tells you what time it is” displaying the first evidence of Phase long before the full functionality of the grammar has emerged, as it normally has around seven years later;

A theory of speech and language acquisition?

The proposal here is NOT a theory of speech and language acquisition. There are many of these, some involving various psycholinguistic considerations such as auditory memory, processing load, and more. It is just assumed here that the domain is immensely large and complex, and that the steps of mastering it are not likely to be achieved by ticking off the achievements one by one and repeating them reliably, or not so reliably, from that point on. Rather, the steps by the proposal here are more like profound insights which are grasped first tentatively, followed by more failures than successes, and then gradually more and more confidently. Instances of these insights may only appear very occasionally in any sort of longitudinal record.

The motivation here is primarily biological. But I take this to mean that any postulated genomic content should be such that it can be genetically encoded – on the assumption that any adaptation enhancing the fitness of the organism is necessarily both accidental and simple. This rejects any sort of analogy with the primary and secondary dentition, as by Ken Wexler (1996). The evolutionary time scales are different by an order of magnitude. And in every case it is necessary to link the genomic content with a process which can plausibly be encoded – by the spine proposal here.

What is explained by an evolved UG applying to the whole of speech and language

  • The characteristic multifactoriality of speech and language disorders, so a history of delayed or disordered speech is often co-morbid with literacy problems;
  • The converse specificities of common errors, by misapplying what should be syntactic processes in the phonology;
  • The fact that many characteristics of child speech seem reducible to the lack of any proper definition of phonemes, syllables, words, and so on;
  • The characteristically poor metalinguistics of children with speech and language disorders.

By this proposal – consequentially:

  • As the genome evolved, the effects of the steps interacted with one another, giving the complex variations which Roberts (2022) characterises as ‘building blocks’;
  • An abstract Universal Grammar UG is derived from the evolution of the human species. But there is  cross-linguistic variation in how it is used – for example in which parts of the sentence structure are projected where – with global effects on word order and other aspects of what is commonly characterised as ‘grammar’. While a language may not express one or more parts of UG, all languages, spoken and signed, are built from it;
  • It is possible in principle for parts of UG to be incompletely specified in some individuals;
  • Rather than postulating a series of separate disorders, many apparent disorders, even some with names in popular speech, such as ‘lisping’, may fall out from a fully worked out theory of speech and language evolution;
  • Following Progovac (2015), there must have been a series of protolanguages, each likely to have left fossils.

From evolution to acquisition

In the absence of any evidence that acquisition proceeds differently from evolution, acquisition and new language formation may provide the closest approach to direct evidence of the possible, probable, or necessary course of speech and language evolution. It thus seems significant that most modern children ask or respond appropriately to a Wh question such as “Where Daddy” only after producing a declarative structure involving two corresponding elements, such as “Daddy upstairs”, and not in the opposite order. This holds of the two children, Joe and Frank, from whose records the examples here are drawn.

Stems and time scale

The proposal here makes no claim about how the cognitive evolution of speech and language connected up with cognition itself or with any of the physical changes. We just have to note that these changes were in one species, and they would seem likely to have complemented one another.

The steps and precursor steps proposed here were not events. If it took modern human ancestors at least 3 or 4 million years to learn to make a sharp edge or point, it would seem reasonable to assume that the incomparably more subtle process of encoding UG on a spine must have been similarly challenging. But the proposal here has nothing to say about the possible time scale for each step. All that can be said is that the encoding is far removed from the communicative advantages. For instance the convenience of using pronouns has no obvious relation to unobvious adjacency of levels on the spine. The difficulty of the translation here would suggest that this may have taken many thousands of generations. But as soon as the spine relation was established, the translation may have been simpler and faster, perhaps greatly so.

From the fact that there are corresponding phenomena in all languages, it is reasonable to suppose that the seven specifically linguistic steps were made separately, as an evolutionary sequence, in a population from which all humans alive today descend. At each point when a necessarily very obvious and visible step was taken, this was valued throughout and across a population. It had to be, or it wouldn’t have diffused and fixated.

But while the capacity for speech and language clearly distinguishes humans from any other animal, and mostly develops naturally without any active intervention, this is obviously not the case for all, with 1 child in 10 having minor problems with speech and language, 1 in 1,000 having major problems, and perhaps 1 in 100,000 being unintelligible in adulthood other than to close family and friends, if at all.

Clinical utility

In relation to less than fully competent speech and language, the proposal here effects a conceptual economy. By the proposal here, most disorders are by the effect of failures in the specification of a genomically defined UG which makes it possible for humans to learn to talk the way they do without needing to be helped, other than to learn what not to say. This makes it unnecessary to postulate a corresponding series of specific malformations.

There is clinical utility from the study of the apparatus and the structures that can be derived from it. It makes many common issues in speech and language disorders more accurately definable. And it broadens the range of plausible interventions into areas that would otherwise be at least hard to treat. For instance, many children have difficulty with both case and tense, as in “She loves me” where the S in loves expresses agreement with the singular property in She. Such children may go on saying things like “Love me” many years after most children have learnt that in a statement, both the she and the S in loves are forced in English.

Nunes (2023) argues that to help such children it may be useful to allow them to discover the ‘sisterhood’ relation between the subject marking of she, known as ‘nominative case’ and the S in loves, known as ‘third person singular’, the formal device by Work space.

Nunes (2002) notes that another group of children sometimes say monopoly as OPOLI. If such errors persist they can lead to stigma or mockery. If OPOLI is part of a broader pattern the child probably needs help. But for what exactly? If monopoly as OPOLI is by the deletion of the first three sounds, what are they? The first two are the first, unstressed syllable, sometimes confusingly called the ‘pre-tonic syllable.’ But the N is the onset of the stressed syllable. What sort of a thing is the whole of one syllable and the beginning of the next? But there is another way of looking at this. OPOLI is the domain of stress. (The N is irrelevant here). The child may be treating the stress domain as though it was the word, an easy mistake for a first language learner of English. By the framework here, this suggests a treatment approach targeting the separateness of the word from the stress domain,

The limits of a proposal

This page sets out a proposal. It is what I am working on, initially on the basis of the observations of two children. It is essentially a question. There may have been more steps. Or Merge may have just developed out of the blue by a single macro-mutation, or what Chomsky calls ‘a minor rearrangement of neurones’. The proposal here just seems to me a biologically well-motivated way of reconciling the evidence of human speech and language, as they currently are, biology, neurology, archeology, paleo-anthropology, genetics, delays, disorders, and the random cases of two individual children. The fact that the two children were brothers, living in the same family home, may have influenced the areas of their attention and interest. But it cannot have had any bearing on  whatever part genetics may have played in their focus on the formalities of Universal Grammar, as documented here.

As set out here, this proposal is only a sketch. There is at least another ten years of work ahead.

Do you have an enquiry?