3. The nuts and bolts

A research hypothesis – necessary steps

Echoing Dobzhansky (1937) and upholding an idea from Cedric Boeckx (2021), I propose that evolutionary considerations are critical to any understanding of how language works By my proposal, speech and language must have evolved by an ordered sequence of steps giving the bare, minimally necessary structures for speech and language. I shall refer to these steps as Point, Mimic, Glue, Head, Case, Fit, Honour, Move and PhasePoint and Mimic as precursors, one of them Case crucially involved in the truth or falsity of a proposition, and Phase crucial to the fact that the very complex resulting system is confidently mastered by the overwhelming majority of children without needing a jot of explicit instruction in about ten years.

The eight steps proposed here, gave an advantage to a fragile and vulnerable population, making it easier to construct defences and survive, to discuss and develop techniques, to plot, groom, befriend, sympathise, and romance, eventually coming to define the whole surviving population.

At a societal level, the functionality of modern language is the wherewithal of every joint venture from from a hunt to a start up. Whether the issue is how to understand a broken twig or a false financial statement, “I think you might be mistaken” is reliably understandable. In hunting or in business, life or fortune may depend on the warning being clear and reliably understandable. 

I propose the overtly linguistic steps from Glue to Move must have recruited quite general cognitions because there is no other way that they could have emerged. This must have happened over a period which may have lasted for two million years. The diverse forms of modern human language, English, French, Arabic, Chinese, rework the framework which emerges from this hypothesis.

None of the steps proposed here is evidenced any longer in its primordial form. But from the fact that there seem to be corresponding phenomena in all languages, it is reasonable to hypothesise that there has to have been a time when they first started to be used.

I shall point to evidence that these evolutionary steps are very approximately recapitulated in child speech.

Each step was essentially simple, but abstract. It could be applied both to the sound structures and to the assembly of words and parts of words, and in different ways, but not in an infinite number of ways. So, for example, there is no principle which inserts an element into the middle of a structure. A word like Corona can have the stress on the middle syllable, but only because English word stress happens to be counted from right to left, ignoring the rightmost syllable. And the principles refer to elements, structures, levels of structure, degrees of domination, including phrases and parts of sounds, known as ‘features‘, rather than the more obvious speech sounds or words.

At each point when a step was taken, it was copied throughout a population. It had to be, or it wouldn’t have diffused and fixated. It had to be highly visible in at least one respect.

The process of fixation across the population is demonstrably not yet complete. The learning process does not work for all children, as shown by the various developmental speech and language defects that appear in some children in all cultures, marginally for perhaps 10% of children and proportionately more for a smaller percentage. And for some individuals, some problems endure. The fact that some problems are commonly recognised as such by small children shows that they are with respect to a well defined ‘competence’, to use the term proposed by Noam Chomsky (1965).

It is evidently possible in principle that at least some defects of speech and language involve failures with respect to the necessary learning process at any one of a number of points. One aspect of a step can be missed. Or it can be misapplied. By my proposal here, this is indeed so. The more recent the evolution the greater the likely vulnerability.

The sequence I am postulating is quite different in kind from one or more ‘macro-mutations’. It unfolded within a small and marginal population in the process of becoming human which could easily have become extinct if at any point the evolutionary dice had been cast very slightly differently.  As the population became more and more human, it reacted more and more strongly to the effects of the steps as they occurred. This made the process quite unlike that of most, if not all, non-human evolutions.

Humans were becoming more and more conscious of themselves, their partners, children, other members of their families, neighbours, and competitors. Very occasionally non-humans display something like consciousness of humans as members of another species without any sort of training. Individual cetaceans, primates, and canids, have shown or invited empathy with humans and insights about one another, both in the wild and in captivity, I once had a complete manicure of both hands from a female Celebes crested macaque. She reached through the bars of her cage to use the nails of her thumb and middle finger the 40 or 50 hang nails which I had at the time. But humans display such consciousness every day, not as occasional, one-to-one exchanges. And human consciousness is recursive. We are not only aware of the consciousness of others – as well as their likely or obvi0us needs – but of their consciousness of our own consciousness. And so on.

The steps that I propose in the the evolution of speech and language were not events. But seen as an evolutionary accomplishment they define a path by which each step involves a complexly co-ordinated set of adjustments. These adjustments are separate, according to the requirements of going from one individual step to the next. There was a time before each step, and a time after. The adjustments define the differences between languages, known as ‘parameters’.

These steps are from the evidence of what are now known as the ‘universals’ of modern speech and language. For instance, in every one of the six or seven thousand currently sp0ken languages there are

• Words used only to pick out items of interest in the real world, categorised as ‘nouns’.

• Words like die, generally categorised as ‘verbs’.

• Words like that, as in “I see that you are here” with functions in relation to the structure of phrases, expressions and sentences, sometimes known as ‘functors’.

It is possible to imagine languages without any such categories or with categories which do not fit into the schemas of natural language. The brackets, slashes, and underscores of various notations used in computing are obvious examples. But the particular categorisations of natural human language seem to stand on their own with no analogue either in nature or the world of human invention.

The hypothesis here concerns the number, form, and sequence of the steps that I am proposing. There are other hypotheses in the relevant literature. The hypothesis here is informed by the evidence of child-speech and the way this is sometimes affected by developmental disorders of various sorts, as well as the evidence of what is known as ‘typology’ from the study of the world’s languages. There is also evidence from a vast amount of research in paleo-anthropology, psychology, biology, genetics.


My interest here is in us – known as ‘anatomically modern humans’ or ‘homo sapiens’ (‘wise man’ – a term which some may find offensive for obvious reasons) or just ‘modern humans’ and the fact that we can talk. Because there is no way in which a thought could be directly expressed by a sound, it is reasonable to assume some interfaces between these things. There are some useful distinctions to be made:

Language – The specific linguistic endowment of the speech and language faculty and its acquisition;

Articulation – What can be articulated by the vocal tract or discriminated perceptually, now often known as the ‘Articulatory – Perceptual (A/P) Interface’ by which speech is pronounced and understood;

Understanding _ the assigning of meaning to a structukre taking account of the words, the internal composition, and the interacting scopes of the operators within the structure;

Appropriateness – the sense of what it is appropriate tso say or talk about in some situation and when to remain silent;

Biology – The neuro-anatomy and neuro-physiology of the brain with ten billion or so neurones each connecting with five thousand or so other neurones;

Mind – The human mind, human reason, and the recursion of human consciousness;

Care – The duty of care and the survival of the elderly and disabled long past the point at which they could not survive independently;

Invention – A wide range of human-specific skills from fire-making, to building, to music, to visual representation;

All of these things have evolved as species-specific characters. They all emerge in the process of child development. But it is neither simple n or obvious how to stack them up either in development or in the plotting of human evolution. My focus here is on speech and language.

There is now abundant evidence that Charles Darwin was correct on the point that humans evolved in Africa. It is suggested by David Reich (2018) that some of this evolution may have been in Eurasia after an initial disapora from Africa before a later diaspora from Africa. But this only marginally qualifies the claim that most of the evolution happened in Africa. There is recent evidence from Søren Besenbacher and colleagues (2019) that the divergence between the last common ancestors of modern humans chimpanzees was about 10 million years ago, somewhat longer ago than by previous estimates on account of a newly discovered slowing in the rate of human mutations. Bukt Reich is more confident that this divergence was by a long goodbye, with a lot of occasional pairings, rather than anything more cut and dried.

Over this time humans have evolved larger brains, flatter, more vertical faces, a highly-doomed forehead without craggy eyebrows, a smooth top of the skull, a pointed chin, and crucially a relatively long distance between the larynx and the lips, smaller teeth, longer legs and feet and differently articulated hands and arms. The changes in the lower limbs made it easier, first to walk long distances, and then to run. The changes in the upper limb made it easier to grasp and throw – and less easy to swing and climb. It would seem that it was hunting capacity that was selected for, rather than gathering.

By the final stop of this process, perhaps between around 300,000 and 200,000 years ago our ancestors could contrast the EE and AH vowels, universal across modern languages. This last step, giving us a long vocal tract, may have taken up to 100,000 years to fixate across the species. But crucially, by my proposal here, by the last step, all modern humans, all descendants from that lineage, now share an equal access to language. So modern language cannot have emerged any later than this. This gives an overall time frame for the development of modern language ending at least 200,000 years ago. By this point our last common ancestors were not the whooping, bone crunching savages from the opening scene of the movie, 2001, but indistinguishable from other modern humans alive today, fully capable of becoming musicians, go-players, programmers,  cosmologists, or even linguists. Crucially, all are expected to become competent native speakers of a modern language.

But the direct evidence about these things is from the partial remains of skeletons from that time which happen to have survived and the tell-tale traces of DNA. The only scientific evidence available right now is indirect – from modern language and how children learn it.

I hypothesis that the first step towareds modern language could not plausibly have got off the ground before the first object of value and worthy of a name – giving prestige and pride to the possessor – perhaps two million years ago. This might have been a club from the burnt trunk and rootball of a small tree, a highly lethal weapon in the hands of a skilled, brave thrower. But since wood does not fossilise, archeology has no record of this.


There have been several human diasporas from Africa. The last major diaspora from is commonly thought to have taken place between 65 and 80,000 years ago. This may have been preceded by smaller scale diasporas also by modern humans, some possibly unsuccessful. There were certainly earlier human diasporas, not by modern humans. The last disasporans eventually met and sometimes mated with humans whose ancestors had left Africa several hundred thousand years earlier. Some of these, known as ‘Neanderthals’ from where their remains were first found, were thick set with with powerful biting muscles. Others from a still earlier diaspora are known as ‘Denisovans’, from the name of a cave in Siberia. There may have been other such mixings. Some of the off-spring from these unions then mated with the more modern humans from the last diaspora. So those descended from that diaspora have some pre-modern inheritance. Neanderthal inheritance is mainly in Europe. Denisovan inheritance is mainly in Asia and Australia. But Neanderthals and Denisovans were different species from modern humans. One of the differences is with respect to a gene called Fox P2, which seems to be involved in the delicate co-ordinations of speech. If the off=spring from these pairings had a Fox P2 deficiency, that may have impacted on their speech.

From where Neanderthal and Denisovans remains have been found, it may have been that they tried to avoid meeting modern humans, just as some Amazonian tribes today prefer not to have contact with the modern world. But by my proposal here, by the time modern humans first encountered the descendants of older human lineages, the evolution of modern speech and language had been completed.

The focus here

My initial focus is on the point where the evidence is strongest and clearest – the last stages of speech and language acquisition in children today and in something which connects both speech and language, the fact that the overwhelming majority of human beings reach a stage which Noam Chomsky used to call ‘linguistic competence’ with no special help of guidance within an approximately ten-year window before the age of puberty. Some developments happen overnight. This is obviously by an order of magnitude faster than evolution.

One of these stages, in children from five to ten, was studied by the late Carol Chomsky, Noam Chomsky’s wife, in 1969. She looked at how children learnt the special grammar of words like promise, ask and tell, what are now known as ‘control verbs’.

In my own research for my 2002 PhD, I looked at how children learnt to say words like calculator and monopoly, both characteristically mispronounced in particular ways. For most children the errors naturally resolve themselves by around eight.

There are three obvious questions which this raises:

•. Is there are a commonality between what is happening in respect of this very narrowly circumscribed area of grammar and how children learn to pronounce some of the most hard-to-say words of the language. (Almost all languages have something like control verbs and something like tongue-twisters.)

• If there is a commonality here, does this define what children don’t know about their language at five and know by ten?

• Does this knowledge have some global effect? Does it, for instance, make the very complex apparatus of speech and language reliably learnable by the overwhelming majority in a very limited period of time?


By a notion often referred to as ‘behavioral modernity’ there is decisive evidence from achievements in flint-tool making, jewelry, cave painting, sculpture, flute making, mostly from within the last 100.000 years that there must have been a key evolutionary step in that time, that if the last step of language evolution had been any earlier, there would be evidence in corresponding cultural and scientific achievements. But by my proposal here, this reasoning is wrong. I propose that the last major evolutionary step was at least 200,000 years ago, not 100,000 (See Sally Mcbrearty & Alison S.Brooks, 2000).

Doubtless, modern language was the precondition for behavioral modernity. But the expectation of rapid progress from language to modernity underestimates the difficulty of first discovery. The paintings at Lascaux were executed with a skill which requires long training over generations by a continuous, unbroken infrastructure. The corresponding skills are easily lost. Skills that were alive in living memory have now been lost. Skills are fragile. There is an economic cost to maintaining skills, even in the modern world.

I contend that the emergence of modern language cannot be estimated from the first evidence of the modernity to which it gave rise. The absence of evidence is not evidence of absence.

Contradicting the assumption of recent linguistic modernity, human biological diversity is greater across Africa than across the rest of the world. As shown by David Reich (2018). if the last linguistic speciation was significantly less than 200,000 years ago we would expect to find African populations not benefitting from it. But no such population has been found.

An ancient bias

Wherever there is any sort of general question about a child’s development distinctions need to be made between all the relevant factors.

There is an ancient bias towards the physical, In many languages, there is one word; a language is a tongue. As little as 150 years ago some surgeons convinced themselves and at least some of their patients that a stammer could be cured by cutting out a sectionn of the tongue, and sewing together the remaining parts. And even now there are speech and language pathologists who propound that the only issue in many speech disorders is one of control and co-ordination. The neuro-anatomy, the mind and meaning, the linguistics, and evolution are all treated as relatively trivial, attention is focused almost exclusively on what are called the ‘planning’ and ‘execution’ of the motoric gestures of speech. mainly within the the most peripheral part of the vocal tract, the tongue and the lips. From this perspective, it is assumed by Apraxia Kids, Healthline, and many others, that the A/P interface is singularly critical, that for every articulatory action there has to be a plan of some sort. This is taken to an extreme by proponents of ‘Oral Motor Therapy’ . Caroline Bowen (2005) and Greg Lof (2006) show there is little or no evidence to justify this. Nunes 2019 points out that there is no reason for assuming that common difficulties are reducible to muscular co-ordination or any aspect of what is often referred to as ‘praxis’, as by the common diagnosis of ‘apraxia’ or ‘dyspraxia’. There may be no such thing as ‘planning’ in the sense assumed by proponents of ‘motor planning’.

But holding a question to this ancient bias, it is reasonable to ask whether all the related skills evolved together, separately, or one after the other. Such connections are known to occur in biology. During what may have been a critical period between 300 and 200,000 years ago, the shape of the face was changing, with the chin becoming more pointed. Many co-relations are possible in principle. I shall propose one here, bearing on stammering.

Chomsky has long insisted on the autonomy of the language faculty. It has properties which are demonstrably not reducible to any other item on this list, like the building of language on abstract elements. One was the notion of ‘Tense’ in the final D or ED of did, had, should, and played, moving to a particular position in the structure, as in “Did you have a nice time?”. On this conception the same element was just as much present in “I love you” or “I love football” even there is no overt marking for Tense, in this case present tense, rather than past.

Another such element, only to be formulated as this ‘generative’ conception of language developed, was the notion of ’empty categories’ like the ‘trace’ or ‘copy’ of what was left behind by a word like what in “What did they see?” or “What do you think it is likely they saw” where what is interpreted on the right of the structure, as by an answer like “They saw a saviour”. By Chomsky’s 1957 analysis, what moves from one position to the other.

In about half the languages of the world, words like he, she and I are not pronounced, other than for special emphasis. But in languages where this applies most strongly, like most varieties of Greek and Italian, in the equivalent of “He is making himself a cup of tea”, there nothing equivalent to the word he, but himself agrees with the fact that he is singular and male. And in English “I want to make a pot of tea” we know that I will be doing the making by another such empty category.

In most varieties of English, the -G in young and long is pronounced only in younger, youngest, longer and longest. In these varieties of the language it remains unpronounced in the more obvious root cases. It is no less real for that.

But by the proposal here, the notion of linguistic autonomy does not force us to disregard all the evidence with regard to biology, mind, care, and invention. Speech and language evolved in a society which was demonstrably evolving in all of the ways from A) to H) at the same time. It seems to me reasonable to assume that there were some very significant connections. But I shall propose that the process of making these connections involved a mathematical decomposition, ripping out irrelevancies and leaving behind only a set-theoretic essence.

The Minimalist Program

By what has come to be known as the ‘Minimalist Program’, the simplest logically possible starting point is the pairing of two ‘formatives’, extracted from the lexicon. From this, using the same device, a structure is assembled, and then ‘sent’ to two interfaces, for pronunciation as speech and for understanding as meaning, by what is known as ‘Spell out’. This ‘architecture’ as it is known, leaves the question open as to whether Spell out is a single simultaneous event, or whether it is broken down into more than one part.

Some find this quite implausible, urging what might seem a more plausible, common-sense conception, first there is an idea, what somebody wants to say. Then there is the process of assembling it into words and sounds. Finally it is transmitted as a spoken utterance. But it is difficult to define ideas in such a way that the process of assembling them into sentences is both internally consistent on the one hand and not implausibly complex on the other.

Chomsky characterises the Minimalist architecture as a ‘perfect system’, something almost unique in biology. While the proposal here is based on a minimalist architecture, it abandons the notion of a perfect system, seemingly invoked just to defend the notion of autonomy – a line of argument which seems to me both wrong and unnecessary.


The evidence for the sequence that I am postulating is from the fully competent speech and language of adults, history and typology, and the less competent speech and language of children, some from the experiments carried out by Carol Chomsky (1969), some from children in Kingston and Wimbledon, some from my own children, (See Nunes, 2002), some from children in Edinburgh (See Nunes,1976), some from other studies.

A faculty


By a now commonly held belief, (though one not held by all authorities), speech and language are what they are thanks to a faculty which is universal across the human species. The first and still leading exponent of this belief was Chomsky (The idea is implicit in his PhD thesis from 1955, only published in 1975, though it befcomes increasingly explicit as his work develops). By this idea, there is an object of scientific study in speech and language as a faculty, with all natural languages equally learnable by children without any sort of active, adult intervention. The faculty is independent of actual use – or what is sometimes called ‘linguistic behaviour’. There are various sorts of evidence for the universality.

One sort of evidence is from ambiguities, seemingly found in all languages, as in “The rabbit is ready to eat” where the logical subject of eat varies between, the rabbit and some understood other, potentially us, or in fractionally different structures like “Who would you like to play with?” and “Who would you like to play with you?” where the presence of you at the end of the structure dictates a fundamental change in the understanding of play. In both sorts of case, the meaning is determined by the structure in ways to be explicated below.

One is from the language of those who for physical reasons have no way of speaking, as shown by the life, works and career of Christy Brown, the author of My left foot. There is more evidence from languages with a known origin. Most languages have no known origin. They just have histories. But some languages have a known and identifiable origin, like those which came into existence when slaves who were deliberately isolated from other speakers of the same language who then looked for and found ways of talking to others in the same situation as themselves. These are generally known as ‘pidgins’. A great triumph of the human spirit, in my view, and resistance against the great historical crime of slavery (See Mufwene, 2008). The slaves’ children then in one generation turned the pidgin of their parents into a language, generally known as a ‘creole’ or ‘krio’. Interestingly and revealingly, creoles tend to have properties in common with one another, rather than with the language of the owners, traders and slave masters. known as the ‘lexifier’. There is another special case in Nicaraguan Sign Language. In 1979, the Sandinista government introduced education for the deaf on the basis of lip-reading and spoken Spanish. What happened was that against the planning and all expectations within six years a completely new sign language had developed in the playground and on the school bus. (See Ann Senghas and Marie Coppola, 2001, and Ana Mineiro and colleagues, 2021). In different ways and on different time scales, creoles and Nicaraguan sign language are all living laboratories of language evolution.

If the thesis of Chomskyan universality was wrong, there would be no plays on words or agreements with something like legal force. Without this commonality of understanding there would be a monstrous discrimination in the assumption that the law is known to all, just as there is a criminal contradiction in the notion of justice where anyone believed to be limited in his or her understanding is punishable by pain or even death. But there are plays on words and expectations of contracts being met in every known society and culture. So the thesis of universality is supported by abundant and obvious evidence. It is not a claim or speculation.

But as noted above, as is obvious, there are exceptions.


By the proposal here, against Chomskyan doctrine, evolution grabs cognitions from the world of skilled work. Like some aspects of consciousness, some core cognitions appear to be not human-specific. But those we share with non-humans tend to be exercised more adeptly by humans. The use and making of tools has been observed in an increasing variety of animals in the wild. The termite-fishing sticks made by chimpanzees and the drum-beating sticks made by Australian parrots do not compare with human stone tools.

It seems fanciful to conjecture which cognition contributed to which step in relation to speech and language or when this may have happened. But we can be sure of the cognitions. These include:

• Aiming a throw at a target, where the throw is judged by its accuracy;

• Route-mapping;

• Drilling a hard twig into a larger piece of softer wood, eventually getting a red glow hot enough to light small shavings, necessarily distinguishing between different sorts of wood;

• Sharpening or pointing a stone-tool for a particular purpose;

• Shaping the grip of a stone tool to fit the hand;

• Glueing or binding a tip to a spear;

• Planning an activity as a series of steps;

• Enhancing the flavour of meat or leaves with salt, thinking of different sorts and sources of food supply and methods of cookery;

• Building a home in more than one part, where one part is the roof, a floor, or a stairway;

• Preparing for some life-threateningly dangerous procedure, only to be undertaken in a quite extreme situation;

• Responding to life-threatening danger with extreme strength and/or speed of reaction;

• Treating injury, pain, sickness or the loss of teeth.

Unlike a new word for some new discovery or invention or perspective on life, each of these cultural steps was potentially global in its effects. Here I am deliberately excluding steps towards modern culture such as wall painting, body-paint, jewelry, or music, all first evidenced within the 130,000 years. .I am including only those skills most critical in the preservation of life at some much earlier point. 

As an example of one cognition, in the British Museum there is a flint tool from around 500,000 years ago with a scraping edge and a sharp point. This was after the time when, by most estimates, neanderthals had diverged from modern human ancestors, and at least 200,000 years before modern humans had evolved. This tool fits comfortably into the right hand in two orientations, each at roughly 90 degrees to the other. One way it could be be a knife to scrape meat off a bone; the other way it could be a chisel to split a bone in half to expose the tasty and nutritious marrow by a heavy, well-aimed blow. The maker must have had a clear idea of the impress of a tightly-clenched right hand, and how to make the opposite fit for scraping. No such tool is made today. But for a modern carver with modern tools, to reproduce this shape would be quite challenging.

Another example is the pitching a roof to keep out the sun or the rain. At the most basic level, poles or rafters have to meet at a ridge, each at an angle to the wall on which it sits. The whole structure then has to be covered. If this is with tiles or leaves, the covering starts with the largest at the bottom and finishes at the ridge with the smallest. The first computation involves trigonometry. Then the hypotenuse has to be divided by the sizes of the covering. The same cognition is involved in the building of a stairwell. For each flight all the treads and risers have to be in the same relations to one another equally distanced between floors and landings. The end result has to be kept in mind from the beginning. This is quite different from the building of a nest by adding one twig or branch at a time, with no thought of any relation between one set of twigs or branches and another.

A quite special cognition is required when it is necessary to weigh up the risks and select the least risky. If this involves carrying a baby across a river, the advance plan probably involves having one hand for the baby and another to hold another adult wrist to wrist and considering how to respond if either adult starts to lose his or her footing. With or without words, but better with words, making it more likely to save life, not lose it. In the modern world, astronauts are selected for what seems to be likelihood that they will be able to step up.

All of these cognitions can be represented mathematically, as the last will have to be if humans are ever to land on Mars, with the landing necessarily controlled by an intelligent, automatic system, learning as it goes along, because time does not allow anything else.

The cognitions involved non-trivial, life-preserving, or at least life-enhancing, discoveries. Important aspects of early human culture and economy involved distinguishing different sorts of function in the mind.

Such decompositions and reassemblies are quite different from usages like those described by Guy Deutscher (2005). By what are known as ‘usage based’ theories of language, it is possible to explain properties of grammar by habit, convenience, clarity, communicative need and so on. Such theories offer no insight into the subtle asymmetries and overlaps between the ways words are formed and meanings are expressed or of the unpronounced elements which Jason Merchant (2016) nicely compares to ‘black holes’, detectable only by their effects on what is pronounced. By the proposal here, usage had a role. But it was distilled over time. And the reassembly into language was such as to allow various expressions from one language to another, generally known as ‘parameters’, such as those defining the position of the verb in the sentence.

By my proposal, these cognitive recruitments were one by one, each over a time-scale of tens or hundreds of thousands of years, each unique and different from all its predecessors. The sequence is either by conceptual necessity or by the logical scope of the formalisation. Necessarily, each cognition was small, but advantageous. If it wasn’t small, it could not have been expressed biologically. If it wasn’t advantageous, it wouldn’t have diffused.

How many?

Ray Jackendoff (2003) suggests that there have been 15 steps in the evolution of language. Steven Pinker (2010) suggests that there have been many more, without counting them exactly. Others, like Berwick and Chomsky (2016) propose that there has just been one step, necessarily a macro-mutation, hinting at a ‘look-ahead’ functionality, contradicting the essence of Darwin’s thinking. Without committing myself to any particular phylogeny, interesting as that may be, by my proposal here, there have probably been between five and ten steps, each well-defined, but abstract, and none a macro-mutation. I follow Darwin and most biologists in assuming that evolution just takes small, random variations, and favours those which lead to some advantage, leveraging the effect of each advantageous variation. Even small advantages are significant over time – perhaps by ten thousand or more generations. But in the case of speech and language, the changes by those steps must be visible within a community of speakers. Or they would not be copied and diffuse. Because these biological steps were accessible to humans in terms of their effects and potential, the leveraging was enormous. The biological untypicality of the evolution. on which Berwick and Chomsky rightly dwell, is thanks to the leveraging, rather than the power or scope of the evolutionary steps themselves, all quite humble, I submit, except possibly the last. At this point, about 200.000 years ago, a last common ancestor is identifiable.

No matter how many steps there were, this exceedingly complex process was completed before anatomically modern homo sapiens first migrated out of Africa. There is controversy about exactly when this was. It is possible that the early emigrants were involved in this fixation. But the possibility of this stepwise process having happened more than once is so small that it is hardly worth thinking about. If any of the necessary turning points postulated here had been missed by some ancestral population, there would be living populations without some corresponding property in their language.



By a traditional conception of language, as taught in school in the 1950s, elements were organised in line, with nouns as subjects or objects separated by verbs. But this conception afforded no insight into any historic relations between languages or of what children actually say. One of my children was heard to say to his friend, “When my animals saw my dad’s bum they laughed so much they couldn’t stand up”. To treat this as a sequence of nouns and verbs misses the point that it could not plausibly have been heard by the speaker. The structure was generated by a system of some sophistication.

In 1957 Chomsky proposed a new conception of language, as a potential which was realised by the decisions of children or adults, rather than as observable behaviours. This turned his 1955 dissertation into a lecture course. Take “Don’t you like being tickled?” On Chomsky’s analysis this was by a series of operations on a basic string or ‘kernel’ including the ‘verb phrase’, tickle you, on one level and you like on a higher level. But the relations between tickle you and you like cannot in principle be represented as points on a line. They can only be represented on two dimensions, with one dimension defining something other than linear sequence. This is usually characterised as ‘dominance’ in a special sense not reflecting any relations in the real world. This dominance is now commonly represented as height on a diagonal ‘spine’, with linear sequence represented from left to right.

In order to raise such questions for investigation and analysis, Chomsky proposed a new model of representation. In the overwhelming majority of languages, there is a default word order by which subjects precede objects, as by “John ate breakfast”, as opposed to “Breakfast ate John”, where English ate breakfast is represented as a ‘verb phrase’ with the verb see as its head, where John is on a higher level of structure, dominating the verb phrase. The relations between ate and breakfast and between John and ate breakfast cannot in principle be represented as points on a line. but they can be represented on two dimensions, with one dimension defining something other than linear sequence. This is usually characterised as ‘dominance’ in a special sense not reflecting any relations in the real world. This dominance is now commonly represented as height on a diagonal ‘spine’, with linear sequence represented from left to right.

Genetic epistemology

There are various mechanisms by which these steps may have occurred. All involve involve small accidental changes in style oBy my proposal, these operations developed out of reassemblies of separate cognitions, all by small accidental changes in style originating in one individual or small group, none originally involving a direct change in the DNA, by what has become known as the ‘Baldwin Effect’ from the work of James Mark Baldwin (1896 & 1897). As members of a population try to pass on to the next generation some newly-learnt, useful skill, this contributes to breeding success. Very slowly this complements the genome. 

Tinkering, not mutating

The cognitions may have evolved independently. Or the technique may have been bootstrapped by language. Inventions are helped by The cognitions may have evolved independently. But a utilisation in physical technique cannot plausibly have begun with the linguistic incorporation of a corresponding cognition. These underlying cognitions did not necessarily involve any sort of formal organisation such as a group or set. In order to be used as a device for speech and language, each cognition had to be decomposed into its raw essence, perhaps couched in mathematical logic. And in order to integrate this into modern speech and language and the natural process of speech and language acquisition by successive generations of children, this decomposition had to actuate by a different timescale. In the case of acquisition some of the ordering is reversed. The time-scale is quite different from what conscious introspection may allow.

There is no reason to suppose that these decompositions or their reassemblies and incorporations into language were instantaneous. These processes must almost necessarily have been slow and complex – over thousands of human generations.

The most important evidence for these steps is in the entire structure of modern speech and language. As with biology, the evidence of evolution is in the form of every living entity What we know as fossils are the remains of chance accidents when some plant or animal died in some very specific circumstances, allowing part or aspect of its remains to become converted over millions of years into stone and thus preserved. But only over a very long period and never the whole plant or animal.

By my proposal here, this entire structure of speech and language, displays the products of evolution – from what we know as nouns and verbs to the ways we ask questions to what we represent by letters of the alphabet and punctuation marks. Like the anatomy of modern plants and animals, language contains a record of its origins within itself. In my view, if we look down carefully enough, the evidence is everywhere. The question is: how to tease this out, and trace the various separate histories.

In numerous cultures, ‘Shhh’ is a call for silence. it may be primordial.

Evidence pointing two ways

A puzzle: order in disorder

One of the many aspects of speech that linguistic theory has not yet elucidated is the non-random asymmetries in children’s errors. Errors should be random, not organised into any sort of pattern. Order in One of the many aspects of speech that linguistic theory has not yet elucidated is the non-random asymmetries in children’s errors. Errors should be random, not organised into any sort of pattern. Order in disorder is a nonsense, but it is observable for as long as speech and language are developing. In most children this goes on until around the age of eight, plus or minus a year or so. It goes on correspondingly later in children whose speech or language is delayed or disordered. And there are well-evidenced ‘co-morbidities’, or significant overlaps between the children with speech and language disorders and those with a poorly developed sense of what is called ‘metalinguistics’, or the awareness that there is a commonality between the real word hippopotamus and the nonsense word HETTAPUTAMUS.

The anomaly of the asymmetry and the patterning of the co-morbidities both demand an explanation. Is there a significant commonality between these things?

JJust as developmental incompetence is asymmetric, there are words which many adults who think of themselves as ‘competent’ speakers find hard to say, words like anomaly. obliterate, monogamy. In 1994 the linguist, Clare Galloway, called the resulting mispronunciations ‘cloth ear errors’ – mild to the point that they are not generally counted as aspects of disorder.

If all the incompetences were in one direction or polarity this would point to some external factor. Some incompetences may indeed result from such a factor. In children’s speech, the P in pea tends towards B as in bee, and the B in cub tends towards P as in cup. But not the other way round. John Locke (1983) has pointed out, this particular asymmetry follows from acoustics. The ‘best’ sort of syllable has the sonority concentrated before the vowel, and dying away after it, by the difference between voiced and voiceless consonants, as explained in Sounds and bits of sounds.

But not all of the asymmetry can be explained in this kind of way. As pointed out by Alan Cruttenden, in early child speech the D in doggy is often replaced by G, leading to a common realisation as GOGI. The tip of the tongue articulation in the D at the beginning of the stressed syllable gets matched to the back of the tongue articulation of the G at the beginning of the less salient unstressed syllable. This matching is commonly described as ‘assimilation’. Assimilation the opposite way round, as DODI, in favour of the tongue tip articulator, is almost unattested. But at the same approximate age or stage in speech development, by one of the commonest errors in child speech, the K in key is replaced in speech production by the T in tea. This is known by speech pathologists as ‘fronting’ because T is articulated further forward in the mouth than K. Fronting is perhaps a hundred times as common as ‘backing’ with the replacement going other way round, with K replacing T. Then a few years later, again in both  normal and disordered development, in fouk words, three of which are not generally expected from small children, we suddenly find what seem like assimilations with tongue tip articulations favoured.

calculator as KALTALATOR

cardigan as KARDIDAN

hippopotamus as HITOPOTAMUS

archeopteryx as ARTIOPTERIKS

But seemingly no child says KALCALAKOR or anything like it in any other word. Interestingly, the contrasts between articulators are adjusted only where this is the only contrast, where there is no contrast between the stress of the affected syllables, and where there is another instance of both of the critical ‘features’.

The word is almost ready to say when the tongue-tip articulator replaces another articulator – either the lips or the back of the tongue. 

In other words, it seems that the process of speech development first biases the inventory of speech sounds one way and the pattern of assimilation the opposite way and then when speech has been almost completely mastered, reverses the polarity of assimilation towards the bias by the early inventory.  

By the proposal here, at least some of the asymmetries in the error distribution are best explained by the last two of the evolutionary steps which I am postulating here. Both are very powerful devices. The last but one gives a great freedom. The last restricts that freedom. But the combination is not easy to learn, though they are easy to misapply. And this leads to predictable error patterns during what Eric Lenneberg in 1967 called the ‘critical period’ – from birth to puberty.

Pinch points

All languages have things which are hard for children to learn – pinch points. As shown by the late Carol Chomsky (1969), Annette Karmiloff-Smith (1979), and Aubrey Nunes (2002) fully adult skills are not mastered until some point between 8 and 10. By Chomsky’s study, interesting changes are starting to become evident soon after the fifth birthdays of two normally developing siblings, both closely observed by both parents. It seems that this age represents a critical turning point. There is a conscious awareness of grammar. Toy animals are imagined speaking in a babyish way: “I say: Me live in countryside. Me underground. Me run faster than fox.” Allowing for the possibility that some errors are insignificant, there still seem to he some unresolved issues in real errors. While these are not likely to lead to any confusion, they represent a system being used to its maximum limits.

 • Controlling meanings from one clause into a dominated clause

“It needs to undo the knot” The subject of undo needs to be pronounced, as in “It needs someone to undo the knot”) 

“I’m going to tell my animals what they want for their breakfast. What do you want elephant? I want burnt toast.” Here tell is used instead of ask. The meanings of the two words are confused.

That used next to the trace of a constituent that has been moved, in a way not grammatical in English

“Whatever he sees that I want, he copies” 

“I want whatever book that Frank has… I want the book that Frank has” 

Both of the original sentences would be correct without that

• Predicate forms wrongly used to extend a noun-phrase.

“It’s on a too high shelf” 

“It’s my first balsa wood model I’ve ever done”

“That looks like a cracked open radio to a mouse.”

“My starfish has got a chopped off leg.”

• Conjoined forms

“Are you Joe and me’s mummy?”

wrongly treated as a noun phrase

“It was longer than the kitchen onto the living room together”

A prepositional phrase wrongly replacing a conjoined form

• Negation

“No fat people are not allowed in here

Negation wrongly doubled.

“There were two of the same sort and one of not the same sort”

Negative marker should precede the whole phrase

• Movement of categories headed by a Wh element – known as ‘pied piping’

“Which model do you think  I made with Joe is the best?”

Complete failure of pied piping – meaning: “Which model that I made with Joe do you think is the best?”

“Look how big stone we’ve found.” missing an overtly pronounced determiner as in “what a big stone”

“I’ll tell you who’s going to be me of my plastic pets: Panda.”

Pied piping failure and wrong Wh element, meaning “I’ll tell you which of my plastic pets is going to be me: Panda.”

• Clause structure

“So when I grow up I know a way so it doesn’t hurt my ears” 

Replacing a whole clause by so, and wrongly doubling the so, meaning something like “When I grow up I know a way to make sure that it doesn’t hurt my ears” 

• Relativisation

“I can’t cut out lots of shapes of ones that are together”

Wrongly using a relative introduced by that when a conditional is meant, as by “I can’t cut out shapes when there are lots of them together”

• Reversible relations

“I wish Joe could be my boss. I want him to do what I say.” 

Where Joe was then the older brother, meaning “I wish I could be Joe’s boss. I want him to do what I say.” 

I dressed my badger in the clothes that I don’t fit”

Meaning: “I dressed my badger in the clothes that don’t fit me”

• Tense as expressed by an auxiliary form (in traditional grammar)

“I showing it as if you were holding it”

Showing a picture at a conjectured orientation, missing the necessary tensed form, in this case am.

• Limit on what elements can be combined in a single structure

“Made up dinosaurs can have anything the children who made it…. Made up dinosaurs can have anything the children want it to have”

Using a device common in early language where structures are built, first with one combination missing out one element, then with a different combination, realising the missing element, but missing out something present the first time round. In this case the variable elements are the relative clause “who made it” and the xontrol verb want and the controlled phrase “want it to have”. 

At this point, the two children’s systems allowed abstraction, explanation and nuance. But all the operations were carried out one by one, as needs occur. The operations could not seemingly be multiplied against one another as in the case of a relative clause and a control verb.

At five and a half, six months later, a subtle change is occuring. One of the speakers is starting to use unpronounced structures as in 

“I couldn’t eat that sausage. Neither could you”

Here the whole verb phrase “eat that sausage” is unpronounced. For this speaker this seems to be the first such case.

This system is a major component of English ‘grammar’ or ‘syntax’ – assembling words and parts of words into structures with well-defined meanings. Another pinch point is the English syllable.

Every syllable in English (and most languages) has what is known as a ‘nucleus’, like the AY in May day, and one or more consonants before it, like STR, or SPR or SCR, as in stray, spry, screw. There are even greater complexities at the other end of the syllable in what is known as the coda, getting complicated in length and strength with the G sounding as a K in my English, with the TH showing that these forms are being used as nouns. And as tabulated in the inventory here, there are at least 18 vowels, and on one possible count, 28, in comparison to the five vowels in many languages or just three in some. There may be around 5,000 possible syllables in English.

The problem is that there is no certainty of the child learning English hearing even one example of every possible syllable or combination of auxiliaries or of any of the other, various pinch points during the whole of his or her childhood.

A logical problem

The lack of any certainty about the child learner’a experience leads to what is known as ‘the logical problem of language acquisition’.

No child knows what language he or she is learning or when he or she has heard everything he or she needs to hear in order to have learnt the language. Taking English as just one of several thousand odd languages in the world, the learner has no guidance, no ‘privileged information’  as it is called by learnability theorists, on where his or her target language lies with respect to all the possible variations, like how many syllables there might be. Might there be one more, that he or she has yet to hear?

This is in what Marlys Macken, a specialist in child speech, in 1995 nicely called the ‘learnability space’ – the space in which all the various pinch points have to be traversed.

Against this background, it is relevant that there is a much discussed language, Tashlhiyt, spoken in North Africa, sometimes called ‘Berber’ to the great dislike of the speakers. Tashlhiyt allows any consonant to constitute the nucleus of a syllable. It thus allows sentences consisting entirely of consonants like P, T, K, and S, and not a single vowel. Like Arabic, Tashlhiyt has three vowels. It has been the subject of various experiments. Now in English a syllable can end in CT as in pict, duct, tact, but only after short vowels. In “I walked there yesterday” and “He talked the talk and got the job” walk and talk have a tense or long vowel. So the T sound of the ED can’t be part of the word, but might, from the learner’s perspective, be a separate self-standing element, as, in a sense, it is. But Tashlhiyt goes a number of steps further in allowing words and even complete sentences without a single vowel. It may be world-unique in this respect. This is one sort of uniqueness.

Robert Dixon reports a game in the Australian language, Arrernte, with an initial consonant and nucleus in one part of the syllable and a final consonant in the other part. With its syllable structure this way round, Arrernte may also be world-unique.

Arrernte and Tashlhiyt may represent the limit cases of markedness in the sound system, what is known as the ‘phonology’.

At the opposite extreme of familiarity, there are English auxiliaries. Take the sentences “He takes sugar” and in the past ‘tense’ as this is known, “He took sugar.” By what was once known as ‘Do support’ the category Tense is moved to the left in negatives and questions and realised in a form of the verb do in “Does he take sugar?” or “He didn’t take sugar”. No other widely studied language has anything quite like Do support (although there may be something like it in one Italian dialect). If Do support was only known from the last elderly inhabitants of a small island, reported by one investigator, the more typical case might be regarded as a universal, and the seemingly dying language of the elderly islanders as a response to the knowledge that one of them will one day be the only speaker. But English is, for now, a world language with one major aspect of its grammar very highly ‘marked’ (in linguistic parlance).

The big deal


On the timescale of evolution, at a point of genetic divergence there are inheritors and non-inheritors, beneficiaries and non-beneficiaries. For a while the two populations may co-exist, but with one having a better chance of breeding success. Inheritors may have magnified the effect by killing non-inheritors – what is now known as ‘ethnic cleansing’. Neanderthals and Denisovans may have rightly feared this. Non-inheritors may find some way of hiding, avoiding confrontation, competition or compensating. But eventually the only survivors may be the inheritors. Something like this must have happened with speech and language. Or there would be modern human populations without some part of the linguistic apparatus. But, other than by the claims of Gill and Everett, no such populations have been found.

At least one of these steps must have involved a speciation, differentiating modern homo sapiens from Neanderthals.

But while the entire linguistic inheritance has diffused across the whole of the human population, it is still developmentally vulnerable. It is not inherited completely by some individuals. Some children have speech defects. And these are heritable.

Universal nuance and precision

Since the 1969 work of Robert Allen and Beatrice Gardner on teaching chimpanzees 200 or so signs from American Sign Language, a number of other apes have been taught to use various signing systems to a similar level. Sue Savage-Rumbaugh (1996) believes that she has taught a bonobo, Kanzi, to a significantly higher level. But to me and many others, while Kanzi is doubtless an outstanding student, his understanding is qualitatively less than that of that of the normally developing human child of two and a half. Human language allows not just precision, but nuance. A four year old on his or her first day at school is expected to understand the difference between “I don’t believe you” and “Could you say that more politely?” or between “Who do you want to play with?” and “Who do you want to play with you?” But such differences would be quite beyond Kanzi.

For this, there are some quite abstract, universally available to all speakers of every language – what are often called ‘universals’. Such as: 

• The distinction in all languages between ‘content words’ like hand, sleep, cold, and at, and ‘functors’ like that in “You believe that it’s true?” Functors add little or nothing to meaning on their own. They contribute to meaning by their relation to other elements in the structures of language. They often behave in recognisably distinct ways, losing or changing sounds or hopping over one another, making themselves quite obvious by their grossness;

• A principle known as ‘markedness’ by which linguistic phenomena divide unevenly, with biases and gaps – with most functors featuring the tongue tip articulator, as in T, D, S, N, and in words ending in NK or NG, as in, ping, bung, bang, and pong, with just those four vowels, not those in beng or boong – with the vowels in bet and foot.

Universals not express forms in the grammar, like the fact that English has “I love you” for what languages like Italian, Spanish and Greek have as “Love you”. The genome just provides the underlying basis for the variation here. This is now known as a ‘Principle’ of ‘Universal Grammar’ – that all languages have what are known as ‘subjects’, but languages vary in whether they require all subjects to be pronounced, as in English, or often go unpronounced, as in Italian. This Principle is common to both sorts of language, whether or not the subject I or its equivalent is invariably pronounced.

The point at issue here is plainly not easy for all children. Some children learning English have difficulty on this point. Those who are not sufficiently aware of the indications of a variation one way or the other, those likely to be diagnosed as suffering from language delay or language disorder, often go on leaving out words like I long past the age at which other children have stopped dong this.

In 1967, before the notion of universal principles had been developed the late Catherine Renfrew published her first version of the Renfrew Action Picture Test which was highly sensitive to this point of variation across children.

Time scale

Each of the steps I am postulating here may have taken a thousand or tens of thousands of generations to diffuse across the population, or fixate. The timescale is unknown. But it must have been slower by an order of magnitude than the process by which older speakers notice what seem like errors in the speech of younger speakers, not realising that speech itself has just changed. Elderly speakers may struggle to understand younger speakers who have adopted some new grammaticalisation or lexicalisation like isn’t it, as in “That could look good on you, innit?”

Orthodoxy and novelty

By most theories, in the formation of single speech sounds or ‘phonemes’, such as English K in key, there is no significant ordering other than for the sake of intrinsic necessity. The phoneme K precedes the vowel in key, car, and cow. But K is problematic for approximately one English child in ten. It is defined by a set of gestures involving:

• A closure and opening of the airstream in the mouth – making it what is known as a ‘stop’;

• The briefness of the closure:

• The articulation by the back of the tongue against the soft palate or velum;

• An audible pause after the release of the closure;

• Relatively low acoustic sonority;

• The definition of this as a consonant by its position in the syllable – in languages where this is relevant, as it is in almost all languages.

By the proposal here, the sequence of steps with respect to K varies fractionally from language to language. And this has to be learnt.

In a corresponding way, many children miss the correct tongue posture for the airstream in S and say sea as TEE, by what is commonly known as ‘stopping’ because the airstream is incorrectly stopped.

Very uncommonly, a child of three and a half with seriously disordered speech said watch as BOP, glove as DUD, finger as DINDER, milk as GIK, with all three articulators, the lips, the tip of the tongue, and the back of the tongue, all seeming to assimilate to one another. There seemed to be effectively a template allowing only one articulator in the syllable or word. But there were six additional steps. The speech was incomprehensible to a most careful, insightful, attentive mother who struggled not to make the problem more apparent to the child than it already was. Such speech is not easily accountable.

By an even greater degree of incompetence, the speech is not readily recognised as speech.

Two perspectives

The study of the apparatus proposed here began from what seemed like two opposite perspectives. In 1955 John Langshaw Austen published How to Do Things with Words, laying the basis for what is known as ‘pragmatics’ or the study of how language is used to reach particular objectives – doing things. In 1957, taking English as one arbitrarily selected language, Noam Chomsky proposed two components within the grammar, a ‘phrase structure grammar’, assembling ‘kernel’ sentences like “The man hit the ball”, and a ‘Transformational component’ going step-wise through a set of transformations defining:

• Negatives as in “The man did not hit the ball” by ‘Do support’ with did as a form of the word do appearing before not between the man and hit

• Questions, as in “Did the man hit the ball?” with Do support moving the word did to the beginning of the sentence.

• Negative questions, as in “Did the man not hit the ball?”

• Negative questions with a contraction as in “Didn’t the man hit the ball?” hopping not across the man, losing the vowel, and glueing it onto did.

• What are known as ‘passives’ as in “The ball was hit by the man”, moving the ball leftwards, changing the man into a phrase with by (and changing the form of the verb, other than in special cases such as hit);

• Questions beginning with a word like what as in “What did the man hit?” inviting a response like “The ball” or “The man hit the ball” with what moved to the beginning of the sentence linking to an element at the end – even as far away as in “What do you think she said the man hit?”

The grammatical apparatus here is extraordinarily complex. Chomsky’s analysis was original in a number of ways. The rules were explicit and applied one by one. In a word such as had, did or might, the reference to time, known as ‘tense’, was treated separately from the word containing it. All and only grammatical sentences were generated by the two components, including “Mightn’t the ball have been being examined by the umpire?” Previous grammars had omitted any distinction between what was generated and what wasn’t. So Chomsky’s proposal was a ‘generative grammar’.

Generative grammar is often represented as neutral. But it isn’t entirely neutral. Questions and negatives diminish authority. The passive reduces agency. “You might be mistaken” can be read as disrespectful.

In 1967, E. Mark Gold showed that the class of grammars then being developed to explain, not just for English questions, negatives and passives, but their equivalents in other languages, was unlearnable. Gold’s critique applies if the critical variation across languages is by the ordering of devices. The critique is avoided by a single function with expressions according to the conditions. Generative grammar has since developed accordingly. Most aspects of Chomsky’s 1957 analysis have been superseded by reanalyses by Chomsky himself and others. But the notion of derivation from an origin to a point of pronunciation has been widely retained,  as by Chomsky’s 1995 Minimalist Program.

But Chomsky’s and Austen’s projects were less orthogonal to one another than they first appeared. From a 1997 proposal by Luigi Rizzi, breaking the left edge of the sentence down into an ordered set of elements comprising the main, pragmatic aspects of the utterance, it has emerged that there may be a way of reconciling Chomsky’s and Austen’s seemingly contrasting perspectives.

I seek to exploit this in two of the steps I am postulating here, Fit and Move.

What learners have to go on

For the child learning English, some things are relatively easy. Others, like the auxiliary system consisting of words like can and do, with Do support and related phenomena, as by Chomsky’s 1957 analysis, not easy at all. In relation to the learnability space, among the more obvious and uncontroversial characteristics of English are:

• A very complex auxiliary system;

• A relatively large inventory of phonemes, 18 vowels and 24 consonants, as set out in An inventory of Sounds;

• A relatively simple system of stops, ‘voiced’ in the case of B, D, G with the lips, tongue tip, and back of the tongue, contrasting with the ‘voiceless’ or ‘unvoiced’ stops, P, T, K with the same articulators, but a significant delay before the vocal chords are brought together allowing them to vibrate in what is perceived as a vowel. This contrasts with much more complex systems with three, four, or five settings in many South East Asian languages;

• Great complexity in the vowel system, with six short vowels, in him, hem, ham, hum, hod, hood, the long vowels in he, hark, hawk, who, what are known as ‘diphthongs’ with with the tongue moving in what is known as the ‘vowel space’ in hay, high, hoy, hoe, how, both long vowels and diphthongs articulated with a degree of ‘tension’, the vowel known as ‘schwa’ at the beginning and end of agenda, a long equivalent in her, and what are sometimes taken as an extra series combining a long vowel or diphthong with schwa in our, ire, coir, truer, all written with an R, and pronounced with an R in ‘rhotic’ varieties of English.

• Relatively complex syllables or ‘phonotactics’ – with up to three segments before the vowel or nucleus, as in spring and string, up to two vocalic elements in the vowel or ‘nucleus’ in my, tense or long in me, up to three segments after the nucleus – in glimpse, next and length (in many pronunciations, at least), and, outside the frame of the dictionary word, T glued on the right edge for past tense in glimpsed and S for plurality in lengths;

• One complexity in the consonantal system when a stop is released so as to effect a sudden blast of airstream which is then released to produce a brief moment of high frequency noise in what is known as an ‘affricate’, as in the first sounds in chew and jew. This contrasts with Russian with affricates in two groups, one like English with a stop before what is known as a ‘fricative’ with the airstream flowing through the space left by the partial release of the closure, the other with just one member with the closure in the middle of a fricative. There is much greater complexity in many African languages.

• Syllabic nuclei which do not invariably contain a vowel, as in the second syllables of little and table – always unstressed in English – in a way that seems to be very problematic for most learners, with the tongue tip articulation of the T and D in little and middle characteristically lost until three or four;

• A relatively complex system of word stress with one primary stress on the left branch of the ‘foot’ in ladder, in the left branch of the rightmost foot in belladonna, and discounting one rime with a short nucleus on the right edge, as in hippopotamus.

• In many varieties of English, R is added between vowels, for some speakers in withdrawal as WITHDRAW R AL and for other speakers where sentences are connected in sense, as in “I went to Australia R and I fell in love”, but not where there is no connection as in “I went to Australia. And you still owe me that money.”

• Pitch or ‘tone’ used only to mark the difference between questions and other sorts of sentences and for various sorts of ‘pragmatic’ effects or what we can do with words, but not to distinguish words from one another. Here English contrasts with the languages of China, and about half the languages in the world.

Although, by comparison with other languages, English has a very complex vowel system and an only averagely complex consonant system, there are far more developmental problems with respect to the latter than with respect to the former. It is worth asking why this might be so. By the proposal here, even a cross-linguistically average consonant system is intrinsically more complex than a cross-linguistically complex vowel system. In most languages, vowels differ with respect to the position of the tongue in the mouth, their length, and the configuration of the lips. Consonants differ in all of these respects and more. And this would seem to go back to the original formation of speech sounds by what I am calling Mimic and Glue at the very beginning of speech and language evolution.


A dialect or variety of a language can be characterised at least in part by the way some words are said. In some cases,, the words are uncommon. The data available to the learner is often uneven.

In many varieties of English, including the now disappearing Cockney, the T in little is pronounced not with the tongue tip but by a closure of the vocal chords, known as the ‘glottal stop’. This is unmissable.

But in the variety of English which Daniel Jones quaintly characterised as ‘Received Pronunciation’, now mostly known as RP, in Jones’s inimitatable style as ‘the speech of men educated in one of the great English public schools’, the T in little is not released. But the T in huntsman, ointment and appointment, is glottal stopped, as in Cockney little. The unusual configuration – between and N and an M with no vowel – blocks the realisation of the tongue tip gesture, forcing a realisation with the larynx. But even if the learner never hears any examples of this, there are other similar cases as in gentle and gentleman where the N is followed by an L, functioning as a stand-alone syllable, and like N and M also what is known as a ‘sonorant’. In these cases too, the only involvement of the tongue tip is in the articulation of the L. The learner has to generalise from what may be very limited information.

Evolution or development?

It is possible to reconstruct stages in the development of modern European languages, from what is known as Proto-Indo-European or PIE, as spoken perhaps 6,00o years ago, before the building of Stonehenge. There is key evidence from what is known as ‘grammaticalisation’, with words like le, la, un and une, in modern French having developed from the Latin for this, that, and one. From this evidence, some conclude that PIE was characteristic of a more primitive stage in the evolution of human language than its modern descendants. The development from Latin to modern French happened over less than a thousand years. It is not known if there is any pressure in this direction. It seems that it just happens – or doesn’t. Like it hasn’t happened in Russian which has done without words for the and a from the time of the first records over a thousand years ago. There is more direct evidence of grammaticalisation in the linguistic relics of slavery, that historic crime for which there has not, so far, been any remotely adequate atonement. When African slaves were dispersed in the New World, efforts were made to ensure that they were all separated by language, obviously obstructing any sort of resistance. The only common language was that of the new masters. So pidgins developed with words from English, French, Spanish, Portuguese. But a pidgin was not a language. Only the most basic meanings could be expressed. The slaves were encouraged to have children who would become slaves in their turn. But within a generation the children turned the pidgin of their parents into languages, known as ‘creoles’ or ‘patois’. These languages characteristically lack those parts of linguistic structure used in languages like English and French to distinguish nine from ninth. So the traditional Caribbean ceremony after a death is mostly pronounced as nine night without the TH. But this lack of the TH is not in my view any sort of mark or evidence of primitiveness. It is just a consequence of the criminal circumstances in which these languages developed and the assertion of humanity by the speakers. Within a thousand years, if human society lasts that long, some of these languages are likely to develop something equivalent to the TH from a form which somehow captures the ordinality. In my view the remarkable thing about grammaticalisations of creoles is that they evolved at all, even in the horror and misery of slavery, not that it can take a few dozen or hundreds of generations to take its full effect, to reproduce something equivalent to the whole 5,000 year development of Indo-European languages from PIE to English, Russian, Greek, Gaelic, French, German, etc..

The scope of the inquiry

It is sometimes thought that the main focus in the analysis of child speech should be on what they most often get wrong, as by fronting and stopping. These are accordingly characterised as ‘processes’. But that does not answer the questions: Why do children get wrong what they do, not just individually, but generally?

The first step to an answer, I propose, is to consider what learners HAVE to attend to. This includes many subtleties, such as those involving time. At the end of the syllable, the voiceless stops in P, T, and K are kept apart from the voiced stops in B, D, and G, mainly by the length of the vowel. So to keep the G in hog apart from the K in hock, the O vowel in hog is almost as long as a long vowel. The learner must be attending to this sort of thing in order to progress to being a competent native speaker.

Children are directed towards the relevant aspects of what they hear said by the effects of universal grammar, emerging from the steps I am postulating here.

It is sometimes assumed that subtleties like the representation of voicing in a final stop by the length of the preceding segment are defined by values on scales with an infinite number of possible settings. But if so, learners have to attend to two different sorts of thing: what contrasts with what and scalar variables of time. The learner’s task is simplified if there is just one sort of task, as there is by the proposal here with all variation, what the learner has to learn, defined by orderings in time. Obviously the task is harder if there is more than one first language. But by the proposal here, the task is the same for every language. 

In the study of speech and language there is an obvious tension between the externality of what is heard and the inner structure of what can be given a meaning like the ordering of  the two words in “Something good” and the almost complete uninterpretability of “Good something”. On one approach, the external expression and the inner structure are quite separate. On another they are part of one system.

In 1968 Chomsky and his then colleague, the late Morris Halle, published the Sound Pattern of English, arguably the most talked about book in the history of linguistics. It is one of the few books to have been republished in paperback after both authors had discarded all of its main theses. By one of these theses both speech and language are derived by a process which to the greatest extent possible uses a common apparatus in both areas, and because it is using same apparatus cyclically, does this as economically as possible. Both Chomsky and Halle, for different reasons, changed their minds on the desirability of a common apparatus. The proposal here upholds the common apparatus view from 1968. This is partly motivated by the typical multifactoriality of speech and language disorders, partly by the strange and anomalous distribution of English children’s speech errors (limited evidence suggests similar distributions for children learning other languages) and partly by the conceptual necessities of evolution.

But going several steps further, by the proposal here, the whole process of speech and language evolution has been from particular cognitions which were recruited and adapted for a special purpose, to run at very high speed, and possibly to have parallel effects in human thought generally, – with great consequences for the process of acquisition and thus for child speech. The general principle is to do as much as possible with as little apparatus as possible (stretching the derivation, increasing the scope and range of contrasts, minimising the phoneme inventory). If the learnability space is broken down ink this way, and if the parameters thus have to be set in small groups, the combinatorics of parameters are reduced to a point of manageability.

By a variety of other constructions either the topic or the focus of a sentence is moved the left edge, as in “The chocolate, we stole because we were hungry” or preserving the order of the words in the clauses and adding a clause, by a process known as ‘clefting’ as in “It was because we were hungry that we stole the chocolate” or by a different sort of cleft “What we stole was the chocolate, because we were hungry.”


What the learner has to learn is what can be Glued and when_ by the definitions of Heads, whether Case is overt or abstract as in English, how the elements are Fitted together, what can be Moved where, and how the elements of the derivation are assembled intp Phases.

In the 1980s, particularly from work by Noam Chomsky (1981) and Hagit Borer (1984), many linguists  came to think of language learning in terms of choices about the functors. For instance, there are many languages like Italian, Greek and Spanish in which the equivalents of I and you, known as ‘pronouns’ can be routinely dropped, with the equivalent of “I love you” said as “Love you” without the I. According to whether the target language drops this sort of pronoun or not, the language learner has to do the equivalent of throwing a mental switch one way or the other. In English something like this happens in questions with verbs of perception like “See what I mean” or “Mind if I come in?” addressed to a person, or in negatives where the speaker I speaking for himself or herself, as in “Don’t mind if I do.” But such marginal cases don’t make English a language like Italian.

The points around which these choices or settings are made are known as ‘parameters’.

The idea has been explored mainly in syntax – how to use pronouns, how words are put together to form negatives, questions, and so on. But some work has been done applying the idea to phonology – how sounds or phonemes are put together to form words, and how stress is organised differently in those languages which use stress.

Languages like English contrast below and bellow with contrasting levels of stress on the two syllables. This is with stress represented by the length, pitch and volume of a particular syllable’s rime. Most of the languages of Western Europe use stress, but in different ways from English. Chinese languages, on the other hand, contrast different tones on single syllables. A child learning a European-type language has to set a corresponding parameter one way. From work by Paula Fikkert (1994), it seems that this starts to happen around two and a quarter. From work by Yuen Ren Chao (1973), children exposed to a Chinese-type language start to set the same parameter the opposite way at the same age, rather suggesting that there is one parameter here.

From work by Nina Hyams (1986), children exposed to English on the one side or Italian on the other are throwing the switch opposing ways at the same interesting age, around two and a quarter. 

The notion of parameters was a great advance on previous notions of rules in systems of great complexity. Whenever we think of anything at all, a picture, a sympthony, a piece of legislation, a speech, or a joke, we do so using an apparatus whose operations are exclusively binary. A neurone either fires, or it doesn’t. Neurones don’t express matters of degree or shades of grey. Reducing the grammar to a series of binary values made it tractable by the human mind.


Mimic, Glue, Head, Case, Fit, Honour, Move and Phase evolved one by one, providing the apparatus for language.

Languages vary in which parts of the available apparatus they use. In French, honour is marked by the use of the plural, in Spanish by a separate term, originally from Arabic, in traditional varieties of Russian by the patronymic from the name of the father. The English auxiliary system has been changing since the time of Chaucer. Might was once the past tense of may. But it now encodes possibility. For Charles Dickens “We are going to dine well” would have suggested that the diner expected to go from one house or room to another. Now it suggests a future by some sort of human agency, as in “Perhaps we are going to die” by a diner frightened of being poisoned, as opposed to “We will die” as an expression of a general certainty. The change to the auxiliary system is on going. Many English speakers under the age of 40 say things like “That really suits you, innit?”. For older speakers this is almost uninterpretable. To express continuity, English glues –ING on to the right edge of a verb, but uses words like maybe, allegedly, to express evidentiality. French does the opposite, using the equivalent of in the course of for doing, and a special form of the verb for evidentiality.

While a language may not express one or more parts of the total apparatus, all are available, with most languages having ways of expressing most of it, with no language lacking any term completely. All languages, including creoles and recently developed sign languages, have access to all the resources by all the speciations postulated here.

So languages vary in the use which they make of the apparatus. But there are possible uses and impossible ones. The possible uses are defined by what are known as ‘Principles’. The variations are by ‘parameters’. Nothing can be moved until there is structure in a position from which movement is possible and sense can be made of an appropriate result.

A genome

All learners of any language can expect to converge on a single grammar, and, in the case of English, agree that:

• “You believe that it’s true?” and “You believe it’s true?” mean the same thing, both with one clause embedded in another. By the theory here, even if the word that is not pronounced, it is still part of the structure.

• “The rabbit is ready to eat” is ambiguous, with the same words having different meanings according to the structures which are assigned to them. By one structure, the rabbit is the one who does the eating. By the other, the rabbit is less lucky, and the eaters are unnamed. By the theory here, the eaters are specified by an unpronounced ’empty category’ which gets its meaning from the rest of the structure.

• “Mightn’t the ball that won the match that the bookie keeps talking about have been being examined by the umpire?” is meaningful, no matter how improbable the sentence. By the theory here, both the ball and the contracted form of not are pronounced and interpreted at opposite ends of the sentence.

Some children will hear examples of all of these structures. But others may not. How do the less lucky ones know that some categories may be empty – of all the strange things that there are to be known about the world? How does language have the property of being learnable to this degree, ensuring that empty categories are easily and reliably recognised for what they are, without learners being told? Thiis is sometimes called ‘the Logical Problem of Language Acquisition’.

By the simplest answer, the possibility of a category being empty is given by a general property of what is known as ‘Universal Grammar’, as one distinctive character of the human genome.

It should be borne in mind that not all linguists agree that there is any such thing as Universal Grammar. The Research Program here assumes that there is such a thing, but on the basis that the goal is to give this idea some biological content – following an idea from Cedric Boeckx (2021).

By the research program here, Universal Grammar evolved from the steps by which speech and language evolved. I propose that these steps must have included those referred to here as Point, Mimic, Glue, Shout, Label, Fit, Move, and Sherlock. The informal rules suggested here are just that – informal. But were there actually more steps, or less? How did they all work together? And how did they evolve in such a way that their effect became heritable?

Before the beginning

It is not assumed here that the last common ancestors of modern humans and modern chimpanzees were just ancient chimpanzees, as Darwin’s first critics mocked. It is possible that these ancient common ancestors were gentle, loving, caring, thoughtful, considerate vegetarians. Or they may have been more chimpanzee-like, capable of killing one another or smaller animals, shrieking, hooting, panting, roaring, in various situations, as documented by Jane Goodall (2009). They may have differentiated between predators such as leopards, eagles and snakes, with corresponding alert and alarm calls for each one. We still use a variety of calls to flag up humour, pleasure or sorrow, fear, agony, or triumph. But apart from laughter, the communication is hit and miss. While any such differentiation is clearly a step forward from any lesser degree 0f differentiation, it cannot signal what the thoughts might be. It indicates, but does not refer.

An end in sight

However the speech and language faculty is acquired, no matter how many steps or stages this involves, the process eventually comes to an. end. After this point, we tend to become less and less good at accommodating to the changes in speech and language of younger speakers. And only a very small proportion of us can learn to speak a foreign language well enough to pass for native speakers. But we can go on learning new words for as long as we retain what lawyers and doctors call our ‘mental capacity’. The learning of words is evidently separate from the faculty that we use to pronounce and put them together in the meaningful structures we know as ‘sentences’. This includes learning about the natural variations exhibited by some particular target language, such as the fact that English has “We are at home” whereas Japanese has something like “We home at are'” What are known as ‘parameters.’ for which the correct settings have to be learnt. Some parameters, like the word order difference between English and Japanese, seem to be quite easy. Children mostly get this right almost from the first words because the relevant data is so clear and obvious. But some variations are much less clear. A lot of children say hippopotamus as HITAPOTAMUS with a P becoming a T and hospital as HOSTIPU with the P changing places with a T by what is often known as ‘metathesis’. But very few children say hippopotamus as HIPATOTAMUS. And not one child I have even known says hospital as HOSTITU. So it seems reasonable to think that these are not the accidents of individual pathways of speech development, but something to do with the pathway itself.

Most people eventually sort out their HITAPOTOMASES, though not all do. Such phenomena are regarded as errors rather than individual eccentricities because speech and language are built on the basis of what is and what isn’t. “Something good” is English. but not “Good something”. At some point the process of parameter setting comes to an end, typically around the age of ten. This closes what Eric Lenneberg (1967) called the ‘Critical Period’. This requires an acquisition process quite different from learning to paint or play an instrument or throw darts. Some are good at these things, others less so. But there is no such thing as a wrong throw other than accidentally pinning an opponent to the board. The learning process is just incremental. Practise does not make perfect because there is no such thing as perfection, but only improvement. But speech and language are acquired on the basis of what is known technically as ‘discrete infiinity’. A competent speaker has a finite resource of materials with which he or she can produce an infinite set of results. How is this so?

That is the question I address in my 2002 PhD thesis. There I propose a dedicated device, necessarily by what is known as a ‘macro-mutation’, a genetic magic-wand. But continuing the same line of research I have become convinced that this cannot be correct, that an account that requiring any such biological magic is intrinsically better. Hence the research program here, postulating a sequence of evolutionary steps. But either each step separately contains some mechanism ensuring that it is finitely learnable, or the last one has this as its special responsibility. Given that the second is intrinsically simpler than the first, by Occam’s Razor it is to be preferred.

So this research is a bit like the search for a proof of Fermat’s last theorem, with an end result known in advance, but no sure way of getting there. By the end result here, at least by the end of any process of evolutionary steps, the whole process was finitely learnable.


By the first step postulated here, a population of modern human ancestors, must have found regular occasion to point out some individual, entity, event, or circumstance, a friend, family member, potential prey or predator, as something specially interesting, as a hint of interest, care, love, or fear.

Before this they may have gazed or touched. Gazing must have been important because human ancestors lost the pigmentation in the whites of the eyes, making the direction of the gaze much easier to read. But at least among chimpanzees, staring is a threat. Between modern humans it is more ambiguous, as anything from a threat to a hint of sexual interest. Touch is also ambiguous. It can be an invasion of personal space or worse, or an expression of tenderness. And as death approaches, as all the other senses are lost, touch can be the last sense to survive. But on a hunt where a living prey is larger and more powerful than the hunter, touching is not an option. And gazing does not communicate to other hunters. But there is a readable, symbolic act by pointing.

Pointing has to have been a cognitive innovation by a human ancestor after the divergence from the ancestors of chimpanzees. But pointing is seemingly not understandable to any non-human. The understandability is robust, surviving as the standard icon for clickability on the internet.

Informal rule.

§ Project a bearing, X, in space aligned with the outstretched fore-finger; Within X there is a point of interest.

Point suffers from two limitations. It is imprecise. X may be hiding behind a tree, only giving itself away by the tip of a tail. And pointing only works where X is in sight.

By the proposal here, although pointing is not reference, pointing was a step on the way.

The outstretched forefinger is adaptable. It can wiggle from side to side to help point out a snake, making the reference clearer, going halfway to mimicry.


As pointed out by Merlin Donald (1991), some group of human ancestors must have started to use mimicry with sounds or gestures to pick out individuals as individuals or as members of a group, class or set, even when they are out of sight. Mimicry is a first step to overt expression. It overcomes both of the main limitations of Point.

Mimicry can be squealing like some species of primate or opening and closing the forefinger and thumb like the beak of a bird. For the sake of clarity and actual communication, the mimicry had to be accurate..

By the most minute and subtle observations of his three children, Jean Piaget (1951) showed how mimicry develops from imitation. One of his children was watching a cat on a wall, and then mimicked the movement with a match box. The symbol formation here is inconceivable without a sense of what Piaget (1954) called ‘object permanence’ and the sense of a valuable object, something to be proud of and worth having. By object permanence Piaget meant that if we lose sight of something and then see it again where we might have expected to see it, it is probably the same thing. This is often around the age of 18 months or at any rate just before the first word combinations. 

Informal rule:

§ For some entity, X, take some obvious, distinguishing characteristic, x of X; Mimic x as accurately as possible; Adopt or copy x as a standard, conventional way of picking out X.

The same forefinger, so useful for mimicking characteristic motions, can be co-opted for more abstract expresssions, to jab as a mark of naked aggression, or to go face up and curl to beckon.

It is argued by the late Michael Corballis (2003) and Mike Tomasello (2010) that the first precursors of language were all with the hands, that the mimicry was all signed. But from the behaviour of modern humans with no shared language, such need-driven communication would seem likely to have used whatever was most convenient – either sign or vocalisation. While there are communities with a high rate of congenital deafness which are bi-lingual between signed and spoken language, there are no hearing communities with only signed language. And as Maggie Tallerman (2012) and others have noted, the onus is on those who propose that sign language came first to show how signing was dropped so completely in favour of speech. If speech evolved from a signed system to a vocal one it is necessary to postulate a step by which this came about.

Just this first part of the human ontogeny proceeds over more than a year. The phylogeny may have taken 100,000 years or more.

Mimicry has been observed at least once in chimpanzees – by an observant member of a television crew. An alpha male, who was limping from an injury, was leading his group in line. A younger male started copying the limp of the alpha male. Then the alpha male turned round. And the younger male promptly reverted to his normal gait. Teasing. Often not enjoyed by whoever is being teased. But from the rarity of the observation, clearly not part of the everyday chimpanzee repertoire. Teasing does not make sense without a theory of the mind of whoever is being teased. But as noted above, there is no claim that what theory of mind there may be in non-humans it has the crucial property of recursivity,

Not onomatopeia

Because speech is by a generative system, competent speakers tend to think they can use it playfully in what is known as ‘onomatopoeia’ – from the Greek for name and composition – to create new words that are thought to ‘sound like’ what they are supposed to represent. Or they can set aside the rules of sound and word formation for their language – what is known as the ‘phonotactics’ – for the sake of better mimicry. If these efforts are on the right track, the language of anyone making the effort should make no difference. It should be obvious what is being referred to. In practice, the mimicry is often abandoned in favour of a conventionalised representation in which the phonotactics are preserved – at least more or less. And the supposed onomatopoeia is not understandable other than to native speakers of the language. But there are interesting clues here as to how the structure of speech may have evolved from mimicry.

In the mimicry of chickens, at least by most speakers of English, the lips are initially closed, not allowing the airstream to pass through the nose, but allowing the vocal chords to vibrate as soon as the lip closure is released through a mouth space as open as possible – almost a B and an AH sound. Then the airstream is closed by the back of the tongue, almost a K. Then this is repeated – only faster.

In the mimicry of cows, we use the lips, the tongue, and the opening to the nasal cavity, in much longer gestures.

In English, oink for pig accommodates to what is perceived as a sound of nature. There are no dictionary words with any sort of long vowel before NK or NG.

What are conventionalised for these sounds by French speakers are interestingly similar In French meuh for cow – the vowel of feu for fire is prolonged in a way that violates French phonotactics, as shown by the H in the conventional spelling. In French groin for the sound of a pig, the initial G and the R both involve the back of the tongue. And the final N opens the passage to the nose, increasingly the resonance. In both cases, the accommodations to mimicty involve similar or the same gestures. But the accommodations are language-specific. French speakers do not relate English moo to cows. Nor do English speakers relate French groin to pigs. Neither French nor English speakers understand the other language’s onomatopoeia for the most familiar farmyard animals.

But these accommodations are by trained users of an evolved speech system. While most modern humans can mimic some sounds from nature, there is wide variation in this skill. For the most skilled exponents this becomes an entertaining party trick, a music hall performance, a military deception, or part of a hunter’s repertoire. But mimicry for reference is lost.

But the imperfect onomatopoeia of oink partially recapitulates what may have been the first step on the pathway to language.To get from mimicry, to language, the hisses, clicks, grunts, sighs, moans, of mimicry had to be reorganised into the acoustic elements of sonority, resonance, harmonics, and distributions of aperiodic noise that characterise modern human speech.

A primordial taxonomy

For all the onomatopoeic imperfections, there are telling cross-linguistic commonalities. When French and English speakers mimic pigs they use what phoneticians call an ‘ingressive airstream’ with the air drawn into the lungs and the back of the tongue raised to the point that the soft palate vibrates. English oink accommodates to this by the closure with the between the back of the tongue and the the soft palate. French does something similar with the back of the tongue gesture of French R. Both English and French accommodate to the mimicry by opening of the passage to the nasal cavity by the N. Similarly, both languages accommodate to the mimicry of cows by an initial lip gesture with the nasal cavity open – M – followed by a long round vowel. In French meuh the length is achieved by a device falling outside French phonotactics, as shown in the written form by the H. In both cases, in English and French, the onomatopoeic representation is much shorter than by true mimicry. The natural sound of the cow is much slower.

Development and vestiges

Reference by mimicry alone is very limiting. Only a very small proportion of all the funny, interesting, or important things there are to refer to can be defined understandably. Tastes, feelings, preferences, surprises, are just a fraction of what is excluded. Jane Goodall (2009) explains when and why chimpanzees use each and every one of 60 or so whoops, seemingly none involving mimicry.

For a social species carving out a new and precarious niche in dangerous territory, just starting to use mimicry as a form of reference, there are clear and obvious advantages from every expansion of the inventory of mimicked items. This could have been done by varying one aspect of the mimicry. But with a system defined only by mimicry, it is impossible to track the variations unless the gestural properties are varied one by one.

The first variations may have been by any of the properties available to the human mimic, the length in time, the pitch, the point in the vocal tract at which closure is effected, whether the airstream passes through the nose, as by M, or not, as by B. But perhaps the most critical variation was with respect to the overall resonance of segments, what differentiates vowels and consonants from another.

No variation will get off the ground in relation to a communication system unless it can be reproduced across the community. As one property is identified, it can be combined with another. This is taking a small step towards the system known as ‘features’, as described in Sounds and Bits of sounds.

Thus mimicry may have developed by reassembling the elements of auditory perception and vocal articulation one by one into an organised and memorable acoustic system, defined not on particular animals or any other focus of reference, but on contrasts. Thus abandoning the defining property of mimicry for the sake of more coverage.

We still use the functionality by mimicry when there is no dictionary to hand, or none exists, or there is no mobile phone for Google Translate, if communication is important enough, and we choose to run the risk of sounding silly.

Or we laugh at a gait mimic walking behind some unaware stranger accurately mimicking his or her gait. Primordial mimicry is very much alive and kicking.


By a mechanism, generally known as ‘Merge’, now associated with the Minimalist Program, two elements are brought together and glued as ‘branches’ in a structure. Here, for the sake of evolutionary plausibility, I break the idea down into parts, Merge 1, 2, 3, and 4. Ljiljana Progovach (2015) calls the first stage ‘Proto merge’. I call ithe first part of this Glue – to resurrect a notion proposed by the linguist and educational pioneer, Willhelm von Humboldt (1836). By Glue on my definition, the only ordering is by the assembly itself.

Informal rule:

§ Take two entities, X and Y, and join them.

By my proposal here, Glue did not just involve the assembly of words. It must logically have involved the parts of sounds from which the words were built, the primordial features by gestures of the tongue and lips, and to their combination into syllables consisting of something like a consonant (C) and a vowel (V). By the formulation here, Glue gives a single syllable with a primordial consonant and a primordial vowel, such as Pa, known as a simple open or CV syllable. Two such CV syllables could then be strung together – as by modern high way and no go. This does not exclude the possibility that X and Y are the same, as by No no, perhaps for emphasis. But it is hard to find a plausible range of modern examples, either contrasting or the same.

Phonetically, this allows the attributes of sounds to be multiplied against one another even within the four-celled matrix of consonants which are articulated with or without allowing the airstream to pass though the nose, like N and M in opposition to D and B, or those articulated with a constriction by the lips like M and B in opposition to N and D with an equivalent constriction by the tip of the tongue. By the proposal here, this conceptual matrix was first worked out by a classical scholar, in Four Clever Letters.

As Robbins Burling (2005) points out, there could be no words without sounds as their components. Or there would have been nothing to join.

The components of sound were no longer the attributes of actions of mimicry, but self-standing, independent elements. They could start to evolve towards what are known as ‘features’, as described in Sounds and Bits of sounds, differentiating EE with the tongue at the front of the mouth, OO with the tongue at the back, AH with the tongue lowered, and differentiating B with the airstream completely stopped for a brief moment from M with the airstream through the nose. The simplest possible system would contrast just two features, but under the rubric of a system.

It was the abstract formulation that made it possible to unconsciously incorporate any of the sounds which humans can produvce into a fundamental mechanism for speech and language.

Primordial structures

In terms of syntax Glue gives a single concatenation. “Pa see” and “See pa” just suggest a relation between see and Pa, without specifying who could see who or what. The meaning could only be figured out from the context. Ljiljana Progovach (2015) suggests that there is still a vestige of this in expressions like tell tale, cut throat and skin flint. Where people are involved, the vestiges are mostly derogatory. But these expressions combine verbs and nouns in a way quite unlike the simplest possible primordial forms.

Pursuing to the limit the notion of simplicity, Glue is the first step of the development towards words and sentences, consisting of nothing more than sounds and associated meaning, just allowing elements to be concatenated with one another in any logically possible combination – with no significant ordering or difference between them. In the first instance, there is no reason to assume anything other than the gross contrast of sonority between consonants, briefly closing the vocal tract, and vowels leaving the airflow unimpaired. And structures such as Pa see” and “See pa” would seem equivalent.

Glued structures can announce, comment, insult, joke about the delights and disasters of family life, and more. But at the time this step was taken there is no reason for assuming that its limitations were obvious or even detectable.


The contrast and meaning of a structure can be amplified by defining one of two elements as a head, identifying and defining roles, including the role of referring. In the case of “See Pa”, pa is referential, see as non-referential. The active non-referential element becomes the head of the structure, tying the elements to one another. By the marking of one element as a head, the structure can now be glued to another, as by Ma gluing to “See Pa” giving “Ma see Pa”. It is still unclear who sees who. But there are now two logically possible propositions.

In relation to the sound structure, it is obvious that M, B and P all involve a complete closure by the lips, The connection between M and B| was seemingly detected by a an ancient scholar. But only M is articulated with the airway through the nose left open. To implement this, M is specified accordingly, and B and P are specified the opposite way. In a novel way, it is proposed here that these specifications, like others including the involvement of the larynx, are ordered or ranked in relation to one another. By this proposal, speech sounds are built by a strictly ordered assembly. By the ordering, there is a difference between a nasal sound with the lips closed and a lip closure sound with the airstream through the nose. Each multiplication by the primordial features has an exponential effect. Multiplying two variables gives a four times increase. Multiplying this by one other variable gives an eight times increase. And the greater the separation by the ranking, the greater the differentiation.

Informal rule:

§ Of X and Y as X Y, Mark X as the projecting head of X Y, and Project X.

Head gives the basis of what are known as ‘X-bar theory’ and a version of what is known as ‘feature geometry’. Various notations have been used for X-bar theory. One was to mark the head with an over-bar. Hence the terminology. Conventional feature geometries do not recognise any ordering, and thus do not make it possible to encode the same degree of detail.

Syllables could be doubled with some contrast in length or volume as by modern Mama and Papa in English and French, or combined with one another, as in Moonie and Barney. Vowels could become more internally complex towards something like the near limit case represented by English. halo and hello, could be contrasted by what is commonly known as ‘word stress’, an intrinsic and inalienable aspect of every word. phrase, and sentence, in English. This area of variation is sometimes known as ‘prosody’ or ‘supra-segmental’.Phonemes could merge the properties of vowels and consonants, as in W, Y, L and R. Diphthongs with the tongue beginning in one position in the mouth and ending in another could be built for greater contrast, as in the cases of English eye and owe. Consonants could be clustered as by English FR and FL in fly and fry.

By this increase in the the combinatorial power of the representation, new ideas with no obvious ostensive expression, could be defined. by each step exceeding by a larger margin the possibilities not available to chimpanzees or cetaceans.

The relation between heading and being headed is not one of equals. By this rather undemocratic idea, there is hardly any such thing as an equal relationship. As soon as two elements pair, one typically becomes the head of the resulting pairing.

There are still traces of language with minimal structure beyond Head, in “Good morning”, “Happy Birthday”, in which the head is the act of greeting. But these are not true examples of the primordial structure with only the most minimal attributes defining the headship.

In modern English we distinguish two meanings of a ‘woman novelist’, according to whether woman or novelist carries the main stress. This scaling of importance, which seems to be universal across modern languages may have been the first step beyond unordered concatenation – by marking one of the two elements as a ‘head’.ut here again, the subtlety is not primordial, but just a consequence of the primordial simplicity.


If by Glue and Head there was an irreducible ambiguity in structures like “Boy see baby”, the ambiguity could be removed in either of two ways, by marking the different semantic roles or by understanding different positions within the structure in a corresponding way, or both. There are numberous theories of what are known as ‘theta roles. These include

Agent: People in “People kill game” where kill game is a verb phrase headed by kill and glued to people.

Patient: Game in “People kill game”

Experiencer : People in “People fear pain”, where fear pain is a verb phrase glued to people

Theme: Pain in “People fear pain” by the same structure

And more

Very approximately, theta roles can be encoded by gluing abstract elements nominally corresponding to them. In English, the process is almost completely abstract.

Informal rule:

§ Where X has a theta role Y, Glue y of Y to X.

The y element is known as Case.

The system is cumbersome and arbitrary. But the marking of theta roles in this way is enough to eliminate most of the propositional ambiguity in a structure like “People kill game”.

In languages like classical Latin and modern Russian, Case cleaerly distinguishes between the agent and the patient in the equivalent of “people kill game” – if the context does not make this obvious.

In languages like English, Case is expressed abstractly as a property of the structure, where by traditional grammar in “People kill game”and “People fear pain” people is the ‘subject’ and game or pain is the ‘object’ or ‘complement’. So it is possible to kill a flower, an animal, time, an idea, but not a house or water, where the relation is expressed by the first step of Glueing, and the action of killing is on the next level of structure by a different, looser, less semantically-restrictive relation.

The advantage of stating Case in this abstract way is that it gives a single analysis of English and Russian-type systems.

From the decisive perspective of evolutionary advantage, Case gives a proposition its character as true, false, half-true, self-evident, misleading, ironic, mockery, and so on.


By the realities of discourse, a proposition is stated as an event in time, with some purpose – to shift the focus of attention, to remove a doubt, to get some information, to correct an error, to reset a time scale, to make a commitment, to make some necessary assertion or flippant comment.

Functors are like the skeleton to which the muscles of. ‘content words’ are anchored. Primordial functors may have been not, as in “Not Pa!” and Tense, like -ED in “Dropped baby”.

In “What did they love?” there is a still higher relation between what and the rest of the structure, defining the question force, by my proposal here, at yet another angle.

There is a similar contrast between different aspects of the speech sound, some given by the shaping of the vocal tract, by the point of greatest narrowing, whether this is by the lips or the back of the tongue, and another given by relative timings of different actions. Essentially, these are different sorts of timing.

Informal rule:

§ For an element X, if X has no fixed meaning, Align X to Y.

What Fit does

Fit helps a labelled structure fit some need or situation. The difference between the planes allows the cartography to adjust the relative saliences between elements, to glue two elements indissolubly together, as for a sailing boat or aeroplane, or just lightly as for a violin. The indissoluble glueing is what von Humboldt probably had in mind when he invented the term ‘agglutination’, as between the n’t and do in don’t, as opposed to the much looser connection between do and you in “Do you do that?”, which can be separated (though not conventionally for the vows) by really, and so on, and the loosest of all possible connections where the overt pronunciation of an element is optional where the only definiti0n is of its exact or approximate position in sequence. (The approximation applies only in those languages like classical Latin and modern Russian with much looser definitions of word order than allowed in English). English is only very mildly agglutinative. Strong glueing applies only to contracted elements like n’t and whatever they get glued to.The native languages of North America are much more strongly agglutinative. According to Barbara Mithun who was tasked with writing a dictionary of Mohawk, one can ask a Mohawk speaker: “What was the last word you said?” And the answer is what would be a sentence in English. The difference is in the tight way that Mohawk glues words together.

What and not are both archetypal functors, outside the system of syntactic categories. Both exploit the possibilities by Fit.

Fit exploits the salience of particular structural positions. The most salient of these positions is the left edge or what is known as the ‘left periphery’. I say the left edge because words, phrases, and sentences all have left edges.

Fit allows recursion to proceed indefinitely. So there is no maximum limit on the length of words or sentences. But very long words come at an obvious price. Languages tend to keep the most important, most commonly used words, short and simple.

• The most salient of left edges is the left edge of the entire structure. This allows requests for some specific information by one of a special set of words or phrases, in English who, what, which, when, where, why, how, how many, how much, how often, and so on, in English, as in many languages, mostly on the leftmost periphery, as in “What did you say?”

• In almost all modern languages, the marking of edges helps to distinguish between positive and negative statements of fact, denials or affirmations of truth, commands, questions, by forms such as not. More simply than modern English, a now archaic English just had not after the verb as in “I think not” and “I know not” By contrast in modern English, not is sequenced on the end of an auxiliary element such as can. The contracted form n’t always appears indissolubly glued to the right edge, without the vowel, as n’t, exploiting the difference between the planes, as in “I can’t see you” from “I can not see you.”

• The functor that alternates between being pronounced and not being pronounced in “I say that he died” and “I say he died” , again on the left edge.

• The marking of edges also helps to define the statement of possibility, permission, compulsion, relevance to the present, what is known as ‘evidentiality’. All of these are expressed in in English, as in almost all languages, on elements of the verb. But English does this in an uncommonly complex way, up to the limit represented by “Couldn’t Father have been being honoured?”. Evidentiality defines the contrast between fact and uncertainty. Many languages, such as French and Italian, use a form ot the verb known as the ‘subjunctive’ to allow this to be encoded. For this, an abstract form has to be entered on the left periphery. The subjunctive is becoming archaic in English, but it is still interpretable by most speakers in “I demand that she be admitted”. Here be is a relic of a subjunctive once much more widely used. In such a case, the subjunctive encodes a degree of doubt or uncertainty about whether she will be admitted or not. English happens to be poorly stocked with ways of expressing evidentiality, and makes do with words like allegedly and expressions like “It is alleged that ….” I once heard one of my children say, “I think I might have misunderstoodended that.” What is known as ‘multiple marking’ – in this case by the four elements in stood, -en, -D and -ED. Misunderstoodended may have been an attempt to encode evidentiality more definitively than English allows.

• Respect or deference, sometimes known as ‘register’, is expressed in most languages by one or more special terms for English you. Familiarity is shown by tu in French, du in German, etc., equivalent to thee and thou. Most modern English dialects don’t have a word for marking respect. There is a special vocabulary for marking disrespect in words like lurk, amble, shamble, ramble, toddle, witter, babble, but this does not fill the gap left by the disappearance of thee and thou. Normally, children only start learning this aspect of language between three and four. On the basis of an insightful observation by my youngest son at the age of nine, respect is expressed indirectly in English, by not referring to a third party by their relationship to the addressee. So to say “I just saw your colleague / wife” implies equality or superiority. How this is learnt is not obvious.

• The forms, generally known as ‘personal pronouns’, I, you, he, she, etc., refer according to the relations between the participants in the conversation. Mostly a shorthand for fuller reference, pronouns are dropped by ‘Pro-drop’ in Italian-type languages in which “I love you” is said as the equivalent of “Love you”. In English “Mind if I come in”, both the pronoun you and the verb do go unpronounced. In English, as in almost all languages, commands are mostly issued with a bare ‘verb phrase’ like “Come in”, “Go to sleep”, “Put your coat on”.

• The two commonest words in English, a and the, relate a referent to the history of the conversation or the immediate world of the speaker and listener or listeners. In “A woman laughed” we may know nothing about the woman. But in “The woman laughed” we know who she is.

• The perception of some entity as animate or of one sex, as something unique like a child, partner, lover, parent, or grandparent, in some large numbers like the fruit on a tree, in every case restricting the scope of reference. A category known as ‘number’ expresses the difference between single and plural entities, in modern English regularly written as S or -ES.

By my proposal here, in a novel way, Fit is no less powerful in relation to the sound system or phonology.

• On time scales over a hundred years or more, vowels can be changed. Take name and tide, pronounced in the time of Chaucer with an AH vowel in name, and an EE vowel in tide, and a short AY in the second syllable. By the effects of what Otto Jespersen called the ‘Great English vowel shift’ now qqq. often called the GVS, both AH and EE vowels became diphthongs with the tongue moving up towards the front of the mouth in the course of articulating them, in name from a mid point in the mouth and in tide from down low. This change happened over a number of generations. But speakers who were making the fractional and mostly imperceptible changes in their speech were, by my proposal here, exploiting the functionality of Fit.

• By what is known as ‘lenition’, the T in little and between the N and the M in huntsman is rqqqeplaced by a glottal gesture, effecting an increase in the acoustic contrast with the two closest or most similar sounds, P and K in supple and nickel.

• By the seemingly opposite devices known as ‘assimilation’, as in phrases like “good morning” as GOOB MORNING and “ten girls” as TENG GIRLS, and ‘dissimilation’, as in little as LIKU (where these processes have the effect of changing what is already there) Fit adjusts levels of contrast. In the much commoner assimilatory cases, a tongue tip articulation in the D in good is lost to the labial articulation of the M in morning, becoming a B. And the N in ten is lost to the back-of-the-tongue articulation of G in girls. Here Fit diminishes contrast. In the cross-linguistically rare case of dissimilation (rare across the world’s languages, but available from the apparatus), Fit increases contrast. This occurs mainly in child speech. It is, in my view, hugely underestimated in studies of child speech, and often not even recognised or listed. Assuming that the small child correctly identifies the final L segment in little as being by a tongue tip articulation, the childish pronunciation as LIKU involves a sharpening of the contrast between the T and the L. Although the LE syllable is a full part of the root in modern English little, this was plainly not always the case, as shown by settle, as a place to sit for a period, fettle, to fit the handle onto a cup or jug, ladle a deep spoon for loading a soup or stew from one container to another, handle as part of an object doing the job of a hand. And in the case of little it seems to have originated in Proto-Indo-European as leud meaning small, where the LE may have originally emphasised smallness or perhaps dearness. And this may have been enhanced by the dissimilatory change of articulator if LIKU once persisted longer than it does in modern acquisition.

• A set of words including that in English is used to define what is known as a ‘subordinate clause’, as in “I know that I am right” Or “I know that you think that I am wrong.” The functor that marks the embedding or subordination of one clause within another, exemplifying recursion. Its role is emphasised by the fact that it is alternately pronounced and left unpronounced.

• By what is known as ‘reduplication’ or doubling the whole structure, as in chop chop, Fit can either mark the structure as non-literal or emphasise it, as in very, very good, or for some speakers, to denote an extreme example, as in “I went to a school school. If you looked out of the window you got tied to the chair.”

By my proposal here, again in a novel way, by the different angles, it is possible to define different sorts of entity, such as contrasting relations of time, and long distance immediacies:

• The presence or absence of a pause in what are known as ‘restricted’ and ‘unrestricted’ relative clauses, as by the different readings of “I don’t like butch men who abuse women”. With a pause between men and women, the inference is all all butch men abuse women. Without the pause the inference is that only some do.

• The difference between syllables and words, a sequence which is sometimes reduced within the phrase as in “Good morning” in English, almost lost as in a pinta milk, and between functors as in Don’t, or increased as in “Australia R or England” as in English English but not in Scottish English;

• Length variations between long vowels like the EE in me where the tongue is squeezed and tensed forwards and upwards in the mouth, shorter, less far into the corner, without tensing in him, with length in the vowel used to mark the voicing of the B in rib, and in diphthongs like OY in boy where the tongue moves forwards and upwards throughout the articulation;

• The contrasting edges of the affricates at the beginnings of Chay and Joe, and the on-glides and off-glides in the AY and OE vowels or nuclei;

• The variable delay between the release of the stops in tie and die;

• The doubling or ‘gemination’ of N in words like unknown and unnerving.

• The finest steps of a derivation, unfolding too fast for conscious human perception.

Fit helps to ‘do things with words’ precisely, in expressions of pain, surprise and apology, in questions by intonation alone, as in “Eating seaweed?” Significantly, the rising intonation here is almost universal across languages. It would thus seem possible that speakers found this way of asking questions before Fit had evolved. But when it evolved, Fit was hugely beneficial, taking a step towards reconciling the pragmatics and the syntax.

Fit makes it possible to mark non-propositional aspects of meaning like politeness and delicacy, in contrast to propositional aspects. The congtrast here is crucial to the property first identified by Noam Chomsky in 1986 and updated and developed in 1981, by which all grammatical effects are strictly local, over a short distance of dereivational structure.


It is noted that quite a number of children on the ASD spectrum have problems with the pronouns, I, you, he, she, we, they. These pronouns are just part of a vastly more complex indexical apparatus. It would seem possible that either this element of the Fit apparatus or the Fit apparatus itself is selectively vulnerable in human development.


The only relations which are systematically expressed in language and the lexicon are all related to honour or the lack of it, power and authority or importance. In 1978 Bernard Comrie invented the term honorific for this. Apart from words like darling, affection and love don’t get a look in. One social dimension is systematically singled out.

Languages vary in how they encode honour, It is often said that English does not do this at all. In most other languages, including all the other languages of Western Europe, there are different words for you, according to the degree of respect which is thought to be owed to the addressee. But this is misleading in four ways.

First, reference to anyone by their relation to the addressee implies either equality or superiority. It would seem presumptuous in a conventional work setting for a junior to say to a director, “I was just talking to your colleague”, but not the other way round.

Second there are words like donate which are treated differently by the grammar from the close synomym, give. With give, there is an alternation between “I gave the library a present” and “I gave a present to the library”. With donate there is only “I donated a present to the library”. The difference seems to hinge on the fact that lexical representation of library seems to refer to its significance as an institution,

Third, there are many verbs which imply a lack of respect for the subject – witter, jabber. amble, lurk.

Fourth, as Ljiljana Progovac points out, terms like tell tale, pick pocket, skin flint, are almost all perjorative.

Languages encode honour and respect in their own ways. For instance in Zulu, in discussion with an elder, respect is indicated by just not referring directly, but by finding a work around – Gugulethu (2019, personal communication). Some languages, like Japanese, have forms of the verb reserved as marks of respect.

Since this is the only social dimension which is encoded in language, honour cannot be one of the many dimensions encoded by Fit. It can onlky have emerged as a separate step, narrfowing down the options by Fit, necessarily later.


By what Chomsky and most other interpreters of the Minimalist Program now call ‘Internal merge’, a functor like who, what or where is COPIED to the left edge. By my proposal here, it is MOVED there, but leaving a copy in the original position in a way quite different to any primitive concatenation. By Move, I am resurrecting Chomsky’s terminology from the 1950s. By my proposal here, Move represents the benefit of a different, more evolved cognition, seeking to increase the pragmatic salience of a particular position in the structure by moving an element to it, rather than just finding it there.

The value of separating the function by Move from the positioning by what the Minimalist Program calls ‘External Merge’ is recognised by the syntactician Cedric Boeckx and a number of biologists (2019). It gives a better account of evolution.

Informal rule:

§ Take a functor X; Move X to a C-commanded position.

By the current theory of Move, it involves copying an element and either leaving it in its original position or leaving it, but with a marker ensuring that it is not pronounced. Within the Mimimalist framework here the marking could be by disconnecting the connection from the spine.

By Move:

• The left periphery becomes a container for the pragmatic, aspects of the sentence, as in “What did the man hit?” There is evidence of the movement in the fact that it also makes sense to ask “The man hit what?” where the force, to use a term from Austen, is one of surprise, rather than a request for information. But in English, as in most languages which have this movement, it only happens once. Hence a doctor might ask about a child showing symptoms of poisoning “What did the child eat when?” or “When did the child eat what?” But as shown by Tom Roeper and colleagues, normally developing children mostly understand such multiple WH questions correctly despite their rarity. But this is not so for many children with language issues. Not all languages move the equivalent of Wh words. Some languages like Japanese and at least most Chinese languages don’t move them at all. Other languages, like Hindi and Kurmanji, only move some of them.

• By a mechanism nowadays mostly called ‘agreement’, properties like singularity, plurality, person, gender can be moved or copied from a noun element to a verb element. So in “There are two cups on the table”, the plural form of are copies the plurality of cups, just as the singular form of is copies the singularity of cup in “There is a cup on the table.” And in “I am talking” and “We are talking”, am and are copy the singularity and plurality of I and we in the subject. English makes relatively little use of agreement, much less than most Western European languages. Many languages have more than one form of this. But for the purposes of morpho-phonology, English has three forms of the S sound for singularity in verbs and plurality in nouns, as a plain S in pats, as a Z in pads, and as a syllable in patches, wedges and messes.

• A tensed element associated with the verb is copied to its left by Do support to point up a question in  “Did the man hit the ball?” 

• In spoken modern English, the negative form not, or a shortened form, is moved to the right edge of whichever element of the verb is used to bear tense, often do, does or did, as in “I do not drink”, or  “I don’t drink”, or  “I didn’t drink” always on the left edge of the overall structure of the verb.

• The irreducibly necessary sub-structure traditionally referred to as the ‘subject’ is replaced by an element Moved from a position elsewhere, in “The question was batted aside by the politician” reflecting an interest in what happened to the question rather than the politician’s words. English happens to make extensive use of this device, known as ‘passive’. Not all languages have it.

• By another device an element which would traditionally have been characterised as the ‘object’ is moved left to emphasise its significance as a topic or point of focus, as in “Questions from that journalist I don’t like at all”. This is rather less used in English than in other languages, in English mainly to emphasise some significant judgement.

The same Move functionality is involved in the formation of a class of expressions moving elements within the structure of existing words:

• By extracting from some word its stressed vowel and everything else on the right of that, what is known as the ‘stress domain’, doubling the stress domain on the left, and adding an H as the initial consonant, as in hodge podge, hardy gurdy, helter skelter, higgledy piggledy

• By doing the opposite and doubling two forms with a match only between the initial consonants as in trick or treat

• By stringing together items differing only in the stressed vowel, where the first is always short I and the second is either a short A or short O, as in chit chat, knick knack, riff raff, pitter patter, flip flop, tip top, hip hop.

It is possible to devise a grammar which generates words like where on the left periphery, i.e. without copying or moving anything anywhere. But it is only by copying or moving that it is possible to capture the relation between “Where are you going?” as a simple question and “You’re going where?” as an expression of surprise.

Move gives particular functors particular roles, sharpening clarity, strengthening contrasts. The possibility of copying and moving labeled elements made language more reliably understandable – but at the expense of learnability. The mechanisms by Glue, Label, Wall and Move may have been complex and hard to learn, with the effect that the learner’s best hope of converging on the grammar was by guess work and hoping that the guesses were mostly correct. There was no guarantee of all leaners progressing to a point of competence. There was just a wide spectrum of approximation – from good talkers to less good, in a way similar to Basil Bernstein’s theory of restricted and elaborated codes. Competence in the full apparatus of grammar may have been the privilege of only a small élite. 


Almost complete modern grammars

The steps up to this point, Glue, Head, Case, Fit, Honour and Move. give the building blocks for words and putting them meaningfully together up to a given level of complexity. But how far the individual learner progresses towards a complete mastery of the system – what Chomsky (1965) characterised as competence – remains an accident. At an evolutionary level this left speakers with a spectrum of different competences – analogous to different levels of skill at ball games or woodwork. But at a societal level in the modern world, this is demonstrably not the case.

Here I am postulating a final step, Phase. But there must be a point before this has started to become evident, as there must have been in human development.

In an arbitrarily observed modern system, only partially developed, it was still a complete accident how much of the system could be generated in any one spoken structure. There was no evidence of any guarantee of any complete system having been learnt.

There was no reason to expect that all speakers had undergone the same apprenticeship.

An almost complete ancient grammar

Unless human language evolved ready-formed by a macro-mutation, there must have been a point before the last step in its evolution, a penultimate point. There is no way way of estimating what such an ancient grammar might have generated, other than by speculation. The most we can do is to look at what is generated by modern versions of something equivalent. These include versions of phenomena like the movement of Wh forms like who, what, which, and where, which mostly move to a dominating position on the left edge of a clause in English as in most languages, as in “Who is going to be me?” and in English in a cross-linguistically rather less common way, dragging along the rest of any associated structure “Which of my animals is going to be me?” by a process known as ‘pied piping’. Both the Wh movement and the pied-piping are by ways which have been found of exploiting the general abstract device of Move, and set parametrically in the course of development. As shown in the previous section, these settings are still shaky at five. But there has to have been a time before devices such as pied piping had emerged, or perhaps, could have emerged. At this penultimate point in human evolution, the apparatus may have functioned much more primordially, generating structures like “Need undo knot” without the definitenss by the in “The knot”, with no clear specification of who is expected to do the undoing, the speaker or someone else, and with no to defining this lack of what is called ‘finiteness’. This penultimate evolutionary stage may have been characterise by any degree of competence from “Need undo knot” to “It needs to undo the knot” or perhaps some slightl greater degree of what Chomsky (1965) characterised as ‘linguistic competence’. But in the light of Carol Chomsky’s results, modern children’s vagueness with respect to the controller seems telling. Premodern human human language is likely to have been at a similar penultimate point, both premodern humans and modern children being limited in their language by the fact that there was nothing to guarantee that a point of complete competence had been reached. By contrast, for the overwhelming majority of modern adults, all the relevant parameters of language variation get correctly set for a given target language in some finite amount of time – what is known as ‘finite learnability’. This point is reached confidently, and seemingly effortlessly. But for a premodern population as for modern children at the penultimate stage of speech and language the skills may have been as unequally spread across the population as in singing, athleticism, craftsmanship, and so on. In the modern world, pedagogues try to ensure that their students don’t split their infinitives, as “To rightfully enter a country”, or begin sentences with and or but, or end sentences with prepositions, and nowadays always make sure to derive isn’t it by a matching the auxiliary with the person and tense of the main verb with a reversed polarity. But the evidence of language history suggests that their efforts are as futile as Canute telling the tide to stop rising. The population reaches competence without pedagogic hindrance or any other sort of ‘help’. Lack of skill may have been as stigmatised as deafness once was, and to a degree, still is. Neanderthal speech and language were probably limited in this way – adequate for instruction up to a given level of dexterity, but not reliable in the heat of battle or at critical moments.

It seems likely that there was one significant difference between premodern humans and modern children, if ever they could have been compared. With the separate accessing of the various functions, the premodern language is likely to have been significantly slow in a way that the speech of modern children is evidently not.

There was potentially great advantage for the population in ensuring that speech and language skills were more evenly spread. But evolution is not driven by the interests of populations. As Dawkins points out, the driving force is from the lowermost point of advantage – that of single individuals. But it is not obvious how finite learnability could have been reached. Let us suppose for the sake of a thought experiment that some individu`l ias able to progress to an uncommonl cxomplete mastery of all the function by Glue, Head, Sort, Fit, Honour and Move. Such an individual may have been precociously advanced as a child language learner and progress to uncommon proficiency as an adult. But how might this chance exceptionality be both visible and defined in such a way that it was likely to be selected for?

By the proposal here, this exceptionality was defined on the rather subtle function which is generally characterised as Phase.

Originating in a 1986 proposal by Chomsky, Phase ensures that all the assembly functions take their effect in compact packages. By conceptual necessity there are two interfaces, one where a structure is organised into a sequence of consonants and vowels with particular timings and intonations for the purpose of pronunciation and the where is interpreted or understood. By what is known as the ‘architecture’ of the system, a structure is ‘spelt out’ or ‘despatched’ to these two interfaces. Interestingly, one of these compact packages defines the propositionality. the point of truth or falsity. Another defines the ‘force’ of the structure in a conversational sense, at the point of use as a statement or a question ,with attention directed to one element or another. By this conception, each Phase of Spell encapsulates a chunk of what has been defined up to that point. Obviously, as soon as this happens, whatever is being shipped around the various organs of the vocal tract for pronunciation or taken apart for semantic analysis, it can’t be processed any more.

An innovation with two components

By the proposal here, Phase has two components, the bringing together of the operations by Glue, Head, Sort, Fit, Honour and Move, and narrow locality of the separate instances of Spell out.


§ Of X and Y in Z, if there is more than one exemplar of X, xi and xj omit xi

Phase runs at three tempos, at high speech for speech by an order of magnitude faster than by introspection, an order of magnitude slower for the purpose of acquisition, and hundreds of orders of magnitude slower still by the process of evolution.

Phase discards the core proposal of Nunes (2002) which like the notion of Merge by a single step. Either necessarily entails a macro-mutation. I replace this by the notion of a (small) series of (large) steps, with Sherlock as the last.ff

Introspection may capture the mindset of someone trying to execute some difficult and potentially dangerous task, in a world without modern infrastructures, preparing to ford a rapid as a whole family, or by an everyday modern routine, carrying out s pre-flight check on an airliner.

At the opposite end of the scale, Sherlock reduces what the child learner hears as a first language to parametric values, starting with the simplest things, like whether the language has “In bed” like English or “Bed in” like Japanese, then at around two and a quarter deciding whether to say “I fell down” or just the equivalent of “Fell down” where the target language is Italian. And so on, for the next ten years or so, finishing with what are known as ‘control verbs’ like promise. The eight year old child may think that “I promised the cat to be good” means “I got the cat to promise that she would be good”.

Linguistically, Phase relates to both meanings and forms. It reinforces the role of unpronounced elements in grammar, and seeks to remove any possible doubts about relations within a structure. It checks for incompleteness of all sorts including the triggers, sources and destinations of changes. It either leaves or uncovers any relics of movement. And it fills in gaps, as in the case of most stops, like T, P, and K, there is none of the stridency which defines the right edge of patch and badge. This is not a waste of time. Stridency is sometimes relevant – by the insight of Rubach (1994).

It may be that children have only a vague idea of these things, or they may have no idea at all, until they incorporate this step into their developing grammars. In a different sense of the word step, Sherlock proceeds in steps as given by the terms of the derivation.

But how these steps should be formulated and fit together in a biologically readable way is the main research question here.

It seems to me that there are currently three relevant areas of linguistic research – in, around, and sometimes against the Minimalist Program.

The second area of research concerns the compactness of the relation between semantics and syntax, the limits on the scope of prepositions and other elements by what is known as ‘binding’, the special roles of reference and event structure with words like want, ask. tell and promise, by Hagit Borer (2005), Gillian Ramchand (2008), Cedric Boeckx, Norbert Hornstein, and Jairo Nunes (2010), Idan Landau (2013), Maria Ines Corbalan and Giulia Terzian (2021).

The third area of research concerns what is known as ‘elipsis’ as in the “I do” of the marriage vows and “Me too”, as a slogan. Elipsis allows pithiness and brevity. When somebody volunteers, “I am ready to go,” one possible response would be, “Do you really want to?” not pronouncing the word go. But the elipsis leaves a trace. Without the elipsis the response might be “D’ya really wanna go?” With the elipsis, the word to has to be pronounced, and can’t be reduced, as by “I really want to”. The reduction is blocked next to the trace.

In some combination these three areas of research seem capable of yielding the final step of linguistic evolution.

• By encapsulating the syntax and semantics, reducing the combinatorics to the point that speech and language learning could proceed reliably to a finite conclusion.

• Preserving the exact history and logic of a derivation, including unpronounced ’empty categories’ like the trace residues of Move.

• The pithiness resulting from elipsis is highly visible and thus advantageous in the mating game.

But however this functionality works, it works by orders of magnitude faster than by some archaic cognition and slower than by modern ontogeny.

Jason Merchant (2016) notes that elipsis is reflected in different ways in different languages and cultures, but to some degree in all languages and cultures.

The complexity is illustrated by cases like these (characterising one aspect of the moment of truth in capitals), and spelling out the antecedent below:

DISOBEDIENCE I told you to paint yourself. Not your neighbour.

I told you to paint yourself. I did not tell you to paint your neighbour.

By the elipsis, the speaker elides the subject and verb phrase of one clause and the verb of a subordinate clause.

GREED I see he helped himself. I wish I had too.

I see he helped himself. I wish I had helped myself too.

By the interpretation that the speaker regrets not having helped him or herself, the person of the reflexive himself in the antecedent is set aside.

There is a powerful device at work here. But as Merchant notes, the search for an exhaustive generalisation is still very much on.

Phase and speech

The checking effect of Sherlock is particularly significant in relation to the ontogenetic development of speech. By the effect of Glue and Label, as I am calling the two most basic functionalities of Minimalist Merge, the full inventory of features can be deployed. One feature, which Chomsky and Halle (1968) call ‘Strident’, defines and characterises Greek theta, Castillian Spanish cerbeza, and English three. Contra Chomsky and Halle, the feature seems to be best defined by an extreme lowness in the aperiodic noise, critically lower with the tongue bnetween the teeth than with the lower lip against the upper teeth. Greek and some varieties of English and Spanish opt to the feature matrix of some phonemes. Although it is possible to distinguish the four sorts of English consonants which use a continuous airstream by the ordering of the settings with respect to this property, the articulator, and the tightness of the closure, the acoustic contrasts may be insufficient to avoid confusion. The feature Strident sharpens the contrast. Link checks the feature matrix of a segment, and detects the absence of one member.

In the case of T, the least marked consonant in English and most languages other than those with very small inventories, the particular pronunciations are given by separate steps, defining the articulation by the tip of the tongue, the short period of closure and complete release, the complete closure of the nose, the relation in time between this action and an adjacent vowel. This process of ‘derivation’ builds the phoneme as pronounced. But at a given point in the process of sound formation, each phoneme is defined. This goes on all the time in productive speech and in reverse in processing speech, but so fast that speakers are unaware of any such process. 

Take the affricates at the beginnings and ends of the words church and judge. Languages are likely to differ in how such segments are defined. Phonological specificities surface in:

• The differences between English R, the Russian and Scottish burred R, French and German back-of-the-mouth R, and Spanish R with what is known as a ‘tap’;

• The difference between Russian T and D and T and D in the languages of Western Europe;

• The differences in the timing of ‘unvoiced’ or ‘voiceless’ stops in the North of England and Ireland, London, Paris, and Southern France;

There are only so many logically possible sequences in the building of phonological features. Some are easily heard and articulated with only minor variations, with the effect that neighouring languages like English and French can and typically do have very similar settings for phonemes like T – though not exactly the same, with Russian T more different (articulated ith the the tongue tip against the points of the teeth). From the work of the Russian linguist, Nikolai Trubetskoy, these combinations of settings are often characterised as ‘unmarked’. 

An angle on stammering

There is negative evidence of Spell out from the disorder of stammering. In 1994, in the opening paper at the International Child Language Seminar I proposed a way of reconciling psychological and physical aspects of stammering by a defect with respect to a buffer, already proposed by other authors. The notion of a defective buffer helps to explain three things. First, stammering is never attested on the first words, but only when language is already in full development, most often between the ages of three and six, but sometimes later. Second the speech of normal speakers is massively disrupted by hearing themselves speak over a given delay; In the terminology of those who treat stammering, they block violently; for most speakers, the greatest effect is from a delay of around a third of a second; this is known as Delayed Auditory Feedback, or DAF; but for most stammerers, the effect is reversed; they become more fluent. Third, there is the equivalent of stammering amongst native users of American Sign Language, albeit with only one tenth of the prevalence of stammering in those using voice (America is the only country with enough native signers to get reliable statistics about a phenomenon which only occurs at a rate of one or two per thousand signers). 

The greatest utility of such a buffer is in relation to the left periphery, evidenced in language after language, allowing Wh words to be correctly understood as in sentences like “Where do you think they said we might have put the car keys?” The buffer stores a left shifted element until it can be interpreted.

The timing of the buffer is hard-wired, a special adaptation for language with the power of moving elements leftwards. It does not vary across languages. For stammerers, DAF restores the normal effect of the buffer. But the buffer itself, as an expression of Spell out, is universal.

Making us human

By far the most significant effect of Sherlock was on learnability. By virtue of the checking, all the variables become reliably visible, making language finitely learnable. Thus by the proposal here, it was Sherlock which made humans what we are today, at least almost uniformly smart when it comes to learning to talk.

For better or for worse, over a period possibly greater than two million years, advances in our primary means of communication have contributed significantly to the growth and success of humankind.

By a conjecture here, the most visible effect of Sherlock was on elipsis, making it possible for pithy speech to be understood, sexually attractive today, as it presumably was 200,000 years ago.

At this point in human development, with Sherlock ensuring that there was such a thing as a clear and generally agreed meaning or set of meanings, it made sense to think of shared entitlement and responsibility, in a way which would not have made sense previously. But the functionality which makes speech and language finitely learnable does not resolve all of the issues. It does not work perfectly for everyone.


If there was, as I contend here, a sequence of evolutionary events leading up to modern speech and language, this forces the conclusion that there must have been corresponding stages of ‘protolanguage’. But while the entire modern apparatus is shared by the whole of the modern human population, it is still developmentally vulnerable.

The usual course of events

First there is a long period during which the child says only single words or what Martin Braine called ‘holophrases’ – expressions which sound like they might contain more than one word – but not occurring on their own – like Ozah, as an expression of apparent curiosity, possibly modelled on “What’s that?”. Then, typically sometime between 18 and 21 months, words start to be put together. The child says something like “duck bath” with two elements relating to two significant entities in the child’s universe, such as, in this case, duck, and the duck’s place, in this case bathGlued together, in a primitive prototype of a phrase or sentence. Then as I discovered when I was doing the research for my MA in 1976, between a week and two months after saying something like “Duck bath” most naturally interpreted as a simple ‘declarative’, as such structures are known, commenting on some apparent relation betwee the duck and the bath, but crucially with duck as the head, the child either asks a question like “Where duck?” or answers a fully formed corresponding question by an adult like “Where’s your duck?” by an appropriate and plausible reference to place, possibly by a single word. But never in the opposite order. In other words, two word declaratives always precede one word answers to questions, even though the one word answer might seem simpler. In a period of up to two months, sometimes only a week, the child starts to understand questions. This happens to the ‘WH’ forms, who, where, and so on, correectly applied to the right edge of the respective structures.

But this is only the beginning. In “Duck bath” there is no evidence of any awareness of definiteness in the references to either the duck or the bath. And in ‘Where duck?” there is no evidence of a copy on the right edge, but just a destination position by Move.

Most of the apparatus for syntax and phonology is known by most children by the age of five. But as Carol Chomsky showed in 1969, there are subtle aspects of the grammar which are not mastered until nine or so. In 2002, I showed that for most children, the final stages of phonological acquisition are still normally in process at eight  or so.

The sense of ordered disorder

Characteristic incompetences, commonly described as the ‘processes‘ of child speech, are mostly with respect to Sherlock, with derivations concluding too early in fronting, stopping, and too late in calculator as KALTALATOR. In cardigan as KARDINTON, there is indeed a sequence of steps, changing the G to a D, losing the voicing of the stop, and copying the nasality of the final consonant, one syllable to the left. But these steps are all very late, at the end of the derivation, inappropriate additions to it, all involving tongue tip articulations, after the stress has been assigned.

In non-pathological speech by many normally developing children of five, six and seven, hospital is commonly mis-pronounced by children as HOSTIPU and spaghetti as BASKETI or PSKETI. In all of these cases, elements of the phonemic structure are copied incorrectly. 

In hospital competently pronounced, the tongue tip T at the beginning of the final syllable contrasts with whatever is left of the L sound. This often characterised as ‘syllabic’ because it works as a stand-alone syllable without an independent vowel. What is left consists mainly  in a lip rounding gesture similar to the vowel in pull, put or book. But the native speaker knows that the origin of this is a tongue tip L, as evidenced in hospitalise with the L now at the beginning of a syllable. 

In hospital as HOSTIPU, the T and P are reversed by what is known as ‘metathesis’. The tongue tip gesturing of the L is partially or completely lost in favour of a lip gesture, triggering a matching change at the beginning of the syllable. There are three steps here. First the lip-rounding of the final syllable is exaggerated. Second the lip action of the P is copied rightwards to the onset of the final syllable. Third, what is left behind at the start of the second syllable is a stop without a defined articulator. This is then said as a T.

In spaghetti, the child’s system may reject the SP cluster at the beginning of an unstressed syllable before the stressed syllable on the grounds that there is no other such word in the child’s vocabulary in contrast to the numerous cases like spy, spare, spit. So the S moves to the beginning of the stressed syllable, and the G loses what is known as its ‘voicing‘ to match that of its new neighbour, becoming a K. The structure is now more familiar except that the P has been left behind. It usually becomes voiced as a B. But in some children’s speech, as the S is moved, the initial unstressed vowel is left unrealised. And an initial cluster of PSK is formed in a pronunciation as PSKETI. Nobody would call this a natural way of making the word easy to say. But it has an easy derivation by one incorrect application of Move.

On an alternative ‘process account‘ BASKETI is commonly described in terms of ‘migration’. But this assumes a ‘process’ with only one common exemplar. It is more parsimonious to postulate a general Movc functionality which is justified independently in competent speech and language. On such reasoning, a process account is rejected here.

Many incompetences involve incorrect adjustments of contrast within a structure, typically a word or phrase. ‘Assimilation’ reduces the contrast between two elements, as in “Good morning” in normally competent adult as GOOB MORNING. ‘Dissimilation’, much less common in competent speech. increases the contrast.

Many listings of supposed ‘processes’ in child speech do not mention dissimilation even though it is in fact quite common, represented here or there in the speech of most normally developing children between four and eight years old, as I showed in Nunes (2002)

Many normally developing five year olds mispronounce magnet with an apparent assimilation as MAGNIK. Here the back of the tongue articulation of the G in what is known as the ‘coda’ of the stressed syllable is copied into the tongue tip T coda of the final unstressed syllable, without being lost at the point of origin. The two codas contrast in their ‘voicing’, or the time relation between the release of the closure by the tongue. The effect is one of assimilation.

In the speech of normally developing children, in little and middle as LIKU and MIGU, even though there may be no overt tongue tip gesture, there is nothing in the child’s experience of English to suggest that there could be a word ending with the vowel in full or pull. The presence of the L is signalled in forms like fully and pulling in which it is at the beginning of a second syllable. And the child’s system retraces the history of the final U sound back to its origin as L, and increases the contrast by moving the tongue articulation back to K or G, with a dissimilatory effect in other words.

There is a similar dissimilatory effect in the speech of children of seven or eight, who mispronounce monopoly as MONOKOLI. Here the environment is very narrowly defined with a lip action M before a tongue tip N, a lip action P after a stressed vowel with lip-rounding, and an L in the final syllable, capable of becoming a rounded vowel in other circumstances. Here the replacement of P by K has the same dissimilatory effect as small children’s LIKU.

Even some apparently complex errors common in children’s speech can be reduced to the effect of minor misstatements of general, independently justified evolutionary principles. giving step by step increases in the possible complexity of syllables and forms of stress contour, both unusually complex in English. Failure is possible at any point along this pathway which gets increasingly intricate as it proceeds.


The natural focus of treatment is to guide the child’s discovery of whatever is for him or herself the missing or not well-enough defined section of the pathway.

The child is given a sequence of nonsense words to repeat, each different enough from the last to avoid confusion, but similar enough to make the sequence logical. The degree of difference will of course, vary from child to child. But in a very significant way, children with any sort of difficulty in the formation of words and sentences tend to have a very marked difficulty in detecting any sort of relations between the sound patterns of words – between, say, hippopotamus and HEPPOPUTAMUS, both with the obvious stress contour, with primary stress on the third syllable and a secondary stress on the first syllable.

For a child with a difficulty, such a sequence might be anything from 30 to 100 items long. The clinical aim is to minimise any stress for the child by ensuring that as far as possible every item is said correctly. Obviously this means starting somewhere ‘below’ the point at which it is thought that the difficulty may be kicking in. Obviously, this requires some informed guess work.

Many children have a small difficulty saying cardigan, typically replacing the G by a D, in way that is easily understood. Interestingly, many children who say cardigan this way have no difficulty with crocodile, where the K sound in the middle and the D differ with respect to what is known as the ‘voicing’. That small difference seems to be critical, making crocodile easier to say. But now suppose the child says cardigan not with a D at the beginning of the last syllable but with a sound vaguely like an L and the final sound vaguely like the final sound in ring. Here the G is clearly influencing the sound at the which is itself influencing the sound at the beginning. Even though the syllable is unstressed, such a pronunciation can be hard or impossible to understand unless the hearer happens to know in advance or be able to guess what the word is supposed to be.

Taking the difficulty a long way down, It might be possible to pronounce the sequence K _ N _ K at the beginnings of the syllables, with stress on the second, with no final N at the end of the last – as something like KENOKER, follow that with KELARKY, make vaious similar substitutions and then try a variety of forms with the stress on the first syllable, and only then start adding a final N, as CORDERKIN, perhaps. Then after maybe 30 or 40 trials we might have a go at cardigan. Sometimes this is then completely correct – presumably for the first time. Sometimes this becomes correct only a few days or a week later – with no effort at a follow-up or home practise. Typically as soon as such a criterion has been achieved, it is maintained. Very occasionally the child is aware of the change, sometimes commenting something to the effect of “Gosh, I can say that now”.

Then the process is repeated with other words, all the while varying stress pattern and the sound structure in controlled steps.

It might seem that this promises only one, two or perhaps three newly correct pronunciations per clinical session. Not very satisfactory progress with a child of four, five or six whose speech is mostly incomprehensible, with thousands of mispronunciations needing to be corrected. But the progress goes beyond the work that is done in the clinic. By working around a constellation of what are known as ‘featural’ and ‘prosodic’ structures, the child explores for him or herself the freedoms which exist. The newly resulting skills are highly transferable. And the speech increases exponentially – faster than by working on the ‘processes’ like the exchange of features between the G and the N in that child’s pronunciation of cardigan.

Unbeknownst to me, Shulamith Chiat (1983) was looking at the relation between word stress and the replacement of T by K at the same that I was starting to develop the approach to treatment described here – what I call ‘Possible words treatment’. Brett, Chiat and Pilcher (1987) compare the realisation of common words and nonsense words with various stress patterns, concluding that the unfamiliarity of the unfamiliar words does not seem to bear on the pronunciation. Chiat (1987) identifies the key variable is the ‘foot’ in the first two syllables of accident and hospital.

This is quite different from the dogmatic approach by ReST, also using nonsense, polysyllabic forms, but with the focus on feedback and correction, rather than discovery by the child. (See Ballard, et al (2010), Murray et al (2012), Thomas et al (2014), claiming all the discovery and invention for themselves with no reference to the previous work by Chiat and her colleagues and myself.

Particular vulnerabilities

Because speech and language sit on a highly structured genomic component, they are vulnerable in corresponding ways, with the greatest vulnerabilities with respect to the most recent evolution, less stable across the population than those components by earlier evolutionary steps. Most common speech disorders are with respect to Sherlock and the misuse of Move by the step before that. Stammering appears to involve a buffer acting as the neurological correspondence of Sherlock. Autism appears to compromise the pragmatic apparatus and thus the pronominal system by Fit, more phylogenetically primitive and correspondingly more resistant to intervention.


Against the proposal here, it might seem possible in principle that speech and language evolved separately on different continents like the evolutions of flight in insects, fish, reptiles, birds, and mammals. But this is vanishingly unlikely. All modern languages show similar residues of the evolutionary steps proposed here. On the simplest assumption, speech and language evolved just once.

Against my proposals here, there are claims of languages having been found that falsify one or more aspects of linguistic universality, one in Indonesia, and two in Brasil. David Gill claims that in what he calls Riau Indonesian there is no clear distinction between nouns and verbs. This would be consistent with the language not having progressed beyond the step by which structures set a syntactic label. But from the limited age range of his subjects, it may be that what Gill is observing is the self-styling of a sub-culture, rather than a language. 

Daniel Everett claims that in 25 years studying the language, known as Pirahã, spoken by one previously uncontacted Brazilian tribe of less than 500 people, he has never heard a sentence like “I know you think I’m wrong” or “I think that you know I’m right”. Such sentences are built out of clauses ’embedded’ inside one another. In these cases, “I’m wrong” and “I’m right” are embedded in “Your think I’m wrong” and “You know I’m right”. And these are embedded recursively in another, yet higher level of structure. Everett takes his failure to observe this as evidence for his claim that such sentences cannot be formed in principle because Pirahã does not allow recursion. If so, recursion is not an intrinsic property of human language. And cultures vary in whether it is possible to discuss doubt, error, or suspicion with any precision. But Andrew Nevins et al (2009) have found alternative analyses of Everett’s data, suggesting that Everett may have been mistaken in his conclusions. I personally find it unimaginable that any culture or society could exist without being able to say precisely who is right or wrong about what.

Other than by claims, such as those by Gill and Everett, mistaken in my view, there is no evidence of any language not evidencing the whole succession of evolutionary steps I am postulating here, with part of the inheritance biological, and no modern human population without this inheritance.

The hypothesis here is quite different from the straw-horse idea of human beings being born knowing how to talk, that is sometimes advancerd to rubbish the idea of any sort of biological inheritance.

Making things easy?

Not really, but the complexity is just the way things are. Speech and language don’t fossilise. Some aspects of proto-language at some stage of its evolution may be detectable in modern language. But these are likely to be highly mediated by modern language.

Recordings go back no further than the late 19th century. These betray considerable change over the last 130 years. Reconstructing how English sounded at the time of Dickens, Shakespeare, Chaucer or King Alfred, 150, 400, 600 or 1,100 years ago is entirely from the written word, errors, local variations, and the analysis by contemporaries or near contemporaries.

Reconstructing the development of stages of proto-language is incomparably harder. The proposals here are just the first stage of a corresponding research program.


Abels, Klaus (2012) Phases: An Essay on Cyclicity in Syntax. Berlin: Walter de Gruyter

Abramova, Ekaterina (2018) The role of pantomime in gestural language evolution, its cognitive bases and an alternative. Journal of Language Evolution, 3, 1, 26–40,

Archangeli, Diana (1984) Underspecification in Yawelmani Phonology and Morphology, MIT, PhD Dissertation.

Austen, John Langshaw (1955) How to Do Things with Words. Harvard University Press

Baldwin, James Mark (1896). A New Factor in Evolution. The American Naturalist. 30 (354) 441–451

Baldwin, James Mark (1897). Organic Selection. Science. 5 (121: 634–636.

Ballard, Kirrie, Don Robin, Tricia Mccabe & Jeanie Mcdonald (2010). A Treatment for Dysprosody in Childhood Apraxia of Speech. Journal of Speech, Language and Hearing Research (Online), 53(5), 1227–1245

Bergström, Anders, Chris Stringer, Mateja Hajdinjak, Eleanor Scerri & Pontus Skoglund (2021) Origins of modern human ancestry. Nature 590. 229–237.

Berwick, Robert & Noam Chomsky (2016) Why Only Us: Language and Evolution. MIT Press

Besenbacher, Søren, Christina Hvilsom, Tomas Marques-Bonet, Thomas Mailund, Mikkel Heide Schierup. (2019) Direct estimation of mutations in great apes reconciles phylogenetic dating. Nature Ecology & Evolution

Boeckx, Cedric (2021) Reflections on language evolution: From minimalism to pluralism, Berlin: Language Science Press.

Boeckx, Cedric Norbert Hornstein, and Jairo Nunes (2010) Control as Movement. Cambridge: Cambridge University Press

Borer, Hagit (2005) In Name Only. Oxford: Oxford University Press

Borer, Hagit (2005) The Normal Course of Events Oxford: Oxford University Press

Bowen, Caroline (2005). What is the Evidence for Oral Motor Therapy? Acquiring Knowledge in Speech, Language and Hearing, Speech Pathology Australia, 7, 3, 144-147.

Braine, Martin (1962). On Learning the Grammatical Order of Words. Psychological Review 70. 323–48.

Brenner, Sydney (1995) Loose end. Current Biology 5(1). 94.

Brett, Linda, Shulamith Chiat & Christine Pilcher (1987) Stages and units in poutput processing: Some evidence from Voicing and Frinting Processes in Children. Language and Cognitive Processes: 3.4: 165–177

Brown, Christy (1954) My Left Foot Vintage

Brown, Penelope & Stephen Levinson (1978). Universals in language usage: politeness phenomena. In: Esther Goody (ed.). Questions and politeness. Cambridge University Press.

Burling, Robbins (2005) The Talking Ape: How Language Evolved. Oxford: Oxford University Press

Chiat, Shulamit (1983) Why Mikey is right and My Key is wrong: The significance of stress and word boundaries in a child’s output system. Cognition.14: 275-300

Chiat, Shulamit (1989) The relation between prosodic structure, syjllabification, and segmental realisation: Evidence from a child with fricative stopping. Clinical linguistics and Phonetics: 3.3. 223–242

Chomsky, Carol (1969) The Acquisition of Syntax in Children from 5 to 10. Cambridge, Massachusetts: MIT Press

Chomsky, Noam (1955/1975) The Logical Structure of Linguistic Theory. Plenum Press.

Chomsky, Noam (1957) Syntactic Structures

Chomsky, Noam (1959) A Review of B. F. Skinner’s Verbal Behavior. Language, 35, 1, 26-58.

Chomsky, Noam (1965) Aspects of the Theory of Syntax

Chomsky, Noam (1986) Barriers. Linguistic Inquiry Monograph

Chomsky, Noam (1995) The Minimalist Program. The MIT Press

Chomsky, Noam (1999) Derivation by Phase. MIT Occasional Papers in Linguistics. No. 18.

Chomsky, Noam & Morris Halle (1968) The Sound Pattern of English, New York, Harper and Row

Clements, George Nicholas (1985) The geometry of Phonological Features. Phonology Yearbook 2. 225–252.

Comrie, Bernard (1976) Linguistic Politeness Axes: Speaker-Addressee, Speaker-Referent, Speaker-Bystander. Pragmatics Microfiche1.7: A3. Department of Linguistics. Cambridge: Univ. of Cambridge

Corbalan, Maria Ines & Giulia Terzian (2021) Simplicity of what? A case study from generative linguistics. Synthese 198 (10): 9427-9452.

Corballis, M. C. (2003) From Hand to Mouth, The Origins of Language. Princeton University Press

Cruttenden, Alan (1978) Assimilation in Child Language and Elsewhere. Journal of Child Language 5, 373–378.

De Boer, B., Thompson, B., Ravignani, A., & Boeckx, C. (2020). Evolutionary Dynamics Do Not Motivate a Single-Mutant Theory of Human Language. Scientific Reports, 10(451), 1-9.

Dediu, Dan & Stephen C. Levinson. (2013) On the antiquity of language: The rein- terpretation of Neandertal linguistic capacities and its consequences. Frontiers in psychology 4. 397.

Dediu, Dan & Stephen C. Levinson. (2018) Neanderthal language revisited: Not only us. Current Opinion in Behavioral Sciences 21. 49–55.

Dobzhansky, Theodosius (1937) Genetics and the Origin of Species

Dugatkin, Lee Allen & Lyudmila Trut (2017) How to Tame a Fox: And Build a Dog. Chicgo: University of Chicago Press

Fisher, Simon ( 2014) Translating the genome in human neuroscience. In Gary Marcus & Jeremy Freeman (eds.), The future of the brain, 149–158. Princeton: Princeton University Press.

Fisher, Simon & Sonja Vernes (2015) Genetics and the language sciences. Annual Review of Linguistics 1(1). 289–310.

Gardner, Robert & Beatrice Gardner (1969) Teaching Sign Language to a Chimpanzee. Science 165, 3894: 664-672

Goodall, Jane (2009) Sowing the seeds of hope https://www.youtube.com/watch?v=Vr350j7Ya5E

Goodluck, Helen (1991) Language Acquisition: A Linguistic Introduction. Oxford: Blackwell Publishers.

Graf, Thomas (2014) Beyond the apparent: Cognitive parallels between syntax and phonology. In Carson T. Schütze & Linnaea Stockall (eds.) Connectedness: Papers by and for Sarah van Wagenen, vol. 18 UCLA Working Papers in Linguistics: 161–174.

Hickok, Greg (2021) Beyond Broca: Architecture and Evolution of a Dual Speech Control Model

Johansson, Sverker The Talking Neanderthals: What Do Fossils, Genetics, and Archeology Say? Biolinguistics

Karmiloff-Smith, Annette (1979) A Functional Approach to Child Language: A study of Determiners and Reference. Cambridge University Press

Landau, Idan (2013) Control in generative grammar: A research companion. Cambridge: Cambridge University Press

Lenneberg, Eric (1967) Biological foundations of language. New York: Wiley.

Lof, Greg (2006) Against Non-Speech Oral Motor Exercises ASHA Convention

Mcbrearty, Sally & Alison S.Brooks (2000) The revolution that wasn’t: a new interpretation of the origin of modern human behavior. Journal of Human Evolution. 39, 5, 453-563

Macken, Marlys (1995) Phonological Acquisition. In John Goldsmith (Ed) TheHandbook of Phonological Theory. Oxford: Blackwell

Martins, Pedro Tiago & Cedric Boeckx (2019) Language evolution and complexity considerations: The no half-merge fallacy. PLoS biology 17(11). e3000389.

Martins, Pedro Tiago & Cedric Boeckx (2020) Clarifications on the no half-Merge fallacy. Lingbuzz. 005432.

Mendívil Giró, José Luis. (2019) Did language evolve through language change? On language change, language evolution and grammaticalization theory. Glossa 4(1). 124.

Merchant, Jason (2016) Ellipsis: A survey of analytical approaches. In Jeroen van Craenenbroeck and Tanja Temmerman, (Eds) A handbook of ellipsis, Oxford: Oxford University Press.

Mineiro Ana, Inmaculada Concepción Báez-Montero, Mara Moita, Isabel Galhano-Rodrigues, Alexandre Castro-Caldas (2021) Disentangling Pantomime From Early Sign in a New Sign Language: Window Into Language Evolution Research. Frontiers in Psychology.

Mufwene, Salikoko (2008). Language Evolution: Contact, Competition and Change. Continuum International Publishing Group

Murray, Elizabeth, Tricia McCabe & Kirrie Ballard (2012). A comparison of two treatments for childhood apraxia of speech: methods and treatment protocol for a parallel group randomised control trial. BMC Pediatrics.

Nevins, Andrew, David Pesetsky & Cilene Rodrigues (2009) Evidence and Argumentation: A Reply to Everett. Language. 85

Nunes, Aubrey (1976) The period of Base Rule Acquisition. MA Dissertation; University of Essex

Nunes, Aubrey (1994) The emergence of disfluency in children and the skeleton/root distinction in phonology Paper at the International Child Language Seminar. University of Bangor

Nunes, Aubrey (2002) The Price of a Perfect System: Learnability and the Distribution of Errors in the Speech of Children Learning English as a First Language. University of Durham, PhD Dissertation.

Paradis, Carole & Jean-Françcois Prunet (Eds 1991a) Phonetics and Phonology. Vol. 2. The Special Status of Coronals: Internal and External Evidence. Academic Press, Inc., San Diego, California

Paradis, Carole & Jean-Françcois Prunet (1991b) Introduction: Asymmetry and Visibility in Consonant Articulations.

Piaget, Jean (1951) Play, Dreams and Imitation in Childhood. W. W. Norton.

Piaget, Jean (1954) The Construction of Reality in the Child. Ballantyne Books.

Piaget, Jean (1971) Biology and knowledge: An Essay on the Relations between Organic Regulations and Cognitive Processes. Edinburgh: Edinburgh University Press

Ramchand, Gillian (2008) Verb meaning and the lexicon: A first-phase syntax. Cambridge: Cambridge University Press

Reich, David (2018) Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past. New York: Pantheon

Renfrew, Catherine (1967) Action Picture Test Routledge.

Ridouane, Rachid (2008) Syllables without vowels: Phonetic and phonological evidence from Tashlhiyt Berber. Edmund Gussmann (Ed) Phonology, Cambridge University Press , 2008, 321-359

Rubach, Jerzy (1994) Affricates as Strident Stops in Polish. Linguistic Inquiry: 25,1. 119-143

Sapir, Edward. (1921) Language. An Introduction to the Study of Speech. New York: Harcourt Brace.

Savage-Rumbaugh, Sue & Taylor J. Talbot (1998) Apes, Language, and the Human Mind. Oxford: Oxford University Press.

Scerri, Eleanor M. L., Mark G. Thomas, Andrea Manica, Philipp Gunz, Jay T. Stock, Chris Stringer, Matt Grove, Huw S. Groucutt, Axel Timmermann, G. Philip Rightmire, et al. 2018. Did our species evolve in subdivided popula- tions across Africa, and why does it matter? Trends in ecology & evolution 33(8). 582–594.

Senghas, Ann & Marie Coppola (2001) Children Creating Language: How Nicaraguan Sign Language Acquired a Spatial Grammar

Stringer, Chris (2016) The origin and evolution of Homo sapiens. Philosophical Transactions of the Royal Society Biological Sciences 371

Tallerman, Maggie (2012) Protolanguage in Maggie & Kathleen Gibson (Eds) The Oxford Handbook of Language Evolution: Oxford University Press.

Thomas, Donna, Tricia McCabe & Kirrie Ballard (2014). Rapid Syllable Transitions (ReST) Treatment for Childhood Apraxia of Speech: The Effect of Lower Dose Frequency. Journal of Communication Disorders. 51, 29-42.

Tomasello, Michael. (2010) Origins of Human Communication. MIT Press

Tomasello, Micheal & Josep Call (1997) Primate Cognition. Oxford University Press