1. Impairment and evolution

It seems to me that where there are problems with respect to a child’s speech and language it makes sense to investigate how speech and language may have evolved in the human species. By my proposed here, as the first stage of a research program, both speech and language have evolved by an ordered sequence of seven steps. Ray Jackendoff suggests that there have been 15 steps. Others, like Noam Chomsky propose that there has just been one step. However, it seems to me that seven steps are minimally necessary.

I call these steps Point, Mimic, Glue, Label, Wall, Move and Capsule. I call them what I do for the sake of consistency with one another and for the sake of emphasising how they emerged from more general cognitions, transforming each one in the process.

In separating these steps, in particular, Glue and Move, and emphasising more general cognitions, I am departing from many interpretations of the Minimalist Program initiated by Chomsky and others in the 199os. The main thrust of the Minimalist Program was to do as much explanation as possible with as little theoretical baggage as possible, goals to which Chomsky has subscribed from his first work in the 1950s. In particular the Minimalist Program dispenses with the contrast between Deep Structure and Surface Structure as ontological realities, although derivation is as crucial as it ever was. The motivation of the new direction here is a better account of speech and language acquisition and how these are sometimes problematic, marginally for perhaps 10% of children and proportionately more so for a smaller percentage. But despite this departure from aspects of the Minimalist Program, my intentions here are still broadly minimalist.

I am not committing mysef to any particular course of that phylogeny, interesting as that may be. The cognitions may have evolved independently before their rehearsal in skills such as the attachment of a blade to a handle or the construction. of a roof for a dwelling. Or the technique may have been bootstrapped by language. Inventions are helped by talking things through. But for the incorporation of a cognition into language it has to processed much faster than conscious introspection allows. A utilisation in physical technique cannot plausibly have begun with the linguistic incorporation of a corresponding cognition.

I am therefore proposing that no step on the linguistic pathway required a specific mutation. The skills that were reworked for this go back long before there was any sort of recorded history because there wasn’t the language to record it with.

The most important evidence for these steps is in the ‘fossils’ to use a term from Ray Jackendoff, which these evolutions leave in modern speech and language.

Each step may have taken a thousand or more generations to diffuse across the population, or fixate. The timescale is unknown. But it must have been at least a hundred times slower than the process by which speech and language change in the modern world with older speakers often complaining about the speech and language of younger speakers, not realising that speech and language have just changed in their own lifetimes.

This process of fixation across the population is demonstrably not yet complete, or developmental speech and language problems would not be as common as they are.

But each of these steps of long term development was so advantageous that eventually it would come to define the whole surviving population by ‘speciation’. The key mechanism here is what has become known as the ‘Baldwin Effect’. As members of a population pass on some newly-learnt, useful skill, this contributes to breeding success. Very slowly this becomes part of the genome. These are not forms in the grammar, like the fact that English has “I love you” for what many languages have as “Love you”. These are the underlying basis for such forms, common to both sorts of language, whether or not the I or its equivalent is pronounced in actual speech.

For some reason I don’t understand, there has been much more research into the evolution of language than into the evolution of speech. My proposal here tries to redress the imbalance a very small amount.

Cognitions for skills

While it seems fanciful to conjecture which cognition contributed to which step in relation to speech and language or when this may have happened (because there just isn’t the evidence), we can be sure of some of the skills. Some of these are not human-specific. But those which we share with some non-human species are far more highly developed amongst humans.

• Aiming a throw at a target, where the throw is judged by its accuracy.

• Route-mapping;

• Stone-tool-making;

• Planning an activity as a series of steps;

• Building a home in more than one part, where one part is the roof, a floor, or a stairway.

Each cognition had to be decomposed into its raw essence before it could be re-assembled as a device for speech and language. There is no reason to suppose that this was an instantaneous process. It may have taken generations. Similarly, there is no reason to suppose that the incorporations into language followed immediately from the cognitions. This process of decomp0sition and reassembly was necessarily slow and complex. Biology does not move fast.

The sequence that I am postulating is by the way the effects are organised in modern language, and appear in the speech and language of small children.


By virtue of the Baldwin effect and the speciations, each of of the seven steps proposed tentatively here was heritable. Each gave a communication advantage to a fragile and vulnerable population, making it easier to construct defences and survive, to discuss and develop techniques, to plot, groom, befriend, sympathise, and romance.

There is no reason to suppose that this exceedingly complex process happened more than once. It involved:

• Achieving some effect or being able to talk to oneself; 

• Logic and conceptual understanding;

• The underlying basis of the ‘grammar’ in the broadest sense, like the fact that in all languages sentences have subjects; 

• What can be articulated by the vocal tract or discriminated perceptually. 

The changes I am postulating were anatomical, neurological, and cognitive. This does not say anything about how the anatomy, the neurology, and the cognition connected up with one another. Such connections are known to occur in biology. When Dmitri Belyaev bred foxes for docility, the shape of their ears and faces changed too. It is not yet understood how this happens. Of course, in the human lineage there was no deliberate breeding. But the shape of the face was changing very slightly during what may have been a critical period between 300 and 200,000 years ago. The forehead had already become more upright, but the chin was becoming more pointed. Chinless is not a flattering way to describe anyone. We are very sensitive to this physiognomic character. And the best talkers may have been fancied. The gift of the gab is still an asset when it comes to finding a mate. Once those two characters of a decisive chin and the gift of the gab may have been more closely connected than they are today.

By 200,000 years ago our last common ancestors were not the whooping, bone crunching savages from the opening scene of the movie, 2001, but indistinguishable from other modern humans alive today, fully capable of becoming musicians, go-players, programmers,  cosmologists, or even linguists – and, of course, being able to become competent native speakers of a modern language.

At a societal level, the functionality of modern language is the wherewithal of every joint venture from from a hunt to a start up. Whether the issue is how to understand a broken twig or a false financial statement, “You might be mistaken” is reliably understandable. In hunting or in business, life or fortune may depend on the warning being clear and reliably understandable. 

This view is sometimes challenged. It is possible to reconstruct stages in the development of modern European languages, from what is known as Proto-Indo-European or PIE, as spoken perhaps 5,00o years ago, certainly before the building of Stonehenge. There is key evidence from what is known as ‘grammaticalisation’, with words like le, la, un and une, in modern French having developed from the Latin for this, that, and one. From this evidence, some conclude that PIE was characteristic of a more primitive evolutionary stage in the evolution of human language than its modern descendants. The development from Latin to modern French happened over less than a thousand years. It is not known if there is any pressure in this direction. It seems that it just happens – or doesn’t. Like it hasn’t happened in Russian which has done without words for the an a from the time of the first records over a thousand years ago. There is more direct evidence of grammaticalisation in the linguistic relics of slavery, that historic crime for which there has not, so far, been any remotely adequate atonement. When African slaves were dispersed in the New World, efforts were made to ensure that they were all separated by language, obviously obstructing any sort of resistance. The only common language was that of the new masters. So pidgins developed with words from English, French, Spanish, Portuguese. But a pidgin was not a language. Only the most basic meanings could be expressed. The slaves were encouraged to have children who would become slaves in their turn. But within a generation the children turned the pidgin of their parents into languages, known as ‘creoles’ or ‘patois’. These languages characteristically lack those parts of linguistic structure used in languages like English and French to distinguish nine from ninth. So the traditional Caribbean ceremony after a death is mostly pronounced as nine night without the S. But this lack of the TH is not in my view any sort of mark or evidence of primitiveness. It is just a consequence of the criminal circumstances in which these languages developed and the assertion of humanity by the first speakers. Within a thousand years, if human society lasts that long, some of these languages are likely to develop that TH from a form which somehow captures the ordinality.

In my view the remarkable thing about grammaticalisation is that it happens at all, even in the horror and misery of slavery, not that it can take a few dozen generations to take its fullest effect, as we have over the whole 5,000 year development of Indo-European languages from PIE to English, Russian, Greek, Gaelic, French, German, etc..

The evolution of speech and language in the human species is by a process many hundreds of times slower, over hundreds of thousands of years, perhaps one or two million years or more since the beginning of the process when the most our ancestors could do was to howl or yowl, or hoot or pant, to warn of a potentially dangerous predator or to announce a tasty meal. Since the 1969 work of Robert Allen and Beatrice Gardner on teaching chimpanzees 200 or so signs from American Sign Language, a number of other apes have been taught to use various signing systems to a similar level. Sue Savage-Rumbaugh believes that she has taught a bonobo, Kanzi, to a significantly higher level. But to me and many others, while Kanzi is doubtless an outstanding student, his understanding is qualitatively less thman that of that of the normally developing human child of two and a half. Human language is just different from the signing of the most articulate primate. To a human utterance we can respond “I don’t believe you” or “Could you say that more politely?” Human language allows not just precision, but nuance. A four year old in his or her day at school is expected to understand such things. But they would be quite beyond Kanzi.

For this, there are some quite abstract, universally available, tools in the toolbox, available to all speakers of every language – what are often called ‘universals’. Such as: 

• The distinction in all languages between ‘content words’ like hand, sleep, cold, and at, and ‘functors’ like that in “You believe that it’s true?” Functors add little or nothing to meaning on their own. They contribute to meaning by their relation to other elements in the structures of language. They often behave in recognisably distinct ways, losing or changing sounds or hopping over one another, making themselves quite obvious by their grossness;

• A principle known as ‘markedness’ by which linguistic phenomena divide unevenly, with biases and gaps – with most functors featuring the tongue tip articulator, as in T, D, S, N, and in words ending in NK or NG, as in, ping, bung, bang, and pong, with just those four vowels, not those in beng or boong – with the vowels in bet and foot. This is a gap.

A puzzle: order in disorder

One of the many aspects of speech that linguistic theory has not yet elucidated is the non-random asymmetries in children’s errors. Errors should be random, not organised into any sort of pattern. Order in disorder is a nonsense, but it is observable for as long as speech and language are developing. In most children this goes on until around the age of eight, plus or minus a year or so. It goes on correspondingly later in children whose speech or language is delayed or disorded. And there are well-evidenced ‘co-morbidities’, or significant overlaps between the children with speech and language disorders and those a poorly developed sense of what is called ‘metalinguistics’, or the awareness that there is a commonality between the real word hippopotamus and the nonsense word HETTAPUTAMUS.

The anomaly of the asymmetry and the patterning of the c0-morbidities both demand an explanation. Is there a significant commonality between these things?

Just as developmental incompetence is asymmetric, there are words which many adults who think of themselves as ‘competent’ speakers find hard to say, words like anomaly. obliterate, monogamy. The linguist, Clare Galloway, called the resulting mispronunciations ‘cloth ear errors’. These are just by the mildest effects of the imperfections, so mild that they are not generally counted as aspects of disorder.

If all the incompetences were in one direction or polarity this would point to some external factor. Some incompetences may indeed result from such a factor. In children’s speech, the P in pea tends towards B as in bee, and the B in cub tends towards P as in cup. But not the other way round. As John Locke has pointed out, this particular asymmetry follows from acoustics. The ‘best’ sort of syllable has the sonority concentrated before the vowel, and dying away after it, by the difference between voiced and voiceless consonants, as explained in Sounds and bits of sounds.

But not all of the asymmetry can be explained in this kind of way. As pointed out by Alan Cruttenden, in early child speech the D in doggy is often replaced by G, leading to a common realisation as GOGI. The tip of the tongue articulation in the D at the beginning of the stressed syllable (where it should be obvious) gets matched to the back of the tongue articulation of the G at the beginning of the less salient unstressed syllable. This matching is commonly described as ‘assimilation’. Assimilation the opposite way round, as DODI, in favour of the tongue tip articator, is almost unattested. But at the same approximate age or stage in speech development, by one of the commonest errors in child speech, the K in key is replaced in speech production by the T in tea. This is known by speech pathologists as ‘fronting’ because T is articulated further forward in the mouth than K. Fronting is perhaps a hundred times as common as ‘backing’ with the replacement going other way round, with K replacing T. Then a few years later, again in both  normal and disordered development, in fouk words, three of which are not generally expected from small children, we suddenly find what seem like assimilations with tongue tip articulations favoured.

calculator as KALTALATOR

cardigan as KARDIDAN

hippopotamus as HITOPOTAMUS

archeopteryx as ARTIOPTERIKS

But at this level of speech development no child says KALCALAKOR or anything like it in any other word. Interestingly, where this happens there is more than one instance of both the tongue tip-articulation and the vulnerable articulator. The contrasts between articulators are being adjusted, but only where this is the only contrast, between the affected segment and another instance of what ends up being pronounced, where there is no contrast between the stress of the affected syllables, and where there is another instance of both in the word, or at least of the critical ‘features’.

The word is almost ready to say when the tongue-tip articulator replaces another articulator – either the lips or the back of the tongue. 

In other words, it seems that the process of speech development first biases the inventory of speech sounds one way and the pattern of assimilation the opposite way and then when speech has been almost completely mastered, reverses the polarity of assimilation towards the bias by the early inventory. Such a history seems more like playing silly games than normal human development. It does not plausibly reduce to what is known as the ‘articulatory/perceptual’ or AP interface.  

By the proposal here, at least some of the asymmetries in the error distribution are best explained by the last two of the evolutionary steps which I am postulating here. Both are very powerful devices. The last but one gives a great freedom. The last restricts that freedom. But the combination is not easy to learn, though they are easy to misapply. And this leads to predictable error patterns during what Eric Lenneberg in 1967 called the ‘critical period’ – from birth to puberty.

A tax for children

All languages have things which are hard for children to learn – like special taxes for them. The special tax for children learning English is the English syllable. This mostly has a rime or an element which can be rhymed, characteristically beginning with a vowel, what is known as the ‘nucleus’, like the AY in May day, and one or more consonants before the rime, like TR in Trick or treat, or STR, or SPR or SCR, as in stray, spry, screw. There are even greater complexities at the other end of the syllable in what is known as the coda, getting complicated in length and strength with the G as a K in my English, with the TH showing that these forms are forcibly being used as nouns, and even more so in lengths and strengths. And as tabulated in the inventory here, there are at least 18 vowels, and on one possible count, 28, in comparison to the five vowels in many languages. There are many restrictions and what may be accidental gaps. As already noted, some nuclei and codas do not occur together. For reasons I won’t go into here, it is hard to work out how many possible syllables there are in English. But estimates tend to cluster around 5,000. The problem is that there is no certainty of the child learning English hears even one example of every possible syllable during the whole of his or her childhood.

A logical problem

The problem is actually a logical one, what might be called ‘the logical problem of speech acquisition’.

No child knows what language he or she is learning or when he or she has heard everything he or she needs to hear in order to have learnt the language. Taking English as just one of several thousand odd languages in the world, the learner has no guidance, no ‘privileged information’  as it is called by learnability theorists, on where his or her target language lies with respect to all the possible variations, like how many syllables there might be. Might there be one more, that he or she has yet to hear?

This is in what Marlys Macken, a specialist in child speech, in 1995 nicely called the ‘learnability space’ – the space of consonants and vowels and nouns and verbs and more, in which speech and language has to be learnt. A strange sort of space, you might say.

Against this background, it is relevant that there is a much discussed and language, Tashlhiyt, spoken in North Africa, which allows any consonant to constitute the nucleus of a syllable. It thus allows sentences consisting entirely of consonants like P, T, K, and S, and not a vowel in sight. Like Arabic it has three vowels. It has been the subject of various experiments. Now in English a syllable can end in CT as in pict, duct, tact, but only after short vowels. In “I walked there yesterday” and “He talked the talk and got the job” walk and talk have a tense or long vowel. So the T sound of the ED can’t be part of the word, but might, from the learner’s perspective, be a separate self-standing element, as, in a sense, it is. But Tashlhiyt goes a number of steps further in allowing words and even complete sentences without a single vowel. It may be world-unique in this respect. This is one sort of uniqueness.

Robert Dixon reports a game in the Australian language, Arrernte, with an initial consonant and nucleus in one part of the syllable and a final consonant in the other part. With its syllable structure this way round, Arrernte may also be world-unique.

Arrernte and Tashlhiyt may represent the limit cases of markedness in the sound system, what is known as the ‘phonology’.

At the opposite extreme of familiarity, there is English with its system of ‘auxiliaries’ or words like would and shall. This system is a major component of English ‘grammar’ or ‘syntax’ – assembling words and parts of words into structures with well-defined meanings. Take the sentences “He takes sugar” and in the past ‘tense’ as this is known, “He took sugar.” By what was once known as ‘Do support’ the category Tense is moved to the left in negatives and questions and realised in a form of the verb do in “Does he take sugar?” or “He didn’t take sugar”. No other widely studied language has anything quite like Do support. If Do support was only known from the last very elderly inhabitants of a small island off the coast of Cornwall, reported by one investigator, the more typical case might be regarded as a universal, and the seemingly dying language of the elderly islanders as a response to the knowledge that one of them will one day be the only speaker. (This sad situation happens to some unfortunate person somewhere in the world about every two weeks.) But English is, for now, a world language with one major aspect of its grammar very highly ‘marked’ (in linguistic parlance).

While speech and language clearly distinguish humans from any other animal, and most children learn to talk naturally without any active intervention, this is clearly not the case for all children, as shown by developmental speech defects of various sorts. But from the mere fact that speech defects are recognised for what they are even by children, it is clear that there is a well-defined benchmark of normal competence (normal in the broad, everyday sense, not the narrow, statistical sense). If the language faculty was not universal, it would not be the case that in every known society and culture there are plays on words and agreements with something like legal force. The ability to learn at least one language to this level of competence is generally expected. Without this commonality of understanding there would be a monstrous discrimination in the assumption that the law is known to all, just as there is a criminal contradiction in the notion of justice where anyone believed to be limited in his or her understanding is punished by pain or even death.

If any of the seven turning points postulated here had been missed by some ancestral population, there would be living populations without some corresponding property in their language. Against my claim in the last sentence, two linguists claim to have found such populations, one in Indonesia, and two in Brasil. David Gill claims that in what he calls Riau Indonesian there is no clear distinction between nouns and verbs. This would be consistent with the language not having progressed beyond the fourth of the seven steps, identified here, by which structures are labelled. But from the limited age range of his subjects, it may be that what Gill is observing is the self-styling of a sub-culture rather than a language. 

Daniel Everett claims that in 25 years studying the language, known as Pirahã, spoken by one previously uncontacted Brazilian tribe of less than 500 people, he has never heard a sentence like “I know you think I’m wrong” or “I think that you know I’m right”. Such sentences are built out of clauses ’embedded’ inside one another. In these cases, “I’m wrong” and “I’m right” are embedded in “Your think I’m wrong” and “You know I’m right”. And these are embedded again in another, yet higher level of structure. This successive process is known as ‘recursion’. Everett takes his failure to observe this as evidence for his claim that such sentences cannot be formed in principle because Pirahã does not allow recursion. Without recursion there are only so many grammatical sentences that can be formed in the language, a large number, but a finite one nevertheless. If so, recursion is not an intrinsic property of human language. And cultures vary in whether it is possible to discuss doubt, error, or suspicion with any precision. But David Pesetsky and Andrew Nevins have found alternative analyses of Everett’s data, suggesting that Everett may have been mistaken in his conclusions. I personally find it unimaginable that any culture or society could exist without being able to say precisely who is right or wrong about what.

This is not the only uniqueness which Everett claims to have found. By what is generally believed to be another universal, in languages like English with rhythms inside words of two syllables or more, there are ‘feet’, one in coffee with two syllables, two in capuccino with four syllables. In coffee the F both the end of the first syllable and the beginning of the second, as reflected by the spelling. But in all cases, the syllables are inside the feet. In 1994, Everett presented data on a Brazilian language which, he claimed, had syllables with a consonant and vowel in one foot and a final consonant in another foot. Such a language must have developed its foot-structure in some unique way, different from all the thousands of other languages with feet.

Other than by claims, such as those by Gill and Everett, all mistaken in my view, there is no evidence of any language not evidencing the whole succession of evolutionary steps I am postulating here, with part of the inheritance biological, and no modern human population without this inheritance.

This is quite different from the absurd, straw-horse idea of human beings being born knowing how to talk, that is sometimes trotted out to rubbish the idea of any sort of biological inheritance.

There is a proposal that modern human language is by virtue of just one evolutionary step with its main benefit being to support thought, discovery and invention. This is sometimes known as the ‘Minimal UG hypothesis’ – UG for Universal Grammar. It assumes the most macro of macro-mutations. But a macro-mutation is implicitly implausible. My proposal follows Darwin and most biologists in assuming that evolution does not look forwards or backwards, but just takes random variations, and favours those which lead to some breeding advantage. Even small advantages can be enormously significant over enough generations – a thousand or more in this case. So the Minimal UG hypothesis is rejected here. 

By my proposal here, for better or for worse, for at least the last million or so years, advances in speech and language have contributed significantly to the growth and success of humankind.


Humans have been evolving separately for perhaps about 250,000 generations or between six and eight million years. By a process that happened mostly in Africa, although with some very complex ingressions, human ancestors evolved larger brains, smaller teeth, longer legs and feet and differently articulated hands and arms. The changes in the lower limbs made it easier to walk long distances, and then to run. The changes in the upper limb made it easier to grasp and throw – and less easy to swing and climb. It would seem that it was hunting capacity that was selected for, rather than gathering.

The evolutionary steps towards modern speech and language proposed here were mainly during  the later part of this process. The first step in this direction could not plausibly have got off the ground before the first object of value and worthy of a name – giving prestige and pride to the possessor – somewhere between one and two million years ago, perhaps a club from the burnt trunk and rootball of a small tree, a highly lethal weapon in the hands of a skilled, brave thrower. But since wood does not fossilise, archeology has no record of this.

By the final stop of this process, perhaps between around 300,000 and 200,000 years ago our ancestors evolved a pointed chin, a highly-doomed forehead without craggy eyebrows, a smooth top of the skull. This last step may have taken up to 100,000 years to fixate across the species. But crucially, by my proposal here, by the last step, all modern humans, all descendants from that lineage, now share an equal access to language. So modern language cannot have emerged any later than this. This gives an overall time frame for the development of modern language from 800,000 years to 1,800,000 years. The biology of speech and language are thus by a very recent and rapid evolution.

Between 65 and 80,000 years ago, one or more groups of modern humans left Africa, eventually meeting and sometimes mating with Neanderthals and others known as ‘Denisovans’, whose ancestors had left Africa several hundred thousand years earlier. Some of the off-spring from these unions then mated with modern unions. So those descended from that diaspora between 60 and 80 thousand years ago have some pre-modern inheritance, with the Neanderthal inheritance mainly in Europe and the Denisovans inheritance mainly in the East and Australia. But Neanderthals and Denisovans were different species from modern humans. One of the differences is with respect to a gene called Fox P2, which seems to be involved in the delicate co-ordinations of speech. If the off=spring from these pairings had a Fox P2 deficiency, that may have impacted on their speech.

From where Neanderthal and Denisovans remains have been found, it may have been that they tried to avoid meeting modern humans, never mind mating with them, just as some Amazonian tribes today prefer not to have contact with the modern world.

But by the time modern humans first encountered the descendants of older human lineages, the evolution of modern speech and language had been completed.

Inheritors and non-inheritors

At a point of genetic divergence there are inheritors and non-inheritors, beneficiaries and non-beneficiaries. For a while the two populations may co-exist. But if the difference is significant, inheritors have a better chance of breeding success – by whatever means. Non-inheritors may find some way of hiding or avoiding confrontation or competition or compensating. But eventually the only survivors may be the inheritors. Something like this story must have played with speech and language. Or there would be modern human populations not benefiting from some part of the apparatus. But, other than by the claims of Gill and Everett, no such populations have been found.

Inheritors may have speeded this process by killing non-inheritors – what is now known as ‘ethnic cleansing’ – what Neanderthals and Denisovans may have rightly feared.

But while the entire linguistic inheritance has diffused across the whole of the human population, it is still developmentally vulnerable. It is not inherited completely by some individuals. Some children have speech defects. And these are heritable.

Orthodoxy and novelty

By most theories, in the formation of single speech sounds or ‘phonemes’, such as English K in key, there is no significant ordering other than for the sake of intrinsic necessity. The phoneme K precedes the vowel in key, car, and cow. But K is problematic for approximately one English child in ten. It is defined by a set of gestures involving:

• A closure and opening of the airstream in the mouth – making it what is known as a ‘stop’;

• The briefness of the closure:

• The articulation by the back of the tongue against the soft palate or velum;

• An audible pause after the release of the closure;

• Relatively low acoustic sonority;

• The definition of this as a consonant by its position in the syllable – in languages where this is relevant, as it is in almost all languages.

By the proposal here, the sequence of steps with respect to K varies fractionally from language to language. And this has to be learnt.

In a corresponding way, many children miss the correct tongue posture for the airstream in S and say sea as TEE, by what is commonly known as ‘stopping’ because the airstream is incorrectly stopped.

Very uncommonly, a child of three and a half with seriously disordered speech said watch as BOP, glove as DUD, finger as DINDER, milk as GIK, with all three articulators, the lips, the tip of the tongue, and the back of the tongue, all seeming to assimilate to one another. There seemed to be effectively a template allowing only one articulator in the syllable or word. But there were six additional steps. The speech was incomprehensible to a most careful, insightful, attentive mother who struggled not to make the problem more apparent to the child than it already was. Such speech is not easily accountable.

By an even greater degree of incompetence, the speech is not readily recognised as speech.

Two perspectives

The study of the apparatus proposed here began from what seemed like two opposite perspectives. In 1955 John Langshaw Austen published Doing things with words, laying the basis for what is known as ‘pragmatics’ or the study of how language is used to reach particular objectives – doing things. In 1957, taking English as one arbitrarily selected language, Noam Chomsky proposed two components within the grammar, a ‘phrase structure grammar’, assembling ‘kernel’ sentences like “The man hit the ball”, and a ‘Transformational component’ going step-wise through a set of transformations defining:

• Negatives as in “The man did not hit the ball” by ‘Do support’ with did as a form of the word do appearing before not between the man and hit

• Questions, as in “Did the man hit the ball?” with Do support moving the word did to the beginning of the sentence.

• Negative questions, as in “Did the man not hit the ball?”

• Negative questions with a contraction as in “Didn’t the man hit the ball?” hopping not across the man, losing the vowel, and glueing it onto did.

• What are known as ‘passives’ as in “The ball was hit by the man”, moving the ball leftwards, changing the man into a phrase with by (and changing the form of the verb, other than in special cases such as hit);

• Questions beginning with a word like what as in “What did the man hit?” inviting a response like “The ball” or “The man hit the ball” with what moved to the beginning of the sentence linking to an element at the end – even as far away as in “What do you think she said the man hit?”

The grammatical apparatus here is extraordinarily complex. Chomsky’s analysis was original in a number of ways. The rules were explicit and applied one by one. In a word such as had, did or might, the reference to time, known as ‘tense’, was treated separately from the word containing it. All and only grammatical sentences were generated by the two components, including “Mightn’t the ball have been being examined by the umpire?” Previous grammars had omitted any distinction between what was generated and what wasn’t. So Chomsky’s proposal was a ‘generative grammar’.

Generative grammar is often represented as neutral. But it isn’t entirely neutral. Questions and negatives diminish authority. The passive reduces agency. “You might be mistaken” can be read as disrespectful.

In 1967, E. Mark Gold showed that the class of grammars then being developed to explain, not just for English questions, negatives and passives, but their equivalents in other languages, was unlearnable. Gold’s critique applies if the critical variation across languages is by the ordering of devices. The critique is avoided by a single function with expressions according to the conditions. Generative grammar has since developed accordingly. Most aspects of Chomsky’s 1957 analysis have been superseded by reanalyses by Chomsky himself and others. But the notion of derivation from an origin to a point of pronunciation has been widely retained,  as by Chomsky’s 1995 Minimalist Program.

But Chomsky’s and Austen’s projects were less orthogonal to one another than they first appeared. From a 1997 proposal by Luigi Rizzi, breaking the left edge of the sentence down into an ordered set of elements comprising the main, pragmatic aspects of the utterance, it has emerged that there may be a way of reconciling Chomsky’s and Austen’s seemingly contrasting perspectives.

I seek to exploit this in two of the steps I am postulating here, Fit and Move.

Data for learners

For the child learning English, some things are relatively easy. Others, like the auxiliary system consisting of words like can and do, with Do support and related phenomena, as by Chomsky’s 1957 analysis, not easy at all. In relation to the learnability space, among the more obvious and uncontroversial characteristics of English are:

• A very complex auxiliary system;

• A relatively large inventory of phonemes, 18 vowels and 24 consonants, as set out in An inventory of Sounds;

• A relatively simple system of stops, ‘voiced’ in the case of B, D, G with the lips, tongue tip, and back of the tongue, contrasting with the ‘voiceless’ or ‘unvoiced’ stops, P, T, K with the same articulators, but a significant delay before the vocal chords are brought together allowing them to vibrate in what is perceived as a vowel. This contrasts with much more complex systems with three, four, or five settings in many South East Asian languages;

• Great complexity in the vowel system, with six short vowels, in him, hem, ham, hum, hod, hood, the long vowels in he, hark, hawk, who, what are known as ‘diphthongs’ with with the tongue moving in what is known as the ‘vowel space’ in hay, high, hoy, hoe, how, both long vowels and diphthongs articulated with a degree of ‘tension’, the vowel known as ‘schwa’ at the beginning and end of agenda, a long equivalent in her, and what are sometimes taken as an extra series combining a long vowel or diphthong with schwa in our, ire, coir, truer, all written with an R, and pronounced with an R in ‘rhotic’ varieties of English.

• Relatively complex syllables or ‘phonotactics’ – with up to three segments before the vowel or nucleus, as in spring and string, up to two vocalic elements in the vowel or ‘nucleus’ in my, tense or long in me, up to three segments after the nucleus – in glimpse, next and length (in many pronunciations, at least), and, outside the frame of the dictionary word, T glued on the right edge for past tense in glimpsed and S for plurality in lengths;

• One complexity in the consonantal system when a stop is released so as to effect a sudden blast of airstream which is then released to produce a brief moment of high frequency noise in what is known as an ‘affricate’, as in the first sounds in chew and jew. This contrasts with Russian with affricates in two groups, one like English with a stop before what is known as a ‘fricative’ with the airstream flowing through the space left by the partial release of the closure, the other with just one member with the closure in the middle of a fricative. There is much greater complexity in many African languages.

• Syllabic nuclei which do not invariably contain a vowel, as in the second syllables of little and table – always unstressed in English – in a way that seems to be very problematic for most learners, with the tongue tip articulation of the T and D in little and middle characteristically lost until three or four;

• A relatively complex system of word stress with one primary stress on the left branch of the ‘foot’ in ladder, in the left branch of the rightmost foot in belladonna, and discounting one rime with a short nucleus on the right edge, as in hippopotamus.

• In many varieties of English, R is added between vowels, for some speakers in withdrawal as WITHDRAW R AL and for other speakers where sentences are connected in sense, as in “I went to Australia R and I fell in love”, but not where there is no connection as in “I went to Australia. And you still owe me that money.”

• Pitch or ‘tone’ used only to mark the difference between questions and other sorts of sentences and for various sorts of ‘pragmatic’ effects or what we can do with words, but not to distinguish words from one another. Here English contrasts with the languages of China, and about half the languages in the world.

Although, by comparison with other languages, English has a very complex vowel system and an only averagely complex consonant system, there are far more developmental problems with respect to the latter than with respect to the former. It is worth asking why this might be so. By the proposal here, even a cross-linguistically average consonant system is intrinsically more complex than a cross-linguistically complex vowel system. In most languages, vowels differ with respect to the position of the tongue in the mouth, their length, and the configuration of the lips. Consonants differ in all of these respects and more. And this would seem to go back to the original formation of speech sounds by what I am calling Mimic and Merge at the very beginning of speech and language evolution.

Uneven exposure to data

A dialect or variety of a language can be characterised at least in part by the way some words are said. In some cases,, the words are uncommon. The data available to the learner is often uneven.

In many varieties of English, including the now disappearing Cockney, the T in little is pronounced not with the tongue tip but by a closure of the vocal chords, known as the ‘glottal stop’. This is unmissable.

But in the variety of English which Daniel Jones quaintly characterised as ‘Received Pronunciation’, now mostly known as RP, in Jones’s inimitatable style as ‘the speech of men educated in one of the great English public schools’, the T in little is not released. But the T in huntsman, ointment and appointment, is glottal stopped, as in Cockney little. The unusual configuration – between and N and an M with no vowel – blocks the realisation of the tongue tip gesture, forcing a realisation with the larynx. But even if the learner never hears any examples of this, there are other similar cases as in gentle and gentleman where the N is followed by an L, functioning as a stand-alone syllable, and like N and M also what is known as a ‘sonorant’. In these cases too, the only involvement of the tongue tip is in the articulation of the L. The learner has to generalise from what may be very limited information.

The proper focus of analysis

It is sometimes thought that the main focus in the analysis of child speech should be on what they most often get wrong, as by fronting and stopping. These are accordingly characterised as ‘processes’. But that does not answer the questions: Why do children get wrong what they do, not just individually, but generally?

The first step to an answer, I propose, is to consider what learners HAVE to attend to. This includes many subtleties, such as those involving time. At the end of the syllable, the voiceless stops in P, T, and K are kept apart from the voiced stops in B, D, and G, mainly by the length of the vowel. So to keep the G in hog apart from the K in hock, the O vowel in hog is almost as long as a long vowel. The learner must be attending to this sort of thing in order to progress to being a competent native speaker.

It is sometimes assumed that subtleties like the representation of voicing in a final stop by the length of the preceding segment are defined by values on scales with an infinite number of possible settings. But if so, learners have to attend to two different sorts of thing: what contrasts with what and scalar variables of time. The learner’s task is simplified if there is just one sort of task, as there is by the proposal here with all variation, what the learner has to learn, defined by orderings in time. Obviously the task is harder if there is more than one first language. But by the proposal here, the task is the same for every language. 


In the 1980s, from work by Noam Chomsky and Hagit Borer, many linguists  came to think of language learning in terms of choices about the functors. For instance, there are many languages like Italian, Greek and Spanish in which the equivalents of I and you, known as ‘pronouns’ can be routinely dropped, with the equivalent of “I love you” said as “Love you” without the I. According to whether the target language drops this sort of pronoun or not, the language learner has to do the equivalent of throwing a mental switch one way or the other. In English something like this happens in questions with verbs of perception like “See what I mean” or “Mind if I come in?” addressed to a person, or in negatives where the speaker I speaking for himself or herself, as in “Don’t mind if I do.” But such marginal cases don’t make English a language like Italian.

The points around which these choices or settings are made are known as ‘parameters’.

The idea has been explored mainly in syntax – how to use pronouns, how words are put together to form negatives, questions, and so on. But some work has been done applying the idea to phonology – how sounds or phonemes are put together to form words, and how stress is organised differently in those languages which use stress.

Languages like English contrast below and bellow with contrasting levels of stress on the two syllables. This is with stress represented by the length, pitch and volume of a particular syllable’s rime. Most of the languages of Western Europe use stress in this way. Chinese-type languages, on the other hand, contrast different tones on single syllables. A child learning a European-type language has to set a corresponding parameter one way. From work by Paula Fikkert, it seems that this starts to happen around two and a quarter. From work by Yuen Ren Chao, children exposed to a Chinese-type language start to set the same parameter the opposite way at the same age, rather suggesting that there is one parameter here.

From work by Nina Hyams, children exposed to English on the one side or Italian on the other are throwing the switch opposing ways at the same interesting age, around two and a quarter. 

The notion of parameters was a great advance on previous notions of rules in systems of great complexity. Whenever we think of anything at all, a picture, a sympthony, a piece of legislation, a speech, or a joke, we do so using an apparatus whose operations are exclusively binary. A neurone either fires, or it doesn’t. Neurones don’t express matters of degree or shades of grey. Reducing the grammar to a series of binary values made it tractable by the human mind.

But there are three main sorts of problems with the notion of parameters. First, the combinatorics of the settings are implausibly large. In relation to syntax, Guglielmo Cinque lists 40 sorts of English functor, each potentially associated with a parameter. If each parameter went just two ways, that would give two to the power 40 logically possible combinations of settings. Something over a trillion. In order to try every logically possible combination, taking one decision a second without pausing for food or sleep, the parameter setter would still be less than a quarter of the way to the target of a grammar at the end of an average lifetime. Even more settings are needed for phonology. Second, the two-way valuations don’t exhaust all the possibilities. Many languages which drop pronoun sunbelts don’t do this all the time or in every variety of the language. Third, there is a chicken and egg problem. If language learning is entirely by parameters, how did language evolve before the first language learner came to guess that it might be organised parametrically? .

By a new development on which Chomsky and others have been working mostly since 2001, the syntax is organised in phases, one similar to what was once called a ‘kernel sentence’ with a subject a verb and an object, and a later phase with the pragmatic force of the sentence fully specified. By a very different approach, Borer has all the variation between languages and within languages in the properties of what she calls the ‘exoskeleton’, an abstract framework in which all the properties of five or six variables are organised. One set of variables applies to referential items, i.e. nouns. Another set of variables apply to elements bearing on events, i.e. verbs. But the nouns and verbs are not managed directly, but by the elements which make them nouns and verbs.

Both proposals encapsulate what has been the bread and butter of generative syntax for the past 50 years. Both group continuous sequences of steps within the derivation together. But they do this in different ways, Phase by the form of the derivation itself, and the exoskeleton by what has become known as the ‘cartography’, essentially a graphical plan of the derivation, with the universality in the terms and the variation in the values they get given.

It seems to me that Phase is less easily expressed biologically than the exoskeleton. With her customary wit, Borer calls her model the XS model, where XS stands for exoskeletal, not excess. By the proposal here, the most important research tasks are to extend the model and to find a biological way of expressing the variations by XS variations.