Menu Close
Step ladder Seven steps R2

Proposal

What makes us human: Seven connected steps

For the overwhelming majority, very early in human life, a special sensitivity starts to develop, a sensitivity which has no equivalent in any non-human species. The baby starts paying close attention to the structures of speech and language, and learning to talk. This sensitivity lasts throughout childhood, but not beyond.

This sensitivity defines both an infinite capacity and significant commonalities across all members of the human species, irrespective of the language that is spoken in some particular group, no matter whether it is numbered in the thousands or in the millions. Speech and language are human specific. And children mostly master this infinity and commonality without special help or instruction. But not always.

The Faculty of Language or FL is plainly not expressed at birth, but it develops throughout childhood, by a process known as ‘acquisition‘. The fact that this process is possible is known as ‘learnability‘. The baby hears a partial instance of the fully evolved capacity from competent adult speakers, even though much of what is heard is in bits and pieces as the speakers change their minds about what they are trying to say. As David Adger (2019) points out, the child is soon able to say and understand things which have never been said before in the whole course of human history, and do this despite the randomness of what he or she happens to hear. This is known as the logical problem of language acquisition.

Assuming (conventionally) that FL should be broken down into its simplest possible elements, I propose here that:

  • It is likely that FL evolved by steps or saltations, by my proposal here, seven, and this is reflected in learnability and acquisition;
  • Where the acquisition process goes wrong, as it plainly does sometimes, this may be due to an error in the ‘wiring’ of the genetic inheritance or the genome.

These two principles naturally bear on the central questions of clinical linguistics.

The properties by the proposed sequence here are abstract. But they are no more abstract than the straightness of the line between footfalls which the child is learning to walk along. The straighter the line the more efficient the gait becomes. More energy is used to propel the body forwards, and less to stay upright. This gait allows humans to run for longer on two legs than faster-running prey on four. But few, apart from trainers in athletics, think about the straightness of the footfalls.

Abstractness is useful in relation to speech and language, as well as for athletics. When the acquisition process goes wrong, some abstractness can usefully guide the process of clinical investigation. What questions should the clinician ask when? This is what the proposal here is about.

Consider the child of three who seems to have just one word. What does this mean? What is the likelihood of the child learning to talk normally? Have the parents done something wrong? Is the child likely to grow into an adult with a ‘communication problem’? Is there a way of reducing the chances of this? What can be done to help? To answer such questions, it is worth sharpening our understanding of FL and, I submit, its possible evolution.

A prohibition and the inspiration here

There are many proposals about the evolution of speech and language. In 1866, the Linguistic Society of Paris banned all discussion of the topic. There may have been a suspicion that research would point towards an African origin of human language, undermining the assumption by most Western intellectuals at the time of white European superiority. But whatever the motivations of the ban, it held for over 100 years.

The main inspiration for the proposal here is Chomsky’s 1995 Minimalist Program and its core notion of Merge, combining pairs or sets of atoms, one as the head of the combined expression which can then combine with others, and so on, indefinitely. This is commonly represented as ‘branching‘ denoting instances of Merge. The characterisation of this by seven steps treats the steps as as part of one evolutionary sequence. This stepwise treatment of Merge is motivated by the questions above, by the evidence of language acquisition, and by the evidence of what is known as ‘Universal Grammar’ or UG. This departs in significant ways from Chomsky’s ‘Strong Minimalist Thesis’, as set out by Chomsky et al (2023). Without abandoning the empirical progress there and elsewhere, I assume, like many critics of minimalism, that language represents a most useful adaptation, and did not evolve entirely by means of a single mutation by Homo sapiens in the past 300,000 years. It may be possible to resolve the obvious tension here between apparently contradictory notions of minimalism if it is acknowledged, as it is here, that Merge is at the centre of the evolutionary process, which can be usefully extended.

Subjects and data

The subjects here represent the smallest sample for any logically possible generalisation, two. The subjects were the youngest two of three children, Joe and a younger brother initially called Frank, both developing normally, from a liberally-minded, middle-class family, interested in books, museums, galleries, history, art and politics. They went to a neighbourhood, non-denominational, local authority school, catering for children from a wide variety of social and ethnic backgrounds. The observations are from diaries kept by my wife and myself. They comprise about fifteen thousand observations, filling nine cathedral analysis note books. The observations continued until Joe was almost ten and a half.

We tried to make our observations as accurate as possible, as soon as possible after the event. Obviously we must have missed many developmentally significant occasions. The observations exampled here exhibit commonalities across the two boys. Because of the age difference, it is not plausible that the older of the two was significantly influencing the younger, other than on matters of interest to two small boys. Listening to them talking with their friends and peers there was nothing obviously singular about their speech and language. Generalisations across the two are thus likely to be significant.

It is only coming back to these records for analysis forty years after they were made that it is becoming clear how much they reveal. One cannot listen too carefully to the details and nuances of what children say. They can say more than they seem to on a first listening.

The next stage by the research here will be either to add more subjects or to amend the methodology.

Both speech and language are enormously complex. The first efforts are, by virtue of this rather obvious fact, hard to understand – to the point that it is often hard to decide exactly what is being said. But on this basis, any degree of structure understandable as speech may be intrinsically significant.

Following the convention established by Jean Piaget, ages are given as 0; 11 (10), meaning the tenth day of the eleventh month. For our purposes here, this degree of precision is useful. Some developments happen over a few days, or overnight, or less. In the cases exampled here, the observations were the  first cases of utterances satisfying some particular grammatical criterion.

Steps

All of the steps proposed here are given in terms which make no reference to any sort of linguistic category. Ot they could not be encoded in a genome, capable of being transmitted from one generation to the nest. The steps are as follows.

1. Lexicon

Any system of communication based on discrete meanings and external symbolisations entails a branching between two unlikenesses, a strictly binary relation. By the proposal here, the human innovation was to evolve a new sort of relation between the unlikenesses where the relation is itself part of the definition. This happened by first reducing the external elements of the expression to their simplest, logically possible, perceptibly distinct forms, contrasting sorts of featural element, and then reversing the decomposition, recombining elements for the sake of clear articulation. The decomposition and recomposition is diagrammed below in terms of feature a and feature b.

What made this different from a chimpanzee’s hoot or shriek or warble was the sequence of steps. The novelty of the step involved both. The recomposition gives their assembly into some prototypical part of a phoneme or speech sound. But primordially and in modern children’s early speech and language development, the recomposition may be uncertain or not fully defined. So there is no reason to expect that primordial expressions sounded like modern EE, AH, TOO, COO, DEE or DAW by modern phonemic or syllabic structure. Modern chimpanzees’ apparent inability to copy human speech tells its own story. Primordial articulations may have been different in any number of ways, in all probability using features that could be drawn from an existing system of shrieks, hoots, grunts, or howls.

By virtue of the double branching, as by the diagram above, the first human language was different from any sort of alarm calls  of vervet monkeys or prairie dogs, or the richer but seemingly less specific systems of chimpanzees, or the marking of individual and group identity by dolphins. Dolphin calls of group or individual identity may be meaningful only in the presence of another group-member, although there is implicit reference by a recognition of identity in an act of welcome. Such non-human calls are not compositional. They cannot be freely combined with other calls to some infinite degree, known as ‘discrete infinity’. Nor can they be ‘decomposed’ into separate articulatory / perceptual and semantic / pragmatic elements, as in games like the French Verlan, wittily reversing the order of the syllables in l’envers, the French for backwards.

In evolution, the physical aspect may have been either gestural or vocal. If primordially there was a bias towards physical gesture, this bias must have disappeared as language evolution progressed, or there would be sign languages used natively by normally-hearing populations. The proposal here is neutral on whether the first decompositions and recompositions were vocal or manual, about the content of the features, and about the modernity of the contrast.

But by the proposal here, the externalisations always had separate elements, the prototypes of modern features or the simplest sort of modern syllables, known as CV or consonant vowel syllables, widely thought to occur in every language. If (improbably) CV syllables were primordial the contrast between the two elements may have been with respect to the greater openness or sonority or resonance of vowels. More probably, the first step on the pathway to modern speech and language, the initial prototype, is not likely to have differed greatly from ape-like hoots as a result of the new process. But there must have been a perceptible difference in order for the evolutionary novelty to spread. It is reasonable to suppose that the fitness advantage of this more complex arrangement was that it enabled forms to be classified in a prototype dictionary, a lexicon of comparable items, differentiating features more narrowly, by defining them more precisely.

In modern acquisition, one child, Joe, at 1; 0 (14), seeing the swings in the playground said something which his mother heard as “See saw” – as DEE DAW, with the consonants as what are known as ‘voiced stops’, stops because the airstream is completely blocked in the mouth, and voiced because the action is only momentary with the buzzing sound from the larynx beginning as soon as the closure is released. The partial closure by S is replaced by  complete closure. Joe’s younger brother, Frank, says “Bu” on seeing a bus, with no discernible S, at 0; 11 (10). All of these forms seem to be referential, at least to some degree.

On what might seem to be the simplest possible analysis, BU and DEE DAW are  by a simple chaining or ‘concatenation’ of elements, unanalysed structures, parrot-like mimicries of speech, that there are no grounds for any further analyis:

But there is evidence that the first phonemes are themselves defined on constituent features. The evidence is from one rare sort of disorder, with the features treated as properties of words rather than their consituent phonemes. For instance, one child I saw at the age of two had very little speech; but of the few words he had, he could say more with a lip articulation in the M and corresponding lip rounding in the vowel and knee with a tongue tip articulation in the N and a corresponding high front articulation in the vowel, but in me and gnaw the M and N sounds were seemingly unpronouncable. There is a featural analysis here. But one of the branchings is skipped. So for the sake of a plausible pathway, we a featural analysis is more plausible than an apparently simpler phonemic analysis. By  a featural analsysis, even with only a small subset of the features  by fully competent speech, the features can be cross-multiplied in various ways, allowing the lexicon to be expanded, and forms to be extracted at will, at any moment, and entered into a ‘work space’.

On a completely different time scale from modern acquisition, the lexicon could grow exponentially. One form could be compared with another. New forms could be added, each distinctively represented by branched structures, to be articulated and understood accordingly.

Both by evolution and by the fully evolved modern system, such a ‘lexicon’, or store of words or signs, could be freely supplemented throughout life. Lexical items are thus quite different from any shriek or howl in response to some situation. The modern lexicon evolved from the two steps of decomposition and recomposition.

The relation between physical and semantic features and the classification of entries in a lexicon are common to all naturally spoken, modern languages. This breakpoint between human and chimpanzee-style communication cannot have been by a gradual transition, but by a reconfiguration of the relation between the two sorts of feature. The relation between the decomposed and recomposed aspects of the externalisation allows the expression to be classified. This reconfiguration was the evolutionary cognitive genius. Without the decomposition and recomposition, there is no way that the primordial hoots and shrieks could have evolved into what they became.

Anna Maria di Sciullo and Edwin Williams (1987) suggest that the lexicon is a place of lawlessness and unpredicatability. But by the phonology and the combination of different underspecification theories proposed here, by the next but one step to be proposed here, defining what Chomsky and numerous others characterise as Merge – as an irreducibly necessary aspect of human language, the keys to the cells of the lexicon are strictly organised. They are highly compacted. Although the principles of this are common to all languages, the detailed implementation is language-specific, different for every language, necessarily within the child’s learnability space. Nunes (2002) shows that this learning is still normally still in progress at the age of eight, at least for the hardest words.

The principle of compacting is plainly not learnt. It has to be available to the child from day one. It has to have evolved to be able to work the way it does. This evolution cannot be an aspect of human culture or it would vary from culture to culture. It has to be biological. Hence the current notion of ‘Biolinguistics‘.

In evolution, the physical aspect may have been either gestural or vocal. If primordially there was a bias towards physical gesture, this bias must have disappeared as language evolution progressed, or there would be sign languages used natively by normally-hearing populations. The proposal here is neutral on whether the first decompositions and recompositions were vocal or manual. If the proposal here is on the right lines, Lexicon defines the beginning of a human-specific pathway

How far the first inventor or inventors were CONSCIOUSLY aware of what they had ‘invented’ is obviously impossible to say. But it seems reasonable to speculate that even just one suitable expression could be very appealing, that there was a selective advantage. More mental investment in the form may have allowed more attention to other aspects of the expression. The adaptation here wouold seem likely to have conferred a degree of advantageous fitness. Inheritors had a greater chance of mating and thus of passing the adaptation on. All we know is that it was noticed. Or it could not have spread.

Now it might seem that there was a simpler analysis by the mere fact of the expression and the meaning being combined, or equivalently that the first single words are in principle no different from a chimpanzee-like hoot, as many psychologists and primatologists have suggested, implicitly or explicitly, or that it is more parsimonious to delay for as long as possible the point at which we postulate any sort of powerful capacity, perhaps until the grammar is generating structures with some degree of complexity, such that the infinite capacity is obvious. But this would be to postulate two pathways, one leading to a finite set of outputs and the other leading to an infinite set. The greater parsimony is by a single pathway.But this would be to underestimate the complexity of the child’s first word, whether this is understood as referential, calling up an entity not actually present at the point of the utterance, as by mum, mummy, dad, daddy, cat, pussy, dog, doggy, horse, horsey, bird, birdie, car, bus, lorry, or as a standalone expression of discourse like hello, hi, or  goodbye

The first word is not just more sophisticated than a hoot. It is something else entirely. It is a first step on a pathway.

Even if the first word or syllable is imprecisely articulated and hard to even identify, it still has some rudimentary structure, typically with an initial consonant followed by some sort of vowel. There may be even a vestige of a less highly stressed second syllable.  Semantically, this may be referential or based entirely on discourse. Significantly there are only a handful of discourse expressions in contrast to those entering the system of maeaningful combinations, known as syntax. Reference is universal across languages, allowing an entity to be called up just because it happens to be in a speaker’s mind.

By the analysis of Matilde Marcolli (2023), Decompose and compose are invoked to explain two aspects of Merge. By the proposal here, Decompose and compose should be seen as the basis of the process by which linguistic atoms are put together to form the basis of the lexicon. It is both primordial and the first step on the modern acquisition pathway.

By the framework here, it is likely that both primordially and in modern acquisition, definitions were and are imprecise, making the speech is hard to understand. The structure (or lack of it) is both primordial and characteristic of early development.

Modern language exploits the complete resources of a system which is continually changing in how things are pronounced, in how words are put together, in what they mean, and so on. But that being said, it is now in a sense fully evolved. The primordial system has been supplanted by both evolution and what is known as grammaticalisation, the churning over of the grammatical apparatus under various pressures over tens of thousnds of years, some pulling in opposite directions.

There are fossils of the primordial sound / meaning relation, as expressed by the modern lexicon, in:

  • The lexicon itself;
  • The sending of information of two sorts to two quite different sorts of interface, both conceptually necessary, represented in primordial human expression, and as significant in modern speech and language as at the point of evolution;
  • The long period in normal language acquisition, of anything between two and six months, of just single words;
  • English yes and no and what are known as ‘modal particles’ or ‘discourse markers’ including Ah for pleased surprise, Eh as a query, tut tut for disapproval, hello, bye bye, curses, and so on;
  • Expressions like sh as a call for silence;
  • What are known as ‘imperatives’, commonly, as in English by the ‘root’ form of a word, such as come and go, sometimes for sake of saving life;
  • Expressions like “genius” in response to a performance.

The single words of modern one year olds are unlike the primordial forms proposed here in that they exploit combinations of features by later steps. But they are used in a way characteristic of the primordial system.

2. Recognise and address

There is a reduction of the number of elements involved by pairing the simplest sort of element such as a ‘standalone expression’, such as bye-bye, with a reference to some animate entity. The contrast is between elements such that one relates only to the discourse itself and the other is a prototype of what will become an element of syntax, including nouns.

Joe at 1; 5; (23), six months after his first ‘words’, said “Bye, doggy”.

Frank at 1; 2 (22) said “Bye bye, Daddy”.

Bye or bye bye are clearly aspects of discourse, as are ah, ey, uh, oh in adult language. Doggy, addressed to a toy dog, and Daddy, are seemingly referential. So there is discourse and reference, but with no internal structure other than the contrast.

Representing the externalisation schematically as two features, a and b:

This primordial system is reflected in modern language in at least two ways:

  • Expressions standardly by root forms combined in ways falling outside the terms of the grammar, kill joy, go between, go slow and so on, noted by Ljiljana Progovac (2015);
  • Possibly at least some adverbs, as the only sort of word which can appear in different positions in English, albeit with some subtle changes in meaning, as with sadly  in any of all logically possible positions in “Sadly, he is going to die”, “He sadly is going to die”, “He is sadly going to die”, “He is going sadly to die”, “He is going to sadly die”, “He is going to die sadly” – all grammatical, at least for those who allow infinitives to be split;

The modern infant generally puts his or her first two ‘words’ together around the point when the vocabulary reaches around fifty items. There is no reason for assuming that Recognise and address evolved at the point when some particular number of items became accessible, but it is clearly possible that in evolution Recognise and address was triggered by the growth of the lexicon.

3. External Merge (to use a Minimalist term because there is no other) and headship

There is a further restriction of the number of elements in the computation by limiting them to different types, as some sort of prototypical categories, with one becoming the head of a combined expression.

For example Joe at 1;7 (30) said “In er car…. In car”.

And Frank at 1; 3 (2) said “Open door”.

In these examples, only one element is referential, The other is a prototype of some sort of syntactic object, in these two cases a verb and a preposition. Neither is such that it can stand on its own. Discourse elements are thus excluded. A headed structure can’t include bye bye or hello.

When the phrases “In  car” and “Open door” were uttered it seemed probable what they meant, but not certain. The relation is asymmetric between two contrasting elements. But however they should be understood, both of these phrases have well defined heads, in and open, contrasting with the clearly referential elements in car and door. In the framework here, the non-head is known as the ‘complement’. By the diagram above, dominance is thus built into the system.

In “Daddy upstairs” by two branchings, one defective, Daddy is defined by a second branching, in this case, as the subject.

The contrast between the elements expresses the simplest possible structure with a definable spine, in this case with Daddy dominating X upstairs, where X is an unrealised abstract element.

  • With the branching applying to just two elements, sisters in the framework here, headship is thus essentially a relation between prototype syntactic elements, both with parts. In the modern child’s process of acquisition, by the simplest interpretation, “In car” involves two elements, car, essentially noun-like, and in, as a step towards a preposition. “Open door” contrasts noun-like door and verb like open.
  • External Merge signals the first step towards distinctive ‘parts of speech’ as these are called by traditional grammar, nouns like car, Mummy, Daddy, verbs like want and like, prepositions like in. The items become differentiated, as the only sorts of expression on which grammatical operations can be defined. On the simplest plausible readings, open and in are plainly heads. Both elements can now express formal relations, as head and complement, to the expression as a whole. There is a grammatical relation between them, each with an an irreducibly necessary, structural role. But it seems premature to regard such elements as fully-defined nouns, verbs and propositions;
  • The elements have features which define the interaction, as opposed to some purely accidental relation, as by ooh, eh, ah, yeshello, good bye,and so on, all independent from the grammar because they can stand on their own, sometimes adjoined to it, but not by External merge;
  • By virtue of the headship role of one element by each branching, and developing an idea from Nunes (2002), it is possible to combine different sorts of ‘underspecification‘, reducing the lexical storage to the minimum, espanding it only for purpose of clear pronunciation.

4. Internal Merge (to use another Minimalist term)

By a further reduction of the elements, these can drawn from the set of elements that have already been selected.

In English, questions beginning with words like where (or their equivalents in most  languages other than English) are asked with where mostly on the left of the structure, as in “Where are you going?” But in a full-sentence answer, the requested information is on the right, as in “I’m going to the shops.”

Building on the featural and combinatorial properties by Lexicon, Recognise and Address and External Merge, Internal Merge builds an array of a given set of lexical entries, from which it is possible to extract a particular, suitable item which has already featured in a previous step of the derivation.

By the first three steps proposed here, what John Langshaw Austen (1962) called the ‘force’ of a structure, was an accident of the structure itself and the circumstances in which it was uttered. Such a grammar was most likely prone to deep and frequent misunderstandings.

Necessarily in English, the force of the question in “Where are you going?” determines a reversal of the sequence of you and are from the sequence which would be followed in the response statement “You are going to the shops” or I and am in “I am going to the shops”.

Typically, as in English, an internally mergd item, B in the diagram below, is then not pronounced at its point of origin.

For example, Joe at 1;10 (3), says “Daddy upstairs” where Daddy seems to be the ‘subject’ in traditional terminology, and at 1; 10 (27) “Where Daddy?” with where seeming to define a clear question. In the first, the B element is only extracted once. But in “Where Daddy” it is extracted with a notion of location, and then extracted once again with the force of a question.

At 1; 4 (27) Frank is asked, “Who wants some chips?” And he replies “Me”. And at 1; 5 (9) he asks “Where chicken?” Both the appropriate answer to a who question and the where question 12 days later suggest a grammar capable of making two uses of the same element, once to define an identity or location, and then, by Internal Merge, with the force of a question.

Here where and who are fulfilling a special role as the sister or ‘specifier’ of the head of the A B  structure. In the framework here, it is a universal.

In Joe’s “Where Daddy?” at 1; 10 (27), where is pronounced on the left and interpreted on the right of the structure (shown in grey).

Here “Upstairs” might be a plausible child’s answer, taking “Daddy is upstairs” or “Upstairs” as plausible adult-type answers, except that upstairs is treated here as a bare marker of location, questioned by where,

In modern acquisition, between a week and three months after two words are put together, a question is asked or answered involving a question word relating to one of the items in the two word combination, particularly what, where, and so on, signalling points of curiosity, as a key factor in discourse. So children ask or respond appropriately to a Wh question such as “Where Daddy” only after producing a declarative structure involving two corresponding elements, not in the opposite order.

By this step:

  • Structures, with what traditional grammar calls ‘subjects’ , can be expressed on noun-like elements. Thus I has the special role of expressing a subject, a seemingly universal property of sentences. The subject role is purely grammatical or syntactic, as in “There is food on the table” and “It is a shame that you’re ill” where neither there nor it has any semantic role;
  • What are known as ‘thematic roles’, including agency, ownership, location, benefit, destination, or experience, can be expressed;
  • Elements of composed structure can be recomposed at a higher level;
  • Across the system of phonemes, relativities can be marked in contrasts such as those between P and B, defined on a difference in the delay between the release of a closure and the onset of ‘voicing’ by bringing the vocal cords together;
  • The cognitive load of searching for and extracting items from the lexicon is greatly reduced. At this point in language acquisition, the lexicon is expanding rapidly. Simplifying the task of searching for and extracting a word is thus a valuable increase in fitness. This is easily and obviously detectable because questions are now clearly defined as such. Questions with a Wh word can be asked or understood.

Significantly, English also allows forms like “Daddy is where” or “The chicken is where”, typically with where heavily stressed, no longer with the force of a question, but as statements of surprise or astonishment.

5. Projection, Inflection, Tense

While Internal Merge extracts elements from a work space, Projection, Inflection, Tense projects them upwards – necesarily because that is the only possible direction. Such elements, known as ‘functors’, are only definable by their relation to another element within the structure or to the structure itself. Functors are marked in English in obvious ways, easily detected by the child learner. Their sound structure can be reduced, by changing or losing their vowel, or by losing one of the consonants.

What is known as ‘tense’ defines a relation in time to an immediately present or everyday event /situation. English past tense is marked either as -ED, as in sorted, or -D as in lied, or T as in spilt, or by a change in what is known as the rime as in ate, saw, took, or by the whole form of the verb as in was and went.  By the profound insight of Chomsky (1957), this marking of tense is separate from the verb itself. In did in “Did you tell the truth”, tense is expressed on the word did, known as the auxiliary’. In relation to existence or situation, the present tense of Be is expressed by the word is or its contracted form, written as ‘s,.

The use of grammatical form to talk about time is often a key step in language acquisition. Joe at 1;11 (12) asks “Who’s that?” On the same day. looking at a picture book together, his mother asks: “Where’s the bus?” Joe replies: “There’s bus.” At 1;11 (14) he asks: “Where’s man tractor”. It was not clear whether he meant “Where is the man’s tractor?” or “Where is the man for the tractor?” or something else. The point here is the articulation of the ‘s form, as a contracted form of is. The functor here does double duty, marking both the relation to the present and the fact that the question is about a single entity. The fact that it is expressed by a contraction makes it highly visible and thus easy to identify and learn.

Frank at 1; 5 (29) asks: “What is that?” His mother who made the observation, noted that the is form was clearly articulated.

Showing the new functional projection in bold.


These are the first uses of an is or ‘s form by these children, known as ’inflection’. Here an element is inserted into the structure and anchored to the here and now of the utterance, reflecting an aspect of the discourse. There is no reason for thinking that there is any contrastive intent here. The child is not also asking things like “What was that?” But the use of the is form is a place-holder for the tense category as this becomes accessible to consciousness. And in a broader sense, the form signals the accessibility of elements which are purely functional, with their own corresponding projections.

6. Measure and compare

By Measure and compare, there is measurement and comparison of degrees of dominance, where any degree of branching implies a corresponding degree of dominance. This is an obviously abstract functionality. But the relation characterises numerous phenomena in the grammar. In English this relation applies to:

  • Successions of verbs as by want to go, where want ranks above go in the structure. Verbs like want are known as ‘control verbs’ because they control the tense, person and number, i.e. all the variable features of the lower ranked verb;
  • Complexity in ‘noun phrases’ defining possession and qualities, as in my little crisp and Mummy’s fingers;
  • What is known as ‘Case’, expressed by the difference between he and him, she and her, as ‘arguments’ in one of various relationships, essentially who is doing what to who, but changing as roles change or speakers take turns to talk. pronouns such as I, you, he and she, what are traditionally known as ‘reflexives’, as in “I hurt myself”. The special role of the ‘subject’ of the sentence is expressed by what was traditionally known as the ‘nominative’ case of I, he, she, we and they;
  • ‘Agreement’ between a nominative subject and the form of a top-ranked verb, as between I and am, or its contracted form ‘m.
  • Passive forms such as broken in broken by a hammer or just broken;
  • Negatives by not and its reduced form written as ‘nt, where the negative form only appears immediately after the form expressing the tense in doesn’t, didn’t, can’t, won’t, and so on.

The scope of operations, each doing just one thing at a time, is restricted by comparing and measuring degrees of dominance in the same hierarchy, dominance to any minimal degree and to any particular, or greater degree.

At this point in their development, in terms of their chronological age, Frank, the younger of the two, was some months ahead of Joe.

At 1; 11 (1) Frank said “Want sit lap” with two root forms of the verbs, want and sit. At 2; 3 (26), Joe said “Want help daddy” where it seemed that he wanted to help his father. In Frank’s case want and sit, in Joe’s case, want and help were separate verbs, one at a higher position in the structure than the other. Want is known as a ‘control verb’ in as much as it controls the tense, number, and person of the lower ranked verb, in these cases help and sit.

At 1; 9 (22) Frank said “I hurt self’”, meaning I hurt myself,  with I and self with the same reference, with with the reflexive self-form one level down from what is known as its ‘antecedent’ – in this case I. And at 1; 10 (27) he said “I found this” with the ‘nominative pronoun’ I next to the past tense found. At 2; 5 (5) Joe said “doggy licking hisself” with the self form picking up the third person of the antecedent.

At 1; 11 (4)  Frank said “I don’t like it” with the negative n’t  next to the auxiliary do. At 2; 4 (26) Joe said “Mog doesn’t like that”.

At 1; 11 (13) Frank said “I’m making lorry” with ‘agreement’ between the first person pronoun 1 and the auxiliary am. At 2; 0 (20) he said “I need more that” with more as the head of a complex phrase. Joe at 2; 5 (7) said “I saw lorry pulling car” and on 2; 5 (11) “I took picture of milkman”. In both of these cases there is significant embedded structure, a clause in ‘lorry pulling car’ and a noun dominating another noun in the substructure of ‘picture of milkman’. In all three cases tense and the nominative case defining the subject role are overtly represented in a sisterhood relation. In “I eated that chocolate”, the tense in the verb is manifest in the mistake in eated. 

All of these cases involve the measurement and comparison of dominance. The marking of tense on the verb and linking this core element of the structure to the context of the utterance is almost, though not completely, universal. Tense and nominative case constitute the two most sharply contrasting sorts of elements within the hierarchy. Measure and Compare puts these two functionalities at the top of the projection chain.

Going beyond the acquisition data here, these phenomena are reflected in the way both are expressed as the left most elements in “I might have been being deceived”. In restricting the measuring and comparing to the edge of the hierarchy, the set of elements involved in the computation is reduced one degree further than by previous steps.

As far as pronouns are concerned, the key data for English is in contrasts like the one between “She says Mummy feels tired” and “Mummy says she feels tired.” In the second, she could be Mummy. But in “She says Mummy feels tired”,  she can’t be Mummy. Such relations are common across languages, raising the obvious questions: How do children learn this? And why should this be? Ever since a seminal (1976) work on the issue by Tanya Reinhart, this has been a hot and continuing topic of debate. All approaches since that of Reinhart have focused on the small size of the domain, as illustrated in the diagram above. But by all of these approaches, to the question: How do children learn this? the answer is that they don’t. And to the question: Why should this be? the answer is that by measuring and comparing two levels of dominance, minimally different in the examples here because the elements are sisters in the structure, the structure is significantly reduced. What the child has to absorb is not a highly arcane restriction on the reference of she and Mummy in the examples above, but a much more immediately familiar principle of measurement and comparison.

By a spine-based universal grammar, these things can be encoded in ways that vary across languages, but using the same, universal template. By this sixth step, degrees of  dominance are measured and compared, imposing a ceiling on specified relations, abstractly A and B, at the top of the spine at a given point in the derivation.

This allows a special relationship between I and am and between she and is, one denoting the nominative case of the subject and the other denoting the most immediate aspect of the here and now in the discourse. In most languages including English, the key aspect of the here and how is related to time, represented as the tense of the verb, as in the differences between I am and I was and I have and I had. Nominative case is purely grammatical, with no thematic role or relation to the here and now or the needs of communication.

In “I may seem to be asleep” the thematic role of I is plainly not a function of the main verb, seem, but of the embedded verb, be. “I may seem to be asleep” means the same thing as “It may seem that I am asleep”. but the structure is quite different.  I with its marking of Case and the tense of be get shunted upwards or ‘raised’ by successive steps of projection, each step by a separate process, shown here by the arrow in a simplified diagram of the tree.

The process can be continued as in “I may seem to want to be asleep” with a different meaning, but still with I immediately followed by the tense bearing may. The sense of the tense-bearing element has been lost in the history of English. “I might seem to want to he asleep” means almost the same thing, But without may or might in “I seem to want to be asleep” or “I seemed to want to be asleep” the tense difference is clear and overt. Again immediately next to I.

The expression of these levels of the hierarchy, for noun and verb like elements, varies from language to language. These are things the language learner has to learn. They fall within the learnability space. By the proposal here, Measure and compare is universal. But expressed in terms of the spine it is biologically encodable, and thus readable within the human genome. It is precisely the abstractness of Measure and compare which makes it both universal and a heritable aspect of UG. But despite the universality of the  core principle here, the way it works in English is complex and hard to learn. Internal Merge, Projection, Inflection, Tense and Measure and Compare are mathematically powerful devices. Applied to the output of one another, they can be exploited indefinitely. Such a grammar strains both processing and production. It may not be completely learned under the condition of finite learnability. There may have been wide variations in the mastery of the grammar, as in all other areas of human skill from musicality, to art, to athleticism of all sorts.

7. Phase and Complementiser (or Sentence)

By this final step, the grammatical apparatus is factored into the smallest posssible components, further reducing the set of elements at any point in the derivation.

This follows a consistent approach from Chomsky’s first widely circulated (1957) work factoring the grammar into two sorts of rule, then a division between deep structure and surface structure by Chomsky (1965), then the effect of a barrier with respect to what was at the time considered to be the ‘movement’ of elements such as what and where by Chomsky (1986), then by Chomsky (2000) with the ‘spelling out’ of the minimally necessary information to two interfaces in separate ‘Phases’.

A phase is defined by the fact that as soon as it has been completed, most of its structure becomes inaccessible to the ongoing process of derivation. A phase may by expressed by only one word. But it can have a special analytic status, as in the case of the Wh words. As by the primordial step, decomposing and recomposing the elements of the lexicon, the information that is sent to the two interfaces (for physical expression and understanding) has to be detailed and complete.

Here the phases are shown as alternating light black and heavy red branchings, at least two for each clause, the light black roughly representing the propositional content, and the heay red roughly representing the ‘force’. These are not the steps proposed here, but the effects of the final step.

By this seventh step, the spine maps onto the formal structure of UG. The phasing becomes detectable only by the last of the seven steps.

The novelty, by a phased approach to syntax, is that the factoring into two phases is often repeated. The first phase often spells out the referential and propositional content, the second phase spells out the full complement of what Austen called the ‘illocutionary force’ of statements, commands, questions, pleas, and so on.

By the proposal here, both in the evolution of language and in the acquisition of language by modern children, the factoring of the grammar by phases is (necessarily) ordered last.

Showing accessible elements in bold red, and with a lower, earlier, inaccessible elements lighter, with only the head and edge accessible.

There are thus two phases in most clauses, even if the second phase is not represented by any overt structure, but just by the fact that a ‘simple’ proposition is also a statement of fact, which may be contradicted in jest or irony, as represented by the second phase. Thus the first phase is mainly grammatical and the second phase often has a significant discourse element.

By the seventh step, an expanded notion of force was defined as ‘Complementiser’ or C, as the  topmost level of the spine, and replacing the traditional notion of a ‘sentence’. C provided a hosting for what in expressions like “What did you say you thought I said?” with what as the complement of I said at the opposite end of the structure.

In simple, declarative main clauses, C is not expressed in English. But it is the destination or landing site of words like what, where and when and expressions with which in questions seeking particular items of information. By the 1997 proposal of Luigi Rizzi, the ‘force’ of the structure is expressed as a property of C. This applies no matter whether the structure is a statement or a question, or whether the agency of the subject is diminished by passivisation or in some other way.

So for example, Joe at 2; 9 (4) asked “When’s Daddy coming back?” with the Wh morpheme when projected onto the uppermost level and the contracted auxiliary ‘S stuck on its right edge as what is known as a ‘clitic’.

By characterising this level as that of C, every level from the bottom of the structure to the top is defined in the same way, rather than by giving the sentence a special status of its own, one that is hard to define other than in a purely circular way.

Three weeks after “When’s Daddy coming back”, at 2; 9 (28) Joe produces his first sentence with multiple embeddings and a Wh word not introducing a question, in “I want to stand on the chair to see what’s happening”. Crucially, there is no search for information here. Simplifying slightly:

Turning to the other child, Frank, up until this point, the more precocious of the two, now, at 2; 10 (21), says “I want to sit where Joe’s been sitting.”

Looking at the two children together, almost identical, sentences with multiple embeddings, accidentally or otherwise, with full grammaticality. The exactness of the similarity between two utterances in two children two and half years apart, only noticed forty years later, would seem to suggest that there is significance in such structures with a Wh word specifying an embedded clause, and not forming a question.

As by the examples above, Phase allows the derivation to proceed in steps, as by the process of evolution. But the full application of the principle here takes years to learn. At 9; 9 (13) Joe said “We don’t know whether I’m going to be picked up by who” (of the rather complicated child care arrangements we had in place at the time to allow me to go to a university 70 miles away one day a week). Joe’s sentence is anomalous in as much as who seeks particular information and whether seeks only a truth value. But the structure of two Wh words in the same clause calls up the Phase functionality in a significant way.

By this seventh step:

  • Building the derivation in phases allows clause structure to develop, while at the same time limiting how much of the derivation can be manipulated at any one point, reducing to the minimum both Search and the speaker’s and language learner’s tasks in constructing a derivation, allowing complexity to be distributed across it;
  • No matter whether Wh words like where and what are used in questions like “What do you want?” or as introducing an embedded clause in a statement like “I know what you want”, they are crucial to the system of Phase, the last step in the evolution and development of Universal Grammar.
  • Information is sent bit by bit to the articulatory system to be pronounced and to the semantic / conceptual system for the meaning to be analysed. English marks the point of sending articulatory information much earlier than ‘agglutinative’ languages like Turkish, and many others, with what seem to the speaker of a language like English to be hugely-complex ‘words‘. So this point necessarily falls within the learnability space;
  • Hypothetically it is Phase which makes speech and language finitely learnable, at least for the overwhelming majority, giving humans a unique capacity among all species alive on the planet;
  • A commonly shared competence can be assumed across the whole population – a huge advantage for a small and hihgly vlunerable species;
  • Metalinguistic awareness is brought into being;
  • Fantasy, fiction, non-fiction, irony, fun, comedy, contracts, all become parts of everyday life.

At the very latest, the last of these steps, like the rest of apparatus for modern speech and language, must have been completed by around 70,000 years ago when modern humans started spreading across the planet. Or there would be modern populations without the same faculty. However, it seems more likely that the modern, fully developed faculty was completed much earlier in the time span of homo sapiens. By common consent at the First Conference on Biolinguistics in 2022, this point is most likely to have been reached around 150,ooo years ago. By around 130,000 years ago the first indications of modern culture started to appear, as documented by Marean and others (2017). Coincidentally, it was around the same time that humans almost became extinct, becoming reduced to a population which may have consisted of only 1,000 or so individuals. The enhanced cooperation by Phase may thus have saved the human race from extinction.

The first step, allowing a prototype lexicon to start to develop, must have been correspondingly much earlier.

Fifteen points

  1. Pathway. There is an evolutionary pathway from the first human words to the competence needed to explain to an apprentice some aspect of professional skill (such as the subtleties of flint knapping) and a corresponding developmental pathway from the modern child’s first words to his or her fully-mature, adult speaker’s competence ten or so years later. The evidence here confirms Darwin’s hunch that modern humans share a common ancestor with African apes. The evidence of DNA shows human ancestors diverging from chimpanzees at some point between six and ten million years ago (See Søren Besenbacher and others (2019) for a recent contribution. But the exact timing of the divergence is irrelevant here.) It is assumed here that at the point when the two ancestral populations diverged, they shared a system of communication essentially similar to that of modern chimpanzees. Some chimpanzee calls, like the shriek of pain as an individual is attacked vary in intensity according to the severity of the attack, and crucially, are understood that way by other chimpanzees. But an infinity of calls by the grading of fear, pain and distress is quite different from any of the infinities considered here, It may be that in  some species, particularly birds, different calls can be combined, sometimes one inside another. Observations to this effect have sometimes been interpreted as countering the claim that recursitivity in the system is human-specific. Similar interpretations have been made of variations in the calls by chimpanzees, vervet monkeys and others. It may be that some species, partucularly chimpanzees, are better able to remember a large repertoire of calls. And other species are better able to understand them separately and to combine them. A combination of two calls goes beyond a mapping. But it is limited to the square of the calls. The output is finite. The interpretation that either of these evolutions are steps on the evolutionary path to human speech and language seem to me to miss the point that that the end-state of linguistic competence involves the simultaneous variation of both form and meaning. It is this, I propose, which allows the human-specific property of free compositionality and a similarly human-specific pathway to linguistic competence in normally developed human adults, taken for granted in all human cultures. But this leaves the core question, how did this human specificity emerge? By the proposal here, it can only have emerged by minimal steps in a population with a well developed communication system mapping calls to meanings, but making a fundamental cognitive break by defining the relation here.In the newly diverged human ancestors, the number of calls may have increased in response to the extreme dangers and opportunities of a ground-based environment. If that happened, the capacity to remember and understand different calls may have been pushed to the limit, perhaps beyond the limit. At some point at least one call was reconfigured in such a way that it could constitute the first step to modern speech and language. In a way not possible from a mapping, both sides of the form / meaning relation have to be reconfigured. From the fact that there are corresponding phenomena in all languages, it is reasonable to suppose that the seven specifically linguistic steps were made separately, as an evolutionary sequence, in a population from which all humans alive today descend. At each point when a necessarily very obvious and visible step was taken, this was valued throughout and across a population. It had to be, or it wouldn’t have diffused and fixated. But while the capacity for speech and language clearly distinguishes humans from any other animal, and mostly develops naturally without any active intervention, this is obviously not the case for all, with 1 child in 10 having minor problems with speech and language, 1 in 1,000 having major problems, and perhaps 1 in 100,000 being unintelligible in adulthood other than to close family and friends, if at all.
  2. Interfaces. By conceptual necessity there are at least two interfaces, one involving physical expression (either by speech or by sign), the other involving the analysis of meaning. Information has to be sent to these interfaces in suitable, necessarily-different forms. The distinctively human lexicon is defined by the way this relation is defined in the brain at the point of acquisition. Both interfaces are limited by general factors of human cognition and the physical universe, including the acoustic phenomenon of sounds dying away as the energy is absorbed by the atmosphere or the fact that once a sign is replaced by another, the first is gone forever. The proposal here makes no claim about how the cognitive evolution of speech and language connected up with cognition itself or with any of the physical changes. We just have to note that these changes were in one species, and they would seem likely to have complemented one another.
  3. Seven steps. The Faculty of Language, FL, as it currently exists, could not have evolved its precise character other than by steps. Children learn to talk the way they do by discrete, necessarily ordered steps, each one unconscious, most of them originally proposed by Chomsky, by which FL evolved in the human species. This is understanding FL in a very broad way, distinguishing all the various uses to which language is put, including the case of irony where the intended meaning is the exact opposite of the literal meaning. The steps proposed here provide a framework for the much more narrowly defined structures of what is known as Universal Grammar, UG, and a foundation for language acquisition. The seven steps postulated here were taken by a species which had forsaken the safety of the trees for a much more dangerous life on the ground, after making at least six significant precursor adaptations. This was a population which plainly lived on its wits – or died. The population remained very small, almost dying out at one point, but ranged across Africa while what is now the Sahara desert was forested and well-watered. Within this population, individuals or groups of individuals must have started restructuring some of their expressions in detectably advantageous ways, but by no more than one term at a time, so that, over the course of thousands of generations, the innovation could diffuse across the population, and (separately) become part of the genome. Following Progovac (2015), there must have been a series of protolanguages, each likely to have left fossils. As the linguistic genome evolved, the effects of the steps interacted with one another, giving the complex variations which Roberts (2022) characterises as ‘building blocks’. An abstract Universal Grammar UG is derived from the evolution of the human species. But there is cross-linguistic variation in how it is used – for example in which parts of the sentence structure are projected where – with global effects on word order and other aspects of what is commonly characterised as ‘grammar’. While a language may not express one or more parts of UG, all languages, spoken and signed, are built from it. The universality here is only very partially expressed at birth, just as the abilities of particular bird species to hover, stoop, dive and soar, are expressed only as the fledgling develops. But all of these steps are entirely unconscious, implemented too fast to be conceivable as conscious acts. In a way more complex than the particularities of bird flight, the full integration of UG elements into FL continues until at least ten or so for most children. The evidence for the evolutionary sequence proposed here is from the acquisition of language, language disorders, the differences between languages, creoles, signed languages, the commonalities of unrelated languages, the detailed examination of any one language – for our purposes here, English, and the special, possibly world-unique, case of Nicaraguan Sign Language. The acquisition evidence is from the similarities between the examples given and the fact that they occur in a matching sequence across all members of a sample of children. While there is no reason for assuming that modern speech and language acquisition exactly replicate their evolution, there is every reason to expect significant parallels, as in other areas of comparative biology.
  4. Timescales, tools and talk. The evolution of speech and language began earlier and proceeded more gradually than by Berwick and Chomsky’s proposal, but still very briefly for a change of such complexity and significance. From paleoanthropology, it is possible to define an earliest plausible beginning to the evolution of speech and language – no earlier, I submit, than the first manufactured tools intended to last (by the results of Sonia Harmond and others (2015) about 3.3 million years ago), and no later than the point when anatomically modern homo sapiens developed modern skills like being able to sail across several hundred miles of open sea to Australia about 70 or 65 thousand years ago). I propose that human style stone tool making must have preceded the evoluytion of language because knapping flint requires an awareness of geometry, quite different from and cognitively far beyond the various skills exhibited by chimpanzees and other non-humans. The geometry of knapping relates to the postion and direction of the first blow and the angle of the shockwave which breaks the flint. This was plainly an enormously difficult cognitive step, taking anything from three to six million years from the point at which human ancestors diverged from chimpanzees. In every case over this period of at most 3.1 million years, the likely evolutionary time scale of the steps proposed here is by hundreds of thousands of years or thousands or tens of thousands of generations, in contrast to normal, modern-child development over months and single years. In the absence of any evidence that acquisition proceeds differently from evolution, acquisition and new language formation may provide the closest approach to direct evidence of the possible, probable, or even necessary course of speech and language evolution. This says says nothing about the exact time scale here for each step or about how quickly the steps in language evolution fixated across the ancestral population. If it took modern human ancestors at least 3 million years and pssibly twice that long to learn the essential geometry of knapping, it would seem reasonable to assume that the incomparably more subtle process of encoding UG on a spine must have been similarly challenging.
  5. Encodability. By each step, by evolution and by modern acquisition, the human organism’s sensitivity grows to a particular, mathematically defined, degree of infinity manifest in language. The sequencing of the steps is by five necessary factors, by the internal logic of the steps themselves, the dictates of discourse and conversation, general human cognition, the criterion of heritability, and the mathematical representation of biology. The last two factors are justified by the clear evidence of biological factors in disorders of all sorts, including stammering and problems with the articulation of words and putting them together in grammatical structures. All of these things run in families in ways not accountable by immediate contact. For instance, a child can sound like an uncle or aunt at the same age or a close relative brought up in another language. Such sorts of genetic evidence are found in around 30 percent of all disorders. But biology does not, cannot, operate with any properties defined solely on linguistics, such as consonants, vowels, or sentences. Nor can it be defined on graphics like upside down versions of a child’s image of a bird in flight or of pyramids without bottoms. The binary branchedness assumed here (on the basis of 40 years of research on this point) has to be definable in a way that can be can be encoded mathematically, as set out by Matilde Marcolli (2022). This biological ‘encodability’ allows linguistic structures to be entered into a computation in a way applying to any natural language, whether spoken or signed, now fixated as a defining genomic character of our species, anatomically-modern Homo sapiens. As shown by Sandiway Fong (2023), this genomic factor is limited by neuro-physiology; synapses take around a millisecond to transmit from one nerve cell to another, and much longer to recover. Given the complexity of what has to be transmitted, this is slow.
  6. The reverse of discourse. Language, as characterised by FL, is defined on structures, which are put together so as to lay the basis for a potentially infinite output. FL contrasts with discourse, anchored in the here and now of conversation, expressing the use of language to relate utterances to the context in which they are uttered, to express emotions, to interest, to entertain, to elicit information, or to be ironic by reversing the overt sense of an utterance. Even at the very beginning of the evolution, it is possible to imagine discourse functions as soon as there are distinct meanings in particular expressions. Modern language is both used for discourse and subject to FL. FL is structured in such a way that meanings can be both shared and defined. But the structure of FL is quite different from the recognition of other speakers and other points of view in discourse. Discourse and FL are separate domains. Neither makes sense without the other. By the proposal here, both are separately articulated in relation to grammar. There is interplay between the two systems in both directions, with syntactic expressions becoming curses and attitudinal expressions getting turned into words. Plainly, the first words are not exclusively defined by either system. As a child’s language develops, these articulations of discourse and FL become increasingly well-defined.
  7. Reducing the infinity. in a rather surprising way, each of the steps postulated here in the evolution of FL with its infinite generative capacity, involves the step-wise reduction of infinities. The first step first decomposes, then recomposes, two sorts of unlike atom, one expressive, the other semantic, and defines this relation for what it is. Decompose and Recompose gives a starting point for this aspect of evolution and ontogeny. It allowed what is known as Universal Grammar, UG, to start evolving. By the proposal of Martina Wiltschko (2014), UG is defined on a ‘spine’ or a headed decision-tree, with binary branches’, rather than on a set of grammatical functionalities such as passives, as in “She was hit by a falling tree”. Language-specific variations, such as the form of the passive, are defined on derivations from the spine, interactions between these derivations, and the ways that these things are implemented in speech. These variations are part of the learnability space – what has to be learnt in different ways according to the language being learnt, as finitely varying points of variation known as ‘parameters’. By the proposal of Chomsky (2000) and much subsequent work by Chomsky and others, the derivation is factored into ‘phases’, each defined on a minimal set of elements, such that at any given phase, much of the content of previous phases is no longer accessible to the computation. For example,  the ‘illocutionary force‘ of a structure in the terminology of John Langshaw Austin (1962), as a statement, question, entreaty, and so on, is defined on a phase higher than the phase or phases defining its propositional content. In “What did you say?” the word, what, and the whole sentence are different sorts of syntactic object, with what having a special status in relation to the illocutionary act. By the proposal here, the spine is itself a mathematical structure. This reduction of the infinity, arguably makes the grammar finitely learnable, as it patently is. By conceptual necessity, this has to be the last of the steps proposed here. Both the notion of a phase-based grammer and the term ‘spine are now widely accepted. By the proposal here, the spine itself is phased.
  8. The recognition of fitness. Each step must have been noticed and recognised by potential mates for what it was, a greater fitness, leading to a consistent bias in mate selection, ensuring that it eventually became inheritable. In terms of statistical dynamics, the greater fitness may have been marginal. But a slight bias applying consistently over one or more thousands of generations can effect a change in the genome.
  9. A single sequence of steps conferring a single faculty. By the terms of this evolution, the sequence of steps was necessarily over an extended period. Resurrecting the approach of Chomsky and Halle (1968), the seven steps here give speech as well as language. Crucially this evolution provided a grammatical apparatus which was, and is: freely used in assembling words together and in the building of speech sounds, in ways that the child has to learn. Parts of this apparatus may be over-used in the process of speech acquisition so that children often use devices in the building of words which should be used only in the assembling of words into sentences. The apparatus is such that speech-disordered children from different generations or parts of a family often have recognisably similar issues. By virtue of the sequence, the grammar becomes available in parts, so that questions can be asked and answered in a rudimentary way. A normally developing child of two and three quarters can say “A clock tells you what time it is” displaying the first evidence of Phase long before the full functionality of the grammar has emerged, as it normally has around seven years later. All that can be said is that the encoding is separate from the communicative advantages. For instance the convenience of using pronouns has no obvious relation to the unobvious adjacency of levels on the spine. The difficulty of the translation here would suggest that this may have taken many thousands of generations. But as soon as the spine relation was established, the translation may have been simpler and faster, perhaps greatly so. But if the proposal here is on the right lines, one or more of the last evolutionary steps may have occurred after the divergence between the main line of modern human ancestors and Neanderthals and before more or less anatomically modern humans appeared in what is now Western Morocco around 300,000 years ago. Inheritors of the epigenetic changes by the last step in particular would have learnt to talk faster, more accurately, more reliably, and crucially more completely. Referring to something slightly different, we often refer to someone having ‘the gift of the gob’ as a characteristic talent of chat show hosts and comics, uncommonly able to spot and develop a double entendre and more. The first inheritors of Phase would have sounded even more talented, standing out even more sharply in competition for mates. It seems a reasonable conjecture that Phase fixated across the ancestral stem of anatomically modern Homo sapiens between 300 and 150,000 years ago somewhere in the Northern part of Africa, as one cognitive capacity of the new species. It seems likely that Phase critically reduces the learnability space, making speech and language finitely learnable for the overwhelming majority in what Eric Lenneberg in 1967 called the ‘critical period’ for language acquisition, normally ending around the age of ten. This linguistic advance would have marked homo sapiens apart from any pre-existing human species, including Neanderthals already established in Europe and central Asia. It thus may be that it is Phase which makes language finitely learnabable as it demonstrably is for the overwhelming majority, that language was not finitely learnable without it, that without Phase, there was a wide range of linguistic competence across the ancestral population, with only a minority having access to anything resembling the complexity of modern grammar. How small this minority may have been, how grievous the effects of relative incompetence were, and in what proportions, where the grammatical defects may have appeared, are all impossible to guess. As Phase and conjecturally finite learnability spread across the population, communication between conspecifics sharing this faculty became critically more reliable. In dealing with everyday emergencies, at critical points in hunting dangerous prey, in disseminating advances in technique and technology, reliable comnunication became a decisive asset. Finite learnability and the consequential reliability of communication between conspecifics would seem to have given a great advantage at the point of population survival to those having a phase-based grammar in relation to any group not having it. The difference may have been critical, with Neanderthal mastery of speech and language mastery uneven, with no expectation of common understanding. Neanderthals may have been stuck at the point when only a fortunate minority had a full mastery of their linguistic inheritance, whatever that may have been, and the rest of the population had only varying degrees of competence and little or no metalinguistic ability. In competition for scarce resources, the reliability of communication between conspecifics may have enabled modern Homo Sapiens to prevail decisively over the established Neanderthal population in a few thousand years, soon developing the first indications of modern culture in jewelry, wall-paintings, sculpture, musical instruments, not to mention stone tools. While there is a learning process which is normally completed across the whole population, this is not so for all. As Carol Chomsky showed in (1969), many ten year olds are still misunderstanding sentences like “I’m asking you what to feed the dog” as “I’m telling you what to feed the dog”. She suspects that some individuals may not proceed to a full understanding of this point. On a phase-theoretic analysis of the error here, the subject of the ‘feed the dog’ phrase is incorrectly not projected up to the topmost phase, represented in this case by the first person pronoun, I.
  10. Infinite generatiive capacity. Despite the seemingly obvious progress towards the infinite productive capacity of FL. by a succession of small increments. But the basis for the infinity is already there in the normally developing one-year-old’s first word. The initial merging is very abstract and across infinities so high that the infinity is not obvious. But step by step the infinity becomes more and more tightly defined, and at the same time more and more obviously an apparatus with an infinite productive potential.
  11. Clinical effects. In relation to less than fully competent speech and language, diagnosed as delayed or disordered, the proposal here effects a conceptual economy. Rather than postulating a series of separate disorders, it is possible in principle for parts of UG to be incompletely specified in some individuals. Most developmental disorders involving those aspects of speech and language which are necessarily learned, phonology, syntax and morphology, are by the effect of failures in the specification of a genomically defined UG which makes it possible for humans to learn to talk the way they do without needing to be helped, other than to learn what not to say. This makes it unnecessary to postulate a corresponding series of specific malformations. Many common issues are more accurately definable. There are useful points of measurement and focus points for intervention. And the range of plausible interventions is increased. For instance, many children have difficulty with both case and tense, as in “She loves me” where the S in loves expresses both present tense and agreement with the singular property in She. Such children may go on saying things like “Love me” many years after most children have learnt that in a statement, both the she and the S in loves are forced in English. To help children with the common developmental issue here, it may be useful to allow them to discover the ‘sisterhood’ relation between the subject marking of she, known as ‘nominative case’ and the S in loves, known as ‘third person singular’. And to do this, it may be useful, as argued in more detail in Nunes 2023, to focus on the basis of that relation. A history of delayed or disordered speech is likely to be co-morbid with literacy problems. The characteristic multifactoriality of speech and language disorders is predictable. There are likely to be speech errors by misapplying what should be syntactic processes in the phonology. Many characteristics of child speech are likely to be reducible to the lack of any proper definition of phonemes, syllables, words, and so on. Children with speech and language disorders are likely to have characteristically poor metalinguistics. Many apparent disorders, even some with names in popular speech, such as ‘lisping’, may fall out from a fully worked out theory of speech and language evolution. Another group of children sometimes say monopoly as OPOLI. If such errors persist they can lead to stigma or mockery. Monopoly as OPOLI involves the non-pronunciation of the first three sounds, with only the domain of stress pronounced. The child may be treating the stress domain as the word.
  12. The autonomy of grammar. No version of any of the steps postulated here is reducible to the needs of communication or social interaction. There could not have been any external input because by their very nature, the properties here are strictly-internal to cognition. A phase-based spine can only be defined on general, i.e. non-linguistic, principles. It cannot directly reference any categories which would only come into existence by virtue of the evolution. The grammar must encompass the entire apparatus which yields the linguistic categories. A category may seem to occur only very rarely or even in only one of the six or seven thousand known languages. Some categories are idiosyncratic, But if a category occurs at all, the learnability space must be configured accordingly.
  13. A buffer. A system by which linguistic structures of all sorts were derived in real time would seem to have favoured the secondary evolution of a buffer between the derivation and the articulation of speech. Such a buffer is both contingent on the formation of these structures and developmentally vulnerable. By the proposal of Nunes (1994), an incorrect specification of the buffer characteristically leads to stammering. Stammering occurs at a rate of between one and two percent in all human populations. If the functionality commonly characterised as ‘Merge’ has triggered a separate adaptation, the buffer by the proposal here, this pushes the evolution of at least some of the steps proposed here back in time to a point significantly before anatomically modern homo sapiens started spreading first across Africa and then across the rest of the world.
  14. One stem. All humans alive today must be descended from one African stem. The ancestry may be from more than one point on the stem, which may have migrated and introgressed (See Chris Stringer (2016), Aaron Ragsdale et al (2023) for a different point of emphasis). But at a given point of descent,  modern UG was necessarily complete. Or there would be groups of humans genetically incapable of ever learning one another’s languages.
  15. Early complexity. The proposal here involves what is sometimes known as ‘early complexity’, on the understanding here, complexity as early as possible, but no earlier and no later. This is to say that as one evolution is built on by another, the earlier evolution cannot be amended. So evolved properties are plausible only at a given point of evolution. No property evolved in this way can be jettisoned on a ‘Use it or lose it’ basis without fatally compromising the rest of the apparatus.

A buffer and stammering

Developing a proposal by Nunes (1994), updated by Sandiway Fong (2021, 2023), the evolution of speech and language was bottlenecked by the slowness of transitions across the synapse in contrast to the extreme sensitivity of both visual and auditory perception. Evolution addressed this bottleneck in two ways, first by the successive reductions in the infinity by recursive Merge, and second by the development of a buffer allowing finite time for the apparatus by these steps. Like the specification of the steps, the buffer is developmentally vulnerable. By the proposal of Nunes (1994), an incorrect specification of the buffer characteristically surfaces in speech as a stammer. Putting this differently, a stammer has a necessary neurolinguistic component. Familial experience and self definition are not enough on their own to fully characterise the disorder.

This conclusion is evidenced by the following:

  • Stammers occur in all known human populations at a rate of between one and two percent;
  • In all the vast clinical literature about stammering, there are no reports of stammering on first words;
  • By a series of discoveries in the early 1950s, reactions to Delayed Auditory Feedback (listening to one’s own speech played back with a delay) are quite different in normal speakers and stammerers; the normal speakers stammer and the stammerers stop stammering. or stammer much less.

A good start

Given the simple principle of binary branching, in relation to the phonology of phonemes, English just happens to pursue this branchedness further than most languages. But some languages, Polish for example, go further with even more structure before the onset. All of this falls within the learnability space, and is often problematic in children’s speech development.

The tree can just develop, adding branches, up to some limit, as by the structure in strange. Here the long vowel is shown as AE, where the two elements are separated in the spelling. The final GE by the spelling is shown as a single J, representing the fact that this is just one sound. But it is also a sound with two halves, known as an ‘affricate’, beginning with a complete closure, shown here with a D, and ending with a fractional release of the closure, shown here as ZH, like the sound at the end of beige and rouge. Respecting the binary branching, the initial S is shown as a dependent off the left edge of the syllable.

A theory of speech and language acquisition?

The proposal here is NOT a theory of speech and language acquisition. It does not take account of psycholinguistic considerations such as auditory memory and auditory discrimination. It is just assumed here that the domain of speech and language is large and complex, and that the process of mastering it is likely to begin with only occasional successes and much more common failures.

The steps by the proposal here are like insights which are grasped at first tentatively and only gradually with confidence. Advances towards adult competence are likely to be occasional –hence the recommendation here about keepimg a diary.

Precursor cognitions and follow ons

Following six precursor cognitions, some of the seven specifically linguistic steps which I propose can be exemplified very approximately in the language development of a modern child – except that what the child is hearing is a fully-developed, modern language, and the child is the inheritor of a corresponding genomic capacity, albeit with only the gestural half of the first step, characterised here as Lexicon, manifested in babbling.

There may have been more steps. Or there may have been,  as Chomsky suggests,  a single, out-of-the-blue mutation of quite extraordinary power, defined by Matilde Marcolli’s Hopf algebra, by what Chomsky calls ‘a minor rearrangement of neurones’, as the singular cognitive achievement of modern homo sapiens. But my proposal extends Chomsky’s by an evolution over a much longer time scale with steps before and after Chomsky’s. This just seems to me a biologically well-motivated way of reconciling the evidence of human speech and language, as they currently are, biology, neurology, archeology, paleo-anthropology, genetics, delays, disorders, and the random cases of two individual children. The fact that the two children were brothers, living in the same family home, may have influenced the areas of their attention and interest. But it cannot have had any bearing on whatever part genetics may have played in their focus on the formalities of Universal Grammar, as documented here.

The proposal here is a sketch. Like all research proposals it has to be developed further – in this case by at least another ten years of work.

Do you have an enquiry?