Menu Close
Step ladder Seven steps R2


What makes us human: seven connected steps

While a baby is still in the womb a special sensitivity starts to develop, a sensitivity which has no equivalent in any non-human species. By the discovery of Peter Juszyk (2000), against the loud noise of the mother’s heartbeat and digestion, the baby becomes selectively attentive to the sounds of speech, especially the mother’s –  especially loud because it can be heard by direct conduction, rather than through the air between the mother’s mouth and her abdomen. This selective sensitivity becomes a general sensitivity to the structures of speech and language which lasts throughout childhood.

Children learn to talk with plainly random variations in what they hearn said. But they end up with a common understanding of what counts as interpretable in the language of their family and its environment, what doesn’t count as interpretable, and what means more than one thing, They can and do say things which have never been said before, in other words an INFINITY of sentences. So the learning of speech and language cannot be just by copying what has been heard.

This is a very cursory statement of what is commonly known as ‘the logical problem of language acquisition.’

Three questions:

  • What defines the Faculty of Language or FL?
  • How did FL evolve?
  • How did humans evolve the capacity to learn FL in the way they plainly do, implicitly solving the logical problem of language acquisition?

I propose here that:

  • The evolution of FL should be broken down into a sequence of seven connected steps or saltations;
  • Children’s speech and language acquisition is likely to follow this evolution, at least in broad terms;
  • Where this process goes wrong, as it sometimes does, this may be, and often is, because of an error in the ‘wiring’ of the genetic inheritance.

The properties by the proposed sequence here are abstract. But they are no more abstract than the straightness of the line between footfalls which the child is learning to walk along. The straighter the line the more efficient the gait becomes. More energy is used to propel the body forwards, and less to stay upright. This gait allows humans to run for longer on two legs than faster-running prey on four. But few, apart from trainers in athletics, think about the straightness of the footfalls. The straightness just defines how we run and walk. Abstractness is useful in relation to speech and language, as well as for athletics. When the acquistion process goes wrong, some abstractness can usefully guide the process of clinical investigation. What questions should the clinician ask when?

Consider the child of three who seems to have just one word. What does this mean? What is the likelihood of the child learning to talk normally? Have the parents done something wrong? Is the child likely to grow into an adult with a ‘communication problem’? Is there a way of reducing the chances of this? What can be done to help? To answer such questions, it is worth sharpening our understanding of FL and, I submit, its possible evolution.

Inspiration from Minimalism

There are many proposals about the evolution of speech and language. In 1866, the Linguistic Society of Paris banned all discussion of the topic. There may have been a suspicion that research would point towards an African origin, undermining the assumption by most Western intellectuals at the time of white European superiority. But whatever the motivations of the ban, it held for over 100 years.

The inspiration for the proposal here is Chomsky’s 1995 Minimalist Program and its core notion of Merge, combining pairs of atoms, one as the head of the combined expression which can then combine with others, and so on, indefinitely. An atom is mostly, but not always, what is commonly regarded as a ‘word’. (The combinations are in one sense more than words and in another sense less than words.) In the expression “Eat apples”, apple is ‘merged’ with S and eat with apples, with apple and eat as the heads by successive instances of the same procedure. By applying to its own output, Merge can generate an infinity of structures. It applies ‘recursively’. Chomsky and others (2023) argue that Merge is the one and only operation of the syntax – what is commonly known as the ‘grammar’. Robert Berwick snd Noam Chomsky (2016) argue strenuously for the notion of Merge by a single, recent, evolutionary mutation. In subsequent discussion, they argue just as strenuously that it is indivisible.

Crucially, by this framework, as by all other versions of generative thinking since Chomsky’s first work in 1951, published in 1979, the structures of speech and language are DERIVED. The words in a sentence are not assembled in line from first to last, but according to a binary-branched structure. One aspect of this structure is shown by the equivalence between “Can’t you just do it?” and “Can you not just do it” with the position of the negative form seeming to vary according to whether it has its vowel or not.

The core motivation of this approach is to explain WHY the structure of a language is the way it is and HOW it is reliably learnt despite the infinite, random variations in children’s language experiences.

The proposal here adopts this derivational approach, and takes the singularity of Merge as its inspiration. The characterisation of this by seven steps treats the steps as as part of one sequence. This stepwise treatment of Merge is motivated by the questions above and by the evidence of language acquisition.

The puzzle of infinity

There may seem to be a very obvious progress towards the infinite productive capacity of FL. Children progress from single words, often around the age of one, to being able to join one another and adults in conversations about the best team in the world, or the best player, or the best dressed skater, or the most convincing speaker, and so on, around the age of ten. At first children seem to be adding the various complexities one by one.

  • Bus.
  • Ride bike.
  • Daddy drive car.
  • Lady getting on train.
  • A man is flying the plane……….
  • There are bound to be some who think it easy for a man to learn to fly a plane.

Essentially the developmental path seems to be a succession of small increments, until at some point, a long way down the pathway from bus, FL somehow accumulates this capacity of infinite creativity, as by the last example.

But by the proposal here, the basis for the infinity is already there in the normally developing one-year-old’s first word, such as bus. The initial merging is very abstract and across infinities so high that the infinity is not obvious. But step by step the infinity becomes more and more tightly defined, and at the same time more and more obviously an apparatus with an infinite productive potential.

12 points by the proposal here:

  1. Pathway. There is an evolutionary pathway from the first human words to the competence needed to explain to an apprentice some aspect of professional skill (such as controlling the temperature of a fire over a period of hours) and a corresponding developmental pathway from the modern child’s first words to his or her fully-mature, adult speaker’s competence ten or so years later.
  2. Interfaces. By conceptual necessity there are at least two interfaces, one involving physical expression (either by speech or by sign), the other involving the analysis of meaning. Information has to be sent to these interfaces in suitable, necessarily-different forms.
  3. Seven steps. The Faculty of Language, FL, as it currently exists, could not have evolved its precise character other than by steps. Children learn to talk the way they do because of seven discrete, necessarily ordered steps, each one unconscious, most of them originally proposed by Chomsky, by which FL evolved in the human species. This is understanding FL in a very broad way, distinguishing all the various uses to which language is put, including the case of irony where the intended meaning is the exact opposite of the literal meaning. The steps proposed here provide a framework for the much more narrowly defined structures of what is known as Universal Grammar, UG, and a foundation for language acquisition. But UG is only very partially expressed at birth, just as the abilities of particular bird species to hover, stoop, dive and soar, are expressed only as the fledgling develops. By the evidence here, the ‘initial state’ is thus a transition from the time when the embryo is still in his or her mother’s womb to the time when, around three, the building blocks of UG are in place. But all of these steps are entirely unconscious, implemented too fast to be conceivable as conscious acts. In a way more complex than the particularities of bird flight, the full integration of UG elements into FL continues until at least ten or so for most children. The evidence for the evolutionary sequence proposed here is from the acquisition of language, language disorders, the differences between languages, creoles, signed languages, the commonalities of unrelated languages, the detailed examination of any one language – for our purposes here, English, and the special, possibly world-unique, case of Nicaraguan Sign Language. The acquisition evidence is from the similarities between the examples given and the fact that they occur in a matching sequence across all members of a sample of children. While there is no reason for assuming that modern speech and language acquisition exactly replicate their evolution, there is every reason to expect significant parallels, as in other areas of comparative biology.
  4. Tools and talk. The evolution of speech and language began earlier and proceeded more gradually than by Berwick and Chomsky’s proposal, but still very briefly for a change of such complexity and significance. From paleoanthropology, it is possible to define an earliest plausible beginning to the evolution of speech and language – no earlier, I submit, than the first manufactured tools intended to last (by the results of Sonia Harmond and others (2015) about 3.3 million years ago), and no later than the point when anatomically modern homo sapiens started developing modern skills like the annealing of flint (by the work of Curtis Marean and others (2007), about 130,000 years ago). In every case over this period of at most 3.1 million years, the likely evolutionary time scale of the steps proposed here is by hundreds of thousands of years or thousands or tens of thousands of generations, in contrast to normal, modern-child development over months and single years. In the absence of any evidence that acquisition proceeds differently from evolution, acquisition and new language formation may provide the closest approach to direct evidence of the possible, probable, or even necessary course of speech and language evolution.
  5. Encodability. By each step, by evolution and by modern acquisition, the human organism’s sensitivity grows to a particular, mathematically defined, degree of infinity manifest in language. The sequencing of the steps is by five necessary factors, by the internal logic of the steps themselves, the dictates of discourse and conversation, general human cognition, the criterion of heritability, and the mathematical representation of biology. The last two factors are justified by the clear evidence of biological factors in disorders of all sorts, including stammering and problems with the articulation of words and putting them together in grammatical structures. All of these things run in families in ways not accountable by immediate contact. For instance, a child can sound like an uncle or aunt at the same age or a close relative brought up in another language. Such sorts of genetic evidence are found in around 30 percent of all disorders. But biology does not, cannot, operate with any properties defined solely on linguistics, such as consonants, vowels, or sentences. It has to be definable in a way that can be can be encoded mathematically, as set out by Matilde Marcolli (2022). This biologically ‘encodability’ allows the structures to be entered into a computation in a way applying to any natural language, whether spoken or signed, now fixated as a defining genomic character of our species, anatomically-modern Homo sapiens. As shown by Sandiway Fong (2023), this genomic factor is limited by neuro-physiology; synapses take around a millisecond to transmit from one nerve cell to another, and much longer to recover. Given the complexity of what has to be transmitted, this is slow.
  6. The reverse of discourse. Language, as characterised by FL, is defined on structures, which are put together so as to lay the basis for a potentially infinite output. FL contrasts with discourse, anchored in the here and now of conversation, expressing the use of language to relate utterances to the context in which they are uttered, to express emotions, to interest, to entertain, to elicit information, or to be ironic by reversing the overt sense of an utterance. Even at the very beginning of the evolution, it is possible to imagine discourse functions as soon as there are distinct meanings in particular expressions. Modern language is both used for discourse and subject to FL structuring in such a way that meanings can be both shared and defined. But the structure of FL is quite different from the recognition of other speakers and other points of view in discourse. Discourse and FL are separate domains. Neither makes sense without the other. By the proposal here, both are separately articulated in relation to grammar. There is interplay between the two systems in both directions, with syntactic expressions becoming curses and attitudinal expressions getting turned into words. Plainly, the first words are not exclusively defined by either system. As a child’s language develops, these articulations of discourse and FL become increasingly well-defined.
  7. Reducing the infinity. In a way that may seem paradoxical, each derivational step has the effect of reducing the infinity. The first step first decomposes, then recomposes, two sorts of unlike atom, one expressive, the other semantic, and defines this relation for what it is. Decompose and Recompose gives a starting point for this aspect of evolution and ontogeny. It allowed what is known as Universal Grammar, UG, to start evolving. By the proposal of Martina Wiltschko (2014), UG is defined on a ‘spine’ or a headed decision-tree, with binary branches’, rather than on a set of grammatical functionalities such as passives, as in “She was hit by a falling tree”. Language-specific variations, such as the form of the passive, are defined on derivations from the spine, interactions between these derivations, and the ways that these things are implemented in speech. These variations are part of the learnability space – what has to be learnt in different ways according to the language being learnt, as finitely varying points of variation known as ‘parameters’. By the proposal of Chomsky (2000) and much subsequent work by Chomsky and others, the derivation is factored into ‘phases’, each defined on a minimal set of elements, the first (very roughly) defining its propositional content, the second defining its ‘illocutionary force’ ‘illocutionary force‘ in the terminology of John Langshaw Austin (1962), as a statement, question, entreaty, and so on. In “What did you say?” the word, what, and the whole sentence are different sorts of syntactic object, with what having a special status in relation to the illocutionary act. This reduction of the infinity, arguably makes the grammar finitely learnable, as it patently is. By conceptual necessity, this has to be the last of the steps proposed here. Both the notion of a phase-based grammer and the term ‘spine are now widely accepted. By the proposal here, the spine itself is phased.
  8. The recognition of fitness. Each step must have been noticed and recognised by potential mates for what it was, a greater fitness, leading to a consistent bias in mate selection, ensuring that it eventually became inheritable. In terms of statistical dynamics, the greater fitness may have been marginal. But a slight bias applying consistently over one or more thousands of generations can effect a change in the genome.
  9. One event. By the terms of this evolution, it was necessarily one event. Resurrecting the approach of Chomsky and Halle (1968), steps here give speech as well as language.
  10. A buffer. A system by which linguistic structures of all sorts were derived in real time would seem to have favoured the secondary evolution of a buffer between the derivation and the articulation of speech. Such a buffer is both contingent on the formation of these structures and developmentally vulnerable. By the proposal of Nunes (1994), an incorrect specification of the buffer characteristically leads to stammering. Stammering occurs at a rate of between one and two percent in all human populations. If the functionality commonly characterised as ‘Merge’ has triggered a separate adaptation, the buffer by the proposal here, this pushes the evolution of at least some of the steps proposed here back in time to a point significantly before anatomically modern homo sapiens started spreading first across Africa and then across the rest of the world.
  11. One stem. All humans alive today must be descended from one African stem. The ancestry may be from more than one point on the stem, which may have migrated and introgressed (See Chris Stringer (2016), Aaron Ragsdale et al (2023) for a different point of emphasis). But at a given point of descent,  modern UG was necessarily complete. Or there would be groups of humans genetically incapable of ever learning one another’s languages.
  12. Early complexity. The proposal here involves what is sometimes known as ‘early complexity’, on the understanding here, complexity as early as possible, but no earlier and no later. This is to say that as one evolution is built on by another, the earlier evolution cannot be amended. So evolved properties are plausible only at a given point of evolution. No property evolved in this way can be jettisoned on a ‘Use it or lose it’ basis without fatally compromising the rest of the apparatus.

Following the convention established by Jean Piaget, ages are given as 0; 11 (10), meaning the tenth day of the eleventh month. For our purposes here, this degree of precision is useful. Some developments happen over a few days, or overnight, or even less. In the cases exampled here, the observations were the  first satisfying some particular grammatical criterion.

The sample of subjects

The subjects here represent the smallest possible sample, two brothers, two and a half years apart in age, both able bodied, Joe and a younger brother initially called Frank. The next stage by the research here will be to add more subjects.


The steps proposed here were as follows.

1. Decompose and compose (or recompose) the sound and meaning of expressions – and the origin of ‘Merge’

By the proposal here, the apparatus of grammar originated in forms defined on a new sort of relation between unlike features, where the relation is itself part of the definition. One sort of feature involved sensori-motor gestures, which could be seen or heard, implemented with the hands, face or head, or with the vocal tract. The other sort of feature involved a meaning. The meaning could be pragmatic, like a greeting or a curse, or semantic, like a reference to some individual. The relation between the two atomic elements of expression and meaning was essentially arbitrary. The expression did not look or sound like what it symbolised. It was defined just by the relation. The atomic elements had first to be reduced to their simplest, logically possible forms, both on the sensori-motor side and on the semantic side. This decompositon into single features could then be reversed, allowing the features to be freely recombined or recomposed, as something more like a modern speech sound or sign language gesture.

What makes this here different from Chomsky’s Merge is that by the proposal here the process here involves first decomposing the constituent features to one of each, both the semantic or pragmatic features and the gestural features, then recomposing them, and then defining the relation. The novelty of the step proposed here is by all three factors.

Now it might seem that there was a simpler analysis by the mere fact of the expression and the meaning being combined, or equivalently that the first single words are in principle no different from a chimpanzee-like hoot, as many psychologists and primatologists have suggested, implicitly or explicitly. But this would be to underestimate the complexity of the child’s first word, whether this is understood as mum, mummy, dad, daddy, cat, pussy, dog, doggy, horse, horsey, bird, birdie, car, bus, lorry, or whatever. Even if the first word or syllable is imprecisely articulated and hard to even identify, it still has some rudimentary structure, typically with an initial consonant followed by some sort of vowel. There may be even a vestige of a less highly stressed second syllable.  And semantically, there is at least the possibility that this is referential, calling up an entity not actually present at the point of the utterance. This linkage of two sorts of feature is defined by the fact that it is both understandable and such that it can be classified as one of any number of other items. The understanding may be by another child of similar age with no emotional investment in the milestone. For these reasons, the first word is not just more sophisticated than a hoot; it something else entirely. It is a first step on a pathway.

By the analysis of Matilde Marcolli (2023), Decompose and compose is invoked to explain two aspects of what Chomsky and numerous others characterise as Merge – as an irreducibly necessary aspect of human language. By the proposal here, Decompose and compose is primordial. The decomposition and recomposition should be seen as the basis of the process by which linguistic atoms are put together.

The relation between physical and semantic features and the classification of entries in a lexicon are common to all naturally spoken, modern languages. This gives the breakpoint between the hoops and coos of chimpanzee-style communication and the beginning of human speech and language – not by a gradual transition, but by a reconfiguration as the relation between the two sorts of feature becomes a defining part of the expression. This reconfiguration was the evolutionary cognitive genius. Such a decomposition is diagrammed below by a relation between a gesture with the lips and an association with close kinship – as a partner or parent. The two sets of features are defined by their unlikeness. They are thus not variables. The relation is the reconfiguration and what is represented in the top left of the diagram.

The lips are just one point at which the vocal tract can be completely closed or constricted with a particular acoustic effect. Another is the tongue tip. These are just two of many humanly-possible articulations, utilised in modern languages, each effecting an acoustic contrast. Primordial articulations may have been different in any number of ways, in all probability using features that could be drawn from an existing system of shrieks, hoots, grunts, or howls. But given that almost all modern languages utilise complete constrictions with the tongue and the lips, such articulations were in all probability primordial.

This relation between unlikenesses is reflected in the first stone tools, involving the geometry of at least one broken edge and another edge or tangent, the notion of sharpness, the long term utility of this, and the hardness of a particular material – a four way relation. But a four-way relation is mathematically clumsy. On standard assumptions of simplicity, the recomposition must have been based on sets with only two elements. Any number greater than two would have been exponentially more complex. There is a binary relation in nervous systems. A neurone either fires, or it doesn’t. There is nothing in between. So the step-wise structure of speech and language necessarily starts, from a binary branched structure or what is known as a ‘decision-tree’, standardly shown with the ‘root’ at the top and the decisions at the bottom. The first decision is to combine two atoms or ‘leaves’.  The expression is then specified, allowing, by another decision, that another atom is combined with the first two. And so on.

The relation is essentially asymmetric. The specifier is like an index. By the specification, the relation could form part of a lexicon of comparable items, any of which could be extracted at will for any purpose, at any moment, and entered into an expressive ‘work space’. Such an expression is quite different from any shriek or howl in response to some situation. In principle, the output of one combination can be combined with another. The branching can be repeated any number of times. It is ‘recursive’.

This is shown below as a tree, by the standard convention in mathematics and linguistics with the root at the top and the leaves at the bottom.

The representation above creates the basis for an infinity of outputs. By definition, Feature 1 and Feature 2 are different sorts of features, involving any meaning and any sort of expression. Jumping ahead, subsequent development of the system involves progressive reductions in the infinity.

It might seem from a set of similar items that it is more parsimonious to delay for as long as possible the point at which we postulate such a powerful capacity, perhaps until the grammar is generating structures such as “I want to stand on the chair to see what’s happening” (exampled below) with multiple levels of embedding, with every element such that it could be replaced by something similar, or with further levels of embedding. At that point, the infinite capacity is obvious. But this would be to postulate two pathways, one leading to a finite set of outputs and the other leading to an infinite set. The greater parsimony is by a single pathway.

By the simplest, most conservative, most Darwinian assumption, the initiative here is most likely to have been by giving meanings, in the simplest of dictionary senses, to perceptible elements, whether spoken or signed, with the meaning having priority over the physical expression.

Ferdinand de Saussure (2016) referred to the relation here as one between the ‘signifier’ and the ‘signified’. For Saussure, signifier and signified were fully-developed, modern words. Saussure emphasised the complete arbitrariness of this relation. Some linguists have drawn attention to the way particular feature combinations, supposedly onomatopoeic, have related connotations – like ASH in bash, mash, smash, crash, all denoting some degree of violence, tick tock and clip clop denoting the sounds of a clock or a horse, and so on. The case of iconicity in sign language is similar. But in no language are such relations anything but marginal.

The first instances of the relation here could have been either referential or related to discourse. But for the sake of generality, it seems more likely that such promordial expressions involved a degree of reference, as by a recognition of identity in an act of welcome. Reference is universal across languages. It is reference which marks human language apart from non-human systems. Dolphin calls of group or individual identity seem to be meaningful only in the presence of another group-member. It is only a step towards reference. Reference allows an entity to be called up just because it happens to be in a speaker’s mind.

So, for example:

Joe at 1; 0 (13) seeing the swings in the playground said something which his mother heard as “See saw, margery door” – as DEE DAW, DEE BAW, DEE DAW. The commas divide this into three utterances. Here all the consonants are what are known as ‘voiced stops’, stops because the airstream is completely blocked in the mouth, and voiced because the action is only momentary with the buzzing sound from the larynx beginning as soon as the closure is released. The variation between D and B is by the action of the tongue tip in D as opposed to the action of the lips in B. The partial closure by S is replaced by  complete closure. And the nasal airstream by M is replaced, again by a complete closure.

And similarly, Frank at 0; 9 (4) “Mum”, at 0; 9 (7), at 0; 10 (7) “Mama”,  at 0; 11 (10) “bus”.

All of these forms seem to be referential, at least to some degree, bus more clearly than Mum or Mama.

By the framework here, the modern system with its capacity to generate an infinite number of sentences could not have evolved in any other way. But if there is nothing to define which atom goes where in the structure or how the specification fits, the relation is imprecise.  There is no way of telling or guessing what a given structure means. With no limits on what can go where, it may be impossible to process any more than one step of the potentially recursive process. At this point in the acquisition of speech and language, with the exploitation of infinity essentially unordered, it is often unclear what is being represented. There is no clear definition of the structure. Such speech is hard to understand. Or for whatever reason, children exploit the recursive potential here in only the most minimal of ways, perhaps because of the lack of definition. By the proposal here, this structure (or lack of structure) is both primordial and characteristic of early development.

In the diagram below, three articulatory features are represented, defining three physical aspects of a single expression, the involvement of the lips, the involvement of the airstream through the nose, and the complete stopping of the airstream through the mouth. The decomposition gives these three features. The recomposition gives their assembly into some part of a phoneme or speech sound. But primordially and in modern children’s early speech and language development, the recomposition is not defined. A stopping gesture may be composed with the lips. And the output of this composition may be composed with a nasal airstream. The order of this composition is accidental. But accidental or otherwise, the order is acoustically significant. Composing the air stream gesture with the other two features sounds different from any of the other logical possibilities here. This is shown in the diagram below by the dotting of the lines. The term, ‘Labial’ represents an articulation by the lips. The term, ‘Nasal’, represents an airstream through the nose. The term ‘Stop’ represents a complete closure of the airstream through the mouth. All that is defined is there is a relation between these three features. So faint versions of two features are shown for each leaf in addition to one clear version. All three features are shown faint at the root since none are defined. And the dotting represents the essentially accidental nature of the structure.

Significantly, in modern acquisition, children at this early point in their speech and language development are often understandable only by those who know them best, who are used to the particular way they happen to order the recomposition. The relevant features may be inserted into the structure in any logically possible way.

Take for example what seemed to be a child’s first word without any previous cue – bus.  Let us apply the same schema, simplifying the complex, but undefined structure of the sound, and relate this to the notion of a bus. By the schema here, there is no certainty about the derivational structure. On this basis, bus, pronounced as BUH without the S, might be represented thus:

The adult listener interprets this as having the status of the form here as a syllable, even though this is not a possible syllable of English.

It might seem that there is a much simpler way of representing what is going on here. We could just label the sounds from left to right, as they occur, and say that the last gets deleted. But this does not capture the fact that the child is rather clearly at the beginning of a pathway to speech and language. And a tree makes it easy to describe how children’s speech most commonly develops, often omitting the coda, but seldom omitting the onset, which is higher in the tree and ‘safer’ on account of that.

Using the same schema, the meaning of bus as a first word might be represented as follows:

Modern language exploits the complete resources of a system which is continually changing in how things are pronounced, in how words are put together, in what they mean, and so. But that being said, it is now in a sense fully evolved. The primordial system has been supplanted by both evolution and what is known as grammaticalisation, the churning over of the grammatical apparatus under various pressures over tens of thousnds of years, some pulling in opposite directions.

There are fossils of the primordial sound / meaning relation in:

  • The long period in language acquisition of just single words, as names or categorisations such as “bus”, or greetings, as “more”, “hello”, “bye bye” and so on;
  • English yes and no and what are known as ‘modal particles’ or ‘discourse markers’ in many languages;
  • Curses;
  • Expressions like sh as a call for silence, Ah for pleased surprise, Eh as a query, and tut tut for disapproval;
  • What are known as ‘imperatives’, commonly, as in English by the ‘root’ form of a word, such as come and go, sometimes for sake of saving life, sometimes greatly elaborated by more complex grammar, as “For goodness sake, just go”;
  • Expressions like “genius” in response to a performance.

These expressions are unlike the primordial forms proposed here in that they exploit combinations of features by later steps. But they are used in a way characteristic of the primordial system. Note that these modern expressions are mostly, but not all, matters of discourse.

This does not prohibit the continuation of a system which may have been used by our common ancestors at the point of differentiation between the two lineages, humans and chnimpanzees. These early ancestors may have had a system of calls of any degree of acoustic length and complexity, and used for any purpose. But such a system could be supplemented by one or more forms, distinctively represented by the sort of branched structure diagrammed above, to be articulated and understood accordingly.

How far the first inventor or inventors were CONSCIOUSLY aware of what they had ‘invented’ is obviously impossible to say. Even just one suitable expression could be very appealing. The adaptation here conferred a significant degree of Darwinian fitness. Inheritors of the adaptation had a greater chance of mating and thus of passing the adaptation on. All we know is that from the beginning, speech and language were noticed. Or the capacity could not have spread, and eventually fixated.

Decompose and Recompose defines the beginning of a human-specific pathway, different from any sort of alarm calls  of vervet monkeys or prairie dogs, or the richer but seemingly less specific systems of chimpanzees, or the marking of individual and group identity by dolphins. Such non-human calls are not compositional. They cannot be combined with other calls to some infinite degree, known as ‘discrete infinity’. Nor can they be ‘decomposed’ into separate articulatory / perceptual and semantic / pragmatic elements, as in games like the French Verlan, reversing the order of the syllables in l’envers, the French for backwards. In evolution, the physical aspect may have been either gestural or vocal. If primordially there was a bias towards physical gesture, this bias must have disappeared as language evolution progressed, or there would be sign languages used natively by normally-hearing populations. The proposal here is neutral on whether the first decompositions and recompositions were vocal or manual.

The necessity here is that information of two sorts is sent to two quite different sorts of interface, both conceptually necessary, represented in primordial human expression, and as significant in modern speech and language as at the point of evolution.

2. Pair

Obviously, there was no point in history at which a group of one or more human ancestors realised that an infinite regress did not provide a good basis for a system of communication. But there has to have been a point at which the infinite regress here was limited in such a way that the limit could be expressed mathematically, making it possible for this limit to become part of what would become the human genome. Thus at the lowest, logically-possible point of compositionality, contrasting elements are put together as a Pair. The contrast is betweeen elements such that one relates only to the discourse itself and the other is a prototype of what will become an element of syntax, including nouns. This greatly reduces the infinity by Decompose and Recompose.

Pair is like the improbable relation between Jonah and some species of cetacean who rescued him. Here I am doing what journalists do and trying to make sense of the story. The Hebrew text talks about a huge fish. The writer obviously didn’t know the sea. The name Jonah was originally with an initial Y sound. The J sound is by a misconstrual of the letter, by the original German version in lower case, with just the direction of the hook reversed. In Turkish and one of the neighbouring Kurdish languages, YUNUS is the word for a dolphin. It may thus be that Jonas fell overboard from a ship and was found by a dolphin who decided to rescue him. Various species of mammals – dogs, big cats, primates, bears, cetaceans  – do occasionally cross various  species boundaries to offer or seek protection or just make friends. In Jonah’s case, the story about him being swallowed may be an embroidery by Jonah himself or a biographer. But either way the story of a man owing his life to a dolphin would be the talk of the town today as it may have been three thousand years ago, giving Jonah his name as the most obvious nickname. The theory that the story is a satire just misses the point. Stories about improbable relations are sometimes true – or almost true.

Getting back to child language, Joe at 1; 5; (23), six months after his first ‘words’, said “Bye, doggy”.

Frank at 1; 2 (22) said “Bye bye, Daddy”.

Bye or bye bye are clearly aspects of discourse, as are ah, ey, uh, oh in adult language. Doggy, addressed to a toy dog, and Daddy, are seemingly referential. So there is discourse and reference, but with no internal structure other than the contrast. “Daddy, bye” and “Bye bye, Daddy” do not involve any change in meaning. Both expressions mean much the same thing, irrespective of the order. By the framework here, left to right, linear ordering is the preserve of physical expression and discourse. Meaning is defined on structural relations.

Showing explicitly discourse objects in red and connected by a line of round, red dots and the prototype of syntax in blue, with reference to the observation of Joe at 1; 5 (23):

This primordial system is reflected in modern language in various ways:

  • Expressions standardly by root forms combined in ways falling outside the terms of the grammar, kill joy, go between, go slow and so on, noted by Ljiljana Progovac (2015);
  • Possibly at least some adverbs, as the only sort of word which can appear in different positions in English, albeit with some subtle changes in meaning, as with sadly  in any of all logically possible positions in “Sadly, he is going to die”, “He sadly is going to die”, “He is sadly going to die”, “He is going sadly to die”, “He is going to sadly die”, “He is going to die sadly” – all grammatical, at least for those who allow infinitives to be split;
  • In terms of sound structure, as a physical, articulated expression, possibly by evolution, but certainly by modern acquisition, syllables with a vocalic nucleus and a consonantal onset, in what is thus a syllable. Or, in a way that may be out of sync with the elements of meaning, there is what is known as ‘foot structure’ or stress between the syllables, as in Daddy.

The modern infant generally puts his or her first two ‘words’ together around the point when the vocabulary reaches around fifty items. There is no reason for assuming that Pair evolved at the point when some particular number of items became accessible, but it is clearly possible that in evolution Pair was triggered by the growth of the lexicon.

3. Head

By another step, one element, as one of two syntactic objects, becomes the head of a combined expression, where the contrast is by only one element being referential and the other being a prototype of some other sort of syntactic object, such as a verb or preposition, but neither is such that it can in all cases stand on its own. Discourse elements are firmly excluded.

Headship is thus the opposite of democracy, and the selection of two syntactic objects is the opposite of diversity. If the theory of language had a designer, he or she needed a year of serious diversity training.

For example Joe at 1;7 (30) said “In er car…. In car”.

And Frank at 1; 3 (2) said “Open door”.

When the phrases “In  car” and “Open door” were uttered it seemed probable what they meant, but not certain. The relation is asymmetric between two contrasting elements. In other words, a headed structure can’t include bye bye or hello, both stand alone elements in discourse. But however they should be understood, both of these phrases have well defined heads, in and open, contrasting with the clearly referential elements in car and door. In the framework here, the non-head is known as the ‘complement’. By the diagram above, dominance is thus built into the system in a way that is now well-defined.

  • With the branching applying to just two elements, sisters in the framework here, headship is thus essentially a relation between defined elements, both with parts. In the modern child’s process of acquisition, by the simplest possible interpretation, the structure of “In car” involves two elements, car, essentially noun-like. and in, as a step towards a preposition. “Open door” contrasts noun-like door and verb like open.
  • Head signals the first step towards distinctive ‘parts of speech’ as these are called by traditional grammar, nouns like car, Mummy, Daddy, verbs like want and like, prepositions like in. The items become differentiated, as the only sorts of expression on which grammatical operations can be defined. On the simplest plausible readings, open and in are plainly heads. Both elements can now express formal relations, as head and complement, to the expression as a whole. There is a grammatical relation between them, each with an an irreducibly necessary, structural role. But it seems premature to regard such elements as fully-defined nouns, verbs and propositions;
  • The elements have features which define the interaction, as opposed to some purely accidental relation, as by ooh, eh, ah, yeshello, good bye,and so on, all independent from the grammar because they can stand on their own, sometimes adjoined to it, but not by Head.

The relation here is clearly one by a simple version of Merge.

4. Extraction

In English, questions beginning with words like where (or their equivalents in most  languages other than English) are asked with where mostly on the left of the structure, as in “Where are you going?” But in a full-sentence answer, the requested information is on the right, as in “I’m going to the shops.” Why should this complexity be common, not universal, but common across languages? By ancestral grammars from the 1950s to the 1980s, this was represented as ‘movement’. More recent grammars avoid the notion of movement.

Building on the featural and combinatorial properties by Decompose and Recompose, Pair and Head, a further step builds an array of a given set of lexical entries. From this Extraction makes it possible to extract a particular, suitable item which has already featured in a previous step of the derivation. By this notion, which Chomsky calls ‘Internal merge’, the notion of movement is no longer necessary. And the infinity is reduced.

As revealed by tip of the tongue phenomena, the extraction of items from the lexicon is subtle and complex. Extraction is like a game of hide and seek where it is not known who is hiding and how many of them there are.

Necessarily in English, the force of the question in “Where are you going?” determines a reversal of the sequence of you and are from the sequence which would be followed in the response statement “You are going to the shops” or I and am in “I am going to the shops”.

Typically, as in English, an extracted item, B in the diagram below, is then not pronounced at its point of origin.

For example, Joe at 1;10 (3), says “Daddy upstairs” where Daddy seems to be the ‘subject’ in traditional terminology, and at 1; 10 (27) “Where Daddy?” with where seeming to define a clear question. In the first, the B element is only extracted once. But in “Where Daddy” it is extracted with a notion of location, and then extracted once again with the force of a question.

At 1; 4 (27) Frank is asked, “Who wants some chips?” And he replies “Me”. And at 1; 5 (9) he asks “Where chicken?” Both the appropriate answer to a who question and the where question 12 days later suggest a grammar capable of making two uses of the same element, once to define an identity or location, and then, by Extraction, with the force of a question.

Here where and who are fulfilling a special role as the sister or ‘specifier’ of the head of the A B  structure. In the framework here, it is a universal.

Significantly, English also allows forms like “Daddy is where” or “The chicken is where”, typically with where heavily stressed, no longer with the force of a question, but as statements of surprise or astonishment.

In “Daddy upstairs” by two branchings, one defective, Daddy is the specifier, in this case, the subject.

The contrast between the elements expresses the simplest possible structure with a definable spine, in this case with Daddy dominating X upstairs, where X is an unrealised abstract element.

In Joe’s “Where Daddy?” at 1; 10 (27), where is pronounced on the left and interpreted on the right of the structure (shown in grey).

Here “Upstairs” might be a plausible child’s answer, taking “Daddy is upstairs” or “Upstairs” as plausible adult-type answers, except that upstairs is treated here as a bare marker of location, questioned by where,

In modern acquisition, between a week and three months after two words are put together, a question is asked or answered involving a question word relating to one of the items in the two word combination, particularly what, where, and so on, signalling points of curiosity, as a key factor in discourse. So children ask or respond appropriately to a Wh question such as “Where Daddy” only after producing a declarative structure involving two corresponding elements, not in the opposite order.

By this step:

  • Every element has a role, from the verbal head of an expression like open in “Open door” to the Wh question form where in “Where chicken”;
  • Structures, with what traditional grammar calls ‘subjects’ , can be expressed on noun-like elements. Thus I has the special role of expressing a subject, a seemingly universal property of sentences. The subject role is purely grammatical or syntactic, as in “There is food on the table” and “It is a shame that you’re ill” where neither there nor it has any semantic role;
  • What are known as ‘thematic roles’, including agency, ownership, location, benefit, destination, or experience, can be expressed;
  • Elements of composed structure can be recomposed at a higher level;
  • Across the system of phonemes, relativities can be marked in contrasts such as those between P and B, defined on a difference in the delay between the release of a closure and the onset of ‘voicing’ by bringing the vocal cords together;
  • The cognitive load of searching for and extracting items from the lexicon is greatly reduced. At this point in language acquisition, the lexicon is expanding rapidly. Simplifying the task of searching for and extracting a word is thus a valuable increase in fitness. This is easily and obviously detectable because questions are now clearly defined as such. Questions with a Wh word can be asked or understood.

5. Hitch

The thematic relations of who did what to who, where, when, why and how are obviously important. And there is no reason for thinking that they were ever anything else. But they are discussed and argued about in discourse, where there are separate issues of time, identity, and interpersonal relations. In ways that vary quite widely across languages, at least some of these issues are captured by ‘functors’, or elements within the structure which are only definable by their relation to another element within the structure. Functors thus get hitched to hosts which may be other functors.

Knowing the various hitches and ands being able to do them fast was once a sailor’s pride. For instance, a clove hitch held a rope to a spar. A rolling hitch joined one rope to another already under tension. Both could be quickly and easily undone.

Functors are marked in English in obvious ways, easily detected by the child learner. They can be displaced like where, shunted to the left edge of the sentence, in a way marking their illocutionary significance. Or their sound structure can be reduced, by losing their vowel, for instance. One sort of functor hitches the sentence as a whole to the most salient aspect of the discourse. In English, as in most languages, this what is known as ‘tense’, as by the word is or its contracted form, written as ‘s, with reference to an immediately present event /situation. English tense is marked either as -ED, as in sorted, or -D as in lied, or T as in spilt, or by a change in what is known as the rime as in ate, saw, took, or by the whole form of the verb as in was and went.  By the profound insight of Chomsky (1957), this marking of tense is separate from the verb itself. In did in “Did you tell the truth”, tense is expressed on the word did, known as the auxiliary’.

Joe at 1;11 (12) asks “Who’s that?” On the same day. looking at a picture book together, his mother asks: “Where’s the bus?” Joe replies: “There’s bus.” At 1;11 (14) he asks: “Where’s man tractor”. It was not clear whether he meant “Where is the man’s tractor?” or “Where is the man for the tractor?” or something else. The point is the articulation of the ‘s form, as a contracted form of is. The functor here does double duty, marking both the relation to the present and the fact that the question is about a single entity. The fact that it is expressed by a contraction makes it highly visible and thus easy to identify and learn.

Frank at 1; 5 (29) asks: “What is that?” Hisa mother who made the observation, noted that the is form was clearly articulated.

Showing the new functional projection in bold.

These are the first uses of an is or ‘s form by these children, known as ’inflection’. Here an element is inserted into the structure and anchored to the here and now of the utterance, reflecting an aspect of the discourse. There is no reason for thinking that there is any contrastive intent here. The child is not also asking things like “What was that?” But the use of the is form is a place-holder for the tense category as this becomes accessible to consciousness. And in a broader sense, the form signals the accessibility of elements which are purely functional, with their own corresponding projections.

6. Measure and compare

Extraction and Hitch are mathematically powerful devices. Applied to the output of one another, they can be exploited indefinitely. Such a grammar strains both processing and production. It is patently impossible to learn under the condition of finite learnability. Only a small minority may have been able to master the complete apparatus, with wide and significant variations in the mastery, as in all other areas of human skill from musicality, to art, to athleticism of all sorts.

By Measure and compare, a minimal degree of dominance by one head is compared to that of another in the same hierearchy to any equal or greater degree. This is an obviously abstract functionality. But the relation characterises numerous phenomena in the grammar of unrelated languages. In English this relation applies to pronouns such as I, you, he and she, what are traditionally known as ‘reflexives’, as in “I hurt myself” and negatives by not and its reduced form written as ‘nt. The negative form only appears immediately after the form expressing the tense in doesn’t, didn’t, can’t, won’t, and so on. The scope of operations, each doing just one thing at a time, is restricted by comparing and measuring just two degrees of dominance.

The first versions of what is known as ‘Case’ can be expressed, as by the difference between he and him, she and her, as what are known as ‘arguments’ in one of various relationships, essentially who is doing what to who, but changing as roles change or speakers take turns to talk.

At 1; 9 (22) Frank said “I hurt self’ with I and self with the same reference. At 1; 10 (27) he said “I found this” with the ‘nominative prououn I next to the past tense found. At 1; 11 (4)  he said “I don’t like it” with the negative n’t  next to the auxiliary do. At 1; 11 (13) he said “I’m making lorry” with ‘agreement’ between the first person pronoun 1 and the auxiliary am. At 2; 0 (20) he said “I need more that” with more as the head of a complex phrase. All of these cases involve the measurement and comparison of dominance.

Going through the same process six months later, but faster, at 2; 4 (26) Joe said “Mog doesn’t like that”.

A week later at 2; 5 (5) he says “doggy licking hisself” with the reflexive one level down from what is known as its ‘antecedent’ – in this case doggy.

Two days after saying “Doggy licking hisself”, Joe at 2; 5 (7) said “I saw lorry pulling car” and on 2; 5 (11) “I took picture of milkman”. In both of these cases there is significant embedded structure, a clause in ‘lorry pulling car’ and a noun dominating another noun in the substructure of ‘picture of milkman’. In both cases tense and case are overtly represented in a sisterhood relation. In “I eated that chocolate”, the tense in the verb is manifest in the mistake in eated.

The marking of tense on the verb and linking this core element of the structure to the context of the utterance is almost, though not completely, universal. This puts two functionalities, tense and nominative Case, at the top of the projection chain. This is reflected in the way both are expressed as the left most elements in “I might have been being deceived”. In restricting the measuring and comparing to the edge of the hierarchy, the infinity is reduced one degree further than by previous steps.

Going beyond the acquisition data here, as far as pronouns are concerned, the key data for English is in contrasts like the one between “She says Mummy feels tired” and “Mummy says she feels tired.” In the second, she could be Mummy. But in “She says Mummy feels tired”,  she can’t be Mummy. Such relations are common across languages, raising the obvious questions: How do children learn this? And why should this be? Ever since a seminal (1976) work on the issue by Tanya Reinhart, this has been a hot and continuing topic of debate. All approaches since that of Reinhart have focused on the small size of the domain, as illustrated in the diagram above. But by all of these approaches, to the question: How do children learn this? the answer is that they don’t. And to the question: Why should this be? the answer is that by measuring and comparing two levels of dominance, minimally different in the examples here because the elements are sisters in the structure, the infinity of the structure is significantly reduced. What the child has to absorb is not a highly arcane restriction on the reference of she and Mummy in the examples above, but a much more immediately familiar principle of measurement and comparison, as applied in an argument between siblings about who should have the bigger ice-cream.

By a spine-based universal grammar, these things can be encoded in ways that vary across languages, but using the same, universal template. By this sixth step, degrees of  dominance are measured and compared, imposing a ceiling on specified relations, abstractly A and B, at the top of the spine at a given point in the derivation.

This allows a special relationship between I and am and between she and is, one denoting what was traditionally known as the ‘nominative’ case of the subject and the other denoting the most immediate aspect of the here and now in the discourse. In most languages including English, the key aspect of the here and how is related to time, represented as the tense of the verb, as in the differences between I am and I was and I have and I had. Nominative case is purely grammatical, with no thematic role or relation to the here and now or the needs of communication.

In “I may seem to be asleep” the thematic role of I is plainly not a function of the main verb, seem, but of the embedded verb, be. “I may seem to be asleep” means the same thing as “It may seem that I am asleep”. but the structure is quite different.  I with its marking of Case and the tense of be get shunted upwards or ‘raised’ by successive steps of projection, each step by a separate process, shown here by the arrow in a simplified diagram of the tree.

The process can be continued as in “I may seem to want to be asleep” with a different meaning, but still with I immediately followed by the tense bearing may. The sense of the tense-bearing element has been lost in the history of English. “I might seem to want to he asleep” means almost the same thing, But without may or might in “I seem to want to be asleep” or “I seemed to want to be asleep” the tense difference is clear and overt. Again immediately next to I.

Tense and nominative case constitute the two most sharply contrasting sorts of elements within the hierarchy.

The expression of these levels of the hierarchy, for noun and verb like elements, varies from language to language. These are things the language learner has to learn. They fall within the learnability space. By the proposal here, Measure and compare is universal. But expressed in terms of the spine it is biologically encodable, and thus readable within the human genome. By the proposal here, it is precisely the abstractness of Measure and compare which makes it both universal and a heritable aspect of UG. But despite the universality of the  core principle here, the way it works in English is complex and hard to learn.

7. Phase and Complementiser (or Sentence) – limiting the whole expression

By this final step, the grammatical apparatus is factored into the smallest posssible components, further reducing the infinity.

This follows a consistent approach from Chomsky’s first widely circulated (1957) work factoring the grammar into two sorts of rule, then a division between deep structure and surface structure by Chomsky (1965), then the effect of a barrier with respect to what was at the time considered to be the ‘movement’ of elements such as what and where by Chomsky (1986), then by Chomsky (2000) with the ‘spelling out’ of the minimally necessary information to two interfaces in separate ‘Phases’.

A phase is defined by the fact that as soon as it has been completed, most of its structure becomes inaccessible to the ongoing process of derivation. A phase may by expressed by only one word. But it can have a special analytic status, as in the case of the Wh words. As by the primordial Decompose and recompose step, the information that is sent to the two interfaces (for physical expression and understanding) has to be detailed and complete.

Here the phases are shown as alternating light black and heavy red branchings, at least two for each clause, the light black roughly representing the propositional content, and the heay red roughly representing the ‘force’. These are not the steps proposed here, but the effects of the final step.

By this seventh step, the spine maps onto the formal structure of UG. The phasing becomes detectable only by the last of the seven steps.

The novelty, by a phased approach to syntax, is that the factoring into two phases is repeated clause by clause. At least by the original conception (which I am assuming here), the first phase spells out the referential and propositional content, the second phase spells out what John Langshaw Austen (1957) called the ‘illocutionary force’ of statements, commands, questions, pleas, and so on.

By the proposal here, both in the evolution of language and in the acquisition by modern children, the factoring of the grammar by phases is (necessarily) ordered last.

Showing accessible elements in bold red, and with a lower, earlier, inaccessible elements lighter, with only the head and edge accessible.

 There are thus two phases in most clauses, even if the second phase is not represented by any overt structure, but just by the fact that a ‘simple’ proposition is also a statement of fact, which may be contradicted in jest or irony, as represented by the second phase. Thus the first phase is mainly grammatical and the second phase often has a significant discourse element.

By the first six steps proposed here, the force of a structure, was an accident of the structure itself and the circumstances in which it was uttered. Such a grammar was most likely prone to deep and frequent misunderstandings. By the seventh step, an expanded notion of force was defined as ‘Complementiser’ or C, as the  topmost level of the spine, and replacing the traditional notion of a ‘sentence’. C provided a hosting for what in expressions like “What did you say you thought I said?” with what as the complement of I said at the opposite end of the structure.

In simple, declarative main clauses, C is not expressed in English. But it is the destination or landing site of words like what, where and when and expressions with which in questions seeking particular items of information. By the 1997 proposal of Luigi Rizzi, the ‘force’ of the structure is expressed as a property of C. This applies no matter whether the structure is a statement or a question, or whether the agency of the subject is diminished by passivisation or in some other way.

So for example, Joe at 2; 9 (4) asked “When’s Daddy coming back?” with the Wh morpheme when projected onto the uppermost C level and the contracted auxiliary ‘S stuck on its right edge as what is known as a ‘clitic’.

By characterising this level as that of C, every level from the bottom of the structure to the top is defined in the same way, rather than by giving the sentence a special status of its own, one that is hard to define other than in a purely circular way.

Three weeks after “When’s Daddy coming back”, at 2; 9 (28) Joe produces his first sentence with multiple embeddings, in this case three, with two phases at the lowermost level in “what’s happening”, with thus eight phases in all, with all structures fully specified – in “I want to stand on the chair to see what’s happening”. Simplifying slightly:

At the lowermost level, the contracted auxiliary ‘s is projected to form a tensed structure, and then what is projected to form what is traditionally characterised as an ‘interrogative  clause’, in the framework here, now specified by what. But crucially there is no search for information here. In a deeplly embedded clause, the word what does not define a question.

Turning to the other child, Frank, up until this point, his language has been slightly more precocious than Joe’s. But now, a month after Joe, at 2; 10 (21) Frank says “I want to sit where Joe’s been sitting.”

Looking at the two children together, almost identical, sentences with multiple embeddings, accidentally or otherwise, with full grammaticality. The exactness of the similarity between two utterances in two children two and half years apart, only noticed forty years later, would seem to suggest that there is significance in such structures with a Wh word specifying an embedded clause, and not forminmg a question.

As by the examples above, Phase allows the derivation to proceed in steps, as by the process of evolution. But the full application of the principle here takes years to learn. At 9; 9 (13) Joe said “We don’t know whether I’m going to be picked up by who” (of the rather complicated child care arrangements we had in place at the time). Joe’s sentence is anomalous in as much as who seeks particular information and whether seeks only a truth value. But the structure of two Wh words in the same clause calls up the Phase functionality in an interesting way.

By this seventh step:

  • Building the derivation in phases limits how much of the derivation can be manipulated at any one point, reducing to the minimum both Search and the speaker’s and language learner’s tasks in constructing a derivation, allowing complexity to be distributed across it;
  • Information is sent bit by bit to the articulatory system to be pronounced and to the semantic / conceptual system for the meaning to be analysed. English marks the point of sending articulatory information much earlier than languages like Turkish, Mohawk, and many others, with what seem to the speaker of a language like English to be hugely-complex ‘words‘. So this point necessarily falls within the learnability space;
  • UG becomes knowable, and Metalinguistic awareness is brought into being;
  • Fantasy, fiction, non-fiction, irony, fun, comedy, contracts, become parts of everyday life.

Summary of seven steps

In this way, the infinity by Merge or by Decompose and Recompose is reduced step by step. It can be summarised as follows (for the sake of simplicity, assuming that the steps were primordially vocal, an assumption which may be wholly or partly wrong for at least the first steps):

  1. Decompose and compose – decomposing and recomposing UNLIKE features – sensory-motor or semantic-pragmatic – the most extreme possible sort of contrast – generating a high level of infinity – in forms not yet properly words;
  2. Pair – of elements, only one of which can stand on its own, as a greeting or such like;
  3. Head – one of two contrasting elements, neither such that it can in all cases stand on its own – one such that it heads the expression;
  4. Extraction – listing a set of elements for selection, not disallowing a previously selected element to be selected again;
  5. Hitch – a functor to the appropriate host;
  6. Measure and compare – comparing and measuring the least degree of dominance at some point in the derivation to some equal or greater degree, thus limiting at least some specified projections;
  7. Phase – splitting the derivation into successive phases, each such that its internal structure becomes unreachable as the next proceeds, limiting the scope of the grammar bv any one phase of the derivation.

In the case of two randomly selected, normally developing children, these steps all occur in the space of 22 months. But even at ten and a half, the whole edifice of grammar is still not absolutely complete.

An apparatus

Crucially this evolution provided an apparatus which was, and is:

  • Freely used in assembling words together and in the building of speech sounds, in ways that the child has to learn;
  • Commonly over-used in the process of speech acquisition so that children often use devices in the building of words which should be used only in the assembling of words into sentences;
  • Such that speech-disordered children from different generations or parts of a family often have recognisably similar issues;
  • Available in parts, so that questions can be asked and answered in a rudimentary way, so a child of two and three quarters can say “A clock tells you what time it is” displaying the first evidence of Phase long before the full functionality of the grammar has emerged, as it normally has around seven years later.

By this proposal – consequentially

  •  A history of delayed or disordered speech is likely to be co-morbid with literacy problems. The characteristic multifactoriality of speech and language disorders is predictable;
  • There are likely to be speech errors by misapplying what should be syntactic processes in the phonology;
  • Many characteristics of child speech are likely to be reducible to the lack of any proper definition of phonemes, syllables, words, and so on;
  • Children with speech and language disorders are likely to have characteristically poor metalinguistics;
  • It is possible in principle for parts of UG to be incompletely specified in some individuals;
  • Rather than postulating a series of separate disorders, many apparent disorders, even some with names in popular speech, such as ‘lisping’, may fall out from a fully worked out theory of speech and language evolution;
  • Following Progovac (2015), there must have been a series of protolanguages, each likely to have left fossils;
  • As the genome evolved, the effects of the steps interacted with one another, giving the complex variations which Roberts (2022) characterises as ‘building blocks’;
  • An abstract Universal Grammar UG is derived from the evolution of the human species. But there is  cross-linguistic variation in how it is used – for example in which parts of the sentence structure are projected where – with global effects on word order and other aspects of what is commonly characterised as ‘grammar’. While a language may not express one or more parts of UG, all languages, spoken and signed, are built from it.

A buffer and stammering

Developing a proposal by Nunes (1994), updated by Sandiway Fong (2021, 2023), the evolution of speech and language was bottlenecked by the slowness of transitions across the synapse in contrast to the extreme sensitivity of both visual and auditory perception. Evolution addressed this bottleneck in two ways, first by the successive reductions in the infinity by recursive Merge, and second by the development of a buffer allowing finite time for the apparatus by these steps. Like the specification of the steps, the buffer is developmentally vulnerable. By the proposal of Nunes (1994), an incorrect specification of the buffer characteristically surfaces in speech as a stammer. Putting this differently, a stammer has a necessary neurolinguistic component. Familial experience and self definition are not enough on their own to fully characterise the disorder.

This conclusion is evidenced by the following:

  • Stammers occur in all known human populations at a rate of between one and two percent;
  • In all the vast clinical literature about stammering, there are no reports of stammering on first words;
  • By a series of discoveries in the early 1950s, reactions to Delayed Auditory Feedback (listening to one’s own speech played back with a delay) are quite different in normal speakers and stammerers; the normal speakers stammer and the stammerers stop stammering. or stammer much less.

A good start

Given the simple principle of binary branching and the acoustic phenomenon of sounds dying away as the energy is absorbed by the atmosphere, the system just consists in the step-wise reduction of infinity.

In relation to the sound system, English just happens to pursue this branchedness further than most languages. But some languages, Polish for example, go further with even more structure before the onset. All of this falls within the learnability space, and is often problematic in children’s speech development.

The tree can just develop, adding branches, up to some limit, as by the structure in strange. Here the long vowel is shown as AE, where the two elements are separated in the spelling. The final GE by the spelling is shown as a single J, representing the fact that this is just one sound. But it is also a sound with two halves, known as an ‘affricate’, beginning with a complete closure, shown here with a D, and ending with a fractional release of the closure, shown here as ZH, like the sound at the end of beige and rouge. Respecting the binary branching, the initial S is shown as a dependent off the left edge of the syllable.

The data here

The proposal here is on the basis of the diaries kept by my wife and myself of the development of our two sons, Joe and Frank, two and a half years apart in age, comprising about fifteen thousand observations in all, filling nine cathedral analysis note books. The observations continued until Joe, the older of the two, was almost ten and a half. We tried to make our observations as accurate as possible, as soon as possible after the event. Obviously we must have missed many developmentally significant occasions. The observations exampled here exhibit commonalities across the two boys. Because of the age difference, it is not plausible that the older of the two was significantly influencing the younger, other than on matters like loyalties to one or to football teams. Generalisations across the two boys are likely to be significant. Listening to them talking with their friends and peers there was nothing obviously singular about their speech and language. The structures exampled here seem to be typical of children from a liberally-minded, middle-class family, going to a neighbourhood, non-denominational, local authority school, catering for children from a wide variety of social and ethnic backgrounds. It is only coming back to these records for analysis forty years after they were made that it is becoming clear how much they reveal. One cannot listen too carefully to the details and nuances of what children say. They can say more than they seem to on a first listening.

A theory of speech and language acquisition?

The proposal here is NOT a theory of speech and language acquisition. It does not take account of psycholinguistic considerations such as auditory memory. It is just assumed here that the domain is large and complex, and that the process of mastering it is likely to begin with only occsasional successes and much more common failures.

The steps by the proposal here are like insights which are grasped first tentatively and only gradually with confidence. Instances of the insights are likely to be scarce.

The motivation here is biological. Any adaptation capable of enhancing the fitness of the organism has to be simple, accidental, and such that it can be genetically encoded.

The autonomy of grammar

No version of Merge or Decompose and Recompose is reducible to the needs of communication or social interaction. There could not have been any external input because by their very nature, the properties here are strictly-internal to cognition. A phase-based spine can only be defined on general, i.e. non-linguistic, principles. It cannot directly reference any categories which would only come into existence by virtue of the evolution. One example of this is the time relation between the utterance and the context in which it is uttered, known as ‘tense’, as by the actually quite complex difference between “I clean the floor” and “I cleaned the floor”. Thus the grammar must encompass the entire apparatus which yields the linguistic categories. A category may seem to occur only very rarely or even in only one of the six or seven thousand known languages. Some categories are idiosyncratic, But if a category occurs at all, the learnability space must be configured accordingly.

Steps, stems and time scales

The proposal here makes no claim about how the cognitive evolution of speech and language connected up with cognition itself or with any of the physical changes. We just have to note that these changes were in one species, and they would seem likely to have complemented one another.

I am assuming here that genetics confirms Darwin’s hunch that modern humans share a common ancestor with African apes, with human ancestors diverging from chimpanzees at some point between six and ten million years ago (See Søren Besenbacher and others (2019) for a recent contribution. But the exact timing of the divergence is irrelevant here.)

I am also assuming that at the point when the two ancestral populations diverged, they shared a system of communication essentially similar to that of modern chimpanzees. According to Katie Slocombe and others (2022), reviewing studies of wild chimpanzee across different African populations, there are approximately 44 or 45 calls in use. Some calls, like the shriek of pain as an individual is attacked vary in intensity according to the severity of the attack, and crucially, are understood that way by other chimpanzees. But an infinity of calls by the grading of fear, pain and distress is quite different from any of the infinities considered here,

It may be that in  some species, particularly birds, different calls can be combined, sometimes one inside another. Observations to this effect have sometimes been interpreted as countering the claim that recursitivity in the system is human-specific. Similar interpretations have been made of variations in the calls by chimpanzees, vervet monkeys and others. It may be that some species, partucularly chimpanzees, are better able to remember a large repertoire of calls. And other species are better able to understand them separately and to combine them. A combination of two calls goes beyond a mapping. But it is limited to the square of the calls. The output is finite. The interpretation that either of these evolutions are steps on the evolutionary path to human speech and language seem to me to miss the point that that the end-state of linguistic competence involves the simultaneous variation of both form and meaning. It is this, I propose, which allows the human-specific property of free compositionality and a similarly human-specific pathway to linguistic competence in normally developed human adults, taken for granted in all human cultures.

But to rephrase the core question of this essay, how did this human specificity emerge? By the proposal here, it can only have emerged by a minimal step in a population with a well developed communication system mapping calls to meanings.

In the newly diverged human ancestors, the number of calls may have increased in response to the extreme dangers and opportunities of a ground-based environment. If that happened, the capacity to remember and understand different calls may have been pushed to the limit, perhaps beyond the limit. At some point at least one call was reconfigured in such a way that it could constitute the first step to modern speech and language. In a way not possible from a mapping, both sides of the form / meaning relation have to be reconfigured. By the proposal here, this was by Decompose and recompose.

The steps and precursor steps proposed here were not events. If it took modern human ancestors at least 3 or 4 million years to learn to make a sharp edge or point, it would seem reasonable to assume that the incomparably more subtle process of encoding UG on a spine must have been similarly challenging. But the proposal here has nothing to say about the possible time scale for each step. All that can be said is that the encoding is far removed from the communicative advantages. For instance the convenience of using pronouns has no obvious relation to unobvious adjacency of levels on the spine. The difficulty of the translation here would suggest that this may have taken many thousands of generations. But as soon as the spine relation was established, the translation may have been simpler and faster, perhaps greatly so.

From the fact that there are corresponding phenomena in all languages, it is reasonable to suppose that the seven specifically linguistic steps were made separately, as an evolutionary sequence, in a population from which all humans alive today descend. At each point when a necessarily very obvious and visible step was taken, this was valued throughout and across a population. It had to be, or it wouldn’t have diffused and fixated.

But while the capacity for speech and language clearly distinguishes humans from any other animal, and mostly develops naturally without any active intervention, this is obviously not the case for all, with 1 child in 10 having minor problems with speech and language, 1 in 1,000 having major problems, and perhaps 1 in 100,000 being unintelligible in adulthood other than to close family and friends, if at all.

A conjecture about finite learnability

The seven steps here were taken by a species which had forsaken the safety of the trees for a much more dangerous life on the ground, after making at least six significant precursor adaptations. This was a population which plainly lived on its wits – or died. The population remained very small, almost dying out at one point, but ranged across Africa while what is now the Sahara desert was forested and well-watered. Within this population, by the proposal here, individuals or groups of individuals must have started restructuring some of their expressions in detectably advantageous ways, but by no more than one term at a time, so that, over the course of thousands of generations, the innovation could diffuse across the population, and (separately) become part of the genome.

The totality of this evolution was most likely over at least a million years, exploiting, but going far beyond, any life-support cognitions, such as those of making stone tools.

It seems likely that Phase critically reduces the learnability space, making language finitely learnable in what Eric Lenneberg in 1967 called the ‘critical period’ for language acquisition, normally ending around the age of ten. While there is a learning process which is normally completed across the whole population, this may not be so for all individuals. As Carol Chomsky showed in (1969), many ten year olds are still misunderstanding sentences like “I’m asking you what to feed the dog” as “I’m telling you what to feed the dog”. She suspects that some individuals may not proceed to a full understanding of this point. On a phase-theoretic analysis of the error here, the subject of the ‘feed the dog’ phrase is incorrectly not projected up to the topmost phase, represented in this case by the first person pronoun, I.

It thus may be that it is Phase which makes language finitely learnabable as it demonstrably is for the overwhelming majority, that language was not finitely learnable without it, that without Phase, there was a wide range of linguistic competence across the ancestral population, with only a minority haveing access to anything resembling the complexity of modern grammar. How small this minority may have been, how grievous the effects of relative incompetence were, and in what proportions, where the grammatical defects may have appeared, are all impossible to guess. As Phase and conjecturally finite learnability spread across the population, communication between conspecifics sharing this faculty became critically more reliable. In dealing with everyday emergencies, at critical points in hunting dangerous prey, in disseminating advances in technique and technology, reliable comnunication became a decisive asset. Finite learnability and the consequential reliability of communication between conspecifics would seem to have given a great advantage at the point of population survival to those having a phase-based grammar in relation to any group not having it.

This says says nothing about the exact time scale here or about how quickly the steps in language evolution fixated across the ancestral population. But if the proposal here is on the right lines, one or more of the last evolutionary steps may have occurred after the divergence between the main line of modern human ancestors and Neanderthals and before more or less anatomically modern humans appeared in what is now Western Morocco around 300,000 years ago. Inheritors of the epigenetic changes by the last step in particular would have learnt to talk faster, more accurately, more reliably, and crucially more completely. It thus seems a reasonable conjecture that Phase fixated across the ancestral stem of anatomically modern Homo sapiens between 150 and 300,000 years ago somewhere in the Northern part of Africa, as one cognitive capacity of the new species, marking it apart from any pre-estind human species, including Neanderthals already establised in Europe and central Asia. The difference may have been critical, with Neanderthal mastery of speech and language mastery uneven, with no expectation of common understanding. Neanderthals may have been stuck at the point when only a fortunate minority had a full mastery of their linguistic inheritance, whatever that may have been, and the rest of the population had only varying degrees of competence and little or no metalinguistic ability. In competition for scarce resources, it was conjecturally the reliability of communication between conspecifics which which enableed modern Homo Sapiens to prevail so decisively over the established Neanderthal population in a few thousand years, soon developing the first indications of modern culture in jewelry, wall-paintings, sculpture, and musical instruments.

Precursor cognitions

Following six precursor cognitions, some of the seven specifically linguistic steps which I propose can be exemplified very approximately in the language development of a modern child – except that what the child is hearing is a fully-developed, modern language, and the child is the inheritor of a corresponding genomic capacity, albeit with only the gestural half of the first step, characterised here as Decompose and Recompose, manifested in babbling.

Clinical utility

In relation to less than fully competent speech and language, the proposal here effects a conceptual economy. By the proposal here, most disorders are by the effect of failures in the specification of a genomically defined UG which makes it possible for humans to learn to talk the way they do without needing to be helped, other than to learn what not to say. This makes it unnecessary to postulate a corresponding series of specific malformations.

There is clinical utility from the study of the apparatus and the structures that can be derived from it. It makes many common issues in speech and language disorders more accurately definable. And it broadens the range of plausible interventions into areas that would otherwise be at least hard to treat. If the proposal here is on the right lines, with seven abstractly-defined steps in the pathway to Universal Grammar, these are all useful points of measurement and focus points for intervention, For instance, many children have difficulty with both case and tense, as in “She loves me” where the S in loves expresses agreement with the singular property in She. Such children may go on saying things like “Love me” many years after most children have learnt that in a statement, both the she and the S in loves are forced in English.

Nunes (2023) argues that to help such children it may be useful to allow them to discover the ‘sisterhood’ relation between the subject marking of she, known as ‘nominative case’ and the S in loves, known as ‘third person singular’, the formal device by Work space.

Nunes (2002) notes that another group of children sometimes say monopoly as OPOLI. If such errors persist they can lead to stigma or mockery. If OPOLI is part of a broader pattern the child probably needs help. But for what exactly? If monopoly as OPOLI is by the deletion of the first three sounds, what are they? The first two are the first, unstressed syllable, sometimes confusingly called the ‘pre-tonic syllable.’ But the N is the onset of the stressed syllable. What sort of a thing is the whole of one syllable and the beginning of the next? But there is another way of looking at this. OPOLI is the domain of stress. (The N is irrelevant here). The child may be treating the stress domain as though it was the word, an easy mistake for a first language learner of English. By the framework here, this suggests a treatment approach targeting the separateness of the word from the stress domain.

A speculation about syntax, discourse, neurophysiological transmission rates, and the Faculty of language

By the proposal of this essay, a significant distinction is drawn between objects of discourse, ooh, ah, oh, eh, yes, no, OK, greetings , curses, expressions of religious devotion, sometimes swithched around, commonly at the end of sentences in English, in some cases in various other positions, and objects of syntax, very rigidly ordered in English. This distinction is reflected in the fact that in some cases where speech and language are lost as a result of an accident or a stroke, only some discourse elements are preserved, at the expense of syntax, never the other way round. Conversely, it is common for syntactic objects, particularly names to be felt to be ‘on the tip of the tongue’. But the experience of a swear word on the tip of the tongue seems to be unattested. So there is good reason for thinking that discourse objects are represented separately in the brain from syntactic objects. But this seemingly categorial distinction does not prevent items being shifted in both directions from one category to the other. Historically, modern goodbye was God abide with thee. And the conjunction, oohs and arghs, has become an object of syntax. It seems to be an everyday operation of speech and language to recategorise item in either direction.

Sandiway Fong contrasts the relatively slow rate of neurophysiological transmission with the hypersensistivity of the audfitory and visual receptors. It thus seems possible that the transmission speed of neurophysiological transmission is a bottleneck and a critical variable, While the relation between semantic and articulatory . perceptual factors is a matter of conceptual nececessity acrosss the whole range of speech and language evolution and development, this says nothing about any ordering or priority between them. So it is possible that syntactic and discourse objects are differentiated on this basis. Syntactic objects may be given some sort of priority and perthaps associated faster and earlier. And the difference may be critical.

The Faculty of Language, FL, involves two interplays, one between syntax and discourse, and the other within syntax between lexical words, comprising nouns, verbs, adjectives, and prepositions, not all invariably represented in a language, and functors like a and the, definable only on the structures in which they occur. Used in questions, words like where and what involve both discourse and the functional projection. Both sorts of interplay are thus central to the formal characterisation of FL, and thus on how it can be mathematically encoded in a way that is legible to the biology.

No matter whether Wh words like where and what are used in questions like “What do you want?” or as introducing an embedded clause in a statement like “I know what you want”, they are crucial to the system of Phase, by the proposal here, the last step in the evolution and development of Universal Grammar. Hypothetically it is Phase which makes speech and language finitely learnable. It is thus the key step in what has allowed the unique status of humans among all species alive on the planet. The syntax / discourse interplay is key part in this,

The limits of a proposal

This page sets out a proposal. It is what I am working on, initially on the basis of the observations of two children. It is essentially a question. There may have been more steps. Or there may have been,  as Chomsky suggests,  a single out-of-the-blue mutation of quite extraordinary power, corresponding to Matilde Marcolli’s Hopf algebra. By the evidence here this would have to be more abstract than Merge, and such that it can only be implemented by a long period of aquisition or an even longer period of evolution. This could be by what Chomsky calls ‘a minor rearrangement of neurones’, as the singular cognitive achievement of modern homo-sapiens. But by my proposal here, the  formulation evolved over a much longer evolutionary time scale, with the paradoxical effect that the infinity was reduced in steps, while remaining an infinity. This just seems to me a biologically well-motivated way of reconciling the evidence of human speech and language, as they currently are, biology, neurology, archeology, paleo-anthropology, genetics, delays, disorders, and the random cases of two individual children. The fact that the two children were brothers, living in the same family home, may have influenced the areas of their attention and interest. But it cannot have had any bearing on whatever part genetics may have played in their focus on the formalities of Universal Grammar, as documented here.

As soon as possible, I shall be expanding the sample to include three more children, unrelated to one another, observed in a more experimental setting.

As set out here, this proposal is only a sketch. There is at least another ten years of research ahead.

Do you have an enquiry?