The human instrument

And exapting it for speech

The apparatus we use for speech evolved for the quite different purposes of breathing and feeding, and doing both at the same time without getting food into the lungs. The mechanism fulfilled these functions for hundreds of millions of years before human ancestors started evolving in their own direction. All living humans have the same apparatus. We use it a bit like an instrument with a source for the voice and one resonator in the mouth and another in the nose. The various exaptations here are not entirely cost-free. The resulting mechanisms are all vulnerable in particular ways.

The exaptation for speech involves raising the tempo of use in feeding by about 100 times. Whereas in feeding both sides of the mouth are used, in speech the aistream is symmetrical and in one or two straight lines, mostly one. Arguably, this is a quite different sort of use from feeding.

The apparatus can be manipulated in ways which we start learning even before birth These manipulations vary across languages. The engineering and the way humans learn to use it are extraordinary. But it is easy to exaggerate the role of the two most visible elements – the tongue and the lips.

In all cases, the entire apparatus here has to interface with a cognitive system in order for communication to occur.

The vocal tract

The instrument here was first identified as what it is, the vocal tract, by William Holder (1669). He correctly identified the lungs, the airway, the larynx or voice box, the tongue, the nose, the palate, rigid at the front and flexible at the back, as a single system as far as speech is concerned.

In Holder’s day there was no idea of evolutionary pathways or of the time scales involved.

Other exaptations

The larynx has clearly been exapted from its original function. There seem to have been at least three other exaptations for the special purpose of speech:

In a way very relevant to the understanding of stammering, the development of a buffer as a temporary holding place for speech and language information by the proposal of Nunes (1994);
The lowering of the larynx;
The pointing of the chin in homo sapiens.

Both the second and third of these adaptations increase the length of the effective part of the tract in a way critical for vowels like EE – almost universal across languages.

A comparison

There is an obvious, but perhaps not very useful, comparison with the closest man-made equivalent, a wind instrument – with the variation from one note to the next by

The airflow;
The length of the column of vibrating air;
The contact with the strings in a string instrument or the ’embouchure’ or the action of the lips in a wind instrument.

Musical notes are almost universally organised in scales of unequal values, mostly counted as eight in the West or, by a different way of counting, as five in much of the East. And in music there are values of timing in the rhythm.

But in speech the pitches and rhythms that we perceive are (mostly) relative to one another, not absolute. The absolutes that we perceive are the sounds of speech, the phonemes. Speech adjusts the rhythm and the pitch, but a whole lot more as well:

The pitch by raising or lowering the structure containing the vocal cords, known as the ‘larynx’, and by adjusting the tension of the cords;
The shape of the column by squeezing it slightly or completely closing it at any of a number of points;
The size of the resonator – from just the space in the mouth to this and the internal space of the nose by squeezing or relaxing a ring of muscle around the soft palate;
The timing of any vocal cord action by bringing the cords to almost touch to vibrate against one another like the fluttering of a flag.

Speech and language involve vowels and consonants. But without a thought about either of these, we adjust the muscles according to the sounds we want – as though there were keys for each sound.

For both singing and speech by untrained voices, the system has a range of around an octave, with the vocal cords vibrating twice as fast at the top as at the bottom of the range.

Co-ordination

The working of the apparatus has to be co-ordinated with the action of the lungs, and adjusted, taking account of the feedback, of which there are three sorts:

Hearing;
The sense of where the tongue and the lips are in relation to the vocal tract, by what is known as ‘proprioception’;
The sense of which mobile part of the vocal tract is touching which immobile part.

The feedback allows fractional adjustments to be made.

For speech at a normal tempo between two and three syllables per second, the actions of the 60 or so muscles in the vocal tract are at the speed of virtuosic music making,

A muscle is activated from the brain by a signal passing along a nerve. The greater the distance, the longer this takes. Long nerves have to be initiated before short ones. Almost all speech is on breathing out. The main muscles of breathing are the diaphragm and the muscles between the ribs, known as ‘intercostals’. So the messages to start breathing out have to be sent out before the first message for the first full-of-breath word. All of the different parts of this musculature have to be co-ordinated with one another irrespective of the different lengths of paths in order to achieve a particular result.

The co-ordination of instructions and feedback is obviously intricate. The disruption of this feedback is obvious after a pain-killing injection in the mouth or by trying to talk if the sound of the speech is artificially delayed, as sometimes happens in a studio or on a mobile phone.

In some medically well-defined conditions ike cerebral palsy, the chain of instructions is disrupted.

Some children’s issues seem to be exclusively motoric without any medical diagnosis, as, for example, where the activity of a muscle in the tongue triggers a corresponding activity in a muscle in the face. But in the absence of any known medical factor, such exclusively motoric issues seem to me to be much rarer than commonly thought.

Developmentally

The child has to work out that there is a difference between a word, with a rhythmic series of feet, and syllables within the feet, and the domain over which stress is computed, consisting of a vowel, and one or two syllables to the right. Crucially the consonant or consonants before the stress domain have nothing to do with the stress. And there may be a whole syllable before this which is not counted either. Hence banana as NAHNA or BAHNA, and later on monopoly as OPOLI or NOPOLI. In the study of poetry this is called scansion. In English, the scansion is from right to left. In modern Scottish Gaelic and many other languages, the scansion is from left to right. The direction of the scansion is something which has to be learnt – within the learnability space. Children learning English seem to have little difficulty learning the directionality here. Banana as BANAH, monopoly as MONOPOL – leaving out the final syllable or final vowel, seem to be unattested. There may be a child doing this, but such a child is rare.

Not behaviour

Although practise and familiarity are obviously relevant, by the framework here, speech and language are plainly not behaviour in the ordinary sense of this word.

The contrasting case of sign language

Every sensory system (we have a lot more than five senses) has its own degree of resolution. Humans belong to the order of primates which first evolved as small, nocturnal, tree-living grub catchers, using good hand-eye co-ordination to catch their prey. The sounds of the grubs scratching might betray their presence. But they had to be seen to be caught. Primate hearing and eyesight evolved to suit this way of life. On this time-scale, the evolution of speech and language is a brief after-thought.

So we can see more things happening at once than we can hear. Although sign language is called what it is, the action of the hands is only one component of the message. There is also the direction of the look, left or right, up or down, the action of the tongue and lips, the orientation of the head. The signing resolves into hand configurations in various positions and orientations separately or together. Spoken language cannot match sign language in these respects.

Do you have an enquiry?

Contact Me