Since 1998, by a proposal from Noam Chomsky, it has been widely assumed that the operations of ‘syntax’ or what many understand as grammar, happen in narrowly defined packages or ‘phases’ of operations one after another. This notion of Phase is taken to be specific to the syntax. I propose here that Phase applies also to the sound system or ‘phonology’ with huge consequences for the process of acquisition and thus for children’s speech. The first phase is the construction of the speech sounds or ‘phonemes‘ from more elemental ‘features’.

Phase and speech acquisition

Patterning in speech errors is odd. It shouldn’t occur. If it does occur there must be a reason. And that reason may be important.

These patterns are observable from the first words up until the age at which full adult competence is reached around the age of nine, plus or minus a year or so. Even in children with significant delays and disorders of speech and language development, there is much of the same non-randomness in the patterning of the errors. If all the errors were in one direction or polarity this would point to to some external factor, either articulatory or auditory. But many errors are demonstrably non-random. At least in the case of normally developing children, the numbers are large, the data is easy to collect, and there are no ethical issues of observation. The data can be plotted taking account of only chronological age from first words to the point when the longest and most difficult words can be pronounced – around eight.

The fact that this non-randomness occurs and the way it is organised is hard to explain other than by a factor applying to the whole structure of both words and the sounds within them, in what Marlys Macken in 1995 called the ‘learnability space’ during what Eric Lenneberg in 1967 called the ‘critical period’.

I propose that the reason for this non-random patterning in the error distribution is because of the phasing. The first phase is what in 1984 Diana Archangeli called ‘alphabet formation’.

If all aspects of speech are phased, if there are commonly observed defects or patterns in the acquisition of the system, it is likely that these defects will be in operations which are either phased together or at the same point in a number of different phases.

Making a complex system learnable

Taking English as just one of the six or seven thousand odd languages in the world, the learner has no guidance on where his or her target language lies with respect to all the possible variations. Some of the most crucial characteristics are not obvious at all. But among the more obvious characteristics, in comparison to most languages English has:

• A relatively large inventory of sounds (no matter how this is counted);

• Relatively complex syllables or ‘phonotactics’ (with up to three segments before the vowel or nucleus, up to two segments in the nucleus, which does not always include a vowel, and up to three segments after the nucleus – in glimpse, next and length);

• Relatively complex metricality with one primary stress, any number of secondary stresses, and one light, rightmost ‘rime’ (the nucleus and a following consonant if there is one) on the rightmost edge, which is discounted in the calculation of stress, as in hippopotamus.

• Pitch or ‘tone’ used only to mark the difference between questions and other sorts of sentences and for various sorts of ‘pragmatic’ effects or what we can do with words.

That is just a small fragment of what Marlys Macken in 1995 calls the ‘learnability space’.

A new understanding

Conventionally, the properties of the phoneme are defined by a simultaneity of gestures within the vocal tract. The totality of these gestures gives an approximation as what is known as a ‘stop’ like T or D, or as a ‘sonorant’ like R. By a standard understanding about the inventory, T involves the tongue-tip, a complete closure which is suddenly and completely released, an audible pause after the release of the closure, and the definition of this as a consonant by its position in the syllable and its relatively low acoustic sonorance. All of these specifications are thought to apply simultaneously.

But by the hypothesis here, the functionality of Phase applies not just to the syntax or grammar, but also to the sound system or phonology.. All of the cortical commands which define the gestures of speech are actually sequential. Within the most local phase the sequence is too fast to be humanly detectable. The ordering of this command-structure is language-specific and thus within the learnability-space. This space is greatly reduced by its being ‘chunked’ into time scales or phases or degrees of locality. And this is what makes speech and language reliably learnable in a finite period of time, in a way quite unlike other human skills.

The sequencing and the phasing is subtle and hard to learn. Even with respect to the ‘simplest’ sounds, such as English T, child speech can be nearly competent, but not completely.

By the proposal here, there is more than one phase in building of the English syllable, and similarly in relation to English metricality, and in respect of the English phonemic inventory.

The general principles of Phase are

• To do as much as possible as late as possible (stretching the derivation, increasing the scope and range of contrasts);

• At the boundaries between phases, as element are glued together, to make the past history of the derivation ‘invisible’;

• To use the smallest possible amount of apparatus (minimising the size of the inventory of sounds);

• To ensure that the derivation remains recoverable (applying Fit only as and when necessary).

For the purposes of speech, Glue first applies only within the most local, shortest, most immediate Phase, allowing the timescale of phoneme building to be compressed to the point that actually sequential events are perceived as simultaneous. This applies to ‘short’ vowels, as in hat and hit and the bare, simple stops, like T or D. But even here there is significant time, even though this is often not noticed. English hod is differentiated from hot mainly by the length of the vowel. And Southern British English tea is differentiated from Irish English tea, partly by the fact that in the latter there is a much longer delay between the release of the tongue tip closure for the T and the bringing together of the vocal chords so that they vibrate against one another, forming a vowel. There is a different order of time-scale in long vowels like the vowel in tea where the tongue squeezes itself to fill the top front-most corner of the mouth, staying in more or less the same position and in toy where the tongue glides up and frontwards throughout the articulation. There is another time scale where N is repeated or ‘geminated’ in words like unknown, and yet another timescale when a stop is released not so as to effect a sudden blast of airstream along the top of the mouth, but gradually to produce a brief moment of high frequency noise in what is known as an affricate as in the first sounds in chew and jew (differentiated by the relative timing of what is going on down in the voice box.)

Phase and speech variation

In the D and B of Dee and bee, the vocal chords are set close together to allow them to vibrate spontaneously only a brief moment after the release of a complete closure of the airstream by the tongue tip or the lips. In the T and P of tea and pea there is a much longer delay. If the delay is reduced, this is perceived as what is known as ‘voicing’ in D and B. If the delay is fractionally longer, this is perceived as ‘voicelessness’ in T and P. But perceptually the delay is ignored. T, P, D and B are perceived as categorially different phonemes, rather than different by virtue of different positions on a scale of timing and different articulators.

There is a phoneme like T in London English, Irish English, Russian, and so on. They are all articulated by a closure with the tongue tip and with particular timing relations between what is happening with the tongue and in the larynx. There are four features centrally involved here, non-continuance defining the stops, the articulator or the position in the vocal tract at which it is closed, the voicing or the relative timing of the release and the approximation of the vocal chords, and whether or not this is accentuated by ‘aspiration’. By the hypothesis here, these settings are made in different sequences in different languages. But other features may be involved. So these phonemes are not the same. The language specificities surface in:

• The differences between English R, the Russian and Scottish burred R, French and German back-of-the-mouth R, and Spanish R with what is known as a ‘tap’;

• The difference between Russian T and D and T and D in the languages of Western Europe;

• The differences in the relative timings of the bringing together of vocal chords to start them vibrating against one another and the release of the closure in the mouth between P, T, and K in the North of England and Ireland, London, Paris, and Southern France;

• Subtle, barely-noticed, generational differences in the pronunciation of various vowels in both Britain and North America.

There are more easily perceptible points on the time scale here, as by the sequence from the initial stop and following fricative elements of affricates as in chew and jaw, and the ‘onglide’ and ‘offglide’ of diphthongs – in English, as in most other languages, with the tongue higher in the ‘vowel space’ at the end than at the beginning, as in high and now. But even this relatively gross time scale tends to be disregarded in favour of the categorisation by phonemes.

A single theory of speech delay and disorder

The most characteristic incompetences are with respect to Phase, too early in some cases, but mostly too late, as in the cases of early fronting, stopping, and what looks like, but isn’t, tongue tip assimilation in calculator as KALTALATOR in normally developing children of seven or eight.

What are commonly taken to be the characteristic ‘processes‘ of child speech are mainly by the misapplication of a fitting function, with the child’s system trying to do too much too late, going too far.

In one common. non-pathological case, hospital is said by many normally developing children of five, six and seven, as HOSTIPAL, with the T and P reversed by what is commonly known as ‘metathesis’, though nothing quite like this ever happens in competent speech. By the notion of phases, there are two steps here. First the tongue tip gesturing of the L is detected, and this triggers an increase in the contrast between this and the adjacent tongue tip gesture of the T. The most easily available way to increase the contrast is to copy the labiality of the P in the next syllable to the left. Then by a later phase, the copying is detected, what was originally a P segment has lost its defined articulator, leaving only the default articulator, in English, as in most languages, the tongue tip. And the word is then said as HOSTIPAL.

In clearly pathological cases, as in cardigan as KARDINTON, there is indeed a sequence of steps, changing the G to a D, losing the voicing of the stop, and copying the nasality of the final consonant, one syllable to the left. But these steps are all very late, at the end of the derivation, appropriate additions to it, all involving tongue tip articulations.

In the most severe cases, as with watch as BOP, glove as DUD, finger as DINDER, milk as GIK, with all three articulators seeming to harmonise with one another, the derivation ‘conspires’ to eliminate all but one articulator in the syllable, in a sequence beyond any easy or reasonable unscrambling. In that particular case of a child of three and a half, there were, I believe, six steps in the derivation of these familiar words of one or two syllables. The speech was incomprehensible to one of the most careful, insightful and attentive of mothers doing her level best not to make the problem more apparent to the child than it already was.

A precondition

In order for Phase to work as I am suggesting here there has to be a very high order acoustic discrimination. There is simple, but not obvious, evidence of this sort of discrimination in how the urban cyclist navigates in traffic traveling faster than he or she can ride. Cars, buses, motorbikes and lorries, are all potentially lethal. The most dangerous are those coming from behind. As a matter of life and death, the cyclist has to determine the angle at any moment in time between his or her position and the nearest motorised vehicle to the rear, ten degrees, twenty degrees, and so on. And cyclists do this routinely, or they don’t venture onto roundabouts where precisely this scenario happens routinely. In order to determine where a sound is coming from it is necessary to compute the timing difference between a sound wave hitting the left and right ears, or more precisely the left or right cochleas. In the case of a single modulation from the sound of a car or bus at ten degrees off the centre line, this is an extraordinarily precise discrimination – with the transmission of sound through air at sea level at around 332 metres per second and the cochleas about 25 centimetres apart, the timing difference is around one ten thousandth of a second, a much shorter interval than we can access consciously.

Accuracy of acoustic angle discrimination was important for nocturnal primate ancestors fifty million or so years ago dependant on catching bugs and grubs in trees. But on recent biological evidence humans seem to have uncommonly well-tuned auditory systems. This fine-tuning may have been driven by the special requirements of learning to speak.

A complicated way of making things easy?

Perhaps so, but this appears to be a necessary price for the finite learnability of a complex system, similar to the price we pay for having a large brain and a constant, internally-adjusted, temperature.