Basic issues in speech perception
As already mentioned, human auditory perception is especially well tuned to speech sounds. Our hearing is most sensitive to sounds in the frequency range in which most speech sounds are found, i.e. between 600 Hz and 4000 Hz.
It is also the case that the human perceptual system streams language and non-language signals, i.e. treats them as separate inputs, thereby reducing the distracting effect of non-speech signals on speech perception. This has been shown through phoneme restoration effects Samuel, 1990. When listeners hear words in which a speech sound phoneme has been replaced by a non-speech sound such as a cough, they are highly likely to report the word as intact, i.e. the cough is treated as part of a separate stream. There is a clear linguistic influence here, as the restoration effect is stronger with real words than with nonsense words. It has also been shown that when the word-level information is ambiguous, then the word that is restored is one which matches the sentence context. For example, the sequence /# il/, where # indicates a non-speech sound replacing or overlaid on a consonant, could represent many possible words (deal, feel, heal, etc.). In the different conditions shown in 7.2, a word will be reported that is appropriate to the context shown by the final word in the utterance (originally demonstrated by Warren & Warren, 1970). So if that word is orange, then peel is reported, if it is table, then meal, and so on. Linguistic effects on perception are powerful, and will be dis cussed in more detail later in this chapter.

The language-specific nature of streaming effects is indicated by anecdotal evidence from students who are asked to listen to recordings from a language with a very different sound inventory from their own, and who experience some of the sounds as non-speech sounds external to the speech stream. A good example of this is when English-speaking students first listen to recordings of a click language, with many students reporting the click consonants as a tapping or knocking sound happening separately from the speech.
Despite the evidence that listeners segregate speech and non-speech signals, it is also clear that the perceptual system will integrate these if at all plausible. That is, if a non-speech sound could be part of the simultaneous speech signal, we will generally perceive it as such. For example, if the final portion of the /s/ sound in the word slit is replaced by silence, then this silence is interpreted as a /p/ sound, resulting in the word s l being heard see the exercises at the end of this chapter. A stretch of silence is one of several cues to a voiceless plosive such as /p/ the silence results from the closure of the lips with no simultaneous voicing noise, and is sufficient in this context to result in the percept of a speech sound.
 Frequently there are multiple cues to a speech sound, or to the distinction between that speech sound and a very similar one. The voiceless bilabial plosive /p/ sound is cued not just by the silence during the lip closure portion of that consonant, but also by changes that take place in the formant structure of any preceding vowel as the lips come together to make the closure, by the duration of a preceding vowel voiceless stops in English tend to be preceded by shorter variants of a vowel than voiced stops such as /b/ , by several properties of the burst noise as the lips are opened, and so on. While some of these cues may be more important or more reliable than others, it is clear that the perception of an individual sound depends on cue integration, involving a range of cues that distinguish this sound from others in the sound inventory of the language.
A fascinating instance of cue integration comes from studies of speech perception that involve visual cues. We are often able to see the people we are listening to, and their faces tell us much about what they are saying. A particular set of cues comes from the shape and movements of the mouth. For a bilabial plosive /b/ or /p/ there will be a visible lip closure gesture; for an alveolar plosive /d/ or /t/ it might be possible to see the tongue making a closure at the front of the mouth, just behind the top teeth; for a velar plosive /g / or /k/ the closure towards the back of the mouth will be visually less evident. Normally, these visual cues will be compatible with the auditory cues from the speech signal, and therefore will supplement them. If however, the visual cues and the auditory cues have been experimentally manipulated so that they are no longer compatible, then they can merge on a percept that is different from that signalled by either set of cues on their own. This is known as the McGurk effect, after one of the early researchers to identify the phenomenon (McGurk & Maconald, 1976). For instance, if the auditory information indicates a /ba/ syllable, but the visual information is from a / a/ syllable, showing no lip closure, then the interpretation is that the speaker has said /da/. Examples of this effect are available on the website for this book and see also the exercises at the end of this chapter.
Another and at first glance somewhat bizarre cue integration effect has been reported in what have been referred to as the puff of air experiments. In these experiments participants listen to stimulus syllables that are ambiguous between, say, /ba/ and /pa/. For speakers of English and many other languages, one of several characteristics that distinguish the /b/ and /p/ sounds in these syllables is that there is a stronger puff of air that accompanies the /p/ than is found with the /b/. In phonetics terminology, the /p/ is aspirated and the /b/ is unaspirated. In the experiments, it was found that participants were more likely to report the ambiguous stimulus as /pa/ if they also felt a puff of air that was presented simultaneously with the speech signal. The effect was found whether the puff of air was directed at the hand or at the neck (Gick & Derrick, 2009), or even at the ankle (Derrick & Gick, 2010).
It has also been shown that cue trading is involved in speech perception. For instance, if the release burst of a /p/ is unclear, perhaps because of some non-speech sound that happened at the same time, then the listener may assign greater perceptual significance to other cues such as the relative duration of the preceding vowel and movements in the formants at the end of that vowel.
These cues in the formant movements are a result of coarticulation – the articulation of one sound is influenced by the articulation of a neighbouring sound. It appears that our perceptual system is so used to the phenomenon of coarticulation that it will compensate for it in the perception of sounds. For instance, Elman and McClelland 1988 asked participants to identify a word as capes or tapes. Their experiment hinged on the fact that a /k/ is pronounced further forward in the mouth, so closer to a /t/, when it follows /s/ (as in Christmas capes) than when it follows /ʃ/ (as in foolish capes). This is because /s/ is itself further forward in the mouth than /ʃ/ and there is coarticulation of the following /k/ towards the place of articulation of the /s/. Elman and McClelland manipulated the first speech sound in capes or tapes to make it sound more /k/-like or more /t/-like. One of the cues to the difference between the /k/ and /t/ sounds is the height in the frequency scale of the burst of noise that is emitted when the plosive is released – it is higher for front sounds like /t/ than it is for back sounds like /k/. In the experiment, the noise burst of the initial consonant in tapes or capes was manipulated to produce a range of values that were intermediate between the target values for /t/ and /k/. Participants heard tokens from this range of tapes/capes stimuli after either Christmas or foolish, and had to report whether they heard the word as tapes or capes. The results were very clear – after the word Christmas, tokens on the /t/--/k/ continuum were more likely to be heard as /k/ than when the same tokens followed foolish see Figure 7.4. That is, the participants expected the coarticulation effect to lead to a fronted’ /k/ after /s/, and compensated for this in their interpretation of the frequency level of the burst noise.

 
Our perception and comprehension of speech is also affected by signal continuity. That is, listeners are better able to follow a stream of speech if it sounds like it comes in a continuous fashion from one source. This lies behind the cocktail party effect, where we are able to follow one speaker in a crowded room full of conversation despite other talk around us Arons, 1992. This effect can be demonstrated in various ways. In one task, participants hear two voices over stereo headphones, and are asked to focus on what is being said on just one of the headphone channels, the left channel for example. If the voice on the left channel switches to the right part way through the recording, then participants find that their attention follows the voice to the right channel. They then report at least some of what is then said on the right channel, despite the instructions to focus on the left channel (Treisman, 1960). The strength of this effect is reduced if the utterance prosody is disrupted at the switch. The importance of signal continuity is also demonstrated in the relative unnatural ness of some computer-generated or concatenated speech, such as is found for instance in the automated speech of some phone-in banking systems.
Active and passive speech perception
There have been numerous attempts to frame aspects of speech perception in models or theories (some of these are reviewed by Klatt, 1989). One distinction that has been made between different models concerns the degree of involvement of the listener as speaker, characterised as a difference between active and passive perception processes.
Passive models of speech perception assume that we have a stored system of patterns or recognition units, against which we match the speech that we hear. Depending on the specific claims of the model, these stored patterns might be phonetic features or perhaps templates for phonemes or diphone sequences, and so on. A phoneme-based perception model might for example include a template for the /i/ phoneme that shares some of the common characteristics of the spectrogram slices shown in Figure 7.3. A feature-based model might include a voicing detector that examines the input for the presence of the regular repetition of speech waves that corresponds to vocal cord vibration, and would have similar detectors for other features that define a speech sound.
Incoming speech data is matched against the templates, and a score given for how well the data matches the templates. These scores are evaluated and a best match determined. Many automatic speech recognition systems operate like this – they have templates for each recognition unit and match slices of the input speech data against these templates. Such systems perform best when they have had some training, usually requiring the user to repeat some standard phrases so that the speech processing system can develop appropriate templates.
 Active models of speech perception argue that our perception is influenced by or depends on our capabilities as producers of speech. One model of this type involves analysis-by-synthesis. Here, the listener matches the incoming speech data not against a stored template for input units of speech, but against the patterns that would result from the listener’s own speech production, i.e. synthesises an output or a series of alternative outputs and matches that against the analysis of the input.
				
				
					
					
					 الاكثر قراءة في  Linguistics fields 					
					
				 
				
				
					
					
						اخر الاخبار
					
					
						
							  اخبار العتبة العباسية المقدسة