Speech Recognition 语音识别系统 Automatic reco...

皮皮学，免费搜题

搜题

【简答题】

Speech Recognition 语音识别系统 Automatic recognition of speech by machine [1] has been a goal of research for more than four decades and has inspired such science fiction wonders as the computer HAL in Stanley Kubrick's famous movie 2001—A Space Odyssey [2] and the robot R2D2 in the George Lucas classic Star Wars [3] series of movies. However, in spite of the glamour of designing an intelligent machine that can recognize the spoken word and comprehend its meaning, and in spite of the enormous research efforts spent in trying to create such a machine, we are far from [4] achieving the desired goal of a machine that can understand spoken discourse on any subject by all speakers in all environments. Thus, an important question is, What do we mean by "speech recognition by machine". Another important question is, How can we build a series of bridges that will enable us to advance both our knowledge as well as the capabilities of modern speech-recognition systems so that the "holy grail" [5] of conversational speech recognition and understanding by machine is attained? Because we do not know how to solve the ultimate challenge of speech recognition, our goal here is to give a series of presentations on the fundamental principles of most modern, successful speech-recognition systems so as to provide a framework from which researchers can expand the frontier. We will attempt to avoid making absolute judgments on the relative merits of various approaches to particular speech-recognition problems. Instead we will provide the theoretical background and justification for each topic discussed so that the reader is able to understand why the techniques have proved valuable and how they can be used to benefit practical situations. One of the most difficult aspects of performing research in speech recognition by machine is its interdisciplinary nature [6] , and the tendency of most researchers to apply a monolithic approach to individual problems. Consider the disciplines that have been applied to one or more speech-recognition problems. 1. signal processing—the process of extracting relevant information from the speech signal in an efficient, robust manner. Included in signal processing is the form of spectral analysis used to characterize the time-varying properties of the speech signal as well as various types of signal preprocessing (and postprocessing) to make the speech signal robust to the recording environment (signal enhancement). 2. physics (acoustics)—the science of understanding the relationship between the physical speech signal and the physiological mechanisms (the human vocal tract mechanism) [7] that produced the speech and with which the speech is perceived (the human hearing mechanism). 3. pattern recognition—the set of algorithms used to cluster data to create one or more prototypical patterns of a data ensemble, and to match a pair of patterns on the basis of feature measurements of the patterns. 4. communication and information theory—the procedures for estimating parameters of statistical models; the methods for detecting the presence of particular speech patterns, the set of modern coding and decoding algorithms used to search a large but finite grid for a best path corresponding to a "best" recognized sequence of words. 5. linguistics—the relationships between sounds, words in a language, meaning of spoken words and sense derived from meaning. Included within this discipline are the methodology of grammar and language parsing. 6. physiology—understanding of the higher-order mechanisms within the human central nervous system that account for speech production and perception in human beings. Many modern techniques try to embed this type of knowledge within the framework of artificial neural networks (which depend heavily on several of the above disciplines). 7. computer science—the study of efficient algorithms for implementing, in software or hardware, the various methods used in a practical speech-recognition system. 8. psychology—the science of understanding the factors that enable technology to be used by human beings in practical tasks. Successful speech-recognition systems require knowledge and expertise from a wide range of disciplines, a range far larger than any single person can possess [8] . Therefore, it is especially important for a researcher to have a good understanding of the fundamentals of speech recognition (so that a range of techniques can be applied to a variety of problems), without necessarily having to be an expert in each aspect of the problem. The purpose is to provide this expertise by giving in-depth discussions of a number of fundamental topics in speech-recognition research. A general model for speech recognition begins with a user creating a speech signal (speaking) to accomplish a given task. The spoken output is first recognized in the speech signal that is decoded into a series of words that are meaningful according to the syntax, semantics and pragmatics [9] of the recognition task. A higher-level processor that uses a dynamic knowledge representation to modify the syntax, semantics, and pragmatics according to the context of what it has previously recognized obtains the meaning of the recognized words. In this manner, things such as non-sequitors are omitted from consideration at the risk of misunderstanding, but at the gain of minimizing errors for sequentially meaningful inputs. The feedback from the higher-level processing box reduces the complexity of the recognition model by limiting the search for valid input sentences (speech) from the user. The recognition system responds to the user in the form of a voice output, or equivalently, in the form of the requested action being performed, with the user being prompted for more input. A Brief History of Speech-Recognition Research Research in automatic speech recognition by machine has been done for almost four decades. To gain an appreciation for the amount of progress achieved over this period, it is worthwhile to briefly review some research highlights [10] . The reader is cautioned that such a review is cursory, at best, and must therefore suffer from errors of judgment as well as omission. The earliest attempts to devise systems for automatic speech recognition by machine were made in 1950s, at Bell Laboratories. Davis, Biddulph, and Balashek built a system for isolated digit recognition for a single speaker. The system relied heavily on measuring spectral resonances during the vowel region of each digit. Another effort of note in this period was the vowel recognizer of Forgie and Forgie, constructed at MIT Lincoln Laboratories [11] in 1959, in which 10 vowels embedded in a /b/-vowel-/t/ format were recognized in a speaker-independent manner. Again, a filter bank analyzer was used to provide special information and a time-varying estimate of the vocal tract resonances was made to decide which vowel was spoken. In the 1960s several fundamental ideas in speech recognition surfaced and were published. However, the decade started with several Japanese laboratories entering the recognition arena and building special-purpose hardware as part of their systems. In the 1960s three key research projects were initiated that have had major implications on the research and development of speech recognition for the past 20 years. The first of these projects was from the effort of Martin and his colleagues at RCA [12] Laboratories, beginning in the late 1960s, to develop realistic solutions to the problems associated with nonuniformity of time scales in speech events. At about the same time, in the Soviet Union, Vintsyuk proposed the use of dynamic programming methods for time aligning a pair of speech utterances. Although the essence of the concepts of dynamic time warping, as well as rudimentary versions of the algorithms for connected word recognition, were embodied in Vintsyuk's work, it was largely unknown in the West and did not come to light until the early 1980s; this was long after the more formal methods were proposed and implemented by others. A final achievement of note in the 1960s was the pioneering research of Reddy in the field of continuous speech recognition by dynamic tracking of phonemes. Reddy's research eventually spawned a long and highly successful speech-recognition research program at Carnegie Mellon University, which, to this day, remains a world leader in continuous- speech-recognition systems. In the 1970s speech-recognition research achieved a number of significant milestones. First was the area of isolated word or discrete utterance recognition. The Japanese research showed how dynamic programming methods could be successfully applied; and the American research showed how the ideas of linear predictive coding (LPC) [13] , which had already been successfully used in low-bit-rate speech coding, could be extended to speech- recognition systems through the use of an appropriate distance measure based on LPC spectral parameters. Another milestone of the 1970s was the beginning of a longstanding, highly successful group effort in large vocabulary speech recognition at IBM, in which researchers studied three distinct simple database queries, the laser patent text language for transcribing laser patents, and the office correspondence task, called Tangora, for dictation of simple memos. Speech research in the 1980s was characterized by a shift in technology from template- based approaches to statistical modeling methods—especially the hidden Markov model approach. Although the methodology of hidden Markov modeling (HMM) [14] was well known and understood in a few laboratories, it was not until widespread publication of the methods and theory of HMMs, in the mid-1980s, that the technique became widely applied in virtually every speech-technology that was recognition research laboratory in the world. Another "new" technology that was reintroduced in the late 1980s was the idea of applying neural networks to pr