As a key technology of human-machine interface in information technology, speech recognition has important research significance and wide application value. This paper introduces the development of speech recognition technology, and expounds the basic knowledge of speech recognition concept, basic principle and acoustic modeling method, and briefly introduces the application of speech recognition technology in various fields.
Language is the most common, most effective, most important and convenient form of communication for human interaction. Speech is the acoustic expression of language. It is a human dream to communicate with machines. With the rapid development of computer technology, speech recognition technology has also made breakthrough achievements, and the dream of dialogue between humans and machines using natural language is gradually approaching. Speech recognition technology is used in a wide range of applications, not only in all aspects of daily life, but also plays an extremely important role in the military field. It is the key technology for the information society to move towards intelligentization and automation, which makes people's information processing and access more convenient, thus improving people's work efficiency.
1 Development of speech recognition technology
Speech recognition technology began in the 1950s. During this period, the study of speech recognition focused on the recognition of vowels, consonants, numbers, and isolated words.
In the 1960s, speech recognition research made substantial progress. The linear predictive analysis and dynamic programming proposed to solve the problem of the generation of speech signal model and the unequal length of speech signal, and effectively solve the feature extraction of speech signal through linear predictive coding of speech signal.
In the 1970s, speech recognition technology made breakthrough progress. The Dynamic Time Warping (DTW) technology based on dynamic programming is basically mature, and the Vector Quantization (VQ) and Hidden Markov Model (HMM) theories are especially proposed.
In the 1980s, speech recognition tasks began to shift from the recognition of isolated words and connected words to the recognition of large vocabulary, non-specific people, and continuous speech. The recognition algorithm also shifted from the traditional method based on standard template matching to the method based on statistical model. In terms of acoustic model, HMM can be widely used in the acoustic modeling of Large Vocabulary Continous Speech Recognition (LVCSR) because it can describe the speech time-varying and stationarity well. In terms of language model, The statistical language model represented by the N-gram grammar has begun to be widely used in speech recognition systems. At this stage, the speech modeling method based on HMM/VQ, HMM/Gaussian hybrid model and HMM/ artificial neural network has been widely used in LVCSR system, and speech recognition technology has made new breakthroughs.
After the 1990s, along with the practical application of speech recognition systems, speech recognition has made great progress in the design of refinement models, parameter extraction and optimization, and system adaptation. At the same time, people pay more attention to topics such as speaker adaptation, auditory model, fast search recognition algorithm and further language model research. In addition, speech recognition technology has begun to be combined with related technologies in other fields to improve the accuracy of recognition and facilitate the commercialization of speech recognition technology.
2 Speech recognition basics
2.1 Speech recognition concept
Speech recognition is the process of converting human voice signals into words or instructions. Speech recognition is based on speech. It is an important research direction of speech signal processing and a branch of pattern recognition. The research of speech recognition involves many subject areas such as computer technology, artificial intelligence, digital signal processing, pattern recognition, acoustics, linguistics and cognitive science. It is a multidisciplinary and comprehensive research field.
Different research areas have emerged based on research tasks under different constraints. These areas include: according to the requirements of the speaker's way of speaking, can be divided into isolated words (words), connected words and continuous speech recognition system; according to the degree of dependence on the speaker, can be divided into specific person and non-specific person speech recognition system According to the size of vocabulary, it can be divided into small vocabulary, medium vocabulary, large vocabulary and infinite vocabulary speech recognition system.
2.2 Basic Principles of Speech Recognition
From the perspective of speech recognition models, the mainstream speech recognition system theory is based on statistical pattern recognition. The goal of speech recognition is to convert the input speech feature vector sequences X=x1, x2, ..., xT into word sequences W=w1, w2, ..., wN and output them using phonetics and linguistic information. The speech recognition model based on the maximum posterior probability is as follows:
The above equation shows that the most basic word sequence speech recognition principle to be searched for should maximize the product of P(X|W) and P(W). Where P(X|W) is the conditional probability of the feature vector sequence X under a given W condition, which is determined by the acoustic model. P(W) is the prior probability that W is independent of the speech feature vector and is determined by the language model. Since the logarithm of the probability does not affect the selection of W, the fourth equation holds. logP(X|W) and logP(W) represent acoustic scores and language scores, respectively, and are calculated by acoustic model and language model, respectively. A is the weight of the balanced acoustic model and the language model. From the perspective of the structure of speech recognition system, a complete speech recognition system includes features such as feature extraction, acoustic model, language model and search algorithm. The speech recognition system is essentially a multi-dimensional pattern recognition system. For different speech recognition systems, the specific recognition methods and techniques used by people are different, but the basic principles are the same, and the collected speech signals are sent to feature extraction. The module processes, and the obtained voice feature parameters are sent to the model library module, and the voice pattern matching module identifies the voice according to the model library, and finally obtains the recognition result.
The basic principle block diagram of the speech recognition system is shown in Figure 1. The preprocessing module filters out the secondary information and background noise in the original speech signal, including anti-aliasing filtering, pre-emphasis, analog-to-digital conversion, automatic gain control, etc. The processing process digitizes the speech signal; the feature extraction module analyzes the acoustic parameters of the speech and extracts the speech feature parameters to form a feature vector sequence. The characteristic parameters commonly used in speech recognition systems include short-term average amplitude, short-term average energy, linear predictive coding coefficient, and short-term spectrum. Feature extraction and selection are the key to building a system and are extremely important for recognition.
Since speech signals are essentially non-stationary signals, the current analysis of speech signals is based on short-term stationary hypotheses. After making a short-term stationary hypothesis on the speech signal, the feature extraction on the short-term speech segment is realized by windowing the speech signal. These short-term segments are called frames, and the sequence of features in frames constitutes the input of the speech recognition system. Because the Mel cepstrum coefficient and the perceptual linear prediction coefficient can accurately describe the speech signal from the perspective of human auditory characteristics, it has become the mainstream speech feature. To compensate for the interframe independence hypothesis, when using the Mel cepstral coefficients and the perceptual linear prediction coefficients, they usually add their first and second order differences to introduce the dynamic characteristics of the signal features.
Acoustic models are one of the most important parts of speech recognition systems. Acoustic modeling involves many aspects such as modeling unit selection, model state clustering, and model parameter estimation. In the current LVCSR system, a context-dependent model is generally adopted as a basic modeling unit to characterize the cooperative pronunciation phenomenon of continuous speech. After considering the influence of context, the number of acoustic models increases sharply. The LVCSR system usually uses state clustering to compress the number of acoustic parameters to simplify the training of the model. In the training process, the system preprocesses several training speeches, and obtains feature vector sequences through feature extraction. Then, the feature modeling module establishes a reference pattern library of training speech.
Search is the process of finding the optimal word sequence according to certain optimization criteria in the specified space. The essence of search is problem solving, which is widely used in various fields of artificial intelligence and pattern recognition such as speech recognition and machine translation. It finds the optimal state sequence in the state (from top to bottom, word, acoustic model, HMM state) by using the acquired knowledge (acoustic knowledge, phonetic knowledge, dictionary knowledge, language model knowledge, etc.). The final word sequence is an optimal description of the input speech signal under certain criteria. In the identification phase, the feature vector parameters of the input speech are compared with the patterns in the referenced template library obtained by training, and the category to which the mode with the highest similarity belongs is output as the intermediate candidate result. In order to improve the correct rate of recognition, the candidate recognition result obtained above is further processed in the post-processing module, including the language model of the higher-level fusion by Lattice, the reliability of the recognition result by the confidence measure, and the like. Finally, by increasing the constraints, a more reliable recognition result is obtained.
2.3 Acoustic modeling method
Commonly used acoustic modeling methods include the following three types: dynamic time warping method based on pattern matching (DTW); hidden Markov model method (HMM); artificial neural network recognition method (ANN).
DTW is an earlier method of pattern matching. It is based on the idea of ​​dynamic programming, and solves the template matching problem of different lengths of speech signal feature parameter sequences in isolated speech recognition. In practical applications, DTW calculates the pre-processed and framed speech signals and reference templates. The similarity, then calculate the similarity between the templates according to a certain distance measure and select the best path.
HMM is a statistical model established on the time series structure of speech signals. It is developed on the basis of Markov chain. It is a statistical identification method based on parametric model. HMM can imitate human speech process and can be regarded as a double stochastic process: one is to use the Markov chain with finite state number to simulate the implicit stochastic process of the statistical characteristics of speech signal, and the other is to markov chain A stochastic process of observation sequences associated with each state.
ANN uses mathematical models to simulate neuron activity, and applies the principle of parallel distribution of a large number of neurons in artificial neural networks, efficient learning algorithms, and the ability to imitate human cognitive systems into the field of speech recognition, combined with neural networks and implicit The recognition algorithm with Markov model overcomes the shortcomings of ANN in describing the temporal dynamic characteristics of speech signals, and further improves the robustness and accuracy of speech recognition. The successful method is to use the ANN instead of the Gaussian mixture model to estimate the posterior probability of the phoneme or state in the hybrid model. In 2011, Microsoft replaced the multi-layer perceptron with a deep neural network to improve the accuracy of speech recognition.
3 speech recognition applications
Speech recognition technology has a very wide range of applications and market prospects. In the voice input control system, it allows people to get rid of the keyboard and make correct responses by recognizing requests, requests, commands or queries in the voice, which can overcome the shortcomings of slow input speed and error. It is also beneficial to shorten the reaction time of the system and make the communication between human and computer easy, such as voice-activated voice dialing system, voice-activated intelligent toys, smart home appliances and other fields. In the intelligent conversation inquiry system, people can conveniently query and extract relevant information from the remote database system through voice commands, and enjoy natural and friendly database retrieval services, such as information network inquiry, medical service, banking service, and the like. Speech recognition technology can also be applied to automatic spoken language translation, that is, by combining spoken language recognition technology, machine translation technology, speech synthesis technology, etc., the speech input of one language can be translated into the speech output of another language to realize cross-language. communicate with.
Speech recognition technology also has extremely important application value and extremely broad application space in the field of military struggle. Some speech recognition technologies are developed with a focus on military activities, and are first applied in the military field, and the first results have been achieved. Military applications have higher recognition accuracy, response time, and robustness in harsh environments. Claim. At present, speech recognition technology has been applied in military command and control automation. For example, the application of speech recognition technology to aeronautical flight control can quickly improve operational efficiency and reduce the workload of pilots. Pilots use voice input instead of traditional manual operation and control of various switches and devices, and re-arrange or arrange displays. The display information, etc., allows the pilot to focus his time and energy on the judgment of the target and perform other operations in order to obtain information faster and to exert tactical advantages.
4 Conclusion
The research work of speech recognition has far-reaching significance for the development of information society and the improvement of people's living standards. With the continuous development of computer information technology, speech recognition technology will make more major breakthroughs. The research of speech recognition system will be more in-depth and have a broader development space.
LED DJ Console Display is a kind of led electronic display screen designed and made according to the customer's specific requirements. DJ Table LED Display Stage can be applied in many fields due to its flexibility of control and combination. It is a bar mark, which is the most distinctive stage facade.
It has an impact on the stage, starts from this moment. It can be widely used in bars, KTV, stage performance and other places.
Characteristics
1. Exquisite effect, smooth transition
2. Assemble and disassemble design, quick and convenient maintenance
3. Innovative unique creation, bar vision 3D design concept
LED DJ Console Display, Fantasy LED DJ Console Display, Video LED DJ Console Display, LED Honeycomb DJ Console Display
Shenzhen Priva Tech Co., Ltd. , https://www.privaled.com