Traditional Culture Encyclopedia - Traditional stories - What is the basic principle of CTC method in speech recognition?

What is the basic principle of CTC method in speech recognition?

In the early training of speech model, we need to mark the training data of each frame, which is basically done by traditional HMM and GMM. Then train the neural model with the labeled data. The end-to-end scheme is to eliminate this part of the non-neural network processing stage and directly use CTC and RNN to train the speech model, without marking the training data into the frame and training the neural network model with the help of other (HMM, GMM). In the traditional speech recognition model, before we train the speech model, we often need to strictly align the text with the speech. There are two disadvantages: although there are some mature open source comparison tools for everyone to use, with the popularity of deep learning, some people will think, can we let our network learn the comparison method by itself? So CTC came into being. Think about it, why doesn't CTC need voice and text alignment? Because CTC allows our neural network to predict labels at any time, there is only one requirement: the output sequence is OK as long as it is correct, so we don't need to strictly align the text and voice. CTC outputs the whole sequence label without doing some post-processing operations. The following figure shows an example of using CTC and text alignment for a piece of audio: