Traditional Culture Encyclopedia - Traditional stories - Artificial Intelligence Science|What is the principle of Wake on Speech technology?

Artificial Intelligence Science|What is the principle of Wake on Speech technology?

Name: Zhang Lu

Student number: 19021210845

Embedded cow guide? Many students have AI smart speaker products at home, such as Tmall Genie, Xiao Ai students, Xiaodu and so on. These smart speakers are not only convenient for our daily life, but also bring a lot of joy to users because of their witty or amusing answers.

Embedded Niubizi AI smart speaker, voice wake-up

Embedded Niubizi asked what is the principle of voice wake-up technology?

Embedded Niu body

"Tmall Elf." "Hey, it's here, go ahead."

"Little Love Classmate, set the alarm for 8am tomorrow." "Okay, already help you set the alarm clock for 8 am tomorrow"

Many students have AI smart speaker products at home, such as Tmall Genie, Xiao Ai classmates, Xiaodu and so on. These smart speakers not only facilitate our daily life, but also bring a lot of joy to users because of their witty or amusing answers.

An important AI capability in these smart products is called voice wakeup.

First, the device is turned on and automatically loaded with good resources, at which point it goes to sleep. Then, when the user utters a specific wake word, the device wakes up and switches to a working state to wait for the user's next command.

This process does not require the user to touch the device with their hands, they can operate it directly with their voice, and by utilizing the voice wake-up mechanism, the device does not have to be in a working state in real-time, thus saving energy.

Wake-on-voice has a wide range of applications, such as robots, cell phones, wearables, smart homes, cars, and so on. Almost many devices with voice functions will need voice wake-up technology as a start or entrance for human-machine interaction. Different products will have different wake-up words, and when the user needs to wake up the device they need to say a specific wake-up word.

Definition

Wake-on-Speech is academically known as Keyword Spotting (KWS), and Mr. Wu defines it as detecting a specific segment of the speaker in a continuous stream of speech in real time.

It is important to note that the "real-time" nature of the detection is a key point. The purpose of waking up is to activate the device from a dormant state to a running state, so if the wake-up call is detected immediately after it is uttered, the user experience will be better.

So how do you evaluate the effectiveness of Wake-on-Voice? There are four common metrics, namely, wake-up rate, false wake-ups, response time, and power consumption levels:

? Wake-up rate, the success rate of user interaction, the technical term is recall rate.

? False wakeup, the probability that the device is woken up without user interaction, usually calculated on a daily basis, e.g., once a day at most.

? Response time, the time difference between when the user finishes saying the wakeup word and when the device gives feedback.

? Power consumption level, which is how much power the wake-up system consumes. Many smart devices are powered by batteries and need to meet long battery life, so they are more concerned about power consumption levels.

Technical routes for voice wakeup

After a long period of development, the technical routes for voice wakeup can be roughly summarized into three generations, with the following characteristics:

The first generation: template-matching-based KWS

The steps of training and testing are relatively simple, and training is based on feature extraction of registered speech, or template speech, to build a template. When testing, the feature sequence is generated by feature extraction, and the distance between the tested feature sequence and the template sequence is calculated, based on which to judge whether to wake up or not.

Second generation: HMM-GMM based KWS

Converts the wake-up task into two types of recognition tasks, with keyword and non-keyword recognition results.

Third generation: neural network based scheme

The neural network scheme can be subdivided into a few categories, the first one is the HMM-based KWS, which is different from the second generation wake-up scheme. The first is the HMM-based KWS, which is the same as the second-generation wake-up scheme in that the acoustic modeling is converted from a GMM to a neural network model. The second category incorporates template matching with neural networks and uses neural networks as feature extractors. The third category is based on an end-to-end scheme, where the input is speech and the output is the probability of each wakeup, and a model solves it.

Difficulty of voice wakeup

The difficulty of voice wakeup is mainly the contradiction between low power requirements and high effect requirements.

On the one hand, many smart devices currently use low-end chips, while using battery power, which requires that the energy consumed to wake up should be as little as possible.

On the other hand, users are pursuing a higher and higher level of experience. Currently, voice wake-up is mainly used in C-suite, with a wide range of user groups and a lot of far-field interactions, which puts a high demand on wake-up capability.

To solve the contradiction between the two, for the low-power demand, we use model depth compression strategy to reduce the model size and ensure that the effect of the decline in the magnitude of the controllable; and for the high-effective demand, generally through the model closed-loop optimization to achieve. We first provide a startup model with usable effects, and then iteratively update it in a closed loop as users use it, and the whole process is automated without human involvement.

Typical applications of voice wakeup

The application areas of voice wakeup are very wide, mainly C-end products, such as robots, speakers, cars and so on. The more representative application modes are as follows:

? Traditional voice interaction: first wake up the device, wait for the device feedback (beep or light), the user thinks the device is woken up, and then send voice control commands, the disadvantage is that the interaction time is long.

?One-shot: Wake-up words and commands are spoken together, such as "Dingdong, Dingdong, I want to listen to Jay Chou's song", and the client will start recognizing and semantic comprehension services directly after waking up, which shortens the interaction time.

?Zero-shot: Set commonly used user-specified words as wake-up words, so that the user does not have to feel the wake-up, such as saying "Navigate to KUNAI" directly to the car, where some high-frequency prefixes are set as wake-up words.

? Multi-wakeup: mainly to meet the user's personalized needs, to give the device multiple names.

? What you see is what you say: A new type of AIUI interaction, for example, after the user sends the command "Navigate to Haijilao" to the car, the car will display "Zhixincheng Haijilao", "Yintai City Haijilao" and other options, and the user only needs to set up the wake word. For example, after the user gives the command "Navigate to Haijilao" to the car, the car will display "Zhixincheng Haijilao" and "Yintai City Haijilao" on the car.