As 5G mobile and internet technologies develop, requirements for human-machine interaction are growing. Regarding personalized, convenient, rapid, and user-friendly operation, neither keyboard nor touchscreen is comparable to voice control. Voice is the most natural and convenient way for people to communicate and get information. The control over home devices has developed from ordinary button remote control to Bluetooth voice remote control and then to intelligent voice control that supports sound pick-up. The voice technology will liberate human eyes and hands and become the best human-machine interactive mode for a variety of service business scenarios.
Through smart speakers, users can check real-time traffic conditions. Before going out, they can inquire traffic and road information by voice to find out the conditions of the road they are going to pass. Through smart home control panels, users can also check real-time weather conditions, so that they can know information about temperature, wind, and rainfall probability of the day through voice query before going out. They can quickly and easily make video calls with their families through the far-field AI set-top boxes (STBs). Just say "video call and the name of the called person" to the STB, and the STB will turn on the TV, call out the video call client, select the called number in the phone book, and make a video call.
HD Video and Audio Entertainment
If users want to watch Jackie Chan's movies, they only need to say "Jackie Chan's movies" to the far-field AI STB, and then they can quickly search the list of "movies starring Jackie Chan" for them to choose from. If they want to switch TV channels, they can just say the name of the channel. If they want to watch a program they missed before, it is easy to do it now. For example, if they miss CCTV News on July 20, they just say, "I want to watch CCTV News on July 20".
If users want to get up early to catch the train, they just say "set the alarm clock at 7 o'clock in the morning", and the smart voice alarm clock at the bedside can turn on the alarm clock at 7 o'clock the next day. Users can also set multiple reminders, such as the date of paying water, electricity and gas bills and the date of repayment of credit cards.
Smart Home Control
Users can use the voice control panel to control TV and smart home devices such as lights and curtains in time by voice, or set the time or conditions to enable these devices by voice.
To implement these applications in 5G smart home scenarios, the AI voice recognition technology must support key technologies such as far-field sound pick-up, instant-on, multi-turn dialogue interaction, and voiceprint recognition.
Far-Field Sound Pick-Up
Far-field sound pick-up adopts a microphone array that contains several microphones operating in tandem and used to sample and process spatial features of the sound field. The purpose of using the microphone array instead of a single microphone is to ensure voice commands can be properly received when the user is far away from the smart voice terminal.
The microphone array is always in the sound pick-up state when it starts to work. It continuously samples and quantifies sound signals, and processes the basic signals. After the collected voice signals are processed by a more complex voice signal algorithm, clean voice signals are obtained and transmitted to the remote voice cloud platform for a real voice interaction.
There are linear, circular, and spherical microphone arrays. A circular or linear microphone array is usually used. At present, the mainstream solution is 6-Mic array, and there are also 2-Mic and 4-Mic products. The microphone array matches front-end sound processing technologies such as beamforming, noise suppression, echo cancellation, reverberation cancellation, automatic gain, and sound source localization.
The wake-up module is a small voice recognition engine. As wake-up words are a single target recognition, only small acoustic and language models are required, which occupy little algorithm space and can be locally implemented. The wake-up words usually contain three to five Chinese characters. It is recommended to choose the words that are not commonly used and have open accents and many syllables.
Multi-Turn Dialogue Interaction
Continuous interaction means that a user can trigger the intelligent voice by saying wake-up words and interact with it for many times without saying the wake-up words any more. If the voice interaction exceeds the specified time, the user needs to say the wake-up words again.
The user's input goes through the natural language understanding (NLU) module and enters the dialogue management system that identifies the current dialogue state and determines the next dialogue action. The next dialogue action includes general model and domain model. The former deals with general interaction logic, while the latter deals with domain-specific interaction logic.
The security of home voice control is quite important in the voice interaction era. Voiceprint recognition falls into two categories: speaker identification and speaker verification. Voiceprint recognition in a home scenario is the process of identifying the speaker. First of all, it is necessary to model the speaker’s voiceprint, and then match the voiceprint features during the voice interaction to provide personalized service experience based on the speaker's role (Fig. 2).