CN110070863A

CN110070863A - A kind of sound control method and device

Info

Publication number: CN110070863A
Application number: CN201910181787.9A
Authority: CN
Inventors: 王永超; 魏建宾; 王丰欣
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2019-07-30

Abstract

The embodiment of the present application discloses a kind of sound control method and device, is related to technical field of voice recognition, can reduce the false wake-up rate of smart machine.This method comprises: electronic equipment obtains the first voice of user；And determine the type of the first voice obtained, wherein the type of the first voice includes the first kind, and the first kind is that the content accounting of voiceless sound in voice is greater than the first thresholding.If it is determined that the first voice belongs to the first kind, then in response to the first voice, the first movement corresponding with the first voice is executed.

Description

Voice control method and device

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech control method and apparatus.

Background

With the development of voice technology, many smart devices can interact with users through voice. And the voice interaction system of the intelligent equipment identifies the voice of the user to complete the instruction of the user.

In conventional voice interaction, a user typically manually activates a voice function, such as pressing a wake-up key, to enable voice interaction. This operation is cumbersome and inconvenient.

In order to enable a user to cut in voice more smoothly, the behavior of calling the other party at the beginning of interaction between people is simulated, and a voice awakening function is presented; the user speaks the wakeup word first to wake up the smart device, and then voice interaction can be performed. In some scenarios, the user may not speak the wake-up word, and the smart device may directly perform the corresponding action according to the voice of the user. However, in life, when a user does not have a need for voice interaction with the smart device, the smart device may be woken up by mistake. For example, the smart device may misidentify the user's other voice usage as a wake word. Frequent false wakeups result in a poor user experience.

Disclosure of Invention

The embodiment of the application provides a voice control method, which can reduce the false awakening rate of intelligent equipment.

In a first aspect, an embodiment of the present application provides a voice control method, which may include: the method comprises the steps that electronic equipment obtains a first voice of a user; if the first voice is determined to be of the first type, responding to the first voice, and executing a first action corresponding to the first voice by the electronic equipment; wherein the first type is that the content ratio of unvoiced sound in the voice is larger than a first threshold.

In the method, the voice belonging to the first type may be voice with less application in life. On one hand, the first type of voice is less in application in life, is not easy to be triggered by mistake, and can reduce the probability that the electronic equipment is awakened by mistake; on the other hand, the control instruction matching is only carried out when the voice belongs to the first category, so that the matching times of the control instruction can be reduced, the occupancy rate of system resources is reduced, and the energy consumption is reduced.

With reference to the first aspect, in one possible design manner, the executing, by the electronic device, the first action corresponding to the first voice includes: if the first voice is determined to contain the control instruction, the electronic equipment executes a first action corresponding to the control instruction. In this mode, a control instruction is preset in the electronic device; after receiving the voice of the user, the electronic device can directly execute the corresponding action according to the control instruction corresponding to the first type of voice.

With reference to the first aspect, in one possible design manner, the executing, by the electronic device, the first action corresponding to the first voice includes: the electronic equipment acquires the semantics of the first voice; the electronic device performs a first action corresponding to the semantics of the first speech. In this manner, the electronic device may recognize the semantic meaning of the first voice, or the electronic device may obtain the semantic meaning of the first voice through the server and execute the corresponding action according to the semantic meaning of the first voice.

With reference to the first aspect, in one possible design manner, the executing, by the electronic device, the first action corresponding to the first voice includes: and if the first voice is determined to contain the first keyword, the electronic equipment acquires a second voice of the user, and responds to the second voice to execute a second action corresponding to the second voice. In this manner, the electronic device may be awakened in response to the first type of voice.

With reference to the first aspect, in a possible design manner, if it is determined that the first voice is of the second type, the electronic device obtains a second voice of the user, and executes a second action corresponding to the second voice in response to the second voice; wherein the second type is that the content ratio of unvoiced sound in the voice is less than or equal to the first threshold. In the method, after receiving a first voice, an electronic device firstly judges the category of the first voice; if the first voice is determined to be of the second type and includes the second keyword, a second action corresponding to the second voice may be performed; if the first voice is determined to be of the first type, a first action corresponding to the first voice may be performed. Therefore, the action corresponding to the voice which is easily triggered by mistake in daily life can be set to be executed in response to the voice belonging to the first type, so that the false awakening rate of the electronic equipment can be reduced, and the convenience of using the voice control by a user is not influenced.

With reference to the first aspect, in a possible design manner, determining that the first voice belongs to the first type specifically includes: the DSP of the electronic device determines that the first voice is of a first type.

With reference to the first aspect, in a possible design manner, determining that the first voice includes the control instruction specifically includes: and the DSP of the electronic equipment extracts the voice characteristics of the first voice, matches the first voice according to the extracted voice characteristics and the control instruction detection model, and determines that the first voice contains the control instruction.

With reference to the first aspect, in a possible design manner, determining that the first voice includes the control instruction specifically includes: the DSP of the electronic equipment transmits the first voice to the AP of the electronic equipment; and the AP extracts the voice characteristics of the first voice, matches the first voice according to the extracted voice characteristics and the control instruction detection model, and determines that the first voice contains the control instruction.

With reference to the first aspect, in a possible design manner, determining that the first voice includes the control instruction specifically includes: the DSP of the electronic equipment extracts voice features of the first voice and transmits the extracted voice features to the AP of the electronic equipment; and the AP performs matching according to the extracted voice characteristics and the control instruction detection model, and determines that the first voice contains the control instruction.

With reference to the first aspect, in a possible design manner, obtaining semantics of the first speech specifically includes: the DSP of the electronic equipment transmits the first voice to the AP of the electronic equipment; and the AP extracts the voice characteristic value of the first voice and obtains the semantic meaning of the first voice according to the voice characteristic value.

With reference to the first aspect, in a possible design manner, obtaining semantics of the first speech specifically includes: the DSP of the electronic equipment extracts a voice characteristic value of the first voice and transmits the voice characteristic value to the AP of the electronic equipment; and the AP acquires the semantics of the first voice according to the voice characteristic value.

With reference to the first aspect, in a possible design manner, determining that the first speech includes the first keyword specifically includes: and the DSP of the electronic equipment extracts the voice features of the first voice, matches the first voice according to the extracted voice features and the keyword detection model, and determines that the first voice contains the first keyword.

With reference to the first aspect, in a possible design manner, determining that the first speech includes the first keyword specifically includes: the DSP of the electronic equipment transmits the first voice to the AP of the electronic equipment; and the AP extracts the voice features of the first voice, matches the first voice according to the extracted voice features and the keyword detection model, and determines that the first voice contains the first keyword.

With reference to the first aspect, in a possible design manner, determining that the first speech includes the first keyword specifically includes: the DSP of the electronic equipment extracts voice features of the first voice and transmits the extracted voice features to the AP of the electronic equipment; and the AP performs matching according to the extracted voice characteristics and the keyword detection model, and determines that the first voice contains a first keyword.

In a second aspect, an embodiment of the present application provides an electronic device, where the electronic device may implement the voice control method described in the first aspect, and the electronic device may implement the method through software, hardware, or hardware to execute corresponding software. In one possible design, the electronic device may include a processor and a memory. The processor is configured to enable the electronic device to perform the corresponding functions of the method of the first aspect. The memory is for coupling with the processor and holds the necessary program instructions and data for the electronic device.

In a third aspect, embodiments of the present application provide a computer storage medium, which includes computer instructions, and when the computer instructions are run on an electronic device, the electronic device is caused to perform a voice control method according to any one of the above aspects and possible design manners.

In a fourth aspect, embodiments of the present application provide a computer program product, which when run on a computer, causes the computer to execute the voice control method according to any one of the above aspects and possible designs thereof.

For technical effects brought by the electronic device of the second aspect, the computer storage medium of the third aspect, and the computer program product of the fourth aspect, reference may be made to the technical effects brought by the first aspect and the different design manners thereof, and details are not described here.

Drawings

Fig. 1 is a schematic view of a scene example of a voice control method according to an embodiment of the present application;

fig. 2 is a first schematic diagram illustrating a hardware structure of an electronic device according to an embodiment of the present disclosure;

fig. 3 is a schematic composition diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure;

fig. 4 is a schematic composition diagram three of a hardware structure of an electronic device according to an embodiment of the present disclosure;

fig. 5 is a flowchart of a voice control method according to an embodiment of the present application;

fig. 6 is a schematic view of an example display interface of an electronic device according to an embodiment of the present disclosure;

fig. 7 is a flowchart of a voice control method according to an embodiment of the present application;

fig. 8 is a schematic structural component diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. The electronic device in the following embodiments may be an intelligent household device (e.g., an intelligent sound box, an intelligent refrigerator, an intelligent television, etc.), a portable computer (e.g., a mobile phone, etc.), a notebook computer, a Personal Computer (PC), a wearable electronic device (e.g., an intelligent watch, an intelligent glasses, an intelligent helmet, etc.), a tablet computer, an AR \ Virtual Reality (VR) device, an in-vehicle computer, etc., and the following embodiments do not particularly limit the specific form of the electronic device. It should be noted that "electronic device" and "intelligent device" have the same meaning in the embodiments of the present application, and they may be replaced with each other herein.

The embodiment of the application provides a voice control method, and intelligent equipment can respond to one type of voice of a user and execute corresponding actions; the voice is not easy to be triggered by mistake, the false wake-up rate of the intelligent device can be reduced, and the occupancy rate of system resources is reduced.

Illustratively, as shown in fig. 1, if the smart device recognizes the speech "piupiuu", the corresponding action of the speech "piupiuu" is executed: the air conditioner is turned on. For example, if the smart device recognizes the voice "turn on air conditioner", no action is performed. For example, if the smart device recognizes a sound emitted by a user blowing a whistle, i.e., a whistle sound, the smart device executes an action corresponding to the whistle sound: and (5) turning on the lamp. For example, if the smart device recognizes that the voice is "on", no action is performed. For example, the smart device recognizes the voice "sleep" in the whisper mode, and then executes a corresponding action of the voice "sleep" in the whisper mode: and switching to a sleep mode.

The voice "piupiuu", whistling, the voice "sleep" in whisper mode, and the like are special voices. The voice of turning on the air conditioner, turning on the light and the like is normal voice. The normal voice is a type of voice which can recognize semantics and vibrate vocal cords when a voice is produced. Special speech refers to a type of speech that is distinguished from normal speech. For example, special speech refers to a type of speech that does not vibrate the vocal cords when uttered, i.e., unvoiced speech. As another example, special speech refers to speech without semantics.

According to the voice control method provided by the embodiment of the application, the intelligent device can respond to the received special voice to execute the action corresponding to the voice. And when the intelligent equipment receives the normal voice, no action is executed. The recognition rate of the special voices is high, and the false touch rate is low; therefore, the intelligent equipment can quickly execute corresponding actions of voice and improve the response speed; and frequent false triggering can be avoided, and the false awakening rate is reduced.

In other embodiments, the smart device may perform an action corresponding to the special speech in response to the received special speech, and may also perform an action corresponding to the normal speech in response to the normal speech.

For example, the smart device may be awakened in response to the wake word. After the intelligent device receives the voice, if the voice is determined to be normal voice and includes a wake word, corresponding action can be executed in response to the voice. For example, as shown in fig. 1, the smart device recognizes the voice "small art", determines that the voice "small art" is a normal voice, and is a wake-up word, and is then woken up. Further, the smart device may perform a corresponding action according to the user's next voice. For example, a user wants to use the voice interaction system of the smart tv, instructing the smart tv to switch to a sports channel. First the user speaks a wake-up word, such as "hello tv," to wake up the smart tv. Then, the user says "watch sports channel", and the smart television switches the channel to the sports channel according to the voice "watch sports channel".

In some scenarios, the user may not speak the wake-up word, and the smart device may directly perform a corresponding action according to the normal speech of the user. For example, when the smart device recognizes the "previous" or "next" speech of the user while playing music, the smart device may perform corresponding actions. For example, if the intelligent device recognizes that the speech "navigate to library" includes the keyword "navigate", the corresponding action of "navigate to library" is executed.

The voice control method provided by the embodiment of the application can set the action corresponding to the voice which is easily triggered by mistake in daily life as responding to the special voice to execute; for example, the smart device is set to turn on the light in response to a whistle trigger action. When the intelligent equipment receives the normal voice 'turn on the light', the intelligent equipment does not execute the action of turning on the light; and when a whistle sound sent by the user is received, the light-on action is executed. The voice which is not easy to be triggered by mistake in daily life can be set as a wake-up word or a keyword; for example, a "small art" is set as a wake-up word. The intelligent device receives the normal voice 'Xiaoyi', and then is awakened. Therefore, the false awakening rate of the electronic equipment can be reduced, and the convenience of using the voice control intelligent equipment by a user is not influenced.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 2, a schematic structural diagram of an electronic device 100 according to an embodiment is shown. The electronic device 100 may be the smart device described in this embodiment.

As shown in fig. 2, the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

In the present application, the microphone 170C may be used to collect voice of the user and convert the voice signal into an electrical signal. The microphone 170C transmits the voice signal to the processor 110, and the processor 110 performs voice recognition, determines whether the voice input by the user matches the fast control command, and triggers a corresponding action if the fast control command is determined.

It is to be understood that the illustrated structure of the present embodiment does not constitute a specific limitation to the electronic apparatus 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processor (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be a neural center and a command center of the electronic device 100, and is a decision maker instructing each component of the electronic device 100 to work in coordination according to instructions. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, processor 110 may include multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, the charger, the flash, the camera 193, etc. through different I2C bus interfaces, respectively. For example: the processor 110 may be coupled to the touch sensor 180K via an I2C interface, such that the processor 110 and the touch sensor 180K communicate via an I2C bus interface to implement the touch functionality of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, processor 110 may include multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may communicate audio signals to the wireless communication module 160 via the I2S interface, enabling answering of calls via a bluetooth headset.

The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled by a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to implement a function of answering a call through a bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communications. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is generally used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit the audio signal to the wireless communication module 160 through a UART interface, so as to realize the function of playing music through a bluetooth headset.

MIPI interfaces may be used to connect processor 110 with peripheral devices such as display screen 194, camera 193, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a display screen serial interface (DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the capture functionality of electronic device 100. The processor 110 and the display screen 194 communicate through the DSI interface to implement the display function of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transmit data between the electronic device 100 and a peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone.

It should be understood that the connection relationship between the modules according to the embodiment of the present invention is only illustrative, and is not limited to the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charging management module 140 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from a wired charger via the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive a wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be used to monitor parameters such as battery capacity, battery cycle count, battery state of health (leakage, impedance), etc. In some other embodiments, the power management module 141 may also be disposed in the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device 100. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 150 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation. The mobile communication module 150 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs a sound signal through an audio device (not limited to the speaker 170A, the receiver 170B, etc.) or displays an image or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional modules, independent of the processor 110.

The wireless communication module 160 may provide a solution for wireless communication applied to the electronic device 100, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, perform frequency modulation and amplification on the signal, and convert the signal into electromagnetic waves through the antenna 2 to radiate the electromagnetic waves.

In some embodiments, antenna 1 of electronic device 100 is coupled to mobile communication module 150 and antenna 2 is coupled to wireless communication module 160 so that electronic device 100 can communicate with networks and other devices through wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), Long Term Evolution (LTE), LTE, BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou satellite navigation system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may be a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, with N being a positive integer greater than 1.

The electronic device 100 may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display 194, the application processor, and the like.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to perform fourier transform or the like on the frequency bin energy.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, phone book, etc.) created during use of the electronic device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The electronic device 100 may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The electronic apparatus 100 can listen to music through the speaker 170A or listen to a handsfree call.

The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the electronic apparatus 100 receives a call or voice information, it can receive voice by placing the receiver 170B close to the ear of the person.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 170C by speaking the user's mouth near the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further include three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, perform directional recording, and so on.

The headphone interface 170D is used to connect a wired headphone. The headset interface 170D may be the USB interface 130, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 180A is used for sensing a pressure signal, and converting the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 180A, the capacitance between the electrodes changes. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the intensity of the touch operation according to the pressure sensor 180A. The electronic apparatus 100 may also calculate the touched position from the detection signal of the pressure sensor 180A. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.

The gyro sensor 180B may be used to determine the motion attitude of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., the x, y, and z axes) may be determined by gyroscope sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects a shake angle of the electronic device 100, calculates a distance to be compensated for by the lens module according to the shake angle, and allows the lens to counteract the shake of the electronic device 100 through a reverse movement, thereby achieving anti-shake. The gyroscope sensor 180B may also be used for navigation, somatosensory gaming scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, electronic device 100 calculates altitude, aiding in positioning and navigation, from barometric pressure values measured by barometric pressure sensor 180C.

The magnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip holster using the magnetic sensor 180D. In some embodiments, when the electronic device 100 is a flip phone, the electronic device 100 may detect the opening and closing of the flip according to the magnetic sensor 180D. And then according to the opening and closing state of the leather sheath or the opening and closing state of the flip cover, the automatic unlocking of the flip cover is set.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, taking a picture of a scene, electronic device 100 may utilize range sensor 180F to range for fast focus.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light to the outside through the light emitting diode. The electronic device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the electronic device 100. When insufficient reflected light is detected, the electronic device 100 may determine that there are no objects near the electronic device 100. The electronic device 100 can utilize the proximity light sensor 180G to detect that the user holds the electronic device 100 close to the ear for talking, so as to automatically turn off the screen to achieve the purpose of saving power. The proximity light sensor 180G may also be used in a holster mode, a pocket mode automatically unlocks and locks the screen.

The ambient light sensor 180L is used to sense the ambient light level. Electronic device 100 may adaptively adjust the brightness of display screen 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 180L may also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in a pocket to prevent accidental touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 can utilize the collected fingerprint characteristics to unlock the fingerprint, access the application lock, photograph the fingerprint, answer an incoming call with the fingerprint, and so on.

The temperature sensor 180J is used to detect temperature. In some embodiments, electronic device 100 implements a temperature processing strategy using the temperature detected by temperature sensor 180J. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold, the electronic device 100 performs a reduction in performance of a processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection. In other embodiments, the electronic device 100 heats the battery 142 when the temperature is below another threshold to avoid the low temperature causing the electronic device 100 to shut down abnormally. In other embodiments, when the temperature is lower than a further threshold, the electronic device 100 performs boosting on the output voltage of the battery 142 to avoid abnormal shutdown due to low temperature.

The touch sensor 180K is also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is used to detect a touch operation applied thereto or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 194. In other embodiments, the touch sensor 180K may be disposed on a surface of the electronic device 100, different from the position of the display screen 194.

The bone conduction sensor 180M may acquire a vibration signal. In some embodiments, the bone conduction sensor 180M may acquire a vibration signal of the human vocal part vibrating the bone mass. The bone conduction sensor 180M may also contact the human pulse to receive the blood pressure pulsation signal. In some embodiments, the bone conduction sensor 180M may also be disposed in a headset, integrated into a bone conduction headset. The audio module 170 may analyze a voice signal based on the vibration signal of the bone mass vibrated by the sound part acquired by the bone conduction sensor 180M, so as to implement a voice function. The application processor can analyze heart rate information based on the blood pressure beating signal acquired by the bone conduction sensor 180M, so as to realize the heart rate detection function.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The electronic apparatus 100 may receive a key input, and generate a key signal input related to user setting and function control of the electronic apparatus 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration cues, as well as for touch vibration feedback. For example, touch operations applied to different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also respond to different vibration feedback effects for touch operations applied to different areas of the display screen 194. Different application scenes (such as time reminding, receiving information, alarm clock, game and the like) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.

The SIM card interface 195 is used to connect a SIM card. The SIM card can be brought into and out of contact with the electronic apparatus 100 by being inserted into the SIM card interface 195 or being pulled out of the SIM card interface 195. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 195 may support a Nano SIM card, a Micro SIM card, a SIM card, etc. The same SIM card interface 195 can be inserted with multiple cards at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 195 may also be compatible with different types of SIM cards. The SIM card interface 195 may also be compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to implement functions such as communication and data communication. In some embodiments, the electronic device 100 employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

Continuing to refer to fig. 3, for example, the smart device described in this embodiment may also have a structure shown in the electronic device 200 in fig. 3. The electronic device 200 includes a main control module 210, a voice module 220, and a power module 230.

The main control module 210 may include a processor 211 and a memory 212. The processor 211 is used for data operation and processing, and the processor 211 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application, for example: one or more Digital Signal Processors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs). The memory 212 is used to store instructions and data. Memory 212 may be a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions. For example, the memory 212 may be a double data rate synchronous dynamic random access memory (DDR SDRAM) or a flash memory (flash). The memory 212 may exist separately from the processor 211, i.e., the memory 212 may be a memory external to the processor 211, or the memory 212 may be integrated with the processor 211, i.e., the memory 212 may be an internal memory of the processor 211, and may be used for temporarily storing some data and instruction information, etc.

The voice module 220 includes a microphone 221 and an analog-to-digital converter (ADC) 222. Wherein the microphone 221 is used for collecting sound signals, in some embodiments, the microphone 221 may be a mic (microphone) array. The ADC222 is used to convert the sound signal into a digital signal. In some embodiments, the ADC222 may also be integrated in the microphone 221.

The power module 230 is used for supplying power to the electronic device 200 and various components thereof.

In one example, the electronic device 100 or the electronic device 200 may include the hardware structure shown in fig. 4, and as shown in fig. 4, a system on a chip (SoC) may include a Central Processing Unit (CPU), a DSP, a memory, and the like. The CPU is mainly used for interpreting computer instructions and processing data, and controlling and managing the entire electronic device, such as: timing control, digital system control, radio frequency control, power saving control, human-computer interface control and the like. For example, the CPU may be used to control the periodic collection of sound signals in the environment. The DSP may be used to implement the functions of the DSP in the processor 110 of the electronic device 100 shown in fig. 2. For example, in the embodiment of the present application, the DSP may be configured to perform noise reduction processing, speech recognition, and the like on the received speech signal. The memory is used for storing computer instructions, data and the like. Of course, the SoC may also include other parts, such as: audio chips, video chips, NPUs, GPUs, ISPs, baseband chips, radio frequency chips, etc. The audio chip is configured to process audio-related functions, for example, to implement the function of the audio module 170 in fig. 2; the video chip is used for processing video-related functions, such as the functions of a video codec implementing the processor 110 in fig. 2; the NPU, the GPU and the ISP are respectively used for realizing the functions of the NPU, the GPU and the ISP in the processor 110 of the electronic equipment 100 shown in FIG. 2; the baseband chip is mainly used for signal processing and protocol processing to realize the functions related to mobile communication; for example, a baseband chip may be used to implement the functions of the mobile communication module 150 in fig. 2, and a modem processor, a baseband processor, and the like in the processor 110, the mobile communication related functions; for example, the baseband chip may also implement the functions of the wireless communication module 160 in fig. 2; the radio frequency chip is used for radio frequency transceiving, frequency synthesis, power amplification and the like; for example, the functions of the antenna 1 in fig. 2, and a part of the functions of the mobile communication module 150 may be implemented; for example, the function of the antenna 2 in fig. 2 and a part of the functions of the wireless communication module 160 may also be implemented.

The embodiment of the present application provides a voice control method, which can be applied to the electronic device 100 in fig. 2 or the electronic device 200 in fig. 3. As shown in fig. 5, the method may include:

s101, the electronic equipment acquires a first voice of a user.

The electronic device may periodically capture sounds in the environment. For example, the processor 110 of the electronic device in fig. 2 controls the microphone 170C to capture the sound in the environment according to a set period.

After the electronic equipment collects the sound in the environment each time, the electronic equipment can carry out preprocessing to obtain a voice signal. For example, in fig. 2, the microphone 170C collects sound and transmits the sound signal to the DSP of the processor 110. And the DSP preprocesses the acquired sound signal.

In one implementation, the pre-processing is noise reduction processing, and the electronic device performs noise reduction processing on the received sound.

For example, the electronic device performs wind noise determination based on the spectral continuity of wind noise and the characteristic of starting from low frequencies. Illustratively, the electronic device starts to detect from a low-frequency starting point of the sound signal to a high frequency, judges whether a current frequency point of the current frame of the sound signal is in a high-energy area, and determines that the current frequency point of the current frame is not wind noise if the current frequency point of the current frame of the sound signal is not in the high-energy area; if so, judging whether the current frequency point is smaller than a wind noise low-frequency threshold, and if so, determining that the current frequency point of the current frame is wind noise; if not, detecting whether a low-frequency point adjacent to the current frequency point is wind noise, and if so, determining that the current frequency point of the current frame is wind noise; if not, determining that the current frequency point of the current frame is not wind noise. Thus, the wind noise in each frame of sound signal can be determined, and the wind noise in the collected sound signal can be further removed. For details, see patent CN104637489B, "method and apparatus for processing sound signals".

For example, the electronic device considers the received sound signal as a model of a pure speech signal and a pure noise signal respectively input by two communication channels (a speech channel and a noise channel), simulates the effect of human ears by using Computational Auditory Scene Analysis (CASA), and determines the pure speech signal and the pure noise signal according to the signal time difference (ITD) and the intensity difference (ILD) of the arrival of the two communication channels. The method comprises the following steps of firstly, calculating the noise envelope and the voice envelope of the auditory spectrum of the jth frame of the ith frequency of the sound signal according to the number of sampling points in each frame and the time domain amplitude of the sound signal. And secondly, calculating a cross-correlation function of the noise channel and the voice channel according to the number of sampling points in each frame and the characteristic time delay tau of voice and noise. And thirdly, respectively calculating the ITD and ILD of the noise channel and the voice channel through the cross-correlation function of the noise channel and the voice channel. And fourthly, adding the cross-correlation functions on all frames and all frequency channels to obtain the extreme value of the sum, namely the characteristic time delay tau of the voice and the noise. And judges which communication channel of the two communication channels inputs a voice signal and which communication channel inputs a noise signal according to the value of tau. When the value of tau is negative, the signal of the first communication channel is pure voice; when the value of τ is positive, the signal of the second communication channel is pure speech. And fifthly, estimating mask information of the time-frequency unit on the time-frequency domain by using the ITD and the ILD by adopting the CASA, and determining a noise area and a pure voice area in the time-frequency domain according to the mask information of the time-frequency unit. And carrying out voice synthesis on the pure voice area to obtain a pure voice signal. For details, see patent CN104064196B, "a method for improving accuracy of speech recognition based on noise cancellation of speech front end".

Further, the electronic device determines whether the pre-processed sound is valid. In one implementation, the pre-processed sound is determined to be valid if the energy of the pre-processed sound is determined to be greater than a first threshold. Determining that the pre-processed sound is invalid if it is determined that the energy of the pre-processed sound is less than or equal to the first threshold.

If the preprocessed sound is determined to be effective, the electronic equipment determines that the voice is obtained; if the pre-processed sound is determined to be invalid, the electronic device may discard the sound without processing.

For example, the electronic device obtains a first voice of a user.

S102, the electronic equipment judges the type of the acquired first voice. If it is determined that the first voice is of the first type, S103 is performed.

In some embodiments, the electronic device performs voice feature extraction on the acquired first voice, and determines the category of the first voice according to the voice feature of the first voice. The categories of speech may include a first type and a second type. Illustratively, the first type may be special speech; the second type may be normal speech.

In one implementation, the electronic device determines a category of speech based on a content ratio of unvoiced and voiced speech in the speech. Unvoiced sound refers to voice without vibrating vocal cords during sounding; voiced sound is speech that vibrates the vocal cords when voiced. If the electronic device determines that the speech of the speech is characterized by a speech characteristic in which the percentage of unvoiced speech content in the speech is greater than a first threshold, it is determined that the speech is of a first type. If the electronic equipment determines that the voice of the voice is characterized in that the content ratio of the unvoiced sound in the voice is less than or equal to the first threshold, the voice is determined to belong to the second type. Illustratively, the first threshold is 80%.

For example, the user does not vibrate the vocal cords while uttering the voice "piupiuu". The content of unvoiced sound in the voice "piupiuu" is more than 80 percent; the electronic device determines that the speech "piupiuu" is of the first type. When the user gives out voice to turn on the air conditioner, the vocal cords are vibrated. The content of the unvoiced sound in the voice 'turn on air conditioner' is less than 80 percent; the electronic device determines that the voice "turn on the air conditioner" is of the second type.

For example, when a user speaks a voice to sleep, the user speaks the voice in a whisper speaking mode without vibrating the vocal cords; the electronic device determines that the content of unvoiced sound in the voice "sleep" is greater than 80%, and determines that the voice "sleep" belongs to the first type (the voice "sleep" is a voice in the whisper mode). For example, when a user speaks a voice to "sleep," the vocal cords are vibrated; and the electronic equipment determines that the content of the unvoiced sound in the voice sleep is less than 80%, and determines that the voice sleep belongs to the second type.

In one example, the electronic device can perform unvoiced sound detection on the voice signal according to the medium-high frequency characteristics of unvoiced sound, and determine the class of the voice. For example, the electronic device determines the total energy of the low-frequency signal in the current frame of the speech signal and the total energy of the medium-high frequency signal in the current frame; if the ratio of the total energy of the medium-high frequency signal in the current frame of the voice signal to the total energy of the low-frequency signal in the current frame of the voice signal is greater than a second threshold, determining that the current frame of the voice signal has unvoiced sound; and if the ratio of the total energy of the medium-high frequency signal in the current frame of the voice signal to the total energy of the low-frequency signal in the current frame of the voice signal is less than or equal to a second threshold, determining that the current frame of the voice signal has no unvoiced sound. The speech signal is determined to be of the first type if a ratio of the content of the frames containing unvoiced speech in all frames of the speech signal is greater than a first threshold. For details, see patent CN104637489B, "method and apparatus for processing sound signals".

In yet another example, in a first step, the electronic device can set a voiced-unvoiced decision question set. Wherein, the unvoiced and voiced sound judgment problem set comprises the problem of judging the type. For example, (1) speech information related to a phoneme to which the speech frame belongs. For example, whether the phoneme to which the speech frame belongs is a vowel, whether it is a plosive, whether it is a fricative, whether it is a nasal sound, whether it is a rereaded sound, whether it is a specific phoneme, whether it is a yin-tie, whether it is a yang-tie, whether it is a sound-up, whether it is a sound-down, etc. (2) And the related speech information of the previous phoneme of the speech frame in the sentence. (3) And the related speech information of the next phoneme of the speech frame in the sentence. (4) The speech frame is in the fourth state in the phoneme (usually, a phoneme is divided into 5 states), the tone of the phoneme to which the speech frame belongs and whether the phoneme to which the speech frame belongs is overruled, etc.

And secondly, training an unvoiced and voiced sound judgment model with a binary judgment tree structure by using the voice training data and the unvoiced and voiced sound judgment problem set. Wherein, the non-leaf nodes in the binary decision tree structure are the problem of the voiced and unvoiced decision problem set, and the leaf nodes are the unvoiced and unvoiced decision results.

For example, a binary decision tree model is adopted, the used training data is a speech frame, and the accompanying information includes: fundamental frequency information, the phoneme of the frame and the phonemes before and after the frame, the state ordinal of the frame within the phoneme (i.e. the number of states within the phoneme), etc.

In the training process, aiming at each problem in the problem set of the design, aiming at the training data belonging to yes and no, the voiced sound frame proportion value is respectively calculated, and the problem that the difference between the voiced sound proportion values of yes and no is the largest is selected as the problem of the node. The training data is then split. Conditions for stopping splitting (for example, how many frames less than training data of a node or a difference value of voiced proportion for continuing splitting less than a certain threshold) can be preset, and then the unvoiced/voiced decision of the node is determined according to the proportion of voiced frames in the training data of the leaf node. If the frame is judged to be voiced, predicting the fundamental frequency value of the frame through a trained hidden Markov model.

And thirdly, judging whether the voice data is unvoiced or voiced by using the trained unvoiced and voiced sound judgment model.

And fourthly, if the content ratio of the unvoiced sound in the voice signal is determined to be larger than the first threshold, determining that the voice signal belongs to the first type.

For specific contents, see patent CN 104143342B, "a method and an apparatus for determining unvoiced/voiced sounds, and a speech synthesis system".

For example, after the DSP of the electronic device 100 in fig. 2 acquires the speech, the DSP extracts a speech feature value of the speech, and determines the category of the speech according to the content ratio of unvoiced speech and voiced speech.

In another implementation, the electronic device determines the category of the speech based on whether the speech is capable of being recognized semantically. If the electronic device determines that the speech is characterized by speech whose semantics can be recognized, then the electronic device determines that the speech is of the second type. If the electronic device determines that the speech is characterized by speech whose semantics cannot be recognized, the speech is determined to be of a first type.

For example, the electronic device determines that the speech "piupiuu" cannot be recognized as semantic, and determines that the speech "piupiuu" belongs to the first type. The electronic device determines that the speech "turn on the air conditioner" can be recognized as semantic, and then determines that the speech "turn on the air conditioner" belongs to the second type.

In an example, after the DSP of the electronic device 100 in fig. 2 acquires the voice signal, the DSP extracts a voice feature value of the voice signal, and transmits the extracted voice feature value to the AP. For example, the AP may perform semantic recognition according to the speech feature value; alternatively, the AP may transmit the voice feature value to a server (the server may be a cloud server in charge of voice processing in a communication network in which the electronic device is located), perform semantic recognition by the server, and receive a semantic recognition result from the server. In some embodiments, the speech is determined to be of the second type if the AP determines that the semantic recognition is successful, and the speech is determined to be of the first type if the AP determines that the semantic recognition is unsuccessful; further, the AP may send the recognition result to the DSP. In other embodiments, the AP sends the semantic recognition result to the DSP, determines that the speech is of the second type if the DSP determines that the semantic recognition is successful, and determines that the speech is of the first type if the DSP determines that the semantic recognition is unsuccessful. And then, the DSP can continue to collect the sound in the environment according to a set period to obtain a voice signal, extract the voice characteristic value of the voice signal and transmit the voice characteristic value to the AP for semantic recognition.

In another example, after acquiring the voice signal, the DSP of the electronic device 100 in fig. 2 transmits the voice signal to the AP, and the AP extracts the voice feature value of the voice. For example, the AP may perform semantic recognition according to the speech feature value; alternatively, the AP may transmit the voice feature value to a server (the server may be a cloud server in charge of voice processing in a communication network in which the electronic device is located), perform semantic recognition by the server, and receive a semantic recognition result from the server. In some embodiments, the speech is determined to be of the second type if the AP determines that the semantic recognition is successful, and the speech is determined to be of the first type if the AP determines that the semantic recognition is unsuccessful; further, the AP may send the recognition result to the DSP. In other embodiments, the AP sends the semantic recognition result to the DSP, determines that the speech is of the second type if the DSP determines that the semantic recognition is successful, and determines that the speech is of the first type if the DSP determines that the semantic recognition is unsuccessful. And then, the DSP can continue to collect the sound in the environment according to a set period to obtain a voice signal, and the voice signal is transmitted to the AP for semantic recognition.

Of course, the electronic device may also determine the category according to other voice features of the voice, which is not limited in this embodiment of the application.

It should be noted that S102 may be an optional step. That is, after the electronic device acquires the first voice, it may be determined whether the first voice includes the control instruction without determining the category of the first voice. Namely, the judgment of whether the voice belonging to the first type and the voice belonging to the second type contain the control instruction is carried out.

S103, the electronic equipment responds to the first voice and executes a first action corresponding to the first voice.

In some embodiments, the electronic device determines whether the first voice contains a control instruction. If the first voice is determined to contain the control instruction, executing a first action corresponding to the control instruction; if it is determined that the first speech does not contain control instructions, no action is performed.

In one implementation, a set of control instructions is configured within an electronic device. After the electronic equipment receives the first voice, the first voice is matched in the control instruction set, and if the matching is successful, the control instruction corresponding to the first voice is determined.

The control instruction set includes one or more control instructions. Wherein the control command is a voice belonging to a first type. Illustratively, the control instruction set includes three control instructions, respectively, speech "piupiuu", whistling, speech "sleep" in whisper mode.

The electronic device is also provided with a control instruction and action relation table, wherein the control instruction and action relation table comprises a corresponding relation between each control instruction and one execution action. For example, the table of the relationship between the control command and the action is shown in table 1.

TABLE 1

Control instruction	Movement of
		“piupiupiu”	Turning on the air conditioner
Whistle sound	Turning on the light
		Sleep with voice in silent conversation mode "	Switching to sleep mode

The control instruction set may be in the form of a control instruction table or other format, the control instruction and action relation table may be in the form of a control instruction and action relation set or other format, and the control instruction set and the control instruction and action relation table may be combined into one set or table or may be separated into independent sets or tables. This is not limited in the embodiments of the present application.

In some examples, the set of control instructions and the table of control instruction and action relationships are pre-set within the electronic device.

In other examples, the control instruction in the control instruction set and the corresponding relationship between the control instruction and the execution action may be set according to an operation of a user.

Exemplarily, taking the electronic device as the mobile phone 300 shown in fig. 6 as an example, a process of setting a control instruction by the electronic device according to a user operation will be described:

the handset 300 may receive a user's click operation (e.g., a single click operation) on the "set" application icon. In response to a user's click operation on the "settings" application icon, the cellular phone 300 may display a settings interface 301 shown in (a) in fig. 6. The settings interface 301 may include an "airplane mode" option, a "WLAN" option, a "bluetooth" option, a "mobile network" option, and a "smart assist" option 302, among others. Specific functions of the "flight mode" option, the "WLAN" option, the "bluetooth" option, and the "mobile network" option may refer to specific descriptions in the conventional technology, and are not described herein again in this embodiment of the present application.

The handset 300 may receive a user click operation (e.g., a single click operation) on the "smart assistance" option 302. In response to a user clicking on the "smart assistance" option 302, the cell phone 300 may display a smart assistance interface 303 shown in (b) of fig. 6. The intelligent secondary interface 303 includes a "gesture control" option 304, a "voice control" option 305, and the like. Where the "gesture control" option 304 is used to manage user gestures that trigger the handset 300 to perform corresponding events. The "voice control" option 305 is used to manage the voice interaction functions of the handset 300. Specifically, the mobile phone 300 may receive a click operation of the "voice control" option 305 by the user, and the mobile phone 300 may display a voice control interface 306 shown in (c) in fig. 6. The voice control interface 306 includes a "voice wakeup" option 307 and a "control command" option 308. Wherein the "voice wake-up" option 307 is used to set a wake-up word for the voice wake-up function. The "control Instructions" option 308 is used to set control instructions.

The handset 300 may receive a user click operation (e.g., a single click operation) on the "control command" option 308. In response to the user clicking the "control command" option 308, the mobile phone 300 may display a custom control command interface 309 shown in (d) of fig. 6. The custom control command interface 309 may include an "turn on air conditioner" option 30a, an "on light" option 30b, and a "switch sleep mode" option 30 c. In response to a user clicking on the "turn on air conditioner" option 30a, the "turn on light" option 30b, or the "switch sleep mode" option 30c, a control instruction corresponding to the option may be configured. For example, the mobile phone 300 may receive a click operation (e.g., a single click operation) from the user on the "turn on air conditioner" option 30 a. In response to a click operation of the "turn on air conditioner" option 30a by the user, the cellular phone 300 may display the "turn on air conditioner" control instruction setting interface 401 shown in (e) of fig. 6.

The "turn on air conditioner" control instruction setting interface 401 may include: a recording progress bar 402, a "microphone" option 403, and recording prompt information 404. Wherein the "microphone" option 403 is used to trigger the handset 300 to start recording voice data as a control instruction. The recording progress bar 402 is used for displaying the progress of the recording control instruction of the mobile phone 300. The recording prompt message 404 is used to prompt the mobile phone to request the user-defined control instruction. For example, the recorded alert 404 may be "please set an instruction, start by clicking the lower microphone button. To avoid false touches, please set the voice that is not commonly used in daily life. "

Handset 300 may begin recording voice data entered by the user in response to the user clicking on "microphone" option 403. After receiving the voice data input by the user, the mobile phone 300 may set the voice data as a control instruction corresponding to the action "turn on the air conditioner". The "turn on air conditioner" control instruction setting interface 401 further includes a "cancel" button 405 and an "ok" button 406. The "ok" button 406 is used to trigger a control command for the handset 300 to save a recording. The "cancel" button 405 is used to trigger the setting of the handset cancel control instruction. For example, if the voice input by the user is "popopo", the electronic device adds the voice "popopo" to the control instruction set, and deletes the voice "piupiuu" from the control instruction set; and setting a control command corresponding to the action of turning on the air conditioner as popopo, and updating the control command and action relation table.

It should be noted that the above is only an exemplary illustration, different electronic devices have different designs, and the above interfaces and options may also be other names. For example, the intelligent assistance may be referred to as an assistance function and the voice control may be referred to as a voice assistant. And, the way that the user triggers the electronic device to display the customized control instruction interface includes, but is not limited to, the operation of "setup-intelligent assistance-voice control-control instruction" of the user in the electronic device. For example, the manner in which the user triggers the electronic device to display the custom control command interface may also be "settings-voice assistant-voice control-control commands".

After receiving the first voice, the electronic equipment matches the first voice in the control instruction set, and if the matching is successful, the electronic equipment determines a control instruction corresponding to the first voice; the electronic equipment determines a first action corresponding to the control instruction according to the control instruction and action relation table, and executes the first action corresponding to the control instruction. If the match is determined to fail, no action is performed. For example, the first speech may be dropped.

Illustratively, the electronic device receives the speech "piupiuu" and determines that "piupiuuu" is of the first type. Determining the action corresponding to the piupiupu according to the table 1 as follows: and when the air conditioner is turned on, executing action to turn on the air conditioner.

For example, the method for the electronic device to match the received voice in the control instruction set may include: and the electronic equipment builds a control instruction detection model according to the control instruction set. After receiving the voice, the electronic equipment performs voice feature extraction on the received voice. And the electronic equipment performs matching according to the extracted voice features and the control instruction detection model. If the model is detected according to the control instruction, the voice characteristics of the received voice can be matched with one control instruction in the control instruction set, and the matching is successful. And if the model is detected according to the control instruction, the voice characteristics of the received voice cannot be matched with any control instruction in the control instruction set, and the matching is failed.

The electronic device may adopt a method in the prior art to construct the control instruction detection model according to the control instruction set, and details of a specific construction method are not described herein. For example, the control instruction detection model may be a hidden markov model. When constructing a hidden markov model, a large number of different voices may be set as preset words. Each preset word can be expanded into a hidden markov chain, namely a preset word state chain. The hidden markov model includes a plurality of hidden markov chains.

The electronic device may use the existing acoustic model evaluation to perform speech feature extraction on the received speech signal, and details of a specific speech feature extraction method are not described herein. The speech features may be frequency spectra or cepstral coefficients, among others.

Take hidden markov model as an example. And the electronic equipment adopts an acoustic model for evaluation, and performs preset word confirmation on each hidden Markov chain in the hidden Markov model for the extracted voice characteristics to obtain the score of the hidden Markov chain. For example, the extracted voice features are compared with the state of each hidden markov chain to obtain the score of the hidden markov chain. The score of the hidden Markov chain represents the similarity of the current input voice and the preset word of the hidden Markov chain, and the higher the score is, the higher the similarity is. And determining whether the preset word corresponding to the hidden Markov chain with the highest score is a control instruction in the control instruction set.

And if the preset word corresponding to the hidden Markov chain with the highest score is determined to be the control instruction in the control instruction set, matching is successful, and the corresponding relation between the received voice and the control instruction is confirmed. And if the preset word corresponding to the hidden Markov chain with the highest score is determined not to be the control instruction in the control instruction set, the matching is failed, and the received voice is determined not to be the control instruction. The specific method for controlling command matching can be found in CN 105654943a, "a voice wake-up method, device and system".

For example, in fig. 2, the DSP of the electronic device 100 determines that the received speech is "piupiuu", performs speech feature extraction on the speech "piupiuuu", and performs matching according to the extracted speech feature and the control instruction detection model. If the matching is successful, the voice 'piupiupu' is determined as a control instruction. For example, the DSP may also hand the received speech to the AP, extract speech features of the received speech by the AP, and perform matching according to the extracted speech features and the control instruction detection model. For another example, the DSP may also extract the voice feature of the voice, transfer the voice feature of the voice to the AP, and perform matching by the AP according to the voice feature and the control instruction detection model. This is not limited in the embodiments of the present application.

In some embodiments, after receiving the first voice, the electronic device obtains the semantic meaning of the first voice and executes a first action corresponding to the semantic meaning of the first voice.

For example, after receiving a voice "sleep" in the silent mode, the electronic device determines that the voice belongs to the first type, and obtains the semantic meaning of the voice as "sleep". Further, the electronic device performs an action corresponding to the semantic "sleep": and switching to a sleep mode.

In one implementation, a DSP of an electronic device obtains semantics of a first voice from a server and performs a corresponding action according to the semantics of the first voice. In an example, after the DSP of the electronic device 100 in fig. 2 acquires the voice signal, the DSP extracts a voice feature value of the voice signal, and transmits the extracted voice feature value to the AP. For example, the AP may perform semantic recognition according to the speech feature value; alternatively, the AP may transmit the voice feature value to a server (the server may be a cloud server in charge of voice processing in a communication network in which the electronic device is located), perform semantic recognition by the server, and receive a semantic recognition result from the server and further transmit the semantic recognition result to the DSP.

In another example, after acquiring the voice signal, the DSP of the electronic device 100 in fig. 2 transmits the voice signal to the AP, and the AP extracts the voice feature value of the voice. For example, the AP may perform semantic recognition according to the speech feature value; alternatively, the AP may transmit the voice feature value to a server (the server may be a cloud server in charge of voice processing in a communication network in which the electronic device is located), perform semantic recognition by the server, and receive a semantic recognition result from the server and further transmit the semantic recognition result to the DSP.

In some embodiments, after receiving the first voice, the electronic device sends the first voice to the server. After receiving the first voice, the server determines the semantic meaning of the first voice and executes a first action corresponding to the semantic meaning of the first voice.

For example, the electronic device receives the voice of the private message mode "inquire the weather in the open air", determines that the voice belongs to the first type, and sends the voice of the private message mode "inquire the weather in the open air" to the server. After the server acquires the voice, recognizing that the semantic meaning of the voice is 'inquiring the weather of tomorrow', executing an action corresponding to the semantic meaning 'inquiring the weather of tomorrow': the weather of tomorrow is queried. Further, the server may send the query result to the electronic device, and the electronic device presents the query result; for example, the electronic device may display the query result on a display screen; alternatively, the electronic device may output the sound signal of the query result through an audio device (e.g., a speaker, a receiver, etc.).

In some embodiments, after receiving the first voice, the electronic device determines whether the first voice contains a first keyword.

For example, a first keyword set is preset in the electronic device, and the first keyword set includes one or more first keywords. Illustratively, the first set of keywords comprises "pangpang" and "pengpeng". The electronic equipment builds a keyword detection model according to the first keyword set. After receiving the voice, the electronic equipment performs voice feature extraction on the received voice. And the electronic equipment performs matching according to the extracted voice features and the keyword detection model. And if the voice characteristics of the received voice can be matched with one first keyword in the first keyword set according to the keyword detection model, the matching is successful, and the received voice is determined to contain the first keyword. And if the voice characteristics of the received voice cannot be matched with any first keyword in the first keyword set according to the keyword detection model, the matching is failed, and the received voice is determined not to contain the first keyword. The electronic device may use a method in the prior art to construct a keyword detection model according to the first keyword set, and use a method in the prior art to perform speech feature extraction, which is not described herein again.

If the first voice is determined to contain the first keyword, the electronic device is awakened and can receive the next voice of the user. Further, the electronic device acquires a second voice of the user, and in response to the second voice, executes a second action corresponding to the second voice. Wherein the second speech may be of the first type or of the second type.

Illustratively, a first keyword "pangpang" is preset in the electronic device. The electronic device receives the voice "pangpang" to determine that the voice "pangpang" belongs to the first type, and in response to the voice "pangpang", the electronic device wakes up to receive the next voice of the user. The electronic equipment further receives the voice of the user to turn on the air conditioner, and then corresponding actions are executed in response to the voice to turn on the air conditioner: the air conditioner is turned on.

For example, the DSP of the electronic device 100 in fig. 2 determines that the received speech is "pangpang", performs speech feature extraction on the speech "pangpang", and performs matching according to the extracted speech feature and the keyword detection model. If the match is successful, it is determined that the voice "pangpang" includes the first keyword. For example, the DSP may also hand the received speech to the AP, extract speech features of the received speech by the AP, and perform matching according to the extracted speech features and the keyword detection model. For another example, the DSP may also extract the voice feature of the voice, transfer the voice feature of the voice to the AP, and perform matching according to the keyword detection model by the AP according to the voice feature. This is not limited in the embodiments of the present application.

According to the voice control method provided by the embodiment of the application, after the electronic equipment receives the voice, if the voice is determined to be of the first type, corresponding action can be executed in response to the voice. The speech belonging to the first type may be speech with less application in life, such as the special speech described in the embodiments of the present application. On one hand, the special voice is less applied in life, is not easy to be triggered by mistake, and can reduce the probability that the electronic equipment is awakened by mistake; on the other hand, the control instruction matching is only carried out when the voice belongs to the first category, so that the matching times of the control instruction can be reduced, the occupancy rate of system resources is reduced, and the energy consumption is reduced.

Further, if at S102, the electronic device determines that the first voice is of the second type, the voice control method provided in the embodiment of the present application may further include:

s104, the electronic equipment determines whether the first voice belonging to the second type comprises a second keyword. If the electronic device determines that the first voice includes the second keyword, S105 is executed, and the electronic device acquires a second voice of the user and, in response to the second voice, executes a second action corresponding to the second voice; if the electronic device determines that the first voice does not include the second keyword, the electronic device is unable to perform an action in accordance with the user's voice.

In one implementation, the electronic device presets a second set of keywords. The second set of keywords comprises one or more second keywords. For example, the second keyword set includes "Xiao Yi", "you Hao Yi Xiao Yi". And the electronic equipment constructs a wakeup word detection model according to the second keyword set. After receiving the first voice, the electronic equipment performs voice feature extraction on the first voice. And the electronic equipment performs matching according to the extracted voice features and the awakening word detection model. And if the voice characteristics of the first voice can be matched with one second keyword in the second keyword set according to the awakening word detection model, determining that the first voice comprises the second keyword. And if the voice characteristics of the first voice cannot be matched with any second keyword in the second keyword set according to the awakening word detection model, determining that the first voice does not comprise the second keyword. The electronic device may use a method in the prior art to construct a wakeup word detection model according to the second keyword set, and use a method in the prior art to perform voice feature extraction, which is not described herein again.

In another implementation, the electronic device presets a second set of keywords. The second set of keywords comprises one or more second keywords. After receiving the first voice, the electronic equipment identifies the semantic meaning of the first voice, and determines whether the first voice comprises the second keyword or not according to the semantic meaning of the first voice. For example, the second set of keywords may include "weather," "navigation," and the like.

S105, the electronic equipment acquires the second voice of the user and responds to the second voice to execute a second action corresponding to the second voice.

In one implementation, the second keyword is a wake-up word, and the electronic device is woken up in response to the first voice and can receive a voice output next by the user. And acquiring the second voice of the user, namely receiving the voice which is output next after the first voice of the user. And, in response to the second voice, performing a second action corresponding to the second voice. For example, if the smart device in fig. 1 receives the voice "xiaozhi", determines that the voice is of the second type and includes the second keyword "xiaozhi", the smart device is awakened in response to the voice "xiaozhi", and may receive the next voice of the user. For example, if the smart tv receives the voice "hello tv", determines that the voice signal belongs to the second type and includes the second keyword "hello tv", the smart tv is awakened in response to the voice "hello tv", and can receive the next voice of the user. Further, the smart tv receives a voice "watch sports channel", and performs channel switching to sports channel in response to the voice "watch sports channel".

For a specific method for the electronic device to execute a corresponding action in response to the voice, reference may be made to specific descriptions in the conventional technology, which is not described herein again in this embodiment of the present application.

In another implementation, the first speech includes a second speech; and acquiring the second voice of the user, namely acquiring the second voice in the first voice. And, in response to the second voice, performing a second action corresponding to the second voice. For example, if the smart device receives a voice "navigate to library", determines that the voice is of the second type and includes the second keyword "navigate", then determines that the second voice includes "navigate to library", and initiates navigation to library in response to the voice "navigate to library". For example, the smart device receives a voice "small art lights up", determines that the voice belongs to the second type and includes a second keyword "small art", determines that the second voice includes "lights up", and performs a light-up action in response to the voice "lights up".

According to the voice control method provided by the embodiment of the application, after the first voice is received, the category of the first voice is judged; if the first voice is determined to be of the second type and includes the second keyword, a second action corresponding to the second voice may be performed; if the first voice is determined to be of the first type, a first action corresponding to the first voice may be performed. In this way, the action corresponding to the voice which is easily triggered by mistake in daily life can be set to be executed in response to the voice belonging to the first type, for example, the smart sound box is set to turn on the light in response to the whistle to execute the action. Therefore, the false wake-up rate of the electronic equipment can be reduced, and the convenience of using voice control by a user is not influenced.

In some embodiments, the voice control method provided in the embodiments of the present application may also be used to provide a customized service. For example, according to a special voice input by a user, the electronic equipment is unlocked; or taking a special voice input by the user as a password for logging in the electronic equipment to log in the account numbers of different users. The electronic device may set the control instruction in the control instruction set and the corresponding relationship between the control instruction and the executed action according to the user operation in S103, set the special voice input by the user as the control instruction, and establish the corresponding relationship between the control instruction and the executed action. Therefore, the playability and the privacy of the electronic equipment can be improved.

In some embodiments, as shown in fig. 7, a voice control method provided in an embodiment of the present application may include:

the electronic equipment performs voice acquisition. For example, the electronic device periodically collects sounds in the environment, and preprocesses the collected sounds to obtain a voice signal of the user.

The electronic device determines whether the captured speech is a wake-up word. The electronic device may determine whether the obtained speech is a wakeup word by using a method in the prior art, which is not described herein again.

And if the acquired voice is determined to be the awakening word, the electronic equipment enters an awakening processing flow. In response to the user's voice, the electronic device wakes up and may receive the user's next voice. Further, the electronic device receives the next voice of the user and executes the corresponding action in response to the next voice of the user.

And if the acquired voice is determined not to be the awakening word, the electronic equipment judges whether the acquired voice is normal voice according to the ratio of the content of the unvoiced and voiced sounds in the voice. And if the electronic equipment determines that the content ratio of unvoiced sound in the acquired voice is greater than a first threshold, determining that the voice is a special voice. And if the electronic equipment determines that the content ratio of unvoiced sound in the acquired voice is less than or equal to a first threshold, determining that the voice is normal voice.

In one possibility, the electronic device determines that the captured speech is normal speech, does not perform any action in response to the normal speech, and continues speech acquisition.

In one possibility, the electronic device determines that the acquired voice is a special voice, matches the special voice with the voice feature models of the instruction set one by one, and confirms whether the voice belongs to the instruction. For example, an instruction set is configured in the electronic device, and a speech feature model of the instruction set is established according to the instruction set. And after receiving the special voice, the electronic equipment matches the acquired special voice with the voice characteristic modes of the instruction set one by one. If the matching fails, confirming that the special voice does not belong to the instruction; the electronic device continues to perform voice capture. If the matching is successful, confirming that the special voice belongs to the instruction; and executing the action corresponding to the instruction. For example, instruction 1 corresponds to performing action 1, instruction 2 corresponds to performing action 2, instruction 3 corresponds to performing action 3, and so on. Illustratively, if it is confirmed that the acquired special voice corresponds to the instruction 1, the electronic device performs action 1.

It is understood that the electronic device includes hardware structures and/or software modules for performing the functions in order to realize the functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

In the embodiment of the present application, the electronic device may be divided into the functional modules according to the method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

In the case of an integrated unit, fig. 8 shows a schematic view of a possible configuration of the electronic device involved in the above-described embodiment. The electronic device 800 includes: a processing unit 801 and a storage unit 802.

The processing unit 801 is configured to control and manage operations of the electronic device 800. For example, it can be used to execute the processing steps of S101, S102, S103, S104, and S105 in fig. 5; or, the method may be used to execute the relevant steps in fig. 7, such as receiving the voice of the user, determining the type of the voice, waking up word matching, instruction matching, executing a wake-up processing flow, and executing the action corresponding to the instruction; and/or other processes for the techniques described herein.

The storage unit 802 is used to store program codes and data of the electronic device 800. For example, it may be used to store instruction sets, control instruction and action relationship tables, first keyword sets, second keyword sets, and the like.

Of course, the unit modules in the electronic device 800 include, but are not limited to, the processing unit 801 and the storage unit 802. For example, an audio unit, a communication unit, and the like may also be included in the electronic device 800. The audio unit is used for collecting voice sent by a user and playing the voice. The communication unit is used to support communication between the electronic device 800 and other apparatuses.

The processing unit 801 may be a processor or a controller, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an application-specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. The processor may include an application processor and a baseband processor. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The storage unit 802 may be a memory. The audio unit may include a microphone, a speaker, a receiver, and the like. The communication unit may be a transceiver, a transceiving circuit or a communication interface, etc.

For example, the processing unit 801 is a processor (such as the processor 110 shown in fig. 2 or the processor 211 shown in fig. 3), and the storage unit 802 may be a memory (such as the internal memory 121 shown in fig. 2 or the memory 212 shown in fig. 3). The audio unit may include a microphone (e.g., the microphone 170C shown in fig. 2 or the microphone 221 shown in fig. 3), a speaker (e.g., the speaker 170A shown in fig. 2), and a receiver (e.g., the receiver 170B shown in fig. 2). The communication unit includes a mobile communication module, such as the mobile communication module 150 shown in fig. 2, and a wireless communication module, such as the wireless communication module 160 shown in fig. 2. The mobile communication module and the wireless communication module may be collectively referred to as a communication interface. The electronic device 800 provided by the embodiment of the application may be the electronic device 100 shown in fig. 2 or the electronic device 200 shown in fig. 3. Wherein the processor, the memory, the communication interface, etc. may be coupled together, for example by a bus connection.

The embodiment of the present application further provides a computer storage medium, in which computer program codes are stored, and when the processor executes the computer program codes, the electronic device executes the relevant method steps in fig. 5 or fig. 7 to implement the method in the foregoing embodiment.

The embodiments of the present application also provide a computer program product, which when run on a computer causes the computer to execute the relevant method steps in fig. 5 or fig. 7 to implement the method in the above embodiments.

In addition, the electronic device 800, the computer storage medium, or the computer program product provided in the embodiment of the present application are all configured to execute the corresponding method provided above, so that the beneficial effects achieved by the electronic device 800, the computer storage medium, or the computer program product may refer to the beneficial effects in the corresponding method provided above, and are not described herein again.

Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for voice control, the method comprising:

the method comprises the steps that electronic equipment obtains a first voice of a user;

if the first voice is determined to be of the first type, responding to the first voice, and executing a first action corresponding to the first voice by the electronic equipment; wherein the first type is that the content ratio of unvoiced sound in the voice is larger than a first threshold.

2. The method of claim 1, wherein the electronic device performs a first action corresponding to the first voice, comprising:

and if the first voice is determined to contain the control instruction, the electronic equipment executes a first action corresponding to the control instruction.

3. The method of claim 1, wherein the electronic device performs a first action corresponding to the first voice, comprising:

the electronic equipment acquires the semantics of the first voice;

the electronic device performs a first action corresponding to the semantics of the first speech.

4. The method of claim 1, wherein the electronic device performs a first action corresponding to the first voice, comprising:

if the first voice is determined to contain the first keyword, the electronic equipment acquires a second voice of the user; and, in response to the second voice, performing a second action corresponding to the second voice.

5. The method according to any one of claims 1-4, further comprising:

if the first voice is determined to be of the second type, the electronic equipment acquires a second voice of the user, and responds to the second voice to execute a second action corresponding to the second voice; wherein the second type is that the content ratio of unvoiced sound in the voice is less than or equal to the first threshold.

6. The method according to any one of claims 1 to 5, wherein the determining that the first speech is of a first type specifically comprises:

a digital signal processor DSP of the electronic device determines that the first voice is of a first type.

7. The method according to claim 2, wherein the determining that the first speech includes a control instruction specifically includes:

a Digital Signal Processor (DSP) of the electronic equipment extracts voice features of the first voice, matches the extracted voice features according to a control instruction detection model, and determines that the first voice contains a control instruction; or,

a Digital Signal Processor (DSP) of the electronic equipment transmits the first voice to an Application Processor (AP) of the electronic equipment; the AP extracts voice features of the first voice, matches the first voice according to the extracted voice features and a control instruction detection model, and determines that the first voice contains a control instruction; or,

a Digital Signal Processor (DSP) of the electronic equipment performs voice feature extraction on the first voice and transmits the extracted voice feature to an Application Processor (AP) of the electronic equipment; and the AP performs matching according to the extracted voice features and a control instruction detection model, and determines that the first voice contains a control instruction.

8. The method according to claim 3, wherein the obtaining the semantics of the first speech specifically comprises:

a Digital Signal Processor (DSP) of the electronic equipment transmits the first voice to an Application Processor (AP) of the electronic equipment; the AP extracts a voice characteristic value of the first voice and obtains the semantics of the first voice according to the voice characteristic value; or,

a Digital Signal Processor (DSP) of the electronic equipment extracts a voice characteristic value of the first voice and transmits the voice characteristic value to an Application Processor (AP) of the electronic equipment; and the AP acquires the semantics of the first voice according to the voice characteristic value.

9. The method according to claim 4, wherein the determining that the first speech includes a first keyword specifically includes:

a Digital Signal Processor (DSP) of the electronic equipment extracts voice features of the first voice, matches the extracted voice features according to a keyword detection model and determines that the first voice contains a first keyword; or,

a Digital Signal Processor (DSP) of the electronic equipment transmits the first voice to an Application Processor (AP) of the electronic equipment; the AP extracts voice features of the first voice, matches the first voice according to the extracted voice features and a keyword detection model, and determines that the first voice contains a first keyword; or,

a Digital Signal Processor (DSP) of the electronic equipment performs voice feature extraction on the first voice and transmits the extracted voice feature to an Application Processor (AP) of the electronic equipment; and the AP performs matching according to the extracted voice characteristics and a keyword detection model, and determines that the first voice contains a first keyword.

10. An electronic device, characterized in that the electronic device comprises: a processor and a memory; the memory is coupled with the processor; the memory for storing computer program code; the computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any of claims 1-9.

11. A computer storage medium comprising computer instructions that, when executed on an electronic device, cause the electronic device to perform the method of any one of claims 1-9.

12. A computer program product, characterized in that it causes a computer to carry out the method according to any one of claims 1-9, when said computer program product is run on the computer.