CN117095683A - Speech recognition processing method, system, device and storage medium - Google Patents

Speech recognition processing method, system, device and storage medium Download PDF

Info

Publication number
CN117095683A
CN117095683A CN202311096724.6A CN202311096724A CN117095683A CN 117095683 A CN117095683 A CN 117095683A CN 202311096724 A CN202311096724 A CN 202311096724A CN 117095683 A CN117095683 A CN 117095683A
Authority
CN
China
Prior art keywords
online
command
voice
offline
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311096724.6A
Other languages
Chinese (zh)
Inventor
姚光乐
王宇
王璞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Original Assignee
Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd filed Critical Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority to CN202311096724.6A priority Critical patent/CN117095683A/en
Publication of CN117095683A publication Critical patent/CN117095683A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application relates to a voice recognition processing method, a voice recognition processing system, a voice recognition processing device and a voice recognition processing storage medium, and relates to the field of cache voice processing. The method comprises the steps of performing offline voice recognition and semantic understanding on voice information to obtain a command, sending the voice information to a cloud voice recognition server for online voice recognition to obtain a text form of the voice information, and sending the text form to a semantic understanding server for semantic analysis and understanding to obtain the command; according to the network off-line condition, the conditions of receiving the on-line command and the off-line command are arbitrated to control the on-line voice processing and the off-line voice processing to obtain a final command. The switching and matching of dynamic real-time decision on-line voice recognition semantic processing and off-line voice recognition semantic understanding not only ensures the recognition accuracy, but also improves the voice recognition processing efficiency under the weak network.

Description

Speech recognition processing method, system, device and storage medium
Technical Field
The present application relates to the field of speech recognition processing technologies, and in particular, to a speech recognition processing method, system, device, and storage medium.
Background
With the continuous development of the internet of things technology, the artificial intelligence technology and the voice technology, voice recognition becomes one of the key technologies of terminal equipment intelligence. The traditional speech recognition processing modes are divided into: the first mode is that only voice is input to a cloud server to perform online voice recognition and semantic understanding; the second way is to use only the local recognition engine for speech recognition offline. Both of these approaches have their own drawbacks, and for the first approach, the purpose of the voice control device cannot be achieved when not networked; for the second mode, the local recognition engine cannot realize flexibility of voice commands due to the limitation of the memory, only can recognize offline recognition words of a few sentence patterns, has low applicability, and cannot meet the requirement of intelligent chat communication. To avoid the two-way drawbacks, the prior art will choose to perform online speech recognition and offline speech recognition, as 202111659197.6, depending on whether the network is connected or not. However, for intelligent terminal devices such as automobiles and home appliances, network instability, network congestion and the like are easy to occur, so that the intelligent terminal devices are in a weak network environment, the response time of voice is very long, and even when network faults occur between the intelligent terminal devices and a server, the problem of no response occurs. How to use different voice recognition modes for devices in different network states, including good network, non-networking and weak network scenes, is a problem to be solved urgently, so that accuracy and response speed of voice recognition are improved under the network scene of the device under dynamic change.
Disclosure of Invention
In order to solve the above technical problems or at least partially solve the above technical problems, the present application provides a voice recognition processing method, system, device and storage medium.
In a first aspect, the present application provides a speech recognition processing method, including: acquiring voice information; the method comprises the steps of performing offline voice recognition and semantic understanding on voice information to obtain a command, sending the voice information to a cloud voice recognition server to perform online voice recognition to obtain a text form of the voice information, and sending the text form to a semantic understanding server to perform semantic analysis and understanding to obtain the command; according to the network off-line condition, the conditions of receiving the on-line command and the off-line command are arbitrated to control the on-line voice processing and the off-line voice processing to obtain a final command.
Further, according to the off-line condition of the network, the step one weak network signal, the step one stop interaction signal, the step two weak network signal and the step two stop interaction signal in the on-line voice processing process, the step of receiving the on-line command and the off-line command to perform arbitration so as to control the on-line voice processing and the off-line voice processing process to obtain the final command comprises the following steps:
if the command output receives the offline command, temporarily storing the offline command if the network is online; if the network is offline, continuously judging whether the network is in an awake state or VAD processing state, executing an offline command if the network is in the awake state or VAD processing state, and keeping the current state if the network is not in the VAD processing state;
if the command output receives the weak network signal of the stage one, judging whether an offline command is received, if so, executing the offline command, and then entering an idle state, otherwise, keeping the current state;
if the command output receives the stage one interaction stop signal, judging whether the command output is in an online voice recognition state, if so, entering an idle state, otherwise, keeping the current state;
if the command output receives the online command, judging whether the online command is in an online semantic understanding state, if so, executing the online command, and then entering an idle state;
if the command output receives the second-stage weak network signal, judging whether an offline command is received, if so, executing the offline command, and then entering an idle state, otherwise, keeping the current state;
if the command output receives the second interaction stop signal, judging whether the command output is in an online semantic understanding state, if so, entering an idle state, otherwise, keeping the current state.
Further, if the command output receives the text form of the voice information obtained by the online voice recognition, the offline voice understanding is performed by using the text of the voice information to obtain the command, and the offline command is replaced or obtained.
Further, presetting a network mark for determining the off-line condition of the network and an off-line identification mark for determining the off-line command generation condition; detecting whether the network is online, if so, setting a network mark to be 1, otherwise, setting the network mark to be 0; and detecting whether an offline command is obtained according to offline voice recognition or offline semantic understanding of the text obtained by online voice recognition, if so, setting an offline recognition mark to 1, otherwise, setting the offline recognition mark to 0.
Further, when online voice recognition is performed online, detecting whether the network request time of online voice recognition initiated to a cloud voice recognition server is smaller than a second request time threshold and larger than or equal to a first request time threshold, and if so, generating a stage one weak network signal;
when online voice recognition is carried out online, detecting whether the network request time of online voice recognition initiated to a cloud voice recognition server is greater than or equal to a second request time threshold value, and if so, generating a phase one interaction stop signal;
when online semantic understanding is carried out online, detecting whether the network request time for online semantic analysis and understanding initiated to a semantic understanding server is smaller than a second request time threshold and larger than or equal to a first request time threshold, and if so, generating a stage two weak network signal;
when online semantic understanding is carried out online, detecting whether the network request time of online semantic analysis and understanding initiated to a semantic understanding server is greater than or equal to a second request time threshold value, and if so, generating a stage two interaction stop signal.
Still further, the awake state, VAD processing state, online speech recognition state, online semantic understanding state, and idle state are from a finite state machine comprising:
the method comprises the following steps of (1) in an initial state when equipment is just powered on, and then initializing;
after the initialization of the equipment is completed, the equipment enters an idle state from an initial state;
after a user wakes up the equipment through a wake-up event, the user enters a wake-up state from an idle state;
after the voice information is detected to be ended by VAD, the VAD enters an online voice recognition state and an offline voice recognition understanding state, wherein the online voice recognition state is used for sending the voice information to a cloud voice recognition server for online recognition, the text form of the voice information is obtained and sent to a command output, and meanwhile, the VAD enters an online semantic understanding state, and the online semantic understanding state is used for sending the text form of the voice information to the semantic understanding server for obtaining an online command for semantic analysis and understanding and sending the online command to the command output; the offline voice recognition understanding state utilizes voice information to perform voice recognition and semantic understanding to obtain an offline command, and the offline command is sent to command output, wherein the offline voice recognition understanding state semantic understanding process supports analysis and understanding of texts of the voice information obtained by using the online voice recognition state to obtain the offline command;
the command output state arbitrates according to the network off-line condition, the first weak network signal, the first stop interaction signal, the second weak network signal and the second stop interaction signal in the online voice processing process, and the condition of receiving the online command and the offline command so as to control the online voice processing and the offline voice processing process to obtain a final command.
Further, if the vad detects that the time of the voice start exceeds the first threshold, the vad considers that the voice is not heard after waking up and enters an idle state, and if the vad detects that the time of the voice end exceeds the second threshold, the vad considers that the vad processes errors and enters the idle state.
In a second aspect, the present application provides a speech recognition processing system comprising: the system comprises a voice input module, an online voice recognition module, an online semantic understanding module, a network state module, an offline voice recognition understanding module and a command output module; wherein,
the voice input module is used for acquiring voice information of a user and comprises a wake-up function and a vad detection function;
the offline voice recognition understanding module is used for offline recognition and understanding of voice information or offline understanding of offline commands obtained by the online voice recognition module through the text of the voice information provided by the command output module;
the online voice recognition module is used for sending the voice information to the cloud voice recognition server to obtain a text form of the voice information;
the online semantic understanding module is used for sending the text form of the voice information to the semantic understanding server for semantic analysis and understanding to obtain an online command;
the network state module generates a first-stage weak network signal, a first-stage stop interaction signal, a second-stage weak network signal and a second-stage stop interaction signal according to the network data request time of online voice recognition and online semantic understanding;
the command output module is used for arbitrating the conditions of receiving the online command and the offline command according to the conditions of network off-line, the first weak network signal, the first stop interaction signal, the second weak network signal and the second stop interaction signal in the online voice processing process, so as to control the online voice processing and the offline voice processing process to obtain a final command.
In a third aspect, the present application provides a speech recognition processing apparatus comprising: the processing unit is connected with the storage unit through the bus unit, the storage unit stores a computer program, and the voice recognition processing method is realized when the computer program is executed by the processing unit.
In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the speech recognition processing method.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
the method comprises the steps of performing offline voice recognition and semantic understanding on voice information to obtain a command, sending the voice information to a cloud voice recognition server for online voice recognition to obtain a text form of the voice information, and sending the text form to a semantic understanding server for semantic analysis and understanding to obtain the command; according to the network off-line condition, the conditions of receiving the on-line command and the off-line command are arbitrated to control the on-line voice processing and the off-line voice processing to obtain a final command. According to the conditions of off-line network, different weak networks and interaction stopping of the network, the switching and coordination of on-line voice recognition semantic processing and off-line voice recognition semantic understanding are dynamically decided in real time, so that the recognition accuracy is ensured, and the voice recognition processing efficiency under the weak networks is improved.
The application supports offline semantic understanding to obtain the offline command by using the text form obtained by online voice recognition, and ensures the accuracy of the offline command and the accuracy of the overall recognition under the condition of ensuring timeliness.
The application controls the orderly execution of the voice recognition processing method through the finite state machine, thereby avoiding conflict.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
Fig. 1 is a schematic diagram of a finite state machine implementing a speech recognition processing method according to an embodiment of the present application;
FIG. 2 is a flowchart of a speech recognition processing method according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for arbitrating the situations of receiving online commands and offline commands to control online voice processing and offline voice processing to obtain final commands according to the network off-line situation provided by the embodiment of the application, wherein the first weak network signal, the first stop interaction signal, the second weak network signal and the second stop interaction signal are in the online voice processing process;
fig. 4 is a flowchart of generating a first-stage weak network signal, a first-stage stop interaction signal, a second-stage weak network signal, and a second-stage stop interaction signal according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a speech recognition processing system according to an embodiment of the present application;
fig. 6 is a schematic diagram of a speech recognition processing device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Example 1
The embodiment of the application provides a voice recognition processing method, and in order to better execute the process of the voice recognition processing method, the application constructs a finite state machine for realizing the voice recognition processing method, and referring to fig. 1, the finite state machine comprises:
and in an initial state, the voice recognition processing equipment is in the initial state just when being powered on, and then is initialized. When the initialization of the device is completed, the microphone of the voice recognition device is turned on, the processor operates, and the finite state machine enters an idle state from an initial state. After a user wakes up the equipment through a wake-up event, the user enters a wake-up state from an idle state; in the implementation process, the voice processing state is triggered to enter the wake-up state through the wake-up event, and the offline identification mark is restored to 0. When entering the wake-up state, broadcasting a wake-up prompt tone,
after the voice trigger enters the wake-up state, the voice trigger enters the VAD processing state, and the VAD processing state realizes the determination of the beginning and ending of voice information and clocks the voice process. Specifically, in the VAD processing state, the voice start is detected in the sound intensity range, that is, the voice information start is determined, and the whole voice recording process is timed. If the vad detects that the time of the voice start exceeds the first threshold, the vad considers that the voice is not heard after waking up and enters an idle state, and if the vad detects that the time of the voice end exceeds the second threshold, the vad considers that the vad processes errors and enters the idle state.
And after the vad detects that the voice information is finished, the method enters an online voice recognition state and an offline voice recognition understanding state. The online voice recognition state is used for sending the voice information to the cloud voice recognition server for online recognition to obtain a text form of the voice information, and the online voice recognition state is used for timing a network request of online voice recognition; the method comprises the steps that a text is sent to a command output and enters an online semantic understanding state, the online semantic understanding state sends a text form of voice information to a semantic understanding server to obtain an online command after semantic analysis and understanding, and network request timing for online semantic understanding is realized in the online semantic understanding state; and sending the online command to the command output. The offline voice recognition understanding state utilizes voice information to conduct voice recognition and semantic understanding to obtain an offline command, and the offline command is sent to command output, wherein the offline voice recognition understanding state semantic understanding process supports analysis and understanding of texts of voice information obtained by the online voice recognition state to obtain the offline command, and when the offline command is generated, an offline recognition mark is set to be 1.
In the voice recognition processing method provided by the application, command output is arbitrated according to the conditions of network off-line, namely, a stage one weak network signal, a stage one stop interaction signal, a stage two weak network signal and a stage two stop interaction signal in the online voice processing process, and the conditions of an online command and an offline command are received so as to control the online voice processing and the offline voice processing process to obtain a final command. Specifically, referring to fig. 2, a speech recognition processing method includes:
acquiring voice information; user voice is received through a microphone of the voice recognition processing device, and voice information is acquired through a VAD function.
And performing off-line voice recognition and semantic understanding on the voice information to obtain an off-line command. In one embodiment, an offline recognition text corresponding to the voice information is searched in a text database local to the voice recognition processing device, and the offline recognition text is analyzed to obtain an offline command. In one embodiment, limited to the speech recognition processing device storage and processor, a miniaturized speech recognition understanding model is deployed through which local offline speech recognition and semantic understanding is performed.
And sending the voice information to a cloud voice recognition server for online voice recognition to obtain a text form of the voice information.
And sending the text form to a semantic understanding server for semantic analysis and understanding to obtain an online command.
According to the network off-line condition, the conditions of receiving the on-line command and the off-line command are arbitrated to control the on-line voice processing and the off-line voice processing to obtain a final command.
In the implementation process, referring to fig. 3, according to the off-line condition of the network, the steps of the first weak network signal, the first stop interaction signal, the second weak network signal, and the second stop interaction signal in the online voice processing process, the steps of receiving the online command and the off-line command for arbitration to control the online voice processing and the off-line voice processing process to obtain the final command include:
presetting a network mark for determining the off-line condition of a network and an off-line identification mark for determining the generation condition of an off-line command; detecting whether the network is online, if so, setting a network mark to be 1, otherwise, setting the network mark to be 0; and detecting whether an offline command is obtained according to offline voice recognition or text offline semantic understanding obtained through online voice recognition, if so, setting an offline recognition mark to 1, otherwise, setting the offline recognition mark to 0, and setting the offline recognition mark to 0 when the offline recognition mark enters an awake state.
When the command output receives the offline command, if the network is determined to be online through the network mark of 1, the offline command is temporarily stored, the offline command obtained by understanding the voice information through offline recognition is limited by the performance of the voice recognition processing equipment, the offline command obtained by understanding the voice information is poorer than the online command obtained by understanding the online voice recognition semantics, and the offline command is temporarily stored under the condition that the network is online, so as to attempt to obtain the online command; if the network is offline, whether the network is in an awake state or a VAD processing state is continuously judged, if so, an offline command is executed, and if not, the current state is maintained. And executing the offline command under the condition that the online command cannot be acquired by the offline network.
In the case of online network, online speech recognition is performed:
in the online voice recognition process, if the network is normal, the command outputs the text form of the received voice information, and offline semantic analysis is carried out by utilizing the text of the received voice information to obtain an offline command, so as to replace the restored offline command. Because the online voice recognition effect is better, under the condition that the online voice recognition is carried out to obtain the text, the text of the online voice recognition is utilized to carry out offline semantic analysis, so that a more accurate offline command is obtained.
In the online voice recognition process, if the command output receives a weak network signal in a stage one, judging whether an offline command is received, if so, executing the offline command, and then entering an idle state, otherwise, keeping the current state; when online voice recognition is performed online, as shown in fig. 4, whether the network request time of online voice recognition initiated to the cloud voice recognition server is smaller than a second request time threshold and larger than or equal to a first request time threshold is detected, and if yes, a phase one weak network signal is generated. The generation of the phase one weak signal indicates that online speech recognition has not been achieved after the first request time threshold is exceeded, and in order to ensure efficiency, to avoid jamming, the offline command is executed immediately if the offline command is ready.
In the online voice recognition process, if the command output receives the stage one interaction stop signal, judging whether the online voice recognition state is achieved, if yes, entering an idle state, otherwise, keeping the current state; when online voice recognition is performed online, as shown in fig. 4, detecting whether the network request time of online voice recognition initiated to a cloud voice recognition server is greater than or equal to a second request time threshold, and if so, generating a phase one interaction stop signal; the generation of the phase-one interaction stop signal indicates that the phase-one weak network signal is necessarily experienced, if the phase-one weak network signal is in a period from the phase-one interaction stop signal to the phase-one interaction stop signal, the off-line command is not generated, the competition time of the off-line command and the on-line command is given, if the second request time threshold is exceeded, the on-line voice recognition is stopped, and the idle state is entered. Under the condition of generating the offline command, the online command generating process is ended, and repeated execution is avoided.
In the online voice recognition and semantic online understanding process, if the network is normal, the command output receives the online command, judges whether the online command is in an online semantic understanding state, if so, executes the online command, and then enters an idle state.
In the semantic online understanding process, if the command output receives the stage two weak network signal, judging whether an offline command is received, if so, executing the offline command, then entering an idle state, and otherwise, keeping the current state; when online semantic understanding is performed online, as shown in fig. 4, whether the network request time for online semantic analysis and understanding initiated to the semantic understanding server is smaller than a second request time threshold and larger than or equal to a first request time threshold is detected, and if so, a stage two weak network signal is generated. The second weak network signal indicates that online semantic understanding is not realized after the first request time threshold is exceeded, and in order to ensure efficiency, the offline command is immediately executed under the condition that the offline command is ready to be blocked.
In the semantic online understanding process, if the command output receives the stage two interaction stop signal, judging whether the state is in an online semantic understanding state, if so, entering an idle state, and otherwise, keeping the current state. When online semantic understanding is performed online, as shown in fig. 4, whether the network request time for online semantic analysis and understanding initiated to the semantic understanding server is greater than or equal to a second request time threshold is detected, and if yes, a stage two interaction stop signal is generated. The generation of the second-stage interaction stop signal indicates that the second-stage weak network signal is necessarily experienced, if the second-stage weak network signal is generated until the second-stage interaction stop signal, the off-line command is not generated, the competition time of the off-line command and the on-line command is given, and if the second request time threshold is exceeded, the on-line voice recognition is stopped, and the idle state is entered. Under the condition of generating the offline command, the online command generating process is ended, and repeated execution is avoided.
In some examples, the speech recognition processing device applying the method of the present application serves as a control hub of a home appliance, such as: the command output module of the voice recognition processing equipment finally obtains an air conditioner control command, and the command is sent to an air conditioner to execute air conditioner control; and if the command output module of the voice recognition processing equipment obtains the voice broadcasting command, the broadcasting content is sent to the voice synthesis module and is played and executed.
Example 2
Referring to fig. 5, an embodiment of the present application provides a speech recognition processing system, including: the system comprises a voice input module, an online voice recognition module, an online semantic understanding module, a network state module, an offline voice recognition understanding module and a command output module; wherein,
the voice input module is used for acquiring voice information of a user and comprises a wake-up function and a vad detection function;
the offline voice recognition understanding module is used for offline recognition and understanding of voice information or offline understanding of offline commands obtained by the online voice recognition module through the text of the voice information provided by the command output module;
the online voice recognition module is used for sending the voice information to the cloud voice recognition server to obtain a text form of the voice information;
the online semantic understanding module is used for sending the text form of the voice information to the semantic understanding server for semantic analysis and understanding to obtain an online command;
the network state module generates a first-stage weak network signal, a first-stage stop interaction signal, a second-stage weak network signal and a second-stage stop interaction signal according to the network data request time of online voice recognition and online semantic understanding;
the command output module is used for arbitrating the conditions of receiving the online command and the offline command according to the conditions of network off-line, the first weak network signal, the first stop interaction signal, the second weak network signal and the second stop interaction signal in the online voice processing process, so as to control the online voice processing and the offline voice processing process to obtain a final command. Specific: if the command output module receives the offline command, the offline command is temporarily stored if the network is online; if the network is offline, continuously judging whether the network is in an awake state or VAD processing state, executing an offline command if the network is in the awake state or VAD processing state, and keeping the current state if the network is not in the VAD processing state; if the command output receives the text form of the voice information obtained by the online voice recognition, performing offline voice understanding by using the text of the voice information to obtain a command, and replacing or obtaining an offline command; if the command output module receives the stage one weak network signal, judging whether an offline command is received, if so, executing the offline command, and then entering an idle state, otherwise, keeping the current state; if the command output module receives the stage one interaction stop signal, judging whether the command output module is in an online voice recognition state, if so, entering an idle state, otherwise, keeping the current state; if the command output module receives the online command, judging whether the online command is in an online semantic understanding state, if so, executing the online command, and then entering an idle state; if the command output module receives the second-stage weak network signal, judging whether an offline command is received, if so, executing the offline command, and then entering an idle state, otherwise, keeping the current state; if the command output module receives the second-stage interaction stop signal, judging whether the command output module is in an online semantic understanding state, if so, entering an idle state, otherwise, keeping the current state.
Example 3
Referring to fig. 6, an embodiment of the present application provides a speech recognition processing device, including: the processing unit is connected with the storage unit through the bus unit, and the storage unit is used as a computer readable storage medium and can be used for storing software programs, computer executable programs and modules, such as the software programs, the computer executable programs and the modules (a voice input module, an online voice recognition module, an online semantic understanding module, a network state module, an offline voice recognition understanding module and a command output module) corresponding to the voice recognition processing method in the embodiment of the application. The processing unit implements the above-described voice recognition processing method by running a software program, a computer-executable program, and a module stored in a storage unit, and includes:
acquiring voice information; the method comprises the steps of performing offline voice recognition and semantic understanding on voice information to obtain a command, sending the voice information to a cloud voice recognition server to perform online voice recognition to obtain a text form of the voice information, and sending the text form to a semantic understanding server to perform semantic analysis and understanding to obtain the command; according to the network off-line condition, the conditions of receiving the on-line command and the off-line command are arbitrated to control the on-line voice processing and the off-line voice processing to obtain a final command.
Further, the memory unit may include high-speed random access memory, and may also include nonvolatile memory, such as at least one magnetic disk storage device, flash memory device, or other nonvolatile solid state memory device. In some examples, the storage unit may further include memory remotely located relative to the processing unit, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Of course, the storage unit in the speech recognition processing device according to the embodiment of the present application stores a computer program not limited to the above-described method operations, but may also perform related operations in the speech recognition processing method according to any embodiment of the present application.
Example 4
An embodiment of the present application provides a computer-readable storage medium storing a computer program that, when executed, implements the speech recognition processing method, the method including:
acquiring voice information; the method comprises the steps of performing offline voice recognition and semantic understanding on voice information to obtain a command, sending the voice information to a cloud voice recognition server to perform online voice recognition to obtain a text form of the voice information, and sending the text form to a semantic understanding server to perform semantic analysis and understanding to obtain the command; according to the network off-line condition, the conditions of receiving the on-line command and the off-line command are arbitrated to control the on-line voice processing and the off-line voice processing to obtain a final command.
Of course, the computer readable storage medium according to the embodiment of the present application stores a computer program not limited to the above-described method operations, but also may perform related operations in a speech recognition processing method according to any embodiment of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed structures and methods may be implemented in other manners. For example, the structural embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via interfaces, structures or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A speech recognition processing method, comprising: acquiring voice information; the method comprises the steps of performing offline voice recognition and semantic understanding on voice information to obtain a command, sending the voice information to a cloud voice recognition server to perform online voice recognition to obtain a text form of the voice information, and sending the text form to a semantic understanding server to perform semantic analysis and understanding to obtain the command; according to the network off-line condition, the conditions of receiving the on-line command and the off-line command are arbitrated to control the on-line voice processing and the off-line voice processing to obtain a final command.
2. The method according to claim 1, wherein the step of receiving the online command and the offline command to perform arbitration to control the online voice processing and the offline voice processing to obtain the final command includes:
if the command output receives the offline command, temporarily storing the offline command if the network is online; if the network is offline, continuously judging whether the network is in an awake state or VAD processing state, executing an offline command if the network is in the awake state or VAD processing state, and keeping the current state if the network is not in the VAD processing state;
if the command output receives the weak network signal of the stage one, judging whether an offline command is received, if so, executing the offline command, and then entering an idle state, otherwise, keeping the current state;
if the command output receives the stage one interaction stop signal, judging whether the command output is in an online voice recognition state, if so, entering an idle state, otherwise, keeping the current state;
if the command output receives the online command, judging whether the online command is in an online semantic understanding state, if so, executing the online command, and then entering an idle state;
if the command output receives the second-stage weak network signal, judging whether an offline command is received, if so, executing the offline command, and then entering an idle state, otherwise, keeping the current state;
if the command output receives the second interaction stop signal, judging whether the command output is in an online semantic understanding state, if so, entering an idle state, otherwise, keeping the current state.
3. The voice recognition processing method of claim 2, wherein if the command output receives a text form of voice information obtained by the online voice recognition, the offline voice understanding is performed using the text of the voice information to obtain the command, and the offline command is replaced or obtained.
4. A voice recognition processing method according to claim 3, wherein a network flag for determining an off-line condition of the network and an off-line recognition flag for determining an off-line command generation condition are preset; detecting whether the network is online, if so, setting a network mark to be 1, otherwise, setting the network mark to be 0; and detecting whether an offline command is obtained according to offline voice recognition or offline semantic understanding of the text obtained by online voice recognition, if so, setting an offline recognition mark to 1, otherwise, setting the offline recognition mark to 0.
5. The voice recognition processing method according to claim 2, wherein when online voice recognition is performed online, detecting whether a network request time of online voice recognition initiated to a cloud voice recognition server is smaller than a second request time threshold and larger than or equal to a first request time threshold, and if so, generating a phase one weak network signal;
when online voice recognition is carried out online, detecting whether the network request time of online voice recognition initiated to a cloud voice recognition server is greater than or equal to a second request time threshold value, and if so, generating a phase one interaction stop signal;
when online semantic understanding is carried out online, detecting whether the network request time for online semantic analysis and understanding initiated to a semantic understanding server is smaller than a second request time threshold and larger than or equal to a first request time threshold, and if so, generating a stage two weak network signal;
when online semantic understanding is carried out online, detecting whether the network request time of online semantic analysis and understanding initiated to a semantic understanding server is greater than or equal to a second request time threshold value, and if so, generating a stage two interaction stop signal.
6. The speech recognition processing method of claim 2 wherein the awake state, VAD processing state, online speech recognition state, online semantic understanding state, and idle state are from a finite state machine comprising:
the method comprises the following steps of (1) in an initial state when equipment is just powered on, and then initializing;
after the initialization of the equipment is completed, the equipment enters an idle state from an initial state;
after a user wakes up the equipment through a wake-up event, the user enters a wake-up state from an idle state;
after the voice information is detected to be ended by VAD, the VAD enters an online voice recognition state and an offline voice recognition understanding state, wherein the online voice recognition state is used for sending the voice information to a cloud voice recognition server for online recognition, the text form of the voice information is obtained and sent to a command output, and meanwhile, the VAD enters an online semantic understanding state, and the online semantic understanding state is used for sending the text form of the voice information to the semantic understanding server for obtaining an online command for semantic analysis and understanding and sending the online command to the command output; the offline voice recognition understanding state utilizes voice information to perform voice recognition and semantic understanding to obtain an offline command, and the offline command is sent to command output, wherein the offline voice recognition understanding state semantic understanding process supports analysis and understanding of texts of the voice information obtained by using the online voice recognition state to obtain the offline command;
the command output state arbitrates according to the network off-line condition, the first weak network signal, the first stop interaction signal, the second weak network signal and the second stop interaction signal in the online voice processing process, and the condition of receiving the online command and the offline command so as to control the online voice processing and the offline voice processing process to obtain a final command.
7. The speech recognition processing method of claim 6 wherein if the vad detects that the time of the beginning of the voice exceeds a first threshold, the vad considers that the voice is not heard after waking up, and enters an idle state, and if the vad detects that the time of the ending of the voice exceeds a second threshold, the vad considers that the vad processes to be in error, and enters the idle state.
8. A speech recognition processing system, comprising: the system comprises a voice input module, an online voice recognition module, an online semantic understanding module, a network state module, an offline voice recognition understanding module and a command output module; wherein,
the voice input module is used for acquiring voice information of a user and comprises a wake-up function and a vad detection function;
the offline voice recognition understanding module is used for offline recognition and understanding of voice information or offline understanding of offline commands obtained by the online voice recognition module through the text of the voice information provided by the command output module;
the online voice recognition module is used for sending the voice information to the cloud voice recognition server to obtain a text form of the voice information;
the online semantic understanding module is used for sending the text form of the voice information to the semantic understanding server for semantic analysis and understanding to obtain an online command;
the network state module generates a first-stage weak network signal, a first-stage stop interaction signal, a second-stage weak network signal and a second-stage stop interaction signal according to the network data request time of online voice recognition and online semantic understanding;
the command output module is used for arbitrating the conditions of receiving the online command and the offline command according to the conditions of network off-line, the first weak network signal, the first stop interaction signal, the second weak network signal and the second stop interaction signal in the online voice processing process, so as to control the online voice processing and the offline voice processing process to obtain a final command.
9. A speech recognition processing device, comprising: at least one processing unit, said processing unit being connected to a storage unit via a bus unit, said storage unit storing a computer program, which when executed by said processing unit, implements the speech recognition processing method according to any one of claims 1-7.
10. A computer readable storage medium storing a computer program, which when executed by a processor implements the speech recognition processing method according to any one of claims 1-7.
CN202311096724.6A 2023-08-29 2023-08-29 Speech recognition processing method, system, device and storage medium Pending CN117095683A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311096724.6A CN117095683A (en) 2023-08-29 2023-08-29 Speech recognition processing method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311096724.6A CN117095683A (en) 2023-08-29 2023-08-29 Speech recognition processing method, system, device and storage medium

Publications (1)

Publication Number Publication Date
CN117095683A true CN117095683A (en) 2023-11-21

Family

ID=88773219

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311096724.6A Pending CN117095683A (en) 2023-08-29 2023-08-29 Speech recognition processing method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN117095683A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2026001141A1 (en) * 2024-06-26 2026-01-02 北京字跳网络技术有限公司 Information processing method, device, storage medium, and product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2026001141A1 (en) * 2024-06-26 2026-01-02 北京字跳网络技术有限公司 Information processing method, device, storage medium, and product

Similar Documents

Publication Publication Date Title
JP7114721B2 (en) Voice wake-up method and apparatus
CN107704275B (en) Intelligent device awakening method and device, server and intelligent device
CN112634897B (en) Equipment awakening method and device, storage medium and electronic device
CN108182943B (en) Intelligent device control method and device and intelligent device
CN110111789B (en) Voice interaction method and device, computing equipment and computer readable medium
CN112201246A (en) Intelligent control method and device based on voice, electronic equipment and storage medium
JP2019128938A (en) Lip reading based voice wakeup method, apparatus, arrangement and computer readable medium
CN114724564B (en) Speech processing methods, devices and systems
CN110851221A (en) Smart home scene configuration method and device
CN111599371A (en) Voice adding method, system, device and storage medium
US12062361B2 (en) Wake word method to prolong the conversational state between human and a machine in edge devices
CN107705793A (en) Information-pushing method, system and its equipment based on Application on Voiceprint Recognition
CN113205809A (en) Voice wake-up method and device
CN109524010A (en) A kind of sound control method, device, equipment and storage medium
CN108899028A (en) Voice awakening method, searching method, device and terminal
CN117095683A (en) Speech recognition processing method, system, device and storage medium
CN116264078A (en) Speech recognition processing method and device, electronic equipment and readable medium
CN115019797A (en) Voice interaction method and server
CN115762505B (en) Voice interaction method, electronic equipment and storage medium
CN109686372B (en) Resource playing control method and device
CN111081254A (en) A kind of speech recognition method and device
CN111292749A (en) Conversation control method and device for intelligent voice platform
US20220122593A1 (en) User-friendly virtual voice assistant
CN117409779B (en) Voice wakeup method, device, system and readable medium
CN115188377B (en) Voice interaction method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination