CN114863941B - Howling suppression method and device, storage medium and electronic equipment - Google Patents

Howling suppression method and device, storage medium and electronic equipment

Info

Publication number
CN114863941B
CN114863941B CN202210307288.1A CN202210307288A CN114863941B CN 114863941 B CN114863941 B CN 114863941B CN 202210307288 A CN202210307288 A CN 202210307288A CN 114863941 B CN114863941 B CN 114863941B
Authority
CN
China
Prior art keywords
howling
audio
signal
suppression
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210307288.1A
Other languages
Chinese (zh)
Other versions
CN114863941A (en
Inventor
陈志鹏
阮良
陈功
陈丽
郝一亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Netease Zhiqi Technology Co Ltd
Original Assignee
Hangzhou Netease Zhiqi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Netease Zhiqi Technology Co Ltd filed Critical Hangzhou Netease Zhiqi Technology Co Ltd
Priority to CN202210307288.1A priority Critical patent/CN114863941B/en
Publication of CN114863941A publication Critical patent/CN114863941A/en
Application granted granted Critical
Publication of CN114863941B publication Critical patent/CN114863941B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本公开的实施方式涉及音频信号处理技术领域,更具体地,本公开的实施方式涉及啸叫抑制方法及装置,存储介质和电子设备。所述方法包括:提取待处理音频信号的音频特征,所述待处理音频信号为所述第一设备通过其第一音频采集模块所采集到的音频信号,所述待处理音频信号为声源所发出的声学信号与所述第二音频播放模块所播放的第二音频信号之叠加;将所述音频特征输入至啸叫检测模型,所述啸叫检测模型输出所述待处理音频信号的啸叫特征参数;将所述啸叫特征参数以及所述音频特征输入至啸叫抑制模型;根据所述啸叫抑制模型的输出,获得啸叫抑制音频信号。本公开的技术方案能够适应存在诸多非线性、不确定性的即时通信的声学环路中,进行啸叫抑制。

The embodiments of this disclosure relate to the field of audio signal processing technology, and more specifically, to a method and apparatus for suppressing howling, a storage medium, and an electronic device. The method includes: extracting audio features of an audio signal to be processed, wherein the audio signal to be processed is an audio signal acquired by a first device through its first audio acquisition module, and the audio signal to be processed is a superposition of an acoustic signal emitted by a sound source and a second audio signal played by a second audio playback module; inputting the audio features to a howling detection model, wherein the howling detection model outputs howling feature parameters of the audio signal to be processed; inputting the howling feature parameters and the audio features to a howling suppression model; and obtaining a howling-suppressed audio signal based on the output of the howling suppression model. The technical solution of this disclosure can adapt to howling suppression in acoustic loops of real-time communication with many nonlinearities and uncertainties.

Description

Howling suppression method and device, storage medium and electronic equipment
Technical Field
Embodiments of the present disclosure relate to the field of audio signal processing technologies, and more particularly, to a howling suppression method and apparatus, a storage medium, and an electronic device.
Background
This section is intended to provide a background or context for the embodiments of the disclosure recited in the claims, which description herein is not admitted to be prior art by inclusion in this section.
The essence of howling generation is that the feedback system is in an unstable state, and the stability of the system can be judged by using the Nyquist stability criterion through the open loop transfer function of the system. In a typical feedback system, as an input to the system, an input R(s) of the system is transferred to an output C(s) of the system via a forward transfer function G(s), and an output C(s) of the system is transferred to an input R(s) of the system via a feedback transfer function H(s) of the system. Therefore, it can be deduced that the open loop transfer function of the system is H(s). G(s). Thus, the system stability can be judged according to the Nyquist diagram or the Bode diagram of the open loop transfer function. In a feedback system, when the feedback signal and the input signal are in phase and the feedback loop is positive feedback, i.e. the corresponding open loop gain is greater than 1, the system is in an unstable state.
In an acoustic scene, a howling phenomenon easily occurs when an acoustic feedback closed loop is formed.
Disclosure of Invention
In acoustic scenes such as conference rooms, auditoriums, KTV, etc., the microphone picks up sound and the speaker plays, at which time the signal played by the speaker is picked up again by the microphone, creating a loop. In these acoustic scenarios, self-howling is often generated by the system itself, and the acoustic characteristics of self-howling are mostly continuous howling with single frequency or multiple frequencies. Howling in the above-mentioned acoustic scenario has the feature of being relatively fixed and relatively easy to identify.
In the context of an acoustic loop of instant messaging (Real Time Communication, RTC), because of different audio processing capabilities of different devices built-in, different network transmission environments between devices, a change in a device location, a difference in frequency response of the device, and the like, may affect the transmission uncertainty of an audio signal in the acoustic loop. The variations and effects of the above factors are also not linear, and thus the measurement transfer function and analysis of the acoustic loop quantification in instant messaging scenarios is not possible. Meanwhile, these nonlinear factors also cause a plurality of characteristics different from the traditional howling scene, such as the discontinuity of howling, multiple frequency points, frequency point movement, frequency point diffusion, and the like.
In the current howling suppression technology, generally:
The scheme I is that howling suppression is carried out by adopting a frequency shift phase shift method, a notch method and a self-adaptive filtering method. The frequency shift phase shift method comprises the steps of changing the condition of howling generation through a frequency shift phase shift method, so as to inhibit the howling generation, the notch method comprises the steps of firstly determining the frequency point of the howling, then carrying out notch suppression on the corresponding frequency point, so as to achieve the purpose of howling inhibition, and the adaptive filtering method comprises the steps of dynamically updating the coefficient of a filter through adaptive filtering, so that the filtering of howling signals is realized. However, the phase shift method, the notch method and the adaptive filtering method are more suitable for traditional conferences, hearing aid systems and the like, and the condition of howling generation is relatively fixed in an acoustic scene. The method has poor howling suppression effect on an acoustic loop scene of instant messaging with a plurality of nonlinearities and uncertainties.
And secondly, firstly detecting the howling frequency point, then removing the signal of the howling frequency point, and finally repairing the signal near the howling frequency point through a neural network. Scheme two still essentially performs howling suppression by a notch-like scheme, which also provides a signal restoration network compared to the notch method. Compared with the traditional notch method, the method has the advantages that the tone quality is improved, but similar to the notch method, the method is still not applicable to the acoustic loop scene of instant messaging.
For this reason, there is a great need for an improved howling suppression method and apparatus, storage medium and electronic device, to provide an acoustic loop that can accommodate instant messaging with a number of non-linearities and uncertainties.
In this context, embodiments of the present disclosure desire to provide a howling suppression method and apparatus, a storage medium, and an electronic device.
According to an aspect of the present disclosure, there is provided a howling suppression method applied to a first device, where the first device is configured to perform instant communication with a second device, where the first device and the second device belong to a same acoustic loop, the first device includes a first communication module, a first audio acquisition module, and a first audio playing module, and the second device includes a second communication module, a second audio acquisition module, and a second audio playing module, where the method includes:
extracting audio characteristics of an audio signal to be processed, wherein the audio signal to be processed is an audio signal acquired by the first equipment through a first audio acquisition module of the audio signal to be processed, and the audio signal to be processed is superposition of an acoustic signal sent by a sound source and a second audio signal played by a second audio playing module;
Inputting the audio characteristics into a howling detection model, wherein the howling detection model outputs howling characteristic parameters of the audio signals to be processed;
Inputting the howling feature parameters and the audio features into a howling suppression model;
And obtaining the howling suppression audio signal according to the output of the howling suppression model.
According to an aspect of the present disclosure, there is provided a howling suppression apparatus applied to a first device for instant communication with a second device, the first device and the second device belonging to the same acoustic loop, the first device including a first communication module, a first audio acquisition module, and a first audio playback module, the second device including a second communication module, a second audio acquisition module, and a second audio playback module, the apparatus comprising:
the audio feature extraction module is used for extracting audio features of an audio signal to be processed, wherein the audio signal to be processed is an audio signal acquired by the first equipment through the first audio acquisition module, and the audio signal to be processed is superposition of an acoustic signal sent by a sound source and a second audio signal played by the second audio play module;
The howling detection module is used for inputting the audio characteristics into a howling detection model, and the howling detection model outputs howling characteristic parameters of the audio signals to be processed;
the howling suppression input module is used for inputting the howling characteristic parameters and the audio characteristics into a howling suppression model;
And the howling suppression output module is used for obtaining the howling suppression audio signal according to the output of the howling suppression model.
According to an aspect of the present disclosure, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, is the howling suppression method described above.
According to one aspect of the present disclosure, there is provided an electronic device including:
Processor, and
A memory for storing executable instructions of the processor;
Wherein the processor is configured to perform the howling suppression method of any of the above via execution of the executable instructions.
According to the howling suppression method of the embodiment of the disclosure, audio characteristics of an audio signal to be processed are input into a howling detection model for detection, howling characteristic parameters are obtained, and howling suppression is performed based on the audio characteristics and the howling characteristic parameters. Therefore, the method and the device are suitable for the acoustic loop of the instant messaging, and the complex and non-fixed howling characteristic parameters generated by the uncertainty of the acoustic loop of the instant messaging are detected based on the howling detection model, so that effective howling suppression can be performed based on the detected howling characteristic parameters.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:
fig. 1 schematically shows a spectrum of howling signals in a common acoustic scenario in the prior art;
fig. 2 schematically shows a spectrum of howling signals in another common acoustic scenario in the prior art;
fig. 3 schematically illustrates a spectrogram of a howling signal in an acoustic loop of an instant messaging scenario, in accordance with an embodiment of the present disclosure;
fig. 4 schematically shows a flowchart of a howling suppression method according to an embodiment of the present disclosure;
Fig. 5 schematically illustrates a schematic diagram of an acoustic loop of an instant messaging scenario in accordance with an embodiment of the present disclosure;
Fig. 6 schematically illustrates a schematic diagram of a cascade application of a howling detection model and a howling suppression model according to an embodiment of the present disclosure;
fig. 7 schematically illustrates a schematic diagram of a howling detection model according to an embodiment of the disclosure;
fig. 8 schematically illustrates a schematic diagram of a howling suppression model according to an embodiment of the disclosure;
Fig. 9 schematically illustrates a flowchart of training a howling detection model according to an embodiment of the disclosure;
Fig. 10 schematically illustrates a flowchart of training a howling suppression model according to an embodiment of the disclosure;
fig. 11 schematically illustrates a block diagram of a first audio processing module of a first device according to an embodiment of the disclosure;
Fig. 12 schematically shows a block diagram of a howling suppression apparatus according to an embodiment of the present disclosure;
FIG. 13 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure, and
Fig. 14 schematically shows a block diagram of an electronic device according to a disclosed embodiment.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present disclosure will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present disclosure and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Those skilled in the art will appreciate that embodiments of the present disclosure may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of entirely hardware, entirely software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software.
According to an embodiment of the present disclosure, there are provided a howling suppression method, a howling suppression device, a storage medium, and an electronic apparatus.
Any number of elements in the figures are for illustration and not limitation, and any naming is used for distinction only, and not for any limiting sense.
The principles and spirit of the present disclosure are described in detail below with reference to several representative embodiments thereof.
Summary of The Invention
The inventors have found that in acoustic scenes such as conference rooms, auditoriums, KTV, etc., the microphone picks up sound and the speaker plays, at which time the signal played by the speaker is picked up again by the microphone, creating a loop. In these acoustic scenarios, self-howling is often generated by the system itself, and the acoustic characteristics of self-howling are mostly continuous howling with single frequency or multiple frequencies. Fig. 1 shows a spectrum of a multi-frequency howling signal in the above scenario, F11 in fig. 1 is a time domain waveform diagram, F12 is a speech spectrum diagram, the abscissa of F11 and F12 is time, the ordinate of F11 is amplitude, the ordinate of F12 is frequency, and the graphic brightness of F12 represents the energy level of the frequency. Fig. 2 shows a spectrum of a single-frequency howling signal with background sounds in the above-mentioned scene, F21 time domain waveform diagram in fig. 2, F22 is a speech spectrum diagram, the abscissa of F21 and F22 is time, the ordinate of F21 is amplitude, the ordinate of F22 is frequency, and the graphic brightness of F22 represents the energy magnitude of the frequency. According to the examples of fig. 1 and 2, howling in the above-described acoustic scenario has relatively fixed, more easily identifiable characteristics.
In the context of an acoustic loop of instant messaging (Real Time Communication, RTC), because of different audio processing capabilities of different devices built-in, different network transmission environments between devices, a change in a device location, a difference in frequency response of the device, and the like, may affect the transmission uncertainty of an audio signal in the acoustic loop. The variations and effects of the above factors are also not linear, and thus the measurement transfer function and analysis of the acoustic loop quantification in instant messaging scenarios is not possible. Meanwhile, these nonlinear factors also cause a plurality of characteristics different from the traditional howling scene, such as the discontinuity of howling, multiple frequency points, frequency point movement, frequency point diffusion, and the like.
For example, the audio processing built in the device may include noise reduction processing, and noise tracking in the noise reduction processing may track howling as noise to eliminate the howling to some extent, but since the acoustic loop still exists, the howling may still be generated again due to external excitation, and on the other hand, if the noise cannot be completely eliminated but only a part of the howling is eliminated, the audio signal may generate intermittent and inattentive howling. Meanwhile, other nonlinear processing and the like can influence the phase amplitude characteristic of the system, so that the howling frequency point can generate the phenomena of change, diffusion and the like. In addition, due to the difference of the collected and played frequency response of different devices, the transfer functions of the acoustic loops are inconsistent, so that the howling frequency points and the characteristics generated by different devices are different.
It follows that the howling signal generated in the acoustic loop of instant messaging is more complex and has more uncertainty.
Fig. 3 shows a spectrum of a howling signal having a complex characteristic in the above-mentioned scene, F31 time domain waveform diagram in fig. 3, F32 is a speech spectrum diagram, the abscissa of F31 and F32 is time, the ordinate of F31 is amplitude, the ordinate of F32 is frequency, and the graphic brightness of F32 represents the energy level of the frequency. According to fig. 3, the howling signal in the above scenario has more complex howling characteristics than in fig. 1 and 2. Specifically, F12 in fig. 1 shows that howling is multi-frequency, and the energy (amplitude) of the howling is covered with the background sound according to F11, so that the energy display in F11 is relatively single. In fig. 2, F21 shows the energy (amplitude) over time, so that the howling in F21 does not cover its background sound, while the long straight line in which the frequency remains unchanged over time in F22 is shown as single-frequency howling. Thus, the multi-frequency howling signal is shown in fig. 1, while the single-frequency howling signal is shown in fig. 2. With continued reference to fig. 3, F31 shows that 3 with a larger amplitude indicates that howling covers the original background sound, and at the same time, F32 cannot clearly show multi-frequency howling or single-frequency howling as in F12 and F22, so that in the scenario of fig. 3, the howling signal has more complex howling characteristics than those of fig. 1 and 2.
In the current howling suppression technology, generally:
The scheme I is that howling suppression is carried out by adopting a frequency shift phase shift method, a notch method and a self-adaptive filtering method.
The frequency shift phase shift method is used for changing the condition of howling generation by the frequency shift phase shift method, thereby inhibiting the howling generation. The phase shift and phase shift method changes the characteristics of signals due to the change of phase and frequency, and typically, the tone of a speaker is changed to cause distortion. The general shift frequency is suitable for relatively fixed scenes, the targeted optimization is carried out through analysis of a system transfer function, and for instant messaging scenes with various nonlinearities and uncertainties, the howling suppression effect of the instant messaging scenes is often invalid.
The notch method is to firstly determine the frequency point of howling and then to press the notch corresponding to the frequency point, thereby achieving the purpose of howling inhibition. The notch method has a very important premise that the frequency of howling needs to be accurately detected, and the howling has the characteristics of discontinuity, multiple frequency points, frequency point movement, frequency point diffusion and the like in an instant communication scene, so that the method is difficult to realize honest places due to the prediction of the howling frequency points and the difficulty of the prediction. The notch method is generally suitable for the scene of howling frequency point fixation and continuous howling.
And the adaptive filtering method is to dynamically update the coefficients of the filter through adaptive filtering so as to realize the filtering of the howling signals. The adaptive filtering method removes the requirement of the notch method on howling frequency point detection, and estimates the acoustic feedback signal in real time, but the adaptive filtering method is only suitable for filtering linear components, and is difficult to play a good effect in the scene of instant communication with a plurality of nonlinear factors.
Therefore, the phase shift and phase shift method, the notch method and the adaptive filtering method are more suitable for traditional conferences, hearing aid systems and the like, and the conditions of howling are relatively fixed in acoustic scenes. The method has poor howling suppression effect on an acoustic loop scene of instant messaging with a plurality of nonlinearities and uncertainties.
And secondly, firstly detecting the howling frequency point, then removing the signal of the howling frequency point, and finally repairing the signal near the howling frequency point through a neural network.
According to the analysis, scheme two essentially still performs howling suppression by a notch-like scheme, which also provides a signal restoration network compared to the notch method. Compared with the traditional notch method, the method has the advantages that the tone quality is improved, but similar to the notch method, the method is still not applicable to the acoustic loop scene of instant messaging.
In view of the above, in the howling suppression method according to the embodiment of the present disclosure, in an acoustic loop of instant communication, an audio feature of an audio signal to be processed is input to a howling detection model for detection, howling feature parameters are obtained, and howling suppression is performed based on the audio feature and the howling feature parameters. Because the acoustic loop in the instant messaging scene generates the cause of the howling signal with uncertainty, compared with common acoustic scenes such as conference rooms, auditoriums, KTVs and the like, complex and non-stationary howling signals are more likely to appear in the instant messaging scene, and the howling signals are difficult to detect and inhibit. Based on the above, in the embodiment of the disclosure, firstly, the howling signal is accurately detected through the howling detection model, the corresponding howling detection model is obtained through training of the howling characteristic parameters which are richer in instant messaging scene, and on the basis, the influence of the howling characteristic parameters on the howling suppression is learned through the howling suppression model, so that the howling suppression is performed on the audio signal to be processed through the howling suppression model.
Having described the basic principles of the present disclosure, various non-limiting embodiments of the present disclosure are specifically described below.
Exemplary method
A howling suppression method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 4. The howling suppression method is applied to first equipment, the first equipment is used for carrying out instant messaging with second equipment, the first equipment and the second equipment are attributed to the same acoustic loop, the first equipment comprises a first communication module, a first audio acquisition module and a first audio playing module, and the second equipment comprises a second communication module, a second audio acquisition module and a second audio playing module.
Referring to fig. 4, the howling suppression method may include the steps of:
Step S110, extracting audio characteristics of an audio signal to be processed, wherein the audio signal to be processed is an audio signal acquired by the first equipment through a first audio acquisition module of the audio signal to be processed, and the audio signal to be processed is superposition of an acoustic signal sent by a sound source and a second audio signal played by a second audio playing module;
step S120, inputting the audio characteristics into a howling detection model, wherein the howling detection model outputs howling characteristic parameters of the audio signals to be processed;
step S130, inputting the howling feature parameters and the audio features into a howling suppression model;
And step 140, obtaining the howling suppression audio signal according to the output of the howling suppression model.
In the howling suppression method according to the embodiment of the disclosure, in an acoustic loop of instant communication, audio characteristics of an audio signal to be processed are input into a howling detection model for detection, howling characteristic parameters are obtained, and howling suppression is performed based on the audio characteristics and the howling characteristic parameters. Because the acoustic loop in the instant messaging scene generates the cause of the howling signal with uncertainty, compared with common acoustic scenes such as conference rooms, auditoriums, KTVs and the like, complex and non-stationary howling signals are more likely to appear in the instant messaging scene, and the howling signals are difficult to detect and inhibit. Based on the above, in the embodiment of the disclosure, firstly, the howling signal is accurately detected through the howling detection model, the corresponding howling detection model is obtained through training of the howling characteristic parameters which are richer in instant messaging scene, and on the basis, the influence of the howling characteristic parameters on the howling suppression is learned through the howling suppression model, so that the howling suppression is performed on the audio signal to be processed through the howling suppression model.
Referring now to fig. 5, fig. 5 schematically illustrates a schematic diagram of an acoustic loop of an instant messaging scenario in accordance with an embodiment of the present disclosure. As shown in fig. 5, the first device 10, the second device 20, and the user 30 are located in the same physical space a. The physical space a may be, for example, a conference room, an office, or the like. The first device 10 comprises a first audio acquisition module 11, a first communication module 12 and a first audio playing module 13, and the second device 20 comprises a second audio acquisition module 21, a second communication module 22 and a second audio playing module 23. The first device 10 and the second device 20 communicate instantaneously.
In fig. 5, two acoustic loops C1 (solid arrows) and C2 (dashed arrows) are shown, where the acoustic loop C1 is an audio signal transmission path generated based on the first audio acquisition module 11, the first communication module 12, the second communication module 22, and the second audio playing module 23, and the acoustic loop C2 is an audio signal transmission path generated based on the second audio acquisition module 21, the second communication module 22, the first communication module 12, and the first audio playing module 13.
Taking the acoustic loop C1 as an example, when the acoustic source 30 emits an acoustic signal, the acoustic signal is collected by the first audio collection module 11 and is sent to the second communication module 22 through the first communication module 12, the second communication module 22 sends the received audio signal to the second audio playing module 23 for playing, and the second audio signal played by the second audio playing module 23 is collected by the first audio collection module 11 and added with the acoustic signal emitted by the acoustic source 30 to enter the acoustic loop C1 to complete closed loop transmission. Further, the second audio playing module 23 is located within the pick-up distance of the first audio collecting module 11, and the playing volume of the second audio playing module 23 is sufficient for the second audio signal to be picked up by the first audio collecting module 11, so that the audio signal will complete transmission within the acoustic loop C1. Similarly, acoustic loop C2 also completes the closed loop transmission of the audio signal in a similar manner. Thus, since the second device 20 can also be the first device in its acoustic loop C2, the howling suppression method of the present disclosure can also be applied in the second device 20.
Referring now to fig. 6, fig. 6 schematically illustrates a model configuration diagram of a howling suppression method according to an embodiment of the present disclosure.
As shown in fig. 6, in the howling suppression method of the present application, first, audio features are extracted from an audio signal to be processed. After extracting the audio features, the audio features are input to the howling detection model M1. The howling feature parameters and the audio features outputted from the howling detection model M1 are inputted into the howling suppression model M2 to suppress howling. Since the audio features are feature data extracted from the audio signal, the feature data cannot be played. Therefore, it is necessary to restore the characteristics output from the howling suppression model M2 to obtain a howling suppressed audio signal with howling suppression that can be played.
In exemplary embodiments of the present disclosure, a short-time fourier transform (STFT) may be employed to perform feature extraction of the audio signal to be processed. In view of the instant messaging scenario, in the short-time fourier transform, different sampling rates, such as 48kHz/16kHz (music mode and speech mode), may be employed. Further, the number of points of the fourier transform (the number of points of the fourier transform is higher the resolution of the frequency is higher) can be selected according to the requirements (for example, a music scene, a 48kHz sampling rate, more points of the fourier transform are needed, a corresponding voice scene, a 16kHz sampling rate, and points can be reduced. In some embodiments, 512 points may be selected. The frame length and frame shift of the audio signal to be processed may also be selected in combination with the audio acquisition module, the audio playing module, or other processing modules that need to process the audio signal to be processed, for example, the frame length may be selected to be 20 ms and the frame shift may be selected to be 10 ms. The extracted audio features may be one or more of spectral features, bark spectral features, mel-cepstrum features and fundamental frequency features of the audio signal to be processed.
In the exemplary embodiments of the present disclosure, the howling suppression model M2 may directly output the suppressed audio feature. After the howling suppression model M2 outputs the suppressed audio feature, a reverse operation with respect to feature extraction may be performed to restore the suppressed audio feature to obtain the howling suppression audio signal. Thus, an audio signal for transmission or playback can be obtained by feature restoration.
In an exemplary embodiment of the present disclosure, the howling suppression model M2 may also output a howling suppression mask that characterizes the howling suppression bin gain of the audio features of the reference sample signal compared to the audio features of the audio signal to be processed. In other words, the howling suppression mask provides gain values at each frequency point of the audio feature, and the amplitude value of each frequency point of the audio feature is multiplied by the gain values to suppress howling of the audio feature. Thus, during the training of the howling suppression model M2, the audio feature of the howling sample signal may be input as the howling suppression model M2, the howling suppression mask may be obtained based on the howling suppression frequency point gain based on the audio feature of the reference sample information compared to the audio feature of the howling sample signal, and the howling suppression mask may be output as the howling suppression model M2, thereby training the howling suppression model M2 to output the howling suppression mask capable of howling suppression. Because the howling suppression mask is used for representing the howling suppression frequency point gain of the audio characteristics of the reference sample signal compared with the audio characteristics of the audio signal to be processed, the information content contained in the howling suppression mask is smaller than the audio characteristics, and the howling suppression model M2 is trained based on the howling suppression mask, so that the training efficiency of the howling suppression model M2 can be improved. Meanwhile, the howling suppression model M2 obtains the howling suppression mask through multiple calculation on the input audio characteristics, so that the calculated amount is smaller, and the calculation efficiency is higher. After the howling suppression model M2 outputs the howling suppression mask, the howling suppression mask is multiplied with the audio feature of the audio signal to be processed (the howling suppression frequency point gain of the howling suppression mask is used for adjusting the energy of each frequency point of the audio feature of the audio signal to be processed to remove/suppress the howling) so as to obtain a suppressed audio feature, and a reverse operation with respect to feature extraction is performed on the suppressed audio feature, so that the suppressed audio feature is restored to obtain the howling suppression audio signal. Thus, an audio signal for transmission or playback can be obtained by feature restoration.
Referring now to fig. 7, fig. 7 schematically illustrates a schematic diagram of a howling detection model according to an embodiment of the present disclosure.
The howling detection model sequentially comprises an input processing layer 201, a first intermediate processing layer 204 and a classification output layer 206. The audio feature is assigned to the input processing layer 201 for input to the first intermediate processing layer 204 via the input processing layer 201. The first intermediate processing layer 204 is configured to obtain a local feature of the audio feature, and take the local feature as a howling intermediate feature. The classification output layer 206 is configured to classify the howling intermediate feature to obtain a howling result feature, where the howling intermediate feature and the howling result feature are used as the howling feature parameter to be input to the howling suppression model.
In the exemplary embodiment of the present disclosure, the howling detection model further includes a trunk layer 202 and a loop layer 203 connected between the input processing layer 201 and the intermediate processing layer 204 in sequence. The trunk layer 202 is configured to perform rolling and/or pooling processing on the data input to the trunk layer 202, so as to further compress the data input to the trunk layer 3. The loop layer 203 is configured to establish an association relationship between input data of the loop layer 203 and output data of the loop layer. Since the audio feature is a sequence feature that changes with time, howling suppression of the audio feature at a certain time is associated with the audio feature at an adjacent time, and thus, learning of the association relationship of the sequence feature can be achieved through the loop layer 203, thereby improving accuracy of howling suppression. The howling detection model further comprises an attention layer 205 connected between the first intermediate processing layer 204 and the classification output layer 206, the attention layer 205 being arranged to weight-sum data input to the attention layer 205. The weight in the attention layer 205 is also a part to be learned in the training process of the howling detection model, and the influence of the data output by the first intermediate processing layer 204 on the classification output layer 206 is learned through the training of the howling detection model, so that the accuracy of the howling result characteristics output by the classification output layer 206 is improved. The howling detection model may have other structures, and the present disclosure is not limited thereto.
In the exemplary embodiment of the present disclosure, due to the fact that in the acoustic loop suitable for instant messaging, the propagation condition of the propagation path of the audio signal is relatively complex, and has relatively high uncertainty, in order to facilitate the howling suppression model to obtain richer information of the howling signal, the howling feature parameters output by the howling detection model may include the howling intermediate feature and the howling result feature. Wherein the howling intermediate feature is used to represent the spectral features of the howling signal. The howling result features include one or more of howling detection results, howling levels, howling types, howling continuity, and frequency point movement parameters. On the other hand, because the howling signals of the acoustic loop of instant messaging have the characteristics of intermittence, multi-frequency points, frequency point movement, frequency point diffusion and the like which are different from common acoustic scenes, the characteristics of the howling signals which are different from the common acoustic scenes are detected through the howling detection model, so that the howling result characteristics comprise one or more of howling detection results, howling grades, howling types, howling continuity and frequency point movement parameters, and the model parameters of the howling suppression model can be adjusted according to the howling result characteristics in the training process, and the howling signals with the howling result characteristics are removed from audio signals to be processed.
Further, the howling detection result is used for indicating whether the audio feature input to the howling detection model has howling, the howling level is used for indicating the howling strength of the audio feature input to the howling detection model, the howling type comprises single-frequency howling, multi-frequency howling and diffuse howling, the howling continuity comprises continuous howling and intermittent howling, the frequency point movement parameters comprise frequency point movement type parameters and frequency point movement amplitude parameters, the frequency point movement type parameters are used for indicating whether the audio feature input to the howling detection model has frequency point movement, and the frequency point movement amplitude parameters are used for indicating the amplitude of the frequency point movement of the audio feature input to the howling detection model. The above-mentioned howling result features may be digital labels or single heat vectors. Thus, the howling characteristics are described and vectorized from a number of different characteristic description modes.
Referring now to fig. 8, fig. 8 schematically illustrates a schematic diagram of a howling suppression model according to an embodiment of the present disclosure.
The howling suppression model comprises in order an encoder 214, a second intermediate processing layer 215 and a decoder 216. The encoder 214 is configured to perform feature encoding on the audio feature and the howling result feature to obtain an encoded feature, the second intermediate processing layer 215 is configured to perform feature screening on the encoded feature and the howling intermediate feature, and the decoder 216 is configured to perform feature decoding on the screened encoded feature. Because the howling middle feature is already processed by the middle layer in the howling detection model, coding is not needed to be performed again in the howling suppression model, so that the howling middle feature can be input into the second middle processing layer 215 of the howling suppression model, better learning of the howling feature by the howling suppression model is facilitated, and repeated processing of the howling middle feature is avoided while a better suppression effect is obtained.
In an exemplary embodiment of the present disclosure, the second intermediate processing layer 216 may sequentially include a plurality of long and short memory units connected to each other, and a full connection layer for performing feature screening on the encoded features, and for performing weighted summation on outputs of the plurality of long and short memory units. Thus, the relation between howling characteristics and audio characteristics is effectively learned through a plurality of long and short time memory units and a full connection layer.
In the exemplary embodiments of the present disclosure, the audio features may be input into the encoder 214 through the convolution process of the convolution layer 211, thereby ensuring that the feature size of the input audio features can be adapted to the encoder 214. The howling result feature may be input to the encoder 214 via a highly feature learning by the embedding layer 212, thereby ensuring that the feature size of the input howling result feature is adapted to the encoder 214. The howling intermediate features may be convolved via convolution layer 213 for input to second intermediate processing layer 216, thereby ensuring that the feature size of the input howling intermediate features can be adapted to second intermediate processing layer 216, and convolution layer 217 may be, for example, an inverse convolution layer for restoring the features output by decoder 215 to be consistent with the audio features of the input howling suppression model.
The above is merely a network structure schematically showing the howling suppression model, and the present disclosure is not limited thereto.
Referring now to fig. 9, fig. 9 schematically illustrates a flow chart of training a howling detection model according to an embodiment of the present disclosure. Fig. 9 shows the following steps in total:
step S101, acquiring a first sample signal set.
In an exemplary embodiment of the present disclosure, the first set of sample signals includes a plurality of first sample signals and howling characteristic parameters of the first sample signals. The howling characteristic parameter and the howling result characteristic include the same parameter item. For example, the howling result features include a howling detection result, a howling level, a howling type, a howling continuity, and a frequency point movement parameter, and the howling characteristic parameters also include a howling detection result, a howling level, a howling type, a howling continuity, and a frequency point movement parameter. The parameter values of the parameter items vary depending on the specific audio signal.
The first sample signal includes a reference sample signal and a howling sample signal. The reference sample signal is an audio signal played by a playback device in the acoustic loop. Specifically, the playback device is a playback device that sets the playback device to play back the reference sample signal at the position of the sound source 30 as in fig. 5 during the model training. The howling sample signal is a superposition of the audio signal collected by the first audio collection module and played by the playing device and the second audio signal played by the second audio playing module.
Step S102 extracts a first sample audio feature of the first sample signal.
Specifically, the extraction algorithm of the first sample signal may be the same as the extraction algorithm of the audio features of the audio signal to be processed described above.
Step S103, taking the first sample audio characteristic as the input of the howling detection model, and adjusting the model parameters of the howling detection model according to the difference between the output of the howling detection model and the corresponding howling characteristic parameters.
Specifically, the model parameters of the howling detection model are adjusted, so that the howling detection model can output howling result characteristics consistent with the corresponding howling characteristic parameters, and the detection performance of the howling detection model is improved.
Referring now to fig. 10, fig. 10 schematically illustrates a flow chart of training a howling suppression model according to an embodiment of the present disclosure. Fig. 10 shows the following steps in total:
step S104, a second sample signal set is acquired.
In an exemplary embodiment of the present disclosure, the second set of sample signals includes a plurality of second sample signal pairs, each of the second sample signal pairs including a howling sample signal and a reference sample signal. The reference sample signal is an audio signal played by a playback device in the acoustic loop. Specifically, the playback device is a playback device that sets the playback device to play back the reference sample signal at the position of the sound source 30 as in fig. 5 during the model training. The howling sample signal is a superposition of the audio signal collected by the first audio collection module and played by the playing device and the second audio signal played by the second audio playing module.
Step S105 extracts a second sample audio feature of the howling sample signal in the second set of sample signals.
Specifically, the extraction algorithm of the howling sample signal may be the same as the aforementioned extraction algorithm of the audio features of the audio signal to be processed.
Step S106, inputting the second sample audio characteristics into the howling detection model to obtain the howling characteristic parameters.
Step S107, inputting the second sample audio feature and the howling feature parameter into the howling suppression model, and adjusting the model parameter of the howling suppression model according to the difference between the output of the howling detection model and the corresponding reference sample signal.
Specifically, the model parameters of the howling suppression model are adjusted so that the howling suppression model can output the suppressed audio signal which is closer to the reference sample signal, thereby improving the detection performance of the howling suppression model.
In an exemplary embodiment of the present disclosure, the howling detection model is trained prior to the howling suppression model. Therefore, in the training of the howling suppression model, accurate howling characteristic parameters output by the howling detection model can be obtained, so that the training efficiency of the howling suppression model and the performance of the howling suppression model are improved.
In the exemplary embodiments of the present disclosure, since the training of the howling detection model and the howling suppression model requires a reference sample signal without a howling signal, the above-described first sample signal set and second sample signal set need to include the howling sample signal with the howling signal and the reference sample signal without the howling signal. Further, the second set of sample signals is used to train the howling suppression model, so that the howling sample signals and the reference sample signals thereof need to be paired, i.e. the same signal source is used, the howling and non-howling signals are recorded in the acoustic loop of instant messaging as shown in fig. 5, and time alignment is performed to serve as paired howling sample signals and reference sample signals, respectively. The howling sample signal in the first sample signal set and the howling sample signal in the second sample signal set may be the same signal or different signals, and the reference sample signal in the first sample signal set and the reference sample signal in the second sample signal set may be the same signal or different signals.
Because of the specificity of the instant messaging scenario, the conventional data set and signal construction method are difficult to simulate the real howling condition, and such an open source data set is not currently available, so that the first sample signal set and the second sample signal set need to be actually collected. In some implementations, the howling sample signal in the first sample signal set and the howling sample signal in the second sample signal set are made identical, and the reference sample signal in the first sample signal set and the reference sample signal in the second sample signal set are made identical, so that data acquisition steps of the sample signal sets are reduced, and data acquisition efficiency is improved.
Further, in view of the complexity of the instant messaging scenario, the scheme of the acquisition of the first sample signal and the second sample signal set may involve different audio content, different devices, different environments, different communication parameters, etc., thereby improving the robustness of the howling suppression algorithm.
In exemplary embodiments of the present disclosure, the reference sample signal may be generated based on different audio content. The audio content includes one or more of speech, music, ambient sound, ringtone, bird song, whistle, such that the reference sample signal overlays the different audio content.
In an exemplary embodiment of the disclosure, the first device and the second device comprise an audio processing module having an audio processing algorithm, the first device and the second device having different capabilities and different audio processing algorithms for different howling sample signals, whereby devices of different capabilities and devices having different audio processing algorithms are covered in the collection of howling sample signals.
In an exemplary embodiment of the present disclosure, the spatial region in which the acoustic loop is located has different noise environments for different howling sample signals, and the first audio signal acquired in the spatial region having the first noise environment and the second audio signal acquired in the spatial region having the second noise environment have different signal-to-noise ratios under the same acquisition conditions, the same acquisition conditions including the same device, the same spatial region, and the same sound source. Thereby, such that environments with different signal-to-noise ratios are covered in the acquisition of the howling sample signal. For example, for the same conference room, it may be made to have different background noise, so that different howling sample signals are collected in the same conference room with different background noise.
In an exemplary embodiment of the present disclosure, the audio transmission parameters between the first device and the second device are different for different howling sample signals, the audio transmission parameters comprising one or more of a relative position between the first device and the second device, a network communication parameter between the first device and the second device, a real-time volume of the first device and the second device. Thereby, such that acoustic loops with different audio transmission parameters are covered in the acquisition of the howling sample signal.
Therefore, the acquisition of the reference sample signal and the howling sample signal is convenient for covering a plurality of different instant messaging situations, so that the robustness of the obtained howling detection model and the howling suppression model is improved.
In an exemplary embodiment of the present disclosure, the loss function of the howling suppression model is any one of an error loss function, an audio quality loss function, and an countermeasure loss function. The loss function of the howling suppression model may also be a weighted sum of any of an error loss function, an audio quality loss function, and an anti-loss function. Each loss function is used as much as possible in model training, and when howling suppression is performed by using the model, calculation of the loss function is not required.
In an exemplary embodiment of the present disclosure, the loss function of the howling suppression model comprises an error loss function, which is obtained based on an error calculation between the howling suppression audio signal and a reference sample signal. Specifically, the error loss function may be calculated based on an MSE (mean-square error) between the howling suppressed audio signal and the reference sample signal.
In an exemplary embodiment of the present disclosure, the loss function of the howling suppression model includes an audio quality loss function calculated based on an audio quality of the howling suppression audio signal and an audio quality of the reference sample signal. Specifically, the average subjective opinion score (mean opinion score, MOS) of the howling suppressed audio signal and the average subjective opinion score of the reference sample signal may be obtained, respectively, and the calculation of the audio quality loss function may be performed from the obtained average subjective opinion scores. The average subjective opinion score may be obtained, for example, by a trained audio quality scoring network model, which is not a limitation of the present disclosure.
In an exemplary embodiment of the disclosure, the loss function of the howling suppression model includes an counterloss function, which is obtained based on a probability calculation for making a correct discrimination by a discriminator, where the discriminator is configured to discriminate an output result of the howling suppression model as a first signal or a second signal, the first signal characterizes the corresponding output result as the howling suppression signal, the second signal characterizes the corresponding output result as a reference sample signal, and the discriminator makes a correct discrimination when the discriminator discriminates the output result of the howling suppression model as the first signal. Specifically, the objective of the countermeasure loss function is that the discriminator is desirably capable of discriminating the output result of the howling suppression model as the second signal, that is, the audio signal itself having no howling. Therefore, the output result of the howling suppression model has a better suppression effect through the antagonism of the howling suppression model and the discriminator.
In exemplary embodiments of the present disclosure, the howling suppression model may be multiplexed for noise suppression. Thus, the step of obtaining the howling suppressed audio signal may further comprise performing noise cancellation on the howling suppressed audio signal, the noise being ambient noise in the audio signal, the ambient noise having a fixed frequency, based on the output of the howling suppression model. The environmental noise is, for example, noise having a fixed frequency such as a fan or an air conditioning sound. In one exemplary embodiment, the howling suppression model may be a deep deconvolution loop network (Deep Complex Convolution Recurrent Network, DCCRN). Since the howling suppression model and the noise suppression model are both for eliminating specific signals, design factors of the howling suppression model and the noise suppression model are substantially similar, and thus, the present disclosure can reuse the howling suppression model for noise suppression, thereby simultaneously performing two tasks of howling suppression and noise suppression with one model. To achieve multiplexing of the howling suppression model, the present disclosure processes a sample signal set of the howling suppression model. In an exemplary embodiment of the disclosure, the howling sample signal in the second sample signal pair may be obtained according to the steps of superposing a reference sample signal played by the sound source acquired by the first audio acquisition module and a second audio signal played by the second audio playing module as a quasi-howling sample signal, where the reference sample signal has no noise, superposing the quasi-howling sample signal and a noise audio signal, and generating the howling sample signal. Therefore, the howling sample signals in the second sample signal pair of the howling suppression model are provided with howling and noise, and the reference sample signals are not provided with howling and noise, so that the howling suppression model can inform the howling suppression and noise suppression according to the training method of the howling suppression model.
Referring now to fig. 11, fig. 11 schematically illustrates a block diagram of a first audio processing module of a first device according to an embodiment of the present disclosure. The first device may comprise a first audio processing module 15. The first audio processing module 15 has an audio processing algorithm. The audio processing algorithm may include one or more of an acoustic echo cancellation algorithm, a noise suppression algorithm, and an automatic gain control algorithm. In fig. 11, the first audio processing module 15 includes an echo cancellation module 151, a howling suppression module 152, a noise cancellation module 153, and an automatic gain module 154.
The echo cancellation module 151 is configured to execute an acoustic echo cancellation algorithm, where the acoustic echo cancellation algorithm is configured to cancel acoustic echo in the audio signal acquired by the first audio acquisition module 11, where the acoustic echo includes an echo signal formed by the audio signal played by the first audio playing module (such as reference numeral 13 in fig. 5) acquired by the first audio acquisition module 11.
Howling suppression module 152 is configured to perform the howling suppression method shown in fig. 4.
The noise cancellation module 153 is configured to execute a noise suppression algorithm, where the noise suppression algorithm is configured to suppress noise in the audio signal collected by the first audio collection module, where the noise is environmental noise of the audio signal collected by the first audio collection module, and the environmental noise has a fixed frequency. The noise suppression module 153 may also be omitted when the howling suppression model in the howling suppression method may be multiplexed for noise suppression.
The automatic gain module 154 is configured to execute an automatic gain control algorithm for adjusting the volume of the audio signal collected 11 by the first audio collection module to be within a set volume range.
In an exemplary embodiment of the present disclosure, the first device may further include a built-in audio processing module 14. The built-in audio processing module 14 is built into the first device. The built-in audio processing module 14 may be a non-linear processing module and, as related to the device itself, is customized by various manufacturers and is not controllable for audio signal processing at instant messaging. The built-in audio processing module 14 may have an on or off switch. The built-in audio processing module 14 may also perform one or more of an acoustic echo cancellation algorithm, a noise suppression algorithm, and an automatic gain control algorithm.
In the exemplary embodiment of the present disclosure, the first audio acquisition module 11 of the first device acquires the acquired audio signal, after processing by the built-in audio processing module 14 (if turned on), the audio signal enters the acoustic echo cancellation module 151 of the first audio processing module 15, so as to perform acoustic echo cancellation on the audio signal to be processed. The audio signal to be processed after acoustic echo cancellation enters the howling suppression module 152 for howling suppression. The audio signal to be processed via howling suppression enters the noise cancellation module 153 to noise suppress the howling suppressed audio signal using the noise suppression algorithm. The audio signal to be processed via noise suppression enters an automatic gain module 154 for automatic gain control of the noise suppressed audio signal using the automatic gain control algorithm. The audio signal after the automatic gain control can be output to the first communication module or directly played by the first audio playing module.
Thus, in the first audio processing module, the howling suppression module 152 performs howling suppression after the acoustic echo cancellation module 151 performs echo cancellation to prevent interference caused by the echo signal, and at the same time, the howling suppression module 152 performs howling suppression before the noise cancellation module 153 performs noise suppression to prevent the noise cancellation module 153 from causing further damage to the howling signal so as not to reduce the accuracy of howling detection and the effect of the howling suppression.
The various embodiments provided by the present disclosure are described above only schematically, and the present disclosure is not limited thereto, and the embodiments may be used alone or in combination.
Exemplary apparatus
Having introduced the howling suppression method of the exemplary embodiment of the present disclosure, next, the howling suppression apparatus of the exemplary embodiment of the present disclosure is described with reference to fig. 12. The howling suppression device is applied to first equipment, the first equipment is used for carrying out instant messaging with second equipment, the first equipment and the second equipment are attributed to the same acoustic loop, the first equipment comprises a first communication module, a first audio acquisition module and a first audio playing module, and the second equipment comprises a second communication module, a second audio acquisition module and a second audio playing module.
Referring to fig. 12, howling suppression apparatus 300 of an exemplary embodiment of the present disclosure may include an audio feature extraction module 310, a howling detection module 320, a howling suppression input module 330, and a howling suppression output module 340. Wherein, the
The audio feature extraction module 310 may be configured to extract an audio feature of an audio signal to be processed, where the audio signal to be processed is an audio signal collected by the first device through the first audio collection module thereof, and the audio signal to be processed is a superposition of an acoustic signal sent by a sound source and a second audio signal played by the second audio play module;
the howling detection module 320 may be configured to input the audio feature into a howling detection model, where the howling detection model outputs a howling feature parameter of the audio signal to be processed;
Howling suppression input module 330 may be configured to input the howling feature parameters and the audio features to a howling suppression model;
The howling suppression output module 340 may be configured to obtain a howling suppression audio signal according to an output of the howling suppression model.
According to an exemplary embodiment of the disclosure, the acoustic loop is an audio signal transmission path generated based on the first audio acquisition module, the first communication module, the second communication module and the second audio playing module, and the audio signal to be processed completes closed loop transmission in the acoustic loop sequentially through the first audio acquisition module, the first communication module, the second audio playing module and the first audio acquisition module.
According to an exemplary embodiment of the present disclosure, the second audio playing module is located within a pickup distance of the first audio collecting module, and a playing volume of the second audio playing module is sufficient for the second audio signal to be picked up by the first audio collecting module.
According to an exemplary embodiment of the present disclosure, the audio feature is one of a spectral feature, a bark spectral feature, a mel-cepstrum feature, and a fundamental frequency feature of the audio signal to be processed.
According to an exemplary embodiment of the present disclosure, the howling detection model sequentially includes an input processing layer, a first intermediate processing layer, and a classification output layer, where the audio feature is assigned to the input processing layer, so that the first intermediate processing layer is input via the input processing layer, the first intermediate processing layer is configured to obtain a local feature of the audio feature, and use the local feature as a howling intermediate feature, and the classification output layer is configured to classify the howling intermediate feature to obtain a howling result feature, and the howling intermediate feature and the howling result feature are used as the howling feature parameter to be input to the howling suppression model.
According to an exemplary embodiment of the present disclosure, the howling detection model further includes a trunk layer and a loop layer connected between the input processing layer and the intermediate processing layer in sequence, the trunk layer being configured to perform convolution and/or pooling processing on data input to the trunk layer, the loop layer being configured to establish an association relationship between input data of the loop layer and output data of the loop layer, the howling detection model further includes an attention layer connected between the first intermediate processing layer and the classification output layer, the attention layer being configured to perform weighted summation on data input to the attention layer.
According to an exemplary embodiment of the present disclosure, the howling suppression model sequentially includes an encoder to perform feature encoding on the audio feature and the howling result feature to obtain an encoded feature, a second intermediate processing layer to perform feature screening on the encoded feature and the howling intermediate feature, and a decoder to perform feature decoding on the screened encoded feature.
According to an exemplary embodiment of the present disclosure, the second intermediate processing layer sequentially includes a plurality of long and short memory units connected to each other, the long and short memory units to perform feature screening on the encoded features, and a full connection layer to weight-sum outputs of the plurality of long and short memory units.
According to an exemplary embodiment of the present disclosure, the howling detection model is trained by:
the method comprises the steps of obtaining a first sample signal set, wherein the first sample signal set comprises a plurality of first sample signals and howling characteristic parameters of the first sample signals, the howling characteristic parameters and howling result characteristics comprise the same parameter items, the first sample signals comprise reference sample signals and howling sample signals, the reference sample signals are audio signals played in an acoustic loop by playing equipment, the howling sample signals are superposition of the audio signals acquired by a first audio acquisition module and played by the playing equipment and second audio signals played by a second audio playing module, extracting the first sample audio characteristics of the first sample signals, taking the first sample audio characteristics as input of a howling detection model, and adjusting model parameters of the howling detection model according to the difference between output of the howling detection model and the corresponding howling characteristic parameters.
According to an exemplary embodiment of the present disclosure, the howling suppression model is trained by acquiring a second sample signal set, the second sample signal set including a plurality of second sample signal pairs, each of the second sample signal pairs including a howling sample signal and a reference sample signal, the reference sample signal being an audio signal played by a playback device in the acoustic loop, the howling sample signal being a superposition of the audio signal acquired by the first audio acquisition module and a second audio signal played by the second audio playback module, extracting second sample audio features of the howling sample signals in the second sample signal set, inputting the second sample audio features into the howling detection model to obtain the howling feature parameters, inputting the second sample audio features and the howling feature parameters into the howling suppression model, and adjusting the howling suppression model based on a difference between an output of the howling detection model and the corresponding reference sample signal.
According to an exemplary embodiment of the present disclosure, the loss function of the howling suppression model comprises an error loss function, which is obtained based on an error calculation between the howling suppression audio signal and a reference sample signal.
According to an exemplary embodiment of the present disclosure, the loss function of the howling suppression model comprises an audio quality loss function, which is calculated based on the audio quality of the howling suppression audio signal and the audio quality of the reference sample signal.
According to an exemplary embodiment of the present disclosure, the loss function of the howling suppression model includes an countermeasure loss function, which is obtained based on a probability calculation for making a correct discrimination by a discriminator for discriminating an output result of the howling suppression model as a first signal characterizing the corresponding output result as the howling suppression signal or as a second signal characterizing the corresponding output result as a reference sample signal, and making a correct discrimination by the discriminator when discriminating the output result of the howling suppression model as the first signal.
According to an exemplary embodiment of the present disclosure, the loss function of the howling suppression model is a weighted sum of an error loss function, an audio quality loss function, and an antagonistic loss function.
According to an exemplary embodiment of the present disclosure, the howling detection model is trained prior to the howling suppression model.
According to an exemplary embodiment of the present disclosure, the howling suppression output module further includes a first noise cancellation module configured to perform noise cancellation on the howling suppression audio signal, the noise being ambient noise in the audio signal, the ambient noise having a fixed frequency
According to an exemplary embodiment of the disclosure, the howling sample signals in the second sample signal pair are obtained according to the steps of superposing a reference sample signal played by the sound source acquired by the first audio acquisition module and a second audio signal played by the second audio playing module as a quasi-howling sample signal, wherein the reference sample signal does not have noise, superposing the quasi-howling sample signal and a noise audio signal, and generating the howling sample signal.
According to an exemplary embodiment of the present disclosure, the howling suppression model is a deep complex convolution loop network.
According to an exemplary embodiment of the present disclosure, the howling result characteristic includes one or more of a howling detection result, a howling level, a howling type, a howling continuity, a frequency point movement parameter.
According to an exemplary embodiment of the present disclosure, the howling detection result is used to indicate whether there is howling in the audio feature input to the howling detection model, the howling level is used to indicate the howling intensity of the audio feature input to the howling detection model, the howling type includes single-frequency-point howling, multi-frequency-point howling, and diffuse howling, the howling continuity includes continuous howling and intermittent howling, the frequency-point movement parameters include a frequency-point movement type parameter used to indicate whether there is frequency-point movement in the audio feature input to the howling detection model, and a frequency-point movement amplitude parameter used to indicate the amplitude of the frequency-point movement of the audio feature input to the howling detection model.
According to an exemplary embodiment of the present disclosure, the reference sample signal is generated based on different audio content including one or more of speech, music, ambient sound, ringtone, bird song, whistle.
According to an exemplary embodiment of the present disclosure, the first device and the second device comprise an audio processing module having an audio processing algorithm, the first device and the second device having different capabilities and different audio processing algorithms for different howling sample signals.
According to an exemplary embodiment of the present disclosure, the spatial region in which the acoustic loop is located has different noise environments for different howling sample signals, and the first audio signal acquired in the spatial region having the first noise environment and the second audio signal acquired in the spatial region having the second noise environment have different signal-to-noise ratios under the same acquisition conditions, the same acquisition conditions including the same device, the same spatial region, and the same sound source.
According to an exemplary embodiment of the present disclosure, the audio transmission parameters between the first device and the second device are different for different howling sample signals, the audio transmission parameters comprising one or more of a relative position between the first device and the second device, a network communication parameter between the first device and the second device, a real-time volume of the first device and the second device.
According to the exemplary embodiment of the disclosure, the howling suppression output module comprises a first feature reduction module, which is used for performing inverse operation relative to feature extraction on the suppressed audio features output by the howling suppression model, and reducing the suppressed audio features to obtain the howling suppression audio signals.
According to the exemplary embodiment of the disclosure, the howling suppression output module comprises a mask acquisition module for acquiring a howling suppression mask of an output of the howling suppression model, wherein the howling suppression mask is used for representing howling suppression frequency point gain of an audio feature of a reference sample signal compared with an audio feature of an audio signal to be processed, a suppression feature acquisition module for multiplying the howling suppression mask with the audio feature of the audio signal to be processed to obtain a suppressed audio feature, and a second feature reduction module for performing a reverse operation relative to feature extraction on the suppressed audio feature and reducing the suppressed audio feature to obtain the howling suppression audio signal.
According to an exemplary embodiment of the disclosure, the first device includes an audio processing module having an audio processing algorithm, where the audio processing algorithm includes one or more of an acoustic echo cancellation algorithm for canceling acoustic echo in an audio signal acquired by the first audio acquisition module, a noise suppression algorithm for suppressing noise in the audio signal acquired by the first audio acquisition module, where the noise is an environmental noise of the audio signal acquired by the first audio acquisition module, and an automatic gain control algorithm for adjusting a volume of the audio signal acquired by the first audio acquisition module to be within a set volume range.
According to an exemplary embodiment of the disclosure, the audio processing module further includes an echo cancellation module for performing acoustic echo cancellation on the audio signal to be processed using the acoustic echo cancellation algorithm, a noise suppression module for performing noise suppression on the howling suppressed audio signal using the noise suppression algorithm, and an automatic gain module for performing automatic gain control on the noise suppressed audio signal using the automatic gain control algorithm.
Since each functional module of the howling suppression device in the embodiment of the present disclosure is the same as that in the above-described howling suppression method disclosed embodiment, a detailed description thereof will be omitted.
Exemplary storage Medium
Having described the howling suppression method and apparatus of the exemplary embodiments of the present disclosure, next, a storage medium of the exemplary embodiments of the present disclosure will be described with reference to fig. 13.
Referring to fig. 13, a program product 1000 for implementing the above-described method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
Exemplary electronic device
Having described the storage medium of the exemplary embodiments of the present disclosure, next, an electronic device of the exemplary embodiments of the present disclosure will be described with reference to fig. 14.
The electronic device 800 shown in fig. 14 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.
As shown in fig. 14, the electronic device 800 is embodied in the form of a general purpose computing device. The components of electronic device 800 may include, but are not limited to, at least one processing unit 810 described above, at least one storage unit 820 described above, a bus 830 connecting the various system components (including storage unit 820 and processing unit 810), and a display unit 840.
Wherein the storage unit stores program code that is executable by the processing unit 810 such that the processing unit 810 performs steps according to various exemplary embodiments of the present disclosure described in the above section of the present specification. For example, the processing unit 810 may perform the steps as shown in fig. 4.
The storage unit 820 may include volatile storage units such as a Random Access Memory (RAM) 8201 and/or a cache memory 8202, and may further include a Read Only Memory (ROM) 8203.
Storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 830 may include a data bus, an address bus, and a control bus.
The electronic device 800 may also communicate with one or more external devices 900 (e.g., keyboard, pointing device, bluetooth device, etc.) via an input/output (I/O) interface 850. The electronic device 800 further comprises a display unit 840 connected to an input/output (I/O) interface 850 for displaying. Also, electronic device 800 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 860. As shown, network adapter 860 communicates with other modules of electronic device 800 over bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 800, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
It should be noted that although in the above detailed description several modules or sub-modules of the howling suppressing means are mentioned, this division is only exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.
Furthermore, although the operations of the methods of the present disclosure are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that this disclosure is not limited to the particular embodiments disclosed nor does it imply that features in these aspects are not to be combined to benefit from this division, which is done for convenience of description only. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (54)

1. The howling suppression method is characterized by being applied to a first device, wherein the first device is used for performing instant communication with a second device, the first device and the second device belong to the same acoustic loop, the first device comprises a first communication module, a first audio acquisition module and a first audio playing module, and the second device comprises a second communication module, a second audio acquisition module and a second audio playing module, and the method comprises the following steps:
extracting audio characteristics of an audio signal to be processed, wherein the audio signal to be processed is an audio signal acquired by the first equipment through a first audio acquisition module of the audio signal to be processed, and the audio signal to be processed is superposition of an acoustic signal sent by a sound source and a second audio signal played by a second audio playing module;
Inputting the audio characteristics into a howling detection model, wherein the howling detection model outputs howling characteristic parameters of the audio signals to be processed;
Inputting the howling feature parameters and the audio features into a howling suppression model;
Obtaining a howling suppression audio signal according to the output of the howling suppression model;
The howling detection model sequentially comprises an input processing layer, a first intermediate processing layer and a classification output layer, wherein the audio characteristics are assigned to the input processing layer so as to be input into the first intermediate processing layer through the input processing layer, the first intermediate processing layer is used for acquiring local characteristics of the audio characteristics and taking the local characteristics as howling intermediate characteristics, the classification output layer is used for classifying the howling intermediate characteristics to obtain howling result characteristics, and the howling intermediate characteristics and the howling result characteristics are used for being input into the howling suppression model as howling characteristic parameters;
The howling suppression model sequentially comprises an encoder, a second intermediate processing layer and a decoder, wherein the encoder is used for carrying out feature coding on the audio features and the howling result features to obtain coding features, the second intermediate processing layer is used for carrying out feature screening on the coding features and the howling intermediate features, and the decoder is used for carrying out feature decoding on the screened coding features.
2. The howling suppression method according to claim 1, wherein the acoustic loop is an audio signal transmission path generated based on the first audio acquisition module, the first communication module, the second communication module, and the second audio playing module, and the audio signal to be processed completes closed loop transmission in the acoustic loop sequentially via the first audio acquisition module, the first communication module, the second audio playing module, and the first audio acquisition module.
3. The howling suppression method according to claim 2, wherein the second audio playing module is located within a pickup distance of the first audio collecting module, and a playing volume of the second audio playing module is sufficient for the second audio signal to be picked up by the first audio collecting module.
4. The howling suppression method according to claim 1, wherein the audio feature is one of a bark spectral feature, a mel-cepstrum feature, and a fundamental frequency feature of the audio signal to be processed.
5. The howling suppression method according to claim 1, wherein the howling detection model further comprises, in order, a trunk layer for rolling and/or pooling data input to the trunk layer and a loop layer for establishing an association relationship between input data of the loop layer and output data of the loop layer, connected between the input processing layer and the intermediate processing layer,
The howling detection model further comprises an attention layer connected between the first intermediate processing layer and the classification output layer, the attention layer being configured to weight sum data input to the attention layer.
6. The howling suppression method according to claim 1, wherein the second intermediate processing layer sequentially includes a plurality of long and short time memory units connected to each other, the long and short time memory units being configured to perform feature screening on the encoded features, and a full connection layer configured to weight-sum outputs of the plurality of long and short time memory units.
7. The howling suppression method according to claim 1, wherein the howling detection model is trained by:
Acquiring a first sample signal set, wherein the first sample signal set comprises a plurality of first sample signals and howling characteristic parameters of the first sample signals, the howling characteristic parameters and the howling result characteristics comprise the same parameter items, the first sample signals comprise reference sample signals and howling sample signals, the reference sample signals are audio signals played in the acoustic loop by playing equipment, and the howling sample signals are superposition of the audio signals acquired by the first audio acquisition module and played by the playing equipment and the second audio signals played by the second audio playing module;
extracting a first sample audio feature of the first sample signal;
And taking the first sample audio characteristic as the input of the howling detection model, and adjusting the model parameters of the howling detection model according to the difference between the output of the howling detection model and the corresponding howling characteristic parameters.
8. The howling suppression method according to claim 1, wherein the howling suppression model is trained by:
Acquiring a second sample signal set, wherein the second sample signal set comprises a plurality of second sample signal pairs, each second sample signal pair comprises a howling sample signal and a reference sample signal, the reference sample signal is an audio signal played by playing equipment in the acoustic loop, and the howling sample signal is a superposition of the audio signal acquired by the first audio acquisition module and played by the playing equipment and a second audio signal played by the second audio playing module;
Extracting a second sample audio feature of a howling sample signal in the second sample signal set;
Inputting the second sample audio characteristics into the howling detection model to obtain the howling characteristic parameters;
And inputting the second sample audio characteristics and the howling characteristic parameters into the howling suppression model, and adjusting model parameters of the howling suppression model according to the difference between the output of the howling detection model and the corresponding reference sample signals.
9. The howling suppression method according to claim 8, wherein the loss function of the howling suppression model includes an error loss function obtained based on an error calculation between the howling suppression audio signal and a reference sample signal.
10. The howling suppression method according to claim 9, wherein the loss function of the howling suppression model includes an audio quality loss function calculated based on the audio quality of the howling suppression audio signal and the audio quality of the reference sample signal.
11. The howling suppression method according to claim 9, wherein the loss function of the howling suppression model includes an countermeasure loss function obtained based on a probability calculation for making a correct discrimination by a discriminator for discriminating an output result of the howling suppression model as a first signal representing a corresponding output result as a howling suppression signal or a second signal representing a corresponding output result as a reference sample signal, the discriminator making a correct discrimination when discriminating the output result of the howling suppression model as the first signal.
12. The howling suppression method according to claim 9, wherein the loss function of the howling suppression model is a weighted sum of an error loss function, an audio quality loss function, and an antagonism loss function.
13. The howling suppression method according to claim 9, wherein the howling detection model is trained prior to the howling suppression model.
14. The howling suppression method according to claim 9, wherein said obtaining a howling suppressed audio signal from an output of the howling suppression model further comprises:
And carrying out noise elimination on the howling suppression audio signal, wherein the noise is environmental noise in the audio signal, and the environmental noise has a fixed frequency.
15. Howling suppression method according to claim 14, characterized in that the howling sample signal in the second sample signal pair is obtained according to the following steps:
Overlapping the reference sample signal played by the sound source acquired by the first audio acquisition module and the second audio signal played by the second audio playing module to serve as a quasi-howling sample signal, wherein the reference sample signal does not have noise;
And superposing the quasi-howling sample signal and a noise audio signal to generate the howling sample signal.
16. The howling suppression method according to claim 15, wherein the howling suppression model is a deep complex convolution loop network.
17. A howling suppression method according to any of claims 5 to 16, wherein the howling result characteristics comprise one or more of howling detection results, howling level, howling type, howling continuity, frequency bin movement parameters.
18. The howling suppression method according to claim 17, wherein the howling detection result is used to indicate whether there is howling in the audio feature input to the howling detection model, the howling level is used to indicate the howling intensity of the audio feature input to the howling detection model, the howling type includes single-frequency howling, multi-frequency howling, and diffuse howling, the howling continuity includes continuous howling and intermittent howling, the frequency point movement parameters include a frequency point movement type parameter used to indicate whether there is frequency point movement in the audio feature input to the howling detection model, and a frequency point movement amplitude parameter used to indicate the amplitude of the frequency point movement of the audio feature input to the howling detection model.
19. A howling suppression method according to any of claims 7 to 16, wherein the reference sample signal is generated based on different audio content, including one or more of speech, music, ambient sound.
20. A howling suppression method according to any of claims 7-16, characterized in that the first device and the second device comprise an audio processing module with an audio processing algorithm, the first device and the second device having different capabilities and different audio processing algorithms for different howling sample signals.
21. A howling suppression method according to any of claims 7 to 16, characterized in that the spatial region in which the acoustic loop is located has different noise environments for different howling sample signals, and that under the same acquisition conditions, the first audio signal acquired in the spatial region with the first noise environment and the second audio signal acquired in the spatial region with the second noise environment have different signal-to-noise ratios, the same acquisition conditions comprising the same device, the same spatial region and the same sound source.
22. A howling suppression method according to any of claims 7-16, characterized in that for different howling sample signals the audio transmission parameters between the first device and the second device are different, the audio transmission parameters comprising one or more of the relative position between the first device and the second device, network communication parameters between the first device and the second device, real-time volume of the first device and the second device.
23. The howling suppression method according to any one of claims 1 to 16, wherein said obtaining a howling suppressed audio signal from an output of the howling suppression model comprises:
And performing inverse operation relative to feature extraction on the suppressed audio features output by the howling suppression model, and restoring the suppressed audio features to obtain the howling suppression audio signals.
24. The howling suppression method according to any one of claims 1 to 16, wherein said obtaining a howling suppressed audio signal from an output of the howling suppression model comprises:
obtaining a howling suppression mask outputted by the howling suppression model, wherein the howling suppression mask is used for representing the howling suppression frequency point gain of the audio characteristics of the reference sample signal compared with the audio characteristics of the audio signal to be processed;
Multiplying the howling suppression mask with audio features of the audio signal to be processed to obtain suppressed audio features;
And performing inverse operation relative to feature extraction on the suppressed audio features, and restoring the suppressed audio features to obtain the howling suppressed audio signal.
25. The howling suppression method according to any of claims 2-13, wherein the first device comprises an audio processing module having an audio processing algorithm comprising one or more of an acoustic echo cancellation algorithm, a noise suppression algorithm and an automatic gain control algorithm,
The acoustic echo cancellation algorithm is used for canceling acoustic echo in the audio signals acquired by the first audio acquisition module, wherein the acoustic echo comprises an echo signal formed by the audio signals played by the first audio playing module acquired by the first audio acquisition module;
the noise suppression algorithm is used for suppressing noise in the audio signals acquired by the first audio acquisition module, wherein the noise is the environmental noise of the audio signals acquired by the first audio acquisition module, and the environmental noise has fixed frequency;
the automatic gain control algorithm is used for adjusting the volume of the audio signal acquired by the first audio acquisition module to be within a set volume range.
26. The howling suppression method according to claim 25, characterized in that,
Before the extracting the audio characteristics of the audio signal to be processed, the method further comprises:
performing acoustic echo cancellation on the audio signal to be processed by adopting the acoustic echo cancellation algorithm,
After obtaining the howling suppression audio signal according to the output of the howling suppression model, the method further comprises:
Noise suppressing the howling suppressed audio signal using the noise suppression algorithm, and
The automatic gain control algorithm is adopted to carry out automatic gain control on the audio signal subjected to noise suppression.
27. A howling suppression apparatus, characterized by being applied to a first device, the first device being configured to perform instant communication with a second device, the first device and the second device being assigned to a same acoustic loop, the first device including a first communication module, a first audio acquisition module, and a first audio playback module, the second device including a second communication module, a second audio acquisition module, and a second audio playback module, the apparatus comprising:
the audio feature extraction module is used for extracting audio features of an audio signal to be processed, wherein the audio signal to be processed is an audio signal acquired by the first equipment through the first audio acquisition module, and the audio signal to be processed is superposition of an acoustic signal sent by a sound source and a second audio signal played by the second audio play module;
The howling detection module is used for inputting the audio characteristics into a howling detection model, and the howling detection model outputs howling characteristic parameters of the audio signals to be processed;
the howling suppression input module is used for inputting the howling characteristic parameters and the audio characteristics into a howling suppression model;
The howling suppression output module is used for obtaining a howling suppression audio signal according to the output of the howling suppression model;
The howling detection model sequentially comprises an input processing layer, a first intermediate processing layer and a classification output layer, wherein the audio characteristics are assigned to the input processing layer so as to be input into the first intermediate processing layer through the input processing layer, the first intermediate processing layer is used for acquiring local characteristics of the audio characteristics and taking the local characteristics as howling intermediate characteristics, the classification output layer is used for classifying the howling intermediate characteristics to obtain howling result characteristics, and the howling intermediate characteristics and the howling result characteristics are used for being input into the howling suppression model as howling characteristic parameters;
The howling suppression model sequentially comprises an encoder, a second intermediate processing layer and a decoder, wherein the encoder is used for carrying out feature coding on the audio features and the howling result features to obtain coding features, the second intermediate processing layer is used for carrying out feature screening on the coding features and the howling intermediate features, and the decoder is used for carrying out feature decoding on the screened coding features.
28. The howling suppression apparatus according to claim 27, wherein the acoustic loop is an audio signal transmission path generated based on the first audio acquisition module, the first communication module, the second communication module, and the second audio playback module, and the audio signal to be processed completes closed loop transmission in the acoustic loop sequentially via the first audio acquisition module, the first communication module, the second audio playback module, and the first audio acquisition module.
29. The howling suppression apparatus of claim 28, wherein said second audio playing module is located within a pickup distance of said first audio collection module, and wherein a playing volume of said second audio playing module is sufficient for said second audio signal to be picked up by said first audio collection module.
30. The howling suppression apparatus according to claim 27, wherein the audio feature is one of a bark spectral feature, a mel-cepstrum feature, and a fundamental frequency feature of the audio signal to be processed.
31. The howling suppression apparatus according to claim 27, wherein the howling detection model further comprises, in order, a trunk layer for rolling and/or pooling data input to the trunk layer and a loop layer for establishing an association between input data of the loop layer and output data of the loop layer, connected between the input processing layer and the intermediate processing layer,
The howling detection model further comprises an attention layer connected between the first intermediate processing layer and the classification output layer, the attention layer being configured to weight sum data input to the attention layer.
32. The howling suppression apparatus according to claim 27, wherein the second intermediate processing layer sequentially comprises a plurality of long and short time memory units connected to each other for performing feature screening on the encoded features, and a full connection layer for weighted summing outputs of the plurality of long and short time memory units.
33. Howling suppression apparatus as claimed in claim 27, wherein the howling detection model is trained by:
Acquiring a first sample signal set, wherein the first sample signal set comprises a plurality of first sample signals and howling characteristic parameters of the first sample signals, the howling characteristic parameters and the howling result characteristics comprise the same parameter items, the first sample signals comprise reference sample signals and howling sample signals, the reference sample signals are audio signals played in the acoustic loop by playing equipment, and the howling sample signals are superposition of the audio signals acquired by the first audio acquisition module and played by the playing equipment and the second audio signals played by the second audio playing module;
extracting a first sample audio feature of the first sample signal;
And taking the first sample audio characteristic as the input of the howling detection model, and adjusting the model parameters of the howling detection model according to the difference between the output of the howling detection model and the corresponding howling characteristic parameters.
34. Howling suppression apparatus according to claim 27, wherein the howling suppression model is trained by:
Acquiring a second sample signal set, wherein the second sample signal set comprises a plurality of second sample signal pairs, each second sample signal pair comprises a howling sample signal and a reference sample signal, the reference sample signal is an audio signal played by playing equipment in the acoustic loop, and the howling sample signal is a superposition of the audio signal acquired by the first audio acquisition module and played by the playing equipment and a second audio signal played by the second audio playing module;
Extracting a second sample audio feature of a howling sample signal in the second sample signal set;
Inputting the second sample audio characteristics into the howling detection model to obtain the howling characteristic parameters;
And inputting the second sample audio characteristics and the howling characteristic parameters into the howling suppression model, and adjusting model parameters of the howling suppression model according to the difference between the output of the howling detection model and the corresponding reference sample signals.
35. The howling suppression apparatus according to claim 34, wherein the loss function of the howling suppression model comprises an error loss function obtained based on an error calculation between the howling suppression audio signal and a reference sample signal.
36. The howling suppression apparatus according to claim 34, wherein the loss function of the howling suppression model comprises an audio quality loss function calculated based on the audio quality of the howling suppression audio signal and the audio quality of the reference sample signal.
37. The howling suppression apparatus according to claim 34, wherein the loss function of the howling suppression model includes an countermeasure loss function obtained based on a probability calculation for making a correct discrimination by a discriminator for discriminating an output result of the howling suppression model as a first signal representing a corresponding output result as a howling suppression signal or a second signal representing a corresponding output result as a reference sample signal, the discriminator making a correct discrimination when discriminating the output result of the howling suppression model as the first signal.
38. Howling suppression apparatus as claimed in claim 34, characterized in that the loss function of the howling suppression model is a weighted sum of an error loss function, an audio quality loss function and an antagonism loss function.
39. The howling suppression apparatus of claim 34, wherein the howling detection model is trained prior to the howling suppression model.
40. The howling suppression apparatus according to claim 34, wherein the howling suppression output module further comprises:
and the first noise elimination module is used for eliminating noise of the howling inhibition audio signal, wherein the noise is environmental noise in the audio signal, and the environmental noise has a fixed frequency.
41. Howling suppression apparatus as claimed in claim 40, wherein the howling sample signals in said second pair of sample signals are obtained according to the steps of:
Overlapping the reference sample signal played by the sound source acquired by the first audio acquisition module and the second audio signal played by the second audio playing module to serve as a quasi-howling sample signal, wherein the reference sample signal does not have noise;
And superposing the quasi-howling sample signal and a noise audio signal to generate the howling sample signal.
42. The howling suppression apparatus of claim 41, wherein said howling suppression model is a deep complex convolutional loop network.
43. The howling suppression apparatus according to any of claims 27 to 42, wherein said howling result characteristic comprises one or more of a howling detection result, a howling level, a howling type, a howling continuity, a frequency bin movement parameter.
44. The howling suppression apparatus according to claim 43, wherein the howling detection result is used to indicate whether there is howling in the audio feature input to the howling detection model, the howling level is used to indicate the howling strength of the audio feature input to the howling detection model, the howling type includes single-frequency howling, multi-frequency howling, and diffuse howling, the howling continuity includes continuous howling and intermittent howling, the frequency point movement parameters include a frequency point movement type parameter used to indicate whether there is frequency point movement in the audio feature input to the howling detection model, and a frequency point movement amplitude parameter used to indicate the amplitude of the frequency point movement of the audio feature input to the howling detection model.
45. Howling suppression apparatus according to any of claims 33 to 42, wherein said reference sample signal is generated based on different audio content, including one or more of speech, music, ambient sound.
46. Howling suppression apparatus according to any of claims 33 to 42, wherein said first device and said second device comprise an audio processing module having an audio processing algorithm, said first device and said second device having different capabilities and different audio processing algorithms for different howling sample signals.
47. Howling suppression apparatus according to any of claims 33 to 42, wherein the spatial region in which the acoustic loop is located has different noise environments for different howling sample signals, and wherein the first audio signal acquired in the spatial region having the first noise environment and the second audio signal acquired in the spatial region having the second noise environment have different signal-to-noise ratios under the same acquisition conditions, the same acquisition conditions comprising the same device, the same spatial region, and the same sound source.
48. Howling suppressing apparatus according to any of claims 33 to 42, wherein audio transmission parameters between said first device and said second device are different for different howling sample signals, said audio transmission parameters comprising one or more of a relative position between said first device and said second device, a network communication parameter between said first device and said second device, a real-time volume of said first device and said second device.
49. Howling suppressing apparatus as claimed in any of claims 27 to 42, wherein said howling suppressing output module comprises:
And the first feature reduction module is used for performing reverse operation relative to feature extraction on the suppressed audio features output by the howling suppression model and reducing the suppressed audio features to obtain the howling suppression audio signals.
50. Howling suppressing apparatus as claimed in any of claims 27 to 42, wherein said howling suppressing output module comprises:
the mask acquisition module is used for acquiring a howling suppression mask output by the howling suppression model, wherein the howling suppression mask is used for representing the howling suppression frequency point gain of the audio characteristics of the reference sample signal compared with the audio characteristics of the audio signal to be processed;
A suppression feature acquisition module for multiplying the howling suppression mask with the audio features of the audio signal to be processed to obtain suppressed audio features;
and the second feature reduction module is used for performing reverse operation relative to feature extraction on the suppressed audio features and reducing the suppressed audio features to obtain the howling suppression audio signals.
51. Howling suppression apparatus according to any of claims 28 to 39, wherein said first device comprises an audio processing module having an audio processing algorithm comprising one or more of an acoustic echo cancellation algorithm, a noise suppression algorithm and an automatic gain control algorithm,
The acoustic echo cancellation algorithm is used for canceling acoustic echo in the audio signals acquired by the first audio acquisition module, wherein the acoustic echo comprises an echo signal formed by the audio signals played by the first audio playing module acquired by the first audio acquisition module;
the noise suppression algorithm is used for suppressing noise in the audio signals acquired by the first audio acquisition module, wherein the noise is the environmental noise of the audio signals acquired by the first audio acquisition module, and the environmental noise has fixed frequency;
the automatic gain control algorithm is used for adjusting the volume of the audio signal acquired by the first audio acquisition module to be within a set volume range.
52. The howling suppression apparatus of claim 51, wherein said audio processing module further comprises:
the echo cancellation module is used for performing acoustic echo cancellation on the audio signal to be processed by adopting the acoustic echo cancellation algorithm;
A noise suppression module for performing noise suppression on the howling suppression audio signal by adopting the noise suppression algorithm, and
And the automatic gain module is used for adopting the automatic gain control algorithm to perform automatic gain control on the audio signal subjected to noise suppression.
53. A storage medium having stored thereon a computer program, the computer program realizing when executed by a processor:
A howling suppression method as recited in any one of claims 1 to 26.
54. An electronic device, comprising:
Processor, and
A memory for storing executable instructions of the processor;
Wherein the processor is configured to execute via execution of the executable instructions:
A howling suppression method as recited in any one of claims 1 to 26.
CN202210307288.1A 2022-03-25 2022-03-25 Howling suppression method and device, storage medium and electronic equipment Active CN114863941B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210307288.1A CN114863941B (en) 2022-03-25 2022-03-25 Howling suppression method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210307288.1A CN114863941B (en) 2022-03-25 2022-03-25 Howling suppression method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN114863941A CN114863941A (en) 2022-08-05
CN114863941B true CN114863941B (en) 2026-02-03

Family

ID=82630343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210307288.1A Active CN114863941B (en) 2022-03-25 2022-03-25 Howling suppression method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114863941B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229998B (en) * 2023-02-02 2025-09-02 北京达佳互联信息技术有限公司 Audio signal processing method, device, electronic device and storage medium
CN116682407A (en) * 2023-05-31 2023-09-01 菁音核创科技(厦门)有限公司 Method, system and computer medium for real-time howling detection and adaptive suppression
CN118400650B (en) * 2024-05-14 2025-04-15 广东台德智联科技有限公司 Microphone howling prevention method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067837A (en) * 2021-11-15 2022-02-18 杭州网易智企科技有限公司 Howling detection method and device, medium and computing device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111149370B (en) * 2017-09-29 2021-10-01 杜比实验室特许公司 Howling Detection in Conference System
CN109788400B (en) * 2019-03-06 2020-12-18 哈尔滨工业大学(深圳) A neural network howling suppression method, system and storage medium for digital hearing aids
CN111583949A (en) * 2020-04-10 2020-08-25 南京拓灵智能科技有限公司 Howling suppression method, device and equipment
CN113870885B (en) * 2021-12-02 2022-02-22 北京百瑞互联技术有限公司 Bluetooth audio squeal detection and suppression method, device, medium, and apparatus

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067837A (en) * 2021-11-15 2022-02-18 杭州网易智企科技有限公司 Howling detection method and device, medium and computing device

Also Published As

Publication number Publication date
CN114863941A (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN114863941B (en) Howling suppression method and device, storage medium and electronic equipment
Fu et al. MetricGAN-U: Unsupervised speech enhancement/dereverberation based only on noisy/reverberated speech
CN111161752B (en) Echo cancellation method and device
CN112951259B (en) Audio noise reduction method and device, electronic equipment and computer readable storage medium
CN113241085B (en) Echo cancellation method, device, equipment and readable storage medium
US5757937A (en) Acoustic noise suppressor
Luo et al. Real-time single-channel dereverberation and separation with time-domain audio separation network.
JP5666444B2 (en) Apparatus and method for processing an audio signal for speech enhancement using feature extraction
JP2012155339A (en) Improvement in multisensor sound quality using sound state model
US10262677B2 (en) Systems and methods for removing reverberation from audio signals
JP2006215568A (en) Audio enhancement device, audio enhancement method, and computer-readable medium recording audio enhancement program
CN114067837A (en) Howling detection method and device, medium and computing device
CN118314917A (en) Noise reduction optimization method for noise reduction type MEMS microphone for smart home
CN115359804A (en) Directional audio pickup method and system based on microphone array
Chen et al. A neural network-based howling detection method for real-time communication applications
CN117894318A (en) Audio processing model training method and device, storage medium, and electronic device
WO2013057659A2 (en) Signal noise attenuation
CN120998219A (en) Adaptive noise reduction method and system for multimodal audio SoC main control chip
CN113113046B (en) Performance detection method and device for audio processing, storage medium and electronic equipment
CN112669877A (en) Noise detection and suppression method, device, terminal equipment, system and chip
CN118486320A (en) Noise suppression method, device, electronic equipment and computer readable storage medium
JP2020190606A (en) Sound noise removal device and program
Li et al. Joint noise reduction and listening enhancement for full-end speech enhancement
JP4542538B2 (en) Double talk state determination method, echo canceling apparatus using the method, program thereof, and recording medium thereof
CN119207456B (en) Audio noise reduction method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant