CN119418712A - A noise reduction method for real-time speech at the edge - Google Patents
A noise reduction method for real-time speech at the edge Download PDFInfo
- Publication number
- CN119418712A CN119418712A CN202510018663.4A CN202510018663A CN119418712A CN 119418712 A CN119418712 A CN 119418712A CN 202510018663 A CN202510018663 A CN 202510018663A CN 119418712 A CN119418712 A CN 119418712A
- Authority
- CN
- China
- Prior art keywords
- signal
- voice
- noise
- domain signal
- mel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000009467 reduction Effects 0.000 title claims abstract description 37
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 20
- 238000001914 filtration Methods 0.000 claims abstract description 16
- 230000003044 adaptive effect Effects 0.000 claims abstract description 15
- 238000001228 spectrum Methods 0.000 claims description 31
- 238000005070 sampling Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 10
- 238000013139 quantization Methods 0.000 claims description 10
- 230000003595 spectral effect Effects 0.000 claims description 9
- 102100030148 Integrator complex subunit 8 Human genes 0.000 claims description 6
- 101710092891 Integrator complex subunit 8 Proteins 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000009499 grossing Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 238000009432 framing Methods 0.000 abstract description 4
- 230000000694 effects Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000008054 signal transmission Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000003874 inverse correlation nuclear magnetic resonance spectroscopy Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Noise Elimination (AREA)
Abstract
The application belongs to the technical field of voice noise reduction. The application provides a noise reduction method for real-time voice at an edge. The disclosed embodiments perform pre-emphasis, framing, and windowing on an input speech signal to improve the high frequency content of the signal and reduce distortion of the signal. The preprocessed input speech signal is subjected to a fast fourier transform to convert the time domain signal into an original frequency domain signal. And calculating the Mel frequency cepstrum coefficient of the original frequency domain signal to extract the MFCC characteristic parameters of the voice. The noise estimation is carried out on the MFCC characteristic parameters by utilizing the light separable convolutional neural network, so that the parameter quantity and the calculated quantity are reduced, and the efficient noise estimation is realized. And performing adaptive iterative wiener filtering by using the generated noise estimation, and gradually eliminating noise in the voice signal to obtain a filtered frequency domain signal. And performing inverse fast Fourier transform on the filtered frequency domain signal to recover the time domain signal. And outputting a time domain signal, namely the noise-reduced voice signal.
Description
Technical Field
The embodiment of the disclosure relates to the technical field of voice noise reduction, in particular to a noise reduction method for real-time voice at an edge end.
Background
With the rapid development of information technology, voice communication and voice recognition technologies are widely used in various fields. In particular, in smart phones, smart assistants, teleconferencing, and other scenarios, clear speech signal transmission and processing are critical. However, in practical applications, the speech signal is often subjected to various noise, which may come from the environment (e.g., traffic noise, crowd noise, etc.) or the device itself (e.g., poor microphone quality, unstable signal transmission, etc.). These noises can significantly reduce the clarity and intelligibility of the speech signal and even lead to information delivery errors.
At present, the noise reduction of the voice is mainly divided into noise reduction of a traditional filter and noise reduction based on deep learning. The traditional filter denoising method is small in calculated amount and suitable for an offline real-time denoising scene of an edge end, but has poor denoising effect, particularly non-stationary noise is difficult to filter, and the denoising scheme based on deep learning is excellent in denoising effect, but has huge calculated amount of a model, and is difficult to deploy and use in the edge end scene with limited power consumption and calculation resources.
Accordingly, there is a need to improve one or more problems in the related art as described above.
It is noted that this section is intended to provide a background or context for the technical solutions of the present disclosure as set forth in the claims. The description herein is not admitted to be prior art by inclusion in this section.
Disclosure of Invention
An object of the embodiments of the present disclosure is to provide a noise reduction method for edge-side real-time voice, which overcomes one or more of the problems due to the limitations and disadvantages of the related art at least to some extent.
According to an embodiment of the present disclosure, there is provided a method for noise reduction of edge-side real-time speech, the method including:
Acquiring an input voice signal, and preprocessing the input voice signal to obtain a plurality of frames of voice signals;
Performing fast Fourier transform on the voice signal aiming at one frame of the voice signal to obtain an original frequency domain signal;
calculating a mel frequency cepstrum coefficient of the original frequency domain signal according to the original frequency domain signal so as to extract MFCC characteristic parameters of the input voice signal;
Inputting the MFCC characteristic parameters into a separable convolutional neural network subjected to INT8 fixed-point quantization processing to obtain a noise estimation result, wherein the separable convolutional neural network comprises an input layer, a first depth separable convolutional layer, a second depth separable convolutional layer, a full connection layer and an output layer which are sequentially connected in series, and the noise estimation result comprises a voice frame and a non-voice frame;
performing adaptive iterative wiener filtering on the voice frame and the non-voice frame by using a wiener filter to obtain a filtered frequency domain signal;
performing inverse fast fourier transform on the filtered frequency domain signal to obtain a filtered time domain signal;
Traversing all the voice signals to obtain the filtered time domain signals corresponding to all the voice signals, and combining all the filtered time domain signals to obtain noise-reducing voice signals.
Further, the step of obtaining an input voice signal and preprocessing the input voice signal includes:
Pre-emphasis the input speech signal with a pre-emphasis filter;
Dividing the pre-emphasized input voice signal into a plurality of frames of voice signals;
The speech signal is windowed for each frame using a hamming window to reduce spectral leakage.
Further, the expression of the pre-emphasis filter is:
Wherein, Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal after pre-emphasis is at the firstThe values of the individual sample points are used,Is a pre-emphasis coefficient;
the Hamming window expression is:
Wherein, Is a Hamming window function at the firstThe values of the individual sample points are used,The number of samples per frame is calculated,Index of current sampling point, range is;
Each frame of the voice signalThe windowed expression is:
Wherein, Is the voice signal at the firstValues of the sampling points; In the first place for windowed speech signal The values of the sampling points.
Further, the step of calculating mel-frequency cepstrum coefficients of the original frequency-domain signal to extract MFCC characteristic parameters of the input speech signal includes:
Using a mel filter bank to convert the frequency of the original frequency domain signal Conversion to mel frequencyWherein the mel filter bank comprises a plurality of mel filters;
According to Mel frequency And the original frequency domain signal, calculating the energy of each Mel filter;
and carrying out discrete cosine transform on the logarithm of the energy of all the Mel filters to obtain MFCC characteristic parameters.
Further, the mel frequency is expressed as:
the energy of the mel filter is calculated as follows:
Wherein, As the frequency point of the signal, the signal is the frequency point,As the original frequency domain signal,In order to the original frequency domain signalThe power spectral density at the frequency bin,Is the firstFirst of the Mel filterThe frequency response of the frequency bin,For the start-up frequency of the mel-filter,Is the ending frequency of the mel filter;
The discrete cosine transform expression is:
Wherein, Is the firstThe number of MFCC characteristic parameters,For the number of mel-filters,Is the firstEnergy of the individual mel filters.
Further, the step of inputting the MFCC characteristic parameter into the separable convolutional neural network after the INT8 fixed-point quantization process to obtain the noise estimation result includes:
Sequentially inputting the MFCC characteristic parameters into the input layer, the first depth separable convolution layer and the second depth separable convolution layer, flattening the output of the second depth separable convolution layer and inputting the flattened output into the fully connected layer to generate the noise estimation result;
The output layer outputs the noise estimation result, wherein if the VAD signal of the noise estimation result is higher than a preset threshold value, the voice signal of the current frame is the non-voice frame, and if the VAD signal of the noise estimation result is smaller than or equal to the threshold value, the voice signal of the current frame is the voice frame.
Further, the first depth-separable convolution layer includes a first depth convolution and a first point-wise convolution, and the second depth-separable convolution layer includes a second depth convolution and a second point-wise convolution.
Further, the step of performing adaptive iterative wiener filtering on the speech frame and the non-speech frame by using a wiener filter to obtain a filtered frequency domain signal includes:
The average power spectrum of the mute section is adopted as an initial noise power spectrum estimation, and a preset constant is adopted as a voice power spectrum;
If the speech signal of the current frame is the non-speech frame, updating the noise power spectrum estimation:
Wherein, Is the current frameA single frequency bin noise power spectrum estimate,To the updated current frameA single frequency bin noise power spectrum estimate,Is the current frameThe power spectral density of the individual frequency bins,Is a smoothing coefficient;
if the voice signal of the current frame is the voice frame, maintaining the noise power spectrum estimation unchanged;
calculating the gain of the wiener filter according to the updated noise power spectrum estimation and the voice power spectrum :
Performing adaptive iterative wiener filtering on the original frequency domain signal of the current frame by using the wiener filter to obtain a filtered frequency domain signal, wherein the adaptive iterative wiener filtering has the following expression:
Wherein, Is a filtered frequency domain signal that,Is the original frequency domain signal.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:
In the embodiment of the disclosure, through the noise reduction method of the edge-side real-time voice, on one hand, pre-emphasis, framing and windowing are performed on an input voice signal, so as to improve the high-frequency component of the signal and reduce the distortion of the signal. The preprocessed input speech signal is subjected to a fast fourier transform to convert the time domain signal into an original frequency domain signal. And calculating the Mel frequency cepstrum coefficient of the original frequency domain signal to extract the MFCC characteristic parameters of the voice. The noise estimation is carried out on the MFCC characteristic parameters by utilizing the light separable convolutional neural network, so that the parameter quantity and the calculated quantity are reduced, and the efficient noise estimation is realized. And performing adaptive iterative wiener filtering by using the generated noise estimation, and gradually eliminating noise in the voice signal to obtain a filtered frequency domain signal. And performing inverse fast Fourier transform on the filtered frequency domain signal to recover the time domain signal. And outputting a time domain signal, namely the noise-reduced voice signal. On the other hand, the method can perform offline real-time noise reduction under the edge end scene with limited power consumption and small computing resources, and has better noise reduction effect.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 is a step diagram of a method for edge-side real-time speech noise reduction in an exemplary embodiment of the present disclosure;
FIG. 2 illustrates synthesized audio minus 20db for noisy speech and its denoised audio schematic in an exemplary embodiment of the present disclosure;
fig. 3 shows an audio schematic of an environmental noise call recording and its denoised audio in an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein, but rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of embodiments of the disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.
The embodiment provides a noise reduction method for edge-side real-time voice. Referring to fig. 1, the method for noise reduction of the edge-side real-time voice may include steps S101 to S107.
S101, acquiring an input voice signal, and preprocessing the input voice signal to obtain a plurality of frames of voice signals;
Step S102, performing fast Fourier transform on the voice signal aiming at one frame of the voice signal to obtain an original frequency domain signal;
Step S103, calculating the Mel frequency cepstrum coefficient of the original frequency domain signal to extract the MFCC characteristic parameters of the input voice signal;
Step S104, inputting the MFCC characteristic parameters into a separable convolutional neural network subjected to INT8 fixed-point quantization processing to obtain a noise estimation result, wherein the separable convolutional neural network comprises an input layer, a first depth separable convolutional layer, a second depth separable convolutional layer, a full connection layer and an output layer which are sequentially serial, and the noise estimation result comprises a voice frame and a non-voice frame;
Step 105, performing adaptive iterative wiener filtering on the voice frame and the non-voice frame by using a wiener filter to obtain a filtered frequency domain signal;
step S106, performing inverse fast Fourier transform on the filtered frequency domain signal to obtain a filtered time domain signal;
Step S107, traversing all the voice signals to obtain the filtered time domain signals corresponding to all the voice signals, and combining all the filtered time domain signals to obtain the noise-reduced voice signals.
According to the noise reduction method for the edge-side real-time voice, on one hand, pre-emphasis, framing and windowing are carried out on an input voice signal so as to improve the high-frequency component of the signal and reduce the distortion of the signal. The preprocessed input speech signal is subjected to a fast fourier transform to convert the time domain signal into an original frequency domain signal. And calculating the Mel frequency cepstrum coefficient of the original frequency domain signal to extract the MFCC characteristic parameters of the voice. The noise estimation is carried out on the MFCC characteristic parameters by utilizing the light separable convolutional neural network, so that the parameter quantity and the calculated quantity are reduced, and the efficient noise estimation is realized. And performing adaptive iterative wiener filtering by using the generated noise estimation, and gradually eliminating noise in the voice signal to obtain a filtered frequency domain signal. And performing inverse fast Fourier transform on the filtered frequency domain signal to recover the time domain signal. And outputting a time domain signal, namely the noise-reduced voice signal. On the other hand, the method can perform offline real-time noise reduction under the edge end scene with limited power consumption and small computing resources, and has better noise reduction effect.
Next, each step of the above-described edge-side real-time voice noise reduction method in the present exemplary embodiment will be described in more detail with reference to fig. 1 to 3.
In step S101, an input speech signal is acquired and preprocessed to obtain a number of frames of speech signals.
The method comprises the steps of pre-emphasizing an input voice signal by using a pre-emphasizing filter, dividing the pre-emphasized input voice signal into a plurality of frames of voice signals, and windowing each frame of voice signal by using a Hamming window to reduce frequency spectrum leakage.
In one embodiment, the input speech signal is pre-emphasis processed to enhance the high frequency content. Is embodied as an input signalApplying a pre-emphasis filter, the formula:
Wherein, Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal after pre-emphasis is at the firstThe values of the individual sample points are used,For the pre-emphasis coefficient, for controlling the degree of enhancement of the high frequency component, for example,The value is 0.95.
The pre-emphasized signal is divided into frames, each frame being 10 ms in length and the frame being shifted by 5 ms. A hamming window is applied to each frame of signal to reduce spectral leakage. The windowing formula is as follows:
Wherein, Is a Hamming window function at the firstValues of the sampling points; the number of sampling points for each frame, namely the frame length; Index of current sampling point, range is 。
When applying a hamming window, each frame of speech signalThe windowing process is as follows:
Wherein, Is the voice signal at the firstValues of the sampling points; In the first place for windowed speech signal The values of the sampling points.
In step S102, for one frame of the speech signal, a fast fourier transform is performed on the speech signal to obtain an original frequency domain signal.
Specifically, the windowed signal of each frame is subjected to a fast fourier transform (Fast Fourier transform, FFT), and the time domain signal is converted into a frequency domain signal, so as to obtain a complex frequency spectrum.
In step S103, the mel-frequency cepstrum coefficient thereof is calculated from the original frequency-domain signal to extract the MFCC characteristic parameters of the input speech signal.
Specifically, the Mel filter bank is used to convert the frequency of the original frequency domain signalConversion to mel frequencyWherein the Mel filter group comprises several Mel filters according to Mel frequencyAnd calculating the energy of each Mel filter according to the original frequency domain signal, taking logarithms of the energy of all Mel filters, and performing discrete cosine transform to obtain MFCC characteristic parameters.
In one embodiment, mel frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCC) of the frequency domain signal are calculated. First, the spectrum is passed through a mel filter bank, and the energy of each filter is calculated. The filter energy is then logarithmically measured and Discrete Cosine Transformed (DCT) to yield MFCC features. The specific process is as follows:
first a mel filter bank (e.g. 40 filters are used in this embodiment) is applied to the filter bank Conversion to mel frequencyThe mel frequency formula is as follows:
for each filter, calculate its output energy :
Wherein, As the frequency point of the signal, the signal is the frequency point,As the original frequency domain signal,In order to the original frequency domain signalThe power spectral density at the frequency bin,Is the firstFirst of the Mel filterThe frequency response of the frequency bin,For the start-up frequency of the mel-filter,Is the ending frequency of the mel filter;
The expression of discrete cosine transform is:
Wherein, Is the firstThe number of MFCC characteristic parameters (e.g., i=10 in this embodiment),For the number of mel filters (e.g., taking m=40 in this embodiment),Is the firstEnergy of the individual mel filters.
In step S104, the MFCC characteristic parameters are input into a separable convolutional neural network subjected to INT8 fixed-point quantization processing to obtain a noise estimation result, wherein the separable convolutional neural network comprises an input layer, a first depth separable convolutional layer, a second depth separable convolutional layer, a full connection layer and an output layer which are sequentially connected in series, and the noise estimation result comprises a voice frame and a non-voice frame.
More specifically, inputting the MFCC characteristic parameters into the input layer, the first depth separable convolution layer and the second depth separable convolution layer in sequence, flattening the output of the second depth separable convolution layer, and inputting the flattened output into the full connection layer to generate a noise estimation result;
The output layer outputs a noise estimation result, wherein if the VAD signal of the noise estimation result is higher than a preset threshold value, the voice signal of the current frame is a non-voice frame, and if the VAD signal of the noise estimation result is smaller than or equal to the threshold value, the voice signal of the current frame is a voice frame.
In addition, the first depth-separable convolution layer includes a first depth convolution and a first point-wise convolution, and the second depth-separable convolution layer includes a second depth convolution and a second point-wise convolution.
In one embodiment, a lightweight separable convolutional neural network (i.e., separable convolutional neural network) is constructed, and MFCC characteristics (i.e., MFCC characteristic parameters) are input into the designed lightweight separable convolutional neural network. The network consists of a plurality of depth separable convolution layers, and fixed-point quantization processing is carried out, so that a noise estimation result can be obtained quickly and efficiently. The method comprises the following specific steps:
(1) Input features from the previous step, mel-frequency cepstral coefficients (MFCCs) of each frame of signal are calculated, and these MFCC features are to be used as inputs to the neural network.
(2) Light separable convolutional neural network design
1) Network structure:
input layer input MFCC two-dimensional feature vector (number of samples per frame x feature number).
Depth separable convolution layer (2 layers) depth separable convolution is used to reduce the number of parameters and computation. Each convolution layer is divided into a depth convolution (convolution kernel 10×2) and a point-by-point convolution (convolution kernel 1×1), with a channel number of 32.
Activation function-ReLU activation function is applied after each separable convolutional layer.
And the full connection layer is used for flattening the output of the convolution layer and inputting the flattened output of the convolution layer into the full connection layer to generate a noise estimation result.
2) Network training
Data preparation, marking noisy and clean speech as 1, marking pure noise data as 0, training with 15 hours of speech data.
Loss function-using a mean square error loss function.
And (3) perceptual quantization training, namely, in order to obtain a lightweight model, using the perceptual quantization training to the network structure to obtain a lightweight network after fixed-point quantization so as to reduce the network calculation amount.
In a specific embodiment, the output of each layer of the neural network:
input layer 160×10
The first depth may separate the convolution outputs 151 x 32.
The second depth may separate the convolution outputs 151 x 32.
151×32 After ReLU activation.
Full connection layer, and is flattened to 4832.
In step S105, adaptive iterative wiener filtering is performed on the speech frame and the non-speech frame, respectively, using wiener filters to obtain filtered frequency domain signals.
Specifically, a key role of noise estimation using a neural network (i.e., a separable convolutional neural network) is to determine whether a speech signal is present in the current frame. In the adaptive iterative wiener filtering step, it is determined whether to update the parameters of the wiener filter based on the result of the voice activity detection.
(1) Voice Activity Detection (VAD)
In the noise estimation step, a lightweight separable convolutional neural network is used to indicate whether the current frame contains speech. A threshold is set (in this embodiment, the threshold is set to 0.8), and if the VAD signal output by the network is higher than the threshold, the current frame is determined to be a non-speech frame.
(2) Wiener filter initialization
At the beginning of the process, the average power spectrum of the silence segment is used as the initial noise power spectrum estimate, and a small constant value (0.0001) is used as the initial speech power spectrum.
(3) Iterative updating
Non-speech frame processing, noise power spectrum update
If the current frame is determined to be a non-speech frame, the noise power spectrum estimate is updated:
Wherein, Is the current frameA single frequency bin noise power spectrum estimate,Is the updated current frameA single frequency bin noise power spectrum estimate,Is the current frameThe power spectral density of the individual frequency bins,Is a smoothing coefficient, and the value of the embodiment0.998.
Speech frame processing, maintaining the noise power spectrum unchanged
If the current frame is determined to be a speech frame, the noise power spectrum estimate is not updated.
(4) Wiener filter computation
Calculating wiener filter gain based on updated noise and speech power spectra:
And then filtering the frequency domain signal of the current frame by applying a wiener filter:
Wherein, Is a filtered frequency domain signal that is then filtered,Is the original frequency domain signal.
In step S106 and step S107, the filtered frequency domain signal is subjected to inverse fast Fourier transform to obtain a filtered time domain signal, all the voice signals are traversed to obtain the filtered time domain signals corresponding to all the voice signals, and all the filtered time domain signals are combined to obtain the noise-reduced voice signal
Specifically, the filtered frequency domain signal is subjected to inverse fast fourier transform (INVERSE FAST Fourier transform, IFFT) to recover the time domain signal. And combining the time domain signals of all frames to obtain a final noise reduction voice signal.
In a specific embodiment, the noise reduction method (i.e., RTC method) of the edge-side real-time voice proposed by the application is compared with the traditional noise reduction method (including RNNoise, deepFilterNet and S-DCCRN in particular). The method for reducing the noise of the edge-side real-time voice provided by the application has the advantages that the power consumption is only 0.8W, the time for processing the audio of one frame for 10ms is only 4ms, and the noise suppression under the strong noise environment can reach more than 25 db. Compared with the traditional method and the pure deep learning scheme, the method has the advantages of less parameters and calculation amount and better noise reduction effect.
Table 1 comparison of noise reduction methods
As shown in fig. 2, the synthesized audio with noise voice minus 20db and the audio after noise reduction are shown.
As shown in fig. 3, an audio schematic diagram of the voice recording of the ambient noise and the noise reduction is shown.
According to the noise reduction method for the edge-side real-time voice, on one hand, pre-emphasis, framing and windowing are carried out on an input voice signal so as to improve the high-frequency component of the signal and reduce the distortion of the signal. The preprocessed input speech signal is subjected to a fast fourier transform to convert the time domain signal into an original frequency domain signal. And calculating the Mel frequency cepstrum coefficient of the original frequency domain signal to extract the MFCC characteristic parameters of the voice. The noise estimation is carried out on the MFCC characteristic parameters by utilizing the light separable convolutional neural network, so that the parameter quantity and the calculated quantity are reduced, and the efficient noise estimation is realized. And performing adaptive iterative wiener filtering by using the generated noise estimation, and gradually eliminating noise in the voice signal to obtain a filtered frequency domain signal. And performing inverse fast Fourier transform on the filtered frequency domain signal to recover the time domain signal. And outputting a time domain signal, namely the noise-reduced voice signal. On the other hand, the method can perform offline real-time noise reduction under the edge end scene with limited power consumption and small computing resources, and has better noise reduction effect.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present disclosure, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In the embodiments of the present disclosure, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured" and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed, mechanically connected, electrically connected, directly connected, indirectly connected via an intervening medium, or in communication between two elements or in an interaction relationship between two elements. The specific meaning of the terms in this disclosure will be understood by those of ordinary skill in the art as the case may be.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, one skilled in the art can combine and combine the different embodiments or examples described in this specification.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
Claims (8)
1. The method for reducing noise of the edge-side real-time voice is characterized by comprising the following steps:
Acquiring an input voice signal, and preprocessing the input voice signal to obtain a plurality of frames of voice signals;
Performing fast Fourier transform on the voice signal aiming at one frame of the voice signal to obtain an original frequency domain signal;
calculating a mel frequency cepstrum coefficient of the original frequency domain signal according to the original frequency domain signal so as to extract MFCC characteristic parameters of the input voice signal;
Inputting the MFCC characteristic parameters into a separable convolutional neural network subjected to INT8 fixed-point quantization processing to obtain a noise estimation result, wherein the separable convolutional neural network comprises an input layer, a first depth separable convolutional layer, a second depth separable convolutional layer, a full connection layer and an output layer which are sequentially connected in series, and the noise estimation result comprises a voice frame and a non-voice frame;
performing adaptive iterative wiener filtering on the voice frame and the non-voice frame by using a wiener filter to obtain a filtered frequency domain signal;
performing inverse fast fourier transform on the filtered frequency domain signal to obtain a filtered time domain signal;
Traversing all the voice signals to obtain the filtered time domain signals corresponding to all the voice signals, and combining all the filtered time domain signals to obtain noise-reducing voice signals.
2. The method for noise reduction of edge-side real-time speech according to claim 1, wherein the step of obtaining an input speech signal and preprocessing the input speech signal comprises:
Pre-emphasis the input speech signal with a pre-emphasis filter;
Dividing the pre-emphasized input voice signal into a plurality of frames of voice signals;
The speech signal is windowed for each frame using a hamming window to reduce spectral leakage.
3. The method for noise reduction of edge-side real-time speech according to claim 2, wherein the pre-emphasis filter has the expression:
Wherein, Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal after pre-emphasis is at the firstThe values of the individual sample points are used,Is a pre-emphasis coefficient;
the Hamming window expression is:
Wherein, Is a Hamming window function at the firstThe values of the individual sample points are used,The number of samples per frame is calculated,Index of current sampling point, range is;
Each frame of the voice signalThe windowed expression is:
Wherein, Is the voice signal at the firstValues of the sampling points; In the first place for windowed speech signal The values of the sampling points.
4. The method of claim 3, wherein the step of calculating mel-frequency cepstrum coefficients of the original frequency-domain signal to extract MFCC characteristic parameters of the input speech signal comprises:
Using a mel filter bank to convert the frequency of the original frequency domain signal Conversion to mel frequencyWherein the mel filter bank comprises a plurality of mel filters;
According to Mel frequency And the original frequency domain signal, calculating the energy of each Mel filter;
and carrying out discrete cosine transform on the logarithm of the energy of all the Mel filters to obtain MFCC characteristic parameters.
5. The method for noise reduction of edge-side real-time speech according to claim 4, wherein the mel frequency is expressed as:
the energy of the mel filter is calculated as follows:
Wherein, As the frequency point of the signal, the signal is the frequency point,As the original frequency domain signal,In order to the original frequency domain signalThe power spectral density at the frequency bin,Is the firstFirst of the Mel filterThe frequency response of the frequency bin,For the start-up frequency of the mel-filter,Is the ending frequency of the mel filter;
The discrete cosine transform expression is:
Wherein, Is the firstThe number of MFCC characteristic parameters,For the number of mel-filters,Is the firstEnergy of the individual mel filters.
6. The method for noise reduction of edge-side real-time speech according to claim 5, wherein the step of inputting the MFCC characteristic parameters into a separable convolutional neural network subjected to INT8 fixed-point quantization processing to obtain a noise estimation result comprises:
Sequentially inputting the MFCC characteristic parameters into the input layer, the first depth separable convolution layer and the second depth separable convolution layer, flattening the output of the second depth separable convolution layer and inputting the flattened output into the fully connected layer to generate the noise estimation result;
The output layer outputs the noise estimation result, wherein if the VAD signal of the noise estimation result is higher than a preset threshold value, the voice signal of the current frame is the non-voice frame, and if the VAD signal of the noise estimation result is smaller than or equal to the threshold value, the voice signal of the current frame is the voice frame.
7. The method of edge-side real-time speech noise reduction according to claim 6, wherein the first depth separable convolution layer comprises a first depth convolution and a first point-by-point convolution, and the second depth separable convolution layer comprises a second depth convolution and a second point-by-point convolution.
8. The method for noise reduction of edge-side real-time speech according to claim 6, wherein the step of performing adaptive iterative wiener filtering on the speech frame and the non-speech frame, respectively, by using wiener filters to obtain filtered frequency domain signals comprises:
The average power spectrum of the mute section is adopted as an initial noise power spectrum estimation, and a preset constant is adopted as a voice power spectrum;
If the speech signal of the current frame is the non-speech frame, updating the noise power spectrum estimation:
Wherein, Is the current frameA single frequency bin noise power spectrum estimate,To the updated current frameA single frequency bin noise power spectrum estimate,Is the current frameThe power spectral density of the individual frequency bins,Is a smoothing coefficient;
if the voice signal of the current frame is the voice frame, maintaining the noise power spectrum estimation unchanged;
calculating the gain of the wiener filter according to the updated noise power spectrum estimation and the voice power spectrum :
Performing adaptive iterative wiener filtering on the original frequency domain signal of the current frame by using the wiener filter to obtain a filtered frequency domain signal, wherein the adaptive iterative wiener filtering has the following expression:
Wherein, Is a filtered frequency domain signal that,Is the original frequency domain signal.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510018663.4A CN119418712A (en) | 2025-01-07 | 2025-01-07 | A noise reduction method for real-time speech at the edge |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510018663.4A CN119418712A (en) | 2025-01-07 | 2025-01-07 | A noise reduction method for real-time speech at the edge |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN119418712A true CN119418712A (en) | 2025-02-11 |
Family
ID=94466039
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202510018663.4A Pending CN119418712A (en) | 2025-01-07 | 2025-01-07 | A noise reduction method for real-time speech at the edge |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119418712A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120220713A (en) * | 2025-04-08 | 2025-06-27 | 中国铁路北京局集团有限公司丰台车辆段 | A method and device for adaptive noise reduction for head-mounted computer |
| CN120690222A (en) * | 2025-05-27 | 2025-09-23 | 江西亚瑞科技有限责任公司 | Real-time noise reduction method and system for call speech based on dynamic noise perception |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2008028511A1 (en) * | 2006-09-08 | 2008-03-13 | Siemens Aktiengesellschaft | System and method for speaker recognition |
| EP2372707A1 (en) * | 2010-03-15 | 2011-10-05 | Svox AG | Adaptive spectral transformation for acoustic speech signals |
| CN112735456A (en) * | 2020-11-23 | 2021-04-30 | 西安邮电大学 | Speech enhancement method based on DNN-CLSTM network |
| US20210256988A1 (en) * | 2020-02-14 | 2021-08-19 | System One Noc & Development Solutions, S.A. | Method for Enhancing Telephone Speech Signals Based on Convolutional Neural Networks |
| CN115910034A (en) * | 2022-09-30 | 2023-04-04 | 兴业银行股份有限公司 | Speech language recognition method and system based on deep learning |
| CN116013348A (en) * | 2022-12-27 | 2023-04-25 | 思必驰科技股份有限公司 | Audio noise reduction method, device and storage medium |
| CN118283218A (en) * | 2024-04-12 | 2024-07-02 | 中国工商银行股份有限公司 | Audio and video reconstruction method and device for real-time conversation and electronic equipment |
-
2025
- 2025-01-07 CN CN202510018663.4A patent/CN119418712A/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2008028511A1 (en) * | 2006-09-08 | 2008-03-13 | Siemens Aktiengesellschaft | System and method for speaker recognition |
| EP2372707A1 (en) * | 2010-03-15 | 2011-10-05 | Svox AG | Adaptive spectral transformation for acoustic speech signals |
| US20210256988A1 (en) * | 2020-02-14 | 2021-08-19 | System One Noc & Development Solutions, S.A. | Method for Enhancing Telephone Speech Signals Based on Convolutional Neural Networks |
| CN112735456A (en) * | 2020-11-23 | 2021-04-30 | 西安邮电大学 | Speech enhancement method based on DNN-CLSTM network |
| CN115910034A (en) * | 2022-09-30 | 2023-04-04 | 兴业银行股份有限公司 | Speech language recognition method and system based on deep learning |
| CN116013348A (en) * | 2022-12-27 | 2023-04-25 | 思必驰科技股份有限公司 | Audio noise reduction method, device and storage medium |
| CN118283218A (en) * | 2024-04-12 | 2024-07-02 | 中国工商银行股份有限公司 | Audio and video reconstruction method and device for real-time conversation and electronic equipment |
Non-Patent Citations (2)
| Title |
|---|
| 胡忠元: "面向辅助驾驶的语音指令识别算法设计与FPGA验证", 《中国优秀硕士学位论文全文数据库工程科技Ⅱ辑》, no. 06, 15 June 2022 (2022-06-15), pages 035 - 251 * |
| 许昊: "噪声环境下的说话人识别技术", 《中国优秀硕士学位论文全文数据库信息科技辑》, vol. 01, no. 01, 15 January 2016 (2016-01-15), pages 136 - 101 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120220713A (en) * | 2025-04-08 | 2025-06-27 | 中国铁路北京局集团有限公司丰台车辆段 | A method and device for adaptive noise reduction for head-mounted computer |
| CN120690222A (en) * | 2025-05-27 | 2025-09-23 | 江西亚瑞科技有限责任公司 | Real-time noise reduction method and system for call speech based on dynamic noise perception |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| RU2329550C2 (en) | Method and device for enhancement of voice signal in presence of background noise | |
| CN119418712A (en) | A noise reduction method for real-time speech at the edge | |
| CN110120225A (en) | A kind of audio defeat system and method for the structure based on GRU network | |
| CN111091833A (en) | Endpoint detection method for reducing noise influence | |
| CN108922554A (en) | The constant Wave beam forming voice enhancement algorithm of LCMV frequency based on logarithm Power estimation | |
| CN110808059A (en) | Speech noise reduction method based on spectral subtraction and wavelet transform | |
| CN110942766A (en) | Audio event detection method, system, mobile terminal and storage medium | |
| Zhu et al. | A robust and lightweight voice activity detection algorithm for speech enhancement at low signal-to-noise ratio | |
| CN118899005B (en) | Audio signal processing method, device, computer equipment and storage medium | |
| CN111968651A (en) | WT (WT) -based voiceprint recognition method and system | |
| CN112687284A (en) | Reverberation suppression method and device for reverberation voice | |
| CN117711419B (en) | Intelligent data cleaning method for data center | |
| CN108922514B (en) | Robust feature extraction method based on low-frequency log spectrum | |
| TWI749547B (en) | Speech enhancement system based on deep learning | |
| He et al. | Speech enhancement with intelligent neural homomorphic synthesis | |
| CN113409804A (en) | Multichannel frequency domain speech enhancement algorithm based on variable-span generalized subspace | |
| CN109102823A (en) | A kind of sound enhancement method based on subband spectrum entropy | |
| CN120148484B (en) | Speech recognition method and device based on microcomputer | |
| CN111681649B (en) | Speech recognition method, interactive system and performance management system including the system | |
| CN110970044A (en) | A speech enhancement method for speech recognition | |
| CN118016079B (en) | Intelligent voice transcription method and system | |
| CN110197657B (en) | A dynamic sound feature extraction method based on cosine similarity | |
| CN119091896A (en) | A speech enhancement model training method, speech enhancement method and related device | |
| CN111627426A (en) | Method and system for eliminating channel difference in voice interaction, electronic equipment and medium | |
| CN114822577B (en) | Method and device for estimating fundamental frequency of voice signal |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |