CN119418712A - A noise reduction method for real-time speech at the edge - Google Patents

A noise reduction method for real-time speech at the edge Download PDF

Info

Publication number
CN119418712A
CN119418712A CN202510018663.4A CN202510018663A CN119418712A CN 119418712 A CN119418712 A CN 119418712A CN 202510018663 A CN202510018663 A CN 202510018663A CN 119418712 A CN119418712 A CN 119418712A
Authority
CN
China
Prior art keywords
signal
voice
noise
domain signal
mel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510018663.4A
Other languages
Chinese (zh)
Inventor
敬运腾
陈瀚轩
王诗音
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Saipute Information Technology Co ltd
Original Assignee
Xi'an Saipute Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Saipute Information Technology Co ltd filed Critical Xi'an Saipute Information Technology Co ltd
Priority to CN202510018663.4A priority Critical patent/CN119418712A/en
Publication of CN119418712A publication Critical patent/CN119418712A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Noise Elimination (AREA)

Abstract

The application belongs to the technical field of voice noise reduction. The application provides a noise reduction method for real-time voice at an edge. The disclosed embodiments perform pre-emphasis, framing, and windowing on an input speech signal to improve the high frequency content of the signal and reduce distortion of the signal. The preprocessed input speech signal is subjected to a fast fourier transform to convert the time domain signal into an original frequency domain signal. And calculating the Mel frequency cepstrum coefficient of the original frequency domain signal to extract the MFCC characteristic parameters of the voice. The noise estimation is carried out on the MFCC characteristic parameters by utilizing the light separable convolutional neural network, so that the parameter quantity and the calculated quantity are reduced, and the efficient noise estimation is realized. And performing adaptive iterative wiener filtering by using the generated noise estimation, and gradually eliminating noise in the voice signal to obtain a filtered frequency domain signal. And performing inverse fast Fourier transform on the filtered frequency domain signal to recover the time domain signal. And outputting a time domain signal, namely the noise-reduced voice signal.

Description

Noise reduction method for edge-end real-time voice
Technical Field
The embodiment of the disclosure relates to the technical field of voice noise reduction, in particular to a noise reduction method for real-time voice at an edge end.
Background
With the rapid development of information technology, voice communication and voice recognition technologies are widely used in various fields. In particular, in smart phones, smart assistants, teleconferencing, and other scenarios, clear speech signal transmission and processing are critical. However, in practical applications, the speech signal is often subjected to various noise, which may come from the environment (e.g., traffic noise, crowd noise, etc.) or the device itself (e.g., poor microphone quality, unstable signal transmission, etc.). These noises can significantly reduce the clarity and intelligibility of the speech signal and even lead to information delivery errors.
At present, the noise reduction of the voice is mainly divided into noise reduction of a traditional filter and noise reduction based on deep learning. The traditional filter denoising method is small in calculated amount and suitable for an offline real-time denoising scene of an edge end, but has poor denoising effect, particularly non-stationary noise is difficult to filter, and the denoising scheme based on deep learning is excellent in denoising effect, but has huge calculated amount of a model, and is difficult to deploy and use in the edge end scene with limited power consumption and calculation resources.
Accordingly, there is a need to improve one or more problems in the related art as described above.
It is noted that this section is intended to provide a background or context for the technical solutions of the present disclosure as set forth in the claims. The description herein is not admitted to be prior art by inclusion in this section.
Disclosure of Invention
An object of the embodiments of the present disclosure is to provide a noise reduction method for edge-side real-time voice, which overcomes one or more of the problems due to the limitations and disadvantages of the related art at least to some extent.
According to an embodiment of the present disclosure, there is provided a method for noise reduction of edge-side real-time speech, the method including:
Acquiring an input voice signal, and preprocessing the input voice signal to obtain a plurality of frames of voice signals;
Performing fast Fourier transform on the voice signal aiming at one frame of the voice signal to obtain an original frequency domain signal;
calculating a mel frequency cepstrum coefficient of the original frequency domain signal according to the original frequency domain signal so as to extract MFCC characteristic parameters of the input voice signal;
Inputting the MFCC characteristic parameters into a separable convolutional neural network subjected to INT8 fixed-point quantization processing to obtain a noise estimation result, wherein the separable convolutional neural network comprises an input layer, a first depth separable convolutional layer, a second depth separable convolutional layer, a full connection layer and an output layer which are sequentially connected in series, and the noise estimation result comprises a voice frame and a non-voice frame;
performing adaptive iterative wiener filtering on the voice frame and the non-voice frame by using a wiener filter to obtain a filtered frequency domain signal;
performing inverse fast fourier transform on the filtered frequency domain signal to obtain a filtered time domain signal;
Traversing all the voice signals to obtain the filtered time domain signals corresponding to all the voice signals, and combining all the filtered time domain signals to obtain noise-reducing voice signals.
Further, the step of obtaining an input voice signal and preprocessing the input voice signal includes:
Pre-emphasis the input speech signal with a pre-emphasis filter;
Dividing the pre-emphasized input voice signal into a plurality of frames of voice signals;
The speech signal is windowed for each frame using a hamming window to reduce spectral leakage.
Further, the expression of the pre-emphasis filter is:
Wherein, Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal after pre-emphasis is at the firstThe values of the individual sample points are used,Is a pre-emphasis coefficient;
the Hamming window expression is:
Wherein, Is a Hamming window function at the firstThe values of the individual sample points are used,The number of samples per frame is calculated,Index of current sampling point, range is;
Each frame of the voice signalThe windowed expression is:
Wherein, Is the voice signal at the firstValues of the sampling points; In the first place for windowed speech signal The values of the sampling points.
Further, the step of calculating mel-frequency cepstrum coefficients of the original frequency-domain signal to extract MFCC characteristic parameters of the input speech signal includes:
Using a mel filter bank to convert the frequency of the original frequency domain signal Conversion to mel frequencyWherein the mel filter bank comprises a plurality of mel filters;
According to Mel frequency And the original frequency domain signal, calculating the energy of each Mel filter;
and carrying out discrete cosine transform on the logarithm of the energy of all the Mel filters to obtain MFCC characteristic parameters.
Further, the mel frequency is expressed as:
the energy of the mel filter is calculated as follows:
Wherein, As the frequency point of the signal, the signal is the frequency point,As the original frequency domain signal,In order to the original frequency domain signalThe power spectral density at the frequency bin,Is the firstFirst of the Mel filterThe frequency response of the frequency bin,For the start-up frequency of the mel-filter,Is the ending frequency of the mel filter;
The discrete cosine transform expression is:
Wherein, Is the firstThe number of MFCC characteristic parameters,For the number of mel-filters,Is the firstEnergy of the individual mel filters.
Further, the step of inputting the MFCC characteristic parameter into the separable convolutional neural network after the INT8 fixed-point quantization process to obtain the noise estimation result includes:
Sequentially inputting the MFCC characteristic parameters into the input layer, the first depth separable convolution layer and the second depth separable convolution layer, flattening the output of the second depth separable convolution layer and inputting the flattened output into the fully connected layer to generate the noise estimation result;
The output layer outputs the noise estimation result, wherein if the VAD signal of the noise estimation result is higher than a preset threshold value, the voice signal of the current frame is the non-voice frame, and if the VAD signal of the noise estimation result is smaller than or equal to the threshold value, the voice signal of the current frame is the voice frame.
Further, the first depth-separable convolution layer includes a first depth convolution and a first point-wise convolution, and the second depth-separable convolution layer includes a second depth convolution and a second point-wise convolution.
Further, the step of performing adaptive iterative wiener filtering on the speech frame and the non-speech frame by using a wiener filter to obtain a filtered frequency domain signal includes:
The average power spectrum of the mute section is adopted as an initial noise power spectrum estimation, and a preset constant is adopted as a voice power spectrum;
If the speech signal of the current frame is the non-speech frame, updating the noise power spectrum estimation:
Wherein, Is the current frameA single frequency bin noise power spectrum estimate,To the updated current frameA single frequency bin noise power spectrum estimate,Is the current frameThe power spectral density of the individual frequency bins,Is a smoothing coefficient;
if the voice signal of the current frame is the voice frame, maintaining the noise power spectrum estimation unchanged;
calculating the gain of the wiener filter according to the updated noise power spectrum estimation and the voice power spectrum :
Performing adaptive iterative wiener filtering on the original frequency domain signal of the current frame by using the wiener filter to obtain a filtered frequency domain signal, wherein the adaptive iterative wiener filtering has the following expression:
Wherein, Is a filtered frequency domain signal that,Is the original frequency domain signal.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:
In the embodiment of the disclosure, through the noise reduction method of the edge-side real-time voice, on one hand, pre-emphasis, framing and windowing are performed on an input voice signal, so as to improve the high-frequency component of the signal and reduce the distortion of the signal. The preprocessed input speech signal is subjected to a fast fourier transform to convert the time domain signal into an original frequency domain signal. And calculating the Mel frequency cepstrum coefficient of the original frequency domain signal to extract the MFCC characteristic parameters of the voice. The noise estimation is carried out on the MFCC characteristic parameters by utilizing the light separable convolutional neural network, so that the parameter quantity and the calculated quantity are reduced, and the efficient noise estimation is realized. And performing adaptive iterative wiener filtering by using the generated noise estimation, and gradually eliminating noise in the voice signal to obtain a filtered frequency domain signal. And performing inverse fast Fourier transform on the filtered frequency domain signal to recover the time domain signal. And outputting a time domain signal, namely the noise-reduced voice signal. On the other hand, the method can perform offline real-time noise reduction under the edge end scene with limited power consumption and small computing resources, and has better noise reduction effect.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 is a step diagram of a method for edge-side real-time speech noise reduction in an exemplary embodiment of the present disclosure;
FIG. 2 illustrates synthesized audio minus 20db for noisy speech and its denoised audio schematic in an exemplary embodiment of the present disclosure;
fig. 3 shows an audio schematic of an environmental noise call recording and its denoised audio in an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein, but rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of embodiments of the disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.
The embodiment provides a noise reduction method for edge-side real-time voice. Referring to fig. 1, the method for noise reduction of the edge-side real-time voice may include steps S101 to S107.
S101, acquiring an input voice signal, and preprocessing the input voice signal to obtain a plurality of frames of voice signals;
Step S102, performing fast Fourier transform on the voice signal aiming at one frame of the voice signal to obtain an original frequency domain signal;
Step S103, calculating the Mel frequency cepstrum coefficient of the original frequency domain signal to extract the MFCC characteristic parameters of the input voice signal;
Step S104, inputting the MFCC characteristic parameters into a separable convolutional neural network subjected to INT8 fixed-point quantization processing to obtain a noise estimation result, wherein the separable convolutional neural network comprises an input layer, a first depth separable convolutional layer, a second depth separable convolutional layer, a full connection layer and an output layer which are sequentially serial, and the noise estimation result comprises a voice frame and a non-voice frame;
Step 105, performing adaptive iterative wiener filtering on the voice frame and the non-voice frame by using a wiener filter to obtain a filtered frequency domain signal;
step S106, performing inverse fast Fourier transform on the filtered frequency domain signal to obtain a filtered time domain signal;
Step S107, traversing all the voice signals to obtain the filtered time domain signals corresponding to all the voice signals, and combining all the filtered time domain signals to obtain the noise-reduced voice signals.
According to the noise reduction method for the edge-side real-time voice, on one hand, pre-emphasis, framing and windowing are carried out on an input voice signal so as to improve the high-frequency component of the signal and reduce the distortion of the signal. The preprocessed input speech signal is subjected to a fast fourier transform to convert the time domain signal into an original frequency domain signal. And calculating the Mel frequency cepstrum coefficient of the original frequency domain signal to extract the MFCC characteristic parameters of the voice. The noise estimation is carried out on the MFCC characteristic parameters by utilizing the light separable convolutional neural network, so that the parameter quantity and the calculated quantity are reduced, and the efficient noise estimation is realized. And performing adaptive iterative wiener filtering by using the generated noise estimation, and gradually eliminating noise in the voice signal to obtain a filtered frequency domain signal. And performing inverse fast Fourier transform on the filtered frequency domain signal to recover the time domain signal. And outputting a time domain signal, namely the noise-reduced voice signal. On the other hand, the method can perform offline real-time noise reduction under the edge end scene with limited power consumption and small computing resources, and has better noise reduction effect.
Next, each step of the above-described edge-side real-time voice noise reduction method in the present exemplary embodiment will be described in more detail with reference to fig. 1 to 3.
In step S101, an input speech signal is acquired and preprocessed to obtain a number of frames of speech signals.
The method comprises the steps of pre-emphasizing an input voice signal by using a pre-emphasizing filter, dividing the pre-emphasized input voice signal into a plurality of frames of voice signals, and windowing each frame of voice signal by using a Hamming window to reduce frequency spectrum leakage.
In one embodiment, the input speech signal is pre-emphasis processed to enhance the high frequency content. Is embodied as an input signalApplying a pre-emphasis filter, the formula:
Wherein, Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal after pre-emphasis is at the firstThe values of the individual sample points are used,For the pre-emphasis coefficient, for controlling the degree of enhancement of the high frequency component, for example,The value is 0.95.
The pre-emphasized signal is divided into frames, each frame being 10 ms in length and the frame being shifted by 5 ms. A hamming window is applied to each frame of signal to reduce spectral leakage. The windowing formula is as follows:
Wherein, Is a Hamming window function at the firstValues of the sampling points; the number of sampling points for each frame, namely the frame length; Index of current sampling point, range is
When applying a hamming window, each frame of speech signalThe windowing process is as follows:
Wherein, Is the voice signal at the firstValues of the sampling points; In the first place for windowed speech signal The values of the sampling points.
In step S102, for one frame of the speech signal, a fast fourier transform is performed on the speech signal to obtain an original frequency domain signal.
Specifically, the windowed signal of each frame is subjected to a fast fourier transform (Fast Fourier transform, FFT), and the time domain signal is converted into a frequency domain signal, so as to obtain a complex frequency spectrum.
In step S103, the mel-frequency cepstrum coefficient thereof is calculated from the original frequency-domain signal to extract the MFCC characteristic parameters of the input speech signal.
Specifically, the Mel filter bank is used to convert the frequency of the original frequency domain signalConversion to mel frequencyWherein the Mel filter group comprises several Mel filters according to Mel frequencyAnd calculating the energy of each Mel filter according to the original frequency domain signal, taking logarithms of the energy of all Mel filters, and performing discrete cosine transform to obtain MFCC characteristic parameters.
In one embodiment, mel frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCC) of the frequency domain signal are calculated. First, the spectrum is passed through a mel filter bank, and the energy of each filter is calculated. The filter energy is then logarithmically measured and Discrete Cosine Transformed (DCT) to yield MFCC features. The specific process is as follows:
first a mel filter bank (e.g. 40 filters are used in this embodiment) is applied to the filter bank Conversion to mel frequencyThe mel frequency formula is as follows:
for each filter, calculate its output energy :
Wherein, As the frequency point of the signal, the signal is the frequency point,As the original frequency domain signal,In order to the original frequency domain signalThe power spectral density at the frequency bin,Is the firstFirst of the Mel filterThe frequency response of the frequency bin,For the start-up frequency of the mel-filter,Is the ending frequency of the mel filter;
The expression of discrete cosine transform is:
Wherein, Is the firstThe number of MFCC characteristic parameters (e.g., i=10 in this embodiment),For the number of mel filters (e.g., taking m=40 in this embodiment),Is the firstEnergy of the individual mel filters.
In step S104, the MFCC characteristic parameters are input into a separable convolutional neural network subjected to INT8 fixed-point quantization processing to obtain a noise estimation result, wherein the separable convolutional neural network comprises an input layer, a first depth separable convolutional layer, a second depth separable convolutional layer, a full connection layer and an output layer which are sequentially connected in series, and the noise estimation result comprises a voice frame and a non-voice frame.
More specifically, inputting the MFCC characteristic parameters into the input layer, the first depth separable convolution layer and the second depth separable convolution layer in sequence, flattening the output of the second depth separable convolution layer, and inputting the flattened output into the full connection layer to generate a noise estimation result;
The output layer outputs a noise estimation result, wherein if the VAD signal of the noise estimation result is higher than a preset threshold value, the voice signal of the current frame is a non-voice frame, and if the VAD signal of the noise estimation result is smaller than or equal to the threshold value, the voice signal of the current frame is a voice frame.
In addition, the first depth-separable convolution layer includes a first depth convolution and a first point-wise convolution, and the second depth-separable convolution layer includes a second depth convolution and a second point-wise convolution.
In one embodiment, a lightweight separable convolutional neural network (i.e., separable convolutional neural network) is constructed, and MFCC characteristics (i.e., MFCC characteristic parameters) are input into the designed lightweight separable convolutional neural network. The network consists of a plurality of depth separable convolution layers, and fixed-point quantization processing is carried out, so that a noise estimation result can be obtained quickly and efficiently. The method comprises the following specific steps:
(1) Input features from the previous step, mel-frequency cepstral coefficients (MFCCs) of each frame of signal are calculated, and these MFCC features are to be used as inputs to the neural network.
(2) Light separable convolutional neural network design
1) Network structure:
input layer input MFCC two-dimensional feature vector (number of samples per frame x feature number).
Depth separable convolution layer (2 layers) depth separable convolution is used to reduce the number of parameters and computation. Each convolution layer is divided into a depth convolution (convolution kernel 10×2) and a point-by-point convolution (convolution kernel 1×1), with a channel number of 32.
Activation function-ReLU activation function is applied after each separable convolutional layer.
And the full connection layer is used for flattening the output of the convolution layer and inputting the flattened output of the convolution layer into the full connection layer to generate a noise estimation result.
2) Network training
Data preparation, marking noisy and clean speech as 1, marking pure noise data as 0, training with 15 hours of speech data.
Loss function-using a mean square error loss function.
And (3) perceptual quantization training, namely, in order to obtain a lightweight model, using the perceptual quantization training to the network structure to obtain a lightweight network after fixed-point quantization so as to reduce the network calculation amount.
In a specific embodiment, the output of each layer of the neural network:
input layer 160×10
The first depth may separate the convolution outputs 151 x 32.
The second depth may separate the convolution outputs 151 x 32.
151×32 After ReLU activation.
Full connection layer, and is flattened to 4832.
In step S105, adaptive iterative wiener filtering is performed on the speech frame and the non-speech frame, respectively, using wiener filters to obtain filtered frequency domain signals.
Specifically, a key role of noise estimation using a neural network (i.e., a separable convolutional neural network) is to determine whether a speech signal is present in the current frame. In the adaptive iterative wiener filtering step, it is determined whether to update the parameters of the wiener filter based on the result of the voice activity detection.
(1) Voice Activity Detection (VAD)
In the noise estimation step, a lightweight separable convolutional neural network is used to indicate whether the current frame contains speech. A threshold is set (in this embodiment, the threshold is set to 0.8), and if the VAD signal output by the network is higher than the threshold, the current frame is determined to be a non-speech frame.
(2) Wiener filter initialization
At the beginning of the process, the average power spectrum of the silence segment is used as the initial noise power spectrum estimate, and a small constant value (0.0001) is used as the initial speech power spectrum.
(3) Iterative updating
Non-speech frame processing, noise power spectrum update
If the current frame is determined to be a non-speech frame, the noise power spectrum estimate is updated:
Wherein, Is the current frameA single frequency bin noise power spectrum estimate,Is the updated current frameA single frequency bin noise power spectrum estimate,Is the current frameThe power spectral density of the individual frequency bins,Is a smoothing coefficient, and the value of the embodiment0.998.
Speech frame processing, maintaining the noise power spectrum unchanged
If the current frame is determined to be a speech frame, the noise power spectrum estimate is not updated.
(4) Wiener filter computation
Calculating wiener filter gain based on updated noise and speech power spectra:
And then filtering the frequency domain signal of the current frame by applying a wiener filter:
Wherein, Is a filtered frequency domain signal that is then filtered,Is the original frequency domain signal.
In step S106 and step S107, the filtered frequency domain signal is subjected to inverse fast Fourier transform to obtain a filtered time domain signal, all the voice signals are traversed to obtain the filtered time domain signals corresponding to all the voice signals, and all the filtered time domain signals are combined to obtain the noise-reduced voice signal
Specifically, the filtered frequency domain signal is subjected to inverse fast fourier transform (INVERSE FAST Fourier transform, IFFT) to recover the time domain signal. And combining the time domain signals of all frames to obtain a final noise reduction voice signal.
In a specific embodiment, the noise reduction method (i.e., RTC method) of the edge-side real-time voice proposed by the application is compared with the traditional noise reduction method (including RNNoise, deepFilterNet and S-DCCRN in particular). The method for reducing the noise of the edge-side real-time voice provided by the application has the advantages that the power consumption is only 0.8W, the time for processing the audio of one frame for 10ms is only 4ms, and the noise suppression under the strong noise environment can reach more than 25 db. Compared with the traditional method and the pure deep learning scheme, the method has the advantages of less parameters and calculation amount and better noise reduction effect.
Table 1 comparison of noise reduction methods
As shown in fig. 2, the synthesized audio with noise voice minus 20db and the audio after noise reduction are shown.
As shown in fig. 3, an audio schematic diagram of the voice recording of the ambient noise and the noise reduction is shown.
According to the noise reduction method for the edge-side real-time voice, on one hand, pre-emphasis, framing and windowing are carried out on an input voice signal so as to improve the high-frequency component of the signal and reduce the distortion of the signal. The preprocessed input speech signal is subjected to a fast fourier transform to convert the time domain signal into an original frequency domain signal. And calculating the Mel frequency cepstrum coefficient of the original frequency domain signal to extract the MFCC characteristic parameters of the voice. The noise estimation is carried out on the MFCC characteristic parameters by utilizing the light separable convolutional neural network, so that the parameter quantity and the calculated quantity are reduced, and the efficient noise estimation is realized. And performing adaptive iterative wiener filtering by using the generated noise estimation, and gradually eliminating noise in the voice signal to obtain a filtered frequency domain signal. And performing inverse fast Fourier transform on the filtered frequency domain signal to recover the time domain signal. And outputting a time domain signal, namely the noise-reduced voice signal. On the other hand, the method can perform offline real-time noise reduction under the edge end scene with limited power consumption and small computing resources, and has better noise reduction effect.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present disclosure, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In the embodiments of the present disclosure, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured" and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed, mechanically connected, electrically connected, directly connected, indirectly connected via an intervening medium, or in communication between two elements or in an interaction relationship between two elements. The specific meaning of the terms in this disclosure will be understood by those of ordinary skill in the art as the case may be.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, one skilled in the art can combine and combine the different embodiments or examples described in this specification.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (8)

1. The method for reducing noise of the edge-side real-time voice is characterized by comprising the following steps:
Acquiring an input voice signal, and preprocessing the input voice signal to obtain a plurality of frames of voice signals;
Performing fast Fourier transform on the voice signal aiming at one frame of the voice signal to obtain an original frequency domain signal;
calculating a mel frequency cepstrum coefficient of the original frequency domain signal according to the original frequency domain signal so as to extract MFCC characteristic parameters of the input voice signal;
Inputting the MFCC characteristic parameters into a separable convolutional neural network subjected to INT8 fixed-point quantization processing to obtain a noise estimation result, wherein the separable convolutional neural network comprises an input layer, a first depth separable convolutional layer, a second depth separable convolutional layer, a full connection layer and an output layer which are sequentially connected in series, and the noise estimation result comprises a voice frame and a non-voice frame;
performing adaptive iterative wiener filtering on the voice frame and the non-voice frame by using a wiener filter to obtain a filtered frequency domain signal;
performing inverse fast fourier transform on the filtered frequency domain signal to obtain a filtered time domain signal;
Traversing all the voice signals to obtain the filtered time domain signals corresponding to all the voice signals, and combining all the filtered time domain signals to obtain noise-reducing voice signals.
2. The method for noise reduction of edge-side real-time speech according to claim 1, wherein the step of obtaining an input speech signal and preprocessing the input speech signal comprises:
Pre-emphasis the input speech signal with a pre-emphasis filter;
Dividing the pre-emphasized input voice signal into a plurality of frames of voice signals;
The speech signal is windowed for each frame using a hamming window to reduce spectral leakage.
3. The method for noise reduction of edge-side real-time speech according to claim 2, wherein the pre-emphasis filter has the expression:
Wherein, Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal at the firstThe values of the individual sample points are used,Is the original input signal after pre-emphasis is at the firstThe values of the individual sample points are used,Is a pre-emphasis coefficient;
the Hamming window expression is:
Wherein, Is a Hamming window function at the firstThe values of the individual sample points are used,The number of samples per frame is calculated,Index of current sampling point, range is;
Each frame of the voice signalThe windowed expression is:
Wherein, Is the voice signal at the firstValues of the sampling points; In the first place for windowed speech signal The values of the sampling points.
4. The method of claim 3, wherein the step of calculating mel-frequency cepstrum coefficients of the original frequency-domain signal to extract MFCC characteristic parameters of the input speech signal comprises:
Using a mel filter bank to convert the frequency of the original frequency domain signal Conversion to mel frequencyWherein the mel filter bank comprises a plurality of mel filters;
According to Mel frequency And the original frequency domain signal, calculating the energy of each Mel filter;
and carrying out discrete cosine transform on the logarithm of the energy of all the Mel filters to obtain MFCC characteristic parameters.
5. The method for noise reduction of edge-side real-time speech according to claim 4, wherein the mel frequency is expressed as:
the energy of the mel filter is calculated as follows:
Wherein, As the frequency point of the signal, the signal is the frequency point,As the original frequency domain signal,In order to the original frequency domain signalThe power spectral density at the frequency bin,Is the firstFirst of the Mel filterThe frequency response of the frequency bin,For the start-up frequency of the mel-filter,Is the ending frequency of the mel filter;
The discrete cosine transform expression is:
Wherein, Is the firstThe number of MFCC characteristic parameters,For the number of mel-filters,Is the firstEnergy of the individual mel filters.
6. The method for noise reduction of edge-side real-time speech according to claim 5, wherein the step of inputting the MFCC characteristic parameters into a separable convolutional neural network subjected to INT8 fixed-point quantization processing to obtain a noise estimation result comprises:
Sequentially inputting the MFCC characteristic parameters into the input layer, the first depth separable convolution layer and the second depth separable convolution layer, flattening the output of the second depth separable convolution layer and inputting the flattened output into the fully connected layer to generate the noise estimation result;
The output layer outputs the noise estimation result, wherein if the VAD signal of the noise estimation result is higher than a preset threshold value, the voice signal of the current frame is the non-voice frame, and if the VAD signal of the noise estimation result is smaller than or equal to the threshold value, the voice signal of the current frame is the voice frame.
7. The method of edge-side real-time speech noise reduction according to claim 6, wherein the first depth separable convolution layer comprises a first depth convolution and a first point-by-point convolution, and the second depth separable convolution layer comprises a second depth convolution and a second point-by-point convolution.
8. The method for noise reduction of edge-side real-time speech according to claim 6, wherein the step of performing adaptive iterative wiener filtering on the speech frame and the non-speech frame, respectively, by using wiener filters to obtain filtered frequency domain signals comprises:
The average power spectrum of the mute section is adopted as an initial noise power spectrum estimation, and a preset constant is adopted as a voice power spectrum;
If the speech signal of the current frame is the non-speech frame, updating the noise power spectrum estimation:
Wherein, Is the current frameA single frequency bin noise power spectrum estimate,To the updated current frameA single frequency bin noise power spectrum estimate,Is the current frameThe power spectral density of the individual frequency bins,Is a smoothing coefficient;
if the voice signal of the current frame is the voice frame, maintaining the noise power spectrum estimation unchanged;
calculating the gain of the wiener filter according to the updated noise power spectrum estimation and the voice power spectrum :
Performing adaptive iterative wiener filtering on the original frequency domain signal of the current frame by using the wiener filter to obtain a filtered frequency domain signal, wherein the adaptive iterative wiener filtering has the following expression:
Wherein, Is a filtered frequency domain signal that,Is the original frequency domain signal.
CN202510018663.4A 2025-01-07 2025-01-07 A noise reduction method for real-time speech at the edge Pending CN119418712A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510018663.4A CN119418712A (en) 2025-01-07 2025-01-07 A noise reduction method for real-time speech at the edge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510018663.4A CN119418712A (en) 2025-01-07 2025-01-07 A noise reduction method for real-time speech at the edge

Publications (1)

Publication Number Publication Date
CN119418712A true CN119418712A (en) 2025-02-11

Family

ID=94466039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510018663.4A Pending CN119418712A (en) 2025-01-07 2025-01-07 A noise reduction method for real-time speech at the edge

Country Status (1)

Country Link
CN (1) CN119418712A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120220713A (en) * 2025-04-08 2025-06-27 中国铁路北京局集团有限公司丰台车辆段 A method and device for adaptive noise reduction for head-mounted computer
CN120690222A (en) * 2025-05-27 2025-09-23 江西亚瑞科技有限责任公司 Real-time noise reduction method and system for call speech based on dynamic noise perception

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008028511A1 (en) * 2006-09-08 2008-03-13 Siemens Aktiengesellschaft System and method for speaker recognition
EP2372707A1 (en) * 2010-03-15 2011-10-05 Svox AG Adaptive spectral transformation for acoustic speech signals
CN112735456A (en) * 2020-11-23 2021-04-30 西安邮电大学 Speech enhancement method based on DNN-CLSTM network
US20210256988A1 (en) * 2020-02-14 2021-08-19 System One Noc & Development Solutions, S.A. Method for Enhancing Telephone Speech Signals Based on Convolutional Neural Networks
CN115910034A (en) * 2022-09-30 2023-04-04 兴业银行股份有限公司 Speech language recognition method and system based on deep learning
CN116013348A (en) * 2022-12-27 2023-04-25 思必驰科技股份有限公司 Audio noise reduction method, device and storage medium
CN118283218A (en) * 2024-04-12 2024-07-02 中国工商银行股份有限公司 Audio and video reconstruction method and device for real-time conversation and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008028511A1 (en) * 2006-09-08 2008-03-13 Siemens Aktiengesellschaft System and method for speaker recognition
EP2372707A1 (en) * 2010-03-15 2011-10-05 Svox AG Adaptive spectral transformation for acoustic speech signals
US20210256988A1 (en) * 2020-02-14 2021-08-19 System One Noc & Development Solutions, S.A. Method for Enhancing Telephone Speech Signals Based on Convolutional Neural Networks
CN112735456A (en) * 2020-11-23 2021-04-30 西安邮电大学 Speech enhancement method based on DNN-CLSTM network
CN115910034A (en) * 2022-09-30 2023-04-04 兴业银行股份有限公司 Speech language recognition method and system based on deep learning
CN116013348A (en) * 2022-12-27 2023-04-25 思必驰科技股份有限公司 Audio noise reduction method, device and storage medium
CN118283218A (en) * 2024-04-12 2024-07-02 中国工商银行股份有限公司 Audio and video reconstruction method and device for real-time conversation and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
胡忠元: "面向辅助驾驶的语音指令识别算法设计与FPGA验证", 《中国优秀硕士学位论文全文数据库工程科技Ⅱ辑》, no. 06, 15 June 2022 (2022-06-15), pages 035 - 251 *
许昊: "噪声环境下的说话人识别技术", 《中国优秀硕士学位论文全文数据库信息科技辑》, vol. 01, no. 01, 15 January 2016 (2016-01-15), pages 136 - 101 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120220713A (en) * 2025-04-08 2025-06-27 中国铁路北京局集团有限公司丰台车辆段 A method and device for adaptive noise reduction for head-mounted computer
CN120690222A (en) * 2025-05-27 2025-09-23 江西亚瑞科技有限责任公司 Real-time noise reduction method and system for call speech based on dynamic noise perception

Similar Documents

Publication Publication Date Title
RU2329550C2 (en) Method and device for enhancement of voice signal in presence of background noise
CN119418712A (en) A noise reduction method for real-time speech at the edge
CN110120225A (en) A kind of audio defeat system and method for the structure based on GRU network
CN111091833A (en) Endpoint detection method for reducing noise influence
CN108922554A (en) The constant Wave beam forming voice enhancement algorithm of LCMV frequency based on logarithm Power estimation
CN110808059A (en) Speech noise reduction method based on spectral subtraction and wavelet transform
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
Zhu et al. A robust and lightweight voice activity detection algorithm for speech enhancement at low signal-to-noise ratio
CN118899005B (en) Audio signal processing method, device, computer equipment and storage medium
CN111968651A (en) WT (WT) -based voiceprint recognition method and system
CN112687284A (en) Reverberation suppression method and device for reverberation voice
CN117711419B (en) Intelligent data cleaning method for data center
CN108922514B (en) Robust feature extraction method based on low-frequency log spectrum
TWI749547B (en) Speech enhancement system based on deep learning
He et al. Speech enhancement with intelligent neural homomorphic synthesis
CN113409804A (en) Multichannel frequency domain speech enhancement algorithm based on variable-span generalized subspace
CN109102823A (en) A kind of sound enhancement method based on subband spectrum entropy
CN120148484B (en) Speech recognition method and device based on microcomputer
CN111681649B (en) Speech recognition method, interactive system and performance management system including the system
CN110970044A (en) A speech enhancement method for speech recognition
CN118016079B (en) Intelligent voice transcription method and system
CN110197657B (en) A dynamic sound feature extraction method based on cosine similarity
CN119091896A (en) A speech enhancement model training method, speech enhancement method and related device
CN111627426A (en) Method and system for eliminating channel difference in voice interaction, electronic equipment and medium
CN114822577B (en) Method and device for estimating fundamental frequency of voice signal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination