Disclosure of Invention
The invention mainly aims to provide a method for solving the problem that the gain of a paraphrase audio point is smoothed by an algorithm to avoid abrupt distortion of data.
In order to achieve the above object, the present invention provides an adaptive noise reduction method based on frequency point gain smoothing, which converts a speech signal into a spectrum signal and then performs gain, which performs smoothing processing on a gain process, wherein the smoothing processing includes forward smoothing and/or backward smoothing;
Wherein, forward smoothing:
Taking out ;
For a pair of;
Wherein, Is the gain attenuation coefficient;
the original gain is replaced by a new gain, ;
Wherein, The virtual gain of the frequency point outside the highest frequency is used as an initial point of calculation;
For each of the actual frequency points, For its original gain, this gain is derived by a noise reduction algorithm;
an adjusted gain for the last higher frequency bin;
In addition, backward smoothing:
Taking out ;
For a pair of;
The original gain is replaced by a new gain,;
The virtual gain of the frequency point outside the lowest frequency is used as an initial point of calculation;
For each of the actual frequency points, For its original gain, this gain is derived by a noise reduction algorithm, or from a previous frontal adjustment;
An adjusted gain for the last lower frequency bin.
Preferably, further comprising time smoothing;
Taking out ;
Has the following components;
The original gain is replaced by a new gain,;
Here, theIs the current frequency pointUpper firstIs provided.
Preferably, this is accomplished by the steps of:
preprocessing, namely framing a real-time voice signal, converting the signal into a frequency spectrum through FFT (fast Fourier transform), and obtaining a step 1 and a step 2;
ANS noise reduction, gain is obtained by using a certain noise reduction algorithm, and the step 3 is carried out;
post filtering, namely adjusting the gains, namely steps 4,5 and 6;
Returning to the time domain signal, the spectrum is converted to the time domain by an inverse FFT transformation and synthesized into an audio signal, steps 7 and 8.
Preferably, this is accomplished by the steps of:
step 1, real-time voice signal framing:
the voice signal is a signal stream after device sampling or pre-algorithm processing ;
Each sampling point has a certain number of bits, and samples according to the certain number of bits, normalization is required to be carried out, so that;
In real-time processing, each is takenThe sampling point is one frame, i.eThe frame signal isWherein;
Step 2, speech signalFFT transformed into spectrumWherein, the method comprises the steps of,
Is an analysis window;
is a discrete fourier transform;
Is complex spectrum, in which ,Is the frequency point label;
step 3, adopting a certain signal processing noise reduction algorithm to obtain the gain on each frequency point In general, the number of the cells in a cell,I.e., the gain is real, which means that,For noise reduced spectrum, i.e;
Step 4, namely adopting the forward smoothing step;
step 5, namely adopting the backward smoothing step;
Step 6, namely adopting the time smoothing.
Preferably, the method further comprises the following steps:
and 7, updating the frequency spectrum signal: ;
Step 8: inverse fourier transformed back to the time domain signal: ;
according to the windowing mode, the analysis window used for synthesizing the signal is determined.
Preferably, whenAnd is also provided withWhen a Hanning window is used, the composite window is a unit window;
by Overlap-add mode Synthesizing speech signals by frame。
Preferably, for an audio file at a 16KHz sampling rate, 32ms is taken as one frame, and the 16ms frame is shifted, i.e。
Preferably, the attenuation coefficient in the forward smoothing and the backward smoothingThe values are the same or different;
Attenuation coefficient Adjusting according to signal-to-noise ratio, i.e. when signal-to-noise ratio is highReduced signal to noise ratioThe improvement is carried out;
Replaced by 。
The invention also provides a self-adaptive noise reduction post-filter based on the frequency point gain smoothing, and the post-filter is processed by adopting the noise reduction method.
Preferably, the processing is performed by using independently arranged post-filters at the time of forward smoothing, backward smoothing and time smoothing.
The method has the beneficial effects that through the forward smoothing and backward smoothing processing mode, the frequency point gain of voice conversion is processed according to an algorithm to realize smoothing processing, and the original gain is replaced by the new gain, so that smooth and natural frequency point data are obtained. Correspondingly, the method can improve the naturalness of the noise-reduced voice, particularly in a scene with low signal-to-noise ratio, obviously reduces the music noise caused by excessive noise reduction and frequency spectrum mutation, is a general method, can be applied to different noise reduction algorithms (models) based on different parameters or varieties, is simple to realize and low in calculation complexity, and can be combined with other post-filtering to achieve better noise reduction and voice preservation effects.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
The principle of the invention is explained as follows:
one voice being a time sequence Can be seen as two tones together, one being speechOne is noiseI.e.,. The purpose of noise reduction or speech separation algorithms is fromHandleAnd (5) calculating.
Because sound is processed little by little and is processed in real time, each time it is faced with a small piece of data, e.g. tens of milliseconds, step 1, framing. And we will typically put the speech on the frequency domain for processing. I.e. step 2, thus corresponding to the noisy speech in the frequency domain,,Corresponding to the spectrum of speech and noise, respectively. The calculation is the content of a noise reduction algorithm, step 3, where it is assumed that the speech and noise are random variables independent of each other, and the signal processing algorithm does not change phase, so that when we estimate a speechMeaning we considerFor example, in general, we let,The power spectrum of the noisy speech, noise, respectively, then,Is a noise gain estimate. In the example of calculation, for example, assuming that the energy of the noisy (original) speech is 100 and the noise energy is 10, one of the simplest estimates is speech energy 90, i.e., gain g=0.9. Step 7 is to apply the gain toThe estimated voice is obtained, and step 8 is to return to the time domain from the frequency domain, so as to obtain the voice stream heard by us at ordinary times.
A practical noise reduction algorithm may need to take into account the speed of noise variation, the accuracy of the noise estimate, etc. (subject to some statistical model), and most will tend to reduce the noise more deeply, so that the noise is almost absent and the speech is also greatly impaired. This is not a problem in the paper, which is to say that the index is rather than the sense of hearing, but in practical use, it is a problem that in practical use, we rather allow some noise and do not have great damage to speech. Thus, what is done in the following step 4/5/6 is that the principle is to preserve speech by preventing abrupt changes in gain. The gain is not reduced but only increased by the basic principle that if the last gain was 0.9, which suddenly became 0.1, it is considered too great, we set a factor such asAs for 0.8, 0.9 is next, and the minimum gain is 0.9×0.8=0.72. And step 4/5/6 is to do this "next" from three directions, namely, going to low frequency, going to high frequency, and going to the next frame, respectively. Thus, the signal will be more stable and the audible sensation will be better.
For each of the actual frequency points,For its original gain, this gain is obtained by a noise reduction algorithm, i.e. the frequency point is obtained by the existing noise reduction algorithmThe noise reduction algorithm is calculated on the frequency point according to the frame iteration, the gain of the adjacent frequency point is not considered, so that the very aggressive gain is obtained, namely, unsmooth or unnatural voice is generated, and the noise reduction algorithm is needed to be applied toSmoothing is performed by the algorithm.
As shown in fig. 1, which is a demonstration of the effect of the data algorithm processed by the method of the present invention, the first channel is the original noise reduction effect, and the second channel is enhanced by the method of the present invention. It can be seen that after the processing of the method herein, many details are recovered in the speech frequency domain, and it can be noted that the lack of fluency (empty space) is somewhat supplemented, such as 28-29 seconds (second region), 31-32 seconds (third region), etc.
Example 1
The embodiment provides a self-adaptive noise reduction method based on frequency point gain smoothing, which is used for converting a voice signal into a frequency spectrum signal and then carrying out gain, wherein the gain process is smoothed by the self-adaptive noise reduction method, and the smoothing process comprises forward smoothing and/or backward smoothing;
wherein, forward smoothing of step 4:
Taking out ;
For a pair of;
Wherein, Is the gain attenuation coefficient, and in the subject scheme,Is a fixed gain attenuation coefficient which is used to determine the gain,If (if)The post-filtering does not affect the original signal,The smaller, the more tends to preserve the original gain,The larger the gain, the smoother;
the original gain is replaced by a new gain, ;
Wherein, For the virtual gain of the frequency bin outside the highest frequency, this frequency bin is not applied in practice, here as an initial point of calculation, and in particular,For the frequency point array, whenWhen the highest frequency point is taken, the algorithm existsThus taking the number ofI.e. the outer frequency point of the highest frequency.
For example, in one specific processing case, for a 16k sample rate signal, a 16ms frame shift, a 32ms frame long data processing,The frequency domain signal range is {0,1,.. The first place, 255}, at this time, the starting point of the back-to-front calculation is 256 which is not within the range, and when one g (256) =0, max {0, g (255) } =g (255), according to the induction method, a value of 254 can be obtained from the value of 255 stepwise, a value of 1 is obtained, and finally a value of 0 is obtained, thus completing the whole calculation process.
Likewise, starting from back-to-front calculation starts from-1, from-1 to 0,0 to 1, and so does smoothing over time from 254 to 255.
For the adjusted gain of the last higher frequency point, a large gap between gains of adjacent frequency points is not desirable for audio quality consideration, and thusAdjusting the smoothing coefficient;
Thereby will Setting the gain of the frequency point after adjustment;
This gain is used for subsequent adjustment, thus setting ;
In addition, step 5 backward smoothing:
Taking out ;
For a pair of;
The original gain is replaced by a new gain,;
The virtual gain of the frequency point outside the lowest frequency is not applied in practice, and is used as the initial point of calculation whenWhen the lowest frequency point 0 is taken, the algorithm existsThus taking the number ofI.e. the outer frequency bin of the lowest frequency. See the description of the "treatment case" above.
For each of the actual frequency points,For its original gain, this gain is derived by a noise reduction algorithm, or from a previous frontal adjustment;
The frequency domain is forward, the frequency domain is backward, the adjustment of the time domain is independent, the input of each adjustment is a set of gains, namely, the functions are nested, and the nesting order of the functions can be adjusted at will.
If the gain of the original noise reduction algorithm is g, the forward filtering algorithm of the frequency domain is Apre, the backward filtering algorithm is Apost, and the time domain filtering is Atime, then the algorithm structure herein can be very flexible, such as Atime (Apre (Apost)) is the order described herein, apre (Apost (Atime (g))) is another order, and so on. Thus, the "forehead adjustment" is achieved by these means.
The noise reduction algorithm is calculated on the frequency point according to the frame iteration, and the gain of the adjacent frequency point is not considered, so that the very aggressive gain is obtained;
For the adjusted gain of the last lower frequency point, a large gap between gains of adjacent frequency points is not desirable for audio quality consideration, and thus Adjusting the smoothing coefficient;
Thereby will Setting the gain of the frequency point after adjustment;
this gain can be used for subsequent adjustment, thus setting ;
Preferably, the method further comprises the step of time smoothing of the step 6;
Taking out ;
Has the following components;
The original gain is replaced by a new gain,;
The gain calculation of the noise reduction algorithm is derived from the estimation of the signal to noise ratio and contains little gain contrast adjustment for the previous and subsequent frames. In view of the fact that there is considerable continuity in the signal between the preceding and following frames, abrupt changes in the signal gain therebetween also tend to be responsible for reduced signal quality or poor hearing, and it is therefore necessary to maintain signal gain smoothness to some extent over adjacent frames.
Similar to the previous description, hereIs the current frequency pointUpper firstIs provided. In particular, the method comprises the steps of,Representing a sequence of frames. That is, from-1 (virtual frame), frame 0 is calculated, then frame 1, then frame 2.
Example 2
On the basis of example 1, this is preferably done by the following steps:
preprocessing, namely framing a real-time voice signal, and converting the signal into a frequency spectrum through FFT (Fourier transform) transformation, wherein the steps are step 1 and step 2;
ANS noise reduction (Automatic Noise Suppression, background noise suppression), gain is obtained by using a certain noise reduction algorithm, which is step 3;
post filtering, namely adjusting the gains, namely steps 4,5 and 6;
Returning to the time domain signal, the spectrum is converted to the time domain by an inverse FFT transformation and synthesized into an audio signal, steps 7 and 8.
Step 1, real-time voice signal framing:
the voice signal is a signal stream after device sampling or pre-algorithm processing ;
Each sampling point has a certain number of bits, and is sampled according to the certain number of bits, for example, 16-bit sampling, and normalization is performed so that;
In real-time processing, each is takenThe sampling point is one frame, i.eThe frame signal isWhereinPreferably, for an audio file at a 16KHz sampling rate, 32ms is taken as one frame, and the 16ms frame is shifted, i.e. I.e.The number of samples for a frame,The number of samples for the frame shift length.
Step 2, speech signalFFT transformed into spectrumWherein, the method comprises the steps of,
Is an analysis window;
is a discrete fourier transform;
Is complex spectrum, in which ,Is the frequency point label;
Step 3, obtaining the gain of each frequency point by adopting the existing signal processing noise reduction algorithm In general, the number of the cells in a cell,I.e., the gain is real, which means that,For noise reduced spectrum, i.e;
Step 4, adopting the forward smoothing step in embodiment 1;
Step 5, adopting the backward smoothing step in embodiment 1;
Step 6, the time smoothing step of example 1 was employed.
And 7, updating the frequency spectrum signal:;
Step 8: inverse fourier transformed back to the time domain signal: ;
According to the windowing mode, the analysis window used for synthesizing the signal is determined. Preferably, when And is also provided withWhen a Hanning window is used, the composite window is a unit window;
by Overlap-add mode Synthesizing speech signals by frame。
Example 3
Preferably, the attenuation coefficient in the forward smoothing and the backward smoothingThe values are the same or different;
Attenuation coefficient Adjusting according to signal-to-noise ratio, i.e. when signal-to-noise ratio is highReduced signal to noise ratioThe improvement is carried out;
Replaced by Instead of the typical exponential decay approach of atack-Decay, a simple linear decay approach is also more effective in certain situations.
This means that an additional implementation is given. Attack-Decay refers to the original implementation of fast tracking large values (attock), exponentially decaying tracking small values (Decay), and replacing the exponential decay here with a linear decay.
Example 4
The invention also provides a self-adaptive noise reduction post-filter based on frequency point gain smoothing, and the post-filter is processed by adopting the noise reduction method in the embodiments 1-3.
Preferably, the processing is performed by using independently arranged post-filters at the time of forward smoothing, backward smoothing and time smoothing.
In order to better illustrate the solution of the present invention, some prior art documents are given below.
The following summary contains a description of general steps (e.g., steps 1-3):
Mahdi Parchami, Wei-Ping Zhu, Benoit Champagne, and Eric Plourde, Recent Developments in Speech Enhancement in the Short-Time Fourier Transform Domain , July 2016 IEEE Circuits and Systems Magazine 16(3):45-77
the classical noise reduction algorithm is described below:
Y. Ephraim and D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, no. 6, pp. 1109–1121, December 1984.
a very common noise reduction method is referred to as follows:
Timo Gerkmann, and Richard C. Hendriks, Unbiased MMSE-Based Noise Power Estimation with Low Complexity and Low Tracking Delay [435 citations] May 2012IEEE Transactions on Audio Speech and Language Processing 20(4):1383-1393,
It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.