Disclosure of Invention
The invention aims to provide a voice enhancement quality evaluation method, a device, a terminal and a storage medium, wherein a clean voice signal is used as an original signal, different types of noise are superposed before voice enhancement is carried out to generate a noisy voice signal, then the voice enhancement signal is generated through processing of a voice enhancement algorithm, and finally the clean original voice signal and the voice enhancement signal are led into PESQ to obtain a voice enhancement quality evaluation score.
In a first aspect, an embodiment of the present invention provides a method for evaluating speech enhancement quality, where the method includes the following steps:
the data acquisition step comprises: acquiring clean voice signals and noise data;
generating a noisy speech signal: processing the clean voice signal and the noise data according to preset evaluation content to generate a voice signal with noise; wherein the evaluation content includes a generation scenario of noise data and a signal-to-noise ratio of a noisy speech signal;
generating a speech enhancement signal: processing the noisy speech signal using a speech enhancement algorithm to generate a speech enhancement signal;
the step of evaluating the speech enhancement quality: and importing the clean voice signal and the voice enhancement signal into PESQ to calculate and obtain a voice enhancement quality score.
As a further improvement of the first aspect of the present invention, the step of generating a noisy speech signal specifically includes the steps of:
selecting a plurality of pieces of noise data to be superposed to generate a noise signal;
calculating a scaling coefficient of a corresponding noise signal according to the signal-to-noise ratio of the voice signal with the noise;
and carrying out scaling processing on the noise signal according to the scaling coefficient, and superposing the noise signal after scaling processing and the clean voice signal to generate a voice signal with noise.
As a further improvement of the first aspect of the present invention, the calculation formula of the scaling factor is as follows:
wherein alpha isnoiseRepresenting the scaling factor of the noise signal and snr representing the signal-to-noise ratio of the noisy speech signal.
As a further improvement of the first aspect of the present invention, the calculation formula for generating the noisy speech signal is as follows:
noisy(x)=speech(x)+αnoise*noise(x)
here, noise (x) represents a noisy speech signal, speech (x) represents a clean speech signal, and noise (x) represents a noise signal.
As a further improvement of the first aspect of the present invention, the acquiring noise data specifically includes the following steps:
collecting noise data by recording or downloading a source database through a network;
and generating a noise library according to the collected noise data.
In a second aspect, an embodiment of the present invention provides a speech enhancement quality assessment apparatus, where the apparatus includes:
the data acquisition module is used for acquiring clean voice signals and noise data;
the noisy speech signal generation module is used for processing the clean speech signal and the noise data according to preset evaluation content to generate a noisy speech signal; wherein the evaluation content includes a generation scenario of noise data and a signal-to-noise ratio of a noisy speech signal;
the voice enhancement signal generation module is used for processing the voice signal with noise by utilizing a voice enhancement algorithm to generate a voice enhancement signal;
and the voice enhancement quality evaluation module is used for guiding the clean voice signal and the voice enhancement signal into PESQ to calculate and obtain a voice enhancement quality score.
As a further development of the second aspect of the invention, the noisy speech signal generation module comprises the following subunits:
the noise signal generation subunit is used for selecting a plurality of pieces of noise data to be superposed so as to generate a noise signal;
the scaling coefficient calculation subunit is used for calculating the scaling coefficient of the corresponding noise signal according to the signal-to-noise ratio of the voice signal with noise;
and the noisy speech signal generating subunit is used for carrying out scaling processing on the noise signal according to the scaling coefficient and superposing the scaled noise signal and the clean speech signal to generate a noisy speech signal.
As a further development of the second aspect of the invention, the data acquisition module comprises a noise data acquisition sub-module for acquiring noise data, the noise data acquisition sub-module comprising the following sub-units:
the acquisition subunit is used for acquiring the noise data in a mode of recording or downloading a source database by a network;
and the noise library generating subunit is used for generating a noise library according to the collected noise data.
In a third aspect, an embodiment of the present invention provides a speech enhancement quality assessment terminal, including: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the speech enhancement quality assessment method according to any of the embodiments of the first aspect of the present invention when executing the program.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to cause a computer to execute a speech enhancement quality assessment method according to any embodiment of the first aspect of the present invention.
Compared with the prior art, the embodiment of the invention at least has the following beneficial effects:
1. the voice enhancement quality evaluation method, the voice enhancement quality evaluation device, the voice enhancement quality evaluation terminal and the storage medium provided by the invention can meet the voice source with any signal-to-noise ratio, and can finish the voice enhancement quality evaluation work only by using the conference terminal without the cooperation of additional equipment (such as a server).
2. According to the voice enhancement quality evaluation method, the voice enhancement quality evaluation device, the voice enhancement quality evaluation terminal and the storage medium, the clean voice signal is used as the original signal, different types of noise are superposed before voice enhancement is carried out to generate the voice signal with noise, then the voice enhancement signal is processed through the voice enhancement algorithm to generate the voice enhancement signal, and finally the clean original voice signal and the voice enhancement signal are led into the PESQ to obtain the voice enhancement quality evaluation score.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
Reference will now be made in detail to the present preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout.
Fig. 1 is a block diagram of a terminal for performing a speech enhancement quality assessment method according to an embodiment. Referring to fig. 1, the voice enhancement quality assessment method is applied to a conference terminal 110. The conference terminal 110 is a computer device with data transmission and processing capabilities, and the conference terminal 110 is electrically connected with a microphone 120 and an analog-to-digital converter 130 for converting acoustic signals in the environment into digital signals suitable for calculation. It should be noted that the speech enhancement quality evaluation method may also be applied to other terminals, such as a desktop terminal or a mobile terminal, where the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like.
To facilitate understanding of the present invention by those skilled in the art, technical terms related to embodiments of the present invention are briefly described below.
An ADC, i.e., an Analog-to-digital converter (english) is a class of devices used to convert a continuous signal in Analog form into a discrete signal in digital form. An analog to digital converter may provide the signal for measurement. The opposite device becomes a digital-to-analog converter.
The microphone, known as a microphone, is translated from an english microphone (microphone), and is also called a microphone or a microphone. A microphone is an energy conversion device that converts a sound signal into an electrical signal.
The signal-to-noise ratio, known by the english name SNR or S/N, refers to the ratio of signal to noise in an electronic device or system. In a narrow sense, the ratio of the power of the output signal of the amplifier to the power of the noise output at the same time, often expressed in decibels, a higher signal-to-noise ratio of the device indicates that it generates less noise.
PESQ, objective speech quality assessment, ITU-T (international telecommunication union, telecommunication standardization sector) p.862 recommendation provides an objective MOS value evaluation method.
The speech enhancement quality evaluation method provided by the embodiment of the present invention will be described and explained in detail by several specific embodiments.
In one embodiment, as shown in FIG. 2, a speech enhancement quality assessment method is provided. The embodiment is mainly illustrated by applying the method to computer equipment. The computer device may specifically be the conference terminal 110 in fig. 1 described above.
Referring to fig. 2, in this embodiment, the speech enhancement quality evaluation method includes the following steps:
step S102, data acquisition step: conference terminal 110 acquires clean voice signals and noise data.
It should be noted that the acquiring of the noise data specifically includes the following steps:
step a: the conference terminal 110 collects noise data by recording or downloading a source database through a network;
step b: the conference terminal 110 generates a noise library from the collected noise data.
It is understood that, as an example, the voice enhancement quality scoring under the actual application environment can be realized by recording the noise of the environment through the microphone 120 shown in fig. 1.
In another example, noise data can be collected by downloading a source database over a network, and the method uses the source database over the network to test the voice enhancement quality under various application environments or scenes, thereby improving the testing efficiency.
Step S104, generating a noisy speech signal: the conference terminal 110 processes the clean voice signal and the noise data according to a preset evaluation content to generate a voice signal with noise; wherein the evaluation content includes a generation scenario of the noise data and a signal-to-noise ratio of the noisy speech signal.
It should be noted that the step of generating a noisy speech signal specifically includes the following steps:
step S1041: the conference terminal 110 selects a plurality of pieces of noise data to be superimposed to generate a noise signal;
step S1042: the conference terminal 110 calculates a scaling coefficient of a corresponding noise signal according to the signal-to-noise ratio of the voice signal with noise;
the calculation formula of the scaling factor is as follows:
wherein alpha isnoiseRepresenting the scaling factor of the noise signal and snr representing the signal-to-noise ratio of the noisy speech signal.
Step S1043: the conference terminal 110 performs scaling processing on the noise signal according to the scaling coefficient, and superimposes the scaled noise signal with the clean voice signal to generate a noisy voice signal.
The calculation formula for generating the noisy speech signal is as follows:
noisy(x)=speech(x)+αnoise*noise(x)
here, noise (x) represents a noisy speech signal, speech (x) represents a clean speech signal, and noise (x) represents a noise signal.
It is understood that after the noise data of various scenes is acquired through step S102, the signal-to-noise ratio of the noisy speech signal can be arbitrarily set through step S104, facilitating large-scale speech enhancement quality assessment under various scenes
Step S106, generating a speech enhancement signal: conference terminal 110 processes the noisy speech signal using a speech enhancement algorithm to generate a speech enhancement signal;
step S108, evaluating the voice enhancement quality: and the conference terminal 110 guides the clean voice signal and the voice enhancement signal into PESQ to calculate a voice enhancement quality score.
In summary, the speech enhancement quality assessment method in this embodiment uses a clean speech signal as an original signal, superimposes different types of noise to generate a noisy speech signal before performing speech enhancement, then processes the noisy speech signal through a speech enhancement algorithm to generate a speech enhancement signal, and finally introduces the clean original speech signal and the speech enhancement signal into PESQ to obtain a speech enhancement quality assessment score.
In a preferred embodiment, a speech enhancement quality assessment method is provided, comprising the following steps
Step one, preparing a clean voice signal, and acquiring the voice signal with the signal-to-noise ratio higher than 60dB and the sampling rate of 16 kHz.
Step two, adding a noise signal, specifically comprising the following four substeps:
substep 1: acquiring a noise library, wherein in the embodiment, the noise library is obtained by recording and downloading a source database on the internet, and different noise libraries can be prepared for different test scenes;
substep 2: selecting noise, randomly selecting N pieces of noise data for superposition, wherein the superposition formula is as follows: alpha is alphanoise
noise(x)=α0noise0(x)+α1noise2(x)+…+αnnoiseN(x)
Where noise (x) represents the noise signal after superposition, αnRepresenting a scaling parameter, αnIn the range of (0,1)And (4) taking values.
Substep 3: calculating a noise scaling coefficient, calculating a specified signal-to-noise ratio according to the amplitude of the clean voice signal, in the embodiment, randomly taking a value between the signal-to-noise ratios (-15dB,20dB), and solving the scaling coefficient of the noise signal according to the known signal-to-noise ratio.
The signal-to-noise ratio calculation formula is as follows:
wherein T represents the number of signal samples of the signal, and the noise scaling factor formula is calculated as follows according to the formula:
αnoiserepresenting the scaling factor of the noise signal and snr representing the signal-to-noise ratio of the noisy speech signal.
Substep 4: generating a noisy speech signal
Generating the noisy speech according to the noise scaling factor calculated in substep 3, the calculation formula being as follows:
noisy(x)=speech(x)+αnoise*noise(x)
here, noise (x) represents a noisy speech signal, speech (x) represents a clean speech signal, and noise (x) represents a noise signal.
Step three: and inputting the noisy speech signal generated in the step three into any speech enhancement algorithm for processing to obtain a speech enhancement signal.
Note that, speech enhancement, english name: speech Enhancement, whose essence is Speech noise reduction, in other words, in daily life, the Speech collected by the microphone 120 is usually "polluted" Speech with different noises, and the main purpose of Speech Enhancement is to recover the desired clean Speech from these "polluted" noisy Speech. Speech enhancement algorithms can be divided into two broad categories: the speech enhancement method based on digital signal processing and the speech enhancement method based on machine learning are not described herein since the speech enhancement algorithm is well known in the art.
And step four, importing the clean voice signal and the voice enhancement signal into PESQ to calculate and obtain a voice enhancement quality score, wherein the PESQ is used as a known general open source technology and is not described herein again.
Referring to fig. 4, an embodiment of the present invention provides a speech enhancement quality assessment apparatus, including:
a data acquisition module 201, configured to acquire clean voice signals and noise data;
a noisy speech signal generating module 202, configured to process the clean speech signal and the noise data according to a preset evaluation content to generate a noisy speech signal; wherein the evaluation content includes a generation scenario of noise data and a signal-to-noise ratio of a noisy speech signal;
a speech enhancement signal generation module 203, configured to process the noisy speech signal by using a speech enhancement algorithm to generate a speech enhancement signal;
and the voice enhancement quality evaluation module 204 is configured to import the clean voice signal and the voice enhancement signal into PESQ to calculate a voice enhancement quality score.
Specifically, the noisy speech signal generating module 202 includes the following sub-units:
the noise signal generation subunit is used for selecting a plurality of pieces of noise data to be superposed so as to generate a noise signal;
the scaling coefficient calculation subunit is used for calculating the scaling coefficient of the corresponding noise signal according to the signal-to-noise ratio of the voice signal with noise;
and the noisy speech signal generating subunit is used for carrying out scaling processing on the noise signal according to the scaling coefficient and superposing the scaled noise signal and the clean speech signal to generate a noisy speech signal.
Further, the data acquisition module 201 includes a noise data acquisition sub-module for acquiring noise data, the noise data acquisition sub-module including the following sub-units:
the acquisition subunit is used for acquiring the noise data in a mode of recording or downloading a source database by a network;
and the noise library generating subunit is used for generating a noise library according to the collected noise data.
In summary, the speech enhancement quality assessment apparatus in this embodiment uses a clean speech signal as an original signal, superimposes different types of noise to generate a noisy speech signal before performing speech enhancement, then processes the noisy speech signal through a speech enhancement algorithm to generate a speech enhancement signal, and finally introduces the clean original speech signal and the speech enhancement signal into PESQ to obtain a speech enhancement quality assessment score.
It should be noted that the device embodiment and the method embodiment of the present invention are based on the same inventive concept, and are not described herein again for the device embodiment.
FIG. 5 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the conference terminal 110 in fig. 1. As shown in fig. 5, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the speech enhancement quality assessment method. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a speech enhancement quality assessment method. Those skilled in the art will appreciate that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing devices to which aspects of the present invention may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the speech enhancement quality assessment apparatus provided by the present application may be implemented in the form of a computer program that is executable on a computer device such as that shown in fig. 5. The memory of the computer device may store various program modules constituting the speech enhancement quality evaluation apparatus, such as a data acquisition module 201, a noisy speech signal generation module 202, a speech enhancement signal generation module 203, and a speech enhancement quality evaluation module 204 shown in fig. 4. The respective program modules constitute computer programs that cause a processor to execute the steps in the speech enhancement quality evaluation method of the respective embodiments of the present application described in the present specification.
For example, the computer device shown in fig. 5 may perform the step of acquiring clean speech signal and noise data by the data acquisition module 201 in the speech enhancement quality assessment apparatus shown in fig. 4. The step of processing the clean speech signal and the noise data according to a preset evaluation content to generate a noisy speech signal is performed by a noisy speech signal generating module 202, wherein the evaluation content includes a generation scene of the noise data and a signal-to-noise ratio of the noisy speech signal. The step of processing the noisy speech signal with a speech enhancement algorithm to generate a speech enhancement signal is performed by a speech enhancement signal generation module 203. The step of importing the clean speech signal and the speech enhancement signal into PESQ to calculate a speech enhancement quality score is performed by a speech enhancement quality evaluation module 204.
In one embodiment, there is provided a conference terminal including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to perform the steps of the above-described speech enhancement quality assessment method. Here, the steps of the speech enhancement quality evaluation method may be the steps in the speech enhancement quality evaluation methods of the above-described respective embodiments.
In one embodiment, a computer-readable storage medium is provided, which stores computer-executable instructions for causing a computer to perform the steps of the above-described speech enhancement quality assessment method. Here, the steps of the speech enhancement quality evaluation method may be the steps in the speech enhancement quality evaluation methods of the above-described respective embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRA), Rambus Direct RAM (RDRA), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.