CN114743566B

CN114743566B - Voice evaluation method, device, equipment and storage medium

Info

Publication number: CN114743566B
Application number: CN202110023913.5A
Authority: CN
Inventors: 叶珑; 雷延强; 佘爽
Original assignee: Guangzhou Shikun Electronic Technology Co Ltd
Current assignee: Guangzhou Shikun Electronic Technology Co Ltd
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2025-08-12
Anticipated expiration: 2041-01-08
Also published as: CN114743566A

Abstract

Embodiments of the present invention disclose a speech evaluation method, apparatus, device, and storage medium. The speech evaluation method includes determining a first pronunciation-goodness GOP sequence corresponding to the test speech and a second GOP sequence corresponding to a template speech, determining the KL divergence of the relative entropy between the first and second GOP sequences, and determining an evaluation result for the test speech based on the KL divergence. The technical solution of this embodiment utilizes the fact that GOP features are probability distributions. The KL divergence is used to measure the similarity between the template pronunciation GOP sequences, thereby improving the accuracy of the evaluation results.

Description

Voice evaluation method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to a data processing technology, in particular to a voice evaluation method, a device, equipment and a storage medium.

Background

The pronunciation quality evaluation technique (Pronunciation Quality Assessment, PQA) is a subdivision direction of Computer-aided language learning (Computer-assisted language learning, CALL), and the pronunciation quality evaluation technique is required to efficiently and accurately indicate pronunciation errors of a learner, give objective evaluation of phoneme level, and help the learner correct the pronunciation errors.

The pronunciation well degree (Goodness Of Pronunciation, GOP) is a common characteristic in pronunciation evaluation, only a pronunciation evaluation result of a phoneme level can be obtained, and for direct evaluation of a whole sentence, the existing evaluation method mainly uses a weighted phoneme level GOP as a sentence level GOP, so that a pronunciation evaluation result of a sentence level is indirectly obtained.

The whole sentence speech evaluation method does not solve the influence caused by the difference of the GOP value distribution of different phonemes, and can cause inaccurate evaluation of the whole sentence speech.

Disclosure of Invention

The embodiment of the invention provides a voice evaluation method, a device, equipment and a storage medium, which can improve the accuracy of the voice evaluation result of a whole sentence of words.

In a first aspect, an embodiment of the present invention provides a voice evaluation method, including:

determining a first sounding well GOP sequence corresponding to the voice to be detected and a second GOP sequence corresponding to the template voice;

determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence;

and determining an evaluation result of the voice to be tested based on the KL divergence.

In one embodiment, determining a first GOP sequence corresponding to a voice to be detected and a second GOP sequence corresponding to a template voice includes:

Acquiring a first phoneme state sequence and first boundary information of the voice to be detected;

Determining a first GOP sequence corresponding to the voice to be detected based on the first phoneme state sequence and the first boundary information;

acquiring a second phoneme state sequence and second boundary information of the template voice;

And determining a second GOP sequence corresponding to the template voice based on the second phoneme state sequence and the second boundary information.

In one embodiment, determining a first GOP sequence corresponding to the speech to be detected based on the first phoneme state sequence and the first boundary information, and determining a second GOP sequence corresponding to the template speech based on the second phoneme state sequence and the second boundary information includes:

Determining a first GOP value corresponding to each non-mute phoneme in the first phoneme state sequence based on the first phoneme state sequence and first boundary information;

arranging the first GOP values corresponding to the plurality of non-mute phonemes according to a first preset sequence to obtain a first GOP sequence;

Determining a second GOP value corresponding to each non-mute phoneme in the second phoneme state sequence based on the second phoneme state sequence and the second boundary information;

and arranging the second GOP values corresponding to the plurality of non-mute phonemes according to a second preset sequence to obtain a second GOP sequence.

In one embodiment, after determining the second GOP sequence corresponding to the template speech based on the second phoneme state sequence and the second boundary information, the method further includes:

comparing GOP values in the first GOP sequence and GOP values of the second GOP sequence for each phoneme state;

And if the GOP value in the second GOP sequence is smaller than the GOP value of the first GOP sequence, replacing the GOP value of the second GOP sequence with the GOP value in the first GOP sequence to obtain a new second GOP sequence.

In one embodiment, determining the relative entropy KL divergence between the first GOP sequence and the second GOP sequence comprises:

And calculating the KL divergence between the first GOP sequence and the second GOP sequence through a KL divergence formula.

In one embodiment, the KL divergence formula is:

Wherein D _KL (P|Q) is the KL divergence between the first GOP sequence and the second GOP sequence, P (i) is the i-th GOP value in the second GOP sequence, and Q (i) is the i-th GOP value in the first GOP sequence.

In an embodiment, the determining the evaluation result of the voice to be tested by using the KL divergence includes:

when the KL divergence is larger than or equal to a preset threshold value, determining that the evaluation result of the voice to be tested is correct in pronunciation;

and when the KL divergence is smaller than the preset threshold value, determining that the evaluation result of the voice to be tested is a pronunciation error.

In a second aspect, an embodiment of the present invention further provides a voice evaluation device, which is characterized in that the device includes:

the GOP sequence determining module is used for determining a first sounding good GOP sequence corresponding to the voice to be detected and a second GOP sequence corresponding to the template voice;

a KL divergence determining module configured to determine a relative entropy KL divergence between the first GOP sequence and the second GOP sequence;

and the evaluation result determining module is used for determining the evaluation result of the voice to be tested based on the KL divergence.

In a third aspect, an embodiment of the present invention further provides a voice evaluation apparatus, including:

one or more processors;

a memory for storing one or more programs;

the one or more programs are executed by the one or more processors, causing the one or more processors to implement the speech assessment method as provided in the first aspect described above.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having stored thereon one or more computer programs which, when executed by a processor, implement a speech assessment method as provided in the first aspect above.

In the voice evaluation method, the device, the equipment and the storage medium provided by the embodiment, the voice evaluation method comprises the steps of determining a first sounding good GOP sequence corresponding to the voice to be evaluated and a second GOP sequence corresponding to the template voice, determining the relative entropy KL divergence between the first GOP sequence and the second GOP sequence, and determining the evaluation result of the voice to be evaluated based on the KL divergence. The technical scheme of the embodiment utilizes the characteristic that the GOP features are probability distribution, the similarity between the template pronunciation GOP sequence and the template pronunciation GOP sequence is measured by KL divergence, and the accuracy of the evaluation result is improved.

Drawings

FIG. la is an exemplary diagram of an application scenario provided by an embodiment of the present invention;

FIG. lb is another exemplary diagram of an application scenario provided by an embodiment of the present invention

FIG. 2 is a flowchart of a method for providing speech assessment according to an embodiment of the present invention;

FIG. 3 is a flowchart of another method for voice evaluation according to an embodiment of the present invention;

FIG. 4 is a flowchart of another method for voice evaluation according to an embodiment of the present invention;

Fig. 5 is a schematic structural diagram of a voice evaluation device according to an embodiment of the present invention;

Fig. 6 is a schematic hardware structure of a device according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

The existing speech evaluation technology research can only obtain the pronunciation evaluation result of the phoneme level, and for the direct evaluation of the whole sentence, the existing evaluation method mainly usually takes the weighted phoneme level GOP value as the sentence level GOP value, so as to indirectly obtain the pronunciation evaluation result of the sentence level. The whole sentence speech evaluation method does not solve the influence caused by the difference of the GOP value distribution of different phonemes, and can cause inaccurate evaluation of the whole sentence speech.

Therefore, based on the above-mentioned problems, the embodiments of the present invention provide a speech evaluation method, apparatus, device, and storage medium, by calculating KL divergence between a phoneme GOP sequence of template pronunciation and a phoneme GOP sequence of speech to be tested, a pronunciation evaluation result at sentence level is obtained. And dividing a plurality of pronunciation level according to the KL divergence to realize pronunciation evaluation of sentence level.

The method can be used for detecting pronunciation errors and diagnosing directions in the field of speech evaluation, such as an online or offline speech evaluation system, and can be used for detecting pronunciation errors of the whole sentence of a language learner, so that sentence-level pronunciation evaluation can be realized. For example, a user in Chinese as a native language, learn a foreign language, and so on.

Fig. la is an exemplary diagram of an application scenario provided in an embodiment of the present application. As shown in fig. la, the server 102 is configured to execute the pronunciation assessment method according to any one of the embodiments of the present application, the client 101 receives, through an input device, a voice to be tested sent by a user, the server 102 interacts with the client 101 to obtain the voice to be tested, and after the server 102 executes the pronunciation assessment method, the server 102 outputs a pronunciation assessment result to the client 101, and the output device of the client 101 notifies the learner. Further, the client 101 provides the correct pronunciation to the learner to help it correct the pronunciation.

The input device may be an input device built in the computer device, such as a built-in voice input device, or an external input device connected to the computer device through a communication line, such as a microphone. Furthermore, the output device can be an output device which is built in the computer equipment, such as a touch display screen and the like, or an external output device which is connected with the computer equipment through a communication line, such as a projector, a digital TV and the like.

In this embodiment, the client 101 is illustrated as a computer, and the computer apparatus may specifically be a computer apparatus including a processor, a memory, an input device, and an output device. Such as notebook computers, desktop computers, tablet computers, intelligent terminals, learning machines, early education machines, intelligent wearable, etc.

Or when the client 101 has a certain data processing capability, that is, the client 101 has a processor and a memory, the client 101 may be separately used as an execution subject of the pronunciation assessment method according to any method embodiment of the present application, as illustrated in fig. 1 b. In fig. 1b, a learner holds a microphone, and a built-in voice acquisition device of a mobile phone can acquire voice to be tested sent by a user in real time, and after the pronunciation assessment method is executed, the assessment result of the voice to be tested is displayed to the learner through a display screen.

The pronunciation assessment method provided by the invention is explained below with reference to specific embodiments.

Fig. 2 is a flowchart of a voice evaluation method provided in an embodiment of the present invention, where the method is suitable for detecting whether a learner pronounces correctly, the voice evaluation method may be performed by a voice evaluation device, and the voice evaluation device may be implemented by hardware and/or software. The speech evaluation device may be composed of two or more physical entities, or may be composed of one physical entity, and is generally integrated in a computer device.

As shown in fig. 2, the voice evaluation method provided by the embodiment of the invention mainly includes the following steps:

s11, determining a first sounding well GOP sequence corresponding to the voice to be detected and a second GOP sequence corresponding to the template voice.

In this embodiment, the to-be-detected voice may be understood as a voice sent by the collected learner for the pronunciation text, and further, may be a voice sent by the learner directly collected through the microphone, or may be a voice of the learner recorded in advance, where the learner is a person who needs to train to pronounce according to the pronunciation text. The template voice refers to standard pronunciation of the voice required to be trained by the learner, and is recorded according to standard pronunciation of various voices, for example, chinese is recorded according to national mandarin standard pronunciation, and English can be recorded according to standard American pronunciation or English pronunciation standard. The template pronunciation can also be understood as correct pronunciation, and is used for helping a learner to correct the incorrect pronunciation and prompting the correct pronunciation.

It should be noted that, the above-mentioned voice to be measured and the template voice are voices aiming at the same pronunciation text.

In practical applications, when a learner reads a text, a speech signal corresponding to the text is generated. The electronic device first obtains the voice signal and determines whether the pronunciation of the learner is correct by detecting the voice signal.

For example, the text may be embodied as at least one word, or even at least one phoneme. Wherein, the phonemes are the minimum phonetic units divided according to the natural attribute of the voice, and are analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme. Whereas phonemes are made up of multiple states. For example, a phoneme is composed of three states to which at least a duration of one frame is assigned each. The phoneme is read out for a corresponding duration longer than three frames. The text is the pronunciation text according to the embodiment of the application, and the voice signal is the voice to be detected.

Taking a learning machine as an example, when a learner reads pronunciation text on a display interface of the learning machine, the learning machine collects voice signals through pickup equipment such as a microphone to obtain voice to be detected.

In this embodiment, the first GOP sequence refers to analyzing the voice to be tested to obtain a series of GOP values, and the second GOP sequence refers to analyzing the template voice to obtain a series of GOP values.

Further, the arrangement order of the GOP values in the first GOP sequence and the first phoneme state sequence of the to-be-detected voice are determined. And determining the arrangement order of the GOP values in the second GOP sequence and the second phoneme state sequence of the template voice. Wherein the length of the first GOP sequence and the length of the second GOP sequence are equal.

In one embodiment, a method of determining a first GOP sequence is provided. Specifically, based on the pronunciation text and the voice to be tested, the voice to be tested is decomposed to obtain phonemes and boundary information contained in the voice to be tested, and states corresponding to the phonemes form a first phoneme state sequence. And calculating the GOP value corresponding to each phoneme, and sequentially arranging the GOP values corresponding to the phonemes in the first phoneme state sequence according to the sequence of the phonemes to obtain a first GOP sequence.

In one embodiment, a method of determining a second GOP sequence is provided. Specifically, based on the pronunciation text and the standard voice, the standard voice is decomposed to obtain phonemes and boundary information contained in the standard voice, and the states corresponding to the phonemes form a second phoneme state sequence. And calculating the GOP value corresponding to each phoneme, and sequentially arranging the GOP values corresponding to the phonemes in the second phoneme state sequence according to the sequence of the phonemes to obtain a second GOP sequence.

It should be noted that, for the first GOP sequence corresponding to the voice to be tested, since the learner may be different for the voice sent by the same pronunciation text each time, the scheme of determining the first GOP sequence needs to be executed after each time the voice signal to be tested is collected.

In one embodiment, for the second GOP sequence corresponding to the template voice, after each time the voice signal to be detected is collected, the scheme of obtaining the template voice corresponding to the voice to be detected and determining the second GOP sequence corresponding to the template voice may be executed.

In two embodiments, for the second GOP sequence corresponding to the template voice, since the standard pronunciation of one pronunciation text is the same, after the second GOP sequence corresponding to the template voice is determined, the corresponding relationship between the second GOP sequence and the pronunciation text is stored, after the to-be-detected voice corresponding to the pronunciation text and the first GOP sequence corresponding to the to-be-detected voice are obtained, the second GOP sequence corresponding to the pronunciation text is directly obtained in the memory or the storage unit, so as to determine the second GOP sequence corresponding to the template voice.

S12, determining the relative entropy KL divergence between the first GOP sequence and the second GOP sequence.

The relative entropy (relative entropy), also known as the Kullback-Leibler divergence (Kullback-Leibler divergence) or the information divergence (information divergence), is a measure of the asymmetry of the difference between two probability distributions (probability distribution). KL divergence is a method for measuring the difference degree of two probability distributions P and Q, where P represents the true distribution and the observed distribution of data, and Q represents the theoretical distribution, estimation and approximate model distribution of data. KL divergence is also referred to as KL distance, whose value is non-negative.

Specifically, the KL divergence is used to measure the degree of difference between the first GOP sequence and the second GOP sequence. In this embodiment, the specific method of the KL divergence is not limited. The GOP values in the first GOP sequence and the second GOP sequence can be directly brought into a KL divergence formula to obtain KL divergence, or the KL divergence of the first GOP sequence and the second GOP sequence can be obtained by adopting a mathematical derivation method. In this embodiment, only a method for calculating the KL divergence is described, but not limited thereto. In this example, only the KL divergence method among the methods for measuring the degree of difference between two sequences is described, but the method is not limited thereto, and other methods such as euclidean distance may be used.

In one embodiment, determining the relative entropy KL divergence between the first GOP sequence and the second GOP sequence includes calculating the KL divergence between the first GOP sequence and the second GOP sequence by a KL divergence formula.

In this embodiment, the KL divergence formula may be a KL conclusion formula or a KL deduction formula, which is not limited in this embodiment.

S13, determining an evaluation result of the voice to be tested based on the KL divergence.

In this embodiment, the evaluation result may be correct or incorrect, or may be a classification of a plurality of grades of good or bad, or may be a score or a percentage value, which is not limited in this embodiment.

In one embodiment, the determining the evaluation result of the voice to be tested includes determining that the evaluation result of the voice to be tested is correct in pronunciation when the KL divergence is greater than or equal to a preset threshold value, and determining that the evaluation result of the voice to be tested is incorrect in pronunciation when the KL divergence is less than the preset threshold value.

Specifically, the preset threshold may be set according to actual situations, and in this embodiment, the preset threshold is not specifically limited.

In one embodiment, the determining of the KL divergence includes determining that the evaluation result of the voice to be tested is good when the KL divergence is greater than or equal to a first threshold, determining that the evaluation result of the voice to be tested is good when the KL divergence is less than the first threshold and greater than or equal to a second threshold, determining that the evaluation result of the voice to be tested is medium when the KL divergence is less than the second threshold and greater than or equal to a third threshold, and determining that the evaluation result of the voice to be tested is bad when the KL divergence is less than the third threshold.

The first threshold value is greater than the second threshold value and the second threshold value is greater than the third threshold value, where the first threshold value, the second threshold value and the third threshold value may be set according to actual situations, and in this embodiment, the threshold values are not specifically limited.

In the above embodiment, only 1 threshold value and 3 threshold values are taken as examples, and the description is briefly made. In a specific application process, different thresholds may be set according to different situations, and the evaluation result is classified into multiple classes, which is not described in this embodiment.

In one embodiment, the KL divergence can be directly converted into a score or a percentage value, that is, the evaluation result is displayed in a score and percentage mode, so that the evaluation result is divided more carefully, and a learner can clearly know the pronunciation condition of the learner.

Specifically, when phonemes of the to-be-detected voice and the standard voice are completely identical, the KL divergence corresponds to 100 points, when the phonemes of the to-be-detected voice and the standard voice are completely different, the KL divergence corresponds to 0 point, then the KL divergence between 0 and 100 points is divided into 99 points, 1 point to 99 point respectively, and finally the corresponding relation between the KL divergence and the points is stored.

Further, after determining the KL divergence of the voice to be tested, inquiring in the corresponding relation between the KL divergence and the score, and determining the evaluation result of the voice to be tested according to the score corresponding to the KL divergence of the voice to be tested. The specific query manner is not described in detail in this embodiment.

After determining the evaluation result of the voice to be tested, the evaluation result needs to be sent to an output device for display, so that the learner can correctly recognize the pronunciation of the learner. The output device can be a display device which is arranged in the computer device, such as a touch display screen, or an external playing device which is connected with the computer device through a communication line, such as a projector, a digital TV, or the like, or an audio playing device which is arranged in the computer device, such as an internal sound box, or an external playing device which is connected with the computer device through a communication line, such as an earphone, an external sound box, or the like.

Further, the evaluation result is unqualified, or after receiving a template voice playing instruction input by the learner, the template voice is played through a playing device or playing equipment so as to provide correct pronunciation for the learner, so that the learner can simulate training.

The voice evaluation method provided by the embodiment of the invention comprises the steps of determining a first sounding well GOP sequence corresponding to the voice to be tested and a second GOP sequence corresponding to the template voice, determining the relative entropy KL divergence between the first GOP sequence and the second GOP sequence, and determining the evaluation result of the voice to be tested based on the KL divergence. The technical scheme of the embodiment utilizes the characteristic that the GOP features are probability distribution, the similarity between the template pronunciation GOP sequence and the template pronunciation GOP sequence is measured by KL divergence, and the accuracy of the evaluation result is improved.

Fig. 3 is another voice evaluation method provided by the embodiment of the present invention, as shown in fig. 3, the voice evaluation method provided by the embodiment of the present invention mainly includes the following steps:

S21, acquiring a first phoneme state sequence and first boundary information of the voice to be detected.

In this embodiment, the pronunciation text and the voice to be measured are aligned. Further, the acoustic score of each frame in the voice to be detected is calculated by utilizing an acoustic model trained in advance, and then an optimal path is searched in an alignment network by utilizing a Viterbi algorithm, so that a first phoneme state sequence and first boundary information of the voice to be detected are obtained.

The viterbi algorithm is a very widely applied dynamic programming algorithm in machine learning for finding the-viterbi path-hidden state sequence most likely to produce the sequence of observed events, especially in markov information source contexts and hidden markov models. The terms "viterbi path" and "viterbi algorithm" are also used to find the dynamic programming algorithm for which observations are most likely to explain the correlation. In the embodiment of the invention, the Viterbi algorithm is utilized to search the optimal path in the alignment network to obtain the first phoneme state sequence.

The acoustic model can be constructed by a deep neural network (Deep Neural Networks, abbreviated as DNN) -hidden Markov model (Hidden Markov Model, abbreviated as HMM), namely the acoustic model is a DNN-HMM acoustic model. Inputting the voice signal to be detected into a DNN-HMM acoustic model frame by frame, outputting a state posterior probability corresponding to the frame by frame, converting the state posterior probability into an acoustic score, and searching an optimal path by utilizing a Viterbi algorithm to obtain a first phoneme state sequence and boundary information. The purpose of the viterbi algorithm search path is to search a WFST aligned network for an optimal path for matching the speech feature sequence.

S22, determining a first GOP sequence corresponding to the voice to be detected based on the first phoneme state sequence and the first boundary information.

In the present embodiment, the first phoneme state sequence and the first boundary are known from the alignment result, and the GOP score is calculated based on the GOP calculation formula in units of phoneme length. Wherein, the numerator in the GOP calculation formula is the phoneme sequence likelihood value obtained by forced alignment, and the denominator in the formula is the sequence likelihood value obtained by freely decoding the phonemes, wherein, the free decoding refers to a decoding process based on a cyclic phoneme network.

The calculation formula of the GOP is as follows:

Where T is the phoneme duration, O represents the speech feature sequence corresponding to the speech signal for the phoneme P duration, Q is the phoneme set, P (o|p) is the observation probability of the phoneme P, and P (P) is the prior probability of the phoneme P. And only calculating the GOP value of the non-mute phonemes to obtain the GOP score of each non-mute phoneme in the phoneme sequence.

In one embodiment, based on the first phoneme state sequence and the first boundary information, determining a first GOP value corresponding to each non-mute phoneme in the first phoneme state sequence, and arranging the first GOP values corresponding to the plurality of non-mute phonemes according to a first preset order to obtain the first GOP sequence.

The first preset sequence refers to the sequence of each phoneme in the voice to be tested in the first phoneme state sequence.

S23, acquiring a second phoneme state sequence and second boundary information of the template voice.

S24, determining a second GOP sequence corresponding to the template voice based on the second phoneme state sequence and the second boundary information.

In this embodiment, the specific method of the steps S21 and S24 is the same, and specific reference may be made to the description in the foregoing embodiment, which is not repeated in this embodiment.

In one embodiment, based on the second phoneme state sequence and the second boundary information, determining a second GOP value corresponding to each non-mute phoneme in the second phoneme state sequence, and arranging the second GOP values corresponding to the plurality of non-mute phonemes according to a second preset order to obtain a second GOP sequence.

The second preset sequence refers to the sequence of each phoneme in the template voice in the second phoneme state sequence.

S25, comparing the GOP value in the first GOP sequence and the GOP value of the second GOP sequence for each phoneme state.

S26, if the GOP value in the second GOP sequence is smaller than the GOP value of the first GOP sequence, replacing the GOP value of the second GOP sequence with the GOP value in the first GOP sequence to obtain a new second GOP sequence.

In the present embodiment, the second GOP sequence corresponding to the template voice and the first GOP sequence corresponding to the voice to be detected have been obtained in S21-S24. Considering that the output probability of the acoustic model may be inaccurate, comparing whether the phoneme GOP value of the second GOP sequence is smaller than the phoneme GOP value in the first GOP sequence or not, if so, temporarily replacing the phoneme GOP value in the second GOP sequence with the GOP value of the corresponding phoneme in the first GOP sequence, wherein the phoneme GOP value in the first GOP sequence is unchanged, and if not, not operating, thus ensuring the stability and reliability of scoring.

The second GOP sequence corresponding to the template voice comprises a GOP value a1 corresponding to the phoneme A, a GOP value B1 corresponding to the phoneme B and a GOP value C1 corresponding to the phoneme C in sequence. The first GOP sequence corresponding to the voice to be detected sequentially comprises a GOP value a2 corresponding to the phoneme A, a GOP value B2 corresponding to the phoneme B and a GOP value C2 corresponding to the phoneme C. The phoneme-by-phoneme comparison means that the GOP value a1 corresponding to the phoneme a and the GOP value a2 corresponding to the phoneme a are compared, the GOP value B1 corresponding to the phoneme B and the GOP value B2 corresponding to the phoneme B are compared, and the GOP value C1 corresponding to the phoneme C and the GOP value C2 corresponding to the phoneme C are compared. If a1 is less than a2, a1 in the second GOP sequence is replaced with a2. If a1 is greater than or equal to a2, then the GOP values in the second GOP sequence do nothing.

Further, the second GOP sequences are a1, b1 and c1, and the first GOP sequences are a2, b2 and c2. If a1 is smaller than a2, b1 is not smaller than b2, and c1 is not smaller than c2, the second GOP sequences after comparison are a2, b1, and c1, and the first GOP sequences are a2, b2, and c2.

S27, calculating the KL divergence between the first GOP sequence and the second GOP sequence through a KL divergence formula.

In this embodiment, the KL divergence is calculated for the updated GOP sequences of equal length. KL divergence, also called relative entropy, is a method for measuring the degree of difference between two probability distributions P and Q, where P represents the true distribution and the observed distribution of data, and Q represents the theoretical distribution, estimation and approximate model distribution of data.

Wherein, the KL divergence formula is:

Wherein D _KL (P|Q) is the KL divergence between the first GOP sequence and the second GOP sequence, P (i) is the i-th GOP value in the first GOP sequence, and Q (i) is the i-th GOP value in the second GOP sequence.

S28, determining an evaluation result of the voice to be tested based on the KL divergence.

In this embodiment, a sentence voice evaluation method based on KL divergence is used. Because the GOP distribution of different phonemes has obvious difference, the influence caused by the difference of the GOP values of different phonemes can be effectively avoided by considering the similarity degree of the sample to be tested and the template, instead of simply carrying out weighted average on the phoneme evaluation result. The similarity between the GOP features and the template pronunciation GOP sequence is measured by KL divergence by utilizing the feature that the GOP features are probability distribution, so that the accuracy of the evaluation result is ensured.

In an application example, fig. 4 is a flowchart of another voice evaluation method according to an embodiment of the present invention, as shown in fig. 4, the voice evaluation method provided in the embodiment mainly includes:

S31, forcedly aligning the text and the audio to acquire a phoneme state sequence and boundary information.

In this embodiment, the audio includes template audio and audio to be tested. Correspondingly, the phoneme state sequence comprises a first phoneme state sequence corresponding to the voice to be detected and a second phoneme state sequence corresponding to the template voice. The boundary information comprises first boundary information corresponding to the voice to be detected and second boundary information corresponding to the template voice.

S32, calculating GOP values of the non-mute phonemes.

In this embodiment, the template speech and the speech to be detected both perform the same calculation process, so as to obtain a template GOP sequence and a GOP sequence to be detected. The specific calculation process may refer to the description in the above embodiment, and the present embodiment is not limited thereto.

S33, judging whether the template GOP value is smaller than the GOP value to be detected or not by phonemes, if so, executing the step S34, and if not, executing the step S35.

S34, replacing the template GOP value with the GOP value to be detected, and executing step S35.

S35, calculating KL divergence between the template GOP sequence and the GOP sequence to be detected.

The specific KL divergence calculation method may refer to the description in the above embodiment, and is not limited in this embodiment.

S36, judging whether the KL divergence is larger than a preset threshold value, if so, executing S37, and if not, executing S38.

S37, the sentence pronounces correctly.

S38, wrong pronunciation of sentences is achieved.

In the embodiment, the similarity between the GOP features and the template pronunciation GOP sequence is measured by KL divergence by utilizing the feature that the GOP features are probability distribution, so that the accuracy of an evaluation result is ensured.

Fig. 5 is a schematic structural diagram of a voice evaluation device provided by an embodiment of the present invention, where the device is suitable for detecting whether a learner pronounces correctly, and the voice evaluation device may be implemented by hardware and/or software. The speech evaluation device may be composed of two or more physical entities, or may be composed of one physical entity, and is generally integrated in a computer device.

As shown in fig. 2, the voice evaluation device provided by the embodiment of the invention mainly includes a GOP sequence determining module 51, a KL distance determining module 52 and an evaluation result determining module 53.

The GOP sequence determining module 51 is configured to determine a first sounding well GOP sequence corresponding to the to-be-detected voice and a second GOP sequence corresponding to the template voice;

a KL divergence determining module 52 for determining a relative entropy KL divergence between the first GOP sequence and the second GOP sequence;

And an evaluation result determining module 53, configured to determine an evaluation result of the voice to be tested based on the KL divergence.

The voice evaluation device is used for determining a first sounding well GOP sequence corresponding to the voice to be tested and a second GOP sequence corresponding to the template voice, determining the relative entropy KL divergence between the first GOP sequence and the second GOP sequence, and determining an evaluation result of the voice to be tested based on the KL divergence. The technical scheme of the embodiment utilizes the characteristic that the GOP features are probability distribution, the similarity between the template pronunciation GOP sequence and the template pronunciation GOP sequence is measured by KL divergence, and the accuracy of the evaluation result is improved.

Further, the GOP sequence determining module 51 includes a first GOP sequence determining unit and a second GOP sequence determining unit;

The first GOP sequence determining unit is used for acquiring a first phoneme state sequence and first boundary information of the voice to be detected;

and the second GOP sequence determining unit is used for acquiring a second phoneme state sequence and second boundary information of the template voice and determining a second GOP sequence corresponding to the template voice based on the second phoneme state sequence and the second boundary information.

Further, a first GOP sequence determining unit is specifically configured to determine a first GOP value corresponding to each non-mute phoneme in the first phoneme state sequence based on the first phoneme state sequence and the first boundary information;

The second GOP sequence determining unit is specifically configured to determine a second GOP value corresponding to each non-silent phoneme in the second phoneme state sequence based on the second phoneme state sequence and the second boundary information, and arrange the second GOP values corresponding to the plurality of non-silent phonemes according to a second preset order to obtain a second GOP sequence.

Further, the second GOP sequence determining unit is further specifically configured to compare, for each phoneme state, a GOP value in the first GOP sequence with a GOP value of the second GOP sequence;

Further, the KL divergence determination module 52 is specifically configured to determine a relative entropy KL divergence between the first GOP sequence and the second GOP sequence, and comprises calculating the KL divergence between the first GOP sequence and the second GOP sequence by a KL divergence formula.

Wherein, the KL divergence formula is:

Further, the evaluation result determining module 53 is specifically configured to determine that the evaluation result of the to-be-tested voice is correct in pronunciation when the KL divergence is greater than or equal to a preset threshold value, and determine that the evaluation result of the to-be-tested voice is incorrect in pronunciation when the KL divergence is less than the preset threshold value.

The voice evaluation device provided by the embodiment of the invention can execute the voice evaluation method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Fig. 6 is a schematic hardware structure of an apparatus according to an embodiment of the present invention, where, as shown in fig. 6, the apparatus includes a processor 601, a memory 602, an input device 603 and an output device 604, where the number of processors 601 in the apparatus may be one or more, in fig. 6, one processor 601 is taken as an example, and the processor 601, the memory 602, the input device 603 and the output device 604 in the apparatus may be connected by a bus or other manners, and in fig. 6, connection by a bus is taken as an example.

The memory 602 is used as a computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as program instructions/modules corresponding to the voice evaluation method in the embodiment of the present invention (for example, the modules in the voice evaluation device shown in fig. 5 include the GOP sequence determining module 51, the KL distance determining module 52, and the evaluation result determining module 53). The processor 601 executes various functional applications of the device and data processing by running software programs, instructions and modules stored in the memory 602, i.e., implements the above-described voice evaluation method.

The memory 602 may mainly include a storage program area that may store an operating system, application programs required for at least one function, and a storage data area that may store data created according to the use of the device, etc. In addition, the memory 602 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 602 may further include memory located remotely from processor 601, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

And, when one or more programs included in the above-described apparatus are executed by the one or more processors 601, the programs perform the operations of:

The input means 603 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the device. The output 604 may include a display device such as a display screen.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processing device, implements the voice evaluation method provided by the embodiment of the invention, the method comprising:

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform the related operations in the voice evaluation method provided in any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the embodiment of the voice evaluation device, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented, and the specific names of the functional units are only for convenience of distinguishing each other, and are not used for limiting the protection scope of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A voice evaluation method, comprising:

Determining the relative entropy KL divergence between the first GOP sequence and the second GOP sequence, wherein the KL divergence is used for measuring the difference degree of the first GOP sequence and the second GOP sequence;

determining an evaluation result of the voice to be tested based on the KL divergence;

Determining a first GOP sequence corresponding to the voice to be detected and a second GOP sequence corresponding to the template voice, wherein the method comprises the following steps:

Determining a second GOP sequence corresponding to the template voice based on the second phoneme state sequence and second boundary information;

After determining a second GOP sequence corresponding to the template speech based on the second phoneme state sequence and the second boundary information, the method further includes:

2. The method of claim 1 wherein determining a first GOP sequence corresponding to a speech under test based on the first phoneme state sequence and first boundary information and a second GOP sequence corresponding to a template speech based on the second phoneme state sequence and second boundary information comprises:

3. The method of claim 1, wherein determining the relative entropy KL divergence between the first GOP sequence and the second GOP sequence comprises:

4. The method according to claim 1, wherein the KL-divergence formula is:

5. The method according to claim 1, wherein the determining the KL divergence of the evaluation result of the to-be-tested voice includes determining that the evaluation result of the to-be-tested voice is correct in pronunciation when the KL divergence is greater than or equal to a preset threshold value, and determining that the evaluation result of the to-be-tested voice is incorrect in pronunciation when the KL divergence is less than the preset threshold value.

6. A voice evaluation device, comprising:

The KL divergence determining module is used for determining the KL divergence of the relative entropy between the first GOP sequence and the second GOP sequence, wherein the KL divergence is used for measuring the difference degree of the first GOP sequence and the second GOP sequence;

the evaluation result determining module is used for determining an evaluation result of the voice to be tested based on the KL divergence;

the GOP sequence determining module comprises a first GOP sequence determining unit and a second GOP sequence determining unit;

The second GOP sequence determining unit is used for acquiring a second phoneme state sequence and second boundary information of the template voice;

the second GOP sequence determining unit is further specifically configured to compare, for each phoneme state, a GOP value in the first GOP sequence with a GOP value of the second GOP sequence;

7. A voice evaluating apparatus, characterized by comprising:

one or more processors;

a memory for storing one or more programs;

The one or more programs are executed by the one or more processors to cause the one or more processors to implement the speech assessment method of any of claims 1-5.

8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the speech assessment method according to any one of claims 1-5.