RU2343564C2

RU2343564C2 - Method of voice signal variable-structure system-based adaptive encoding

Info

Publication number: RU2343564C2
Application number: RU2006143249/09A
Authority: RU
Inventors: Андрей Алексеевич Афанасьев (RU); Андрей Алексеевич Афанасьев; Геннадий Васильевич Богачев (RU); Геннадий Васильевич Богачев; Олег Олегович Басов (RU); Олег Олегович Басов
Priority date: 2006-12-06
Filing date: 2006-12-06
Publication date: 2009-01-10
Also published as: RU2006143249A

Abstract

FIELD: physics, communication.

SUBSTANCE: invention relates to system of telecommunication and is intended for coding voice signals based on variable-structure system. Proposed method of encoding comprises separating the input voice signal segments into six classes, i.e. a pause, tone segment, noise segment of the first type, noise segment of the second type, transition segment of the first type, transition segment of the second type, and encoding of the input voice signal recognised segments by various methods, varying the coding system structure.

EFFECT: improved quality of synthesised voice signal at fixed low transmission rate.

2 dwg

Description

Предлагаемое изобретение предназначено для кодирования речевых сигналов (PC) на основе системы с переменной структурой, применение которой направленно на сокращение избыточности передаваемой информации.The present invention is intended for encoding speech signals (PC) based on a system with a variable structure, the use of which is aimed at reducing the redundancy of transmitted information.

Известны способы кодирования речевых сигналов на основе линейного предсказания и различными сигналами возбуждения синтезирующего фильтра, с использованием процедуры векторного квантования сигналов возбуждения и параметров, описывающих спектральную огибающую речевого сигнала, например [1, 2].Known methods for encoding speech signals based on linear prediction and various excitation signals of a synthesizing filter using the vector quantization of excitation signals and parameters describing the spectral envelope of a speech signal, for example [1, 2].

Недостатком таких способов являются невысокие возможности в плане адаптации этих алгоритмов к свойствам обрабатываемого PC, что предопределяет недостаточно высокое качество восстановления сигнала на приеме. В данных алгоритмах в процессе кодирования изменяются только параметры кодера, а структура его остается неизменной. Фиксированное структурирование пространства кодируемых параметров и постоянство мощностей подпространств представлений (порядок предсказания для линейного предсказания, размер кодовых книг при векторном квантовании, длина кодируемого вектора), присущие существующим алгоритмам и выражающиеся в фиксированной структуре кодекса, не создают необходимых условий для максимального использования имеющейся априорной информации о речевом сигнале, что препятствует дальнейшей оптимизации кодека.The disadvantage of such methods is the low ability to adapt these algorithms to the properties of the processed PC, which determines the insufficient quality of signal recovery at the reception. In these algorithms, only the encoder parameters are changed during the encoding process, and its structure remains unchanged. The fixed structuring of the space of encoded parameters and the constancy of the powers of the subspaces of representations (the prediction order for linear prediction, the size of code books for vector quantization, the length of the encoded vector) inherent in existing algorithms and expressed in a fixed code structure do not create the necessary conditions for maximizing the use of a priori information about speech signal, which prevents further optimization of the codec.

Известен способ кодирования речевых сигналов на основе линейного предсказания в зависимости от типа обрабатываемого сегмента речевого сигнала [3], где достигается повышение качества синтезируемого сигнала за счет использования классификации обрабатываемых кадров речевого сигнала на два непересекающихся класса: вокализованная и невокализованная речь, и кодирование сегментов, относящихся к разным классам различными методами. К недостаткам данного способа можно отнести небольшое число классов, на которые подразделяется речевой сигнал, адаптивное перераспределение мощностей подпространств представления кодируемых параметров в условиях фиксированной структуры кодирующего устройства, что предопределяет недостаточно высокое качество восстановления сигнала на приеме. При существующих требованиях к представлению речевых сигналов в цифровом виде необходима более тщательная обработка речевого сигнала. В данных условиях способы, основанные на изменении только параметров кодера в соответствии с характеристиками речи, становятся неприемлемыми и не дают достаточного качества при кодировании PC.A known method of encoding speech signals based on linear prediction depending on the type of processed segment of the speech signal [3], where the quality of the synthesized signal is improved by using the classification of the processed frames of the speech signal into two disjoint classes: voiced and unvoiced speech, and coding of segments related to different classes by various methods. The disadvantages of this method include a small number of classes into which the speech signal is subdivided, adaptive redistribution of the power of the subspaces of representation of the encoded parameters under the conditions of a fixed structure of the encoding device, which determines the insufficient quality of signal recovery at the reception. Given the existing requirements for digital representation of speech signals, more careful processing of the speech signal is required. Under these conditions, methods based on changing only the parameters of the encoder in accordance with the characteristics of speech become unacceptable and do not provide sufficient quality when encoding a PC.

Предлагаемый способ преобразования речи решает задачу повышения качества синтезируемого PC без увеличения скорости передачи.The proposed method of speech conversion solves the problem of improving the quality of the synthesized PC without increasing the transmission speed.

Указанный технический результат достигается тем, что в реальном масштабе времени входной речевой сигнал делят по временной оси на сегменты, распознают сегмент входного речевого как пауза, тоновый сегмент, шумовой сегмент первого типа, шумовой сегмент второго типа, переходный сегмент первого типа, переходный сегмент второго типа исходя из следующей процедуры классификации (фиг.1).The specified technical result is achieved in that in real time the input speech signal is divided along the time axis into segments, the input speech segment is recognized as a pause, tone segment, noise segment of the first type, noise segment of the second type, transition segment of the first type, transition segment of the second type based on the following classification procedure (figure 1).

На первом этапе классификации сигнал делится на активные участки и паузы, критерием принятия решения служит соотношение:At the first stage of classification, the signal is divided into active sections and pauses, the criterion for decision making is the ratio:

где N - число отсчетов в обрабатываемом речевом сегменте;where N is the number of samples in the processed speech segment;

s_i - отсчет речевого сигнала;s _i - counting of a speech signal;

P₀ - пороговое значение мощностной характеристики, определяемое экспериментальным путем.P ₀ is the threshold value of the power characteristic determined experimentally.

В случае выполнения неравенства (1) принимается решение о том, что обрабатываемый сегмент 1 относит к классу пауз 2. В противном случае, принимается решение о принадлежности обрабатываемого сегмента речи к классу активных сегментов 3.In the case of inequality (1), it is decided that the processed segment 1 belongs to the class of pauses 2. Otherwise, a decision is made whether the processed speech segment belongs to the class of active segments 3.

На втором этапе классификации активные сегменты речи 3 делятся на 4 типа сегментов: тоновые 7, шумовые 4, переходные первого типа 5 и переходные второго типа 6. Для такого деления вычисляется параметр тон/шум (ТШ) и частота основного тона (ОТ) F_от на сегменте анализа. Вычисление сигналов ТШ и F_от производится совместно на основе анализа автокорреляционной функции (АКФ) PC и метода Итакуры-Саито. Использование двух методов в совокупности уменьшает вероятность ошибки в классификации сегментов речи. При этом правила принятия решения о типе сегмента формулируются следующим образом.At the second stage of classification, active speech segments 3 are divided into 4 types of segments: tone 7, noise 4, transitional first type 5 and transitional second type 6. For this division, the tone / noise (TS) parameter and the fundamental frequency (OT) F _from on the analysis segment. The calculation of the TS and F signals _{from is} carried out jointly based on the analysis of the PC autocorrelation function (ACF) and the Takura-Saito method. Using two methods together reduces the likelihood of errors in the classification of speech segments. In this case, the decision rules on the type of segment are formulated as follows.

К тоновым сегментам 7 относятся сегменты, для которых:Tone segments 7 include segments for which:

- метод анализа АКФ определяет сегмент как тоновой;- ACF analysis method defines the segment as tonal;

- метод Итакуры-Саито определяет сегмент как тоновой.- The Takura-Saito method defines the segment as tonic.

К шумовым сегментам 4 относятся сегменты, для которых:Noise segments 4 include segments for which:

- метод анализа АКФ определяет сегмент как шумовой;- ACF analysis method determines the segment as noise;

- метод Итакуры-Саито определяет сегмент как шумовой.- The Takura-Saito method defines the segment as noise.

К переходным сегментам первого типа 5 относятся сегменты, для которых:The transitional segments of the first type 5 include segments for which:

К переходным сегментам второго типа 6 относятся сегменты, для которых:The transitional segments of the second type 6 include segments for which:

На третьем этапе классификации происходит разделение шумовых сегментов речи 4 по коэффициенту огибающей и мощностной характеристике сигнала (1) на два класса. Правило принятия решения определяется соотношением:At the third stage of classification, the noise segments of speech 4 are divided by the envelope coefficient and the power characteristic of the signal (1) into two classes. The decision rule is determined by the ratio:

где Р - определяется в соответствии с левой частью выражения (1);where P - is determined in accordance with the left side of the expression (1);

α₀ - пороговое значение коэффициента сложности кодируемого сегмента, определяемое экспериментальным путем;α ₀ - threshold value of the coefficient of complexity of the encoded segment, determined experimentally;

η - коэффициент огибающей кодируемого сигнала, определяемый как:η is the envelope coefficient of the encoded signal, defined as:

Если в результате расчета неравенство (2) выполняется, принимается решение о том, что обрабатываемый сегмент относится к шумовому сегменту первого типа 8, в противном случае - к шумовому сегменту второго типа 9.If, as a result of the calculation, inequality (2) is satisfied, a decision is made that the segment being processed belongs to the noise segment of the first type 8, otherwise, to the noise segment of the second type 9.

Затем кодируют сегмент входного речевого сигнала путем кодирования формы сигнала, если сегмент входного речевого сигнала определен как пауза, шумовой сегмент первого типа или шумовой сегмент второго типа, или находят остатки кратковременных предсказаний входного речевого сигнала и кодируют остатки кратковременных предсказаний с использованием синусоидального аналитического кодирования, если сегмент входного речевого сигнала определен как тоновый сегмент, переходный сегмент первого типа или переходный сегмент второго типа.Then, a segment of the input speech signal is encoded by encoding the waveform if the segment of the input speech signal is defined as a pause, a noise segment of the first type or a noise segment of the second type, or the remnants of short-term predictions of the input speech signal are found and the remnants of short-term predictions are used using sinusoidal analytical coding if a segment of the input speech signal is defined as a tone segment, a transition segment of the first type, or a transition segment of the second type.

Таким образом, в соответствии с полученными статистическими и параметрическими характеристиками выбирается та структура кодирующего устройства (блоки 10 или 11), которая обеспечивает минимум искажений речевого сигнала.Thus, in accordance with the obtained statistical and parametric characteristics, the encoding device structure (blocks 10 or 11) is selected that provides a minimum of distortion of the speech signal.

На чертежах (фиг.1 и 2) представлена сущность предлагаемого решения, где на фиг.1 изображен вариант классификации распознаваемых сегментов речи в соответствии с предлагаемым решением, на фиг.2 - структурная схема устройства кодирования речевых сигналов на основе системы с переменной структурой.In the drawings (Figs. 1 and 2) the essence of the proposed solution is presented, in which Fig. 1 shows a classification of recognizable segments of speech in accordance with the proposed solution, Fig. 2 is a structural diagram of a speech encoding device based on a system with a variable structure.

Предлагаемый способ преобразования речевого сигнала может быть реализован в устройстве кодирования речевых сигналов (фиг.2).The proposed method of converting a speech signal can be implemented in a device for encoding speech signals (figure 2).

Исходный речевой сигнал подается на кодер ИКМ 12, реализующий преобразование аналогового сигнала в цифровую форму согласно рекомендации МСЭ G.711. В блоке формирования и начальной обработки сегмента анализа PC 13 оцифрованный речевой сигнал подвергается сегментированию на одинаковые подкадры, равные периоду квазистационарности. Далее подкадры речевого сигнала последовательно поступают на анализатор речь/пауза 14, выделитель статистических и параметрических характеристик 15, блок формирования субкадров 24 и блок управления структурой и параметрами линейного предсказателя 26. В анализаторе речь/пауза 14 происходит разделение речи на сегменты активности и паузы, при этом сегменты речи, отнесенные к активным, передаются для последующего анализа в блок анализа тон/шум 18, а управляющие сигналы о принятом решении (речь/пауза) с данного блока передаются на выделитель статистических и параметрических характеристик 15 и подсистему управления структурой кодека 17. В блоке 15 реализуется выделение статистических и параметрических характеристик сегмента речевого сигнала при отнесении его к сегментам активной речи. Блок формирования субкадров 14 предназначен для выделения подкадров процедуры векторного квантования 30 на сегменте анализа, результаты процедуры подаются на блок управления структурой векторного квантователя 25 и векторный квантователь 30. В блоке анализа тон/шум 18 осуществляется выделение сигнала тон-шум на сегменте анализа, при отнесении его блоком 14 к сегменту активной речи. При этом в случае выделения сигнала шум управляющий сигнал, несущий информацию о данном решении, поступает на блок 17, в противоположном случае (выделение сигнала тон) управляющий сигнал поступает на блок 19.The original speech signal is supplied to the PCM encoder 12, which implements the conversion of the analog signal into digital form according to ITU Recommendation G.711. In the block for the formation and initial processing of the analysis segment PC 13, the digitized speech signal is segmented into identical subframes equal to the quasistationary period. Next, the subframes of the speech signal are sequentially fed to the speech / pause analyzer 14, the statistical and parametric characteristics extractor 15, the subframe forming unit 24 and the linear predictor 26 structure and parameters control unit. In the speech / pause analyzer 14, speech is divided into activity and pause segments, when of this, the speech segments assigned to active are transmitted for subsequent analysis to the tone / noise analysis unit 18, and control signals about the decision made (speech / pause) from this block are transferred to the stat selector -terrorist and parametric characteristics of the structure 15 and the control subsystem 17. The codec unit 15 is realized the selection and parametric statistical characteristics of the speech signal segment in allocating it to the active speech segments. The subframe generating unit 14 is designed to separate the subframes of the vector quantization procedure 30 on the analysis segment, the results of the procedure are supplied to the structure control unit of the vector quantizer 25 and the vector quantizer 30. In the tone / noise analysis unit 18, the tone-noise signal is extracted on the analysis segment, when assigned its block 14 to the segment of active speech. In this case, in the case of a noise signal, a control signal that carries information about this decision is sent to block 17, in the opposite case (tone signal extraction), the control signal is sent to block 19.

В блоках 17 и 19 реализуется подсистема управления структурой кодера, при этом блок 17 управляет структурой кодера в зависимости от классификационных решений, относящих обрабатываемый сегмент речи к паузе и шумовым сегментам, а блок 19 использует информацию с блока 18 об активности и тональности сегмента речи. Информационные сигналы с блока 19 поступают на выделитель частоты ОТ на основе анализа АКФ 20 и выделитель частоты ОТ методом Итакуры-Саито 21. В указанных блоках 20 и 21 осуществляется выделение частоты основного тона на основе анализа автокорреляционной функции сегмента анализа речевого сигнала и с помощью метода Итакуры-Саито соответственно. Результаты расчетов поступают в блок корректировки значения частоты ОТ 22, в котором происходит коррекция значения частоты основного тона для осуществления принятия решений о типе обрабатываемого сегмента речи классификатором речевых кадров 16 и выбора оптимальных режимов функционирования блоком управления структурой и параметрами линейного предсказателя 26 и блоком управления структурой векторного квантователя 25. Таким образом, на входы классификатора 16 поступают информационные сигналы: с выхода выделителя статистических и параметрических характеристик PC 15 и выхода блока корректировки значения частоты ОТ 22. Результаты классификации блоком 16 подаются на подсистему управления кодером 23, определяющую режим кодирования в зависимости от результата классификации сегмента, выходом данного блока являются управляющие сигналы для блока формирования субкадров PC 14, блока управления структурой векторного квантователя 25 и блока управления структурой и параметрами линейного предсказателя 26.In blocks 17 and 19, the encoder structure control subsystem is implemented, while block 17 controls the encoder structure depending on classification decisions relating the processed speech segment to pause and noise segments, and block 19 uses information from block 18 about the activity and tonality of the speech segment. Information signals from block 19 are fed to the OT frequency separator based on the analysis of ACF 20 and the OT frequency separator by the Takura-Saito method 21. In the indicated blocks 20 and 21, the fundamental frequency is selected based on the analysis of the autocorrelation function of the speech signal analysis segment and using the Takura method -Saito respectively. The calculation results are sent to the OT 22 frequency value adjustment block, in which the fundamental tone frequency value is corrected to make decisions about the type of speech segment being processed by the speech frame classifier 16 and select optimal operating modes by the structure control unit and parameters of the linear predictor 26 and the vector structure control unit quantizer 25. Thus, at the inputs of the classifier 16 receives information signals: from the output of the statistical isolator and parameter of the physical characteristics of PC 15 and the output of the unit for adjusting the frequency from OT 22. The classification results by block 16 are sent to the encoder control subsystem 23, which determines the encoding mode depending on the segment classification result, the output of this block is control signals for the subframe generation unit PC 14, structure control unit vector quantizer 25 and the control unit structure and parameters of the linear predictor 26.

В соответствии с результатами классификации блоком 25 осуществляется управление работой векторного квантователя 30, а также выбор кодовых книг различной структуры 27, наиболее точно соответствующих кодируемому субкадру речи. Блоком 26 осуществляется управление структурой и параметрами линейного предсказателя. Функционирование линейного предсказателя связано с кодовыми книгами параметров кратковременного линейного предсказателя 31 и кодовыми книгами параметров долговременного линейного предсказания 32, с блоком расчета параметров кратковременного линейного предсказания 28 и блоком расчета параметров долговременного линейного предсказания 29, с блоком кратковременного линейного анализа 33 и блоком долговременного линейного анализа 34, в которых непосредственно реализуются процедуры линейного предсказания на основе параметров линейного предсказания, выбранных из соответствующих кодовых книг и наиболее соответствующих рассчитанным. Также блок 26 взаимодействует с блоком выбора наилучшей структуры и параметров линейного предсказателя 36, который осуществляет выбор наилучшей структуры и параметров линейного предсказателя на основе процедуры анализа через синтез и результатами управляющих воздействий на структуру кодера. Информационные сигналы, полученные в результате векторного квантования (блок 30) и/или линейного предсказания (блок 36), поступают на вход блока формирования выходной последовательности кодера 35, который осуществляет формирование кадра передачи кодирующего устройства.In accordance with the classification results, block 25 controls the operation of the vector quantizer 30, as well as selects codebooks of various structures 27 that most closely match the encoded speech subframe. Block 26 controls the structure and parameters of the linear predictor. The operation of the linear predictor is associated with the code books of the parameters of the short-term linear predictor 31 and the code books of the parameters of the long-term linear prediction 32, with the block for calculating the parameters of the short-term linear prediction 28 and the block for calculating the parameters of the long-term linear prediction 29, with the block of short-term linear analysis 33 and the block of long-term linear analysis 34 in which linear prediction procedures are directly implemented based on the linear pre sayings selected from the corresponding code books and the most appropriate calculated. Block 26 also interacts with a block for selecting the best structure and parameters of a linear predictor 36, which selects the best structure and parameters of a linear predictor based on the analysis procedure through synthesis and the results of control actions on the structure of the encoder. Information signals obtained as a result of vector quantization (block 30) and / or linear prediction (block 36) are fed to the input of the output sequence generating unit of the encoder 35, which implements the transmission frame of the encoder.

Процедура декодирования на приемной стороне заключается в выделении из принятой последовательности кадра передачи информации о типе структуры и параметрах кодированного PC, выборе соответствующей структуры декодера и постановлении PC по принятым сигналу возбуждения и параметрам синтезирующего устройства.The decoding procedure on the receiving side consists in extracting from the received sequence of the transmission frame information about the type of structure and parameters of the encoded PC, selecting the appropriate decoder structure and setting the PC according to the received excitation signal and the parameters of the synthesizing device.

Приведенные сведения показывают, что введение в систему кодирования процедуры классификации речевых сегментов на 6 типов: пауза, тоновый сегмент, шумовой сегмент первого типа, шумовой сегмент второго типа, переходный сегмент первого типа, переходный сегмент второго типа, и кодирование распознанных сегментов входного речевого сигнала различными методами путем изменения структуры системы кодирования позволяет повысить качество синтезируемого PC без увеличения скорости передачи.The above data show that the introduction of the procedure for classifying speech segments into 6 types into the coding system: pause, tone segment, noise segment of the first type, noise segment of the second type, transition segment of the first type, transition segment of the second type, and coding of recognized segments of the input speech signal by different methods by changing the structure of the coding system can improve the quality of the synthesized PC without increasing the transmission speed.

Источники информацииInformation sources

1. Устинов А.А., Тюлегенев А.О., Данилюк В.В. Патент №2152646, кл. 7 G10L 21/00. Способ сжатия и восстановления речевых сигналов. Бюл. №19 от 10.07.2000.1. Ustinov A.A., Tyulegenev A.O., Danilyuk V.V. Patent No. 2152646, cl. 7 G10L 21/00. The method of compression and restoration of speech signals. Bull. No. 19 dated 10.07.2000.

2. Костров В.В., Дыранов Ю.В., Фабричный С.Ю. Патент №2166804, кл. 7 G10L 13/02. Способ преобразования речи и устройство для его осуществления. Бюл. №13 от 10.05.2001.2. Kostrov V.V., Dyranov Yu.V., Factory S.Yu. Patent No. 2166804, cl. 7 G10L 13/02. A method of converting speech and a device for its implementation. Bull. No. 13 dated 05/10/2001.

3. Нисигути М., Иидзима К., Матсумото Д., Омори С. Патент №2233010, кл. 7 G10L 19/06. Способы и устройства для кодирования и декодирования речевых сигналов. Бюл. №20 от 20.07.2004.3. Nishiguchi M., Iijima K., Matsumoto D., Omori S. Patent No. 2233010, class. 7 G10L 19/06. Methods and devices for encoding and decoding speech signals. Bull. No. 20 dated July 20, 2004.

Claims

A method of adaptive coding of speech signals based on a system with a variable structure, namely, that the input speech signal is divided into segments along the time axis, the remnants of short-term predictions of the input speech signal are found, the input speech signal is recognized as voiced or unvoiced, and the residual short-term predictions are encoded using sinusoidal analytic coding, if part of the input speech signal is defined as voiced, or the input speech signal is encoded n with a waveform coding, if a part of the input speech signal is determined to be non-voiced, characterized in that the segments of the input speech signal are recognized as pause, tone segment, noise segment of the first type, noise segment of the second type, transition segment of the first type, transition segment of the second type, then encode the segment of the input speech signal by encoding the waveform if the segment of the input speech signal is defined as a pause, noise segment of the first type or noise segment of the second type, or residuals of short-term predictions of the input speech signal and encode residuals of short-term predictions using sinusoidal analytical coding if the segment of the input speech signal is defined as a tone segment, a transition segment of the first type, or a transition segment of the second type.