CN109872720B

CN109872720B - Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network

Info

Publication number: CN109872720B
Application number: CN201910085725.8A
Authority: CN
Inventors: 王泳; 赵雅珺; 张梦鸽
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2022-11-22
Anticipated expiration: 2039-01-29
Also published as: CN109872720A

Abstract

The invention discloses a re-recorded speech detection algorithm based on a convolutional neural network that is robust to different scenarios, and specifically relates to the field of speech detection algorithms. By inputting a speech time-frequency graph into the algorithm model, the algorithm model includes seven layers, each layer Contains a convolutional layer and a pooling layer. The output of the convolutional layer passes through a linear rectification function, and a residual connection is added between the layers. Finally, the final feature is extracted through global pooling, and the detection result is predicted through sigmoid. The present invention adopts the time-frequency diagram as the data input form of the network in the present invention. Compared with directly inputting voice data, the time-frequency diagram has a relatively dense distribution for the feature information introduced by the re-recording device, which is more conducive to the feature extraction of the neural network, thereby speeding up training to improve accuracy.

Description

A Re-recorded Speech Detection Algorithm Robust to Different Scenes Based on Convolutional Neural Network

技术领域technical field

本发明涉及语音检测算法领域，更具体地说，本发明涉及一种基于卷积神经网络对不同场景鲁棒的重录语音检测算法。The present invention relates to the field of speech detection algorithms, and more specifically, the present invention relates to a re-recorded speech detection algorithm robust to different scenarios based on a convolutional neural network.

背景技术Background technique

已有研究证明，语音转换(Voice Conversion，VC)、语音合成(Speech Synthesis,SS)及重录语音等欺骗性语音能有效地欺骗说话人识别(Automatic SpeakerRecognition，ASV)系统，从而冒充他人登入系统，重录语音会使ASV系统产生更高的错误接受率，对社会安全产生严重威胁。其中，VC及SS需要目标说话人较多的语音信息及特征，再加上现有算法尚未完全成熟，实现成本及难度相对较高；而重录语音利用低廉的录音设备即可轻松获得，且重录语音基本包涵目标人物语音的所有特征，因此，相对VC及SS更具威胁。为此，重录语音的检测应该受到重视。Studies have proved that deceptive voices such as Voice Conversion (Voice Conversion, VC), Speech Synthesis (SS) and re-recorded voice can effectively deceive the Automatic Speaker Recognition (ASV) system, thereby impersonating others to log in to the system , the re-recording of the voice will cause the ASV system to have a higher false acceptance rate, which poses a serious threat to social security. Among them, VC and SS require more voice information and features of the target speaker, and the existing algorithms are not yet fully mature, so the cost and difficulty of implementation are relatively high; while the re-recorded voice can be easily obtained with cheap recording equipment, and The re-recorded voice basically contains all the characteristics of the target character's voice, so it is more threatening than VC and SS. For this reason, the detection of re-recorded speech should be paid attention to.

SV(自动说话人识别)系统在实际中的应用越来越多，例如：访问控制系统、电话银行、军事等领域。由于说话人验证过程不需要任何面对面的接触，因此ASV系统非常容易受到欺骗性语音的攻击。音频设备产生的欺骗性语音会对ASV(自动说话人识别)系统带来威胁，影响该系统的安全性能。在最近十多年中，音频数字产品不仅在类型上层出不穷，而且各类产品集成的功能也越来越多、越来越强。现在利用安装有音频处理软件的个人电脑或者具备音频处理能力的PDA等相对廉价的设备就能达到相同或相近的效果。例如，高质量、低成本的录音设备-智能手机，其形成的欺骗性语音，就会对ASV系统构成风险。欺骗性语音包括重播攻击、语音转换、语音合成等。攻击者会利用欺骗性的语音伪造特征数据，来获取对系统的非法身份访问，进而用户的文件数据、隐私就会被盗取，带来很多无法弥补的损失。其中重播攻击相对于语音转换和语音合成更具有威胁。重播攻击是从实际目标说话人中采集的语音样本，其形式是连续的预先记录的语音样本。基于重播的欺骗攻击不需要对语音做任何技术处理，实际目标说话人的语音和重播语音具有完全相同的频谱和高级特征，它是最容易的语音攻击类型。而合成语音和变形语音相对于实际目标说话人的语音，是有一定的误差和变化，并不是完全相同的，所以对重播攻击的检测相对于合成语音和变形语音具有更大的难度。SV (Automatic Speaker Recognition) systems are used more and more in practice, such as: access control systems, telephone banking, military and other fields. Since the speaker verification process does not require any face-to-face contact, ASV systems are highly vulnerable to spoofed speech. The spoofed voice generated by the audio equipment will pose a threat to the ASV (Automatic Speaker Recognition) system and affect the security performance of the system. In the past ten years, audio digital products have not only emerged in an endless stream, but also integrated functions of various products have become more and more powerful. The same or similar effects can now be achieved with relatively cheap devices such as a personal computer with audio processing software installed or a PDA with audio processing capabilities. For example, a high-quality, low-cost recording device—smartphone—forms spoofed speech, which poses a risk to the ASV system. Deceptive speech includes replay attacks, speech conversion, speech synthesis, etc. Attackers will use deceptive voice to forge feature data to obtain illegal identity access to the system, and then the user's file data and privacy will be stolen, causing many irreparable losses. Among them, replay attacks are more threatening than speech conversion and speech synthesis. A replay attack is a speech sample taken from the actual target speaker in the form of successive pre-recorded speech samples. The replay-based spoofing attack does not require any technical processing of the speech. The speech of the actual target speaker and the replay speech have exactly the same frequency spectrum and advanced features. It is the easiest type of speech attack. Compared with the voice of the actual target speaker, synthetic speech and deformed speech have certain errors and changes, and are not exactly the same. Therefore, the detection of replay attacks is more difficult than synthetic speech and deformed speech.

发明内容Contents of the invention

为了克服现有技术的上述缺陷，本发明的实施例提供一种基于卷积神经网络对不同场景鲁棒的重录语音检测算法，通过采用时频图作为本发明中网络的数据输入形式，相对于直接输入语音数据，时频图对于重录设备引入的特征信息有相对密集的分布，更有利于神经网络特征提取，从而加快训练，提高精度，对不同录制设备、录制环境及录制距离的重录语音的检测都有很高的精确度。In order to overcome the above-mentioned defects of the prior art, embodiments of the present invention provide a re-recorded speech detection algorithm based on convolutional neural networks that is robust to different scenarios. Compared with directly inputting voice data, the time-frequency graph has a relatively dense distribution of feature information introduced by re-recording equipment, which is more conducive to neural network feature extraction, thereby speeding up training and improving accuracy. The detection of recorded voice has a high accuracy.

为实现上述目的，本发明提供如下技术方案：一种基于卷积神经网络对不同场景鲁棒的重录语音检测算法，具体包括以下步骤：In order to achieve the above object, the present invention provides the following technical solution: a re-recorded speech detection algorithm based on a convolutional neural network that is robust to different scenarios, specifically comprising the following steps:

a、使用录音设备采集原始语音，并经DA/AD变换，获得重录语音；a. Use the recording equipment to collect the original voice, and convert it through DA/AD to obtain the re-recorded voice;

b、原始语音在变换过程中会产生失真，通过失真模型计算原始语音的失真数据，其中，失真模型表达式为：b. The original speech will be distorted during the conversion process, and the distortion data of the original speech is calculated through the distortion model, wherein the expression of the distortion model is:

y(t)是重录语音，x(t)是原始语音，λ是幅值变换因子，α是时间轴线性伸缩因子，η是叠加噪声；y(t) is the re-recorded speech, x(t) is the original speech, λ is the amplitude conversion factor, α is the time axis linear scaling factor, and η is the superimposed noise;

对应的频域变化表达式：The corresponding frequency domain change expression:

Y(jω)、X(jω)、N(jω)分别为y(t)、x(t)、η的频域表示，对于固定的录音设备，其特征是非常稳定的，即λ、α是常数；Y(jω), X(jω), and N(jω) are the frequency domain representations of y(t), x(t), and η respectively. For a fixed recording device, their characteristics are very stable, that is, λ and α are constant;

c、重录语音由短时傅里叶变换生产语音时频图；c. Re-recording the voice and producing the voice time-frequency map by short-time Fourier transform;

d、语音时频图输入至算法模型内，算法模型包含七层，每层包含一个卷积层与一个池化层，卷积层的输出通过线性整流函数，并在层与层之间加入残差连接，最后通过全局池化提取最终特征，并通过sigmoid预测检测结果。d. The speech time-frequency graph is input into the algorithm model. The algorithm model consists of seven layers, each layer includes a convolution layer and a pooling layer. The output of the convolution layer passes through a linear rectification function, and residuals are added between layers. Poor connections, and finally extract the final features through global pooling, and predict the detection results through sigmoid.

在一个优选地实施方式中，重录语音进行变换时，短时傅里叶变换采用126长度汉明(hanning)窗，步长为50，时频图的尺寸为(64x62)。In a preferred embodiment, when the re-recorded speech is transformed, the short-time Fourier transform adopts a 126-length Hamming (Hanning) window, the step size is 50, and the size of the time-frequency map is (64x62).

在一个优选地实施方式中，算法模型采用在频率维度卷积，时间维度池化，具体设置为采用3x1卷积核，1x2池化，且可与时频图的特征分布特点相契合，语音时频图分布特点在相邻语音帧之间具有独立性并且在特定频段又具有一致性。In a preferred embodiment, the algorithm model adopts convolution in the frequency dimension and pooling in the time dimension. The specific setting is to use a 3x1 convolution kernel and 1x2 pooling, and it can match the feature distribution characteristics of the time-frequency map. The distribution characteristics of the frequency map are independent between adjacent speech frames and consistent in a specific frequency band.

在一个优选地实施方式中，算法模型采用深度学习作为数据驱动的技术。In a preferred embodiment, the algorithmic model uses deep learning as a data-driven technique.

在一个优选地实施方式中，重录设备会在原语音信号的频域上引入变化，深度学习模型以原始音频信号作为网络的输入数据。In a preferred embodiment, the re-recording device will introduce changes in the frequency domain of the original speech signal, and the deep learning model uses the original audio signal as the input data of the network.

在一个优选地实施方式中，所述算法模型进行频率维度进行卷积时，不考虑时间维度的相关性，且在频率维度进行卷积时，同时进行时间维度进行池化。In a preferred embodiment, when the algorithm model performs convolution in the frequency dimension, the correlation of the time dimension is not considered, and when the convolution is performed in the frequency dimension, pooling is performed in the time dimension at the same time.

在一个优选地实施方式中，卷积核可参数共享，时间维度具有的同分布的设备的特征信息重复训练卷积核参数，池化层采用时间维度的池化(1x2)，频率维度不进行池化。In a preferred embodiment, the parameters of the convolution kernel can be shared, and the characteristic information of the equipment with the same distribution in the time dimension repeatedly trains the parameters of the convolution kernel. The pooling layer adopts the pooling of the time dimension (1x2), and the frequency dimension does not perform pooling.

本发明的技术效果和优点：Technical effect and advantage of the present invention:

1、本发明采用时频图作为本发明中网络的数据输入形式，相对于直接输入语音数据，时频图对于重录设备引入的特征信息有相对密集的分布，更有利于神经网络特征提取，从而加快训练，提高精度；1. The present invention uses a time-frequency diagram as the data input form of the network in the present invention. Compared with directly inputting voice data, the time-frequency diagram has a relatively dense distribution for the feature information introduced by the re-recording device, which is more conducive to neural network feature extraction. Thereby speeding up training and improving accuracy;

2、本发明采用在频率维度卷积，时间维度池化，具体设置为采用3x1卷积核，1x2池化，只在频率维度进行卷积，不考虑时间维度的相关性，能极大的减少卷积核参数量，使得模型有更强的抗过拟合能力，减少对数据量的过度依赖，同时在训练过程中由于卷积核的参数共享，时间维度具有的同分布的设备的特征信息重复训练卷积核参数，可以使训练更加充分；2. The present invention uses convolution in the frequency dimension and pooling in the time dimension. Specifically, it is set to use a 3x1 convolution kernel and 1x2 pooling. Convolution is only performed in the frequency dimension without considering the correlation of the time dimension, which can greatly reduce The number of convolution kernel parameters makes the model more resistant to overfitting and reduces excessive dependence on the amount of data. At the same time, during the training process, due to the sharing of convolution kernel parameters, the time dimension has the characteristic information of devices with the same distribution Repeatedly training the parameters of the convolution kernel can make the training more sufficient;

3、本发明不需要像传统的机器学习方法一样需要人工选取特定的一个或多个特征然后再用分类器进行分类，能够自发地提取相关的特征包括一些浅层边缘的特征和深层的特征然后进而分类，简化了整个流程并达到了更好的效果；3. The present invention does not need to manually select one or more specific features like traditional machine learning methods and then classify them with a classifier, and can spontaneously extract relevant features including some shallow edge features and deep features and then Further classification simplifies the entire process and achieves better results;

4、本发明算法对不同录制设备、录制环境及录制距离的重录语音的检测都有很高的精确度。4. The algorithm of the present invention has high accuracy in the detection of re-recorded voices with different recording devices, recording environments and recording distances.

附图说明Description of drawings

图1为本发明的算法模型结构示意图。Fig. 1 is a structural schematic diagram of the algorithm model of the present invention.

图2为本发明的语音重录过程示意图。Fig. 2 is a schematic diagram of the speech re-recording process of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

实施例1Example 1

如图1所示的一种基于卷积神经网络对不同场景鲁棒的重录语音检测算法，算法模型共有7层，每层包含一个卷积层与一个池化层，卷积层的输出通过线性整流函数，并在层与层之间加入残差连接，最后通过全局池化提取最终特征，并通过sigmoid预测检测结果，采用在频率维度卷积，时间维度池化，具体设置为采用3x1卷积核，1x2池化，最大化降低模型容量，极大减少过拟合的风险，降低模型对数据量的依赖性又与时频图的特征分布特点高度契合，将训练参数分配到更合理的地方，用更有效的特征来训练更紧凑的参数；As shown in Figure 1, a re-recorded speech detection algorithm based on convolutional neural network robust to different scenarios, the algorithm model has 7 layers, each layer contains a convolutional layer and a pooling layer, the output of the convolutional layer is passed through Linear rectification function, and add residual connections between layers, and finally extract the final features through global pooling, and predict the detection results through sigmoid, using convolution in the frequency dimension and pooling in the time dimension, the specific setting is to use 3x1 convolution Accumulation of cores, 1x2 pooling, maximizing the reduction of model capacity, greatly reducing the risk of overfitting, reducing the dependence of the model on the amount of data and highly consistent with the characteristic distribution characteristics of the time-frequency map, and assigning training parameters to more reasonable place, use more effective features to train more compact parameters;

语音时频图，由短时傅里叶变换生成，相对于直接输入语音数据，时频图对于重录设备引入的特征信息有相对密集的分布，更有利于神经网络特征提取，从而加快训练，提高精度，重录设备会在原语音信号的频域上引入变化，深度学习模型的性能对数据有极高的依赖性，以原始音频信号作为网络的输入数据，其特征分布过于稀疏，极大地提高了神经网络提取有效特征的难度；Speech time-frequency map is generated by short-time Fourier transform. Compared with directly inputting speech data, time-frequency map has a relatively dense distribution of feature information introduced by re-recording equipment, which is more conducive to neural network feature extraction, thereby speeding up training. To improve the accuracy, the re-recording equipment will introduce changes in the frequency domain of the original voice signal. The performance of the deep learning model has a high dependence on the data. The original audio signal is used as the input data of the network, and its feature distribution is too sparse, which greatly improves It reduces the difficulty of extracting effective features by neural network;

实施例2Example 2

如图2所示的一种基于卷积神经网络对不同场景鲁棒的重录语音检测算法，重录导致语音数据一定程度的失真，包括幅度失真和时间轴上的线性伸缩，其中，失真模型表达式为：As shown in Figure 2, a re-recording speech detection algorithm based on convolutional neural network robust to different scenarios, re-recording leads to a certain degree of distortion of speech data, including amplitude distortion and linear scaling on the time axis, where the distortion model The expression is:

实施例3Example 3

在本实施中，采用0.2秒语音段作为实验数据，短时傅里叶变换采用126长度汉明(hanning)窗，步长为50，时频图的尺寸为(64x62)；In this implementation, the speech segment of 0.2 seconds is used as the experimental data, the short-time Fourier transform adopts a 126-length Hamming (hanning) window, the step size is 50, and the size of the time-frequency map is (64x62);

进一步的，在上述技术方案中，采用在频率维度进行卷积，同时在时间维度进行池化，只在频率维度进行卷积，不考虑时间维度的相关性，能极大的减少卷积核参数量，使得模型有更强的抗过拟合能力，减少对数据量的过度依赖，同时在训练过程中由于卷积核的参数共享，时间维度具有的同分布的设备的特征信息重复训练卷积核参数，可以使训练更加充分，池化层采用时间维度的池化(1x2)，频率维度不进行池化，池化能减少特征的维度，加快网络的计算，并且使网络结构对数据特征的伸缩、变形有更强的鲁棒性，对于时频图，特征分布不存在伸缩与变形，只在时间维度池化，既减少了特征维度，同时又不会导致频率维度特征的丢失，通过多层卷积与池化计算，特征维度最终变为一维，长度与时频图频率相同；Further, in the above technical solution, convolution is performed in the frequency dimension, and pooling is performed in the time dimension at the same time. Convolution is only performed in the frequency dimension, regardless of the correlation of the time dimension, which can greatly reduce the convolution kernel parameters. The amount makes the model have stronger anti-overfitting ability and reduces the excessive dependence on the amount of data. At the same time, due to the parameter sharing of the convolution kernel during the training process, the feature information of the same distribution equipment in the time dimension has repeated training convolution. The kernel parameters can make the training more sufficient. The pooling layer adopts the pooling of the time dimension (1x2), and the frequency dimension does not perform pooling. Scaling and deformation are more robust. For the time-frequency graph, there is no stretching and deformation in the feature distribution, and it is only pooled in the time dimension, which not only reduces the feature dimension, but also does not cause the loss of frequency dimension features. Through multiple Layer convolution and pooling calculation, the feature dimension finally becomes one-dimensional, and the length is the same as the frequency of the time-frequency map;

进一步的，在上述技术方案中，原始语音库由30000段语音，共60人录制组成，抽样频率16kHz，量化精度16bits；Further, in the above technical solution, the original voice database consists of 30,000 pieces of voice recorded by 60 people, with a sampling frequency of 16 kHz and a quantization precision of 16 bits;

随机抽选10位发言人的语音作为测试数据，其余50人的语音用于训练，保证训练数据与测试数据的独立性，避免同一位发言者的录音出现在不同数据集；The voices of 10 speakers are randomly selected as test data, and the voices of the remaining 50 people are used for training to ensure the independence of training data and test data, and avoid the recording of the same speaker from appearing in different data sets;

具体录制过程如下：对于训练集，在安静环境下由不同距离和设备组合对原始语音库重录4次，由此获得4个重录语音库，它们分别包含25000段语音，从4个语音库中随机提取共25000段语音作为负样本，与原始语音共同组成训练数据集共50000段。原始语音通过手提电脑联想Y40-70AT-IFI播放；重录设备是手提电脑戴尔Inspion灵越14(Ins14VD-258)和智能手机小米2S；The specific recording process is as follows: For the training set, the original speech library was re-recorded 4 times under different distances and equipment combinations in a quiet environment, thus obtaining 4 re-recorded speech libraries, which respectively contained 25,000 segments of speech, from the 4 speech libraries A total of 25,000 segments of speech are randomly extracted as negative samples, and together with the original speech, they form a training data set of 50,000 segments. The original voice is played through a laptop Lenovo Y40-70AT-IFI; the re-recording device is a laptop Dell Inspion 14 (Ins14VD-258) and a smartphone Xiaomi 2S;

4次录制的情况如表1所示：The situation of the 4 recordings is shown in Table 1:

表1录制语音Table 1 recording voice

对于测试数据，采用表二相同的录制设置，为了验证模型对环境随机噪声的干扰的语音的鲁棒性，分别在安静环境与有一定随机噪声的环境下录制，测试集共包含4个语音库，每个语音库包含该库录制模式下的安静环境与含有环境噪声的共10000条测试语音；For the test data, the same recording settings as in Table 2 are used. In order to verify the robustness of the model to the interference of random noise in the environment, recordings were made in a quiet environment and in an environment with a certain amount of random noise. The test set contains a total of 4 speech libraries , each voice library contains a total of 10,000 test voices in a quiet environment and ambient noise in the recording mode of the library;

进一步的，在上述技术方案中，网络误差函数为交叉熵损失函数，采用Adam优化算法进行训练，初始学习率设置为0.001，并在训练过程中动态调整学习率，每训练10000次将学习率减小一倍，每次训练批量大小为32，为了在训练过程监督训练效果，从训练数据中随机选取2000条数据用于验证，通过对比训练数据损失函数与验证数据损失函数，为损失函数加入正则化项并设置正则化系数为0.0001能有效防止过拟合；Further, in the above technical solution, the network error function is a cross-entropy loss function, the Adam optimization algorithm is used for training, the initial learning rate is set to 0.001, and the learning rate is dynamically adjusted during the training process, and the learning rate is reduced by It is twice as small, and the batch size of each training is 32. In order to monitor the training effect during the training process, 2000 pieces of data are randomly selected from the training data for verification. By comparing the loss function of the training data and the loss function of the verification data, add regularization to the loss function The normalization item and setting the regularization coefficient to 0.0001 can effectively prevent overfitting;

表2列出了训练过程中的一些重要的超参数设置，在该设置下网络在训练过程有快速的收敛，并且最终取得相当高的精确度；Table 2 lists some important hyperparameter settings during the training process. Under this setting, the network has rapid convergence during the training process, and finally achieved a fairly high accuracy;

表2超参数(β₁、β₂分别为Adam优化器参数)Table 2 Hyperparameters (β ₁ and β ₂ are Adam optimizer parameters respectively)

进一步的，在上述技术方案中，本实施例包含了4次实验测试，分别是针对不同录制设备和不同录制距离下进行的测试，每次实验所得实验结果如表3所示：Further, in the above technical solution, this embodiment includes 4 experimental tests, which are respectively for different recording devices and different recording distances. The experimental results obtained in each experiment are shown in Table 3:

表3实验结果Table 3 Experimental results

在不同情况下的测试实验精确度都能达到99.8％以上，保证了实验模型具有很好的通用性。The accuracy of test experiments under different conditions can reach more than 99.8%, which ensures that the experimental model has good versatility.

最后：以上所述仅为本发明的优选实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。Finally: the above is only a preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention within the scope of protection.

Claims

1. A re-recorded voice detection algorithm based on a convolutional neural network and robust to different scenes is characterized by comprising the following steps:

a. acquiring original voice by using a recording device, and obtaining re-recorded voice through DA/AD conversion;

b. the original voice can generate distortion in the transformation process, and the distortion data of the original voice is calculated through a distortion model, wherein the expression of the distortion model is as follows:

y (t) is the re-recorded speech, x (t) is the original speech, λ is the amplitude transformation factor, α is the time axis linear scaling factor, η is the superposition noise;

the corresponding frequency domain variation expression:

Y(jω)＝λX(jαω)+N(jω)，

y (j omega), X (j omega) and N (j omega) are respectively frequency domain representations of Y (t), X (t) and eta, and are characterized by being very stable for a fixed recording device, namely lambda and alpha are constants;

c. re-recording the voice and producing a voice time-frequency graph by short-time Fourier transform;

d. inputting a voice time-frequency graph into an algorithm model, wherein the algorithm model comprises seven layers, each layer comprises a convolution layer and a pooling layer, the output of the convolution layer passes through a linear rectification function, residual connection is added between the layers, and finally, the final characteristics are extracted through global pooling, and the detection result is predicted through sigmoid;

convolution layers of the algorithm model are only convoluted in a frequency dimension without considering the correlation of a time dimension, and pooling layers are only pooled in the time dimension without pooling in the frequency dimension; the specific setting is that 3x1 convolution kernel is adopted, 1x2 pooling is adopted, and the feature distribution characteristics of the speech time-frequency diagram can be matched, and the distribution characteristics of the speech time-frequency diagram have independence between adjacent speech frames and consistency in a specific frequency band.

2. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: when the re-recorded speech is transformed, a 126-length hamming (hanning) window is used for short-time fourier transform, the step size is 50, and the size of the time-frequency diagram is (64 × 62).

3. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: the algorithmic model employs deep learning as a data-driven technique.

4. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: the algorithm model does not consider the correlation of the time dimension when performing convolution in the frequency dimension, and performs pooling of the time dimension when performing convolution in the frequency dimension.

5. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: the convolution kernel can be shared by parameters, the time dimension has characteristic information of the same distributed equipment to repeatedly train the parameters of the convolution kernel, the pooling layer adopts pooling 1x2 of the time dimension, and the frequency dimension does not carry out pooling.