CN116129931B

CN116129931B - An audio-visual combined speech separation model building method and speech separation method

Info

Publication number: CN116129931B
Application number: CN202310394927.7A
Authority: CN
Inventors: 付民; 李贵竹; 刘雪峰; 孙梦楠; 闵健; 董亮; 刘英哲; 闫劢; 郑冰
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-06-30
Anticipated expiration: 2043-04-14
Also published as: CN116129931A

Abstract

The invention provides an audio-visual combined voice separation model building method and a voice separation method, which belong to the technical field of voice separation. The model building method is as follows: acquiring the video and corresponding audio original data of several speakers, and pre-processing the acquired original data. Process and obtain speech spectrograms, face frames and mouth movement frames to construct data sets; build audio separation modules based on U‑Net networks, build face modules based on ResNet‑18 networks, and build mouth movements based on ShuffleNet‑V2 and TCN networks Module, the three are combined to form a new network model, and the model is trained to select the model with the highest accuracy; after the model is built, it is used for mixing audio separation. Compared with the method using a single video stream, the audio-visual combined speech separation model proposed by the present invention has achieved obvious performance improvement. Comparative experiments on public datasets verify the effectiveness of the method.

Description

An audio-visual combined speech separation model building method and speech separation method

技术领域technical field

本发明属于语音分离技术领域，尤其涉及一种视听结合的语音分离模型搭建方法及语音分离方法。The invention belongs to the technical field of speech separation, and in particular relates to a speech separation model building method and a speech separation method combining audio-visual.

背景技术Background technique

在有多个声源同时发声的环境中，人类可以凭借自身灵敏的听觉系统对接收到的声音信号进行处理，将注意力集中在目标声音上，同时忽略其他不感兴趣的声音，这种现象被Cherry在其著作中定义为“鸡尾酒会效应”。自此引发了人们对语音分离问题的广泛关注。语音分离问题是解决鸡尾酒会效应的关键任务之一，其是指从存在重叠的多个语音信号中提取出单个人的声音信号。随着智能化的发展，语音分离技术也在众多语音交互设备中发挥着作用，如可以作为助听器帮助听力受损者听清外界的声音、智能家居中语音控制为人们提供便利、手机语音助手辅助操作、协助分析案件情报语音线索、线上会议中提升通话效率和质量等。但目前语音分离技术的性能远远落后于人类听觉系统，如何高效的实现接近人类的语音分离效果仍是一个技术难题。In an environment where multiple sound sources sound at the same time, humans can process the received sound signals with their own sensitive auditory system, focus on the target sound, and ignore other uninteresting sounds at the same time. Cherry defined it as "the cocktail party effect" in his book. Since then, a lot of attention has been paid to the problem of speech separation. The problem of speech separation is one of the key tasks to solve the cocktail party effect, which refers to extracting a single person's voice signal from multiple overlapping voice signals. With the development of intelligence, voice separation technology also plays a role in many voice interaction devices, such as hearing aids to help the hearing-impaired hear the sound of the outside world, voice control in smart homes to provide convenience for people, mobile phone voice assistant assistance Operation, assisting in analyzing voice clues of case intelligence, improving call efficiency and quality in online meetings, etc. However, the performance of the current speech separation technology is far behind the human auditory system, and how to efficiently achieve a speech separation effect close to that of humans is still a technical problem.

早期使用广泛的语音分离方法有谱减法、计算机听觉场景分析、隐马尔科夫模型等，这些方法是浅层模型，不能充分提取信号特征，其效果往往建立在依赖先验知识或者特定的麦克风配置基础上，缺乏从大量数据中学习的能力。近年来，随着深度学习技术的发展，众多表现良好的语音分离的模型被提出。Early speech separation methods widely used include spectral subtraction, computer auditory scene analysis, hidden Markov model, etc. These methods are shallow models that cannot fully extract signal features, and their effects are often based on prior knowledge or specific microphone configurations. On the other hand, it lacks the ability to learn from large amounts of data. In recent years, with the development of deep learning technology, many well-performed speech separation models have been proposed.

事实上，人类能够专注于特定声音的能力不仅依赖于声音，还依靠视觉信息，如说话者的性别、年龄、嘴唇的开闭等等。这些非语音信息能增强人类在复杂环境中聚焦目标语音的能力已经在心理学研究中被证明。近年来越来越多结合视觉信息辅助语音分离的模型被提出。2018年谷歌提出了一个基于深度学习联合视听语音分离模型，相比纯音频方法显著提高了其分离性能。有技术人员提出一种时域视听语音分离架构，视觉信息通过唇嵌入提取器预先训练，用词级和音素唇形嵌入辅助分离，网络直接预测目标语音波形。但上述方法只利用单一的视觉信息，如何有效地提取和利用音视频特征，使其在面对更复杂的场景时更加鲁棒仍然值得探讨。In fact, the human ability to focus on a specific sound depends not only on the sound, but also on visual information such as the speaker's gender, age, opening and closing of lips, and so on. The ability of such non-speech information to enhance human ability to focus on target speech in complex environments has been demonstrated in psychological research. In recent years, more and more models that combine visual information to assist speech separation have been proposed. In 2018, Google proposed a deep learning-based joint audio-visual speech separation model, which significantly improved its separation performance compared to pure audio methods. Some technicians proposed a time-domain audio-visual speech separation architecture. The visual information is pre-trained through the lip embedding extractor, and word-level and phoneme lip shape embeddings are used to assist separation. The network directly predicts the target speech waveform. However, the above method only uses a single visual information, how to effectively extract and utilize audio and video features to make it more robust in the face of more complex scenes is still worth exploring.

发明内容Contents of the invention

针对上述问题，本发明第一方面提供了一种视听结合的语音分离模型搭建方法，包括以下步骤：In view of the above problems, the first aspect of the present invention provides an audio-visual combination speech separation model building method, comprising the following steps:

步骤1，获取若干说话人的视频和相应音频的原始数据，所述原始数据在不同的场景中拍摄或下载获取；Step 1, obtaining raw data of video and corresponding audio of several speakers, said raw data being shot or downloaded and obtained in different scenes;

步骤2，对步骤1中获取的原始数据进行预处理；将视频分别处理为一帧一帧的图像，同时从原始数据中随机选取两个说话人的数据，将其中的音频混合后对混合语音做短时傅里叶变换得到语音的语谱图，结合两个说话人数据对应的面部帧和嘴部动作帧构建数据集，并划分为训练集、验证集和测试集；Step 2. Preprocess the raw data obtained in step 1; process the video into images frame by frame, and randomly select two speaker data from the raw data, mix the audio in it, and analyze the mixed voice Do the short-time Fourier transform to get the spectrogram of the speech, combine the face frames and mouth movement frames corresponding to the two speaker data to construct a data set, and divide it into a training set, a verification set and a test set;

步骤3，基于U-Net网络结构，在传统U-Net的部分降采样卷积块中建立残差连接得到残差卷积块，然后在压缩路径和拓展路径的卷积和ReLU激活函数之间加入BN层，构建为音频分离模块；基于ResNet-18网络结构，在ResNet-18的基础卷积块前后分别加入CBAM注意力机制，构建为面部模块；基于ShuffleNet-V2和TCN网络结构，结合3D卷积层，构建为嘴部动作模块；将上述三个网络模块结合，构建为AV-ResUnet网络模型；其中混合语音的语谱图输入到所述音频分离模块中，面部帧输入到所述面部模块中，嘴部动作帧输入到所述嘴部动作模块中；Step 3, based on the U-Net network structure, establish a residual connection in the partial downsampling convolution block of the traditional U-Net to obtain a residual convolution block, and then between the convolution and ReLU activation functions of the compression path and the expansion path Add the BN layer to build an audio separation module; based on the ResNet-18 network structure, add the CBAM attention mechanism before and after the basic convolution block of ResNet-18 to build a face module; based on the ShuffleNet-V2 and TCN network structure, combined with 3D The convolutional layer is constructed as a mouth movement module; the above three network modules are combined to construct an AV-ResUnet network model; wherein the spectrogram of the mixed voice is input into the audio separation module, and the face frame is input into the face In the module, the mouth movement frame is input into the mouth movement module;

步骤4，使用步骤2中所述的训练集和验证集对步骤3中所搭建的AV-ResUnet网络模型进行训练与验证；选取训练过程中验证效果最好的模型作为最终的测试模型；Step 4, use the training set and verification set described in step 2 to train and verify the AV-ResUnet network model built in step 3; select the model with the best verification effect in the training process as the final test model;

步骤5，使用测试集中的数据对最终选择的AV-ResUnet网络模型进行测试。Step 5, use the data in the test set to test the final selected AV-ResUnet network model.

优选的，所述步骤2中预处理的具体过程为：首先把视频处理为一帧一帧的图像，选取一帧作为面部帧；每一帧图像使用SFD面部检测器获取面部关键点，去除与位置有关的差异、定位嘴唇的位置后裁剪为固定大小，再经过灰度化处理后作为嘴部动作帧，帧数为64；然后从原始数据中随机选取两个说话人的数据，将其中的音频混合后对混合语音做短时傅里叶变换得到语音的语谱图，结合两个说话人数据对应的面部帧和嘴部动作帧构建数据集。Preferably, the specific process of pre-processing in the step 2 is: first the video is processed into a frame-by-frame image, and a frame is selected as a face frame; each frame image uses the SFD face detector to obtain facial key points, and removes the Position-related differences, positioning the position of the lips and cutting them to a fixed size, and then gray-scaled them as mouth action frames, the number of frames is 64; then randomly select the data of two speakers from the original data, and convert the After the audio is mixed, short-time Fourier transform is performed on the mixed voice to obtain the spectrogram of the voice, and the data set is constructed by combining the facial frames and mouth movement frames corresponding to the two speaker data.

优选的，所述音频分离模块，基于U-Net网络进行改进，包括conv层、res_conv层、视听特征融合、up_conv层以及Tanh函数；Preferably, the audio separation module is improved based on the U-Net network, including conv layer, res_conv layer, audio-visual feature fusion, up_conv layer and Tanh function;

所述conv层由一个大小为4×4步长为2的卷积核、一个BN层以及一个ReLU激活函数组成，conv层在网络的压缩路径和拓展路径中各有两个，分别为unet_conv和unetup_conv；The conv layer consists of a convolution kernel with a size of 4×4 and a step size of 2, a BN layer, and a ReLU activation function. There are two conv layers in the compression path and expansion path of the network, respectively unet_conv and unetup_conv;

所述res_conv层共有6层，分别为res_conv1、res_conv2、res_conv3、res_conv4、res_conv5、res_conv6，每层由两个大小为3×3步长为1的卷积核、两个BN层、两个ReLU激活函数、一个Maxpool层和一个残差连接组成；The res_conv layer has 6 layers in total, namely res_conv1, res_conv2, res_conv3, res_conv4, res_conv5, res_conv6, each layer consists of two convolution kernels with a size of 3×3 and a step size of 1, two BN layers, and two ReLU activations function, a Maxpool layer and a residual connection;

其中res_conv1和res_conv2的数据输入和输出通道数不同，其余四层输入和输出的通道数相同，残差连接根据输入数据和输出数据的通道数异同分为两种，通道数相同时，直接将输入和卷积的输出相加后进行池化，通道数不同时则将输入经过一个卷积核处理后再进行相加；Among them, the number of data input and output channels of res_conv1 and res_conv2 is different, and the number of channels of input and output of the remaining four layers is the same. The residual connection is divided into two types according to the number of channels of input data and output data. When the number of channels is the same, directly input After adding the output of the convolution, pooling is performed. When the number of channels is different, the input is processed by a convolution kernel and then added;

所述视听特征融合是将压缩路径处理后得到的音频特征与视觉网络提取到的视觉特征在时间维度进行融合，获得视听融合特征的过程；The audio-visual feature fusion is the process of merging the audio features obtained after the compression path processing with the visual features extracted by the visual network in the time dimension to obtain audio-visual fusion features;

所述up_conv层由一个Upsample层、一个大小为3×3步长为1的卷积核、一个BN层以及一个ReLU激活函数组成，Upsample代替压缩路径中的Maxpool；The up_conv layer consists of an Upsample layer, a convolution kernel with a size of 3×3 and a step size of 1, a BN layer, and a ReLU activation function, and Upsample replaces Maxpool in the compression path;

所述Tanh函数，将数据压缩至-1到1区间后输出分离后的掩蔽，将分离后掩蔽和混合语音的语谱图相乘，得到单独的说话人语音语谱图，再经过逆短时傅里叶变换恢复出说话人干净语音。The Tanh function compresses the data to the range of -1 to 1 and outputs the mask after separation, multiplies the mask after separation and the spectrogram of the mixed voice to obtain a separate speaker's speech spectrogram, and then passes the inverse short-time The Fourier transform restores the speaker's clean voice.

优选的，所述面部模块，基于ResNet-18网络进行改进，包括conv7层、CBAM层、res层、池化层以及线性层；Preferably, the face module is improved based on the ResNet-18 network, including a conv7 layer, a CBAM layer, a res layer, a pooling layer and a linear layer;

所述conv7层由大小为7×7步长为2的卷积核、BN层以及ReLU激活函数组成，所述conv7层的输出作为CBAM层的输入；The conv7 layer is composed of a convolution kernel with a size of 7×7 and a step size of 2, a BN layer and a ReLU activation function, and the output of the conv7 layer is used as the input of the CBAM layer;

所述CBAM层由Channel Attention和Spatial Attention组成，CBAM层分别位于第一个res层前和最后一个res层后，用于高效的提取和音频相关性较大的人脸关键区域，忽略人脸之外的次要区域；The CBAM layer is composed of Channel Attention and Spatial Attention. The CBAM layer is located before the first res layer and after the last res layer, and is used to efficiently extract key areas of the face that are highly correlated with audio, ignoring the differences between the faces. outside the secondary area;

所述res层包括res1、res2、res3、res4四层，分别包含2个卷积块，其中res1中每个卷积块由3×3的卷积核、BN层和ReLU激活函数组成，所述卷积块可由如下公式表示：The res layer includes four layers res1, res2, res3, and res4, each containing two convolutional blocks, wherein each convolutional block in res1 consists of a 3×3 convolutional kernel, a BN layer, and a ReLU activation function. The convolution block can be expressed by the following formula:

y = ReLU(x + BN(conv3(ReLU(BN(conv3(x))))))y = ReLU(x + BN(conv3(ReLU(BN(conv3(x))))))

其中，x代表卷积块的输入，y代表卷积块的输出；conv3是3×3卷积运算、BN指批归一化层；ReLU指ReLU激活函数；Among them, x represents the input of the convolution block, and y represents the output of the convolution block; conv3 is a 3×3 convolution operation, BN refers to the batch normalization layer; ReLU refers to the ReLU activation function;

res2、res3及res4中第一个卷积块和res1中相同，第二个卷积块由3×3的卷积核、BN层、降采样层和ReLU激活函数组成，第二个卷积块可由如下公式表示：The first convolution block in res2, res3 and res4 is the same as in res1, the second convolution block consists of a 3×3 convolution kernel, BN layer, downsampling layer and ReLU activation function, and the second convolution block It can be expressed by the following formula:

y = ReLU ( Downsample(x) + BN(conv3(ReLU (BN(conv3(x))))))y = ReLU ( Downsample(x) + BN(conv3(ReLU (BN(conv3(x))))))

其中，Downsample指降采样层；Among them, Downsample refers to the downsampling layer;

所述池化层包含最大池化和平均池化，最大池化位于第一个CBAM层之后，用于减少参数量、简化网络的复杂度；平均池化位于第二个CBAM之后，平均池化的输出作为最终线性层的输入；The pooling layer includes maximum pooling and average pooling, and the maximum pooling is located after the first CBAM layer to reduce the amount of parameters and simplify the complexity of the network; the average pooling is located after the second CBAM, and the average pooling The output of is used as the input of the final linear layer;

所述线性层的输出作为网络提取的最终面部特征，将其在时间维度复制后与唇部特征结合后成为模型所需的视觉特征。The output of the linear layer is used as the final facial feature extracted by the network, which is copied in the time dimension and combined with the lip feature to become the visual feature required by the model.

优选的，所述嘴部动作模块，基于ShuffleNet-V2和TCN网络结构，结合3D卷积层构建，所述3D卷积层由大小为5×7×7步长为1×2×2的卷积核、BN层、ReLU激活函数和大小为1×3×3步长为1×2×2的3D最大池化层组成；Preferably, the mouth movement module is constructed based on ShuffleNet-V2 and TCN network structure combined with a 3D convolutional layer, and the 3D convolutional layer is composed of volumes with a size of 5×7×7 and a step size of 1×2×2 Consists of product kernel, BN layer, ReLU activation function and 3D maximum pooling layer with a size of 1×3×3 and a step size of 1×2×2;

所述ShuffleNet-V2网络包括卷积层、池化层、全连接层、分组卷积和深度可分离卷积；TCN网络由多个残差块构成，TCN网络将ShuffleNet-V2网络提取特征向量的时间索引序列通过使用1D时间卷积将其映射为新的序列，最终得到维度为512×64的嘴唇运动特征。The ShuffleNet-V2 network includes a convolutional layer, a pooling layer, a fully connected layer, grouped convolution, and depth separable convolution; the TCN network is composed of a plurality of residual blocks, and the TCN network extracts the ShuffleNet-V2 network from the feature vector. The time-indexed sequence is mapped to a new sequence by using 1D temporal convolution, and finally a lip motion feature with a dimension of 512×64 is obtained.

优选的，所述步骤4中所搭建的模型在训练过程中使用复数域理想比值掩蔽cIRM作为音频的训练目标，使用triplet loss损失计算音频和面部图像的相似性，cIRM的计算公式如下所示：Preferably, the model built in the step 4 uses the ideal ratio of the complex number domain to cover cIRM as the training target of the audio during the training process, and uses triplet loss to calculate the similarity of the audio and facial images. The calculation formula of the cIRM is as follows:

其中，X_r和X_i代表混合语音信号的实部和虚部，S_r和S_i代表干净语音的实部和虚部。Among them, X _r and _Xi represent the real part and imaginary part of the mixed speech signal, S _r and S _i represent the real part and imaginary part of the clean speech.

优选的，所述步骤2中，通过短时傅里叶变换将时域混合音频转换为语谱图，音频经过16kHz采样率，音频片段长度为2.55s，STFT具有400个窗口长度、160个跳跃大小和512个FFT大小。Preferably, in said step 2, the time-domain mixed audio is converted into a spectrogram by short-time Fourier transform, the audio is passed through a 16kHz sampling rate, the length of the audio segment is 2.55s, and the STFT has 400 window lengths and 160 jumps size and 512 FFT sizes.

优选的，所述音频分离模块的输入具体为：关于视觉输入，面部图像帧大小为224×224，经网络提取成维度为128的面部特征；嘴部动作帧输入大小为88×88，经网络提取成维度为512×64的嘴部特征，将其与面部特征结合最终得到维度为640×64的视觉特征，该视觉特征作为音频分离模块的视觉输入；关于音频输入，混合音频的信号语谱图作为音频分离模块的视觉输入，维度为2×257×256，经网络后得到和输入语谱图维度一致的预测掩码。Preferably, the input of the audio separation module is specifically: regarding the visual input, the facial image frame size is 224×224, which is extracted by the network into facial features with a dimension of 128; the mouth movement frame input size is 88×88, which is extracted by the network. Extract the mouth features with a dimension of 512×64, and combine them with facial features to finally obtain a visual feature with a dimension of 640×64, which is used as the visual input of the audio separation module; about the audio input, the signal spectrum of the mixed audio The graph is used as the visual input of the audio separation module, and its dimension is 2×257×256. After passing through the network, a prediction mask with the same dimension as the input spectrogram is obtained.

本发明第二方面提供了一种视听结合的语音分离方法，包含以下过程：The second aspect of the present invention provides an audio-visual combined speech separation method, comprising the following process:

获取包含有两个说话人的视频和相应音频；Obtain a video and corresponding audio containing two speakers;

将获取的视频和相应音频进行处理，分别提取视频中说话人的面部帧和嘴部动作帧，Process the acquired video and the corresponding audio, and extract the face frame and mouth movement frame of the speaker in the video respectively,

将面部帧、嘴部动作帧和相应音频输入到如第一方面所述的搭建方法所搭建的语音分离模型中；Input the face frame, mouth movement frame and corresponding audio into the speech separation model built by the construction method described in the first aspect;

输出分离后的每个说话人及对应的干净语音。Output each speaker after separation and the corresponding clean speech.

本发明第三方面还提供了一种视听结合的语音分离设备，所述设备包括至少一个处理器和至少一个存储器，所述处理器和存储器相耦合；所述存储器中存储有如第一方面所述的搭建方法所搭建的语音分离模型的计算机执行程序；所述处理器执行存储器中存储的计算机执行程序时，可以使处理器实现语音分离方法。The third aspect of the present invention also provides an audio-visual combined speech separation device, the device includes at least one processor and at least one memory, the processor and the memory are coupled; the memory stores the information as described in the first aspect The computer execution program of the speech separation model built by the construction method; when the processor executes the computer execution program stored in the memory, the processor can realize the speech separation method.

与现有技术相比，本发明具有如下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明提出使用嘴部动作和人脸面部信息两部分视觉特征来同时辅助语音分离的实现过程，相比于单纯使用一项视觉信息的跨模态语音分离模型和纯音频的语音分离模型，双视觉信息的辅助能让网络更好地利用视觉信息和音频信息之间的内在联系，可以实现更好的分离性能；为进一步提升对视觉特征的提取并考虑到视频中还包括人脸之外的次要信息，在面部特征提取的过程中通过两层注意力机制来帮助网络获取最为关键的人脸区域，更高效的利用视觉信息；针对传统U-Net网络模型容易忽略数据细节的不足，本发明中将残差连接机制引入到U-Net网络，在压缩路径的卷积块中添加残差连接来辅助网络更好的提取细节特征，同时在卷积层后添加BN层加快网络的训练和收敛的速度；相比于单纯使用音频信号的频域特征，本发明对混合语音信号做STFT变换，充分的利用语音信号的幅度信息和相位信息。The present invention proposes to use two parts of visual features of mouth movement and face information to assist the realization process of speech separation at the same time. The assistance of visual information allows the network to better utilize the intrinsic connection between visual information and audio information, and achieve better separation performance; in order to further improve the extraction of visual features and consider that the video also includes human faces Secondary information, in the process of facial feature extraction, uses a two-layer attention mechanism to help the network obtain the most critical face area and use visual information more efficiently; for the traditional U-Net network model, which tends to ignore the lack of data details, this paper In the invention, the residual connection mechanism is introduced into the U-Net network, and the residual connection is added in the convolution block of the compressed path to assist the network to better extract detailed features, and at the same time, the BN layer is added after the convolution layer to speed up the training and processing of the network. Convergence speed; Compared with simply using the frequency domain characteristics of the audio signal, the present invention performs STFT transformation on the mixed speech signal, and fully utilizes the amplitude information and phase information of the speech signal.

附图说明Description of drawings

图1为本发明提出的视听结合语音分离模型的框架图。Fig. 1 is a frame diagram of the audio-visual combination speech separation model proposed by the present invention.

图2为音频分离模块网络结构图。Figure 2 is a network structure diagram of the audio separation module.

图3为改进U-Nnet网络中卷积层结构图。Figure 3 is a structural diagram of the convolutional layer in the improved U-Nnet network.

图4为改进U-Net网络中残差卷积块结构图。Figure 4 is a structural diagram of the residual convolution block in the improved U-Net network.

图5为面部模块网络结构图。Figure 5 is a network structure diagram of the face module.

图6为CBAM层注意力机制模块示意图。Figure 6 is a schematic diagram of the attention mechanism module of the CBAM layer.

图7为嘴部动作模块网络结构图。Figure 7 is a network structure diagram of the mouth movement module.

图8为实施例2中语音分离设备的结构简易框图。Fig. 8 is a simplified block diagram of the structure of the speech separation device in Embodiment 2.

具体实施方式Detailed ways

下面结合具体实施例对发明进行进一步说明。The invention will be further described below in conjunction with specific embodiments.

实施例1：Example 1:

本发明提出了一种基于双视觉线索的视听结合语音分离方法，如图1所示，主要包括以下步骤：The present invention proposes a kind of audio-visual combination speech separation method based on dual visual cues, as shown in Figure 1, mainly comprises the following steps:

本实施例在VoxCeleb2数据集上进行实验，该数据集包含从YouTube中下载的超过1000000条语音片段及其对应的视频片段，且男女比例分布较均衡，发言者来自诸多国家。In this embodiment, experiments are carried out on the VoxCeleb2 data set, which contains more than 1,000,000 voice clips and corresponding video clips downloaded from YouTube, and the distribution of male and female ratio is relatively balanced, and the speakers come from many countries.

1.获取原始数据1. Get raw data

由于VoxCeleb2数据集中包含的视频是在大量具有挑战性的视觉和听觉环境中拍摄的。其中包括在红地毯上、户外体育场和演播室内接受采访等，这导致拍摄的视频质量参差不齐。数据集中有些说话人的视频片段十分模糊，网络从中提取到有用信息面临较大挑战。因此，在本实施例中删除了数据集中部分模糊不清的视频，保证较好的视频和音频质量。Since the videos included in the VoxCeleb2 dataset were shot in a large number of challenging visual and auditory environments. These included interviews on red carpets, outdoor stadiums, and in studios, which resulted in videos of varying quality. The video clips of some speakers in the data set are very blurry, and it is a big challenge for the network to extract useful information from them. Therefore, in this embodiment, some blurred videos in the data set are deleted to ensure better video and audio quality.

2.数据预处理2. Data preprocessing

获取的原始数据进行预处理；首先把视频处理为一帧一帧的图像，选取一帧作为面部帧，分辨率为224×224；每一帧图像使用SFD面部检测器获取面部关键点，将视频中的人脸与参考平面对齐，使用相似性变换通过去除与位置有关的差异、定位嘴唇的位置后，裁剪为96×96大小，再经过灰度化处理后作为嘴部动作帧，将其另存为.h5文件方便后续模型读取数据；然后从原始数据中随机选取两个说话人的数据，将其中的音频混合后对混合语音做短时傅里叶变换得到语音的语谱图，结合两个说话人数据对应的面部帧和嘴部动作帧构建数据集。The acquired raw data is preprocessed; first, the video is processed into a frame-by-frame image, and one frame is selected as a face frame with a resolution of 224×224; each frame of image uses the SFD face detector to obtain facial key points, and the video The face in the image is aligned with the reference plane, and the similarity transformation is used to remove the position-related differences and position the lips. Then, it is cropped to a size of 96×96, and after grayscale processing, it is used as a mouth motion frame and saved as a frame. The .h5 file is convenient for the follow-up model to read the data; then randomly select the data of two speakers from the original data, mix the audio and perform short-time Fourier transform on the mixed speech to obtain the spectrogram of the speech, and combine the two Construct a dataset of face frames and mouth movement frames corresponding to the speaker data.

3.模型搭建3. Model building

本实施例中，音频分离模块，基于U-Net网络进行改进，包括conv层、res_conv层、视听特征融合、up_conv层以及Tanh函数；具体结构如附图2所示。In this embodiment, the audio separation module is improved based on the U-Net network, including a conv layer, a res_conv layer, audio-visual feature fusion, an up_conv layer, and a Tanh function; the specific structure is shown in Figure 2.

conv层由一个大小为4×4步长为2的卷积核、一个BN（Batch Normalization）层以及一个ReLU激活函数组成，conv层在网络的压缩路径和拓展路径中各有两个，分别为unet_conv和unetup_conv；详细结构见附图3；（a）代表U-Net网络中的conv层，（b）代表U-Net网络中的up_conv层。The conv layer consists of a convolution kernel with a size of 4×4 and a step size of 2, a BN (Batch Normalization) layer, and a ReLU activation function. There are two conv layers in the compression path and expansion path of the network, respectively. unet_conv and unetup_conv; see Figure 3 for the detailed structure; (a) represents the conv layer in the U-Net network, and (b) represents the up_conv layer in the U-Net network.

res_conv层共有6层，分别为res_conv1、res_conv2、res_conv3、res_conv4、res_conv5、res_conv6，每层由两个大小为3×3步长为1的卷积核、两个BN层、两个ReLU激活函数、一个Maxpool层和一个残差连接组成；The res_conv layer has 6 layers, namely res_conv1, res_conv2, res_conv3, res_conv4, res_conv5, and res_conv6. Each layer consists of two convolution kernels with a size of 3×3 and a step size of 1, two BN layers, and two ReLU activation functions. A Maxpool layer and a residual connection;

其中res_conv1和res_conv2的数据输入和输出通道数不同，其余四层输入和输出的通道数相同，残差连接根据输入数据和输出数据的通道数异同分为两种，通道数相同时，直接将输入和卷积的输出相加后进行池化，通道数不同时则将输入经过一个卷积核处理后再进行相加，残差连接可以有效避免网络在训练过程中梯度消失的问题同时可以帮助网络更好的提取不易区分的小细节，具体结构见附图4，（a）代表第1-2个残差卷积块，（b）代表第4-6个残差卷积块。改进U-Net网络网络各层的输入输出通道数如表1所示。Among them, the number of data input and output channels of res_conv1 and res_conv2 is different, and the number of channels of input and output of the remaining four layers is the same. The residual connection is divided into two types according to the number of channels of input data and output data. When the number of channels is the same, directly input After adding the output of the convolution, pooling is performed. When the number of channels is different, the input is processed by a convolution kernel and then added. The residual connection can effectively avoid the problem of gradient disappearance during the training process of the network and help the network Better extract small details that are not easy to distinguish. The specific structure is shown in Figure 4. (a) represents the 1-2th residual convolution block, and (b) represents the 4th-6th residual convolution block. The number of input and output channels of each layer of the improved U-Net network is shown in Table 1.

表1 改进U-Net网络各层的通道数Table 1 The number of channels in each layer of the improved U-Net network

视听特征融合是将经过压缩路径处理后的音频特征与视觉网络提取到的视觉特征在时间维度进行融合，获得视听融合特征的过程；Audio-visual feature fusion is the process of merging the audio features processed by the compressed path with the visual features extracted by the visual network in the time dimension to obtain audio-visual fusion features;

up_conv层由一个Upsample层、一个大小为3x3步长为1的卷积核、一个BN层以及一个ReLU激活函数组成，Upsample代替压缩路径中的Maxpool，具体结构见附图3；The up_conv layer consists of an Upsample layer, a convolution kernel with a size of 3x3 and a step size of 1, a BN layer, and a ReLU activation function. Upsample replaces Maxpool in the compression path. The specific structure is shown in Figure 3;

tanh函数将数据压缩至-1到1区间后输出分离后的掩蔽，将分离后掩蔽和混合语音的语谱图相乘，得到单独的说话人语音语谱图，再经过逆短时傅里叶变换恢复出说话人干净语音。The tanh function compresses the data to the range -1 to 1 and outputs the separated mask, multiplies the separated mask and the spectrogram of the mixed voice to obtain a separate speaker's speech spectrogram, and then passes the inverse short-time Fourier The transformation restores the speaker's clean voice.

面部模块，基于ResNet-18网络进行改进，包括conv7层、CBAM层、res层、池化层以及线性层；Face module, improved based on ResNet-18 network, including conv7 layer, CBAM layer, res layer, pooling layer and linear layer;

所述conv7层由大小为7×7步长为2的卷积核、BN层以及ReLU激活函数组成，所述conv7层的输出作为CBAM层的输入，面部模块模块具体结构如附图5所示。The conv7 layer is composed of a convolution kernel with a size of 7×7 and a step size of 2, a BN layer, and a ReLU activation function. The output of the conv7 layer is used as the input of the CBAM layer. The specific structure of the face module module is shown in Figure 5 .

CBAM（Convolutional Block Attention Module）层由Channel Attention和Spatial Attention组成，CBAM层分别位于第一个res层前和最后一个res层后，用于高效的提取和音频相关性较大的人脸关键区域，忽略人脸之外的次要区域；CBAM层结构如附图6所示。The CBAM (Convolutional Block Attention Module) layer is composed of Channel Attention and Spatial Attention. The CBAM layer is located before the first res layer and after the last res layer, and is used to efficiently extract key areas of the face with high audio correlation. Ignore the secondary area outside the face; the CBAM layer structure is shown in Figure 6.

res层包括res1、res2、res3、res4四层，分别包含2个卷积块，其中res1中每个卷积块由3×3的卷积核、BN层和ReLU激活函数组成，所述卷积块可由如下公式表示：The res layer includes four layers res1, res2, res3, and res4, each containing 2 convolution blocks, where each convolution block in res1 consists of a 3×3 convolution kernel, a BN layer, and a ReLU activation function. The convolution A block can be represented by the following formula:

其中，Downsample指降采样层；其余同第一个卷积块。Among them, Downsample refers to the downsampling layer; the rest are the same as the first convolution block.

池化层包含最大池化和平均池化，最大池化位于第一个CBAM层之后，用于减少参数量、简化网络的复杂度；平均池化位于第二个CBAM之后，平均池化的输出作为最终线性层的输入；线性层的输出作为网络提取的面部特征，将其在时间维度复制后与唇部特征结合后成为模型所需的视觉特征；The pooling layer includes maximum pooling and average pooling. The maximum pooling is located after the first CBAM layer to reduce the amount of parameters and simplify the complexity of the network; the average pooling is located after the second CBAM, and the output of the average pooling is As the input of the final linear layer; the output of the linear layer is the facial feature extracted by the network, which is copied in the time dimension and combined with the lip feature to become the visual feature required by the model;

嘴部动作模块，基于ShuffleNet-V2和TCN网络结构，结合3D卷积层构建，所述3D卷积层由大小为5×7×7步长为1×2×2的卷积核、BN层、ReLU激活函数和大小为1×3×3步长为1×2×2的3D最大池化层组成；Mouth movement module, based on ShuffleNet-V2 and TCN network structure, combined with 3D convolutional layer construction, the 3D convolutional layer consists of a convolution kernel with a size of 5×7×7 and a step size of 1×2×2, a BN layer , ReLU activation function and a 3D maximum pooling layer with a size of 1×3×3 and a step size of 1×2×2;

所述ShuffleNet-V2网络由卷积层、池化层、全连接层、分组卷积和深度可分离卷积等构成；TCN（temporalconvolutionalnetwok）网络由多个残差块构成，TCN网络将ShuffleNet-V2网络提取特征向量的时间索引序列通过使用1D时间卷积将其映射为新的序列，最终得到维度为512×64的嘴唇运动特征。The ShuffleNet-V2 network consists of a convolutional layer, a pooling layer, a fully connected layer, grouped convolution, and depth-separable convolution; The temporal index sequence of feature vectors extracted by the network is mapped to a new sequence by using 1D temporal convolution, and finally a lip motion feature with a dimension of 512×64 is obtained.

4.模型训练4. Model training

本实施例中一种基于双视觉的线索的视听结合语音分离方法的实现平台基于Linux操作系统，编程语言为Python3.8、深度学习框架是Pytorch1.11.0，CUDA版本为11.1，使用NVIDIA RTX 2080Ti显卡。使用Adam作为优化器，学习率为0.00001，批次大小为8，总批次为为5000，每迭代500次保存一次最新模型。训练过程中每迭代100次用验证集检验一下训练效果，保存当前最优的模型。在实验过程中，考虑到VoxCeleb2数据集太大，训练时间过长，不利于实验进行，为节约时间成本和保证公平性，使用同样的部分数据进行训练，这不影响模型的比较。In the present embodiment, a realization platform of audio-visual combined speech separation method based on dual vision clues is based on Linux operating system, the programming language is Python3.8, the deep learning framework is Pytorch1.11.0, the CUDA version is 11.1, and NVIDIA RTX 2080Ti graphics card is used . Using Adam as the optimizer, the learning rate is 0.00001, the batch size is 8, the total batch size is 5000, and the latest model is saved every 500 iterations. During the training process, every 100 iterations, use the verification set to check the training effect, and save the current optimal model. During the experiment, considering that the VoxCeleb2 dataset is too large and the training time is too long, which is not conducive to the experiment, in order to save time and cost and ensure fairness, the same part of the data is used for training, which does not affect the comparison of the models.

数据集中的数据都是单个声音，训练时随机混合两个不同说话人的声音信号，通过STFT将时域混合音频转换为的语谱图，音频经过16kHz采样率，音频片段长度为2.55s，STFT具有400个窗口长度、160个跳跃大小和512个FFT大小。面部图像帧大小为224x224，经网络提取成维度为128的面部特征、嘴部动作帧输入为64帧大小为96x96嘴部灰度帧，经网络提取成维度为512x64的嘴部特征，将其与面部特征结合最终得到维度为640x64的视觉特征，该视觉特征作为音频分离模块的视觉输入；混合音频的信号语谱图维度为2x257x256，经网络后得到和输入语谱图维度一致的预测掩码。预测掩码分别与混合语音的语谱图相乘得到分离后的单独的说话人语音语谱图，再经过逆短时傅里叶变换（STFT）恢复出说话人干净语音信号。The data in the data set is a single sound. During training, the sound signals of two different speakers are randomly mixed, and the time-domain mixed audio is converted into a spectrogram through STFT. The audio is sampled at a rate of 16kHz, and the length of the audio segment is 2.55s. STFT With 400 window lengths, 160 hop sizes and 512 FFT sizes. The size of the facial image frame is 224x224, which is extracted by the network into facial features with a dimension of 128, and the input of the mouth movement frame is 64 frames with a size of 96x96 mouth grayscale frame, which is extracted by the network into a mouth feature with a dimension of 512x64, which is compared with The facial features are combined to finally obtain a visual feature with a dimension of 640x64, which is used as the visual input of the audio separation module; the signal spectrogram dimension of the mixed audio is 2x257x256, and a prediction mask with the same dimension as the input spectrogram is obtained after the network. The prediction mask is multiplied by the spectrogram of the mixed speech to obtain the separated speaker speech spectrogram, and then the speaker's clean speech signal is recovered by inverse short-time Fourier transform (STFT).

5.实验结果5. Experimental results

本实施例中对比了所提方法与使用单一视觉线索视听语音分离模型的分离性能，同时对比了改进模型与基础模型的分离表现，验证本发明所提方案的有效性。语音分离任务常用的评估指标有源失真比SDR 、语音质量的感知评估PESQ、短时客观可懂度STOI等。SDR表示信号整体失真的情况；STOI测量参考（干净）话语和分离话语的短时时间包络之间的相关性；对于语音质量，PESQ是标准度量，应用听觉变换来产生响度谱并比较干净的参考信号和分离信号的响度谱。本实施例中使用PESQ、SDR和STOI作为对比实验的评估指标。本实施例中只比较和分析两个说话人混合的结果。In this embodiment, the separation performance of the proposed method and the audio-visual speech separation model using a single visual cue is compared, and the separation performance of the improved model and the basic model is compared to verify the effectiveness of the proposed scheme of the present invention. Commonly used evaluation indicators for speech separation tasks are active distortion ratio SDR, perceptual evaluation of speech quality PESQ, short-term objective intelligibility STOI, etc. SDR indicates the overall distortion of the signal; STOI measures the correlation between the reference (clean) utterance and the short-term temporal envelope of the separated utterance; for speech quality, PESQ is the standard measure, applying an auditory transformation to produce a loudness spectrum and compare clean Loudness spectra of the reference and separated signals. In this embodiment, PESQ, SDR and STOI are used as evaluation indexes for comparative experiments. In this embodiment, only the results of mixing two speakers are compared and analyzed.

注意力机制效果验证：Attention mechanism effect verification:

在视频特征提取网络中，本发明所提模型的面部模块中添加CBAM注意力机制帮助网络提取关键的面部信息，忽略对分离无用的次要信息，以提升分离的性能。本实施例中对比了添Squeeze Excitation（SE）和CBAM注意力机制的模型效果。需要注意的是，下表中所示结果是分离模型在原始U-Net网络基础上实现的。选取测试集中的两个数据检验效果，分离后语音的评价指标结果如表2所示，实验结果表明添加两层CBAM注意力机制有助于提升语音分离的效果。In the video feature extraction network, the CBAM attention mechanism is added to the facial module of the model proposed by the present invention to help the network extract key facial information, and ignore the secondary information that is useless for separation, so as to improve the performance of separation. In this example, the model effects of Squeeze Excitation (SE) and CBAM attention mechanism are compared. It should be noted that the results shown in the table below are achieved by the separation model based on the original U-Net network. Two data in the test set are selected to test the effect. The evaluation index results of the separated speech are shown in Table 2. The experimental results show that adding two layers of CBAM attention mechanism can help improve the effect of speech separation.

表2 不同注意力机制实验结果Table 2 Experimental results of different attention mechanisms

可见，CBAM注意力机制的加入有助于视觉特征提取模块更准确的提取面部信息，即视频流中关键的人脸才包含对分离任务更加有益的特征，实验结果进一步验证了该想法的可行性。添加两层CBAM注意力机制的分离模型相比于未添加注意力机制的模型提升了0.09的PESQ分数以及0.8dB的SDR提升。不同于CBAM注意力机制，添加SE注意力机制的效果并没有提升分离效果，这可能是由于CBAM对视觉信息的处理机制更接近人类大脑在视听模式下对视觉信息的处理机制。It can be seen that the addition of the CBAM attention mechanism helps the visual feature extraction module to extract facial information more accurately, that is, the key faces in the video stream contain features that are more beneficial to the separation task. The experimental results further verify the feasibility of this idea . The separation model with two layers of CBAM attention mechanism improves the PESQ score by 0.09 and the SDR improvement by 0.8dB compared to the model without the attention mechanism. Different from the CBAM attention mechanism, the effect of adding SE attention mechanism does not improve the separation effect, which may be due to the fact that the processing mechanism of CBAM for visual information is closer to the processing mechanism of the human brain for visual information in audio-visual mode.

残差连接效果验证：Residual connection effect verification:

受到在图像处理领域应用残差连接可以帮助网络更好地提取图像细节的启发以及残差连接具备帮助网络避免出现退化问题的能力。在音频信号处理网络中，本发明所提模型在U-Net的基础上添加残差连接。残差连接的作用是当深度网络出现退化现象（网络层数加深效果反而下降的现象）时，充当一个桥梁，将上一层的信息也传给下一层。本发明所提模型在压缩路径的卷积层中添加残差连接，并验证添加不同残差连接的数量和类型对分离效果的提升情况。选取测试集中的两个数据进行验证，分离后语音的评价指标结果如下表所示。如表3所示，结果表明，相比于未添加残差连接的U-Net网络，添加6个残差的网络模型提升了分离音频的性能。Inspired by the application of residual connections in the field of image processing can help the network to better extract image details and the ability of residual connections to help networks avoid degradation problems. In the audio signal processing network, the proposed model of the present invention adds residual connections on the basis of U-Net. The function of the residual connection is to act as a bridge to pass the information of the previous layer to the next layer when the deep network degenerates (the effect of deepening the number of network layers decreases instead). The model proposed in the present invention adds residual connections to the convolutional layer of the compressed path, and verifies the improvement of the separation effect by adding different numbers and types of residual connections. Select two data in the test set for verification, and the evaluation index results of the speech after separation are shown in the table below. As shown in Table 3, the results show that the network model with 6 residuals improves the performance of separating audio compared to the U-Net network without residual connections.

表3 不同残差的实验结果Table 3 Experimental results of different residuals

视觉信息效果验证：Visual information effect verification:

为了验证使用双视觉信息对语音分离性能提高的有效性，本实施例中比较了音频结合嘴部信息、音频结合面部信息以及音频同时结合面部和唇部信息的语音分离模型的性能。具体来说，实验中分别将所提模型中视觉特征提取模块得到的纯嘴部信息特征、纯面部信息特征和嘴部面部融合特征作为最终的视觉特征融合到分离模块，而分离模块的网络结构保持不变，结果见表4。In order to verify the effectiveness of using dual visual information to improve the performance of speech separation, this embodiment compares the performance of speech separation models that combine audio with mouth information, audio with facial information, and audio with both face and lip information. Specifically, in the experiment, the pure mouth information features, pure facial information features and mouth and face fusion features obtained by the visual feature extraction module in the proposed model were fused into the separation module as the final visual features, and the network structure of the separation module remain unchanged, the results are shown in Table 4.

表4 不同视觉线索的实验效果Table 4 Experimental effects of different visual cues

当不同的视觉线索被引入时，模型分离的效果不同。从表中可知，与单纯使用嘴部动作或者面部信息相比，本文使用的双视觉线索可以更全面地利用说话人视觉和语音信息。对比只使用面部信息的方法，本发明所提方法的PESQ提高了0.23，SDR提高了2.5dB，STOI提高了0.08，实现了更好的分离性能。Model separation works differently when different visual cues are introduced. It can be seen from the table that compared with simply using mouth movements or facial information, the dual visual cues used in this paper can make more comprehensive use of the speaker's visual and speech information. Compared with the method using only face information, the PESQ of the method proposed in the present invention is increased by 0.23, the SDR is increased by 2.5dB, and the STOI is increased by 0.08, achieving better separation performance.

在不同的应用场景中，可以使用本发明中所搭建的的语音分离模型进行语音分离：In different application scenarios, the speech separation model built in the present invention can be used for speech separation:

首先获取包含有两个说话人的视频和相应音频；First obtain the video and corresponding audio containing two speakers;

将面部帧、嘴部动作帧和相应音频输入到上述方法所搭建的语音分离模型中；Input the face frame, mouth movement frame and corresponding audio into the speech separation model built by the above method;

实施例2：Example 2:

如图8所示，本发明同时提供了一种视听结合的语音分离设备，设备包括至少一个处理器和至少一个存储器，同时还包括通信接口和内部总线；存储器中存储有计算机执行程序；存储器中存储有如实施例1所述的搭建方法所搭建的语音分离模型的计算机执行程序；所述处理器执行存储器中存储的计算机执行程序时，可以使处理器实现语音分离方法。其中内部总线可以是工业标准体系结构(Industry Standard Architecture，ISA)总线、外部设备互连(Peripheral Component，PCI)总线或扩展工业标准体系结构(.XtendedIndustry Standard Architecture，EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示，本申请附图中的总线并不限定仅有一根总线或一种类型的总线。其中存储器可能包含高速RAM存储器，也可能还包括非易失性存储NVM，例如至少一个磁盘存储器，还可以为U盘、移动硬盘、只读存储器、磁盘或光盘等。As shown in Fig. 8, the present invention simultaneously provides a kind of audio-visual combination speech separation equipment, and equipment comprises at least one processor and at least one memory, also comprises communication interface and internal bus simultaneously; Computer execution program is stored in the memory; The computer execution program of the speech separation model built by the construction method described in Embodiment 1 is stored; when the processor executes the computer execution program stored in the memory, the processor can realize the speech separation method. The internal bus may be an Industry Standard Architecture (Industry Standard Architecture, ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (.Xtended Industry Standard Architecture, EISA) bus, etc. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, the buses in the drawings of the present application are not limited to only one bus or one type of bus. The storage may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk storage, and may also be a U disk, a mobile hard disk, a read-only memory, a magnetic disk or an optical disk, and the like.

设备可以被提供为终端、服务器或其它形态的设备。A device may be provided as a terminal, server, or other form of device.

图8是为示例性示出的一种设备的框图。设备可以包括以下一个或多个组件：处理组件，存储器，电源组件，多媒体组件，音频组件，输入/输出(I/O)的接口，传感器组件，以及通信组件。处理组件通常控制电子设备的整体操作，诸如与显示，电话呼叫，数据通信，相机操作和记录操作相关联的操作。处理组件可以包括一个或多个处理器来执行指令，以完成上述的方法的全部或部分步骤。此外，处理组件可以包括一个或多个模块，便于处理组件和其他组件之间的交互。例如，处理组件可以包括多媒体模块，以方便多媒体组件和处理组件之间的交互。Fig. 8 is a block diagram of a device shown for example. A device may include one or more of the following components: a processing component, a memory, a power supply component, a multimedia component, an audio component, an input/output (I/O) interface, a sensor component, and a communication component. The processing components typically control the overall operations of the electronic device, such as those associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component may include one or more processors to execute instructions to complete all or part of the steps of the above method. Additionally, a processing component may include one or more modules to facilitate interaction between the processing component and other components. For example, the processing component may include a multimedia module to facilitate interaction between the multimedia component and the processing component.

存储器被配置为存储各种类型的数据以支持在电子设备的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。存储器可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。The memory is configured to store various types of data to support operations at the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, etc. The memory can be realized by any type of volatile or non-volatile storage devices or their combination, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

电源组件为电子设备的各种组件提供电力。电源组件可以包括电源管理系统，一个或多个电源，及其他与为电子设备生成、管理和分配电力相关联的组件。多媒体组件包括在所述电子设备和用户之间的提供一个输出接口的屏幕。在一些实施例中，屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板，屏幕可以被实现为触摸屏，以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界，而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中，多媒体组件包括一个前置摄像头和/或后置摄像头。当电子设备处于操作模式，如拍摄模式或视频模式时，前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。Power components provide power to various components of electronic equipment. Power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic devices. The multimedia component includes a screen providing an output interface between said electronic device and a user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component includes a front camera and/or a rear camera. When the electronic device is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.

音频组件被配置为输出和/或输入音频信号。例如，音频组件包括一个麦克风(MIC)，当电子设备处于操作模式，如呼叫模式、记录模式和语音识别模式时，麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器或经由通信组件发送。在一些实施例中，音频组件还包括一个扬声器，用于输出音频信号。 I/O接口为处理组件和外围接口模块之间提供接口，上述外围接口模块可以是键盘，点击轮，按钮等。这些按钮可包括但不限于：主页按钮、音量按钮、启动按钮和锁定按钮。The audio component is configured to output and/or input audio signals. For example, the audio component includes a microphone (MIC), which is configured to receive an external audio signal when the electronic device is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may be further stored in a memory or sent via a communication component. In some embodiments, the audio component further includes a speaker for outputting audio signals. The I/O interface provides an interface between the processing component and the peripheral interface module, and the above peripheral interface module can be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.

传感器组件包括一个或多个传感器，用于为电子设备提供各个方面的状态评估。例如，传感器组件可以检测到电子设备的打开/关闭状态，组件的相对定位，例如所述组件为电子设备的显示器和小键盘，传感器组件还可以检测电子设备或电子设备一个组件的位置改变，用户与电子设备接触的存在或不存在，电子设备方位或加速/减速和电子设备的温度变化。传感器组件可以包括接近传感器，被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件还可以包括光传感器，如CMOS或CCD图像传感器，用于在成像应用中使用。在一些实施例中，该传感器组件还可以包括加速度传感器，陀螺仪传感器，磁传感器，压力传感器或温度传感器。A sensor assembly includes one or more sensors that provide status assessments of various aspects of an electronic device. For example, the sensor component can detect the open/closed state of the electronic device, the relative positioning of components, such as the display and keypad of the electronic device, the sensor component can also detect the position change of the electronic device or a component of the electronic device, and the user The presence or absence of contact with the electronic device, the orientation or acceleration/deceleration of the electronic device and the temperature change of the electronic device. The sensor assembly may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly may also include optical sensors, such as CMOS or CCD image sensors, for use in imaging applications. In some embodiments, the sensor assembly may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.

通信组件被配置为便于电子设备和其他设备之间有线或无线方式的通信。电子设备可以接入基于通信标准的无线网络，如WiFi，2G或3G，或它们的组合。在一个示例性实施例中，通信组件经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，所述通信组件还包括近场通信(NFC)模块，以促进短程通信。例如，在NFC模块可基于射频识别(RFID)技术，红外数据协会(IrDA)技术，超宽带(UWB)技术，蓝牙(BT)技术和其他技术来实现。The communication component is configured to facilitate wired or wireless communication between the electronic device and other devices. Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication assembly further includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

在示例性实施例中，电子设备可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现，用于执行上述方法。In an exemplary embodiment, the electronic device may be programmed by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.

以上所述仅为本申请的优选实施例而已，并不用于限制本申请，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. For those skilled in the art, there may be various modifications and changes in the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included within the protection scope of this application.

上述虽然对本发明的具体实施方式进行了描述，但并非对本发明保护范围的限制，所属领域技术人员应该明白，在本发明的技术方案的基础上，本领域技术人员不需要付出创造性劳动即可做出的各种修改或变形仍在本发明的保护范围以内。Although the specific implementation of the present invention has been described above, it is not a limitation to the protection scope of the present invention. Those skilled in the art should understand that on the basis of the technical solution of the present invention, those skilled in the art can do it without creative work. Various modifications or deformations are still within the protection scope of the present invention.

Claims

1. An audio-visual combined voice separation model building method is characterized by comprising the following steps:

step 1, acquiring original data of videos and corresponding audios of a plurality of speakers, wherein the original data are shot or downloaded in different scenes;

step 2, preprocessing the original data obtained in the step 1; processing videos into images of one frame and one frame respectively, randomly selecting data of two speakers from original data, mixing audio in the data, performing short-time Fourier transform on the mixed voice to obtain a spectrogram of the voice, combining face frames and mouth action frames corresponding to the data of the two speakers to construct a data set, and dividing the data set into a training set, a verification set and a test set;

step 3, based on a U-Net network structure, establishing residual connection in a partial downsampling convolution block of the traditional U-Net to obtain a residual convolution block, and then adding a BN layer between convolution of a compression path and an expansion path and a ReLU activation function to construct an audio separation module; based on the ResNet-18 network structure, adding a CBAM attention mechanism before and after a basic convolution block of ResNet-18 respectively to construct a face module; based on the SheffleNet-V2 and TCN network structure, the method combines a 3D convolution layer to construct a mouth action module; combining the three network modules to construct an AV-Resunate network model; wherein a spectrogram of the mixed speech is input into the audio separation module, a face frame is input into the face module, and a mouth action frame is input into the mouth action module;

The input of the audio separation module is specifically as follows: regarding visual input, the face frame size is 224×224, extracted via a network into facial features of dimension 128; the mouth action frame is input into 88 x 88, extracted into 512 x 64 mouth features through a network, and combined with facial features to finally obtain 640 x 64 visual features, wherein the visual features are used as visual input of an audio separation module; regarding audio input, the signal spectrogram of the mixed audio is used as the audio input of the audio separation module, the dimension is 2 multiplied by 257 multiplied by 256, and a prediction mask which is consistent with the dimension of the input spectrogram is obtained after the network; multiplying the prediction mask with the spectrograms of the mixed voices respectively to obtain separated individual speaker voice spectrograms, and recovering a speaker clean voice signal through inverse short-time Fourier transform;

step 4, training and verifying the AV-Resunate network model built in the step 3 by using the training set and the verification set in the step 2; selecting a model with the best verification effect in the training process as a final test model;

and 5, testing the finally selected AV-Resunate network model by using the data in the test set.

2. The audio-visual combined speech separation model building method according to claim 1, wherein the specific process of preprocessing in the step 2 is: firstly, processing a video into an image frame by frame, and selecting a frame as a face frame; each frame of image is used for acquiring facial key points by using an SFD (small form-factor detector), removing the difference related to the position, positioning the position of the lips, then cutting the lips into a fixed size, and taking the lips as a mouth action frame after gray-scale treatment, wherein the frame number is 64; and then randomly selecting the data of two speakers from the original data, mixing the audio frequencies in the data, and then performing short-time Fourier transform on the mixed voice to obtain a spectrogram of the voice, and combining face frames and mouth action frames corresponding to the data of the two speakers to construct a data set.

3. The audio-visual combined speech separation model building method according to claim 1, wherein: the audio separation module is improved based on a U-Net network and comprises a conv layer, a res_conv layer, audio-visual feature fusion, an up_conv layer and a Tanh function;

the conv layer consists of a convolution kernel with the size of 4 multiplied by 4 and the step length of 2, a BN layer and a ReLU activation function, wherein two conv layers are respectively in a compression path and an expansion path of the network, namely unet_conv and unet_conv;

the res_conv layers comprise 6 layers, namely res_conv1, res_conv2, res_conv3, res_conv4, res_conv5 and res_conv6, and each layer consists of two convolution kernels with the 3×3 step length of 1, two BN layers, two ReLU activation functions, one Maxpool layer and one residual error connection;

the method comprises the steps of dividing the input data into two types according to the difference of the channel numbers of the input data and the output data, directly adding the input data and the convolved output to be pooled when the channel numbers are the same, and adding the input data after one convolution kernel processing when the channel numbers are different;

The audio-visual feature fusion is a process of fusing audio features obtained after the compression path processing with visual features extracted by a visual network in a time dimension to obtain audio-visual fusion features;

the up_conv layer consists of an Upsample layer, a convolution kernel with the size of 3 multiplied by 3 and the step length of 1, a BN layer and a ReLU activation function, wherein Upsample replaces Maxpool in a compression path;

and the Tanh function compresses data to a section from-1 to 1, outputs a separated mask, multiplies the separated mask by a spectrogram of the mixed voice to obtain an independent speaker voice spectrogram, and then recovers the clean voice of the speaker through inverse short time Fourier transform.

4. The audio-visual combined speech separation model building method according to claim 1, wherein: the face module is improved based on a ResNet-18 network and comprises a conv7 layer, a CBAM layer, a res layer, a pooling layer and a linear layer;

the conv7 layer consists of a convolution kernel with the size of 7 multiplied by 7 and the step length of 2, a BN layer and a ReLU activation function, and the output of the conv7 layer is used as the input of the CBAM layer;

the CBAM layer consists of Channel Attention and Spatial Attention, is respectively positioned before the first res layer and after the last res layer and is used for efficiently extracting and audio-related face key areas and ignoring secondary areas outside the face;

The res layer comprises four layers res1, res2, res3 and res4, and each of the four layers comprises 2 convolution blocks, wherein each convolution block in res1 consists of a 3×3 convolution kernel, a BN layer and a ReLU activation function, and the convolution blocks can be represented by the following formula:

y = ReLU(x + BN(conv3(ReLU(BN(conv3(x))))))

wherein x represents the input of the convolution block and y represents the output of the convolution block; conv3 is a 3×3 convolution operation, BN batch normalization layer; reLU refers to a ReLU activation function;

the first convolution block in res2, res3 and res4 is the same as res1, the second convolution block is composed of a 3×3 convolution kernel, BN layer, downsampling layer and ReLU activation function, and the second convolution block can be expressed by the following formula:

y = ReLU ( Downsample(x) + BN(conv3(ReLU (BN(conv3(x))))))

wherein downsampled refers to a downsampling layer;

the pooling layer comprises maximum pooling and average pooling, wherein the maximum pooling is positioned after the first CBAM layer and is used for reducing the parameter number and simplifying the complexity of the network; the average pooling is located after the second CBAM, the average pooled output being the input to the final linear layer;

the output of the linear layer is taken as the final facial feature extracted by the network, and the final facial feature is combined with the mouth feature after being copied in the time dimension to form the visual feature required by the model.

5. The audio-visual combined speech separation model building method according to claim 1, wherein: the mouth action module is constructed based on a SheffleNet-V2 and TCN network structure and combined with a 3D convolution layer, wherein the 3D convolution layer consists of a convolution kernel with the size of 5 multiplied by 7 and the step size of 1 multiplied by 2, a BN layer, a ReLU activation function and a 3D maximum pooling layer with the size of 1 multiplied by 3 and the step size of 1 multiplied by 2;

The ShuffleNet-V2 network includes a convolutional layer, a pooling layer, a fully-connected layer, a packet convolution, and a depth separable convolution; the TCN network is composed of a plurality of residual blocks, and maps the time index sequence of the feature vector extracted by the ShuffleNet-V2 network into a new sequence by using 1D time convolution, so as to finally obtain a mouth feature with dimension of 512×64.

6. The audio-visual combined speech separation model building method according to claim 1, wherein: in the training process, the model constructed in the step 4 uses complex domain ideal ratio masking cIRM as a training target of audio, and uses a triple loss to calculate similarity between the audio and the facial image, wherein the calculation formula of the cIRM is as follows:

wherein X is _r And X _i Representing the real and imaginary parts of the mixed speech signal, S _r And S is _i Representing the real and imaginary parts of clean speech.

7. The audio-visual combined speech separation model building method according to claim 1, wherein: in the step 2, the time domain mixed audio is converted into a spectrogram through short-time Fourier transform, the audio is subjected to 16kHz sampling rate, the audio fragment length is 2.55s, and the STFT has 400 window lengths, 160 jump sizes and 512 FFT sizes.

8. An audiovisual combined speech separation method, characterized by comprising the following processes:

acquiring video and corresponding audio containing two speakers;

processing the acquired video and corresponding audio, respectively extracting face frames and mouth action frames of a speaker in the video,

inputting a face frame, a mouth action frame and corresponding audio into a speech separation model constructed by the construction method according to any one of claims 1 to 7;

outputting each separated speaker and the corresponding clean voice.

9. An audiovisual combined speech separation device, characterized by: the apparatus includes at least one processor and at least one memory, the processor and the memory coupled; a computer-implemented program of a speech separation model constructed by the construction method according to any one of claims 1 to 7 is stored in the memory; the processor may be caused to implement a voice separation method when executing a computer-implemented program stored in the memory.