CN112653902B

CN112653902B - Speaker recognition method and device and electronic equipment

Info

Publication number: CN112653902B
Application number: CN201910959899.2A
Authority: CN
Inventors: 王全剑; 黄鹏; 李波
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2023-04-11
Anticipated expiration: 2039-10-10
Also published as: CN112653902A

Abstract

The embodiment of the present application discloses a speaker recognition method, device and electronic equipment. The method includes: separating and obtaining an audio file and an image file from a video file to be recognized; performing speech segmentation on the audio file, obtaining start and end time information and speaker identification information corresponding to at least one speech segment, and Face recognition is performed on the image file to obtain face recognition results corresponding to different times; the speaker identification information and the face recognition results are aligned in time to determine the speaker corresponding to the at least one speech segment . Such a solution helps to improve the accuracy of speaker recognition.

Description

Speaker recognition method, device and electronic equipment

技术领域technical field

本申请涉及语音识别技术领域，特别是涉及说话人识别方法、装置及电子设备，视频处理方法、装置及电子设备。The present application relates to the technical field of speech recognition, in particular to a speaker recognition method, device and electronic equipment, a video processing method, device and electronic equipment.

背景技术Background technique

声纹识别，也称为说话人识别，可以通过计算机语音处理技术，对语音信号进行分析处理，确定说话人身份。Voiceprint recognition, also known as speaker recognition, can analyze and process voice signals through computer voice processing technology to determine the identity of the speaker.

对于直播、会议等可能涉及多个说话人的场景来说，进行说话人识别时，不仅需要识别出语音信号中包含哪些人，还需要确定出每个人对应的具体语音段有哪些。以用户A与用户B之间的对话语音为例，可以先进行语音分割，找到不同说话人之间的语音跳变点，将整个对话语音分割为若干语音段，并标注好每个语音段对应的起止时间以及说话人id，然后再进行说话人聚类，将对应同一个说话人id的语音段合并在一起。For scenarios that may involve multiple speakers, such as live broadcasts and conferences, when performing speaker recognition, it is not only necessary to identify who is included in the voice signal, but also to determine the specific voice segments corresponding to each person. Taking the dialogue voice between user A and user B as an example, voice segmentation can be performed first to find the voice transition points between different speakers, divide the entire dialogue voice into several voice segments, and mark the corresponding voice segment of each voice segment. The start and end time and speaker id, and then perform speaker clustering, and merge the speech segments corresponding to the same speaker id.

理想状态下，以上对话语音经过处理，可以获得两个类别：一个类别对应用户A的id1，包括用户A在对话过程中说的所有语音段；另一个类别对应用户B的id2，包括用户B在对话过程中说的所有语音段。Ideally, after processing the above dialogue speech, two categories can be obtained: one category corresponds to user A’s id1, including all speech segments spoken by user A during the dialogue; the other category corresponds to user B’s id2, including user B’s All speech segments spoken during the conversation.

但在实际应用过程中，受周围环境噪声的影响，很可能在语音分割时，出现分割错误。例如，将用户A对应的一个语音段，拆分成两个子语音段，且对应的说话人id分别被标注为id1以及id2，这就导致说话人聚类时，被标注为id2的子语音段要合并到用户B对应的类别中。虽然说话人身份识别正确，但合并出的语音段却出现错误。However, in the actual application process, due to the influence of ambient noise, segmentation errors may occur during speech segmentation. For example, a speech segment corresponding to user A is split into two sub-speech segments, and the corresponding speaker ids are marked as id1 and id2 respectively, which leads to the sub-speech segment marked as id2 when speakers are clustered To be merged into the category corresponding to user B. Although the speaker identification was correct, the merged speech segments were incorrect.

或者，上述被拆分出的两个子语音段，对应的说话人id可能被识别为id1以及id3，如此识别后，不仅合并出的语音段有误，还导致说话人身份识别出错，将原本只有两个用户对话的语音识别为三个用户。Alternatively, the speaker ids of the two sub-segments that were split above may be identified as id1 and id3. After such identification, not only the merged speech segment is wrong, but also the speaker’s identity is incorrectly identified, and the original only Speech recognition for two user conversations as three users.

如何提高说话人识别准确度，特别是在多人参与的有声语言交流活动，例如多人演讲场景下，存在背景音乐、掌声等强噪声干扰时，如何提高说话人识别的准确度成为技术人员要解决的技术问题。How to improve the accuracy of speaker recognition, especially in the vocal language communication activities where many people participate, such as in the scene of multi-person speech, where there is strong noise interference such as background music and applause, how to improve the accuracy of speaker recognition has become a key issue for technical personnel. Solved technical problems.

发明内容Contents of the invention

本申请提供了一种说话人识别方法、装置及电子设备，可以基于声纹特征以及人脸特征进行说话人识别，有助于提高识别准确度。The present application provides a speaker recognition method, device and electronic equipment, which can perform speaker recognition based on voiceprint features and face features, and help to improve recognition accuracy.

本申请提供了如下方案：This application provides the following solutions:

一种说话人识别方法，包括：A speaker recognition method, comprising:

从待识别的视频文件中分离获得音频文件以及图像文件；Separate and obtain audio files and image files from video files to be identified;

对所述音频文件进行语音分割，获得至少一个语音段各自对应的起止时间信息以及说话人标识信息，以及对所述图像文件进行人脸识别，获得不同时间对应的人脸识别结果；Carrying out voice segmentation to the audio file, obtaining start and end time information and speaker identification information corresponding to at least one voice segment, and performing face recognition on the image file to obtain face recognition results corresponding to different times;

对所述说话人标识信息以及所述人脸识别结果，在时间上进行对齐处理，确定所述至少一个语音段对应的说话人。The speaker identification information and the face recognition result are aligned in time to determine the speaker corresponding to the at least one speech segment.

一种视频文件处理方法，包括：A video file processing method, comprising:

服务端获得有声语言交流活动产生的视频文件；The server obtains the video files generated by the audio language communication activities;

根据从所述视频文件中分离出的音频文件以及图像文件，进行说话人识别，获得不同说话人各自关联的语音段的集合；Perform speaker identification according to the audio files and image files separated from the video file to obtain a collection of speech segments associated with different speakers;

对所述说话人关联的语音段的集合进行语音识别，获得所述说话人的说话内容；performing speech recognition on a set of speech segments associated with the speaker to obtain the speech content of the speaker;

建立所述说话人与对应的说话内容之间的关联关系。An association relationship between the speaker and the corresponding speech content is established.

客户端接收服务端发送的对有声语言交流活动的视频进行处理获得的看点文件，所述看点文件中添加有不同说话人的内容看点，所述内容看点为根据对所述视频中分离出的音频文件以及图像文件进行说话人识别，确定的不同说话人关联的语音段的集合中提取；The client receives the point-of-view file sent by the server and obtained by processing the video of the vocal language communication activity. The separated audio files and image files are subjected to speaker identification, and the determined sets of speech segments associated with different speakers are extracted;

将所述看点文件推送给所述客户端关联的用户。Push the highlight file to the user associated with the client.

一种说话人识别装置，包括：A speaker recognition device, comprising:

文件分离单元，用于从待识别的视频文件中分离获得音频文件以及图像文件；The file separation unit is used to separate and obtain audio files and image files from video files to be identified;

语音分割单元，用于对所述音频文件进行语音分割，获得至少一个语音段各自对应的起止时间信息以及说话人标识信息；A voice segmentation unit, configured to perform voice segmentation on the audio file, and obtain start and end time information and speaker identification information corresponding to at least one voice segment;

人脸识别单元，用于对所述图像文件进行人脸识别，获得不同时间对应的人脸识别结果；A face recognition unit, configured to perform face recognition on the image file to obtain face recognition results corresponding to different times;

对齐处理单元，用于对所述说话人标识信息以及所述人脸识别结果，在时间上进行对齐处理，确定所述至少一个语音段对应的说话人。The alignment processing unit is configured to perform temporal alignment processing on the speaker identification information and the face recognition result to determine the speaker corresponding to the at least one speech segment.

一种视频文件处理装置，应用于服务端，包括：A video file processing device applied to a server, comprising:

视频文件获得单元，用于获得有声语言交流活动产生的视频文件；The video file obtaining unit is used to obtain the video files produced by the vocal language communication activities;

说话人识别单元，用于根据从所述视频文件中分离出的音频文件以及图像文件，进行说话人识别，获得不同说话人各自关联的语音段的集合；A speaker recognition unit, configured to perform speaker recognition based on the audio files and image files separated from the video file, to obtain a collection of speech segments associated with different speakers;

说话内容获得单元，用于对所述说话人关联的语音段的集合进行语音识别，获得所述说话人的说话内容；A utterance content obtaining unit, configured to perform speech recognition on a set of speech segments associated with the speaker, and obtain the utterance content of the speaker;

关联关系建立单元，用于建立所述说话人与对应的说话内容之间的关联关系。An association relationship establishing unit, configured to establish an association relationship between the speaker and the corresponding utterance content.

一种视频文件处理装置，应用于客户端，包括：A video file processing device, applied to a client, comprising:

看点文件接收单元，用于接收服务端发送的对有声语言交流活动的视频进行处理获得的看点文件，所述看点文件中添加有不同说话人的内容看点，所述内容看点为根据对所述视频中分离出的音频文件以及图像文件进行说话人识别，确定的不同说话人关联的语音段的集合中提取；The highlight file receiving unit is used to receive the highlight file sent by the server and obtained by processing the video of the audio language exchange activity. The highlight file is added with content highlights of different speakers, and the content highlights are Carrying out speaker identification according to the separated audio files and image files in the video, and extracting from the set of speech segments associated with different speakers determined;

看点文件推送单元，用于将所述看点文件推送给所述客户端关联的用户。A point-of-view file pushing unit, configured to push the point-of-view file to users associated with the client.

一种电子设备，包括：An electronic device comprising:

一个或多个处理器；以及one or more processors; and

与所述一个或多个处理器关联的存储器，所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时，执行如下操作：A memory associated with the one or more processors, the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:

一种电子设备，包括：An electronic device comprising:

一个或多个处理器；以及one or more processors; and

获得有声语言交流活动产生的视频文件；Access to video files produced by spoken language exchange activities;

一种电子设备，包括：An electronic device comprising:

一个或多个处理器；以及one or more processors; and

接收服务端发送的对有声语言交流活动的视频进行处理获得的看点文件，所述看点文件中添加有不同说话人的内容看点，所述内容看点为根据对所述视频中分离出的音频文件以及图像文件进行说话人识别，确定的不同说话人关联的语音段的集合中提取；Receiving the highlights file sent by the server and obtained by processing the video of the audio language communication activity, the highlights file is added with the highlights of the contents of different speakers, and the contents of the highlights are based on the separation of the video from the video. audio files and image files for speaker recognition, and extract from the set of determined speech segments associated with different speakers;

将所述看点文件推送给客户端关联的用户。Push the highlight file to the user associated with the client.

根据本申请提供的具体实施例，本申请公开了以下技术效果：According to the specific embodiments provided by the application, the application discloses the following technical effects:

本申请实施例，可以对视频文件进行声音图像分离，并对分离出的音频文件进行语音分割，获得至少一个语音段，以及每个语音段各自对应的起止时间信息以及说话人标识信息；同时，还可以对分离出的图像文件进行人脸识别，获得不同时间对应的人脸识别结果。然后再对声纹特征以及人脸特征进行对齐处理，获得说话人聚类结果，即，确定至少一个语音段对应的说话人。其中，对齐处理可以是基于语音段对应的起止时间，确定该起止时间对应的人脸识别结果是否可以对应到同一个说话人；或者，对齐处理可以是基于人脸识别结果，确定相邻语音段是否可以对应到同一个说话人。如此基于用户的声纹特征以及人脸特征，实现的说话人身份识别，有助于解决现有技术识别准确度低的问题。In the embodiment of the present application, the video file can be separated from the sound image, and the separated audio file can be divided into voices to obtain at least one voice segment, and the start and end time information and speaker identification information corresponding to each voice segment; at the same time, Face recognition can also be performed on the separated image files to obtain face recognition results corresponding to different times. Then, the voiceprint features and face features are aligned to obtain speaker clustering results, that is, to determine the speaker corresponding to at least one speech segment. Among them, the alignment process can be based on the start and end time corresponding to the speech segment, determine whether the face recognition result corresponding to the start and end time can correspond to the same speaker; or, the alignment process can be based on the face recognition result, determine the adjacent speech segment Whether it can correspond to the same speaker. Such speaker identification based on the user's voiceprint features and face features helps to solve the problem of low recognition accuracy in the prior art.

当然，实施本申请的任一产品并不一定需要同时达到以上所述的所有优点。Of course, implementing any product of the present application does not necessarily need to achieve all the above-mentioned advantages at the same time.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the accompanying drawings required in the embodiments. Obviously, the accompanying drawings in the following description are only some of the present application. Embodiments, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1是本申请实施例提供的系统的示意图；Fig. 1 is a schematic diagram of the system provided by the embodiment of the present application;

图2是本申请实施例提供的第一种方法的流程图；Fig. 2 is a flow chart of the first method provided by the embodiment of the present application;

图3是本申请实施例提供的第一种交流文本的示意图；Fig. 3 is a schematic diagram of the first communication text provided by the embodiment of the present application;

图4是本申请实施例提供的第二种交流文本的示意图；Fig. 4 is a schematic diagram of the second communication text provided by the embodiment of the present application;

图5是本申请实施例提供的第二种方法的流程图；Fig. 5 is a flow chart of the second method provided by the embodiment of the present application;

图6是本申请实施例提供的第三种方法的流程图；Fig. 6 is a flow chart of the third method provided by the embodiment of the present application;

图7是本申请实施例提供的第四种方法的流程图；Fig. 7 is a flow chart of the fourth method provided by the embodiment of the present application;

图8是本申请实施例提供的第一种装置的示意图；Fig. 8 is a schematic diagram of the first device provided by the embodiment of the present application;

图9是本申请实施例提供的第二种装置的示意图；Fig. 9 is a schematic diagram of a second device provided by an embodiment of the present application;

图10是本申请实施例提供的第三种装置的示意图；Fig. 10 is a schematic diagram of a third device provided by an embodiment of the present application;

图11是本申请实施例提供的计算机系统的架构的示意图；Fig. 11 is a schematic diagram of the architecture of the computer system provided by the embodiment of the present application;

图12是本申请实施例提供的电子设备的架构的示意图。FIG. 12 is a schematic diagram of an architecture of an electronic device provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员所获得的所有其他实施例，都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments in this application belong to the protection scope of this application.

实施例1Example 1

为了解决多人说话场景下，受环境噪声等因素的影响，导致说话人识别准确性低的问题，本申请实施例提供一种基于声纹特征以及人脸特征进行说话人识别的工具。下面结合具体示例，对识别过程进行解释说明。In order to solve the problem of low speaker recognition accuracy due to the influence of environmental noise and other factors in a multi-person speaking scene, the embodiment of the present application provides a tool for speaker recognition based on voiceprint features and face features. The identification process will be explained below in combination with specific examples.

本申请实施例涉及的识别对象主要是包括音频以及图像的视频文件。获得待识别的视频文件后，可以先进行声音图像分离，将视频文件拆分成可提取声纹特征的音频文件，以及可提取人脸特征的图像文件。例如，可以通过多媒体视频处理工具FFmpeg(FastForward Mpeg)从视频文件中分离出音频文件以及图像文件。The identification objects involved in this embodiment of the present application are mainly video files including audio and images. After obtaining the video file to be recognized, the sound and image can be separated first, and the video file can be split into an audio file that can extract voiceprint features, and an image file that can extract facial features. For example, an audio file and an image file can be separated from a video file by a multimedia video processing tool FFmpeg (FastForward Mpeg).

对于音频文件来说，可以通过声纹跟踪技术进行语音分割，找到不同说话人之间的语音跳变点，将音频文件分割为多个语音段，并标注好每个语音段对应的起止时间以及说话人标识信息。例如，视频文件1的时间长度为10分钟，对从视频文件1中分离出的音频文件进行语音分割后，可以获得表1所示语音段的相关信息。For audio files, voice segmentation can be performed through voiceprint tracking technology to find the voice transition points between different speakers, divide the audio file into multiple voice segments, and mark the corresponding start and end time of each voice segment and Speaker identification information. For example, the time length of video file 1 is 10 minutes. After voice segmentation is performed on the audio file separated from video file 1, the relevant information of the voice segment shown in Table 1 can be obtained.

表1Table 1

语音段的标识信息Voice segment identification information 起止时间信息(单位：s)Start and end time information (unit: s) 说话人标识信息speaker identification information 语音段1speech segment 1 0～110～11 id1id1 语音段2speech segment 2 12～55012～550 id2id2 语音段3Speech segment 3 551～600551～600 id1id1

对于图像文件来说，可以对其包括的图像帧进行人脸识别，通过人脸特征确定图像文件包括哪些说话人。作为一种示例，可以按照预设时间间隔，从图像文件中抽取图像帧，并通过人脸识别技术确定各图像帧对应的说话人的身份信息。例如，针对从视频文件1中分离出的图像文件，采用按秒抽帧的方式，可以获得下表2所示的人脸识别结果信息。For image files, face recognition can be performed on the included image frames, and the speakers included in the image files can be determined through face features. As an example, image frames may be extracted from image files at preset time intervals, and identity information of speakers corresponding to each image frame may be determined through face recognition technology. For example, for the image file separated from the video file 1, the face recognition result information shown in the following Table 2 can be obtained by using frame-by-second method.

表2Table 2

时间信息(单位：s)Time information (unit: s) 说话人身份信息speaker identity information 11 用户1user 1 22 用户1user 1 ……... ……... 1212 用户2user 2 1313 用户2user 2 1414 用户1user 1 ……... ……... 600600 用户2user 2

其中，说话人身份信息可以是在人脸识别过程中自定义的编号。例如，根据第1秒对应的图像帧进行人脸识别，确定的用户可以定义为用户1，如果在对第2～11秒对应的图像帧进行人脸识别的过程中，识别结果表示对应同一用户1，则继续使用相同的编号定义该用户；如果对第12秒对应的图像帧进行人脸识别时，确定该人脸特征对应的用户为一名新的用户，则可将该新的用户定义为用户2。以此类推，为图像文件中出现的用户定义对应的编号。Wherein, the speaker's identity information may be a number customized during the face recognition process. For example, face recognition is performed based on the image frame corresponding to the first second, and the determined user can be defined as user 1. If the face recognition process is performed on the image frame corresponding to the second to eleventh seconds, the recognition result indicates that it corresponds to the same user 1, continue to use the same number to define the user; if the face recognition is performed on the image frame corresponding to the 12th second, it is determined that the user corresponding to the face feature is a new user, then the new user can be defined for user2. By analogy, define the corresponding number for the user that appears in the image file.

或者，可以预先创建说话人信息库，通过信息库保存至少一个说话人的身份信息(例如，身份信息可以体现为说话人的姓名等)与人脸特征信息之间的关联关系。这样，对图像帧进行人脸识别，提取到对应的人脸特征信息后，可以查询信息库，确定该人脸特征信息对应的说话人的身份信息。Alternatively, a speaker information database may be created in advance, and the association relationship between at least one speaker's identity information (for example, the identity information may be embodied as a speaker's name, etc.) and facial feature information is stored in the information database. In this way, after face recognition is performed on the image frame and the corresponding face feature information is extracted, the information database can be queried to determine the identity information of the speaker corresponding to the face feature information.

按照上文所做介绍，获得表1以及表2的识别结果后，可以对声纹特征以及人脸特征进行对齐处理，获得说话人聚类结果。According to the introduction above, after obtaining the recognition results in Table 1 and Table 2, the voiceprint features and face features can be aligned to obtain the speaker clustering results.

下面结合具体示例，对特征对齐过程进行举例说明。The feature alignment process is illustrated below with a specific example.

一种方式下，可以根据语音段对应的起止时间信息，获得该起止时间对应的目标人脸识别结果，确定目标人脸识别结果是否对应同一个说话人。In one manner, the target face recognition result corresponding to the start and end time can be obtained according to the start and end time information corresponding to the speech segment, and it can be determined whether the target face recognition result corresponds to the same speaker.

具体地，如果目标人脸识别结果对应一名目标用户，即，声纹特征以及人脸特征在该起止时间内是保持对齐的，则可直接建立语音段以及目标用户之间的关联关系。Specifically, if the target face recognition result corresponds to a target user, that is, the voiceprint features and face features are aligned within the start and end time, then the association between the voice segment and the target user can be directly established.

以表1所示语音段1为例，对应的起止时间为0～11s，且由表2可知，该起止时间对应的人脸识别的说话人身份信息均为用户1，即，目标人脸识别结果仅对应一名目标用户。此时，可以确定语音段1为用户1说话过程中产生的音频，可以建立语音段1与用户1之间的关联关系。Taking speech segment 1 shown in Table 1 as an example, the corresponding start and end times are 0-11s, and it can be seen from Table 2 that the speaker identity information of the face recognition corresponding to the start and end times is user 1, that is, the target face recognition Results correspond to only one target user. At this point, it can be determined that the speech segment 1 is the audio generated during the speaking process of the user 1, and an association relationship between the speech segment 1 and the user 1 can be established.

具体地，如果目标人脸识别结果对应至少两名用户，则可从中确定一名目标用户，即，对目标人脸识别结果进行合并处理，将至少两名用户合并为一名目标用户，实现声纹特征以及人脸特征在该起止时间内的对齐，然后再对语音段以及目标用户进行关联。Specifically, if the target face recognition results correspond to at least two users, a target user can be determined from them, that is, the target face recognition results are merged, and at least two users are merged into one target user to achieve voice recognition. Align the fingerprint features and face features within the start and end time, and then associate the voice segment and the target user.

以表1所示语音段2为例，对应的起止时间为12～550s，且由表2可知，该起止时间对应的人脸识别的说话人身份信息包括用户1以及用户2，即，目标人脸识别结果对应两名目标用户。出现这种情况的原因，可能是在某个用户说话期间，镜头切换拍摄到另一个用户，即，说话人通常是两名用户中的一位，故可从中确定一名目标用户，建立目标用户与语音段之间的关联关系。例如，可以将起止时间内出现次数最多的用户，确定为目标用户。Taking speech segment 2 shown in Table 1 as an example, the corresponding start and end times are 12 to 550s, and it can be seen from Table 2 that the speaker identity information for face recognition corresponding to the start and end times includes user 1 and user 2, that is, the target person The face recognition results correspond to two target users. The reason for this situation may be that while a certain user is speaking, the camera switches to capture another user, that is, the speaker is usually one of the two users, so a target user can be determined from it to establish a target user Relationships with speech segments. For example, the user with the most occurrences within the start and end time periods may be determined as the target user.

上述示例中，如果在12～550s对应的图像帧中，识别为用户1的图像帧的占比明显少于识别为用户2的图像帧，则可将用户2确定为目标用户。即，可以确定语音段2为用户2说话过程中产生的音频，可以建立语音段2与用户2之间的关联关系。In the above example, if among the image frames corresponding to 12-550s, the proportion of image frames identified as user 1 is significantly less than that of image frames identified as user 2, user 2 may be determined as the target user. That is, it can be determined that the speech segment 2 is the audio generated during the speaking process of the user 2, and an association relationship between the speech segment 2 and the user 2 can be established.

在另一种方式下，可以根据人脸识别结果，确定相邻语音段是否对应同一个说话人。In another manner, it may be determined whether adjacent speech segments correspond to the same speaker according to the face recognition result.

具体地，可以根据相邻语音段各自对应的起止时间信息，确定进行对齐处理的目标起止时间；获得目标起止时间对应的目标人脸识别结果；如果目标人脸识别结果对应一名目标用户，则可确定该相邻语音段对应同一说话人，可以对该相邻语音段以及目标用户进行关联。Specifically, the target start and end time for alignment processing can be determined according to the respective start and end time information of adjacent speech segments; the target face recognition result corresponding to the target start and end time can be obtained; if the target face recognition result corresponds to a target user, then It can be determined that the adjacent speech segment corresponds to the same speaker, and the adjacent speech segment and the target user can be associated.

以表1所示语音段3为例，对应的起止时间为551～600s，如果通过表2确定该起止时间对应的说话人身份信息为用户2，(可能起止时间对应的人脸识别结果仅包括用户2；或者，也可能包括多名用户，但用户2出现的次数最多)，即，对于相邻的语音段2与语音段3来说，虽然二者在音频识别时对应不同的用户id，但在图像识别时却对应相同的用户身份。出现这种情况的原因，可能是受环境噪声的影响，在语音分割时将同一用户的声音分割到多个id，针对于此，可以将相邻语音段对应到图像识别出的说话人，实现声纹特征以及人脸特征在目标起止时间内的对齐。也就是说，可以确定语音段2以及语音段3均为用户2说话过程中产生的音频，可以建立语音段2以及语音段3与用户2之间的关联关系。Taking speech segment 3 shown in Table 1 as an example, the corresponding start and end time is 551 to 600s. If the speaker identity information corresponding to the start and end time is determined as user 2 through Table 2, (maybe the face recognition result corresponding to the start and end time only includes User 2; or, may also include multiple users, but user 2 appears the most), that is, for adjacent speech segment 2 and speech segment 3, although they correspond to different user ids during audio recognition, However, they correspond to the same user identity during image recognition. The reason for this situation may be that it is affected by environmental noise. During voice segmentation, the voice of the same user is divided into multiple ids. For this purpose, adjacent voice segments can be mapped to the speakers identified by the image to achieve Alignment of voiceprint features and facial features within the target start and end time. That is to say, it can be determined that both the voice segment 2 and the voice segment 3 are audio generated during the speaking of the user 2, and an association relationship between the voice segment 2 and the voice segment 3 and the user 2 can be established.

通过以上对齐处理，可以获得下表3所示说话人聚类结果。Through the above alignment process, the speaker clustering results shown in Table 3 below can be obtained.

表3table 3

需要说明的是，对于同一说话人关联至少两个语音段的情况来说，作为一种示例，可以如表3所示，建立该说话人与对应的至少两个语音段之间的关联关系；或者，可以对至少两个语音段进行合并处理，再建立说话人与合并后的语音段之间的关联关系。It should be noted that, for the case where the same speaker is associated with at least two speech segments, as an example, as shown in Table 3, an association between the speaker and the corresponding at least two speech segments can be established; Alternatively, at least two speech segments may be merged, and then an association between the speaker and the merged speech segment is established.

如此基于用户的声纹特征以及人脸特征，实现的说话人身份识别，有助于解决现有技术识别准确度低的问题。Such speaker identification based on the user's voiceprint features and face features helps to solve the problem of low recognition accuracy in the prior art.

可以理解地，本申请实施例可以基于完整的视频文件进行说话人识别，如上文所举示例，可对0～600s的音频文件以及图像文件进行分析处理；或者，可以基于用户感兴趣的部分视频文件进行说话人识别，例如，截取第12～600s对应的视频文件，进行声音图像分离后，按照上文介绍的处理过程进行说话人识别。不论采用哪种方式进行说话人识别，只要音频文件以及图像文件在时间上保持对齐即可。It can be understood that the embodiment of the present application can perform speaker recognition based on complete video files. As the example above, audio files and image files of 0 to 600s can be analyzed and processed; or, it can be based on part of the video that the user is interested in Files for speaker recognition, for example, intercept the video files corresponding to the 12th to 600s, perform sound and image separation, and perform speaker recognition according to the processing process introduced above. No matter which method is used for speaker recognition, as long as the audio files and image files are aligned in time.

此外，需要说明的是，上述说话人识别的工具可以部署在用户关联的终端设备上，对终端设备获得的视频文件进行本地分析处理；或者，可以部署在云端服务器上，对终端设备或者其他服务器提交的视频文件进行分析处理，识别视频文件包括几个说话人，以及每个说话人分别对应哪些语音段。In addition, it should be noted that the above-mentioned speaker recognition tool can be deployed on the terminal device associated with the user to perform local analysis and processing on the video files obtained by the terminal device; The submitted video files are analyzed and processed, and several speakers are identified in the video file, and which speech segments each speaker corresponds to.

可以理解地，本申请实施例提供的说话人识别工具可以应用到多种场景下。作为一种示例，可以应用到多人参与的有声语言交流活动中，例如，多人会议、多人演讲等活动。It can be understood that the speaker recognition tool provided by the embodiment of the present application can be applied to various scenarios. As an example, it can be applied to a verbal language communication activity in which multiple people participate, for example, a multi-person meeting, a multi-person speech, and other activities.

以多人会议场景为例，可以识别说话人(例如，参会者)身份信息，并确定各说话人在会议上发言时产生的语音段的集合。作为一种示例，可以基于说话人对应的语音段的集合，提取说话人阐述的会议观点。以多人演讲场景为例，可以识别说话人(例如，演讲者)身份信息，并确定各说话人在演讲过程中产生的语音段。作为一种示例，可以基于说话人对应的语音段的集合，提取说话人的演讲看点。Taking a multi-person conference scenario as an example, identity information of a speaker (for example, a participant) may be identified, and a set of speech segments generated by each speaker when speaking at the conference may be determined. As an example, based on the set of speech segments corresponding to the speaker, the viewpoint of the meeting expounded by the speaker may be extracted. Taking a multi-person speech scene as an example, identity information of a speaker (for example, a speaker) may be identified, and speech segments produced by each speaker during a speech may be determined. As an example, based on the set of speech segments corresponding to the speaker, the highlights of the speaker's speech may be extracted.

或者，可以应用到影视剧场景下，用于识别不同角色的身份信息，以及各角色对应的语音段的集合。作为一种示例，可以基于角色对应的语音段的集合，提取剧情看点。Alternatively, it can be applied to the scene of a film and television drama to identify the identity information of different characters and the set of speech segments corresponding to each character. As an example, plot highlights may be extracted based on a set of speech segments corresponding to characters.

下面以演讲直播场景下拍摄的视频为例，对视频文件的处理过程进行解释说明。The following takes the video taken in the live lecture scene as an example to explain the processing process of the video file.

实施例2Example 2

针对演讲直播来说，现有技术只能在直播结束后，通过人工分析的方式提取不同演讲人的演讲看点，再手动添加到视频文件中，导致视频处理过程消耗大量的人力成本以及时间成本。For the live broadcast of speeches, the existing technology can only extract the highlights of different speakers’ speeches through manual analysis after the live broadcast is over, and then manually add them to the video file, resulting in a large amount of labor and time cost in the video processing process. .

对应于此，本实施例提供一种可自动进行视频文件处理的工具，如图1所示，可以包括客户端以及服务端。其中，客户端可以部署在用户关联的终端设备上，服务端可以部署在云端服务器上，例如，部署在直播服务器上，且实施例1提供的说话人识别的工具可以作为服务端的一个功能，识别演讲视频中的演讲者身份，并确定不同演讲者关联的语音段的集合。Correspondingly, this embodiment provides a tool that can automatically process video files. As shown in FIG. 1 , it may include a client and a server. Among them, the client can be deployed on the terminal device associated with the user, and the server can be deployed on the cloud server, for example, on the live broadcast server, and the speaker recognition tool provided in Embodiment 1 can be used as a function of the server to identify Speaker identities in lecture videos and determine the set of speech segments associated with different speakers.

下面结合图2所示流程图，对视频文件的处理过程进行解释说明。In the following, the video file processing process will be explained in combination with the flow chart shown in FIG. 2 .

S101：服务端获得有声语言交流活动产生的视频文件。S101: The server obtains the video file generated by the audio communication activity.

本示例中，有声语言交流活动产生的视频文件，即演讲产生的视频文件，可以由进行演讲直播的直播服务器获得，并提交至服务端。如果直播服务器在演讲直播结束后，将演讲产生的视频文件提交至服务端，服务端可以进行直播后处理，在文件中添加对应的内容看点(本示例中，内容看点可以体现为演讲者的演讲看点)，供用户进行回看。如果直播服务器在直播过程中，将演讲产生的视频文件提交至服务端，服务端可以进行实时处理，在已完成直播的视频文件中添加对应的演讲看点，这样，用户在观看直播时，可以选择跟随正在直播的演讲，同步进行在线观看；或者，可以选择对已完成直播的部分内容进行回看。In this example, the video files generated by the spoken language communication activities, that is, the video files generated by the speech, can be obtained by the live broadcast server for the live speech broadcast, and submitted to the server. If the live broadcast server submits the video file generated by the speech to the server after the live broadcast of the speech, the server can perform post-processing of the live broadcast and add the corresponding content highlights to the file (in this example, the content highlights can be reflected as the speaker highlights of the speech), for users to review. If the live broadcast server submits the video file generated by the speech to the server during the live broadcast, the server can process it in real time and add the corresponding speech highlights to the video file that has been broadcast live. In this way, when watching the live broadcast, the user can You can choose to follow the live speech and watch it online synchronously; or you can choose to watch back part of the live broadcast.

例如，对于一个2小时的演讲直播来说，如果已经直播了30分钟，服务端可以对这30分钟直播产生的视频文件进行分析处理，添加演讲看点，供用户回看前30分钟的直播内容，并可直接定位到演讲看点对应的演讲片段。For example, for a 2-hour live speech broadcast, if the live broadcast has been live for 30 minutes, the server can analyze and process the video files generated by the 30-minute live broadcast, add highlights of the speech, and allow users to review the live broadcast content of the previous 30 minutes , and can directly locate the speech clips corresponding to the highlights of the speech.

S102：根据从所述视频文件中分离出的音频文件以及图像文件，进行说话人识别，获得不同说话人各自关联的语音段的集合。S102: Perform speaker recognition according to the audio file and the image file separated from the video file, and obtain a set of speech segments associated with different speakers.

具体实现过程可参见上文实施例1处所做介绍，此处不再赘述。For the specific implementation process, refer to the introduction in Embodiment 1 above, and details will not be repeated here.

如果视频文件1即为演讲产生的视频文件，经说话人识别后可以获得表3所示识别结果。其中，语音段1为演讲者1演讲过程中产生的音频，语音段2以及语音段3为演讲者2演讲过程中产生的音频。If the video file 1 is the video file generated by the speech, the recognition results shown in Table 3 can be obtained after speaker recognition. Wherein, the speech segment 1 is the audio generated during the speech of the speaker 1, and the speech segment 2 and the speech segment 3 are the audio generated during the speech of the speaker 2.

作为一种示例，可以在演讲开始前，预先创建演讲者信息库，通过演讲者信息库保存不同演讲者的人脸特征信息，这样，服务端对图像文件进行人脸识别时，便可根据演讲者的人脸特征，从演讲者信息库中确定出演讲者的身份信息。或者，可以通过演讲者信息库保存不同演讲者的声纹特征，这样，服务端对音频文件进行声纹特征跟踪时，便可根据演讲者的声纹特征，从信息库中确定出演讲者的身份信息。其中，演讲者信息库可以由本发明实施例的服务端创建，或者可以由其他服务端创建后提供给本发明实施例的服务端，本申请实施例对此可不做具体限定。As an example, before the speech begins, the speaker information database can be created in advance, and the facial feature information of different speakers can be saved through the speaker information database. In this way, when the server performs face recognition on the image file, it can The speaker's face features are used to determine the speaker's identity information from the speaker information database. Alternatively, the voiceprint characteristics of different speakers can be stored in the speaker information database. In this way, when the server tracks the voiceprint characteristics of the audio file, it can determine the speaker's identity from the information database according to the voiceprint characteristics of the speaker. Identity Information. Wherein, the speaker information database may be created by the server of the embodiment of the present invention, or may be created by other servers and then provided to the server of the embodiment of the present invention, which is not specifically limited in this embodiment of the present application.

优选地，还可以预先设置噪声阈值，确定噪声可能对语音分割产生影响时，再通过对声纹特征以及人脸特征进行特征对齐的方式进行说话人识别。Preferably, the noise threshold can also be set in advance, and when it is determined that the noise may affect the speech segmentation, speaker recognition is performed by performing feature alignment on voiceprint features and face features.

一种方式下，如果演讲场地布设有噪声采集设备，服务端获得噪声采集设备采集的噪声信息后，可以判断噪声信息表示的噪声值是否超过预设阈值(例如，预设阈值可以体现为预设分贝值)：如果超过预设阈值，可以确定噪声可能会影响语音分割的准确度，可以按照实施例1提供的方案，从视频文件中分离音频文件以及图像文件，进行说话人识别。如果未超过预设阈值，可以确定噪声对语音分割的影响较小，可以从视频文件中提取音频文件，只根据音频文件进行说话人识别即可。In one way, if the lecture venue is equipped with a noise collection device, after the server obtains the noise information collected by the noise collection device, it can determine whether the noise value indicated by the noise information exceeds a preset threshold (for example, the preset threshold can be embodied as a preset decibel value): if it exceeds the preset threshold, it can be determined that the noise may affect the accuracy of speech segmentation, and the audio file and image file can be separated from the video file according to the solution provided in embodiment 1 for speaker recognition. If the preset threshold is not exceeded, it can be determined that noise has little influence on the speech segmentation, and the audio file can be extracted from the video file, and speaker recognition can be performed only based on the audio file.

或者，在另一种方式下，服务端获得视频文件，完成对视频文件的声音图像分离后，可以根据音频文件中的音量变化情况，确定是否存在噪声干扰。如果音量的幅值在一定范围内变化，可能是演讲者在说话导致的幅值变化，此时可以从视频文件中提取音频文件，并根据音频文件进行说话人识别。如果音量的幅值出现大幅度变化，可能是出现掌声或者背景音乐等噪声干扰，此时可以按照实施例1提供的方案，从视频文件中分离音频文件以及图像文件，进行说话人识别。Or, in another way, the server obtains the video file, and after separating the sound and image of the video file, it can determine whether there is noise interference according to the volume change in the audio file. If the amplitude of the volume changes within a certain range, it may be caused by the speaker speaking. At this time, the audio file can be extracted from the video file, and the speaker can be identified based on the audio file. If the amplitude of the volume changes greatly, it may be due to noise interference such as applause or background music. At this time, according to the solution provided in Embodiment 1, the audio file and the image file can be separated from the video file for speaker recognition.

S103：对所述说话人关联的语音段的集合进行语音识别，获得所述说话人的说话内容。S103: Perform speech recognition on the set of speech segments associated with the speaker to obtain the speech content of the speaker.

S104：建立所述说话人与对应的说话内容之间的关联关系。S104: Establish an association between the speaker and the corresponding speech content.

具体地，可以通过语音识别技术，将语音段转换为识别文本，获得说话人的说话内容。Specifically, the speech segment can be converted into a recognized text through speech recognition technology to obtain the speech content of the speaker.

上文所举示例中，对语音段1进行语音识别，可以获得演讲内容1(本示例中，说话内容可以体现为演讲者的演讲内容)，建立演讲者1与演讲内容1之间的关联关系；对语音段2以及语音段3进行语音识别，可以分别获得演讲内容2以及演讲内容3，建立演讲者2与演讲内容2、演讲内容3之间的关联关系。In the above example, speech recognition is performed on speech segment 1, and speech content 1 can be obtained (in this example, the speech content can be embodied as the speech content of the speaker), and an association relationship between speaker 1 and speech content 1 can be established ; Speech recognition is performed on the speech segment 2 and the speech segment 3, and the speech content 2 and the speech content 3 can be obtained respectively, and the association relationship between the speaker 2 and the speech content 2 and the speech content 3 is established.

上述示例经视频处理过程，可以获得下表4所示关联关系。Through the video processing process of the above example, the association relationship shown in Table 4 below can be obtained.

表4Table 4

作为一种示例，对语音段2以及语音段3进行语音识别后，还可以对演讲内容2以及演讲内容3进行合并处理，获得演讲内容4后，建立演讲者2与演讲内容4之间的关联关系。As an example, after the speech recognition of the speech segment 2 and the speech segment 3, the speech content 2 and the speech content 3 can also be merged, and after the speech content 4 is obtained, the association between the speaker 2 and the speech content 4 can be established relation.

S105：从所述说话人关联的说话内容中，提取所述说话人的内容看点。S105: Extract content highlights of the speaker from the utterance content associated with the speaker.

以上示例中，获得演讲者关联的演讲内容后，便可从中提取演讲者的演讲看点，即，演讲者在演讲过程中阐述的主要观点。例如，可以通过生成式(abstractive)或者抽取式(extractive)，提取演讲看点，具体实现过程可参见相关技术，此处不做详述。In the above example, after obtaining the lecture content related to the lecturer, the lecturer's speech point of view can be extracted from it, that is, the main point of view expressed by the lecturer during the speech. For example, the key points of a speech may be extracted through abstractive or extractive methods. The specific implementation process may refer to related technologies, and will not be described in detail here.

对于演讲者1来说，可以基于演讲内容1提取演讲者1的演讲看点。对于演讲者2来说，可以分别基于演讲内容2以及演讲内容3提取演讲者2的演讲看点；或者，可以基于合并获得的演讲内容4提取演讲者2的演讲看点。For the speaker 1, the key points of the speaker 1's speech may be extracted based on the speech content 1 . For the speaker 2, the highlights of the speaker 2's speech can be extracted based on the speech content 2 and the speech content 3 respectively; or, the highlights of the speaker 2's speech can be extracted based on the combined speech content 4 .

此外，考虑到演讲者的观点可能需要较长时间进行阐述，即，从起止时间较长的语音段中提取到演讲者的演讲看点的可能性更大一些，故可以对语音集合中起止时间较长的部分语音段，进行文本识别。例如，可以对语音段2进行文本识别，并根据演讲内容2提取演讲者2的演讲看点。In addition, considering that the speaker's point of view may take a long time to elaborate, that is, it is more likely to extract the speaker's speech highlights from the speech segment with a longer start and end time, so the start and end time of the speech collection can be Longer partial speech segments for text recognition. For example, text recognition may be performed on the speech segment 2, and the highlights of the speaker 2's speech may be extracted according to the speech content 2.

综上，便可自动从演讲视频中提取演讲者的演讲看点，相对现有技术只能在直播结束后，通过人工提炼演讲看点的方案，本申请实施例不仅可以提高视频文件的处理速度，降低人力成本以及时间成本；此外，还可以在直播过程中便对已完成直播的内容进行实时处理，供用户在回看以及在线观看直播两种方式中进行选择，为用户提供更多的使用体验。To sum up, the speaker’s speech highlights can be automatically extracted from the speech video. Compared with the existing technology, which can only manually extract the speech highlights after the live broadcast is over, the embodiment of this application can not only improve the processing speed of video files , to reduce labor costs and time costs; in addition, during the live broadcast process, the content of the completed live broadcast can be processed in real time, allowing users to choose between watching back and watching the live broadcast online, providing users with more usage experience.

S106：对所述内容看点进行后处理，生成看点文件下发至客户端。S106: Perform post-processing on the content highlight, generate a highlight file and send it to the client.

本申请实施例中，可以将内容看点以不同形式下发至客户端，下面结合具体示例对后处理过程进行解释说明。In the embodiment of the present application, the content highlights can be sent to the client in different forms, and the post-processing process will be explained in conjunction with specific examples below.

1.如果客户端关联的智能设备为视频播放设备，则可以视频形式下发内容看点。例如，在教学场景下，可以将添加有内容看点的视频发送到客户端，供用户观看学习。1. If the smart device associated with the client is a video playback device, the content highlights can be delivered in the form of video. For example, in a teaching scenario, a video with added content highlights can be sent to the client for users to watch and learn.

一种方式下，服务端可以将内容看点添加到视频文件中，并将包含内容看点的视频文件发送至客户端进行视频播放。即，后处理生成的看点文件为包含内容看点的视频文件。In one way, the server can add content highlights to the video file, and send the video file containing the content highlights to the client for video playback. That is, the highlight file generated by the post-processing is a video file containing content highlights.

也就是说，可以在原始视频的基础上进行后处理，实现内容看点的添加。具体地，可以根据内容看点对应的语音段的起止时间，确定内容看点的添加节点。例如，将内容看点添加到对应的语音段的开始时间节点上，这样，用户点击视频文件进度条上展示的内容看点后，可以快速定位到该内容看点对应的播放片段，供用户观看。In other words, post-processing can be performed on the basis of the original video to add content highlights. Specifically, the addition node of the content highlight may be determined according to the start and end time of the speech segment corresponding to the content highlight. For example, add the content highlight to the start time node of the corresponding speech segment, so that after the user clicks the content highlight displayed on the progress bar of the video file, he can quickly locate the playback segment corresponding to the content highlight for the user to watch .

或者，另一种方式下，可以根据内容看点生成模拟交流活动的演示视频，并将包含内容看点的演示视频发送至客户端进行视频播放。即，后处理生成的看点文件为包含内容看点的演示视频。Or, in another manner, a demo video of a simulated communication activity may be generated according to the content highlights, and the demo video including the content highlights may be sent to the client for video playback. That is, the highlights file generated by the post-processing is a demo video containing the highlights of the content.

也就是说，为了便于用户快速的了解交流活动涉及的主要看点，可以基于内容看点生成一个演示视频，该演示视频可以作为原始视频的摘要视频，尽量简洁的把不同说话人的内容看点展示给用户。That is to say, in order to facilitate users to quickly understand the main highlights involved in communication activities, a demo video can be generated based on the content highlights. This demo video can be used as a summary video of the original video, and the content highlights of different speakers should be as concise as possible. displayed to the user.

具体地，可以先根据说话人的形象或者其他虚拟形象，确定演示视频中的人物，再通过不同人物进行交流对话的形式，将对应说话人的内容看点表述出来。Specifically, the characters in the demonstration video can be determined based on the image of the speaker or other virtual images, and then the contents of the corresponding speaker can be expressed in the form of communication and dialogue between different characters.

2.如果客户端关联的智能设备为音频播放设备，则可以音频形式下发内容看点。例如，在智能音箱上部署客户端后，可以将添加有内容看点的音频文件发送到客户端，供用户收听。2. If the smart device associated with the client is an audio playback device, the content highlights can be delivered in the form of audio. For example, after a client is deployed on a smart speaker, audio files with added content highlights can be sent to the client for users to listen to.

具体地，可以将内容看点添加到音频文件中(例如，可以根据内容看点对应的语音段的起止时间，确定内容看点在音频文件中的添加节点)，并将包含内容看点的音频文件发送至客户端进行音频播放。即，后处理生成的看点文件为包含内容看点的音频文件。Specifically, the content highlight can be added to the audio file (for example, the addition node of the content highlight in the audio file can be determined according to the start and end time of the speech segment corresponding to the content highlight), and the audio file containing the content highlight The file is sent to the client for audio playback. That is, the point-of-view file generated by the post-processing is an audio file containing the point-of-view of the content.

作为一种示例，客户端获得看点文件后，可以直接进行音频播放。或者，可以先对看点文件中添加的内容看点进行播放，供用户从中选择感兴趣的目标内容看点。这样，用户输入目标内容看点(以智能音箱为例，用户可以通过语音操控的方式说出目标内容看点)后，客户端可以将目标内容看点提交到服务端，由服务端将目标内容看点对应的音频片段发送到客户端进行音频播放。或者，可以先对说话人的身份信息进行播放，供用户从中选择感兴趣的目标说话人，这样，客户端与服务端相互配合，可以向用户推送与目标说话人相关的音频片段。As an example, after the client obtains the highlight file, it can directly play the audio. Alternatively, the content highlights added in the highlight file may be played first, for the user to select the target content highlights of interest. In this way, after the user enters the target content point of view (taking a smart speaker as an example, the user can speak out the target content point of view through voice control), the client can submit the target content point of view to the server, and the server sends the target content point of view The audio clip corresponding to the point of view is sent to the client for audio playback. Alternatively, the speaker's identity information can be played first for the user to select a target speaker of interest. In this way, the client and the server cooperate to push audio clips related to the target speaker to the user.

3.可以文本形式下发内容看点。3. The content highlights can be delivered in text form.

具体地，可以根据客户端的展示类型，将内容看点生成对应类型的交流文本，并将交流文本发送至客户端进行展示。即，后处理生成的看点文件为包含内容看点的交流文本。Specifically, according to the display type of the client, a corresponding type of communication text may be generated from the content point of view, and the communication text may be sent to the client for display. That is, the point-of-view file generated by the post-processing is the communication text containing the point-of-view of the content.

作为一种示例，客户端的展示类型可以是文本展示，例如，通过邮件、帖子等形式展示内容看点，对应地，服务端可以生成图3所示文本类型的交流文本，下发到客户端进行展示。或者，客户端的展示类型可以是海报展示，对应地，服务端可以生成图4所示海报类型的交流文本，下发到客户端进行展示。As an example, the display type of the client can be text display, for example, to display content highlights in the form of emails, posts, etc. Correspondingly, the server can generate communication texts of the text type shown in Figure 3 and send them to the client for further processing. exhibit. Alternatively, the display type of the client may be poster display, and correspondingly, the server may generate a communication text of the poster type shown in FIG. 4 and send it to the client for display.

S107：客户端接收服务端发送的对有声语言交流活动的视频进行处理获得的看点文件，将所述看点文件推送给所述客户端关联的用户。S107: The client receives the point-of-view file obtained by processing the video of the spoken language communication activity sent by the server, and pushes the point-of-view file to the user associated with the client.

本申请实施例中，客户端可以通过视频形式、音频形式、文本形式向用户进行看点文件推送，具体可参见上文所做介绍，此处不再赘述。In the embodiment of the present application, the client can push the interesting files to the user in the form of video, audio, or text. For details, please refer to the introduction above, and will not repeat them here.

S108：客户端向服务端提交服务请求，触发服务端为客户端关联的用户提供相关服务。S108: The client submits a service request to the server, triggering the server to provide related services for the users associated with the client.

由上文介绍可知，本申请实施例可以实现视频文件的数据化处理。作为一种优选方案，服务端还可以根据需求，定义一个数组结构，用于保存说话人、语音段、内容看点等信息之间的关联关系。It can be seen from the above introduction that the embodiment of the present application can implement data processing of video files. As a preferred solution, the server can also define an array structure according to requirements, which is used to store the relationship between information such as speakers, voice segments, and content highlights.

对于视频文件1来说，进行视频处理后可以获得下表5所示关联关系。For the video file 1, the association relationship shown in Table 5 below can be obtained after video processing.

表5table 5

也就是说，客户端除了可以向用户推送服务端下发的看点文件之外，还可以请求服务端为用户提供多种服务。下面以演讲场景为例，对服务端提供不同服务的实现过程进行举例说明。That is to say, in addition to pushing the highlight files delivered by the server to the user, the client can also request the server to provide users with various services. The following takes the speech scene as an example to illustrate the implementation process of different services provided by the server.

1.可以为用户在直播过程中提交的转发请求，生成个性化的直播封面。1. It can generate a personalized live broadcast cover for the forwarding request submitted by the user during the live broadcast.

具体地，用户在线观看演讲直播的过程中，如果想对直播进行分享操作，可以通过用户关联的客户端，向服务端提交直播转发请求。对应地，服务端可以获得已完成直播的内容对应的演讲看点，例如，基于已经直播的30分钟演讲内容提取的演讲看点，发送至客户端。Specifically, if the user wants to share the live broadcast while watching the live speech online, he can submit a request for forwarding the live broadcast to the server through the client associated with the user. Correspondingly, the server can obtain the speech highlight corresponding to the live broadcast content, for example, the speech highlight extracted based on the 30-minute live speech content, and send it to the client.

作为一种示例，客户端可以对服务端下发的演讲看点进行展示，供用户从中选择目标演讲看点，并将目标演讲看点提交到服务端，以便服务端根据目标演讲看点生成直播封面。例如，服务端可以将目标演讲看点添加到直播封面上；或者，可以从目标演讲看点对应的演讲片段中，截取一张目标图像，利用目标图像以及目标演讲看点生成直播封面，具体地，可以根据表3以及表5确定目标演讲看点对应的起止时间，进而从该起止时间对应的演讲片段中截取目标图像。As an example, the client can display the speech highlights delivered by the server for the user to select the target speech highlights, and submit the target speech highlights to the server, so that the server can generate a live broadcast according to the target speech highlights cover. For example, the server can add the target speech highlights to the cover of the live broadcast; or, it can intercept a target image from the speech segment corresponding to the target speech highlights, and use the target image and the target speech highlights to generate the live cover, specifically , according to Table 3 and Table 5, the start and end time corresponding to the target speech point of view can be determined, and then the target image is intercepted from the speech segment corresponding to the start and end time.

服务端根据目标演讲看点生成直播封面后，可以获得包含该直播封面的转发链接，该转发链接可以作为用户的专属链接，下发到客户端，供用户进行直播转发。如此便可在用户进行直播转发时，展示该用户专属的个性化直播封面。相对于现有技术只能转发包含统一封面的直播链接，本申请实施例提供的转发服务，不仅有助于提高用户体验，还可能通过个性化分享吸引更多的用户观看演讲直播，提高传播引导效率。After the server generates the cover of the live broadcast according to the highlights of the target speech, it can obtain a forwarding link containing the cover of the live broadcast. The forwarding link can be used as the user's exclusive link and sent to the client for the user to forward the live broadcast. In this way, when the user forwards the live broadcast, the user's exclusive personalized live broadcast cover can be displayed. Compared with the existing technology that can only forward the live broadcast link with a unified cover, the forwarding service provided by the embodiment of the application not only helps to improve the user experience, but also attracts more users to watch the live lecture through personalized sharing, and improves the communication guidance. efficiency.

2.可以在用户回看时，直接定位到用户感兴趣的播放片段。2. When the user watches back, it can directly locate the playback segment that the user is interested in.

具体地，不论是直播结束后进行回看，还是在直播过程中对已完成直播的内容进行回看，用户可以通过客户端向服务端提交播放请求，触发服务端将演讲看点下发到客户端进行展示，供用户从中选择目标演讲看点，进而将用户选择的目标演讲看点提交到服务端。对应地，服务端可以根据表3以及表5所示关联关系，确定目标演讲看点对应的目标语音段，并根据目标语音段的起止时间，从视频文件中确定目标演讲看点对应的播放片段，将播放片段下发至客户端进行视频播放。也就是说，客户端可以只接收并播放用户感兴趣的片段。Specifically, whether it is to watch back after the live broadcast, or to review the content of the live broadcast during the live broadcast, the user can submit a play request to the server through the client, triggering the server to send the speech highlights to the client Display on the terminal for the user to select the target speech highlights, and then submit the target speech highlights selected by the user to the server. Correspondingly, the server can determine the target speech segment corresponding to the target speech point of view according to the association relationship shown in Table 3 and Table 5, and determine the playback segment corresponding to the target speech point of view from the video file according to the start and end time of the target speech segment , and send the playback clip to the client for video playback. That is to say, the client can only receive and play the segments that the user is interested in.

或者，在另一种方式下，如果客户端可以将视频文件中添加的演讲看点展示给用户查看，例如，在视频文件的进度条上展示演讲看点，客户端可以获得用户从中选择的目标演讲看点，生成包含目标演讲看点的播放请求，提交到服务端。例如，用户点击进度条上展示的演讲看点2后，可以生成针对演讲看点2的播放请求。对应地，服务端获得上述播放请求后，可以根据表3以及表5所示关联关系，确定用户想要观看12～550s的演讲片段，故可控制客户端加载12～550s的视频资源，播放这部分片段。也就是说，客户端既可以播放完整的演讲视频，又可以根据用户操作跳转播放用户感兴趣的片段。Or, in another way, if the client can display the speech highlights added in the video file to the user for viewing, for example, display the speech highlights on the progress bar of the video file, the client can obtain the target selected by the user Speech highlights, generate a playback request containing the target speech highlights, and submit it to the server. For example, after the user clicks on the speech highlight 2 displayed on the progress bar, a playback request for the speech highlight 2 may be generated. Correspondingly, after the server obtains the above playback request, it can determine that the user wants to watch the 12-550s speech segment according to the association relationship shown in Table 3 and Table 5, so it can control the client to load the 12-550s video resource and play this partial fragment. That is to say, the client can not only play the complete speech video, but also jump to play the clips that the user is interested in according to the user's operation.

对于以音频形式收听演讲的用户来说，通过客户端提交播放请求后，服务端可以根据用户确定的目标演讲看点，从音频文件中确定该目标演讲看点对应的播放片段，并将播放片段下发到客户端进行音频播放。For users who listen to lectures in the form of audio, after submitting a playback request through the client, the server can determine the playback clip corresponding to the target speech viewpoint from the audio file according to the target speech viewpoint determined by the user, and play the clip Send it to the client for audio playback.

3.可以根据用户输入的检索信息，直接定位到用户感兴趣的播放片段。3. According to the search information input by the user, the playback segment that the user is interested in can be directly located.

具体地，客户端可以提供用于提交检索信息的操作选项，通过操作选项获得检索信息后，可以生成检索请求提交到服务端。Specifically, the client can provide an operation option for submitting the retrieval information, and after obtaining the retrieval information through the operation option, a retrieval request can be generated and submitted to the server.

作为一种示例，用户提交的检索信息可以是演讲者的标识信息，即，用户想要观看某个(某些)演讲者的演讲片段。对应于此，服务端获得检索请求后，可以根据表3以及表5所示关联关系，确定演讲者关联的目标语音段，并根据目标语音段的起止时间，从视频文件中确定演讲者对应的播放片段，将播放片段下发至客户端进行视频播放。As an example, the search information submitted by the user may be identification information of a speaker, that is, the user wants to watch a certain (certain) speaker's speech clips. Correspondingly, after the server obtains the retrieval request, it can determine the target speech segment associated with the speaker according to the association relationship shown in Table 3 and Table 5, and determine the corresponding speech segment of the speaker from the video file according to the start and end time of the target speech segment. Play the segment, and send the playback segment to the client for video playback.

或者，作为另一种示例，用户提交的检索信息可以是内容关键词，即，用户想要观看与某个(某些)内容相关的演讲片段。对应于此，服务端获得检索请求后，可以从演讲看点中，确定与内容关键词相匹配的目标演讲看点，再根据表3以及表5所示关联关系，确定目标演讲看点对应的目标语音段，并根据目标语音段的起止时间，从视频文件中确定目标演讲看点对应的播放片段，将播放片段下发至客户端进行视频播放。Or, as another example, the search information submitted by the user may be a content keyword, that is, the user wants to watch a speech segment related to certain (certain) content. Correspondingly, after the server obtains the retrieval request, it can determine the target speech highlights that match the content keywords from the speech highlights, and then determine the corresponding target speech highlights according to the associations shown in Table 3 and Table 5. The target speech segment, and according to the start and end time of the target speech segment, determine the playback segment corresponding to the target speech point from the video file, and send the playback segment to the client for video playback.

优选地，为了提高用户基于内容关键词进行播放片段检索的准确率，可以预先从演讲看点中提炼内容关键词，并将内容关键词下发到客户端进行展示，供用户从中选择感兴趣的内容关键词，提交到服务端。Preferably, in order to improve the accuracy of the user's playback clip retrieval based on the content keywords, the content keywords can be extracted from the speech highlights in advance, and the content keywords are sent to the client for display, so that the user can choose the ones he is interested in. Content keywords, submitted to the server.

同样地，对于以音频形式收听演讲的用户来说，通过客户端提交检索请求后，服务端可以根据用户确定的演讲者的标识信息或者内容关键词，从音频文件中确定对应的播放片段，并将播放片段下发到客户端进行音频播放。Similarly, for users who listen to lectures in audio form, after submitting a retrieval request through the client, the server can determine the corresponding playback segment from the audio file according to the speaker's identification information or content keywords determined by the user, and Send the playback clip to the client for audio playback.

可以理解地，以上示例中确定的目标演讲看点可能是同一个看点，也可能是不同的看点，主要由用户的实际需求决定。It is understandable that the target speech points of view determined in the above example may be the same point of view, or may be different points of view, which are mainly determined by the actual needs of users.

实施例3Example 3

该实施例3是与实施例1相对应的，提供了一种说话人识别方法，参见图5，该方法具体可以包括：This embodiment 3 is corresponding to embodiment 1, provides a kind of speaker identification method, referring to Fig. 5, this method may specifically include:

S201：从待识别的视频文件中分离获得音频文件以及图像文件；S201: Separate and obtain an audio file and an image file from the video file to be identified;

S202：对所述音频文件进行语音分割，获得至少一个语音段各自对应的起止时间信息以及说话人标识信息，以及对所述图像文件进行人脸识别，获得不同时间对应的人脸识别结果；S202: Perform voice segmentation on the audio file, obtain start and end time information and speaker identification information corresponding to at least one voice segment, and perform face recognition on the image file to obtain face recognition results corresponding to different times;

S203：对所述说话人标识信息以及所述人脸识别结果，在时间上进行对齐处理，确定所述至少一个语音段对应的说话人。S203: Perform temporal alignment processing on the speaker identification information and the face recognition result to determine a speaker corresponding to the at least one speech segment.

实施例4Example 4

该实施例4是与实施例2相对应的，从服务端的角度，提供了一种视频文件处理方法，参见图6，该方法具体可以包括：Embodiment 4 corresponds to Embodiment 2. From the perspective of the server, a video file processing method is provided. Referring to FIG. 6, the method may specifically include:

S301：服务端获得有声语言交流活动产生的视频文件；S301: The server obtains the video file generated by the audio communication activity;

S302：根据从所述视频文件中分离出的音频文件以及图像文件，进行说话人识别，获得不同说话人各自关联的语音段的集合；S302: Perform speaker identification according to the audio file and the image file separated from the video file, and obtain a set of speech segments associated with different speakers;

S303：对所述说话人关联的语音段的集合进行语音识别，获得所述说话人的说话内容；S303: Perform speech recognition on the set of speech segments associated with the speaker to obtain the speech content of the speaker;

S304：建立所述说话人与对应的说话内容之间的关联关系。S304: Establish an association relationship between the speaker and the corresponding utterance content.

实施例5Example 5

该实施例5是与实施例2相对应的，从客户端的角度，提供了一种视频文件处理方法，参见图7，该方法具体可以包括：Embodiment 5 corresponds to Embodiment 2. From the perspective of the client, a video file processing method is provided. Referring to FIG. 7, the method may specifically include:

S401：客户端接收服务端发送的对有声语言交流活动的视频进行处理获得的看点文件，所述看点文件中添加有不同说话人的内容看点，所述内容看点为根据对所述视频中分离出的音频文件以及图像文件进行说话人识别，确定的不同说话人关联的语音段的集合中提取；S401: The client receives the point-of-view file sent by the server and obtained by processing the video of the spoken language communication activity, the point-of-view file is added with point-of-view content of different speakers, and the point-of-view of the content is based on the The audio files and image files separated from the video are used for speaker identification, and the determined set of speech segments associated with different speakers is extracted;

S402：将所述看点文件推送给所述客户端关联的用户。S402: Push the highlight file to a user associated with the client.

关于前述实施例3至实施例5中的未详述部分，可以参见前述实施例中的记载，这里不再赘述。Regarding the parts not described in detail in the foregoing embodiment 3 to embodiment 5, reference may be made to the records in the foregoing embodiments, and details are not repeated here.

与实施例1相对应，本申请实施例还提供了一种说话人识别装置，参见图8，该装置可以应用于服务端，包括：Corresponding to Embodiment 1, this embodiment of the present application also provides a speaker recognition device, see Figure 8, the device can be applied to the server, including:

文件分离单元501，用于从待识别的视频文件中分离获得音频文件以及图像文件；A file separation unit 501, configured to separate and obtain an audio file and an image file from the video file to be identified;

语音分割单元502，用于对所述音频文件进行语音分割，获得至少一个语音段各自对应的起止时间信息以及说话人标识信息；A voice segmentation unit 502, configured to perform voice segmentation on the audio file, and obtain start and end time information and speaker identification information corresponding to at least one voice segment;

人脸识别单元503，用于对所述图像文件进行人脸识别，获得不同时间对应的人脸识别结果；A face recognition unit 503, configured to perform face recognition on the image file to obtain face recognition results corresponding to different times;

对齐处理单元504，用于对所述说话人标识信息以及所述人脸识别结果，在时间上进行对齐处理，确定所述至少一个语音段对应的说话人。The alignment processing unit 504 is configured to perform temporal alignment processing on the speaker identification information and the face recognition result to determine the speaker corresponding to the at least one speech segment.

其中，所述装置还包括：Wherein, the device also includes:

合并处理单元，用于对说话人对应的至少一个语音段进行合并处理。The merging processing unit is configured to perform merging processing on at least one speech segment corresponding to the speaker.

其中，所述对齐处理单元，具体用于：Wherein, the alignment processing unit is specifically used for:

根据相邻语音段各自对应的起止时间信息，确定进行对齐处理的目标起止时间；获得所述目标起止时间对应的目标人脸识别结果；如果所述目标人脸识别结果对应一名目标用户，则对所述相邻语音段进行合并处理，并对合并后的语音段以及所述目标用户进行关联。According to the corresponding start and end time information of adjacent speech segments, determine the target start and end time for alignment processing; obtain the target face recognition result corresponding to the target start and end time; if the target face recognition result corresponds to a target user, then Perform merging processing on the adjacent speech segments, and associate the merged speech segments with the target user.

根据语音段对应的起止时间信息，获得所述起止时间对应的目标人脸识别结果；如果所述目标人脸识别结果对应至少两名用户，则从中确定一名目标用户，并对所述语音段以及所述目标用户进行关联。According to the start and end time information corresponding to the voice segment, the target face recognition result corresponding to the start and end time is obtained; if the target face recognition result corresponds to at least two users, then determine a target user from them, and perform an analysis of the voice segment and the target user for association.

其中，所述装置还包括：Wherein, the device also includes:

信息库获得单元，用于获得说话人信息库，所述说话人信息库中保存有至少一个说话人的身份信息关联的人脸特征信息；An information base obtaining unit, configured to obtain a speaker information base, wherein at least one face feature information associated with the identity information of the speaker is stored in the speaker information base;

身份信息确定单元，用于根据所述目标用户的人脸特征信息，从所述说话人信息库中确定所述目标用户的身份信息。The identity information determining unit is configured to determine the identity information of the target user from the speaker information database according to the facial feature information of the target user.

与实施例2相对应，本申请实施例还提供了一种视频文件处理装置，参见图9，该装置应用于服务端，包括：Corresponding to Embodiment 2, the embodiment of the present application also provides a video file processing device, see Figure 9, the device is applied to the server, including:

视频文件获得单元601，用于获得有声语言交流活动产生的视频文件；A video file obtaining unit 601, configured to obtain a video file generated by a vocal language communication activity;

说话人识别单元602，用于根据从所述视频文件中分离出的音频文件以及图像文件，进行说话人识别，获得不同说话人各自关联的语音段的集合；The speaker identification unit 602 is configured to perform speaker identification based on the audio files and image files separated from the video file, and obtain a collection of speech segments associated with different speakers;

说话内容获得单元603，用于对所述说话人关联的语音段的集合进行语音识别，获得所述说话人的说话内容；Speech content obtaining unit 603, configured to perform speech recognition on the set of speech segments associated with the speaker, and obtain the speech content of the speaker;

关联关系建立单元604，用于建立所述说话人与对应的说话内容之间的关联关系。An association relationship establishing unit 604, configured to establish an association relationship between the speaker and the corresponding utterance content.

其中，所述装置还包括：Wherein, the device also includes:

内容看点提取单元，用于从所述说话人关联的说话内容中，提取所述说话人的内容看点；A content highlight extraction unit, configured to extract the speaker's content highlight from the speech content associated with the speaker;

内容看点处理单元，用于对所述内容看点进行后处理，生成看点文件下发至客户端。The content highlight processing unit is configured to post-process the content highlight, generate a highlight file and send it to the client.

其中，所述内容看点处理单元，具体用于：Wherein, the content highlight processing unit is specifically used for:

如果所述客户端关联的智能设备为视频播放设备，则将所述内容看点添加到所述视频文件中，并将包含所述内容看点的视频文件发送至所述客户端进行视频播放。If the smart device associated with the client is a video playback device, adding the content highlights to the video file, and sending the video file containing the content highlights to the client for video playback.

如果所述客户端关联的智能设备为视频播放设备，则根据所述内容看点生成模拟交流活动的演示视频，并将包含所述内容看点的演示视频发送至所述客户端进行视频播放。If the smart device associated with the client is a video playback device, a demonstration video of a simulated communication activity is generated according to the content highlights, and the demonstration video including the content highlights is sent to the client for video playback.

如果所述客户端关联的智能设备为音频播放设备，则将所述内容看点添加到所述音频文件中，并将包含所述内容看点的音频文件发送至所述客户端进行音频播放。If the smart device associated with the client is an audio playback device, adding the content highlights to the audio file, and sending the audio file containing the content highlights to the client for audio playback.

根据客户端的展示类型，将所述内容看点生成对应类型的交流文本，并将所述交流文本发送至所述客户端进行展示。According to the display type of the client, generate a corresponding type of communication text from the content highlights, and send the communication text to the client for display.

其中，如果交流活动场地布设有噪声采集设备，所述装置还包括：Among them, if the communication activity venue is equipped with noise collection equipment, the device also includes:

噪声信息获得单元，用于获得所述噪声采集设备采集的噪声信息；a noise information obtaining unit, configured to obtain the noise information collected by the noise collection device;

所述说话人识别单元，用于在所述噪声信息表示噪声值超过预设阈值时，从所述视频文件中分离所述音频文件以及所述图像文件，进行说话人识别。The speaker identification unit is configured to separate the audio file and the image file from the video file to perform speaker identification when the noise information indicates that a noise value exceeds a preset threshold.

其中，所述装置还包括：Wherein, the device also includes:

音频文件提取单元，用于在所述噪声信息表示噪声值未超过所述预设阈值时，从所述视频文件中提取音频文件，并根据所述音频文件进行说话人识别。An audio file extracting unit, configured to extract an audio file from the video file when the noise information indicates that the noise value does not exceed the preset threshold, and perform speaker recognition based on the audio file.

其中，所述视频文件为交流活动直播过程中获得的，已完成直播的内容对应的视频文件。Wherein, the video file is a video file obtained during the live broadcast of the exchange activity and corresponding to the content of the live broadcast.

其中，所述装置还包括：Wherein, the device also includes:

内容看点发送单元，用于在获得客户端提交的直播转发请求时，将所述已完成直播的内容对应的内容看点发送至所述客户端，供所述客户端关联的用户从中选择目标内容看点；The content highlight sending unit is configured to send the content highlight corresponding to the completed live broadcast content to the client when obtaining the live broadcast forwarding request submitted by the client, so that the user associated with the client can select a target from it content point of view;

直播封面生成单元，用于接收所述客户端提交的所述目标内容看点，并根据所述目标内容看点生成直播封面；A live cover generating unit, configured to receive the target content highlights submitted by the client, and generate a live cover according to the target content highlights;

转发链接下发单元，用于获得包含所述直播封面的转发链接，下发至所述客户端进行直播转发。A forwarding link sending unit, configured to obtain a forwarding link including the live cover, and send it to the client for live forwarding.

其中，所述关联关系建立单元，还用于建立所述说话人、所述语音段的集合以及所述内容看点之间的关联关系。Wherein, the association establishing unit is further configured to establish an association among the speaker, the set of speech segments, and the content highlights.

其中，所述装置还包括：Wherein, the device also includes:

内容看点发送单元，用于在获得客户端提交的播放请求时，将所述内容看点发送至所述客户端，供所述客户端关联的用户从中选择目标内容看点；A content point-of-view sending unit, configured to send the content point-of-view to the client when obtaining a playback request submitted by the client, for users associated with the client to select target content point-of-view;

播放片段确定单元，用于接收所述客户端提交的所述目标内容看点；确定所述目标内容看点对应的目标语音段，并根据所述目标语音段的起止时间，从所述视频文件中确定所述目标内容看点对应的播放片段；Play segment determination unit, used to receive the target content point of view submitted by the client; determine the target voice segment corresponding to the target content point of view, and according to the start and end time of the target voice segment, from the video file Determine the playback segment corresponding to the target content point of view;

播放片段下发单元，用于将所述播放片段下发至所述客户端进行播放。A playback segment delivery unit, configured to deliver the playback segment to the client for playback.

其中，所述装置还包括：Wherein, the device also includes:

检索请求获得单元，用于获得客户端提交的检索请求，所述检索请求中包括说话人的标识信息；A retrieval request obtaining unit, configured to obtain a retrieval request submitted by a client, where the retrieval request includes speaker identification information;

播放片段确定单元，用于确定所述说话人关联的目标语音段，并根据所述目标语音段的起止时间，从所述视频文件中确定所述说话人对应的播放片段；A playback segment determination unit, configured to determine the target speech segment associated with the speaker, and determine the playback segment corresponding to the speaker from the video file according to the start and end time of the target speech segment;

其中，所述装置还包括：Wherein, the device also includes:

检索请求获得单元，用于获得客户端提交的检索请求，所述检索请求中包括内容关键词；A retrieval request obtaining unit, configured to obtain a retrieval request submitted by a client, wherein the retrieval request includes content keywords;

播放片段确定单元，用于从所述内容看点中，确定与所述内容关键词相匹配的目标内容看点；获得所述目标内容看点对应的目标语音段，并根据所述目标语音段的起止时间，从所述视频文件中确定所述目标内容看点对应的播放片段；The playback segment determination unit is used to determine the target content viewpoint matching the content keyword from the content viewpoint; obtain the target speech segment corresponding to the target content viewpoint, and according to the target speech segment The start and end times of the video file are used to determine the playback segment corresponding to the target content point of view;

与实施例2相对应，本申请实施例还提供了一种视频文件处理装置，参见图10，该装置应用于客户端，包括：Corresponding to Embodiment 2, this embodiment of the present application also provides a video file processing device, see Figure 10, the device is applied to the client, including:

看点文件接收单元701，用于接收服务端发送的对有声语言交流活动的视频进行处理获得的看点文件，所述看点文件中添加有不同说话人的内容看点，所述内容看点为根据对所述视频中分离出的音频文件以及图像文件进行说话人识别，确定的不同说话人关联的语音段的集合中提取；The point-of-view file receiving unit 701 is used to receive the point-of-view file sent by the server and obtained by processing the video of the spoken language communication activity, the point-of-view file is added with points of content of different speakers, and the point of view of the content is For performing speaker identification on the audio files and image files separated from the video, extracting from a set of speech segments associated with different speakers determined;

看点文件推送单元702，用于将所述看点文件推送给所述客户端关联的用户。The point-of-view file pushing unit 702 is configured to push the point-of-view file to users associated with the client.

另外，本申请实施例还提供了一种电子设备，包括：In addition, the embodiment of the present application also provides an electronic device, including:

一个或多个处理器；以及one or more processors; and

以及一种电子设备，包括：and an electronic device comprising:

一个或多个处理器；以及one or more processors; and

以及一种电子设备，包括：and an electronic device comprising:

一个或多个处理器；以及one or more processors; and

其中，图11示例性的展示出了计算机系统的架构，具体可以包括处理器810，视频显示适配器811，磁盘驱动器812，输入/输出接口813，网络接口814，以及存储器820。上述处理器810、视频显示适配器811、磁盘驱动器812、输入/输出接口813、网络接口814，与存储器820之间可以通过通信总线830进行通信连接。Wherein, FIG. 11 exemplarily shows the architecture of a computer system, which may specifically include a processor 810 , a video display adapter 811 , a disk drive 812 , an input/output interface 813 , a network interface 814 , and a memory 820 . The processor 810 , video display adapter 811 , disk drive 812 , input/output interface 813 , network interface 814 , and the memory 820 can be connected by communication bus 830 .

其中，处理器810可以采用通用的CPU(Central Processing Unit，中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit，ASIC)、或者一个或多个集成电路等方式实现，用于执行相关程序，以实现本申请所提供的技术方案。Wherein, the processor 810 may be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for Relevant programs are executed to realize the technical solutions provided by this application.

存储器820可以采用ROM(Read Only Memory，只读存储器)、RAM(Random AccessMemory，随机存取存储器)、静态存储设备，动态存储设备等形式实现。存储器820可以存储用于控制计算机系统800运行的操作系统821，用于控制计算机系统800的低级别操作的基本输入输出系统(BIOS)。另外，还可以存储网页浏览器823，数据存储管理系统824，以及视频文件处理系统825等等。上述视频文件处理系统825就可以是本申请实施例中具体实现前述各步骤操作的服务端。总之，在通过软件或者固件来实现本申请所提供的技术方案时，相关的程序代码保存在存储器820中，并由处理器810来调用执行。The memory 820 can be implemented in the form of ROM (Read Only Memory, read-only memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, and the like. The memory 820 may store an operating system 821 for controlling the operation of the computer system 800 and a basic input output system (BIOS) for controlling low-level operations of the computer system 800 . In addition, a web browser 823, a data storage management system 824, and a video file processing system 825 can also be stored. The above-mentioned video file processing system 825 can be the server in the embodiment of the present application that specifically implements the operations of the aforementioned steps. In a word, when implementing the technical solution provided by the present application through software or firmware, related program codes are stored in the memory 820 and invoked by the processor 810 for execution.

输入/输出接口813用于连接输入/输出模块，以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出)，也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等，输出设备可以包括显示器、扬声器、振动器、指示灯等。The input/output interface 813 is used to connect the input/output module to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions. The input device may include a keyboard, mouse, touch screen, microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, and the like.

网络接口814用于连接通信模块(图中未示出)，以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信，也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。The network interface 814 is used to connect the communication module (not shown in the figure), so as to realize the communication interaction between the device and other devices. The communication module can realize communication through wired means (such as USB, network cable, etc.), and can also realize communication through wireless means (such as mobile network, WIFI, Bluetooth, etc.).

总线830包括一通路，在设备的各个组件(例如处理器810、视频显示适配器811、磁盘驱动器812、输入/输出接口813、网络接口814，与存储器820)之间传输信息。Bus 830 includes a path for carrying information between the various components of the device (eg, processor 810, video display adapter 811, disk drive 812, input/output interface 813, network interface 814, and memory 820).

需要说明的是，尽管上述设备仅示出了处理器810、视频显示适配器811、磁盘驱动器812、输入/输出接口813、网络接口814，存储器820，总线830等，但是在具体实施过程中，该设备还可以包括实现正常运行所必需的其他组件。此外，本领域的技术人员可以理解的是，上述设备中也可以仅包含实现本申请方案所必需的组件，而不必包含图中所示的全部组件。It should be noted that although the above-mentioned devices only show the processor 810, the video display adapter 811, the disk drive 812, the input/output interface 813, the network interface 814, the memory 820, the bus 830, etc., in the specific implementation process, the A device may also include other components necessary for proper operation. In addition, those skilled in the art can understand that the above-mentioned device may only include components necessary to realize the solution of the present application, and does not necessarily include all the components shown in the figure.

其中，图12示例性的展示出了电子设备的架构，例如，设备900可以是移动电话，计算机，数字广播终端，消息收发设备，游戏控制台，平板设备，医疗设备，健身设备，个人数字助理，飞行器等。Wherein, FIG. 12 exemplarily shows the architecture of an electronic device, for example, the device 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant , aircraft, etc.

参照图12，设备900可以包括以下一个或多个组件：处理组件902，存储器904，电源组件906，多媒体组件908，音频组件910，输入/输出(I/O)的接口912，传感器组件914，以及通信组件916。12, device 900 may include one or more of the following components: processing component 902, memory 904, power supply component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916 .

处理组件902通常控制设备900的整体操作，诸如与显示，电话呼叫，数据通信，相机操作和记录操作相关联的操作。处理元件902可以包括一个或多个处理器920来执行指令，以完成本公开技术方案提供的方法的全部或部分步骤。此外，处理组件902可以包括一个或多个模块，便于处理组件902和其他组件之间的交互。例如，处理部件902可以包括多媒体模块，以方便多媒体组件908和处理组件902之间的交互。The processing component 902 generally controls the overall operations of the device 900, such as those associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 902 may include one or more processors 920 to execute instructions to complete all or part of the steps of the method provided by the technical solution of the present disclosure. Additionally, processing component 902 may include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 may include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902 .

存储器904被配置为存储各种类型的数据以支持在设备900的操作。这些数据的示例包括用于在设备900上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。存储器904可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。The memory 904 is configured to store various types of data to support operations at the device 900 . Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and the like. The memory 904 can be implemented by any type of volatile or non-volatile memory device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

电源组件906为设备900的各种组件提供电力。电源组件906可以包括电源管理系统，一个或多个电源，及其他与为设备900生成、管理和分配电力相关联的组件。The power supply component 906 provides power to various components of the device 900 . Power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 900 .

多媒体组件908包括在设备900和用户之间的提供一个输出接口的屏幕。在一些实施例中，屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板，屏幕可以被实现为触摸屏，以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界，而且还检测与触摸或滑动操作相关的持续时间和压力。在一些实施例中，多媒体组件908包括一个前置摄像头和/或后置摄像头。当设备900处于操作模式，如拍摄模式或视频模式时，前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。The multimedia component 908 includes a screen that provides an output interface between the device 900 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or a swipe action, but also detect duration and pressure associated with the touch or swipe operation. In some embodiments, the multimedia component 908 includes a front camera and/or a rear camera. When the device 900 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.

音频组件910被配置为输出和/或输入音频信号。例如，音频组件910包括一个麦克风(MIC)，当设备900处于操作模式，如呼叫模式、记录模式和语音识别模式时，麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器904或经由通信组件916发送。在一些实施例中，音频组件910还包括一个扬声器，用于输出音频信号。The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a microphone (MIC) configured to receive external audio signals when the device 900 is in operation modes, such as call mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 904 or sent via communication component 916 . In some embodiments, the audio component 910 also includes a speaker for outputting audio signals.

I/O接口912为处理组件902和外围接口模块之间提供接口，上述外围接口模块可以是键盘，点击轮，按钮等。这些按钮可包括但不限于：主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 912 provides an interface between the processing component 902 and a peripheral interface module. The peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.

传感器组件914包括一个或多个传感器，用于为设备900提供各个方面的状态评估。例如，传感器组件914可以检测到设备900的打开/关闭状态，组件的相对定位，例如所述组件为设备900的显示器和小键盘，传感器组件914还可以检测设备900或设备900一个组件的位置改变，用户与设备900接触的存在或不存在，设备900方位或加速/减速和设备900的温度变化。传感器组件914可以包括接近传感器，被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件914还可以包括光传感器，如CMOS或CCD图像传感器，用于在成像应用中使用。在一些实施例中，该传感器组件914还可以包括加速度传感器，陀螺仪传感器，磁传感器，压力传感器或温度传感器。Sensor assembly 914 includes one or more sensors for providing status assessments of various aspects of device 900 . For example, the sensor component 914 can detect the open/closed state of the device 900, the relative positioning of components, such as the display and keypad of the device 900, and the sensor component 914 can also detect a change in the position of the device 900 or a component of the device 900 , the presence or absence of user contact with the device 900 , the orientation or acceleration/deceleration of the device 900 and the temperature change of the device 900 . Sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 914 may also include an optical sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.

通信组件916被配置为便于设备900和其他设备之间有线或无线方式的通信。设备900可以接入基于通信标准的无线网络，如WiFi，2G或3G，或它们的组合。在一个示例性实施例中，通信部件916经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，所述通信部件916还包括近场通信(NFC)模块，以促进短程通信。例如，在NFC模块可基于射频识别(RFID)技术，红外数据协会(IrDA)技术，超宽带(UWB)技术，蓝牙(BT)技术和其他技术来实现。Communication component 916 is configured to facilitate wired or wireless communications between device 900 and other devices. The device 900 can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.

在示例性实施例中，设备900可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现，用于执行上述方法。In an exemplary embodiment, device 900 may be programmed by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.

在示例性实施例中，还提供了一种包括指令的非临时性计算机可读存储介质，例如包括指令的存储器904，上述指令可由设备900的处理器920执行以完成本公开技术方案提供的方法。例如，所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as a memory 904 including instructions, the above instructions can be executed by the processor 920 of the device 900 to complete the method provided by the technical solution of the present disclosure . For example, the non-transitory computer readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

通过以上的实施方式的描述可知，本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。It can be known from the above description of the implementation manners that those skilled in the art can clearly understand that the present application can be implemented by means of software plus a necessary general-purpose hardware platform. Based on this understanding, the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in storage media, such as ROM/RAM, disk , CD, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments of the present application.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统或系统实施例而言，由于其基本相似于方法实施例，所以描述得比较简单，相关之处参见方法实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。Each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system or the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment. The systems and system embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is It can be located in one place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without creative effort.

以上对本申请所提供的说话人识别方法、装置及电子设备，视频文件处理方法、装置及电子设备，进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的一般技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处。综上所述，本说明书内容不应理解为对本申请的限制。Above, the speaker recognition method, device and electronic equipment provided by this application, the video file processing method, device and electronic equipment have been introduced in detail. In this paper, specific examples have been used to illustrate the principle and implementation of the application. Above The description of the embodiments is only used to help understand the method of the present application and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the present application, there will be changes in the specific implementation and application scope. To sum up, the contents of this specification should not be understood as limiting the application.

Claims

1. A speaker recognition method, characterized in that, comprising:

Separate and obtain audio files and image files from video files to be identified;

performing voice segmentation on the audio file, obtaining start and end time information and speaker identification information corresponding to at least one voice segment, and performing face recognition on the image file to obtain face recognition corresponding to image frames at different time points Result; wherein, at least one speech segment obtained by the speech segmentation includes: under the influence of the noise information contained in the audio file, the start-stop time splitting error caused, and/or the speech segment whose speaker identification information is wrong;

The start and end time information of the speech segment obtained by the speech segmentation, the speaker identification information and the face recognition result are aligned in time, and the speaker corresponding to the at least one speech segment is determined, so that according to the person The face recognition result corrects the start and end time information and speaker identification information of the speech segment obtained from the speech segmentation.

2. The method according to claim 1, further comprising:

Merging processing is performed on at least one speech segment corresponding to the speaker.

3. The method of claim 1, wherein,

The time-aligned processing of the speaker identification information and the face recognition results includes:

According to the corresponding start and end time information of adjacent speech segments, determine the target start and end time for alignment processing;

Obtain the target face recognition result corresponding to the target start and end time;

If the target face recognition result corresponds to a target user, the adjacent speech segments are merged, and the merged speech segment is associated with the target user.

4. The method of claim 1, wherein,

According to the start and end time information corresponding to the speech segment, obtain the target face recognition result corresponding to the start and end time;

If the target face recognition result corresponds to at least two users, then determine a target user, and associate the speech segment with the target user.

5. The method of claim 4, wherein,

Determining a target user from said at least two users, comprising:

The user with the largest number of appearances within the start and end time period is determined as the target user.

6. The method according to any one of claims 3 to 5, further comprising:

Obtaining a speaker information base, which stores facial feature information associated with at least one speaker's identity information;

The identity information of the target user is determined from the speaker information database according to the facial feature information of the target user.

7. A video file processing method, characterized in that, comprising:

The server obtains the video files generated by the audio language communication activities;

According to the audio file and the image file separated from the video file, perform speaker identification, and obtain a collection of speech segments associated with different speakers; wherein, the collection of speech segments is obtained in the following manner: for the audio Carrying out speech segmentation on the file to obtain at least one speech segment corresponding to start and end time information and speaker identification information, wherein the at least one speech segment obtained by the speech segmentation includes: under the influence of noise information contained in the audio file, The resulting start and end time split errors, and/or speech segments with incorrect speaker identification information; face recognition is performed on the image file, and face recognition results corresponding to image frames at different time points are obtained, and according to the person The face recognition result corrects the start and end time information and speaker identification information of the speech segment obtained by the speech segmentation;

performing speech recognition on a set of speech segments associated with the speaker to obtain the speech content of the speaker;

An association relationship between the speaker and the corresponding speech content is established.

8. The method according to claim 7, further comprising:

Extracting the content highlights of the speaker from the speech content associated with the speaker;

Perform post-processing on the content highlight, generate a highlight file and send it to the client.

9. The method of claim 8, wherein

Perform post-processing on the highlights of the content, generate highlight files and send them to the client, including:

If the smart device associated with the client is a video playback device, adding the content highlights to the video file, and sending the video file containing the content highlights to the client for video playback.

10. The method of claim 8, wherein,

If the smart device associated with the client is a video playback device, a demonstration video of a simulated communication activity is generated according to the content highlights, and the demonstration video including the content highlights is sent to the client for video playback.

11. The method of claim 8, wherein,

If the smart device associated with the client is an audio playback device, adding the content highlights to the audio file, and sending the audio file containing the content highlights to the client for audio playback.

12. The method of claim 8, wherein,

According to the display type of the client, generate a corresponding type of communication text from the content highlights, and send the communication text to the client for display.

13. The method according to claim 7, characterized in that, if the communication activity venue is equipped with noise collection equipment, the method further comprises:

obtaining noise information collected by the noise collection device;

If the noise information indicates that the noise value exceeds a preset threshold, then separate the audio file and the image file from the video file.

14. The method of claim 13, further comprising:

If the noise information indicates that the noise value does not exceed the preset threshold, an audio file is extracted from the video file, and speaker recognition is performed based on the audio file.

15. The method of claim 7, wherein,

The video file is obtained during the live broadcast of the exchange event and corresponds to the content of the live broadcast.

16. The method of claim 15, further comprising:

If the live broadcast forwarding request submitted by the client is obtained, the content highlights corresponding to the completed live broadcast content are sent to the client, for users associated with the client to select target content highlights;

receiving the target content highlights submitted by the client, and generating a live broadcast cover according to the target content highlights;

A forwarding link including the cover of the live broadcast is obtained, and delivered to the client for forwarding the live broadcast.

17. The method of claim 8, further comprising:

An association between the speaker, the set of speech segments, and the content point of view is established.

18. The method of claim 17, further comprising:

If the playback request submitted by the client is obtained, the content highlights are sent to the client, for users associated with the client to select target content highlights;

receiving the target content highlights submitted by the client;

Determine the target speech segment corresponding to the target content point of view, and determine the corresponding playback segment of the target content point of view according to the start and end time of the target speech segment;

Send the playback segment to the client for playback.

19. The method of claim 17, further comprising:

Obtain a retrieval request submitted by the client, where the retrieval request includes speaker identification information;

Determine the target speech segment associated with the speaker, and determine the playback segment corresponding to the speaker according to the start and end time of the target speech segment;

Send the playback segment to the client for playback.

20. The method of claim 17, further comprising:

Obtain a retrieval request submitted by the client, where the retrieval request includes content keywords;

From the content viewpoints, determine target content viewpoints that match the content keywords;

Obtain the target speech segment corresponding to the target content point of view, and determine the playback segment corresponding to the target content point of view according to the start and end time of the target speech segment;

Send the playback segment to the client for playback.

21. A video file processing method, characterized in that, comprising:

The client receives the point-of-view file sent by the server and obtained by processing the video of the vocal language communication activity. Perform speaker recognition on the separated audio files and image files, and extract from a set of determined speech segments associated with different speakers; wherein, when performing speaker recognition, perform speech segmentation on the audio file to obtain at least one speech segment Respectively corresponding start and end time information and speaker identification information, and performing face recognition on the image file to obtain face recognition results corresponding to image frames at different time points; wherein, at least one speech segment obtained by the speech segmentation Including: under the influence of the noise information contained in the audio file, the start and end time splitting error caused, and/or the speech segment with wrong speaker identification information; the start and end time information of the speech segment obtained by the speech segmentation, The speaker identification information and the face recognition result are aligned in time to determine the speaker corresponding to the at least one speech segment, so as to start and stop the speech segment obtained by the speech segmentation according to the face recognition result Time information and speaker identification information are corrected;

Push the highlight file to the user associated with the client.

22. A speaker recognition device, comprising:

The file separation unit is used to separate and obtain audio files and image files from video files to be identified;

The voice segmentation unit is configured to perform voice segmentation on the audio file, and obtain the corresponding start and end time information and speaker identification information of at least one voice segment; wherein, the at least one voice segment obtained by the voice segmentation includes: Under the influence of the noise information contained in the file, the start and end time splitting error caused by the file, and/or the speech segment with wrong speaker identification information;

A face recognition unit, configured to perform face recognition on the image file, and obtain face recognition results corresponding to image frames at different time points;

An alignment processing unit, configured to perform temporal alignment processing on the start and end time information, speaker identification information, and the face recognition result of the speech segment obtained by the speech segmentation, and determine the speaker corresponding to the at least one speech segment , so as to correct the start and end time information and speaker identification information of the speech segment obtained from the speech segmentation according to the face recognition result.

23. A video file processing device, characterized in that it is applied to a server, comprising:

The video file obtaining unit is used to obtain the video files produced by the vocal language communication activities;

The speaker identification unit is used to perform speaker identification based on the audio files and image files separated from the video file, and obtain a collection of speech segments associated with different speakers; wherein, the collection of speech segments is passed through the following Obtaining method: perform speech segmentation on the audio file, and obtain start and end time information and speaker identification information corresponding to at least one speech segment, wherein the at least one speech segment obtained by the speech segmentation includes: the audio file contains Under the influence of the noise information, the start and end time splitting error caused, and/or the speech segment with wrong speaker identification information; the face recognition is performed on the image file, and the face recognition corresponding to the image frames at different time points is obtained As a result, the start and end time information and the speaker identification information of the speech segment obtained by the speech segmentation are corrected according to the face recognition result;

A utterance content obtaining unit, configured to perform speech recognition on a set of speech segments associated with the speaker, and obtain the utterance content of the speaker;

An association relationship establishing unit, configured to establish an association relationship between the speaker and the corresponding utterance content.

24. A video file processing device, characterized in that it is applied to a client, comprising:

The point file receiving unit is used to receive the point file sent by the server to process the video of the audio language communication activity and obtain the point file, the point file is added with the point point of content of different speakers, and the point point of the content is Perform speaker identification on the audio files and image files separated from the video, and extract from the set of determined speech segments associated with different speakers; wherein, when performing speaker identification, perform voice segmentation on the audio file Obtaining start and end time information and speaker identification information corresponding to at least one speech segment, and performing face recognition on the image file, and obtaining face recognition results corresponding to image frames at different time points; wherein, the speech segmentation The obtained at least one speech segment includes: under the influence of the noise information contained in the audio file, the start-stop time split error caused, and/or the speech segment whose speaker identification information is wrong; the speech segment obtained by the speech segmentation Segment start and end time information, speaker identification information, and the face recognition result are aligned in time to determine the speaker corresponding to the at least one speech segment, so as to segment the speech according to the face recognition result Correct the start and end time information and speaker identification information of the obtained speech segment;

A point-of-view file pushing unit, configured to push the point-of-view file to users associated with the client.

25. An electronic device, characterized in that it comprises:

one or more processors; and

A memory associated with the one or more processors, the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:

26. An electronic device, characterized in that it comprises:

one or more processors; and

Access to video files produced by spoken language exchange activities;

27. An electronic device, comprising:

one or more processors; and

Receiving the highlights file sent by the server and obtained by processing the video of the audio language communication activity, the highlights file is added with the highlights of the contents of different speakers, and the contents of the highlights are based on the separation of the video from the video. The audio files and image files are identified by speakers, and the set of determined speech segments associated with different speakers is extracted; wherein, when the speaker is identified, the audio files are segmented to obtain at least one speech segment corresponding to each The start and end time information and speaker identification information of the image file, and face recognition is performed on the image file to obtain face recognition results corresponding to image frames at different time points; wherein, at least one speech segment obtained by the speech segmentation includes: Under the influence of the noise information contained in the audio file, the start and end time splitting error caused, and/or the speech segment with wrong speaker identification information; the start and end time information of the speech segment obtained by the speech segmentation, the speaker The identification information and the face recognition result are aligned in time, and the speaker corresponding to the at least one speech segment is determined, so as to start and end time information of the speech segment obtained by the speech segmentation according to the face recognition result , The speaker identification information is corrected;

Push the highlight file to the user associated with the client.