CN116705031A

CN116705031A - Violation audio detection method and device, electronic equipment, storage medium

Info

Publication number: CN116705031A
Application number: CN202310827676.7A
Authority: CN
Inventors: 曾然然
Original assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Current assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-09-05

Abstract

The embodiment of the application discloses a method and a device for detecting illegal audio, electronic equipment and a storage medium, wherein the method for detecting illegal audio comprises the following steps: acquiring audio to be detected, extracting voice audio in the audio to be detected, and converting the voice audio into text information; inputting the text information into a violation voice detection module to detect whether the violation voice exists in the audio to be detected; the illegal voice detection module detects at least one of voice words in the text information and duration of the audio to be detected; if the voice frequency to be detected has the offensive voice, inputting the text information into a preset offensive detection model to detect whether the voice frequency to be detected is the offensive voice frequency. The technical scheme of the embodiment of the application can rapidly and accurately detect the illegal audio, and simultaneously saves labor cost and time cost.

Description

Violation audio detection method and device, electronic equipment, storage medium

技术领域technical field

本申请涉及人工智能技术领域，具体而言，涉及一种违规音频检测方法及装置、电子设备、计算机可读存储介质。The present application relates to the technical field of artificial intelligence, in particular, to a method and device for detecting illegal audio, electronic equipment, and a computer-readable storage medium.

背景技术Background technique

随着多媒体技术的发展，在网络用户生成内容(User Generated Content，UGC)中，用户的音频内容可进行自编自创，或者用户直接将歌曲内容进行歌词改编，再传播到网络上，其他用户可以在网络上查看到其传播的音频内容。但其中上传到网络的音频中可能会存在违规内容的情况，且相同的歌手，创作的系列作品往往均存在违规的现象。With the development of multimedia technology, in the network user generated content (User Generated Content, UGC), the user's audio content can be edited and created by himself, or the user can directly adapt the lyrics of the song content, and then spread it on the network, and other users The audio content of its transmission can be viewed on the Internet. However, there may be violations in the audio uploaded to the Internet, and the series of works created by the same singer often have violations.

因此，在音频内容上传前，会对其进行违规检测，现有的违规检测大多基于文本内容直接进行是否违规的判别，即将音频内容采取语音识别的方式转化为文本，再通过文本来进行音频内容是否违规的判别，或采用音频相似度匹配的方式来实现，但由于音频内容中有多含有音乐曲调和背景音，在音乐曲调和背景音的影响下，现有的违规检测的检出率较低。其次，歌曲内容可能涉及的违规种类多，场景复杂，可能涉及多语种和方言，判别标准难度很大。Therefore, before the audio content is uploaded, it will be detected for violations. Most of the existing violation detections are based on the text content to directly judge whether it is a violation, that is, the audio content is converted into text by means of speech recognition, and then the audio content is processed through the text. Whether it is a violation or not, it may be realized by audio similarity matching. However, due to the music tune and background sound in the audio content, under the influence of the music tune and background sound, the detection rate of the existing violation detection is relatively low. Low. Secondly, there are many types of violations that may be involved in the content of the song, the scene is complex, and it may involve multiple languages and dialects, making it very difficult to judge.

同时，现有技术中，音频内容的审核主要采用机器审核与人工审核的双重审核的方式来完成，迫切需要一个可以降低耗费的人力成本和时间成本，并提高音频内容审核效率的违规检测的技术方案，助力运营审核团队对损害网络健康安全的音频内容进行禁止发布和删除，进而构建健康传递正确价值观的泛娱乐新生态。At the same time, in the existing technology, the review of audio content is mainly completed by the double review method of machine review and manual review. There is an urgent need for a violation detection technology that can reduce labor costs and time costs and improve the efficiency of audio content review. The program helps the operation audit team to prohibit the release and delete of audio content that damages the health and safety of the network, so as to build a new pan-entertainment ecology that delivers correct values in a healthy way.

发明内容Contents of the invention

为解决上述技术问题，本申请的实施例提供了一种违规音频检测方法及装置、电子设备、计算机可读存储介质，旨在解决现有的违规审核需要耗费的人力成本和时间成本较高，同时音频内容审核效率和准确率较低的技术问题。In order to solve the above technical problems, the embodiments of the present application provide a violation audio detection method and device, electronic equipment, and a computer-readable storage medium, aiming at solving the high labor cost and time cost of the existing violation review, At the same time, there are technical problems with low audio content review efficiency and accuracy.

本申请的其他特性和优点将通过下面的详细描述变得显然，或部分地通过本申请的实践而习得。Other features and advantages of the present application will become apparent from the following detailed description, or in part, be learned by practice of the present application.

根据本申请实施例的一个方面，提供了一种违规音频检测方法，包括：According to an aspect of the embodiment of the present application, a method for detecting illegal audio is provided, including:

获取待检测音频，提取待检测音频中的人声音频，并将人声音频转换为文本信息；Obtain the audio to be detected, extract the human voice audio in the audio to be detected, and convert the human voice audio into text information;

将文本信息输入违规语气检测模块中检测待检测音频中是否存在违规语气；其中，违规语气检测模块基于文本信息中的语气词和待检测音频的时长中至少一项进行检测；Text information is input into the illegal tone detection module to detect whether there is an illegal tone in the audio to be detected; wherein, the illegal tone detection module is based on at least one of the tone words in the text information and the duration of the audio to be detected;

若待检测音频中存在违规语气，将文本信息输入预设的违规检测模型中检测待检测音频是否为违规音频。If there is an illegal tone in the audio to be detected, the text information is input into a preset violation detection model to detect whether the audio to be detected is an illegal audio.

在另外的实施例中，将文本信息输入违规语气检测模块中检测待检测音频中是否存在违规语气，包括：In another embodiment, text information is input into the illegal tone detection module to detect whether there is an illegal tone in the audio to be detected, including:

获取待检测音频的时长信息，并提取文本信息中的语气词个数和词汇个数，以及文本信息的文本字数；Obtain the duration information of the audio to be detected, and extract the number of modal particles and vocabulary in the text information, as well as the number of text words in the text information;

计算语气词个数与词汇个数的第一比值，并计算文本字数与时长信息的第二比值；Calculate the first ratio between the number of modal particles and the number of vocabulary, and calculate the second ratio between the number of words in the text and the duration information;

根据第一比值和第二比值检测待检测音频中是否存在违规语气。Detecting whether there is an illegal tone in the audio to be detected according to the first ratio and the second ratio.

在另外的实施例中，根据第一比值和第二比值检测待检测音频中是否存在违规语气，包括：In another embodiment, detecting whether there is an illegal tone in the audio to be detected according to the first ratio and the second ratio includes:

将第一比值与第一检测阈值进行匹配得到第一匹配结果，并将第二比值与第二检测阈值进行匹配得到第二匹配结果；matching the first ratio with the first detection threshold to obtain a first matching result, and matching the second ratio with the second detection threshold to obtain a second matching result;

若第一匹配结果表征第一比值大于第一检测阈值，第二匹配结果表征第二比值小于第二检测阈值，确定待检测音频中存在违规语气；If the first matching result indicates that the first ratio is greater than the first detection threshold, and the second matching result indicates that the second ratio is less than the second detection threshold, it is determined that there is an illegal tone in the audio to be detected;

若第一匹配结果表征第一比值小于等于第一检测阈值，或第二匹配结果表征第二比值大于等于第二检测阈值，确定待检测音频中不存在违规语气。If the first matching result indicates that the first ratio is less than or equal to the first detection threshold, or the second matching result indicates that the second ratio is greater than or equal to the second detection threshold, it is determined that there is no illegal tone in the audio to be detected.

在另外的实施例中，在将文本信息输入违规语气检测模块中检测待检测音频中是否存在违规语气之前，方法还包括：In another embodiment, before the text information is input into the illegal tone detection module to detect whether there is an illegal tone in the audio to be detected, the method also includes:

提取人声音频中的声纹特征得到目标声纹特征；Extract the voiceprint features in the human voice audio to obtain the target voiceprint features;

将目标声纹特征与声纹库中的违规声纹特征进行匹配，得到特征匹配结果；Match the target voiceprint feature with the illegal voiceprint feature in the voiceprint database to obtain the feature matching result;

若特征匹配结果表征目标声纹特征与违规声纹特征不匹配，执行将文本信息输入预设的违规语气检测模块中检测待检测音频中是否存在违规语气的步骤。If the feature matching result indicates that the target voiceprint feature does not match the illegal voiceprint feature, perform the step of inputting the text information into the preset illegal tone detection module to detect whether there is an illegal tone in the audio to be detected.

在另外的实施例中，提取人声音频中的声纹特征得到目标声纹特征，包括：In another embodiment, the voiceprint feature in the human voice audio is extracted to obtain the target voiceprint feature, including:

对人声音频进行分段处理，得到多个人声音频段；Perform segmentation processing on human voice audio to obtain multiple human voice audio segments;

通过声纹模型提取各个人声音频段的声纹特征；Extract the voiceprint features of each human voice segment through the voiceprint model;

计算多个声纹特征的平均声纹特征，并将平均声纹特征作为目标声纹特征。Calculate the average voiceprint feature of multiple voiceprint features, and use the average voiceprint feature as the target voiceprint feature.

在另外的实施例中，方法还包括：In another embodiment, the method also includes:

若待检测音频中不存在违规语气，将文本信息输入预设的语义模型进行语义识别处理，得到语义识别结果；If there is no illegal tone in the audio to be detected, the text information is input into the preset semantic model for semantic recognition processing, and the semantic recognition result is obtained;

根据语义识别结果确定待检测音频是否为违规音频。Determine whether the audio to be detected is an illegal audio according to the semantic recognition result.

在另外的实施例中，在根据语义识别结果确定待检测音频是否为违规音频之后，方法还包括：In another embodiment, after determining whether the audio to be detected is an illegal audio according to the semantic recognition result, the method further includes:

若待检测音频为违规音频，根据语义识别结果确定违规类别；If the audio to be detected is a violation audio, determine the violation category according to the semantic recognition result;

根据违规类别确定对应的处理流程，并通过处理流程对待检测音频进行处理。Determine the corresponding processing flow according to the type of violation, and process the audio to be detected through the processing flow.

根据本申请实施例的一个方面，提供了一种违规音频检测装置，包括：According to an aspect of the embodiment of the present application, a violation audio detection device is provided, including:

获取模块，配置为获取待检测音频，提取待检测音频中的人声音频，并将人声音频转换为文本信息；The obtaining module is configured to obtain the audio to be detected, extract the human voice audio in the audio to be detected, and convert the human voice audio into text information;

第一检测模块，配置为将文本信息输入违规语气检测模块中检测待检测音频中是否存在违规语气；其中，违规语气检测模块基于文本信息中的语气词和待检测音频的时长中至少一项进行检测；The first detection module is configured to input text information into the illegal tone detection module to detect whether there is an illegal tone in the audio to be detected; wherein, the illegal tone detection module is based on at least one of the tone words in the text information and the duration of the audio to be detected. detection;

第二检测模块，配置为若待检测音频中存在违规语气，将文本信息输入预设的违规检测模型中检测待检测音频是否为违规音频。The second detection module is configured to input text information into a preset violation detection model to detect whether the audio to be detected is an illegal tone if there is an illegal tone in the audio to be detected.

根据本申请实施例的一个方面，提供了一种电子设备，包括：一个或多个处理器；存储装置，用于存储一个或多个程序，当一个或多个程序被一个或多个处理器执行时，使得电子设备实现如前的违规音频检测方法。According to an aspect of the embodiments of the present application, an electronic device is provided, including: one or more processors; a storage device for storing one or more programs, when one or more programs are executed by one or more processors When executed, the electronic device is made to implement the above illegal audio detection method.

根据本申请实施例的一个方面，提供了一种计算机可读存储介质，其上存储有计算机可读指令，当计算机可读指令被计算机的处理器执行时，使计算机执行如上的违规音频检测方法。According to one aspect of the embodiments of the present application, there is provided a computer-readable storage medium, on which computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor of the computer, the computer is made to execute the above-mentioned illegal audio detection method .

根据本申请实施例的一个方面，提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述各种可选实施例中提供的违规音频检测方法。According to an aspect of the embodiments of the present application, a computer program product or computer program is provided, the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the violation audio detection method provided in the above-mentioned various optional embodiments.

在本申请的实施例所提供的技术方案中，在需要检测待检测音频时，提取待检测音频中的人声音频，并将人声音频转换为文本信息，提取出的人声音频将很大程度地降低背景音乐干扰，从而提升了将人声音频转换为文本信息的识别精度。本申请还将转换到的文本信息输入违规语气检测模块中检测待检测音频中是否存在违规语气，违规语气检测模块基于文本信息中的语气词和待检测音频的时长中至少一项进行检测，在待检测音频中存在违规语气的情况下，其中的语气词和时长相较于正常的音频是存在一定的差异，因此，违规语气检测模块能够准确的检测出待检测音频中是否存在违规语气，在检测到待检测音频中存在违规语气的情况下，进一步将文本信息输入预设的违规检测模型中检测所述待检测音频是否为违规音频，能够更加快速、准确的进行违规检测。可见，采用本申请的技术方案无需设置专有的人工审核，能够降低耗费的人力成本和时间成本，助力运营审核团队对损害网络健康安全的音频内容进行禁止发布和删除，进而构建健康传递正确价值观的泛娱乐新生态。In the technical solution provided in the embodiments of the present application, when the audio to be detected needs to be detected, the human voice audio in the audio to be detected is extracted, and the human voice audio is converted into text information, the extracted human voice audio will be very loud Minimize the interference of background music, thereby improving the recognition accuracy of converting human voice audio into text information. The present application also inputs the converted text information into the illegal tone detection module to detect whether there is an illegal tone in the audio to be detected, and the illegal tone detection module detects at least one of the modal particles in the text information and the duration of the audio to be detected. When there is an illegal tone in the audio to be detected, there are certain differences in the modal particles and duration compared with the normal audio. Therefore, the illegal tone detection module can accurately detect whether there is an illegal tone in the audio to be detected. When it is detected that there is an illegal tone in the audio to be detected, the text information is further input into the preset violation detection model to detect whether the audio to be detected is an illegal audio, and the violation detection can be performed more quickly and accurately. It can be seen that adopting the technical solution of this application does not need to set up a proprietary manual review, which can reduce the labor cost and time cost, and help the operation review team to prohibit the release and deletion of audio content that damages the health and safety of the network, and then build a healthy delivery of correct values. new pan-entertainment ecology.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本申请的实施例，并与说明书一起用于解释本申请的原理。显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术者来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。在附图中：The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application. Apparently, the drawings in the following description are only some embodiments of the present application, and those skilled in the art can obtain other drawings based on these drawings without creative efforts. In the attached picture:

图1是本申请涉及的一种实施环境的示意图；FIG. 1 is a schematic diagram of an implementation environment involved in the present application;

图2是本申请涉及的一种违规音频检测方法的流程图；Fig. 2 is the flowchart of a kind of illegal audio detection method involved in the present application;

图3是本申请涉及的另一个实施例中违规音频检测方法的流程图；Fig. 3 is the flow chart of the illegal audio detection method in another embodiment involved in the present application;

图4是本申请涉及的一个实施例中步骤S310的流程图；FIG. 4 is a flowchart of step S310 in an embodiment involved in the present application;

图5是本申请涉及的一个实施例中步骤S220的流程图；FIG. 5 is a flowchart of step S220 in an embodiment involved in the present application;

图6是本申请涉及的一个实施例中步骤S530的流程图；FIG. 6 is a flowchart of step S530 in an embodiment involved in the present application;

图7是本申请涉及的另一个实施例中违规音频检测方法的流程图；Fig. 7 is the flow chart of the illegal audio detection method in another embodiment involved in the present application;

图8是本申请涉及的一个实施例中步骤S720的流程图；FIG. 8 is a flowchart of step S720 in an embodiment involved in the present application;

图9是本申请涉及的另一个实施例中违规音频检测方法的流程图；Fig. 9 is a flow chart of a method for detecting illegal audio in another embodiment involved in the present application;

图10是本申请涉及的一种违规音频检测装置的框图；Fig. 10 is a block diagram of a violation audio detection device involved in the present application;

图11示出了适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。FIG. 11 shows a schematic structural diagram of a computer system suitable for implementing the electronic device of the embodiment of the present application.

具体实施方式Detailed ways

这里将详细地对示例性实施例执行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present application as recited in the appended claims.

附图中所示的方框图仅仅是功能实体，不一定必须与物理上独立的实体相对应。即，可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the drawings are merely functional entities and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices entity.

附图中所示的流程图仅是示例性说明，不是必须包括所有的内容和操作/步骤，也不是必须按所描述的顺序执行。例如，有的操作/步骤还可以分解，而有的操作/步骤可以合并或部分合并，因此实际执行的顺序有可能根据实际情况改变。The flow charts shown in the drawings are only exemplary illustrations, and do not necessarily include all contents and operations/steps, nor must they be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partly combined, so the actual order of execution may be changed according to the actual situation.

还需要说明的是：在本申请中提及的“多个”是指两个或者两个以上。“和/或”描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。It should also be noted that: the "plurality" mentioned in this application refers to two or more. "And/or" describes the association relationship of associated objects, indicating that there may be three types of relationships. For example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently. The character "/" generally indicates that the contextual objects are an "or" relationship.

请参阅图1，图1是本申请涉及的一种实施环境的示意图。该实施环境包括终端110和服务器120，终端110和服务器120之间通过有线或者无线网络进行通信。Please refer to FIG. 1 , which is a schematic diagram of an implementation environment involved in this application. The implementation environment includes a terminal 110 and a server 120, and the terminal 110 and the server 120 communicate through a wired or wireless network.

终端110中运行有对用户生成内容进行创作的应用程序和对音视频等进行播放的应用程序，在某些进行播放的应用程序中，可直接创作用户生成内容。用户将用户生成内容上传至应用程序进行播放时，若上传的应用程序为一个公共平台，即用户上传的用户生成内容能够被其他用户播放时，需要对用户上传的用户生成内容进行审核，以确定该用户生成内容中是否包含有违规内容，在确定用户生成内容中包含违规内容时，不准许用户成功上传该用户生成内容。Terminal 110 runs applications for creating user-generated content and applications for playing audio and video. In some playing applications, user-generated content can be created directly. When a user uploads user-generated content to an application for playback, if the uploaded application is a public platform, that is, when the user-generated content uploaded by the user can be played by other users, the user-generated content uploaded by the user needs to be reviewed to determine Whether the user-generated content contains illegal content, and when it is determined that the user-generated content contains illegal content, the user is not allowed to successfully upload the user-generated content.

服务器120中设置有对用户上传的内容进行违规检测的方法，实现对音频、图像或视频进行违规检测。The server 120 is provided with a method for detecting violations of content uploaded by users to implement violation detection for audio, images or videos.

其中，终端110可以是智能手机、平板、笔记本电脑、计算机等任意能够运行创作应用程序和播放应用程序的电子设备，服务器120可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content DeliveryNetwork，内容分发网络)以及大数据和人工智能平台等基础云计算服务的云服务器，本处不对此进行限制。Wherein, the terminal 110 can be any electronic device capable of running authoring applications and playing applications, such as smart phones, tablets, laptops, computers, etc., and the server 120 can be an independent physical server, or a server cluster composed of multiple physical servers Or a distributed system can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network) and Cloud servers for basic cloud computing services such as big data and artificial intelligence platforms are not restricted here.

图2是根据一示例性实施例示出的一种违规音频检测方法的流程图。该方法可以应用于图1所示的实施环境，并由图1所示实施例环境中的服务器120具体执行。Fig. 2 is a flow chart showing a method for detecting illegal audio according to an exemplary embodiment. The method can be applied to the implementation environment shown in FIG. 1 , and is specifically executed by the server 120 in the embodiment environment shown in FIG. 1 .

如图2所示，在一示例性实施例中，该违规音频检测方法可以包括步骤S210至步骤S230，详细介绍如下：As shown in FIG. 2, in an exemplary embodiment, the method for detecting illegal audio may include steps S210 to S230, which are described in detail as follows:

步骤S210，获取待检测音频，提取待检测音频中的人声音频，并将人声音频转换为文本信息。Step S210, acquiring the audio to be detected, extracting human voice audio from the audio to be detected, and converting the human voice audio into text information.

本申请实施例中，当用户上传音频时，将上传的音频作为待检测音频，并生成检测请求，当接收到检测请求，获取与检测请求对应的待检测音频，待检测音频中包括有背景音频、旋律音频、人声音频等，通过对背景音频、旋律音频等进行抑制和分离提取出待检测音频中的人声音频，并将人声音频转换为文本信息。本申请通过对背景音频、旋律音频等进行抑制和分离，将更大程度地降低音乐对声纹特征和语音特征的干扰，很大程度提升了后续将人声音频转换为文本信息，以及提取声纹特行的识别精度。In the embodiment of this application, when the user uploads audio, the uploaded audio is used as the audio to be detected, and a detection request is generated. When the detection request is received, the audio to be detected corresponding to the detection request is obtained. The audio to be detected includes background audio , melody audio, human voice audio, etc., extract the human voice audio in the audio to be detected by suppressing and separating the background audio, melody audio, etc., and convert the human voice audio into text information. By suppressing and separating background audio and melody audio, this application will greatly reduce the interference of music on voiceprint features and speech features, and greatly improve the subsequent conversion of human voice audio into text information and the extraction of voice information. Unique recognition accuracy.

具体的，在提取人声音频时，将待检测音频输入预设的人声分离模型中进行处理，人声分离模型可采用深度神经网络模型如U-Net、Wave-U-Net、WaveGlow、Deep Clustering(深度聚类)，TasNet(Time-domain audio separation Network，时域音频分离网络)、Conv-TasNet(Convolutional Time-domain audio separation Network，全卷积时域音频分离网络模型)等，或基于注意力机制的TasNet、Conv-TasNet训练而成。人声分离模型将待检测音频分离出人声音频和除开人声音频的其他音频两个音频文件。本申请实施例中只保留人声音频进行后续处理。Specifically, when extracting human voice audio, the audio to be detected is input into the preset human voice separation model for processing. The human voice separation model can use deep neural network models such as U-Net, Wave-U-Net, WaveGlow, Deep Clustering (deep clustering), TasNet (Time-domain audio separation Network, time-domain audio separation network), Conv-TasNet (Convolutional Time-domain audio separation Network, fully convolutional time-domain audio separation network model), etc., or based on attention It is trained by TasNet and Conv-TasNet with force mechanism. The human voice separation model separates the audio to be detected into two audio files: human voice audio and other audio except the human voice audio. In the embodiment of the present application, only human voice audio is reserved for subsequent processing.

将提取出的人声音频进行语音识别处理，得到文本信息。具体的，将人声音频输入至预训练完成的语音识别模型中进行语音识别，得到文本信息。Perform speech recognition processing on the extracted human voice audio to obtain text information. Specifically, human voice audio is input into the pre-trained speech recognition model for speech recognition to obtain text information.

在本申请的一示例性实施例中，请参阅图3，在步骤S220将文本信息输入违规语气检测模块中检测待检测音频中是否存在违规语气之前，违规音频检测方法还包括步骤S310至步骤S330，详细介绍如下：In an exemplary embodiment of the present application, please refer to FIG. 3 , before step S220, the text information is input into the illegal tone detection module to detect whether there is an illegal tone in the audio to be detected, the illegal audio detection method also includes steps S310 to S330 , the details are as follows:

步骤S310，提取人声音频中的声纹特征得到目标声纹特征。Step S310, extracting voiceprint features in the human voice audio to obtain target voiceprint features.

本申请实施例中，从人声音频中提取声纹特征作为目标声纹特征。具体的，可通过训练好的声纹模型进行提取，声纹模型可基于深度学习的TDNN(TimeDelay NeuralNetwork，时延神经网络)的网络结构进行训练得到。In the embodiment of the present application, the voiceprint feature is extracted from the human voice audio as the target voiceprint feature. Specifically, it can be extracted through a trained voiceprint model, and the voiceprint model can be obtained by training based on a deep learning TDNN (TimeDelay Neural Network, Time Delay Neural Network) network structure.

步骤S320，将目标声纹特征与声纹库中的违规声纹特征进行匹配，得到特征匹配结果。Step S320, matching the target voiceprint features with the illegal voiceprint features in the voiceprint library to obtain a feature matching result.

本申请实施例中，预先将运营中发现或者已被通报的违规音频的片段进行声纹特征提取，将提取到的声纹的x-vector组合成声纹库，同时将违规音频对应的发布者标记为敏感人员，并将提取到的违规音频的声纹特征作为违规声纹特征。In the embodiment of the present application, voiceprint feature extraction is performed on fragments of illegal audio found in operation or notified in advance, the x-vector of the extracted voiceprint is combined into a voiceprint library, and the publisher corresponding to the illegal audio is Mark as a sensitive person, and use the voiceprint feature of the extracted illegal audio as the voiceprint feature of the violation.

将目标声纹特征与违规声纹特征进行匹配，得到特征匹配结果，特征匹配结果分为两类，一类是目标声纹特征与违规声纹特征不匹配，一类是目标声纹特征与违规声纹特征匹配。Match the target voiceprint feature with the violation voiceprint feature to obtain the feature matching result. The feature matching result is divided into two categories, one is that the target voiceprint feature does not match the violation voiceprint feature, and the other is that the target voiceprint Voiceprint feature matching.

步骤S330，若特征匹配结果表征目标声纹特征与违规声纹特征不匹配，执行将文本信息输入预设的违规语气检测模块中检测待检测音频中是否存在违规语气的步骤。Step S330, if the feature matching result indicates that the target voiceprint feature does not match the illegal voiceprint feature, perform the step of inputting the text information into the preset illegal tone detection module to detect whether there is an illegal tone in the audio to be detected.

本申请实施例中，若特征匹配结果表征目标声纹特征与违规声纹特征不匹配，则执行将文本信息输入预设的违规语气检测模块中检测待检测音频中是否存在违规语气的步骤。若目标声纹特征与违规声纹特征匹配，则无需进行后续将文本信息输入到违规语气检测模块中进行处理步骤，直接不允许发布者将待检测音频发布在应用程序上。In the embodiment of the present application, if the feature matching result indicates that the target voiceprint feature does not match the illegal voiceprint feature, the step of inputting the text information into the preset illegal tone detection module to detect whether there is an illegal tone in the audio to be detected is performed. If the target voiceprint feature matches the illegal voiceprint feature, there is no need to perform subsequent processing steps of inputting the text information into the illegal tone detection module, and the publisher is directly not allowed to publish the audio to be detected on the application program.

在本申请的另一实施例中，在目标声纹特征与违规声纹特征匹配若目标声纹特征与违规声纹特征匹配时，可将文本信息输入预设的语义模型进行语义识别处理，得到语义识别结果，再根据语义识别结果进一步确定待检测音频是否为违规音频。In another embodiment of the present application, when the target voiceprint feature matches the illegal voiceprint feature, if the target voiceprint feature matches the illegal voiceprint feature, the text information can be input into the preset semantic model for semantic recognition processing, and the obtained The semantic recognition result, and then further determine whether the audio to be detected is an illegal audio according to the semantic recognition result.

本申请实施例中，通过声纹特征的匹配，使得不用所有的待检测音频都需要经过违规语气检测模块和违规检测模型的检测，大大节约服务器的算力资源，同时提升违规检测的检测效率。In the embodiment of this application, through the matching of voiceprint features, it is not necessary for all the audio to be detected to be detected by the violation tone detection module and the violation detection model, which greatly saves the computing resources of the server and improves the detection efficiency of violation detection.

在本申请的一示例性实施例中，请参阅图4，在步骤S310中提取人声音频中的声纹特征得到目标声纹特征，包括步骤S410至步骤S430，详细介绍如下：In an exemplary embodiment of the present application, please refer to FIG. 4 , in step S310, the voiceprint feature in the human voice audio is extracted to obtain the target voiceprint feature, including steps S410 to S430, which are described in detail as follows:

步骤S410，对人声音频进行分段处理，得到多个人声音频段。Step S410, segmenting the human voice audio to obtain multiple human voice audio segments.

本申请实施例中，将人声音频进行分段处理，得到个人声音频段。在分段处理时，可预先设置有对应的分段段数，按照固定的分段段数将人声音频切成多个人声音频段。在另一实施例中，可预设设置有每个人声音频段的时长，将人声音频切成固定时长的人声音频段。前述的预设的分段段数和时长可基于需要自行进行设置，本申请对此不进行限制。In the embodiment of the present application, the human voice audio is segmented to obtain individual voice segments. During segment processing, the corresponding number of segments can be preset, and the human voice audio can be divided into multiple human voice audio segments according to the fixed number of segments. In another embodiment, the duration of each human voice segment may be preset, and the human voice audio may be cut into fixed duration human voice segments. The aforementioned preset number of segments and duration can be set based on needs, which is not limited in this application.

步骤S420，通过声纹模型提取各个人声音频段的声纹特征。Step S420, extracting the voiceprint features of each human voice segment through the voiceprint model.

本申请实施例中，将每个人声音频段输入声纹模型中进行声纹识别，提取每个人声音频段的声纹特征x-vector。In the embodiment of the present application, each human voice segment is input into the voiceprint model for voiceprint recognition, and the voiceprint feature x-vector of each human voice segment is extracted.

步骤S430，计算多个声纹特征的平均声纹特征，并将平均声纹特征作为目标声纹特征。Step S430, calculating the average voiceprint feature of multiple voiceprint features, and using the average voiceprint feature as the target voiceprint feature.

本申请实施例中，将多个声纹特征x-vector取平均，得到平均声纹特征。将平均后的平均声纹特征作为目标声纹特征与声纹库中的违规声纹特征进行匹配。In the embodiment of the present application, multiple voiceprint feature x-vectors are averaged to obtain the average voiceprint feature. The averaged average voiceprint feature is used as the target voiceprint feature to match with the illegal voiceprint feature in the voiceprint library.

步骤S220，将文本信息输入违规语气检测模块中检测待检测音频中是否存在违规语气；其中，违规语气检测模块基于文本信息中的语气词和待检测音频的时长中至少一项进行检测。Step S220, input the text information into the illegal tone detection module to detect whether there is an illegal tone in the audio to be detected; wherein, the illegal tone detection module detects based on at least one of the modal particles in the text information and the duration of the audio to be detected.

本申请实施例中，预先设置有违规语气检测模块，违规语气检测模块基于文本信息中的语气词和待检测音频的时长中至少一项进行检测。将转换到的文本信息输入到违规语气检测模块中检测待检测音频中是否存在违规语气。本申请实施例中的违规语气可理解为通过语气词表达违规内容，因此，在待检测音频中存在违规语气的情况下，其中的语气词和时长相较于正常的音频是存在一定的差异。In the embodiment of the present application, an illegal tone detection module is preset, and the illegal tone detection module performs detection based on at least one of the modal particle in the text information and the duration of the audio to be detected. Input the converted text information into the illegal tone detection module to detect whether there is an illegal tone in the audio to be detected. The illegal tone in the embodiment of the present application can be understood as expressing the illegal content through modal particles. Therefore, when there is an illegal tone in the audio to be detected, the modal particles and duration are different from normal audio.

在本申请的一示例性实施例中，请参阅图5，在步骤S220中将文本信息输入违规语气检测模块中检测待检测音频中是否存在违规语气，包括步骤S510至步骤S530，详细介绍如下：In an exemplary embodiment of the present application, please refer to FIG. 5, in step S220, input text information into the illegal tone detection module to detect whether there is an illegal tone in the audio to be detected, including steps S510 to step S530, described in detail as follows:

步骤S510，获取待检测音频的时长信息，并提取文本信息中的语气词个数和词汇个数，以及文本信息的文本字数。In step S510, the duration information of the audio to be detected is obtained, and the number of modal particles and the number of vocabulary in the text information, as well as the number of words in the text information are extracted.

本申请实施例中，获取待检测音频的时长信息，并对文本信息进行分词处理，在进行分词处理时，可采用结巴分词、SnowNLP、THULAC(THU Lexical Analyzer for Chinese)、NLPIR分词系统等工具进行分词处理。在分词处理时，对分词后的各个词汇进行词性标注，再从词性标注后的词汇中提取出语气词，并统计语气词个数和词汇个数，以及文本信息中的文本字数。In the embodiment of the present application, the duration information of the audio to be detected is obtained, and word segmentation processing is performed on the text information. When performing word segmentation processing, tools such as stuttering word segmentation, SnowNLP, THULAC (THU Lexical Analyzer for Chinese), NLPIR word segmentation system can be used. Word segmentation processing. During the word segmentation process, perform part-of-speech tagging on each word after word segmentation, and then extract modal particles from the part-of-speech tagged vocabulary, and count the number of modal particles and vocabulary, as well as the number of text words in the text information.

步骤S520，计算语气词个数与词汇个数的第一比值，并计算文本字数与时长信息的第二比值。Step S520, calculating the first ratio between the number of modal particles and the number of vocabulary, and calculating the second ratio between the number of words in the text and the duration information.

本申请实施例中，将语气词个数除以词汇个数，得到第一比值，同时将文本字数除以时长信息，得到第二比值。In the embodiment of the present application, the first ratio is obtained by dividing the number of modal particles by the number of vocabulary, and the second ratio is obtained by dividing the number of words in the text by the duration information.

步骤S530，根据第一比值和第二比值检测待检测音频中是否存在违规语气。Step S530, detecting whether there is an illegal tone in the audio to be detected according to the first ratio and the second ratio.

本申请实施例中，通常，在待检测音频中存在违规语气的情况下，人声音频中的非语气词的词汇占比较小，而“啊、哦、哈、好”这类语气词的占比大大增加。此外，识别出的文字个数在同等时长的音频长度下，也远小于正常音频，因此可以根据第一比值和第二比值检测待检测音频中是否存在违规语气。In the embodiment of the present application, usually, when there is an illegal tone in the audio to be detected, the vocabulary of the non-modal particles in the human voice audio is relatively small, while the proportion of the modal particles such as "ah, oh, ha, good" is than greatly increased. In addition, the number of recognized words is much smaller than normal audio under the same audio length, so whether there is an illegal tone in the audio to be detected can be detected according to the first ratio and the second ratio.

在本申请的一示例性实施例中，请参阅图6，在步骤S530中根据第一比值和第二比值检测待检测音频中是否存在违规语气，包括步骤S610至步骤S630，详细介绍如下：In an exemplary embodiment of the present application, referring to FIG. 6 , in step S530, it is detected whether there is an illegal tone in the audio to be detected according to the first ratio and the second ratio, including steps S610 to S630, which are described in detail as follows:

步骤S610，将第一比值与第一检测阈值进行匹配得到第一匹配结果，并将第二比值与第二检测阈值进行匹配得到第二匹配结果。Step S610, matching the first ratio with the first detection threshold to obtain a first matching result, and matching the second ratio with the second detection threshold to obtain a second matching result.

本申请实施例中，预先设置有第一检测阈值和第二检测阈值，将第一比值与第一检测阈值进行匹配，得到第一匹配结果，同时将第二比值与第二检测阈值进行匹配，得到第二匹配结果。In the embodiment of the present application, the first detection threshold and the second detection threshold are preset, the first ratio is matched with the first detection threshold to obtain the first matching result, and the second ratio is matched with the second detection threshold, Get the second matching result.

步骤S620，若第一匹配结果表征第一比值大于第一检测阈值，第二匹配结果表征第二比值小于第二检测阈值，确定待检测音频中存在违规语气。Step S620, if the first matching result indicates that the first ratio is greater than the first detection threshold, and the second matching result indicates that the second ratio is less than the second detection threshold, it is determined that there is an illegal tone in the audio to be detected.

本申请实施例中，若第一比值大于第一检测阈值，同时第二比值小于第二检测阈值，则表明人声音频中的非语气词的词汇占比较小，而语气词的占比较大，同时识别出的文字个数在同等时长的音频长度下，也远小于正常音频，可确定待检测音频中存在违规语气。In the embodiment of the present application, if the first ratio is greater than the first detection threshold and the second ratio is smaller than the second detection threshold, it indicates that the proportion of non-modal particles in the vocal audio is relatively small, while the proportion of modal particles is relatively large. At the same time, the number of recognized texts is also much smaller than the normal audio under the same audio length, so it can be determined that there is an illegal tone in the audio to be detected.

步骤S630，若第一匹配结果表征第一比值小于等于第一检测阈值，或第二匹配结果表征第二比值大于等于第二检测阈值，确定待检测音频中不存在违规语气。Step S630, if the first matching result indicates that the first ratio is less than or equal to the first detection threshold, or the second matching result indicates that the second ratio is greater than or equal to the second detection threshold, determine that there is no illegal tone in the audio to be detected.

本申请实施例中，若第一比值小于等于第一检测阈值，或第二比值大于等于第二检测阈值中任一成立，则表明人声音频中的非语气词的词汇占比较大，而语气词的占比较小，或识别出的文字个数在同等时长的音频长度下，也达到正常音频，可确定待检测音频中不存在违规语气。In the embodiment of the present application, if any of the first ratio is less than or equal to the first detection threshold, or the second ratio is greater than or equal to the second detection threshold, it indicates that the vocabulary of non-modal particles in the human voice audio accounts for a large proportion, while the tone If the proportion of words is small, or the number of recognized words reaches normal audio under the same audio length, it can be determined that there is no illegal tone in the audio to be detected.

步骤S230，若待检测音频中存在违规语气，将文本信息输入预设的违规检测模型中检测待检测音频是否为违规音频。Step S230, if there is an illegal tone in the audio to be detected, input text information into a preset violation detection model to detect whether the audio to be detected is an illegal audio.

本申请实施例中，预设的违规检测模型可基于Panns-cnn10模型或Ecapa-tdnn模型训练而成，训练时，设置有正样本和负样本两类，其中，正样本中多采用自然界声音，具体可包括动物叫声、自然音&水声、非会话人声、家居声音、城市噪音等。In the embodiment of the present application, the preset violation detection model can be trained based on the Panns-cnn10 model or the Ecapa-tdnn model. During training, there are two types of positive samples and negative samples. Among them, natural sounds are mostly used in positive samples. Specifically, it may include animal sounds, natural sounds & water sounds, non-conversational human voices, household sounds, urban noises, etc.

负样本则采用被标注的违规音频。Negative samples use labeled offending audio.

通过正样本和负样本对Panns-cnn10模型或Ecapa-tdnn进行训练得到基线模型后，又进行了迁移训练，在进行迁移训练时加入歌曲、录播课程、有声小说等数据，使得最终训练的违规检测模型能够更加精确，同时整个训练的耗时更短。在本申请一实施例中，优选Ecapa-tdnn模型来训练得到违规检测模型。After training the Panns-cnn10 model or Ecapa-tdnn through positive and negative samples to obtain the baseline model, migration training was performed, and data such as songs, recorded courses, and audio novels were added during the migration training, making the final training violations The detection model can be more accurate, and the overall training time is shorter. In an embodiment of the present application, the Ecapa-tdnn model is preferably trained to obtain a violation detection model.

在确定出待检测音频中存在违规语气后，将文本信息输入预设的违规检测模型中进一步检测待检测音频中是否为违规音频，能够更加准确的进行违规检测。After it is determined that there is an illegal tone in the audio to be detected, the text information is input into a preset violation detection model to further detect whether the audio to be detected is an illegal audio, so that violation detection can be performed more accurately.

本申请实施例中，若根据违规检测模型检测出待检测音频为违规音频，将待检测音频输入至人工审核通道进行进一步审核，得到的人工审核结果人工审核结果表征待检测音频为违规音频，则将从该待检测音频中提取出的目标声纹特征存储至声纹库中作为违规声纹特征，通过将该待检测音频的发布者标记为敏感人员。In the embodiment of this application, if the audio to be detected is detected to be an illegal audio according to the violation detection model, the audio to be detected is input to the manual review channel for further review, and the obtained manual review result indicates that the audio to be detected is a violation audio, then The target voiceprint feature extracted from the audio to be detected is stored in the voiceprint library as the voiceprint feature of the violation, and the publisher of the audio to be detected is marked as a sensitive person.

本申请实施例中，在需要检测待检测音频时，提取待检测音频中的人声音频，并将人声音频转换为文本信息，提取出的人声音频将很大程度地降低背景音乐干扰，从而提升了将人声音频转换为文本信息的识别精度。本申请还将转换到的文本信息输入违规语气检测模块中检测待检测音频中是否存在违规语气，违规语气检测模块基于文本信息中的语气词和待检测音频的时长中至少一项进行检测，在待检测音频中存在违规语气的情况下，其中的语气词和时长相较于正常的音频是存在一定的差异，因此，违规语气检测模块能够准确的检测出待检测音频中是否存在违规语气，在检测到待检测音频中存在违规语气的情况下，进一步将文本信息输入预设的违规检测模型中检测所述待检测音频是否为违规音频，能够更加快速、准确的进行违规检测。可见，采用本申请的技术方案无需设置专有的人工审核，能够降低耗费的人力成本和时间成本，助力运营审核团队对损害网络健康安全的音频内容进行禁止发布和删除，进而构建健康传递正确价值观的泛娱乐新生态。In the embodiment of the present application, when the audio to be detected needs to be detected, the human voice audio in the audio to be detected is extracted, and the human voice audio is converted into text information, and the extracted human voice audio will greatly reduce background music interference, Thereby, the recognition accuracy of converting human voice audio into text information is improved. The present application also inputs the converted text information into the illegal tone detection module to detect whether there is an illegal tone in the audio to be detected. The illegal tone detection module detects at least one of the modal particles in the text information and the duration of the audio to be detected. When there is an illegal tone in the audio to be detected, the modal particles and duration are different from the normal audio. Therefore, the illegal tone detection module can accurately detect whether there is an illegal tone in the audio to be detected. When it is detected that there is an illegal tone in the audio to be detected, the text information is further input into the preset violation detection model to detect whether the audio to be detected is an illegal audio, so that the violation detection can be performed more quickly and accurately. It can be seen that adopting the technical solution of this application does not need to set up a proprietary manual review, which can reduce the labor cost and time cost, and help the operation review team to prohibit the release and deletion of audio content that damages the health and safety of the network, and then build a healthy delivery of correct values. new pan-entertainment ecology.

在本申请的一示例性实施例中，请参阅图7，违规音频检测方法还包括步骤S710和步骤S720，详细介绍如下：In an exemplary embodiment of the present application, please refer to FIG. 7, the illegal audio detection method also includes step S710 and step S720, which are described in detail as follows:

步骤S710，若待检测音频中不存在违规语气，将文本信息输入预设的语义模型进行语义识别处理，得到语义识别结果。Step S710, if there is no illegal tone in the audio to be detected, input text information into a preset semantic model for semantic recognition processing, and obtain a semantic recognition result.

本申请实施例中，在通过违规语气检测模块检测出待检测音频中不存在违规语气后，将文本信息输入到预设的语义模型中进行语义识别处理，得到语义识别结果。In the embodiment of the present application, after the illegal tone detection module detects that there is no illegal tone in the audio to be detected, the text information is input into the preset semantic model for semantic recognition processing, and the semantic recognition result is obtained.

步骤S720，根据语义识别结果确定待检测音频是否为违规音频。Step S720, determine whether the audio to be detected is an illegal audio according to the semantic recognition result.

本申请实施例中，根据语义识别结果能够确定出待检测音频所表达的语义，进而能够确定待检测音频是否为违规音频。In the embodiment of the present application, the semantics expressed by the audio to be detected can be determined according to the semantic recognition result, and then whether the audio to be detected is an illegal audio can be determined.

在本申请的一示例性实施例中，请参阅图8，在步骤S720根据语义识别结果确定待检测音频是否为违规音频之后，违规音频检测方法还包括步骤S810和步骤S820，详细介绍如下：In an exemplary embodiment of the present application, please refer to FIG. 8. After determining whether the audio to be detected is an illegal audio according to the semantic recognition result in step S720, the illegal audio detection method further includes steps S810 and S820, which are described in detail as follows:

步骤S810，若待检测音频为违规音频，根据语义识别结果确定违规类别。Step S810, if the audio to be detected is a violation audio, determine the violation type according to the semantic recognition result.

本申请实施例中，在确定出待检测音频为违规音频时，进一步根据语义识别结果确定违规类型，违规类型包括有多种。In the embodiment of the present application, when the audio to be detected is determined to be a violation audio, the violation type is further determined according to the semantic recognition result, and there are many types of violations.

步骤S820，根据违规类别确定对应的处理流程，并通过处理流程对待检测音频进行处理。Step S820, determine the corresponding processing flow according to the type of violation, and process the audio to be detected through the processing flow.

本申请实施例中，每个违规类别具有对应的处理流程，如禁止、疑似，在禁止的处理流程中，则禁止将待检测音频发布到网络中，而在疑似的处理流程中，则需要进一步确定是否为违规音频。根据确定出的处理流程对待检测音频进行处理。In the embodiment of this application, each violation category has a corresponding processing flow, such as prohibited, suspected, in the prohibited processing flow, it is forbidden to publish the audio to be detected to the network, and in the suspected processing flow, further processing is required. Determine if it is the offending audio. The audio to be detected is processed according to the determined processing flow.

在本申请的一个示例性实施例中，请参阅图9，图9是根据一示例性实施例示出的一种违规音频检测方法，包括步骤S910至步骤S960，详细介绍如下：In an exemplary embodiment of the present application, please refer to FIG. 9. FIG. 9 is a method for detecting illegal audio according to an exemplary embodiment, including steps S910 to S960. The details are as follows:

步骤S910，若接收到检测请求，根据检测请求获取待检测音频。Step S910, if a detection request is received, acquire audio to be detected according to the detection request.

本申请实施例中，当用户上传音频时，将上传的音频作为待检测音频，并生成检测请求，当接收到检测请求，获取与检测请求对应的待检测音频。In this embodiment of the application, when the user uploads audio, the uploaded audio is used as the audio to be detected, and a detection request is generated, and when the detection request is received, the audio to be detected corresponding to the detection request is obtained.

步骤S920，提取待检测音频中的人声音频。Step S920, extracting human voice audio from the audio to be detected.

本申请实施例中，根据预设的人声分离模型对待检测音频进行处理，通过对背景音频、旋律音频等进行抑制和分离提取出待检测音频中的人声音频。In the embodiment of the present application, the audio to be detected is processed according to the preset human voice separation model, and the human voice audio in the audio to be detected is extracted by suppressing and separating background audio and melody audio.

步骤S930，提取人声音频中的声纹特征得到目标声纹特征，将目标声纹特征与声纹库中的违规声纹特征进行匹配，得到特征匹配结果。Step S930, extracting the voiceprint features in the human voice audio to obtain the target voiceprint features, and matching the target voiceprint features with the illegal voiceprint features in the voiceprint database to obtain a feature matching result.

本申请实施例中，步骤S930与前述步骤S310和步骤S320中的描述一致，再此不进行赘述。In the embodiment of the present application, step S930 is consistent with the descriptions in step S310 and step S320 described above, and will not be repeated here.

步骤S940，若特征匹配结果表征目标声纹特征与违规声纹特征不匹配，将文本信息输入预设的语义模型进行语义识别处理，得到语义识别结果，并将人声音频转换为文本信息。Step S940, if the feature matching result indicates that the target voiceprint feature does not match the illegal voiceprint feature, input the text information into the preset semantic model for semantic recognition processing, obtain the semantic recognition result, and convert the human voice audio into text information.

本申请实施例中，若特征匹配结果表征目标声纹特征与违规声纹特征不匹配，直接将文本信息输入预设的语义模型进行语义识别处理，得到语义识别结果，同时将人声音频转换为文本信息。In the embodiment of the present application, if the feature matching result indicates that the target voiceprint feature does not match the illegal voiceprint feature, the text information is directly input into the preset semantic model for semantic recognition processing, and the semantic recognition result is obtained. At the same time, the human voice audio is converted into text message.

步骤S950，若语义识别结果表征待检测音频不为违规音频，将文本信息输入预设的违规语气检测模块中检测待检测音频中是否存在违规语气；其中，违规语气检测模块基于文本信息中的语气词和待检测音频的时长中至少一项进行检测。Step S950, if the semantic recognition result indicates that the audio to be detected is not an illegal audio, input the text information into the preset illegal tone detection module to detect whether there is an illegal tone in the audio to be detected; wherein, the illegal tone detection module is based on the tone in the text information At least one of the word and the duration of the audio to be detected is detected.

本申请实施例中，当语义识别结果表征待检测音频不为违规音频时，将转换得到的文本信息输入预设的违规语气检测模块中检测待检测音频中是否存在违规语气，语气检测模块如何检测待检测音频中是否存在违规语气的描述与前述步骤S510至步骤S530，以及步骤S610至步骤S630的描述一致，在此不进行赘述。In the embodiment of the present application, when the semantic recognition result indicates that the audio to be detected is not an illegal audio, the converted text information is input into the preset illegal tone detection module to detect whether there is an illegal tone in the audio to be detected, how does the tone detection module detect The description of whether there is an illegal tone in the audio to be detected is consistent with the description of the aforementioned steps S510 to S530, and steps S610 to S630, and will not be repeated here.

步骤S960，若待检测音频中存在违规语气，将文本信息输入预设的违规检测模型中检测待检测音频是否为违规音频。Step S960, if there is an illegal tone in the audio to be detected, input text information into a preset violation detection model to detect whether the audio to be detected is an illegal audio.

本申请实施例中，若待检测音频中存在违规语气，将文本信息输入预设的违规检测模型中检测待检测音频是否为违规音频，步骤S960的相关描述与前述步骤S230的相关描述一致，对此不进行赘述。In the embodiment of the present application, if there is an illegal tone in the audio to be detected, the text information is input into the preset violation detection model to detect whether the audio to be detected is an illegal audio. The relevant description of step S960 is consistent with the relevant description of the aforementioned step S230. This will not be repeated here.

在本申请的一个示例性实施例中，请参阅图10，图10是根据一示例性实施例示出的一种违规音频检测装置，包括：In an exemplary embodiment of the present application, please refer to FIG. 10. FIG. 10 is a device for detecting illegal audio according to an exemplary embodiment, including:

获取模块1010，配置为获取待检测音频，提取待检测音频中的人声音频，并将人声音频转换为文本信息；The acquisition module 1010 is configured to acquire the audio to be detected, extract the human voice audio in the audio to be detected, and convert the human voice audio into text information;

第一检测模块1020，配置为将文本信息输入违规语气检测模块中检测待检测音频中是否存在违规语气；其中，违规语气检测模块基于文本信息中的语气词和待检测音频的时长中至少一项进行检测；The first detection module 1020 is configured to input text information into the illegal tone detection module to detect whether there is an illegal tone in the audio to be detected; wherein, the illegal tone detection module is based on at least one of the tone words in the text information and the duration of the audio to be detected to test;

第二检测模块1030，配置为若待检测音频中存在违规语气，将文本信息输入预设的违规检测模型中检测待检测音频是否为违规音频。The second detection module 1030 is configured to input text information into a preset violation detection model to detect whether the audio to be detected is an illegal tone if there is an illegal tone in the audio to be detected.

在本申请的一个示例性实施例中，第一检测模块1020，包括：In an exemplary embodiment of the present application, the first detection module 1020 includes:

第一提取子模块，配置为获取待检测音频的时长信息，并提取文本信息中的语气词个数和词汇个数，以及文本信息的文本字数；The first extraction submodule is configured to obtain the duration information of the audio to be detected, and extract the number of modal particles and the number of vocabulary in the text information, and the number of words in the text information;

第一计算子模块，配置为计算语气词个数与词汇个数的第一比值，并计算文本字数与时长信息的第二比值；The first calculation submodule is configured to calculate the first ratio between the number of modal particles and the number of vocabulary, and calculate the second ratio between the number of words in the text and the duration information;

检测子模块，配置为根据第一比值和第二比值检测待检测音频中是否存在违规语气。The detection submodule is configured to detect whether there is an illegal tone in the audio to be detected according to the first ratio and the second ratio.

在本申请的一个示例性实施例中，检测子模块，包括：In an exemplary embodiment of the present application, the detection submodule includes:

匹配单元，配置为将第一比值与第一检测阈值进行匹配得到第一匹配结果，并将第二比值与第二检测阈值进行匹配得到第二匹配结果；A matching unit configured to match the first ratio with the first detection threshold to obtain a first matching result, and match the second ratio with the second detection threshold to obtain a second matching result;

第一确定单元，配置为若第一匹配结果表征第一比值大于第一检测阈值，第二匹配结果表征第二比值小于第二检测阈值，确定待检测音频中存在违规语气；The first determination unit is configured to determine that there is an illegal tone in the audio to be detected if the first matching result indicates that the first ratio is greater than the first detection threshold, and the second matching result indicates that the second ratio is less than the second detection threshold;

第二确定单元，配置为若第一匹配结果表征第一比值小于等于第一检测阈值，或第二匹配结果表征第二比值大于等于第二检测阈值，确定待检测音频中不存在违规语气。The second determination unit is configured to determine that there is no illegal tone in the audio to be detected if the first matching result indicates that the first ratio is less than or equal to the first detection threshold, or the second matching result indicates that the second ratio is greater than or equal to the second detection threshold.

在本申请的一个示例性实施例中，违规音频检测装置，还包括：In an exemplary embodiment of the present application, the illegal audio detection device also includes:

提取模块，配置为提取人声音频中的声纹特征得到目标声纹特征；The extraction module is configured to extract the voiceprint feature in the human voice audio to obtain the target voiceprint feature;

匹配模块，配置为将目标声纹特征与声纹库中的违规声纹特征进行匹配，得到特征匹配结果；The matching module is configured to match the target voiceprint feature with the violation voiceprint feature in the voiceprint library to obtain a feature matching result;

执行模块，配置为若特征匹配结果表征目标声纹特征与违规声纹特征不匹配，执行将文本信息输入预设的违规语气检测模块中检测待检测音频中是否存在违规语气的步骤。The execution module is configured to execute the step of inputting text information into a preset illegal tone detection module to detect whether there is an illegal tone in the audio to be detected if the feature matching result indicates that the target voiceprint feature does not match the illegal voiceprint feature.

在本申请的一个示例性实施例中，提取模块，包括：In an exemplary embodiment of the present application, the extraction module includes:

分段处理子模块，配置为对人声音频进行分段处理，得到多个人声音频段；The segmentation processing sub-module is configured to segment the human voice audio to obtain multiple human voice audio segments;

第二提取子模块，配置为通过声纹模型提取各个人声音频段的声纹特征；The second extraction submodule is configured to extract the voiceprint features of each human voice segment through the voiceprint model;

第二计算子模块，配置为计算多个声纹特征的平均声纹特征，并将平均声纹特征作为目标声纹特征。The second calculation sub-module is configured to calculate the average voiceprint feature of multiple voiceprint features, and use the average voiceprint feature as the target voiceprint feature.

语义识别模块，配置为若待检测音频中不存在违规语气，将文本信息输入预设的语义模型进行语义识别处理，得到语义识别结果；The semantic recognition module is configured to input text information into a preset semantic model for semantic recognition processing if there is no illegal tone in the audio to be detected, and obtain a semantic recognition result;

第一确定模块，配置为根据语义识别结果确定待检测音频是否为违规音频。The first determination module is configured to determine whether the audio to be detected is an illegal audio according to the semantic recognition result.

第二确定模块，配置为若待检测音频为违规音频，根据语义识别结果确定违规类别；The second determination module is configured to determine the violation category according to the semantic recognition result if the audio to be detected is an illegal audio;

第三确定模块，配置为根据违规类别确定对应的处理流程，并通过处理流程对待检测音频进行处理。The third determining module is configured to determine a corresponding processing flow according to the violation category, and process the audio to be detected through the processing flow.

需要说明的是，上述实施例所提供的违规音频检测装置与上述实施例所提供的违规音频检测方法属于同一构思，其中各个模块、子模块和单元执行操作的具体方式已经在方法实施例中进行了详细描述，此处不再赘述。It should be noted that the illegal audio detection device provided in the above-mentioned embodiments and the illegal audio detection method provided in the above-mentioned embodiments belong to the same idea, and the specific ways of performing operations of each module, sub-module and unit have been carried out in the method embodiment A detailed description is given and will not be repeated here.

本申请的实施例还提供了一种电子设备，包括：一个或多个处理器；存储装置，用于存储一个或多个程序，当所述一个或多个程序被所述一个或多个处理器执行时，使得所述电子设备实现上述各个实施例中提供的违规音频检测方法；The embodiment of the present application also provides an electronic device, including: one or more processors; a storage device for storing one or more programs, when the one or more programs are processed by the one or more When the device is executed, the electronic device is made to implement the violation audio detection method provided in each of the above embodiments;

违规音频检测方法，包括：Violation audio detection methods, including:

需要说明的是，图11示出的电子设备的计算机系统1100仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。It should be noted that the computer system 1100 of the electronic device shown in FIG. 11 is only an example, and should not limit the functions and scope of use of this embodiment of the present application.

如图11所示，计算机系统1100包括中央处理单元(Central Processing Unit，CPU)1101，其可以根据存储在只读存储器(Read-Only Memory，ROM)1102中的程序或者从储存部分1108加载到随机访问存储器(Random Access Memory，RAM)1103中的程序而执行各种适当的动作和处理，例如执行上述实施例中所述的方法。在RAM 1103中，还存储有系统操作所需的各种程序和数据。CPU 1101、ROM 1102以及RAM 1103通过总线1104彼此相连。输入/输出(Input/Output，I/O)接口1105也连接至总线1104。As shown in FIG. 11 , a computer system 1100 includes a central processing unit (Central Processing Unit, CPU) 1101, which can be stored in a program in a read-only memory (Read-Only Memory, ROM) 1102 or loaded from a storage part 1108 to a random It accesses programs in the memory (Random Access Memory, RAM) 1103 to execute various appropriate actions and processes, such as executing the methods described in the above-mentioned embodiments. In the RAM 1103, various programs and data necessary for system operation are also stored. The CPU 1101 , ROM 1102 , and RAM 1103 are connected to each other via a bus 1104 . An input/output (Input/Output, I/O) interface 1105 is also connected to the bus 1104 .

以下部件连接至I/O接口1105：包括键盘、鼠标等的输入部分1106；包括诸如阴极射线管(Cathode Ray Tube，CRT)、液晶显示器(Liquid Crystal Display，LCD)等以及扬声器等的输出部分1107；包括硬盘等的储存部分1108；以及包括诸如LAN(Local AreaNetwork，局域网)卡、调制解调器等的网络接口卡的通信部分1109。通信部分1109经由诸如因特网的网络执行通信处理。驱动器1110也根据需要连接至I/O接口1105。可拆卸介质1111，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器1110上，以便于从其上读出的计算机程序根据需要被安装入储存部分1108。The following components are connected to the I/O interface 1105: an input section 1106 including a keyboard, a mouse, etc.; an output section 1107 including a cathode ray tube (Cathode Ray Tube, CRT), a liquid crystal display (Liquid Crystal Display, LCD), etc., and a speaker ; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1109 performs communication processing via a network such as the Internet. A drive 1110 is also connected to the I/O interface 1105 as needed. A removable medium 1111 such as a magnetic disk, optical disk, magneto-optical disk, semiconductor memory, etc. is mounted on the drive 1110 as necessary so that a computer program read therefrom is installed into the storage section 1108 as necessary.

特别地，根据本申请的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本申请的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的计算机程序。在这样的实施例中，该计算机程序可以通过通信部分1109从网络上被下载和安装，和/或从可拆卸介质1111被安装。在该计算机程序被中央处理单元(CPU)1101执行时，执行本申请的系统中限定的各种功能。In particular, according to the embodiments of the present application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, the embodiments of the present application include a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes a computer program for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via communication portion 1109, and/or installed from removable media 1111. When the computer program is executed by a central processing unit (CPU) 1101, various functions defined in the system of the present application are performed.

需要说明的是，本申请实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory，EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory，CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的计算机程序。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的计算机程序可以用任何适当的介质传输，包括但不限于：无线、有线等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the embodiment of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. A computer-readable storage medium may be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash memory, optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable The combination. In this application, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, in which a computer-readable computer program is carried. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device. . A computer program embodied on a computer readable medium can be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the above.

附图中的流程图和框图，图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。其中，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Wherein, each block in the flowchart or block diagram may represent a module, a program segment, or a part of the code, and the above-mentioned module, program segment, or part of the code includes one or more executable instruction. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block in the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be implemented by a A combination of dedicated hardware and computer instructions.

描述于本申请实施例中所涉及到的单元可以通过软件的方式实现，也可以通过硬件的方式来实现，所描述的单元也可以设置在处理器中。其中，这些单元的名称在某种情况下并不构成对该单元本身的限定。The units described in the embodiments of the present application may be implemented by software or by hardware, and the described units may also be set in a processor. Wherein, the names of these units do not constitute a limitation of the unit itself under certain circumstances.

本申请的另一方面还提供了一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如前所述的方法。该计算机可读存储介质可以是上述实施例中描述的电子设备中所包含的，也可以是单独存在，而未装配入该电子设备中。Another aspect of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the aforementioned method is implemented. The computer-readable storage medium may be included in the electronic device described in the above embodiments, or may exist independently without being assembled into the electronic device.

本申请的另一方面还提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述各个实施例中提供的违规音频检测方法；Another aspect of the present application also provides a computer program product or computer program, the computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the violation audio detection method provided in the above-mentioned embodiments;

上述内容，仅为本申请的较佳示例性实施例，并非用于限制本申请的实施方案，本领域普通技术人员根据本申请的主要构思和精神，可以十分方便地进行相应的变通或修改，故本申请的保护范围应以权利要求书所要求的保护范围为准。The above content is only a preferred exemplary embodiment of the application, and is not intended to limit the implementation of the application. Those skilled in the art can easily make corresponding modifications or modifications according to the main idea and spirit of the application. Therefore, the scope of protection of this application should be based on the scope of protection required by the claims.

Claims

1. A method for detecting offensive audio, comprising:

acquiring audio to be detected, extracting voice audio in the audio to be detected, and converting the voice audio into text information;

inputting the text information into a violation voice detection module to detect whether the violation voice exists in the audio to be detected; the offensive word and phrase detection module detects at least one of word and phrase in the text information and duration of the audio to be detected;

if the voice frequency to be detected has the offensive voice, inputting the text information into a preset offensive detection model to detect whether the voice frequency to be detected is the offensive voice frequency.

2. The method of claim 1, wherein said inputting the text information into a offensive detection module detects whether offensive is present in the audio to be detected, comprising:

acquiring duration information of the audio to be detected, and extracting the number of words and words in the text information and the number of text words of the text information;

calculating a first ratio of the number of the Chinese words to the number of the words, and calculating a second ratio of the number of the text words to the duration information;

and detecting whether the offensive language gas exists in the audio to be detected according to the first ratio and the second ratio.

3. The method of claim 2, wherein detecting whether there is a offensive language in the audio to be detected based on the first ratio and the second ratio comprises:

matching the first ratio with a first detection threshold to obtain a first matching result, and matching the second ratio with a second detection threshold to obtain a second matching result;

if the first matching result represents that the first ratio is larger than the first detection threshold, the second matching result represents that the second ratio is smaller than the second detection threshold, and it is determined that illegal voice exists in the audio to be detected;

And if the first matching result represents that the first ratio is smaller than or equal to the first detection threshold or the second matching result represents that the second ratio is larger than or equal to the second detection threshold, determining that no offensive is present in the audio to be detected.

4. The method of claim 1, wherein prior to said entering the text information into the offensive detection module to detect whether there is an offensive in the audio to be detected, the method further comprises:

extracting voiceprint features in the voice frequency of the person to obtain target voiceprint features;

matching the target voiceprint features with the illegal voiceprint features in the voiceprint library to obtain feature matching results;

and if the characteristic matching result represents that the target voiceprint characteristic is not matched with the illegal voiceprint characteristic, executing the step of inputting the text information into a preset illegal voice gas detection module to detect whether illegal voice gas exists in the audio to be detected.

5. The method of claim 4, wherein the extracting voiceprint features in the human voice audio results in target voiceprint features, comprising:

carrying out segmentation processing on the voice audio to obtain a plurality of voice frequency bands;

Extracting voiceprint characteristics of each voice audio segment through a voiceprint model;

and calculating average voiceprint features of the plurality of voiceprint features, and taking the average voiceprint features as target voiceprint features.

6. The method of any one of claims 1 to 5, wherein the method further comprises:

if no offensive language exists in the audio to be detected, inputting the text information into a preset semantic model for semantic recognition processing to obtain a semantic recognition result;

and determining whether the audio to be detected is illegal audio according to the semantic recognition result.

7. The method of claim 6, wherein after said determining whether the audio to be detected is offending audio based on the semantic recognition result, the method further comprises:

if the audio to be detected is illegal audio, determining the type of the violation according to the semantic recognition result;

and determining a corresponding processing flow according to the violation category, and processing the audio to be detected through the processing flow.

8. An offensive audio detection device, comprising:

the acquisition module is configured to acquire audio to be detected, extract human voice audio in the audio to be detected and convert the human voice audio into text information;

The first detection module is configured to input the text information into the offensive language gas detection module to detect whether the offensive language gas exists in the audio to be detected; the offensive word and phrase detection module detects at least one of word and phrase in the text information and duration of the audio to be detected;

and the second detection module is configured to input the text information into a preset violation detection model to detect whether the audio to be detected is the violation audio if the violation air exists in the audio to be detected.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs that, when executed by the one or more processors, cause the electronic device to implement the method of detecting offending audio as claimed in any of claims 1 to 7.

10. A computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor of a computer, cause the computer to perform the method of detecting offending audio of any of claims 1 to 7.