CN114896978B - Entity recognition method, system and storage medium based on multi-graph collaborative semantic network - Google Patents
Entity recognition method, system and storage medium based on multi-graph collaborative semantic network Download PDFInfo
- Publication number
- CN114896978B CN114896978B CN202210496739.0A CN202210496739A CN114896978B CN 114896978 B CN114896978 B CN 114896978B CN 202210496739 A CN202210496739 A CN 202210496739A CN 114896978 B CN114896978 B CN 114896978B
- Authority
- CN
- China
- Prior art keywords
- data
- graph
- feature
- original data
- merged
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及实体识别技术领域,尤其是一种基于多图协作语义网络的实体识别方法、系统和存储介质。The present invention relates to the field of entity recognition technology, and in particular to an entity recognition method, system and storage medium based on a multi-graph collaborative semantic network.
背景技术Background Art
相关技术中,命名实体识别不仅是自然语言处理的最重要的方向之一,而且还是关系抽取、信息检索、知识图谱等下游任务的预处理步骤。命名实体识别旨在从非结构化的文本中抽取人名、地名、组织名等实体。传统的中文命名实体识别主要分为基于字符级别的方法和基于分词级别的方法。基于字符级别的方法不能充分利用句子潜在的词与词之间的序列信息,导致模型的泛化能力不强;基于分词级别的方法过度依赖分词工具,分词的错误会延续整个任务过程,错误传播会导致识别性能下降。而随着深度学习的出现,为命名实体识别提供了强大的工具,提高模型的识别能力。但是,基于深度学习方法的命名实体识别过度依赖了词典等增强方式,忽略了词典增强方法对模型的干扰作用,过度依赖词典往往容易造成识别出错误的实体的现象。In the related technologies, named entity recognition is not only one of the most important directions of natural language processing, but also a preprocessing step for downstream tasks such as relationship extraction, information retrieval, and knowledge graphs. Named entity recognition aims to extract entities such as names of people, places, and organizations from unstructured texts. Traditional Chinese named entity recognition is mainly divided into character-level methods and word-level methods. Character-level methods cannot fully utilize the potential sequence information between words in a sentence, resulting in weak generalization of the model; word-level methods over-rely on word-segmentation tools, and word-segmentation errors will continue throughout the task process, and error propagation will lead to a decrease in recognition performance. With the emergence of deep learning, powerful tools have been provided for named entity recognition to improve the recognition ability of the model. However, named entity recognition based on deep learning methods overly relies on enhancement methods such as dictionaries, ignoring the interference of dictionary enhancement methods on the model. Over-reliance on dictionaries often easily leads to the phenomenon of identifying incorrect entities.
发明内容Summary of the invention
本发明旨在至少解决现有技术中存在的技术问题之一。为此,本发明提出一种基于多图协作语义网络的实体识别方法、系统和存储介质,能够有效提高实体识别结果的准确度。The present invention aims to solve at least one of the technical problems existing in the prior art. To this end, the present invention proposes an entity recognition method, system and storage medium based on a multi-graph collaborative semantic network, which can effectively improve the accuracy of entity recognition results.
一方面,本发明实施例提供了一种基于多图协作语义网络的实体识别方法,包括以下步骤:On the one hand, an embodiment of the present invention provides an entity recognition method based on a multi-graph collaborative semantic network, comprising the following steps:
获取原始数据,所述原始数据包括若干个中文句子;Acquire original data, wherein the original data includes a plurality of Chinese sentences;
提取所述原始数据的特征数据,所述特征数据包括词典特征数据和词性特征数据;Extracting feature data of the original data, wherein the feature data includes dictionary feature data and part-of-speech feature data;
根据所述原始数据和所述特征数据构建多图矩阵;Constructing a multi-image matrix according to the original data and the feature data;
将所述原始数据和所述特征数据进行合并,得到合并数据;Merging the original data and the characteristic data to obtain merged data;
将所述合并数据和所述多图矩阵融合后进行协作训练,得到特征融合结果;The combined data and the multi-image matrix are fused and then collaboratively trained to obtain a feature fusion result;
对所述特征融合结果进行解码,得到所述原始数据的中文命名实体识别结果。The feature fusion result is decoded to obtain the Chinese named entity recognition result of the original data.
在一些实施例中,所述提取所述原始数据的特征数据,包括:In some embodiments, extracting feature data of the original data includes:
采用词典工具对所述原始数据进行词特征数据匹配,得到词典特征数据;Using a dictionary tool to match word feature data on the original data to obtain dictionary feature data;
采用词性解析工具对所述原始数据进行词性解析,得到词性特征数据。The original data is parsed by a part-of-speech parsing tool to obtain part-of-speech feature data.
在一些实施例中,在所述提取所述原始数据的特征数据后,所述方法还包括以下步骤:In some embodiments, after extracting the feature data of the original data, the method further comprises the following steps:
将所述词典特征数据输入嵌入层,得到词典特征向量;Inputting the dictionary feature data into an embedding layer to obtain a dictionary feature vector;
将所述词性特征数据输入嵌入层,得到词性特征向量;Inputting the part-of-speech feature data into an embedding layer to obtain a part-of-speech feature vector;
将所述原始数据输入嵌入层,得到上下文信息向量。The original data is input into the embedding layer to obtain a context information vector.
在一些实施例中,所述根据所述原始数据和所述特征数据构建多图矩阵,包括:In some embodiments, constructing a multi-image matrix according to the original data and the feature data includes:
融合所述词典特征向量和所述上下文信息向量,生成边界图;fusing the dictionary feature vector and the context information vector to generate a boundary map;
融合所述词性特征向量和所述上下文信息向量,分别生成关系图和包含图。The part-of-speech feature vector and the context information vector are fused to generate a relationship graph and an inclusion graph respectively.
在一些实施例中,所述关系图包括单个次之间的依存关系和词语之间的依存关系。In some embodiments, the relationship graph includes dependencies between individual items and dependencies between words.
在一些实施例中,所述将所述合并数据和所述多图矩阵融合后进行协作训练,包括:In some embodiments, the step of fusing the combined data with the multi-image matrix for collaborative training includes:
将第一合并数据与所述边界图的融合结果输入第一图注意网络进行图卷积处理,所述第一合并数据包括所述词典特征向量和所述上下文信息向量的合并数据;Inputting the fusion result of the first merged data and the boundary graph into a first graph attention network for graph convolution processing, wherein the first merged data includes merged data of the dictionary feature vector and the context information vector;
将第二合并数据和所述关系图的融合结果输入到第二图注意网络进行图卷积处理,所述第二合并数据包括词性特征向量和所述上下文信息向量的合并数据;Inputting the fusion result of the second merged data and the relationship graph into a second graph attention network for graph convolution processing, wherein the second merged data includes merged data of the part-of-speech feature vector and the context information vector;
将所述第二合并数据和所述包含图的融合结果输入到第三图注意网络进行图卷积处理。The fusion result of the second merged data and the inclusion graph is input into a third graph attention network for graph convolution processing.
在一些实施例中,所述将所述合并数据和所述多图矩阵融合后进行协作训练,还包括:In some embodiments, the step of fusing the combined data with the multi-image matrix for collaborative training further includes:
将所述边界图、所述关系图、所述包含图与所述原始数据的上下文信息输入融合层进行参数学习;Inputting the boundary graph, the relationship graph, the inclusion graph and the context information of the original data into a fusion layer for parameter learning;
通过多图融合进行协作训练。Collaborative training via multi-image fusion.
另一方面,本发明实施例提供了一种基于多图协作语义网络的实体识别系统,包括:On the other hand, an embodiment of the present invention provides an entity recognition system based on a multi-graph collaborative semantic network, comprising:
获取模块,用于获取原始数据,所述原始数据包括若干个中文句子;An acquisition module, used for acquiring original data, wherein the original data includes a plurality of Chinese sentences;
提取模块,用于提取所述原始数据的特征数据,所述特征数据包括词典特征数据和词性特征数据;An extraction module, used to extract feature data of the original data, wherein the feature data includes dictionary feature data and part-of-speech feature data;
构建模块,用于根据所述原始数据和所述特征数据构建多图矩阵;A construction module, used for constructing a multi-image matrix according to the original data and the feature data;
合并模块,用于将所述原始数据和所述特征数据进行合并,得到合并数据;A merging module, used for merging the original data and the feature data to obtain merged data;
协作训练模块,用于将所述合并数据和所述多图矩阵融合后进行协作训练,得到特征融合结果;A collaborative training module, used for fusing the combined data with the multi-image matrix and performing collaborative training to obtain a feature fusion result;
解码模块,用于对所述特征融合结果进行解码,得到所述原始数据的中文命名实体识别结果。A decoding module is used to decode the feature fusion result to obtain the Chinese named entity recognition result of the original data.
另一方面,本发明实施例提供了一种基于多图协作语义网络的实体识别系统,包括:On the other hand, an embodiment of the present invention provides an entity recognition system based on a multi-graph collaborative semantic network, comprising:
至少一个存储器,用于存储程序;at least one memory for storing a program;
至少一个处理器,用于加载所述程序以执行所述的基于多图协作语义网络的实体识别方法。At least one processor is used to load the program to execute the entity recognition method based on multi-graph collaborative semantic network.
另一方面,本发明实施例提供了一种存储介质,其中存储有计算机可执行的程序,所述计算机可执行的程序被处理器执行时用于实现所述的基于多图协作语义网络的实体识别方法。On the other hand, an embodiment of the present invention provides a storage medium, in which a computer-executable program is stored. When the computer-executable program is executed by a processor, it is used to implement the entity recognition method based on a multi-graph collaborative semantic network.
本发明实施例提供的一种基于多图协作语义网络的实体识别方法,具有如下有益效果:The embodiment of the present invention provides an entity recognition method based on a multi-graph collaborative semantic network, which has the following beneficial effects:
本实施例通过提取包含若干个中文句子的原始数据内的词典特征数据和词性特征数据组成特征数据,并根据原始数据和特征数据构建多图矩阵,同时将原始数据和特征数据进行合并后得到合并数据,然后将合并数据和多图矩阵融合后进行协作训练,得到特征融合结果,再对所述特征融合结果进行解码后得到所述原始数据的中文命名实体识别结果,本实施例通过考虑词典特征数据和词性特征数据来识别中文句子中的中文实体,从而能够有效提高实体识别结果的准确度。This embodiment extracts dictionary feature data and part-of-speech feature data from original data containing several Chinese sentences to form feature data, and constructs a multi-graph matrix based on the original data and the feature data, and merges the original data and the feature data to obtain merged data, and then fuses the merged data and the multi-graph matrix for collaborative training to obtain a feature fusion result, and then decodes the feature fusion result to obtain a Chinese named entity recognition result of the original data. This embodiment identifies Chinese entities in Chinese sentences by considering dictionary feature data and part-of-speech feature data, thereby effectively improving the accuracy of entity recognition results.
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description, or will be learned through practice of the present invention.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
下面结合附图和实施例对本发明做进一步的说明,其中:The present invention will be further described below with reference to the accompanying drawings and embodiments, wherein:
图1为本发明实施例的一种基于多图协作语义网络的实体识别方法的流程图;FIG1 is a flow chart of an entity recognition method based on a multi-graph collaborative semantic network according to an embodiment of the present invention;
图2为本发明实施例的一种基于多图协作语义网络的实体识别方法的过程示意图。FIG2 is a process diagram of an entity recognition method based on a multi-graph collaborative semantic network according to an embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and cannot be understood as limiting the present invention.
在本发明的描述中,若干的含义是一个以上,多个的含义是两个以上,大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。如果有描述到第一、第二只是用于区分技术特征为目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。In the description of the present invention, "several" means more than one, "many" means more than two, "greater than", "less than", "exceed", etc. are understood to exclude the number itself, and "above", "below", "within", etc. are understood to include the number itself. If there is a description of "first" or "second", it is only used for the purpose of distinguishing the technical features, and cannot be understood as indicating or implying the relative importance or implicitly indicating the number of the indicated technical features or implicitly indicating the order of the indicated technical features.
本发明的描述中,除非另有明确的限定,设置等词语应做广义理解,所属技术领域技术人员可以结合技术方案的具体内容合理确定上述词语在本发明中的具体含义。In the description of the present invention, unless otherwise clearly defined, words such as "setting" should be understood in a broad sense, and technicians in the relevant technical field can reasonably determine the specific meanings of the above words in the present invention in combination with the specific content of the technical solution.
本发明的描述中,参考术语“一个实施例”、“一些实施例”、“示意性实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of the present invention, the description with reference to the terms "one embodiment", "some embodiments", "illustrative embodiments", "examples", "specific examples", or "some examples" means that the specific features or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the schematic representation of the above terms does not necessarily refer to the same embodiment or example. Moreover, the specific features or characteristics described may be combined in any one or more embodiments or examples in a suitable manner.
目前,基于深度学习方法的命名实体识别过度依赖了词典等增强方式,忽略了词典增强方法对模型的干扰作用,过度依赖词典往往容易造成识别出错误的实体的现象。比如:对于“南京市长江大桥”这个句子,词典可以匹配出“长江大桥”来帮助模型识别“长江大桥”这个地名实体。可同时,词典还会匹配出“南京市”、“南京”、“市长”和“江大桥”这些词语,“南京”这个匹配词语会让模型误以为在这个句子中“南京”是一个真正实体,从来导致后续识别出“市长”和“江大桥”两个错误的实体,因此会导致这条语句所表达的意思变成南京市长的名字叫做江大桥,这就是由于过度依赖词典增强方式所造成的问题。此外,大部分方法忽略了中文语句中最重要一点,中文的词性依赖信息。和其他语言相比,中文语句的表达能力更加丰富。比如:“广大学生会”,对于机器而言,“广”字或者“大”字,很难将这两个字和“学生会”这几个字联系起来,但实际上“广大学生会”是一个整体。目前大部分方法都无法解决这个问题。At present, named entity recognition based on deep learning methods over-relies on enhancement methods such as dictionaries, ignoring the interference of dictionary enhancement methods on the model. Over-reliance on dictionaries often easily leads to the phenomenon of identifying wrong entities. For example: for the sentence "Nanjing Yangtze River Bridge", the dictionary can match "Yangtze River Bridge" to help the model identify the place name entity "Yangtze River Bridge". But at the same time, the dictionary will also match the words "Nanjing City", "Nanjing", "Mayor" and "Jiang Bridge". The matching word "Nanjing" will make the model mistakenly think that "Nanjing" is a real entity in this sentence, which has always led to the subsequent recognition of two wrong entities "Mayor" and "Jiang Bridge". Therefore, the meaning of this sentence will become that the name of the mayor of Nanjing is Jiang Bridge. This is the problem caused by over-reliance on dictionary enhancement methods. In addition, most methods ignore the most important point in Chinese sentences, the part-of-speech dependency information of Chinese. Compared with other languages, Chinese sentences have richer expressive power. For example: "Guangdong Student Union", for the machine, it is difficult to associate the two words "Guang" or "Da" with the words "Student Union", but in fact "Guangdong Student Union" is a whole. Most current methods cannot solve this problem.
基于此,参照图1,本发明实施例提供了一种基于多图协作语义网络的实体识别方法,本方法实施例可应用于云服务器或者中文命名实体识别平台的处理器。具体地,如图1所示,本方法实施例包括但不限于以下步骤:Based on this, referring to FIG1 , an embodiment of the present invention provides an entity recognition method based on a multi-graph collaborative semantic network, and the embodiment of the method can be applied to a processor of a cloud server or a Chinese named entity recognition platform. Specifically, as shown in FIG1 , the embodiment of the method includes but is not limited to the following steps:
步骤110、获取原始数据;其中,所述原始数据包括若干个中文句子;Step 110: obtaining original data; wherein the original data includes a plurality of Chinese sentences;
步骤120、提取所述原始数据的特征数据;其中,所述特征数据包括词典特征数据和词性特征数据;Step 120: extracting feature data of the original data; wherein the feature data includes dictionary feature data and part-of-speech feature data;
步骤130、根据所述原始数据和所述特征数据构建多图矩阵;Step 130, constructing a multi-image matrix according to the original data and the feature data;
步骤140、将所述原始数据和所述特征数据进行合并,得到合并数据;Step 140: Merge the original data and the feature data to obtain merged data;
步骤150、将所述合并数据和所述多图矩阵融合后进行协作训练,得到特征融合结果;Step 150, fusing the merged data and the multi-image matrix and performing collaborative training to obtain a feature fusion result;
步骤160、对所述特征融合结果进行解码,得到所述原始数据的中文命名实体识别结果。Step 160: decode the feature fusion result to obtain the Chinese named entity recognition result of the original data.
在本申请实施例中,原始数据是中文句子的集合,即本实施例是对需要进行中文命名实体识别的句子进行中文命名实体识别。具体地,假设基于字符的训练模型的原始数据输入为s={c1,c2,...,cn},其中,ci是一个句子中的第i个字符,每个字符ci可以通过查找嵌入向量表示,具体如公式(1)所示:In the embodiment of the present application, the original data is a set of Chinese sentences, that is, the present embodiment performs Chinese named entity recognition on sentences that need Chinese named entity recognition. Specifically, assuming that the original data input of the character-based training model is s={c 1 ,c 2 ,...,c n }, where c i is the i-th character in a sentence, each character c i can be represented by looking up an embedded vector, as shown in formula (1):
其中,ec表示字符嵌入矢量表。Among them, e c represents the character embedding vector table.
在本申请实施例中,对于提取所述原始数据的特征数据,得到词典特征数据和词性特征数据,可以通过采用词典工具对所述原始数据进行词特征数据匹配,得到词典特征数据;以及采用词性解析工具对所述原始数据进行词性解析,得到词性特征数据。在本实施例中,由于词汇知识有助于增强字符的表示以提高NER模型的性能,本实施例将原始数据中的每个句子的字符通过词典匹配工具匹配到的词条表示为l={l1,l2,...,lm}。本实施例利用中文词汇的词性和关系依存度,通过词性解析工具切割分离原始数据中的句子,得到切割的词汇表示为w={w1,w2,…,wk}。In an embodiment of the present application, for extracting the feature data of the original data, obtaining the dictionary feature data and the part-of-speech feature data, the original data can be matched with the word feature data by using a dictionary tool to obtain the dictionary feature data; and the original data can be parsed with the part-of-speech parsing tool to obtain the part-of-speech feature data. In this embodiment, since vocabulary knowledge helps to enhance the representation of characters to improve the performance of the NER model, this embodiment represents the entries matched by the dictionary matching tool for the characters of each sentence in the original data as l={l 1 ,l 2 ,...,l m }. This embodiment uses the part-of-speech and relational dependency of Chinese vocabulary, and cuts and separates the sentences in the original data through the part-of-speech parsing tool, and obtains the cut vocabulary represented as w={w 1 ,w 2 ,...,w k }.
在本申请实施例中,对于根据所述原始数据和所述特征数据构建多图矩阵,可以通过融合词典特征数据和原始数据生成边界矩阵图;融合词性特征数据和原始数据分别生成关系图和包含图。具体地,本实施例构建了三个词-字符交互图,三个图的顶点与边均不同,引入邻接矩阵表示图结构,邻接矩阵中的值表示图中顶点之间是否存在关系。本实施例以原始数据si=(我,爱,北,京,故,宫,博,物,馆)为例,对原始数据中的每个字符进行词典匹配以充分利用潜在字符,匹配得到词典特征数据集合li={北京,故宫,博物馆},用以构建边界图B-graph,融合词典特征数据和原始数据得到边界图,顶点集合V为词典特征数据与原始数据的合并结果,用邻接矩阵AB存储数据,若匹配到的词有多个字符,则将矩阵中词与该词中第一个及最后一个字符所在位置记为1。同样以si为例,利用依存分析器dependency prasing tool对原始数据进行词性分析,将原始数据集切割成字和词的集合di={我,爱,北京,故宫,博物馆},用以构建关系图R-graph和包含图C-graph,词的依存信息集合定义为fi={SBV,HED,ATT,ATT,VOB},用集合索引表示indexi={2,0,5,5,2}。其中,关于关系图R-graph,同时考虑单个字之间的依存关系和词语之间的依存关系,用邻接矩阵AR和AC存储数据。对于邻接矩阵AR,若切割出的词包含多个字符,则将矩阵中该词与该词所包含的全部字符所在位置记为1,并依次将每个字符与该词中剩余字符所在位置记为1;若切割出的词与另一词有联系,则将其所在位置记为1。对于邻接矩阵AC,将切割出的词与该词中第一个及最后一个字符所在位置记为1,将切割出的字符与其前驱、后继字符所在位置均记为1。In an embodiment of the present application, for constructing a multi-graph matrix according to the original data and the feature data, a boundary matrix graph can be generated by fusing the dictionary feature data and the original data; and a relationship graph and an inclusion graph are generated by fusing the part-of-speech feature data and the original data. Specifically, the present embodiment constructs three word-character interaction graphs, the vertices and edges of the three graphs are different, and an adjacency matrix is introduced to represent the graph structure, and the values in the adjacency matrix represent whether there is a relationship between the vertices in the graph. The present embodiment takes the original data s i = (I, love, Beijing, Forbidden City, Museum, Museum) as an example, and performs dictionary matching on each character in the original data to make full use of potential characters, and matches to obtain a dictionary feature data set l i = {Beijing, Forbidden City, Museum}, which is used to construct a boundary graph B-graph, and fuses the dictionary feature data and the original data to obtain a boundary graph, and the vertex set V is the merged result of the dictionary feature data and the original data, and the data is stored in the adjacency matrix AB . If the matched word has multiple characters, the position of the word in the matrix and the first and last characters in the word is recorded as 1. Taking si as an example, the dependency analyzer is used to perform part-of-speech analysis on the original data, and the original data set is cut into a set of characters and words d i = {I, love, Beijing, Forbidden City, Museum}, which is used to construct the relationship graph R-graph and the inclusion graph C-graph. The word dependency information set is defined as fi = {SBV, HED, ATT, ATT, VOB}, and is represented by a set index index i = {2, 0, 5, 5, 2}. Among them, regarding the relationship graph R-graph, the dependency relationship between individual characters and the dependency relationship between words are considered at the same time, and the adjacency matrix AR and AC are used to store data. For the adjacency matrix AR , if the cut word contains multiple characters, the position of the word and all the characters contained in the word in the matrix is recorded as 1, and the position of each character and the remaining characters in the word is recorded as 1 in turn; if the cut word is related to another word, its position is recorded as 1. For the adjacency matrix AC , the positions of the cut word and the first and last characters in the word are recorded as 1, and the positions of the cut character and its predecessor and successor characters are recorded as 1.
在本申请实施例中,在提取所述原始数据的特征数据后,还可以将所述词典特征数据输入嵌入层后得到词典特征向量;将所述词性特征数据输入嵌入层后得到词性特征向量;以及将所述原始数据输入嵌入层后得到上下文信息向量。当得到向量形式的特征数据和原始数据的上下文信息后,所述根据所述原始数据和所述特征数据构建多图矩阵,可以通过融合所述词典特征向量和所述上下文信息向量来生成边界图,以及融合所述词性特征向量和所述上下文信息向量分别生成关系图和包含图。In an embodiment of the present application, after extracting the feature data of the original data, the dictionary feature data can be input into an embedding layer to obtain a dictionary feature vector; the part-of-speech feature data can be input into an embedding layer to obtain a part-of-speech feature vector; and the original data can be input into an embedding layer to obtain a context information vector. After obtaining the feature data in vector form and the context information of the original data, the multi-graph matrix is constructed according to the original data and the feature data, and a boundary graph can be generated by fusing the dictionary feature vector and the context information vector, and a relationship graph and an inclusion graph can be generated by fusing the part-of-speech feature vector and the context information vector, respectively.
具体地,如图2所示,词典特征数据集合定义为l={l1,l2,...,lm},将其送入预训练好的嵌入层embedding,从而得到词典特征数据的低维稠密向量,即词典特征向量其中,ew是一个词汇嵌入查找表。同样,将词性特征数据集合w={w1,w2,...,wk}和原始数据s={c1,c2,...,cn}分别送入嵌入层,得到词性特征向量xw=ew(w)和原始数据的向量表示xc=ec(c),其中,ec是字符嵌入向量表。利用LSTM模型处理原始数据,具体过程如公式(2)、公式(3)和公式(4)所述:Specifically, as shown in FIG2 , the dictionary feature data set is defined as l = {l 1 ,l 2 ,...,l m }, which is sent to the pre-trained embedding layer embedding to obtain a low-dimensional dense vector of the dictionary feature data, namely, the dictionary feature vector Where e w is a vocabulary embedding lookup table. Similarly, the part-of-speech feature data set w = {w 1 ,w 2 ,...,w k } and the original data s = {c 1 ,c 2 ,..., cn } are respectively sent to the embedding layer to obtain the part-of-speech feature vector x w = e w (w) and the vector representation of the original data x c = e c (c), where e c is the character embedding vector table. The original data is processed using the LSTM model. The specific process is described in formulas (2), (3) and (4):
将原始数据向量的上文数据向量和下文数据向量做异或运算,得到原始数据的上下文信息向量H={h1,h2,...,hn}集合。将词典特征向量与原始数据向量合并得到边界图B-graph,其顶点集合为将词性特征向量与原始数据向量合并得到关系图R-graph和包含图C-graph,两者共用一套顶点集合 The above data vector of the original data vector And the following data vector Perform XOR operation to obtain the context information vector H = {h 1 ,h 2 ,...,h n } of the original data. Combine the dictionary feature vector with the original data vector to obtain the boundary graph B-graph, whose vertex set is The part-of-speech feature vector is combined with the original data vector to obtain the relationship graph R-graph and the inclusion graph C-graph, both of which share a set of vertex sets.
在本申请实施例中,如图2所示,对于将所述合并数据和所述多图矩阵融合后进行协作训练,包括但不限于以下步骤:In an embodiment of the present application, as shown in FIG2 , the collaborative training after fusing the merged data and the multi-image matrix includes but is not limited to the following steps:
将第一合并数据与所述边界图B-graph的融合结果输入第一图注意网络GAT1进行图卷积处理,所述第一合并数据包括所述词典特征向量和所述上下文信息向量的合并数据;Inputting the fusion result of the first merged data and the boundary graph B-graph into the first graph attention network GAT1 for graph convolution processing, wherein the first merged data includes the merged data of the dictionary feature vector and the context information vector;
将第二合并数据和所述关系图R-graph的融合结果输入到第二图注意网络GAT2进行图卷积处理,所述第二合并数据包括词性特征向量和所述上下文信息向量的合并数据;Inputting the fusion result of the second merged data and the relationship graph R-graph into the second graph attention network GAT2 for graph convolution processing, wherein the second merged data includes the merged data of the part-of-speech feature vector and the context information vector;
将所述第二合并数据和所述包含图C-graph的融合结果输入到第三图注意网络GAT3进行图卷积处理。The fusion result of the second merged data and the inclusion graph C-graph is input into the third graph attention network GAT3 for graph convolution processing.
在本实施例中,采用图注意力网络GAT对三个交互图进行建模。具体地,本实施例在M层GAT中,第j层的输入表示由一组节点特征组成,NFJ={f1,f2,...,fn}。此外,本实施例还引入邻接矩阵A,且满足fi∈RF,A∈RN*N其中F表示第j层特征的维度,N表示节点数。第j层的输出表示是一组与其他节点不同的新节点特征NF(j+1)={f1',f2',...,fN'}。每个GAT都有K个独立且不同的注意头,具体表示如公式(5)和公式(6)所示:In this embodiment, the graph attention network GAT is used to model the three interaction graphs. Specifically, in the M-layer GAT of this embodiment, the input representation of the j-th layer consists of a set of node features, NF J ={f 1 ,f 2 ,...,f n }. In addition, this embodiment also introduces the adjacency matrix A, and satisfies fi ∈ RF , A∈R N*N where F represents the dimension of the j-th layer features and N represents the number of nodes. The output representation of the j-th layer is a set of new node features NF(j+1) ={f1',f2',...,fN'} that are different from other nodes. Each GAT has K independent and different attention heads, which are specifically represented as shown in formulas (5) and (6):
其中,||表示连接操作;σ表示非线性激活函数;Ni表示图中节点i的相邻节点,表示注意参数,Wk∈RF',a∈R2F'表示单层前馈神经网络。用KF'表示第j层输出fi'的维度,F'表示最终输出特征的维度,结果保留平均值。具体地,平均值计算如公式(7)所示:Among them, || represents the connection operation; σ represents the nonlinear activation function; Ni represents the neighboring nodes of node i in the graph, represents the attention parameter, W k ∈ RF' , a∈R 2F' represents a single-layer feedforward neural network. KF' represents the dimension of the j-th layer output fi ', F' represents the dimension of the final output feature, and the result retains the average value. Specifically, the average value is calculated as shown in formula (7):
具体地,本实施例构建三个独立的图注意网络GAT1、GAT2和GAT3,用来对三个交互图建模。边界图B-graph用于捕获词汇知识的边界实体信息,其顶点集合用GAT1模型的输出节点特征表示如公式(8)所示:Specifically, this embodiment constructs three independent graph attention networks GAT1, GAT2 and GAT3 to model three interaction graphs. The boundary graph B-graph is used to capture the boundary entity information of vocabulary knowledge, and its vertex set is The output node feature representation of the GAT1 model is shown in formula (8):
G1=GAT1(XB,AB) 公式(8)G 1 =GAT1(X B ,A B ) Formula (8)
其中,G1∈RF'*(N+M),n表示句子中包含的字的个数,m表示匹配到的词的个数。对于关系图R-graph和包含图C-graph图,共用同一组顶点集合GAT2和GAT3模型的输出节点特征表示分别为G2和G3,具体如公式(9)和公式(10)所示:Where G 1 ∈ RF'* ( N+M) , n represents the number of characters in the sentence, and m represents the number of matched words. For the relationship graph R-graph and the inclusion graph C-graph, they share the same set of vertices. The output node feature representations of the GAT2 and GAT3 models are G 2 and G 3 , respectively, as shown in formula (9) and formula (10):
G2=GAT2(X,AR) 公式(9)G 2 =GAT2(X, AR ) Formula (9)
G3=GAT3(X,AC) 公式(10)G 3 =GAT3(X, AC ) Formula (10)
其中,Gy∈RF'*(n+k),y∈(2,3),n表示句子中包含的字的个数,k表示词性解析切割得到的词的个数。对于图注意网络的输出节点特征Gi,i∈(1,2,3),仅保留矩阵前n列的字符表示Qi=Gi[:,0:n],i∈(1,2,3),以便解码层使用。Among them, G y ∈ RF'*(n+k) , y ∈ (2,3), n represents the number of characters contained in the sentence, and k represents the number of words obtained by part-of-speech parsing. For the output node features of the graph attention network, Gi , i∈ (1,2,3), only the character representations of the first n columns of the matrix, Qi = Gi [:,0:n], i∈ (1,2,3), are retained for use by the decoding layer.
在本申请实施例中,对于将所述合并数据和所述多图矩阵融合后进行协作训练还包括:将所述边界图、所述关系图、所述包含图与所述原始数据的上下文信息输入融合层进行参数学习;然后通过多图融合进行协作训练,以得到特征融合结果。具体地,本实施例所构建的三个字-词交互图均有不同的功能,利用融合层将三个交互图融合以便充分利用各自的优点。同时,考虑原始数据的上下文信息对Chinese NER的辅助作用,本实施例将其与三个图注意网络输出的保留特征一同输入融合模块R,其中,R如公式(11)所示:In an embodiment of the present application, the collaborative training after fusing the merged data and the multi-graph matrix also includes: inputting the context information of the boundary graph, the relationship graph, the inclusion graph and the original data into the fusion layer for parameter learning; and then performing collaborative training through multi-graph fusion to obtain feature fusion results. Specifically, the three character-word interaction graphs constructed in this embodiment have different functions, and the three interaction graphs are fused by the fusion layer to fully utilize their respective advantages. At the same time, considering the auxiliary role of the context information of the original data on Chinese NER, this embodiment inputs it together with the retained features output by the three graph attention networks into the fusion module R, where R is shown in formula (11):
R=W1H+W2Q1+W3Q2+W4Q3 公式(11)R=W 1 H+W 2 Q 1 +W 3 Q 2 +W 4 Q 3Formula (11)
其中,Wy是训练矩阵,y∈(1,2,3,4),最后从融合层得到一个协作矩阵R,矩阵R整合了这些图的不同功能。Among them, W y is the training matrix, y∈(1,2,3,4), and finally a collaboration matrix R is obtained from the fusion layer, which integrates the different functions of these graphs.
在本申请实施例中,可以利用CRF解码特征融合结果后得到预测结果。具体地,采用标准的条件随机场判别模型CRF捕获连续标签之间的依赖关系,对于任何输入数据s={c1,c2,...,cn},将多图融合进行协作训练,得到特征融合结果R=(r1,r2,...,rn),作为CRF解码层的输入数据。通过CRF解码层解码得到输入数据对应的中文命名实体识别结果。具体地,给定预测标签序列Y=(y1,y2,...,yn),标签序列Y的计算公式如公式(12)所示:In an embodiment of the present application, the prediction result can be obtained by using the CRF decoding feature fusion result. Specifically, the standard conditional random field discriminant model CRF is used to capture the dependency between continuous labels. For any input data s = {c 1 , c 2 , ..., c n }, the multi-image fusion is collaboratively trained to obtain the feature fusion result R = (r 1 , r 2 , ..., r n ) as the input data of the CRF decoding layer. The Chinese named entity recognition result corresponding to the input data is obtained by decoding the CRF decoding layer. Specifically, given the predicted label sequence Y = (y 1 , y 2 , ..., y n ), the calculation formula of the label sequence Y is shown in formula (12):
其中,y'i表示任一标签序列,和均为训练参数。本实施例利用一阶Viterbi算法在基于字符的输入表示上找到得分最高的标签序列,给定一个人工标注的训练数通过使用如公式(13)所示的L2正则化的句子对数似然损失得到优化模型。Where y'i represents any tag sequence, and All are training parameters. This example uses the first-order Viterbi algorithm to find the label sequence with the highest score on the character-based input representation. Given a manually labeled training data The optimized model is obtained by using the L2 regularized sentence log-likelihood loss as shown in formula (13).
其中,λ表示L2正则化参数,Θ表示训练参数集。Among them, λ represents the L2 regularization parameter and Θ represents the training parameter set.
本发明实施例提供了一种基于多图协作语义网络的实体识别系统,包括:The embodiment of the present invention provides an entity recognition system based on a multi-graph collaborative semantic network, comprising:
获取模块,用于获取原始数据,所述原始数据包括若干个中文句子;An acquisition module, used for acquiring original data, wherein the original data includes a plurality of Chinese sentences;
提取模块,用于提取所述原始数据的特征数据,所述特征数据包括词典特征数据和词性特征数据;An extraction module, used to extract feature data of the original data, wherein the feature data includes dictionary feature data and part-of-speech feature data;
构建模块,用于根据所述原始数据和所述特征数据构建多图矩阵;A construction module, used for constructing a multi-image matrix according to the original data and the feature data;
合并模块,用于将所述原始数据和所述特征数据进行合并,得到合并数据;A merging module, used for merging the original data and the feature data to obtain merged data;
协作训练模块,用于将所述合并数据和所述多图矩阵融合后进行协作训练,得到特征融合结果;A collaborative training module, used for fusing the combined data with the multi-image matrix and performing collaborative training to obtain a feature fusion result;
解码模块,用于对所述特征融合结果进行解码,得到所述原始数据的中文命名实体识别结果。A decoding module is used to decode the feature fusion result to obtain the Chinese named entity recognition result of the original data.
本发明方法实施例的内容均适用于本系统实施例,本系统实施例所具体实现的功能与上述方法实施例相同,并且达到的有益效果与上述方法达到的有益效果也相同。The contents of the method embodiments of the present invention are all applicable to the system embodiments. The functions specifically implemented by the system embodiments are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.
本发明实施例提供了一种基于多图协作语义网络的实体识别系统,包括:The embodiment of the present invention provides an entity recognition system based on a multi-graph collaborative semantic network, comprising:
至少一个存储器,用于存储程序;at least one memory for storing a program;
至少一个处理器,用于加载所述程序以执行图1所示的基于多图协作语义网络的实体识别方法。At least one processor is used to load the program to execute the entity recognition method based on multi-graph collaborative semantic network shown in FIG1 .
本发明方法实施例的内容均适用于本系统实施例,本系统实施例所具体实现的功能与上述方法实施例相同,并且达到的有益效果与上述方法达到的有益效果也相同。The contents of the method embodiments of the present invention are all applicable to the system embodiments. The functions specifically implemented by the system embodiments are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.
本发明实施例提供了一种存储介质,其中存储有计算机可执行的程序,所述计算机可执行的程序被处理器执行时用于实现图1所示的基于多图协作语义网络的实体识别方法。An embodiment of the present invention provides a storage medium storing a computer-executable program, wherein the computer-executable program is used to implement the entity recognition method based on a multi-graph collaborative semantic network shown in FIG1 when executed by a processor.
本发明实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存介质中。计算机设备的处理器可以从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行图1所示的基于多图协作语义网络的实体识别方法。The embodiment of the present invention also provides a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device can read the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the entity recognition method based on the multi-graph collaborative semantic network shown in FIG1.
上面结合附图对本发明实施例作了详细说明,但是本发明不限于上述实施例,在所属技术领域普通技术人员所具备的知识范围内,还可以在不脱离本发明宗旨的前提下作出各种变化。此外,在不冲突的情况下,本发明的实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above embodiments. Various changes can be made within the knowledge of ordinary technicians in the relevant technical field without departing from the purpose of the present invention. In addition, the embodiments of the present invention and the features in the embodiments can be combined with each other without conflict.
Claims (7)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210496739.0A CN114896978B (en) | 2022-05-09 | 2022-05-09 | Entity recognition method, system and storage medium based on multi-graph collaborative semantic network |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210496739.0A CN114896978B (en) | 2022-05-09 | 2022-05-09 | Entity recognition method, system and storage medium based on multi-graph collaborative semantic network |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114896978A CN114896978A (en) | 2022-08-12 |
| CN114896978B true CN114896978B (en) | 2024-10-22 |
Family
ID=82721379
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210496739.0A Active CN114896978B (en) | 2022-05-09 | 2022-05-09 | Entity recognition method, system and storage medium based on multi-graph collaborative semantic network |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114896978B (en) |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11176325B2 (en) * | 2017-06-26 | 2021-11-16 | International Business Machines Corporation | Adaptive evaluation of meta-relationships in semantic graphs |
| CN107526834B (en) * | 2017-09-05 | 2020-10-23 | 北京工商大学 | Word2vec improvement method for training correlation factors of united parts of speech and word order |
| CN111046671A (en) * | 2019-12-12 | 2020-04-21 | 中国科学院自动化研究所 | Chinese named entity recognition method based on graph network and merged into dictionary |
| CN113420557B (en) * | 2021-06-09 | 2024-03-08 | 山东师范大学 | Chinese named entity recognition method, system, equipment and storage medium |
-
2022
- 2022-05-09 CN CN202210496739.0A patent/CN114896978B/en active Active
Non-Patent Citations (1)
| Title |
|---|
| MSCN:multi-graph collaborative semantic network for Chinese NER;Wenjing Gu等;《Knowledge science,engineering and management》;20220719;322-334 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114896978A (en) | 2022-08-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10409911B2 (en) | Systems and methods for text analytics processor | |
| RU2686000C1 (en) | Retrieval of information objects using a combination of classifiers analyzing local and non-local signs | |
| US7428487B2 (en) | Semi-automatic construction method for knowledge base of encyclopedia question answering system | |
| CN110532328B (en) | Text concept graph construction method | |
| RU2679988C1 (en) | Extracting information objects with the help of a classifier combination | |
| CN113553850A (en) | An Entity Relation Extraction Method Based on Coded Pointer Network Decoding of Ordered Structure | |
| CN112417155B (en) | Court trial query generation method, device and medium based on pointer-generation Seq2Seq model | |
| CN108763510A (en) | Intension recognizing method, device, equipment and storage medium | |
| CN112084381A (en) | Event extraction method, system, storage medium and equipment | |
| CN109726274A (en) | Problem generation method, device and storage medium | |
| CN114239546B (en) | A translation machine testing method based on syntax tree pruning | |
| CN115048944A (en) | Open domain dialogue reply method and system based on theme enhancement | |
| CN115392259A (en) | Microblog text sentiment analysis method and system based on confrontation training fusion BERT | |
| CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
| CN108108468A (en) | A kind of short text sentiment analysis method and apparatus based on concept and text emotion | |
| Hu et al. | A novel word embedding learning model using the dissociation between nouns and verbs | |
| CN114491079A (en) | Knowledge graph construction and query method, device, equipment and medium | |
| CN104699797A (en) | Webpage data structured analytic method and device | |
| CN112069312A (en) | Text classification method based on entity recognition and electronic device | |
| CN114490937A (en) | Comment analysis method and device based on semantic perception | |
| CN111859950A (en) | A method for automatic generation of lecture notes | |
| CN115470232A (en) | Model training and data query method and device, electronic equipment and storage medium | |
| CN114443846B (en) | Classification method and device based on multi-level text different composition and electronic equipment | |
| CN114595338A (en) | A system and method for joint entity-relation extraction based on hybrid feature representation | |
| Liu et al. | Flickr30k-cfq: A compact and fragmented query dataset for text-image retrieval |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |