CN108415902A

CN108415902A - A kind of name entity link method based on search engine

Info

Publication number: CN108415902A
Application number: CN201810138076.9A
Authority: CN
Inventors: 吴共庆; 何颖; 胡学钢; 胡东辉; 李磊; 吴信东
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2018-02-10
Filing date: 2018-02-10
Publication date: 2018-08-17
Anticipated expiration: 2038-02-10
Also published as: CN108415902B

Abstract

The invention discloses a method for linking named entities in a search engine, comprising the following steps: segmenting input query text, identifying a group of named entities in the text that need to be linked to a knowledge base; To name an entity, search for a list of candidate entities in the Chinese entity knowledge base through entity reference; use the search engine to expand the context information of the entity reference; use the search engine to expand the context information of each candidate entity in the candidate entity list; calculate the entity reference and The matching degree between each candidate entity; the entity with the highest matching degree is selected for linking. This method retrieves entity references and candidate entities in a search engine, obtains information from search results to expand the context information of entity references and candidate entities, and provides additional information for improving the accuracy of entity links.

Description

A Method of Named Entity Linking Based on Search Engine

技术领域technical field

本发明涉及网络信息处理方法，具体是一种基于搜索引擎的命名实体链接方法。The invention relates to a network information processing method, in particular to a named entity linking method based on a search engine.

背景技术Background technique

随着信息化的不断提升，人们越来越多的通过互联网获取信息，传统的信息传播媒体如报纸、杂志、期刊逐渐被门户网站、电子图书馆和搜索引擎所超越。通过互联网获取的信息大多是通过新闻网站、微博、贴吧等平台传播的文本信息，这些文本中蕴含着大量的命名实体，造成信息时代在提供高效的阅读方式的同时，伴随着信息量爆炸式的增长。如何高效、准确地把这些大量的命名实体无歧义地链接到知识库中对应的实体中，成为信息融合、自然语言处理、信息检索等领域亟待解决的问题。With the continuous improvement of informatization, more and more people obtain information through the Internet, and traditional information dissemination media such as newspapers, magazines, and periodicals are gradually surpassed by portal websites, electronic libraries, and search engines. Most of the information obtained through the Internet is text information disseminated through platforms such as news websites, microblogs, and post bars. These texts contain a large number of named entities. growth of. How to efficiently and accurately link these large numbers of named entities to the corresponding entities in the knowledge base without ambiguity has become an urgent problem to be solved in the fields of information fusion, natural language processing, and information retrieval.

命名实体链接主要分为两个阶段，第一阶段主要是根据实体指称在知识库中搜索得到潜在的候选实体集，第二阶段主要是根据实体指称和候选实体集进行实体消歧。目前，命名实体链接方法主要工作集中在优化实体消歧算法上，实体消歧的主要方法主要分为基于实体流行度排序的消歧方法、基于上下文相似度排序的消歧方法、基于分类的消歧方法、基于图的消歧方法等。Named entity linking is mainly divided into two stages. The first stage is mainly to search for potential candidate entity sets in the knowledge base according to entity references. The second stage is mainly to disambiguate entities based on entity references and candidate entity sets. At present, the main work of the named entity linking method focuses on optimizing the entity disambiguation algorithm. The main methods of entity disambiguation are mainly divided into disambiguation methods based on entity popularity ranking, disambiguation methods based on context Disambiguation methods, graph-based disambiguation methods, etc.

基于实体流行度排序的消歧方法是最简单和最直接的实体消歧方法，这种方法根据实体流行度判断返回流行程度最高的实体。例如人物实体，知名度越高的人的被链接的可能性越大，这种知名程度通常通过与实体相关链接的多少来衡量，但这种方法不足之处在于：无论查询的实体流行与否都会返回同样的结果。The disambiguation method based on entity popularity ranking is the simplest and most direct entity disambiguation method, which returns the most popular entity according to the entity popularity judgment. For example, a person entity, the higher the popularity of the person, the greater the possibility of being linked. This popularity is usually measured by the number of links related to the entity, but the disadvantage of this method is that no matter whether the query entity is popular or not, it will returns the same result.

基于上下文相似度排序的消歧方法利用实体指称的上下文与候选实体上下文之间的相似度来判断应该链接的实体。这种方法通常把实体指称的上下文与候选实体上下文表示成词袋向量或者把其中的一些关键词如文本中的命名实体表示成词空间向量，然后计算向量之间相似度，如余弦相似度。它的缺点在于相似度的计算依赖于词共现信息，需要大量的实体上下文信息，当上下文信息较少时判断结果的准确性不高。Disambiguation methods based on contextual similarity sorting use the similarity between the context of entity references and the context of candidate entities to determine which entities should be linked. This method usually expresses the context of the entity reference and the context of the candidate entity as a word bag vector or expresses some keywords such as named entities in the text as a word space vector, and then calculates the similarity between the vectors, such as cosine similarity. Its disadvantage is that the calculation of similarity depends on word co-occurrence information and requires a large amount of entity context information. When the context information is less, the accuracy of the judgment result is not high.

基于分类的消歧方法将链接过程看成一个分类过程，在训练阶段从知识库中抽取实体指称与知识库中实体建立链接的<实体指称，候选实体>对做为正向实例，未建立链接的<实体指称，候选实体>对做为负向实例。然后，使用这些正向实例和负向实例构成的训练集训练一个分类器，分类器使用的常见分类模型包括SVM、决策树、朴素贝叶斯等。在命名实体链接阶段，利用训练好的分类器对实体指称的候选实体进行分类。然而，这种方法容易得到两个或更多的链接实体，需要结合上下文相似度等方法再次进行排序。The classification-based disambiguation method regards the linking process as a classification process. During the training phase, the <entity designation, candidate entity> pair is extracted from the knowledge base and the entity in the knowledge base is linked as a positive instance, and no link is established. The pair of <entity designation, candidate entity> is used as a negative instance. Then, use the training set composed of these positive examples and negative examples to train a classifier. The common classification models used by the classifier include SVM, decision tree, naive Bayes, etc. In the named entity linking stage, the trained classifiers are used to classify the candidate entities referred to by the entities. However, this method is easy to get two or more linked entities, which needs to be sorted again by combining context similarity and other methods.

基于图的消歧方法将实体指称与候选实体做为图节点，将实体指称与候选实体之间的关系做为图的边建立图结构。通常在计算实体指称节点与其候选实体节点之间的关系时，多数方法利用了的两者之间的上下文相似度特征。因此，基于图的消歧方法在一定程度上也依赖于词共现信息，也存在当上下文信息较少时判断结果的准确性不高的问题。The graph-based disambiguation method uses entity references and candidate entities as graph nodes, and uses the relationship between entity references and candidate entities as graph edges to establish a graph structure. Usually, when calculating the relationship between an entity referring node and its candidate entity nodes, most methods utilize the context similarity feature between the two. Therefore, graph-based disambiguation methods also rely on word co-occurrence information to a certain extent, and there is also the problem that the accuracy of judgment results is not high when there is less context information.

总结以上所述的实体消歧方法，命名实体链接的大部分方法都依赖于实体指称上下文与候选实体上下文相似度特征，然而这一特征的计算需要大量的实体指称和其候选实体的上下文信息来克服计算相似度时过分依赖词共现信息的问题。为了得到更多的实体指称与候选实体的上下文信息，需要一种能扩展实体指称与其候选实体上下文信息的方法以提高计算结果的准确性。To summarize the entity disambiguation methods mentioned above, most methods of named entity linking rely on the similarity feature between the entity reference context and the candidate entity context. However, the calculation of this feature requires a large amount of context information of the entity reference and its candidate entities. Overcome the problem of over-reliance on word co-occurrence information when calculating similarity. In order to obtain more context information of entity references and candidate entities, a method that can expand the context information of entity references and candidate entities is needed to improve the accuracy of calculation results.

发明内容Contents of the invention

本发明的目的是提供一种基于搜索引擎的命名实体链接方法，以解决现有技术存在的问题。The purpose of the present invention is to provide a method for linking named entities based on a search engine to solve the problems in the prior art.

为了达到上述目的，本发明所采用的技术方案为：In order to achieve the above object, the technical scheme adopted in the present invention is:

一种基于搜索引擎的命名实体链接方法，其特征在于：包括以下步骤：A method for linking named entities based on a search engine, characterized in that: comprising the following steps:

步骤1，对输入的查询文本进行分词，识别出文本中需要链接到知识库中的一组命名实体；Step 1, segment the input query text, and identify a group of named entities in the text that need to be linked to the knowledge base;

步骤2，对每个命名实体，通过实体指称得到候选实体列表；Step 2, for each named entity, obtain a candidate entity list through entity reference;

步骤3，利用搜索引擎，扩展实体指称的上下文信息；Step 3, using the search engine to expand the contextual information of the entity reference;

步骤4，利用搜索引擎，扩展候选实体列表中每个候选实体的上下文信息；Step 4, using a search engine to expand the context information of each candidate entity in the candidate entity list;

步骤5，计算实体指称与每个候选实体之间的匹配度；Step 5, calculating the matching degree between the entity reference and each candidate entity;

步骤6，根据候选实体的匹配度对候选实体排序，选择匹配度最大的候选实体做为目标实体进行链接。Step 6, sort the candidate entities according to their matching degree, and select the candidate entity with the highest matching degree as the target entity for linking.

所述的一种基于搜索引擎的命名实体链接方法，其特征在于：所述步骤1中：The described method for linking named entities based on a search engine is characterized in that: in the step 1:

命名实体为人名、机构名、地名以及其他所有以名称为标识的实体，实体为现实生活中的对象或概念；Named entities are person names, organization names, place names, and all other entities identified by names, and entities are objects or concepts in real life;

查询文本为用户输入的文本或文本集，包括实体指称和其上下文文本，其文本长度为固定长度；The query text is the text or text set entered by the user, including the entity reference and its context text, and its text length is a fixed length;

知识库为若干实体条目的集合，包括人物、机构、地名以及其他所有以名称为标识的实体；The knowledge base is a collection of several entity entries, including people, institutions, place names and all other entities identified by names;

实体条目为每个实体在知识库中存储的具体信息，包括实体的上下文文本、实体属性和属性值，实体的上下文信息为对实体进行详细描述的文本，实体指称为查询文本中待链接实体的名词引用。Entity entry is the specific information stored in the knowledge base for each entity, including the context text of the entity, entity attributes and attribute values, the context information of the entity is the text describing the entity in detail, and the entity reference is the entity to be linked in the query text noun references.

所述的一种基于搜索引擎的命名实体链接方法，其特征在于：所述步骤2中，通过实体指称得到候选实体的方法为：The described method for linking named entities based on a search engine is characterized in that: in the step 2, the method for obtaining the candidate entity by referring to the entity is:

依次取出知识库中的每条实体条目，遍历其名称属性值、别名属性值，计算查询文本中的实体指称与各个实体条目的名称属性值、别名属性值之间的字符串相似度；若查询文本的实体指称与实体条目的名称属性值相似度大于预先设定好的阈值，则把当前实体条目输出到候选实体列表中，若查询文本的实体指称与实体条目的别名属性值相似度大于预先设定好的阈值，则把所述别名属性所对应的实体条目输出到候选实体列表中；Take out each entity entry in the knowledge base in turn, traverse its name attribute value and alias attribute value, and calculate the string similarity between the entity reference in the query text and the name attribute value and alias attribute value of each entity entry; if the query If the similarity between the entity reference of the text and the name attribute value of the entity entry is greater than the preset threshold, the current entity entry is output to the candidate entity list. If the similarity between the entity reference of the query text and the alias attribute value of the entity entry is greater than the preset After the threshold is set, the entity entry corresponding to the alias attribute is output to the candidate entity list;

通过上述方法得到初始候选实体列表，对初始候选实体列表进行筛选得到候选实体列表，方法为：Obtain the initial candidate entity list through the above method, and filter the initial candidate entity list to obtain the candidate entity list, the method is:

抽取知识库中的部分实体做为训练集，利用标注的<实体，类型>对集合训练实体-类型分类器，然后利用实体-类型分类器识别候选实体列表中实体的类型，把与查询文本中待链接实体属于同类型的实体保留，删除类型不同的实体，得到最终的候选实体列表；Extract some entities in the knowledge base as the training set, use the marked <entity, type> pair set to train the entity-type classifier, and then use the entity-type classifier to identify the type of the entity in the candidate entity list, and compare it with the query text The entity to be linked belongs to the entity of the same type and is retained, and the entity of different type is deleted to obtain the final candidate entity list;

所述实体的类型为实体的类别属性，包括人物、地点、机构、时间；所述<实体，类型>对为实体与其类型的映射，其中映射正确的<实体，类型>对为训练分类器的正向实例，反之为负向实例；所述实体-类型分类器为通过<实体，类型>对集合训练的多类分类模型，通过所述模型可以判断实体的类型。The type of the entity is the category attribute of the entity, including people, places, institutions, and time; the <entity, type> pair is a mapping between the entity and its type, and the correct <entity, type> pair is the training classifier A positive example, and a negative example on the contrary; the entity-type classifier is a multi-class classification model trained on a set of <entity, type>, through which the type of the entity can be judged.

所述的一种基于搜索引擎的命名实体链接方法，其特征在于：所述步骤3中，利用搜索引擎，扩展实体指称的上下文信息的方法为：The described method for linking named entities based on a search engine is characterized in that: in the step 3, using a search engine to expand the context information referred to by an entity is as follows:

将步骤1中查询文本中一组命名实体的实体指称做为种子在搜索引擎中检索，得到与所述实体指称相关的搜索结果网页，从搜索结果网页中抽取前m条搜索结果的标题和摘要信息，得到所述实体指称的扩展文本，查询文本与扩展文本一起构成所述实体指称的扩展的上下文信息。Retrieve the entity references of a group of named entities in the query text in step 1 as seeds in the search engine to obtain the search result pages related to the entity references, and extract the titles and abstracts of the first m search results from the search result pages information to obtain the extended text of the entity reference, and the query text and the extended text together constitute the extended context information of the entity reference.

所述的一种基于搜索引擎的命名实体链接方法，其特征在于：所述步骤4中，利用搜索引擎，扩展候选实体列表中每个候选实体的上下文信息的方法为：The described method for linking named entities based on a search engine is characterized in that: in the step 4, using a search engine to expand the context information of each candidate entity in the list of candidate entities is:

将候选实体列表中的每个候选实体在知识库中的上下文文本进行分词，对分词后的文本进行命名实体识别得到一组与所述候选实体相关的命名实体，然后把这一组命名实体做为一个种子在搜索引擎中检索，得到与所述候选实体相关的搜索结果网页，从搜索结果网页中抽取前n条搜索结果的标题和摘要信息，得到所述候选实体的扩展文本，候选实体在知识库中的上下文文本与扩展文本一起构成所述候选实体的扩展的上下文信息。Segment the context text of each candidate entity in the knowledge base in the candidate entity list, perform named entity recognition on the text after segmentation to obtain a group of named entities related to the candidate entity, and then make this group of named entities Retrieve a seed in a search engine to obtain the search result webpage related to the candidate entity, extract the title and summary information of the first n search results from the search result webpage, and obtain the extended text of the candidate entity, and the candidate entity is in The context text in the knowledge base together with the extension text constitutes the extended context information of the candidate entity.

所述的一种基于搜索引擎的命名实体链接方法，其特征在于：所述步骤5中，计算实体指称与每个候选实体之间的匹配度的方法为：The described method for linking named entities based on a search engine is characterized in that: in the step 5, the method for calculating the degree of matching between the entity designation and each candidate entity is:

对候选实体列表中的每个候选实体，计算所述实体指称的扩展的上下文信息和所述候选实体的扩展的上下文信息之间的余弦相似度a，计算所述候选实体流行度b，计算实体指称与每个候选实体之间的匹配度c＝w_a×a+w_b×b，其中：w_a、w_b为预设的权值，且w_a+w_b＝1；所述候选实体的流行度为：实体指称做为超链接链接到所述候选实体页面的次数与实体指称做为超链接链接到所有候选实体的总次数的比值。For each candidate entity in the candidate entity list, calculate the cosine similarity a between the extended context information referred to by the entity and the extended context information of the candidate entity, calculate the candidate entity popularity b, and calculate the entity The matching degree c=w _a ×a+w _b ×b between the reference and each candidate entity, wherein: w _a and w _b are preset weights, and w _a +w _b =1; the candidate entity The popularity of is: the ratio of the number of times the entity reference links to the candidate entity page as a hyperlink to the total number of times the entity reference links to all candidate entities as a hyperlink.

与已有技术相比，本发明的有益效果体现在：Compared with the prior art, the beneficial effects of the present invention are reflected in:

1、本发明采用了两阶段产生候选实体列表的方法。首先，计算字符串相似度得到初始候选实体列表。然后，根据对实体类型分类的方法，对初始候选实体列表进行筛选得到候选实体列表。两阶段产生候选实体列表的方法既可产生较多的与实体指称相关的候选实体，又可通过实体类型分类去除部分噪音候选实体以降低消歧的复杂度。1. The present invention adopts a two-stage method for generating candidate entity lists. First, calculate the string similarity to get the initial candidate entity list. Then, according to the method of classifying entity types, the initial candidate entity list is screened to obtain the candidate entity list. The method of generating candidate entity lists in two stages can not only generate more candidate entities related to entity references, but also remove some noisy candidate entities through entity type classification to reduce the complexity of disambiguation.

2、本发明关键的贡献在于提供了对实体指称和候选实体的上下文文本信息进行扩展以解决命名实体链接问题的方法。实体指称上下文文本长度是有限的，知识库中包含的实体上下文信息也是有限的，从而导致依赖于词共现信息计算实体指称与候选实体上下文相似度的准确性不高。通过搜索搜索引擎检索实体指称和候选实体，得到有关实体指称和候选实体的所有网页，对网页进行整理后得到扩展的上下文信息，有助于我们提取更多的实体特征，从而为提高实体链接的准确性提供附加信息。2. The key contribution of the present invention is to provide a method for extending entity references and context text information of candidate entities to solve the problem of named entity linking. The length of entity reference context text is limited, and the entity context information contained in the knowledge base is also limited, which leads to low accuracy in calculating the similarity between entity reference and candidate entity context based on word co-occurrence information. Retrieve entity references and candidate entities through search engines, get all the webpages related to entity references and candidate entities, and get extended context information after sorting out the webpages, which will help us extract more entity features, so as to improve the entity link Accuracy provides additional information.

附图说明Description of drawings

图1为本发明一种基于搜索引擎的命名实体链接方法的流程图。FIG. 1 is a flowchart of a method for linking named entities based on a search engine in the present invention.

具体实施方式Detailed ways

一种基于搜索引擎的命名实体链接方法包括下述步骤：A method for linking named entities based on a search engine includes the following steps:

步骤1，对输入的查询文本进行分词，识别出文本中需要链接到知识库中的一组命名实体。Step 1. Segment the input query text and identify a group of named entities in the text that need to be linked to the knowledge base.

步骤2，对每个命名实体，通过实体指称得到候选实体列表。Step 2. For each named entity, obtain a list of candidate entities through entity reference.

步骤3，利用搜索引擎，扩展实体指称的上下文信息。Step 3, use the search engine to expand the context information of the entity reference.

步骤4，利用搜索引擎，扩展候选实体列表中每个候选实体的上下文信息。Step 4, using the search engine to expand the context information of each candidate entity in the candidate entity list.

步骤5，计算实体指称与每个候选实体之间的匹配度。Step 5, calculate the matching degree between the entity reference and each candidate entity.

步骤1中，命名实体为人名、机构名、地名以及其他所有以名称为标识的实体，实体为现实生活中的对象或概念；查询文本为用户输入的文本或文本集，包括实体指称和其上下文文本，其文本长度为固定长度，知识库为若干实体条目的集合，包括人物、机构、地名以及其他所有以名称为标识的实体；实体条目为每个实体在知识库中存储的具体信息，包括实体的上下文文本、实体属性和属性值，实体的上下文信息为对实体进行详细描述的文本，实体指称为查询文本中待链接实体的名词引用。In step 1, named entities are person names, organization names, place names, and all other entities identified by names, and entities are objects or concepts in real life; query text is text or text sets entered by users, including entity references and their contexts Text, whose text length is a fixed length, the knowledge base is a collection of several entity entries, including people, institutions, place names and all other entities identified by names; entity entries are the specific information stored in the knowledge base for each entity, including The context text, entity attribute and attribute value of the entity, the context information of the entity is the text describing the entity in detail, and the entity reference is the noun reference of the entity to be linked in the query text.

步骤2中，通过实体指称得到候选实体的方法为：依次取出知识库中的每条实体条目，遍历其名称属性值、别名属性值，计算查询文本中的实体指称与各个实体条目的名称属性值、别名属性值之间的字符串相似度。若查询文本的实体指称与实体条目的名称属性值相似度大于预先设定好的阈值，则把当前实体条目输出到候选实体列表中，若查询文本的实体指称与实体条目的别名属性值相似度大于预先设定好的阈值，则把别名属性所对应的实体条目输出到候选实体列表中；通过上述方法得到初始候选实体列表。对初始候选实体列表进行筛选得到候选实体列表。方法为：抽取知识库中的部分实体做为训练集，利用标注的<实体，类型>对集合训练实体-类型分类器。然后利用实体-类型分类器识别候选实体列表中实体的类型，把与查询文本中待链接实体属于同类型的实体保留，删除类型不同的实体，得到最终的候选实体列表，实体的类型为实体的类别属性，包括人物、地点、机构、时间；<实体，类型>对为实体与其类型的映射，其中映射正确的<实体，类型>对为训练分类器的正向实例，反之为负向实例；实体-类型分类器为通过<实体，类型>对集合训练的多类分类模型，多类分类模型是把实体分成多个类别，一个实体属于且只属于多个类别中的一个，不同类之间是互斥的，通过多类分类模型可以判断实体的类型。In step 2, the method of obtaining candidate entities through entity references is: take out each entity entry in the knowledge base in turn, traverse its name attribute value and alias attribute value, and calculate the entity reference in the query text and the name attribute value of each entity entry , the string similarity between the alias attribute values. If the similarity between the entity reference of the query text and the name attribute value of the entity entry is greater than the preset threshold, the current entity entry will be output to the candidate entity list; if the similarity between the entity reference of the query text and the alias attribute value of the entity entry is greater than the preset threshold, the entity entry corresponding to the alias attribute is output to the candidate entity list; the initial candidate entity list is obtained through the above method. Filter the initial candidate entity list to obtain the candidate entity list. The method is as follows: extract some entities in the knowledge base as the training set, and use the labeled <entity, type> pair set to train the entity-type classifier. Then use the entity-type classifier to identify the type of the entity in the candidate entity list, keep the entity of the same type as the entity to be linked in the query text, delete the entity of different type, and obtain the final candidate entity list, the type of the entity is the entity Category attributes, including people, places, institutions, and time; <entity, type> pairs are mappings between entities and their types, where the correctly mapped <entity, type> pair is a positive example for training a classifier, otherwise it is a negative example; The entity-type classifier is a multi-class classification model trained by <entity, type>. The multi-class classification model is to divide the entity into multiple categories. An entity belongs to and only belongs to one of the multiple categories. Between different categories are mutually exclusive, and the type of entity can be judged through the multi-class classification model.

步骤3中，利用搜索引擎，扩展实体指称的上下文信息的方法为：将步骤1中查询文本中一组命名实体的实体指称做为种子在搜索引擎中检索，得到与所述实体指称相关的搜索结果网页，从搜索结果网页中抽取前m条搜索结果的标题和摘要信息，得到所述实体指称的扩展文本，查询文本与扩展文本一起构成所述实体指称的扩展的上下文信息。In step 3, using the search engine to expand the context information of entity references is as follows: use the entity references of a group of named entities in the query text in step 1 as seeds to search in the search engine, and obtain the search results related to the entity references The result web page, extracting the title and abstract information of the first m search results from the search result web page to obtain the extended text of the entity reference, and the query text and the extended text together constitute the extended context information of the entity reference.

步骤4中，利用搜索引擎，扩展候选实体列表中每个候选实体的上下文信息的方法为：将候选实体列表中的每个候选实体在知识库中的上下文文本进行分词，对分词后的文本进行命名实体识别得到一组与所述候选实体相关的命名实体，然后把这一组命名实体做为一个种子在搜索引擎中检索，得到与所述候选实体相关的搜索结果网页，从搜索结果网页中抽取前n条搜索结果的标题和摘要信息，得到所述候选实体的扩展文本，候选实体在知识库中的上下文文本与扩展文本一起构成所述候选实体的扩展的上下文信息。In step 4, using a search engine to expand the context information of each candidate entity in the candidate entity list is as follows: segment the context text of each candidate entity in the candidate entity list in the knowledge base, and perform word segmentation on the text after word segmentation Named entity recognition obtains a group of named entities related to the candidate entity, and then uses this group of named entities as a seed to retrieve in a search engine to obtain a search result webpage related to the candidate entity, and from the search result webpage The title and abstract information of the first n search results are extracted to obtain the extended text of the candidate entity, and the context text of the candidate entity in the knowledge base and the extended text constitute the extended context information of the candidate entity.

步骤5中，计算实体指称与每个候选实体之间的匹配度的方法为：对候选实体列表中的每个候选实体，计算实体指称的扩展的上下文信息和所述候选实体的扩展的上下文信息之间的余弦相似度a，计算所述候选实体流行度b，计算实体指称与每个候选实体之间的匹配度c＝w_a×a+w_b×b。其中：w_a、w_b为预设的权值，且w_a+w_b＝1。候选实体的流行度为：实体指称做为超链接链接到所述候选实体页面的次数与实体指称做为超链接链接到所有候选实体的总次数的比值。In step 5, the method for calculating the matching degree between the entity reference and each candidate entity is: for each candidate entity in the candidate entity list, calculate the extended context information of the entity reference and the extended context information of the candidate entity the cosine similarity a between them, calculate the candidate entity popularity b, and calculate the matching degree c=w _a ×a+w _b ×b between the entity reference and each candidate entity. Wherein: w _a and w _b are preset weights, and w _a +w _b =1. The popularity of a candidate entity is: the ratio of the number of times the entity reference links to the candidate entity page as a hyperlink to the total number of times the entity reference links to all candidate entities as a hyperlink.

具体实施例：Specific examples:

本实施例提供了一种基于搜索引擎的命名实体链接方法，下面结合图1说明本实施例中基于搜索引擎的命名实体链接方法的步骤：The present embodiment provides a method for linking named entities based on a search engine. The steps of the method for linking named entities based on a search engine in this embodiment are described below in conjunction with FIG. 1:

(1)如图1的S101所示，对输入的查询文本进行分词，识别出文本中需要链接到知识库中的一组命名实体。一名用户输入一段查询文本，例如某新闻里的一段话“李娜退役仪式纳达尔献花，娜姐泪奔享全场欢呼。”，首先，对输入的查询文本进行分词，然后利用命名实体识别工具识别出文本中的一组命名实体。其中命名实体就是人名、机构名、地名以及其他所有以名称为标识的实体，更广泛的实体还包括数字、日期、货币、地址。(1) As shown in S101 of FIG. 1 , word segmentation is performed on the input query text, and a group of named entities in the text that need to be linked to the knowledge base are identified. A user enters a query text, for example, a paragraph in a news article "Li Na's retirement ceremony and Nadal presented flowers, and sister Na burst into tears to enjoy the cheers of the audience." First, the input query text is segmented, and then the named entity recognition tool is used to Identify a set of named entities in text. Among them, named entities are people's names, organization names, place names, and all other entities identified by names, and broader entities include numbers, dates, currencies, and addresses.

对输入的查询文本进行分词是利用分词工具将查询文本按照中文词典进行词的划分，本实施例中分词所用的工具是由中国科学院计算技术研究所提供的NLPIR中文分词工具(网址：http://ictclas.nlpir.org/downloads)。分词得到的词和其类型标注的集合为：W＝{李娜/nr退役/vi仪式/n纳达尔/nr献花/vi，/wd娜姐/nr泪/n奔/v享/vg全场/n欢呼/v}，其中“/nr”、“/vi”、“/n”、“/wd”、“/v”、“/vg”是词的类型标注。Carrying out word segmentation to the input query text is to utilize the word segmentation tool to divide the query text into words according to the Chinese dictionary. In the present embodiment, the used tool for word segmentation is the NLPIR Chinese word segmentation tool provided by the Institute of Computing Technology, Chinese Academy of Sciences (web site: http:// /ictclas.nlpir.org/downloads). The set of words and their types marked by word segmentation is: W={Li Na/nr retires/vi ceremony/n Nadal/nr presents flowers/vi, /wd Najie/nr tears/n Ben/v enjoys/vg audience/ n cheers /v}, where "/nr", "/vi", "/n", "/wd", "/v", "/vg" are the type annotations of words.

分词得到一组词的集合后，利用命名实体识别技术识别出分词后的文本所包含的实体集合：E＝{李娜、纳达尔、娜姐}，本实施例采用的命名实体识别方法为根据分词得到的词和其类型标注，提取类型标注为nr、ns、nt、nz的词组成命名实体集合。这里的“李娜”、“纳达尔”、“娜姐”就是识别的命名实体“李娜”、“纳达尔”、“娜姐”的实体指称。本实施例以实体指称“李娜”的链接过程为例介绍命名实体链接过程。After word segmentation obtains a set of words, use the named entity recognition technology to identify the entity set contained in the text after word segmentation: E={Li Na, Nadal, Najie}, the named entity recognition method adopted in this embodiment is based on word segmentation The obtained words and their types are marked, and the words whose types are marked as nr, ns, nt, and nz are extracted to form a named entity set. The "Li Na", "Nadal" and "Najie" here are the entity references of the identified named entities "Li Na", "Nadal" and "Najie". This embodiment introduces the named entity linking process by taking the linking process of the entity referring to "Li Na" as an example.

(2)如图1的S102所示，对每个命名实体，通过实体指称得到候选实体列表。依次取出知识库中的每条实体条目，遍历其名称属性值、别名属性值，计算查询文本中的实体指称与各个实体条目的名称属性值、别名属性值之间的字符串相似度。若查询文本的实体指称与实体条目的名称属性值相似度大于预先设定好的阈值，则把当前实体条目输出到候选实体列表中，若查询文本的实体指称与实体条目的别名属性值相似度大于预先设定好的阈值，则把别名属性所对应的实体条目输出到候选实体列表中。通过上述方法得到初始候选实体列表。本实施例中所使用的知识库是清华大学发布的中文知识库(网址：http://keg.cs.tsinghua.edu.cn/project/ChineseKB)，其中包含800,000不同的实体及其属性，名称属性值包含在<rdf:Description rdf:about＝""></rdf:Description>标签的rdf:about属性中，别名属性包含在<ont:别名></ont:别名>标签中。本实施例预先设定阈值为0.7。例如，搜索李娜这个实体指称得到的满足阈值的初始候选实体条目如表1所示：(2) As shown in S102 of FIG. 1 , for each named entity, a candidate entity list is obtained through entity reference. Take out each entity entry in the knowledge base in turn, traverse its name attribute value and alias attribute value, and calculate the string similarity between the entity reference in the query text and the name attribute value and alias attribute value of each entity entry. If the similarity between the entity reference of the query text and the name attribute value of the entity entry is greater than the preset threshold, the current entity entry will be output to the candidate entity list; if the similarity between the entity reference of the query text and the alias attribute value of the entity entry is greater than the preset threshold, the entity entry corresponding to the alias attribute is output to the candidate entity list. The initial candidate entity list is obtained through the above method. The knowledge base used in this embodiment is the Chinese knowledge base issued by Tsinghua University (URL: http://keg.cs.tsinghua.edu.cn/project/ChineseKB), which contains 800,000 different entities and their attributes, names The attribute value is included in the rdf:about attribute of the <rdf:Description rdf:about=""></rdf:Description> tag, and the alias attribute is included in the <ont:alias></ont:alias> tag. In this embodiment, the preset threshold value is 0.7. For example, the initial candidate entity entries that meet the threshold obtained by searching for the entity reference of Li Na are shown in Table 1:

表1实体指称“李娜”的初始候选实体列表Table 1 The initial candidate entity list for the entity referred to as "Li Na"

表1中，实体名称为提取的知识库中实体条目的名称属性值，其包含在<rdf:Description rdf:about＝""></rdf:Description>标签的rdf:about属性中，实体上下文为提取的知识库中实体条目的上下文属性值，其包含在<ont:ABSTRACT></ont:ABSTRACT>标签中，其他属性为实体在知识库中实体条目的其他属性值，包含在除<rdf:Descriptionrdf:about＝""></rdf:Description>、<ont:ABSTRACT></ont:ABSTRACT>标签的其他标签中。In Table 1, the entity name is the name attribute value of the entity entry in the extracted knowledge base, which is included in the rdf:about attribute of the <rdf:Description rdf:about=""></rdf:Description> tag, and the entity context is The context attribute value of the entity entry in the extracted knowledge base, which is included in the <ont:ABSTRACT></ont:ABSTRACT> tag, and other attributes are other attribute values of the entity entry in the knowledge base, included in except <rdf: Descriptionrdf:about=""></rdf:Description>, <ont:ABSTRACT></ont:ABSTRACT> tags in other tags.

然后，对初始候选实体列表进行筛选得到候选实体列表，方法为：抽取知识库中的部分实体做为训练集，利用标注的“<实体，类型>对”集合训练“实体-类型”分类器。本实施例抽取知识库中部分实体做为训练集的方法为：从每个类型的实体条目集合中随机抽取200条实体条目做为训练集，所采用的分类器模型为SVM(Support Vector Machine)。然后利用“实体-类型”分类器识别候选实体列表中实体的类型，把与查询文本中待链接实体属于同类型的实体保留，删除类型不同的实体，得到最终的候选实体列表。实体类型为实体的类别属性，包括人物、地点、机构、时间。筛选后的候选实体列表如表2所示：Then, filter the initial candidate entity list to obtain the candidate entity list. The method is: extract some entities in the knowledge base as the training set, and use the marked "<entity, type>" set to train the "entity-type" classifier. The method for extracting some entities in the knowledge base as the training set in this embodiment is: randomly extract 200 entity entries from each type of entity entry set as the training set, and the classifier model adopted is SVM (Support Vector Machine) . Then use the "entity-type" classifier to identify the type of entities in the candidate entity list, keep the entities of the same type as the entities to be linked in the query text, delete entities of different types, and obtain the final candidate entity list. The entity type is the category attribute of the entity, including person, place, institution, and time. The list of candidate entities after screening is shown in Table 2:

表2实体指称“李娜”的候选实体列表Table 2 A list of candidate entities referred to as "Li Na"

(3)如图1的S103所示，利用搜索引擎，扩展实体指称的上下文信息。以实体指称“李娜”为例，该实体指称的查询文本为“李娜退役仪式纳达尔献花，娜姐泪奔享全场欢呼。”，将从查询文本中识别的一组命名实体的实体指称“李娜、纳达尔、娜姐”做为搜索引擎的种子进行搜索，得到与实体指称“李娜”相关的搜索结果网页，从搜索结果网页中抽取前m条搜索结果的标题和摘要信息，得到如表3所示的实体指称“李娜”的扩展文本，查询文本与扩展文本一起构成实体指称“李娜”的扩展的上下文信息。本实施例中搜索引擎为必应搜索引擎，使用CSS选择器抽取搜索结果的标题和摘要信息，m取值为8。(3) As shown in S103 of FIG. 1 , use a search engine to expand the context information of entity references. Taking the entity reference "Li Na" as an example, the query text of the entity reference is "Li Na's retirement ceremony Nadal presented flowers, sister Na burst into tears to enjoy the cheers of the audience.", the entity reference of a group of named entities identified from the query text is " Li Na, Nadal, and Sister Na" are searched as the seeds of the search engine, and the search result webpage related to the entity alleged "Li Na" is obtained, and the title and abstract information of the first m search results are extracted from the search result webpage, and the results are as shown in the table The entity referred to in 3 is the extended text of "Li Na", and the query text and the extended text together constitute the extended context information of the entity referred to as "Li Na". In this embodiment, the search engine is the Bing search engine, and the title and summary information of the search results are extracted using a CSS selector, and the value of m is 8.

表3实体指称“李娜”的扩展文本Table 3 The extended text of the entity referring to "Li Na"

(4)如图1的S104所示，利用搜索引擎，扩展候选实体列表中每个候选实体的上下文信息。首先，将候选实体列表中的每个候选实体在知识库中的上下文文本进行分词，本实施例中，候选实体的上下文信息包含在<ont:ABSTRACT></ont:ABSTRACT>标签中。以候选实体“李娜(网球运动员)”为例，对其上下文分词的结果为W^*＝{李娜/nr(/wkz 1982年/t 2月/t 26日/t—/wp)/wky，/wd中国/nr 女子/n 网球/n 选手/n。/wj 2011年/t 6月/t 4日/t，/wd在/p法国/nr的/ude1 巴黎/nr 西部/f 蒙/vi 特/d 高/a 地/n 的/ude1罗兰/nrf·加/b 洛/b 斯/b 体育场/n 内/f，/wd李娜/nr 获得/v 法网/n 女单/n 冠军/n。/wj成为/v有史以来/dl第一/m 个/q 获得/v 大/a 满贯/n 网球赛/n 事/n 冠军/n 的/ude1亚洲/nr人/n。/wj}，对分词后的文本进行命名实体识别得到一组与候选实体相关的命名实体。本实施例中采用的命名实体识别方法为根据分词得到的词和其类型标注，提取类型标注为nr、ns、nt、nz的词组成命名实体集合。命名实体识别的结果为E^*＝{李娜,中国,法国,巴黎,亚洲}。然后，将这一组命名实体“李娜、中国、法国、巴黎、亚洲”做为一个种子在搜索引擎中检索，得到一组与候选实体“李娜(网球运动员)”相关的搜索结果网页，从搜索结果网页中抽取前n条搜索结果的标题和摘要信息，得到如表4所示的候选实体“李娜(网球运动员)”的扩展文本，候选实体“李娜(网球运动员)”在知识库中的上下文文本与扩展文本一起构成候选实体“李娜(网球运动员)”的扩展的上下文信息。本实施例中搜索引擎为必应搜索引擎，使用CSS选择器抽取搜索结果的标题和摘要信息，n取值为10。(4) As shown in S104 of FIG. 1 , use a search engine to expand the context information of each candidate entity in the candidate entity list. First, word segmentation is performed on the context text of each candidate entity in the knowledge base in the candidate entity list. In this embodiment, the context information of the candidate entity is included in the <ont:ABSTRACT></ont:ABSTRACT> tag. Taking the candidate entity "Li Na (tennis player)" as an example, the result of its context segmentation is W ^* ={Li Na/nr(/wkz 1982/t February/t 26th/t—/wp)/wky,/ wd China/nr women/n tennis/n players/n. /wj 2011/t June/t 4th/t, /wd in/p France/nr/ude1 paris/nr west/f mont/vi ter/d high/a land/n/ude1 Roland/nrf ·Gal/b Luo/b Si/b stadium/n inside/f, /wd Li Na/nr won/v French Open/n women's singles/n champion/n. /wj become the /v history /dl first /m /q win /v big /a grand slam /n tennis match /n event / n champion / ude1 Asia /nr people /n. /wj}, perform named entity recognition on the word-segmented text to obtain a set of named entities related to candidate entities. The named entity recognition method adopted in this embodiment is to extract words whose types are marked as nr, ns, nt, and nz to form a named entity set according to the words obtained by word segmentation and their types. The result of named entity recognition is E ^* ={Li Na, China, France, Paris, Asia}. Then, use this group of named entities "Li Na, China, France, Paris, Asia" as a seed to retrieve in the search engine, and get a group of search result pages related to the candidate entity "Li Na (tennis player)". The title and abstract information of the first n search results are extracted from the result webpage, and the extended text of the candidate entity "Li Na (tennis player)" is obtained as shown in Table 4, and the context of the candidate entity "Li Na (tennis player)" in the knowledge base The text together with the extended text constitutes the extended context information of the candidate entity "Li Na (tennis player)". In this embodiment, the search engine is the Bing search engine, and the title and summary information of the search results are extracted using a CSS selector, and the value of n is 10.

表4候选实体“李娜(网球运动员)”的扩展文本Table 4 The extended text of the candidate entity "Li Na (tennis player)"

(5)如图1的S105所示，计算实体指称与每个候选实体之间的匹配度。首先对候选实体列表中的每个候选实体，计算实体指称的扩展的上下文信息和候选实体的扩展的上下文信息之间的余弦相似度a，计算候选实体流行度b，计算实体指称与每个候选实体之间的匹配度c＝w_a×a+w_b×b。其中：w_a、w_b为预设的权值，且w_a+w_b＝1。候选实体的流行度为：实体指称做为超链接链接到所述候选实体页面的次数与实体指称做为超链接链接到所有候选实体的总次数的比值。本实施例中预设w_a＝0.6，w_b＝0.4。以步骤(3)中计算实体指称“李娜”与候选实体“李娜(网球运动员)”的匹配度为例，实体指称“李娜”的扩展的上下文信息与候选实体“李娜(网球运动员)”的扩展的上下文信息之间的余弦相似度a＝0.53，候选实体“李娜(网球运动员)”的流行度b＝0.88，因此，实体指称“李娜”与候选实体“李娜(网球运动员)”的匹配度c＝0.6*0.53+0.4*0.88＝0.67。表5列出了实体指称“李娜”与每个候选实体之间的匹配度。(5) As shown in S105 of FIG. 1 , calculate the matching degree between the entity reference and each candidate entity. First, for each candidate entity in the candidate entity list, calculate the cosine similarity a between the extended context information of the entity reference and the extended context information of the candidate entity, calculate the popularity b of the candidate entity, and calculate the relationship between the entity reference and each candidate The degree of matching between entities c=w _a ×a+w _b ×b. Wherein: w _a and w _b are preset weights, and w _a +w _b =1. The popularity of a candidate entity is: the ratio of the number of times the entity reference links to the candidate entity page as a hyperlink to the total number of times the entity reference links to all candidate entities as a hyperlink. In this embodiment, w _a =0.6 and w _b =0.4 are preset. Taking the calculation of the matching degree between the entity reference "Li Na" and the candidate entity "Li Na (tennis player)" in step (3) as an example, the extended context information of the entity reference "Li Na" and the extension of the candidate entity "Li Na (tennis player)" The cosine similarity a=0.53 between the context information of the candidate entity "Li Na (tennis player)", the popularity b=0.88 of the candidate entity "Li Na" and the candidate entity "Li Na (tennis player)" matching degree c =0.6*0.53+0.4*0.88=0.67. Table 5 lists the matching degree between the entity reference "Li Na" and each candidate entity.

表5实体指称“李娜”与每个候选实体之间的匹配度Table 5 The matching degree between the entity referring to "Li Na" and each candidate entity

候选实体candidate entity 余弦相似度cosine similarity 候选实体的流行度Popularity of Candidate Entities 匹配度suitability 李娜(网球运动员)Li Na (tennis player) 0.530.53 0.880.88 0.670.67 李娜(歌手)Li Na (singer) 0.430.43 0.0560.056 0.280.28 李娜(北大教授)Li Na (Professor of Peking University) 0.390.39 00 0.230.23 李娜英Lee Na Young 0.190.19 0.310.31 0.240.24

(6)如图1的S106所示，根据候选实体的匹配度对候选实体排序，选择匹配度最大的候选实体做为目标实体进行链接。本实施例根据实体指称与候选实体之间的匹配度对表5进行自大至小的排序得到表6，表6第1条中的候选实体即为匹配度最大的候选实体，由此可得实体指称“李娜”链接的目标实体为“李娜(网球运动员)”。(6) As shown in S106 of FIG. 1 , the candidate entities are sorted according to their matching degree, and the candidate entity with the highest matching degree is selected as the target entity for linking. In this embodiment, Table 5 is sorted from large to small according to the matching degree between the entity reference and the candidate entity to obtain Table 6. The candidate entity in item 1 of Table 6 is the candidate entity with the highest matching degree, thus it can be obtained The entity refers to the target entity of the "Li Na" link as "Li Na (tennis player)".

表6按匹配度自大至小对实体指称“李娜”的候选实体的排序结果Table 6 Sorting results of the candidate entities referred to as "Li Na" according to the matching degree from large to small

候选实体candidate entity 余弦相似度cosine similarity 候选实体的流行度Popularity of Candidate Entities 匹配度suitability 李娜(网球运动员)Li Na (tennis player) 0.530.53 0.880.88 0.670.67 李娜(歌手)Li Na (singer) 0.430.43 0.0560.056 0.280.28 李娜英Lee Na Young 0.190.19 0.310.31 0.240.24 李娜(北大教授)Li Na (Professor of Peking University) 0.390.39 00 0.230.23

Claims

1. a kind of name entity link method based on search engine, it is characterised in that：Include the following steps：

Step 1, the query text of input is segmented, identifies that one group of name for needing to be linked in text in knowledge base is real Body；

Step 2, it to each name entity, censures to obtain candidate list of entities by entity；

Step 3, the contextual information censured using search engine, extension entity；

Step 4, using search engine, the contextual information of each candidate entity in candidate list of entities is extended；

Step 5, computational entity censures the matching degree between each candidate entity；

Step 6, it is sorted to candidate entity according to the matching degree of candidate entity, selects the maximum candidate entity of matching degree as target Entity is linked.

2. a kind of name entity link method based on search engine according to claim 1, it is characterised in that：The step In rapid 1：

Name entity is name, mechanism name, place name and other all entities with entitled mark, and entity is in actual life Object or concept；

Query text is text input by user or text set, including entity is censured and its context text, and text size is Regular length；

Knowledge base is the set of several entity entries, including personage, mechanism, place name and other all realities with entitled mark Body；

Entity entries are the specifying information that each entity stores in knowledge base, include context text, the entity attribute of entity And attribute value, the contextual information of entity is the text that entity is described in detail, and it is to wait for chain in query text that entity, which is censured, Connect the noun reference of entity.

3. a kind of name entity link method based on search engine according to claim 1, it is characterised in that：The step In rapid 2, it is by the method that entity censures to obtain candidate entity：

Every entity entries in knowledge base are taken out successively, traverse its name attribute value, alias property value, are calculated in query text Entity censure the similarity of character string between the name attribute values of each entity entries, alias property value；If query text Entity censure and be more than pre-set threshold value with the name attribute value similarities of entity entries, then it is current entity entry is defeated Go out into candidate list of entities, is set in advance if the entity of query text censures to be more than with the alias property value similarity of entity entries The threshold value set then is output to the entity entries corresponding to the alias property in candidate list of entities；

Initial candidate list of entities is obtained by the above method, initial candidate list of entities is screened to obtain candidate entity row Table, method are：

The part entity in knowledge base is extracted as training set, utilizes mark<Entity, type>To set training entity-type Then grader utilizes entity-type sorter to identify the type of entity in candidate list of entities, with chain is waited in query text The entity reservation that entity belongs to same type is connect, different types of entity is deleted, obtains final candidate list of entities；

The type of the entity is the category attribute of entity, including personage, place, mechanism, time；It is described<Entity, type>It is right For the mapping of entity and its type, wherein mapping is correct<Entity, type>To the positive example for training grader, otherwise it is Negative sense example；Entity-the type sorter is to pass through<Entity, type>To the multicategory classification model of set training, pass through institute State the type that model may determine that entity.

4. a kind of name entity link method based on search engine according to claim 1, it is characterised in that：The step In rapid 3, using search engine, the method for the contextual information that extension entity is censured is：

The entity of one group of name entity in query text in step 1 is censured and is retrieved in a search engine as seed, obtain with The entity censures relevant search result web page, the title and abstract of m search result before being extracted from search result web page Information, obtains the expanded text that the entity is censured, and query text constitutes the extension that the entity is censured together with expanded text Contextual information.

5. a kind of name entity link method based on search engine according to claim 1, it is characterised in that：The step In rapid 4, using search engine, the method for extending the contextual information of each candidate entity in candidate list of entities is：

Context text of the candidate entity of each of candidate list of entities in knowledge base is segmented, to the text after participle Originally it is named Entity recognition and obtains one group and the relevant name entity of the candidate entity, then this group name entity is done Retrieved in a search engine for a seed, obtain with the candidate relevant search result web page of entity, from search result net The title and summary info of n search result before being extracted in page, obtain the expanded text of the candidate entity, and candidate entity is being known Know the contextual information that the context text in library constitutes the extension of the candidate entity together with expanded text.

6. a kind of name entity link method based on search engine according to claim 1, it is characterised in that：The step In rapid 5, the method that computational entity censures the matching degree between each candidate entity is：

To the candidate entity of each of candidate list of entities, the contextual information for the extension that the entity is censured and the time are calculated The cosine similarity a between the contextual information of the extension of entity is selected, the candidate entity popularity b is calculated, computational entity refers to Claim matching degree c=w between each candidate entity_a×a+w_b× b, wherein：w_a、w_bFor preset weights, and w_a+w_b=1；It is described The popularity of candidate entity is：Entity denotion is linked to the number of the candidate physical page as hyperlink and entity denotion is done The ratio of the total degree of all candidate entities is linked to for hyperlink.