CN105912625A

CN105912625A - Linked data oriented entity classification method and system

Info

Publication number: CN105912625A
Application number: CN201610213411.8A
Authority: CN
Inventors: 葛涛; 穗志方
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2016-04-07
Filing date: 2016-04-07
Publication date: 2016-08-31
Anticipated expiration: 2036-04-07
Also published as: CN105912625B

Abstract

The invention discloses a link data-oriented entity classification method and system, aiming at the link data entity classification problem, including preprocessing, statistical classification and postprocessing; wherein, the preprocessing is through word segmentation of text description information in entity pages ; The word information obtained from the attribute name of the information box and word segmentation constitutes the feature of the entity page; the statistical classification process uses a variety of segmentation granularity to train the statistical classification model to classify the entity page, and obtains the preliminary prediction result of the entity category; Statistical classification results are corrected, including model fusion, language knowledge, link information, and the use of category-associated attribute information to correct the fused entity category. The technical solution of the invention is easy to implement, easy to debug, high in efficiency and good in precision, and is suitable for linking data for knowledge management; it can realize high-precision classification of entities.

Description

A Linked Data Oriented Entity Classification Method and System

技术领域technical field

本发明属于信息处理领域，涉及链接数据分类和搜索，尤其涉及一种面向链接数据中的实体页面进行高精准分类的方法和系统。The invention belongs to the field of information processing, relates to link data classification and search, in particular to a method and system for high-precision classification of entity pages in link data.

背景技术Background technique

目前处在大数据时代，如何最大限度地利用数据来帮助计算机进行信息处理已经成为了当前信息处理领域最热门的研究课题。近年来，随着Web2.0时代的到来，链接数据(例如语义网、知识图谱等)因为其强大的关系描述能力，得到了人们的广泛关注。链接数据是指象百度百科、维基百科的数据组织形式，这种数据中，每个页面对应一个实体，实体间有相互的链接，因此被称为链接数据(linked data)。随着数据规模的不断增大，采用人工方法管理链接数据已经不现实，迫切需要能够对链接数据进行知识管理的高效方法和系统。Currently in the era of big data, how to maximize the use of data to help computers perform information processing has become the hottest research topic in the field of information processing. In recent years, with the advent of the Web 2.0 era, linked data (such as semantic web, knowledge graph, etc.) has attracted widespread attention because of its powerful relationship description capabilities. Linked data refers to data organization forms such as Baidu Encyclopedia and Wikipedia. In this kind of data, each page corresponds to an entity, and there are mutual links between entities, so it is called linked data. With the continuous increase of data scale, it is unrealistic to use manual methods to manage linked data, and there is an urgent need for efficient methods and systems for knowledge management of linked data.

链接数据的实体分类是链接数据知识管理领域的一个重要技术问题，针对链接数据进行实体分类，能够有效地组织链接数据中大量的实体页面，从而加强用户搜索和阅读的体验。Entity classification of linked data is an important technical issue in the field of knowledge management of linked data. Entity classification for linked data can effectively organize a large number of entity pages in linked data, thereby enhancing user search and reading experience.

目前，实体分类的常用方法是针对实体的描述文本进行分类。但是，这种简单的方法在很多情况下并不能够准确地分析出实体的类别，其不足主要表现在：At present, the common method of entity classification is to classify the descriptive text of the entity. However, this simple method cannot accurately analyze the category of entities in many cases, and its shortcomings are mainly manifested in:

(一)对于人来说，尽管根据文本描述来判断实体类别是一件很容易的事情，但是对于目前基于特征的统计分类方法而言，想要高精准地通过文本描述判断实体类别并不现实；例如，文本“X是根据著名游戏改编的动画”与“A是根据著名动画制作的游戏”在词汇级别有着非常相似的表示，但是前者是对一个动画实体的描述而后者是对游戏实体的描述，其描述的实体类型完全不同。因此，单纯基于文本特征的统计分类方法识别精度不足，并不能精准地获得实体类别。(1) For humans, although it is easy to judge entity categories based on text descriptions, it is not realistic for current feature-based statistical classification methods to accurately judge entity categories through text descriptions ; for example, the text "X is an animation based on a famous game" and "A is a game based on a famous animation" have very similar representations at the lexical level, but the former is a description of an animated entity while the latter is a description of a game entity Description, which describes a completely different entity type. Therefore, the statistical classification method based solely on text features has insufficient recognition accuracy and cannot accurately obtain entity categories.

(二)很多实体页面并没有足够的文本描述信息，这种情况下，单纯利用文本描述信息来对实体进行分类，必然会导致分类错误，通过文本描述无法得到实体类别。(2) Many entity pages do not have enough text description information. In this case, simply using text description information to classify entities will inevitably lead to classification errors, and the entity category cannot be obtained through text description.

发明内容Contents of the invention

为了克服上述现有技术的不足，本发明提供一种面向链接数据的实体分类方法和系统，针对链接数据的实体分类问题，通过统计分类过程和后处理过程来达到高精准实体分类的目的；其中，统计分类过程通过针对文本信息建模来进行分类；后处理过程利用丰富资源(例如词缀信息、链接数据等信息)对实体统计分类的结果进行修正，包括模型融合、语言知识、链接信息以及利用类别关联属性信息对融合后的实体类别进行修正等方法。In order to overcome the shortcomings of the above-mentioned prior art, the present invention provides a link data-oriented entity classification method and system, aiming at the entity classification problem of link data, the purpose of high-precision entity classification is achieved through statistical classification process and post-processing process; wherein , the statistical classification process classifies by modeling text information; the post-processing process uses rich resources (such as affix information, link data, etc.) to correct the results of entity statistical classification, including model fusion, language knowledge, link information, and utilization Methods such as modifying the fused entity category by category association attribute information.

链接数据中的实体页面通常包含文本描述和信息框(infobox)。本发明将文本描述进行切分以后，将信息框(infobox)属性名连同切分得到的词信息作为特征抽取出来，作为实体页面的特征表示；然后，对实体页面利用最大熵模型采用多种切分粒度进行分类，得到对实体类别的初步预测；再对所得到的实体类别进行后处理，以验证其分类结果是否可靠；后处理具体包括对利用不同切分粒度的特征训练的分类器的分类结果进行融合；利用类别属性数据库库中的类别关联属性信息修正明显的预测错误；对文本描述首句进行深度理解，利用语法分析等方法分析句子结构，获取实体类别信息，以修正之前的预测结果；优选地，还可利用困惑矩阵识别难以正确分类的类别，针对难以正确分类的类别的预测进行进一步验证，包括使用实体页面所链接的相邻页面的类别对实体类别进行修正和使用实体页面的词缀信息对实体类别进行修正。Entity pages in Linked Data usually contain text descriptions and infoboxes. After the text description is segmented by the present invention, the attribute name of the infobox (infobox) and the word information obtained by segmenting are extracted as features and used as the feature representation of the entity page; Classify by granularity to obtain a preliminary prediction of the entity category; then post-process the obtained entity category to verify whether the classification result is reliable; post-processing specifically includes classification of classifiers trained using features of different segmentation granularities The results are fused; use the category-associated attribute information in the category attribute database to correct obvious prediction errors; deeply understand the first sentence of the text description, use grammatical analysis and other methods to analyze the sentence structure, and obtain entity category information to correct the previous prediction results. ; Preferably, the confusion matrix can also be used to identify categories that are difficult to classify correctly, and the prediction of categories that are difficult to correctly classify can be further verified, including using the category of the adjacent page linked by the entity page to correct the entity category and using the category of the entity page The affix information modifies the entity class.

本发明提供的技术方案是：The technical scheme provided by the invention is:

一种面向链接数据的实体分类方法，所述链接数据为多个实体页面，所述实体页面包含文本描述和信息框；所述实体分类方法包括预处理阶段、统计分类阶段和后处理阶段，具体包括如下步骤：A link data-oriented entity classification method, the link data is a plurality of entity pages, the entity pages include text descriptions and information boxes; the entity classification method includes a preprocessing stage, a statistical classification stage and a postprocessing stage, specifically Including the following steps:

1)在预处理阶段过程，通过对实体页面中的文本描述信息进行分词，切分得到词信息；由信息框的属性名和所述词信息构成实体页面的特征；1) In the preprocessing stage, word information is obtained by segmenting the text description information in the entity page; the attribute name of the information box and the word information form the characteristics of the entity page;

2)在统计分类阶段，利用所述实体页面的特征，采用多种切分粒度来训练统计分类模型对实体页面进行分类，得到实体类别的初步预测结果；2) In the statistical classification stage, using the characteristics of the entity page, a variety of segmentation granularities are used to train the statistical classification model to classify the entity page, and obtain the preliminary prediction result of the entity category;

3)在后处理阶段，对实体类别的初步预测结果进行修正，得到修正后的实体分类类别；所述修正包括如下步骤：3) In the post-processing stage, the preliminary prediction result of the entity category is corrected to obtain the corrected entity classification category; the correction includes the following steps:

31)通过多粒度模型融合方法，将采用不同切分粒度训练的统计分类模型得到的实体类别的初步预测结果进行融合，得到融合后的实体类别结果；31) Through the multi-granularity model fusion method, the preliminary prediction results of the entity category obtained by the statistical classification model trained with different segmentation granularities are fused to obtain the fused entity category result;

32)构建类别属性数据库，利用类别属性数据库库中的类别关联属性信息，对融合后的实体类别进行修正，得到类别关联属性修正后的实体类别；32) Constructing a category attribute database, using the category-associated attribute information in the category-attribute database to correct the fused entity category, and obtain the entity category after the category-associated attribute correction;

33)利用语法分析方法分析句子结构，通过对文本描述首句进行深度理解步骤32)所得到的类别关联属性修正后的实体类别，获取首句深度理解修正后的实体类别信息。33) Analyzing the sentence structure by means of grammatical analysis, and obtaining the entity category information corrected by the in-depth understanding of the first sentence through the corrected entity category of the category association attribute obtained in step 32) of the deep understanding of the first sentence of the text description.

针对上述面向链接数据的实体分类方法，进一步地，步骤1)所述分词方法包括前后最大匹配方法、后向最大匹配方法和基于统计序列标注方法。For the above link data-oriented entity classification method, further, the word segmentation method in step 1) includes a front-back maximum matching method, a backward maximum matching method, and a statistical sequence-based tagging method.

针对上述面向链接数据的实体分类方法，进一步地，步骤2)采用两种切分粒度，分别为带有命名实体识别的切分粒度和不带有命名实体识别的切分粒度。For the above link data-oriented entity classification method, further, step 2) adopts two segmentation granularities, namely the segmentation granularity with named entity recognition and the segmentation granularity without named entity recognition.

针对上述面向链接数据的实体分类方法，进一步地，所述统计分类模型为最大熵模型；步骤31)所述多粒度模型融合方法具体通过式1计算得到融合不同切分粒度分类器预测的概率分布，将多个切分粒度训练的最大熵分类模型对实体页面进行分类得到实体类别结果进行融合：For the above link data-oriented entity classification method, further, the statistical classification model is a maximum entropy model; step 31) the multi-granularity model fusion method is specifically calculated by formula 1 to obtain the probability distribution predicted by the fusion of different segmentation granularity classifiers , the maximum entropy classification model trained by multiple segmentation granularities is used to classify entity pages to obtain entity category results for fusion:

式1中，P_multi(y|x)为融合不同切分粒度分类器预测的概率分布；P_w(y|x)为只用词切分作为特征最大熵分类模型对于样本x预测的概率分布；y为样本类别，x为样本；P_n(y|x)为在词切分基础上加入命名实体标注作为特征的最大熵预测的概率分布；λ是调整线性插值权重的参数。In Equation 1, P _multi (y|x) is the probability distribution predicted by the fusion of different segmentation granularity classifiers; P _w (y|x) is the probability distribution predicted by the maximum entropy classification model for sample x using only word segmentation as a feature ; y is the sample category, x is the sample; P _n (y|x) is the probability distribution of the maximum entropy prediction based on word segmentation and adding named entity annotation as a feature; λ is a parameter to adjust the linear interpolation weight.

针对上述面向链接数据的实体分类方法，进一步地，步骤33)所述利用语法分析方法分析句子结构，获取首句深度理解修正后的实体类别信息，具体包括如下步骤：For the above-mentioned link data-oriented entity classification method, further, step 33) utilizes the grammatical analysis method to analyze the sentence structure, and obtains the entity category information after the depth understanding of the first sentence, which specifically includes the following steps:

331)对实体描述的首句进行依存句法分析，识别首句的宾语是否属于判断句宾语；331) Perform dependency syntactic analysis on the first sentence of the entity description, and identify whether the object of the first sentence belongs to the object of the judgment sentence;

332)在大规模未标注语料上训练汉语词向量，定义词汇语义相似度，计算词向量与判断句宾语的词汇语义相似度，得到词汇语义相似度最高的词向量；332) Training Chinese word vectors on a large-scale unlabeled corpus, defining the lexical semantic similarity, calculating the lexical semantic similarity between the word vector and the judgment sentence object, and obtaining the word vector with the highest lexical semantic similarity;

333)采用余弦相似度计算方法，设定余弦相似度阈值，当判断句宾语与其最相似类别的词向量的余弦相似度大于余弦相似度阈值，将该实体的类别修正为最相似类别。333) The cosine similarity calculation method is used to set the cosine similarity threshold. When the cosine similarity between the object of the judgment sentence and the word vector of the most similar category is greater than the cosine similarity threshold, the category of the entity is corrected to the most similar category.

针对上述面向链接数据的实体分类方法，进一步地，在所述后处理阶段对实体类别的初步预测结果进行修正，得到修正后的实体分类类别之后，使用困惑矩阵识别出困难实体类别；针对识别出的困难实体类别，通过链接分析方法和词缀分析方法对实体类别结果进行验证；所述困惑矩阵识别方法具体是：在验证集上，当统计分类模型对于某一实体类别y_i的预测精度未达到90％时，类别y_i被视为困难实体类别。For the above link data-oriented entity classification method, further, in the post-processing stage, the preliminary prediction result of the entity category is corrected, and after the corrected entity classification category is obtained, the confusion matrix is used to identify the difficult entity category; Difficult entity category of the entity category, the entity category result is verified by the link analysis method and the affix analysis method; the confusion matrix identification method is specifically: on the verification set, when the prediction accuracy of the statistical classification model for a certain entity category y _i does not reach 90% of the time, class y _i is considered a difficult entity class.

进一步地，所述链接分析方法具体是：设定分类器对实体页面e所做出的类别预测为y’，将实体页面e所链接的实体页面的集合记为N(e)，找出N(e)中有类别标注的页面，统计得到N(e)中有类别标注的页面最多的类别，记作y*；当类别y*与类别预测y’不一致时，利用y*来修正y’的结果，得到实体页面e的类别为y*。Further, the link analysis method specifically includes: setting the category prediction made by the classifier for the entity page e as y', recording the set of entity pages linked by the entity page e as N(e), and finding N For the pages marked with categories in (e), the category with the most pages marked with categories in N(e) is counted, which is recorded as y*; when the category y* is inconsistent with the category prediction y', use y* to correct y' As a result, the category of entity page e is y*.

针对上述面向链接数据的实体分类方法，进一步地，所述词缀分析方法具体是：针对实体名称以固定汉字结尾的实体类别，利用大规模无标注数据学习得到的实体类型相关联的词缀信息，通过分别对最相近词汇的词缀进行频次统计，得到困难实体类别相关联的词缀，通过分析词缀获得所述实体的类别。For the above link data-oriented entity classification method, further, the affix analysis method is specifically: for the entity category whose entity name ends with a fixed Chinese character, using the affix information associated with the entity type learned from large-scale unlabeled data, through Perform frequency statistics on the affixes of the most similar vocabulary to obtain the affixes associated with the difficult entity category, and obtain the category of the entity by analyzing the affixes.

本发明还提供利用上述面向链接数据的实体分类方法实现的面向链接数据的实体分类系统，包括预处理模块、统计分类模块和后处理模块；所述预处理模块用于对实体页面中的文本描述信息进行分词，将信息框属性名和分词得到的词信息作为特征抽取出来，作为实体页面的特征表示；所述统计分类模块通过采用最大熵分类算法来训练分类模型，利用实体页面中对实体的描述信息识别得到实体类别；所述后处理模块用于采用多粒度模型融合、类别关联属性和首句深入理解对所述统计分类模块得到的实体类别进行修正，得到修正后的实体类别。The present invention also provides a link data-oriented entity classification system realized by the link data-oriented entity classification method, including a preprocessing module, a statistical classification module and a postprocessing module; the preprocessing module is used to describe the text in the entity page The information is word-segmented, and the word information obtained by the information frame attribute name and word segmentation is extracted as a feature, and used as a feature representation of the entity page; the statistical classification module trains the classification model by using the maximum entropy classification algorithm, and utilizes the description of the entity in the entity page The entity category is obtained through information identification; the post-processing module is used to correct the entity category obtained by the statistical classification module by adopting multi-granularity model fusion, category association attributes, and first sentence in-depth understanding to obtain the corrected entity category.

上述面向链接数据的实体分类系统中，所述分词工具为Stanford CoreNLP工具包；所述分类模型采用最大熵分类器软件包Maxent。In the above link data-oriented entity classification system, the word segmentation tool is the Stanford CoreNLP toolkit; the classification model uses the maximum entropy classifier software package Maxent.

与现有技术相比，本发明的有益效果是：Compared with prior art, the beneficial effect of the present invention is:

本发明提供一种面向链接数据的实体分类方法和系统，针对链接数据的实体分类问题，通过统计分类过程和后处理过程来达到高精准实体分类的目的。其中，在对文本进行基本分类的基础上，对于实体描述文本分类的结果进行修正，采用方法包括：The present invention provides a link data-oriented entity classification method and system. Aiming at the link data entity classification problem, the purpose of high-precision entity classification is achieved through a statistical classification process and a post-processing process. Among them, on the basis of the basic classification of the text, the result of the classification of the entity description text is corrected, and the methods used include:

(一)采用多粒度词语切分模型融合方法，用于克服单一切分粒度在文本特征抽取上的缺陷；(1) Using a multi-granularity word segmentation model fusion method to overcome the defects of single segmentation granularity in text feature extraction;

(二)利用类别关联属性信息对融合后的实体类别进行修正，以达到修正明显错误的目的；(2) Use the category association attribute information to correct the fused entity category to achieve the purpose of correcting obvious errors;

(三)通过首句深入理解，达到降低文本噪音的效果；(3) Through in-depth understanding of the first sentence, the effect of reducing text noise is achieved;

(四)能够识别困难样本，并对识别结果使用链接分析和词缀等方法进行验证。(4) Be able to identify difficult samples, and use methods such as link analysis and affixes to verify the identification results.

与现有技术相比，目前现有的实体分类方法不再进行处理，对于实体识别分类可能错误的情况无法修正结果；而本发明通过后处理流程对基于文本统计分类模块可能错误的情况进行修正。本发明所提出的技术方案易实现、易调试、效率高、精度好，非常适合企业用来链接数据进行知识管理；能够对实体进行高精准分类。在JIST2015实体分类评测比赛中，本发明的方案准确率为98.6％，为当次评测比赛准确率最高的分类方案。Compared with the existing technology, the current existing entity classification method does not process any more, and the result cannot be corrected for the possible error of the entity recognition classification; and the present invention corrects the possible error of the text-based statistical classification module through the post-processing process . The technical solution proposed by the invention is easy to implement, easy to debug, high in efficiency and good in precision, and is very suitable for enterprises to link data for knowledge management; it can classify entities with high precision. In the JIST2015 entity classification evaluation competition, the accuracy rate of the scheme of the present invention was 98.6%, which was the classification scheme with the highest accuracy rate in the current evaluation competition.

附图说明Description of drawings

图1是本发明提供的面向链接数据的实体分类方法的流程框图。Fig. 1 is a flow chart of the link data-oriented entity classification method provided by the present invention.

图2是本发明实施例提供的面向链接数据的实体分类系统的结构框图。Fig. 2 is a structural block diagram of a linked data-oriented entity classification system provided by an embodiment of the present invention.

图3是本发明提供方法中首句深入理解步骤的流程框图。Fig. 3 is a flow chart of the first sentence in-depth understanding step in the method provided by the present invention.

具体实施方式detailed description

下面结合附图，通过实施例进一步描述本发明，但不以任何方式限制本发明的范围。Below in conjunction with accompanying drawing, further describe the present invention through embodiment, but do not limit the scope of the present invention in any way.

本发明提供一种面向链接数据的实体分类方法和系统，针对链接数据的实体分类问题，通过统计分类过程和后处理过程来达到高精准实体分类的目的；其中，统计分类过程通过针对文本信息建模来进行分类；后处理过程利用丰富资源(例如词缀信息、链接数据等信息)对实体统计分类的结果进行修正，图1是本发明提供的针对链接数据的实体分类方法的流程框图。如图1所示，本发明方法包括预处理过程、统计分类过程和后处理过程；首先对实体页面进行分词特征抽取，然后利用抽取得到的特征训练统计分类模型。对于分类所得到的结果，我们首先利用多粒度模型融合来修正单模型预测错误，然后利用类别关联属性信息对融合后的实体类别进行修正，来修正一些明显的错误预测，再对实体页面的首句描述进行深度分析，来确定其类别。对于一些难以正确分类的类别的样本，本发明可通过链接分析和词缀分析方法对其类别进行再次修正。具体步骤包括：The present invention provides a link data-oriented entity classification method and system. Aiming at the link data entity classification problem, the purpose of high-precision entity classification is achieved through a statistical classification process and a post-processing process; The post-processing process utilizes rich resources (such as affix information, link data and other information) to correct the results of entity statistical classification. Fig. 1 is a flow chart of the entity classification method for link data provided by the present invention. As shown in FIG. 1 , the method of the present invention includes a preprocessing process, a statistical classification process and a postprocessing process; firstly, word segmentation feature extraction is performed on the entity page, and then the statistical classification model is trained using the extracted features. For the classification results, we first use multi-granularity model fusion to correct single-model prediction errors, and then use category-associated attribute information to correct the fused entity categories to correct some obvious mispredictions. Sentence descriptions are analyzed in depth to determine their categories. For samples of some categories that are difficult to classify correctly, the present invention can re-correct their categories through link analysis and affix analysis methods. Specific steps include:

1)对于实体页面进行预处理，包括汉语分词(典型的分词方法有前后最大匹配、后向最大匹配以及基于统计序列标注的方法)、特征抽取(抽取词特征以及实体信息框属性名特征对页面进行表示)等，得到实体页面特征；1) Preprocessing the entity page, including Chinese word segmentation (typical word segmentation methods include front and rear maximum matching, backward maximum matching, and methods based on statistical sequence annotation), feature extraction (extracting word features and entity information box attribute name features to the page to represent), etc., to obtain the entity page features;

2)利用步骤1)中抽取得到的实体页面特征，对实体页面利用最大熵模型采用多种切分粒度进行分类，得到对实体类别的初步预测；2) Using the features of the entity page extracted in step 1), the maximum entropy model is used to classify the entity page with multiple segmentation granularities, and a preliminary prediction of the entity category is obtained;

在本发明实施例中，利用最大熵模型训练两个分类器；一个分类器的特征表示用的是带有命名实体识别粒度切分的词+infobox属性；另一个分类器用的是不带有命名实体识别所进行的切分产生的词和infobox属性。In the embodiment of the present invention, the maximum entropy model is used to train two classifiers; the feature representation of one classifier uses the word+infobox attribute with named entity recognition granularity segmentation; the other classifier uses no named The word and infobox attributes generated by the segmentation performed by entity recognition.

3)对步骤2)中所得到的实体类别进行后处理，验证其分类结果是否可靠；具体包括如下步骤：3) Post-processing the entity category obtained in step 2), verifying whether the classification result is reliable; specifically including the following steps:

31)对利用不同切分粒度的特征训练的分类器的分类结果进行融合；31) Fusing classification results of classifiers trained using features of different segmentation granularities;

在本发明实施例中，采用两种切分粒度，分别指带有命名实体识别的切分和不带有命名实体识别；In the embodiment of the present invention, two segmentation granularities are adopted, respectively referring to segmentation with named entity recognition and without named entity recognition;

32)预先构建类别属性数据库，利用类别属性数据库库中的类别关联属性信息修正明显的预测错误；32) Constructing a category attribute database in advance, using the category-associated attribute information in the category attribute database to correct obvious prediction errors;

33)通过句法分析器对文本描述首句进行深度理解，利用语法分析等方法分析句子结构，从而获取实体类别信息，以修正之前的预测结果；33) Use a syntax analyzer to deeply understand the first sentence of the text description, and use syntax analysis and other methods to analyze the sentence structure, so as to obtain entity category information to correct the previous prediction results;

34)利用困惑矩阵识别难以正确分类的类别，对该类别的预测进行进一步验证，包括：34) Use the confusion matrix to identify categories that are difficult to classify correctly, and further verify the predictions of this category, including:

341)使用实体页面所链接的相邻页面的类别对实体类别进行修正；341) Correcting entity categories using categories of adjacent pages to which the entity page links;

342)使用实体页面的词缀信息对实体类别进行修正。342) Correcting the entity category by using the affix information of the entity page.

图2是本发明实施例提供的面向链接数据的实体分类系统的结构框图。链接数据的实体分类系统包括预处理模块、统计分类模块和后处理模块；针对各模块进一步叙述如下：Fig. 2 is a structural block diagram of a linked data-oriented entity classification system provided by an embodiment of the present invention. The entity classification system of linked data includes a preprocessing module, a statistical classification module and a postprocessing module; further descriptions for each module are as follows:

预处理模块preprocessing module

链接数据中的实体页面通常包含文本描述和信息框(infobox)。Entity pages in Linked Data usually contain text descriptions and infoboxes.

在预处理模块中，我们利用了Stanford CoreNLP工具包对实体页面中的文本描述信息进行分词。本实施例中，我们采取了两种不同切分粒度：有命名实体识别和无命名实体识别。例如，在有命名实体识别的切分下，“纽约时代广场”将被视为一个词汇，而在无命名实体识别的切分下，该词将被切分为“纽约”、“时代”、“广场”三个词。In the preprocessing module, we use the Stanford CoreNLP toolkit to segment the text description information in the entity page. In this embodiment, we adopt two different segmentation granularities: named entity recognition and non-named entity recognition. For example, under segmentation with named entity recognition, "Times Square, New York" will be considered as a word, while under segmentation without named entity recognition, the word will be segmented into "New York", "Times", "Square" three words.

在对于汉语文本进行切分以后，我们将信息框(infobox)属性名连同切分得到的词信息作为特征抽取出来，作为实体页面的特征表示。After segmenting the Chinese text, we extract the attribute name of the infobox (infobox) together with the segmented word information as a feature, and use it as a feature representation of the entity page.

统计分类模块Statistical Classification Module

本发明主要利用实体页面中对实体的描述信息来作为判断实体类别的依据。本发明采用了自然语言处理领域常用的对数线性模型——最大熵分类算法来训练分类模型。如预处理模块所提到，统计分类模块所用到的特征包括词特征和信息框属性特征；词特征是经典的词袋模型特征表示；信息框属性特征对于识别实体的类别有着非常重要的作用，例如，“出生日期”也可能与人物类型的实体相关联。The present invention mainly uses the description information of entities in entity pages as the basis for judging entity categories. The present invention adopts the logarithmic linear model commonly used in the field of natural language processing—the maximum entropy classification algorithm to train the classification model. As mentioned in the preprocessing module, the features used in the statistical classification module include word features and information box attribute features; word features are classic word bag model feature representations; information box attribute features play a very important role in identifying the category of entities. For example, "Date of Birth" might also be associated with an entity of type Person.

在文本分类模块，我们采用了不同粒度的词切分来训练文本分类模型，这是因为在有些情况下，一种切分粒度并不能满足对于分类的要求。例如，“纽约时代广场”如果作为一个命名实体来看待的话，对于分类的作用并不如将其切分成“纽约”“时代”和“广场”，因为“广场”一词对于类别有着至关重要的影响。另一方面，如果我们不进行命名实体识别，那么像“张一山”就会被切分成“张”“一”“山”，那么这也会对分类结果造成影响。因此，在统计分类模块中，本发明实施例通过最大熵分类器软件包Maxent(可由以下链接网站下载最大熵分类器软件包：http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html)训练了两种分类模型，一种是带有命名实体识别的细粒度切分、一种是单纯的粗粒度词切分。In the text classification module, we use different granularity of word segmentation to train the text classification model, because in some cases, a segmentation granularity cannot meet the classification requirements. For example, "New York Times Square" as a named entity is not as useful for classification as splitting it into "New York", "Times" and "Square", because the word "Square" is crucial to the category. influences. On the other hand, if we do not perform named entity recognition, then something like "Zhang Yishan" will be divided into "Zhang", "Yi" and "Mountain", which will also affect the classification results. Therefore, in the statistical classification module, the embodiment of the present invention uses the maximum entropy classifier software package Maxent (the maximum entropy classifier software package can be downloaded by the following link website: http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit .html) trained two classification models, one is fine-grained segmentation with named entity recognition, and the other is simple coarse-grained word segmentation.

后处理模块Post-processing module

基于文本统计分类模块可能错误的情况，本发明利用后处理模块来进行修正。后处理模块可执行以下过程：Based on the fact that the text statistical classification module may be wrong, the present invention uses a post-processing module to make corrections. The post-processing module can perform the following processes:

31)多粒度模型融合过程31) Multi-granularity model fusion process

尽量模型融合在机器学习领域被广泛应用，但大多模型融合的方法都是针对不同种机器学习模型的融合。对于自然语言(尤其是中文)来说，切分粒度的不同对于整个模型的效果会产生影响。针对不同切分粒度的各自优劣性，本发明提出了利用模型融合的方法来对各种切分粒度所得到的分类模型进行“取长补短”。Although model fusion is widely used in the field of machine learning, most of the methods of model fusion are aimed at the fusion of different machine learning models. For natural language (especially Chinese), the difference in segmentation granularity will have an impact on the effect of the entire model. Aiming at the advantages and disadvantages of different segmentation granularities, the present invention proposes a method of model fusion to "learn from each other" the classification models obtained by various segmentation granularities.

我们定义P_w(y|x)为只用词切分作为特征、最大熵分类模型对于样本x预测的类别y概率分布，P_n(y|x)为在词切分基础上加入命名实体标注作为特征的最大熵预测的概率分布。我们将这两种分类器的结果用以下方法进行融合：We define P _w (y|x) as the probability distribution of category y predicted by the maximum entropy classification model for sample x using only word segmentation as a feature, and P _n (y|x) as adding named entity annotations based on word segmentation Probability distribution for maximum entropy predictions as features. We fuse the results of these two classifiers in the following way:

式1中，P_multi(y|x)为融合不同切分粒度分类器预测的概率分布；P_w(y|x)为只用词切分作为特征最大熵分类模型对于样本x预测的概率分布；y为样本类别，x为样本；P_n(y|x)为在词切分基础上加入命名实体标注作为特征的最大熵预测的概率分布；λ是调整线性插值权重的参数，本实施例中，设λ＝0.5。In Equation 1, P _multi (y|x) is the probability distribution predicted by the fusion of different segmentation granularity classifiers; P _w (y|x) is the probability distribution predicted by the maximum entropy classification model for sample x using only word segmentation as a feature ; y is a sample category, and x is a sample; P _n (y|x) is the probability distribution of adding named entity labels as the maximum entropy prediction of features on the basis of word segmentation; λ is a parameter for adjusting the linear interpolation weight, the present embodiment , let λ=0.5.

32)类别关联属性修正预测32) Category association attribute correction prediction

该模块利用类别关联属性修正一些明显错误的类别预测。该模块所利用的主要是信息框属性的类别特异性。如表1所示，对于某些属性而言，它们不可能与有些特定的类别相关联。例如，“游戏平台”不可能与城市实体相关联。因此，利用这些属性的特异性，可以修正分类器明显的预测错误。本发明针对预定义好的实体类型人工建立了类别属性数据库，用来进行对预测的修正。This module fixes some obviously wrong class predictions by exploiting class association properties. This module exploits mainly the category specificity of the infobox attributes. As shown in Table 1, for some attributes, it is impossible for them to be associated with certain categories. For example, it is impossible for a "game platform" to be associated with a city entity. Thus, by exploiting the specificity of these attributes, apparent prediction errors of the classifier can be corrected. The present invention artificially establishes a category attribute database for the predefined entity type to correct the prediction.

表1类别关联属性示例Table 1 Example of category association attributes

33)通过依存句法分析器深入理解实体描述的首句，进一步精准识别实体类别；33) Deeply understand the first sentence of the entity description through the dependency syntax analyzer, and further accurately identify the entity category;

链接数据(例如:维基百科、百度百科等)中的实体页面描述的第一句话通常是对实体的定性描述(例如：砸六家是一种流行于天津的扑克牌游戏)。如果能够深入理解实体描述的首句，那么将会对精准识别实体类别有着非常大的帮助。The first sentence of the entity page description in linked data (eg: Wikipedia, Baidu Baike, etc.) is usually a qualitative description of the entity (eg: Shamiliujia is a poker game popular in Tianjin). If you can deeply understand the first sentence of the entity description, it will be of great help to accurately identify the entity category.

图3是本发明提供方法中首句深入理解步骤的流程框图。本发明首先利用依存句法分析器来找出实体页面文本描述首句中的判断句宾语，然后利用该判断句宾语分析实体页面的类别；具体包括如下步骤：Fig. 3 is a flow chart of the first sentence in-depth understanding step in the method provided by the present invention. The present invention first utilizes the dependency syntax analyzer to find out the judgment sentence object in the first sentence of the entity page text description, and then utilizes the judgment sentence object to analyze the category of the entity page; specifically comprises the following steps:

331)判断句宾语识别331) Judgment sentence object recognition

本发明利用了斯坦福大学依存句法分析器，对实体描述的首句进行依存句法分析，分析出首句中的主语、谓语和宾语。如果依存句法所得到的首句的宾语与“是”有直接的依存关系，那么该宾语被称为“判断句宾语”；否则，该宾语被称为“非判断句宾语”。The invention utilizes the dependency syntax analyzer of Stanford University to analyze the dependency syntax of the first sentence described by the entity, and analyze the subject, predicate and object in the first sentence. If the object of the first sentence obtained by dependency syntax has a direct dependence relationship with "is", then the object is called "the object of the judgment sentence"; otherwise, the object is called "the object of the non-judgment sentence".

如果实体文本描述的首句的宾语为判断句宾语，我们可以利用该宾语为线索确定实体的类别，从而验证分类器预测的结果是否准确。如果分类器预测的结果与断句宾语所得出的结论矛盾，则利用该结果修正分类器的预测。如果首句中不存在判断句宾语，则跳过该步骤，进入34)。If the object of the first sentence described in the entity text is the object of the judgment sentence, we can use the object as a clue to determine the category of the entity, thereby verifying whether the result predicted by the classifier is accurate. If the result predicted by the classifier contradicts the conclusion drawn by the sentence-segmented object, the result is used to correct the prediction of the classifier. If there is no judgment sentence object in the first sentence, then skip this step and enter 34).

例如，在“砸六家是一种流行于天津的扑克牌游戏”句中，依存句法分析结果分析得到“游戏”为该句宾语，并且“游戏”与“是”有直接依存关系，那么“游戏”即为该句的判断句宾语。如果“游戏”是实体分类体系中预定义的实体类别，那么我们用它来作为该实体的类别。For example, in the sentence "Smaliujia is a poker game popular in Tianjin", the result of dependency parsing analysis shows that "game" is the object of the sentence, and "game" has a direct dependency relationship with "is", then " Game" is the object of the sentence. If "game" is a predefined entity category in the entity taxonomy, then we use that as the entity's category.

332)利用判断句宾语修正类别预测332) Correcting Category Predictions Using Judgmental Sentence Objects

在一些情况下，即使我们找出了判断句宾语，也不能随意用来对预测进行修正，因为这样可能会引入一些不必要的错误。同时，在很多情况下，判断句宾语并不完全匹配类别名称。例如：“野泽雅子是日本著名声优”，尽管依存句法分析可以得到“声优”是这句话的判断宾语，然而预定义的实体类别中有可能并没有“声优”这个类别。对此，本发明定义了修正条件，利用词汇语义相似度，即词向量间的余弦相似度，从大规模未标注语料中来寻找判断句宾语最相似的类别，来可靠地进行类别修正。In some cases, even if we find out the object of the judgment sentence, it cannot be used to correct the prediction at will, because it may introduce some unnecessary errors. At the same time, in many cases, the object of the judgment sentence does not exactly match the category name. For example: "Masako Nozawa is a famous voice actor in Japan", although the dependency syntax analysis can get that "voice actor" is the judgment object of this sentence, but there may not be a category of "voice actor" in the predefined entity categories. In this regard, the present invention defines correction conditions, and utilizes lexical semantic similarity, that is, the cosine similarity between word vectors, to find the most similar category of judgment sentence objects from large-scale unlabeled corpus to reliably perform category correction.

在自然语言处理领域，余弦相似度通常被当作词汇的语义相似度。具体来说，本发明实施例首先利用了使用word2vec工具包(https://word2vec.googlecode.com/svn/trunk/)在Gigaword中文语料(汉语Gigaword是公开的数据集)上训练汉语词向量，利用训练得到的词向量来寻找与判断句宾语语义最相似的类别名称。如果判断句宾语与其最相似类别的词向量的余弦相似度大于预设定的阈值(本发明实施例中，通过计算余弦相似度的方法，余弦相似度阈值设定为0.9)，才将该实体的类别修正为最相似类别。In the field of natural language processing, cosine similarity is usually regarded as the semantic similarity of words. Specifically, the embodiment of the present invention first utilizes the word2vec toolkit (https://word2vec.googlecode.com/svn/trunk/) to train the Chinese word vector on the Gigaword Chinese corpus (Chinese Gigaword is a public data set), Use the word vectors obtained from the training to find the category name that is most similar to the semantics of the object of the judgment sentence. If the cosine similarity of the word vector of the judgment sentence object and its most similar category is greater than the preset threshold (in the embodiment of the present invention, by calculating the cosine similarity method, the cosine similarity threshold is set to 0.9), the entity The category of is corrected to the most similar category.

为此，我们定义实体页面首句文本描述的判断句宾语为w₀，类别词为y∈Y(Y为实体类别集合),sim(w₁,w₂)为词语w₁、w₂的词向量的余弦相似度。那么修正条件为式2如示：To this end, we define the object of the judgment sentence described in the first sentence of the entity page as w ₀ , the category word as y∈Y (Y is the set of entity categories), and sim(w ₁ , w ₂ ) as the words w ₁ and w ₂ Cosine similarity of vectors. Then the correction condition is formula 2 as follows:

y^*＝argmax_y∈Ysim(w₀,y)∧sim(w₀,y^*)>0.9 (式2)y ^* ＝ _{argmax y∈Y} sim(w ₀ ,y)∧sim(w ₀ ,y ^* )>0.9 (Formula 2)

式2中，^表示并且(与)关系；^前部分的内容(左边项)表明y*是语义相似度最高的类别，^后部分的内容(右边项)表示y*与w0的相似度需要高于0.9；修正条件(式2)满足才进行修正，即只有当y*是语义相似度最高的类别并且y*与w0的相似度需要高于0.9时，用y*来修正原有的类别预测。In formula 2, ^ represents the and (and) relationship; the content of the first part of ^ (the left item) indicates that y* is the category with the highest semantic similarity, and the content of the latter part of ^ (the right item) indicates that the similarity between y* and w0 requires It is higher than 0.9; the correction condition (Formula 2) is satisfied before the correction is made, that is, only when y* is the category with the highest semantic similarity and the similarity between y* and w0 needs to be higher than 0.9, use y* to correct the original category predict.

在上面例子(“野泽雅子是日本著名声优”)中，我们可以找出与“声优”最相似的类别是“演员”(如表2所示，表2是利用从汉语gigaword上训练的词向量计算出的与类别最相似的一些词汇，其中粗体词表示这些词汇与类别的相似度在0.9以上)，并发现“演员”与“声优”的语义相似度在0.9以上，因此，将“野泽雅子”这个实体页面的类别修正为“演员”。In the above example (“Masako Nozawa is a famous voice actor in Japan”), we can find out that the category most similar to “voice actor” is “actor” (as shown in Table 2, Table 2 is a word trained from Chinese gigaword Some words that are most similar to the category calculated by the vector, where the words in bold indicate that the similarity between these words and the category is above 0.9), and it is found that the semantic similarity between "actor" and "seiyuu" is above 0.9, so the " Nozawa Masako" entity page category was corrected to "actor".

表2类别最相似词汇Table 2 The most similar vocabulary of categories

34)使用困惑矩阵识别困难样本34) Identifying Difficult Samples Using a Confusion Matrix

在实际应用中，我们经常会遇到某些类别的样本难以区分，这类样本称为困难样本。例如对于“城市”和“景点”两个类别的实体，分类器往往会做出错误的预测，因为这两类实体的描述和信息框属性都很相似。为了提高分类的精准度，本发明使用困惑矩阵来找出分类词容易出错的样本类别。具体来说，如果在验证集上，统计分类模型对于某一实体类别y_i的预测精度未达到90％，则类别y_i被视为困难样本类别。例如，在验证集上，统计分类模型对18个实体页面预测为“城市”类别，但其中只有15个页面确实为“城市”类别，因此统计分类模型在“城市”类别的预测精度仅为83.33％(15/18)，“城市”类别被认定为困难样本类别。对于那些被统计分类模型预测为困难样本类别的样本，我们称之为困难样本。In practical applications, we often encounter some types of samples that are difficult to distinguish, and such samples are called difficult samples. For example, for two categories of entities "city" and "attraction", the classifier tends to make wrong predictions because the description and infobox attributes of these two types of entities are very similar. In order to improve the accuracy of classification, the present invention uses a confusion matrix to find out sample categories where classifiers are prone to errors. Specifically, if the prediction accuracy of the statistical classification model for a certain entity category y _i does not reach 90% on the validation set, the category y _i is regarded as a difficult sample category. For example, on the validation set, the statistical classification model predicts 18 entity pages as the "city" category, but only 15 of them are indeed the "city" category, so the prediction accuracy of the statistical classification model in the "city" category is only 83.33% % (15/18), the 'urban' category was identified as a hard sample category. For those samples that are predicted by the statistical classification model as the hard sample class, we call them hard samples.

对于识别出的困难样本，我们利用了以下两种方法来对结果进行验证。For the identified difficult samples, we used the following two methods to verify the results.

341)链接分析341) Link Analysis

对于困难样本，单靠实体页面上的内容可能不足以做出正确的判断，因此，本发明采用了链接分析方法来对困难样本进行分类结果验证。For difficult samples, relying solely on the content on the entity page may not be enough to make a correct judgment. Therefore, the present invention uses a link analysis method to verify the classification results of difficult samples.

在链接数据中，一个实体页面通常会链接到与其相关的其它的实体页面。通常来说，其链接到的其它实体页面的类别非常有可能与其本身的类别的相同的。因此，利用一个实体页面链接到的其它实体页面的类别，可以帮助系统更好的判断该实体的类别。In linked data, an entity page usually links to other entity pages related to it. Generally speaking, the category of other entity pages it links to is very likely to be the same as its own category. Therefore, using the categories of other entity pages linked to by an entity page can help the system better judge the category of the entity.

具体来说，对于某实体页面e，我们分析e所链接的实体页面，其集合记为N(e)。N(e)中会有一部分页面有类别标注信息。本发明找出N(e)中有类别标注的页面，并统计出这些页面最多的类别y*，判断该类别是否与分类器对e所做出的类别预测y’一致。如结果不一致，利用y*来修正y’的结果。Specifically, for a certain entity page e, we analyze the entity pages linked by e, and its set is denoted as N(e). Some of the pages in N(e) have category annotation information. The present invention finds the pages marked with categories in N(e), counts the category y* with the most pages, and judges whether the category is consistent with the category prediction y' made by the classifier for e. If the results are inconsistent, use y* to correct the result of y'.

342)词缀分析342) Affix Analysis

对于某些难以区分的样本，本发明还利用了词缀分析法来验证其分类结果。对于某些类别，其实体名称通常以固定汉字结尾。例如，“城市”实体通常以“市、县”结尾，“景点”实体通常会以“湖、山”等结尾。表3列出了类别常见实体词缀的实例。For some samples that are difficult to distinguish, the present invention also utilizes the affix analysis method to verify the classification results. For some categories, their entity names usually end with fixed Chinese characters. For example, a "City" attribute usually ends with "City, County", a "Attraction" attribute usually ends with "Lake, Mountain", etc. Table 3 lists examples of common entity affixes for categories.

表3常见实体的类别词缀Table 3 Category Affixes of Common Entities

景点attractions 城市city 山、湖、城、道、景、河、岭、洞Mountains, lakes, cities, roads, scenery, rivers, ridges, caves 区、市、县district, city, county

本发明首先提出利用大规模无标注数据学习实体类型相关联的词缀信息，具体来说，我们利用词向量工具包word2vec在中文Gigaword数据集上训练词向量，然后通过计算余弦相似度的方法，找出每个类别语义最相近的词(词向量余弦相似度0.7以上的词)。然后，通过分别对这两个景点的最相近词汇的词缀进行频次统计，就可以得到困难样本类别相关联的词缀，从而通过分析词缀，来确定其所属类别。具体来说，如果某一实体页面词缀s在某一类别y₁中的频率显著高于(2倍以上)另一类别y₂中的出现频率，则我们将y₁作为该实体类别修正原有预测结果。举例来说，对于“庐山仙人洞”实体页面，其词缀“洞”出现在“景点”类别的频率明显高于出现在“城市”类别的频率，因此将该实体的预测类别修正为“景点”。The present invention first proposes to use large-scale unlabeled data to learn the affix information associated with entity types. Specifically, we use the word vector toolkit word2vec to train word vectors on the Chinese Gigaword data set, and then find the Get the words with the most similar semantics in each category (words with word vector cosine similarity above 0.7). Then, by performing frequency statistics on the affixes of the most similar vocabulary of the two scenic spots, the affixes associated with the difficult sample category can be obtained, and the category to which it belongs can be determined by analyzing the affixes. Specifically, if the frequency of an entity page affix s in a category y ₁ is significantly higher (more than 2 times) in another category y ₂ , then we take y ₁ as the entity category to modify the original forecast result. For example, for the "Lushan Fairy Cave" entity page, the frequency of the affix "cave" appearing in the category of "attraction" is significantly higher than that in the category of "city", so the predicted category of the entity is corrected to "attraction" .

需要注意的是，公布实施例的目的在于帮助进一步理解本发明，但是本领域的技术人员可以理解：在不脱离本发明及所附权利要求的精神和范围内，各种替换和修改都是可能的。因此，本发明不应局限于实施例所公开的内容，本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of the disclosed embodiments is to help further understand the present invention, but those skilled in the art can understand that various replacements and modifications are possible without departing from the spirit and scope of the present invention and the appended claims of. Therefore, the present invention should not be limited to the content disclosed in the embodiments, and the protection scope of the present invention is subject to the scope defined in the claims.

Claims

1. A link data-oriented entity classification method, the link data is a plurality of entity pages, and the entity page includes text description and information box; the entity classification method includes a preprocessing stage, a statistical classification stage and a postprocessing stage , including the following steps:

1) In the preprocessing stage, word information is obtained by segmenting the text description information in the entity page; the attribute name of the information box and the word information form the characteristics of the entity page;

2) In the statistical classification stage, using the characteristics of the entity page, a variety of segmentation granularities are used to train the statistical classification model to classify the entity page, and obtain the preliminary prediction result of the entity category;

3) In the post-processing stage, the preliminary prediction result of the entity category is corrected to obtain the corrected entity classification category; the correction includes the following steps:

31) Through the multi-granularity model fusion method, the preliminary prediction results of the entity category obtained by the statistical classification model trained with multiple segmentation granularities are fused to obtain the fused entity category result;

32) Constructing a category attribute database, using the category-associated attribute information in the category-attribute database to correct the fused entity category, and obtain the entity category after the category-associated attribute correction;

33) Analyzing the sentence structure by means of grammatical analysis, and obtaining the entity category information corrected by the in-depth understanding of the first sentence through the corrected entity category of the category association attribute obtained in step 32) of the deep understanding of the first sentence of the text description.

2. The linked data-oriented entity classification method according to claim 1, wherein the word segmentation method in step 1) includes a front-to-back maximum matching method, a backward maximum matching method, and a statistical sequence labeling method.

3. The linked data-oriented entity classification method as claimed in claim 1, characterized in that, step 2) adopts two kinds of segmentation granularities, which are respectively the segmentation granularity with Named Entity Recognition and the segmentation without Named Entity Recognition. Sub-granularity.

4. the link data-oriented entity classification method as claimed in claim 1, is characterized in that, described statistical classification model is maximum entropy model; Step 31) described multi-granularity model fusion method specifically calculates by formula 1 and obtains and merges different segmentations The probability distribution predicted by the granular classifier is combined with the maximum entropy classification model trained by multiple segmentation granularity to classify the entity page to obtain the entity category result:

P _multi (y|x)＝λP _w (y|x)+(1-λ)P _n (y|x) (Formula 1)

In Equation 1, P _multi (y|x) is the probability distribution predicted by the fusion of different segmentation granularity classifiers; P _w (y|x) is the probability distribution predicted by the maximum entropy classification model for sample x using only word segmentation as a feature ; y is the sample category, x is the sample; P _n (y|x) is the probability distribution of the maximum entropy prediction based on word segmentation and adding named entity annotation as a feature; λ is a parameter to adjust the linear interpolation weight.

5. as claimed in claim 1, the entity classification method facing linked data is characterized in that, step 33) described utilizing syntax analysis method to analyze the sentence structure, obtaining the entity category information after first sentence depth understanding correction, specifically comprising the following steps:

331) Perform dependency syntactic analysis on the first sentence of the entity description, and identify whether the object of the first sentence belongs to the object of the judgment sentence;

332) Train Chinese word vectors on a large-scale unlabeled corpus, define lexical semantic similarity, calculate the lexical semantic similarity between the word vector and the judgment sentence object, and obtain the word vector with the highest lexical semantic similarity;

333) Through the cosine similarity calculation method, the cosine similarity threshold is set, and when the cosine similarity between the object of the judgment sentence and the word vector of the most similar category is greater than the cosine similarity threshold, the category of the entity is corrected to the most similar category.

6. as claimed in claim 1, the linked data-oriented entity classification method is characterized in that, in the post-processing stage, the preliminary prediction result of the entity category is corrected, and after obtaining the corrected entity classification category, the confusion matrix is used to identify Difficult entity category; for the identified difficult entity category, the entity category result is verified by link analysis method and affix analysis method; the confusion matrix identification method is specifically: on the verification set, when the statistical classification model for a certain entity category When the prediction accuracy of y _i does not reach 90%, the class y _i is regarded as a difficult entity class.

7. The link data-oriented entity classification method according to claim 6, wherein the link analysis method specifically comprises: setting the category prediction made by the classifier to the entity page e as y', and setting the entity page e The set of linked entity pages is recorded as N(e), find out the pages with category labels in N(e), and get the category with the most category labels in N(e), which is recorded as y*; when the category When y* is inconsistent with the category prediction y', use y* to correct the result of y', and obtain the category of entity page e as y*.

8. The link data-oriented entity classification method according to claim 6, wherein the affix analysis method is specifically: for the entity category whose entity name ends with a fixed Chinese character, the entity type obtained by using large-scale unlabeled data learning For the associated affix information, the affixes associated with the difficult entity category are obtained by performing frequency statistics on the affixes of the most similar vocabulary, and the category of the entity is obtained by analyzing the affixes.

9. The link data-oriented entity classification system realized by the link data-oriented entity classification method according to claims 1 to 8, is characterized in that, comprising a preprocessing module, a statistical classification module and a postprocessing module;

The preprocessing module is used to perform word segmentation on the text description information in the entity page, and extract the word information obtained by the information box attribute name and word segmentation as a feature, and use it as a feature representation of the entity page;

The statistical classification module trains the classification model by using the maximum entropy classification algorithm, and obtains the entity category by using the description information of the entity in the entity page to identify;

The post-processing module is used to correct the entity category obtained by the statistical classification module by adopting multi-granularity model fusion, category association attributes and deep understanding of the first sentence to obtain the corrected entity category.

10. The linked data-oriented entity classification system of claim 9, wherein the word segmentation tool is a Stanford CoreNLP toolkit; and the classification model adopts a maximum entropy classifier software package Maxent.