CN105912625B

CN105912625B - A Linked Data-Oriented Entity Classification Method and System

Info

Publication number: CN105912625B
Application number: CN201610213411.8A
Authority: CN
Inventors: 葛涛; 穗志方
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2016-04-07
Filing date: 2016-04-07
Publication date: 2019-05-14
Anticipated expiration: 2036-04-07
Also published as: CN105912625A

Abstract

The invention discloses a kind of entity classification method and systems towards link data, for the entity classification problem of link data, including pretreatment, statistical classification and last handling process；Wherein, pretreatment is by segmenting the text description information in physical page；Physical page feature is constituted by the word information that the attribute-name of message box and participle obtain；Statistical classification process trains statistical classification model to classify physical page using a variety of cutting granularities, obtains the tentative prediction result of entity class；Last handling process is modified entity statistical classification result, including Model Fusion, linguistry, link information and the methods of is modified using category associations attribute information to fused entity class.Technical solution of the present invention is easily realized, is easily debugged, high-efficient, precision is good, is adapted to link data and is carried out information management；It can be realized and high exact classification is carried out to entity.

Description

A kind of entity classification method and system towards link data

Technical field

The invention belongs to field of information processing, it is related to linking data classification and search, more particularly to a kind of towards link number Physical page in carries out the method and system of high exact classification.

Background technique

It is in big data era at present, how to maximally utilise data to help computer to carry out information processing Become the most popular research topic of current information process field.In recent years, with the arrival in Web2.0 epoch, data are linked (such as semantic net, knowledge mapping etc.) has obtained the extensive concern of people because of its powerful relationship description ability.Link data Refer to the data organization form as Baidupedia, wikipedia, in this data, the corresponding entity of each page, between entity There is mutual link, therefore is referred to as link data (linked data).With the continuous increase of data scale, using artificial Method management link data are unrealistic, and there is an urgent need to can carry out the high efficiency method of information management to link data and be System.

The entity classification of link data is an important technological problems for linking data knowledge management domain, for link number According to carrying out entity classification, can effectively a large amount of physical page in organizing links data, to reinforce user's search and read Experience.

Currently, the common method of entity classification is classified for the description text of entity.But this simple side Method can not accurately analyze the classification of entity in many cases, and deficiency is mainly manifested in:

(1) for people, although judging that entity class is an easily thing according to text description, For the statistical classification method currently based on feature, it is desirable to high accurately to judge that entity class is not existing by text description It is real；For example, text " X is the animation adapted according to famous game " is with " A is the game according to famous cartoon making " in vocabulary grade Do not have closely similar expression, but the former is the description to an animation entity and the latter is the description to game entity, The entity type that it is described is entirely different.Therefore, the statistical classification method accuracy of identification for being based purely on text feature is insufficient, not Entity class can accurately be obtained.

(2) there is no enough text description informations for many physical pages, in this case, are described using text merely Information classifies to entity, inevitably results in classification error, is unable to get entity class by text description.

Summary of the invention

In order to overcome the above-mentioned deficiencies of the prior art, the present invention provide it is a kind of towards link data entity classification method and System reaches high accurate entity by statistical classification process and last handling process for the entity classification problem of link data The purpose of classification；Wherein, statistical classification process is classified by modeling for text information；Last handling process utilizes abundant Resource (such as the information such as affixe information, link data) is modified the result of entity statistical classification, including Model Fusion, language The methods of say knowledge, link information and fused entity class is modified using category associations attribute information.

Physical page in link data generally comprises text description and message box (infobox).The present invention retouches text It states after carrying out cutting, the word information that message box (infobox) attribute-name is obtained together with cutting is come out as feature extraction, make For the character representation of physical page；Then, classified using maximum entropy model using a variety of cutting granularities to physical page, obtained To the tentative prediction to entity class；Obtained entity class is post-processed again, whether may be used with verifying its classification results It leans on；Post-processing, which specifically includes, merges the classification results of the classifier of the feature training using different cutting granularities；It utilizes Category associations attribute information in category attribute data Kuku corrects apparent prediction error；First sentence is described to text and carries out depth Understand, using the methods of syntactic analysis parsing sentence structure, entity class information is obtained, with the prediction result before amendment；It is excellent Selection of land is also difficult to the classification correctly classified using puzzled matrix identification, and the prediction for the classification for being difficult to correctly classify carries out Further verifying, the classification including the adjacent page for using physical page to be linked are modified to entity class and use entity The affixe information of the page is modified entity class.

Present invention provide the technical scheme that

A kind of entity classification method towards link data, the link data are multiple physical pages, the physical page Bread is described containing text and message box；The entity classification method includes pretreatment stage, statistical classification stage and post-processing rank Section, specifically comprises the following steps:

1) in pretreatment stage process, by segmenting to the text description information in physical page, cutting obtains word Information；The feature of physical page is made of the attribute-name of message box and the word information；

2) statistical is trained using a variety of cutting granularities using the feature of the physical page in the statistical classification stage Class model classifies to physical page, obtains the tentative prediction result of entity class；

3) in post-processing stages, the tentative prediction result of entity class is modified, revised entity classification is obtained Classification；The amendment includes the following steps:

31) by more granularity model fusion methods, by what is obtained using the statistical classification model of different cutting granularities training The tentative prediction result of entity class is merged, and fused entity class result is obtained；

32) category attribute database is constructed, using the category associations attribute information in category attribute data Kuku, to fusion Entity class afterwards is modified, and obtains the revised entity class of category associations attribute；

33) parsing method parsing sentence structure is utilized, carries out deep understanding step 32) by describing first sentence to text The obtained revised entity class of category associations attribute obtains the revised entity class information of first sentence deep understanding.

For the above-mentioned entity classification method towards link data, further, before the step 1) segmenting method includes Maximum matching process, backward maximum matching process and it is based on statistical series mask method afterwards.

For the above-mentioned entity classification method towards link data, further, step 2) uses two kinds of cutting granularities, point It Wei not the cutting granularity with name Entity recognition and the cutting granularity without name Entity recognition.

For the above-mentioned entity classification method towards link data, further, the statistical classification model is maximum entropy Model；It is pre- that especially by formula 1 the different cutting grain-size classification devices of fusion are calculated in step 31) more granularity model fusion methods The probability distribution of survey is classified the maximum entropy disaggregated model of multiple cutting granularities training to obtain entity class to physical page As a result it is merged:

P_multi(y | x)=λ P_w(y|x)+(1-λ)P_n(y | x) (formula 1)

In formula 1, P_multi(y | x) it is the probability distribution for merging different cutting grain-size classification device predictions；P_w(y | x) it is a word The probability distribution that cutting predicts sample x as feature maximum entropy disaggregated model；Y is sample class, and x is sample；P_n(y|x) For the probability distribution that name entity mark is predicted as the maximum entropy of feature is added on the basis of word segmentation；λ, which is that adjustment is linear, to be inserted It is worth the parameter of weight.

For the above-mentioned entity classification method towards link data, further, step 33) is described to utilize syntactic analysis side Method parsing sentence structure obtains the revised entity class information of first sentence deep understanding, specifically comprises the following steps:

331) interdependent syntactic analysis is carried out to the first sentence of entity description, identifies whether the object of first sentence belongs to and judges sentence guest Language；

332) the training Chinese term vector on extensive un-annotated data, defines Similarity of Words, calculates term vector With the Similarity of Words for judging sentential object, the highest term vector of Similarity of Words is obtained；

333) cosine similarity calculation method is used, cosine similarity threshold value is set, when judging that sentential object is most like with it The cosine similarity of the term vector of classification is greater than cosine similarity threshold value, and the classification of the entity is modified to most like classification.

For the above-mentioned entity classification method towards link data, further, in the post-processing stages to entity class Other tentative prediction result is modified, and after obtaining revised entity classification classification, identifies difficulty using puzzled matrix Entity class；For the difficult entity class identified, by link analysis method and affix analysis method to entity class knot Fruit is verified；The puzzled matrix recognition methods is specifically: on verifying collection, when statistical classification model is for a certain entity class Other y_iPrecision of prediction when being not up to 90%, classification y_iIt is considered as difficult entity class.

Further, the link analysis method is specifically: the class prediction that setting classifier makes physical page e For y ', the set of the physical page e physical page linked is denoted as N (e), finds out the page for having classification to mark in N (e), system Meter obtains the most classification of the page for having classification to mark in N (e), is denoted as y*；When classification y* and class prediction y ' are inconsistent, benefit Corrected with y* y's ' as a result, obtain physical page e classification be y*.

For the above-mentioned entity classification method towards link data, further, the affix analysis method is specifically: needle The entity class that Chinese character ending is fixed to entity name is related using the entity type learnt on a large scale without labeled data The affixe information of connection carries out frequency statistics by the affixe respectively to most close vocabulary, it is associated to obtain difficult entity class Affixe obtains the classification of the entity by analyzing affixe.

The present invention also provides the realities towards link data realized using the above-mentioned entity classification method towards link data Body categorizing system, including preprocessing module, statistical classification module and post-processing module；The preprocessing module is used for physical page Text description information in face is segmented, and the word information that message box attribute-name and participle obtain is come out as feature extraction, Character representation as physical page；The statistical classification module carrys out train classification models by using maximum entropy sorting algorithm, The description information of entity is identified to obtain entity class using in physical page；The post-processing module is used to use more granularity moulds Type fusion, category associations attribute and first sentence deeply understand that the entity class obtained to the statistical classification module is modified, and obtain To revised entity class.

In the above-mentioned entity classification system towards link data, the participle tool is Stanford CoreNLP tool Packet；The disaggregated model uses maximum entropy classifiers software package Maxent.

Compared with prior art, the beneficial effects of the present invention are:

The present invention provides a kind of entity classification method and system towards link data, for the entity classification of link data Problem achievees the purpose that high precisely entity classification by statistical classification process and last handling process.Wherein, it is carried out to text On the basis of basic classification, the result of entity description text classification is modified, includes: using method

(1) more granularity word segmentation Model Fusion methods are used, for overcoming single cutting granularity to extract in text feature On defect；

(2) fused entity class is modified using category associations attribute information, to reach amendment apparent error Purpose；

(3) deeply understood by first sentence, achieve the effect that reduce text noise；

(4) it can identify difficult sample, and recognition result is verified using the methods of link analysis and affixe.

Compared with prior art, existing entity classification method is no longer handled at present, can for Entity recognition classification Situation that can be wrong can not correction result；And the present invention is by post-processing process to based on the possible mistake of text statistical classification module The case where be modified.Technical solution proposed by the invention is easily realized, is easily debugged, high-efficient, precision is good, is very suitable to enterprise Information management is carried out for linking data；High exact classification can be carried out to entity.It evaluates and tests and competes in JIST2015 entity classification In, the solution of the present invention accuracy rate is 98.6%, for when time evaluation and test match highest classification schemes of accuracy rate.

Detailed description of the invention

Fig. 1 is the flow diagram of the entity classification method provided by the invention towards link data.

Fig. 2 is the structural block diagram of the entity classification system provided in an embodiment of the present invention towards link data.

Fig. 3 is the flow diagram that first sentence deeply understands step in the method provided by the present invention.

Specific embodiment

With reference to the accompanying drawing, the present invention, the model of but do not limit the invention in any way are further described by embodiment It encloses.

The present invention provides a kind of entity classification method and system towards link data, for the entity classification of link data Problem achievees the purpose that high precisely entity classification by statistical classification process and last handling process；Wherein, statistical classification process Classified by being modeled for text information；Last handling process utilizes affluent resources (such as affixe information, link data etc. Information) result of entity statistical classification is modified, Fig. 1 is the entity classification method provided by the invention for link data Flow diagram.As shown in Figure 1, the method for the present invention includes preprocessing process, statistical classification process and last handling process；It is right first Physical page carries out participle feature extraction, the feature training statistical classification model then obtained using extraction.For classification gained To as a result, we merge first with more granularity models corrects single model prediction error, then utilize category associations attributes Information is modified fused entity class, to correct some manifest error predictions, then retouches to the first sentence of physical page Carry out depth analysis is stated, to determine its classification.For the sample of some classifications for being difficult to correctly classify, the present invention can pass through link Analysis and affix analysis method correct its classification again.Specific steps include:

1) physical page is pre-processed, including Chinese word segmenting (typical segmenting method have the maximum matching in front and back, after To maximum matching and the method marked based on statistical series), feature extraction (extract word feature and entity information box properties name Feature is indicated the page) etc., obtain physical page feature；

2) using obtained physical page feature is extracted in step 1), to physical page using maximum entropy model using a variety of Cutting granularity is classified, and the tentative prediction to entity class is obtained；

In embodiments of the present invention, maximum entropy model two classifiers of training are utilized；The character representation of one classifier is used Be with name the cutting of Entity recognition granularity word+infobox attribute；Another classifier is without name entity Identify word and infobox attribute that carried out cutting generates.

3) entity class obtained in step 2) is post-processed, whether reliable verifies its classification results；Specific packet Include following steps:

31) classification results of the classifier of the feature training using different cutting granularities are merged；

In embodiments of the present invention, using two kinds of cutting granularities, respectively refer to name Entity recognition cutting and without There is name Entity recognition；

32) category attribute database is constructed in advance, is repaired using the category associations attribute information in category attribute data Kuku Just apparent prediction error；

33) first sentence is described to text by parser and carries out deep understanding, analyze sentence using the methods of syntactic analysis Minor structure, so that entity class information is obtained, with the prediction result before amendment；

34) it is difficult to the classification correctly classified using puzzled matrix identification, the prediction of the category is further verified, is wrapped It includes:

341) classification of the adjacent page linked using physical page is modified entity class；

342) entity class is modified using the affixe information of physical page.

Fig. 2 is the structural block diagram of the entity classification system provided in an embodiment of the present invention towards link data.Link data Entity classification system include preprocessing module, statistical classification module and post-processing module；For each module be further discussed below as Under:

Preprocessing module

Physical page in link data generally comprises text description and message box (infobox).

In preprocessing module, we are utilized Stanford CoreNLP kit and describe to the text in physical page Information is segmented.In the present embodiment, we take two kinds of different cutting granularities: having name Entity recognition and without name entity Identification.For example, " New York Times Square " will be considered as a vocabulary under the cutting for having name Entity recognition, and in no name Under the cutting of Entity recognition, which will be split as " New York ", " epoch ", " square " three words.

After carrying out cutting for Chinese language text, we obtain message box (infobox) attribute-name together with cutting Word information comes out as feature extraction, the character representation as physical page.

Statistical classification module

The present invention mainly utilizes the foundation that entity class is judged the description information of entity in physical page.This hair It is bright use the common log-linear model of natural language processing field --- maximum entropy sorting algorithm carrys out train classification models.Such as Preprocessing module is previously mentioned, and feature used in statistical classification module includes word feature and message box attributive character；Word is characterized in Classical bag of words character representation；Message box attributive character is for identifying that the classification of entity has very important effect, example Such as, " date of birth " may also be associated with the entity of personage's type.

In text classification module, we use varigrained word segmentation and carry out training text disaggregated model, this is because In some cases, a kind of cutting granularity is not able to satisfy the requirement for classification.For example, " New York Times Square " is if conduct One name entity is come if treating, effect for classification and not as good as being cut into " New York " " epoch " and " square ", because There is vital influence for classification for " square " word.On the other hand, if we are without naming Entity recognition, that Picture " Zhang Yishan " will be cut into " opening " " one " " mountain ", then this can also impact classification results.Therefore, it is counting In categorization module, the embodiment of the present invention (can be downloaded maximum by maximum entropy classifiers software package Maxent by following link website Entropy classifier software package: http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html) instruction Two kinds of disaggregated models are practiced, one is the fine granularity cutting of Entity recognition, one is simple coarseness word segmentations with naming.

Post-processing module

Based on the situation of the possible mistake of text statistical classification module, the present invention is modified using post-processing module.Afterwards Following procedure can be performed in processing module:

31) more granularity model fusion process

Model Fusion is widely used in machine learning field as far as possible, but the method for most Model Fusion is both for difference The fusion of kind machine learning model.For natural language (especially Chinese), the difference of cutting granularity is for entire model Effect can have an impact.For the respective superiority-inferiority of different cutting granularities, the invention proposes the methods using Model Fusion To carry out " learning from other's strong points to offset one's weaknesses " to the obtained disaggregated model of various cutting granularities.

We define P_w(y | x) it is the classification for only word segmentation being used to predict as feature, maximum entropy disaggregated model for sample x Y probability distribution, P_n(y | x) it is the probability point that name entity mark is added on the basis of word segmentation and is predicted as the maximum entropy of feature Cloth.We merge the result of both classifiers using the following method:

P_multi(y | x)=λ P_w(y|x)+(1-λ)P_n(y | x) (formula 1)

In formula 1, P_multi(y | x) it is the probability distribution for merging different cutting grain-size classification device predictions；P_w(y | x) it is a word The probability distribution that cutting predicts sample x as feature maximum entropy disaggregated model；Y is sample class, and x is sample；P_n(y|x) For the probability distribution that name entity mark is predicted as the maximum entropy of feature is added on the basis of word segmentation；λ, which is that adjustment is linear, to be inserted It is worth the parameter of weight, in the present embodiment, if λ=0.5.

32) category associations attribute amendment prediction

The module corrects the class prediction of some apparent errors using category associations attribute.The module utilized mainly The classification specificity of information box properties.As shown in table 1, for certain attributes, they can not be with some specific classifications It is associated.For example, " gaming platform " can not be associated with city entity.Therefore, the specificity of these attributes, Ke Yixiu are utilized The apparent prediction error of positive classifier.The present invention is directed to the entity type predefined and manually establishes category attribute database, For carrying out the amendment to prediction.

1 category associations exemplary properties of table

33) the first sentence of entity description is deeply understood by interdependent parser, further precisely identifies entity class；

Link the physical page description in data (such as: wikipedia, Baidupedia etc.) the is in short usually pair A kind of qualitative description (such as: pounding six is snipsnapsnorum for being popular in Tianjin) of entity.If deep enough can understand entity The first sentence of description, then will have very big help to accurate identification entity class.

Fig. 3 is the flow diagram that first sentence deeply understands step in the method provided by the present invention.The invention firstly uses interdependent sentences Method analyzer describes judgement sentential object in first sentence to find out physical page text, then utilizes the judgement sentential object analysis entities The classification of the page；Specifically comprise the following steps:

331) judge that sentential object identifies

Present invention utilizes the interdependent parsers of Stanford University, carry out interdependent syntax point to the first sentence of entity description Subject, predicate and object in sentence of informing against are analyzed in analysis.If the object of the interdependent obtained first sentence of syntax and "Yes" have directly Dependence, then the object is referred to as " judging sentential object "；Otherwise, which is referred to as " non-judgement sentential object ".

The object of first sentence as described in sporocarp text is to judge sentential object, we can use the object and determine for clue The classification of entity, so that whether the result for verifying classifier prediction is accurate.If result and the punctuate object institute of classifier prediction The conclusion contradiction obtained then utilizes the prediction of the modified result classifier.If skipped in first sentence there is no sentential object is judged The step, into 34).

For example, in a kind of " pounding six is snipsnapsnorum for being popular in Tianjin " sentence, interdependent syntactic analysis interpretation of result Obtaining " game " is the sentential object, and " game " and "Yes" have direct dependence, then " game " is the judgement of this Sentential object.If " game " is predefined entity class in entity classification system, we are with it as the entity Classification.

332) it utilizes and judges that sentential object corrects class prediction

In some cases, even if we have found judgement sentential object, can not arbitrarily be used to be modified prediction, because For some unnecessary mistakes may be introduced in this way.Meanwhile judging that sentential object not fully matches classification in many cases, Title.Such as: " the wild refined son in pool is that the famous sound of Japan is excellent ", although interdependent syntactic analysis available " sound is excellent " is sentencing for the words Disconnected object, however not " sound is excellent " this classification is possible in predefined entity class.In this regard, invention defines amendments Condition finds judgement using Similarity of Words, i.e. cosine similarity between term vector from extensive un-annotated data The most like classification of sentential object, reliably to carry out classification amendment.

In natural language processing field, cosine similarity is generally regarded as the semantic similarity of vocabulary.Specifically, this hair Bright embodiment is first with use word2vec kit (https: //word2vec.googlecode.com/svn/ Trunk/) the training Chinese term vector on Gigaword Chinese corpus (Chinese Gigaword is disclosed data set), utilizes instruction The semantic most like item name of sentential object is found and judged to the term vector got.If it is determined that sentential object is most like with it The cosine similarity of the term vector of classification is greater than presetting threshold value (in the embodiment of the present invention, by calculating cosine similarity 0.9) method, cosine similarity threshold value are set as, the classification of the entity is just modified to most like classification.

For this purpose, we define physical page kept man of a noblewoman sentence text description judge sentential object for w₀, classifier is that (Y is real to y ∈ Y Body category set), sim (w₁,w₂) it is word w₁、w₂Term vector cosine similarity.So correction conditions are that formula 2 is such as shown:

y^*=argmax_y∈Ysim(w₀,y)∧sim(w₀,y^*) > 0.9 (formula 2)

In formula 2, ^ is indicated and (with) relationship；The content (left side) of part shows that y* is that semantic similarity is highest before ^ Classification, the content (the right) of part indicates that the similarity of y* and w0 needs to be higher than 0.9 after ^；Correction conditions (formula 2) meet just into Row amendment, i.e., only when the similarity that y* is the highest classification of semantic similarity and y* and w0 needs to be higher than 0.9, with y* come Correct original class prediction.

In previous example (" the wild refined son in pool is that the famous sound of Japan is excellent "), we can be found out and " sound is excellent " most like class It is not that " performer " (as shown in table 2, table 2 is to utilize the term vector of the training from Chinese gigaword calculated and classification most phase As some vocabulary, wherein runic word indicates the similarity of these vocabulary and classification 0.9 or more), and find " performer " and " sound It is excellent " semantic similarity 0.9 or more, therefore, the classification of " wild pool refined son " this physical page is modified to " performer ".

The most like vocabulary of 2 classification of table

34) difficult sample is identified using puzzled matrix

In practical applications, the sample that we are frequently encountered certain classifications is difficult to differentiate between, and this kind of sample is known as difficult sample This.Such as the entity for " city " and " sight spot " two classifications, classifier often do the prediction to make mistake, because these two types of The description of entity and information box properties are all much like.In order to improve the precision of classification, the present invention is found out using puzzled matrix Classificating word is easy the sample class of error.Specifically, if statistical classification model is for a certain entity class on verifying collection y_iPrecision of prediction be not up to 90%, then classification y_iIt is considered as difficult sample class.For example, on verifying collection, statistical classification model " city " classification is predicted as to 18 physical pages, but wherein there was only 15 pages is " city " classification really, therefore statistical Class model is only 83.33% (15/18) in the precision of prediction of " city " classification, and " city " classification is identified as difficult sample class Not.It is the sample of difficult sample class by statistical classification model prediction for those, we term it difficult samples.

For the difficult sample identified, following two method is utilized to verify to result in we.

341) link analysis

For difficult sample, depending merely on the content in physical page may be not enough to make correct judgement, therefore, the present invention Come to carry out classification results verifying to difficult sample using link analysis method.

In link data, a physical page would generally be linked to relative other physical pages.Usually come It says, the classification for the other physical pages being linked to is very likely to the identical of classification with itself.Therefore, one is utilized The classification for other physical pages that physical page is linked to can preferably judge the classification of the entity with help system.

Specifically, to Mr. Yu physical page e, we analyze the physical page that e is linked, and set is denoted as N (e).N (e) some page is understood in classification markup information.The present invention finds out the page for having classification to mark in N (e), and counts this A little most classification y* of the page, the class prediction y ' for judging whether the category is made e with classifier are consistent.As result is different It causes, the result of y ' is corrected using y*.

342) affix analysis

For certain samples being difficult to differentiate between, the present invention also uses affix analysis method to verify its classification results.For Certain classifications, entity name is usually to fix Chinese character ending.For example, " city " entity is usually with " city, county " ending, " sight spot " Entity would generally be ended up with " lake, mountain " etc..The example that table 3 lists classification common solid affixe.

The classification affixe of 3 common solid of table

Sight spot	City
		Mountain, lake, city, road, scape, river, ridge, hole	Area, city, county

Present invention firstly provides utilizing on a large scale without the associated affixe information of labeled data learning object type, specifically come It says, we train term vector using term vector kit word2vec on Chinese Gigaword data set, then pass through calculating The method of cosine similarity finds out the most similar word (word of 0.7 or more term vector cosine similarity) of each classification semanteme.So Afterwards, frequency statistics are carried out by the affixe of the most close vocabulary respectively to the two sight spots, so that it may obtain difficult sample class Associated affixe, thus by analysis affixe, to determine its generic.Specifically, if a certain physical page affixe s In a certain classification y₁In frequency be significantly higher than (2 times or more) another category y₂In the frequency of occurrences, then we are by y₁As the reality Body classification corrects original prediction result.For example, for " Mount Lushan Fairy Cave " physical page, affixe " hole " appears in " scape The frequency of point " classification is apparently higher than the frequency for appearing in " city " classification, therefore the prediction classification of the entity is modified to " scape Point ".

It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this field Art personnel, which are understood that, not to be departed from the present invention and spirit and scope of the appended claims, and various substitutions and modifications are all It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim Subject to the range that book defines.

Claims

1. An entity classification method oriented to link data, the link data is a plurality of entity pages, and the entity pages comprise text descriptions and information boxes; the entity classification method comprises a preprocessing stage, a statistical classification stage and a post-processing stage , including the following steps:

1) In the process of the preprocessing stage, word information is obtained by segmenting the text description information in the entity page; the attribute name of the information box and the word information constitute the feature of the entity page;

2) In the statistical classification stage, using the features of the entity pages, adopting a variety of segmentation granularities to train the statistical classification model to classify the entity pages, and obtain the preliminary prediction results of the entity categories;

3) In the post-processing stage, the preliminary prediction result of the entity category is modified to obtain the modified entity classification category; the modification includes the following steps:

31) By the multi-granularity model fusion method, the preliminary prediction results of the entity categories obtained by using the statistical classification models trained by multiple segmentation granularities are fused to obtain the fused entity category results;

32) constructing a category attribute database, and using the category association attribute information in the category attribute database library to correct the merged entity category, and obtain the entity category after the category association attribute modification;

33) Analyzing the sentence structure by using the syntax analysis method, and obtaining the entity category information after the deep understanding and correction of the first sentence through the deep understanding of the first sentence of the text description and the corrected entity category obtained in step 32).

2. The entity classification method oriented to linked data as claimed in claim 1, wherein the word segmentation method in step 1) is one of a front-back maximum matching method, a backward maximum matching method and a statistical sequence-based labeling method.

3. the entity classification method for linked data as claimed in claim 1, it is characterized in that, step 2) adopts two kinds of segmentation granularity, be respectively the segmentation granularity with named entity recognition and the segmentation granularity without named entity recognition. Granularity.

4. The entity classification method for linked data as claimed in claim 1, wherein the statistical classification model is a maximum entropy model; Step 31) The multi-granularity model fusion method is calculated by formula 1 to obtain fusion different segmentations. The probability distribution predicted by the granularity classifier, and the maximum entropy classification model trained by multiple segmentation granularities is used to classify entity pages to obtain entity category results for fusion:

P _multi (y|x)=λP _w (y|x)+(1-λ)P _n (y|x) (Equation 1)

In Equation 1, P _multi (y|x) is the probability distribution predicted by the classifier with different segmentation granularity; P _w (y|x) is the probability distribution predicted by the maximum entropy classification model only using word segmentation as the feature for sample x. ; y is the sample category, x is the sample; P _n (y|x) is the probability distribution of the maximum entropy prediction based on the word segmentation with named entity annotation as a feature; λ is the parameter for adjusting the linear interpolation weight.

5. the entity classification method oriented to linked data as claimed in claim 1, it is characterized in that, step 33) described utilizing grammatical analysis method to analyze sentence structure, obtain the entity classification information after the first sentence depth understanding correction, specifically comprises the steps:

331) Perform dependency syntax analysis on the first sentence of the entity description, and identify whether the object of the first sentence belongs to the object of the judgment sentence;

332) train Chinese word vectors on a large-scale unlabeled corpus, define the lexical semantic similarity, calculate the lexical semantic similarity between the word vector and the object of the judgment sentence, and obtain the word vector with the highest lexical semantic similarity;

333) Through the cosine similarity calculation method, set the cosine similarity threshold, when the cosine similarity between the sentence object and the word vector of the most similar category is greater than the cosine similarity threshold, the category of the entity is corrected to the most similar category.

6. The entity classification method oriented to linked data according to claim 1, characterized in that, in the post-processing stage, the preliminary prediction result of the entity category is modified, and after obtaining the modified entity classification category, a confusion matrix is used to identify the entity category. Difficult entity category; for the identified difficult entity category, the entity category results are verified by link analysis method and affix analysis method; the confusion matrix identification method is specifically: on the verification set, when the statistical classification model is used for a certain entity category When the prediction accuracy of _yi does not reach 90%, the category _yi is regarded as a difficult entity category.

7. The entity classification method for link data as claimed in claim 6, wherein the link analysis method is specifically: the category prediction made by the classifier to the entity page e is y', and the entity page e is predicted as y'. The set of linked entity pages is denoted as N(e), find out the pages with category annotations in N(e), and count the category with the most category-annotated pages in N(e), denoted as y*; When y* is inconsistent with the category prediction y', y* is used to correct the result of y', and the category of the entity page e is obtained as y*.

8. The entity classification method oriented to linked data as claimed in claim 6, wherein the affix analysis method is specifically: the entity type that ends with a fixed Chinese character for the entity name, the entity type that is learned by using large-scale unlabeled data For the associated affix information, the affixes associated with the difficult entity category are obtained by performing frequency statistics on the affixes of the most similar words respectively, and the entity category is obtained by analyzing the affixes.

9. The link data-oriented entity classification system realized by the link data-oriented entity classification method according to claims 1 to 8 is characterized in that, comprising a preprocessing module, a statistical classification module and a post-processing module;

The preprocessing module is used to perform word segmentation on the text description information in the entity page, and extract the word information obtained by the attribute name of the information box and the word segmentation as a feature, and use it as a feature representation of the entity page;

The statistical classification module uses the maximum entropy classification algorithm to train the classification model, and uses the description information of the entity in the entity page to identify and obtain the entity category;

The post-processing module is used to modify the entity category obtained by the statistical classification module by adopting multi-granularity model fusion, category association attributes and in-depth understanding of the first sentence to obtain the modified entity category.

10. The linked data-oriented entity classification system according to claim 9, wherein the word segmentation tool is the StanfordCoreNLP toolkit; and the classification model adopts the maximum entropy classifier software package Maxent.