CN105912625B - A Linked Data-Oriented Entity Classification Method and System - Google Patents

A Linked Data-Oriented Entity Classification Method and System Download PDF

Info

Publication number
CN105912625B
CN105912625B CN201610213411.8A CN201610213411A CN105912625B CN 105912625 B CN105912625 B CN 105912625B CN 201610213411 A CN201610213411 A CN 201610213411A CN 105912625 B CN105912625 B CN 105912625B
Authority
CN
China
Prior art keywords
entity
classification
category
information
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610213411.8A
Other languages
Chinese (zh)
Other versions
CN105912625A (en
Inventor
葛涛
穗志方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201610213411.8A priority Critical patent/CN105912625B/en
Publication of CN105912625A publication Critical patent/CN105912625A/en
Application granted granted Critical
Publication of CN105912625B publication Critical patent/CN105912625B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of entity classification method and systems towards link data, for the entity classification problem of link data, including pretreatment, statistical classification and last handling process;Wherein, pretreatment is by segmenting the text description information in physical page;Physical page feature is constituted by the word information that the attribute-name of message box and participle obtain;Statistical classification process trains statistical classification model to classify physical page using a variety of cutting granularities, obtains the tentative prediction result of entity class;Last handling process is modified entity statistical classification result, including Model Fusion, linguistry, link information and the methods of is modified using category associations attribute information to fused entity class.Technical solution of the present invention is easily realized, is easily debugged, high-efficient, precision is good, is adapted to link data and is carried out information management;It can be realized and high exact classification is carried out to entity.

Description

A kind of entity classification method and system towards link data
Technical field
The invention belongs to field of information processing, it is related to linking data classification and search, more particularly to a kind of towards link number Physical page in carries out the method and system of high exact classification.
Background technique
It is in big data era at present, how to maximally utilise data to help computer to carry out information processing Become the most popular research topic of current information process field.In recent years, with the arrival in Web2.0 epoch, data are linked (such as semantic net, knowledge mapping etc.) has obtained the extensive concern of people because of its powerful relationship description ability.Link data Refer to the data organization form as Baidupedia, wikipedia, in this data, the corresponding entity of each page, between entity There is mutual link, therefore is referred to as link data (linked data).With the continuous increase of data scale, using artificial Method management link data are unrealistic, and there is an urgent need to can carry out the high efficiency method of information management to link data and be System.
The entity classification of link data is an important technological problems for linking data knowledge management domain, for link number According to carrying out entity classification, can effectively a large amount of physical page in organizing links data, to reinforce user's search and read Experience.
Currently, the common method of entity classification is classified for the description text of entity.But this simple side Method can not accurately analyze the classification of entity in many cases, and deficiency is mainly manifested in:
(1) for people, although judging that entity class is an easily thing according to text description, For the statistical classification method currently based on feature, it is desirable to high accurately to judge that entity class is not existing by text description It is real;For example, text " X is the animation adapted according to famous game " is with " A is the game according to famous cartoon making " in vocabulary grade Do not have closely similar expression, but the former is the description to an animation entity and the latter is the description to game entity, The entity type that it is described is entirely different.Therefore, the statistical classification method accuracy of identification for being based purely on text feature is insufficient, not Entity class can accurately be obtained.
(2) there is no enough text description informations for many physical pages, in this case, are described using text merely Information classifies to entity, inevitably results in classification error, is unable to get entity class by text description.
Summary of the invention
In order to overcome the above-mentioned deficiencies of the prior art, the present invention provide it is a kind of towards link data entity classification method and System reaches high accurate entity by statistical classification process and last handling process for the entity classification problem of link data The purpose of classification;Wherein, statistical classification process is classified by modeling for text information;Last handling process utilizes abundant Resource (such as the information such as affixe information, link data) is modified the result of entity statistical classification, including Model Fusion, language The methods of say knowledge, link information and fused entity class is modified using category associations attribute information.
Physical page in link data generally comprises text description and message box (infobox).The present invention retouches text It states after carrying out cutting, the word information that message box (infobox) attribute-name is obtained together with cutting is come out as feature extraction, make For the character representation of physical page;Then, classified using maximum entropy model using a variety of cutting granularities to physical page, obtained To the tentative prediction to entity class;Obtained entity class is post-processed again, whether may be used with verifying its classification results It leans on;Post-processing, which specifically includes, merges the classification results of the classifier of the feature training using different cutting granularities;It utilizes Category associations attribute information in category attribute data Kuku corrects apparent prediction error;First sentence is described to text and carries out depth Understand, using the methods of syntactic analysis parsing sentence structure, entity class information is obtained, with the prediction result before amendment;It is excellent Selection of land is also difficult to the classification correctly classified using puzzled matrix identification, and the prediction for the classification for being difficult to correctly classify carries out Further verifying, the classification including the adjacent page for using physical page to be linked are modified to entity class and use entity The affixe information of the page is modified entity class.
Present invention provide the technical scheme that
A kind of entity classification method towards link data, the link data are multiple physical pages, the physical page Bread is described containing text and message box;The entity classification method includes pretreatment stage, statistical classification stage and post-processing rank Section, specifically comprises the following steps:
1) in pretreatment stage process, by segmenting to the text description information in physical page, cutting obtains word Information;The feature of physical page is made of the attribute-name of message box and the word information;
2) statistical is trained using a variety of cutting granularities using the feature of the physical page in the statistical classification stage Class model classifies to physical page, obtains the tentative prediction result of entity class;
3) in post-processing stages, the tentative prediction result of entity class is modified, revised entity classification is obtained Classification;The amendment includes the following steps:
31) by more granularity model fusion methods, by what is obtained using the statistical classification model of different cutting granularities training The tentative prediction result of entity class is merged, and fused entity class result is obtained;
32) category attribute database is constructed, using the category associations attribute information in category attribute data Kuku, to fusion Entity class afterwards is modified, and obtains the revised entity class of category associations attribute;
33) parsing method parsing sentence structure is utilized, carries out deep understanding step 32) by describing first sentence to text The obtained revised entity class of category associations attribute obtains the revised entity class information of first sentence deep understanding.
For the above-mentioned entity classification method towards link data, further, before the step 1) segmenting method includes Maximum matching process, backward maximum matching process and it is based on statistical series mask method afterwards.
For the above-mentioned entity classification method towards link data, further, step 2) uses two kinds of cutting granularities, point It Wei not the cutting granularity with name Entity recognition and the cutting granularity without name Entity recognition.
For the above-mentioned entity classification method towards link data, further, the statistical classification model is maximum entropy Model;It is pre- that especially by formula 1 the different cutting grain-size classification devices of fusion are calculated in step 31) more granularity model fusion methods The probability distribution of survey is classified the maximum entropy disaggregated model of multiple cutting granularities training to obtain entity class to physical page As a result it is merged:
Pmulti(y | x)=λ Pw(y|x)+(1-λ)Pn(y | x) (formula 1)
In formula 1, Pmulti(y | x) it is the probability distribution for merging different cutting grain-size classification device predictions;Pw(y | x) it is a word The probability distribution that cutting predicts sample x as feature maximum entropy disaggregated model;Y is sample class, and x is sample;Pn(y|x) For the probability distribution that name entity mark is predicted as the maximum entropy of feature is added on the basis of word segmentation;λ, which is that adjustment is linear, to be inserted It is worth the parameter of weight.
For the above-mentioned entity classification method towards link data, further, step 33) is described to utilize syntactic analysis side Method parsing sentence structure obtains the revised entity class information of first sentence deep understanding, specifically comprises the following steps:
331) interdependent syntactic analysis is carried out to the first sentence of entity description, identifies whether the object of first sentence belongs to and judges sentence guest Language;
332) the training Chinese term vector on extensive un-annotated data, defines Similarity of Words, calculates term vector With the Similarity of Words for judging sentential object, the highest term vector of Similarity of Words is obtained;
333) cosine similarity calculation method is used, cosine similarity threshold value is set, when judging that sentential object is most like with it The cosine similarity of the term vector of classification is greater than cosine similarity threshold value, and the classification of the entity is modified to most like classification.
For the above-mentioned entity classification method towards link data, further, in the post-processing stages to entity class Other tentative prediction result is modified, and after obtaining revised entity classification classification, identifies difficulty using puzzled matrix Entity class;For the difficult entity class identified, by link analysis method and affix analysis method to entity class knot Fruit is verified;The puzzled matrix recognition methods is specifically: on verifying collection, when statistical classification model is for a certain entity class Other yiPrecision of prediction when being not up to 90%, classification yiIt is considered as difficult entity class.
Further, the link analysis method is specifically: the class prediction that setting classifier makes physical page e For y ', the set of the physical page e physical page linked is denoted as N (e), finds out the page for having classification to mark in N (e), system Meter obtains the most classification of the page for having classification to mark in N (e), is denoted as y*;When classification y* and class prediction y ' are inconsistent, benefit Corrected with y* y's ' as a result, obtain physical page e classification be y*.
For the above-mentioned entity classification method towards link data, further, the affix analysis method is specifically: needle The entity class that Chinese character ending is fixed to entity name is related using the entity type learnt on a large scale without labeled data The affixe information of connection carries out frequency statistics by the affixe respectively to most close vocabulary, it is associated to obtain difficult entity class Affixe obtains the classification of the entity by analyzing affixe.
The present invention also provides the realities towards link data realized using the above-mentioned entity classification method towards link data Body categorizing system, including preprocessing module, statistical classification module and post-processing module;The preprocessing module is used for physical page Text description information in face is segmented, and the word information that message box attribute-name and participle obtain is come out as feature extraction, Character representation as physical page;The statistical classification module carrys out train classification models by using maximum entropy sorting algorithm, The description information of entity is identified to obtain entity class using in physical page;The post-processing module is used to use more granularity moulds Type fusion, category associations attribute and first sentence deeply understand that the entity class obtained to the statistical classification module is modified, and obtain To revised entity class.
In the above-mentioned entity classification system towards link data, the participle tool is Stanford CoreNLP tool Packet;The disaggregated model uses maximum entropy classifiers software package Maxent.
Compared with prior art, the beneficial effects of the present invention are:
The present invention provides a kind of entity classification method and system towards link data, for the entity classification of link data Problem achievees the purpose that high precisely entity classification by statistical classification process and last handling process.Wherein, it is carried out to text On the basis of basic classification, the result of entity description text classification is modified, includes: using method
(1) more granularity word segmentation Model Fusion methods are used, for overcoming single cutting granularity to extract in text feature On defect;
(2) fused entity class is modified using category associations attribute information, to reach amendment apparent error Purpose;
(3) deeply understood by first sentence, achieve the effect that reduce text noise;
(4) it can identify difficult sample, and recognition result is verified using the methods of link analysis and affixe.
Compared with prior art, existing entity classification method is no longer handled at present, can for Entity recognition classification Situation that can be wrong can not correction result;And the present invention is by post-processing process to based on the possible mistake of text statistical classification module The case where be modified.Technical solution proposed by the invention is easily realized, is easily debugged, high-efficient, precision is good, is very suitable to enterprise Information management is carried out for linking data;High exact classification can be carried out to entity.It evaluates and tests and competes in JIST2015 entity classification In, the solution of the present invention accuracy rate is 98.6%, for when time evaluation and test match highest classification schemes of accuracy rate.
Detailed description of the invention
Fig. 1 is the flow diagram of the entity classification method provided by the invention towards link data.
Fig. 2 is the structural block diagram of the entity classification system provided in an embodiment of the present invention towards link data.
Fig. 3 is the flow diagram that first sentence deeply understands step in the method provided by the present invention.
Specific embodiment
With reference to the accompanying drawing, the present invention, the model of but do not limit the invention in any way are further described by embodiment It encloses.
The present invention provides a kind of entity classification method and system towards link data, for the entity classification of link data Problem achievees the purpose that high precisely entity classification by statistical classification process and last handling process;Wherein, statistical classification process Classified by being modeled for text information;Last handling process utilizes affluent resources (such as affixe information, link data etc. Information) result of entity statistical classification is modified, Fig. 1 is the entity classification method provided by the invention for link data Flow diagram.As shown in Figure 1, the method for the present invention includes preprocessing process, statistical classification process and last handling process;It is right first Physical page carries out participle feature extraction, the feature training statistical classification model then obtained using extraction.For classification gained To as a result, we merge first with more granularity models corrects single model prediction error, then utilize category associations attributes Information is modified fused entity class, to correct some manifest error predictions, then retouches to the first sentence of physical page Carry out depth analysis is stated, to determine its classification.For the sample of some classifications for being difficult to correctly classify, the present invention can pass through link Analysis and affix analysis method correct its classification again.Specific steps include:
1) physical page is pre-processed, including Chinese word segmenting (typical segmenting method have the maximum matching in front and back, after To maximum matching and the method marked based on statistical series), feature extraction (extract word feature and entity information box properties name Feature is indicated the page) etc., obtain physical page feature;
2) using obtained physical page feature is extracted in step 1), to physical page using maximum entropy model using a variety of Cutting granularity is classified, and the tentative prediction to entity class is obtained;
In embodiments of the present invention, maximum entropy model two classifiers of training are utilized;The character representation of one classifier is used Be with name the cutting of Entity recognition granularity word+infobox attribute;Another classifier is without name entity Identify word and infobox attribute that carried out cutting generates.
3) entity class obtained in step 2) is post-processed, whether reliable verifies its classification results;Specific packet Include following steps:
31) classification results of the classifier of the feature training using different cutting granularities are merged;
In embodiments of the present invention, using two kinds of cutting granularities, respectively refer to name Entity recognition cutting and without There is name Entity recognition;
32) category attribute database is constructed in advance, is repaired using the category associations attribute information in category attribute data Kuku Just apparent prediction error;
33) first sentence is described to text by parser and carries out deep understanding, analyze sentence using the methods of syntactic analysis Minor structure, so that entity class information is obtained, with the prediction result before amendment;
34) it is difficult to the classification correctly classified using puzzled matrix identification, the prediction of the category is further verified, is wrapped It includes:
341) classification of the adjacent page linked using physical page is modified entity class;
342) entity class is modified using the affixe information of physical page.
Fig. 2 is the structural block diagram of the entity classification system provided in an embodiment of the present invention towards link data.Link data Entity classification system include preprocessing module, statistical classification module and post-processing module;For each module be further discussed below as Under:
Preprocessing module
Physical page in link data generally comprises text description and message box (infobox).
In preprocessing module, we are utilized Stanford CoreNLP kit and describe to the text in physical page Information is segmented.In the present embodiment, we take two kinds of different cutting granularities: having name Entity recognition and without name entity Identification.For example, " New York Times Square " will be considered as a vocabulary under the cutting for having name Entity recognition, and in no name Under the cutting of Entity recognition, which will be split as " New York ", " epoch ", " square " three words.
After carrying out cutting for Chinese language text, we obtain message box (infobox) attribute-name together with cutting Word information comes out as feature extraction, the character representation as physical page.
Statistical classification module
The present invention mainly utilizes the foundation that entity class is judged the description information of entity in physical page.This hair It is bright use the common log-linear model of natural language processing field --- maximum entropy sorting algorithm carrys out train classification models.Such as Preprocessing module is previously mentioned, and feature used in statistical classification module includes word feature and message box attributive character;Word is characterized in Classical bag of words character representation;Message box attributive character is for identifying that the classification of entity has very important effect, example Such as, " date of birth " may also be associated with the entity of personage's type.
In text classification module, we use varigrained word segmentation and carry out training text disaggregated model, this is because In some cases, a kind of cutting granularity is not able to satisfy the requirement for classification.For example, " New York Times Square " is if conduct One name entity is come if treating, effect for classification and not as good as being cut into " New York " " epoch " and " square ", because There is vital influence for classification for " square " word.On the other hand, if we are without naming Entity recognition, that Picture " Zhang Yishan " will be cut into " opening " " one " " mountain ", then this can also impact classification results.Therefore, it is counting In categorization module, the embodiment of the present invention (can be downloaded maximum by maximum entropy classifiers software package Maxent by following link website Entropy classifier software package: http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html) instruction Two kinds of disaggregated models are practiced, one is the fine granularity cutting of Entity recognition, one is simple coarseness word segmentations with naming.
Post-processing module
Based on the situation of the possible mistake of text statistical classification module, the present invention is modified using post-processing module.Afterwards Following procedure can be performed in processing module:
31) more granularity model fusion process
Model Fusion is widely used in machine learning field as far as possible, but the method for most Model Fusion is both for difference The fusion of kind machine learning model.For natural language (especially Chinese), the difference of cutting granularity is for entire model Effect can have an impact.For the respective superiority-inferiority of different cutting granularities, the invention proposes the methods using Model Fusion To carry out " learning from other's strong points to offset one's weaknesses " to the obtained disaggregated model of various cutting granularities.
We define Pw(y | x) it is the classification for only word segmentation being used to predict as feature, maximum entropy disaggregated model for sample x Y probability distribution, Pn(y | x) it is the probability point that name entity mark is added on the basis of word segmentation and is predicted as the maximum entropy of feature Cloth.We merge the result of both classifiers using the following method:
Pmulti(y | x)=λ Pw(y|x)+(1-λ)Pn(y | x) (formula 1)
In formula 1, Pmulti(y | x) it is the probability distribution for merging different cutting grain-size classification device predictions;Pw(y | x) it is a word The probability distribution that cutting predicts sample x as feature maximum entropy disaggregated model;Y is sample class, and x is sample;Pn(y|x) For the probability distribution that name entity mark is predicted as the maximum entropy of feature is added on the basis of word segmentation;λ, which is that adjustment is linear, to be inserted It is worth the parameter of weight, in the present embodiment, if λ=0.5.
32) category associations attribute amendment prediction
The module corrects the class prediction of some apparent errors using category associations attribute.The module utilized mainly The classification specificity of information box properties.As shown in table 1, for certain attributes, they can not be with some specific classifications It is associated.For example, " gaming platform " can not be associated with city entity.Therefore, the specificity of these attributes, Ke Yixiu are utilized The apparent prediction error of positive classifier.The present invention is directed to the entity type predefined and manually establishes category attribute database, For carrying out the amendment to prediction.
1 category associations exemplary properties of table
33) the first sentence of entity description is deeply understood by interdependent parser, further precisely identifies entity class;
Link the physical page description in data (such as: wikipedia, Baidupedia etc.) the is in short usually pair A kind of qualitative description (such as: pounding six is snipsnapsnorum for being popular in Tianjin) of entity.If deep enough can understand entity The first sentence of description, then will have very big help to accurate identification entity class.
Fig. 3 is the flow diagram that first sentence deeply understands step in the method provided by the present invention.The invention firstly uses interdependent sentences Method analyzer describes judgement sentential object in first sentence to find out physical page text, then utilizes the judgement sentential object analysis entities The classification of the page;Specifically comprise the following steps:
331) judge that sentential object identifies
Present invention utilizes the interdependent parsers of Stanford University, carry out interdependent syntax point to the first sentence of entity description Subject, predicate and object in sentence of informing against are analyzed in analysis.If the object of the interdependent obtained first sentence of syntax and "Yes" have directly Dependence, then the object is referred to as " judging sentential object ";Otherwise, which is referred to as " non-judgement sentential object ".
The object of first sentence as described in sporocarp text is to judge sentential object, we can use the object and determine for clue The classification of entity, so that whether the result for verifying classifier prediction is accurate.If result and the punctuate object institute of classifier prediction The conclusion contradiction obtained then utilizes the prediction of the modified result classifier.If skipped in first sentence there is no sentential object is judged The step, into 34).
For example, in a kind of " pounding six is snipsnapsnorum for being popular in Tianjin " sentence, interdependent syntactic analysis interpretation of result Obtaining " game " is the sentential object, and " game " and "Yes" have direct dependence, then " game " is the judgement of this Sentential object.If " game " is predefined entity class in entity classification system, we are with it as the entity Classification.
332) it utilizes and judges that sentential object corrects class prediction
In some cases, even if we have found judgement sentential object, can not arbitrarily be used to be modified prediction, because For some unnecessary mistakes may be introduced in this way.Meanwhile judging that sentential object not fully matches classification in many cases, Title.Such as: " the wild refined son in pool is that the famous sound of Japan is excellent ", although interdependent syntactic analysis available " sound is excellent " is sentencing for the words Disconnected object, however not " sound is excellent " this classification is possible in predefined entity class.In this regard, invention defines amendments Condition finds judgement using Similarity of Words, i.e. cosine similarity between term vector from extensive un-annotated data The most like classification of sentential object, reliably to carry out classification amendment.
In natural language processing field, cosine similarity is generally regarded as the semantic similarity of vocabulary.Specifically, this hair Bright embodiment is first with use word2vec kit (https: //word2vec.googlecode.com/svn/ Trunk/) the training Chinese term vector on Gigaword Chinese corpus (Chinese Gigaword is disclosed data set), utilizes instruction The semantic most like item name of sentential object is found and judged to the term vector got.If it is determined that sentential object is most like with it The cosine similarity of the term vector of classification is greater than presetting threshold value (in the embodiment of the present invention, by calculating cosine similarity 0.9) method, cosine similarity threshold value are set as, the classification of the entity is just modified to most like classification.
For this purpose, we define physical page kept man of a noblewoman sentence text description judge sentential object for w0, classifier is that (Y is real to y ∈ Y Body category set), sim (w1,w2) it is word w1、w2Term vector cosine similarity.So correction conditions are that formula 2 is such as shown:
y*=argmaxy∈Ysim(w0,y)∧sim(w0,y*) > 0.9 (formula 2)
In formula 2, ^ is indicated and (with) relationship;The content (left side) of part shows that y* is that semantic similarity is highest before ^ Classification, the content (the right) of part indicates that the similarity of y* and w0 needs to be higher than 0.9 after ^;Correction conditions (formula 2) meet just into Row amendment, i.e., only when the similarity that y* is the highest classification of semantic similarity and y* and w0 needs to be higher than 0.9, with y* come Correct original class prediction.
In previous example (" the wild refined son in pool is that the famous sound of Japan is excellent "), we can be found out and " sound is excellent " most like class It is not that " performer " (as shown in table 2, table 2 is to utilize the term vector of the training from Chinese gigaword calculated and classification most phase As some vocabulary, wherein runic word indicates the similarity of these vocabulary and classification 0.9 or more), and find " performer " and " sound It is excellent " semantic similarity 0.9 or more, therefore, the classification of " wild pool refined son " this physical page is modified to " performer ".
The most like vocabulary of 2 classification of table
34) difficult sample is identified using puzzled matrix
In practical applications, the sample that we are frequently encountered certain classifications is difficult to differentiate between, and this kind of sample is known as difficult sample This.Such as the entity for " city " and " sight spot " two classifications, classifier often do the prediction to make mistake, because these two types of The description of entity and information box properties are all much like.In order to improve the precision of classification, the present invention is found out using puzzled matrix Classificating word is easy the sample class of error.Specifically, if statistical classification model is for a certain entity class on verifying collection yiPrecision of prediction be not up to 90%, then classification yiIt is considered as difficult sample class.For example, on verifying collection, statistical classification model " city " classification is predicted as to 18 physical pages, but wherein there was only 15 pages is " city " classification really, therefore statistical Class model is only 83.33% (15/18) in the precision of prediction of " city " classification, and " city " classification is identified as difficult sample class Not.It is the sample of difficult sample class by statistical classification model prediction for those, we term it difficult samples.
For the difficult sample identified, following two method is utilized to verify to result in we.
341) link analysis
For difficult sample, depending merely on the content in physical page may be not enough to make correct judgement, therefore, the present invention Come to carry out classification results verifying to difficult sample using link analysis method.
In link data, a physical page would generally be linked to relative other physical pages.Usually come It says, the classification for the other physical pages being linked to is very likely to the identical of classification with itself.Therefore, one is utilized The classification for other physical pages that physical page is linked to can preferably judge the classification of the entity with help system.
Specifically, to Mr. Yu physical page e, we analyze the physical page that e is linked, and set is denoted as N (e).N (e) some page is understood in classification markup information.The present invention finds out the page for having classification to mark in N (e), and counts this A little most classification y* of the page, the class prediction y ' for judging whether the category is made e with classifier are consistent.As result is different It causes, the result of y ' is corrected using y*.
342) affix analysis
For certain samples being difficult to differentiate between, the present invention also uses affix analysis method to verify its classification results.For Certain classifications, entity name is usually to fix Chinese character ending.For example, " city " entity is usually with " city, county " ending, " sight spot " Entity would generally be ended up with " lake, mountain " etc..The example that table 3 lists classification common solid affixe.
The classification affixe of 3 common solid of table
Sight spot City
Mountain, lake, city, road, scape, river, ridge, hole Area, city, county
Present invention firstly provides utilizing on a large scale without the associated affixe information of labeled data learning object type, specifically come It says, we train term vector using term vector kit word2vec on Chinese Gigaword data set, then pass through calculating The method of cosine similarity finds out the most similar word (word of 0.7 or more term vector cosine similarity) of each classification semanteme.So Afterwards, frequency statistics are carried out by the affixe of the most close vocabulary respectively to the two sight spots, so that it may obtain difficult sample class Associated affixe, thus by analysis affixe, to determine its generic.Specifically, if a certain physical page affixe s In a certain classification y1In frequency be significantly higher than (2 times or more) another category y2In the frequency of occurrences, then we are by y1As the reality Body classification corrects original prediction result.For example, for " Mount Lushan Fairy Cave " physical page, affixe " hole " appears in " scape The frequency of point " classification is apparently higher than the frequency for appearing in " city " classification, therefore the prediction classification of the entity is modified to " scape Point ".
It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this field Art personnel, which are understood that, not to be departed from the present invention and spirit and scope of the appended claims, and various substitutions and modifications are all It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim Subject to the range that book defines.

Claims (10)

1.一种面向链接数据的实体分类方法,所述链接数据为多个实体页面,所述实体页面包含文本描述和信息框;所述实体分类方法包括预处理阶段、统计分类阶段和后处理阶段,具体包括如下步骤:1. An entity classification method oriented to link data, the link data is a plurality of entity pages, and the entity pages comprise text descriptions and information boxes; the entity classification method comprises a preprocessing stage, a statistical classification stage and a post-processing stage , including the following steps: 1)在预处理阶段过程,通过对实体页面中的文本描述信息进行分词,切分得到词信息;由信息框的属性名和所述词信息构成实体页面的特征;1) In the process of the preprocessing stage, word information is obtained by segmenting the text description information in the entity page; the attribute name of the information box and the word information constitute the feature of the entity page; 2)在统计分类阶段,利用所述实体页面的特征,采用多种切分粒度来训练统计分类模型对实体页面进行分类,得到实体类别的初步预测结果;2) In the statistical classification stage, using the features of the entity pages, adopting a variety of segmentation granularities to train the statistical classification model to classify the entity pages, and obtain the preliminary prediction results of the entity categories; 3)在后处理阶段,对实体类别的初步预测结果进行修正,得到修正后的实体分类类别;所述修正包括如下步骤:3) In the post-processing stage, the preliminary prediction result of the entity category is modified to obtain the modified entity classification category; the modification includes the following steps: 31)通过多粒度模型融合方法,将采用多个切分粒度训练的统计分类模型得到的实体类别的初步预测结果进行融合,得到融合后的实体类别结果;31) By the multi-granularity model fusion method, the preliminary prediction results of the entity categories obtained by using the statistical classification models trained by multiple segmentation granularities are fused to obtain the fused entity category results; 32)构建类别属性数据库,利用类别属性数据库库中的类别关联属性信息,对融合后的实体类别进行修正,得到类别关联属性修正后的实体类别;32) constructing a category attribute database, and using the category association attribute information in the category attribute database library to correct the merged entity category, and obtain the entity category after the category association attribute modification; 33)利用语法分析方法分析句子结构,通过对文本描述首句进行深度理解步骤32)所得到的类别关联属性修正后的实体类别,获取首句深度理解修正后的实体类别信息。33) Analyzing the sentence structure by using the syntax analysis method, and obtaining the entity category information after the deep understanding and correction of the first sentence through the deep understanding of the first sentence of the text description and the corrected entity category obtained in step 32). 2.如权利要求1所述面向链接数据的实体分类方法,其特征是,步骤1)所述分词方法为前后最大匹配方法、后向最大匹配方法和基于统计序列标注方法中的一种。2. The entity classification method oriented to linked data as claimed in claim 1, wherein the word segmentation method in step 1) is one of a front-back maximum matching method, a backward maximum matching method and a statistical sequence-based labeling method. 3.如权利要求1所述面向链接数据的实体分类方法,其特征是,步骤2)采用两种切分粒度,分别为带有命名实体识别的切分粒度和不带有命名实体识别的切分粒度。3. the entity classification method for linked data as claimed in claim 1, it is characterized in that, step 2) adopts two kinds of segmentation granularity, be respectively the segmentation granularity with named entity recognition and the segmentation granularity without named entity recognition. Granularity. 4.如权利要求1所述面向链接数据的实体分类方法,其特征是,所述统计分类模型为最大熵模型;步骤31)所述多粒度模型融合方法具体通过式1计算得到融合不同切分粒度分类器预测的概率分布,将多个切分粒度训练的最大熵分类模型对实体页面进行分类得到实体类别结果进行融合:4. The entity classification method for linked data as claimed in claim 1, wherein the statistical classification model is a maximum entropy model; Step 31) The multi-granularity model fusion method is calculated by formula 1 to obtain fusion different segmentations. The probability distribution predicted by the granularity classifier, and the maximum entropy classification model trained by multiple segmentation granularities is used to classify entity pages to obtain entity category results for fusion: Pmulti(y|x)=λPw(y|x)+(1-λ)Pn(y|x) (式1)P multi (y|x)=λP w (y|x)+(1-λ)P n (y|x) (Equation 1) 式1中,Pmulti(y|x)为融合不同切分粒度分类器预测的概率分布;Pw(y|x)为只用词切分作为特征最大熵分类模型对于样本x预测的概率分布;y为样本类别,x为样本;Pn(y|x)为在词切分基础上加入命名实体标注作为特征的最大熵预测的概率分布;λ是调整线性插值权重的参数。In Equation 1, P multi (y|x) is the probability distribution predicted by the classifier with different segmentation granularity; P w (y|x) is the probability distribution predicted by the maximum entropy classification model only using word segmentation as the feature for sample x. ; y is the sample category, x is the sample; P n (y|x) is the probability distribution of the maximum entropy prediction based on the word segmentation with named entity annotation as a feature; λ is the parameter for adjusting the linear interpolation weight. 5.如权利要求1所述面向链接数据的实体分类方法,其特征是,步骤33)所述利用语法分析方法分析句子结构,获取首句深度理解修正后的实体类别信息,具体包括如下步骤:5. the entity classification method oriented to linked data as claimed in claim 1, it is characterized in that, step 33) described utilizing grammatical analysis method to analyze sentence structure, obtain the entity classification information after the first sentence depth understanding correction, specifically comprises the steps: 331)对实体描述的首句进行依存句法分析,识别首句的宾语是否属于判断句宾语;331) Perform dependency syntax analysis on the first sentence of the entity description, and identify whether the object of the first sentence belongs to the object of the judgment sentence; 332)在大规模未标注语料上训练汉语词向量,定义词汇语义相似度,计算词向量与判断句宾语的词汇语义相似度,得到词汇语义相似度最高的词向量;332) train Chinese word vectors on a large-scale unlabeled corpus, define the lexical semantic similarity, calculate the lexical semantic similarity between the word vector and the object of the judgment sentence, and obtain the word vector with the highest lexical semantic similarity; 333)通过余弦相似度计算方法,设定余弦相似度阈值,当判断句宾语与其最相似类别的词向量的余弦相似度大于余弦相似度阈值,将该实体的类别修正为最相似类别。333) Through the cosine similarity calculation method, set the cosine similarity threshold, when the cosine similarity between the sentence object and the word vector of the most similar category is greater than the cosine similarity threshold, the category of the entity is corrected to the most similar category. 6.如权利要求1所述面向链接数据的实体分类方法,其特征是,在所述后处理阶段对实体类别的初步预测结果进行修正,得到修正后的实体分类类别之后,使用困惑矩阵识别出困难实体类别;针对识别出的困难实体类别,通过链接分析方法和词缀分析方法对实体类别结果进行验证;所述困惑矩阵识别方法具体是:在验证集上,当统计分类模型对于某一实体类别yi的预测精度未达到90%时,类别yi被视为困难实体类别。6. The entity classification method oriented to linked data according to claim 1, characterized in that, in the post-processing stage, the preliminary prediction result of the entity category is modified, and after obtaining the modified entity classification category, a confusion matrix is used to identify the entity category. Difficult entity category; for the identified difficult entity category, the entity category results are verified by link analysis method and affix analysis method; the confusion matrix identification method is specifically: on the verification set, when the statistical classification model is used for a certain entity category When the prediction accuracy of yi does not reach 90%, the category yi is regarded as a difficult entity category. 7.如权利要求6所述面向链接数据的实体分类方法,其特征是,所述链接分析方法具体是:设定分类器对实体页面e所做出的类别预测为y’,将实体页面e所链接的实体页面的集合记为N(e),找出N(e)中有类别标注的页面,统计得到N(e)中有类别标注的页面最多的类别,记作y*;当类别y*与类别预测y’不一致时,利用y*来修正y’的结果,得到实体页面e的类别为y*。7. The entity classification method for link data as claimed in claim 6, wherein the link analysis method is specifically: the category prediction made by the classifier to the entity page e is y', and the entity page e is predicted as y'. The set of linked entity pages is denoted as N(e), find out the pages with category annotations in N(e), and count the category with the most category-annotated pages in N(e), denoted as y*; When y* is inconsistent with the category prediction y', y* is used to correct the result of y', and the category of the entity page e is obtained as y*. 8.如权利要求6所述面向链接数据的实体分类方法,其特征是,所述词缀分析方法具体是:针对实体名称以固定汉字结尾的实体类别,利用大规模无标注数据学习得到的实体类型相关联的词缀信息,通过分别对最相近词汇的词缀进行频次统计,得到困难实体类别相关联的词缀,通过分析词缀获得所述实体的类别。8. The entity classification method oriented to linked data as claimed in claim 6, wherein the affix analysis method is specifically: the entity type that ends with a fixed Chinese character for the entity name, the entity type that is learned by using large-scale unlabeled data For the associated affix information, the affixes associated with the difficult entity category are obtained by performing frequency statistics on the affixes of the most similar words respectively, and the entity category is obtained by analyzing the affixes. 9.利用权利要求1~8所述面向链接数据的实体分类方法实现的面向链接数据的实体分类系统,其特征是,包括预处理模块、统计分类模块和后处理模块;9. The link data-oriented entity classification system realized by the link data-oriented entity classification method according to claims 1 to 8 is characterized in that, comprising a preprocessing module, a statistical classification module and a post-processing module; 所述预处理模块用于对实体页面中的文本描述信息进行分词,将信息框属性名和分词得到的词信息作为特征抽取出来,作为实体页面的特征表示;The preprocessing module is used to perform word segmentation on the text description information in the entity page, and extract the word information obtained by the attribute name of the information box and the word segmentation as a feature, and use it as a feature representation of the entity page; 所述统计分类模块通过采用最大熵分类算法来训练分类模型,利用实体页面中对实体的描述信息识别得到实体类别;The statistical classification module uses the maximum entropy classification algorithm to train the classification model, and uses the description information of the entity in the entity page to identify and obtain the entity category; 所述后处理模块用于采用多粒度模型融合、类别关联属性和首句深入理解对所述统计分类模块得到的实体类别进行修正,得到修正后的实体类别。The post-processing module is used to modify the entity category obtained by the statistical classification module by adopting multi-granularity model fusion, category association attributes and in-depth understanding of the first sentence to obtain the modified entity category. 10.如权利要求9所述面向链接数据的实体分类系统,其特征是,所述分词工具为StanfordCoreNLP工具包;所述分类模型采用最大熵分类器软件包Maxent。10. The linked data-oriented entity classification system according to claim 9, wherein the word segmentation tool is the StanfordCoreNLP toolkit; and the classification model adopts the maximum entropy classifier software package Maxent.
CN201610213411.8A 2016-04-07 2016-04-07 A Linked Data-Oriented Entity Classification Method and System Expired - Fee Related CN105912625B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610213411.8A CN105912625B (en) 2016-04-07 2016-04-07 A Linked Data-Oriented Entity Classification Method and System

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610213411.8A CN105912625B (en) 2016-04-07 2016-04-07 A Linked Data-Oriented Entity Classification Method and System

Publications (2)

Publication Number Publication Date
CN105912625A CN105912625A (en) 2016-08-31
CN105912625B true CN105912625B (en) 2019-05-14

Family

ID=56745356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610213411.8A Expired - Fee Related CN105912625B (en) 2016-04-07 2016-04-07 A Linked Data-Oriented Entity Classification Method and System

Country Status (1)

Country Link
CN (1) CN105912625B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506362B (en) * 2016-11-23 2021-02-23 上海大学 A brain-like storage method for image classification based on user group optimization
CN110168579B (en) * 2016-11-23 2024-12-06 启创互联公司 Systems and methods for using knowledge representation with machine learning classifiers
CN108241650B (en) * 2016-12-23 2020-08-11 北京国双科技有限公司 Training method and device for training classification standard
CN106777275B (en) * 2016-12-29 2018-03-06 北京理工大学 Entity attribute and property value extracting method based on more granularity semantic chunks
CN108628873B (en) * 2017-03-17 2022-09-27 腾讯科技(北京)有限公司 Text classification method, device and equipment
CN107515858B (en) * 2017-09-01 2020-10-20 鼎富智能科技有限公司 Text classification post-processing method, device and system
CN107885719B (en) * 2017-09-20 2021-06-11 北京百度网讯科技有限公司 Vocabulary category mining method and device based on artificial intelligence and storage medium
CN108415902B (en) * 2018-02-10 2021-10-26 合肥工业大学 Named entity linking method based on search engine
CN110309255A (en) * 2018-03-07 2019-10-08 同济大学 An Entity Search Method Incorporating Distributed Representation of Entity Description
CN108563722B (en) * 2018-04-03 2021-04-02 有米科技股份有限公司 Industry classification method, system, computer device and storage medium for text information
CN108921213B (en) * 2018-06-28 2021-06-22 国信优易数据股份有限公司 An entity classification model training method and device
CN109284374B (en) 2018-09-07 2024-07-05 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer readable storage medium for determining entity class
EP3629520B1 (en) * 2018-09-28 2023-08-23 Permutive Limited Fast and efficient classification system
CN110020024B (en) * 2019-03-15 2021-07-30 中国人民解放军军事科学院军事科学信息研究中心 Method, system and equipment for classifying link resources in scientific and technological literature
US11210346B2 (en) * 2019-04-04 2021-12-28 Iqvia Inc. Predictive system for generating clinical queries
CN112115240B (en) * 2019-06-21 2024-07-09 百度在线网络技术(北京)有限公司 Classification processing method, device, server and storage medium
CN110457467A (en) * 2019-07-02 2019-11-15 厦门美域中央信息科技有限公司 A kind of information technology file classification method based on gauss hybrid models
CN111538813B (en) * 2020-04-26 2023-05-16 北京锐安科技有限公司 Classification detection method, device, equipment and storage medium
CN111737416B (en) * 2020-06-29 2022-08-19 重庆紫光华山智安科技有限公司 Case processing model training method, case text processing method and related device
CN114140636A (en) * 2020-09-02 2022-03-04 华为技术有限公司 Difficult sample acquisition method, device, equipment and readable storage medium
CN112214572B (en) * 2020-10-20 2022-11-01 山东浪潮科学研究院有限公司 Method for secondarily extracting entities in resume analysis
CN113094567A (en) * 2021-03-31 2021-07-09 四川新网银行股份有限公司 Malicious complaint identification method and system based on text clustering
CN113343697A (en) * 2021-06-15 2021-09-03 中国科学院软件研究所 Network protocol entity extraction method and system based on small sample learning
CN113297851B (en) * 2021-06-21 2024-03-05 北京富通东方科技有限公司 A method for identifying entity words that are easily confused with sports injuries
CN115374287A (en) * 2022-04-19 2022-11-22 北京中科凡语科技有限公司 Entity linking method, device and storage medium
CN115934914B (en) * 2022-12-27 2025-08-19 成都数之联科技股份有限公司 Method, device, equipment and medium for multi-category question answering of intention based on knowledge graph

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645064A (en) * 2008-12-16 2010-02-10 中国科学院声学研究所 Superficial natural spoken language understanding system and method thereof
CN104408148A (en) * 2014-12-03 2015-03-11 复旦大学 Field encyclopedia establishment system based on general encyclopedia websites
CN104484461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Method and system based on encyclopedia data for classifying entities
CN102436456B (en) * 2010-09-29 2016-03-30 国际商业机器公司 For the method and apparatus of classifying to named entity

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015175548A1 (en) * 2014-05-12 2015-11-19 Diffeo, Inc. Entity-centric knowledge discovery
US10296192B2 (en) * 2014-09-26 2019-05-21 Oracle International Corporation Dynamic visual profiling and visualization of high volume datasets and real-time smart sampling and statistical profiling of extremely large datasets

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645064A (en) * 2008-12-16 2010-02-10 中国科学院声学研究所 Superficial natural spoken language understanding system and method thereof
CN102436456B (en) * 2010-09-29 2016-03-30 国际商业机器公司 For the method and apparatus of classifying to named entity
CN104408148A (en) * 2014-12-03 2015-03-11 复旦大学 Field encyclopedia establishment system based on general encyclopedia websites
CN104484461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Method and system based on encyclopedia data for classifying entities

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Entity modeling and search in Text";Lu Chunliang;《https://static.chunlianglyu.com/docs/thesis.pdf》;20160315;第15页第2段

Also Published As

Publication number Publication date
CN105912625A (en) 2016-08-31

Similar Documents

Publication Publication Date Title
CN105912625B (en) A Linked Data-Oriented Entity Classification Method and System
TWI735543B (en) Method and device for webpage text classification, method and device for webpage text recognition
CN111091009B (en) Document association auditing method based on semantic analysis
CN108304468A (en) A kind of file classification method and document sorting apparatus
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN107832781B (en) Multi-source data-oriented software defect representation learning method
CN108121829A (en) The domain knowledge collection of illustrative plates automated construction method of software-oriented defect
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN104881458B (en) A kind of mask method and device of Web page subject
CN107609132A (en) One kind is based on Ontology storehouse Chinese text sentiment analysis method
CN113590810A (en) Abstract generation model training method, abstract generation device and electronic equipment
CN109472022B (en) New word recognition method based on machine learning and terminal equipment
CN104350490A (en) Methods, apparatuses and computer-readable mediums for organizing data relating to a product
CN113360582A (en) Relation classification method and system based on BERT model fusion multi-element entity information
CN105975453A (en) Method and device for comment label extraction
CN113657090A (en) Military news long text layering event extraction method
CN103309862A (en) Webpage type recognition method and system
CN106649250A (en) Method and device for identifying emotional new words
CN103593431A (en) Internet public opinion analyzing method and device
CN104809105B (en) Recognition methods and the system of event argument and argument roles based on maximum entropy
CN108763192B (en) Entity relation extraction method and device for text processing
CN107451116B (en) Statistical analysis method for mobile application endogenous big data
CN110688540B (en) Cheating account screening method, device, equipment and medium
CN110781673B (en) Document acceptance method and device, computer equipment and storage medium
CN104834718A (en) Recognition method and system for event argument based on maximum entropy model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190514