CN105912625B - A Linked Data-Oriented Entity Classification Method and System - Google Patents
A Linked Data-Oriented Entity Classification Method and System Download PDFInfo
- Publication number
- CN105912625B CN105912625B CN201610213411.8A CN201610213411A CN105912625B CN 105912625 B CN105912625 B CN 105912625B CN 201610213411 A CN201610213411 A CN 201610213411A CN 105912625 B CN105912625 B CN 105912625B
- Authority
- CN
- China
- Prior art keywords
- entity
- classification
- category
- information
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of entity classification method and systems towards link data, for the entity classification problem of link data, including pretreatment, statistical classification and last handling process;Wherein, pretreatment is by segmenting the text description information in physical page;Physical page feature is constituted by the word information that the attribute-name of message box and participle obtain;Statistical classification process trains statistical classification model to classify physical page using a variety of cutting granularities, obtains the tentative prediction result of entity class;Last handling process is modified entity statistical classification result, including Model Fusion, linguistry, link information and the methods of is modified using category associations attribute information to fused entity class.Technical solution of the present invention is easily realized, is easily debugged, high-efficient, precision is good, is adapted to link data and is carried out information management;It can be realized and high exact classification is carried out to entity.
Description
Technical field
The invention belongs to field of information processing, it is related to linking data classification and search, more particularly to a kind of towards link number
Physical page in carries out the method and system of high exact classification.
Background technique
It is in big data era at present, how to maximally utilise data to help computer to carry out information processing
Become the most popular research topic of current information process field.In recent years, with the arrival in Web2.0 epoch, data are linked
(such as semantic net, knowledge mapping etc.) has obtained the extensive concern of people because of its powerful relationship description ability.Link data
Refer to the data organization form as Baidupedia, wikipedia, in this data, the corresponding entity of each page, between entity
There is mutual link, therefore is referred to as link data (linked data).With the continuous increase of data scale, using artificial
Method management link data are unrealistic, and there is an urgent need to can carry out the high efficiency method of information management to link data and be
System.
The entity classification of link data is an important technological problems for linking data knowledge management domain, for link number
According to carrying out entity classification, can effectively a large amount of physical page in organizing links data, to reinforce user's search and read
Experience.
Currently, the common method of entity classification is classified for the description text of entity.But this simple side
Method can not accurately analyze the classification of entity in many cases, and deficiency is mainly manifested in:
(1) for people, although judging that entity class is an easily thing according to text description,
For the statistical classification method currently based on feature, it is desirable to high accurately to judge that entity class is not existing by text description
It is real;For example, text " X is the animation adapted according to famous game " is with " A is the game according to famous cartoon making " in vocabulary grade
Do not have closely similar expression, but the former is the description to an animation entity and the latter is the description to game entity,
The entity type that it is described is entirely different.Therefore, the statistical classification method accuracy of identification for being based purely on text feature is insufficient, not
Entity class can accurately be obtained.
(2) there is no enough text description informations for many physical pages, in this case, are described using text merely
Information classifies to entity, inevitably results in classification error, is unable to get entity class by text description.
Summary of the invention
In order to overcome the above-mentioned deficiencies of the prior art, the present invention provide it is a kind of towards link data entity classification method and
System reaches high accurate entity by statistical classification process and last handling process for the entity classification problem of link data
The purpose of classification;Wherein, statistical classification process is classified by modeling for text information;Last handling process utilizes abundant
Resource (such as the information such as affixe information, link data) is modified the result of entity statistical classification, including Model Fusion, language
The methods of say knowledge, link information and fused entity class is modified using category associations attribute information.
Physical page in link data generally comprises text description and message box (infobox).The present invention retouches text
It states after carrying out cutting, the word information that message box (infobox) attribute-name is obtained together with cutting is come out as feature extraction, make
For the character representation of physical page;Then, classified using maximum entropy model using a variety of cutting granularities to physical page, obtained
To the tentative prediction to entity class;Obtained entity class is post-processed again, whether may be used with verifying its classification results
It leans on;Post-processing, which specifically includes, merges the classification results of the classifier of the feature training using different cutting granularities;It utilizes
Category associations attribute information in category attribute data Kuku corrects apparent prediction error;First sentence is described to text and carries out depth
Understand, using the methods of syntactic analysis parsing sentence structure, entity class information is obtained, with the prediction result before amendment;It is excellent
Selection of land is also difficult to the classification correctly classified using puzzled matrix identification, and the prediction for the classification for being difficult to correctly classify carries out
Further verifying, the classification including the adjacent page for using physical page to be linked are modified to entity class and use entity
The affixe information of the page is modified entity class.
Present invention provide the technical scheme that
A kind of entity classification method towards link data, the link data are multiple physical pages, the physical page
Bread is described containing text and message box;The entity classification method includes pretreatment stage, statistical classification stage and post-processing rank
Section, specifically comprises the following steps:
1) in pretreatment stage process, by segmenting to the text description information in physical page, cutting obtains word
Information;The feature of physical page is made of the attribute-name of message box and the word information;
2) statistical is trained using a variety of cutting granularities using the feature of the physical page in the statistical classification stage
Class model classifies to physical page, obtains the tentative prediction result of entity class;
3) in post-processing stages, the tentative prediction result of entity class is modified, revised entity classification is obtained
Classification;The amendment includes the following steps:
31) by more granularity model fusion methods, by what is obtained using the statistical classification model of different cutting granularities training
The tentative prediction result of entity class is merged, and fused entity class result is obtained;
32) category attribute database is constructed, using the category associations attribute information in category attribute data Kuku, to fusion
Entity class afterwards is modified, and obtains the revised entity class of category associations attribute;
33) parsing method parsing sentence structure is utilized, carries out deep understanding step 32) by describing first sentence to text
The obtained revised entity class of category associations attribute obtains the revised entity class information of first sentence deep understanding.
For the above-mentioned entity classification method towards link data, further, before the step 1) segmenting method includes
Maximum matching process, backward maximum matching process and it is based on statistical series mask method afterwards.
For the above-mentioned entity classification method towards link data, further, step 2) uses two kinds of cutting granularities, point
It Wei not the cutting granularity with name Entity recognition and the cutting granularity without name Entity recognition.
For the above-mentioned entity classification method towards link data, further, the statistical classification model is maximum entropy
Model;It is pre- that especially by formula 1 the different cutting grain-size classification devices of fusion are calculated in step 31) more granularity model fusion methods
The probability distribution of survey is classified the maximum entropy disaggregated model of multiple cutting granularities training to obtain entity class to physical page
As a result it is merged:
Pmulti(y | x)=λ Pw(y|x)+(1-λ)Pn(y | x) (formula 1)
In formula 1, Pmulti(y | x) it is the probability distribution for merging different cutting grain-size classification device predictions;Pw(y | x) it is a word
The probability distribution that cutting predicts sample x as feature maximum entropy disaggregated model;Y is sample class, and x is sample;Pn(y|x)
For the probability distribution that name entity mark is predicted as the maximum entropy of feature is added on the basis of word segmentation;λ, which is that adjustment is linear, to be inserted
It is worth the parameter of weight.
For the above-mentioned entity classification method towards link data, further, step 33) is described to utilize syntactic analysis side
Method parsing sentence structure obtains the revised entity class information of first sentence deep understanding, specifically comprises the following steps:
331) interdependent syntactic analysis is carried out to the first sentence of entity description, identifies whether the object of first sentence belongs to and judges sentence guest
Language;
332) the training Chinese term vector on extensive un-annotated data, defines Similarity of Words, calculates term vector
With the Similarity of Words for judging sentential object, the highest term vector of Similarity of Words is obtained;
333) cosine similarity calculation method is used, cosine similarity threshold value is set, when judging that sentential object is most like with it
The cosine similarity of the term vector of classification is greater than cosine similarity threshold value, and the classification of the entity is modified to most like classification.
For the above-mentioned entity classification method towards link data, further, in the post-processing stages to entity class
Other tentative prediction result is modified, and after obtaining revised entity classification classification, identifies difficulty using puzzled matrix
Entity class;For the difficult entity class identified, by link analysis method and affix analysis method to entity class knot
Fruit is verified;The puzzled matrix recognition methods is specifically: on verifying collection, when statistical classification model is for a certain entity class
Other yiPrecision of prediction when being not up to 90%, classification yiIt is considered as difficult entity class.
Further, the link analysis method is specifically: the class prediction that setting classifier makes physical page e
For y ', the set of the physical page e physical page linked is denoted as N (e), finds out the page for having classification to mark in N (e), system
Meter obtains the most classification of the page for having classification to mark in N (e), is denoted as y*;When classification y* and class prediction y ' are inconsistent, benefit
Corrected with y* y's ' as a result, obtain physical page e classification be y*.
For the above-mentioned entity classification method towards link data, further, the affix analysis method is specifically: needle
The entity class that Chinese character ending is fixed to entity name is related using the entity type learnt on a large scale without labeled data
The affixe information of connection carries out frequency statistics by the affixe respectively to most close vocabulary, it is associated to obtain difficult entity class
Affixe obtains the classification of the entity by analyzing affixe.
The present invention also provides the realities towards link data realized using the above-mentioned entity classification method towards link data
Body categorizing system, including preprocessing module, statistical classification module and post-processing module;The preprocessing module is used for physical page
Text description information in face is segmented, and the word information that message box attribute-name and participle obtain is come out as feature extraction,
Character representation as physical page;The statistical classification module carrys out train classification models by using maximum entropy sorting algorithm,
The description information of entity is identified to obtain entity class using in physical page;The post-processing module is used to use more granularity moulds
Type fusion, category associations attribute and first sentence deeply understand that the entity class obtained to the statistical classification module is modified, and obtain
To revised entity class.
In the above-mentioned entity classification system towards link data, the participle tool is Stanford CoreNLP tool
Packet;The disaggregated model uses maximum entropy classifiers software package Maxent.
Compared with prior art, the beneficial effects of the present invention are:
The present invention provides a kind of entity classification method and system towards link data, for the entity classification of link data
Problem achievees the purpose that high precisely entity classification by statistical classification process and last handling process.Wherein, it is carried out to text
On the basis of basic classification, the result of entity description text classification is modified, includes: using method
(1) more granularity word segmentation Model Fusion methods are used, for overcoming single cutting granularity to extract in text feature
On defect;
(2) fused entity class is modified using category associations attribute information, to reach amendment apparent error
Purpose;
(3) deeply understood by first sentence, achieve the effect that reduce text noise;
(4) it can identify difficult sample, and recognition result is verified using the methods of link analysis and affixe.
Compared with prior art, existing entity classification method is no longer handled at present, can for Entity recognition classification
Situation that can be wrong can not correction result;And the present invention is by post-processing process to based on the possible mistake of text statistical classification module
The case where be modified.Technical solution proposed by the invention is easily realized, is easily debugged, high-efficient, precision is good, is very suitable to enterprise
Information management is carried out for linking data;High exact classification can be carried out to entity.It evaluates and tests and competes in JIST2015 entity classification
In, the solution of the present invention accuracy rate is 98.6%, for when time evaluation and test match highest classification schemes of accuracy rate.
Detailed description of the invention
Fig. 1 is the flow diagram of the entity classification method provided by the invention towards link data.
Fig. 2 is the structural block diagram of the entity classification system provided in an embodiment of the present invention towards link data.
Fig. 3 is the flow diagram that first sentence deeply understands step in the method provided by the present invention.
Specific embodiment
With reference to the accompanying drawing, the present invention, the model of but do not limit the invention in any way are further described by embodiment
It encloses.
The present invention provides a kind of entity classification method and system towards link data, for the entity classification of link data
Problem achievees the purpose that high precisely entity classification by statistical classification process and last handling process;Wherein, statistical classification process
Classified by being modeled for text information;Last handling process utilizes affluent resources (such as affixe information, link data etc.
Information) result of entity statistical classification is modified, Fig. 1 is the entity classification method provided by the invention for link data
Flow diagram.As shown in Figure 1, the method for the present invention includes preprocessing process, statistical classification process and last handling process;It is right first
Physical page carries out participle feature extraction, the feature training statistical classification model then obtained using extraction.For classification gained
To as a result, we merge first with more granularity models corrects single model prediction error, then utilize category associations attributes
Information is modified fused entity class, to correct some manifest error predictions, then retouches to the first sentence of physical page
Carry out depth analysis is stated, to determine its classification.For the sample of some classifications for being difficult to correctly classify, the present invention can pass through link
Analysis and affix analysis method correct its classification again.Specific steps include:
1) physical page is pre-processed, including Chinese word segmenting (typical segmenting method have the maximum matching in front and back, after
To maximum matching and the method marked based on statistical series), feature extraction (extract word feature and entity information box properties name
Feature is indicated the page) etc., obtain physical page feature;
2) using obtained physical page feature is extracted in step 1), to physical page using maximum entropy model using a variety of
Cutting granularity is classified, and the tentative prediction to entity class is obtained;
In embodiments of the present invention, maximum entropy model two classifiers of training are utilized;The character representation of one classifier is used
Be with name the cutting of Entity recognition granularity word+infobox attribute;Another classifier is without name entity
Identify word and infobox attribute that carried out cutting generates.
3) entity class obtained in step 2) is post-processed, whether reliable verifies its classification results;Specific packet
Include following steps:
31) classification results of the classifier of the feature training using different cutting granularities are merged;
In embodiments of the present invention, using two kinds of cutting granularities, respectively refer to name Entity recognition cutting and without
There is name Entity recognition;
32) category attribute database is constructed in advance, is repaired using the category associations attribute information in category attribute data Kuku
Just apparent prediction error;
33) first sentence is described to text by parser and carries out deep understanding, analyze sentence using the methods of syntactic analysis
Minor structure, so that entity class information is obtained, with the prediction result before amendment;
34) it is difficult to the classification correctly classified using puzzled matrix identification, the prediction of the category is further verified, is wrapped
It includes:
341) classification of the adjacent page linked using physical page is modified entity class;
342) entity class is modified using the affixe information of physical page.
Fig. 2 is the structural block diagram of the entity classification system provided in an embodiment of the present invention towards link data.Link data
Entity classification system include preprocessing module, statistical classification module and post-processing module;For each module be further discussed below as
Under:
Preprocessing module
Physical page in link data generally comprises text description and message box (infobox).
In preprocessing module, we are utilized Stanford CoreNLP kit and describe to the text in physical page
Information is segmented.In the present embodiment, we take two kinds of different cutting granularities: having name Entity recognition and without name entity
Identification.For example, " New York Times Square " will be considered as a vocabulary under the cutting for having name Entity recognition, and in no name
Under the cutting of Entity recognition, which will be split as " New York ", " epoch ", " square " three words.
After carrying out cutting for Chinese language text, we obtain message box (infobox) attribute-name together with cutting
Word information comes out as feature extraction, the character representation as physical page.
Statistical classification module
The present invention mainly utilizes the foundation that entity class is judged the description information of entity in physical page.This hair
It is bright use the common log-linear model of natural language processing field --- maximum entropy sorting algorithm carrys out train classification models.Such as
Preprocessing module is previously mentioned, and feature used in statistical classification module includes word feature and message box attributive character;Word is characterized in
Classical bag of words character representation;Message box attributive character is for identifying that the classification of entity has very important effect, example
Such as, " date of birth " may also be associated with the entity of personage's type.
In text classification module, we use varigrained word segmentation and carry out training text disaggregated model, this is because
In some cases, a kind of cutting granularity is not able to satisfy the requirement for classification.For example, " New York Times Square " is if conduct
One name entity is come if treating, effect for classification and not as good as being cut into " New York " " epoch " and " square ", because
There is vital influence for classification for " square " word.On the other hand, if we are without naming Entity recognition, that
Picture " Zhang Yishan " will be cut into " opening " " one " " mountain ", then this can also impact classification results.Therefore, it is counting
In categorization module, the embodiment of the present invention (can be downloaded maximum by maximum entropy classifiers software package Maxent by following link website
Entropy classifier software package: http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html) instruction
Two kinds of disaggregated models are practiced, one is the fine granularity cutting of Entity recognition, one is simple coarseness word segmentations with naming.
Post-processing module
Based on the situation of the possible mistake of text statistical classification module, the present invention is modified using post-processing module.Afterwards
Following procedure can be performed in processing module:
31) more granularity model fusion process
Model Fusion is widely used in machine learning field as far as possible, but the method for most Model Fusion is both for difference
The fusion of kind machine learning model.For natural language (especially Chinese), the difference of cutting granularity is for entire model
Effect can have an impact.For the respective superiority-inferiority of different cutting granularities, the invention proposes the methods using Model Fusion
To carry out " learning from other's strong points to offset one's weaknesses " to the obtained disaggregated model of various cutting granularities.
We define Pw(y | x) it is the classification for only word segmentation being used to predict as feature, maximum entropy disaggregated model for sample x
Y probability distribution, Pn(y | x) it is the probability point that name entity mark is added on the basis of word segmentation and is predicted as the maximum entropy of feature
Cloth.We merge the result of both classifiers using the following method:
Pmulti(y | x)=λ Pw(y|x)+(1-λ)Pn(y | x) (formula 1)
In formula 1, Pmulti(y | x) it is the probability distribution for merging different cutting grain-size classification device predictions;Pw(y | x) it is a word
The probability distribution that cutting predicts sample x as feature maximum entropy disaggregated model;Y is sample class, and x is sample;Pn(y|x)
For the probability distribution that name entity mark is predicted as the maximum entropy of feature is added on the basis of word segmentation;λ, which is that adjustment is linear, to be inserted
It is worth the parameter of weight, in the present embodiment, if λ=0.5.
32) category associations attribute amendment prediction
The module corrects the class prediction of some apparent errors using category associations attribute.The module utilized mainly
The classification specificity of information box properties.As shown in table 1, for certain attributes, they can not be with some specific classifications
It is associated.For example, " gaming platform " can not be associated with city entity.Therefore, the specificity of these attributes, Ke Yixiu are utilized
The apparent prediction error of positive classifier.The present invention is directed to the entity type predefined and manually establishes category attribute database,
For carrying out the amendment to prediction.
1 category associations exemplary properties of table
33) the first sentence of entity description is deeply understood by interdependent parser, further precisely identifies entity class;
Link the physical page description in data (such as: wikipedia, Baidupedia etc.) the is in short usually pair
A kind of qualitative description (such as: pounding six is snipsnapsnorum for being popular in Tianjin) of entity.If deep enough can understand entity
The first sentence of description, then will have very big help to accurate identification entity class.
Fig. 3 is the flow diagram that first sentence deeply understands step in the method provided by the present invention.The invention firstly uses interdependent sentences
Method analyzer describes judgement sentential object in first sentence to find out physical page text, then utilizes the judgement sentential object analysis entities
The classification of the page;Specifically comprise the following steps:
331) judge that sentential object identifies
Present invention utilizes the interdependent parsers of Stanford University, carry out interdependent syntax point to the first sentence of entity description
Subject, predicate and object in sentence of informing against are analyzed in analysis.If the object of the interdependent obtained first sentence of syntax and "Yes" have directly
Dependence, then the object is referred to as " judging sentential object ";Otherwise, which is referred to as " non-judgement sentential object ".
The object of first sentence as described in sporocarp text is to judge sentential object, we can use the object and determine for clue
The classification of entity, so that whether the result for verifying classifier prediction is accurate.If result and the punctuate object institute of classifier prediction
The conclusion contradiction obtained then utilizes the prediction of the modified result classifier.If skipped in first sentence there is no sentential object is judged
The step, into 34).
For example, in a kind of " pounding six is snipsnapsnorum for being popular in Tianjin " sentence, interdependent syntactic analysis interpretation of result
Obtaining " game " is the sentential object, and " game " and "Yes" have direct dependence, then " game " is the judgement of this
Sentential object.If " game " is predefined entity class in entity classification system, we are with it as the entity
Classification.
332) it utilizes and judges that sentential object corrects class prediction
In some cases, even if we have found judgement sentential object, can not arbitrarily be used to be modified prediction, because
For some unnecessary mistakes may be introduced in this way.Meanwhile judging that sentential object not fully matches classification in many cases,
Title.Such as: " the wild refined son in pool is that the famous sound of Japan is excellent ", although interdependent syntactic analysis available " sound is excellent " is sentencing for the words
Disconnected object, however not " sound is excellent " this classification is possible in predefined entity class.In this regard, invention defines amendments
Condition finds judgement using Similarity of Words, i.e. cosine similarity between term vector from extensive un-annotated data
The most like classification of sentential object, reliably to carry out classification amendment.
In natural language processing field, cosine similarity is generally regarded as the semantic similarity of vocabulary.Specifically, this hair
Bright embodiment is first with use word2vec kit (https: //word2vec.googlecode.com/svn/
Trunk/) the training Chinese term vector on Gigaword Chinese corpus (Chinese Gigaword is disclosed data set), utilizes instruction
The semantic most like item name of sentential object is found and judged to the term vector got.If it is determined that sentential object is most like with it
The cosine similarity of the term vector of classification is greater than presetting threshold value (in the embodiment of the present invention, by calculating cosine similarity
0.9) method, cosine similarity threshold value are set as, the classification of the entity is just modified to most like classification.
For this purpose, we define physical page kept man of a noblewoman sentence text description judge sentential object for w0, classifier is that (Y is real to y ∈ Y
Body category set), sim (w1,w2) it is word w1、w2Term vector cosine similarity.So correction conditions are that formula 2 is such as shown:
y*=argmaxy∈Ysim(w0,y)∧sim(w0,y*) > 0.9 (formula 2)
In formula 2, ^ is indicated and (with) relationship;The content (left side) of part shows that y* is that semantic similarity is highest before ^
Classification, the content (the right) of part indicates that the similarity of y* and w0 needs to be higher than 0.9 after ^;Correction conditions (formula 2) meet just into
Row amendment, i.e., only when the similarity that y* is the highest classification of semantic similarity and y* and w0 needs to be higher than 0.9, with y* come
Correct original class prediction.
In previous example (" the wild refined son in pool is that the famous sound of Japan is excellent "), we can be found out and " sound is excellent " most like class
It is not that " performer " (as shown in table 2, table 2 is to utilize the term vector of the training from Chinese gigaword calculated and classification most phase
As some vocabulary, wherein runic word indicates the similarity of these vocabulary and classification 0.9 or more), and find " performer " and " sound
It is excellent " semantic similarity 0.9 or more, therefore, the classification of " wild pool refined son " this physical page is modified to " performer ".
The most like vocabulary of 2 classification of table
34) difficult sample is identified using puzzled matrix
In practical applications, the sample that we are frequently encountered certain classifications is difficult to differentiate between, and this kind of sample is known as difficult sample
This.Such as the entity for " city " and " sight spot " two classifications, classifier often do the prediction to make mistake, because these two types of
The description of entity and information box properties are all much like.In order to improve the precision of classification, the present invention is found out using puzzled matrix
Classificating word is easy the sample class of error.Specifically, if statistical classification model is for a certain entity class on verifying collection
yiPrecision of prediction be not up to 90%, then classification yiIt is considered as difficult sample class.For example, on verifying collection, statistical classification model
" city " classification is predicted as to 18 physical pages, but wherein there was only 15 pages is " city " classification really, therefore statistical
Class model is only 83.33% (15/18) in the precision of prediction of " city " classification, and " city " classification is identified as difficult sample class
Not.It is the sample of difficult sample class by statistical classification model prediction for those, we term it difficult samples.
For the difficult sample identified, following two method is utilized to verify to result in we.
341) link analysis
For difficult sample, depending merely on the content in physical page may be not enough to make correct judgement, therefore, the present invention
Come to carry out classification results verifying to difficult sample using link analysis method.
In link data, a physical page would generally be linked to relative other physical pages.Usually come
It says, the classification for the other physical pages being linked to is very likely to the identical of classification with itself.Therefore, one is utilized
The classification for other physical pages that physical page is linked to can preferably judge the classification of the entity with help system.
Specifically, to Mr. Yu physical page e, we analyze the physical page that e is linked, and set is denoted as N (e).N
(e) some page is understood in classification markup information.The present invention finds out the page for having classification to mark in N (e), and counts this
A little most classification y* of the page, the class prediction y ' for judging whether the category is made e with classifier are consistent.As result is different
It causes, the result of y ' is corrected using y*.
342) affix analysis
For certain samples being difficult to differentiate between, the present invention also uses affix analysis method to verify its classification results.For
Certain classifications, entity name is usually to fix Chinese character ending.For example, " city " entity is usually with " city, county " ending, " sight spot "
Entity would generally be ended up with " lake, mountain " etc..The example that table 3 lists classification common solid affixe.
The classification affixe of 3 common solid of table
| Sight spot | City |
| Mountain, lake, city, road, scape, river, ridge, hole | Area, city, county |
Present invention firstly provides utilizing on a large scale without the associated affixe information of labeled data learning object type, specifically come
It says, we train term vector using term vector kit word2vec on Chinese Gigaword data set, then pass through calculating
The method of cosine similarity finds out the most similar word (word of 0.7 or more term vector cosine similarity) of each classification semanteme.So
Afterwards, frequency statistics are carried out by the affixe of the most close vocabulary respectively to the two sight spots, so that it may obtain difficult sample class
Associated affixe, thus by analysis affixe, to determine its generic.Specifically, if a certain physical page affixe s
In a certain classification y1In frequency be significantly higher than (2 times or more) another category y2In the frequency of occurrences, then we are by y1As the reality
Body classification corrects original prediction result.For example, for " Mount Lushan Fairy Cave " physical page, affixe " hole " appears in " scape
The frequency of point " classification is apparently higher than the frequency for appearing in " city " classification, therefore the prediction classification of the entity is modified to " scape
Point ".
It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this field
Art personnel, which are understood that, not to be departed from the present invention and spirit and scope of the appended claims, and various substitutions and modifications are all
It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim
Subject to the range that book defines.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610213411.8A CN105912625B (en) | 2016-04-07 | 2016-04-07 | A Linked Data-Oriented Entity Classification Method and System |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610213411.8A CN105912625B (en) | 2016-04-07 | 2016-04-07 | A Linked Data-Oriented Entity Classification Method and System |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN105912625A CN105912625A (en) | 2016-08-31 |
| CN105912625B true CN105912625B (en) | 2019-05-14 |
Family
ID=56745356
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201610213411.8A Expired - Fee Related CN105912625B (en) | 2016-04-07 | 2016-04-07 | A Linked Data-Oriented Entity Classification Method and System |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN105912625B (en) |
Families Citing this family (26)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107506362B (en) * | 2016-11-23 | 2021-02-23 | 上海大学 | A brain-like storage method for image classification based on user group optimization |
| CN110168579B (en) * | 2016-11-23 | 2024-12-06 | 启创互联公司 | Systems and methods for using knowledge representation with machine learning classifiers |
| CN108241650B (en) * | 2016-12-23 | 2020-08-11 | 北京国双科技有限公司 | Training method and device for training classification standard |
| CN106777275B (en) * | 2016-12-29 | 2018-03-06 | 北京理工大学 | Entity attribute and property value extracting method based on more granularity semantic chunks |
| CN108628873B (en) * | 2017-03-17 | 2022-09-27 | 腾讯科技(北京)有限公司 | Text classification method, device and equipment |
| CN107515858B (en) * | 2017-09-01 | 2020-10-20 | 鼎富智能科技有限公司 | Text classification post-processing method, device and system |
| CN107885719B (en) * | 2017-09-20 | 2021-06-11 | 北京百度网讯科技有限公司 | Vocabulary category mining method and device based on artificial intelligence and storage medium |
| CN108415902B (en) * | 2018-02-10 | 2021-10-26 | 合肥工业大学 | Named entity linking method based on search engine |
| CN110309255A (en) * | 2018-03-07 | 2019-10-08 | 同济大学 | An Entity Search Method Incorporating Distributed Representation of Entity Description |
| CN108563722B (en) * | 2018-04-03 | 2021-04-02 | 有米科技股份有限公司 | Industry classification method, system, computer device and storage medium for text information |
| CN108921213B (en) * | 2018-06-28 | 2021-06-22 | 国信优易数据股份有限公司 | An entity classification model training method and device |
| CN109284374B (en) | 2018-09-07 | 2024-07-05 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and computer readable storage medium for determining entity class |
| EP3629520B1 (en) * | 2018-09-28 | 2023-08-23 | Permutive Limited | Fast and efficient classification system |
| CN110020024B (en) * | 2019-03-15 | 2021-07-30 | 中国人民解放军军事科学院军事科学信息研究中心 | Method, system and equipment for classifying link resources in scientific and technological literature |
| US11210346B2 (en) * | 2019-04-04 | 2021-12-28 | Iqvia Inc. | Predictive system for generating clinical queries |
| CN112115240B (en) * | 2019-06-21 | 2024-07-09 | 百度在线网络技术(北京)有限公司 | Classification processing method, device, server and storage medium |
| CN110457467A (en) * | 2019-07-02 | 2019-11-15 | 厦门美域中央信息科技有限公司 | A kind of information technology file classification method based on gauss hybrid models |
| CN111538813B (en) * | 2020-04-26 | 2023-05-16 | 北京锐安科技有限公司 | Classification detection method, device, equipment and storage medium |
| CN111737416B (en) * | 2020-06-29 | 2022-08-19 | 重庆紫光华山智安科技有限公司 | Case processing model training method, case text processing method and related device |
| CN114140636A (en) * | 2020-09-02 | 2022-03-04 | 华为技术有限公司 | Difficult sample acquisition method, device, equipment and readable storage medium |
| CN112214572B (en) * | 2020-10-20 | 2022-11-01 | 山东浪潮科学研究院有限公司 | Method for secondarily extracting entities in resume analysis |
| CN113094567A (en) * | 2021-03-31 | 2021-07-09 | 四川新网银行股份有限公司 | Malicious complaint identification method and system based on text clustering |
| CN113343697A (en) * | 2021-06-15 | 2021-09-03 | 中国科学院软件研究所 | Network protocol entity extraction method and system based on small sample learning |
| CN113297851B (en) * | 2021-06-21 | 2024-03-05 | 北京富通东方科技有限公司 | A method for identifying entity words that are easily confused with sports injuries |
| CN115374287A (en) * | 2022-04-19 | 2022-11-22 | 北京中科凡语科技有限公司 | Entity linking method, device and storage medium |
| CN115934914B (en) * | 2022-12-27 | 2025-08-19 | 成都数之联科技股份有限公司 | Method, device, equipment and medium for multi-category question answering of intention based on knowledge graph |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101645064A (en) * | 2008-12-16 | 2010-02-10 | 中国科学院声学研究所 | Superficial natural spoken language understanding system and method thereof |
| CN104408148A (en) * | 2014-12-03 | 2015-03-11 | 复旦大学 | Field encyclopedia establishment system based on general encyclopedia websites |
| CN104484461A (en) * | 2014-12-29 | 2015-04-01 | 北京奇虎科技有限公司 | Method and system based on encyclopedia data for classifying entities |
| CN102436456B (en) * | 2010-09-29 | 2016-03-30 | 国际商业机器公司 | For the method and apparatus of classifying to named entity |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2015175548A1 (en) * | 2014-05-12 | 2015-11-19 | Diffeo, Inc. | Entity-centric knowledge discovery |
| US10296192B2 (en) * | 2014-09-26 | 2019-05-21 | Oracle International Corporation | Dynamic visual profiling and visualization of high volume datasets and real-time smart sampling and statistical profiling of extremely large datasets |
-
2016
- 2016-04-07 CN CN201610213411.8A patent/CN105912625B/en not_active Expired - Fee Related
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101645064A (en) * | 2008-12-16 | 2010-02-10 | 中国科学院声学研究所 | Superficial natural spoken language understanding system and method thereof |
| CN102436456B (en) * | 2010-09-29 | 2016-03-30 | 国际商业机器公司 | For the method and apparatus of classifying to named entity |
| CN104408148A (en) * | 2014-12-03 | 2015-03-11 | 复旦大学 | Field encyclopedia establishment system based on general encyclopedia websites |
| CN104484461A (en) * | 2014-12-29 | 2015-04-01 | 北京奇虎科技有限公司 | Method and system based on encyclopedia data for classifying entities |
Non-Patent Citations (1)
| Title |
|---|
| "Entity modeling and search in Text";Lu Chunliang;《https://static.chunlianglyu.com/docs/thesis.pdf》;20160315;第15页第2段 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN105912625A (en) | 2016-08-31 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN105912625B (en) | A Linked Data-Oriented Entity Classification Method and System | |
| TWI735543B (en) | Method and device for webpage text classification, method and device for webpage text recognition | |
| CN111091009B (en) | Document association auditing method based on semantic analysis | |
| CN108304468A (en) | A kind of file classification method and document sorting apparatus | |
| CN111783394A (en) | Training method of event extraction model, event extraction method, system and equipment | |
| CN107832781B (en) | Multi-source data-oriented software defect representation learning method | |
| CN108121829A (en) | The domain knowledge collection of illustrative plates automated construction method of software-oriented defect | |
| CN103853834B (en) | Text structure analysis-based Web document abstract generation method | |
| CN104881458B (en) | A kind of mask method and device of Web page subject | |
| CN107609132A (en) | One kind is based on Ontology storehouse Chinese text sentiment analysis method | |
| CN113590810A (en) | Abstract generation model training method, abstract generation device and electronic equipment | |
| CN109472022B (en) | New word recognition method based on machine learning and terminal equipment | |
| CN104350490A (en) | Methods, apparatuses and computer-readable mediums for organizing data relating to a product | |
| CN113360582A (en) | Relation classification method and system based on BERT model fusion multi-element entity information | |
| CN105975453A (en) | Method and device for comment label extraction | |
| CN113657090A (en) | Military news long text layering event extraction method | |
| CN103309862A (en) | Webpage type recognition method and system | |
| CN106649250A (en) | Method and device for identifying emotional new words | |
| CN103593431A (en) | Internet public opinion analyzing method and device | |
| CN104809105B (en) | Recognition methods and the system of event argument and argument roles based on maximum entropy | |
| CN108763192B (en) | Entity relation extraction method and device for text processing | |
| CN107451116B (en) | Statistical analysis method for mobile application endogenous big data | |
| CN110688540B (en) | Cheating account screening method, device, equipment and medium | |
| CN110781673B (en) | Document acceptance method and device, computer equipment and storage medium | |
| CN104834718A (en) | Recognition method and system for event argument based on maximum entropy model |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190514 |