(2) background technology
Along with advancing by leaps and bounds of the development of infotech, particularly Internet technology, change people and obtained mode with exchange of information, a kind of new public sentiment form of expression---network public-opinion should be pregnant and give birth to.Network public-opinion is meant the public (netizen) with the network platform, by netspeak or alternate manner, to the summation of some public affair or the suggestion that focal issue showed.Compare with traditional media, the public feelings information source of network is abundant, comprise news analysis, BBS, chatroom, blog, polymerization news (RSS) etc., and network public-opinion have velocity of propagation fast, involve characteristics such as scope is wide, influence degree is dark, therefore much bigger than the difficulty of traditional society public sentiment management to the management of network public-opinion.According to " the China Internet network state of development statistical report " of China Internet Network Information Center's issue, netizen's quantity of China has surpassed the U.S. and has leapt to the first in the world.Therefore, the situation that the network public-opinion of China management faces is more severe, is badly in need of new technology and method and provides support for it and serve.
Network topics is the fundamental that network public-opinion embodies, to the management of network topics be realize the network public-opinion management the most basic, also be most important link.The life cycle of network topics generally includes topic appearance, topic survival and topic extinction three phases.Wherein the topic live stages is of paramount importance, and in this stage, the related related content of topic can constantly be launched, and along with the development of the state of affairs, related content also can constantly change, and enters the extinction stage until topic simultaneously.The topic related content develops along with the continuous phenomenon that changes of key element wherein is called topic, and topic develops to have shown a certain topic is how to develop, change in live stages, is of paramount importance foundation in grasp and the work of supervising the network public sentiment.
Network topics EVOLUTION ANALYSIS correlation technique is based upon on the network topics detection and tracking technical foundation, is the latter's extension and raising.Aspect the network topics EVOLUTION ANALYSIS, existing following related art scheme is proposed in succession.[topic towards dynamic evolution detects research [J] to Zhao Hua etc., the hi-tech communication, 2006,16 (12): 1230-1235] studying the dynamic evolution analysis method of topic that proposes when topic detects based on two barycenter topic models, this method adopts initial barycenter and current barycenter to represent that respectively topic is early paid close attention to and the content of current concern, the appearance of the foundation sign topic fresh content of separation, initial barycenter and current barycenter upgrade along with the foundation of separation.This method not only can keep the content that topic is early paid close attention to, and can also catch emerging content in the topic immediately.Can in time capture the appearance of the new side of topic based on the method for two barycenter models, but when the relevant report of each side of topic was out of order the appearance, this model then can't correctly be discerned the evolutionary process of topic content.
Wu Pingbo etc. [based on the intelligent retrieval research [J] of the incident relevant documentation of Events Frame. Journal of Chinese Information Processing, 2003,17 (6): 25-30] in topic evolution analysis method based on Events Frame thought has been proposed.This method extracts the side keyword and sets up more perfect Events Frame, according to the evolution of this Events Frame analysis topic content, in order to the accuracy of raising topic detection and tracking by the manual report of collecting with relevant each side of topic.Topic content evolution analysis method based on Events Frame thought, need to collect in advance the not report of ipsilateral of topic, and extract the side keyword, the behavior of manual intervention is too much, and whether each side-information is collected very big to the performance impact of topic detection and tracking comprehensively.
Wang Huizhen etc. [follow the trail of [J] based on the adaptive Chinese topic of feedback learning. Journal of Chinese Information Processing, 2005,20 (3), 92-98] in the influence of developing topic is followed the trail of at topic, promptly the topic drift phenomenon has proposed based on the adaptive topic method for tracing of feedback learning, adopt incremental mode that the topic model is revised, and keep each revised topic model, and follow the trail of follow-up report with the linear combination of these topic models, develop to the influence of Topic Tracking in order to solve topic.
Juha Makkonen[Investigations on event evolution in TDT.In Proceedings ofStudent Workshop of Human Language Technology Conference of the NorthAmerican Chapter of the Association for Computational Linguistics (HLT-NAACL), Edmonton, Canada, 2003:43-48] according to the elemental of media event: the time, the place, the personage, event content, traditional single vector is divided into four subvectors according to the different meaning of a word, calculate the similarity of four semantic vectors respectively, unification at last is a similarity.By judging the similarity between the similarity decision event (or report), analyze the partial content that evolution takes place in the actualite.
Topic develops closely bound up with advancing of time, and therefore the temporal data relevant with the topic content is to analyze the important evidence that topic develops.Chih-Ping Wei[IEEE Transactions On Systems, Man, AndCybernetics-Part A:Systems And Humans, 2007,37 (2), 273-283] the incident evolutionary pattern method for digging based on document sequence proposed, the document sequence here is by document to be analyzed is arranged formation according to time sequencing, and supposes that each piece document only relates to some sides of topic.Jia Ziyan [a kind of incident detection and tracing algorithm [J] based on the dynamic evolution model, computer research and development, 2004,41 (7): 1273-1280] thinks that topic is more little with the report mistiming, and similarity is big more.To introduce the topic calculation of similarity degree time, propose similarity computation model, and for distinguishing different topics certain effect be arranged based on the similarity computation model of time gap, but be not suitable for the not differentiation of ipsilateral of same topic based on time gap.
By to existing method and analysis of technology, bring the reason of various shortcoming and defect can be summarized as following 2 points:
1, topic model representation: for same topic, the often dynamic change of its content keypoint, i.e. many sides property of topic content along with the development of dependent event.The topic model that is proposed at present all is difficult to describe many sides of topic property, can not intactly present topic in evolutionary process.
2, topic modelling: existing technical method just adopts traditional clustering method to handle the news report that belongs to same topic, can not embody the active development process of this topic inside.
(3) summary of the invention
The object of the present invention is to provide and a kind ofly can analyze and present the dynamic evolutionary process of contents of network topics accurately, all sidedly, the contents of network topics EVOLUTION ANALYSIS device of more advanced technical support is provided for the Intelligent Information Processing of network-oriented and public sentiment analytical technology.The present invention also aims to provide a kind of contents of network topics evolution analysis method.
The object of the present invention is achieved like this:
The formation of the dynamic EVOLUTION ANALYSIS device of network topics of the present invention comprises network event transacter, network event data pretreatment unit, topic content EVOLUTION ANALYSIS device and the output unit that connects successively; The network event transacter obtains the raw data of describing the network topics dependent event in real time, on one's own initiative from the internet, and stores; Network event data pretreatment unit is described raw data to the network event that the network event transacter stores, through resolving the noise that filters out wherein, extract the real core data relevant with network event, core data is carried out characterizing definition and extraction, be expressed as the vector space model mode; Through input topic EVOLUTION ANALYSIS device after the data pre-service, the incident relevant with topic carried out cluster, and analyze active development and evolutionary process at the topic internal event; The topic EVOLUTION ANALYSIS result of output unit output system.
Contents of network topics EVOLUTION ANALYSIS device of the present invention can also comprise:
1, described network event data pretreatment unit is made of network event data purification unit and network event data representation unit, the interfere information in the webpage is removed in network event data purification unit, news content is extracted exactly, network event data representation unit, carry out Chinese word segmentation processing for the news content that extracts, be expressed as the form of vector then.
2, the dynamic EVOLUTION ANALYSIS device of described topic content is set up the unit by similarity calculated, topic multicenter and topic center updating block constitutes; Report that similarity calculated calculating is collected and the similarity between each topic center are judged the topic class that this report is affiliated; The topic multicenter is set up the unit on the basis of judging topic class under the current report, and the different characteristic number by the existing center of more current news report and topic decides the topic center under this report; When topic center updating block adds new report at a certain center when topic, upgrade the vector representation at this center of topic.
Conceptual illustration related in the contents of network topics evolution analysis method of the present invention is as follows:
The multicenter structure: a side of topic is represented at the center of topic, and the multicenter structure is representing of a plurality of sides of topic, and the emphasis of discussing between each side is inequality.
Different feature: current report is with respect to the new feature at certain topic center.With the different feature that different topic center calculation goes out may be inequality.
Different degree: the new feature that occurs in the current report accounts for the number percent of this report feature sum.
The contents of network topics evolution analysis method may further comprise the steps:
The network event data collection step, the news web page on the download network, and be kept at server end with the form of file is for the processing and the analysis of subsequent module provides raw data;
The network event pre-treatment step is carried out noise reduction with original news web page, removes useless information, carries out Chinese word segmentation then and handles, and adopt the weight of specific policy calculation speech, finally is expressed as the citation form that adopts vector space model;
The similarity calculation procedure adopts current report of cosine distance calculation and the similarity that has each topic, and record produces the topic of maximum similarity, if maximum similarity, thinks then that current report belongs to this topic class more than or equal to pre-set threshold; Otherwise maximum similarity is then set up new topic class less than threshold value;
Topic multicenter establishment step after the topic class of judging under the current report, continues to judge this report belongs to which center in this topic class, and it is joined in report set at this center, upgrades the topic center simultaneously;
Topic center step of updating when having new report to add the topic center, is upgraded corresponding topic center vector;
The output step with the result of topic content EVOLUTION ANALYSIS output, comprises all centers of topic inside, and the news report that comprises of each center.
Contents of network topics evolution analysis method of the present invention can also comprise:
1, in the similarity calculation procedure, when calculating report and a certain topic similarity, calculates the similarity of reporting with this each center of topic respectively, choose the similarity of maximal value as report and this topic.
2, in the topic multicenter establishment step, the strategy at center is similarity and the different degree according to current report and each center of topic under the judgement report: select the center conduct and the immediate center of current report of similarity maximum; If the different degree at current report and this center is then set up the new center of topic more than or equal to prior preset threshold with current report; If different degree less than threshold value, thinks that then current report belongs to this topic center.
3, the concrete update method of topic center step of updating for vector that current report is formed and center vector is done and, form new center vector.
The invention has the advantages that, can find a plurality of contents side relevant, adopt the multicenter structure to set up corresponding topic model, describe topic more accurately, all sidedly with topic by the present invention; By polycentric foundation of topic and renewal, can represent the dynamic Evolution Development process of topic content, i.e. the generation of topic, development, climax are until the overall process of withering away.The method that the present invention proposes does not rely on the processing sequence of report, can be applicable to the cross occurrence situation of the news report that emphasis is different.
(5) embodiment
For example the present invention is done description in more detail below in conjunction with accompanying drawing:
Figure 1 shows that based on the dynamic EVOLUTION ANALYSIS of the contents of network topics of multicenter structure system, comprising:
Network event transacter: be used in real time, obtain from the internet on one's own initiative the raw data of describing the contents of network topics dependent event, and store;
Network event data pretreatment unit: the network event that the network event transacter stores is described raw data, abide by predefined certain form and resolve, filter out noise wherein, extract the real core data relevant with network event; In addition, core data is carried out characterizing definition and extraction, and adopt suitable form to express;
Topic EVOLUTION ANALYSIS device: after the data pre-service, news cluster that will be relevant with certain incident arrives together, and analyzes active development and evolutionary process at the topic internal event;
Output unit: be used for the topic EVOLUTION ANALYSIS result of output system, the related news report that specifically comprises the topic center and belong to each center.
Fig. 2 has provided the detail flowchart based on the dynamic evolution analysis method of contents of network topics of multicenter structure.
1. network event data aggregation
The characteristics of Internet news incident are many sides characteristics of news, have a plurality of emphasis in all promptly relevant with a certain topic news report, and each emphasis is discussed the content of an aspect of news.Along with the development of incident, the emphasis that topic is discussed is also constantly shifting and is changing.
2. network event data pre-service
The present invention adopts the formalized description of vector space model as news report and topic model, and the network event data vectorization comprises the steps:
(1) from original web page, extracts the body part of news;
(2) utilize dictionary for word segmentation that the text of news is carried out word segmentation processing, extract notional word wherein, remove function word and stop words;
(3) adopt the TF-IDF method to determine the weight of each speech behind the participle, the computing method of TF-IDF as shown in the formula:
W wherein
T, dBe the weight of feature t in document d, m is the feature number, and N is total number of files, TF
T, dBe the word frequency in document d of feature t, DF
tDocument frequency for feature t.
(4) by each speech, promptly feature and weight thereof form the vector representation of this news report as component, specifically are expressed as follows:
V
d={(T
1,W
1,d);(T
2,W
2,d);...;(T
m,W
m,d)}
V wherein
dThe vector representation of expression document d, T
i(i feature among the expression of 1≤i≤m) the document d, W
I, dThe weight of i feature among the expression document d.
3, similarity is calculated
Should calculate the similarity at report and each center of topic when calculating report and topic similarity, and similarity is peaked as the similarity of report with topic.Here adopt the included angle cosine formula to calculate similarity, concrete grammar is as follows:
(1) adopt the included angle cosine method to calculate the similarity at report and each center of topic, specifically adopt following formula to calculate:
Wherein
Be respectively document d
iAnd d
jVector representation.
(2) maximal value in the similarity of selection calculating gained is as the similarity of report and topic.
4, the topic multicenter is set up
The evolution of topic content often is embodied in the appearance of new feature.Do not occur in the topic incipient stage as some feature, but just occur after continuing for some time, then the appearance of these features means that probably evolution has taken place the topic content.Yet the appearance according to a few new feature is not enough to also judge that evolution has taken place the topic content, has only when the new feature quantity that occurs reaches certain scale, can think that just its content develops.Here adopt vector decomposition method to set up topic multicenter structural model, and judge the evolution that the topic content takes place.
In the topic multicenter structure that the present invention proposes, only the topic class under reporting is judged it is incomplete, need also to judge which the center that report is discussed is.When judging the center of report discussion, different feature quantity is few more, and then report more may be at this center of discussion; Otherwise different feature quantity is many more, and is then possible more at the different center of discussion.Algorithm is as follows:
(1) calculates similarity, the different feature quantity of reporting with these all centers of topic;
(2) select and the immediate center of report: select the conduct and the immediate center of report of similarity maximum each in the heart from topic;
(3) judge the topic center that report is discussed: the different feature number percent at report and this center is less than threshold value, and then report belongs to this center; Otherwise, set up the new center of topic with this report.
5, the topic center is upgraded
The center of topic adopts the V vector space model to represent.When having new report to add the topic center, need to upgrade corresponding topic center vector.The vector that concrete update method forms current report and center vector do with, form new center vector, method is specific as follows:
Suppose T
I, d(1≤i≤n) and W
I, dCharacteristic item i and the corresponding weight value of representing current report document vector d respectively, then current document vector V
dCan be expressed as V
d={ (T
1, d, W
1, d); (T
2, d, W
2, d); (T
M, d, W
M, d), the document vector V that is formed centrally in current of topic in like manner
cCan be expressed as form: V
c={ T
1, c, W
1, c); (T
2, c, W
2, c); ...; (T
M, c, W
M, c), then they and be expressed as sum (V
d, V
c)=(T
1, s, W
1, sT
2, s, W
2, s...; T
N, s, W
N, s), for each component (T wherein
I, s, W
I, s), it is generated by following rule:
(1) generating feature item: make V
dCharacteristic item set be S (V
d), V
cCharacteristic item set be S (V
c), T then
I, s(∈ S (the V of 1≤i≤n)
d) ∩ S (V
c).
(2) generate weights:
6, embodiment scene and result describe
In order to verify validity of the present invention, we have realized concrete technology and the method wherein mentioned, and contrast with topic detection method based on two barycenter models, and contrast standard comprises that topic detects two aspects of setting up at performance and topic center.Experimental data is some news web pages of collecting in the Sina website, totally 5 topics, 181 pieces of news report are that the earthquake of Damxung, Tibet, Hangzhou Subway building site cave in respectively, Urumchi commercial building big fire, melamine problem egg, Shanxi Black brick field maltreat workman's incident, represent with numbering 1-5 respectively.
In specific implementation process, similarity threshold is set to 0.4, and different degree threshold value is set to 0.6.
Table 1 be the inventive method and topic detect the performance comparison result.
In order to verify method of the present invention and the not ipsilateral that can detect topic based on the method for two barycenter models, and the not ipsilateral of a topic inside of accurate description, we have carried out following experiment: at first distribute numbering for every piece of report, choose 23 pieces of reports from " Urumchi commercial building big fire " incident, be divided into three aspects: 1, wrecked (the report numbering 1-7) 2 of three firemans, fire failures investigation, (the report numbering 8-12) 3 that deal with problems arising from an accident, analyze and sum up culprit (report numbering 13-23), the data of table 2 are each side report results when handling successively; The result of table 3 is results of ipsilateral report cross occurrence not.
Table 1 is based on multicenter structural approach and two barycenter method performance comparison
Classification results during table 2 sequential processes
Classification results during each side of table 3 cross occurrence
Wherein, the recall rate R in the evaluation index, accuracy rate P and F1 value are calculated by following formula:
The F1 value:
From experimental result as can be seen, detect performance for topic, aspect performance index such as accuracy rate, recall rate, F1 value, basic inventive method is suitable with the method for two barycenter models.And aspect the setting up of topic center, the inventive method then is better than the method based on two barycenter models.Two barycenter only according to the appearance of neologisms, and only are applicable to the situation that each side of topic occurs successively aspect the setting up of separation, can not adapt to for the situation of different content cross occurrence.Method of the present invention then can the contents processing cross occurrence situation, not only can find the emerging content of topic, and can also carry out secondary to old content and sort out, the inner all centers of the in store topic of multicenter structure, and can in time upgrade each center of topic, thereby grasp the evolutionary process of topic content exactly, improved topic and detected performance.