CN110321432A

CN110321432A - Textual event information extracting method, electronic device and non-volatile memory medium

Info

Publication number: CN110321432A
Application number: CN201910548427.8A
Authority: CN
Inventors: 乔春庚; 江敏; 刘瑞宝
Original assignee: Tols Information Technology Co Ltd
Current assignee: Tols Information Technology Co Ltd
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2019-10-11
Anticipated expiration: 2039-06-24
Also published as: CN110321432B

Abstract

The invention belongs to technical field of information processing, in order to solve technical solution that event information in the prior art the extracts technical problem not high there are accuracy rate, the present invention provides a kind of first aspect present invention and provides a kind of textual event information extracting method, this method comprises: carrying out participle division to text, and term vector is obtained after participle done vector conversion, and term vector is input to neural network model, export entity；Based on the information type of text formatting characterizing definition, according to the associative mode rule of grammar definition, by the associative mode rule of participle and entity according to grammar definition in text block, text block after being organized into structuring；Event information extraction process is carried out to the text block after structuring, realizes keywording using the associative mode rule of grammar definition, and keyword is output in result template.Therefore, by neural network deep learning with rule combine in the way of, configuration event extract model, realize textual event information accurate extraction.

Description

Textual event information extracting method, electronic device and non-volatile memory medium

Technical field

The present invention relates to event information extractive technique necks in technical field of information processing more particularly to text mining research Domain, and in particular, to a kind of textual event information extracting method, electronic device and non-volatile memory medium.

Background technique

It is one of task most challenging in text mining research that event information, which extracts, it is intended to utilize computer from text In automatically extract certain types of event and its element, event information extracts the key technology as field of information processing, The fields such as information retrieval, automatic question answering, autoabstract, data mining, text mining have a wide range of applications.

Event information extracts current research and experiment, and summing up mainly has three classes: (1), rule-based text thing Part extracts, and has using the canonical system of such methods: Ex Disco, Gen PAM etc..(2), the text thing based on trigger word detection Part extracts, and core is the determination of trigger word detection and event argument and its role, and trigger word is can to state out certain well The word of class event center meaning；For example, the words such as " appointment ", " resignation " in post variation event.(3) it is based on probability statistics mould The Text Information Extraction of type, such as taken out with all domains of the hidden Markov model to computer Scientific Articles header information It takes.

Although there are many research that statistical model is used for information extraction in this, in these researchs, data field to be extracted is all A very compact sequence can be regarded as, and the statement of event does not often have this feature in text, needs to extract Data field is dispersion, sparse, and the domain to be extracted that has is even apart from event statement center (where can be regarded as trigger word Position) there is a certain distance；To need to be improved in accuracy rate.

Summary of the invention

In order to solve technical solution that event information in the prior art the extracts technical problem not high there are accuracy rate, this hair It is bright that a kind of textual event information extracting method, electronic device and non-volatile memory medium are provided, utilize neural network depth The mode combined with rule is practised, configuration event extracts model, realizes the accurate extraction of textual event information.

To achieve the goals above, technical solution provided by the invention includes:

First aspect present invention provides a kind of textual event information extracting method, which is characterized in that the described method includes:

Text is pre-processed, the pretreatment includes participle division being carried out to text, and participle is done vector conversion After obtain term vector, and the term vector is input to neural network model, passes through the neural network model and export entity；

Piecemeal processing is carried out to text, obtains text block, this block sort extraction process of composing a piece of writing of going forward side by side, the text block sort Extraction process includes: the information type based on text formatting characterizing definition, will be described according to the associative mode rule of grammar definition Participle and entity in text block is regular according to the associative mode of the grammar definition, the text block after being organized into structuring；

Event information extraction process, the event information extraction process are carried out to the text block after structuring in the text Associative mode rule including using the grammar definition realizes keywording, and keyword is output to event information and is extracted In corresponding result template.

The embodiment of the present invention is preferably carried out in mode, when text is resume, the resume piecemeal processing It afterwards include corresponding first text block of essential information, education experience corresponds to the second text block, work experience corresponds to third text block, training Instruction undergoes corresponding 5th text block of corresponding 4th text block, credentials, corresponding 6th text block of job hunting wish；The format is special Sign respectively includes essential information, education experience, work experience, training experience, credentials, the corresponding information spy of job hunting wish Sign；The pattern rules of the grammar definition include fixed according to morphological analysis, syntactic analysis and the semantic analysis in Fundamentals of Compiling The judgment rule of justice.

The embodiment of the present invention is preferably carried out in mode, described to carry out participle to divide including: in core lexicon to text In tissue, using the method for even numbers group trie tree；For overlap type segmentation ambiguity, the method combined using rule with statistics； For unknown word identification, using the recognition methods based on condition random field.

The embodiment of the present invention is preferably carried out in mode, and the deep neural network model includes Embedding layers, two-way RNN layers and CRF layers；Described Embedding layers will obtain term vector after participle progress vector conversion, be sequentially sent to two-way RNN layers, the probability distribution of participle label is obtained, the probability distribution of the participle label is sent into described CRF layers, it is corresponding to obtain entity Entity tag sequence.

The embodiment of the present invention is preferably carried out in mode, the pattern rules be it is revisable, it is described rule match confidence Breath extracts model, can be configured respectively according to different application scenarios；And generic letter is provided in the pattern rules Breath, the Text Pretreatment further include context rule analysis in line of text, and context rule, which is analyzed, in the line of text includes To text carry out participle cutting and Entity recognition as a result, be modified using scheduled rule regulating method to word segmentation result, Ambiguous generic is identified again.

The embodiment of the present invention is preferably carried out in mode, and the pattern rules include abbreviation merging rule, and will be complicated Length rule be placed on front, simple short rule is put behind.

The embodiment of the present invention is preferably carried out in mode, and the participle and entity by the text block is according to corresponding Pattern rules, the text block after being organized into structuring include: to judge whether continuous row meets specifically using pattern rules sequence Mode, and after completion text meets corresponding AD HOC, the pattern rules result that Jiang Gehang is matched is stored in character string In the Multidimensional numerical of type.

The embodiment of the present invention is preferably carried out in mode, and the configuration information of the rule extracts model using rule description language Speech NPRDL carries out expression writing, and NPRDL language is using BNF form；And the description language is based on complex characteristic The means of collection describe the grammatical and semantic information of vocabulary, while using the dynamic described based on set of complex features in dynamic analysis Attribute list describes.

Second aspect of the present invention also provides a kind of electronic device characterized by comprising

Memory；

Processor；And

Computer program；

Wherein, the computer program stores in the memory, and is configured as being executed by the processor with reality The now method as described in any one of first aspect offer.

Third aspect present invention also provides a kind of non-volatile memory medium, is stored thereon with computer program, feature It is, which is performed the step of realizing any one the method such as first aspect offer.

Using above-mentioned technical proposal provided by the invention, can obtain it is following the utility model has the advantages that

1, using the mathematical model of neural network, participle cutting and Entity recognition is carried out to text, can quickly obtain text The mode of fundamental in this, binding pattern rule carries out the extraction of text block classification information to text, by point in text block Word and entity are regular according to the associative mode of grammar definition, and the text block after being organized into structuring is mentioned in this way with being more conducive to information The mode taken by text information according to the grammatical expression formula structuring of computer language requirement, and text sections structuring handle On the basis of carry out Event Distillation again, effectively solve data dispersion, it is sparse and extract domain apart from event statement center it is farther away Problem, the accurate extraction of such textual event information；And Entity recognition includes being identified using deep neural network model, Improve recognition effect.

It 2, include generic information in pattern rules as preferred embodiment, so by introducing external make by oneself The thesaurus of justice, allows the more convenient to use of pattern rules.

3, the basis of pattern rules is revisable, for example, expression writing is carried out using rule description language NPRDL, NPRDL language is using BNF form；For different application scene, flexibly model can be extracted by quick configuration information.

4, complicated length rule is placed on front, simple short rule is put behind；From front to back due to rule match It carries out, avoids it is possible that because not traversing, and causing matching mistake first with the rule match of front success, subsequent The technical issues of losing.

The other feature and advantage of invention will illustrate in the following description, also, partly become aobvious from specification And it is clear to, or understood by implementing technical solution of the present invention.The objectives and other advantages of the invention can be by illustrating Specifically noted structure and/or process are achieved and obtained in book, claims and attached drawing.

Detailed description of the invention

Fig. 1 provides a kind of flow chart of textual event information extracting method for the embodiment of the present invention.

Fig. 2 provides the pretreated flow chart of text in a kind of textual event information extracting method for the embodiment of the present invention.

Fig. 3 provides the process that text block sort is extracted in a kind of textual event information extracting method for the embodiment of the present invention Figure.

Fig. 4 provides the flow chart that event information extracts in a kind of textual event information extracting method for the embodiment of the present invention.

Fig. 5 provides a kind of structural block diagram of textual event information extracting device for the embodiment of the present invention.

Fig. 6 provides a kind of structural block diagram of electronic device for the embodiment of the present invention.

Specific embodiment

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings and examples, how to apply to the present invention whereby Technological means solves technical problem, and the realization process for reaching technical effect can fully understand and implement.It needs to illustrate , these specific descriptions only allow those of ordinary skill in the art to be more easier, clearly understand the present invention, rather than to this hair Bright limited explanation；And if conflict is not constituted, each spy in each embodiment and each embodiment in the present invention Sign can be combined with each other, and it is within the scope of the present invention to be formed by technical solution.

In addition, step shown in the flowchart of the accompanying drawings can be in the control system of a such as group controller executable instruction Middle execution, although also, logical order is shown in flow charts, and it in some cases, can be to be different from herein Sequence executes shown or described step.

Below by the drawings and specific embodiments, technical solution of the present invention is described in detail:

Embodiment

In order to solve the technical solution that event information extracts in the prior art, there are model foundation difficulty or accuracy rate be not high The technical issues of, the present embodiment proposes a kind of event extraction method and apparatus, first text pre-processed with neural network, Piecemeal is carried out to text again, corresponding event information is extracted from different text blocks.

As shown in Figure 1, the present embodiment provides a kind of textual event information extracting methods, this method comprises:

S110, text is pre-processed, pretreatment includes participle division being carried out to text, and participle is done vector conversion After obtain term vector, and term vector is input to neural network model, passes through neural network model and export entity.

Text Pretreatment in the present embodiment further includes text before carrying out participle cutting and Entity recognition to text Format alignment: removing reservation one for continuous space, will include the row segmentation of multiple " keyword "+": "+" slot information " structure For uniline；The double byte character of all appearance in the text is converted into half-angle character, by the capitalization English of all appearance in the text Text mother is converted to small English alphabet.

The present embodiment is preferably carried out in mode, carries out participle to divide including: to adopt in the tissue of core lexicon to text With the method for even numbers group trie tree；For overlap type segmentation ambiguity, the method combined using rule with statistics；For being not logged in Word identification, using the recognition methods based on condition random field.

As shown in Fig. 2, the present embodiment is preferably carried out in mode, deep neural network model includes Embedding layers, double To RNN layers and CRF layers；Embedding layers will participle carry out vector conversion after obtain term vector, be sequentially sent to RNN layers two-way, obtain To the probability distribution of participle label, the probability distribution for segmenting label is sent into CRF layers, obtains the corresponding entity tag sequence of entity. I.e. in the present embodiment preferred embodiment, the method combined is segmented with statistics using Dictionary based segment using participle:

1), in the organizational aspects of core lexicon, it is contemplated that the time efficiency of dictionary lookup, the space efficiency of storage, Chinese The features such as statistical law, using the method for even numbers group trie tree；The full name in English of even numbers group trie is Double Array Trie is one of trie tree simple and effectively realize, be made of two integer arrays, one is base [], the other is check[]；If array index is i, if base [i], check [i] is 0, indicates the position for sky；If base [i] is Negative value indicates that the state is word；Check [i] indicates the previous state of the state, t=base [i]+a, check [t]=i.

2), ambiguity is eliminated and unregistered word is two big difficult points of Chinese word segmentation, for overlap type segmentation ambiguity, uses rule The method then combined with statistics.

3) it, is directed to unknown word identification, using the recognition methods based on condition random field.

4) it, is directed to part-of-speech tagging, using the method based on Hidden Markov Model.

Specifically, Entity recognition is date, the time, phone, name, place name, mechanism name to all appearance in the text Etc. being identified and marked (i.e. entity can also regard the participle after belonging to a kind of particular procedure as).Using based on deep learning Method carries out Entity recognition, and Entity recognition task is regarded the classification problem based on sequence labelling, the depth mind that this method uses It mainly include Embedding layers (mainly having term vector, character vector and some additional features) through network, it is RNN layers two-way, TANH hidden layer and last CRF layer are constituted；Here RNN is often with LSTM or GRU.As shown in Fig. 2, Text Pretreatment is specific Include:

1), text dividing is at individual character, English word, number, punctuation mark, in Embedding layer building term vector.

2) it is sent into two-way LSTM model after, constructing term vector, LSTM model can export a cutting according to list entries Sequence label.

3) Soft max function, is used in the output end of LSTM, obtains the probability distribution of participle label.

4) probability distribution of cutting sequence label, is sent into CRF model, obtains optimal entity tag sequence；Here it mentions And " optimal " result just for the sake of indicating the entity tag sequence obtained by CRF model is good, not has to it The limitation of body.

In the present embodiment preferred embodiment, Text Pretreatment further includes context rule analysis, text in line of text Row in context rule analysis include text is carried out participle cutting and Entity recognition as a result, using scheduled rule regulating side Method is modified word segmentation result, is identified again to ambiguous generic.And generic information is available refers to below Pattern rules.

Since current participle, Entity recognition are impossible to reach 100% accuracy rate, so needing through scheduled rule Bearing calibration is modified word segmentation result, is identified again to ambiguous generic.The tool of scheduled rule regulating method Body is realized, such as can use ngram model or customized grammar rule carrys out the Chinese word segmentation of disambiguation.

In the present embodiment, naming the main task of Entity recognition is to identify the proprietary names such as name, place name in text With the numeral classifier phrases such as significant time, date and sorted out.

It should be noted that the text referred in the present embodiment includes but is not limited to: content is the text of plain text format, Or the text with editable text and text, the form of expression of text include but is not limited to that webpage format, server store Document format etc..

S120, piecemeal processing is carried out to text, obtains text block, this block sort extraction process of composing a piece of writing of going forward side by side, text block point Class extraction process includes: the information type based on text formatting characterizing definition, will be literary according to the associative mode rule of grammar definition Participle and entity in this block is regular according to the associative mode of grammar definition, the text block after being organized into structuring.

The present embodiment is preferably carried out in mode, includes base after the processing of resume piecemeal when text is resume This information corresponds to the first text block, education undergoes corresponding second text block, work experience to correspond to third text block, training experience pair Answer corresponding 5th text block of the 4th text block, credentials, corresponding 6th text block of job hunting wish；Format character respectively includes base This information, education experience, work experience, training experience, credentials, the corresponding information characteristics of job hunting wish；Grammar definition Pattern rules include the judgment rule defined according to morphological analysis, syntactic analysis and the semantic analysis in Fundamentals of Compiling.

The present embodiment is preferably carried out in mode, and pattern rules are that revisable, regular configuration information extracts model, energy It is enough to be configured respectively according to different application scenarios；And generic information is provided in pattern rules, Text Pretreatment also wraps It includes context rule in line of text to analyze, context rule analysis includes carrying out text participle cutting and entity knowledge in line of text It is other as a result, be modified using scheduled rule regulating method to word segmentation result, ambiguous generic is identified again. And pattern rules can general idea be interpreted as (or application scenarios) according to different modes, adjustment extract event information extract model Corresponding rule (or using different rules based on different scenes).Rule in the present embodiment can be conditional statement expression Formula is also possible to a variety of input and output injection tables etc., and the identification of different mode can be based on text itself application scenarios come really Recognize, can also be realized by keyword in automatic identification text；The present embodiment is not limited, these different implementations Mode belongs to the protection scope of the present embodiment.

The present embodiment is preferably carried out in mode, and pattern rules include abbreviation merging rule, and complicated length is regular It is placed on front, simple short rule is put behind.Complicated length rule is placed on front, after simple short rule is placed on Face；Since rule match carries out from front to back, avoid it is possible that first with the success of the rule match of front, it is subsequent then because It does not traverse, and causes the technical issues of it fails to match.

The present embodiment is preferably carried out in mode, by text block participle and entity according to corresponding pattern rules, it is whole Text block after managing into structuring includes: to judge whether continuous row meets specific mode using pattern rules sequence, and complete After meeting corresponding AD HOC at text, the multidimensional number for the pattern rules result deposit character string type that Jiang Gehang is matched In group.Hereafter explanation can be further explained in detail in conjunction with Fig. 3.

Wherein, text block sort extraction process is established on the basis of revisable pattern rules, for text formatting Feature and the rule of judgement information type defined, mode is exactly the format character collection of various information.It can be led by the definition syntax This pattern rules out, its explanation can borrow the side of morphological analysis, syntactic analysis and semantic analysis in Fundamentals of Compiling Method.

It include generic information in pattern rules as the present embodiment preferred embodiment, so external by introducing The thesaurus that can customize, allows the more convenient to use of pattern rules.

In the present embodiment, pattern rules formal definitions:

: right-hand component is by the regular grammatical defined and institutes such as (&) or (|), non-(-), exclusive or (^), parantheses (()) The expression formula of composition.

Piecemeal rule is for sequentially judging whether continuous row meets AD HOC；

Such as: x_y_0_S_1_z；x_y_1_M_n_z；X_y_2_E_1_N- > SUCCESS, meaning are: x_y_0_S_1_ Z, x_y_1_M_n_z, x_y_2_E_1_N are the item name in piecemeal respectively, and rule is by vocabulary generic and operator group At, & is logical AND, | be logic or,-be logic NOT, it is as follows:

x_y_0_S_1_z:“key1_&loc-key0”

Then the meaning of rule match is: " if key1_ and loc generic occurs in line of text, and key0 generic does not go out It is existing, then meet x_y_0_S_1_z classification mode ".

The meaning of item name is:

X is the matched priority number of each information category, and since (0 ...) 1, the smaller then priority of priority number is bigger；

Y is the subscript of rule, since (0 ...) 0, while also indicating that matched priority number in same information category；

The 1st is the subconditional serial number of rule after x_y_, is incremented by since 0, to reduce complexity, limits serial number < 5；

2nd is the subconditional relative position of rule, and ' S' be to start, ' M' is centre, ' E' is to terminate；

3rd is that the sub- condition of rule limits matched line number, 1~9 natural number or be n- without limitation.

Z indicates the information category that the row need to mark, and since (0 ...) 1, N is default event information category.

If row k matching meets x_y_0_S_1_z, kth+1 meets x_y_1_M_n_z, kth+1+n+1 to kth+1+n row Row meets x_y_2_E_1_N, then kth to kth+1+n+1 row successful match, and the information category of label kth to kth+1+n row is 1.

The value of x, y, z should be in corresponding predefined scope, and otherwise program will stop piecemeal.

For convenience of regular expression, defining " SYS_TRUE " indicates logical truth.

Abbreviation merges the overlapping and redundancy that rule can solve rule；Before complicated length rule should be placed on by pattern rules Simple short rule is put behind in face.Specifically as shown in figure 3, the process that text block sort is extracted includes:

S121, beginning execute text block sort extraction process.

S122, pattern classification judge that current text row belongs to and which mould are able to satisfy that is, according to the pattern rules being arranged in advance Formula rule, i.e. match pattern rule ": right-hand component ".

S123, judge whether text terminates or do not match any rule, i.e. whether all rows of current text execute It completes or the style of writing does not originally match any rule, if so, executing step S124, otherwise, return and execute S122.

The mode that S124, Jiang Gehang are matched analyzes line of text, and match pattern rule ": left-hand component " (that is: x_y_0_S_1_z classification results) are stored in three-dimensional array Cn [a] [b] [c].Subscript a corresponds to the 1st x of classification (that is: each letter Cease the priority number of categorical match), subscript b corresponds to the 2nd y of classification (that is: regular subscript), and subscript c corresponds to classification the 3rd and (that is: advises Then subconditional serial number).Last 3 that rule condition is deposited in Cn [a] [b] [c].Cn [a] [b] [c] [0] saves the sub- condition of rule Relative position, Cn [a] [b] [c] [1] save defined by match line number, Cn [a] [b] [c] [2] saves what the row need to mark Information category.The step completes the rule match of all styles of writing originally, and category result is stored in Cn [a] [b] [c].

S125, judging whether Cn [a] [b] [c] is the end of text: reading in Cn [a] [b] [c] i.e. since n=0, n is cumulative, Until the end of text；If Cn [a] [b] [c] is NULL, indicates the end of text, jump to S129.

S126, the value for judging [a] [b] [c] [0] Cn: it indicates that the rule match of the category terminates if it is " E ", jumps to S128；It either " M " indicates that the rule match of the category starts or intermediate if it is " S ", jumps to S127.

S127, judge whether the value of b reaches boundary: reaching boundary if it is b, indicate that category matching terminates, be not then Row classification, directly processing n+1 row.If b does not reach boundary, b+1, c+1 continue with the subsequent rule of the category, S126 is jumped to continue to judge.

S128, nominated bank category result: matching multirow from " S " to " E " success, or matching uniline " E " successfully, by these Row merges and is Cn [a] [b] [c] [2] by piecemeal category label.Then n+1 handles next line.

S129, matching terminate.

S130, event information extraction process, event information are carried out respectively to the text block of all categories after structuring in text Extraction process includes realizing keywording using the associative mode rule of grammar definition, and keyword is output to event information It extracts in corresponding result template.

Therefore textual event information extracting method provided in this embodiment, using the mathematical model of neural network, to text Participle cutting and Entity recognition are carried out, can quickly obtain the fundamental in text, the mode of binding pattern rule is to text The extraction of text block classification information is carried out, by the associative mode rule of participle and entity according to grammar definition in text block, arrangement At the text block after structuring, by text information according to the text of computer language requirement in this way in a manner of being more conducive to information extraction Method expression formula structuring, and Event Distillation is carried out again on the basis of text sections structuring processing, effective solution data dispersion, The problem of center is stated farther out apart from event in sparse and extraction domain, the accurate extraction of such textual event information；And entity Identification includes being identified using deep neural network model, improves recognition effect.

The present embodiment is preferably carried out in mode, and regular configuration information extracts model using regular description language NPRDL Expression writing is carried out, NPRDL language is using BNF form；And description language be the means based on set of complex features come The grammatical and semantic information of vocabulary is described, while being retouched in dynamic analysis using the dynamic attribute list described based on set of complex features It states.

Therefore, as the present embodiment preferred embodiment, the basis of pattern rules is revisable, for example, using rule Then description language NPRDL carries out expression writing, and NPRDL language is using BNF form；It, can be with for different application scene Flexibly quickly configuration information extracts model.In addition, complicated length rule is placed on front, after simple short rule is placed on Face；Since rule match carries out from front to back, avoid it is possible that first with the success of the rule match of front, it is subsequent then because It does not traverse, and causes the technical issues of it fails to match.

Specifically, the specific implementation of all kinds of keywords of outgoing event is extracted in text block: utilizing various information itself Format character, the mainly tissue signature of various information identified using the method for rule.Such as: identification " school ", Yi Zhongke The case where to utilize is " place "+" school " (or " university ", " institute " etc.), such example such as: " Peking University ", identification The case where " organization ", one kind can use is " take office in " (or " holding a post in " etc.)+other information+" company " (or " collection Group ", " office " etc.), such example such as: " taking office in Beijing TRS company ".

More specifically, including: to text block progress event information extraction process in text

One), regular description language NPRDL

Rule in system carries out expression writing using rule description language NPRDL, NPRDL language using BNF form.The basic unit of NPRDL language is<rule>, and one<simple rule>is effectively equivalent to one of natural language Simply " conditional clause ", citation form are as follows:

Meaning: if<test>success, executes<operation>.

<test>and<operation>are by function division, due to identical (the i.e. unified input sentence of analysis of the two referent Son), the form of following structure chart can be unified into:

In actual analysis,<structural formula>can be used to a segment of expression analysis sentence, wherein<structure item>and sentence In ' word ' it is corresponding.The sequential organization being made of<item label><item operates><Xiang Yuansu>, can be used to express related to this word The test or operation of attribute.

The description object of NPRDL language is using word as the Chinese sentence of basic object unit and its intermediate structure (such as grammer Tree, semantic concept space etc.).Including:

1, the information of each word indicates are as follows: concept (attribute 1, attribute 2 ..., attribute n).Wherein concept everyday expressions sheet Body indicates.

2, in the analysis process, using certain word as center word and there is close phrase, attributes (such as the phrase class, sentence such as clause Class, tone etc.), the attribute of the centre word can be attributed to.

3, the relationship between concept and concept, may be expressed as:

Relationship

Concept --- > concept

One sentence (or in which a part), is represented by<word>sequence of following form:<word>+<word>+...+<word>. In rule description, the formula that deserves to be called is<structural formula>, wherein each<word>is known as one<structure item>.<word>sequence is corresponding<knot Structure formula>are as follows:<structure item>+<structure item>+...+<structure item>.

(1) each<structure item>corresponding one<word>, content include:

<item label>: to point out position of the word in sentence.

<item operation>: to point out the operation to the word in relation to attribute.

<Xiang Yuansu>: certain attributes to express the word.

(2) symbol '+' is structure connector, indicates that its former and later two<structure items>have adjacency and succession.

(3) two<structure item>use with set membership ↑ (representing father) or ↓ (representing son) indicates the two in not In same hierarchical structure.

Example: ^ (VV, 2033)+^ ↓ # (NN, 111, SUBJECT)=> (^ ↓ #.GRELA:=AGT)

Meaning: if current word is (verb, thought), and current word son be (noun, people, upper verb Subject), then the case relation of the son of current word is revised as agentive case.

There are three features for the rule description language:

1, descriptive power is strong, and description language is the means based on set of complex features to describe the grammatical and semantic information of vocabulary, It is described in dynamic analysis using the dynamic attribute list described based on set of complex features simultaneously, it thus can be from multi-level, multi-party Face describes the information of Chinese language text unit of analysis.

2, computer disposal, description language offer tree abundant, the movement of net atomic operation, for computer disposal point are provided The syntax tree that is formed in analysis, semantic operation are very convenient, while the inquiry to static attribute list information, modification and deleting also provides Atomic operation movement abundant.

3, it is convenient for rules for writing, not only descriptive power is strong for rule language, and description is very careful.For many descriptions The lesser language phenomenon of fineness ratio, method that completely available word, part of speech, Classification of Speech code etc. uniquely describe describe.And by It is write intuitively, conveniently for a user in regular using description thought design.

Two), contextual analysis rule

The rule of keywording can be indicated with following Bacchus formula:

Rule: :=^ test=> movement

Test: :=test formula；{ test formula；}

Test formula: :=n, test item (& | | |+) test item | n ,~{ & test item }

Test item: :=attribute (=|！=) ' attribute value '

Attribute: :=lex | class

Attribute value: :=vocabulary | generic character

Movement: :=action type；Action type }

Action type: :=n, n, generic character, part of speech

Wherein:

(1) ^ indicates that rule starts symbol

(2)~indicate some node pointer structure

(3) n indicates natural number, and specific number indicates node number, is 1-9

(4) lex and class instruction attribute value thereafter is specific entry or specific generic, if it is lex, then Thereafter attribute value is specific ' entry '；If it is class, then attribute value thereafter is specific ' generic '.

(5) generic character indicates the generic in dictionary, such as: org.

Three), specific applicating example

The corresponding modes rule for analyzing date of birth (birthday) is as follows:

^1, lex=' be born in '；1, class='time'；=> 2,2, birthday, n

The entry for meaning the 1st node is " being born in ", and the generic of the 2nd node is " time ", then by the class of the 2nd node Category is identified as " birthday ".

Four), the merging of pattern rules and tissue

Rule can gradually expand, it is also possible to be overlapped, so having redundancy, abbreviation merging rule can solve this and ask Topic, such as:

^1, class='time'；N ,~&lex！=' $'；1, class='from'；=> 1,1,106, n

And ^1, class='time'；N ,~&lex！=' $'；1, class='to'；=> 1,1,106, n

This two rule can be merged into a rule, using the property of the pattern rules syntax, meet associative law and distribution Rule, the rule after merging are as follows:

^1, class='time'；N ,~&lex！=' $'；1, class='from'| class='to'；=> 1,1, 106,n

This keeps logic tighter.

Because rule match carries out from front to back, may first with the rule match of front success, and it is subsequent then Because not traversing and it fails to match, there may come a time when to will cause mistake.So pattern rules should put complicated length rule In front, simple short rule is put behind.

As shown in figure 4, event information extraction includes: in textual event information extracting method in the present embodiment

S131, confirmation text and extraction type, such as the judgement defined for text formatting feature is referred to based on front The rule of information type.

S132, open rule file, i.e., opening configuration information extract model file.

S133, a line character string is obtained from text.

S134, Text Pretreatment, including participle cutting and Entity recognition are carried out to text.

S135, it is packed into text container, by this line character string after participle cutting and Entity recognition according to pretreated format It saves.

S136, use pattern rule piecemeal extract, i.e., carry out text block to text in the way of referring in aforementioned S120 It divides and text block sort extraction process.

S137, each class keywords are marked using rule-interpreter analysis according to piecemeal classification.

S138, keyword is output in result template.

S139, judge the end of text, text is if it is standardized or is exported (S140)；Otherwise, return is S133。

As shown in figure 5, the present embodiment also provides a kind of textual event information extracting device 100, text event information is mentioned The device 100 is taken to include:

Text Pretreatment module 110 is arranged to pre-process text, and pretreatment includes carrying out participle to text to draw Point, and term vector is obtained after participle is done vector conversion, and term vector is input to neural network model, pass through neural network Model exports entity.

Text block sort extraction process module 120 is arranged to carry out piecemeal processing to text, obtains text block, and carry out Text block sort extraction process, text block sort extraction process include: the information type based on text formatting characterizing definition, according to The associative mode rule of grammar definition, by the associative mode rule of participle and entity according to grammar definition in text block, arrangement At the text block after structuring.

Event information extraction process module 130 is arranged to mention the text block progress event information after structuring in text Processing is taken, event information extraction process includes the associative mode rule realization keywording using grammar definition, and crucial Word is output to event information and extracts in corresponding result template.

It should be noted that the specific of text-processing in textual event information extracting device 100 provided in this embodiment Process is identical with the above-mentioned textual event information extracting method combined using statistics with rule, and can also obtain identical skill Art effect；Details are not described herein.

In order to which those skilled in the art are easier to understand the technical solution of the present embodiment, below with corresponding with resume text For event information extracts, expansion specific description is extracted to text information；Assuming that need to extract essential information, education experience, Work experience, training experience, credentials, job hunting six sports such as wish, totally tens attribute informations.Specific textual event letter Breath extracts

One), Text Pretreatment

1. defining event information extracts content

INFOTYPE1TResume_TRS# essential information

1_00IgnoreFlag#

1_01TrueName# job hunter's name in fact

1_02Email# E-mail address

1_03Mobel# phone number

1_04Phone# telephone number

1_05Sex_s# gender

…

INFOTYPE2TEducation_TRS# education experience

2_00E_StartTime_s# educates from date

2_01E_StartTime_d#

2_02E_EndTime_s# educates the Close Date

2_03E_EndTime_d#

2_04E_SchoolName# school title

…

INFOTYPE3TWorkExperierence_TRS# work experience

The 3_00InductionDate_s# job initiation date

3_01InductionDate_d#

3_02DimissionDate_s# works the Close Date

3_03DimissionDate_d#

3_04CompanyName# Business Name

…

2. defining thesaurus

It, can be by extracting some keywords to artificial observation for convenience of writing for rule.These words may be cut It is divided into a word, it is also possible to be cut into several words.Be added into class.GB and to added after each entry t and Generic name, such as:

Name kr1#kr1_01

Email kr1#kr1_02

Phone number kr1#kr1_03

Phone kr1#kr1_04

Gender kr1#kr1_05

Wed no kr1#kr1_07

…

3. text participle and Entity recognition

Participle and Entity recognition are carried out to text, generic can be automatically given to each word segmentation result.Class built in system Category includes: name (name), loc (place name), org (mechanism name), id (identification card number), digit (numerical value) etc..

Two), text block sort

According to the information extraction content of definition, need resume text being divided into six pieces.By taking essential information piecemeal as an example, write Pattern rules are as follows:

1_00_0_S_1_1:kr1&kru-(bdfhm|bdfhk)

1_00_1_E_n_1:(SYS_TRUE-kru)|(kru&(bdfhm|bdfhk)-kr9)

1_01_0_S_1_1:kr1&kru&bdfhm

1_01_1_E_n_1:(SYS_TRUE-kru)|(kru-bdfhm-kr9)

1_02_0_S_1_1:kr1&kru&bdfhk

1_02_1_E_n_1:(SYS_TRUE-kru)|(kru-bdfhk-kr9)

1_03_0_S_1_1:kr1&(bdfhm|kru)

1_03_1_E_n_1:SYS_TRUE-(bdfhm|kru)

1_04_0_S_1_1:english-(kr0|kr1|kr2|kr3|kr4|kr5|kr6|kr7|kr8|kr9)

1_04_1_E_n_1:kr1-kru

Three), information extraction

By taking essential information as an example, information extraction rule is write as follows:

1) is according to generic Direct Recognition

^1, class='name'；=> 1,1,1_01, n

It indicates, is that name is identified as resume name by generic.

2) based on context keyword recognition, such as:

^1, class=' kr1_01 '；

1, class=' bdfhm '；

N ,~&class！=' bdfh'&class！=' kr'；

1, class='row'| class='kr'；

=> 3,3,1_01, n

Above-mentioned expression formula indicates: with " kr1_01+ colon " beginning, behind until encountering line feed or other keywords, The multiple nodes for then identifying the centre are resume name.

As shown in fig. 6, the present embodiment also provides a kind of electronic device, comprising:

Memory 210；

Processor 220；And

Computer program；

Wherein, computer program stores in memory 210, and is configured as being executed by processor 220 to realize as above Any one textual event information extracting method of offer is provided.

In addition, the present embodiment also provides a kind of non-volatile memory medium, it is stored thereon with computer program, the computer Program, which is performed, realizes any one textual event information extraction side combined using statistics with rule as provided above The step of method.

Those of ordinary skill in the art will appreciate that: it is above-mentioned according to the method for the embodiment of the present invention can be in hardware, firmware Realize, or be implemented as the software being storable in recording medium (such as CD ROM, RAM, floppy disk, hard disk or magneto-optic disk) or Computer code, or the original storage of network downloading is implemented through in long-range recording medium or nonvolatile machine readable media In and the computer code that will be stored in local recording medium, so that method described herein can be stored in using general It is such in computer, application specific processor or the programmable or recording medium of specialized hardware (such as ASIC, FPGA or SoC) Software processing.It is appreciated that computer, processor, microprocessor controller or programmable hardware are soft including that can store or receive The storage assembly (for example, RAM, ROM, flash memory etc.) of part or computer code, when the software or computer code by computer, When processor or hardware access and execution, processing method described herein is realized.In addition, when general purpose computer access for realizing When the code for the processing being shown here, the execution of code, which is converted to general purpose computer, is used to execute the special of the processing being shown here Use computer.

Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and method and step can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed The range of the embodiment of the present invention.

Finally, it should be noted that above description is only highly preferred embodiment of the present invention, not the present invention is appointed What formal limitation.Anyone skilled in the art, it is without departing from the scope of the present invention, all available The way and technology contents of the disclosure above make many possible variations and simple replacement etc. to technical solution of the present invention, these Belong to the range of technical solution of the present invention protection.

Claims

1. a kind of textual event information extracting method, which is characterized in that the described method includes:

Text is pre-processed, the pretreatment includes participle division being carried out to text, and obtain after participle is done vector conversion It is input to neural network model to term vector, and by the term vector, entity is exported by the neural network model；

Piecemeal processing is carried out to text, obtains text block, this block sort extraction process of composing a piece of writing of going forward side by side, the text block sort is extracted Processing includes: the information type based on text formatting characterizing definition, according to the associative mode rule of grammar definition, by the text Participle and entity in block is regular according to the associative mode of the grammar definition, the text block after being organized into structuring；

Event information extraction process is carried out to the text block after structuring in the text, the event information extraction process includes Keywording is realized using the associative mode rule of the grammar definition, and keyword is output to event information and extracts correspondence Result template in.

2. the method according to claim 1, wherein when text be resume when, the resume piecemeal It include corresponding first text block of essential information after processing, education experience corresponds to the second text block, work experience corresponds to third text Corresponding 4th text block of block, training experience, corresponding 5th text block of credentials, corresponding 6th text block of job hunting wish；It is described Format character respectively includes essential information, education experience, work experience, training experience, credentials, the corresponding letter of job hunting wish Cease feature；The pattern rules of the grammar definition include according to morphological analysis, the syntactic analysis in Fundamentals of Compiling, and semanteme point Analyse the judgment rule of definition.

3. the method according to claim 1, wherein described carry out participle to divide including: in core word to text In the tissue of allusion quotation, using the method for even numbers group trie tree；For overlap type segmentation ambiguity, the side combined using rule with statistics Method；For unknown word identification, using the recognition methods based on condition random field.

4. the method according to claim 1, wherein the deep neural network model include Embedding layers, Two-way RNN layers and CRF layers；Described Embedding layers will obtain term vector after participle progress vector conversion, be sequentially sent to double To RNN layers, the probability distribution of participle label is obtained, the probability distribution of the participle label is sent into described CRF layers, obtains entity pair The entity tag sequence answered.

5. the method according to claim 1, wherein the pattern rules be it is revisable, it is described rule match Information extraction model is set, can be configured respectively according to different application scenarios；And class is provided in the pattern rules Belong to information, the Text Pretreatment further includes context rule analysis in line of text, context rule analysis in the line of text Including to text carry out participle cutting and Entity recognition as a result, being repaired using scheduled rule regulating method to word segmentation result Just, ambiguous generic is identified again.

6. the method according to claim 1, wherein the pattern rules include abbreviation merging rule, and will Complicated length rule is placed on front, and simple short rule is put behind.

7. the method according to claim 1, wherein the participle and entity by the text block is according to right The pattern rules answered, the text block after being organized into structuring include: to judge whether continuous row meets spy using pattern rules sequence Fixed mode, and after completion text meets corresponding AD HOC, the pattern rules result that Jiang Gehang is matched is stored in word In the Multidimensional numerical for according with string type.

8. the method in -7 described in any one according to claim 1, which is characterized in that the configuration information of the rule extracts mould Type carries out expression writing using rule description language NPRDL, and NPRDL language is using BNF form；And the description language Speech is the means based on set of complex features to describe the grammatical and semantic information of vocabulary, while using based on complexity in dynamic analysis The dynamic attribute list of feature set description describes.

9. a kind of electronic device characterized by comprising

Memory；

Processor；And

Computer program；

Wherein, the computer program stores in the memory, and is configured as being executed by the processor to realize such as Method described in any one in claim 1-8.

10. a kind of non-volatile memory medium, is stored thereon with computer program, which is characterized in that the computer program is held The step of any one the method in such as claim 1-8 is realized when row.