CN117763077A

CN117763077A - Data query method and device

Info

Publication number: CN117763077A
Application number: CN202311829247.XA
Authority: CN
Inventors: 卢轩; 张亚军; 刘志伟
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-03-26

Abstract

The embodiment of the specification provides a data query method and device, wherein the method comprises the following steps: receiving a data query request, wherein the data query request comprises a regular expression; generating a state machine diagram based on the regular expression, wherein the state machine diagram comprises a first node representing an initial state, a second node representing a termination state and an edge representing a word element, the word element consists of a plurality of characters, and a target path between the first node and the second node represents a word element combination conforming to the regular expression; inquiring a pre-generated inverted index dictionary based on a target path, and determining a target text meeting the regular expression; the inverted index dictionary comprises index sets corresponding to a plurality of words, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text.

Description

Data query method and device

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a data query method and device.

Background

Many applications require searching of data or use pattern matching to capture the required information, with regular expressions being a very popular method of pattern matching such as rule detection for network packets, search processing of text strings, and matching of protein sequences. However, with the growth of the internet, the data of users is also growing rapidly, the data volume required to be processed by each application is far more than ever nowadays, although the performance of hardware is greatly improved than ever, the software architecture is expanded from stand-alone computing to distributed computing, but the conventional regular matching of a large amount of data still needs a lot of computation effort and time consumption above the minute level, and the response time is often not acceptable to users.

Therefore, a reasonable and reliable scheme is urgently needed, and efficient data query can be achieved.

Disclosure of Invention

The embodiment of the specification provides a data query scheme which can realize efficient data query.

In a first aspect, an embodiment of the present disclosure provides a data query method, including: receiving a data query request, wherein the data query request comprises a regular expression; generating a state machine diagram based on the regular expression, wherein the state machine diagram comprises a first node representing an initial state, a second node representing a termination state and an edge representing a word element, the word element consists of a plurality of characters, and a target path between the first node and the second node represents a word element combination conforming to the regular expression; inquiring a pre-generated inverted index dictionary based on the target path, and determining a target text meeting the regular expression; the inverted index dictionary comprises index sets corresponding to a plurality of words respectively, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text.

In some embodiments, the generating a state machine graph based on the regular expression includes: constructing a first state machine with single characters as edges based on the regular expression; converting the first state machine into a second state machine with a word element as an edge; and converting the second state machine into a graph to obtain the state machine graph.

In some embodiments, the determining the target text that satisfies the regular expression includes: determining a plurality of first edges in the state machine diagram; the first side is capable of cutting off access to a target path formed by a plurality of sides; for a target first edge in the plurality of first edges, inquiring the inverted index dictionary to obtain a first index set corresponding to a first word element represented by the target first edge; and sequentially matching text identifiers and positions in the index set corresponding to other edges in a target path where the target first edge is located based on a plurality of text identifiers and corresponding positions indicated in the first index set by taking the target first edge as a starting point to obtain matched text identifiers, and determining corresponding texts as the target texts.

In some embodiments, the determining a number of first edges in the state machine diagram includes: determining a plurality of first edges in the state machine diagram based on a max-min cut algorithm; such that the removal of the number of first edges can cause a disruption in network flow of the state machine graph or a sub-graph of the state machine graph that is a graph of the state machine graph after the removal of a second edge that is directly connected from the first node to the second node.

In some embodiments, the determining the target text that satisfies the regular expression further comprises: in response to the state machine graph including a second edge directly connected from the first node to the second node, querying the inverted index dictionary for a second index set corresponding to a second lemma represented by the second edge; and determining the text corresponding to each text identifier indicated by the queried second index set as the target text.

In some embodiments, the number of text identifications indicated by the first set of indices does not include the text identification indicated by the second set of indices.

In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the text identifiers and corresponding positions indicated in the first index set includes: determining a current first text identifier in the text identifiers, and determining a current first position in each position indicated by the first index set and corresponding to the first text identifier; sequentially matching text identifiers and positions in an index set corresponding to other edges in a target path where the target first edge is located based on the first text identifier and the first position; and if the target path where the target first edge is located is successfully matched, determining the first text identifier as a matched text identifier.

In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the text identifiers and corresponding positions indicated in the first index set further includes: if the target path of the target first side is failed to be matched based on the first text identifier, and other text identifiers which do not participate in path matching exist in the text identifiers, determining a current second text identifier in the other text identifiers, and determining a current second position in each position indicated by the first index set and corresponding to the second text identifier; and sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the second text identifiers and the second positions.

In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the text identifiers and corresponding positions indicated in the first index set further includes: if the target path where the target first edge is located fails to match based on the first position, and other positions which do not participate in path matching exist in the positions indicated by the first index set and correspond to the first text mark, determining a current third position in the other positions; and sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the first text identifier and the third position.

In some embodiments, the target path on which the target first edge is located includes a multi-entry target path; the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the first text identifier and the first position includes: for any first target path in the multi-item target path, sequentially matching text identifiers and positions in index sets corresponding to other edges in the first target path based on the first text identifiers and the first positions; and if the first target path is successfully matched, determining that the target path where the target first edge is located is successfully matched.

In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the first text identifier and the first position further includes: and if the matching of other edges in the first target path fails, and the other target paths in the multi-entry target path all comprise edges with failed matching in the other edges, determining that the matching of the target path where the target first edge is located fails.

In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the first text identifier and the first position further includes: and if the matching of other edges in the first target path fails, and a second target path which does not comprise the edges with the failed matching in the other edges exists in the multi-item target path, sequentially matching the text identifiers and the positions in the index set corresponding to the other edges in the second target path based on the first text identifiers and the first positions.

In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the first target path based on the first text identifier and the first position includes: determining a third side to be matched currently in the first target path; acquiring a third index set corresponding to a third word element represented by the third side in the inverted index dictionary; if the third index set comprises the first text identifier and a target position corresponding to the first text identifier, determining that the third edge matching is successful; wherein the target position is determined according to the first position and the positional relationship between the first side and the third side of the target; and if the third index set does not comprise the first text identifier or the target position, determining that the third edge matching fails.

In some embodiments, the determining the third edge to be currently matched in the first target path includes: if the first target path comprises a plurality of edges adjacent to the target first edge, querying an index set corresponding to the word elements represented by the edges in the inverted index dictionary; and determining a target index set with the largest number of the included text identifiers in each queried index set, and determining an edge corresponding to the target index set in the plurality of edges as the third edge.

In some embodiments, the determining the target text that satisfies the regular expression further comprises: querying index sets corresponding to the word elements respectively represented by the plurality of first sides in the inverted index dictionary; for each index set queried, determining a minimum text identification in the index set as a weight value corresponding to a first edge of the index set; and sequentially determining the first sides as the target first sides according to the sequence of the weight values from small to large.

In some embodiments, prior to receiving the data query request, further comprising: for each text to be processed in a preset text set, performing word segmentation on the text to be processed by using an N-element language model to generate a word segmentation result; wherein N is a natural number greater than 1, the word segmentation result comprises a word element information set corresponding to the text identifier of the text to be processed, and any word element information comprises word elements formed by N characters and the positions of the word elements in the text to be processed; forming a word element set from the word elements in each word segmentation result; for each word element in the word element set, generating an index set corresponding to the word element based on a word segmentation result of the word element; and generating the inverted index dictionary based on index sets corresponding to the respective lemmas in the lemma set.

In some embodiments, prior to generating the word-forming result, further comprising: for any word element divided from the text to be processed, if the any word element comprises capital letters, converting the capital letters into lowercase letters.

In a second aspect, embodiments of the present disclosure provide a data query device, including: a receiving unit configured to receive a data query request including a regular expression; a generating unit configured to generate a state machine graph based on the regular expression, the state machine graph including a first node representing an initial state, a second node representing a termination state, and an edge representing a lemma, the lemma being composed of a plurality of characters, a target path between the first node and the second node representing a lemma combination conforming to the regular expression; a determining unit configured to query a pre-generated inverted index dictionary based on the target path, and determine a target text satisfying the regular expression; the inverted index dictionary comprises index sets corresponding to a plurality of words respectively, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text.

In a third aspect, embodiments of the present specification provide a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform a method as described in any of the implementations of the first aspect.

In a fourth aspect, embodiments of the present specification provide a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, implements a method as described in any of the implementations of the first aspect.

In a fifth aspect, the present description provides a computer program product, wherein the computer program product, when executed in a computer, causes the computer to perform the method as described in any one of the implementations of the first aspect.

The scheme provided by the embodiment of the specification designs an inverted index dictionary, which comprises index sets corresponding to a plurality of words, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text. The inverted index dictionary can be used for data queries based on regular matches. Specifically, upon receiving a data query request including a regular expression, a state machine graph may be generated based on the regular expression, the state machine graph including a first node representing an initial state, a second node representing a termination state, and edges representing tokens, the tokens being composed of a plurality of characters, a target path between the first node and the second node representing a combination of tokens conforming to the regular expression. Then, a pre-generated inverted index dictionary may be queried based on the target path to determine target text satisfying the regular expression. Therefore, the matching of the regular expression can be converted into the traversal of the graph, the dynamic state machine matching part is converted into the analysis of the graph, and the regular matching is quickened by combining the positions of the words in the inverted index dictionary, so that the efficient data query is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only examples of the embodiments disclosed in the present specification, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of one application scenario in which embodiments of the present description may be applied;

FIG. 2 is a flowchart of the generation process of the inverted index dictionary in the embodiment of the present specification;

FIG. 3 is a flow chart of a data polling method in an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a state machine in an embodiment of the present description;

FIG. 5 is a flow chart of a data polling process in an embodiment of the present description;

FIG. 6 is a schematic diagram of a matching process of the target path where the edge e1 is located in the embodiment of the present specification;

fig. 7 is a schematic diagram of the structure of the data polling device in the embodiment of the present specification.

Detailed Description

The present specification is further described in detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. The described embodiments are only some of the embodiments of the present description and not all of the embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present application based on the embodiments herein.

For convenience of description, only a portion related to the present invention is shown in the drawings. Embodiments and features of embodiments in this specification may be combined with each other without conflict.

As previously mentioned, many applications require searching data or capturing desired information using pattern matching, where regular expressions are a very common method of pattern matching, such as rule detection for network packets, search processing of text strings, and matching of protein sequences. In practice, a regular expression, also called regular expression (Regular Expression, often abbreviated in code as regex, regex p, or RE), is a text pattern that includes common characters (e.g., letters between a and z) and special characters (called "meta characters"), and is a concept of computer science. Regular expressions use a single string to describe, match a series of strings that meet a certain syntactic rule, and are typically used to retrieve, replace, text that meets a certain pattern (rule).

Along with the growth of the internet, the data of users is also growing rapidly, the data volume required to be processed by each application is far beyond ever today, and although the performance of hardware is greatly improved relative to ever, the software architecture is expanded from stand-alone computing to distributed computing, but the conventional regular matching of a large amount of data still needs a lot of computation power and time consumption above the minute level, and the response time is often not acceptable to users.

In addition, the data processing scene of the applications is often not simple pattern matching, but is combined with other various data retrieval and processing capabilities to meet the demands. For example, existing matching algorithms typically filter out the most likely matches of data (articles, strings, etc.) in a dataset to obtain a candidate set, and then perform conventional canonical matching on the candidate set. The core of the algorithm and the implementation is focused on the filtering process, and the filtering effect determines the advantages and disadvantages of the regular matching effect, because the regular matching is still required to be carried out on single possibly matched corpus, and the time consumption is high.

The embodiment of the specification provides a data query method, which can convert the matching of regular expressions into the traversal of graphs, convert the matching part of a dynamic state machine into the analysis of the graphs, and accelerate the regular matching by combining the positions of the words in an inverted index dictionary, thereby realizing efficient data query.

Fig. 1 is a schematic diagram of one application scenario to which the embodiments of the present description may be applied. As shown in fig. 1, the application scenario may include a user device 101 and a server 102.

The server 102 may be used to provide data query services and support data queries based on a regular matching approach. In addition, an inverted index dictionary may be included in the server 102. The inverted index dictionary comprises index sets corresponding to a plurality of words, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text. Wherein, any word element is composed of N characters, N is a natural number greater than 1. In one example, N may be 2 or 3, etc., which may be set according to actual requirements, and is not specifically limited herein. The several items may refer to one or more items.

It should be noted that existing Inverted indexes (Inverted indexes), also commonly referred to as Inverted indexes, embedded documents, or Inverted documents, are an indexing method that enables mapping from a keyword to a document or group of documents that contain the keyword. Compared with the existing inverted index, the inverted index related to the position provided in the embodiment of the present disclosure indicates a document or a group of documents where a keyword is located, and also indicates the position of the keyword in the located document. It will be appreciated that the inverted index of a location in this embodiment of the present description may be referred to as an inverted index of a location with a lexeme.

In the application scenario shown in fig. 1, a user may send a data query request to the server 102 through the user device 101, where the data query request may include the regular expression regex1. Upon receiving the data query request, the server 102 may generate a state machine graph based on regex1. The state machine diagram comprises a node 0 representing an initial state, a node 1 representing a termination state, and an edge representing a word element, wherein a target path between the node 0 and the node 1 represents a word element combination conforming to regex1.

In practice, the state machine may be a short for Finite State Machine (FSM). Finite state machines, also known as Finite State Automata (FSA), are mathematical computational models that represent finite states and the behavior of transitions and actions between these states. Finite state machines can be classified into deterministic finite state machines (deterministic finite automaton, DFA) and non-deterministic finite state machines (non-deterministic finite automaton, NFA). DFA is an automaton that enables state transitions, which for a given state belonging to the automaton and a character belonging to the automaton alphabet Sigma can transition to the next state (this state may be the previous one) according to a pre-given transition function. NFA is a finite state automaton that may have multiple possible next states for each state and input symbol pair. Although DFA and NFA have different definitions, they can prove equivalent in formal theory; that is, for any given NFA, an equivalent DFA may be constructed, and vice versa: by using power set construction.

In generating the state machine diagram based on regex1, for example, as shown in fig. 1, a state machine FSM1 with single character edges may be constructed based on regex1, then FSM1 is converted into a state machine FSM2 with lemmas edges, and then FSM2 is converted into a diagram, resulting in a state machine diagram. It should be appreciated that FSM1 and FSM2 are both equivalent to regex1.

After generating the state machine graph, the server 102 may query a pre-generated reverse index dictionary based on the target path in the state machine graph to determine target text that satisfies regex1.

It should be noted that, the text corresponding to the text identifier in the inverted index dictionary described above may be included in a preset text set. The reverse index dictionary may be formed by performing N-gram word segmentation processing or the like on each text in the text set. Next, the generation process of the inverted index dictionary will be described in detail.

Referring to fig. 2, a flowchart of the generation process of the reverse index dictionary in the embodiment of the present specification is shown. The generation process may be performed by any device, platform, or cluster of devices having data storage, computing, processing capabilities, such as by the server 102 as previously described. In addition, the generation process may include steps S201 to S207 as shown below.

As shown in fig. 2, in step S201, for each text to be processed in a preset text set, word segmentation is performed on the text to be processed by using an N-element language model, so as to generate a word segmentation result; the word segmentation result comprises a word element information set corresponding to the text identification of the text to be processed, and any word element information comprises a word element formed by N characters and the position of the word element in the text to be processed.

Specifically, the analyzer of Lucene may be predefined such that the analyzer uses an N-gram word segmentation device as a core, and words are segmented in a specific manner (such as a binary word segmentation manner or a ternary word segmentation manner). The Lucene is a full-text retrieval engine tool kit of an open source code, and aims to provide a simple and easy-to-use tool kit for software developers so as to conveniently realize full-text retrieval function in a target system or establish a complete full-text retrieval engine based on the function.

It should be appreciated that the N-gram model may be referred to as an N-gram model. When n=2, the N-gram model may be referred to as a binary language model (Bi-gram model). When n=3, the N-gram model may be referred to as a ternary language model (Tri-gram model). The Bi-gram model can use two character units as a word token, for example, after the Bi-gram model is used for word segmentation of the character string abcdefg, the following word token can be obtained: ab. bc, cd, de, ef, fg. The Tri-gram model can use three character units as a word element, for example, after the Tri-gram model is used for word segmentation of the character string abcdefg, the following word elements can be obtained: abc, bcd, cde, def, efg.

Before generating the word segmentation result corresponding to the text to be processed, if any word element is segmented from the text to be processed, the word element may be converted into a lower case letter, for example, the lower case letter may be converted into a lower case letter by using a LowerCase Filter (LowerCase Filter). It should be appreciated that in such a case, no lemma containing capital letters exists in the word segmentation result.

In the following, taking n=3, text as a character string, and a text set including each character string as shown in table 1 as an example, a text word segmentation process is described.

TABLE 1

Text identification	Text of
		1	abbbd
2	abbbcd
		3	accbccccd
4	acdabd

For a string "abbbd" with text identification of 1, the following lements can be obtained by using a Tri-gram model to segment the string: abb, bbb, bbd. Then, a word segmentation result corresponding to the character string can be generated, wherein the word segmentation result comprises a text identifier 1 and a word element information set corresponding to the text identifier 1. Any of the lemma information in the lemma information set includes one of abb, bbb, and bbd, and a position of the lemma in the character string. It is noted that the position of the lemma may specifically be the position of the first character of the lemma. For example, the position of the token abb, bbb, bbd can be 0, 1, 2 in order.

For a character string "abbbcd" with a text identifier of 2, the character string is segmented by using a Tri-gram model, so that the following lemmas can be obtained: abb, bbb, bbc, bcd. Then, a word segmentation result corresponding to the character string can be generated, wherein the word segmentation result comprises a text identifier 2 and a word element information set corresponding to the text identifier 2. Any of the lemma information in the lemma information set includes one of abb, bbb, bbc and bcd, and the position of the lemma in the string. The positions of the terms abb, bbb, bbc, bcd can be 0, 1, 2 and 3 in sequence.

For a string "accbccccd" with text identification of 3, the following lements can be obtained by using a Tri-gram model to segment the string: acc, ccb, cbc, bcc, ccc, ccc, ccd. Then, a word segmentation result corresponding to the character string can be generated, wherein the word segmentation result comprises a text identifier 3 and a word element information set corresponding to the text identifier 3. Any of the lemma information in the lemma information set includes one of acc, ccb, cbc, bcc, ccc and ccd, and the position of the lemma in the string. The positions of the terms acc, ccb, cbc, bcc, ccc, ccc, ccd can be 0, 1, 2, 3, 4, 5 and 6 in sequence.

For a character string "acdabd" with text identification of 4, the character string is segmented by using a Tri-gram model, so that the following lemmas can be obtained: acd, cda, dab, abd. Then, a word segmentation result corresponding to the character string can be generated, wherein the word segmentation result comprises a text identifier 4 and a word element information set corresponding to the text identifier 4. Any of the lemma information in the lemma information set includes one of acd, cda, dab and abd, and the position of the lemma in the string. The positions of the terms acd, cda, dab, abd can be 0, 1, 2 and 3 in sequence.

In one embodiment, for any word element information in a word segmentation result corresponding to any text, the any word element information may be implemented as a binary group, where the binary group may include a word element and a position of the word element in the any text, and the position may specifically include each position of the word element in the any text. Based on this, the word segmentation results for each of the character strings in table 1 may be as shown in table 2. Wherein each record in table 2 represents a word segmentation result, and in the word element information of the word segmentation result which is implemented as a binary group, the word element is located on the left side, and the position of the word element is located on the right side. Taking the word element information (ccc, [4,5 ]) corresponding to the text identifier 3 as an example, ccc on the left side is a word element, and 4 and 5 in [4,5] on the right side are the positions of the word element ccc in the text corresponding to the text identifier 3.

TABLE 2

Text identification	Word element information set
		1	(abb，0)(bbb，1)(bbd，2)
2	(abb，0)(bbb，1)(bbc，2)(bcd，3)
		3	(acc，0)(ccb，1)(cbc，2)(bcc，3)(ccc，[4，5])(ccd，6)
4	(acd，0)(cda，1)(dab，2)(abd，3)

Next, in step S203, the tokens in each word segmentation result are formed into a set of tokens.

Specifically, each word element in each word segmentation result can be obtained, the obtained each word element is subjected to duplication removal processing, and the duplication-removed word elements form a word element set. Taking the respective word segmentation results shown in table 2 as an example, based on the respective word segmentation results, a set { abb, bbb, bbd, bbc, bcd, acc, ccb, cbc, bcc, ccd, acd, cda, dab, abd } can be generated.

Next, in step S205, for each of the lemmas in the lemma set, an index set corresponding to the lemma is generated based on the segmentation result in which the lemma is located.

Specifically, for each term in the term set, based on the word segmentation result in which the term is located, the text identifier of each text in which the term is located and one or more positions of the term in the text can be determined, and then an index set corresponding to the term can be generated according to the determination result. The index set includes inverted indexes of terms regarding locations, each of the inverted indexes indicating a text identifier of a text containing the term, and a location of the term in the text.

Further, the inverted indexes may be implemented as tuples, respectively, where the tuples include a text identifier of a text in which the term is located, and a position of the term in the text, where the position may specifically include each position of the term in the text. Further, when the respective positions are a plurality of positions, the plurality of positions may be arranged in order of the numerical values from small to large. It should be noted that, for any index set, when the inverted indexes in the index set implement the two tuples described herein, respectively, and are arranged in order of the text labels from small to large, the index set may be referred to as an inverted chain.

Continuing with the example of each of the word segmentation results shown in Table 2, after forming a set of words from the words in each of the word segmentation results, an index set as shown in Table 3 may be generated for each of the words in the set of words. Wherein, each inverted index included in the index set in table 3 is implemented as a binary group. In any inverted index implemented as a tuple, the text label is on the left and the position of the lemma is on the right. Taking inverted indexes (3, [4,5 ]) corresponding to the word element ccc as an example, the left 3 is a text identifier, and the right 4 and the right 5 in [4,5] are the positions of the word element ccc in the text corresponding to the text identifier 3.

TABLE 3 Table 3

Next, in step S207, an inverted index dictionary is generated based on the index sets corresponding to the respective lemmas in the lemma set.

The inverted index dictionary may specifically include each word element in the word element set, and an index set corresponding to each word element.

The scheme provided by the embodiment corresponding to fig. 2 can segment a text in a preset text set according to an N-element word segmentation mode, and generate a word segmentation result corresponding to the text, wherein the word segmentation result comprises a word element information set corresponding to a text identifier of the text, and any word element information comprises a word element formed by N characters and the position of the word element in the text. Then, the vocabulary elements in each vocabulary element result can be formed into a vocabulary element set, and an index set corresponding to each vocabulary element in the vocabulary element set is generated based on the vocabulary element result of the vocabulary element set. An inverted index dictionary may then be generated for use in data queries based on the respective corresponding index sets of the respective tokens in the set of tokens. Therefore, the positions of the words in the inverted index dictionary are helpful to accelerate regular matching, and therefore efficient data query is achieved.

Fig. 3 is a flowchart of a data polling method in the embodiment of the present specification. The method may be performed by any device, platform, or cluster of devices having data storage, computing, processing capabilities, such as by the server 102 as previously described. In addition, the method may include steps S301-S305 as shown below.

As shown in fig. 3, in step S301, a data query request is received, including a regular expression.

The data query request may be triggered by a user, and the data query request sent by the user equipment of the user may be received. Alternatively, the data query request may be automatically generated by the other device from which the data query request may be received.

Next, in step S303, a state machine graph is generated based on the regular expression, the state machine graph including a first node representing an initial state, a second node representing a termination state, and edges representing lemmas, the lemmas being composed of N characters, a target path between the first node and the second node representing a lemma combination conforming to the regular expression.

Specifically, a state machine graph may be generated based on the regular expression and the length N of the lemma in the inverted index dictionary described above. The state machine diagram may include a plurality of nodes and edges for connecting the nodes and representing tokens. The plurality of nodes are all state nodes and may include a first node representing an initial state, a second node representing a termination state, and the like. The first node and the second node may be referred to as a start node, an end node, etc., which are not specifically limited herein. A target path between the first node and the second node represents a word element combination conforming to the regular expression; it is understood that the state machine diagram illustrates a state machine equivalent to the regular expression. In addition, the edges in the state machine diagram are directed edges.

When the state machine graph is generated based on the regular expression, a first state machine with a single character as an edge can be firstly constructed based on the regular expression, then the first state machine is converted into a second state machine with a word element as an edge, and then the second state machine is converted into the graph, so that the state machine graph is obtained. Taking a regular expression as a (b+|c+) d, taking the length n=3 of the word element as an example, a state machine located on the left side in fig. 4 can be constructed based on the regular expression, then the state machine is converted into a state machine located on the right side in fig. 4, and then the state machine obtained by conversion is converted into a graph, so as to obtain a state machine graph. It will be appreciated that the state machine diagram can show the state machine on the right in fig. 4. Wherein fig. 4 is a schematic diagram of a state machine in an embodiment of the present disclosure. In fig. 4, node 0 represents an initial state, node 1 represents a termination state, and nodes 2 to 4 may each represent intermediate states.

Next, in step S305, the inverted index dictionary generated in advance is queried based on the target path, and the target text satisfying the regular expression is determined.

Specifically, when the state machine diagram includes an edge e2 directly connected from the first node to the second node, the index set index 2 corresponding to the lemma represented by the edge e2 may be queried in the inverted index dictionary. Thereafter, each text identifier indicated by the index set2 may be determined as a matched text identifier, so that the text corresponding to each text identifier is determined as a target text satisfying the regular expression. In addition, when a plurality of target paths in the state machine diagram respectively include a plurality of edges, for any target path in the plurality of target paths, a starting edge may be determined in the any target path, for example, an outgoing edge of a first node in the target path is taken as a starting edge, and the like, then an inverted index dictionary may be queried to obtain an index set corresponding to a word element represented by the starting edge, and text identifiers and positions in the index set corresponding to other edges in the any target path are sequentially matched based on a plurality of text identifiers and corresponding positions indicated in the index set, so as to obtain matched text identifiers, and a text corresponding to the text identifiers is determined as a target text. It should be noted that, in the case where the state machine diagram includes the edge e2, in order to improve the matching efficiency, the text identifiers may be made to not include the text identifier indicated by the IndexSet2.

In the following, the data query procedure is described taking the example that the state machine diagram is obtained by converting the state machine located on the right side in fig. 4, and the inverted index dictionary includes the respective records in table 3.

From the illustration of fig. 4, it can be appreciated that the state machine diagram includes edges <0,1> that are directly connected from node 0 to node 1. For the tokens acd and abd represented by the edges <0,1>, the index sets to which the tokens acd and abd correspond respectively can be queried from the inverted index dictionary. As shown in table 3, the index set corresponding to the token acd includes the inverted index (4, 0), and the index set corresponding to the token abd includes the inverted index (4, 3). Then, the text corresponding to the text identifier 4 indicated by the queried index set can be determined as the target text meeting the regular expression a (b+|c+) d.

In addition, from the illustration of fig. 4, it can also be appreciated that the multiple-entry label paths in the state machine diagram each include multiple edges. The multi-entry label path is:

path Path1: side <0,2> - > side <2,1>;

path Path2: side <0,2> - > side <2,2> (cycle) - > side <2,1>;

path Path3: side <0,3> - > side <3,1>;

path Path4: side <0,3> - > side <3,3> (cycle) - > side <3,1>.

For Path1, a starting edge may be determined in Path1, e.g., with the outgoing edge <0,2> of node 0 as the starting edge. Then, the index set corresponding to the lemma acc represented by the starting point edge <0,2> can be queried in the inverted index dictionary. Wherein the index set includes inverted index (3, 0) as shown in table 3. Then, the text identifiers and positions in the index set corresponding to the other edges in Path1 may be sequentially matched based on the text identifiers 3 and the corresponding positions 0. Specifically, the edge <2,1> adjacent to the starting edge <0,2> in the Path1 may be determined as the edge to be currently matched, and the index set corresponding to the character ccd represented by the edge <2,1> may be queried in the inverted index dictionary. Wherein the index set is shown in table 3, comprising inverted indexes (3, 6). By matching text identifier 3 with the text identifiers in the index set, it may be determined that the index set includes text identifier 3, and position 0 may then be matched with the position in the index set corresponding to text identifier 3. From Path1, it is known that the edge <2,1> is located to the right of the starting edge <0,2>, and the matching direction currently adopted is from left to right. If the match of edges <2,1> is successful, then there must be a position 0 plus one of the positions in the index set corresponding to text label 3, i.e. there must be a position 1. According to the matching logic, for the position 6 corresponding to the text identifier 3 in the index set, the position 6 is not the position 0 plus one position, so that the failure of matching of the edge <2,1> can be determined, and the failure of matching of the Path1 can be further determined.

For Path2, for example, the outgoing edge <0,2> of the node 0 may be used as a starting edge, and based on the text identifier 3 indicated in the index set corresponding to the word acc represented by the starting edge <0,2> and the corresponding position 0, text identifiers and positions in the index set corresponding to other edges in Path2 may be sequentially matched. Specifically, the edge <2,2> adjacent to the starting edge <0,2> in Path2 may be determined as the edge to be currently matched, and the index set corresponding to the lemma ccc represented by the edge <2,2> may be queried in the inverted index dictionary. Wherein the index set is shown in Table 3, including inverted index (3, [4,5 ]). By matching text identifier 3 with the text identifier in the index set corresponding to the token ccc, it can be determined that the index set includes text identifier 3, and then position 0 can be subsequently matched with the position in the index set corresponding to text identifier 3. From Path2, it is known that the edge <2,2> is to the right of the starting edge <0,2>, and the matching direction currently adopted is from left to right. According to the matching logic described in the foregoing, for the position 4 and the position 5 corresponding to the text identifier 3 in the index set, it can be known that neither the position 4 nor the position 5 is the position 0 plus one position, so that the edge <2,2> matching failure can be determined, and the Path2 matching failure can be determined.

For Path3, for example, the outgoing edge <0,3> of the node 0 may be used as the starting edge, and the index set corresponding to the lemma abb represented by the starting edge <0,3> may be queried in the inverted index dictionary. Wherein the index set is shown in Table 3, and comprises inverted indexes (1, 0) and (2, 0). Then, text identifiers and positions in the index set corresponding to other edges in Path3 can be sequentially matched based on the text identifiers 1 and 2 and the corresponding position 0.

Specifically, the matching may be performed based on one of the text identifiers 1 and 2 and the position corresponding to the text identifier. Taking text identifier 1 as an example, an edge <3,1> adjacent to a starting edge <0,3> in Path3 can be determined as an edge to be matched currently, and an index set corresponding to a word bbd represented by the edge <3,1> is queried in an inverted index dictionary. Wherein the index set is shown in table 3, comprising inverted indexes (1, 2). By matching text identifier 1 with the text identifiers in the index set, it may be determined that the index set includes text identifier 1, and position 0 may then be matched with the position in the index set corresponding to text identifier 1. From Path3, it is known that the edge <3,1> is located to the right of the starting edge <0,3>, and the matching direction currently adopted is from left to right. According to the matching logic described in the foregoing, for the position 2 corresponding to the text identifier 1 in the index set, it is known that the position 2 is not the position 0 plus one, so that the edge <3,1> matching failure can be determined, and the Path3 matching failure can be determined.

Path3 may then be matched based on text identification 2 and corresponding location 0. According to the foregoing description, the index set corresponding to the lemma bbd represented by the side <3,1> adjacent to the start side <0,3> in Path3 includes the inverted index (1, 2). Since the index set does not include text identification 2, it can be determined that edge <3,1> failed to match, and thus that Path3 failed to match.

For Path4, for example, the outgoing edge <0,3> of the node 0 may be used as a starting edge, and the text identifiers 1 and 2 and the corresponding positions 0 indicated in the index set corresponding to the lemma abb represented by the starting edge <0,3> may be sequentially matched with the text identifiers and the positions in the index set corresponding to the other edges in Path 4.

Specifically, taking text identifier 1 and position 0 corresponding thereto as an example, edges <3,3> adjacent to starting edges <0,3> in Path4 may be determined as edges to be matched currently, and an index set corresponding to a word bbb represented by the edges <3,3> may be queried in an inverted index dictionary. The index set is shown in table 3 and includes inverted indexes (1, 1) and (2, 1). By matching text identifier 1 with the text identifiers in the index set, it may be determined that the index set includes text identifier 1, and position 0 may then be matched with the position in the index set corresponding to text identifier 1. From Path4 it is known that the edge <3,3> is to the right of the starting edge <0,3>, and the matching direction currently adopted is from left to right. From the matching logic described in the foregoing, for position 1 in the index set corresponding to text identifier 1, it can be known that position 1 is the position 0 plus one, so that it can be determined that edge <3,3> matching was successful. Next edges to be matched may then be determined in Path 4. From the illustration of fig. 4, it can be appreciated that edge <3,3> is a circular edge, and the next edge of edge <3,3> can be itself, or edge <3,1>. Thus, the next edge to be matched can be determined among the edges <3,3> and the edges <3,1>. Taking the edge <3,1> as the next edge to be matched as an example, the edge <3,1> can be matched based on the text identifier 1 and the position 1 where the word bbb is successfully matched. According to the foregoing description, the index set corresponding to the lemma bbd represented by the side <3,1> includes the inverted index (1, 2). Since the index set includes text identifier 1 and position 2 corresponding to text identifier 1 is the position 1 plus one, it can be determined that the edge <3,1> match was successful, and thus it can be determined that the Path4 match was successful. Thus, text label 1 may be determined as a matching text label and its corresponding text determined as the target text satisfying regular expression a (b+|c+) d.

Path4 may then be matched based on text label 2 and corresponding location 0. According to the foregoing description, the index set corresponding to the lemma bbb represented by the side <3,3> includes the inverted indexes (1, 1), (2, 1). Since the index set includes text identifier 2, and position 1 corresponding to text identifier 2 is the position 0 plus one, it can be determined that the edge <3,3> match was successful. Next edges of edge <3,3>, such as edge <3,3> or edge <3,1>, can then be matched based on the text label 2 and the position 1 where the word bbb match successfully. When matching edge <3,3> again based on the position 1 where the text mark 2 and the word bbb are successfully matched, edge <3,3> may fail to match because there is no position 1 plus one position in the position corresponding to the text mark 2 in the index set corresponding to the word bbb. In addition, when matching edge <3,1> based on the position 1 where the text identifier 2 and the word bbb are successfully matched, edge <3,1> will also fail to match because the text identifier 2 does not exist in the index set corresponding to the word bbd represented by edge <3,1 >. Thus, path4 fails to match text identifier 2 with corresponding location 0 based on the index set indication corresponding to the token abb represented by the starting edge <0,3 >.

In summary, by analyzing the state machine diagram, the matched text identifier 1 and the text identifier 4 can be quickly obtained, and the respective corresponding texts are determined to be target texts meeting the regular expression a (b+|c+) d.

The scheme provided by the corresponding embodiment of fig. 3 can convert the matching of the regular expression into the traversal of the graph, convert the dynamic state machine matching part into the analysis of the graph, and accelerate the regular matching by combining the positions of the words in the inverted index dictionary, thereby realizing the efficient data query.

In addition, the existing matching algorithm described in the foregoing generally filters out those data most likely to match in the data set to obtain a candidate set, and then performs conventional regular matching on the candidate set. The algorithm actually performs a plurality of matching processes, and the time consumption is high. By reversely looking at the scheme provided by the embodiment of the specification, the query result can be obtained by combining the state machine diagram and the positions of the words in the inverted index dictionary and performing a matching process on the words, and the index efficiency is greatly improved without calculating the candidate set and performing conventional regular matching.

In practice, when a regular expression is more complex, the content of a state machine graph generated based on the regular expression is also more complex, and many paths are included. In order to improve the path matching efficiency, thereby further improving the data query efficiency, a plurality of key edges can be determined in the state machine diagram. Wherein the critical edge is capable of cutting off access to a target path formed by a plurality of edges. For any key edge in the plurality of key edges, each target path where the any key edge is located can be regarded as a group of paths, and the group of paths are matched based on a plurality of text identifiers and corresponding positions indicated by an index set corresponding to the word element represented by the any key edge by taking the any key edge as a starting point.

The key edge will be hereinafter referred to as the first edge, and a specific implementation of step S305 as described above will be described in conjunction with the data query process shown in fig. 5. Fig. 5 is a flowchart of the data polling process in the embodiment of the present specification, including steps S501 to S505 shown below.

As shown in fig. 5, in step S501, a number of first edges are determined in a state machine diagram; the first side is capable of cutting access to a target path formed by a plurality of sides.

In one embodiment, for example, an outgoing edge of a plurality of edges of the target path where each outgoing edge of the first node is located may be determined as the first edge. Or determining an incoming edge of the target path which is located in each incoming edge of the second node and is formed by a plurality of edges as a first edge.

In another embodiment, to obtain fewer first edges, thereby increasing the matching speed, a max-min-cut algorithm may be applied. The algorithm may refer to that in a directed graph, the maximum traffic that can reach a sink (terminal) from a source (source) is equal to the sum of the minimum capacities of the set of edges that can cause disruption of network flow if pruned from the graph. I.e. in any network the value of the maximum flow is equal to the capacity of the minimum cut. Since the max-min-cut algorithm is a well-known technique, it will not be described in any great detail here. Thus, a number of first edges may be determined in the state machine diagram based on a max-min cut algorithm; such that the removal of the number of first edges can cause an interruption of the network flow of the state machine graph or a sub-graph of the state machine graph after the state machine graph removes the second edge that is directly connected from the first node to the second node. Wherein the first edge may be the outgoing edge of the first node, or the incoming edge of the second node, or some intermediate edge (neither the outgoing edge of the first node nor the incoming edge of the second node) in the target path.

After determining the plurality of first sides, for a target first side of the plurality of first sides, steps S503 and S505 may be performed for the target first side. It should be noted that, any index set typically includes a large number of inverted indexes, and the inverted indexes may be arranged in order of the text labels from small to large. In order to further improve the matching efficiency, in one embodiment, the index set corresponding to the terms represented by each of the plurality of first edges may be queried in the inverted index dictionary, the minimum text identifier in the index set is determined to be a weight value corresponding to the first edge of the index set, and then the plurality of first edges are sequentially determined to be target first edges according to the order of the weight values from small to large, so that steps S503 and S505 are executed for the target first edges.

Next, in step S503, the inverted index dictionary is queried for the edge e1 among the plurality of first edges, and the index set index 1 corresponding to the lemma represented by the edge e1 is obtained.

The inverted index dictionary may specifically include a plurality of lemmas and index sets corresponding to the lemmas. For the target first edge (may be referred to as edge e 1) in the above several first edges, the inverted index dictionary may be queried based on the lemma represented by edge e1, to obtain an index set1 corresponding to the lemma.

Next, in step S505, based on the text identifiers and the corresponding positions indicated in the index set1, text identifiers and positions in the index set corresponding to other edges in the target path where the edge e1 is located are sequentially matched with each other with the edge e1 as a starting point, so as to obtain matched text identifiers, and the corresponding text is determined as a target text satisfying the regular expression.

Specifically, a matching process as shown in fig. 6 may be employed. Fig. 6 is a schematic diagram of a matching process of the target path where the edge e1 is located in the embodiment of the present disclosure. The matching process includes steps S601-S617 as shown below.

As shown in fig. 6, in step S601, a current text identifier docid_1 is determined among a plurality of text identifiers indicated by the index set IndexSet1, and a current position_1 is determined among respective positions indicated by the IndexSet1 corresponding to the docid_1.

In particular, when the several text identifications are one text identification, the text identification may be directly determined as the current text identification docid_1. When the plurality of text identifiers are a plurality of text identifiers, one text identifier can be selected randomly from the plurality of text identifiers, or the smallest text identifier is selected, and then the selected text identifier is determined to be the current text identifier docid_1. After determining docid_1, if each Position indicated by IndexSet1 corresponding to docid_1 is one Position, the Position may be directly determined as the current position_1. If the respective positions are a plurality of positions, one Position may be randomly selected from the plurality of positions, or a minimum Position may be selected, and then the selected Position may be determined as the current position_1.

Next, in step S603, text identifiers and positions in the index set corresponding to other edges in the target path where the edge e1 is located are sequentially matched based on the docid_1 and the position_1.

Specifically, for any target Path1 in each target Path where the edge e1 is located, text identifiers and positions in the index set corresponding to other edges in Path1 can be sequentially matched based on docid_1 and position_1. If the Path1 is successfully matched, it may be determined that the target Path where the edge e1 is located is successfully matched, and then step S605 may be performed next.

If the matching of the other edges in the Path1 fails, and the other target paths in the target paths all include the edges with the matching failure in the other edges, it may be determined that the matching failure of the target Path where the edge e1 is located is not found. For example, if the other edges in Path1 fail to match based on docid_1 and the other target paths all include edges that fail to match among the other edges, it may be determined that the target Path on which edge e1 is located fails to match based on docid_1, and then step S607 may be performed next. If the other edges in Path1 fail to match based on position_1 and the other target paths all include edges that fail to match in the other edges, it may be determined that the target Path in which the edge e1 is located fails to match based on position_1, and then step S613 may be performed next. In addition, if matching of other edges in the Path1 fails, and there is a target Path2 that does not include an edge that fails to match in the other edges in the target paths, text identifiers and positions in the index set corresponding to the other edges in the Path2 may be sequentially matched based on docid_1 and position_1.

Further, when matching Path1 based on DocId_1 and position_1, the edge e3 currently to be matched may be determined in Path 1. For example, when a match for Path1 is started, the edge e3 to be currently matched may be determined among the edges adjacent to edge e 1. Where the adjacent edge is an edge, the edge may be directly determined as the edge e3 to be currently matched. When the adjacent edges are multiple edges, one edge can be selected from the multiple edges, for example, one edge is selected randomly, or the edge most likely to be successfully matched is selected, and the selected edge is determined as the edge e3 to be matched currently. The most likely successful matching edge can be the edge with the most text marks in the index set corresponding to the represented word element; when selecting the most likely successfully matched edge, the index set corresponding to the word element represented by each edge can be queried in the inverted index dictionary, the target index set with the largest number of the included text identifiers is determined in each queried index set, and the edge corresponding to the target index set in the edges is selected.

After determining the current edge e3 to be matched, the corresponding index set index 3 of the word element represented by the edge e3 in the inverted index dictionary can be obtained, and the docid_1 can be searched in the index set 3. If DocId_1 is not found, it may be determined that edge e3 fails to match based on DocId_1, thereby determining that other paths including edge e3 in each of the target paths where Path1 and edge e1 are located fail to match based on DocId_1. If docid_1 is found, it may then be determined whether there is a target Position among the respective positions indicated by IndexSet3 corresponding to docid_1, the target Position being determined according to position_1 and the positional relationship of edge e1 and edge e3. For example, when the edge e3 is the i-th edge to the right of the edge e1, the target Position may be the i-added Position of position_1; when the edge e3 is the i-th edge to the left of the edge e1, the target Position may be the i-minus Position of position_1. Taking i=1 as an example, when the edge e3 is the first edge on the right side of the edge e1, that is, the right adjacent edge of the edge e1, the target Position may be the position_1 plus one Position; when the edge e3 is the first edge to the left of the edge e1, i.e., the left adjacent edge of the edge e1, the target Position may be the minus one Position of position_1. Taking i=2 as an example, when the edge e3 is the second edge on the right side of the edge e1, the target Position may be the plus two Position of position_1; when the edge e3 is the second edge to the left of the edge e1, the target Position may be the minus two Position of position_1.

If the target position is determined to exist, the successful matching of the edge e3 can be determined, and whether other edges which do not participate in Path matching exist in the Path1 is further determined. If it is determined that the other edge does not exist, it may be determined that the Path1 match is successful. If it is determined that the other edge exists, the other edge may continue to be matched. Further, after the successful matching of the edge e3 is determined, the target position of the successful matching of the edge e3 may be associated with the docid_1 for buffering, so that when the other edge includes the edge e4 adjacent to the edge e3, the edge e3 may be determined as the current edge, and further, based on the target position of the successful matching of the docid_1 and the current edge e3, the text identifier and the position in the index set4 corresponding to the word element represented by the edge e4 are matched. It should be appreciated that if edge e4 can be successfully matched, then docid_1 must be included in IndexSet4, and that there is either a plus or minus one position of the target position in each position in IndexSet4 corresponding to docid_1.

If it is determined that the target Position does not exist, it may be determined that the edge e3 fails to match based on position_1, so that it is determined that other paths including the edge e3 in the respective paths where the paths 1 and e1 exist fail to match based on position_1.

Next, in step S605, if the matching of the target path where the edge e1 is located is successful, the docid_1 is determined as the matched text identifier, and the corresponding text is determined as the target text satisfying the regular expression.

Specifically, if the matching of the target path where the edge e1 is located is successful, the docid_1 may be determined as the matched text identifier, and then the text corresponding to the docid_1 may be determined as the target text satisfying the regular expression.

In step S607, if the matching of the target path where the edge e1 is located fails based on the docid_1, it is determined whether there are other text identifiers that do not participate in the matching of the path among the text identifiers.

Specifically, when it is determined that the target path in which the edge e1 is located fails to match based on docid_1 because the index set corresponding to the other edge in the target path in which the edge e1 is located does not include docid_1, it may be determined whether there are other text identifiers that do not participate in the path matching among the above several text identifiers indicated by the index set 1. If it is determined that the other text identification exists, step S609 may be performed next. If it is determined that the other text identifier does not exist, the matching of the target path where the edge e1 is located may end.

In step S609, if it is determined that there are other text identifiers that do not participate in the path matching among the above-mentioned several text identifiers, the current text identifier docid_2 is determined among the other text identifiers, and the current position_2 is determined among the respective positions indicated by the index set1 and corresponding to the docid_2.

The determination methods of docid_2 and position_2 are similar to those of docid_1 and position_1 described above, and reference is made to the related description above, which is not repeated here.

Next, in step S611, text identifiers and positions in the index set corresponding to other edges in the target path where the edge e1 is located are sequentially matched based on the docid_2 and the position_2.

The matching process performed based on docid_2 and position_2 is similar to the matching process based on docid_1 and position_1 described above, and reference is made to the related description above, and thus will not be repeated here.

In step S613, if the target path where the edge e1 is located fails to match based on the position_1, it is determined whether there are other positions which do not participate in the path matching in the respective positions corresponding to docid_1 indicated by the index set 1.

Specifically, when it is determined that the target path in which the edge e1 is located is failed to match based on the position_1 because the other edges in the target path in which the edge e1 is located are failed to match based on the position_1, it may be determined whether there are other positions which do not participate in the path matching among the respective positions corresponding to the docid_1 indicated by the IndexSet 1. If it is determined that the other location exists, step S615 may be next performed. If it is determined that the other position does not exist, the process may proceed to step S607.

In step S615, if it is determined that there is another location that does not participate in the path matching among the locations indicated by the index set1 and corresponding to the docid_1, the current location position_3 is determined among the other locations.

The method for determining position_3 is similar to the method for determining position_1 described above, and reference is made to the description related to the above, which is not repeated here.

Next, in step S617, text identifiers and positions in the index set corresponding to other edges in the target path where the edge e1 is located are sequentially matched based on the docid_1 and the position_3.

The matching process performed based on docid_1 and position_3 is similar to the matching process based on docid_1 and position_1 described above, and reference is made to the related description above, which is not repeated here.

The method of determining a plurality of key edges in the state machine diagram, regarding each target path where any key edge is located as a group of paths, and using any key edge as a starting point, and matching the group of paths based on a plurality of text identifiers indicated by an index set corresponding to a word element represented by the any key edge and corresponding positions is described in detail above with reference to fig. 5 and 6. The method can effectively improve the path matching efficiency, thereby further improving the data query efficiency.

According to the description, the scheme provided by the embodiment of the specification can convert the matching of the regular expression into the traversal of the graph, convert the dynamic state machine matching part into the analysis of the graph, and accelerate the regular matching by combining the positions of the words in the inverted index dictionary, so that the query result can be obtained by carrying out one-time matching process on the corpus, the candidate set is not required to be calculated first, and then the conventional regular matching is carried out, the index efficiency is greatly improved, and the efficient data query is realized.

Fig. 7 is a schematic diagram of the structure of the data polling device in the embodiment of the present specification. The apparatus may be applied to any device, platform or cluster of devices having data storage, computing, processing capabilities, such as the server 102 as previously described. The apparatus may perform the methods as shown in fig. 2, 3, 5, 6, and include: a receiving unit 701 configured to receive a data query request, including a regular expression; a generating unit 702 configured to generate a state machine graph based on the regular expression, the state machine graph including a first node representing an initial state, a second node representing a termination state, and an edge representing a token, the token being composed of a plurality of characters, a target path between the first node and the second node representing a combination of tokens conforming to the regular expression; a determining unit 703 configured to query a pre-generated inverted index dictionary based on the target path, determining a target text satisfying the regular expression; the inverted index dictionary comprises index sets corresponding to a plurality of words, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text.

The present embodiments also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed in a computer, causes the computer to perform the method shown in fig. 2, 3, 5, 6.

Embodiments of the present disclosure also provide a computing device including a memory and a processor, where the memory stores executable code that when executed by the processor implements the methods shown in fig. 2, 3, 5, and 6.

The present description also provides a computer program product, wherein the computer program product, when executed in a computer, causes the computer to perform the method as shown in fig. 2, 3, 5, 6.

Those of skill in the art will appreciate that in one or more of the above examples, the functions described in the various embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

While the foregoing detailed description has described the objects, aspects and advantages of the embodiments disclosed herein in further detail, it should be understood that the foregoing detailed description is merely illustrative of the embodiments disclosed herein and is not intended to limit the scope of the embodiments disclosed herein, but rather any modifications, equivalents, improvements or the like that may be made to the embodiments disclosed herein are intended to be included within the scope of the embodiments disclosed herein.

Claims

1. A data query method, including:

Receive data query requests, including regular expressions;

A state machine graph is generated based on the regular expression, and the state machine graph includes a first node representing an initial state, a second node representing a termination state, and an edge representing a word element, where the word element is composed of a plurality of characters, The target path between the first node and the second node represents a word element combination that conforms to the regular expression;

Based on the target path, query the pre-generated inverted index dictionary to determine the target text that satisfies the regular expression; the inverted index dictionary includes an index set corresponding to multiple word elements, and an index set corresponding to any word element. It includes several inverted indexes about positions, each inverted index indicates the text identifier of the text containing the arbitrary word element, and the position of the arbitrary word element in the text.

2. The method of claim 1, wherein generating a state machine diagram based on the regular expression includes:

Based on the regular expression, construct a first state machine with a single character as an edge;

Convert the first state machine into a second state machine with word elements as edges;

Convert the second state machine into a graph to obtain the state machine graph.

3. The method of claim 1, wherein determining the target text that satisfies the regular expression includes:

Determine a number of first edges in the state machine diagram; the first edges can cut off access to a target path composed of multiple edges;

For the target first edge among the plurality of first edges, query the inverted index dictionary to obtain the first index set corresponding to the first word element represented by the target first edge;

Taking the first edge of the target as the starting point, based on several text identifiers and corresponding positions indicated in the first index set, sequentially match the text identifiers and positions in the index set corresponding to other edges in the target path where the first edge of the target is located. , obtain the matching text identifier, and determine the corresponding text as the target text.

4. The method of claim 3, wherein determining a plurality of first edges in the state machine diagram includes:

Based on the maximum flow-minimum cut algorithm, several first edges are determined in the state machine graph; so that the removal of the several first edges can cause the network flow interruption of the state machine graph or a subgraph of the state machine graph. , the subgraph is the graph of the state machine graph after removing the second edge directly connected from the first node to the second node.

5. The method of claim 3, wherein determining the target text that satisfies the regular expression further includes:

In response to the state machine graph including a second edge directly connected from the first node to the second node, query the inverted index dictionary for the second edge corresponding to the second word element represented by the second edge. second index set;

Text corresponding to each text identifier indicated by the queried second index set is determined as the target text.

6. The method of claim 5, wherein the plurality of text identifiers indicated by the first index set do not include text identifiers indicated by the second index set.

7. The method according to claim 3, wherein the index set corresponding to other edges in the target path where the first edge of the target is located is sequentially matched based on several text identifiers and corresponding positions indicated in the first index set. Text identification and location, including:

Determine the current first text identifier among the plurality of text identifiers, and determine the current first position among the respective positions corresponding to the first text identifier indicated by the first index set;

Based on the first text identifier and the first position, sequentially match the text identifiers and positions in the index set corresponding to other edges in the target path where the first edge of the target is located;

If the target path where the first side of the target is located is successfully matched, the first text identifier is determined as the matching text identifier.

8. The method according to claim 7, wherein the index set corresponding to other edges in the target path where the first edge of the target is located is sequentially matched based on several text identifiers and corresponding positions indicated in the first index set. text identification and location, also including:

If the target path where the first side of the target is located fails to match based on the first text identifier, and there are other text identifiers that do not participate in path matching among the several text identifiers, the current third text identifier is determined among the other text identifiers. two text identifiers, and determining the current second position among the positions corresponding to the second text identifier indicated by the first index set;

Based on the second text identifier and the second position, text identifiers and positions in the index set corresponding to other edges in the target path where the first edge of the target is located are matched in sequence.

9. The method according to claim 7, wherein the index set corresponding to other edges in the target path where the first edge of the target is located is sequentially matched based on several text identifiers and corresponding positions indicated in the first index set. text identification and location, also including:

If the target path where the first edge of the target is located fails to match based on the first position, and there are other positions that do not participate in path matching among the positions indicated by the first index set corresponding to the first text identifier, Then determine the current third position among the other positions;

Based on the first text identifier and the third position, text identifiers and positions in the index set corresponding to other edges in the target path where the first edge of the target is located are matched in sequence.

10. The method according to claim 7, wherein the target path where the first side of the target is located includes multiple target paths;

Based on the first text identifier and the first position, sequentially matching the text identifiers and positions in the index set corresponding to other edges in the target path where the first edge of the target is located includes:

For any first target path among the plurality of target paths, based on the first text identifier and the first position, sequentially match the text identifiers and positions in the index set corresponding to other edges in the first target path;

If the first target path is successfully matched, it is determined that the target path where the first side of the target is located is successfully matched.

11. The method according to claim 10, wherein based on the first text identifier and the first position, text in an index set corresponding to other edges in the target path where the first edge of the target is located is matched in sequence. Identity and location, also including:

If the other edges in the first target path fail to match, and the other target paths in the multiple target paths all include the edges that fail to match in the other edges, then it is determined that the target path where the first target edge is located matches. fail.

12. The method according to claim 10, wherein based on the first text identifier and the first position, text in an index set corresponding to other edges in the target path where the first edge of the target is located is matched in sequence. Identity and location, also including:

If other edges in the first target path fail to match, and there is a second target path in the plurality of target paths that does not include the edge that fails to match in the other edges, then based on the first text identification and the The first position matches text identifiers and positions in the index set corresponding to other edges in the second target path in sequence.

13. The method according to claim 10, wherein, based on the first text identifier and the first position, sequentially matching text identifiers and positions in the index set corresponding to other edges in the first target path, include:

Determine the third edge currently to be matched in the first target path;

Obtain the third index set corresponding to the third word element represented by the third side in the inverted index dictionary;

If the third index set includes the first text identifier and a target position corresponding to the first text identifier, it is determined that the third side is successfully matched; wherein the target position is based on the first position, and the positional relationship between the first side of the target and the third side;

If the third index set does not include the first text identifier or the target position, it is determined that the third side matching fails.

14. The method of claim 13, wherein determining the third edge currently to be matched in the first target path includes:

If the first target path includes multiple edges adjacent to the target first edge, then query the inverted index dictionary for an index set corresponding to the word elements each represented by the multiple edges;

In each of the queried index sets, a target index set containing the largest number of text identifiers is determined, and the edge corresponding to the target index set among the plurality of edges is determined as the third edge.

15. The method according to claim 3, said determining target text that satisfies said regular expression, further comprising:

Query in the inverted index dictionary the index set corresponding to the word elements each represented by the plurality of first sides;

For each queried index set, determine the minimum text identifier in the index set as the weight value corresponding to the first edge of the index set;

In order of weight values from small to large, the first sides are determined as the target first sides.

16. The method according to any one of claims 1-15, wherein before receiving the data query request, it further includes:

For each text to be processed in the preset text collection, an N-gram language model is used to perform word segmentation processing on the text to be processed, and a word segmentation result is generated; where N is a natural number greater than 1, and the word segmentation result includes the word segmentation result to be processed. Process the word element information set corresponding to the text identifier of the text. Any word element information includes a word element formed by N characters and the position of the word element in the text to be processed;

Form the word elements in each word segmentation result into a word element set;

For each word element in the word element set, generate an index set corresponding to the word element based on the word segmentation result where the word element is located;

The inverted index dictionary is generated based on the index set corresponding to each word element in the word element set.

17. The method according to claim 16, wherein before generating the word segmentation result, it further includes:

For any word element divided from the text to be processed, if the arbitrary word element includes uppercase letters, the uppercase letters are converted into lowercase letters.

18. A data query device, including:

a receiving unit configured to receive a data query request, including a regular expression;

A generation unit configured to generate a state machine graph based on the regular expression, the state machine graph including a first node representing an initial state, a second node representing a termination state, and an edge representing a word element, the word element being Composed of multiple characters, the target path between the first node and the second node represents a combination of word elements that conforms to the regular expression;

The determination unit is configured to query a pre-generated inverted index dictionary based on the target path and determine the target text that satisfies the regular expression; the inverted index dictionary includes an index set corresponding to a plurality of word elements, any The index set corresponding to a word element includes a number of inverted indexes regarding positions. Each inverted index indicates the text identifier of the text containing the arbitrary word element and the position of the arbitrary word element in the text.

19. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed in a computer, the computer is caused to perform the method of any one of claims 1-17.

20. A computing device, including a memory and a processor, wherein executable code is stored in the memory, and when the processor executes the executable code, the method of any one of claims 1-17 is implemented. method.