CN117763077A - Data query method and device - Google Patents

Data query method and device Download PDF

Info

Publication number
CN117763077A
CN117763077A CN202311829247.XA CN202311829247A CN117763077A CN 117763077 A CN117763077 A CN 117763077A CN 202311829247 A CN202311829247 A CN 202311829247A CN 117763077 A CN117763077 A CN 117763077A
Authority
CN
China
Prior art keywords
text
target
edge
index set
edges
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311829247.XA
Other languages
Chinese (zh)
Inventor
卢轩
张亚军
刘志伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202311829247.XA priority Critical patent/CN117763077A/en
Publication of CN117763077A publication Critical patent/CN117763077A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a data query method and device, wherein the method comprises the following steps: receiving a data query request, wherein the data query request comprises a regular expression; generating a state machine diagram based on the regular expression, wherein the state machine diagram comprises a first node representing an initial state, a second node representing a termination state and an edge representing a word element, the word element consists of a plurality of characters, and a target path between the first node and the second node represents a word element combination conforming to the regular expression; inquiring a pre-generated inverted index dictionary based on a target path, and determining a target text meeting the regular expression; the inverted index dictionary comprises index sets corresponding to a plurality of words, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text.

Description

Data query method and device
Technical Field
The embodiment of the specification relates to the technical field of computers, in particular to a data query method and device.
Background
Many applications require searching of data or use pattern matching to capture the required information, with regular expressions being a very popular method of pattern matching such as rule detection for network packets, search processing of text strings, and matching of protein sequences. However, with the growth of the internet, the data of users is also growing rapidly, the data volume required to be processed by each application is far more than ever nowadays, although the performance of hardware is greatly improved than ever, the software architecture is expanded from stand-alone computing to distributed computing, but the conventional regular matching of a large amount of data still needs a lot of computation effort and time consumption above the minute level, and the response time is often not acceptable to users.
Therefore, a reasonable and reliable scheme is urgently needed, and efficient data query can be achieved.
Disclosure of Invention
The embodiment of the specification provides a data query scheme which can realize efficient data query.
In a first aspect, an embodiment of the present disclosure provides a data query method, including: receiving a data query request, wherein the data query request comprises a regular expression; generating a state machine diagram based on the regular expression, wherein the state machine diagram comprises a first node representing an initial state, a second node representing a termination state and an edge representing a word element, the word element consists of a plurality of characters, and a target path between the first node and the second node represents a word element combination conforming to the regular expression; inquiring a pre-generated inverted index dictionary based on the target path, and determining a target text meeting the regular expression; the inverted index dictionary comprises index sets corresponding to a plurality of words respectively, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text.
In some embodiments, the generating a state machine graph based on the regular expression includes: constructing a first state machine with single characters as edges based on the regular expression; converting the first state machine into a second state machine with a word element as an edge; and converting the second state machine into a graph to obtain the state machine graph.
In some embodiments, the determining the target text that satisfies the regular expression includes: determining a plurality of first edges in the state machine diagram; the first side is capable of cutting off access to a target path formed by a plurality of sides; for a target first edge in the plurality of first edges, inquiring the inverted index dictionary to obtain a first index set corresponding to a first word element represented by the target first edge; and sequentially matching text identifiers and positions in the index set corresponding to other edges in a target path where the target first edge is located based on a plurality of text identifiers and corresponding positions indicated in the first index set by taking the target first edge as a starting point to obtain matched text identifiers, and determining corresponding texts as the target texts.
In some embodiments, the determining a number of first edges in the state machine diagram includes: determining a plurality of first edges in the state machine diagram based on a max-min cut algorithm; such that the removal of the number of first edges can cause a disruption in network flow of the state machine graph or a sub-graph of the state machine graph that is a graph of the state machine graph after the removal of a second edge that is directly connected from the first node to the second node.
In some embodiments, the determining the target text that satisfies the regular expression further comprises: in response to the state machine graph including a second edge directly connected from the first node to the second node, querying the inverted index dictionary for a second index set corresponding to a second lemma represented by the second edge; and determining the text corresponding to each text identifier indicated by the queried second index set as the target text.
In some embodiments, the number of text identifications indicated by the first set of indices does not include the text identification indicated by the second set of indices.
In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the text identifiers and corresponding positions indicated in the first index set includes: determining a current first text identifier in the text identifiers, and determining a current first position in each position indicated by the first index set and corresponding to the first text identifier; sequentially matching text identifiers and positions in an index set corresponding to other edges in a target path where the target first edge is located based on the first text identifier and the first position; and if the target path where the target first edge is located is successfully matched, determining the first text identifier as a matched text identifier.
In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the text identifiers and corresponding positions indicated in the first index set further includes: if the target path of the target first side is failed to be matched based on the first text identifier, and other text identifiers which do not participate in path matching exist in the text identifiers, determining a current second text identifier in the other text identifiers, and determining a current second position in each position indicated by the first index set and corresponding to the second text identifier; and sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the second text identifiers and the second positions.
In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the text identifiers and corresponding positions indicated in the first index set further includes: if the target path where the target first edge is located fails to match based on the first position, and other positions which do not participate in path matching exist in the positions indicated by the first index set and correspond to the first text mark, determining a current third position in the other positions; and sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the first text identifier and the third position.
In some embodiments, the target path on which the target first edge is located includes a multi-entry target path; the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the first text identifier and the first position includes: for any first target path in the multi-item target path, sequentially matching text identifiers and positions in index sets corresponding to other edges in the first target path based on the first text identifiers and the first positions; and if the first target path is successfully matched, determining that the target path where the target first edge is located is successfully matched.
In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the first text identifier and the first position further includes: and if the matching of other edges in the first target path fails, and the other target paths in the multi-entry target path all comprise edges with failed matching in the other edges, determining that the matching of the target path where the target first edge is located fails.
In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the target path where the target first edge is located based on the first text identifier and the first position further includes: and if the matching of other edges in the first target path fails, and a second target path which does not comprise the edges with the failed matching in the other edges exists in the multi-item target path, sequentially matching the text identifiers and the positions in the index set corresponding to the other edges in the second target path based on the first text identifiers and the first positions.
In some embodiments, the sequentially matching text identifiers and positions in the index set corresponding to other edges in the first target path based on the first text identifier and the first position includes: determining a third side to be matched currently in the first target path; acquiring a third index set corresponding to a third word element represented by the third side in the inverted index dictionary; if the third index set comprises the first text identifier and a target position corresponding to the first text identifier, determining that the third edge matching is successful; wherein the target position is determined according to the first position and the positional relationship between the first side and the third side of the target; and if the third index set does not comprise the first text identifier or the target position, determining that the third edge matching fails.
In some embodiments, the determining the third edge to be currently matched in the first target path includes: if the first target path comprises a plurality of edges adjacent to the target first edge, querying an index set corresponding to the word elements represented by the edges in the inverted index dictionary; and determining a target index set with the largest number of the included text identifiers in each queried index set, and determining an edge corresponding to the target index set in the plurality of edges as the third edge.
In some embodiments, the determining the target text that satisfies the regular expression further comprises: querying index sets corresponding to the word elements respectively represented by the plurality of first sides in the inverted index dictionary; for each index set queried, determining a minimum text identification in the index set as a weight value corresponding to a first edge of the index set; and sequentially determining the first sides as the target first sides according to the sequence of the weight values from small to large.
In some embodiments, prior to receiving the data query request, further comprising: for each text to be processed in a preset text set, performing word segmentation on the text to be processed by using an N-element language model to generate a word segmentation result; wherein N is a natural number greater than 1, the word segmentation result comprises a word element information set corresponding to the text identifier of the text to be processed, and any word element information comprises word elements formed by N characters and the positions of the word elements in the text to be processed; forming a word element set from the word elements in each word segmentation result; for each word element in the word element set, generating an index set corresponding to the word element based on a word segmentation result of the word element; and generating the inverted index dictionary based on index sets corresponding to the respective lemmas in the lemma set.
In some embodiments, prior to generating the word-forming result, further comprising: for any word element divided from the text to be processed, if the any word element comprises capital letters, converting the capital letters into lowercase letters.
In a second aspect, embodiments of the present disclosure provide a data query device, including: a receiving unit configured to receive a data query request including a regular expression; a generating unit configured to generate a state machine graph based on the regular expression, the state machine graph including a first node representing an initial state, a second node representing a termination state, and an edge representing a lemma, the lemma being composed of a plurality of characters, a target path between the first node and the second node representing a lemma combination conforming to the regular expression; a determining unit configured to query a pre-generated inverted index dictionary based on the target path, and determine a target text satisfying the regular expression; the inverted index dictionary comprises index sets corresponding to a plurality of words respectively, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text.
In a third aspect, embodiments of the present specification provide a computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform a method as described in any of the implementations of the first aspect.
In a fourth aspect, embodiments of the present specification provide a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, implements a method as described in any of the implementations of the first aspect.
In a fifth aspect, the present description provides a computer program product, wherein the computer program product, when executed in a computer, causes the computer to perform the method as described in any one of the implementations of the first aspect.
The scheme provided by the embodiment of the specification designs an inverted index dictionary, which comprises index sets corresponding to a plurality of words, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text. The inverted index dictionary can be used for data queries based on regular matches. Specifically, upon receiving a data query request including a regular expression, a state machine graph may be generated based on the regular expression, the state machine graph including a first node representing an initial state, a second node representing a termination state, and edges representing tokens, the tokens being composed of a plurality of characters, a target path between the first node and the second node representing a combination of tokens conforming to the regular expression. Then, a pre-generated inverted index dictionary may be queried based on the target path to determine target text satisfying the regular expression. Therefore, the matching of the regular expression can be converted into the traversal of the graph, the dynamic state machine matching part is converted into the analysis of the graph, and the regular matching is quickened by combining the positions of the words in the inverted index dictionary, so that the efficient data query is realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only examples of the embodiments disclosed in the present specification, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of one application scenario in which embodiments of the present description may be applied;
FIG. 2 is a flowchart of the generation process of the inverted index dictionary in the embodiment of the present specification;
FIG. 3 is a flow chart of a data polling method in an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a state machine in an embodiment of the present description;
FIG. 5 is a flow chart of a data polling process in an embodiment of the present description;
FIG. 6 is a schematic diagram of a matching process of the target path where the edge e1 is located in the embodiment of the present specification;
fig. 7 is a schematic diagram of the structure of the data polling device in the embodiment of the present specification.
Detailed Description
The present specification is further described in detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. The described embodiments are only some of the embodiments of the present description and not all of the embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present application based on the embodiments herein.
For convenience of description, only a portion related to the present invention is shown in the drawings. Embodiments and features of embodiments in this specification may be combined with each other without conflict.
As previously mentioned, many applications require searching data or capturing desired information using pattern matching, where regular expressions are a very common method of pattern matching, such as rule detection for network packets, search processing of text strings, and matching of protein sequences. In practice, a regular expression, also called regular expression (Regular Expression, often abbreviated in code as regex, regex p, or RE), is a text pattern that includes common characters (e.g., letters between a and z) and special characters (called "meta characters"), and is a concept of computer science. Regular expressions use a single string to describe, match a series of strings that meet a certain syntactic rule, and are typically used to retrieve, replace, text that meets a certain pattern (rule).
Along with the growth of the internet, the data of users is also growing rapidly, the data volume required to be processed by each application is far beyond ever today, and although the performance of hardware is greatly improved relative to ever, the software architecture is expanded from stand-alone computing to distributed computing, but the conventional regular matching of a large amount of data still needs a lot of computation power and time consumption above the minute level, and the response time is often not acceptable to users.
In addition, the data processing scene of the applications is often not simple pattern matching, but is combined with other various data retrieval and processing capabilities to meet the demands. For example, existing matching algorithms typically filter out the most likely matches of data (articles, strings, etc.) in a dataset to obtain a candidate set, and then perform conventional canonical matching on the candidate set. The core of the algorithm and the implementation is focused on the filtering process, and the filtering effect determines the advantages and disadvantages of the regular matching effect, because the regular matching is still required to be carried out on single possibly matched corpus, and the time consumption is high.
The embodiment of the specification provides a data query method, which can convert the matching of regular expressions into the traversal of graphs, convert the matching part of a dynamic state machine into the analysis of the graphs, and accelerate the regular matching by combining the positions of the words in an inverted index dictionary, thereby realizing efficient data query.
Fig. 1 is a schematic diagram of one application scenario to which the embodiments of the present description may be applied. As shown in fig. 1, the application scenario may include a user device 101 and a server 102.
The server 102 may be used to provide data query services and support data queries based on a regular matching approach. In addition, an inverted index dictionary may be included in the server 102. The inverted index dictionary comprises index sets corresponding to a plurality of words, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text. Wherein, any word element is composed of N characters, N is a natural number greater than 1. In one example, N may be 2 or 3, etc., which may be set according to actual requirements, and is not specifically limited herein. The several items may refer to one or more items.
It should be noted that existing Inverted indexes (Inverted indexes), also commonly referred to as Inverted indexes, embedded documents, or Inverted documents, are an indexing method that enables mapping from a keyword to a document or group of documents that contain the keyword. Compared with the existing inverted index, the inverted index related to the position provided in the embodiment of the present disclosure indicates a document or a group of documents where a keyword is located, and also indicates the position of the keyword in the located document. It will be appreciated that the inverted index of a location in this embodiment of the present description may be referred to as an inverted index of a location with a lexeme.
In the application scenario shown in fig. 1, a user may send a data query request to the server 102 through the user device 101, where the data query request may include the regular expression regex1. Upon receiving the data query request, the server 102 may generate a state machine graph based on regex1. The state machine diagram comprises a node 0 representing an initial state, a node 1 representing a termination state, and an edge representing a word element, wherein a target path between the node 0 and the node 1 represents a word element combination conforming to regex1.
In practice, the state machine may be a short for Finite State Machine (FSM). Finite state machines, also known as Finite State Automata (FSA), are mathematical computational models that represent finite states and the behavior of transitions and actions between these states. Finite state machines can be classified into deterministic finite state machines (deterministic finite automaton, DFA) and non-deterministic finite state machines (non-deterministic finite automaton, NFA). DFA is an automaton that enables state transitions, which for a given state belonging to the automaton and a character belonging to the automaton alphabet Sigma can transition to the next state (this state may be the previous one) according to a pre-given transition function. NFA is a finite state automaton that may have multiple possible next states for each state and input symbol pair. Although DFA and NFA have different definitions, they can prove equivalent in formal theory; that is, for any given NFA, an equivalent DFA may be constructed, and vice versa: by using power set construction.
In generating the state machine diagram based on regex1, for example, as shown in fig. 1, a state machine FSM1 with single character edges may be constructed based on regex1, then FSM1 is converted into a state machine FSM2 with lemmas edges, and then FSM2 is converted into a diagram, resulting in a state machine diagram. It should be appreciated that FSM1 and FSM2 are both equivalent to regex1.
After generating the state machine graph, the server 102 may query a pre-generated reverse index dictionary based on the target path in the state machine graph to determine target text that satisfies regex1.
It should be noted that, the text corresponding to the text identifier in the inverted index dictionary described above may be included in a preset text set. The reverse index dictionary may be formed by performing N-gram word segmentation processing or the like on each text in the text set. Next, the generation process of the inverted index dictionary will be described in detail.
Referring to fig. 2, a flowchart of the generation process of the reverse index dictionary in the embodiment of the present specification is shown. The generation process may be performed by any device, platform, or cluster of devices having data storage, computing, processing capabilities, such as by the server 102 as previously described. In addition, the generation process may include steps S201 to S207 as shown below.
As shown in fig. 2, in step S201, for each text to be processed in a preset text set, word segmentation is performed on the text to be processed by using an N-element language model, so as to generate a word segmentation result; the word segmentation result comprises a word element information set corresponding to the text identification of the text to be processed, and any word element information comprises a word element formed by N characters and the position of the word element in the text to be processed.
Specifically, the analyzer of Lucene may be predefined such that the analyzer uses an N-gram word segmentation device as a core, and words are segmented in a specific manner (such as a binary word segmentation manner or a ternary word segmentation manner). The Lucene is a full-text retrieval engine tool kit of an open source code, and aims to provide a simple and easy-to-use tool kit for software developers so as to conveniently realize full-text retrieval function in a target system or establish a complete full-text retrieval engine based on the function.
It should be appreciated that the N-gram model may be referred to as an N-gram model. When n=2, the N-gram model may be referred to as a binary language model (Bi-gram model). When n=3, the N-gram model may be referred to as a ternary language model (Tri-gram model). The Bi-gram model can use two character units as a word token, for example, after the Bi-gram model is used for word segmentation of the character string abcdefg, the following word token can be obtained: ab. bc, cd, de, ef, fg. The Tri-gram model can use three character units as a word element, for example, after the Tri-gram model is used for word segmentation of the character string abcdefg, the following word elements can be obtained: abc, bcd, cde, def, efg.
Before generating the word segmentation result corresponding to the text to be processed, if any word element is segmented from the text to be processed, the word element may be converted into a lower case letter, for example, the lower case letter may be converted into a lower case letter by using a LowerCase Filter (LowerCase Filter). It should be appreciated that in such a case, no lemma containing capital letters exists in the word segmentation result.
In the following, taking n=3, text as a character string, and a text set including each character string as shown in table 1 as an example, a text word segmentation process is described.
TABLE 1
Text identification Text of
1 abbbd
2 abbbcd
3 accbccccd
4 acdabd
For a string "abbbd" with text identification of 1, the following lements can be obtained by using a Tri-gram model to segment the string: abb, bbb, bbd. Then, a word segmentation result corresponding to the character string can be generated, wherein the word segmentation result comprises a text identifier 1 and a word element information set corresponding to the text identifier 1. Any of the lemma information in the lemma information set includes one of abb, bbb, and bbd, and a position of the lemma in the character string. It is noted that the position of the lemma may specifically be the position of the first character of the lemma. For example, the position of the token abb, bbb, bbd can be 0, 1, 2 in order.
For a character string "abbbcd" with a text identifier of 2, the character string is segmented by using a Tri-gram model, so that the following lemmas can be obtained: abb, bbb, bbc, bcd. Then, a word segmentation result corresponding to the character string can be generated, wherein the word segmentation result comprises a text identifier 2 and a word element information set corresponding to the text identifier 2. Any of the lemma information in the lemma information set includes one of abb, bbb, bbc and bcd, and the position of the lemma in the string. The positions of the terms abb, bbb, bbc, bcd can be 0, 1, 2 and 3 in sequence.
For a string "accbccccd" with text identification of 3, the following lements can be obtained by using a Tri-gram model to segment the string: acc, ccb, cbc, bcc, ccc, ccc, ccd. Then, a word segmentation result corresponding to the character string can be generated, wherein the word segmentation result comprises a text identifier 3 and a word element information set corresponding to the text identifier 3. Any of the lemma information in the lemma information set includes one of acc, ccb, cbc, bcc, ccc and ccd, and the position of the lemma in the string. The positions of the terms acc, ccb, cbc, bcc, ccc, ccc, ccd can be 0, 1, 2, 3, 4, 5 and 6 in sequence.
For a character string "acdabd" with text identification of 4, the character string is segmented by using a Tri-gram model, so that the following lemmas can be obtained: acd, cda, dab, abd. Then, a word segmentation result corresponding to the character string can be generated, wherein the word segmentation result comprises a text identifier 4 and a word element information set corresponding to the text identifier 4. Any of the lemma information in the lemma information set includes one of acd, cda, dab and abd, and the position of the lemma in the string. The positions of the terms acd, cda, dab, abd can be 0, 1, 2 and 3 in sequence.
In one embodiment, for any word element information in a word segmentation result corresponding to any text, the any word element information may be implemented as a binary group, where the binary group may include a word element and a position of the word element in the any text, and the position may specifically include each position of the word element in the any text. Based on this, the word segmentation results for each of the character strings in table 1 may be as shown in table 2. Wherein each record in table 2 represents a word segmentation result, and in the word element information of the word segmentation result which is implemented as a binary group, the word element is located on the left side, and the position of the word element is located on the right side. Taking the word element information (ccc, [4,5 ]) corresponding to the text identifier 3 as an example, ccc on the left side is a word element, and 4 and 5 in [4,5] on the right side are the positions of the word element ccc in the text corresponding to the text identifier 3.
TABLE 2
Text identification Word element information set
1 (abb,0)(bbb,1)(bbd,2)
2 (abb,0)(bbb,1)(bbc,2)(bcd,3)
3 (acc,0)(ccb,1)(cbc,2)(bcc,3)(ccc,[4,5])(ccd,6)
4 (acd,0)(cda,1)(dab,2)(abd,3)
Next, in step S203, the tokens in each word segmentation result are formed into a set of tokens.
Specifically, each word element in each word segmentation result can be obtained, the obtained each word element is subjected to duplication removal processing, and the duplication-removed word elements form a word element set. Taking the respective word segmentation results shown in table 2 as an example, based on the respective word segmentation results, a set { abb, bbb, bbd, bbc, bcd, acc, ccb, cbc, bcc, ccd, acd, cda, dab, abd } can be generated.
Next, in step S205, for each of the lemmas in the lemma set, an index set corresponding to the lemma is generated based on the segmentation result in which the lemma is located.
Specifically, for each term in the term set, based on the word segmentation result in which the term is located, the text identifier of each text in which the term is located and one or more positions of the term in the text can be determined, and then an index set corresponding to the term can be generated according to the determination result. The index set includes inverted indexes of terms regarding locations, each of the inverted indexes indicating a text identifier of a text containing the term, and a location of the term in the text.
Further, the inverted indexes may be implemented as tuples, respectively, where the tuples include a text identifier of a text in which the term is located, and a position of the term in the text, where the position may specifically include each position of the term in the text. Further, when the respective positions are a plurality of positions, the plurality of positions may be arranged in order of the numerical values from small to large. It should be noted that, for any index set, when the inverted indexes in the index set implement the two tuples described herein, respectively, and are arranged in order of the text labels from small to large, the index set may be referred to as an inverted chain.
Continuing with the example of each of the word segmentation results shown in Table 2, after forming a set of words from the words in each of the word segmentation results, an index set as shown in Table 3 may be generated for each of the words in the set of words. Wherein, each inverted index included in the index set in table 3 is implemented as a binary group. In any inverted index implemented as a tuple, the text label is on the left and the position of the lemma is on the right. Taking inverted indexes (3, [4,5 ]) corresponding to the word element ccc as an example, the left 3 is a text identifier, and the right 4 and the right 5 in [4,5] are the positions of the word element ccc in the text corresponding to the text identifier 3.
TABLE 3 Table 3
Next, in step S207, an inverted index dictionary is generated based on the index sets corresponding to the respective lemmas in the lemma set.
The inverted index dictionary may specifically include each word element in the word element set, and an index set corresponding to each word element.
The scheme provided by the embodiment corresponding to fig. 2 can segment a text in a preset text set according to an N-element word segmentation mode, and generate a word segmentation result corresponding to the text, wherein the word segmentation result comprises a word element information set corresponding to a text identifier of the text, and any word element information comprises a word element formed by N characters and the position of the word element in the text. Then, the vocabulary elements in each vocabulary element result can be formed into a vocabulary element set, and an index set corresponding to each vocabulary element in the vocabulary element set is generated based on the vocabulary element result of the vocabulary element set. An inverted index dictionary may then be generated for use in data queries based on the respective corresponding index sets of the respective tokens in the set of tokens. Therefore, the positions of the words in the inverted index dictionary are helpful to accelerate regular matching, and therefore efficient data query is achieved.
Fig. 3 is a flowchart of a data polling method in the embodiment of the present specification. The method may be performed by any device, platform, or cluster of devices having data storage, computing, processing capabilities, such as by the server 102 as previously described. In addition, the method may include steps S301-S305 as shown below.
As shown in fig. 3, in step S301, a data query request is received, including a regular expression.
The data query request may be triggered by a user, and the data query request sent by the user equipment of the user may be received. Alternatively, the data query request may be automatically generated by the other device from which the data query request may be received.
Next, in step S303, a state machine graph is generated based on the regular expression, the state machine graph including a first node representing an initial state, a second node representing a termination state, and edges representing lemmas, the lemmas being composed of N characters, a target path between the first node and the second node representing a lemma combination conforming to the regular expression.
Specifically, a state machine graph may be generated based on the regular expression and the length N of the lemma in the inverted index dictionary described above. The state machine diagram may include a plurality of nodes and edges for connecting the nodes and representing tokens. The plurality of nodes are all state nodes and may include a first node representing an initial state, a second node representing a termination state, and the like. The first node and the second node may be referred to as a start node, an end node, etc., which are not specifically limited herein. A target path between the first node and the second node represents a word element combination conforming to the regular expression; it is understood that the state machine diagram illustrates a state machine equivalent to the regular expression. In addition, the edges in the state machine diagram are directed edges.
When the state machine graph is generated based on the regular expression, a first state machine with a single character as an edge can be firstly constructed based on the regular expression, then the first state machine is converted into a second state machine with a word element as an edge, and then the second state machine is converted into the graph, so that the state machine graph is obtained. Taking a regular expression as a (b+|c+) d, taking the length n=3 of the word element as an example, a state machine located on the left side in fig. 4 can be constructed based on the regular expression, then the state machine is converted into a state machine located on the right side in fig. 4, and then the state machine obtained by conversion is converted into a graph, so as to obtain a state machine graph. It will be appreciated that the state machine diagram can show the state machine on the right in fig. 4. Wherein fig. 4 is a schematic diagram of a state machine in an embodiment of the present disclosure. In fig. 4, node 0 represents an initial state, node 1 represents a termination state, and nodes 2 to 4 may each represent intermediate states.
Next, in step S305, the inverted index dictionary generated in advance is queried based on the target path, and the target text satisfying the regular expression is determined.
Specifically, when the state machine diagram includes an edge e2 directly connected from the first node to the second node, the index set index 2 corresponding to the lemma represented by the edge e2 may be queried in the inverted index dictionary. Thereafter, each text identifier indicated by the index set2 may be determined as a matched text identifier, so that the text corresponding to each text identifier is determined as a target text satisfying the regular expression. In addition, when a plurality of target paths in the state machine diagram respectively include a plurality of edges, for any target path in the plurality of target paths, a starting edge may be determined in the any target path, for example, an outgoing edge of a first node in the target path is taken as a starting edge, and the like, then an inverted index dictionary may be queried to obtain an index set corresponding to a word element represented by the starting edge, and text identifiers and positions in the index set corresponding to other edges in the any target path are sequentially matched based on a plurality of text identifiers and corresponding positions indicated in the index set, so as to obtain matched text identifiers, and a text corresponding to the text identifiers is determined as a target text. It should be noted that, in the case where the state machine diagram includes the edge e2, in order to improve the matching efficiency, the text identifiers may be made to not include the text identifier indicated by the IndexSet2.
In the following, the data query procedure is described taking the example that the state machine diagram is obtained by converting the state machine located on the right side in fig. 4, and the inverted index dictionary includes the respective records in table 3.
From the illustration of fig. 4, it can be appreciated that the state machine diagram includes edges <0,1> that are directly connected from node 0 to node 1. For the tokens acd and abd represented by the edges <0,1>, the index sets to which the tokens acd and abd correspond respectively can be queried from the inverted index dictionary. As shown in table 3, the index set corresponding to the token acd includes the inverted index (4, 0), and the index set corresponding to the token abd includes the inverted index (4, 3). Then, the text corresponding to the text identifier 4 indicated by the queried index set can be determined as the target text meeting the regular expression a (b+|c+) d.
In addition, from the illustration of fig. 4, it can also be appreciated that the multiple-entry label paths in the state machine diagram each include multiple edges. The multi-entry label path is:
path Path1: side <0,2> - > side <2,1>;
path Path2: side <0,2> - > side <2,2> (cycle) - > side <2,1>;
path Path3: side <0,3> - > side <3,1>;
path Path4: side <0,3> - > side <3,3> (cycle) - > side <3,1>.
For Path1, a starting edge may be determined in Path1, e.g., with the outgoing edge <0,2> of node 0 as the starting edge. Then, the index set corresponding to the lemma acc represented by the starting point edge <0,2> can be queried in the inverted index dictionary. Wherein the index set includes inverted index (3, 0) as shown in table 3. Then, the text identifiers and positions in the index set corresponding to the other edges in Path1 may be sequentially matched based on the text identifiers 3 and the corresponding positions 0. Specifically, the edge <2,1> adjacent to the starting edge <0,2> in the Path1 may be determined as the edge to be currently matched, and the index set corresponding to the character ccd represented by the edge <2,1> may be queried in the inverted index dictionary. Wherein the index set is shown in table 3, comprising inverted indexes (3, 6). By matching text identifier 3 with the text identifiers in the index set, it may be determined that the index set includes text identifier 3, and position 0 may then be matched with the position in the index set corresponding to text identifier 3. From Path1, it is known that the edge <2,1> is located to the right of the starting edge <0,2>, and the matching direction currently adopted is from left to right. If the match of edges <2,1> is successful, then there must be a position 0 plus one of the positions in the index set corresponding to text label 3, i.e. there must be a position 1. According to the matching logic, for the position 6 corresponding to the text identifier 3 in the index set, the position 6 is not the position 0 plus one position, so that the failure of matching of the edge <2,1> can be determined, and the failure of matching of the Path1 can be further determined.
For Path2, for example, the outgoing edge <0,2> of the node 0 may be used as a starting edge, and based on the text identifier 3 indicated in the index set corresponding to the word acc represented by the starting edge <0,2> and the corresponding position 0, text identifiers and positions in the index set corresponding to other edges in Path2 may be sequentially matched. Specifically, the edge <2,2> adjacent to the starting edge <0,2> in Path2 may be determined as the edge to be currently matched, and the index set corresponding to the lemma ccc represented by the edge <2,2> may be queried in the inverted index dictionary. Wherein the index set is shown in Table 3, including inverted index (3, [4,5 ]). By matching text identifier 3 with the text identifier in the index set corresponding to the token ccc, it can be determined that the index set includes text identifier 3, and then position 0 can be subsequently matched with the position in the index set corresponding to text identifier 3. From Path2, it is known that the edge <2,2> is to the right of the starting edge <0,2>, and the matching direction currently adopted is from left to right. According to the matching logic described in the foregoing, for the position 4 and the position 5 corresponding to the text identifier 3 in the index set, it can be known that neither the position 4 nor the position 5 is the position 0 plus one position, so that the edge <2,2> matching failure can be determined, and the Path2 matching failure can be determined.
For Path3, for example, the outgoing edge <0,3> of the node 0 may be used as the starting edge, and the index set corresponding to the lemma abb represented by the starting edge <0,3> may be queried in the inverted index dictionary. Wherein the index set is shown in Table 3, and comprises inverted indexes (1, 0) and (2, 0). Then, text identifiers and positions in the index set corresponding to other edges in Path3 can be sequentially matched based on the text identifiers 1 and 2 and the corresponding position 0.
Specifically, the matching may be performed based on one of the text identifiers 1 and 2 and the position corresponding to the text identifier. Taking text identifier 1 as an example, an edge <3,1> adjacent to a starting edge <0,3> in Path3 can be determined as an edge to be matched currently, and an index set corresponding to a word bbd represented by the edge <3,1> is queried in an inverted index dictionary. Wherein the index set is shown in table 3, comprising inverted indexes (1, 2). By matching text identifier 1 with the text identifiers in the index set, it may be determined that the index set includes text identifier 1, and position 0 may then be matched with the position in the index set corresponding to text identifier 1. From Path3, it is known that the edge <3,1> is located to the right of the starting edge <0,3>, and the matching direction currently adopted is from left to right. According to the matching logic described in the foregoing, for the position 2 corresponding to the text identifier 1 in the index set, it is known that the position 2 is not the position 0 plus one, so that the edge <3,1> matching failure can be determined, and the Path3 matching failure can be determined.
Path3 may then be matched based on text identification 2 and corresponding location 0. According to the foregoing description, the index set corresponding to the lemma bbd represented by the side <3,1> adjacent to the start side <0,3> in Path3 includes the inverted index (1, 2). Since the index set does not include text identification 2, it can be determined that edge <3,1> failed to match, and thus that Path3 failed to match.
For Path4, for example, the outgoing edge <0,3> of the node 0 may be used as a starting edge, and the text identifiers 1 and 2 and the corresponding positions 0 indicated in the index set corresponding to the lemma abb represented by the starting edge <0,3> may be sequentially matched with the text identifiers and the positions in the index set corresponding to the other edges in Path 4.
Specifically, taking text identifier 1 and position 0 corresponding thereto as an example, edges <3,3> adjacent to starting edges <0,3> in Path4 may be determined as edges to be matched currently, and an index set corresponding to a word bbb represented by the edges <3,3> may be queried in an inverted index dictionary. The index set is shown in table 3 and includes inverted indexes (1, 1) and (2, 1). By matching text identifier 1 with the text identifiers in the index set, it may be determined that the index set includes text identifier 1, and position 0 may then be matched with the position in the index set corresponding to text identifier 1. From Path4 it is known that the edge <3,3> is to the right of the starting edge <0,3>, and the matching direction currently adopted is from left to right. From the matching logic described in the foregoing, for position 1 in the index set corresponding to text identifier 1, it can be known that position 1 is the position 0 plus one, so that it can be determined that edge <3,3> matching was successful. Next edges to be matched may then be determined in Path 4. From the illustration of fig. 4, it can be appreciated that edge <3,3> is a circular edge, and the next edge of edge <3,3> can be itself, or edge <3,1>. Thus, the next edge to be matched can be determined among the edges <3,3> and the edges <3,1>. Taking the edge <3,1> as the next edge to be matched as an example, the edge <3,1> can be matched based on the text identifier 1 and the position 1 where the word bbb is successfully matched. According to the foregoing description, the index set corresponding to the lemma bbd represented by the side <3,1> includes the inverted index (1, 2). Since the index set includes text identifier 1 and position 2 corresponding to text identifier 1 is the position 1 plus one, it can be determined that the edge <3,1> match was successful, and thus it can be determined that the Path4 match was successful. Thus, text label 1 may be determined as a matching text label and its corresponding text determined as the target text satisfying regular expression a (b+|c+) d.
Path4 may then be matched based on text label 2 and corresponding location 0. According to the foregoing description, the index set corresponding to the lemma bbb represented by the side <3,3> includes the inverted indexes (1, 1), (2, 1). Since the index set includes text identifier 2, and position 1 corresponding to text identifier 2 is the position 0 plus one, it can be determined that the edge <3,3> match was successful. Next edges of edge <3,3>, such as edge <3,3> or edge <3,1>, can then be matched based on the text label 2 and the position 1 where the word bbb match successfully. When matching edge <3,3> again based on the position 1 where the text mark 2 and the word bbb are successfully matched, edge <3,3> may fail to match because there is no position 1 plus one position in the position corresponding to the text mark 2 in the index set corresponding to the word bbb. In addition, when matching edge <3,1> based on the position 1 where the text identifier 2 and the word bbb are successfully matched, edge <3,1> will also fail to match because the text identifier 2 does not exist in the index set corresponding to the word bbd represented by edge <3,1 >. Thus, path4 fails to match text identifier 2 with corresponding location 0 based on the index set indication corresponding to the token abb represented by the starting edge <0,3 >.
In summary, by analyzing the state machine diagram, the matched text identifier 1 and the text identifier 4 can be quickly obtained, and the respective corresponding texts are determined to be target texts meeting the regular expression a (b+|c+) d.
The scheme provided by the corresponding embodiment of fig. 3 can convert the matching of the regular expression into the traversal of the graph, convert the dynamic state machine matching part into the analysis of the graph, and accelerate the regular matching by combining the positions of the words in the inverted index dictionary, thereby realizing the efficient data query.
In addition, the existing matching algorithm described in the foregoing generally filters out those data most likely to match in the data set to obtain a candidate set, and then performs conventional regular matching on the candidate set. The algorithm actually performs a plurality of matching processes, and the time consumption is high. By reversely looking at the scheme provided by the embodiment of the specification, the query result can be obtained by combining the state machine diagram and the positions of the words in the inverted index dictionary and performing a matching process on the words, and the index efficiency is greatly improved without calculating the candidate set and performing conventional regular matching.
In practice, when a regular expression is more complex, the content of a state machine graph generated based on the regular expression is also more complex, and many paths are included. In order to improve the path matching efficiency, thereby further improving the data query efficiency, a plurality of key edges can be determined in the state machine diagram. Wherein the critical edge is capable of cutting off access to a target path formed by a plurality of edges. For any key edge in the plurality of key edges, each target path where the any key edge is located can be regarded as a group of paths, and the group of paths are matched based on a plurality of text identifiers and corresponding positions indicated by an index set corresponding to the word element represented by the any key edge by taking the any key edge as a starting point.
The key edge will be hereinafter referred to as the first edge, and a specific implementation of step S305 as described above will be described in conjunction with the data query process shown in fig. 5. Fig. 5 is a flowchart of the data polling process in the embodiment of the present specification, including steps S501 to S505 shown below.
As shown in fig. 5, in step S501, a number of first edges are determined in a state machine diagram; the first side is capable of cutting access to a target path formed by a plurality of sides.
In one embodiment, for example, an outgoing edge of a plurality of edges of the target path where each outgoing edge of the first node is located may be determined as the first edge. Or determining an incoming edge of the target path which is located in each incoming edge of the second node and is formed by a plurality of edges as a first edge.
In another embodiment, to obtain fewer first edges, thereby increasing the matching speed, a max-min-cut algorithm may be applied. The algorithm may refer to that in a directed graph, the maximum traffic that can reach a sink (terminal) from a source (source) is equal to the sum of the minimum capacities of the set of edges that can cause disruption of network flow if pruned from the graph. I.e. in any network the value of the maximum flow is equal to the capacity of the minimum cut. Since the max-min-cut algorithm is a well-known technique, it will not be described in any great detail here. Thus, a number of first edges may be determined in the state machine diagram based on a max-min cut algorithm; such that the removal of the number of first edges can cause an interruption of the network flow of the state machine graph or a sub-graph of the state machine graph after the state machine graph removes the second edge that is directly connected from the first node to the second node. Wherein the first edge may be the outgoing edge of the first node, or the incoming edge of the second node, or some intermediate edge (neither the outgoing edge of the first node nor the incoming edge of the second node) in the target path.
After determining the plurality of first sides, for a target first side of the plurality of first sides, steps S503 and S505 may be performed for the target first side. It should be noted that, any index set typically includes a large number of inverted indexes, and the inverted indexes may be arranged in order of the text labels from small to large. In order to further improve the matching efficiency, in one embodiment, the index set corresponding to the terms represented by each of the plurality of first edges may be queried in the inverted index dictionary, the minimum text identifier in the index set is determined to be a weight value corresponding to the first edge of the index set, and then the plurality of first edges are sequentially determined to be target first edges according to the order of the weight values from small to large, so that steps S503 and S505 are executed for the target first edges.
Next, in step S503, the inverted index dictionary is queried for the edge e1 among the plurality of first edges, and the index set index 1 corresponding to the lemma represented by the edge e1 is obtained.
The inverted index dictionary may specifically include a plurality of lemmas and index sets corresponding to the lemmas. For the target first edge (may be referred to as edge e 1) in the above several first edges, the inverted index dictionary may be queried based on the lemma represented by edge e1, to obtain an index set1 corresponding to the lemma.
Next, in step S505, based on the text identifiers and the corresponding positions indicated in the index set1, text identifiers and positions in the index set corresponding to other edges in the target path where the edge e1 is located are sequentially matched with each other with the edge e1 as a starting point, so as to obtain matched text identifiers, and the corresponding text is determined as a target text satisfying the regular expression.
Specifically, a matching process as shown in fig. 6 may be employed. Fig. 6 is a schematic diagram of a matching process of the target path where the edge e1 is located in the embodiment of the present disclosure. The matching process includes steps S601-S617 as shown below.
As shown in fig. 6, in step S601, a current text identifier docid_1 is determined among a plurality of text identifiers indicated by the index set IndexSet1, and a current position_1 is determined among respective positions indicated by the IndexSet1 corresponding to the docid_1.
In particular, when the several text identifications are one text identification, the text identification may be directly determined as the current text identification docid_1. When the plurality of text identifiers are a plurality of text identifiers, one text identifier can be selected randomly from the plurality of text identifiers, or the smallest text identifier is selected, and then the selected text identifier is determined to be the current text identifier docid_1. After determining docid_1, if each Position indicated by IndexSet1 corresponding to docid_1 is one Position, the Position may be directly determined as the current position_1. If the respective positions are a plurality of positions, one Position may be randomly selected from the plurality of positions, or a minimum Position may be selected, and then the selected Position may be determined as the current position_1.
Next, in step S603, text identifiers and positions in the index set corresponding to other edges in the target path where the edge e1 is located are sequentially matched based on the docid_1 and the position_1.
Specifically, for any target Path1 in each target Path where the edge e1 is located, text identifiers and positions in the index set corresponding to other edges in Path1 can be sequentially matched based on docid_1 and position_1. If the Path1 is successfully matched, it may be determined that the target Path where the edge e1 is located is successfully matched, and then step S605 may be performed next.
If the matching of the other edges in the Path1 fails, and the other target paths in the target paths all include the edges with the matching failure in the other edges, it may be determined that the matching failure of the target Path where the edge e1 is located is not found. For example, if the other edges in Path1 fail to match based on docid_1 and the other target paths all include edges that fail to match among the other edges, it may be determined that the target Path on which edge e1 is located fails to match based on docid_1, and then step S607 may be performed next. If the other edges in Path1 fail to match based on position_1 and the other target paths all include edges that fail to match in the other edges, it may be determined that the target Path in which the edge e1 is located fails to match based on position_1, and then step S613 may be performed next. In addition, if matching of other edges in the Path1 fails, and there is a target Path2 that does not include an edge that fails to match in the other edges in the target paths, text identifiers and positions in the index set corresponding to the other edges in the Path2 may be sequentially matched based on docid_1 and position_1.
Further, when matching Path1 based on DocId_1 and position_1, the edge e3 currently to be matched may be determined in Path 1. For example, when a match for Path1 is started, the edge e3 to be currently matched may be determined among the edges adjacent to edge e 1. Where the adjacent edge is an edge, the edge may be directly determined as the edge e3 to be currently matched. When the adjacent edges are multiple edges, one edge can be selected from the multiple edges, for example, one edge is selected randomly, or the edge most likely to be successfully matched is selected, and the selected edge is determined as the edge e3 to be matched currently. The most likely successful matching edge can be the edge with the most text marks in the index set corresponding to the represented word element; when selecting the most likely successfully matched edge, the index set corresponding to the word element represented by each edge can be queried in the inverted index dictionary, the target index set with the largest number of the included text identifiers is determined in each queried index set, and the edge corresponding to the target index set in the edges is selected.
After determining the current edge e3 to be matched, the corresponding index set index 3 of the word element represented by the edge e3 in the inverted index dictionary can be obtained, and the docid_1 can be searched in the index set 3. If DocId_1 is not found, it may be determined that edge e3 fails to match based on DocId_1, thereby determining that other paths including edge e3 in each of the target paths where Path1 and edge e1 are located fail to match based on DocId_1. If docid_1 is found, it may then be determined whether there is a target Position among the respective positions indicated by IndexSet3 corresponding to docid_1, the target Position being determined according to position_1 and the positional relationship of edge e1 and edge e3. For example, when the edge e3 is the i-th edge to the right of the edge e1, the target Position may be the i-added Position of position_1; when the edge e3 is the i-th edge to the left of the edge e1, the target Position may be the i-minus Position of position_1. Taking i=1 as an example, when the edge e3 is the first edge on the right side of the edge e1, that is, the right adjacent edge of the edge e1, the target Position may be the position_1 plus one Position; when the edge e3 is the first edge to the left of the edge e1, i.e., the left adjacent edge of the edge e1, the target Position may be the minus one Position of position_1. Taking i=2 as an example, when the edge e3 is the second edge on the right side of the edge e1, the target Position may be the plus two Position of position_1; when the edge e3 is the second edge to the left of the edge e1, the target Position may be the minus two Position of position_1.
If the target position is determined to exist, the successful matching of the edge e3 can be determined, and whether other edges which do not participate in Path matching exist in the Path1 is further determined. If it is determined that the other edge does not exist, it may be determined that the Path1 match is successful. If it is determined that the other edge exists, the other edge may continue to be matched. Further, after the successful matching of the edge e3 is determined, the target position of the successful matching of the edge e3 may be associated with the docid_1 for buffering, so that when the other edge includes the edge e4 adjacent to the edge e3, the edge e3 may be determined as the current edge, and further, based on the target position of the successful matching of the docid_1 and the current edge e3, the text identifier and the position in the index set4 corresponding to the word element represented by the edge e4 are matched. It should be appreciated that if edge e4 can be successfully matched, then docid_1 must be included in IndexSet4, and that there is either a plus or minus one position of the target position in each position in IndexSet4 corresponding to docid_1.
If it is determined that the target Position does not exist, it may be determined that the edge e3 fails to match based on position_1, so that it is determined that other paths including the edge e3 in the respective paths where the paths 1 and e1 exist fail to match based on position_1.
Next, in step S605, if the matching of the target path where the edge e1 is located is successful, the docid_1 is determined as the matched text identifier, and the corresponding text is determined as the target text satisfying the regular expression.
Specifically, if the matching of the target path where the edge e1 is located is successful, the docid_1 may be determined as the matched text identifier, and then the text corresponding to the docid_1 may be determined as the target text satisfying the regular expression.
In step S607, if the matching of the target path where the edge e1 is located fails based on the docid_1, it is determined whether there are other text identifiers that do not participate in the matching of the path among the text identifiers.
Specifically, when it is determined that the target path in which the edge e1 is located fails to match based on docid_1 because the index set corresponding to the other edge in the target path in which the edge e1 is located does not include docid_1, it may be determined whether there are other text identifiers that do not participate in the path matching among the above several text identifiers indicated by the index set 1. If it is determined that the other text identification exists, step S609 may be performed next. If it is determined that the other text identifier does not exist, the matching of the target path where the edge e1 is located may end.
In step S609, if it is determined that there are other text identifiers that do not participate in the path matching among the above-mentioned several text identifiers, the current text identifier docid_2 is determined among the other text identifiers, and the current position_2 is determined among the respective positions indicated by the index set1 and corresponding to the docid_2.
The determination methods of docid_2 and position_2 are similar to those of docid_1 and position_1 described above, and reference is made to the related description above, which is not repeated here.
Next, in step S611, text identifiers and positions in the index set corresponding to other edges in the target path where the edge e1 is located are sequentially matched based on the docid_2 and the position_2.
The matching process performed based on docid_2 and position_2 is similar to the matching process based on docid_1 and position_1 described above, and reference is made to the related description above, and thus will not be repeated here.
In step S613, if the target path where the edge e1 is located fails to match based on the position_1, it is determined whether there are other positions which do not participate in the path matching in the respective positions corresponding to docid_1 indicated by the index set 1.
Specifically, when it is determined that the target path in which the edge e1 is located is failed to match based on the position_1 because the other edges in the target path in which the edge e1 is located are failed to match based on the position_1, it may be determined whether there are other positions which do not participate in the path matching among the respective positions corresponding to the docid_1 indicated by the IndexSet 1. If it is determined that the other location exists, step S615 may be next performed. If it is determined that the other position does not exist, the process may proceed to step S607.
In step S615, if it is determined that there is another location that does not participate in the path matching among the locations indicated by the index set1 and corresponding to the docid_1, the current location position_3 is determined among the other locations.
The method for determining position_3 is similar to the method for determining position_1 described above, and reference is made to the description related to the above, which is not repeated here.
Next, in step S617, text identifiers and positions in the index set corresponding to other edges in the target path where the edge e1 is located are sequentially matched based on the docid_1 and the position_3.
The matching process performed based on docid_1 and position_3 is similar to the matching process based on docid_1 and position_1 described above, and reference is made to the related description above, which is not repeated here.
The method of determining a plurality of key edges in the state machine diagram, regarding each target path where any key edge is located as a group of paths, and using any key edge as a starting point, and matching the group of paths based on a plurality of text identifiers indicated by an index set corresponding to a word element represented by the any key edge and corresponding positions is described in detail above with reference to fig. 5 and 6. The method can effectively improve the path matching efficiency, thereby further improving the data query efficiency.
According to the description, the scheme provided by the embodiment of the specification can convert the matching of the regular expression into the traversal of the graph, convert the dynamic state machine matching part into the analysis of the graph, and accelerate the regular matching by combining the positions of the words in the inverted index dictionary, so that the query result can be obtained by carrying out one-time matching process on the corpus, the candidate set is not required to be calculated first, and then the conventional regular matching is carried out, the index efficiency is greatly improved, and the efficient data query is realized.
Fig. 7 is a schematic diagram of the structure of the data polling device in the embodiment of the present specification. The apparatus may be applied to any device, platform or cluster of devices having data storage, computing, processing capabilities, such as the server 102 as previously described. The apparatus may perform the methods as shown in fig. 2, 3, 5, 6, and include: a receiving unit 701 configured to receive a data query request, including a regular expression; a generating unit 702 configured to generate a state machine graph based on the regular expression, the state machine graph including a first node representing an initial state, a second node representing a termination state, and an edge representing a token, the token being composed of a plurality of characters, a target path between the first node and the second node representing a combination of tokens conforming to the regular expression; a determining unit 703 configured to query a pre-generated inverted index dictionary based on the target path, determining a target text satisfying the regular expression; the inverted index dictionary comprises index sets corresponding to a plurality of words, wherein the index set corresponding to any word comprises inverted indexes of a plurality of items about positions, and each inverted index indicates a text identifier of a text containing the any word and the position of the any word in the text.
The present embodiments also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed in a computer, causes the computer to perform the method shown in fig. 2, 3, 5, 6.
Embodiments of the present disclosure also provide a computing device including a memory and a processor, where the memory stores executable code that when executed by the processor implements the methods shown in fig. 2, 3, 5, and 6.
The present description also provides a computer program product, wherein the computer program product, when executed in a computer, causes the computer to perform the method as shown in fig. 2, 3, 5, 6.
Those of skill in the art will appreciate that in one or more of the above examples, the functions described in the various embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
While the foregoing detailed description has described the objects, aspects and advantages of the embodiments disclosed herein in further detail, it should be understood that the foregoing detailed description is merely illustrative of the embodiments disclosed herein and is not intended to limit the scope of the embodiments disclosed herein, but rather any modifications, equivalents, improvements or the like that may be made to the embodiments disclosed herein are intended to be included within the scope of the embodiments disclosed herein.

Claims (20)

1.一种数据查询方法,包括:1. A data query method, including: 接收数据查询请求,其中包括正则表达式;Receive data query requests, including regular expressions; 基于所述正则表达式生成状态机图,所述状态机图包括代表初始状态的第一节点、代表终止状态的第二节点、及代表词元的边,所述词元由多个字符组成,所述第一节点到所述第二节点之间的目标路径,代表符合所述正则表达式的词元组合;A state machine graph is generated based on the regular expression, and the state machine graph includes a first node representing an initial state, a second node representing a termination state, and an edge representing a word element, where the word element is composed of a plurality of characters, The target path between the first node and the second node represents a word element combination that conforms to the regular expression; 基于所述目标路径,查询预先生成的倒排索引词典,确定满足所述正则表达式的目标文本;所述倒排索引词典包括多个词元各自对应的索引集,任意词元对应的索引集包括若干项关于位置的倒排索引,各项倒排索引指示包含有该任意词元的文本的文本标识、及该任意词元在该文本中的位置。Based on the target path, query the pre-generated inverted index dictionary to determine the target text that satisfies the regular expression; the inverted index dictionary includes an index set corresponding to multiple word elements, and an index set corresponding to any word element. It includes several inverted indexes about positions, each inverted index indicates the text identifier of the text containing the arbitrary word element, and the position of the arbitrary word element in the text. 2.根据权利要求1所述的方法,其中,所述基于所述正则表达式生成状态机图,包括:2. The method of claim 1, wherein generating a state machine diagram based on the regular expression includes: 基于所述正则表达式,构造以单个字符为边的第一状态机;Based on the regular expression, construct a first state machine with a single character as an edge; 将所述第一状态机转换成以词元为边的第二状态机;Convert the first state machine into a second state machine with word elements as edges; 将所述第二状态机转换成图,得到所述状态机图。Convert the second state machine into a graph to obtain the state machine graph. 3.根据权利要求1所述的方法,其中,所述确定满足所述正则表达式的目标文本,包括:3. The method of claim 1, wherein determining the target text that satisfies the regular expression includes: 在所述状态机图中确定若干第一边;所述第一边能够切断由多条边构成的目标路径的通达;Determine a number of first edges in the state machine diagram; the first edges can cut off access to a target path composed of multiple edges; 对于所述若干第一边中的目标第一边,查询所述倒排索引词典,得到所述目标第一边代表的第一词元对应的第一索引集;For the target first edge among the plurality of first edges, query the inverted index dictionary to obtain the first index set corresponding to the first word element represented by the target first edge; 以所述目标第一边为起点,基于所述第一索引集中指示的若干文本标识和对应位置,依次匹配所述目标第一边所在的目标路径中其他边对应的索引集中的文本标识和位置,得到匹配的文本标识,将对应文本确定为所述目标文本。Taking the first edge of the target as the starting point, based on several text identifiers and corresponding positions indicated in the first index set, sequentially match the text identifiers and positions in the index set corresponding to other edges in the target path where the first edge of the target is located. , obtain the matching text identifier, and determine the corresponding text as the target text. 4.根据权利要求3所述的方法,其中,所述在所述状态机图中确定若干第一边,包括:4. The method of claim 3, wherein determining a plurality of first edges in the state machine diagram includes: 基于最大流-最小割算法,在所述状态机图中确定若干第一边;使得所述若干第一边的割除能够导致所述状态机图或所述状态机图的子图的网络流中断,所述子图为所述状态机图去除从所述第一节点直接连接到所述第二节点的第二边后的图。Based on the maximum flow-minimum cut algorithm, several first edges are determined in the state machine graph; so that the removal of the several first edges can cause the network flow interruption of the state machine graph or a subgraph of the state machine graph. , the subgraph is the graph of the state machine graph after removing the second edge directly connected from the first node to the second node. 5.根据权利要求3所述的方法,其中,所述确定满足所述正则表达式的目标文本,还包括:5. The method of claim 3, wherein determining the target text that satisfies the regular expression further includes: 响应于所述状态机图中包括从所述第一节点直接连接到所述第二节点的第二边,在所述倒排索引词典中查询所述第二边代表的第二词元对应的第二索引集;In response to the state machine graph including a second edge directly connected from the first node to the second node, query the inverted index dictionary for the second edge corresponding to the second word element represented by the second edge. second index set; 将查询到的所述第二索引集指示的各个文本标识各自对应的文本,确定为所述目标文本。Text corresponding to each text identifier indicated by the queried second index set is determined as the target text. 6.根据权利要求5所述的方法,其中,所述第一索引集指示的所述若干文本标识不包括所述第二索引集指示的文本标识。6. The method of claim 5, wherein the plurality of text identifiers indicated by the first index set do not include text identifiers indicated by the second index set. 7.根据权利要求3所述的方法,其中,所述基于所述第一索引集中指示的若干文本标识和对应位置,依次匹配所述目标第一边所在的目标路径中其他边对应的索引集中的文本标识和位置,包括:7. The method according to claim 3, wherein the index set corresponding to other edges in the target path where the first edge of the target is located is sequentially matched based on several text identifiers and corresponding positions indicated in the first index set. Text identification and location, including: 在所述若干文本标识中确定当前的第一文本标识,及在所述第一索引集指示的对应于所述第一文本标识的各个位置中确定当前的第一位置;Determine the current first text identifier among the plurality of text identifiers, and determine the current first position among the respective positions corresponding to the first text identifier indicated by the first index set; 基于所述第一文本标识和所述第一位置,依次匹配所述目标第一边所在的目标路径中其他边对应的索引集中的文本标识和位置;Based on the first text identifier and the first position, sequentially match the text identifiers and positions in the index set corresponding to other edges in the target path where the first edge of the target is located; 若所述目标第一边所在的目标路径匹配成功,则将所述第一文本标识确定为匹配的文本标识。If the target path where the first side of the target is located is successfully matched, the first text identifier is determined as the matching text identifier. 8.根据权利要求7所述的方法,其中,所述基于所述第一索引集中指示的若干文本标识和对应位置,依次匹配所述目标第一边所在的目标路径中其他边对应的索引集中的文本标识和位置,还包括:8. The method according to claim 7, wherein the index set corresponding to other edges in the target path where the first edge of the target is located is sequentially matched based on several text identifiers and corresponding positions indicated in the first index set. text identification and location, also including: 若所述目标第一边所在的目标路径基于所述第一文本标识匹配失败,且所述若干文本标识中存在未参与路径匹配的其他文本标识,则在所述其他文本标识中确定当前的第二文本标识,及在所述第一索引集指示的对应于所述第二文本标识的各位置中确定当前的第二位置;If the target path where the first side of the target is located fails to match based on the first text identifier, and there are other text identifiers that do not participate in path matching among the several text identifiers, the current third text identifier is determined among the other text identifiers. two text identifiers, and determining the current second position among the positions corresponding to the second text identifier indicated by the first index set; 基于所述第二文本标识和所述第二位置,依次匹配所述目标第一边所在的目标路径中其他边对应的索引集中的文本标识和位置。Based on the second text identifier and the second position, text identifiers and positions in the index set corresponding to other edges in the target path where the first edge of the target is located are matched in sequence. 9.根据权利要求7所述的方法,其中,所述基于所述第一索引集中指示的若干文本标识和对应位置,依次匹配所述目标第一边所在的目标路径中其他边对应的索引集中的文本标识和位置,还包括:9. The method according to claim 7, wherein the index set corresponding to other edges in the target path where the first edge of the target is located is sequentially matched based on several text identifiers and corresponding positions indicated in the first index set. text identification and location, also including: 若所述目标第一边所在的目标路径基于所述第一位置匹配失败,且所述第一索引集指示的对应于所述第一文本标识的各个位置中存在未参与路径匹配的其他位置,则在所述其他位置中确定当前的第三位置;If the target path where the first edge of the target is located fails to match based on the first position, and there are other positions that do not participate in path matching among the positions indicated by the first index set corresponding to the first text identifier, Then determine the current third position among the other positions; 基于所述第一文本标识和所述第三位置,依次匹配所述目标第一边所在的目标路径中其他边对应的索引集中的文本标识和位置。Based on the first text identifier and the third position, text identifiers and positions in the index set corresponding to other edges in the target path where the first edge of the target is located are matched in sequence. 10.根据权利要求7所述的方法,其中,所述目标第一边所在的目标路径包括多条目标路径;10. The method according to claim 7, wherein the target path where the first side of the target is located includes multiple target paths; 所述基于所述第一文本标识和所述第一位置,依次匹配所述目标第一边所在的目标路径中其他边对应的索引集中的文本标识和位置,包括:Based on the first text identifier and the first position, sequentially matching the text identifiers and positions in the index set corresponding to other edges in the target path where the first edge of the target is located includes: 对于所述多条目标路径中任意的第一目标路径,基于所述第一文本标识和所述第一位置,依次匹配所述第一目标路径中其他边对应的索引集中的文本标识和位置;For any first target path among the plurality of target paths, based on the first text identifier and the first position, sequentially match the text identifiers and positions in the index set corresponding to other edges in the first target path; 若所述第一目标路径匹配成功,则确定所述目标第一边所在的目标路径匹配成功。If the first target path is successfully matched, it is determined that the target path where the first side of the target is located is successfully matched. 11.根据权利要求10所述的方法,其中,所述基于所述第一文本标识和所述第一位置,依次匹配所述目标第一边所在的目标路径中其他边对应的索引集中的文本标识和位置,还包括:11. The method according to claim 10, wherein based on the first text identifier and the first position, text in an index set corresponding to other edges in the target path where the first edge of the target is located is matched in sequence. Identity and location, also including: 若所述第一目标路径中的其他边匹配失败,且所述多条目标路径中的其他目标路径均包括该其他边中匹配失败的边,则确定所述目标第一边所在的目标路径匹配失败。If the other edges in the first target path fail to match, and the other target paths in the multiple target paths all include the edges that fail to match in the other edges, then it is determined that the target path where the first target edge is located matches. fail. 12.根据权利要求10所述的方法,其中,所述基于所述第一文本标识和所述第一位置,依次匹配所述目标第一边所在的目标路径中其他边对应的索引集中的文本标识和位置,还包括:12. The method according to claim 10, wherein based on the first text identifier and the first position, text in an index set corresponding to other edges in the target path where the first edge of the target is located is matched in sequence. Identity and location, also including: 若所述第一目标路径中的其他边匹配失败,且所述多条目标路径中存在不包括该其他边中匹配失败的边的第二目标路径,则基于所述第一文本标识和所述第一位置,依次匹配所述第二目标路径中其他边对应的索引集中的文本标识和位置。If other edges in the first target path fail to match, and there is a second target path in the plurality of target paths that does not include the edge that fails to match in the other edges, then based on the first text identification and the The first position matches text identifiers and positions in the index set corresponding to other edges in the second target path in sequence. 13.根据权利要求10所述的方法,其中,所述基于所述第一文本标识和所述第一位置,依次匹配所述第一目标路径中其他边对应的索引集中的文本标识和位置,包括:13. The method according to claim 10, wherein, based on the first text identifier and the first position, sequentially matching text identifiers and positions in the index set corresponding to other edges in the first target path, include: 在所述第一目标路径中确定当前要匹配的第三边;Determine the third edge currently to be matched in the first target path; 获取所述第三边代表的第三词元在所述倒排索引词典中对应的第三索引集;Obtain the third index set corresponding to the third word element represented by the third side in the inverted index dictionary; 若所述第三索引集包括所述第一文本标识和对应于所述第一文本标识的目标位置,则确定所述第三边匹配成功;其中,所述目标位置根据所述第一位置,以及所述目标第一边和所述第三边的位置关系而确定;If the third index set includes the first text identifier and a target position corresponding to the first text identifier, it is determined that the third side is successfully matched; wherein the target position is based on the first position, and the positional relationship between the first side of the target and the third side; 若所述第三索引集不包括所述第一文本标识或所述目标位置,则确定所述第三边匹配失败。If the third index set does not include the first text identifier or the target position, it is determined that the third side matching fails. 14.根据权利要求13所述的方法,其中,所述在所述第一目标路径中确定当前要匹配的第三边,包括:14. The method of claim 13, wherein determining the third edge currently to be matched in the first target path includes: 若所述第一目标路径包括与所述目标第一边相邻的多条边,则在所述倒排索引词典中查询所述多条边各自代表的词元对应的索引集;If the first target path includes multiple edges adjacent to the target first edge, then query the inverted index dictionary for an index set corresponding to the word elements each represented by the multiple edges; 在查询到的各个索引集中,确定所包括的文本标识的数量最大的目标索引集,并将所述多条边中对应于所述目标索引集的边确定为所述第三边。In each of the queried index sets, a target index set containing the largest number of text identifiers is determined, and the edge corresponding to the target index set among the plurality of edges is determined as the third edge. 15.根据权利要求3所述的方法,所述确定满足所述正则表达式的目标文本,还包括:15. The method according to claim 3, said determining target text that satisfies said regular expression, further comprising: 在所述倒排索引词典中查询所述若干第一边各自代表的词元对应的索引集;Query in the inverted index dictionary the index set corresponding to the word elements each represented by the plurality of first sides; 对于查询到的每个索引集,将该索引集中的最小文本标识确定为对应于该索引集的第一边的权重值;For each queried index set, determine the minimum text identifier in the index set as the weight value corresponding to the first edge of the index set; 按照权重值由小到大的顺序,依次将所述若干第一边确定为所述目标第一边。In order of weight values from small to large, the first sides are determined as the target first sides. 16.根据权利要求1-15之一所述的方法,其中,在接收数据查询请求之前,还包括:16. The method according to any one of claims 1-15, wherein before receiving the data query request, it further includes: 对于预设的文本集合中的每个待处理文本,使用N元语言模型对所述待处理文本进行分词处理,生成分词结果;其中,N为大于1的自然数,所述分词结果包括所述待处理文本的文本标识对应的词元信息集,任意词元信息包括由N个字符形成的词元、及该词元在所述待处理文本中的位置;For each text to be processed in the preset text collection, an N-gram language model is used to perform word segmentation processing on the text to be processed, and a word segmentation result is generated; where N is a natural number greater than 1, and the word segmentation result includes the word segmentation result to be processed. Process the word element information set corresponding to the text identifier of the text. Any word element information includes a word element formed by N characters and the position of the word element in the text to be processed; 将各个分词结果中的词元形成词元集合;Form the word elements in each word segmentation result into a word element set; 对于所述词元集合中的每个词元,基于该词元所在的分词结果,生成该词元对应的索引集;For each word element in the word element set, generate an index set corresponding to the word element based on the word segmentation result where the word element is located; 基于所述词元集合中的各个词元各自对应的索引集,生成所述倒排索引词典。The inverted index dictionary is generated based on the index set corresponding to each word element in the word element set. 17.根据权利要求16所述的方法,其中,在生成分词结果之前,还包括:17. The method according to claim 16, wherein before generating the word segmentation result, it further includes: 对于从所述待处理文本中划分出的任意词元,若该任意词元包括大写字母,则将所述大写字母转换成小写字母。For any word element divided from the text to be processed, if the arbitrary word element includes uppercase letters, the uppercase letters are converted into lowercase letters. 18.一种数据查询装置,包括:18. A data query device, including: 接收单元,被配置成接收数据查询请求,其中包括正则表达式;a receiving unit configured to receive a data query request, including a regular expression; 生成单元,被配置成基于所述正则表达式生成状态机图,所述状态机图包括代表初始状态的第一节点、代表终止状态的第二节点、及代表词元的边,所述词元由多个字符组成,所述第一节点到所述第二节点之间的目标路径,代表符合所述正则表达式的词元组合;A generation unit configured to generate a state machine graph based on the regular expression, the state machine graph including a first node representing an initial state, a second node representing a termination state, and an edge representing a word element, the word element being Composed of multiple characters, the target path between the first node and the second node represents a combination of word elements that conforms to the regular expression; 确定单元,被配置成基于所述目标路径,查询预先生成的倒排索引词典,确定满足所述正则表达式的目标文本;所述倒排索引词典包括多个词元各自对应的索引集,任意词元对应的索引集包括若干项关于位置的倒排索引,各项倒排索引指示包含有该任意词元的文本的文本标识、及该任意词元在该文本中的位置。The determination unit is configured to query a pre-generated inverted index dictionary based on the target path and determine the target text that satisfies the regular expression; the inverted index dictionary includes an index set corresponding to a plurality of word elements, any The index set corresponding to a word element includes a number of inverted indexes regarding positions. Each inverted index indicates the text identifier of the text containing the arbitrary word element and the position of the arbitrary word element in the text. 19.一种计算机可读存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-17中任一项所述的方法。19. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed in a computer, the computer is caused to perform the method of any one of claims 1-17. 20.一种计算设备,包括存储器和处理器,其中,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-17中任一项所述的方法。20. A computing device, including a memory and a processor, wherein executable code is stored in the memory, and when the processor executes the executable code, the method of any one of claims 1-17 is implemented. method.
CN202311829247.XA 2023-12-27 2023-12-27 Data query method and device Pending CN117763077A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311829247.XA CN117763077A (en) 2023-12-27 2023-12-27 Data query method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311829247.XA CN117763077A (en) 2023-12-27 2023-12-27 Data query method and device

Publications (1)

Publication Number Publication Date
CN117763077A true CN117763077A (en) 2024-03-26

Family

ID=90317999

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311829247.XA Pending CN117763077A (en) 2023-12-27 2023-12-27 Data query method and device

Country Status (1)

Country Link
CN (1) CN117763077A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118113728A (en) * 2024-04-30 2024-05-31 浪潮电子信息产业股份有限公司 A data query method, system, device, equipment, and readable storage medium
US12475625B1 (en) 2024-05-15 2025-11-18 Lagena Inc. Systems and methods for rendering AI generated videos in real time

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118113728A (en) * 2024-04-30 2024-05-31 浪潮电子信息产业股份有限公司 A data query method, system, device, equipment, and readable storage medium
US12475625B1 (en) 2024-05-15 2025-11-18 Lagena Inc. Systems and methods for rendering AI generated videos in real time
WO2025240325A1 (en) * 2024-05-15 2025-11-20 Lagena Inc. Systems and methods for rendering ai generated videos in real time

Similar Documents

Publication Publication Date Title
CN108304378B (en) Text similarity computing method, apparatus, computer equipment and storage medium
US9195738B2 (en) Tokenization platform
JP5330269B2 (en) Document matching engine using asymmetric signature generation
JP5043116B2 (en) Network-based method and apparatus for filtering junk information
CN107784110B (en) A kind of index establishment method and apparatus
CN108846016B (en) A Search Algorithm for Chinese Word Segmentation
CN112115232A (en) A data error correction method, device and server
CN113111178B (en) An unsupervised representation learning-based method and device for disambiguation of the author of the same name
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
CN117763077A (en) Data query method and device
CN101154228A (en) A segmented pattern matching method and device thereof
CN110795526A (en) Mathematical formula index creating method and system for retrieval system
WO2015010509A1 (en) One-dimensional liner space-based method for implementing trie tree dictionary search
CN113779200A (en) Target industry word stock generation method, processor and device
CN119357376A (en) Hybrid retrieval method, device, equipment and storage medium based on knowledge graph and vector
CN119357366B (en) Large model retrieval method, device, equipment and storage medium based on priori atlas
CN115859932A (en) Log template extraction method and device, electronic equipment and storage medium
CN114385777A (en) Text data processing method and device, computer equipment and storage medium
US8682900B2 (en) System, method and computer program product for documents retrieval
CN113420219B (en) Method, device, electronic device and readable storage medium for querying information error correction
CN113849538A (en) An intelligent extraction method and system based on fuzzy search multiple options
CN119719252A (en) Inverted index method, device, equipment and storage medium based on double linked list structure
CN118780261A (en) Log template determination method, device, computer equipment, medium and product
CN118838995A (en) Method, device, equipment and storage medium for answering questioning and notifying based on knowledge base
CN115221013A (en) A method, device and device for determining a log mode

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: 310000 Zhejiang Province, Hangzhou City, Xihu District, Xixi Road 543-569 (continuous odd numbers) Building 1, Building 2, 5th Floor, Room 518

Applicant after: Alipay (Hangzhou) Digital Service Technology Co.,Ltd.

Address before: 310000 801-11 section B, 8th floor, 556 Xixi Road, Xihu District, Hangzhou City, Zhejiang Province

Applicant before: Alipay (Hangzhou) Information Technology Co., Ltd.

Country or region before: China