Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a multi-document question-answer retrieval method combining a generated language model and a semantic document map. The method further perfects and optimizes the construction and traversing modes of the document knowledge graph, designs the graph construction based on bert models respectively, and designs the KGP3 algorithm optimizing retrieval traversing modes. In the map traversal process, the invention combines the main problem and the obtained node to generate the sub-problem related to the next needed node information, precisely selects the most suitable document node from the neighbor nodes, judges the relevance between the node and the initial problem, and does not list in the subsequent retrieval list if the node is not relevant. The method is efficient and has traceability and interpretability, and meanwhile, the whole process of selecting the next-hop node in the map traversal by the model is abstracted and summarized. In order to reduce the cost, the invention also improves the small-sized encoder-decoder T5 model, introduces a contrast learning mechanism into the model, and further improves the retrieval performance of the small-sized model.
The invention aims at realizing the following technical scheme:
a multi-document question-answering retrieval method combining a generated language model and a semantic document map comprises the following steps:
step 1, constructing a document type knowledge graph:
Step 1.1, acquiring a document library, namely acquiring a document library containing a large number of related documents in a certain field;
step 1.2, text cutting, namely after a document library is obtained, word segmentation processing is carried out on the document, and the document is divided into independent text blocks;
Step 1.3, embedding the document blocks based on the pre-training language model, namely extracting semantic features of each document block by using the pre-training language model;
Calculating similarity, namely calculating similarity among text blocks based on the embedded vectors of the document blocks to form edges, and connecting neighboring document nodes for each document node based on a similarity threshold and a top-k algorithm to form a document knowledge graph;
step 2, generating a formula question-answer based on the document knowledge graph multi-hop retrieval of the generated language model:
Step 21, generating an initial candidate set, namely searching a TF-IDF algorithm in the document library obtained in the step 1 by taking the problem of a user as a query condition to obtain a group of nodes related to the TF-IDF algorithm, namely the initial candidate set;
Step 22, iteratively expanding a document set:
step 221, traversing the initial node set, namely traversing each node in the initial candidate set generated in the step 21 and taking the node as an initial node of the current path;
Step 222, inputting the questions and the retrieved nodes into a generated language model, namely inputting the questions and the node information on the current path into the generated language model, and obtaining multi-hop question and answer of the document knowledge graph based on a KGP3 algorithm, wherein the specific steps are as follows:
(1) Generating a sub-problem, namely generating information which relates to the next needed node and should be contained in a sub-problem by combining the main problem and the obtained node;
(2) The next node is obtained according to the generated sub-problem, wherein the generated sub-problem is used for further searching new nodes in the document type knowledge graph constructed in the step 1, and the nodes are targets to be accessed in the next step;
(3) Judging the relevance of the selected node, if the relevance of the node is judged to be "supporting" or "fuzzy", adding the node into the current path, and continuing the selection of the next hop based on the node; if the relevance of the node is judged as 'refused', skipping the current node and continuing to traverse the initial node set, supporting the seed node which is highly supported by the current node when the query is answered, blurring the seed node which is supported by the current node to a certain extent, refusing the seed node, and indicating that no valuable information exists for the main question;
(4) Judging whether the length of the current path reaches a threshold value or not, checking whether the length of the current path reaches a preset threshold value or not, if not, continuing to execute the steps (1) to (3), and if so, adding the document of the current path into the searched document set;
(5) Judging whether the initial node set is traversed, checking whether the initial node set is completely traversed, if not, continuing to traverse the initial node set, and if so, ending the whole retrieval process to obtain a retrieval document set;
The generated language model is a generated large language model or a small encoder-decoder model T5 model, when the generated language model is the small encoder-decoder model T5 model, fine tuning is carried out, negative samples are added for each piece of fine tuning data, and a contrast learning head is introduced at the same time, wherein the input of the T5 model comprises a problem and retrieval evidence related to the problem, each candidate node represents a potential relevant information fragment, the inputs are converted into a rich continuous vector representation through an encoder of the T5 model, and then the rich continuous vector representation is sent to a decoder for generating or predicting texts, and during training, two negative samples are provided for each positive sample, so that the contrast of the positive sample and the negative sample is formed;
And step 23, after the steps 21 and 22, performing a duplicate removal post-processing step on the retrieved document set to obtain a final retrieved document set, and merging the user questions and inputting the generated language model to obtain answer replies.
Compared with the prior art, the invention has the following advantages:
The invention provides a KGP3 method for further improving the retrieval flow of the model, which is efficient and traceable and interpretable, abstracts and summarizes the whole process of the model in the process of traversing the map to select the next-hop node, has certain universality, and can conveniently simulate the general scene to construct a fine-tuning data set to further improve the retrieval capability of the model. In addition, the invention carries out fine adjustment on the T5 model, and introduces a comparison learning head aiming at the framework of an encoder-decoder and the characteristics of the traversing method of the KGP3 method. This approach encourages models to select the correct document by adding context contrast targets while avoiding selecting other confusing documents, thereby enhancing the judgment capability of the model. Finally, the robot question and answer is accessed into the WeChat to realize the floor application of the robot question and answer, and the method reduces the loss of manpower and material resources and promotes the high-quality development of enterprises.
Detailed Description
The following description of the present invention is provided with reference to the accompanying drawings, but is not limited to the following description, and any modifications or equivalent substitutions of the present invention should be included in the scope of the present invention without departing from the spirit and scope of the present invention.
The invention provides a multi-document question-answer retrieval method combining a generated language model and a semantic document map, which comprises the following steps:
And (3) defining and constructing a document type knowledge graph.
Step 1.1, definition of a document knowledge graph, namely, the document knowledge graph is a method for converting text information into structured knowledge, and the text information is better understood and utilized by a machine through refining information such as semantics, keywords and the like of a document and organizing the information into a graph form.
Step 1.2, constructing a document type knowledge graph:
Step 1.2.1, obtaining a document library, namely firstly, needing a document library containing a large number of related documents in a certain field, which is the basis for constructing a knowledge graph. Thus, the present invention selects two public data sets for verification. Wherein HotpotQA is a multi-hop inference question-answer dataset developed by researchers at the university of stenford in the united states. The data set is intended to facilitate understanding and resolution of complex questions in the field of natural language processing, particularly those that require integration of information from multiple sources to answer. HotpotQA require that the model not only be able to understand a single document, but also effectively integrate information from different documents to accomplish multi-step reasoning. MuSiQue (Multihop Questions VIA SINGLE-hop Question Composition) is a multi-hop inference question-answer dataset introduced by the university of Qinghai knowledge engineering laboratory. The data set is designed to overcome the problem of the existing data set that the existing data set can be solved through shortcuts. MuSiQue employs a bottom-up question synthesis method to enhance the logical relevance between questions by ensuring that the answer to each sub-question is necessary for the subsequent questions to be answered.
And 1.2.2, cutting text, namely performing word segmentation processing on the document. After the document library is obtained, the document needs to be subjected to word segmentation processing, and the long document is segmented into independent text blocks (such as sentences, paragraphs or chapters). The purpose of this step is to break the document into smaller, more easily handled and analyzed units. The word segmentation process may use existing Natural Language Processing (NLP) tools such as jieba (chinese word segmentation) or NLTK (english word segmentation).
Text cutting in this step, especially for long document processing, is an important step in Natural Language Processing (NLP). This process typically involves word segmentation and splitting the long document into separate text streams that remain continuous after the word segmentation process is completed. To facilitate subsequent processing and analysis, the document is often required to be further divided into separate chunks. The definition of the chunks may be determined according to specific requirements, e.g., segmentation by paragraph, sentence, or specific tag (e.g., title, line-feed, etc.). The divided text blocks can be used as independent processing units for tasks such as information extraction, abstract generation and the like. In practice, text cutting typically involves the steps of:
Step 1.2.2.1, preprocessing, namely removing irrelevant characters (such as punctuation marks, numbers and the like) in the text, and performing necessary text cleaning.
Step 1.2.2.2, word segmentation, namely using a word segmentation tool or algorithm to segment the text.
And 1.2.2.3, dividing the segmented text into independent text blocks according to defined rules or marks.
Step 1.2.2.4, post-processing (optional) the segmented text blocks are further cleaned or formatted.
And 1.2.3, embedding the document blocks based on the pre-training language model, namely extracting the semantic characteristics of each document block by using the pre-training language model such as BERT and the like so as to facilitate subsequent similarity calculation. The method comprises the following specific steps:
Step 1.2.3.1, pre-training language model the language model (e.g., BERT) is pre-trained on a large amount of text data in order to learn a generic language representation. BERT is a typical example that processes input sequences through a transducer architecture and can understand the meaning of context-dependent words.
Step 1.2.3.2, document block embedding-embedding a document block based on a pre-trained language model, the embedding process may be expressed as shown in equation (1) for a given piece of text or "Wen Dangkuai":
(1)
Wherein, the Is the embedding vector of the document block i, and embedded is an embedding function, usually a pre-trained deep neural network, and chunk i represents a small block of cells after long text segmentation.
And step 1.2.3.3, extracting semantic features, namely converting the semantic features into a vector with a fixed length by using a pre-trained language model. This process typically involves inputting text into a language model and extracting an output feature vector from some layer (typically the last layer) of the language model. The output feature vector obtained through the above process contains semantic information of the original text. This means that the output feature vectors reflect not only the literal meaning of the words, but also their role in sentences and the more complex features of logical relationships between sentences for subsequent similarity calculations.
And 1.2.4, calculating the similarity, namely calculating the similarity among the text blocks based on the embedded vectors of the document blocks to form edges. For each document node, the adjacent document nodes are connected based on the similarity threshold and the top-k algorithm to form a document knowledge graph, and the process of constructing the document graph is actually described. This process involves the following key steps:
step 1.2.4.1, document node representation.
First, each document is considered a node (node) in the graph. These document nodes are typically converted to vector representations by some embedding technique (e.g., word embedding, sentence embedding, or paragraph embedding) for use in subsequent similarity calculations.
Step 1.2.4.2, calculating the similarity.
Some measure of similarity (e.g., cosine similarity, euclidean distance, etc.) is used to calculate the similarity between each pair of document nodes. The choice of similarity measure depends on the nature of the embedded vector and the specific requirements of the task.
Taking cosine similarity as an example, assume that there are two document blocks A and B, whose embedded vectors are respectivelyAndThe cosine similarity between them can be calculated by the following formula:
(2)
Wherein, the Is the dot product of vector a and vector B,Is the modulus of vector a and vector B. The range of cosine similarity is [ -1, 1], where 1 indicates that the two vectors are identical (identical in direction and equal in length), and 0 indicates that the two vectors are orthogonal (i.e., uncorrelated).
Step 1.2.4.3, connecting neighbor nodes.
Based on the similarity calculation result, the neighbor document nodes are connected.
Step 1.2.4.4, similarity threshold.
First, a similarity threshold is set. If the similarity between two document nodes exceeds this threshold, they are considered neighbors and are connected by edges (edges) in the graph. Next, a top-k algorithm is used, finding for each document node the k most similar to it and connecting them as neighbors. Where k is a predetermined positive integer.
And step 1.2.4.5, forming a document knowledge graph.
Through the above steps, the present invention constructs a network consisting of document nodes and edges connecting them. This network can be viewed as a document knowledge graph, where nodes represent documents and edges represent similar relationships between documents.
Step 1.2.4.6, and the subsequent application.
After the steps are completed, the formed document knowledge graph can be used for various NLP tasks, such as document clustering, topic detection, information retrieval, recommendation systems and the like. By analyzing the nodes and edges in the graph, potential links between documents can be found, revealing the inherent structure of the data.
And 2, generating a formula question and answer based on the document knowledge graph multi-hop retrieval of the generated formula language model.
And 21, generating an initial candidate set, namely firstly, searching a document corpus by using a TF-IDF algorithm by taking the problem of a user as a query condition to obtain a group of nodes related to the document corpus, namely the initial candidate set.
TF-IDF algorithm search is a widely used weighting technique in information retrieval and text mining to evaluate the importance of a term to a document in a document collection or corpus. The TF-IDF searching process mainly comprises the following steps:
step 211, pretreatment:
(1) Text word segmentation, namely carrying out word segmentation on each document in a document set to obtain a list consisting of words.
(2) Stop word processing, namely removing stop words (such as common words with no practical meaning like 'yes'.
(3) Word reduction (optional) reduces the vocabulary to its basic form (e.g., reduces "running" to "run") to better handle morphological changes.
Step 212, constructing a word frequency matrix:
(1) Calculating word frequency (TF), namely calculating the number of times of each word in each document and dividing the number of times of each word in each document by the total word number of the document to obtain the TF of the word in the document, wherein the expression is shown in the following formula:
(3)
(2) A matrix is constructed in which rows represent documents and columns represent words, and each element in the matrix represents the word frequency of the corresponding word in the corresponding document.
Step 213, calculating the inverse file frequency (IDF):
for each word in the vocabulary, its IDF is calculated. IDF reflects the general importance of a vocabulary, and the higher the IDF value of a vocabulary, the more representative it is, and its expression is as follows:
(4)
Note that the addition of 1 to the denominator is to avoid the case where the denominator is 0.
Step 214, calculating TF-IDF value:
The TF of each vocabulary in each document is multiplied by the IDF value to obtain the TF-IDF value of the vocabulary in the document, and the expression is shown as follows:
(5)
Step 215, constructing TF-IDF vector:
for each document, a vector is constructed using the TF-IDF values of all of its words. This vector can be considered as a representation of the document in lexical space.
Step 216, searching and sorting:
When a user inputs a query, the query is also pre-processed, such as word segmentation, stop word processing, etc. The TF-IDF value of each term in the query in the document collection is calculated (typically the IDF value of the term in the query is used because the query itself is not a complete document). And calculating the similarity between the query vector and the document vector by using cosine similarity and other methods. And sorting the documents according to the similarity scores, and returning the documents with the highest scores to the user as search results.
Step 22, iteratively expanding a document set:
Step 221, traversing the initial node set, namely traversing each node in the initial node set and taking the node as the initial node of the current path.
Step 222, inputting the questions and the retrieved nodes into a generative language model, namely inputting the questions and the node information on the current path into the generative language model based on the KGP3 algorithm designed by the invention to form the multi-jump question and answer of the document knowledge graph. The specific implementation flow is as follows:
(1) Generating a sub-question in combination with the main question and the already obtained nodes generating a sub-question relates to the information that the next needed node should contain, e.g. Bob Bryan or MARIAAN DE SWARDT when asked who is older, and with Bob Bryan age information, our intuition is to ask MARIAAN DE SWARDT for age. This intuition forms the basis of the inference of the generative language model, enabling it to generate subsequent questions. The generated sub-questions are output by the generated language model to guide the selection of the next hop node.
(2) And (3) taking the next node according to the generated sub-problem, namely using the generated sub-problem to further search new nodes in the knowledge graph, wherein the nodes are targets to be accessed in the next step. The method comprises the following steps that after a sub-question is generated, the generated language model carefully selects the next most suitable document node from the neighbors according to the sub-question, and in the process, the generated language model always needs to select an optimal node in the current path, so that the next search is conveniently carried out on the node, and the node is selected as the node which is most favorable for answering the initial question Q in the candidate neighbors.
(3) The relevance of the selected node is judged, that is, in the step of selecting the neighbor node, all neighbor nodes of the current node may not necessarily contain information required for answering the question, and even have no relevance, if the node is included in the search result, the relevance of the selected node and the question needs to be judged because the question answering is interfered to a certain extent. In particular, the selected node is rated for three kinds of evaluations, supporting that the current node highly supports the seed node when answering the query, obscuring that the current node supports the seed node to some extent, rejecting that there is no valuable information for the main question. If the relevance of a node is judged to be "support" or "fuzzy", the node is added to the current path and the selection of the next hop is continued based on the node. If the relevance of the node is judged as 'refused', skipping the current node and continuing to traverse the initial node set. The method is efficient and traceable and interpretable, and abstracts and summarizes the whole process of selecting the next-hop node in the graph traversal of the generated language model, has certain universality, and can conveniently simulate the universal scene to construct a fine-tuning data set so as to further improve the retrieval capability of the model.
(4) And judging whether the length of the current path reaches a threshold value or not, and checking whether the length of the current path reaches a preset threshold value or not. If the threshold is not reached, continuing to execute the steps (1) to (3), and if the threshold is reached, adding the document of the current path into the retrieved document set.
(5) And judging whether the initial node set is completely traversed or not, and checking whether the initial node set is completely traversed or not. If the initial node set is not traversed, continuing to traverse the initial node set, and if the initial node set is traversed, ending the whole retrieval process.
And obtaining the search document set after the document knowledge graph multi-hop question-answering training after the above contents are completed.
In the present invention, the generative language model is a generative large language model or a small encoder-decoder model T5 model. When the generated language model is a small encoder-decoder model T5 model, fine tuning is performed on the model, and meanwhile, the model is improved according to the structure of the encoder-decoder and the characteristics of the traversal method of the KGP3 method, a negative sample is added for each fine tuning data, and a contrast learning head is introduced. This approach encourages models to select the correct document by adding context contrast targets while avoiding selecting other confusing documents, thereby enhancing the judgment capability of the model. In this process, the model input includes a question and associated search evidence, where each candidate node represents a piece of potentially relevant information. These inputs are converted to a rich continuous vector representation by the encoder of T5 and then fed into the decoder to generate or predict text, providing two negative samples for each positive sample during training. The negative sample is generated by replacing the correct node with an irrelevant node or by changing the decision (e.g., changing "accept" to "reject"). This contrast learning approach helps the model to be more sensitive to subtle differences between useful information and misleading information. In this way, the comparison of the positive sample and the negative sample is formed, and besides the traditional cross entropy loss, the invention further improves the performance of the model by using the comparison learning method.
(6)
(7)
(8)
(9)
(10)
Wherein formulas (6) and (7) represent the output probabilities of positive and negative samples, wherein y + andDecoder concealment states representing positive and k negative samples, respectively, W y and b y are parameters that can be learned, and δ is a Sigmoid function for converting an input to a value between 0 and 1. Equation (8) defines the contrast loss L cl, which is obtained by calculating the probability of a positive sample divided by the sum of the probabilities of all samples. r is a temperature super-parameter for adjusting the steepness of the softmax function, avg #) Representing an average pooling function based on the length of the target sequence. Equation (9) represents the cross entropy loss L mim, which is used to measure the difference between the model predictions and the actual labels. Finally, equation (10) gives the total loss L total, which is a weighted sum of the contrast loss L_cl and the cross-entropy loss L mim, λ being the weight coefficient. In general, these formulas describe how a machine learning model improves its performance by optimizing the two loss functions.
And step 23, after the steps 21 and 22, performing post-processing steps such as de-duplication on the retrieved document set to obtain a final retrieved document set, and merging the user questions and inputting a large model to obtain answer replies.