Disclosure of Invention
In view of the above, the invention aims to provide a large-model-based government document quotation retrieval method, device, equipment and medium, which can be used for efficiently retrieving political nouns and strategic nouns in government document contents and providing accurate quotation retrieval, thereby improving the efficiency and quality of government document writing and auditing. The specific scheme is as follows:
In a first aspect, the application discloses a government affair document quotation retrieval method based on a large model, which comprises the following steps:
acquiring government affair documents, determining document slices corresponding to the government affair documents, and creating a document knowledge base and a vector knowledge base based on the document slices;
identifying the technical terms in the government affairs document by utilizing a target pre-training natural language processing model so as to acquire corresponding identification results;
performing keyword matching on the recognition result by utilizing a target information retrieval algorithm based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents;
and scoring semantic relevance of each candidate citation document and the identification result through reranker model, reordering each candidate citation document according to the corresponding scoring result, and outputting a target citation document according to the corresponding reordering result.
Optionally, the determining the document slice corresponding to the government document, and creating a document knowledge base and a vector knowledge base based on the document slice includes:
Dividing government affair document texts by using LANGCHAIN frames to determine document slices corresponding to the government affair documents;
Constructing an index based on the document slices by elastic search to create the document knowledge base;
vectorizing the document slice by using a deep learning model based on bidirectional generating codes to obtain a first processed document slice;
and constructing the vector knowledge base based on the first processed document slice.
Optionally, before the identifying the technical terms in the government affairs document by using the target pre-training natural language processing model, the method further includes:
Training the initial pre-training natural language processing model by utilizing a target government affair document corpus and a government affair document field special word list to obtain a pre-training natural language processing model, wherein the government affair document field special word list comprises political nouns, strategic nouns and policy terms;
supplementing nouns in the special word list in the government and public file field by using an external database to obtain a supplementing result, wherein the external database comprises government files, white paper books and policy files;
Training the pre-training natural language processing model by using the supplementary result to obtain the target pre-training natural language processing model.
Optionally, the keyword matching is performed on the recognition result by using a target information retrieval algorithm based on the document knowledge base and the vector knowledge base to obtain a plurality of candidate citation documents, including:
Performing word segmentation, stop word removal and keyword extraction operations on the document slices in the document knowledge base to obtain second processed document slices;
constructing a global vocabulary based on each second processed document slice, wherein the global vocabulary is a vocabulary containing unique words or phrases appearing in all second processed document slices;
vectorizing the second processed document slice based on the global vocabulary table to obtain a corresponding sparse vector, and storing the sparse vector into the document knowledge base;
Calculating a relevance score between the identification result and the keywords corresponding to each document slice in the document knowledge base based on a target information retrieval algorithm;
screening initial documents from the document knowledge base according to the relevance scores;
And carrying out keyword matching on the recognition result based on the vector knowledge base and the initial document so as to obtain a plurality of corresponding candidate citation documents.
Optionally, the keyword matching is performed on the recognition result based on the vector knowledge base and the initial document to obtain a plurality of candidate citation documents, including:
performing word segmentation, stop word removal and keyword extraction operations on the identification result to obtain a processed identification result;
determining a target vector corresponding to the processed recognition result;
retrieving a target document slice in the vector knowledge base that is similar to the target vector;
Determining a similarity score between the target document slice and the target vector;
Determining a composite score for each of the initial documents based on the similarity score and the relevance score;
and re-ordering the initial documents according to the comprehensive scores to determine a plurality of candidate citation documents.
Optionally, the scoring semantic relevance between each candidate citation document and the recognition result through reranker model, reordering each candidate citation document according to the corresponding scoring result, and outputting a target citation document according to the corresponding reordering result, including:
carrying out semantic relevance scoring on each candidate quotation document and the identification result through reranker model;
Filtering candidate citation documents corresponding to the target scoring results which do not meet the preset scoring threshold conditions according to the scoring results to determine each initial citation document;
And reordering the initial quotation documents according to the scoring result, the release time of each initial quotation document and the release mechanism, and outputting a target quotation document according to the corresponding reordering result.
Optionally, the method further comprises:
and when outputting the target quotation document, outputting the context information of the target quotation document, wherein the context information comprises the context information of the quotation paragraph and any one or a combination of a plurality of titles, issuing mechanisms and issuing time of the target quotation document.
In a second aspect, the application discloses a government affair document quotation retrieval device based on a large model, which comprises:
the knowledge base construction module is used for acquiring government affair documents, determining document slices corresponding to the government affair documents, and creating a document knowledge base and a vector knowledge base based on the document slices;
the recognition module is used for recognizing the professional terms in the government affairs public by utilizing the target pre-training natural language processing model so as to obtain corresponding recognition results;
the keyword matching module is used for performing keyword matching on the identification result by utilizing a target information retrieval algorithm based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents;
and the quotation document output module is used for scoring semantic relativity of the candidate quotation documents and the identification result through the reranker model, reordering the candidate quotation documents according to the corresponding scoring result and outputting the target quotation document according to the corresponding reordering result.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
and the processor is used for executing a computer program to realize the government affair document quotation retrieval method based on the large model.
In a fourth aspect, the present application discloses a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements a large model based government document quotation retrieval method as described above.
The method comprises the steps of firstly obtaining government affair documents, determining document slices corresponding to the government affair documents, creating a document knowledge base and a vector knowledge base based on the document slices, utilizing a target pre-training natural language processing model to identify professional terms in the government affair documents so as to obtain corresponding identification results, then utilizing a target information retrieval algorithm to conduct keyword matching on the identification results based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents, finally conducting semantic relevance scoring on the candidate citation documents and the identification results through a reranker model, reordering the candidate citation documents according to the corresponding scoring results, and outputting the target citation documents according to the corresponding reordering results. The application can accurately identify the proper nouns in the government field by utilizing a natural language processing model, search the documents and paragraphs relevant to the identified nouns from a knowledge base by combining a semantic matching technology and a high-efficiency indexing mechanism, and finally screen and sort the search results to recommend the most relevant documents and paragraphs. Therefore, the efficiency and the accuracy of quotation retrieval are improved, and related contents can be rapidly positioned. And meanwhile, the semantic understanding capability of the large model is utilized, so that the noun recognition accuracy is remarkably improved. The method can be widely applied to working scenes such as official document writing, policy research, file checking and the like of government departments at all levels, and provides technical support for improving government work efficiency and accuracy.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
At present, during the writing and auditing process of government documents, related documents or regulations are quoted and mainly depend on manual searching, and the mode has the following defects that 1, the efficiency is low, a great deal of time is required to manually search the quoted documents, and particularly when the number of involved nouns in the documents is large and complex, the efficiency is obviously reduced. The traditional retrieval system mainly relies on keyword matching, and deep semantics and context relation of political terms cannot be understood, so that a retrieval result is not accurate enough, and a large amount of irrelevant contents exist. 2. Insufficient accuracy-manual retrieval may result in omission or improper reference due to comprehension bias or knowledge blind areas. The common retrieval system lacks the understanding capability of specific terms and expressions in the government field, and cannot accurately identify and extract key contents such as political nouns, strategic expressions and the like. 3. The intelligent degree is low, and although rich policies, regulations and strategic documents are accumulated in the prior knowledge base, the value of the intelligent retrieval tool is difficult to fully develop due to the lack of intelligent retrieval tools. Most of the existing systems are passive retrieval, lack of active recommendation, intelligent analysis and other functions, and require a large amount of screening and judging work manually. In order to solve the technical problems, the application discloses a government affair document quotation retrieval method based on a large model, which can be used for efficiently retrieving government nouns and strategic nouns in government affair document contents and providing accurate quotation retrieval so as to improve the efficiency and quality of government affair document writing and auditing.
Referring to fig. 1, the embodiment of the invention discloses a government affair document quotation retrieval method based on a large model, which comprises the following steps:
and S11, acquiring government affair documents, determining document slices corresponding to the government affair documents, and creating a document knowledge base and a vector knowledge base based on the document slices.
In this embodiment, a government document draft uploaded by a user is received first, and a text format and a document format are supported. The LANGCHAIN framework is used for dividing government affair document text and cutting the document into a plurality of small blocks (namely document slices). The process ensures that each part of content in the public can be independently processed, and meanwhile, the efficiency of subsequent retrieval and matching is improved. And constructing an index of the sliced document content through ELASTICSEARCH (ES), and creating a document knowledge base. The document knowledge base is convenient for quickly searching the related information in the public, and provides efficient data storage and query support for subsequent quotation retrieval. The document slices are vectorized using a BGE (Bidirectional Generative Encoder, bi-directional generative coding) model. The BGE model can effectively capture semantic information of document contents, and generate high-dimensional vector representations, and the vectors can be better used for semantic matching and retrieval. The vector knowledge base is formed by constructing a vector index base of document slices by Faiss (Facebook AI SIMILARITY SEARCH). The Faiss library can support efficient near-nearest neighbor query, and the speed and accuracy of large-scale document retrieval are improved. The method comprises the steps of dividing government document texts by LANGCHAIN frames to determine document slices corresponding to the government documents, constructing an index based on the document slices through an elastic search to create a document knowledge base, vectorizing the document slices by using a deep learning model based on bidirectional generation codes to obtain first processed document slices, and constructing the vector knowledge base based on the first processed document slices.
Specifically, as shown in FIG. 2, the LANGCHAIN framework is used to divide government documents text, and cut the documents into small pieces (i.e., document slices). The process ensures that each part of content in the public can be independently processed, and meanwhile, the efficiency of subsequent retrieval and matching is improved. And constructing an index of the sliced document content through ELASTICSEARCH (ES), and creating a document knowledge base. The document knowledge base is used for providing rapid retrieval support based on word segmentation dimension for the retrieval module. The full text retrieval capability of the elastic search can quickly locate document slices relevant to user queries, and provides a basis for subsequent accurate matching. And vectorizing the document slices by using the BGE model. The BGE model can effectively capture semantic information of document contents and generate high-dimensional vector representation. The vectors can be better used for semantic matching and retrieval, so that the retrieval accuracy is improved. And constructing a vector index library of the document slices through Faiss to form a vector knowledge base. The function of the vector knowledge base is to provide semantic vector based retrieval support for the retrieval module. The Faiss library can support efficient near-nearest neighbor query, and the speed and accuracy of large-scale document retrieval are improved. The document knowledge base is used for rapidly positioning document slices relevant to user inquiry by using the full-text retrieval capability of the elastic search based on word segmentation dimension. The process can quickly screen out the possibly relevant document fragments and provide a candidate set for the follow-up accurate matching. Vector knowledge base based on semantic vector, efficient approximate nearest neighbor (ANN, approximate Nearest Neighbor) queries using Faiss. This process can evaluate the similarity of the document slice to the user query at the semantic level, providing a more accurate matching result. Meanwhile, a document slice is realized based on LANGCHAIN frames, a sparse index library is constructed through ELASTICSEARCH (ES), and meanwhile, a BGE model is utilized to vectorize the document and construct a FAISS vector index library. Through a double index mechanism, the system can dynamically manage and expand the knowledge base, meet the retrieval requirements of different scales and complexity, and provide a flexible and efficient solution for government document management.
In general, the application performs high-dimensional semantic representation on documents and queries based on a deep learning model (such as a BGE model), and can capture deep semantic association. First, a high-dimensional semantic vector is generated for a document and a query sentence in a knowledge base through a BGE model. These vectors represent the overall semantic features of the document and are not limited to keyword-level information. And then, calculating the similarity between the query vector and the document vector by using cosine similarity or Euclidean distance, and measuring the semantic correlation degree of the query vector and the document vector. In addition, the system employs FAISS-based vector index libraries to accelerate the vector retrieval process by clustering patches and efficient Approximate Nearest Neighbor (ANN) algorithms, encoding vectors into short codes by dividing the vector space into multiple subspaces and using quantization techniques in each subspace. During retrieval, nearest neighbors are approximately calculated by searching the short codes, so that the calculated amount is remarkably reduced, the retrieval process is accelerated, and the instantaneity and the accuracy of large-scale knowledge base retrieval are ensured.
And step S12, identifying the professional terms in the government affairs document by utilizing a target pre-training natural language processing model so as to acquire a corresponding identification result.
In the embodiment, before the special terms in the government affair public domain are identified by using a target pre-training natural language processing model, training an initial pre-training natural language processing model by using a target government affair public language corpus and a government affair public domain special word list to obtain a pre-training natural language processing model, wherein the government affair public domain special word list comprises political nouns, strategic nouns and policy terms, supplementing nouns in the government affair public domain special word list by using an external database to obtain a supplementing result, wherein the external database comprises government files, white paper books and policy documents, and training the pre-training natural language processing model by using the supplementing result to obtain the target pre-training natural language processing model. That is, the application adopts BERT (Bidirectional Encoder Representations from Transformers, bidirectional transducer model) model to carry out semantic analysis, and is specially used for identifying political nouns, strategic nouns and related terms in government documents. It should be noted that the present application adopts a large-scale pre-trained BERT model as a basic model, and makes it better understand the technical terms such as political terms, strategic terms, etc. related to the documents by Fine tuning (Fine-tuning) on the government document corpus in a specific field. The fine tuning process adjusts model parameters based on labeling data (e.g., labeling documents of labels such as political nouns, strategic terms, etc.) to suit the particular context of the government documents. And meanwhile, a word list special for the government and official document field is introduced, wherein the word list comprises political nouns, strategic nouns, policy terms and the like. The vocabulary encompasses not only common terms but also some political, economic terms of relative expertise and coldness. The introduction of the vocabulary can effectively improve the accuracy of noun recognition and help the BERT model to recognize special terms which are difficult to capture by some standard models.
In order to enhance the recognition capability of the model on the newly appeared nouns or terms, a dynamic expansion mechanism is designed. The mechanism can automatically expand the terms in the vocabulary by analyzing newly uploaded documents or data fed back by users in real time. The expansion process ensures that new political or strategic nouns can be identified in time based on the matching of new words in the data to the context. Specifically, nouns are supplemented with an external knowledge base (e.g., government documents, white papers, policy documents, etc.). Based on the prediction result of the BERT model, the unrecognized nouns are marked and fed back to the training set for retraining so as to enhance the recognition capability of the model.
In addition, in addition to the identification of political and strategic nouns, related person, place, institution, etc. names are identified from the document in combination with named entity identification (NER, named Entity identification) techniques. The process improves the understanding capability of the model on the complex document content through the multi-level feature extraction capability of BERT. The term recognition is not limited to word level matching only, but also context understanding based on bi-directional semantic understanding of the BERT model. The system ensures that the noun terms are accurately identified by analyzing semantic information of sentence level, paragraph level and even full text level, and accords with the overall context of the document.
Finally, the technical terms in the government affairs can be identified by utilizing the target pre-training natural language processing model so as to obtain corresponding identification results. That is, the present application improves the comprehensive optimization mechanism in many ways in order to further enhance the system performance. Firstly, expanding query contents of a user based on semantic embedding and domain knowledge graph through a dynamic query expansion technology, and improving the retrieval coverage rate. And secondly, an efficient knowledge base increment updating strategy is designed, so that the newly added or modified document can be timely brought into the retrieval range. In addition, in order to cope with large-scale data, an index compression mechanism, a cache mechanism and a distributed retrieval architecture are adopted, so that the response time of the system is effectively reduced. The application of the comprehensive optimization mechanism enables the system to have high dynamic adaptability and large-scale retrieval capability.
And step S13, performing keyword matching on the identification result by utilizing a target information retrieval algorithm based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents.
In the embodiment, when candidate citation documents are determined, word segmentation, deactivation word and keyword extraction operations are carried out on document slices in a document knowledge base to obtain second processed document slices, a global vocabulary is constructed on the basis of the second processed document slices, the global vocabulary is a vocabulary containing unique words or phrases appearing in all the second processed document slices, vectorization processing is carried out on the second processed document slices on the basis of the global vocabulary to obtain corresponding sparse vectors, the sparse vectors are stored in the document knowledge base, relevance scores between recognition results and keywords corresponding to the document slices in the document knowledge base are calculated on the basis of a target information retrieval algorithm, initial documents are screened from the document knowledge base according to the relevance scores, word segmentation, deactivation word and keyword extraction operations are carried out on the recognition results to obtain processed recognition results, target vectors corresponding to the processed recognition results are determined, the target documents similar to the target vectors in the vector knowledge base are retrieved, the relevance scores are determined on the basis of the target documents, and the similarity scores are determined on the basis of the initial documents, and the comprehensive scores are determined. That is, the application firstly carries out preprocessing on the documents in the knowledge base, including word segmentation, stop word removal and keyword extraction, and constructs a word bag model of each document. Then, the BM25 (Best Matching model) algorithm evaluates the correlation between the keywords in the query statement and the document keywords by calculating the degree of Matching between the keywords and the document keywords. This process comprehensively considers word Frequency (TF), inverse document Frequency (IDF, inverse Document Frequency) and the distribution of terms in the document, and can efficiently capture explicit semantic associations between queries and documents. In this way, the candidate documents are reordered according to the composite score, and a final search result is generated. The process ensures that the search result considers both explicit semantic association and semantic similarity, thereby remarkably improving the accuracy and user experience of search.
And S14, scoring semantic relativity of the candidate citation documents and the identification result through a reranker model, reordering the candidate citation documents according to the corresponding scoring result, and outputting a target citation document according to the corresponding reordering result.
In the embodiment, semantic relevance scoring is carried out on each candidate citation document and the identification result through a reranker model, candidate citation documents corresponding to target scoring results which do not meet preset scoring threshold conditions are filtered according to the scoring results to determine each initial citation document, the initial citation documents are reordered according to the scoring results, the issuing time of each initial citation document and the issuing mechanism, and the target citation documents are output according to the corresponding reordering results. These operations are mainly the functions of the quotation module. The quotation recommendation module is an important component for generating high-quality recommendation results in the system, and the core functions of the quotation recommendation module comprise scoring and finally reordering the relevance of the retrieval results so as to ensure that a user obtains optimal quotation contents. The module combines noun recognition results and deep learning technology, realizes accurate quotation recommendation through a multi-stage processing flow, and the specific flow is shown in fig. 3. First, the system extracts the search result set related to the nouns according to the political nouns and strategic noun results output by the noun recognition module. These search results are initially screened from a sparse search and vector search module, which contains documents and paragraphs related to noun semantics. In order to further improve the accuracy of the recommended results, a reranker model is introduced into the system as a secondary screening tool. In the re-scoring stage, the system performs a depth analysis on candidate documents and paragraphs. Specifically, reranker models are based on a pretrained language model such as Bert and the like, and are combined with the upper part and the lower part Wen Yuyi to score the relevance of the search results in fine granularity. The model not only focuses on direct matching of queries to documents, but also considers deep semantic associations between document context and query intent. By entering the query content and paragraph text of the candidate documents, the reranker model generates a relevance score for each document and filters low-ranked documents that are not satisfactory. The system then reorders the candidate results according to the relevance score of the reranker model. In the reordering process, the system not only considers the semantic matching degree, but also comprehensively evaluates the authority and timeliness of the document. For example, for government documents, the system preferentially recommends documents published by an authority while providing users with more time-efficient sources of reference based on publication time. In order to achieve fairness and high efficiency of ranking, the system adopts an algorithm based on weighted ranking, and a final ranking position is allocated to each document in combination with a scoring index.
And finally, when the target quotation document is output, outputting the context information of the target quotation document, wherein the context information comprises the context information of the quotation paragraph and any one or a combination of more than one of the title, the issuing mechanism and the issuing time of the target quotation document. Specifically, at the time of recommendation result output, the system provides detailed context information for each reference content. The context information includes the context of the referenced paragraph and the metadata of the document (e.g., title, publication organization, and publication time) so that the user is fully aware of the source of the reference. In addition, the system supports further screening and customizing of recommended content by the user, such as refining the ranking of results by time, source, or keywords. In this way, the quotation recommendation module ensures high correlation and high authority of recommended content and provides reliable quotation support for users. And re-scoring the search result by adopting a Bert classification model and reranker technology, and reordering by combining the semantic matching degree and the document authority. By providing contextual information and detailed metadata of the referenced content, the system significantly improves the authority and practicality of the recommended content, helping users to efficiently complete quotation screening and quotation.
The method comprises the steps of firstly obtaining government affair documents, determining document slices corresponding to the government affair documents, creating a document knowledge base and a vector knowledge base based on the document slices, utilizing a target pre-training natural language processing model to identify professional terms in the government affair documents so as to obtain corresponding identification results, then utilizing a target information retrieval algorithm to conduct keyword matching on the identification results based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents, finally conducting semantic relevance scoring on the candidate citation documents and the identification results through a reranker model, reordering the candidate citation documents according to the corresponding scoring results, and outputting the target citation documents according to the corresponding reordering results. The application can accurately identify the proper nouns in the government field by utilizing a natural language processing model, search the documents and paragraphs relevant to the identified nouns from a knowledge base by combining a semantic matching technology and a high-efficiency indexing mechanism, and finally screen and sort the search results to recommend the most relevant documents and paragraphs. Therefore, the efficiency and the accuracy of quotation retrieval are improved, and related contents can be rapidly positioned. And meanwhile, the semantic understanding capability of the large model is utilized, so that the noun recognition accuracy is remarkably improved. The method can be widely applied to working scenes such as official document writing, policy research, file checking and the like of government departments at all levels, and provides technical support for improving government work efficiency and accuracy.
Based on the above embodiment, the present application may perform keyword matching on the recognition result by using a target information retrieval algorithm based on the document knowledge base and the vector knowledge base, so as to obtain a plurality of candidate citation documents. Next, a detailed description will be made regarding a specific candidate citation document screening process.
The application utilizes the traditional word bag model and BM25 algorithm to realize the preliminary screening of the documents. Firstly, preprocessing documents in a knowledge base, including word segmentation, stop word removal and keyword extraction, and constructing a word bag model of each document. Then, the BM25 algorithm evaluates the correlation between the keywords in the query statement and the document keywords by calculating the degree of matching between the two. The process comprehensively considers word frequency (TF), inverse Document Frequency (IDF) and the distribution of words in the document, and can efficiently capture the dominant semantic association between the query and the document. By screening documents with higher scores, the sparse retrieval module provides a fast and efficient candidate document set for subsequent processing. Specifically, the sparse retrieval module utilizes a traditional word bag model and a BM25 algorithm to realize the preliminary screening of the documents. The detailed steps are as follows:
1.1, preprocessing a document:
Word segmentation, namely, carrying out word segmentation processing on each document in a knowledge base, and decomposing the document into words or phrases. This step is implemented based on the word segmentation rules of language, such as Chinese using jieba words and English using NLTK tools.
Removing the stop words, namely removing common stop words (such as ' yes ', ' and the like) in the document, wherein the words do not carry important information semantically, and removing the words can reduce noise and improve retrieval efficiency.
Keyword extraction, namely extracting keywords from a document, wherein the keywords are used for subsequent bag-of-words model construction. Keyword extraction can be realized through TF-IDF, textRank and other algorithms, so that the extracted keywords are ensured to have higher information quantity and representativeness.
1.2, Constructing a word bag model:
Vocabulary construction, namely constructing a global vocabulary based on the preprocessed document set. The vocabulary contains the unique words or phrases that appear in all documents.
Document vectorization, in which each document is represented as a vector, each dimension of the vector corresponds to a word or phrase in a vocabulary, and the value of the vector is the word frequency (TF) or TF-IDF value of the word or phrase in the document. In this way, each document is converted into a sparse vector, which is stored in the document knowledge base.
1.3, BM25 algorithm:
And (3) preprocessing the query, namely preprocessing the query sentence input by the user, wherein the preprocessing comprises word segmentation, word deactivation and the like, which are the same as the document.
Keyword matching, namely calculating the matching degree of the keywords in the query statement and the keywords of the document. The BM25 algorithm evaluates the relevance between queries and documents by comprehensively considering word frequency (TF), inverse Document Frequency (IDF), and the distribution of terms in the documents.
The BM25 algorithm generates a relevance score for each document, with higher scores indicating higher relevance of the document to the query. The specific formula is as follows:
;
Wherein, the A relevance score, n is the total number of words in query q, q is the query, d is the document,Is a keyword in the query and,Is a keywordThe word frequency in the document d is such that,Is a keywordIs used to determine the inverse document frequency of (c),And b is a parameter of the adjustment,Is the length of document d and avgdl is the average length of the document collection.
1.4, Preliminary screening:
And screening out documents with higher scores according to the BM25 scores to form a preliminary candidate document set. These documents are considered to have a higher explicit semantic association with the user query, providing a basis for further dense vector retrieval.
In addition, the dense vector retrieval module utilizes a vector knowledge base and an efficient approximate nearest neighbor algorithm to realize accurate retrieval based on semantics. The detailed steps are as follows:
2.1, constructing a vector knowledge base:
document slice vectorization, namely performing vectorization processing on each document slice in a document knowledge base by using BGE (Bidirectional Generative Encoder) models. The BGE model is capable of capturing semantic information of document content and generating a high-dimensional vector representation.
Vector index construction, namely constructing a vector index library of the document slices through a Faiss library to form a vector knowledge library. The Faiss library supports efficient near nearest neighbor (ANN) queries, enabling quick locating of document slices that are most similar to the query vector.
2.2, Query vectorization:
And (3) preprocessing the query, namely preprocessing the query sentence input by the user, wherein the preprocessing comprises word segmentation, word deactivation and the like, which are the same as the document.
Query vectorization-the conversion of query statements into high-dimensional vector representations using the BGE model. This vector will be used for similarity retrieval in the vector knowledge base.
2.3, Approximate nearest neighbor search:
Vector retrieval, namely, inputting a query vector into a vector knowledge base, and retrieving by using an Approximate Nearest Neighbor (ANN) algorithm of a Faiss base. Faiss quickly locates the document slice most similar to the query vector by clustering slices and efficient index structure.
Similarity scoring, namely generating a similarity score for each retrieved document slice, wherein the higher the score is, the higher the semantic similarity between the document slice and the query vector is.
2.4, Fusion of results:
and (3) comprehensive scoring, namely fusing the results of the sparse retrieval module and the dense vector retrieval module. For each candidate document, a composite score is generated in combination with the BM25 score and the similarity score.
And finally sorting, namely re-sorting the candidate documents according to the comprehensive scores to generate a final retrieval result. This procedure ensures that the search results take into account both explicit semantic associations (sparse search) and semantic similarities (dense vector search).
The mixed retrieval module realizes efficient and accurate knowledge retrieval by combining the advantages of sparse retrieval and dense vector retrieval. The specific synergy is as follows:
3.1, sparse retrieval module:
and (3) primarily screening, namely quickly screening a document set which has higher explicit semantic association with the query sentence through a BM25 algorithm, so that the number of documents to be processed is reduced.
And the candidate document set provides an efficient candidate document set for the dense vector retrieval module, and improves the retrieval efficiency.
3.2, Dense vector retrieval module:
And semantic matching, namely carrying out accurate semantic matching on the candidate document set through the BGE model and the Faiss library, so that the accuracy of the search result is further improved.
Similarity scoring, namely generating a similarity score for each candidate document, and ensuring that the search results are not only related in explicit semantics, but also are matched in high semantics.
3.3, Fusion of results:
And (3) comprehensive scoring, namely fusing the BM25 score of the sparse retrieval module and the similarity score of the dense vector retrieval module to generate a comprehensive score.
And finally sorting, namely re-sorting the candidate documents according to the comprehensive scores to generate a final retrieval result. The process ensures that the search result considers both explicit semantic association and semantic similarity, thereby remarkably improving the accuracy and user experience of search.
By means of the mixed retrieval mode, the system can fully utilize the high efficiency of sparse retrieval and the accuracy of dense vector retrieval, and the overall performance and user experience of document retrieval are remarkably improved. The multi-path search combination strategy comprehensively utilizes the advantages of sparse search and vector search, and realizes the balance of search efficiency and precision. Firstly, a document candidate set which is explicitly matched with query content is quickly screened out through a sparse retrieval module, and a foundation is provided for subsequent deep retrieval. Then, the vector retrieval module performs semantic matching on the candidate documents, and can be expanded to a global document library to find potential semantic association documents. Finally, the system combines the BM25 score of the sparse search and the semantic similarity score of the vector search, and adopts a linear weighting or learning-based sequencing method to perform fusion sequencing on the search results, so as to generate a final quotation recommendation list. The multipath strategy not only improves the recall rate of retrieval, but also ensures high accuracy of results. Innovatively adopts a multi-path search strategy combining sparse search and vector search. The sparse retrieval module is used for rapidly positioning the high-correlation document by utilizing a BM25 algorithm, and the vector retrieval module is used for constructing a vector index library through a BGE model so as to realize the depth matching of the semantic level. The comprehensive optimization mechanism of the multipath retrieval ensures that candidate documents related to the query can be quickly found in a large-scale knowledge base, potential related information can be captured through deep semantic matching, and the comprehensiveness and accuracy of the retrieval are greatly improved.
Therefore, the efficiency and quality of government official document processing and citation retrieval are remarkably improved, an innovative solution is provided for government informatization and intelligent office work, and the method has great social value and application prospect.
Referring to fig. 4, the embodiment of the invention discloses a government affair document quotation retrieval device based on a large model, which comprises:
The knowledge base construction module 11 is used for acquiring government affair documents, determining document slices corresponding to the government affair documents, and creating a document knowledge base and a vector knowledge base based on the document slices;
The recognition module 12 is configured to recognize the technical terms in the government affairs document by using a target pre-training natural language processing model, so as to obtain a corresponding recognition result;
the keyword matching module 13 is configured to perform keyword matching on the recognition result by using a target information retrieval algorithm based on the document knowledge base and the vector knowledge base, so as to obtain a plurality of candidate citation documents;
And the quotation document output module 14 is configured to score semantic relevance between each candidate quotation document and the recognition result through the reranker model, reorder each candidate quotation document according to the corresponding scoring result, and output a target quotation document according to the corresponding reordering result.
The method comprises the steps of firstly obtaining government affair documents, determining document slices corresponding to the government affair documents, creating a document knowledge base and a vector knowledge base based on the document slices, utilizing a target pre-training natural language processing model to identify professional terms in the government affair documents so as to obtain corresponding identification results, then utilizing a target information retrieval algorithm to conduct keyword matching on the identification results based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents, finally conducting semantic relevance scoring on the candidate citation documents and the identification results through a reranker model, reordering the candidate citation documents according to the corresponding scoring results, and outputting the target citation documents according to the corresponding reordering results. The application can accurately identify the proper nouns in the government field by utilizing a natural language processing model, search the documents and paragraphs relevant to the identified nouns from a knowledge base by combining a semantic matching technology and a high-efficiency indexing mechanism, and finally screen and sort the search results to recommend the most relevant documents and paragraphs. Therefore, the efficiency and the accuracy of quotation retrieval are improved, and related contents can be rapidly positioned. And meanwhile, the semantic understanding capability of the large model is utilized, so that the noun recognition accuracy is remarkably improved. The method can be widely applied to working scenes such as official document writing, policy research, file checking and the like of government departments at all levels, and provides technical support for improving government work efficiency and accuracy.
In some specific embodiments, the knowledge base construction module 11 may be specifically configured to divide a government document text by using a LANGCHAIN framework to determine a document slice corresponding to the government document, construct an index based on the document slice by using an elastic search to create the document knowledge base, vector the document slice by using a deep learning model based on bidirectional generation coding to obtain a first processed document slice, and construct the vector knowledge base based on the first processed document slice.
In some specific embodiments, the device may be further configured to train an initial pre-training natural language processing model by using a target government document corpus and a government document domain specific vocabulary to obtain a pre-training natural language processing model, wherein the government document domain specific vocabulary includes political nouns, strategic nouns and policy terms, supplement nouns in the government document domain specific vocabulary by using an external database to obtain a supplement result, wherein the external database includes government documents, white papers and policy documents, and train the pre-training natural language processing model by using the supplement result to obtain the target pre-training natural language processing model.
In some specific embodiments, the keyword matching module 13 may be specifically configured to perform word segmentation, word deactivation and keyword extraction operations on the document slices in the document knowledge base to obtain second processed document slices, construct a global vocabulary based on each second processed document slice, where the global vocabulary is a vocabulary including unique words or phrases appearing in all second processed document slices, vectorize the second processed document slices based on the global vocabulary to obtain corresponding sparse vectors, and store the sparse vectors in the document knowledge base, calculate correlation scores between the recognition results and keywords corresponding to each document slice in the document knowledge base based on a target information retrieval algorithm, screen an initial document from the document knowledge base according to each correlation score, and perform keyword matching on the recognition results based on the vector knowledge base and the initial document to obtain a plurality of candidate citation documents.
In some specific embodiments, the keyword matching module 13 may be specifically configured to perform word segmentation, stop word removal, and keyword extraction on the recognition result, obtain a processed recognition result, determine a target vector corresponding to the processed recognition result, retrieve a target document slice similar to the target vector in the vector knowledge base, determine a similarity score between the target document slice and the target vector, determine a composite score of each initial document based on the similarity score and the relevance score, and reorder the initial documents according to the composite score to determine a plurality of candidate citation documents.
In some specific embodiments, the cited-reference document output module 14 may be specifically configured to score semantic relevance between each candidate cited-reference document and the recognition result through reranker models, filter, according to the scoring result, the candidate cited-reference documents corresponding to the target scoring result that does not meet the preset scoring threshold condition, so as to determine each initial cited-reference document, reorder, according to the scoring result, the issuing time of each initial cited-reference document, and the issuing mechanism, and output the target cited-reference document according to the corresponding reordering result.
In some specific embodiments, the device can also be used for outputting the context information of the target cited document when outputting the target cited document, wherein the context information comprises the context information of the cited paragraph and any one or a combination of a title, a release mechanism and release time of the target cited document.
Further, the embodiment of the present application further discloses an electronic device, and fig. 5 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the figure is not to be considered as any limitation on the scope of use of the present application.
Fig. 5 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may include, in particular, at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input-output interface 25, and a communication bus 26. The memory 22 is configured to store a computer program, where the computer program is loaded and executed by the processor 21 to implement relevant steps in the large-model-based government document quotation retrieval method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide working voltages for each hardware device on the electronic device 20, the communication interface 24 is capable of creating a data transmission channel with an external device for the electronic device 20, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein, and the input/output interface 25 is configured to obtain external input data or output data to the external device, and the specific interface type of the input/output interface may be selected according to the specific application needs and is not specifically limited herein.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further comprise a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the large model-based government document quotation retrieval method performed by the electronic device 20 as disclosed in any of the previous embodiments.
Furthermore, the application also discloses a computer readable storage medium for storing a computer program, wherein the computer program is executed by a processor to realize the large-model-based government document quotation retrieval method. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
While the foregoing has been provided to illustrate the principles and embodiments of the present application, specific examples have been provided herein to assist in understanding the principles and embodiments of the present application, and are intended to be in no way limiting, for those of ordinary skill in the art will, in light of the above teachings, appreciate that the principles and embodiments of the present application may be varied in any way.