CN120296146A - Government document citation retrieval method, device, equipment and medium based on big model - Google Patents

Government document citation retrieval method, device, equipment and medium based on big model Download PDF

Info

Publication number
CN120296146A
CN120296146A CN202510446020.XA CN202510446020A CN120296146A CN 120296146 A CN120296146 A CN 120296146A CN 202510446020 A CN202510446020 A CN 202510446020A CN 120296146 A CN120296146 A CN 120296146A
Authority
CN
China
Prior art keywords
document
documents
knowledge base
government
citation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510446020.XA
Other languages
Chinese (zh)
Inventor
马尚
李廷
韩同
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202510446020.XA priority Critical patent/CN120296146A/en
Publication of CN120296146A publication Critical patent/CN120296146A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了基于大模型的政务公文引文检索方法、装置、设备及介质,涉及人工智能技术领域,包括:获取政务公文,确定所述政务公文对应的文档切片,并基于所述文档切片创建文档知识库以及向量知识库;利用目标预训练自然语言处理模型对所述政务公文中的专业术语进行识别,以获取相应的识别结果;基于所述文档知识库和所述向量知识库利用目标信息检索算法对所述识别结果进行关键词匹配,以获取相应的多个候选引文文档;通过reranker模型对各所述候选引文文档与所述识别结果进行语义相关性评分,根据相应的评分结果对各所述候选引文文档进行重排序,并根据相应的重排序结果输出目标引文文档。由此,本申请可以提升引文检索的效率。

The present application discloses a method, device, equipment and medium for citation retrieval of government documents based on a large model, which relates to the field of artificial intelligence technology, including: obtaining government documents, determining the document slices corresponding to the government documents, and creating a document knowledge base and a vector knowledge base based on the document slices; using a target pre-trained natural language processing model to identify professional terms in the government documents to obtain corresponding recognition results; using a target information retrieval algorithm based on the document knowledge base and the vector knowledge base to perform keyword matching on the recognition results to obtain corresponding multiple candidate citation documents; using a reranker model to perform semantic relevance scoring on each of the candidate citation documents and the recognition results, reordering each of the candidate citation documents according to the corresponding scoring results, and outputting a target citation document according to the corresponding reordering results. Thus, the present application can improve the efficiency of citation retrieval.

Description

Government affair document quotation retrieval method, device, equipment and medium based on large model
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a government affair official document quotation retrieval method, device, equipment and medium based on a large model.
Background
Government documents are important tools for national governance, and the contents of the government documents relate to a plurality of fields of politics, economy, law and the like, and generally comprise a large number of political nouns and strategic nouns with important significance. These terms not only require accurate understanding, but may also require citation of related laws, policies, or strategic documents to enhance authority and reference value of the document. At present, during the writing and auditing process of government documents, related documents or regulations are quoted and mainly depend on manual searching, and the mode has the following defects that 1, the efficiency is low, a great deal of time is required to manually search the quoted documents, and particularly when the number of involved nouns in the documents is large and complex, the efficiency is obviously reduced. The traditional retrieval system mainly relies on keyword matching, and deep semantics and context relation of political terms cannot be understood, so that a retrieval result is not accurate enough, and a large amount of irrelevant contents exist. 2. Insufficient accuracy-manual retrieval may result in omission or improper reference due to comprehension bias or knowledge blind areas. The common retrieval system lacks the understanding capability of specific terms and expressions in the government field, and cannot accurately identify and extract key contents such as political nouns, strategic expressions and the like. 3. The intelligent degree is low, and although rich policies, regulations and strategic documents are accumulated in the prior knowledge base, the value of the intelligent retrieval tool is difficult to fully develop due to the lack of intelligent retrieval tools. Most of the existing systems are passive retrieval, lack of active recommendation, intelligent analysis and other functions, and require a large amount of screening and judging work manually.
In recent years, large-scale Pre-training language models (Large Pre-trained Language Models, LPLMs) have demonstrated excellent capabilities in the field of natural language processing. The models have strong semantic understanding and generating capability, and provide new possibilities for automatic analysis and processing of complex texts. However, the direct application of these models to solve the problem of government document quote retrieval still faces the challenges of 1. Accuracy of noun identification-political and strategic nouns involved in government documents are of high specialty and complexity, and the models require accurate identification of these nouns. 2. The efficiency of knowledge base matching, namely how to efficiently retrieve the reference content related to nouns is a key problem when the knowledge base contains a large number of documents. 3. The availability of the result presentation, namely the reference content returned by the system needs to be clear and definite and can be directly used by a document writer. 4. The full text retrieval system based on the keywords adopts the traditional information retrieval technology such as inverted index and the like, and has lower efficiency and insufficient accuracy.
Therefore, how to efficiently search for political nouns and strategic nouns in government and government documents and provide accurate quotation search is a problem to be solved urgently.
Disclosure of Invention
In view of the above, the invention aims to provide a large-model-based government document quotation retrieval method, device, equipment and medium, which can be used for efficiently retrieving political nouns and strategic nouns in government document contents and providing accurate quotation retrieval, thereby improving the efficiency and quality of government document writing and auditing. The specific scheme is as follows:
In a first aspect, the application discloses a government affair document quotation retrieval method based on a large model, which comprises the following steps:
acquiring government affair documents, determining document slices corresponding to the government affair documents, and creating a document knowledge base and a vector knowledge base based on the document slices;
identifying the technical terms in the government affairs document by utilizing a target pre-training natural language processing model so as to acquire corresponding identification results;
performing keyword matching on the recognition result by utilizing a target information retrieval algorithm based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents;
and scoring semantic relevance of each candidate citation document and the identification result through reranker model, reordering each candidate citation document according to the corresponding scoring result, and outputting a target citation document according to the corresponding reordering result.
Optionally, the determining the document slice corresponding to the government document, and creating a document knowledge base and a vector knowledge base based on the document slice includes:
Dividing government affair document texts by using LANGCHAIN frames to determine document slices corresponding to the government affair documents;
Constructing an index based on the document slices by elastic search to create the document knowledge base;
vectorizing the document slice by using a deep learning model based on bidirectional generating codes to obtain a first processed document slice;
and constructing the vector knowledge base based on the first processed document slice.
Optionally, before the identifying the technical terms in the government affairs document by using the target pre-training natural language processing model, the method further includes:
Training the initial pre-training natural language processing model by utilizing a target government affair document corpus and a government affair document field special word list to obtain a pre-training natural language processing model, wherein the government affair document field special word list comprises political nouns, strategic nouns and policy terms;
supplementing nouns in the special word list in the government and public file field by using an external database to obtain a supplementing result, wherein the external database comprises government files, white paper books and policy files;
Training the pre-training natural language processing model by using the supplementary result to obtain the target pre-training natural language processing model.
Optionally, the keyword matching is performed on the recognition result by using a target information retrieval algorithm based on the document knowledge base and the vector knowledge base to obtain a plurality of candidate citation documents, including:
Performing word segmentation, stop word removal and keyword extraction operations on the document slices in the document knowledge base to obtain second processed document slices;
constructing a global vocabulary based on each second processed document slice, wherein the global vocabulary is a vocabulary containing unique words or phrases appearing in all second processed document slices;
vectorizing the second processed document slice based on the global vocabulary table to obtain a corresponding sparse vector, and storing the sparse vector into the document knowledge base;
Calculating a relevance score between the identification result and the keywords corresponding to each document slice in the document knowledge base based on a target information retrieval algorithm;
screening initial documents from the document knowledge base according to the relevance scores;
And carrying out keyword matching on the recognition result based on the vector knowledge base and the initial document so as to obtain a plurality of corresponding candidate citation documents.
Optionally, the keyword matching is performed on the recognition result based on the vector knowledge base and the initial document to obtain a plurality of candidate citation documents, including:
performing word segmentation, stop word removal and keyword extraction operations on the identification result to obtain a processed identification result;
determining a target vector corresponding to the processed recognition result;
retrieving a target document slice in the vector knowledge base that is similar to the target vector;
Determining a similarity score between the target document slice and the target vector;
Determining a composite score for each of the initial documents based on the similarity score and the relevance score;
and re-ordering the initial documents according to the comprehensive scores to determine a plurality of candidate citation documents.
Optionally, the scoring semantic relevance between each candidate citation document and the recognition result through reranker model, reordering each candidate citation document according to the corresponding scoring result, and outputting a target citation document according to the corresponding reordering result, including:
carrying out semantic relevance scoring on each candidate quotation document and the identification result through reranker model;
Filtering candidate citation documents corresponding to the target scoring results which do not meet the preset scoring threshold conditions according to the scoring results to determine each initial citation document;
And reordering the initial quotation documents according to the scoring result, the release time of each initial quotation document and the release mechanism, and outputting a target quotation document according to the corresponding reordering result.
Optionally, the method further comprises:
and when outputting the target quotation document, outputting the context information of the target quotation document, wherein the context information comprises the context information of the quotation paragraph and any one or a combination of a plurality of titles, issuing mechanisms and issuing time of the target quotation document.
In a second aspect, the application discloses a government affair document quotation retrieval device based on a large model, which comprises:
the knowledge base construction module is used for acquiring government affair documents, determining document slices corresponding to the government affair documents, and creating a document knowledge base and a vector knowledge base based on the document slices;
the recognition module is used for recognizing the professional terms in the government affairs public by utilizing the target pre-training natural language processing model so as to obtain corresponding recognition results;
the keyword matching module is used for performing keyword matching on the identification result by utilizing a target information retrieval algorithm based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents;
and the quotation document output module is used for scoring semantic relativity of the candidate quotation documents and the identification result through the reranker model, reordering the candidate quotation documents according to the corresponding scoring result and outputting the target quotation document according to the corresponding reordering result.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
and the processor is used for executing a computer program to realize the government affair document quotation retrieval method based on the large model.
In a fourth aspect, the present application discloses a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements a large model based government document quotation retrieval method as described above.
The method comprises the steps of firstly obtaining government affair documents, determining document slices corresponding to the government affair documents, creating a document knowledge base and a vector knowledge base based on the document slices, utilizing a target pre-training natural language processing model to identify professional terms in the government affair documents so as to obtain corresponding identification results, then utilizing a target information retrieval algorithm to conduct keyword matching on the identification results based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents, finally conducting semantic relevance scoring on the candidate citation documents and the identification results through a reranker model, reordering the candidate citation documents according to the corresponding scoring results, and outputting the target citation documents according to the corresponding reordering results. The application can accurately identify the proper nouns in the government field by utilizing a natural language processing model, search the documents and paragraphs relevant to the identified nouns from a knowledge base by combining a semantic matching technology and a high-efficiency indexing mechanism, and finally screen and sort the search results to recommend the most relevant documents and paragraphs. Therefore, the efficiency and the accuracy of quotation retrieval are improved, and related contents can be rapidly positioned. And meanwhile, the semantic understanding capability of the large model is utilized, so that the noun recognition accuracy is remarkably improved. The method can be widely applied to working scenes such as official document writing, policy research, file checking and the like of government departments at all levels, and provides technical support for improving government work efficiency and accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a government official document quotation retrieval method based on a large model, which is disclosed by the application;
FIG. 2 is a schematic diagram of a specific large-model-based government document quotation retrieval method;
FIG. 3 is a flow chart of a quotation screening method of the present disclosure;
FIG. 4 is a schematic diagram of a government document quotation retrieval device based on a large model;
Fig. 5 is a block diagram of an electronic device according to the present disclosure.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
At present, during the writing and auditing process of government documents, related documents or regulations are quoted and mainly depend on manual searching, and the mode has the following defects that 1, the efficiency is low, a great deal of time is required to manually search the quoted documents, and particularly when the number of involved nouns in the documents is large and complex, the efficiency is obviously reduced. The traditional retrieval system mainly relies on keyword matching, and deep semantics and context relation of political terms cannot be understood, so that a retrieval result is not accurate enough, and a large amount of irrelevant contents exist. 2. Insufficient accuracy-manual retrieval may result in omission or improper reference due to comprehension bias or knowledge blind areas. The common retrieval system lacks the understanding capability of specific terms and expressions in the government field, and cannot accurately identify and extract key contents such as political nouns, strategic expressions and the like. 3. The intelligent degree is low, and although rich policies, regulations and strategic documents are accumulated in the prior knowledge base, the value of the intelligent retrieval tool is difficult to fully develop due to the lack of intelligent retrieval tools. Most of the existing systems are passive retrieval, lack of active recommendation, intelligent analysis and other functions, and require a large amount of screening and judging work manually. In order to solve the technical problems, the application discloses a government affair document quotation retrieval method based on a large model, which can be used for efficiently retrieving government nouns and strategic nouns in government affair document contents and providing accurate quotation retrieval so as to improve the efficiency and quality of government affair document writing and auditing.
Referring to fig. 1, the embodiment of the invention discloses a government affair document quotation retrieval method based on a large model, which comprises the following steps:
and S11, acquiring government affair documents, determining document slices corresponding to the government affair documents, and creating a document knowledge base and a vector knowledge base based on the document slices.
In this embodiment, a government document draft uploaded by a user is received first, and a text format and a document format are supported. The LANGCHAIN framework is used for dividing government affair document text and cutting the document into a plurality of small blocks (namely document slices). The process ensures that each part of content in the public can be independently processed, and meanwhile, the efficiency of subsequent retrieval and matching is improved. And constructing an index of the sliced document content through ELASTICSEARCH (ES), and creating a document knowledge base. The document knowledge base is convenient for quickly searching the related information in the public, and provides efficient data storage and query support for subsequent quotation retrieval. The document slices are vectorized using a BGE (Bidirectional Generative Encoder, bi-directional generative coding) model. The BGE model can effectively capture semantic information of document contents, and generate high-dimensional vector representations, and the vectors can be better used for semantic matching and retrieval. The vector knowledge base is formed by constructing a vector index base of document slices by Faiss (Facebook AI SIMILARITY SEARCH). The Faiss library can support efficient near-nearest neighbor query, and the speed and accuracy of large-scale document retrieval are improved. The method comprises the steps of dividing government document texts by LANGCHAIN frames to determine document slices corresponding to the government documents, constructing an index based on the document slices through an elastic search to create a document knowledge base, vectorizing the document slices by using a deep learning model based on bidirectional generation codes to obtain first processed document slices, and constructing the vector knowledge base based on the first processed document slices.
Specifically, as shown in FIG. 2, the LANGCHAIN framework is used to divide government documents text, and cut the documents into small pieces (i.e., document slices). The process ensures that each part of content in the public can be independently processed, and meanwhile, the efficiency of subsequent retrieval and matching is improved. And constructing an index of the sliced document content through ELASTICSEARCH (ES), and creating a document knowledge base. The document knowledge base is used for providing rapid retrieval support based on word segmentation dimension for the retrieval module. The full text retrieval capability of the elastic search can quickly locate document slices relevant to user queries, and provides a basis for subsequent accurate matching. And vectorizing the document slices by using the BGE model. The BGE model can effectively capture semantic information of document contents and generate high-dimensional vector representation. The vectors can be better used for semantic matching and retrieval, so that the retrieval accuracy is improved. And constructing a vector index library of the document slices through Faiss to form a vector knowledge base. The function of the vector knowledge base is to provide semantic vector based retrieval support for the retrieval module. The Faiss library can support efficient near-nearest neighbor query, and the speed and accuracy of large-scale document retrieval are improved. The document knowledge base is used for rapidly positioning document slices relevant to user inquiry by using the full-text retrieval capability of the elastic search based on word segmentation dimension. The process can quickly screen out the possibly relevant document fragments and provide a candidate set for the follow-up accurate matching. Vector knowledge base based on semantic vector, efficient approximate nearest neighbor (ANN, approximate Nearest Neighbor) queries using Faiss. This process can evaluate the similarity of the document slice to the user query at the semantic level, providing a more accurate matching result. Meanwhile, a document slice is realized based on LANGCHAIN frames, a sparse index library is constructed through ELASTICSEARCH (ES), and meanwhile, a BGE model is utilized to vectorize the document and construct a FAISS vector index library. Through a double index mechanism, the system can dynamically manage and expand the knowledge base, meet the retrieval requirements of different scales and complexity, and provide a flexible and efficient solution for government document management.
In general, the application performs high-dimensional semantic representation on documents and queries based on a deep learning model (such as a BGE model), and can capture deep semantic association. First, a high-dimensional semantic vector is generated for a document and a query sentence in a knowledge base through a BGE model. These vectors represent the overall semantic features of the document and are not limited to keyword-level information. And then, calculating the similarity between the query vector and the document vector by using cosine similarity or Euclidean distance, and measuring the semantic correlation degree of the query vector and the document vector. In addition, the system employs FAISS-based vector index libraries to accelerate the vector retrieval process by clustering patches and efficient Approximate Nearest Neighbor (ANN) algorithms, encoding vectors into short codes by dividing the vector space into multiple subspaces and using quantization techniques in each subspace. During retrieval, nearest neighbors are approximately calculated by searching the short codes, so that the calculated amount is remarkably reduced, the retrieval process is accelerated, and the instantaneity and the accuracy of large-scale knowledge base retrieval are ensured.
And step S12, identifying the professional terms in the government affairs document by utilizing a target pre-training natural language processing model so as to acquire a corresponding identification result.
In the embodiment, before the special terms in the government affair public domain are identified by using a target pre-training natural language processing model, training an initial pre-training natural language processing model by using a target government affair public language corpus and a government affair public domain special word list to obtain a pre-training natural language processing model, wherein the government affair public domain special word list comprises political nouns, strategic nouns and policy terms, supplementing nouns in the government affair public domain special word list by using an external database to obtain a supplementing result, wherein the external database comprises government files, white paper books and policy documents, and training the pre-training natural language processing model by using the supplementing result to obtain the target pre-training natural language processing model. That is, the application adopts BERT (Bidirectional Encoder Representations from Transformers, bidirectional transducer model) model to carry out semantic analysis, and is specially used for identifying political nouns, strategic nouns and related terms in government documents. It should be noted that the present application adopts a large-scale pre-trained BERT model as a basic model, and makes it better understand the technical terms such as political terms, strategic terms, etc. related to the documents by Fine tuning (Fine-tuning) on the government document corpus in a specific field. The fine tuning process adjusts model parameters based on labeling data (e.g., labeling documents of labels such as political nouns, strategic terms, etc.) to suit the particular context of the government documents. And meanwhile, a word list special for the government and official document field is introduced, wherein the word list comprises political nouns, strategic nouns, policy terms and the like. The vocabulary encompasses not only common terms but also some political, economic terms of relative expertise and coldness. The introduction of the vocabulary can effectively improve the accuracy of noun recognition and help the BERT model to recognize special terms which are difficult to capture by some standard models.
In order to enhance the recognition capability of the model on the newly appeared nouns or terms, a dynamic expansion mechanism is designed. The mechanism can automatically expand the terms in the vocabulary by analyzing newly uploaded documents or data fed back by users in real time. The expansion process ensures that new political or strategic nouns can be identified in time based on the matching of new words in the data to the context. Specifically, nouns are supplemented with an external knowledge base (e.g., government documents, white papers, policy documents, etc.). Based on the prediction result of the BERT model, the unrecognized nouns are marked and fed back to the training set for retraining so as to enhance the recognition capability of the model.
In addition, in addition to the identification of political and strategic nouns, related person, place, institution, etc. names are identified from the document in combination with named entity identification (NER, named Entity identification) techniques. The process improves the understanding capability of the model on the complex document content through the multi-level feature extraction capability of BERT. The term recognition is not limited to word level matching only, but also context understanding based on bi-directional semantic understanding of the BERT model. The system ensures that the noun terms are accurately identified by analyzing semantic information of sentence level, paragraph level and even full text level, and accords with the overall context of the document.
Finally, the technical terms in the government affairs can be identified by utilizing the target pre-training natural language processing model so as to obtain corresponding identification results. That is, the present application improves the comprehensive optimization mechanism in many ways in order to further enhance the system performance. Firstly, expanding query contents of a user based on semantic embedding and domain knowledge graph through a dynamic query expansion technology, and improving the retrieval coverage rate. And secondly, an efficient knowledge base increment updating strategy is designed, so that the newly added or modified document can be timely brought into the retrieval range. In addition, in order to cope with large-scale data, an index compression mechanism, a cache mechanism and a distributed retrieval architecture are adopted, so that the response time of the system is effectively reduced. The application of the comprehensive optimization mechanism enables the system to have high dynamic adaptability and large-scale retrieval capability.
And step S13, performing keyword matching on the identification result by utilizing a target information retrieval algorithm based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents.
In the embodiment, when candidate citation documents are determined, word segmentation, deactivation word and keyword extraction operations are carried out on document slices in a document knowledge base to obtain second processed document slices, a global vocabulary is constructed on the basis of the second processed document slices, the global vocabulary is a vocabulary containing unique words or phrases appearing in all the second processed document slices, vectorization processing is carried out on the second processed document slices on the basis of the global vocabulary to obtain corresponding sparse vectors, the sparse vectors are stored in the document knowledge base, relevance scores between recognition results and keywords corresponding to the document slices in the document knowledge base are calculated on the basis of a target information retrieval algorithm, initial documents are screened from the document knowledge base according to the relevance scores, word segmentation, deactivation word and keyword extraction operations are carried out on the recognition results to obtain processed recognition results, target vectors corresponding to the processed recognition results are determined, the target documents similar to the target vectors in the vector knowledge base are retrieved, the relevance scores are determined on the basis of the target documents, and the similarity scores are determined on the basis of the initial documents, and the comprehensive scores are determined. That is, the application firstly carries out preprocessing on the documents in the knowledge base, including word segmentation, stop word removal and keyword extraction, and constructs a word bag model of each document. Then, the BM25 (Best Matching model) algorithm evaluates the correlation between the keywords in the query statement and the document keywords by calculating the degree of Matching between the keywords and the document keywords. This process comprehensively considers word Frequency (TF), inverse document Frequency (IDF, inverse Document Frequency) and the distribution of terms in the document, and can efficiently capture explicit semantic associations between queries and documents. In this way, the candidate documents are reordered according to the composite score, and a final search result is generated. The process ensures that the search result considers both explicit semantic association and semantic similarity, thereby remarkably improving the accuracy and user experience of search.
And S14, scoring semantic relativity of the candidate citation documents and the identification result through a reranker model, reordering the candidate citation documents according to the corresponding scoring result, and outputting a target citation document according to the corresponding reordering result.
In the embodiment, semantic relevance scoring is carried out on each candidate citation document and the identification result through a reranker model, candidate citation documents corresponding to target scoring results which do not meet preset scoring threshold conditions are filtered according to the scoring results to determine each initial citation document, the initial citation documents are reordered according to the scoring results, the issuing time of each initial citation document and the issuing mechanism, and the target citation documents are output according to the corresponding reordering results. These operations are mainly the functions of the quotation module. The quotation recommendation module is an important component for generating high-quality recommendation results in the system, and the core functions of the quotation recommendation module comprise scoring and finally reordering the relevance of the retrieval results so as to ensure that a user obtains optimal quotation contents. The module combines noun recognition results and deep learning technology, realizes accurate quotation recommendation through a multi-stage processing flow, and the specific flow is shown in fig. 3. First, the system extracts the search result set related to the nouns according to the political nouns and strategic noun results output by the noun recognition module. These search results are initially screened from a sparse search and vector search module, which contains documents and paragraphs related to noun semantics. In order to further improve the accuracy of the recommended results, a reranker model is introduced into the system as a secondary screening tool. In the re-scoring stage, the system performs a depth analysis on candidate documents and paragraphs. Specifically, reranker models are based on a pretrained language model such as Bert and the like, and are combined with the upper part and the lower part Wen Yuyi to score the relevance of the search results in fine granularity. The model not only focuses on direct matching of queries to documents, but also considers deep semantic associations between document context and query intent. By entering the query content and paragraph text of the candidate documents, the reranker model generates a relevance score for each document and filters low-ranked documents that are not satisfactory. The system then reorders the candidate results according to the relevance score of the reranker model. In the reordering process, the system not only considers the semantic matching degree, but also comprehensively evaluates the authority and timeliness of the document. For example, for government documents, the system preferentially recommends documents published by an authority while providing users with more time-efficient sources of reference based on publication time. In order to achieve fairness and high efficiency of ranking, the system adopts an algorithm based on weighted ranking, and a final ranking position is allocated to each document in combination with a scoring index.
And finally, when the target quotation document is output, outputting the context information of the target quotation document, wherein the context information comprises the context information of the quotation paragraph and any one or a combination of more than one of the title, the issuing mechanism and the issuing time of the target quotation document. Specifically, at the time of recommendation result output, the system provides detailed context information for each reference content. The context information includes the context of the referenced paragraph and the metadata of the document (e.g., title, publication organization, and publication time) so that the user is fully aware of the source of the reference. In addition, the system supports further screening and customizing of recommended content by the user, such as refining the ranking of results by time, source, or keywords. In this way, the quotation recommendation module ensures high correlation and high authority of recommended content and provides reliable quotation support for users. And re-scoring the search result by adopting a Bert classification model and reranker technology, and reordering by combining the semantic matching degree and the document authority. By providing contextual information and detailed metadata of the referenced content, the system significantly improves the authority and practicality of the recommended content, helping users to efficiently complete quotation screening and quotation.
The method comprises the steps of firstly obtaining government affair documents, determining document slices corresponding to the government affair documents, creating a document knowledge base and a vector knowledge base based on the document slices, utilizing a target pre-training natural language processing model to identify professional terms in the government affair documents so as to obtain corresponding identification results, then utilizing a target information retrieval algorithm to conduct keyword matching on the identification results based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents, finally conducting semantic relevance scoring on the candidate citation documents and the identification results through a reranker model, reordering the candidate citation documents according to the corresponding scoring results, and outputting the target citation documents according to the corresponding reordering results. The application can accurately identify the proper nouns in the government field by utilizing a natural language processing model, search the documents and paragraphs relevant to the identified nouns from a knowledge base by combining a semantic matching technology and a high-efficiency indexing mechanism, and finally screen and sort the search results to recommend the most relevant documents and paragraphs. Therefore, the efficiency and the accuracy of quotation retrieval are improved, and related contents can be rapidly positioned. And meanwhile, the semantic understanding capability of the large model is utilized, so that the noun recognition accuracy is remarkably improved. The method can be widely applied to working scenes such as official document writing, policy research, file checking and the like of government departments at all levels, and provides technical support for improving government work efficiency and accuracy.
Based on the above embodiment, the present application may perform keyword matching on the recognition result by using a target information retrieval algorithm based on the document knowledge base and the vector knowledge base, so as to obtain a plurality of candidate citation documents. Next, a detailed description will be made regarding a specific candidate citation document screening process.
The application utilizes the traditional word bag model and BM25 algorithm to realize the preliminary screening of the documents. Firstly, preprocessing documents in a knowledge base, including word segmentation, stop word removal and keyword extraction, and constructing a word bag model of each document. Then, the BM25 algorithm evaluates the correlation between the keywords in the query statement and the document keywords by calculating the degree of matching between the two. The process comprehensively considers word frequency (TF), inverse Document Frequency (IDF) and the distribution of words in the document, and can efficiently capture the dominant semantic association between the query and the document. By screening documents with higher scores, the sparse retrieval module provides a fast and efficient candidate document set for subsequent processing. Specifically, the sparse retrieval module utilizes a traditional word bag model and a BM25 algorithm to realize the preliminary screening of the documents. The detailed steps are as follows:
1.1, preprocessing a document:
Word segmentation, namely, carrying out word segmentation processing on each document in a knowledge base, and decomposing the document into words or phrases. This step is implemented based on the word segmentation rules of language, such as Chinese using jieba words and English using NLTK tools.
Removing the stop words, namely removing common stop words (such as ' yes ', ' and the like) in the document, wherein the words do not carry important information semantically, and removing the words can reduce noise and improve retrieval efficiency.
Keyword extraction, namely extracting keywords from a document, wherein the keywords are used for subsequent bag-of-words model construction. Keyword extraction can be realized through TF-IDF, textRank and other algorithms, so that the extracted keywords are ensured to have higher information quantity and representativeness.
1.2, Constructing a word bag model:
Vocabulary construction, namely constructing a global vocabulary based on the preprocessed document set. The vocabulary contains the unique words or phrases that appear in all documents.
Document vectorization, in which each document is represented as a vector, each dimension of the vector corresponds to a word or phrase in a vocabulary, and the value of the vector is the word frequency (TF) or TF-IDF value of the word or phrase in the document. In this way, each document is converted into a sparse vector, which is stored in the document knowledge base.
1.3, BM25 algorithm:
And (3) preprocessing the query, namely preprocessing the query sentence input by the user, wherein the preprocessing comprises word segmentation, word deactivation and the like, which are the same as the document.
Keyword matching, namely calculating the matching degree of the keywords in the query statement and the keywords of the document. The BM25 algorithm evaluates the relevance between queries and documents by comprehensively considering word frequency (TF), inverse Document Frequency (IDF), and the distribution of terms in the documents.
The BM25 algorithm generates a relevance score for each document, with higher scores indicating higher relevance of the document to the query. The specific formula is as follows:
;
Wherein, the A relevance score, n is the total number of words in query q, q is the query, d is the document,Is a keyword in the query and,Is a keywordThe word frequency in the document d is such that,Is a keywordIs used to determine the inverse document frequency of (c),And b is a parameter of the adjustment,Is the length of document d and avgdl is the average length of the document collection.
1.4, Preliminary screening:
And screening out documents with higher scores according to the BM25 scores to form a preliminary candidate document set. These documents are considered to have a higher explicit semantic association with the user query, providing a basis for further dense vector retrieval.
In addition, the dense vector retrieval module utilizes a vector knowledge base and an efficient approximate nearest neighbor algorithm to realize accurate retrieval based on semantics. The detailed steps are as follows:
2.1, constructing a vector knowledge base:
document slice vectorization, namely performing vectorization processing on each document slice in a document knowledge base by using BGE (Bidirectional Generative Encoder) models. The BGE model is capable of capturing semantic information of document content and generating a high-dimensional vector representation.
Vector index construction, namely constructing a vector index library of the document slices through a Faiss library to form a vector knowledge library. The Faiss library supports efficient near nearest neighbor (ANN) queries, enabling quick locating of document slices that are most similar to the query vector.
2.2, Query vectorization:
And (3) preprocessing the query, namely preprocessing the query sentence input by the user, wherein the preprocessing comprises word segmentation, word deactivation and the like, which are the same as the document.
Query vectorization-the conversion of query statements into high-dimensional vector representations using the BGE model. This vector will be used for similarity retrieval in the vector knowledge base.
2.3, Approximate nearest neighbor search:
Vector retrieval, namely, inputting a query vector into a vector knowledge base, and retrieving by using an Approximate Nearest Neighbor (ANN) algorithm of a Faiss base. Faiss quickly locates the document slice most similar to the query vector by clustering slices and efficient index structure.
Similarity scoring, namely generating a similarity score for each retrieved document slice, wherein the higher the score is, the higher the semantic similarity between the document slice and the query vector is.
2.4, Fusion of results:
and (3) comprehensive scoring, namely fusing the results of the sparse retrieval module and the dense vector retrieval module. For each candidate document, a composite score is generated in combination with the BM25 score and the similarity score.
And finally sorting, namely re-sorting the candidate documents according to the comprehensive scores to generate a final retrieval result. This procedure ensures that the search results take into account both explicit semantic associations (sparse search) and semantic similarities (dense vector search).
The mixed retrieval module realizes efficient and accurate knowledge retrieval by combining the advantages of sparse retrieval and dense vector retrieval. The specific synergy is as follows:
3.1, sparse retrieval module:
and (3) primarily screening, namely quickly screening a document set which has higher explicit semantic association with the query sentence through a BM25 algorithm, so that the number of documents to be processed is reduced.
And the candidate document set provides an efficient candidate document set for the dense vector retrieval module, and improves the retrieval efficiency.
3.2, Dense vector retrieval module:
And semantic matching, namely carrying out accurate semantic matching on the candidate document set through the BGE model and the Faiss library, so that the accuracy of the search result is further improved.
Similarity scoring, namely generating a similarity score for each candidate document, and ensuring that the search results are not only related in explicit semantics, but also are matched in high semantics.
3.3, Fusion of results:
And (3) comprehensive scoring, namely fusing the BM25 score of the sparse retrieval module and the similarity score of the dense vector retrieval module to generate a comprehensive score.
And finally sorting, namely re-sorting the candidate documents according to the comprehensive scores to generate a final retrieval result. The process ensures that the search result considers both explicit semantic association and semantic similarity, thereby remarkably improving the accuracy and user experience of search.
By means of the mixed retrieval mode, the system can fully utilize the high efficiency of sparse retrieval and the accuracy of dense vector retrieval, and the overall performance and user experience of document retrieval are remarkably improved. The multi-path search combination strategy comprehensively utilizes the advantages of sparse search and vector search, and realizes the balance of search efficiency and precision. Firstly, a document candidate set which is explicitly matched with query content is quickly screened out through a sparse retrieval module, and a foundation is provided for subsequent deep retrieval. Then, the vector retrieval module performs semantic matching on the candidate documents, and can be expanded to a global document library to find potential semantic association documents. Finally, the system combines the BM25 score of the sparse search and the semantic similarity score of the vector search, and adopts a linear weighting or learning-based sequencing method to perform fusion sequencing on the search results, so as to generate a final quotation recommendation list. The multipath strategy not only improves the recall rate of retrieval, but also ensures high accuracy of results. Innovatively adopts a multi-path search strategy combining sparse search and vector search. The sparse retrieval module is used for rapidly positioning the high-correlation document by utilizing a BM25 algorithm, and the vector retrieval module is used for constructing a vector index library through a BGE model so as to realize the depth matching of the semantic level. The comprehensive optimization mechanism of the multipath retrieval ensures that candidate documents related to the query can be quickly found in a large-scale knowledge base, potential related information can be captured through deep semantic matching, and the comprehensiveness and accuracy of the retrieval are greatly improved.
Therefore, the efficiency and quality of government official document processing and citation retrieval are remarkably improved, an innovative solution is provided for government informatization and intelligent office work, and the method has great social value and application prospect.
Referring to fig. 4, the embodiment of the invention discloses a government affair document quotation retrieval device based on a large model, which comprises:
The knowledge base construction module 11 is used for acquiring government affair documents, determining document slices corresponding to the government affair documents, and creating a document knowledge base and a vector knowledge base based on the document slices;
The recognition module 12 is configured to recognize the technical terms in the government affairs document by using a target pre-training natural language processing model, so as to obtain a corresponding recognition result;
the keyword matching module 13 is configured to perform keyword matching on the recognition result by using a target information retrieval algorithm based on the document knowledge base and the vector knowledge base, so as to obtain a plurality of candidate citation documents;
And the quotation document output module 14 is configured to score semantic relevance between each candidate quotation document and the recognition result through the reranker model, reorder each candidate quotation document according to the corresponding scoring result, and output a target quotation document according to the corresponding reordering result.
The method comprises the steps of firstly obtaining government affair documents, determining document slices corresponding to the government affair documents, creating a document knowledge base and a vector knowledge base based on the document slices, utilizing a target pre-training natural language processing model to identify professional terms in the government affair documents so as to obtain corresponding identification results, then utilizing a target information retrieval algorithm to conduct keyword matching on the identification results based on the document knowledge base and the vector knowledge base so as to obtain a plurality of corresponding candidate citation documents, finally conducting semantic relevance scoring on the candidate citation documents and the identification results through a reranker model, reordering the candidate citation documents according to the corresponding scoring results, and outputting the target citation documents according to the corresponding reordering results. The application can accurately identify the proper nouns in the government field by utilizing a natural language processing model, search the documents and paragraphs relevant to the identified nouns from a knowledge base by combining a semantic matching technology and a high-efficiency indexing mechanism, and finally screen and sort the search results to recommend the most relevant documents and paragraphs. Therefore, the efficiency and the accuracy of quotation retrieval are improved, and related contents can be rapidly positioned. And meanwhile, the semantic understanding capability of the large model is utilized, so that the noun recognition accuracy is remarkably improved. The method can be widely applied to working scenes such as official document writing, policy research, file checking and the like of government departments at all levels, and provides technical support for improving government work efficiency and accuracy.
In some specific embodiments, the knowledge base construction module 11 may be specifically configured to divide a government document text by using a LANGCHAIN framework to determine a document slice corresponding to the government document, construct an index based on the document slice by using an elastic search to create the document knowledge base, vector the document slice by using a deep learning model based on bidirectional generation coding to obtain a first processed document slice, and construct the vector knowledge base based on the first processed document slice.
In some specific embodiments, the device may be further configured to train an initial pre-training natural language processing model by using a target government document corpus and a government document domain specific vocabulary to obtain a pre-training natural language processing model, wherein the government document domain specific vocabulary includes political nouns, strategic nouns and policy terms, supplement nouns in the government document domain specific vocabulary by using an external database to obtain a supplement result, wherein the external database includes government documents, white papers and policy documents, and train the pre-training natural language processing model by using the supplement result to obtain the target pre-training natural language processing model.
In some specific embodiments, the keyword matching module 13 may be specifically configured to perform word segmentation, word deactivation and keyword extraction operations on the document slices in the document knowledge base to obtain second processed document slices, construct a global vocabulary based on each second processed document slice, where the global vocabulary is a vocabulary including unique words or phrases appearing in all second processed document slices, vectorize the second processed document slices based on the global vocabulary to obtain corresponding sparse vectors, and store the sparse vectors in the document knowledge base, calculate correlation scores between the recognition results and keywords corresponding to each document slice in the document knowledge base based on a target information retrieval algorithm, screen an initial document from the document knowledge base according to each correlation score, and perform keyword matching on the recognition results based on the vector knowledge base and the initial document to obtain a plurality of candidate citation documents.
In some specific embodiments, the keyword matching module 13 may be specifically configured to perform word segmentation, stop word removal, and keyword extraction on the recognition result, obtain a processed recognition result, determine a target vector corresponding to the processed recognition result, retrieve a target document slice similar to the target vector in the vector knowledge base, determine a similarity score between the target document slice and the target vector, determine a composite score of each initial document based on the similarity score and the relevance score, and reorder the initial documents according to the composite score to determine a plurality of candidate citation documents.
In some specific embodiments, the cited-reference document output module 14 may be specifically configured to score semantic relevance between each candidate cited-reference document and the recognition result through reranker models, filter, according to the scoring result, the candidate cited-reference documents corresponding to the target scoring result that does not meet the preset scoring threshold condition, so as to determine each initial cited-reference document, reorder, according to the scoring result, the issuing time of each initial cited-reference document, and the issuing mechanism, and output the target cited-reference document according to the corresponding reordering result.
In some specific embodiments, the device can also be used for outputting the context information of the target cited document when outputting the target cited document, wherein the context information comprises the context information of the cited paragraph and any one or a combination of a title, a release mechanism and release time of the target cited document.
Further, the embodiment of the present application further discloses an electronic device, and fig. 5 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the figure is not to be considered as any limitation on the scope of use of the present application.
Fig. 5 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may include, in particular, at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input-output interface 25, and a communication bus 26. The memory 22 is configured to store a computer program, where the computer program is loaded and executed by the processor 21 to implement relevant steps in the large-model-based government document quotation retrieval method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide working voltages for each hardware device on the electronic device 20, the communication interface 24 is capable of creating a data transmission channel with an external device for the electronic device 20, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein, and the input/output interface 25 is configured to obtain external input data or output data to the external device, and the specific interface type of the input/output interface may be selected according to the specific application needs and is not specifically limited herein.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further comprise a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the large model-based government document quotation retrieval method performed by the electronic device 20 as disclosed in any of the previous embodiments.
Furthermore, the application also discloses a computer readable storage medium for storing a computer program, wherein the computer program is executed by a processor to realize the large-model-based government document quotation retrieval method. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
While the foregoing has been provided to illustrate the principles and embodiments of the present application, specific examples have been provided herein to assist in understanding the principles and embodiments of the present application, and are intended to be in no way limiting, for those of ordinary skill in the art will, in light of the above teachings, appreciate that the principles and embodiments of the present application may be varied in any way.

Claims (10)

1.一种基于大模型的政务公文引文检索方法,其特征在于,包括:1. A method for retrieving government documents based on a large model, characterized by comprising: 获取政务公文,确定所述政务公文对应的文档切片,并基于所述文档切片创建文档知识库以及向量知识库;Obtaining government documents, determining document slices corresponding to the government documents, and creating a document knowledge base and a vector knowledge base based on the document slices; 利用目标预训练自然语言处理模型对所述政务公文中的专业术语进行识别,以获取相应的识别结果;Using the target pre-trained natural language processing model to identify professional terms in the government document to obtain corresponding recognition results; 基于所述文档知识库和所述向量知识库利用目标信息检索算法对所述识别结果进行关键词匹配,以获取相应的多个候选引文文档;Based on the document knowledge base and the vector knowledge base, a target information retrieval algorithm is used to perform keyword matching on the recognition result to obtain a corresponding plurality of candidate citation documents; 通过reranker模型对各所述候选引文文档与所述识别结果进行语义相关性评分,根据相应的评分结果对各所述候选引文文档进行重排序,并根据相应的重排序结果输出目标引文文档。The semantic relevance between each candidate citation document and the recognition result is scored through a reranker model, each candidate citation document is reordered according to the corresponding scoring result, and a target citation document is output according to the corresponding reordering result. 2.根据权利要求1所述的基于大模型的政务公文引文检索方法,其特征在于,所述确定所述政务公文对应的文档切片,并基于所述文档切片创建文档知识库以及向量知识库,包括:2. The method for retrieving government documents based on a large model according to claim 1, characterized in that the step of determining the document slices corresponding to the government documents and creating a document knowledge base and a vector knowledge base based on the document slices comprises: 使用Langchain框架对政务公文文本进行划分,以确定所述政务公文对应的文档切片;Use the Langchain framework to divide the text of government documents to determine the document slices corresponding to the government documents; 通过Elasticsearch基于所述文档切片构建索引,以创建所述文档知识库;Building an index based on the document slices through Elasticsearch to create the document knowledge base; 利用基于双向生成编码的深度学习模型对所述文档切片进行向量化处理,获取第一处理后文档切片;Vectorizing the document slice using a deep learning model based on bidirectional generative coding to obtain a first processed document slice; 基于所述第一处理后文档切片构建所述向量知识库。The vector knowledge base is constructed based on the first processed document slices. 3.根据权利要求1所述的基于大模型的政务公文引文检索方法,其特征在于,所述利用目标预训练自然语言处理模型对所述政务公文中的专业术语进行识别之前,还包括:3. The method for retrieving government documents based on a large model according to claim 1 is characterized in that before using the target pre-trained natural language processing model to identify professional terms in the government documents, it also includes: 利用目标政务公文语料库以及政务公文领域专用词表对初始预训练自然语言处理模型进行训练,以获取预训练自然语言处理模型;所述政务公文领域专用词表包括政治名词、战略性名词以及政策术语;The initial pre-trained natural language processing model is trained using a target government document corpus and a special vocabulary in the field of government documents to obtain a pre-trained natural language processing model; the special vocabulary in the field of government documents includes political terms, strategic terms, and policy terms; 利用外部数据库对所述政务公文领域专用词表中的名词进行补充,获取补充结果;所述外部数据库中包括政府文件、白皮书以及政策文档;Using an external database to supplement the nouns in the government document field special vocabulary to obtain supplementary results; the external database includes government documents, white papers and policy documents; 利用所述补充结果对所述预训练自然语言处理模型进行训练,以获取所述目标预训练自然语言处理模型。The pre-trained natural language processing model is trained using the supplemented results to obtain the target pre-trained natural language processing model. 4.根据权利要求1所述的基于大模型的政务公文引文检索方法,其特征在于,所述基于所述文档知识库和所述向量知识库利用目标信息检索算法对所述识别结果进行关键词匹配,以获取相应的多个候选引文文档,包括:4. The method for retrieving government documents based on a large model according to claim 1 is characterized in that the method uses a target information retrieval algorithm based on the document knowledge base and the vector knowledge base to perform keyword matching on the recognition results to obtain a corresponding plurality of candidate citation documents, including: 对所述文档知识库中的文档切片进行分词、去停用词以及关键词提取操作,获取第二处理后文档切片;Perform word segmentation, stop word removal, and keyword extraction operations on the document slices in the document knowledge base to obtain second processed document slices; 基于各所述第二处理后文档切片构建全局词汇表;所述全局词汇表为包含所有第二处理后文档切片中出现的唯一单词或短语的词汇表;Building a global vocabulary based on each of the second processed document slices; the global vocabulary is a vocabulary containing unique words or phrases appearing in all second processed document slices; 基于所述全局词汇表对所述第二处理后文档切片进行向量化处理,获取相应的稀疏向量,并将所述稀疏向量存储至所述文档知识库;Performing vectorization processing on the second processed document slice based on the global vocabulary, obtaining a corresponding sparse vector, and storing the sparse vector in the document knowledge base; 基于目标信息检索算法计算所述识别结果与所述文档知识库中各文档切片对应的关键词之间的相关性评分;Calculate the correlation score between the recognition result and the keywords corresponding to each document slice in the document knowledge base based on the target information retrieval algorithm; 根据各所述相关性评分从所述文档知识库中筛选初始文档;Filtering initial documents from the document knowledge base according to each of the relevance scores; 基于所述向量知识库以及所述初始文档对所述识别结果进行关键词匹配,以获取相应的多个候选引文文档。Keyword matching is performed on the recognition result based on the vector knowledge base and the initial document to obtain a corresponding plurality of candidate citation documents. 5.根据权利要求4所述的基于大模型的政务公文引文检索方法,其特征在于,所述基于所述向量知识库以及所述初始文档对所述识别结果进行关键词匹配,以获取相应的多个候选引文文档,包括:5. The method for retrieving government document citations based on a large model according to claim 4 is characterized in that the step of performing keyword matching on the recognition result based on the vector knowledge base and the initial document to obtain a corresponding plurality of candidate citation documents comprises: 对所述识别结果进行分词、去停用词以及关键词提取操作,获取处理后识别结果;Perform word segmentation, stop word removal, and keyword extraction operations on the recognition result to obtain a processed recognition result; 确定所述处理后识别结果对应的目标向量;Determine a target vector corresponding to the processed recognition result; 检索所述向量知识库中与所述目标向量相似的目标文档切片;Retrieving a target document slice similar to the target vector in the vector knowledge base; 确定所述目标文档切片与所述目标向量之间的相似度评分;Determining a similarity score between the target document slice and the target vector; 基于所述相似度评分和所述相关性评分确定各所述初始文档的综合评分;Determining a comprehensive score for each of the initial documents based on the similarity score and the relevance score; 根据所述综合评分对所述初始文档进行重新排序,以确定多个候选引文文档。The initial documents are re-ranked according to the comprehensive scores to determine a plurality of candidate citation documents. 6.根据权利要求1所述的基于大模型的政务公文引文检索方法,其特征在于,所述通过reranker模型对各所述候选引文文档与所述识别结果进行语义相关性评分,根据相应的评分结果对各所述候选引文文档进行重排序,并根据相应的重排序结果输出目标引文文档,包括:6. The method for retrieving government document citations based on a large model according to claim 1 is characterized in that the semantic relevance scores are scored for each candidate citation document and the recognition result by a reranker model, each candidate citation document is reordered according to the corresponding scoring results, and the target citation document is output according to the corresponding reordering results, including: 通过reranker模型对各所述候选引文文档与所述识别结果进行语义相关性评分;Using a reranker model, a semantic relevance score is performed on each of the candidate citation documents and the recognition result; 根据所述评分结果对不符合预设评分阈值条件的目标评分结果对应的候选引文文档进行过滤,以确定各初始引文文档;According to the scoring results, candidate citation documents corresponding to target scoring results that do not meet the preset scoring threshold conditions are filtered to determine each initial citation document; 根据所述评分结果、各初始引文文档的发布时间和发布机构对所述初始引文文档进行重排序,并根据相应的重排序结果输出目标引文文档。The initial citation documents are reordered according to the scoring results, the publishing time and publishing organization of each initial citation document, and the target citation document is output according to the corresponding reordering result. 7.根据权利要求1至6任一项所述的基于大模型的政务公文引文检索方法,其特征在于,还包括:7. The method for retrieving government documents based on a large model according to any one of claims 1 to 6, characterized in that it also includes: 在输出目标引文文档时,输出所述目标引文文档的上下文信息;所述上下文信息包括引用段落的上下文信息以及目标引文文档的标题、发布机构和发布时间中的任意一种或几种的组合。When outputting the target cited document, the context information of the target cited document is output; the context information includes the context information of the quoted paragraph and any one or a combination of the title, publishing organization and publishing time of the target cited document. 8.一种基于大模型的政务公文引文检索装置,其特征在于,包括:8. A government document citation retrieval device based on a large model, characterized by comprising: 知识库构建模块,用于获取政务公文,确定所述政务公文对应的文档切片,并基于所述文档切片创建文档知识库以及向量知识库;A knowledge base construction module is used to obtain government documents, determine the document slices corresponding to the government documents, and create a document knowledge base and a vector knowledge base based on the document slices; 识别模块,用于利用目标预训练自然语言处理模型对所述政务公文中的专业术语进行识别,以获取相应的识别结果;A recognition module, used to recognize professional terms in the government document using a target pre-trained natural language processing model to obtain corresponding recognition results; 关键词匹配模块,用于基于所述文档知识库和所述向量知识库利用目标信息检索算法对所述识别结果进行关键词匹配,以获取相应的多个候选引文文档;A keyword matching module, used for performing keyword matching on the recognition result based on the document knowledge base and the vector knowledge base using a target information retrieval algorithm to obtain a corresponding plurality of candidate citation documents; 引文文档输出模块,用于通过reranker模型对各所述候选引文文档与所述识别结果进行语义相关性评分,根据相应的评分结果对各所述候选引文文档进行重排序,并根据相应的重排序结果输出目标引文文档。The citation document output module is used to score the semantic relevance between each candidate citation document and the recognition result through a reranker model, reorder each candidate citation document according to the corresponding scoring result, and output the target citation document according to the corresponding reordering result. 9.一种电子设备,其特征在于,包括:9. An electronic device, comprising: 存储器,用于存储计算机程序;Memory for storing computer programs; 处理器,用于执行计算机程序以实现如权利要求1至7任意一项所述的基于大模型的政务公文引文检索方法。A processor, used to execute a computer program to implement the government document citation retrieval method based on a large model as described in any one of claims 1 to 7. 10.一种计算机可读存储介质,其特征在于,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如权利要求1至7任意一项所述的基于大模型的政务公文引文检索方法。10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for retrieval of government document citations based on a large model as described in any one of claims 1 to 7 is implemented.
CN202510446020.XA 2025-04-10 2025-04-10 Government document citation retrieval method, device, equipment and medium based on big model Pending CN120296146A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510446020.XA CN120296146A (en) 2025-04-10 2025-04-10 Government document citation retrieval method, device, equipment and medium based on big model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510446020.XA CN120296146A (en) 2025-04-10 2025-04-10 Government document citation retrieval method, device, equipment and medium based on big model

Publications (1)

Publication Number Publication Date
CN120296146A true CN120296146A (en) 2025-07-11

Family

ID=96284477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510446020.XA Pending CN120296146A (en) 2025-04-10 2025-04-10 Government document citation retrieval method, device, equipment and medium based on big model

Country Status (1)

Country Link
CN (1) CN120296146A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120744081A (en) * 2025-08-14 2025-10-03 北京国联视讯信息技术股份有限公司 Industrial document intelligent retrieval method and system based on large model
CN120821832A (en) * 2025-09-18 2025-10-21 国网浙江省电力有限公司信息通信分公司 Modeling method, device, equipment and medium for power business decision model
CN120873154A (en) * 2025-07-25 2025-10-31 神州医疗科技股份有限公司 Auxiliary method and auxiliary device for scientific research in medical science special field
CN120874801A (en) * 2025-09-24 2025-10-31 浪潮云信息技术股份公司 Intelligent document generation method, system, equipment and medium based on multi-model fusion

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120873154A (en) * 2025-07-25 2025-10-31 神州医疗科技股份有限公司 Auxiliary method and auxiliary device for scientific research in medical science special field
CN120744081A (en) * 2025-08-14 2025-10-03 北京国联视讯信息技术股份有限公司 Industrial document intelligent retrieval method and system based on large model
CN120821832A (en) * 2025-09-18 2025-10-21 国网浙江省电力有限公司信息通信分公司 Modeling method, device, equipment and medium for power business decision model
CN120821832B (en) * 2025-09-18 2026-01-30 国网浙江省电力有限公司信息通信分公司 A modeling method, apparatus, equipment and medium for a power business decision model
CN120874801A (en) * 2025-09-24 2025-10-31 浪潮云信息技术股份公司 Intelligent document generation method, system, equipment and medium based on multi-model fusion

Similar Documents

Publication Publication Date Title
US20220261427A1 (en) Methods and system for semantic search in large databases
CN119988588A (en) A large model-based multimodal document retrieval enhancement generation method
CN104199965B (en) Semantic information retrieval method
CN120296146A (en) Government document citation retrieval method, device, equipment and medium based on big model
CN118278365B (en) Automatic generation method and device for scientific literature review
CN102411621A (en) A Chinese query-oriented multi-document automatic summarization method based on cloud model
CN103559193B (en) A kind of based on the theme modeling method selecting unit
CN105975558A (en) Method and device for establishing statement editing model as well as method and device for automatically editing statement
CN112948544B (en) Book retrieval method based on deep learning and quality influence
CN108038099B (en) A low-frequency keyword recognition method based on word clustering
CN119557500B (en) A method and system for accurate search of Internet massive data based on AI technology
CN107357765A (en) Word document flaking method and device
CN119719277B (en) Professional intelligent question-answering system and method for oil-gas geological investigation
CN120316119B (en) Data retrieval method, device and system
CN118708722A (en) Power document standardization processing method and device based on large language model
CN111949781B (en) Intelligent interaction method and device based on natural sentence syntactic analysis
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
CN120181097B (en) Multi-stage semantic reordering fine tuning method and device applied to vertical field
CN121188178A (en) Query processing methods, devices, equipment, media, and program products
CN118942104B (en) A method and system for extracting structured information
CN119513293B (en) Scientific research topic recommendation method and device based on similarity calculation
CN119377388A (en) Information extraction system and method based on layout analysis and multimodality
CN121144477A (en) A large-model-driven intelligent retrieval and semantic association system for multi-format documents
CN120780793A (en) Data acquisition method and device, electronic equipment and product
CN120508631A (en) Document searching method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination