CN120162400A - A large-scale model-based sewage treatment question-and-answer method, equipment, medium and product - Google Patents
A large-scale model-based sewage treatment question-and-answer method, equipment, medium and product Download PDFInfo
- Publication number
- CN120162400A CN120162400A CN202411244427.6A CN202411244427A CN120162400A CN 120162400 A CN120162400 A CN 120162400A CN 202411244427 A CN202411244427 A CN 202411244427A CN 120162400 A CN120162400 A CN 120162400A
- Authority
- CN
- China
- Prior art keywords
- sewage treatment
- text
- question
- vector
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Economics (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Resources & Organizations (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- Tourism & Hospitality (AREA)
- Human Computer Interaction (AREA)
- Strategic Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a sewage treatment question-answering method, equipment, medium and product based on a large model, and relates to the field of language model construction; preprocessing the existing data to obtain text data, segmenting the text data to obtain text blocks, storing the text blocks in a text database, encoding the text blocks to generate semantic vectors, storing the semantic vectors in a vector database, obtaining retrieval results related to sewage treatment problems from the text database and the vector database by adopting a keyword search method and a vector similarity search method based on the sewage treatment problems, reordering the retrieval results to obtain answer document fragments, forming prompts based on the sewage treatment problems and the answer document fragments, and inputting the prompts into a large language model to generate sewage treatment question-answer texts. The application can provide accurate answers corresponding to the sewage treatment problems, thereby reducing the sewage treatment risk.
Description
Technical Field
The application relates to the field of language model construction, in particular to a sewage treatment question-answering method, equipment, medium and product based on a large model.
Background
In recent years, large language models (Large Language Model, LLM, abbreviated as large models) have made remarkable progress, and well-known models such as ChatGPT (CHAT GENERATIVE PRE-trained Transformer) exhibit excellent performance in question-answering, abstract and content generation. These models exhibit not only strong generalization capability in natural language processing (Natural Language Processing, NLP) tasks, but also in various interdisciplinary fields. However, with the vigorous development of the sewage treatment field, massive knowledge data are generated, but similar to the model of ChatGPT, when training is performed on a general data set, specific optimization for the sewage field is lacking, so that an accurate answer corresponding to a sewage treatment problem cannot be provided, and further risks are brought to sewage treatment.
Disclosure of Invention
The application aims to provide a sewage treatment question-answering method, equipment, medium and product based on a large model, which can provide accurate answers corresponding to sewage treatment questions so as to reduce sewage treatment risks.
In order to achieve the above object, the present application provides the following solutions:
In a first aspect, the application provides a sewage treatment question-answering method based on a large model, which comprises the following steps:
the method comprises the steps of acquiring the existing data related to sewage treatment, wherein the existing data related to sewage treatment comprises one or more of books in the sewage treatment field, industry standards in the sewage treatment field, academic papers in the sewage treatment field and questions and answers related to knowledge points of sewage treatment;
preprocessing the existing data related to sewage treatment to obtain text data;
Cutting the text data to obtain text blocks;
Storing the text block in a text database, encoding the text block to generate a semantic vector, and storing the generated semantic vector in a vector database;
obtaining a search result related to the sewage treatment problem from the text database and the vector database based on the sewage treatment problem by adopting a keyword search method and a vector similarity search method;
reordering the retrieval result to obtain an answer document fragment;
Forming a prompt based on the wastewater treatment question and the answer document fragment;
and inputting the prompt into a large language model to generate a sewage treatment question-answering text.
Optionally, preprocessing existing materials related to sewage treatment to obtain text data, including:
converting the format of the existing data related to sewage treatment, wherein the format of the text data comprises docx format and md format;
And removing redundant information of the existing data which is converted into the docx format and is related to sewage treatment, wherein the redundant information comprises a header, a page number, empty lines and pictures.
Optionally, the text data is segmented to obtain text blocks, which includes:
For the text data in the docx format, a ChineseRecursiveTextSplitter slicer in LANGCHAIN is adopted for slicing to obtain text blocks;
and for the text data in the md format, a MarkdownHeaderTextSplitter slicer in LANGCHAIN is adopted for slicing to obtain text blocks.
Optionally, storing the text block in a text database, and encoding the text block to generate a semantic vector, and storing the generated semantic vector in a vector database, including:
Creating an index in the elastic search and storing the text block in the index, wherein the index belongs to the elastic search;
and encoding the text block by adopting a bge _m3 model to generate a semantic vector, and adopting faiss as a vector database to store the generated semantic vector in the vector database.
Optionally, a keyword search method and a vector similarity search method are adopted, and based on the sewage treatment problem, search results related to the sewage treatment problem are obtained from the text database and the vector database, and specifically include:
Adopting a BM25 algorithm as the keyword searching method, obtaining a search result related to the sewage treatment problem from the text database, and taking the search result as a first search result;
And carrying out vector similarity search by using Euclidean distance, obtaining a search result related to the sewage treatment problem from the vector database, and taking the search result as a second search result.
Optionally, reordering the retrieval result to obtain an answer document fragment, including:
Determining the semantic matching degree between the search result and the sewage treatment problem by adopting Rerank model;
and carrying out descending order arrangement on the search results according to the semantic matching degree to obtain the answer document fragments.
Optionally, forming a prompt based on the sewage treatment question and the answer document fragment includes:
And taking the sewage treatment problem as the above, and taking the answer document sheet as the following to form the prompt.
In a second aspect, the application provides a computer device comprising a memory, a processor to store a computer program on the memory and executable on the processor, the processor executing the computer program to implement the steps of the large model-based sewage treatment question-answering method provided above.
In a third aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the large model based sewage treatment question-answering method provided above.
In a fourth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the large model based sewage treatment question-answering method provided above.
According to the specific embodiment provided by the application, the application discloses the following technical effects:
The application provides a sewage treatment question-answering method, equipment, medium and product based on a large model, which are used for carrying out sewage treatment question-answering treatment on the basis of obtained existing data related to sewage treatment, so that the comprehensiveness of data coverage can be improved. And after operations such as preprocessing, segmentation and storage based on text data, different search methods (a keyword search method and a vector similarity search method) are adopted to obtain search results related to the sewage treatment problem from a text database and a vector database based on the sewage treatment problem, and the obtained search results are subjected to fusion and sorting, so that the diversified requirements in the sewage treatment question-answering process can be met. The prompt generated based on the search results after fusion sequencing is input into the large language model to generate the sewage treatment question-answering text, so that accurate answers corresponding to sewage treatment questions can be provided, the defect of the existing single vector index in the search accuracy is overcome, and the sewage treatment risk is further reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a diagram of an application environment of a sewage treatment question-answering method based on a large model in an embodiment of the application;
FIG. 2 is a schematic flow chart of a sewage treatment question-answering method based on a large model according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a search result generation and fusion process according to another embodiment of the present application;
FIG. 4 is a schematic diagram of a hint template of a design according to another embodiment of the present application;
FIG. 5 is a diagram illustrating a data format of Retrieve Evaluation according to another embodiment of the present application;
FIG. 6 is a schematic diagram of a Hit_Rate search experiment result according to another embodiment of the present application;
Fig. 7 is a schematic diagram of an MRR search experiment result according to another embodiment of the present application.
FIG. 8 is a schematic diagram of a query and answer text for wastewater treatment corresponding to a RAG according to another embodiment of the present application;
FIG. 9 is a schematic diagram of a sewage treatment question-answer text generated by GLM-4 according to another embodiment of the present application;
FIG. 10 is a general workflow diagram of a large model-based sewage treatment question-answering method according to another embodiment of the present application;
fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The foregoing objects, features, and advantages of the application will be more readily apparent from the following detailed description of the application when taken in conjunction with the accompanying drawings and detailed description.
The sewage treatment question-answering method based on the large model provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be provided separately, may be integrated on the server 104, or may be placed on a cloud or other server. The terminal 102 may send the existing data related to the sewage treatment to the server 104, and after the server 104 receives the existing data related to the sewage treatment, the server 104 may perform preprocessing on the existing data related to the sewage treatment to obtain text data. And cutting the text data to obtain text blocks. The text blocks are stored in a text database, semantic vectors are generated by encoding the text blocks, and the generated semantic vectors are stored in a vector database. And obtaining a retrieval result related to the sewage treatment problem from the text database and the vector database based on the sewage treatment problem by adopting a keyword search method and a vector similarity search method. And reordering the retrieval result to obtain an answer document fragment. A hint is formed based on the wastewater treatment question and answer document snippets. And inputting the prompt into a large language model to generate a sewage treatment question-answering text. The server 104 may feed back the resulting sewage treatment question-answer text to the terminal 102. In addition, in some embodiments, the sewage treatment question-answering method based on the large model may be implemented by the server 104 or the terminal 102 alone, for example, the terminal 102 may directly obtain the sewage treatment question-answering text for the existing data related to the sewage treatment, or the server 104 may obtain the existing data related to the sewage treatment from the data storage system, and obtain the sewage treatment question-answering text for the existing data related to the sewage treatment.
The terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers, or may be a cloud server.
In an exemplary embodiment, as shown in fig. 2, a sewage treatment question-answering method based on a large model is provided, and the method is executed by a computer device, specifically, may be executed by a computer device such as a terminal or a server, or may be executed by the terminal and the server together, and in an embodiment of the present application, the method is applied to the server 104 in fig. 1 and is described as an example, and includes the following steps 200 to 207. Wherein:
Step 200, acquiring the existing data related to sewage treatment. Existing data related to sewage treatment includes one or more of books in sewage treatment fields, industry standards in sewage treatment fields, academic papers in sewage treatment fields, and questions and answers related to knowledge points of sewage treatment.
Step 201, preprocessing the existing data related to sewage treatment to obtain text data.
And 202, cutting the text data to obtain text blocks.
And 203, storing the text block in a text database, encoding the text block to generate a semantic vector, and storing the generated semantic vector in a vector database.
And 204, obtaining a retrieval result related to the sewage treatment problem from the text database and the vector database based on the sewage treatment problem by adopting a keyword search method and a vector similarity search method.
And 205, reordering the retrieval result to obtain an answer document fragment.
And 206, forming prompts based on the sewage treatment questions and the answer document fragments.
And 207, inputting the prompt into the large language model to generate a sewage treatment question-answer text.
By implementing the steps 200 to 207, the comprehensiveness of data coverage can be improved, the diversified requirements in the sewage treatment question-answering process can be met, accurate answers corresponding to sewage treatment questions can be provided, the defect of the existing single vector index in retrieval accuracy can be overcome, and the sewage treatment risk can be further reduced.
In another exemplary embodiment of the present application, in order to implement intelligent question-answering and problem diagnosis using large language model retrieval enhancement generation for the sewage treatment field, a large amount of documents and related knowledge about the sewage treatment field needs to be collected, and in order to cover a wide knowledge content in the field, the preprocessing process in step 201 may be to perform format conversion on existing materials related to sewage treatment, so as to obtain text data. The text data includes docx format and md format.
Redundant information is removed from existing data related to sewage treatment converted into docx format. The redundant information includes a header, a page number, an empty line, and a picture.
For example, pretreatment can be initiated from four aspects of wastewater treatment field books, wastewater treatment field industry standards, wastewater treatment field academic papers, and questions and answers related to wastewater treatment knowledge points (i.e., wastewater treatment field common questions), wherein:
(1) And (3) converting the PDF file into a docx-format document by using a PDF word conversion tool in WPS (WPSDOS word processing software), removing references irrelevant to research, removing interference factors (namely redundant information) generated by results such as headers, page numbers, blank lines and the like in the WPS, and eliminating all image data in view of the fact that the current retrieval enhancement generation technology (RetrievalAugmented Generation, RAG) cannot process images. All books related to the sewage treatment field are subjected to the pretreatment. Wherein, can adopt the mode of manual treatment to get rid of redundant information in the books of sewage treatment field.
(2) The industrial standard in the sewage treatment field comprises the steps of crawling website policy standard files related to sewage treatment by utilizing a crawler technology, screening out files closely related to sewage treatment, preprocessing the files by utilizing a pdf-word conversion tool in WPS to convert pdf into word, arranging some format problems in WPS, and removing interference factors such as page headers, page numbers, blank lines, pictures and the like. The process of screening and removing redundant information may also be accomplished manually.
(3) Common problems in the field of sewage treatment are that a python script is compiled, a glm-4 model is utilized, proper prompts are set to generate problems and answers related to sewage, in order to cover more comprehensive sewage treatment knowledge, industry knowledge outline can be combed, the glm-4 model is utilized to generate related problems and answers for each knowledge point in the outline, the related problems and answers are stored in a file in a mark down format (namely, md format), the generated problems are set to be primary titles, and the answers are set to be texts. Wherein, the industry knowledge outline can be completed by a manual mode.
(4) Downloading and integrating the latest paper documents related to sewage treatment in each website, converting the pdf file into docx by using a pdf word conversion tool in the WPS, removing references irrelevant to sewage treatment, removing interference factors such as headers, page numbers, blank lines and the like which can be generated on results, and eliminating all image data in view of the fact that the RAG cannot process images at present. The downloading and integrating of the latest papers on sewage treatment in each website can also be accomplished manually.
In another exemplary embodiment of the present application, the text data obtained from the document that has been preprocessed is used to load the data in the docx and markdown files, and then the text is segmented, where the loaded data is segmented into a plurality of text blocks. Based on this, the implementation procedure of the step 202 includes:
(1) And for the text data in the docx format, a ChineseRecursiveTextSplitter slicer in LANGCHAIN is adopted for slicing to obtain text blocks.
For example, CHUNK_SIZE in ChineseRecursiveTextSplitter slicer is set to 500 and OVERLAP_SIZE is set to 100. Where OVERLAP SIZE is the number of OVERLAP flags between two adjacent text blocks. To ensure consistency of text semantics, adjacent text blocks may overlap to some extent. The chunk_overlap in ChineseRecursiveTextSplitter slicer controls the size of this overlap region. The segmentation mode meets the maximum length limit and ensures the connection of semantics between adjacent text blocks. Proper text block size and overlap may promote fluency and consistency in processing long text by large language models.
(2) And for the text data in the md format, a MarkdownHeaderTextSplitter slicer in LANGCHAIN is adopted for slicing to obtain text blocks. The MarkdownHeaderTextSplitter slicer in LANGCHAIN performs text slicing mainly according to the character #.
In another exemplary embodiment of the present application, in order to further improve the accuracy of the search result, the above-described divided text blocks are stored in 2 databases.
One is to store text into an elastic search. For example, index docs2 is created and text blocks are stored in the index.
And secondly, storing the text block vector embedded representation in a vector database. For example, text blocks are encoded by bge _m3 model, generating semantic vectors to capture nuances of each segment. bge _m3 model was from BAAI (BeijingAcademy ofArtificial Intelligence, beijing Zhiyuan Manual Intelligent research institute) and university of science and technology, china, and was a BAAI open source model. The model supports more than 100 languages and is capable of accepting text input in different forms, with the maximum text input length extending to 8192. The generated semantic vector is stored in a vector database, faiss is selected as the vector database in the embodiment, and Index-FlatL is selected as the Index type. The work up to this offline data preparation is completed.
In another exemplary embodiment of the present application, the RAG can obtain information from an external knowledge base, thereby improving the accuracy and relevance of the answer. In the RAG framework, R represents RETRIEVAL, i.e. retrieval recall, which is critical to the quality of the final answer generation, and Vector-based semantic retrieval is the most well known basic retrieval mode. However, in practical applications, it is often difficult to meet the diversified demands if vector search is simply used once. Therefore, in order to innovate and optimize the search links, the related text blocks are recalled by using different search algorithms, and multiple paths of search results with different dimensions are fused, so that the defect of single vector index in search accuracy is overcome. And re-ordering the recalled results of different retrieval algorithms (Rerank). Where converged searches typically combine traditional keyword-based searches with modern vector-based searches. Based on this, as shown in fig. 3, the implementation process of the above step 204 and step 205 may be:
(1) And obtaining a search result related to the sewage treatment problem from the text database by adopting a BM25 algorithm as a keyword search method, and taking the search result as a first search result.
I.e. recall text blocks related to sewage disposal problems using the BM25 algorithm. Wherein, the calculation formula of the BM25 algorithm is expressed as:
Where BM25 (D, Q) is the relevance score of question Q to text block D, N is the number of terms in the query, Q i is the ith term in the query, N is the total number of documents, N (Q i) is the number of documents including term Q i, f (Q i, D) is the number of occurrences of term Q i in document D, len (D) is the length of document D, avg_len is the average length of all documents, k 1 and b are adjustment parameters, typically set to k 1 =1.5 and b=0.75.
(2) And carrying out vector similarity search by using Euclidean distance, obtaining a search result related to the sewage treatment problem from a vector database, and taking the search result as a second search result.
For example, the problem of the query is vectorized by bge _m3 model, and vector similarity search is performed by using euclidean distance, index-FlatL2 has been selected as Index type before, and L2 distance is the most common euclidean distance measurement method for measuring the spatial distance between two vectors. In multidimensional space, the L2 distance is expressed as the square root of the sum of the squares of each corresponding dimension difference between two vectors. The L2 distance of two points in space is shown as:
Where a i and b i are the values of vectors a and b, respectively, in the ith dimension and m is the dimension of the vector.
(3) And adopting Rerank model to determine the semantic matching degree between the search result (including the first search result and the second search result) and the sewage treatment problem.
(4) And (5) carrying out descending order arrangement on the search results according to the semantic matching degree to obtain answer document fragments.
For the step (1) and the step (2), a plurality of (e.g. 5) related documents can be recalled respectively, effect optimization is continued on the basis of fusion retrieval, and the most similar documents are considered to be recalled by the fusion retrieval, but are not necessarily the most relevant context information, so that after the retrieval is completed, a retrieval result obtained by the retrieval can be added into Rerank models to calculate the semantic matching degree of a recall document list and a user problem, and the semantic matching degree is used for carrying out sequencing again, so that the result of semantic sequencing is improved, guan Wendang is placed at the forefront, and the accuracy of RAG is improved. The Rerank model selects bge-reranker-large. bge-reranker-large is a Rerank model widely used for domestic intelligent source opening, and has excellent results in a plurality of model tests. The Rerank model takes as input the questions entered by the user and documents recalled in the fusion retrieval section, directly outputs the similarity score, and provides a finer relevance measure than the vector embedding model, thus leading the document rows more relevant to the questions. A plurality of relevant document snippets relating to the question may be obtained.
In another exemplary embodiment of the present application, the question and answer document fragments from the retrieving portion are treated as key contexts, together forming a Prompt (Prompt), which is then submitted to a Large Language Model (LLM) to generate a sewage treatment question-answer text.
Further, in order to improve the output quality of LLM, technologies such as prompt engineering are explored, and the campt is adjusted to optimize the generation quality of the whole RAG. Based on this, the alert template designed in this embodiment is shown in fig. 4. Firstly, setting a role for the model, limiting a range of answering questions of the model, analyzing gradually, allowing the model to carefully think about the answered questions, and keeping the rest prompt content that the model strictly adheres to the searched fragments to answer the questions, so as to prevent the model from being braided and wrongly.
In this embodiment, the LLM selects the GLM-4 model, which is utilized by an application program interface (Application Program Interface, API) call. The GLM-4 model has a Generic Language Model (GLM) architecture with autoregressive blank fill targets, pre-trained on 10 trillion labels, mainly chinese and english, and aligned for chinese and english, involving supervision refinement and learning from human feedback.
When a user inquires, the sewage treatment problem replaces the position of { { { question }, the document fragment obtained by the searching part replaces the position of { { context }, and the document fragment is sent into the GLM-4 model together, and the text analysis and reasoning capacity of the GLM-4 model and the prompt engineering technology are utilized to better answer the proposed sewage treatment problem according to the fact knowledge. { question } } is represented as placeholders, and the separator is used to separate different text portions and separate different instructions, contexts and inputs, so that confusion is avoided.
In another exemplary embodiment of the present application, LANGCHAIN framework is an open source framework for developing various downstream applications using the capabilities of a large language model, aiming at providing a generic interface for various large language model applications and simplifying the development difficulty of the large language model applications. In the embodiment, LANGCHAIN is used for assisting in constructing an end-to-end application program, a local domain knowledge base and a large language model are fused, and question and answer in the sewage treatment field is realized.
In another exemplary embodiment of the present application, the retrieving portion is critical to the quality of the final answer generation, and in this embodiment, an evaluation experiment is performed on the retrieving portion, and the answer generated by using the method provided by the present application is compared with the generation result of the GLM-4 model.
In this embodiment, the search part provided in the present application is evaluated by referring to Retrieve Evaluation experiments authoritative by LlamaIndex, and detailed evaluation experimental procedures and results are given.
The general procedure for the Retrieve Evaluation experiment is as follows:
(1) Document segmentation, namely, document segmentation is carried out by utilizing a sewage data set, and the slicing mode is described in detail.
(2) Question generation, namely, generating questions for document contents by using LLM for the document after segmentation.
(3) Recall text, for each question generated, using different recall algorithm to obtain recall result.
(4) Index evaluation, evaluation is performed by using Hit Rate and MRR indexes.
Hit Rate, i.e., hit Rate, generally refers to that the expected recall text (true value) appears in the first k texts of the recall result, that is, when recall@k is used, the expected text can be obtained, and the higher the index Hit Rate, the better the recall algorithm effect is. The calculation formula of the hit rate is:
Where HR represents the hit value, S represents the total number of query/relevant document pairs, and represents the number of user demands. The function hit (i) is used as an index. The relevant document of the ith query is 1 if it is in the top k search results, otherwise it is 0.
The MRR Mean Reciprocal Rank is a common index for evaluating the search effect. The MRR is an average of the inverse of the average ranking of the returned relevant documents or information in a series of queries, e.g., if one system ranks the correct answer to the first query second, the correct answer to the second query first, then the MRR is (1/2+1/1)/2. Wherein, the formula of MRR is:
Where S represents the total number of query/related document pairs, representing the number of user demands. The ith query-related text is located in the position of the first k search result lists, and if not, P i = infinity is obtained.
The final Retrieve Evaluation data format is shown in fig. 5, including queries, corpus, relevant _ docs and mode four fields.
And evaluating four methods of vector semantic search (gte-base-zh, bge-large-zh-v1.5, bce-embedding-base_v1, bge _m3), keyword search (BM 25), fusion search +RRF and fusion search + Rerank (bge-reranker-large). For example, in the vector semantic search section, bge _m3 has the highest index score, and when the number of relevant documents (top_k) is 5, hit_rate and MRR are 0.871988 and 0.692344, respectively. The BM25 algorithms hit_rate and MRR are 0.849398 and 0.603665, respectively. The fusion search + RRF methods hit_rate and MRR are 0.896084 and 0.724799, respectively. The fusion search + Rerank (bge-reranker-large) method Hit_Rate and MRR are 0.942771 and 0.805723, respectively. It can be seen that the last method works best, so the present embodiment will employ the fourth method as a retrieval part of the sewage treatment problem. When top_k=1 to 5, the four methods are shown in fig. 6 and fig. 7. In fig. 6 and 7, the icons Ensemble_search and Ensemble_and_ Rerank are optimized according to the conventional retrieval algorithm in the RAG, and correspond to two methods, i.e. fusion Search +RRF and fusion Search + Rerank (bge-reranker-large), and according to fig. 6 and 7, index analysis is performed to obtain the following conclusion:
(1) Besides top_1_retrieve (usually top_k will not set top_1), the fused search top_k recall is better than that of a single search, and after Rerank models are added, the effect is further improved. If top_k=5, the fusion search is + rerank, the hit_rate is 9.34% higher than BM25, and 7.08% higher than bge _m3. The MRR is improved by 20.20 percent compared with BM25 and 11.34 percent compared with bge _m3.
(2) In addition to top_1_retrieve (usually top_k will not set top_1), the validity of the method provided by the application is further verified by integrating the retrieval + Rerank > and the retrieval > alone Retrieve in terms of the retrieval effect.
In another exemplary embodiment of the application, in the process of displaying the answer effect, two different large language models including RAG and GLM-4 are designed to answer questions in the sewage treatment field. Aiming at the phenomenon that a local domain knowledge base and a large language model are easy to generate illusions, the embodiment designs the problem, and the problem has relevant expertise in the local knowledge base to verify the answer effect of the RAG. The problem is "grid design requirement", the sewage treatment question-answering text corresponding to RAG is shown in figure 8, and the sewage treatment question-answering text generated by GLM-4 alone is shown in figure 9. Most answers generated by the RAG are contents in the reference book or verified information knowledge, so that the illusion problem which can occur by using the universal large language model is greatly reduced, and the instantaneity of the generated contents can be ensured. By comparing the answer generated by the RAG with the answer generated by the GLM-4, the answer generated by the RAG is more accurate and more detailed, the reply of the GLM-4 model is wider, has no clear data requirement, and finally requires the content in the user reference book, so that the answer is of little significance to the user.
In summary, the application successfully constructs a large language model in the field of sewage treatment by utilizing a retrieval enhancement generation (RAG) technology of a Large Language Model (LLM). Particularly, the retrieval recall mechanism for the RAG is optimized, and the retrieval efficiency and accuracy are remarkably improved. The overall workflow of this embodiment is shown in fig. 10. Text data related to sewage treatment is collected, corresponding documents are preprocessed, the documents are loaded to be in a text form, slicing operation is performed, and the sliced text is subjected to two operations, namely vectorization through a embedding model and is stored in a vector database, and the text is stored in a keyword database. When there is a query (i.e., when the user presents a sewage treatment problem), two steps of searching are performed, namely keyword searching, vectorizing the problem and vector similarity searching. The results of the two searches are fused and reordered (Rerank) and the pre-top_k text is recalled and fed into the custom promt, allowing the large language model to answer the user's query based on the searched context information.
In another exemplary embodiment of the present application, a computer device, which may be a server or a terminal, is provided, and an internal structure diagram thereof may be as shown in fig. 11. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing sewage treatment question-answer data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by the processor, implements a large model-based sewage treatment question-answering method.
It will be appreciated by those skilled in the art that the structure shown in FIG. 11 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components. In an exemplary embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor performing the steps of the method embodiments described above when the computer program is executed.
In an exemplary embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method embodiments described above.
In an exemplary embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive RandomAccess Memory, MRAM), ferroelectric Memory (Ferroelectric RandomAccess Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (RandomAccess Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static RandomAccess Memory, SRAM) or dynamic random access memory (Dynamic RandomAccess Memory, DRAM), etc.
The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The principles and embodiments of the present application have been described herein with reference to specific examples, which are intended to facilitate an understanding of the principles and concepts of the application and are to be varied in scope and detail by persons of ordinary skill in the art based on the teachings herein. In view of the foregoing, this description should not be construed as limiting the application.
Claims (10)
1. The sewage treatment question-answering method based on the large model is characterized by comprising the following steps of:
the method comprises the steps of acquiring the existing data related to sewage treatment, wherein the existing data related to sewage treatment comprises one or more of books in the sewage treatment field, industry standards in the sewage treatment field, academic papers in the sewage treatment field and questions and answers related to knowledge points of sewage treatment;
preprocessing the existing data related to sewage treatment to obtain text data;
Cutting the text data to obtain text blocks;
Storing the text block in a text database, encoding the text block to generate a semantic vector, and storing the generated semantic vector in a vector database;
obtaining a search result related to the sewage treatment problem from the text database and the vector database based on the sewage treatment problem by adopting a keyword search method and a vector similarity search method;
reordering the retrieval result to obtain an answer document fragment;
Forming a prompt based on the wastewater treatment question and the answer document fragment;
and inputting the prompt into a large language model to generate a sewage treatment question-answering text.
2. The large model based sewage treatment question-answering method according to claim 1, wherein preprocessing existing materials related to sewage treatment to obtain text data comprises:
converting the format of the existing data related to sewage treatment, wherein the format of the text data comprises docx format and md format;
And removing redundant information of the existing data which is converted into the docx format and is related to sewage treatment, wherein the redundant information comprises a header, a page number, empty lines and pictures.
3. The large model-based sewage treatment question-answering method according to claim 2, wherein the text data is cut to obtain text blocks, comprising:
For the text data in the docx format, a ChineseRecursiveTextSplitter slicer in LANGCHAIN is adopted for slicing to obtain text blocks;
and for the text data in the md format, a MarkdownHeaderTextSplitter slicer in LANGCHAIN is adopted for slicing to obtain text blocks.
4. The large model based sewage treatment question-answering method according to claim 1, wherein storing the text block in a text database, and encoding the text block to generate a semantic vector, storing the generated semantic vector in a vector database, comprises:
Creating an index in the elastic search and storing the text block in the index, wherein the index belongs to the elastic search;
and encoding the text block by adopting a bge _m3 model to generate a semantic vector, and adopting faiss as a vector database to store the generated semantic vector in the vector database.
5. The large model-based sewage treatment question-answering method according to claim 1, wherein a keyword search method and a vector similarity search method are adopted to obtain search results related to sewage treatment problems from the text database and the vector database based on the sewage treatment problems, and specifically comprising:
Adopting a BM25 algorithm as the keyword searching method, obtaining a search result related to the sewage treatment problem from the text database, and taking the search result as a first search result;
And carrying out vector similarity search by using Euclidean distance, obtaining a search result related to the sewage treatment problem from the vector database, and taking the search result as a second search result.
6. The large model-based sewage treatment question-answering method according to claim 1, wherein reordering the retrieval results to obtain answer document fragments comprises:
Determining the semantic matching degree between the search result and the sewage treatment problem by adopting Rerank model;
and carrying out descending order arrangement on the search results according to the semantic matching degree to obtain the answer document fragments.
7. The large model based sewage treatment question-answering method according to claim 1, wherein forming a hint based on the sewage treatment question and the answer document piece includes:
And taking the sewage treatment problem as the above, and taking the answer document sheet as the following to form the prompt.
8. A computer device comprising a memory, a processor to store a computer program on the memory and executable on the processor, wherein the processor executes the computer program to implement the large model based sewage treatment question-answering method according to any one of claims 1-7.
9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the large model based sewage treatment question-answering method according to any one of claims 1 to 7.
10. A computer program product comprising a computer program which, when executed by a processor, implements the large model based sewage treatment question-answering method according to any one of claims 1 to 7.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411244427.6A CN120162400A (en) | 2024-09-06 | 2024-09-06 | A large-scale model-based sewage treatment question-and-answer method, equipment, medium and product |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411244427.6A CN120162400A (en) | 2024-09-06 | 2024-09-06 | A large-scale model-based sewage treatment question-and-answer method, equipment, medium and product |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN120162400A true CN120162400A (en) | 2025-06-17 |
Family
ID=96008009
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411244427.6A Pending CN120162400A (en) | 2024-09-06 | 2024-09-06 | A large-scale model-based sewage treatment question-and-answer method, equipment, medium and product |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN120162400A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120463275A (en) * | 2025-07-15 | 2025-08-12 | 广东海洋大学 | A sewage treatment system and method based on large language model and multi-agent collaboration |
| CN121051139A (en) * | 2025-10-29 | 2025-12-02 | 中国电子科技集团公司第二十八研究所 | A hybrid ranking method and system integrating domain embedding search and keyword search |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117951274A (en) * | 2024-01-29 | 2024-04-30 | 上海岩芯数智人工智能科技有限公司 | RAG knowledge question-answering method and device based on fusion vector and keyword retrieval |
| CN118277521A (en) * | 2024-02-27 | 2024-07-02 | 福建亿榕信息技术有限公司 | An intelligent question-answering method, system, device and medium in the electric power field based on LLM |
| CN118296120A (en) * | 2024-03-29 | 2024-07-05 | 中国科学院香港创新研究院人工智能与机器人创新中心有限公司 | Large-scale language model retrieval enhancement generation method for multi-mode multi-scale multi-channel recall |
-
2024
- 2024-09-06 CN CN202411244427.6A patent/CN120162400A/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117951274A (en) * | 2024-01-29 | 2024-04-30 | 上海岩芯数智人工智能科技有限公司 | RAG knowledge question-answering method and device based on fusion vector and keyword retrieval |
| CN118277521A (en) * | 2024-02-27 | 2024-07-02 | 福建亿榕信息技术有限公司 | An intelligent question-answering method, system, device and medium in the electric power field based on LLM |
| CN118296120A (en) * | 2024-03-29 | 2024-07-05 | 中国科学院香港创新研究院人工智能与机器人创新中心有限公司 | Large-scale language model retrieval enhancement generation method for multi-mode multi-scale multi-channel recall |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120463275A (en) * | 2025-07-15 | 2025-08-12 | 广东海洋大学 | A sewage treatment system and method based on large language model and multi-agent collaboration |
| CN120463275B (en) * | 2025-07-15 | 2025-10-17 | 广东海洋大学 | Sewage treatment system and method based on large language model and multi-agent cooperation |
| CN121051139A (en) * | 2025-10-29 | 2025-12-02 | 中国电子科技集团公司第二十八研究所 | A hybrid ranking method and system integrating domain embedding search and keyword search |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111753060B (en) | Information retrieval method, apparatus, device and computer readable storage medium | |
| CN118093841B (en) | A model training method and question answering method for question answering system | |
| CN118277521B (en) | A method, system, device, and medium for intelligent question answering in the power sector based on LLM. | |
| CN118917305B (en) | A RAG system optimization method, system, electronic device and storage medium | |
| CN120873195A (en) | System and method for map-based dynamic information retrieval and synthesis | |
| CN111159359A (en) | Document retrieval method, apparatus, and computer-readable storage medium | |
| CN120162400A (en) | A large-scale model-based sewage treatment question-and-answer method, equipment, medium and product | |
| CN112307182A (en) | An Extended Query Method for Pseudo-Relevant Feedback Based on Question Answering System | |
| CN116881425A (en) | Universal document question-answering implementation method, system, device and storage medium | |
| CN118410175A (en) | Intelligent manufacturing capability diagnosis method and device based on large language model and knowledge graph | |
| CN112035629A (en) | Implementation method of question answering model based on symbolic knowledge and neural network | |
| CN111552773A (en) | A method and system for finding key sentences of question-like or not in reading comprehension task | |
| CN115757726A (en) | Cold start method and device of intelligent question-answering system for specific field | |
| CN119474274A (en) | Document question-answer pair generation and question-answering method, device, computer device and readable storage medium | |
| Gammack et al. | Semantic knowledge management system for design documentation with heterogeneous data using machine learning | |
| CN119719349A (en) | User question recommendation method, device, electronic device and readable storage medium | |
| CN113761104A (en) | Method, device and electronic device for detecting entity relationship in knowledge graph | |
| CN118626611A (en) | Retrieval method, device, electronic device and readable storage medium | |
| CN120030146A (en) | A transformer fault root cause analysis method based on large model iterative reasoning | |
| CN119293153A (en) | A method and system for solving science problems based on aligned thinking chains | |
| CN118446297A (en) | Patent theme map generation method and system based on deep semantic hierarchical clustering | |
| CN115238034A (en) | Data search method, apparatus, computer equipment and storage medium | |
| WO2026056656A1 (en) | Task processing method, document-based dialog method, and document processing method | |
| CN119903190B (en) | Multi-document question-answer retrieval method combining generated language model and semantic document map | |
| CN118798303B (en) | Training method, question-answering equipment, medium and product of large language model |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |