CN117056490A - Question extraction and answer generation methods, devices, media and equipment - Google Patents

Question extraction and answer generation methods, devices, media and equipment Download PDF

Info

Publication number
CN117056490A
CN117056490A CN202311093501.4A CN202311093501A CN117056490A CN 117056490 A CN117056490 A CN 117056490A CN 202311093501 A CN202311093501 A CN 202311093501A CN 117056490 A CN117056490 A CN 117056490A
Authority
CN
China
Prior art keywords
target
initial
page
text
answers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311093501.4A
Other languages
Chinese (zh)
Other versions
CN117056490B (en
Inventor
詹乐
龚静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202311093501.4A priority Critical patent/CN117056490B/en
Publication of CN117056490A publication Critical patent/CN117056490A/en
Application granted granted Critical
Publication of CN117056490B publication Critical patent/CN117056490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Finance (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种问题提取及答案生成方法、装置、介质和设备,对银行数据文档中的文本信息进行分割和向量化,得到每页的多个文本向量;对每个文本向量进行摘要,得到多个目标摘要;对每个目标摘要使用预设的提问格式,送入自然语言处理模型,得到多个初始问题和答案;对初始问题和答案进行去重和排序,输出每个目标摘要对应的目标问题和答案。本发明能对银行数据文档进行解析,自动生成相关问题和答案,帮助用户快速理解文档内容,及时掌握行业信息动态。

The invention discloses a method, device, medium and equipment for question extraction and answer generation, which segment and vectorize text information in bank data documents to obtain multiple text vectors for each page; summarize each text vector, Obtain multiple target summaries; use the preset question format for each target summary, send it to the natural language processing model, and obtain multiple initial questions and answers; deduplicate and sort the initial questions and answers, and output the corresponding target summary Target questions and answers. The invention can parse bank data documents and automatically generate relevant questions and answers, helping users quickly understand document content and grasp industry information trends in a timely manner.

Description

Question extraction and answer generation method, device, medium and equipment
Technical Field
The application relates to the technical field of banks, in particular to a method, a device, a medium and equipment for extracting questions and generating answers.
Background
Currently, intelligent document parsing is an important business scenario of banks, however long bank data document parsing is a difficult processing scenario. If the document is too long, the user does not know what is in the document, and the corresponding problem cannot be posed to be solved by the system. At this time, the related questions and answers can be automatically generated for the user, so that the user can be helped to quickly understand the document content and timely master the industry information dynamics.
Disclosure of Invention
Based on the above, it is necessary to provide a method, an apparatus, a medium and a device for extracting a question and generating an answer, so as to solve the problem that it is difficult to automatically extract the relevant question and the answer in the document in case that the document is too long.
A method of question extraction and answer generation, the method comprising:
dividing text information in a bank data document and operating the text information with a steering amount to obtain a plurality of text vectors corresponding to each page in the bank data document;
respectively inducing and forming target abstracts corresponding to each text vector to obtain a plurality of target abstracts;
sending each target abstract into a natural language processing model according to a preset questioning format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract;
and performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and the initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
In one embodiment, the dividing the text information in the bank data document and the steering amount operation to obtain a plurality of text vectors corresponding to each page in the bank data document includes:
reading text information of each page in the bank data document, and taking the text information of a target page and floating variables of adjacent pages as text blocks of the target page to obtain text blocks of multiple pages; the target page is any page in a bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is text information with preset length, adjacent to the target page, in the adjacent page;
dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain sub text blocks of each page;
and embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.
In one embodiment, the respectively summarizing to form the target abstract corresponding to each text vector includes:
inputting each vector block into a preset abstract generating model respectively, and acquiring an output initial abstract; each vector block consists of a sub text block and a corresponding text vector;
and extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
In one embodiment, the performing the de-duplication filtering on the initial question and the initial answer includes:
in all the initial problems, calculating cosine similarity between every two initial problems and sequencing the initial problems to obtain a first sequencing result formed by a plurality of problem pairs;
in all the initial answers, respectively calculating cosine similarity between every two initial answers and sequencing to obtain a second sequencing result formed by a plurality of answer pairs;
in the first sorting result, performing a first duplicate removal operation on the first N questions with the maximum cosine similarity and the corresponding initial answers; wherein N is a preset value, and the first deduplication operation indicates to delete any one initial question in each question pair and delete a corresponding initial answer;
in the second sorting result, performing second duplicate removal operation on the first M answers with the maximum cosine similarity and the corresponding initial questions; wherein M is a preset value, and the second deduplication operation indicates to delete any one initial answer in each answer pair and delete the corresponding initial question.
In one embodiment, the sorting the initial questions and the initial answers after the duplicate removal screening based on the word numbers, and outputting the target questions and the target answers corresponding to each target abstract based on the sorting result includes:
returning the first L answers with the maximum word numbers as target answers, and outputting the questions corresponding to the target answers as target questions together with the target answers; wherein L is a preset value.
A question extraction and answer generation device, the device comprising:
the text vector generation module is used for carrying out segmentation operation and turning quantity operation on text information in the bank data document so as to obtain a plurality of text vectors corresponding to each page in the bank data document;
the target abstract generating module is used for respectively inducing and forming target abstracts corresponding to each text vector so as to obtain a plurality of target abstracts;
the result output module is used for respectively sending each target abstract into the natural language processing model according to a preset questioning format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract; and performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
In one embodiment, the text vector generation module is specifically configured to:
reading text information of each page in the bank data document, and taking the text information of a target page and floating variables of adjacent pages as text blocks of the target page to obtain text blocks of multiple pages; the target page is any page in a bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is text information with preset length, adjacent to the target page, in the adjacent page;
dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain sub text blocks of each page;
and embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.
In one embodiment, the target abstract generating module is specifically configured to:
inputting each vector block into a preset abstract generating model respectively, and acquiring an output initial abstract; each vector block consists of a sub text block and a corresponding text vector;
and extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the question extraction and answer generation method described above.
A question extraction and answer generation device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the question extraction and answer generation method described above.
The application provides a method, a device, a medium and equipment for extracting questions and generating answers, which are used for dividing and vectorizing text information in a bank data document to obtain a plurality of text vectors of each page; summarizing each text vector to obtain a plurality of target summaries; using a preset question format for each target abstract, and sending the target abstract into a natural language processing model to obtain a plurality of initial questions and answers; and de-duplicating and sequencing the initial questions and answers, and outputting the target questions and answers corresponding to each target abstract. The application can analyze the bank data document, automatically generate related questions and answers, help users to quickly understand the document content and timely master the industry information dynamics.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Wherein:
FIG. 1 is a flow chart of a method for question extraction and answer generation;
FIG. 2 is a schematic diagram of a device for question extraction and answer generation;
fig. 3 is a block diagram showing the construction of a question extraction and answer generation apparatus.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
As shown in fig. 1, fig. 1 is a flow chart of a method for extracting questions and generating answers in one embodiment, and the method for extracting questions and generating answers in this embodiment includes the following steps:
s101, performing segmentation operation and steering amount operation on text information in a bank data document to obtain a plurality of text vectors corresponding to each page in the bank data document.
Where a banking data document refers to a document containing banking related data and information, such as a banking profile, business description, financial report, etc. The segmentation operation refers to dividing text information into a plurality of parts according to a certain rule or standard, wherein each part contains certain semantic information. For example, we can segment text information according to paragraphs, sentences, punctuation, and the like. Steering operations refer to converting text information into numeric vectors for computer processing and analysis. For example, we can map each word or phrase to a point in a high-dimensional space using word embedding techniques, resulting in a word vector or phrase vector.
In one implementation, the corresponding plurality of text vectors for each page are obtained by the following specific steps, including:
(1) And reading the text information of each page in the bank data document, and taking the text information of the target page and the floating variable of the adjacent page as the text blocks of the target page to obtain the text blocks of a plurality of pages.
The target page is any page in the bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is text information with preset length adjacent to the target page in the adjacent page.
This step is to extend the text information of each page by some contexts in order to capture more semantic information. For example, assume that a banking data document has three pages, each page having a piece of text, as follows:
a first part: bank profile
Banking is a financial institution that primarily engages in deposit, loan, payment, settlement, credit card, and other transactions. Banks are the core components of financial systems and play an important role in economic development and social stability.
A second part: banking business
Banking business is mainly divided into two main categories: deposit and loan businesses. Deposit business refers to the business that a bank accepts funds deposited by a customer and pays interest according to agreements. Loan business refers to the business that a bank provides funds to a customer and charges interest and commission.
Third section: risk of bank
Bank risk refers to the loss or damage that a bank may suffer during an operation. The bank risk mainly includes credit risk, market risk, liquidity risk, operation risk, and the like. Effective risk management measures are required by banks to ensure asset security and profitability.
Assuming we select the second page as the target page, we can stitch the text information of the second page with the last sentence of the first and third pages as a floating variable as a text block of the second page as follows:
banks are the core components of financial systems and play an important role in economic development and social stability. Banking business is mainly divided into two main categories: deposit and loan businesses. Deposit business refers to the business that a bank accepts funds deposited by a customer and pays interest according to agreements. Loan business refers to the business that a bank provides funds to a customer and charges interest and commission. Bank risk refers to the loss or damage that a bank may suffer during an operation.
Similarly, we can operate similarly on the other two pages to obtain three text blocks. It will be understood, of course, that the predetermined length herein may be set as desired.
(2) Dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain the sub-text blocks of each page.
This step is to further refine each text block into sub-text blocks in order to extract finer semantic information. For example, we can divide the text blocks of the second page according to paragraphs or a fixed number of words as follows:
sub-text block 1: banks are the core components of financial systems and play an important role in economic development and social stability.
Sub-text block 2: banking business is mainly divided into two main categories: deposit and loan businesses.
Sub-text block 3: banks are the core components of financial systems and play an important role in economic development and social stability.
Sub-text block 4: banking business is mainly divided into two main categories: deposit and loan businesses. Deposit business refers to the business that a bank accepts funds deposited by a customer and pays interest according to agreements.
Sub-text block 5: loan business refers to the business that a bank provides funds to a customer and charges interest and commission.
Sub-text block 6: bank risk refers to the loss or damage that a bank may suffer during an operation.
Sub-text block 7: deposit business refers to the business that a bank accepts funds deposited by a customer and pays interest according to agreements.
Sub-text block 8: bank risk refers to the loss or damage that a bank may suffer during an operation.
(3) Embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.
This step is to convert each sub-text block into a numeric vector for computer processing and analysis. For example, a simcse model is selected, and simcse is a sentence similarity model based on contrast learning, and semantic representation of sentences can be learned by self-prediction. For example, we can embed all the sub-text blocks of the second page as a 768-dimensional vector using the simcse model as follows:
sub-text block 1- > text vector 1: [0.12, -0.34, -0.45]
Sub-text block 2- > text vector 2: [ -0.23,0.56, -0.67]
Sub-text block 3- > text vector 3: [0.34, -0.78, -0.89]
Sub-text block 4- > text vector 4: [ -0.23,0.56, -0.67]
Sub-text block 5- > text vector 5: [0.45, -0.89, -0.12]
Sub-text block 6- > text vector 6: [0.67, -0.34, -0.45]
Sub-text block 7- > text vector 7: [0.78, -0.12, -0.56]
Sub-text block 8- > text vector 8: [0.45, -0.89.., -0.50]
Thus, a plurality of text vectors corresponding to each page are obtained, and the text vectors can reflect the semantic information of each sub-text block and can also be used for subsequent calculation and analysis.
S102, respectively inducing and forming target abstracts corresponding to each text vector to obtain a plurality of target abstracts.
The step S102 is a task of natural language processing, and its purpose is to summarize a short target abstract according to the semantic content of each text vector, so as to summarize the main information of the text. The target abstract can be a sentence or a phrase or a word.
In one implementation, the specific steps of forming the corresponding target abstract include:
(1) And respectively inputting each vector block into a preset abstract generating model, and acquiring an output initial abstract.
Wherein each vector chunk is composed of a sub-text chunk and a corresponding text vector. This sub-step is to generate an initial summary from the textual information and numerical representation of each vector chunk using a pre-set summary generation model, such as the gpt-2 model.
Illustratively, in another example of a bank card:
the vector block 1- > initial abstract 1 is that a bank card is an electronic payment tool issued by a bank, can store funds and personal information of a user, and is convenient for the user to perform consumption or transfer operations on an automatic teller machine, a POS machine or a network platform. Bank cards typically have both a magnetic stripe and a chip, with the chip card being safer and more stable than the magnetic stripe card.
Vector block 2- > initial abstract 2. The types of bank cards are various, and are mainly classified into debit cards, credit cards and quasi-credit cards. The debit card refers to a card which can be used only after a user deposits funds, the credit card refers to a card which is provided by a bank with a certain amount of credit for the user and can be paid after the user consumes the money, and the quasi-credit card refers to a card which is provided by the bank with a certain amount of pre-authorization for the user and can be overdrawn for consumption within the amount of credit for the user. Different card types have different functions and fees, and users should select proper card types according to own requirements and capabilities.
Vector block 3- > initial abstract 3. Use of bank card needs to be safe and reasonable, avoiding revealing password or personal information, repayment or inquiring balance in time. When using the bank card, the card and the password should be kept, the password is not written on the card or told to other people, and the information of the bank card is not input on unsafe equipment or websites. When using the credit card, paying attention to repayment period and interest, clearing debt on time, avoiding overdue fee or late fee. When using the debit card, the balance and transaction records should be queried, and the abnormal situation should be found and handled in time.
(2) And extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
Wherein, sentence boundaries refer to the positions of the beginning and ending of sentences, which are generally divided by punctuation marks; according to the length and the position of the sentences, the sentences which can represent the original text subject can be selected, or a plurality of sentences can be combined or deleted to form a concise target abstract. Alternatively, the preset word count threshold here may be 50 words. For example:
initial abstract 1- > target abstract 1: the bank card is an electronic payment tool, which can store funds and information and support various operations. The chip card is safer than the magnetic stripe card.
And keywords refer to words that can reflect text topics or core content. The words which can reflect the subject matter and the content of the original text can be selected according to the keywords or phrases in the original text, or a plurality of words can be combined or replaced to form a refined target abstract. Alternatively, the preset word count threshold here may be 50 words.
Initial summary 2- > target summary 2 bank cards are classified as debit cards, credit cards, and quasi-credit cards. Debit cards require deposit, credit cards require repayment, and quasi-credit cards may be overdrawn. Different card types have different characteristics and costs.
S103, respectively sending each target abstract into a natural language processing model according to a preset question format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract.
Specifically, assume we have as input the following target abstract: the bank card is an electronic payment tool issued by a bank, can store funds and personal information of a user, and is convenient for the user to perform consumption or transfer operations on an automatic teller machine, a POS machine or a network platform. Bank cards typically have both a magnetic stripe and a chip, with the chip card being safer and more stable than the magnetic stripe card.
Then we can use different methods to generate the initial question and the initial answer related to the text: such as models based on a pre-trained language model (pre-trained language model): the model can be pre-trained by using a large-scale corpus, learn general knowledge and rules of language, and then perform fine tuning on specific tasks, so that the generation quality and efficiency are improved. For example, we can use a pre-trained language model such as BERT, GPT-2, and add a special tag (e.g., [ Q ]) to the text, and then let the model generate a corresponding question based on the tag. Wherein, possible initial questions and initial answers:
initial problem: what are two forms of bank cards? Initial answer: bank cards typically come in both magnetic stripe and chip form.
Initial problem: what is a bank card? Initial answer: the bank card is an electronic payment tool issued by a bank, can store funds and personal information of a user, and is convenient for the user to perform consumption or transfer operations on an automatic teller machine, a POS machine or a network platform.
Initial problem: what is a chip card advantageous over a magnetic stripe card? Initial answer: the chip card is safer and more stable than the magnetic stripe card.
S104, performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and the initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
It will be appreciated that there may be duplicate or similar content in the initial question and initial answer generated in step S103, which is not necessary for the user, and therefore a series of filtering operations may be performed.
In one implementation, the de-duplication filtering of the initial question and the initial answer includes:
(1) In all the initial problems, calculating cosine similarity between every two initial problems and sequencing the initial problems to obtain a first sequencing result formed by a plurality of problem pairs;
this step is to find the initial problem pair with the highest similarity, i.e. those initial problems with repeated or similar content. Cosine similarity is a method of evaluating the similarity of two vectors by calculating their angle cosine values. Each initial question may be considered herein as a vector of words that may be weighted using word frequency, TF-IDF, etc. By calculating the cosine similarity between the two initial questions, the degree of similarity between them can be obtained, with closer 1 representing more similar and closer 0 representing less relevant. And (3) carrying out cosine similarity calculation on all the initial questions pairwise, and sequencing the initial questions from large to small to obtain a first sequencing result, wherein the first sequencing result is a list formed by a plurality of question pairs, and each question pair has a corresponding cosine similarity value.
(2) And respectively calculating cosine similarity between every two initial answers in all the initial answers and sequencing the initial answers to obtain a second sequencing result formed by a plurality of answer pairs.
This step is to find the initial answer pair with the highest similarity, i.e. those initial answers whose contents are repeated or similar. The method for calculating the cosine similarity is the same as the previous step, and the initial question is simply replaced by the initial answer.
(3) In the first sorting result, performing first duplicate removal operation on the first N questions with the maximum cosine similarity and corresponding initial answers; wherein N is a preset value, and the first deduplication operation indicates to delete any one initial question in each question pair and delete the corresponding initial answer.
This step is to delete duplicate or similar initial questions and their corresponding initial answers to reduce redundant information. N is a preset value that indicates how many question pairs to delete.
(4) In the second sorting result, performing second duplicate removal operation on the first M answers with the maximum cosine similarity and the corresponding initial questions; wherein, M is a preset value, and the second deduplication operation indicates to delete any one initial answer in each answer pair and delete the corresponding initial question.
This step is to further delete those repeated or similar initial answers and their corresponding initial questions to further reduce redundant information. M is a preset value that indicates how many answer pairs to delete.
In one implementation, sorting the initial questions and the initial answers after the duplicate removal screening based on the word number, and outputting a target question and a target answer corresponding to each target abstract based on the sorting result, including: returning the first L answers with the maximum word numbers as target answers, and outputting the questions corresponding to the target answers as target questions together with the target answers; wherein L is a preset value.
The step is to select the target question and the target answer which can cover the content of the target abstract from the initial questions and the initial answers after the duplicate removal screening, and output the target question and the target answer. L is a preset value indicating how many answers to choose. And sorting according to the number of words of the initial answers in order from more to less, selecting the first L answers with the largest number of words as target answers, and outputting the initial questions corresponding to the first L answers as target questions together with the target answers.
The method for extracting the questions and generating the answers comprises the steps of dividing and vectorizing text information in a bank data document to obtain a plurality of text vectors of each page; summarizing each text vector to obtain a plurality of target summaries; using a preset question format for each target abstract, and sending the target abstract into a natural language processing model to obtain a plurality of initial questions and answers; and de-duplicating and sequencing the initial questions and answers, and outputting the target questions and answers corresponding to each target abstract. Therefore, the application can analyze the bank data document, automatically generate related questions and answers, help users to quickly understand the document content and timely master the industry information dynamics.
In one embodiment, as shown in fig. 2, a device for extracting questions and generating answers is provided, which comprises:
the text vector generation module 201 is configured to perform a segmentation operation and a steering amount operation on text information in a bank data document, so as to obtain a plurality of text vectors corresponding to each page in the bank data document;
the target abstract generating module 202 is configured to respectively summarize and form a target abstract corresponding to each text vector, so as to obtain a plurality of target abstracts;
the result output module 203 is configured to send each target abstract to the natural language processing model according to a preset question format, so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract; and performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
In one embodiment, the text vector generation module 201 is specifically configured to:
reading text information of each page in the bank data document, and taking the text information of a target page and floating variables of adjacent pages as text blocks of the target page to obtain text blocks of multiple pages; the target page is any page in a bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is text information with preset length, adjacent to the target page, in the adjacent page;
dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain sub text blocks of each page;
and embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.
In one embodiment, the target summary generating module 202 is specifically configured to:
inputting each vector block into a preset abstract generating model respectively, and acquiring an output initial abstract; each vector block consists of a sub text block and a corresponding text vector;
and extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
The result output module 203 is specifically configured to: in all the initial problems, calculating cosine similarity between every two initial problems and sequencing the initial problems to obtain a first sequencing result formed by a plurality of problem pairs;
in all the initial answers, respectively calculating cosine similarity between every two initial answers and sequencing to obtain a second sequencing result formed by a plurality of answer pairs;
in the first sorting result, performing a first duplicate removal operation on the first N questions with the maximum cosine similarity and the corresponding initial answers; wherein N is a preset value, and the first deduplication operation indicates to delete any one initial question in each question pair and delete a corresponding initial answer;
in the second sorting result, performing second duplicate removal operation on the first M answers with the maximum cosine similarity and the corresponding initial questions; wherein M is a preset value, and the second deduplication operation indicates to delete any one initial answer in each answer pair and delete the corresponding initial question.
The result output module 203 is specifically configured to: returning the first L answers with the maximum word numbers as target answers, and outputting the questions corresponding to the target answers as target questions together with the target answers; wherein L is a preset value.
Fig. 3 is a diagram showing an internal structure of the question extraction and answer generation apparatus in one embodiment. As shown in fig. 3, the question extraction and answer generation apparatus includes a processor, a memory, and a network interface connected through a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the question extraction and answer generation device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement a method of question extraction and answer generation. The internal memory may also store a computer program that, when executed by the processor, causes the processor to perform the question extraction and answer generation methods. It will be appreciated by those skilled in the art that the structure shown in fig. 3 is merely a block diagram of a portion of the structure associated with the present application and does not constitute a limitation of the question extraction and answer generation apparatus to which the present application is applied, and that a specific question extraction and answer generation apparatus may include more or less components than those shown in the drawings, or may combine some components, or have a different arrangement of components.
A computer readable storage medium storing a computer program which when executed by a processor performs the steps of: dividing text information in a bank data document and operating the text information with a steering amount to obtain a plurality of text vectors corresponding to each page in the bank data document; respectively inducing and forming target abstracts corresponding to each text vector to obtain a plurality of target abstracts; sending each target abstract into a natural language processing model according to a preset questioning format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract; and performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and the initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
A question extraction and answer generation device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program: dividing text information in a bank data document and operating the text information with a steering amount to obtain a plurality of text vectors corresponding to each page in the bank data document; respectively inducing and forming target abstracts corresponding to each text vector to obtain a plurality of target abstracts; sending each target abstract into a natural language processing model according to a preset questioning format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract; and performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and the initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
It should be noted that the above-mentioned method, apparatus, device and computer-readable storage medium for question extraction and answer generation belong to a general inventive concept, and the content in the embodiments of the method, apparatus, device and computer-readable storage medium for question extraction and answer generation are applicable to each other.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a non-transitory computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (10)

1. A method for question extraction and answer generation, the method comprising:
dividing text information in a bank data document and operating the text information with a steering amount to obtain a plurality of text vectors corresponding to each page in the bank data document;
respectively inducing and forming target abstracts corresponding to each text vector to obtain a plurality of target abstracts;
sending each target abstract into a natural language processing model according to a preset questioning format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract;
and performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and the initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
2. The method according to claim 1, wherein the dividing the text information in the bank data document and the steering amount operation to obtain a plurality of text vectors corresponding to each page in the bank data document includes:
reading text information of each page in the bank data document, and taking the text information of a target page and floating variables of adjacent pages as text blocks of the target page to obtain text blocks of multiple pages; the target page is any page in a bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is text information with preset length, adjacent to the target page, in the adjacent page;
dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain sub text blocks of each page;
and embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.
3. The method of claim 1, wherein the separately summarizing the target summaries corresponding to each text vector comprises:
inputting each vector block into a preset abstract generating model respectively, and acquiring an output initial abstract; each vector block consists of a sub text block and a corresponding text vector;
and extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
4. The method of claim 1, wherein the de-duplication screening the initial question and the initial answer comprises:
in all the initial problems, calculating cosine similarity between every two initial problems and sequencing the initial problems to obtain a first sequencing result formed by a plurality of problem pairs;
in all the initial answers, respectively calculating cosine similarity between every two initial answers and sequencing to obtain a second sequencing result formed by a plurality of answer pairs;
in the first sorting result, performing a first duplicate removal operation on the first N questions with the maximum cosine similarity and the corresponding initial answers; wherein N is a preset value, and the first deduplication operation indicates to delete any one initial question in each question pair and delete a corresponding initial answer;
in the second sorting result, performing second duplicate removal operation on the first M answers with the maximum cosine similarity and the corresponding initial questions; wherein M is a preset value, and the second deduplication operation indicates to delete any one initial answer in each answer pair and delete the corresponding initial question.
5. The method of claim 1, wherein the ranking the initial questions and the initial answers after the de-duplication filtering based on the word count, and outputting the target questions and the target answers corresponding to each target abstract based on the ranking result, comprises:
returning the first L answers with the maximum word numbers as target answers, and outputting the questions corresponding to the target answers as target questions together with the target answers; wherein L is a preset value.
6. A question extraction and answer generation device, the device comprising:
the text vector generation module is used for carrying out segmentation operation and turning quantity operation on text information in the bank data document so as to obtain a plurality of text vectors corresponding to each page in the bank data document;
the target abstract generating module is used for respectively inducing and forming target abstracts corresponding to each text vector so as to obtain a plurality of target abstracts;
the result output module is used for respectively sending each target abstract into the natural language processing model according to a preset questioning format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract; and performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
7. The question extraction and answer generation device according to claim 6, wherein the text vector generation module is specifically configured to:
reading text information of each page in the bank data document, and taking the text information of a target page and floating variables of adjacent pages as text blocks of the target page to obtain text blocks of multiple pages; the target page is any page in a bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is text information with preset length, adjacent to the target page, in the adjacent page;
dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain sub text blocks of each page;
and embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.
8. The question extraction and answer generation device according to claim 6, wherein the target abstract generation module is specifically configured to:
inputting each vector block into a preset abstract generating model respectively, and acquiring an output initial abstract; each vector block consists of a sub text block and a corresponding text vector;
and extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
9. A computer readable storage medium, characterized in that a computer program is stored, which, when being executed by a processor, causes the processor to perform the steps of the method according to any of claims 1 to 5.
10. A question extraction and answer generation device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method of any one of claims 1 to 5.
CN202311093501.4A 2023-08-28 2023-08-28 Methods, apparatus, media and equipment for question extraction and answer generation Active CN117056490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311093501.4A CN117056490B (en) 2023-08-28 2023-08-28 Methods, apparatus, media and equipment for question extraction and answer generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311093501.4A CN117056490B (en) 2023-08-28 2023-08-28 Methods, apparatus, media and equipment for question extraction and answer generation

Publications (2)

Publication Number Publication Date
CN117056490A true CN117056490A (en) 2023-11-14
CN117056490B CN117056490B (en) 2026-03-24

Family

ID=88653332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311093501.4A Active CN117056490B (en) 2023-08-28 2023-08-28 Methods, apparatus, media and equipment for question extraction and answer generation

Country Status (1)

Country Link
CN (1) CN117056490B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800177A (en) * 2020-12-31 2021-05-14 北京智源人工智能研究院 Method and device for automatic generation of FAQ knowledge base based on complex data types
US20210192126A1 (en) * 2019-12-19 2021-06-24 Adobe Inc. Generating structured text summaries of digital documents using interactive collaboration
CN114579796A (en) * 2022-05-06 2022-06-03 北京沃丰时代数据科技有限公司 Machine reading understanding method and device
CN114925184A (en) * 2022-05-16 2022-08-19 阿里巴巴(中国)有限公司 Question-answer pair generation method and device and computer readable storage medium
KR102436549B1 (en) * 2022-07-20 2022-08-25 (주) 유비커스 Method and apparatus for automatically generating training dataset for faq and chatbot based on natural language processing using deep learning
CN114970563A (en) * 2022-07-28 2022-08-30 山东大学 Chinese question generation method and system fusing content and form diversity
CN114997138A (en) * 2022-06-20 2022-09-02 壹沓科技(上海)有限公司 Chemical specification analysis method, device, equipment and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210192126A1 (en) * 2019-12-19 2021-06-24 Adobe Inc. Generating structured text summaries of digital documents using interactive collaboration
CN112800177A (en) * 2020-12-31 2021-05-14 北京智源人工智能研究院 Method and device for automatic generation of FAQ knowledge base based on complex data types
CN114579796A (en) * 2022-05-06 2022-06-03 北京沃丰时代数据科技有限公司 Machine reading understanding method and device
CN114925184A (en) * 2022-05-16 2022-08-19 阿里巴巴(中国)有限公司 Question-answer pair generation method and device and computer readable storage medium
CN114997138A (en) * 2022-06-20 2022-09-02 壹沓科技(上海)有限公司 Chemical specification analysis method, device, equipment and readable storage medium
KR102436549B1 (en) * 2022-07-20 2022-08-25 (주) 유비커스 Method and apparatus for automatically generating training dataset for faq and chatbot based on natural language processing using deep learning
CN114970563A (en) * 2022-07-28 2022-08-30 山东大学 Chinese question generation method and system fusing content and form diversity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张浩宇: "文本问答与信息抽取关键技术研究", 《中国优秀硕士学位论文全文数据库》, 15 February 2023 (2023-02-15) *

Also Published As

Publication number Publication date
CN117056490B (en) 2026-03-24

Similar Documents

Publication Publication Date Title
Cecchini et al. Making words work: Using financial text as a predictor of financial events
Bian et al. Icorating: A deep-learning system for scam ico identification
Tsai et al. Discovering finance keywords via continuous-space language models
CN110264342A (en) A kind of business audit method and device based on machine learning
CN109344234A (en) Machine reads understanding method, device, computer equipment and storage medium
US20260004330A1 (en) Document processing platform
CN112015869A (en) Risk detection method, device and equipment for text to be issued
Oral et al. Fusion of visual representations for multimodal information extraction from unstructured transactional documents: B. Oral, G. Eryiğit
Radygin et al. Application of text mining technologies in Russian language for solving the problems of primary financial monitoring
CN113220885A (en) Text processing method and system
Loukas et al. EDGAR-CRAWLER: From Raw Web Documents to Structured Financial NLP Datasets
WO2021075998A1 (en) System for classifying data in order to detect confidential information in a text
CN117056490B (en) Methods, apparatus, media and equipment for question extraction and answer generation
Kamaruddin et al. A text mining system for deviation detection in financial documents
KR20230072151A (en) Tax and accounting service platform
Fenyves et al. Application of text mining in analysing notes to financial statements: A Hungarian case.
Rodríguez-Muñoz-de-Baena et al. Fine-tuning transformer models for M&A target prediction in the US ENERGY sector
Zhang et al. A semantic search framework for similar audit issue recommendation in financial industry
CN115034891B (en) Debit and credit accounting method, device, equipment and medium based on natural language processing
Blanco Lambruschini et al. A novel architecture for long-text predictions using bert-based models
Taylor et al. e-commerce and sentiment analysis: Predicting outcomes of class action lawsuits
Sun et al. Using an ensemble LSTM model for financial statement fraud detection
CN116503878A (en) A business decision processing method and device
Mitrache Big data and AI: a potential solution to end the conundrum of too big to fail financial institutions?
Diaf CliqueCorex: A Self-supervised Clique-based Anchored Topic Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant