CN117056490B - Methods, apparatus, media and equipment for question extraction and answer generation - Google Patents

Methods, apparatus, media and equipment for question extraction and answer generation

Info

Publication number
CN117056490B
CN117056490B CN202311093501.4A CN202311093501A CN117056490B CN 117056490 B CN117056490 B CN 117056490B CN 202311093501 A CN202311093501 A CN 202311093501A CN 117056490 B CN117056490 B CN 117056490B
Authority
CN
China
Prior art keywords
target
page
initial
text
answers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311093501.4A
Other languages
Chinese (zh)
Other versions
CN117056490A (en
Inventor
詹乐
龚静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202311093501.4A priority Critical patent/CN117056490B/en
Publication of CN117056490A publication Critical patent/CN117056490A/en
Application granted granted Critical
Publication of CN117056490B publication Critical patent/CN117056490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Finance (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种问题提取及答案生成方法、装置、介质和设备,对银行数据文档中的文本信息进行分割和向量化,得到每页的多个文本向量;对每个文本向量进行摘要,得到多个目标摘要;对每个目标摘要使用预设的提问格式,送入自然语言处理模型,得到多个初始问题和答案;对初始问题和答案进行去重和排序,输出每个目标摘要对应的目标问题和答案。本发明能对银行数据文档进行解析,自动生成相关问题和答案,帮助用户快速理解文档内容,及时掌握行业信息动态。

This invention discloses a method, apparatus, medium, and device for question extraction and answer generation. It segments and vectorizes text information in bank data documents, obtaining multiple text vectors per page; it summarizes each text vector to obtain multiple target summaries; it applies a preset question format to each target summary and feeds it into a natural language processing model to obtain multiple initial questions and answers; it deduplicates and sorts the initial questions and answers, outputting the target question and answer corresponding to each target summary. This invention can parse bank data documents and automatically generate relevant questions and answers, helping users quickly understand document content and stay abreast of industry information.

Description

Question extraction and answer generation method, device, medium and equipment
Technical Field
The invention relates to the technical field of banks, in particular to a method, a device, a medium and equipment for extracting questions and generating answers.
Background
Currently, intelligent document parsing is an important business scenario of banks, however long bank data document parsing is a difficult processing scenario. If the document is too long, the user does not know what is in the document, and the corresponding problem cannot be posed to be solved by the system. At this time, the related questions and answers can be automatically generated for the user, so that the user can be helped to quickly understand the document content and timely master the industry information dynamics.
Disclosure of Invention
Based on the above, it is necessary to provide a method, an apparatus, a medium and a device for extracting a question and generating an answer, so as to solve the problem that it is difficult to automatically extract the relevant question and the answer in the document in case that the document is too long.
A method of question extraction and answer generation, the method comprising:
dividing text information in a bank data document and operating the text information with a steering amount to obtain a plurality of text vectors corresponding to each page in the bank data document;
Respectively inducing and forming target abstracts corresponding to each text vector to obtain a plurality of target abstracts;
sending each target abstract into a natural language processing model according to a preset questioning format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract;
And performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and the initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
In one embodiment, the dividing the text information in the bank data document and the steering amount operation to obtain a plurality of text vectors corresponding to each page in the bank data document includes:
Reading text information of each page in the bank data document, and taking the text information of a target page and a floating variable of an adjacent page as text blocks of the target page to obtain multi-page text blocks, wherein the target page is any page in the bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is the text information of a preset length, adjacent to the target page, in the adjacent page;
dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain sub text blocks of each page;
and embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.
In one embodiment, the respectively summarizing to form the target abstract corresponding to each text vector includes:
inputting each vector block into a preset abstract generating model respectively, and obtaining an output initial abstract, wherein each vector block consists of a sub text block and a corresponding text vector;
And extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
In one embodiment, the performing the de-duplication filtering on the initial question and the initial answer includes:
in all the initial problems, calculating cosine similarity between every two initial problems and sequencing the initial problems to obtain a first sequencing result formed by a plurality of problem pairs;
in all the initial answers, respectively calculating cosine similarity between every two initial answers and sequencing to obtain a second sequencing result formed by a plurality of answer pairs;
in the first sorting result, performing first duplicate removal operation on the first N questions with the maximum cosine similarity and corresponding initial answers, wherein N is a preset value, and the first duplicate removal operation indicates to delete any one initial question in each question pair and delete the corresponding initial answer;
and in the second sorting result, performing a second duplicate removal operation on the first M answers with the maximum cosine similarity and the corresponding initial questions, wherein M is a preset value, and the second duplicate removal operation indicates to delete any one initial answer in each answer pair and delete the corresponding initial question.
In one embodiment, the sorting the initial questions and the initial answers after the duplicate removal screening based on the word numbers, and outputting the target questions and the target answers corresponding to each target abstract based on the sorting result includes:
And returning the first L answers with the maximum word numbers as target answers, and outputting the questions corresponding to the target answers as target questions together with the target answers, wherein L is a preset value.
A question extraction and answer generation device, the device comprising:
The text vector generation module is used for carrying out segmentation operation and turning quantity operation on text information in the bank data document so as to obtain a plurality of text vectors corresponding to each page in the bank data document;
the target abstract generating module is used for respectively inducing and forming target abstracts corresponding to each text vector so as to obtain a plurality of target abstracts;
the system comprises a target abstract, a result output module, a word number-based ranking module and a target abstract processing module, wherein the target abstract processing module is used for generating a plurality of initial questions and initial answers corresponding to the target abstract, the result output module is used for respectively sending each target abstract into a natural language processing model according to a preset questioning format to generate a plurality of different initial questions and initial answers corresponding to each target abstract, performing duplicate removal screening on the initial questions and the initial answers, ranking the initial questions and the initial answers after duplicate removal screening based on the word number, and outputting the target questions and the target answers corresponding to each target abstract based on the ranking result.
In one embodiment, the text vector generation module is specifically configured to:
Reading text information of each page in the bank data document, and taking the text information of a target page and a floating variable of an adjacent page as text blocks of the target page to obtain multi-page text blocks, wherein the target page is any page in the bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is the text information of a preset length, adjacent to the target page, in the adjacent page;
dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain sub text blocks of each page;
and embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.
In one embodiment, the target abstract generating module is specifically configured to:
inputting each vector block into a preset abstract generating model respectively, and obtaining an output initial abstract, wherein each vector block consists of a sub text block and a corresponding text vector;
And extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the question extraction and answer generation method described above.
A question extraction and answer generation device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the question extraction and answer generation method described above.
The invention provides a method, a device, a medium and equipment for question extraction and answer generation, which are used for segmenting and vectorizing text information in a bank data document to obtain a plurality of text vectors of each page, abstracting each text vector to obtain a plurality of target abstracts, sending each target abstracts into a natural language processing model by using a preset question format to obtain a plurality of initial questions and answers, de-duplicating and sequencing the initial questions and answers, and outputting target questions and answers corresponding to each target abstracts. The invention can analyze the bank data document, automatically generate related questions and answers, help users to quickly understand the document content and timely master the industry information dynamics.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Wherein:
FIG. 1 is a flow chart of a method for question extraction and answer generation;
FIG. 2 is a schematic diagram of a device for question extraction and answer generation;
fig. 3 is a block diagram showing the construction of a question extraction and answer generation apparatus.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
As shown in fig. 1, fig. 1 is a flow chart of a method for extracting questions and generating answers in one embodiment, and the method for extracting questions and generating answers in this embodiment includes the following steps:
S101, performing segmentation operation and steering amount operation on text information in a bank data document to obtain a plurality of text vectors corresponding to each page in the bank data document.
Where a banking data document refers to a document containing banking related data and information, such as a banking profile, business description, financial report, etc. The segmentation operation refers to dividing text information into a plurality of parts according to a certain rule or standard, wherein each part contains certain semantic information. For example, we can segment text information according to paragraphs, sentences, punctuation, and the like. Steering operations refer to converting text information into numeric vectors for computer processing and analysis. For example, we can map each word or phrase to a point in a high-dimensional space using word embedding techniques, resulting in a word vector or phrase vector.
In one implementation, the corresponding plurality of text vectors for each page are obtained by the following specific steps, including:
(1) And reading the text information of each page in the bank data document, and taking the text information of the target page and the floating variable of the adjacent page as the text blocks of the target page to obtain the text blocks of a plurality of pages.
The target page is any page in the bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is text information with preset length adjacent to the target page in the adjacent page.
This step is to extend the text information of each page by some contexts in order to capture more semantic information. For example, assume that a banking data document has three pages, each page having a piece of text, as follows:
first part bank profile
Banking is a financial institution that primarily engages in deposit, loan, payment, settlement, credit card, and other transactions. Banks are the core components of financial systems and play an important role in economic development and social stability.
Second part banking business
Banking business is mainly divided into two major categories, deposit business and loan business. Deposit business refers to the business that a bank accepts funds deposited by a customer and pays interest according to agreements. Loan business refers to the business that a bank provides funds to a customer and charges interest and commission.
Third part bank risk
Bank risk refers to the loss or damage that a bank may suffer during an operation. The bank risk mainly includes credit risk, market risk, liquidity risk, operation risk, and the like. Effective risk management measures are required by banks to ensure asset security and profitability.
Assuming we select the second page as the target page, we can stitch the text information of the second page with the last sentence of the first and third pages as a floating variable as a text block of the second page as follows:
Banks are the core components of financial systems and play an important role in economic development and social stability. Banking business is mainly divided into two major categories, deposit business and loan business. Deposit business refers to the business that a bank accepts funds deposited by a customer and pays interest according to agreements. Loan business refers to the business that a bank provides funds to a customer and charges interest and commission. Bank risk refers to the loss or damage that a bank may suffer during an operation.
Similarly, we can operate similarly on the other two pages to obtain three text blocks. It will be understood, of course, that the predetermined length herein may be set as desired.
(2) Dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain the sub-text blocks of each page.
This step is to further refine each text block into sub-text blocks in order to extract finer semantic information. For example, we can divide the text blocks of the second page according to paragraphs or a fixed number of words as follows:
The sub-text block 1, the bank is a core component of a financial system, and plays an important role in economic development and social stability.
The sub-text block 2, banking business, is mainly divided into two main categories, deposit business and loan business.
The sub-text block 3, the bank is a core component of a financial system, and plays an important role in economic development and social stability.
The sub-text block 4, banking business, is mainly divided into two main categories, deposit business and loan business. Deposit business refers to the business that a bank accepts funds deposited by a customer and pays interest according to agreements.
The sub-text block 5, loan service, refers to the service where the bank provides funds to the customer and charges interest and commission.
The sub-text block 6 bank risk refers to the loss or damage that the bank may suffer during the business process.
The sub-text block 7, deposit business, refers to the business that the bank accepts the funds deposited by the customer and pays interest according to the agreement.
The sub-text block 8 bank risk refers to the loss or damage that the bank may suffer during the business process.
(3) Embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.
This step is to convert each sub-text block into a numeric vector for computer processing and analysis. For example, simcse models are selected, simcse is a sentence similarity model based on contrast learning, and semantic representation of sentences can be learned by self-prediction. For example, we can embed all the sub-text blocks of the second page as a 768-dimensional vector using simcse model, as follows:
sub-text block 1- > text vector 1: [0.12, -0.34, -0.45]
Sub-text block 2- > text vector 2: [ -0.23,0.56, -0.67]
Sub-text block 3- > text vector 3: [0.34, -0.78, -0.89]
Sub-text block 4- > text vector 4: [ -0.23,0.56, -0.67]
Sub-text block 5- > text vector 5: [0.45, -0.89, -0.12]
Sub-text block 6- > text vector 6: [0.67, -0.34, -0.45]
Sub-text block 7- > text vector 7: [0.78, -0.12, -0.56]
Sub-text block 8- > text vector 8: [0.45, -0.89.., -0.50]
Thus, a plurality of text vectors corresponding to each page are obtained, and the text vectors can reflect the semantic information of each sub-text block and can also be used for subsequent calculation and analysis.
S102, respectively inducing and forming target abstracts corresponding to each text vector to obtain a plurality of target abstracts.
The step S102 is a task of natural language processing, and its purpose is to summarize a short target abstract according to the semantic content of each text vector, so as to summarize the main information of the text. The target abstract can be a sentence or a phrase or a word.
In one implementation, the specific steps of forming the corresponding target abstract include:
(1) And respectively inputting each vector block into a preset abstract generating model, and acquiring an output initial abstract.
Wherein each vector chunk is composed of a sub-text chunk and a corresponding text vector. This sub-step is to generate an initial summary from the textual information and numerical representation of each vector chunk using a pre-set summary generation model, such as the gpt-2 model.
Illustratively, in another example of a bank card:
The vector block 1- > initial abstract 1 is that a bank card is an electronic payment tool issued by a bank, can store funds and personal information of a user, and is convenient for the user to perform consumption or transfer operations on an automatic teller machine, a POS machine or a network platform. Bank cards typically have both a magnetic stripe and a chip, with the chip card being safer and more stable than the magnetic stripe card.
Vector block 2- > initial abstract 2. The types of bank cards are various, and are mainly classified into debit cards, credit cards and quasi-credit cards. The debit card refers to a card which can be used only after a user deposits funds, the credit card refers to a card which is provided by a bank with a certain amount of credit for the user and can be paid after the user consumes the money, and the quasi-credit card refers to a card which is provided by the bank with a certain amount of pre-authorization for the user and can be overdrawn for consumption within the amount of credit for the user. Different card types have different functions and fees, and users should select proper card types according to own requirements and capabilities.
Vector block 3- > initial abstract 3. Use of bank card needs to be safe and reasonable, avoiding revealing password or personal information, repayment or inquiring balance in time. When using the bank card, the card and the password should be kept, the password is not written on the card or told to other people, and the information of the bank card is not input on unsafe equipment or websites. When using the credit card, paying attention to repayment period and interest, clearing debt on time, avoiding overdue fee or late fee. When using the debit card, the balance and transaction records should be queried, and the abnormal situation should be found and handled in time.
(2) And extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
The sentence boundary refers to the position of the beginning and ending of the sentence, and is usually divided by punctuation marks, and we can select the sentence which can represent the primary subject according to the length and position of the sentence, or combine or delete multiple sentences to form a compact target abstract. Alternatively, the preset word count threshold here may be 50 words. For example:
The bank card is an electronic payment tool, can store funds and information and supports various operations. The chip card is safer than the magnetic stripe card.
And keywords refer to words that can reflect text topics or core content. The words which can reflect the subject matter and the content of the original text can be selected according to the keywords or phrases in the original text, or a plurality of words can be combined or replaced to form a refined target abstract. Alternatively, the preset word count threshold here may be 50 words.
Initial summary 2- > target summary 2 bank cards are classified as debit cards, credit cards, and quasi-credit cards. Debit cards require deposit, credit cards require repayment, and quasi-credit cards may be overdrawn. Different card types have different characteristics and costs.
S103, respectively sending each target abstract into a natural language processing model according to a preset question format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract.
Specifically, it is assumed that a bank card is an electronic payment tool issued by a bank, which can store funds and personal information of a user, and is convenient for the user to perform operations such as consumption or transfer on an automatic teller machine, a POS machine or a network platform. Bank cards typically have both a magnetic stripe and a chip, with the chip card being safer and more stable than the magnetic stripe card.
Then we can use different methods to generate initial questions and initial answers related to the text, such as models based on pre-trained language models (pre-trained language model) that can be pre-trained using a large corpus, learn general knowledge and rules of the language, and then fine tune on specific tasks to improve the quality and efficiency of the generation. For example, we can use a pre-trained language model such as BERT, GPT-2, and add a special tag (e.g., [ Q ]) to the text, and then let the model generate a corresponding question based on the tag. Wherein, possible initial questions and initial answers:
initial problem: what two forms of bank card are? bank cards typically come in both magnetic stripe and chip form.
What is an initial answer to a bank card, which is an electronic payment tool issued by a bank, can store funds and personal information of a user, and is convenient for the user to perform operations such as consumption or transfer on an automatic teller machine, a POS machine or a network platform.
Initial questions what is a chip card advantageous over magnetic stripe cards.
S104, performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and the initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
It will be appreciated that there may be duplicate or similar content in the initial question and initial answer generated in step S103, which is not necessary for the user, and therefore a series of filtering operations may be performed.
In one implementation, the de-duplication filtering of the initial question and the initial answer includes:
(1) In all the initial problems, calculating cosine similarity between every two initial problems and sequencing the initial problems to obtain a first sequencing result formed by a plurality of problem pairs;
This step is to find the initial problem pair with the highest similarity, i.e. those initial problems with repeated or similar content. Cosine similarity is a method of evaluating the similarity of two vectors by calculating their angle cosine values. Each initial question may be considered herein as a vector of words that may be weighted using word frequency, TF-IDF, etc. By calculating the cosine similarity between the two initial questions, the degree of similarity between them can be obtained, with closer 1 representing more similar and closer 0 representing less relevant. And (3) carrying out cosine similarity calculation on all the initial questions pairwise, and sequencing the initial questions from large to small to obtain a first sequencing result, wherein the first sequencing result is a list formed by a plurality of question pairs, and each question pair has a corresponding cosine similarity value.
(2) And respectively calculating cosine similarity between every two initial answers in all the initial answers and sequencing the initial answers to obtain a second sequencing result formed by a plurality of answer pairs.
This step is to find the initial answer pair with the highest similarity, i.e. those initial answers whose contents are repeated or similar. The method for calculating the cosine similarity is the same as the previous step, and the initial question is simply replaced by the initial answer.
(3) And in the first sequencing result, performing first duplicate removal operation on the first N questions with the maximum cosine similarity and corresponding initial answers, wherein N is a preset value, and the first duplicate removal operation indicates to delete any one initial question in each question pair and delete the corresponding initial answer.
This step is to delete duplicate or similar initial questions and their corresponding initial answers to reduce redundant information. N is a preset value that indicates how many question pairs to delete.
(4) And in the second sorting result, performing a second duplicate removal operation on the first M answers with the maximum cosine similarity and the corresponding initial questions, wherein M is a preset value, and the second duplicate removal operation indicates to delete any one initial answer in each answer pair and delete the corresponding initial question.
This step is to further delete those repeated or similar initial answers and their corresponding initial questions to further reduce redundant information. M is a preset value that indicates how many answer pairs to delete.
In one implementation, sorting the initial questions and the initial answers after the duplicate removal screening based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on the sorting results, wherein the first L answers with the maximum word numbers are returned to serve as target answers, and the questions corresponding to the target answers are output together with the target answers as target questions, and L is a preset value.
The step is to select the target question and the target answer which can cover the content of the target abstract from the initial questions and the initial answers after the duplicate removal screening, and output the target question and the target answer. L is a preset value indicating how many answers to choose. And sorting according to the number of words of the initial answers in order from more to less, selecting the first L answers with the largest number of words as target answers, and outputting the initial questions corresponding to the first L answers as target questions together with the target answers.
The method for extracting the questions and generating the answers comprises the steps of segmenting and vectorizing text information in a bank data document to obtain a plurality of text vectors of each page, abstracting each text vector to obtain a plurality of target abstracts, sending each target abstracts into a natural language processing model to obtain a plurality of initial questions and answers by using a preset question format, de-duplicating and sequencing the initial questions and answers, and outputting target questions and answers corresponding to each target abstracts. Therefore, the invention can analyze the bank data document, automatically generate related questions and answers, help users to quickly understand the document content and timely master the industry information dynamics.
In one embodiment, as shown in fig. 2, a device for extracting questions and generating answers is provided, which comprises:
The text vector generation module 201 is configured to perform a segmentation operation and a steering amount operation on text information in a bank data document, so as to obtain a plurality of text vectors corresponding to each page in the bank data document;
The target abstract generating module 202 is configured to respectively summarize and form a target abstract corresponding to each text vector, so as to obtain a plurality of target abstracts;
The result output module 203 is configured to send each target abstract to a natural language processing model according to a preset question format, so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract, perform de-duplication screening on the initial questions and the initial answers, sort the de-duplication screened initial questions and initial answers based on word numbers, and output target questions and target answers corresponding to each target abstract based on the sorting result.
In one embodiment, the text vector generation module 201 is specifically configured to:
Reading text information of each page in the bank data document, and taking the text information of a target page and a floating variable of an adjacent page as text blocks of the target page to obtain multi-page text blocks, wherein the target page is any page in the bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is the text information of a preset length, adjacent to the target page, in the adjacent page;
dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain sub text blocks of each page;
and embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.
In one embodiment, the target summary generating module 202 is specifically configured to:
inputting each vector block into a preset abstract generating model respectively, and obtaining an output initial abstract, wherein each vector block consists of a sub text block and a corresponding text vector;
And extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
The result output module 203 is specifically configured to respectively calculate cosine similarity between every two initial questions and order the two initial questions in all the initial questions to obtain a first ordering result composed of a plurality of question pairs;
in all the initial answers, respectively calculating cosine similarity between every two initial answers and sequencing to obtain a second sequencing result formed by a plurality of answer pairs;
in the first sorting result, performing first duplicate removal operation on the first N questions with the maximum cosine similarity and corresponding initial answers, wherein N is a preset value, and the first duplicate removal operation indicates to delete any one initial question in each question pair and delete the corresponding initial answer;
and in the second sorting result, performing a second duplicate removal operation on the first M answers with the maximum cosine similarity and the corresponding initial questions, wherein M is a preset value, and the second duplicate removal operation indicates to delete any one initial answer in each answer pair and delete the corresponding initial question.
The result output module 203 is specifically configured to return the first L answers with the largest number of words as target answers, and output a question corresponding to the target answer as a target question together with the target answer, where L is a preset value.
Fig. 3 is a diagram showing an internal structure of the question extraction and answer generation apparatus in one embodiment. As shown in fig. 3, the question extraction and answer generation apparatus includes a processor, a memory, and a network interface connected through a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the question extraction and answer generation device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement a method of question extraction and answer generation. The internal memory may also store a computer program that, when executed by the processor, causes the processor to perform the question extraction and answer generation methods. It will be appreciated by those skilled in the art that the structure shown in fig. 3 is merely a block diagram of a portion of the structure associated with the present application and does not constitute a limitation of the question extraction and answer generation apparatus to which the present application is applied, and that a specific question extraction and answer generation apparatus may include more or less components than those shown in the drawings, or may combine some components, or have a different arrangement of components.
A computer readable storage medium is stored with a computer program, when the computer program is executed by a processor, the steps of dividing text information in a bank data document and steering amount operation are carried out to obtain a plurality of text vectors corresponding to each page in the bank data document, respectively inducing and forming a target abstract corresponding to each text vector to obtain a plurality of target abstracts, respectively sending each target abstract into a natural language processing model according to a preset questioning format to generate a plurality of different initial questions and initial answers corresponding to each target abstract, carrying out de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and initial answers based on word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
A device for extracting questions and generating answers comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor performs the steps of dividing text information in a bank data document and performing operation on steering amount to obtain a plurality of text vectors corresponding to each page in the bank data document, respectively inducing and forming a target abstract corresponding to each text vector to obtain a plurality of target abstracts, respectively sending each target abstract into a natural language processing model according to a preset questioning format to generate a plurality of different initial questions and initial answers corresponding to each target abstract, performing de-duplication screening on the initial questions and the initial answers, sequencing the de-duplication screened initial questions and initial answers based on word numbers, and outputting target questions and target answers corresponding to each target abstract based on sequencing results.
It should be noted that the above-mentioned method, apparatus, device and computer-readable storage medium for question extraction and answer generation belong to a general inventive concept, and the content in the embodiments of the method, apparatus, device and computer-readable storage medium for question extraction and answer generation are applicable to each other.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a non-transitory computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (9)

1. A method for question extraction and answer generation, the method comprising:
The method comprises the steps of carrying out segmentation operation and turning quantity operation on text information in a bank data document to obtain a plurality of text vectors corresponding to each page in the bank data document, and concretely comprises the steps of reading the text information of each page in the bank data document, taking the text information of a target page and floating variables of adjacent pages as text blocks of the target page to obtain text blocks of multiple pages, wherein the target page is any page in the bank data document, the adjacent pages are the last page and/or the next page of the target page, the floating variables are text information with preset length, adjacent to the target page, in the adjacent pages, dividing the text blocks of each page by using word numbers and/or paragraphs as dividing units to obtain sub text blocks of each page, and embedding the sub text blocks of each page into corresponding text vectors by using a preset sentence similarity model;
Respectively inducing and forming target abstracts corresponding to each text vector to obtain a plurality of target abstracts;
sending each target abstract into a natural language processing model according to a preset questioning format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract;
And performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and the initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.
2. The method of claim 1, wherein the separately summarizing the target summaries corresponding to each text vector comprises:
inputting each vector block into a preset abstract generating model respectively, and obtaining an output initial abstract, wherein each vector block consists of a sub text block and a corresponding text vector;
And extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
3. The method of claim 1, wherein the de-duplication screening the initial question and the initial answer comprises:
in all the initial problems, calculating cosine similarity between every two initial problems and sequencing the initial problems to obtain a first sequencing result formed by a plurality of problem pairs;
in all the initial answers, respectively calculating cosine similarity between every two initial answers and sequencing to obtain a second sequencing result formed by a plurality of answer pairs;
in the first sorting result, performing first duplicate removal operation on the first N questions with the maximum cosine similarity and corresponding initial answers, wherein N is a preset value, and the first duplicate removal operation indicates to delete any one initial question in each question pair and delete the corresponding initial answer;
and in the second sorting result, performing a second duplicate removal operation on the first M answers with the maximum cosine similarity and the corresponding initial questions, wherein M is a preset value, and the second duplicate removal operation indicates to delete any one initial answer in each answer pair and delete the corresponding initial question.
4. The method of claim 1, wherein the ranking the initial questions and the initial answers after the de-duplication filtering based on the word count, and outputting the target questions and the target answers corresponding to each target abstract based on the ranking result, comprises:
And returning the first L answers with the maximum word numbers as target answers, and outputting the questions corresponding to the target answers as target questions together with the target answers, wherein L is a preset value.
5. A question extraction and answer generation device, the device comprising:
The text vector generation module is used for carrying out segmentation operation and turning quantity operation on text information in the bank data document so as to obtain a plurality of text vectors corresponding to each page in the bank data document;
The text vector generation module is specifically used for reading text information of each page in the bank data document, taking the text information of a target page and a floating variable of an adjacent page as text blocks of the target page to obtain text blocks of multiple pages, wherein the target page is any page in the bank data document, the adjacent page is a previous page and/or a next page of the target page, the floating variable is text information of a preset length adjacent to the target page in the adjacent page, and dividing the text blocks of each page by taking the number of words and/or paragraphs as dividing units to obtain sub text blocks of each page;
the target abstract generating module is used for respectively inducing and forming target abstracts corresponding to each text vector so as to obtain a plurality of target abstracts;
the system comprises a target abstract, a result output module, a word number-based ranking module and a target abstract processing module, wherein the target abstract processing module is used for generating a plurality of initial questions and initial answers corresponding to the target abstract, the result output module is used for respectively sending each target abstract into a natural language processing model according to a preset questioning format to generate a plurality of different initial questions and initial answers corresponding to each target abstract, performing duplicate removal screening on the initial questions and the initial answers, ranking the initial questions and the initial answers after duplicate removal screening based on the word number, and outputting the target questions and the target answers corresponding to each target abstract based on the ranking result.
6. The question extraction and answer generation device according to claim 5, wherein the text vector generation module is specifically configured to:
Reading text information of each page in the bank data document, and taking the text information of a target page and a floating variable of an adjacent page as text blocks of the target page to obtain multi-page text blocks, wherein the target page is any page in the bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is the text information of a preset length, adjacent to the target page, in the adjacent page;
dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain sub text blocks of each page;
and embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.
7. The question extraction and answer generation device according to claim 5, wherein the target abstract generation module is specifically configured to:
inputting each vector block into a preset abstract generating model respectively, and obtaining an output initial abstract, wherein each vector block consists of a sub text block and a corresponding text vector;
And extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.
8. A computer readable storage medium, characterized in that a computer program is stored, which, when being executed by a processor, causes the processor to perform the steps of the method according to any of claims 1 to 4.
9. A question extraction and answer generation device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method of any one of claims 1 to 4.
CN202311093501.4A 2023-08-28 2023-08-28 Methods, apparatus, media and equipment for question extraction and answer generation Active CN117056490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311093501.4A CN117056490B (en) 2023-08-28 2023-08-28 Methods, apparatus, media and equipment for question extraction and answer generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311093501.4A CN117056490B (en) 2023-08-28 2023-08-28 Methods, apparatus, media and equipment for question extraction and answer generation

Publications (2)

Publication Number Publication Date
CN117056490A CN117056490A (en) 2023-11-14
CN117056490B true CN117056490B (en) 2026-03-24

Family

ID=88653332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311093501.4A Active CN117056490B (en) 2023-08-28 2023-08-28 Methods, apparatus, media and equipment for question extraction and answer generation

Country Status (1)

Country Link
CN (1) CN117056490B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925184A (en) * 2022-05-16 2022-08-19 阿里巴巴(中国)有限公司 Question-answer pair generation method and device and computer readable storage medium
CN114970563A (en) * 2022-07-28 2022-08-30 山东大学 Chinese question generation method and system fusing content and form diversity

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11222167B2 (en) * 2019-12-19 2022-01-11 Adobe Inc. Generating structured text summaries of digital documents using interactive collaboration
CN112800177B (en) * 2020-12-31 2021-09-07 北京智源人工智能研究院 Method and device for automatic generation of FAQ knowledge base based on complex data types
CN114579796B (en) * 2022-05-06 2022-07-12 北京沃丰时代数据科技有限公司 Machine reading understanding method and device
CN114997138B (en) * 2022-06-20 2024-07-19 壹沓科技(上海)有限公司 Chemical specification analysis method, device, equipment and readable storage medium
KR102436549B1 (en) * 2022-07-20 2022-08-25 (주) 유비커스 Method and apparatus for automatically generating training dataset for faq and chatbot based on natural language processing using deep learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925184A (en) * 2022-05-16 2022-08-19 阿里巴巴(中国)有限公司 Question-answer pair generation method and device and computer readable storage medium
CN114970563A (en) * 2022-07-28 2022-08-30 山东大学 Chinese question generation method and system fusing content and form diversity

Also Published As

Publication number Publication date
CN117056490A (en) 2023-11-14

Similar Documents

Publication Publication Date Title
US12481828B1 (en) User interface for use with a search engine for searching financial related documents
Cecchini et al. Making words work: Using financial text as a predictor of financial events
Bian et al. Icorating: A deep-learning system for scam ico identification
Tsai et al. Discovering finance keywords via continuous-space language models
CN110457302A (en) A kind of structural data intelligence cleaning method
CN109344234A (en) Machine reads understanding method, device, computer equipment and storage medium
US20260004330A1 (en) Document processing platform
Boskou et al. Assessing internal audit with text mining
Barua et al. Swindle: Predicting the probability of loan defaults using catboost algorithm
Oral et al. Fusion of visual representations for multimodal information extraction from unstructured transactional documents: B. Oral, G. Eryiğit
Radygin et al. Application of text mining technologies in Russian language for solving the problems of primary financial monitoring
Loukas et al. EDGAR-CRAWLER: From Raw Web Documents to Structured Financial NLP Datasets
CN121351802A (en) A document segmentation method and system based on context tagging and model concatenation
WO2021075998A1 (en) System for classifying data in order to detect confidential information in a text
CN117056490B (en) Methods, apparatus, media and equipment for question extraction and answer generation
KR20230072151A (en) Tax and accounting service platform
CN115034891B (en) Debit and credit accounting method, device, equipment and medium based on natural language processing
Blanco Lambruschini et al. A novel architecture for long-text predictions using bert-based models
Zhang et al. A semantic search framework for similar audit issue recommendation in financial industry
Basak et al. British stock market, Brexit and media sentiments-A big data analysis
CN113868431B (en) Relation extraction method and device for financial knowledge graph and storage medium
Taylor et al. e-commerce and sentiment analysis: Predicting outcomes of class action lawsuits
CN116503878A (en) A business decision processing method and device
Sun et al. Using an ensemble LSTM model for financial statement fraud detection
KR20230169538A (en) Apparatus and method for analysis of transaction brief data using corpus for machine learning based on financial mydata and computer program for the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant