CN117056490B

CN117056490B - Methods, apparatus, media and equipment for question extraction and answer generation

Info

Publication number: CN117056490B
Application number: CN202311093501.4A
Authority: CN
Inventors: 詹乐; 龚静
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2026-03-24
Anticipated expiration: 2043-08-28
Also published as: CN117056490A

Abstract

This invention discloses a method, apparatus, medium, and device for question extraction and answer generation. It segments and vectorizes text information in bank data documents, obtaining multiple text vectors per page; it summarizes each text vector to obtain multiple target summaries; it applies a preset question format to each target summary and feeds it into a natural language processing model to obtain multiple initial questions and answers; it deduplicates and sorts the initial questions and answers, outputting the target question and answer corresponding to each target summary. This invention can parse bank data documents and automatically generate relevant questions and answers, helping users quickly understand document content and stay abreast of industry information.

Description

Question extraction and answer generation method, device, medium and equipment

Technical Field

The invention relates to the technical field of banks, in particular to a method, a device, a medium and equipment for extracting questions and generating answers.

Background

Currently, intelligent document parsing is an important business scenario of banks, however long bank data document parsing is a difficult processing scenario. If the document is too long, the user does not know what is in the document, and the corresponding problem cannot be posed to be solved by the system. At this time, the related questions and answers can be automatically generated for the user, so that the user can be helped to quickly understand the document content and timely master the industry information dynamics.

Disclosure of Invention

Based on the above, it is necessary to provide a method, an apparatus, a medium and a device for extracting a question and generating an answer, so as to solve the problem that it is difficult to automatically extract the relevant question and the answer in the document in case that the document is too long.

A method of question extraction and answer generation, the method comprising:

dividing text information in a bank data document and operating the text information with a steering amount to obtain a plurality of text vectors corresponding to each page in the bank data document;

Respectively inducing and forming target abstracts corresponding to each text vector to obtain a plurality of target abstracts;

sending each target abstract into a natural language processing model according to a preset questioning format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract;

And performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and the initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.

In one embodiment, the dividing the text information in the bank data document and the steering amount operation to obtain a plurality of text vectors corresponding to each page in the bank data document includes:

Reading text information of each page in the bank data document, and taking the text information of a target page and a floating variable of an adjacent page as text blocks of the target page to obtain multi-page text blocks, wherein the target page is any page in the bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is the text information of a preset length, adjacent to the target page, in the adjacent page;

dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain sub text blocks of each page;

and embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.

In one embodiment, the respectively summarizing to form the target abstract corresponding to each text vector includes:

inputting each vector block into a preset abstract generating model respectively, and obtaining an output initial abstract, wherein each vector block consists of a sub text block and a corresponding text vector;

And extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.

In one embodiment, the performing the de-duplication filtering on the initial question and the initial answer includes:

in all the initial problems, calculating cosine similarity between every two initial problems and sequencing the initial problems to obtain a first sequencing result formed by a plurality of problem pairs;

in all the initial answers, respectively calculating cosine similarity between every two initial answers and sequencing to obtain a second sequencing result formed by a plurality of answer pairs;

in the first sorting result, performing first duplicate removal operation on the first N questions with the maximum cosine similarity and corresponding initial answers, wherein N is a preset value, and the first duplicate removal operation indicates to delete any one initial question in each question pair and delete the corresponding initial answer;

and in the second sorting result, performing a second duplicate removal operation on the first M answers with the maximum cosine similarity and the corresponding initial questions, wherein M is a preset value, and the second duplicate removal operation indicates to delete any one initial answer in each answer pair and delete the corresponding initial question.

In one embodiment, the sorting the initial questions and the initial answers after the duplicate removal screening based on the word numbers, and outputting the target questions and the target answers corresponding to each target abstract based on the sorting result includes:

And returning the first L answers with the maximum word numbers as target answers, and outputting the questions corresponding to the target answers as target questions together with the target answers, wherein L is a preset value.

A question extraction and answer generation device, the device comprising:

The text vector generation module is used for carrying out segmentation operation and turning quantity operation on text information in the bank data document so as to obtain a plurality of text vectors corresponding to each page in the bank data document;

the target abstract generating module is used for respectively inducing and forming target abstracts corresponding to each text vector so as to obtain a plurality of target abstracts;

the system comprises a target abstract, a result output module, a word number-based ranking module and a target abstract processing module, wherein the target abstract processing module is used for generating a plurality of initial questions and initial answers corresponding to the target abstract, the result output module is used for respectively sending each target abstract into a natural language processing model according to a preset questioning format to generate a plurality of different initial questions and initial answers corresponding to each target abstract, performing duplicate removal screening on the initial questions and the initial answers, ranking the initial questions and the initial answers after duplicate removal screening based on the word number, and outputting the target questions and the target answers corresponding to each target abstract based on the ranking result.

In one embodiment, the text vector generation module is specifically configured to:

In one embodiment, the target abstract generating module is specifically configured to:

A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the question extraction and answer generation method described above.

A question extraction and answer generation device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the question extraction and answer generation method described above.

The invention provides a method, a device, a medium and equipment for question extraction and answer generation, which are used for segmenting and vectorizing text information in a bank data document to obtain a plurality of text vectors of each page, abstracting each text vector to obtain a plurality of target abstracts, sending each target abstracts into a natural language processing model by using a preset question format to obtain a plurality of initial questions and answers, de-duplicating and sequencing the initial questions and answers, and outputting target questions and answers corresponding to each target abstracts. The invention can analyze the bank data document, automatically generate related questions and answers, help users to quickly understand the document content and timely master the industry information dynamics.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Wherein:

FIG. 1 is a flow chart of a method for question extraction and answer generation;

FIG. 2 is a schematic diagram of a device for question extraction and answer generation;

fig. 3 is a block diagram showing the construction of a question extraction and answer generation apparatus.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

As shown in fig. 1, fig. 1 is a flow chart of a method for extracting questions and generating answers in one embodiment, and the method for extracting questions and generating answers in this embodiment includes the following steps:

S101, performing segmentation operation and steering amount operation on text information in a bank data document to obtain a plurality of text vectors corresponding to each page in the bank data document.

Where a banking data document refers to a document containing banking related data and information, such as a banking profile, business description, financial report, etc. The segmentation operation refers to dividing text information into a plurality of parts according to a certain rule or standard, wherein each part contains certain semantic information. For example, we can segment text information according to paragraphs, sentences, punctuation, and the like. Steering operations refer to converting text information into numeric vectors for computer processing and analysis. For example, we can map each word or phrase to a point in a high-dimensional space using word embedding techniques, resulting in a word vector or phrase vector.

In one implementation, the corresponding plurality of text vectors for each page are obtained by the following specific steps, including:

(1) And reading the text information of each page in the bank data document, and taking the text information of the target page and the floating variable of the adjacent page as the text blocks of the target page to obtain the text blocks of a plurality of pages.

The target page is any page in the bank data document, the adjacent page is the last page and/or the next page of the target page, and the floating variable is text information with preset length adjacent to the target page in the adjacent page.

This step is to extend the text information of each page by some contexts in order to capture more semantic information. For example, assume that a banking data document has three pages, each page having a piece of text, as follows:

first part bank profile

Banking is a financial institution that primarily engages in deposit, loan, payment, settlement, credit card, and other transactions. Banks are the core components of financial systems and play an important role in economic development and social stability.

Second part banking business

Banking business is mainly divided into two major categories, deposit business and loan business. Deposit business refers to the business that a bank accepts funds deposited by a customer and pays interest according to agreements. Loan business refers to the business that a bank provides funds to a customer and charges interest and commission.

Third part bank risk

Bank risk refers to the loss or damage that a bank may suffer during an operation. The bank risk mainly includes credit risk, market risk, liquidity risk, operation risk, and the like. Effective risk management measures are required by banks to ensure asset security and profitability.

Assuming we select the second page as the target page, we can stitch the text information of the second page with the last sentence of the first and third pages as a floating variable as a text block of the second page as follows:

Banks are the core components of financial systems and play an important role in economic development and social stability. Banking business is mainly divided into two major categories, deposit business and loan business. Deposit business refers to the business that a bank accepts funds deposited by a customer and pays interest according to agreements. Loan business refers to the business that a bank provides funds to a customer and charges interest and commission. Bank risk refers to the loss or damage that a bank may suffer during an operation.

Similarly, we can operate similarly on the other two pages to obtain three text blocks. It will be understood, of course, that the predetermined length herein may be set as desired.

(2) Dividing the text blocks of each page by taking the word number and/or the paragraph as a dividing unit to obtain the sub-text blocks of each page.

This step is to further refine each text block into sub-text blocks in order to extract finer semantic information. For example, we can divide the text blocks of the second page according to paragraphs or a fixed number of words as follows:

The sub-text block 1, the bank is a core component of a financial system, and plays an important role in economic development and social stability.

The sub-text block 2, banking business, is mainly divided into two main categories, deposit business and loan business.

The sub-text block 3, the bank is a core component of a financial system, and plays an important role in economic development and social stability.

The sub-text block 4, banking business, is mainly divided into two main categories, deposit business and loan business. Deposit business refers to the business that a bank accepts funds deposited by a customer and pays interest according to agreements.

The sub-text block 5, loan service, refers to the service where the bank provides funds to the customer and charges interest and commission.

The sub-text block 6 bank risk refers to the loss or damage that the bank may suffer during the business process.

The sub-text block 7, deposit business, refers to the business that the bank accepts the funds deposited by the customer and pays interest according to the agreement.

The sub-text block 8 bank risk refers to the loss or damage that the bank may suffer during the business process.

(3) Embedding the sub-text blocks of each page into corresponding text vectors by using a preset sentence similarity model.

This step is to convert each sub-text block into a numeric vector for computer processing and analysis. For example, simcse models are selected, simcse is a sentence similarity model based on contrast learning, and semantic representation of sentences can be learned by self-prediction. For example, we can embed all the sub-text blocks of the second page as a 768-dimensional vector using simcse model, as follows:

sub-text block 1- > text vector 1: [0.12, -0.34, -0.45]

Sub-text block 2- > text vector 2: [ -0.23,0.56, -0.67]

Sub-text block 3- > text vector 3: [0.34, -0.78, -0.89]

Sub-text block 4- > text vector 4: [ -0.23,0.56, -0.67]

Sub-text block 5- > text vector 5: [0.45, -0.89, -0.12]

Sub-text block 6- > text vector 6: [0.67, -0.34, -0.45]

Sub-text block 7- > text vector 7: [0.78, -0.12, -0.56]

Sub-text block 8- > text vector 8: [0.45, -0.89.., -0.50]

Thus, a plurality of text vectors corresponding to each page are obtained, and the text vectors can reflect the semantic information of each sub-text block and can also be used for subsequent calculation and analysis.

S102, respectively inducing and forming target abstracts corresponding to each text vector to obtain a plurality of target abstracts.

The step S102 is a task of natural language processing, and its purpose is to summarize a short target abstract according to the semantic content of each text vector, so as to summarize the main information of the text. The target abstract can be a sentence or a phrase or a word.

In one implementation, the specific steps of forming the corresponding target abstract include:

(1) And respectively inputting each vector block into a preset abstract generating model, and acquiring an output initial abstract.

Wherein each vector chunk is composed of a sub-text chunk and a corresponding text vector. This sub-step is to generate an initial summary from the textual information and numerical representation of each vector chunk using a pre-set summary generation model, such as the gpt-2 model.

Illustratively, in another example of a bank card:

The vector block 1- > initial abstract 1 is that a bank card is an electronic payment tool issued by a bank, can store funds and personal information of a user, and is convenient for the user to perform consumption or transfer operations on an automatic teller machine, a POS machine or a network platform. Bank cards typically have both a magnetic stripe and a chip, with the chip card being safer and more stable than the magnetic stripe card.

Vector block 2- > initial abstract 2. The types of bank cards are various, and are mainly classified into debit cards, credit cards and quasi-credit cards. The debit card refers to a card which can be used only after a user deposits funds, the credit card refers to a card which is provided by a bank with a certain amount of credit for the user and can be paid after the user consumes the money, and the quasi-credit card refers to a card which is provided by the bank with a certain amount of pre-authorization for the user and can be overdrawn for consumption within the amount of credit for the user. Different card types have different functions and fees, and users should select proper card types according to own requirements and capabilities.

Vector block 3- > initial abstract 3. Use of bank card needs to be safe and reasonable, avoiding revealing password or personal information, repayment or inquiring balance in time. When using the bank card, the card and the password should be kept, the password is not written on the card or told to other people, and the information of the bank card is not input on unsafe equipment or websites. When using the credit card, paying attention to repayment period and interest, clearing debt on time, avoiding overdue fee or late fee. When using the debit card, the balance and transaction records should be queried, and the abnormal situation should be found and handled in time.

(2) And extracting each initial abstract based on sentence boundaries and/or keywords respectively to obtain a target abstract which is smaller than a preset word number threshold and corresponds to each initial abstract.

The sentence boundary refers to the position of the beginning and ending of the sentence, and is usually divided by punctuation marks, and we can select the sentence which can represent the primary subject according to the length and position of the sentence, or combine or delete multiple sentences to form a compact target abstract. Alternatively, the preset word count threshold here may be 50 words. For example:

The bank card is an electronic payment tool, can store funds and information and supports various operations. The chip card is safer than the magnetic stripe card.

And keywords refer to words that can reflect text topics or core content. The words which can reflect the subject matter and the content of the original text can be selected according to the keywords or phrases in the original text, or a plurality of words can be combined or replaced to form a refined target abstract. Alternatively, the preset word count threshold here may be 50 words.

Initial summary 2- > target summary 2 bank cards are classified as debit cards, credit cards, and quasi-credit cards. Debit cards require deposit, credit cards require repayment, and quasi-credit cards may be overdrawn. Different card types have different characteristics and costs.

S103, respectively sending each target abstract into a natural language processing model according to a preset question format so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract.

Specifically, it is assumed that a bank card is an electronic payment tool issued by a bank, which can store funds and personal information of a user, and is convenient for the user to perform operations such as consumption or transfer on an automatic teller machine, a POS machine or a network platform. Bank cards typically have both a magnetic stripe and a chip, with the chip card being safer and more stable than the magnetic stripe card.

Then we can use different methods to generate initial questions and initial answers related to the text, such as models based on pre-trained language models (pre-trained language model) that can be pre-trained using a large corpus, learn general knowledge and rules of the language, and then fine tune on specific tasks to improve the quality and efficiency of the generation. For example, we can use a pre-trained language model such as BERT, GPT-2, and add a special tag (e.g., [ Q ]) to the text, and then let the model generate a corresponding question based on the tag. Wherein, possible initial questions and initial answers:

initial problem: what two forms of bank card are? bank cards typically come in both magnetic stripe and chip form.

What is an initial answer to a bank card, which is an electronic payment tool issued by a bank, can store funds and personal information of a user, and is convenient for the user to perform operations such as consumption or transfer on an automatic teller machine, a POS machine or a network platform.

Initial questions what is a chip card advantageous over magnetic stripe cards.

S104, performing de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and the initial answers based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.

It will be appreciated that there may be duplicate or similar content in the initial question and initial answer generated in step S103, which is not necessary for the user, and therefore a series of filtering operations may be performed.

In one implementation, the de-duplication filtering of the initial question and the initial answer includes:

(1) In all the initial problems, calculating cosine similarity between every two initial problems and sequencing the initial problems to obtain a first sequencing result formed by a plurality of problem pairs;

This step is to find the initial problem pair with the highest similarity, i.e. those initial problems with repeated or similar content. Cosine similarity is a method of evaluating the similarity of two vectors by calculating their angle cosine values. Each initial question may be considered herein as a vector of words that may be weighted using word frequency, TF-IDF, etc. By calculating the cosine similarity between the two initial questions, the degree of similarity between them can be obtained, with closer 1 representing more similar and closer 0 representing less relevant. And (3) carrying out cosine similarity calculation on all the initial questions pairwise, and sequencing the initial questions from large to small to obtain a first sequencing result, wherein the first sequencing result is a list formed by a plurality of question pairs, and each question pair has a corresponding cosine similarity value.

(2) And respectively calculating cosine similarity between every two initial answers in all the initial answers and sequencing the initial answers to obtain a second sequencing result formed by a plurality of answer pairs.

This step is to find the initial answer pair with the highest similarity, i.e. those initial answers whose contents are repeated or similar. The method for calculating the cosine similarity is the same as the previous step, and the initial question is simply replaced by the initial answer.

(3) And in the first sequencing result, performing first duplicate removal operation on the first N questions with the maximum cosine similarity and corresponding initial answers, wherein N is a preset value, and the first duplicate removal operation indicates to delete any one initial question in each question pair and delete the corresponding initial answer.

This step is to delete duplicate or similar initial questions and their corresponding initial answers to reduce redundant information. N is a preset value that indicates how many question pairs to delete.

(4) And in the second sorting result, performing a second duplicate removal operation on the first M answers with the maximum cosine similarity and the corresponding initial questions, wherein M is a preset value, and the second duplicate removal operation indicates to delete any one initial answer in each answer pair and delete the corresponding initial question.

This step is to further delete those repeated or similar initial answers and their corresponding initial questions to further reduce redundant information. M is a preset value that indicates how many answer pairs to delete.

In one implementation, sorting the initial questions and the initial answers after the duplicate removal screening based on the word numbers, and outputting target questions and target answers corresponding to each target abstract based on the sorting results, wherein the first L answers with the maximum word numbers are returned to serve as target answers, and the questions corresponding to the target answers are output together with the target answers as target questions, and L is a preset value.

The step is to select the target question and the target answer which can cover the content of the target abstract from the initial questions and the initial answers after the duplicate removal screening, and output the target question and the target answer. L is a preset value indicating how many answers to choose. And sorting according to the number of words of the initial answers in order from more to less, selecting the first L answers with the largest number of words as target answers, and outputting the initial questions corresponding to the first L answers as target questions together with the target answers.

The method for extracting the questions and generating the answers comprises the steps of segmenting and vectorizing text information in a bank data document to obtain a plurality of text vectors of each page, abstracting each text vector to obtain a plurality of target abstracts, sending each target abstracts into a natural language processing model to obtain a plurality of initial questions and answers by using a preset question format, de-duplicating and sequencing the initial questions and answers, and outputting target questions and answers corresponding to each target abstracts. Therefore, the invention can analyze the bank data document, automatically generate related questions and answers, help users to quickly understand the document content and timely master the industry information dynamics.

In one embodiment, as shown in fig. 2, a device for extracting questions and generating answers is provided, which comprises:

The text vector generation module 201 is configured to perform a segmentation operation and a steering amount operation on text information in a bank data document, so as to obtain a plurality of text vectors corresponding to each page in the bank data document;

The target abstract generating module 202 is configured to respectively summarize and form a target abstract corresponding to each text vector, so as to obtain a plurality of target abstracts;

The result output module 203 is configured to send each target abstract to a natural language processing model according to a preset question format, so as to generate a plurality of different initial questions and initial answers corresponding to each target abstract, perform de-duplication screening on the initial questions and the initial answers, sort the de-duplication screened initial questions and initial answers based on word numbers, and output target questions and target answers corresponding to each target abstract based on the sorting result.

In one embodiment, the text vector generation module 201 is specifically configured to:

In one embodiment, the target summary generating module 202 is specifically configured to:

The result output module 203 is specifically configured to respectively calculate cosine similarity between every two initial questions and order the two initial questions in all the initial questions to obtain a first ordering result composed of a plurality of question pairs;

The result output module 203 is specifically configured to return the first L answers with the largest number of words as target answers, and output a question corresponding to the target answer as a target question together with the target answer, where L is a preset value.

Fig. 3 is a diagram showing an internal structure of the question extraction and answer generation apparatus in one embodiment. As shown in fig. 3, the question extraction and answer generation apparatus includes a processor, a memory, and a network interface connected through a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the question extraction and answer generation device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement a method of question extraction and answer generation. The internal memory may also store a computer program that, when executed by the processor, causes the processor to perform the question extraction and answer generation methods. It will be appreciated by those skilled in the art that the structure shown in fig. 3 is merely a block diagram of a portion of the structure associated with the present application and does not constitute a limitation of the question extraction and answer generation apparatus to which the present application is applied, and that a specific question extraction and answer generation apparatus may include more or less components than those shown in the drawings, or may combine some components, or have a different arrangement of components.

A computer readable storage medium is stored with a computer program, when the computer program is executed by a processor, the steps of dividing text information in a bank data document and steering amount operation are carried out to obtain a plurality of text vectors corresponding to each page in the bank data document, respectively inducing and forming a target abstract corresponding to each text vector to obtain a plurality of target abstracts, respectively sending each target abstract into a natural language processing model according to a preset questioning format to generate a plurality of different initial questions and initial answers corresponding to each target abstract, carrying out de-duplication screening on the initial questions and the initial answers, sorting the de-duplication screened initial questions and initial answers based on word numbers, and outputting target questions and target answers corresponding to each target abstract based on sorting results.

A device for extracting questions and generating answers comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor performs the steps of dividing text information in a bank data document and performing operation on steering amount to obtain a plurality of text vectors corresponding to each page in the bank data document, respectively inducing and forming a target abstract corresponding to each text vector to obtain a plurality of target abstracts, respectively sending each target abstract into a natural language processing model according to a preset questioning format to generate a plurality of different initial questions and initial answers corresponding to each target abstract, performing de-duplication screening on the initial questions and the initial answers, sequencing the de-duplication screened initial questions and initial answers based on word numbers, and outputting target questions and target answers corresponding to each target abstract based on sequencing results.

It should be noted that the above-mentioned method, apparatus, device and computer-readable storage medium for question extraction and answer generation belong to a general inventive concept, and the content in the embodiments of the method, apparatus, device and computer-readable storage medium for question extraction and answer generation are applicable to each other.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a non-transitory computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method for question extraction and answer generation, the method comprising:

The method comprises the steps of carrying out segmentation operation and turning quantity operation on text information in a bank data document to obtain a plurality of text vectors corresponding to each page in the bank data document, and concretely comprises the steps of reading the text information of each page in the bank data document, taking the text information of a target page and floating variables of adjacent pages as text blocks of the target page to obtain text blocks of multiple pages, wherein the target page is any page in the bank data document, the adjacent pages are the last page and/or the next page of the target page, the floating variables are text information with preset length, adjacent to the target page, in the adjacent pages, dividing the text blocks of each page by using word numbers and/or paragraphs as dividing units to obtain sub text blocks of each page, and embedding the sub text blocks of each page into corresponding text vectors by using a preset sentence similarity model;

2. The method of claim 1, wherein the separately summarizing the target summaries corresponding to each text vector comprises:

3. The method of claim 1, wherein the de-duplication screening the initial question and the initial answer comprises:

4. The method of claim 1, wherein the ranking the initial questions and the initial answers after the de-duplication filtering based on the word count, and outputting the target questions and the target answers corresponding to each target abstract based on the ranking result, comprises:

5. A question extraction and answer generation device, the device comprising:

The text vector generation module is specifically used for reading text information of each page in the bank data document, taking the text information of a target page and a floating variable of an adjacent page as text blocks of the target page to obtain text blocks of multiple pages, wherein the target page is any page in the bank data document, the adjacent page is a previous page and/or a next page of the target page, the floating variable is text information of a preset length adjacent to the target page in the adjacent page, and dividing the text blocks of each page by taking the number of words and/or paragraphs as dividing units to obtain sub text blocks of each page;

6. The question extraction and answer generation device according to claim 5, wherein the text vector generation module is specifically configured to:

7. The question extraction and answer generation device according to claim 5, wherein the target abstract generation module is specifically configured to:

8. A computer readable storage medium, characterized in that a computer program is stored, which, when being executed by a processor, causes the processor to perform the steps of the method according to any of claims 1 to 4.

9. A question extraction and answer generation device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method of any one of claims 1 to 4.