CN119169650A

CN119169650A - A method for generating sequential text bill image question-answering data based on a multimodal large model

Info

Publication number: CN119169650A
Application number: CN202411271331.9A
Authority: CN
Inventors: 刘禹良; 宋家俊; 伏凌; 朱泠皞; 罗琪頔; 黎宇哲; 匡嚞玢; 白翔
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2024-09-11
Filing date: 2024-09-11
Publication date: 2024-12-20
Anticipated expiration: 2044-09-11
Also published as: CN119169650B

Abstract

The invention relates to the technical field of natural language processing, and provides a method for generating question-answer data of a sequential text bill image based on a multi-mode large model. The method comprises the steps of describing target image content in detail, enabling the multi-mode large model to conduct self-question answering on text information to generate questions and answers of the text information, enabling the multi-mode large model to conduct secondary answer on the submitted questions and simultaneously conduct reasoning to generate the basis or reason of the answers, enabling the multi-mode large model to judge whether question-answer pairs are related to the text information or not, conducting consistency check on the question-answer pairs, and deleting irrelevant and inconsistent question-answer pairs. The invention uses a multi-mode large model to realize large-scale batch automatic generation of the image text information question-answer data in the receipt scene. The method solves the defects of the quantity and quality of the large-scale multi-mode instruction fine adjustment data centering on the characters aiming at the receipt, and overcomes the time and labor cost of high labeling for acquiring the large-scale image-text understanding question-answering data.

Description

Multi-mode large model-based sequential text bill image question-answer data generation method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for generating sequential text bill image question-answer data based on a multi-mode large model.

Background

In recent years, the multi-mode large model has significantly progressed in the field of artificial intelligence, and has a wide application prospect. A multi-modal large model is an artificial intelligence model capable of processing multiple types of data including text, images, audio, and represents a high level of intelligence in a variety of tasks. For example, multimodal large models may be used for scene understanding in an autopilot system, voice and image instruction responses in an intelligent assistant, generating natural language descriptions for images, translating and answering questions for specified images in conjunction with text and images, and the like. These applications benefit greatly from the in-depth understanding of multi-modal models for multiple types of data, and in particular for image data. One factor of the strong understanding ability of the multimodal big model to the image data is that the multimodal big model can recognize and understand text information in the image, such as a guideboard, trademark, document content, etc.

However, despite breakthroughs in image content understanding and recognition, multimodal large models remain challenging in processing textual information in images. The current multimodal models are often less accurate at extracting text information from images than specialized optical character recognition systems, and are deficient in understanding the semantics and context of such text information. To overcome these limitations, researchers are actively developing new training strategies and techniques, one of which is instruction tuning techniques. Instruction trimming is an additional training of a pre-trained model to enable the model to better understand and execute given task instructions. Such methods typically involve the use of a well-designed set of instruction-output pairs as training data that is intended to teach the model how to properly generate the expected output, which can significantly improve the performance of the pre-trained model in performing a particular task. By training with the appropriate instruction data set, the model can be made more adaptive to the particular application scenario, thereby producing more accurate and useful results. This is a key issue in how to generate high quality instruction data sets.

However, the quantity of the current promulgating instruction fine tuning questioning and answering data facing the receipt image is small, the quality is poor, and the part of data becomes a defect of a large model instruction fine tuning questioning and answering data set, so that the performance of a large model in related tasks is affected. The manual acquisition of large-scale, high-quality text-centric instructions to fine-tune question-answer data is a difficult and time-consuming task. The manual collection and labeling of data is not only costly, but also difficult to meet the requirements of rapidly developing multi-modal large models.

In view of this, overcoming the drawbacks of the prior art is a problem to be solved in the art.

Disclosure of Invention

The invention aims to provide a method for generating sequential text bill image question-answering data based on a multi-mode large model.

The invention adopts the following technical scheme:

in a first aspect, the invention provides a method for generating question-answer data of a serial text bill image based on a multi-mode large model, which comprises the following steps:

In step 201, using the public receipt data set as picture data to generate corresponding text understanding question-answering data, and sorting labels carried by the data set to obtain information of image data;

in step 202, the target image content is described in detail by using the multi-mode big model and the information of the picture data obtained by the arrangement in step 201;

In step 203, the information of the picture data obtained in step 201 is used to determine the questions and the corresponding answers of the text information generated for a picture;

In step 204, using the information of the picture data obtained in step 201 and the obtained image content description in step 202, the multimodal big model performs self-question-answering on the text information, generating questions and corresponding answers related to the text information, which are also called question-answer pairs in the subsequent embodiments, wherein the number of generated questions and answers is equal to the number of generated questions and corresponding answers related to the text information determined in step 202;

in step 205, using the information of the picture data obtained in step 201 and the image content description obtained in step 202, the multi-mode big model is made to answer the question posed in step 204 for the second time and generate the basis or reason of the answer by reasoning at the same time;

In step 206, based on the information of the picture data organized in step 201 and the questions and answers generated in step 204, the multi-mode big model is made to judge whether the question-answer pair is related to the text information, if not, the question-answer pair is marked as invalid;

In step 207, the consistency check is performed on the question-answer pair by using the information of the picture data obtained in step 201 and the image content description obtained in step 202, the question and answer generated in step 204, and the secondary answer and reason of the question generated in step 205;

in step 208, invalid question-answer pairs are deleted based on the question-answer pair markers obtained in steps 206 and 207.

Preferably, the generating corresponding text understanding question-answering data by using the public receipt data set as the picture data, and sorting labels carried by the data set to obtain information of the image data specifically includes:

The method comprises the steps of sorting labels carried by a data set to obtain information of image data, extracting key information in the labels and simplifying the key information, wherein specific extracted information comprises one or more of text content, text content position, information type represented by the text content and brief information of a receipt;

The Text content position is represented by two points of the left lower corner and the right lower corner of the Text content, the information types represented by the Text content comprise a shop name, a shop address, a commodity name, a commodity price and a commodity total price, the brief information of the receipt comprises the shop name, the shop address and a spending amount, the specific storage format is that for the Text content Text, the Text content position Coordinatate and the information types represented by the Text content Category are stored together as a tuple (Text, category, corrdinate), and the brief information of the receipt is stored as a single dictionary Info_direct: { "shop name": "shop address": "article total price": }.

Preferably, the describing the target image content in detail by using the multi-mode big model and the information of the picture data obtained by the arrangement in step 201 specifically includes:

Generating detailed description of the target image by using the multi-mode large model, emphasizing text information in the focused image in the instruction prompt of the generation detailed description stage, and simultaneously controlling the length of the total prompt instruction to ensure that the generated detailed description is between 70 and 100 words;

The instruction prompt in the detailed description generation stage is randomly selected from three instructions with the same semantics but different expressions, a preset regular expression matching mode is adopted to extract generated question-answer pairs from answers of a multi-mode large model, if the answer is wrong, the regular expression is replaced to continue matching, if all the answer fails, the answer output of ask oneself does not meet the format requirement, another instruction prompt is replaced, and if the three instruction prompt outputs do not meet the format requirement, the image is skipped;

The instruction prompt in the detailed description generation stage is generated based on the following templates of ' you are an image content analysis expert facing the receipt image ', ' Text information of the input receipt image ', ' Text, category, corrdinate ', ' brief information ', ' info_direct ', please generate a short and accurate description of the input receipt image by combining the provided image information, the description is about 70 to 100 words in length, and please pay special attention to the Text information in the image '.

Preferably, the determining, by using the information of the picture data obtained in step 201, the number of questions and corresponding answers of the text information generated for a picture specifically includes:

For the input image I, the label information of the corresponding image in the dataset is L, and the image information obtained by arrangement is Info, and the question-answer pair Q _num = Num (Info) generated for the input image I, wherein Num (Info) is a preset function.

Preferably, the method for generating the question and the answer of the text information by using the information of the picture data obtained in the step 201 and the image content description obtained in the step 202 includes:

The instruction prompt used by the multi-mode large model for carrying out self-question answering on the text information is obtained by combining a predefined prompt trunk p ₂, detailed description d of images and information Info of image data obtained by arrangement from labels, wherein the predefined prompt trunk part is randomly selected from three instructions with the same semantic but different expressions;

the instruction prompt in ask oneself answering stage requires to output question-answer pairs in the format of 'question: [ your first question: [ answer according to the characters in the picture ]';

matching the result generated by the multi-mode large model by adopting a preset regular expression to extract a question Q and a corresponding answer A, and if the number of the extracted questions or the number of the answers is 0 or the number of the questions is unequal to the number of the answers, representing that the matching is wrong, replacing the regular expression and continuing to match;

If all the failures indicate that the output format of the multi-mode large model in the self-question-answering is not satisfactory, another instruction prompt is replaced, and if all the three instruction prompt output formats are not satisfactory, the image is failed to generate a question-answer pair, and the image is skipped.

Preferably, the method for using the information of the picture data obtained in the step 201 and the image content description obtained in the step 202 to make the multi-mode big model perform secondary answer to the question set forth in the step 204 and simultaneously generate the basis or reason of the answer by reasoning specifically includes:

responding to the questions generated in the step 204 in a form of a thinking chain, and generating a section of reasoning basis and a new response for each question;

The instruction prompt in the secondary question-answering and reasoning generation stage is obtained by combining a predefined prompt trunk p ₃, detailed description d of images, information Info of image data obtained by arrangement from labels and a problem Q generated in the step 204, and the predefined prompt trunk part is randomly selected from three instructions with the same semantic but different expressions;

If the single ask oneself answer output does not meet the format requirement, another instruction prompt is used, and if all three instruction prompts fail to generate, the image is skipped;

Preferably, the instruction prompt in the secondary question-answering and reasoning generation stage requires to output a question, a first question, a answer corresponding to the first question, and a generated thinking chain to be output one by one according to the format of the first question, a reasoning basis corresponding to the first question, and then information in the output is extracted one by using a regular expression and is respectively stored as two list reasoning reasons, reason and a secondary answer A'.

Preferably, the information based on the picture data collated in step 201, and the questions and answers generated in step 204, make the multi-mode big model determine whether the question-answer pair is related to the text information, and if not, mark the question-answer pair as invalid, which specifically includes:

The multi-mode big model judges whether the question-answer pair is related to the text information or not, and the instruction prompt used by the question-answer pair is obtained by combining a predefined prompt backbone p ₄ and information Info of image data which is obtained by arrangement from labels with the questions Q and A generated in the step 204;

Discarding the question-answer pair for the question judged as 'no', and reserving the question judged as 'yes';

The instruction prompt in the judging stage requires output, the generated thinking chain is output one by one in a format of' judging [ [ i ] th question-answer corresponding judgment ], and then information in the output is extracted one by using a regular expression and is stored as a list Judge.

Preferably, the consistency test for the question-answer pair by using the information of the picture data obtained in the step 201 and the image content description obtained in the step 202, the question and answer generated in the step 204, and the secondary answer and reason for the question generated in the step 205 specifically includes:

The instruction prompt used for consistency check of the question-answer pair consists of a predefined prompt trunk p ₅, a step 203, a question Q generated in the step 204, an answer A, a secondary answer A' generated in the step 205 and an inference basis Reason;

Deleting the problem judged as 'no';

The instruction prompt in the consistency test stage requires output, the generated thinking chain is output one by one in a format of judging True or False, and then information in the output is extracted one by using a regular expression and is stored as a consistency judging list Judge, wherein Judge _i＝{j_i,k∈{True,False}k＝1,2,...,Q_num.

Preferably, the multi-mode large model is a closed-source multi-mode large model and a corresponding API interface, or the multi-mode large model is an open-source multi-mode large model.

In a second aspect, the present invention further provides a device for generating question-answer data of sequential text bill images based on a multi-mode big model, which is used for implementing the method for generating question-answer data of sequential text bill images based on a multi-mode big model in the first aspect, and the device comprises:

The system comprises at least one processor, and a memory in communication with the at least one processor, wherein the memory stores instructions executable by the at least one processor for executing the multi-modal large model-based sequential text ticket image question-answer data generation method of the first aspect.

In a third aspect, the present invention further provides a non-volatile computer storage medium, where computer-executable instructions are stored, where the computer-executable instructions are executed by one or more processors to perform the method for generating sequential text ticket image question-answering data based on the multimodal big model according to the first aspect.

In a fourth aspect, there is provided a chip comprising a processor and an interface for calling from memory and running a computer program stored in memory, performing a method as in the first aspect.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to perform the method as in the first aspect.

The invention uses a multi-mode large model to realize large-scale batch automatic generation of the image text information question-answer data in the receipt scene. The method solves the defects of the quantity and quality of the large-scale multi-mode instruction fine adjustment data centering on the characters aiming at the receipt, and overcomes the time and labor cost of high labeling for acquiring the large-scale image-text understanding question-answering data.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the embodiments of the present invention will be briefly described below. It is evident that the drawings described below are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a schematic flow chart of a method for generating question-answer data of a serial text bill image based on a multi-mode large model provided by the embodiment of the invention;

FIG. 2 is a schematic flow chart of a method for generating question-answer data of sequential text bill images based on a multi-mode large model provided by the embodiment of the invention;

FIG. 3 is a schematic flow chart of a method for generating question-answer data of sequential text bill images based on a multi-mode large model provided by the embodiment of the invention;

FIG. 4 is a schematic diagram of a method for generating question-answer data of a sequential text bill image based on a multi-mode large model provided by the embodiment of the invention;

Fig. 5 is a schematic diagram of a structure of a device for generating question-answer data of a sequential text bill image based on a multi-mode large model according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Throughout the specification and claims, the term "comprising" is to be interpreted as an open-ended meaning, i.e. as "including, but not limited to, unless the context requires otherwise. In the description of the present specification, the terms "one embodiment," "some embodiments," "example embodiments," "examples," "particular examples," or "some examples," etc., are intended to indicate that a particular feature, structure, material, or characteristic associated with the embodiment or example is included in at least one embodiment or example of the present disclosure. The schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be included in any suitable manner in any one or more embodiments or examples, i.e., embodiments or examples that may be described in any suitable manner in one or more embodiments or examples, although they may be carried out in various combinations, for example, due to the order of occurrence and location of such features, etc.

In the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present disclosure, unless otherwise indicated, the meaning of "a plurality" is two or more. In addition, for example, the same term is used to describe the same as an independent two individuals by adding "a" and "B" to the end, and in this case, the corresponding features defined by "a" and "B" are only used for distinguishing the description purposes of the same individuals, and are not to be interpreted as indicating or implying relative importance or implying that the number of technical features indicated is indicated.

In the description of the invention, reference will be made to the expressions "A and/or B" (where A and B are used in the form of specific features), and the corresponding expressions include three combinations of A only, B only, and a combination of A and B.

As used herein, "about," "approximately" or "approximately" includes the stated values as well as average values within the acceptable deviation range of the particular value as determined by one of ordinary skill in the art in view of the measurement in question and the error associated with the measurement of the particular quantity (i.e., limitations of the measurement system).

In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Example 1:

In order to alleviate the current situation that the current receipt image question-answering instruction facing text content is insufficient in quantity and low in quality of fine tuning data. The invention provides an innovative solution, which is characterized in that an existing multi-mode big model and a public ticket (namely sequence text ticket) dataset are used, an automatic data generation assembly line is designed, a series of fine design instructions are adopted to finish the processes of information arrangement, ask oneself answering, self-reasoning and self-screening of the labeling of images in the ticket dataset, and the automatic generation of high-quality and large-scale ticket image text information question-answering data is realized.

In step 201, corresponding text understanding question-answering data is generated by using the public receipt data set as picture data, and the information of the image data is obtained by arranging labels carried by the data set.

In step 202, the information of the picture data obtained by sorting in step 201 is used to describe the target image content in detail, wherein the multi-mode large model is a closed-source multi-mode large model and a corresponding API interface, or the multi-mode large model is an open-source multi-mode large model.

In step 203, the information of the picture data obtained in step 201 is used to determine the question of text information and the number of answers corresponding to the question generated for one picture.

In step 204, the information of the picture data obtained in step 201 and the obtained image content description in step 202 are utilized to enable the multi-mode big model to carry out self-question and answer on the text information, and generate questions and corresponding answers related to the text information, wherein the number of the generated questions and answers is equal to the number of the generated questions and corresponding answers related to the text information determined in step 202.

In step 205, the information of the picture data obtained in step 201 and the image content description obtained in step 202 are used to make the multi-mode big model perform secondary answer to the question posed in step 204 and simultaneously generate the basis or reason of the answer by reasoning.

In step 206, based on the information of the picture data organized in step 201 and the questions and answers generated in step 204, the multi-mode big model is made to determine whether the question-answer pair is related to the text information, and if not, the question-answer pair is marked as invalid.

In step 207, the consistency check is performed on the question-answer pair by using the information of the picture data obtained in step 201 and the image content description obtained in step 202, the question and answer generated in step 204, and the secondary answer and reason for the question generated in step 205, and if the two answers corresponding to the same question are inconsistent, the question-answer pair is marked as invalid.

The embodiment uses a multi-mode large model to realize large-scale batch automatic generation of the image text information question-answer data in the receipt scene. The method solves the defects of the quantity and quality of the large-scale multi-mode instruction fine adjustment data centering on the characters aiming at the receipt, and overcomes the time and labor cost of high labeling for acquiring the large-scale image-text understanding question-answering data.

The method for generating the corresponding text understanding question-answering data by using the public receipt data set as the picture data, and arranging labels carried by the data set to obtain the information of the image data specifically comprises the following steps:

In order to sort the labels of the data set, extract and simplify the key information in the labels, extract and extract the key information in the labels, and simplify the key information in the labels, carry as much information as possible with the least word number, provide reliable text information for the ask oneself answers of the images of the multi-mode large model, and effectively improve the understanding capability of the multi-mode large model on the image content and the accuracy of the generated answer pairs of the related word information. For the input image I, the label information of the corresponding image in the dataset is L, and the image information obtained by arrangement is info=extract (L). Wherein the specific extracted information includes one or more of text content, text content location, information category represented by the text content, and brief information of the receipt.

The Text content position is represented by two points of the left lower corner and the right lower corner of the Text content, the information types represented by the Text content comprise a shop name, a shop address, a commodity name, a commodity price, a commodity total price and the like, the brief information of the receipt comprises the shop name, the shop address and a spending amount, the specific storage format is that for the Text content Text, the Text content position Coordinatate and the information types represented by the Text content Category are stored together as a tuple (Text, category, corrdinate), and the brief information of the receipt is stored as a single dictionary Info_direct: { "shop name": "shop address": "article total price": }.

In one embodiment, the method for describing the target image content in detail by using the multi-mode big model and the information of the picture data obtained by the arrangement in step 201, as shown in fig. 2, specifically includes:

In step 301, a detailed description of the target image is generated using the multimodal big model, and text information in the image of interest is emphasized in the instruction hint in the generation of the detailed description stage, while controlling the length of the total hint instruction so that the generated detailed description is between 70 and 100 words, in order to balance between controlling the length of the total hint instruction and ensuring the details of the image content description. Considering the randomness of the large model and the requirement of training data on diversity, the instruction prompt of the detailed description generation stage is randomly selected from three instructions with the same semantics but different expressions.

In step 302, a preset regular expression matching mode is adopted to extract the generated question-answer pairs from the answers of the multi-mode large model, if the matching is wrong, the regular expression is replaced to continue the matching, if all the answers fail, the answer output of ask oneself is not in accordance with the format requirement, another instruction prompt is replaced, if all the three instruction prompt outputs are not in accordance with the format requirement, the image is skipped, the input image is made to be I, the predefined instruction prompt is p ₁, the used multi-mode large model is M, and d=M (I, p ₁) is described in detail. The design effectively improves the success rate of model generation and the diversity of the obtained data. The idea and the method are implemented in the whole process, in all the steps related to large model generation, the number, the design method and the use mode of the prompting instructions are the same as those of the following steps, so that the design idea and the use method of the prompting instructions are not explained in detail in each step later.

The instruction prompt for the detailed description generation stage (i.e. step 202) is generated based on the following templates, "you are an image content analysis expert for the receipt image," Text information of the input receipt image, (Text, category, corrdinate), "brief information, please generate a brief but accurate description of the input receipt image in combination with the provided image information, the length of the description is about 70 to 100 words, please pay special attention to the Text information in the image.

In an alternative embodiment, the determining the number of questions and corresponding answers of the text information generated for a picture by using the information of the picture data obtained in step 201 specifically includes:

In order to balance the control of the number of questions and answers corresponding to text information generated for a picture, whether the questions and answers are related to text in an image, and the accuracy of generating question and answer pairs, we use the information of the image data obtained by sorting the labels in the dataset in step 201 to limit the number of question and answer pairs generated by the multi-mode big model for a specific image. For the input image I, the label information of the corresponding image in the dataset is L, the image information obtained by arrangement is Info, and the question-answer pair Q _num = Num (Info) generated for the input image I, wherein Num (Info) is a preset function which is obtained by analysis according to requirements of a person skilled in the art. Corresponding to different receipt data sets, the labels of the data sets may be different, and the image information obtained by arrangement may be different. Through the design, as many question-answer pairs as possible are generated under the condition that the relevance of the question-answer pairs generated by the model and the accuracy of the question-answer pairs are effectively improved.

In an actual application scenario, the information of the picture data obtained in the step 201 and the image content description obtained in the step 202 are used to enable the multi-mode large model to perform self-question answering on the text information, so as to generate a question and a corresponding answer of the related text information, as shown in fig. 3, specifically including:

In step 401, a ask oneself answering stage generates a plurality of question-answering pairs, each question should be clear and meaningful and related to text information in an image, each question corresponds to an answer, the answer of each question should be a short word or phrase related to text in the image, the instruction prompt used by the multi-modal big model for self-questioning the text information is obtained by combining a predefined prompt trunk p ₂, a detailed description d of the image and information Info of image data obtained by finishing from labels, the predefined prompt trunk part is randomly selected from three instructions with the same semantics but different expressions, and the output is required to be carried out in a format of' question: [ first question: [ answer in text ] in picture ] in the instruction prompt of ask oneself answering stage (i.e. step 204).

In step 402, the result generated by the multi-mode large model is matched by using a preset regular expression to extract the question Q and the corresponding answer a, if the number of the extracted questions or the number of the answers is 0, or if the number of the questions and the number of the answers are unequal, the matching is wrong, and the regular expression is replaced to continue the matching.

In step 403, if all the failures indicate that the output format of the multi-mode large model in the self-question-answer is not satisfactory, another instruction prompt is replaced, and if all the three instruction prompt output formats are not satisfactory, the image generation question-answer pair is failed, and the image is skipped.

In a preferred embodiment, the method for using the information of the picture data obtained in the step 201 and the image content description obtained in the step 202 to make the multi-mode big model perform secondary answer to the question posed in the step 204 and simultaneously generate the basis or reason of the answer by reasoning specifically includes:

The questions generated in step 204 are answered in the form of a mental chain, a section of reasoning basis and a new answer are generated for each question, i.e. the secondary question-answering and reasoning generation stage will answer in the form of a mental chain for the questions generated in step four, a section of reasoning basis and a new answer are generated for each question.

The instruction hint of the secondary question-answering and reasoning generation stage (i.e. step 205) is obtained by combining a predefined hint backbone p ₃, a detailed description d of the image, information Info of the image data which is obtained by sorting from the labels and the question Q generated in step 204, and the predefined hint backbone is selected randomly from three instructions which have the same semantics but different expressions.

If the single ask oneself answer output does not meet the format requirement, another instruction prompt is used instead, and if all three instruction prompts fail to generate, the image is skipped, and the step aims to generate reasoning information data for answers of question-answer pairs, so that the expansion of training data is facilitated, and higher-quality answers can be generated by using the characteristics of thinking chains. In addition, the reasoning result and the secondary answer generated in this stage can be used for consistency detection in the subsequent stage to evaluate the quality of the question-answer pair.

In order to control the output format of the model, the instruction prompt in the secondary question-answer and reasoning generation stage is required to output a generated thinking chain in a mode of 'question: [ first question ] answer: [ answer corresponding to first question: [ reasoning basis corresponding to first question ]' according to the format of 'the thinking chain is output one by one, and then information in the output is extracted one by using a regular expression and is respectively stored as two list reasoning reasons of the Reason of the secondary answer A'.

In one embodiment, the information based on the picture data collated in step 201, and the questions and answers generated in step 204, make the multi-mode large model determine whether the question-answer pair is related to the text information, and if not, mark the question-answer pair as invalid, which specifically includes:

Judging whether the question-answer pairs are related to the text information or not, and giving a binary judgment of yes or no to each question-answer pair by the multi-mode large model. The instruction prompt used by the multimodal big model to determine whether the question-answer pair is related to text information is obtained by combining the predefined prompt trunk p ₄ and the information Info of the image data collated from the callout with the questions Q and a generated in step 204.

For the question judged as "no", the question-answer pair is discarded, and the question judged as "yes" is retained.

In order to control the output format of the model, the instruction in the judging stage (i.e. step 206) prompts to require output, the generated thinking chain is output one by one in the format of "judgment: [ judgment corresponding to the ith question and answer ]", and then the information in the output is extracted one by using a regular expression and stored as a list Judge.

In an alternative embodiment, the consistency check of the question-answer pair by using the information of the picture data obtained in the step 201 and the image content description obtained in the step 202, the questions and answers generated in the step 204, and the secondary answers and reasons of the questions generated in the step 205 specifically includes:

The multi-modal large model determines, based on two different contextual prefixes and the reasoning basis generated in step 204, whether the initial answer generated in step 203 and the secondary answer generated in step 204 express the same meaning. The multi-modal large model will give a "yes" or "no" binary judgment for each question-answer pair.

The instruction prompt used for consistency check of question-answer pairs consists of a predefined prompt backbone p ₅, step 203, the question Q generated in step 204, the answer A, the secondary answer A' generated in step 205 and the reasoning basis Reason.

The question of judging as no is deleted.

The purpose of consistency detection is mainly to delete question-answer pairs with confusion, incorrect and lower confidence of the multi-mode model, and most of answers of the question-answer pairs cannot accurately answer questions, and the question-answer pairs reduce the overall quality of a generated question-answer pair data set, introduce noise into training of the model and influence the performance of the model. In order to control the output format of the model, the output is required to output the generated thought chain in the format of "judge: < True or False >" in the instruction prompt of the consistency check stage (i.e. step 207), and then the information in the output is extracted piece by piece and saved as a consistency judgment list Judge by using a regular expression, wherein Judge _i＝{j_i,k∈{True,False}|k＝1,2,...,Q_num.

Step 202, step 204 and step 207 in this embodiment can be understood as inputting corresponding instruction prompts into the multi-modal large model, so as to obtain the result of the corresponding stage.

Example 2:

the invention is based on the method described in embodiment 1, and combines specific application scenes, and the implementation process in the characteristic scene of the invention is described by means of technical expression in the relevant scene.

The invention provides an image text understanding question-answering data generation method facing a receipt image based on a multi-mode large model, which aims to realize large-scale batch automatic generation of image text information question-answering data in a receipt scene by using the multi-mode large model. The method solves the defects of the quantity and quality of the large-scale multi-mode instruction fine adjustment data centering on the characters aiming at the receipt, and overcomes the time and labor cost of high labeling for acquiring the large-scale image-text understanding question-answering data.

As shown in fig. 4, this embodiment proposes a method for generating text understanding question-answering data based on a multi-mode big model/for a receipt image, which includes the following implementation steps:

and step one, data preprocessing, namely generating corresponding text understanding question-answering data by using the public receipt data set as picture data. And (5) sorting labels in the data set to obtain information of the image data.

In the first step, the labels of the data set are arranged to obtain the information of the image data, the key information in the labels is extracted and extracted, the key information is simplified, the number of words as few as possible carries as much information as possible, reliable text information is provided for the ask oneself answers of the multi-mode large model to the image, and the understanding capability of the multi-mode large model to the image content and the accuracy of the generated question-answer pairs of the related word information are effectively improved. Specific extracted information may include text content, text content location and information category represented by the text content and brief information of the ticket. The text content position is represented by two points of a left lower corner coordinate and a right lower corner coordinate of the text content, the information types represented by the text content comprise information such as store names, store addresses, commodity names, commodity price, commodity total price and the like, and the brief information of the receipt comprises brief information such as store names, store addresses, spending amounts and the like. The specific storage format is that for Text content Text, text content position Coordinates and information Category represented by Text content are stored together as a tuple (Text, category, corrdinate), and brief information of the receipt is stored as a single dictionary info_direct: { "store name": "store address": "item total: }.

And secondly, utilizing the multi-mode large model and the information of the picture data which is obtained from the labeling in the first step to describe the content of the target image in detail, and paying special attention to the text information in the target image.

Specifically, in the second step, a detailed description of the target image is generated by using the multi-mode large model, and text information in the focused image is emphasized in the instruction prompt. To maintain the details of the image content description while controlling the length of the overall hint instruction, the detailed description is generated in about 70 to 100 words. A section of reference instruction prompts are as follows:

"you are an image content analysis expert for the ticket image. Text information of the input ticket image, (Text, category, corrdinate), brief information, info_subject, please generate a brief but accurate description of the input ticket image in combination with the provided image information, the length of the description is about 70 to 100 words, please pay special attention to the Text information in the image. "

The instruction prompt similar to the reference instruction can be used for enabling the multi-mode big model to fully utilize the labeling information of the receipt data set, improving the understanding capability of the multi-mode big model on receipt images, generating more detailed and accurate image description, the image description of the step is integrated into the instruction prompt of the subsequent step as auxiliary information, the image description is integrated into the instruction prompt of the subsequent step, the multi-mode big model is helped to further understand the images, the performance of the multi-mode big model in each subsequent step is improved, and the multi-mode big model is helped to generate high-quality and accurate question-answer pairs.

And thirdly, determining the questions and the corresponding number of answers of the related text information generated for one picture by utilizing the information of the picture data obtained in the first step.

In the third step, the number of question-answer pairs generated later is determined according to the image information obtained in the first step, and the determination of the appropriate question-answer pair number is beneficial to improving the correctness of the question-answer pairs generated by the multi-mode large model and the correlation between the question-answer pairs and text information in the image. We control the number of question-answer pairs to be generated in relation to the number of texts in the image, and for a receipt picture I, the Text content Text in the image information obtained from the corresponding dataset label arrangement determines the question of the Text information and the number of corresponding answers num=f (len (Text)), where len (Text) refers to the number of texts and F is a function with the number of texts as a variable, the specific form of the function being different according to the annotation of the dataset.

And fourthly, utilizing the information of the picture data obtained in the first step and the image content description obtained in the second step to enable the multi-mode large model to carry out self-question answering on the text information, generating questions and corresponding answers of the related text information, wherein the number of the questions and the corresponding answers of the generated related text information is equal to that of the generated questions and the corresponding answers of the related text information determined in the second step.

The method comprises the steps of generating self-question-answer data according to image information, generating a plurality of question-answer pairs according to the number of text information in an image in a ask oneself answer stage, wherein each question is clear and meaningful and is related to the text information in the image, each question corresponds to one answer, the answer of each question is a short word or phrase related to the text in the image, an instruction prompt of the stage is obtained by combining a predefined prompt trunk p ₂, a detailed description d of the image and information Info of the image data obtained by arrangement in a labeling, and the predefined prompt trunk part is randomly selected from three instructions with the same semantics but different expressions.

A section of reference instruction prompts are as follows:

"this is a receipt image, here about Text information in the image in the form of (Text content, text Category, (Text upper left corner position coordinates, text lower right corner position coordinates)): (Text, category, corrdinate), and basic information about the image < info_direct > and detailed Description < Description >, generating < Q _num > different and detailed questions and their corresponding answers. The questions should be as detailed as possible, the corresponding answers should be as concise and comprehensive as possible, and each question-answer pair must be associated with a word in the picture.

Each question and answer correspondence should follow the following format:

problems: [ first question of your answer: [ answer according to words in the pictures ]

Questions [ your second question ] answer [ answer according to the text in the picture ]

......

< Q _num >. Questions [ your < Q _num > ] answers to questions: [ answers according to the text in the picture ]

A number of '1.2.3.4..sup.q _num >' was used for each question. "

Using instruction cues similar to the reference instructions described above, a multi-modal large model can be enabled to generate a series of question-answer pairs of image text content of interest. The number of generated question-answer pairs is ensured, and meanwhile, higher-quality output can be ensured. Meanwhile, the strict format requirement enables the number of question-answer pairs generated by the multi-mode large model to be strictly controlled, and meanwhile, the output format of the multi-mode large model is more standard and is easier to match by using a regularized expression.

And fifthly, utilizing the information of the picture data obtained in the first step and the image content description obtained in the second step, enabling the multi-mode large model to answer the question presented in the fourth step for the second time and generating the basis or reason of the answer at the same time.

The main function of the fifth step is to generate reasoning basis of question-answer pairs and secondary answers needed in the subsequent consistency detection stage. The secondary question-answering and reasoning generation stage can answer the questions generated in the step four in a form of a thinking chain, and generates a section of reasoning basis and a new answer for each question. The instruction prompt of this stage is obtained by combining a predefined prompt trunk p ₂, a detailed description d of an image, information Info of image data which is obtained by arrangement from labels and a problem Q generated in the previous stage, wherein the predefined prompt trunk part is randomly selected from three instructions with the same semanteme but different expressions.

"This is a receipt image, here about Text information in the image in the form of (Text content, text Category, (Text upper left corner position coordinates, text lower right corner position coordinates)): (Text, category, corrdinate), and basic information about the image < info_text > and detailed Description < Description >, answer the following questions. The answer to the question must be related to the text information. Please think step by step, first a short reasoning is given, the basis for answering the question is explained, and then a specific answer is given. The question to be answered is < Q >

The output should conform to the following format:

questions [ first question ] answer [ answer corresponding to first question ] basis [ reasoning basis corresponding to first question ]

Questions [ second question ] answer: [ answer corresponding to second question ] basis: [ reasoning basis corresponding to second question ]

......

< Q _num >. Questions: [ Q _num > questions ] answers: [ answers corresponding to the < Q _num > questions ] basis: [ reasoning basis corresponding to the < Q _num > questions ]

A number of '1.2.3.4..sup.q _num >' was used for each question. "

Using instruction hints similar to the reference instructions described above, a multimodal large model can be enabled to answer questions twice in the form of a mental chain. The answer process of the thinking chain can not only improve the answer accuracy of the model to the questions, but also generate relevant reasoning basis. The reasoning basis not only can enrich the information of the data set, but also can provide related information for the subsequent consistency checking step.

And step six, based on the information of the picture data arranged in the step one and the questions and answers generated in the step four, enabling the multi-mode large model to judge whether the question-answer pair is related to the text information, and if not, marking the question-answer pair as invalid.

Specifically, the main function of the step six is to judge whether the generated question-answer pair is related to the characters in the image. The question and answer generated in the fourth step can be screened based on the generated judgment. In actual operation, it is observed that the question and answer in the fourth step may have a part of irrelevant information irrelevant to the text in the image, and the question and answer is not text information in the image but some other visual information, so that the questions need to be screened out to a certain extent, and adverse effects on the quality of the question and answer data set generated by the multi-mode large model are avoided. In the stage of judging whether the question-answer pairs are related to the text information in the image, the multi-mode large model can give a binary judgment of yes or no for each question-answer pair. The instruction hint at this stage is obtained by combining the predefined hint backbone p ₄, the information Info of the image data collated from the annotation, and the question-answer pair Q & a generated in step four.

A section of reference instruction prompts are as follows:

"this is a receipt image, here about Text information in the image in the form of (Text content, text Category, (Text upper left corner position coordinates, text lower right corner position coordinates)): (Text, category, corrdinate), and basic information about the image < info_text >, now following some questions and corresponding answers about the image. The question is < Q >, and the corresponding answer is < A >. Please act as a data inspection expert to determine whether the question-answer pairs are related to text in the image. The determinations of "yes" and "no" are sequentially output to these questions. The output should conform to the following format:

Judgment [ judgment of first question-answer correspondence ]

Judgment [ judgment of second question-answer correspondence ]

......

< Q _num > judgment, [ judgment of correspondence between < Q _num > questions and answers ]

For each judgment, a number of '1.2.3.4..sup.q _num >' was used. "

The use of instruction hints similar to the reference instruction described above enables the multi-modal large model to output a determination of a series of question-answer pairs. The judgment generated in this stage will be used together with the judgment generated in the consistency detection stage to filter question and answer pairs.

And seventhly, carrying out consistency check on the question-answer pair by utilizing the information of the picture data obtained in the first step, the image content description obtained in the second step, the questions and answers generated in the fourth step and the secondary answers and reasons of the questions generated in the fifth step. If the answers corresponding to the same question are inconsistent, the question-answer pair is marked as invalid.

The method has the main function of carrying out consistency test on the data of the questions and the answers. The results generated by the multi-modal large model may be diverse, i.e., the model may generate multiple different answers to the same question. Through consistency test, wrong question-answer pairs can be screened out, so that the accuracy and consistency of answers are improved, and the quality of generated question-answer pairs is improved. In the consistency detection stage, the multi-mode large model judges whether the initial answer generated in the fourth step and the secondary answer generated in the fifth step express the same meaning according to the reasoning basis generated in the fifth step. The multi-modal large model will give a "yes" or "no" binary judgment for each question-answer pair. The instruction prompt at this stage consists of a predefined prompt backbone, questions generated in the fourth and fifth steps, two sets of answers and reasoning bases.

A section of reference instruction prompts are as follows:

"you are an image content analysis expert". The following are some questions about a ticket image and two sets of answers. You need to judge whether the two sets of answers are consistent and correct based on the content of the image and the reasoning basis given. If the contents of the first set of answers and the second set of answers are different, the output judgment is NO. The questions to be judged are < Q >, the reasoning basis for answering the questions is < Reason >, the first set of answers is < A >, and the second set of answers is the judgment that < A > outputs yes and no to the questions in turn. The output should conform to the following format:

Judgment [ judgment of first question-answer correspondence ]

Judgment [ judgment of second question-answer correspondence ]

......

For each judgment, a number of '1.2.3.4..sup.q _num >' was used. "

The use of instruction hints similar to the reference instruction described above enables the multi-modal large model to output a determination of a series of question-answer pairs. The final screening operation is performed using this stage of judgment together with the judgment of the previous stage.

And step eight, deleting invalid question-answer pairs according to the question-answer pair marks obtained in the step six and the step seven.

And step eight, screening question and answer pairs by using the judgment obtained in the step six and the step seven. And summing the text related judgment and the consistency check judgment of each question-answer pair to obtain the judgment result of the question-answer pair. And deleting the question-answer pair judged as no according to the final judgment result, and reserving the question-answer pair judged as yes.

The invention uses the multi-mode large model to automatically generate large-scale and high-quality text-centered image text question-answering data aiming at the tickets, thereby being capable of making up the current situation that the text-centered ticket image question-answering instruction in the current academic world has insufficient fine tuning data quantity and lower quality. The invention designs an automatic data generation assembly line, the whole flow of data generation does not need manual labeling, and a great deal of labor and time cost are saved. The invention realizes a method for generating image-text understanding question-answering data for mainly processing the receipt images, and the workflow and instruction prompt of the method can process various types of receipt images. Meanwhile, the method has the potential of being popularized to more image types, such as natural scene images, mobile screenshot, form images, chart images, document images and the like.

Example 3:

fig. 5 is a schematic diagram of the architecture of the device for generating the question-answer data of the sequential text bill image based on the multi-mode large model according to the embodiment of the invention. The sequential text ticket image question-answer data generating device based on the multi-mode large model of the embodiment comprises one or more processors 21 and a memory 22. In fig. 5, a processor 21 is taken as an example.

The processor 21 and the memory 22 may be connected by a bus or otherwise, for example in fig. 5.

The memory 22 is used as a non-volatile computer readable storage medium for storing a non-volatile software program and a non-volatile computer executable program, such as the sequential text ticket image question-answer data generation method based on the multimodal big model in embodiment 1. The processor 21 executes a sequential text ticket image question-answer data generation method based on a multimodal big model by running a non-volatile software program and instructions stored in the memory 22.

The memory 22 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 22 may optionally include memory located remotely from processor 21, which may be connected to processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The program instructions/modules are stored in the memory 22, which when executed by the one or more processors 21, perform the sequential text ticket image question-answer data generation method based on the multimodal big model in embodiment 1 described above.

It should be noted that, because the content of information interaction and execution process between modules and units in the above-mentioned device and system is based on the same concept as the processing method embodiment of the present invention, specific content may be referred to the description in the method embodiment of the present invention, and will not be repeated here.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, where the storage medium may include a Read Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, and so on.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A method for generating sequential text bill image question-answering data based on a multimodal large model, characterized by comprising:

In step 201, the public receipt dataset is used as image data to generate corresponding text comprehension question and answer data, and the annotations of the dataset are sorted to obtain information of the image data;

In step 202, the target image content is described in detail using the multimodal large model and the information of the image data obtained in step 201;

In step 203, using the information of the picture data obtained in step 201, determine the number of questions related to the text information generated for a picture and the corresponding answers;

In step 204, using the information of the picture data obtained in step 201 and the image content description obtained in step 202, the multimodal large model is made to perform self-questioning and answering on the text information, and generate questions related to the text information and corresponding answers; wherein the number of generated questions and answers is equal to the number of questions related to the text information and corresponding answers determined in step 202;

In step 205, using the information of the image data obtained in step 201 and the image content description obtained in step 202, the multimodal large model is made to give a second answer to the question raised in step 204 and at the same time infer the basis or reason for generating the answer;

In step 206, based on the information of the image data sorted in step 201 and the questions and answers generated in step 204, the multimodal large model is used to determine whether the question-answer pair is related to the text information. If not, the question-answer pair is marked as invalid.

In step 207, the consistency check is performed on the question-answer pair using the information of the image data obtained in step 201, the image content description obtained in step 202, the question and answer generated in step 204, and the secondary answer and reason to the question generated in step 205; if the two answers corresponding to the same question are inconsistent, the question-answer pair is marked as invalid;

In step 208, invalid question-answer pairs are deleted based on the question-answer pair tags obtained in steps 206 and 207.

2. According to the method for generating sequential text bill image question and answer data based on a multimodal large model in claim 1, it is characterized in that the use of a public receipt dataset as image data to generate corresponding text comprehension question and answer data, and the information of the image data obtained by arranging the annotations of the dataset specifically includes:

Arrange the annotations of the dataset to obtain information of the image data, extract key information from the annotations and simplify them; the specific extracted information includes one or more of the text content, the location of the text content, the type of information represented by the text content, and the brief information of the receipt;

The position of the text content is represented by the coordinates of the lower left corner and the lower right corner of the text content. The types of information represented by the text content include store name, store address, product name, unit price of product and total price of product. The brief information of the receipt includes store name, store address and amount spent. The specific saving format is that for the text content Text, the text content position Coordinate and the type of information Category represented by the text content are saved together as a tuple (Text, Category, Corrdinate). The brief information of the receipt is saved as a separate dictionary Info_dict: {"store name":,"store address":,"total price of item":,}.

3. The method for generating sequential text bill image question-answering data based on a multimodal large model according to claim 2, characterized in that the use of the multimodal large model and the information of the image data obtained in step 201 to describe the target image content in detail specifically includes:

Use a multimodal large model to generate a detailed description of the target image, and emphasize the text information in the image in the instruction prompts during the detailed description generation stage. At the same time, control the length of the total prompt instructions so that the generated detailed description is between 70 and 100 words.

The instruction prompts in the detailed description generation stage are randomly selected from three instructions with the same semantics but different expressions. The generated question and answer pairs are extracted from the answers of the multimodal large model using a preset regular expression matching method. If the match fails, the regular expression is replaced to continue matching. If all failures occur, it means that the self-question and answer output does not meet the format requirements, then another instruction prompt is used. If the outputs of all three instruction prompts do not meet the format requirements, the image is skipped.

4. The method for generating sequential text bill image question and answer data based on a multimodal large model according to claim 1 is characterized in that the information of the image data obtained in step 201 is used to determine the number of questions about text information generated for a picture and the corresponding answers, specifically including:

For an input image I, the annotation information of the image in the corresponding data set is L, and the sorted image information is Info. Then, the number of question-answer pairs generated for the input image I is Q _num =Num(Info); wherein Num(Info) is a preset function.

5. The method for generating sequential text bill image question and answer data based on a multimodal large model according to claim 1 is characterized in that the information of the picture data obtained in step 201 and the image content description obtained in step 202 are used to make the multimodal large model perform self-question and answer on the text information, and generate questions about the text information and corresponding answers, specifically including:

The instruction prompt used by the multimodal large model to perform self-questioning and answering of text information is obtained by combining a predefined prompt trunk p ₂ , a detailed description d of the image, and information Info of the image data obtained from the annotation; the predefined prompt trunk is randomly selected from three instructions with the same semantics but different expressions;

The instruction prompt in the self-questioning and answering stage requires the output of question and answer pairs one by one in the format of "Question: [Your first question] Answer: [Answer according to the text in the picture]";

The results generated by the multimodal large model are matched with a preset regular expression to extract the question Q and the corresponding answer A. If the number of questions or answers extracted is 0, or the number of questions is not equal to the number of answers, it means that the match is wrong, and the regular expression is changed to continue matching;

If all fail, it means that the output format of the multimodal large model in this self-question and answering does not meet the requirements. Change to another instruction prompt. If the output formats of all three instruction prompts do not meet the requirements, the generation of question and answer pairs for the image fails and the image is skipped.

6. The method for generating sequential text bill image question-answering data based on a multimodal large model according to claim 1 is characterized in that the information of the image data obtained in step 201 and the image content description obtained in step 202 are used to make the multimodal large model give a secondary answer to the question raised in step 204 and at the same time infer the basis or reason for generating the answer, specifically including:

Answer the questions generated in step 204 in the form of a thought chain, generating a reasoning basis and a new answer for each question;

The instruction prompts in the secondary question answering and reasoning generation stage are obtained by combining the predefined prompt trunk p ₃ , the detailed description d of the image, the information Info of the image data obtained from the annotation, and the question Q generated in step 204 . The predefined prompt trunk is randomly selected from three instructions with the same semantics but different expressions.

If a single self-question and answer output does not meet the format requirements, another instruction prompt is used. If all three instruction prompts fail to generate, the image is skipped.

7. According to the method for generating sequential text bill image question and answer data based on a multimodal large model as described in claim 6, it is characterized in that in the instruction prompt of the secondary question and answer and reasoning generation stage, it is required to output the generated thinking chain one by one in the format of "question: [first question] answer: [answer corresponding to the first question] basis: [reasoning basis corresponding to the first question]", and then use regular expressions to extract the information in the output one by one and save them as two lists of reasoning reason Reason and secondary answer A' respectively.

8. The method for generating sequential text bill image question and answer data based on a multimodal large model according to claim 1 is characterized in that, based on the information of the image data sorted in step 201 and the questions and answers generated in step 204, the multimodal large model is allowed to determine whether the question and answer pair is related to the text information, and if not, the question and answer pair is marked as invalid, specifically comprising:

The instruction prompt used by the multimodal large model to determine whether the question-answer pair is related to the text information is obtained by combining the predefined prompt trunk _p4 and the information Info of the image data sorted out from the annotation with the questions Q and A generated in step 204;

For questions that are judged as "no", the answer pair is discarded and the questions that are judged as "yes" are retained;

Among them, the instruction prompts in the judgment stage require the output of the generated thinking chain one by one in the format of "Judgment: [the judgment corresponding to the i-th question and answer]", and then use regular expressions to extract the information in the output one by one and save it as a list Judge.

9. The method for generating sequential text bill image question and answer data based on a multimodal large model according to claim 1 is characterized in that the consistency check of the question and answer pairs is performed using the information of the picture data obtained in step 201 and the image content description obtained in step 202, the question and answer generated in step 204, and the secondary answer and reason to the question generated in step 205, specifically comprising:

The instruction prompt used for consistency check of the question-answer pair is composed of the predefined prompt trunk p ₅ , the question Q generated in step 203 and step 204 , the answer A , the secondary answer A' generated in step 205 and the reasoning basis Reason ;

Delete the questions that are judged as "no";

Among them, the instruction prompt in the consistency check stage requires the output of the generated thought chain one by one in the format of "judgment: <True or False>", and then use regular expressions to extract the information in the output one by one and save it as a consistency judgment list Judge; among them, Judge _i = {j _i,k ∈{True,False}|k=1,2,...,Q _num }.

10. The method for generating sequential text bill image question and answer data based on a multimodal big model according to any one of claims 1 to 9, characterized in that the multimodal big model is a closed source multimodal big model and a corresponding API interface, or the multimodal big model is an open source multimodal big model.