Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The method for generating the image by the text in the related art generally relies on directly converting the descriptive text into the image, however, the text described by the natural language is generally highly abstract and has a low structuring degree, and limitations exist in terms of understanding spatial relationships and attribute characteristics between entities, so that the generated image often cannot accurately reflect the intention of a user, and the quality and accuracy of the generated image are affected.
Aiming at the problems, the embodiment of the invention provides an image generation method, which comprises the steps of obtaining a description text of a target image, carrying out scene analysis on the description text based on a large language model to obtain a scene graph corresponding to the description text, generating a model based on a multi-mode image, and generating the target image by applying the description text and the scene graph. And simultaneously, referring to text information describing the text and space and semantic information of scene graph characterization, so as to generate a more accurate and reasonable image.
The embodiment of the invention can be applied to scenes needing to generate images through texts, such as the fields of content creation, game design, virtual reality, advertisement design and the like. The execution subject of the method may be an electronic device such as a terminal device, a computer, a server cluster, or a specially designed image generating device, or may be an image generating apparatus provided in the electronic device, and the image generating apparatus may be implemented by software, hardware, or a combination of both.
Fig. 1 is a schematic flow chart of an image generating method according to the present invention, as shown in fig. 1, the method includes the following steps:
Step 110, a descriptive text of the target image is acquired.
Specifically, the target image, i.e., the image that the user desires to generate, and the descriptive text of the target image refers to one or more text descriptions that are used to direct or motivate text-generating image techniques to create the target image. Descriptive text may include, for example, the subject matter and content of the image, style, color, scene, objects, their interrelationships, and so forth.
For example, the target image is intended to generate a scenic map of autumn, and the descriptive text may contain autumn-related elements such as fallen leaves, golden trees, harvested fields, etc.
Further, to more accurately generate a desired image, descriptive text may also include details of colors, light, shadows, shape and size of objects in the image, and the like.
And 120, carrying out scene analysis on the descriptive text based on the large language model to obtain a scene graph corresponding to the descriptive text.
Specifically, considering that the text described by natural language is usually converted into an image directly in the related technology, however, the text described by natural language is usually highly abstract and has a low structuring degree, and limitation exists in terms of understanding spatial relationship and attribute characteristics among entities, multiple entities in a generated image may be overlapped spatially, so that the intention of a user is unclear, or attributes of different entities may not be accurately matched, so that the error presentation of element characteristics in the image is caused, or the spatial relationship among the entities is ignored, and the generated image cannot reflect the actual logic relationship.
In this embodiment, first, scene analysis is performed on the descriptive text by means of the powerful natural language processing capability of the large language model, so as to obtain a scene graph corresponding to the descriptive text.
The large language model (Large Language Model, LLM) is simply referred to as a large model, which refers to a natural language processing (Natural Language Processing, NLP) model with huge parameter scale, the number of model parameters and/or the complexity of model structure exceed a preset threshold, and the model processes large-scale text data during training and has the capability of understanding and generating natural language. For example, the large language model may include a star fire large model, or the like.
A scene graph is a graphical representation that shows the individual elements (e.g., objects, characters, scenes, etc.) mentioned in the descriptive text and the relationships between them. The scene graph is used as a structure for expressing objects and relations thereof in an image, and can better capture space and semantic information in a complex scene to form understanding of the whole scene.
The scene graph may be a directed acyclic graph (DIRECTED ACYCLIC GRAPH, DAG) in which each node represents an object and each edge represents a relationship between objects. Each node in the scene graph has a corresponding region on the picture. The relationship of the scene graph is directed, i.e., there is an explicit subject (object) and object (object) relationship.
The scene analysis is carried out on the description text, the description text can be understood and analyzed through a large language model, key information is extracted, and then a scene graph corresponding to the description text is generated based on the key information. All elements mentioned in the descriptive text are included in the scene graph, as well as the spatial and logical relationships between them.
Step 130, generating a model based on the multimodal image, applying the descriptive text and the scene graph, and generating a target image.
In particular, the current image generation model generally directly takes descriptive text as input to generate an image, and the descriptive text is generally highly abstract and has a low structuring degree, and has limitation in understanding spatial relationships and attribute characteristics between entities. However, the scene graph can make full use of various elements (such as objects, characters, scenes and the like) mentioned in the text and the relation between the elements, so that the space and semantic information in the complex scene can be captured better, and the understanding of the whole scene is formed.
The multimodal image generation model is a generation model capable of processing multiple types of inputs (e.g., text, images, etc.) and generating images, and may be based on diffusion model training.
Thus, based on the multimodal image generation model, the descriptive text and the scene graph are applied, the target image is generated while referring to the text information of the descriptive text, and the spatial and semantic information of the scene graph characterization, thereby generating a more accurate and rational image.
During specific generation, the description text and the scene graph can be input into a pre-trained multi-modal image generation model, multi-modal data fusion is carried out on the description text and the scene graph by the multi-modal image generation model, and a target image is predicted based on a fusion result, so that the target image is output. The description text and the scene graph may be applied to perform image prediction, and the prediction image obtained based on the description text and the prediction image obtained based on the scene graph may be combined to determine a final target image, which is not particularly limited in the embodiment of the present invention.
According to the method provided by the embodiment of the invention, scene analysis is carried out on the descriptive text by means of the powerful natural language processing capability of the large language model, a scene graph corresponding to the descriptive text is obtained, and the descriptive text and the scene graph are combined to generate the target image. And simultaneously, the text information describing the text and the space and semantic information of scene graph characterization can improve the quality and semantic consistency of the generated image, so that a more accurate and reasonable image is generated.
Based on any of the above embodiments, based on the large language model, scene analysis is performed on the description text to obtain a scene graph corresponding to the description text, that is, step 120 specifically includes:
And inputting the description text into the large language model, structuring the description text to obtain a structured text output by the large language model, and analyzing the structured text to obtain the scene graph.
Specifically, the descriptive text contains enough information to construct a clear scene, including objects in the scene, attributes of the objects, relationships between the objects, and the like. The description text is input into a large language model, and after the large language model receives the description text, the large language model analyzes and constructs the description text.
The structuring process specifically comprises entity recognition, attribute extraction, relation extraction and text structuring. Entities in descriptive text, such as people, objects, places, etc., are first identified. The attributes of the entity, such as color, shape, size, location, etc., are extracted. Relationships between entities are identified, such as spatial relationships (up and down, left and right, front and back), temporal relationships (sequential, simultaneous), logical relationships (causal, conditional), etc. The identified entities, attributes and relationships are represented in a structured manner, such as JSON, XML, etc.
After the structured text is obtained, it needs to be parsed and a scene graph generated. The method specifically comprises the steps of constructing a scene graph by taking entities and attributes of the entities contained in the structured text as nodes and taking a spatial relationship between the entities as edges.
And creating nodes in the scene graph according to the entities in the structured text, and assigning the entity attributes extracted from the structured text to the corresponding nodes. Connections between nodes are created in the scene graph based on relationships identified in the structured text. These connections may be directional (indicating a directional relationship) or undirected (indicating a non-directional relationship).
In addition, the layout of nodes and connections may be optimized based on the complexity and readability of the scene graph. May include adjusting the position, size, color, etc. of the node and adjusting the shape, length, width, etc. of the connection.
Based on any of the above embodiments, fig. 2 is one of the flow diagrams of step 130 in the image generating method provided by the present invention, as shown in fig. 2, based on the multi-modal image generating model, the target image is generated by applying the descriptive text and the scene graph, that is, step 130 specifically includes:
step 131, extracting text features describing the text and scene graph features of the scene graph respectively;
and step 132, taking the text features and the scene graph features as diffusion conditions of the multi-mode image generation model to generate a target image.
Specifically, for text feature extraction, words in the descriptive text are first converted to a high-dimensional vector representation using a pre-trained Word embedding model (e.g., word2Vec, gloVe, BERT, etc.). These vectors capture the semantic relationships between words. The text-embedded vector sequences are input into a Recurrent Neural Network (RNN), long short-term memory network (LSTM), or transducer, etc., model to capture contextual information in the text. The output may be a fixed length vector representing features of the entire text, i.e., text features describing the text. Global features (e.g., global topics, emotions) and local features (e.g., attributes, relationships of particular entities) can also be extracted from the text, as desired.
For scene graph feature extraction, the scene graph is first represented as a graph structure, where nodes represent objects and edges represent relationships between objects. Features are extracted for each node (object), including the class, properties (e.g., color, shape), location information, etc. of the object. Features are extracted for each edge (relationship), including the type of relationship (e.g., "located," "connected"), directionality, etc. The scene graph may be encoded using a graph neural network (e.g., GCN, GAT, graph Transformer, etc.) to capture interaction between nodes and global graph structure information. The scene graph features that are output may be an embedded vector for each node or an embedded vector for the entire graph.
And on the basis of extracting text features and scene graph features, fusing the text features and the scene graph features. This can be achieved by simple stitching, weighted summation, attention mechanisms, etc. The fused features will be used as input to the multimodal image generation model.
The multimodal image generation model employs a diffusion process based image generation model (e.g., DDPM, denoising Diffusion Implicit Models, etc.). The model architecture may employ UNet architecture, and the diffusion model generates high quality images by gradually removing noise.
And inputting the fused characteristics as the conditions of the multi-mode image generation model, and guiding the model to generate a target image conforming to the description text and the scene graph. Conditional input may be achieved by modifying the noise prediction network of the model or adding additional conditional embedding.
Based on any of the above embodiments, step 131 specifically includes:
Extracting text features describing the text based on the text coding model, and extracting scene graph features of the scene graph based on the graph convolution coding model.
Specifically, the text encoding model herein is used to extract text features describing the text, and may be, for example, a Recurrent Neural Network (RNN), a long short time memory network (LSTM), a gated loop unit (GRU), or a transducer.
The word vector sequences describing the text are input into a text encoding model that is capable of capturing sequence information and contextual dependencies in the text. The output of the text encoding model, i.e. the text feature, may be a fixed length vector representing the feature of the entire text, or a sequence of vector representations, each vector corresponding to a part or word in the text.
For graph scene features, this can be achieved by a graph convolution coding model (Graph Convolutional Networks, GCN). The graph convolutional coding model is a deep learning model specifically designed to process structured graph data. The graph convolution coding model carries out convolution operation on the graph to extract characteristic information in the graph structure, so that deep learning of graph data is realized. The network structure can capture complex relationships among nodes and deep structural features of the graph.
The GCN model updates the representation of each node by aggregating the information of neighboring nodes in the scene graph, thereby capturing local and global information in the graph structure. Extracting features of the scene graph, i.e., scene graph features, from the output of the GCN model may include node-level features (e.g., embedded representations of each node) and graph-level features (e.g., embedded representations of the entire graph). Node-level features may be used to represent attributes of a single object or entity, while graph-level features may be used to represent global information of an entire scene.
Based on any of the above embodiments, fig. 3 is a second flowchart of step 130 in the image generating method according to the present invention, as shown in fig. 3, step 132 specifically includes:
step 132-1, based on the correlation between the text feature and the scene graph feature, fusing the text feature and the scene graph feature through a cross attention mechanism to obtain a fused feature;
And step 132-2, using the fusion characteristic as a diffusion condition of the multi-mode image generation model to generate a target image.
In particular, fusing text features and scene graph features may be accomplished through a cross-attention mechanism. The attention mechanism is used to calculate the weight of each feature and the features are fused according to the weights. The attention mechanism can dynamically adjust the weights of the features to accommodate different input and task requirements.
Here, the correlation between the text feature and the scene graph feature may be represented by calculating the similarity between each text feature and each scene graph feature in some way (e.g., dot product, cosine similarity, etc.). Based on the similarity results, a weight is assigned to each element, with higher weights indicating that the element is more relevant to the elements in the query set. And summing the weighted key value set elements to obtain a final output representation, namely a fusion characteristic. Through a cross-attention mechanism, the network can dynamically pay attention to key information in descriptive texts and scene graphs when generating images, so that the image generation process is guided by common multi-mode information.
Based on any of the above embodiments, fig. 4 is a second flowchart of an image generating method according to the present invention, as shown in fig. 4, the method includes:
step 410, receiving an editing instruction;
step 420, performing scene analysis on the editing instruction based on the large language model, and updating the scene graph based on the analysis result to obtain an updated scene graph;
step 430, generating an updated target image based on the multimodal image generation model and the updated scene graph.
Specifically, the present embodiment also supports an image editing function on the basis of generating an image. And the system receives an editing instruction input by a user. The editing instructions are used for adjusting entity relationships or attributes in the target image, and may include operations such as adding, deleting, modifying positions or attributes of objects in the target image. The manner in which the editing instructions are received may be text input, voice input, or other forms of user interaction. To ensure the accuracy and understandability of the instructions, the system may also provide some predefined instruction format or template for reference and selection by the user.
The system analyzes the scene of the editing instruction through the large language model, and identifies the type of operation (such as adding an object, modifying an attribute and the like) which the user wants to perform, and the specific object and target of the operation. And correspondingly updating the scene graph based on the analysis result to obtain an updated scene graph. For example, if a user wants to add a new object, the system may add a new node to the scene graph and update the edges associated therewith to reflect the new relationship. For another example, if a user wants to modify an attribute or location of an object, the system may update the attribute of the node or adjust its location in the scene graph.
And re-encoding the updated scene graph through the graph convolution coding model, inputting a multi-mode image generation model, and generating an image conforming to the editing instruction, namely an updated target image, so as to ensure that the updated image conforms to the user requirement.
The method provided by the embodiment of the invention supports flexible editing of the generated image by means of strong reasoning and generating capacity of the large language model, and the user can adjust the scene according to the requirements, so that the practicability and flexibility of the application are further improved.
Based on any of the above embodiments, fig. 5 is a third flowchart of an image generating method according to the present invention, and as shown in fig. 5, an image generating method is provided, including:
s1, acquiring a description text of a target image. The user inputs a piece of descriptive text (promt) as basic information for generating the image. The descriptive text will contain the entities, attributes and relationships involved in the scene.
S2, carrying out scene analysis on the descriptive text based on the large language model to obtain a scene graph corresponding to the descriptive text. The method specifically comprises the following steps:
and constructing a scene graph by taking entities and attributes of the entities contained in the structured text as nodes and taking a spatial relationship between the entities as edges.
And S3, generating a model based on the multi-mode image, and generating a target image by applying the descriptive text and the scene graph. The method specifically comprises the following steps:
s31, extracting text features describing the text based on the text coding model. The coded text features serve as an input condition of the multi-mode image generation model, and context information of the text is provided for the generation process, so that the generated image can meet the semantic requirements of user description.
S32, extracting scene graph features of the scene graph based on the graph convolution coding model. A graph convolutional coding model (GCN) encodes a scene graph generated by a large language model, and converts entities and relationships in the scene graph into a vectorized feature representation. Through the multi-layer graph convolution operation, the network can capture the attributes of the entities in the scene graph and the complex relationships with other entities, so that the information can be expressed through encoding. The scene graph is encoded with a graph convolution encoding model (GCN), structured entity and relationship characterizations are generated. The graph convolution coding model can capture the relation between entities in the scene graph, and ensures that the complex space and attribute relation can be clearly expressed.
And S33, based on the correlation between the text features and the scene graph features, fusing the text features and the scene graph features through a cross attention mechanism to obtain fusion features, and generating a target image by taking the fusion features as diffusion conditions of the multi-mode image generation model.
Under the guidance of the multi-mode image generation model, the system generates a target image conforming to the description text input by the user. The target image not only reflects the content of the descriptive text, but also contains information such as spatial layout, entity attributes and the like in the scene graph.
Unlike traditional text-to-graph diffusion models, the models in this approach can accept text conditional embedding and scene graph conditional embedding simultaneously. The embedded information ensures that the model, when generating the image, both conforms to the overall description of the text, while accurately presenting the entities and their relationships depicted in the scene graph.
S4, receiving an editing instruction, carrying out scene analysis on the editing instruction based on the large language model, updating the scene graph based on the analysis result to obtain an updated scene graph, generating a model and an updated scene graph based on the multi-mode image, and generating an updated target image.
For example, the user input description text comprises "balanus and rabbit race", fig. 6 is a schematic view of a scene provided by the present invention, a scene diagram obtained by analyzing a scene of a large language model is shown in fig. 6, circles in the diagram represent four main entity nodes, edges between the nodes are position and space relations, and in the blocks, node attributes are shown. Fig. 7 is a schematic view of a target image provided by the present invention, as shown in fig. 7, fig. 7 is a target image generated by the multi-mode image generation model according to the scene graph fig. 6.
The user inputs an editing instruction of exchanging the positions of the tortoise and the rabbit, the scene graph is updated according to the large language model of the editing instruction, the space positions of the two nodes of the tortoise and the rabbit are exchanged, other information in the original scene graph is kept unchanged, and the obtained updated scene graph is shown in fig. 8. FIG. 9 is an updated target image generated by the multimodal image generation model from the updated scene graph. As can be seen by comparing FIG. 7 with FIG. 9, the position of the tortoise in FIG. 7 was changed after the rabbit in FIG. 9 after the rabbit.
According to the method provided by the embodiment of the invention, the attribute and the spatial relationship of the entity are explicitly described through the scene graph representation, so that the relationship confusion and the position confusion of the image generated by the related technology are avoided. By adjusting the nodes and edges in the scene graph, the embodiment supports the user to edit the generated image later, and has high flexibility and practicability.
The image generating apparatus provided by the present invention will be described below, and the image generating apparatus described below and the image generating method described above may be referred to correspondingly to each other.
Based on any of the above embodiments, fig. 10 is a schematic structural diagram of an image generating apparatus provided by the present invention, and as shown in fig. 10, the image generating apparatus includes:
A text acquisition unit 1010 for acquiring a description text of the target image;
the scene analysis unit 1020 is configured to perform scene analysis on the description text based on a large language model, so as to obtain a scene graph corresponding to the description text;
an image generating unit 1030 is configured to generate the target image based on a multimodal image generation model, applying the description text and the scene graph.
According to the device provided by the embodiment of the invention, scene analysis is carried out on the descriptive text by means of the powerful natural language processing capability of the large language model, a scene graph corresponding to the descriptive text is obtained, and the descriptive text and the scene graph are combined to generate the target image. And simultaneously, referring to text information describing the text and space and semantic information of scene graph characterization, so as to generate a more accurate and reasonable image.
Based on any of the above embodiments, the scene parsing unit is specifically configured to:
and inputting the description text into the large language model, structuring the description text to obtain a structured text output by the large language model, and analyzing the structured text to obtain the scene graph.
Based on any of the above embodiments, the scene parsing unit is specifically configured to:
and constructing the scene graph by taking the entities contained in the structured text and the attributes of the entities as nodes and the spatial relationship between the entities as edges.
Based on any of the above embodiments, the image generating unit is specifically configured to:
respectively extracting text features of the descriptive text and scene graph features of the scene graph;
And taking the text features and the scene graph features as diffusion conditions of the multi-mode image generation model to generate the target image.
Based on any of the above embodiments, the image generating unit is specifically configured to:
Extracting text features of the descriptive text based on a text coding model;
And extracting scene graph features of the scene graph based on the graph convolution coding model.
Based on any of the above embodiments, the image generating unit is specifically configured to:
based on the correlation between the text feature and the scene graph feature, fusing the text feature and the scene graph feature through a cross attention mechanism to obtain a fused feature;
and taking the fusion characteristic as a diffusion condition of the multi-mode image generation model to generate the target image.
Based on any of the above embodiments, the apparatus further includes an image editing unit configured to:
receiving an editing instruction;
performing scene analysis on the editing instruction based on the large language model, and updating the scene graph based on the analysis result to obtain an updated scene graph;
and generating an updated target image based on the multi-modal image generation model and the updated scene graph.
Fig. 11 illustrates a physical schematic diagram of an electronic device, which may include a processor 1110, a communication interface Communications Interface, a memory 1130, and a communication bus 1140, as shown in fig. 11, where the processor 1110, the communication interface 1120, and the memory 1130 perform communication with each other through the communication bus 1140. Processor 1110 may invoke logic instructions in memory 1130 to perform an image generation method that includes obtaining descriptive text for a target image, scene parsing the descriptive text based on a large language model to obtain a scene graph corresponding to the descriptive text, and generating the target image based on a multimodal image generation model applying the descriptive text and the scene graph.
Further, the logic instructions in the memory 1130 described above may be implemented in the form of software functional units and sold or used as a stand-alone product, stored on a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In another aspect, the invention further provides a computer program product, the computer program product comprises a computer program, the computer program can be stored on a non-transitory computer readable storage medium, when the computer program is executed by a processor, the computer program can execute the image generation method provided by the methods, the method comprises the steps of obtaining descriptive text of a target image, carrying out scene analysis on the descriptive text based on a large language model to obtain a scene graph corresponding to the descriptive text, and generating the target image based on a multi-mode image generation model by applying the descriptive text and the scene graph.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform the image generation method provided by the above methods, the method comprising obtaining a descriptive text of a target image, performing scene parsing on the descriptive text based on a large language model to obtain a scene graph corresponding to the descriptive text, and generating the target image based on a multimodal image generation model by applying the descriptive text and the scene graph.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.