CN117786193A

CN117786193A - A method and device for generating multimedia information, and a computer-readable storage medium

Info

Publication number: CN117786193A
Application number: CN202211139046.2A
Authority: CN
Inventors: 张政; 刘银星; 阮涛; 吕晶晶
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2024-03-29
Also published as: WO2024061073A1

Abstract

The embodiment of the invention provides a method and a device for generating multimedia information and a computer readable storage medium, wherein the method comprises the following steps: in response to the received browsing request, recall item information and content information; extracting features based on the article information and the content information to obtain article features corresponding to article dimensions and content features corresponding to content dimensions, and carrying out coordination and fusion on the article features and the content features to obtain a plurality of groups of fusion features; estimating a plurality of groups of fusion features through a preset recommendation model, and selecting target object information and target content information corresponding to a group of fusion features with highest estimated values; the target multimedia information is generated based on the target item information and the target content information. According to the scheme, feature extraction and combination are carried out on the article information and the content information, a plurality of fusion features are obtained, and the generated target multimedia information has diversity and good recommendation effect.

Description

Method and device for generating multimedia information and computer readable storage medium

Technical Field

The present invention relates to the field of computer vision, and in particular, to a method and apparatus for generating multimedia information, and a computer readable storage medium.

Background

In the current e-commerce advertisement system, the interesting commodity recall is usually commodity granularity, the system can select a commodity candidate set for the current user according to the historical browsing, searching, purchasing, shopping cart adding and other behaviors of the user, the optimal commodity is selected based on an advertisement system sequencing model, and after the commodity is determined, related advertisements are generated. At present, advertisement generation is usually performed by adopting a template mode, a commodity main diagram is inserted into the template for replacement, and corresponding commodity advertisements are rendered and generated, and the generated advertisement recommendation is single although the automatic advertisement generation device has the automatic capability.

Disclosure of Invention

The embodiment of the invention provides a method and a device for generating multimedia information, and a computer readable storage medium, which can generate corresponding target multimedia information according to information and content information of articles, and have the advantages of diversity and good recommendation effect.

The technical scheme of the invention is realized as follows:

the embodiment of the invention provides a method for generating multimedia information, which comprises the following steps:

in response to the received browsing request, recall item information and content information;

extracting features based on the article information and the content information to obtain article features corresponding to article dimensions and content features corresponding to content dimensions, and carrying out synergy and fusion on the article features and the content features to obtain a plurality of groups of fusion features; each group of fusion features characterizes fusion with different articles under different content mode combinations;

Estimating the multiple groups of fusion features through a preset recommendation model, and selecting target object information and target content information corresponding to a group of fusion features with highest estimated values; the preset recommendation model representation screens fusion features;

and generating target multimedia information based on the target item information and the target content information.

In the above scheme, the feature extraction is performed based on the article information and the content information to obtain an article feature corresponding to an article dimension and a content feature corresponding to a content dimension, including:

extracting features of the article information to obtain the article features corresponding to the article dimensions;

identifying the content information to obtain content information corresponding to the content multi-mode type; the content multi-modal type comprises at least two modes of text information, image information and image sequence information;

and extracting the characteristics of the content information corresponding to the content multi-mode type to obtain the content characteristics corresponding to the content dimension.

In the above solution, the extracting the characteristics of the content information corresponding to the content multi-mode type to obtain the content characteristics corresponding to the content dimension includes: if the content multi-mode type is a text type, extracting the characteristics of the text information through a first coding mode to obtain text characteristics;

If the content multi-mode type is an image type or an image sequence type, respectively extracting features of the image information and the image sequence information through a second coding mode to obtain image features and behavior features;

and determining the content characteristics corresponding to the content dimension according to at least one of the text characteristics, the image characteristics and the behavior characteristics.

In the above scheme, if the content multi-mode type is a text type, extracting features of the text information by a first coding mode to obtain text features, including:

if the content multi-modal type is a text type, extracting features of the text information to obtain text initial features; the text initial characteristics comprise semantic expression information and word information;

and carrying out coding processing on the initial text features by the first coding mode to obtain the text features.

In the above solution, if the content multi-mode type is an image type or an image sequence type, respectively extracting features of the image information and the image sequence information by a second coding mode to obtain image features and behavior features, including:

If the content multi-mode type is an image type, extracting features of the image information to obtain initial features of the image; the image initial characteristics comprise scene information, content information and style information;

if the content multi-mode type is an image sequence type, extracting features of the image sequence information to obtain behavior initial features; the behavior initial characteristics comprise main body target information and key frame information;

and respectively carrying out coding processing on the image initial feature and the behavior initial feature through the second coding mode to obtain the image feature and the behavior feature.

In the above scheme, the step of performing collaboration and fusion on the object features and the content features to obtain a plurality of groups of fusion features includes:

carrying out cooperative processing on the object features and the content features to obtain first object features and first content features with the same probability distribution; the first item feature comprises a plurality of first sub-item features; the first content feature comprises a plurality of first sub-content features;

randomly combining the plurality of first sub-article features to obtain a plurality of article combination features;

Randomly combining the plurality of first sub-content features to obtain a plurality of content combination features; the content combination features include content features corresponding to at least two content multi-modality types.

And fusing the article combination features and the content combination features to obtain the multiple groups of fusion features.

In the above scheme, the predicting the multiple groups of fusion features through a preset recommendation model, selecting target object information and target content information corresponding to a group of fusion features with highest predicted values, includes:

inputting the multiple groups of fusion features into the preset recommendation model for pre-estimation to obtain first pre-estimation values corresponding to the multiple groups of fusion features;

selecting a group of fusion features with highest predicted values from the multiple groups of fusion features based on the multiple first predicted values;

and decoding the group of fusion features to obtain the target object information and the target content information.

In the above solution, the generating the target multimedia information based on the target item information and the target content information includes:

generating the layout of the target object information and the target content information through a preset layout generation model to obtain a plurality of layouts; the preset layout generation model characterizes the adjustment of the layout through the objects and the contents;

Evaluating the multiple layouts through an evaluation model to determine candidate layouts; the evaluation model is used for evaluating and screening the layout;

selecting an optimal layout from the candidate layouts through a layout preference model;

and generating the target multimedia information based on the optimal layout, the target item information and the target content information.

In the above solution, the generating, by using a preset layout generating model, the layout of the target object and the target content to obtain a plurality of layouts includes:

generating an initialization layout corresponding to the target object information and the target content information through a preset layout generation model; the preset layout generation model comprises a sequential stacking sequence of image layers and a constraint of a text size range in text information;

adjusting the initialized layout through an adjustment rule to determine the multiple layouts; the adjustment rule is obtained by continuously training by taking the preference degree of the object as an incentive.

In the above aspect, before the evaluating the multiple layouts through the evaluation model and determining the candidate layout, the method further includes:

Acquiring historical target multimedia information;

identifying the historical target multimedia information to obtain a historical layout; the historical layout includes positive sample data and negative sample data;

and training an initial evaluation model through the positive sample data and the negative sample data to determine the evaluation model.

In the above solution, the evaluating the multiple layouts through the evaluation model to determine candidate layouts includes:

evaluating the multiple layouts through the evaluation model to obtain evaluation results corresponding to the multiple layouts respectively;

and if the evaluation result is characterized as successful, the corresponding layout is used as the candidate layout.

The embodiment of the invention provides a generation device of multimedia information, which comprises an acquisition unit, a selection unit and a generation unit; wherein,

the acquisition unit is used for recalling the article information and the content information in response to the received browsing request; extracting features based on the article information and the content information to obtain article features corresponding to article dimensions and content features corresponding to content dimensions, and carrying out synergy and fusion on the article features and the content features to obtain a plurality of groups of fusion features; each group of fusion features characterizes fusion with different articles under different content mode combinations;

The selecting unit is used for estimating the multiple groups of fusion features through a preset recommending model and selecting target object information and target content information corresponding to a group of fusion features with highest estimated values; the preset recommendation model representation is used for optimizing fusion characteristics;

the generation unit is used for generating target multimedia information based on the target object information and the target content information.

The embodiment of the invention provides a device for generating multimedia information, which comprises the following steps:

a memory for storing executable instructions;

and the processor is used for executing the executable instructions stored in the memory, and when the executable instructions are executed, the processor executes the method for generating the multimedia information.

An embodiment of the present invention provides a computer-readable storage medium, where executable instructions are stored, and when the executable instructions are executed by one or more processors, the processors execute the method for generating multimedia information.

The embodiment of the invention provides a method and a device for generating multimedia information and a computer readable storage medium, wherein the method comprises the following steps: in response to the received browsing request, recall item information and content information; extracting features based on the article information and the content information to obtain article features corresponding to article dimensions and content features corresponding to content dimensions, and carrying out cooperation and fusion on the article features and the content features to obtain a plurality of groups of fusion features; each group of fusion features characterizes fusion with different articles under different content mode combinations; estimating the multiple groups of fusion features through a preset recommendation model, and selecting target object information and target content information corresponding to a group of fusion features with highest estimated values; the preset recommendation model representation screens fusion features; and generating target multimedia information based on the target item information and the target content information. In the scheme, firstly, the server performs vectorization representation on the article information and the content information to obtain article characteristics corresponding to the articles and content characteristics corresponding to the contents; converting the object features and the content features in different spaces into vectors in the same space for fusion to obtain a plurality of groups of fusion features; the fusion features are features having two dimensions, and thus the resulting fusion features are diversified. And secondly, the server predicts a plurality of fusion features according to a preset recommendation model to obtain a plurality of predicted values. The higher the predicted value is, the better the diversity of the fusion features is represented, and the better the diversity of the target object information and the target content information corresponding to the group of fusion features with the highest predicted value is correspondingly selected, so that the target multimedia information generated according to the target object information and the target content information has diversity. Finally, the diversity of the target multimedia information can be improved based on the generation method of the multimedia information, so that personalized recommendation is provided for the user, and the recommendation effect is improved.

Drawings

Fig. 1 is a schematic flow chart of an alternative method for generating multimedia information according to an embodiment of the present invention;

fig. 2 is a second flowchart of an alternative method for generating multimedia information according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of an alternative method for generating multimedia information according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of an alternative method for generating multimedia information according to an embodiment of the present invention;

fig. 5 is a flowchart showing an alternative method for generating multimedia information according to an embodiment of the present invention;

fig. 6 is a flowchart showing an alternative method for generating multimedia information according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an alternative flow chart of a method for generating multimedia information according to an embodiment of the present invention;

fig. 8 is a schematic flowchart eighth alternative method for generating multimedia information according to an embodiment of the present invention;

fig. 9 is a flowchart illustrating an alternative method for generating multimedia information according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a device for generating multimedia information according to an embodiment of the present invention;

Fig. 11 is a schematic diagram of a second structure of a device for generating multimedia information according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without making any inventive effort are within the scope of the present invention.

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. Fig. 1 is a schematic flow chart of an alternative method for generating multimedia information according to an embodiment of the present invention, and will be described with reference to the steps shown in fig. 1.

S101, recalling the article information and the content information in response to the received browsing request.

In some embodiments of the present invention, the item information is all items to be recommended by the terminal. The item information includes a plurality of sub-item information; the content information includes a plurality of sub-content information. The browsing request is a request formed by a user inputting item information to be browsed in an application software browsing page or in a search box in the application web browsing page. For example, after a user enters a "photo frame" in a search box in a browse page of a shopping platform, a request for browsing the photo frame is formed.

In some embodiments of the present invention, the server receives a browse request sent by the terminal, and recalls the item information and the content information from the item library and the content library according to the historical browse information of the object in response to the browse request.

For example, as shown in fig. 2, according to the historical browsing information (equivalent to an actor) of the object, a plurality of commodities (equivalent to the commodity) in the commodity library are input into a Deep & Cross Network (DCN) for extraction, so as to obtain a plurality of commodity information; inputting the browsing pictures into a convolutional neural network (Convolutional Neural Network, CNN) for extraction to obtain content information carrying article information; the item information is removed from the content information carrying the item information, and a Click (equivalent to the content information) is obtained.

S102, extracting characteristics of the articles and the contents based on the article information and the content information to obtain article characteristics corresponding to the article dimension and content characteristics corresponding to the content dimension, and carrying out cooperation and fusion on the article characteristics and the content characteristics to obtain a plurality of groups of fusion characteristics.

In some embodiments of the invention, each set of fusion features characterizes a fusion with different items under different combinations of content modalities. The cooperation is to process a plurality of vectors in different vector spaces to map the vectors to the same vector space so as to meet the same probability distribution; fusion is a fusion vector formed by combining a plurality of vectors in the same space in different ways; the vector may be an item feature and a content feature in the present invention. The fusion is performed prior to the synergy, and only after the synergy treatment is performed. The article features are the exhibited features of a certain article, and the content features are the features obtained by extracting the features of the picture description, the video description and the text description of the certain article.

By way of example, the item information may be an attribute characteristic of the item; the content information may be images and text with the exception of the item attributes, i.e., some creative content that advertises the item.

In some embodiments of the present invention, the server may perform feature extraction on the item information to obtain an item feature corresponding to the item dimension; identifying the content information to obtain a content multi-mode type; and extracting the characteristics of the content information corresponding to the content multi-mode type to obtain the content characteristics corresponding to the content dimension. Carrying out cooperative processing on the object features and the content features to obtain first object features and first content features with the same probability distribution; and fusing the first object features and the first object features to obtain a plurality of groups of fused features.

In some embodiments of the present invention, fig. 3 is a schematic flowchart third of an alternative method for generating multimedia information according to an embodiment of the present invention, as shown in fig. 3, feature extraction is performed based on item information and content information, and obtaining item features corresponding to item dimensions and content features corresponding to content dimensions may be implemented through S1021-S1023, as follows:

and S1021, extracting features of the article information to obtain article features corresponding to the article dimensions.

In some embodiments of the present invention, the server may convert the item information into the feature in the form of a vector by extracting the feature of the item information, so as to obtain the item feature corresponding to the item dimension. The item feature is a 1024-dimensional floating point array.

And S1022, identifying the content information to obtain the content information corresponding to the content multi-mode type.

In some embodiments of the present invention, the server may identify the content information through a neural network model to obtain content information corresponding to a content multi-modal type, where the content multi-modal type includes at least two modalities of text information, image information, and image sequence information. The Neural Networks (NN) model is a complex network system formed by a large number of simple processing units (called neurons) widely interconnected, reflecting many basic features of human brain functions, and is a highly complex nonlinear power learning system. Neural networks have massively parallel, distributed storage and processing, self-organizing, adaptive, and self-learning capabilities, and are particularly suited to address imprecise and ambiguous information processing issues that require consideration of many factors and conditions simultaneously.

S1023, extracting the characteristics of the content information corresponding to the content multi-mode type to obtain the content characteristics corresponding to the content dimension.

In some embodiments of the present invention, the server may perform corresponding feature extraction according to the content multimodal type. If the content multi-mode type is the text type, performing feature extraction processing on the text information through a first coding mode to obtain text features. If the content multi-mode type is the image type or the image sequence type, respectively carrying out feature extraction processing on the image information and the image sequence information through a second coding mode to obtain image features and behavior features. And determining the content characteristics corresponding to the content dimension according to the text characteristics, the image characteristics and the behavior characteristics. The object acted by the first coding mode is mainly aimed at text information; the main objects to which the second encoding scheme is applied are image information and image sequence information. For example, the image information may be an image and the image sequence information may be a video.

It can be understood that the server performs feature extraction on the article information, and performs vectorization representation on the article information to obtain article features corresponding to the articles; identifying the content information to obtain a content multi-mode type; and extracting the characteristics of the content information corresponding to the content multi-mode type to obtain the content characteristics corresponding to the content dimension. Because the object features and the content features belong to features under different dimensions, the server obtains the multi-dimensional features, and when the target multimedia information is generated based on the multi-dimensional features, the target multimedia information has multi-dimensional information, so that the target multimedia information has diversity.

In some embodiments of the present invention, fig. 4 is a schematic flowchart of an alternative method for generating multimedia information according to an embodiment of the present invention, as shown in fig. 4, S1023 may be implemented by S201-S203, as follows:

and S201, if the content multi-mode type is a text type, extracting the characteristics of the text information through a first coding mode to obtain text characteristics.

In some embodiments of the invention, the server performs feature extraction on the text information according to the content multi-mode type as the text type to obtain text initial features; and carrying out coding processing on the initial text features by a first coding mode to obtain the text features.

Note that the text feature is a text initial feature in the form of a vector.

In some embodiments of the present invention, S201 may be implemented through S2011-S2012 as follows:

and S2011, if the content multi-mode type is a text type, extracting the characteristics of the text information to obtain the initial characteristics of the text.

In some embodiments of the invention, the text initial feature includes semantic expression information and word information.

In some embodiments of the present invention, the server performs feature extraction on the text information according to the content multi-mode type as the text type, to obtain semantic expression information and word information. Both the semantically expressed information and the word information are text initial features.

For example, fig. 5 is a schematic flow chart five of an alternative method for generating multimedia information according to an embodiment of the present invention, as shown in fig. 5, the server obtains semantic expressions (corresponding to semantic expression information) and word cutting (corresponding to word information) by extracting features of the text information. Specifically, semantic expression is obtained through a Bert mode.

S2012, coding the initial text features by a first coding mode to obtain the text features.

In some embodiments of the present invention, the server encodes the initial text feature by using a first encoding method, so as to obtain a vectorized text feature.

For example, as shown in fig. 5, the first encoding mode is ConCat, and the server encodes semantic expressions (corresponding to semantic expression information) and word cuts (corresponding to word information) by ConCat to obtain feature vectors (corresponding to text features).

It can be understood that the server performs feature extraction and encoding processing on the text information to obtain text features. In the process, the server converts the text information into vectorized text features, so that the coordination and fusion of the object features and the content features can be conveniently carried out subsequently.

S202, if the content multi-mode type is an image type or an image sequence type, respectively extracting features of the image information and the image sequence information through a second coding mode to obtain image features and behavior features.

In some embodiments of the present invention, the server performs feature extraction on the image information according to the content multi-mode type as the image type, to obtain the image initial feature. And extracting the characteristics of the image sequence information according to the content multi-mode type as the image sequence type to obtain the behavior initial characteristics. And respectively carrying out coding processing on the image initial characteristics and the behavior initial characteristics through a second coding mode to obtain the image characteristics and the behavior characteristics.

In some embodiments of the present invention, S202 may be implemented by S2021-S2023, as follows:

s2021, if the content multi-mode type is an image type, extracting features of the image information to obtain initial features of the image.

In some embodiments of the invention, the image initiation features include scene information, content information, and style information.

In some embodiments of the present invention, the server may perform feature extraction on the image information according to the content multimodal type as the image type, to obtain scene information, content information, and style information. Scene information, content information and style information are all initial features of an image.

For example, as shown in fig. 5, the image information may be a promotional picture of an item presentation; the server extracts features of the image information to obtain a scene (corresponding to scene information), a content, a body (corresponding to content information), a color, a style, and a layout (corresponding to style information). The scene, content, body, color, style and layout all belong to the image initial features.

S2022, if the content multi-mode type is the image sequence type, extracting features of the image sequence information to obtain behavior initial features.

In some embodiments of the invention, the behavioral initial characteristics include subject target information and key frame information.

In some embodiments of the present invention, the server may perform feature extraction on the image sequence information according to the content multi-modal type as the image sequence type, to obtain the target theme information and the key frame information. Both the target topic information and the key frame information are behavior initiation features.

For example, as shown in fig. 5, the server performs feature extraction on the image sequence information to obtain a key frame, a highlight point (the key frame and the highlight point correspond to the key frame information), a summary, and a subject target behavior action (the summary and the subject target behavior action correspond to the target subject information). The key frame, the highlight point, the abstract and the main body target behavior actions all belong to the initial behavior characteristics. Wherein the key frames, highlights, summaries, subject target actions and actions all belong to the manifest.

S2023, respectively encoding the image initial feature and the behavior initial feature by a second encoding mode to obtain the image feature and the behavior feature.

In some embodiments of the present invention, the server encodes the initial image feature by using a second encoding method to obtain a vectorized image feature; and carrying out coding processing on the behavior initial characteristics to obtain vectorized behavior characteristics.

For example, as shown in fig. 5, the second encoding mode is One Hot, and the server performs feature encoding on the scene, the content, the main body, the color, the style and the layout through the One Hot to obtain feature vectors (equivalent to image features). The server performs feature coding on the key frame, the highlight, the abstract and the main body target behavior action through the One Hot to obtain feature vectors (equivalent to behavior features).

It can be understood that the server performs feature extraction and encoding processing on the image information and the image sequence information to obtain image features and behavior features. The server can convert the image information and the image sequence information into vectorized image features and vectorized behavior features respectively, so that multi-mode content features are obtained, and the content features are diversified.

S203, determining the content characteristics corresponding to the content dimension according to at least one of the text characteristics, the image characteristics and the behavior characteristics.

In some embodiments of the invention, the server takes at least one of text features, image features, and behavior features as content features corresponding to the content dimension.

For example, the server may determine the text feature as a content feature corresponding to the content dimension; alternatively, the server may determine the image feature as a content feature corresponding to the content dimension; alternatively, the server may determine the behavior feature as a content feature corresponding to the content dimension; alternatively, the server may determine the text feature and the image feature as content features corresponding to the content dimension; alternatively, the server may determine the text feature and the behavior feature as content features corresponding to the content dimension; alternatively, the server may determine the image feature and the behavior feature as content features corresponding to the content dimension; alternatively, the server may determine the text feature, the image feature, and the behavior feature as content features corresponding to the content dimensions.

It is understood that the server may identify and extract features from the content information to obtain text features, image features, and behavior features. The server can determine the content characteristics corresponding to the content dimension according to one of the text characteristics, the image characteristics and the behavior characteristics; or determining the content characteristics corresponding to the content dimension according to two characteristics of the text characteristics, the image characteristics and the behavior characteristics; alternatively, the server may determine the content feature corresponding to the content dimension from three of the text feature, the image feature, and the behavior feature. The content features have a variety because they have one or more multi-modal features.

In some embodiments of the present invention, the article features and the content features are coordinated and fused to obtain multiple sets of fused features, which may be implemented by S301-S303, as follows:

s301, carrying out cooperative processing on the object features and the content features to obtain first object features and first content features with the same probability distribution.

In some embodiments of the invention, the first item feature comprises a plurality of first sub-item features; the first content feature includes a plurality of first sub-content features.

In some embodiments of the present invention, the server performs collaborative learning processing on the item features and the content features according to the difference of the feature domains, and maps the item features and the content features to the same vector space, so as to obtain a first item feature and a first content feature with the same probability distribution.

It should be noted that, the cooperative processing is to process a plurality of vectors located in different vector spaces, so that the vectors are mapped to the same vector space and the same probability distribution is satisfied; the cooperative treatment is consistent with the cooperative technical means.

S302, randomly combining the first sub-article features to obtain a plurality of article combination features.

In some embodiments of the present invention, the server may randomly combine the plurality of first sub-item features to obtain a plurality of different item combination features.

Illustratively, the server randomly combines 12 first sub-item features (12 first sub-item features are not identical) to obtain 5 item combination features; the 5 article combination features respectively comprise 6 first sub-article features, 8 first sub-article features, 3 first sub-article features, 5 first sub-article features and 9 first sub-article features. It should be noted that, in the 5 article combination features, there may be the same first sub-article feature, and there may also be different first sub-article features.

S303, randomly combining the first sub-content features to obtain a plurality of content combination features.

In some embodiments of the invention, the content combination feature comprises content features corresponding to at least two content multi-modality types.

In some embodiments of the present invention, the server may randomly combine the plurality of first sub-content features to obtain a plurality of different content combination features.

For example, the server randomly combines 6 first sub-content features (6 first sub-content features are different, and are embodied in different multi-mode types of content contained in the content or different content features), so as to obtain 2 content combination features. Wherein, 1 content combination feature contains three content features corresponding to three content multi-mode types, the text features are 2, the image features are 3 and the behavior features are 1; the other content combination feature contains content features corresponding to two content multi-mode types, wherein the text features are 2 types and the image features are 1 type.

S304, fusing the plurality of article combination features and the plurality of content combination features to obtain a plurality of groups of fusion features.

In some embodiments of the present invention, the server may fuse the plurality of item combination features and the plurality of content combination features to obtain a plurality of sets of fused features; a set of fusion features includes at least one item combination feature and at least one content combination feature.

Illustratively, the server fuses the 5 item combination features and the 2 content combination features to obtain 3 sets of fusion features, set 1, set 2, and set 3, respectively. The 1 st group of fusion features comprise 3 first sub-object features, three content multi-mode types, 2 text features, 3 image features and 1 behavior feature; the 2 nd group of fusion features comprise 8 first sub-object features, two content multi-mode types, 2 text features and 1 image feature; group 3 includes 13 first sub-item features, three content multi-modal types, 4 text features, 4 image features and 1 behavior feature.

It can be understood that the server processes the object features and the content features in different vector spaces to map them to the same vector space so as to satisfy the same probability distribution, thus the two features can be located in the same vector space, and the two features can be fused conveniently. The server performs random combination on the plurality of first sub-item features to obtain a plurality of item feature combinations, and each item feature combination contains the plurality of first sub-item features, so that the item feature combinations have diversity. The server performs random combination on the plurality of first sub-content features to obtain a plurality of content feature combinations, and each content combination contains the plurality of first sub-content features, so that the content feature combinations have diversity. The server randomly fuses the article feature combination and the content feature combination to obtain a plurality of groups of fusion features, and the fusion features have diversity because the fusion features comprise the improvement of the article feature combination and the content feature combination.

S103, estimating a plurality of groups of fusion features through a preset recommendation model, and selecting target object information and target content information corresponding to a group of fusion features with the highest estimated value.

In some embodiments of the present invention, the server may input a plurality of sets of fusion features into a preset recommendation model to perform prediction, so as to obtain first predicted values corresponding to the plurality of sets of fusion features. Based on the first plurality of predicted values, a set of fusion features with the highest predicted value is selected from the plurality of sets of fusion features. And decoding the group of fusion features to obtain target object information and target content information.

In some embodiments of the present invention, S103 may be implemented by S1031, S1032, and S1033 as follows:

s1031, inputting a plurality of groups of fusion features into a preset recommendation model for prediction to obtain first pre-estimated values corresponding to the plurality of groups of fusion features.

In some embodiments of the present invention, the server predicts a plurality of groups of fusion features through a preset recommendation model, so as to obtain first predicted values corresponding to the plurality of groups of fusion features.

The server predicts 3 groups of fusion features through a preset recommendation model to obtain first predicted values 0.7, 0.85 and 0.62 corresponding to the 3 groups of fusion features respectively.

S1032, selecting a group of fusion features with highest predicted values from the multiple groups of fusion features based on the multiple first predicted values.

In some embodiments of the present invention, the server selects a fusion feature with the highest predicted value from the multiple fusion features by using multiple first predicted values.

Illustratively, the server selects fusion features of the predicted value 0.85 from the 3 sets of fusion features corresponding to the first predicted values 0.7, 0.85, and 0.62, respectively.

S1033, decoding the fusion features to obtain the target object information and the target content information.

In some embodiments of the invention, the server may convert the fused features into target item information and target content information by performing a decoding process on a set of fused features.

Illustratively, the server decodes the 1 st set of fusion features. Obtaining target article information and target content information; the target object information comprises 3 objects, the target content information comprises three content multi-mode types, the text is 2, the image is 3 and the image sequence is 1.

It can be understood that the server predicts the fusion features according to a preset recommendation model to obtain a plurality of predicted values. The higher the predicted value is, the better the diversity of the fusion features is represented, and the better the diversity of the target object information and the target content information corresponding to the group of fusion features with the highest predicted value is correspondingly selected, so that the target multimedia information generated according to the target object information and the target content information has diversity.

And S104, generating target multimedia information based on the target article information and the target content information.

In some embodiments of the present invention, the server may perform layout generation on the target item information and the target content information through a preset layout generation model, so as to obtain a plurality of layouts. And evaluating the multiple layouts through an evaluation model to determine candidate layouts. And selecting an optimal layout from the candidate layouts through a layout preference model. And generating target multimedia information according to the optimal layout, the target object information and the target content information. And sending the target multimedia information to the terminal, and displaying the browsing page based on the target multimedia information by the terminal.

Fig. 6 is a schematic flowchart sixth of an alternative method for generating multimedia information according to an embodiment of the present invention, as shown in fig. 6, a conventional multimedia information generating process is: receiving a user request (corresponding to a browsing request), and recalling the commodity (corresponding to the commodity recall) by the server to obtain commodity information; carrying out model ordering on commodity information, and selecting commodity information corresponding to the Top1 model as recommended commodity information; generating a templated creative, and fusing commodity information to obtain multimedia information. Fig. 7 is a flowchart of an alternative method for generating multimedia information according to an embodiment of the present invention, as shown in fig. 7, data a/B (corresponding to target item information and target content information) is input into an online learning module of a server, and an initialization layout (not shown in the figure) is generated through a preset layout generation model. And performing character size adjustment, element position adjustment, color and contrast adjustment on the initial layout through adjustment rules to obtain a plurality of layouts. And evaluating the multiple layouts (shown as++ in fig. 7) through an evaluation model to obtain an evaluation result, wherein the evaluation result comprises passing and non-passing, and if the evaluation result is passing, outputting a layout plan (corresponding to a candidate layout), wherein the layout plan comprises four layouts which are respectively 1, 2, 3 and 4. And optimizing the layout planning through a layout optimization model to obtain an optimal pattern (optimal layout), wherein the optimal pattern comprises a document or picture or a video or middle page. And generating target multimedia information through a multimedia information real-time generation engine.

It can be understood that the server performs vectorization representation on the item information and the content information to obtain item characteristics corresponding to the item and content characteristics corresponding to the content. And the server converts the object features and the content features in different spaces into vectors in the same space for fusion, so as to obtain a plurality of groups of fusion features. Because the fusion features are features having two dimensions, the fusion features are diversified. The server predicts the fusion features according to a preset recommendation model to obtain a plurality of predicted values. The higher the predicted value is, the better the diversity of the fusion features is represented, and the better the diversity of the target object information and the target content information corresponding to the group of fusion features with the highest predicted value is correspondingly selected, so that the target multimedia information generated according to the target object information and the target content information has diversity. Finally, the diversity of the target multimedia information can be improved based on the generation method of the multimedia information, so that personalized recommendation is provided for the user, and the recommendation effect is improved.

In some embodiments of the present invention, fig. 8 is a schematic flowchart eight of an alternative method for generating multimedia information according to an embodiment of the present invention, as shown in fig. 8, S104 may be implemented by S1041-S1045, as follows:

S1041, performing layout generation on the target object information and the target content information through a preset layout generation model to obtain a plurality of layouts.

In some embodiments of the present invention, the preset layout generation model includes a sequential stacking order of image layers and a text size range constraint in the text information.

In some embodiments of the present invention, the server may generate an initialized layout corresponding to the target item information and the target content information through a preset layout generation model. And adjusting the initialized layout through an adjustment rule to determine a plurality of layouts.

In some embodiments of the present invention, S1041 may be implemented by S401 and S402 as follows:

s401, generating an initialization layout corresponding to the target object information and the target content information through a preset layout generation model.

In some embodiments of the present invention, the server may input the target item information and the target content information into a preset layout generation model, and generate an initialized layout corresponding to the target item information and the target content information. The initialization layout is obtained by arranging and combining the positions of the target object information and the target content information.

S402, adjusting the initialized layout through adjustment rules to determine a plurality of layouts.

In some embodiments of the invention, the adjustment rules are derived by continuous training with the preference degree of the subject as an incentive. Specifically, based on reinforcement learning, according to the preference degree of the object, the click rate is positive excitation if the click rate is higher after adjustment, and the click rate is negative excitation if the click rate is lower, and the object is obtained through repeated adjustment and learning.

In some embodiments of the present invention, the server adjusts the initialized layout by adjusting the rules to obtain a plurality of layouts.

It can be understood that the server can generate an initialization layout corresponding to the target object information and the target content information through a preset layout generation model, and adjust the initialization layout through an adjustment rule to determine a plurality of layouts of the target object information and the target content information; the initialization layout is adjusted through the adjustment rule, so that an unreasonable layout mode is adjusted, and the rationality of the layout can be improved. The server-adjusted layout may still include a variety of layouts such that the adjusted layout still has a variety.

S1042, evaluating the multiple layouts through an evaluation model to determine candidate layouts.

In some embodiments of the invention, an evaluation model is used to evaluate and screen the layout.

In some embodiments of the present invention, the server may evaluate the plurality of layouts through an evaluation model, to obtain evaluation results corresponding to each of the plurality of layouts. And if the evaluation result is characterized as successful, the corresponding layout is used as a candidate layout, and if the evaluation result is characterized as failed, the corresponding layout is deleted.

In some embodiments of the present invention, S1042 may be implemented by S501 and S502 as follows:

s501, evaluating the multiple layouts through an evaluation model to obtain evaluation results corresponding to the multiple layouts.

In some embodiments of the present invention, the server may perform a rationality evaluation on the plurality of layouts through the evaluation model, to obtain evaluation results corresponding to each of the plurality of layouts. The evaluation results include success and failure.

S502, if the evaluation result is characterized as successful, the corresponding layout is used as a candidate layout.

In some embodiments of the invention, the server may characterize a successful layout as a candidate layout based on the evaluation of the layout, representing the passage of the layout.

It can be understood that the server can evaluate the multiple layouts through the evaluation model to obtain evaluation results corresponding to the multiple layouts, and the evaluation results represent local rationality. And the server screens the layout according to the evaluation result, removes unreasonable layout and determines candidate layout. Because the candidate layout is a screening result after the unreasonable layout is removed, the server selects the candidate layout as a layout with higher rationality.

In some embodiments of the present invention, S601, S602, and S603 implementations are also performed before S1042, as follows:

s601, acquiring historical target multimedia information.

In some embodiments of the present invention, the server may obtain historical target multimedia information.

S602, identifying the historical target multimedia information to obtain a historical layout.

In some embodiments of the invention, the historical layout includes positive sample data and negative sample data.

In some embodiments of the present invention, the server may perform recognition analysis on the historical target multimedia information to obtain a historical layout corresponding to the historical target multimedia information.

S603, training an initial evaluation model through positive sample data and negative sample data, and determining an evaluation model.

In some embodiments of the present invention, the server trains the initial evaluation model through positive sample data and negative sample data of the historical layout until the evaluation result output by the model meets a preset threshold value, and saves the model to obtain the evaluation model.

It can be understood that the server trains the initial evaluation model through the historical target multimedia information to determine the evaluation model, so that the evaluation accuracy of the evaluation model can be ensured.

S1043, selecting an optimal layout from the candidate layouts through a layout preference model.

In some embodiments of the invention, the server may input the candidate layout into a layout preference model, selecting an optimal layout from among the candidate layouts. The optimal layout is to perform index evaluation on the candidate layouts through a layout optimization model to obtain index evaluation values corresponding to the candidate layouts; from among the plurality of index evaluation values, a candidate layout having the highest index evaluation value is selected as an optimal layout.

For example, there are 3 candidate layouts, and index evaluation is performed on the 3 candidate layouts through a layout preference model, so as to obtain index evaluation values corresponding to the 3 candidate layouts. The index evaluation value of the first candidate layout is 0.5, the index evaluation value of the second candidate layout is 0.7, and the index evaluation value of the third candidate layout is 0.8; and taking the third candidate layout with the index evaluation value of 0.8 as the optimal layout.

S1044, generating target multimedia information based on the optimal layout, the target item information, and the target content information.

In some embodiments of the present invention, the server may arrange the target item information and the target content information according to an optimal layout to generate the target multimedia information.

S1045, the target multimedia information is sent to the terminal, and the terminal displays the browsing page based on the target multimedia information.

In some embodiments of the invention, the server transmits the target multimedia information to the terminal. The terminal can display to browse the page based on the target multimedia information.

It can be understood that the server may generate a plurality of layouts of the target item information and the target content information according to a preset layout generation model and an adjustment rule, and filter the plurality of layouts through an evaluation model and a layout preference model to determine an optimal layout. The optimal layout is determined after the unreasonable layout is removed, so that the target multimedia information obtained through the optimal layout is reasonable and accurate, and the accuracy of the target multimedia information is improved. When the server recommends the target multimedia information, the target multimedia information can better meet the requirements of users, personalized recommendation can be provided for the users, and the recommendation effect is good.

In some embodiments of the present invention, fig. 9 is a flowchart illustrating an alternative method for generating multimedia information according to an embodiment of the present invention, where, as shown in fig. 9, a server receives a user request (equivalent to a browsing request); and carrying out interest commodity recall (corresponding to article information recall) and creative element recall (corresponding to content recall) to obtain article information and content information. And carrying out vectorization collaborative modeling on the item information and the content information to obtain item characteristics and content characteristics. The item features and the content features are fused to obtain fusion features (not shown in the figure, obtained before input to the cross-modal CTR estimation). Multi-commodity optimization (equivalent to fusion feature optimization) is carried out through cross-mode CTR estimation, and the optimal commodity content combination is obtained; the multi-modality includes text (corresponding to text characteristics), style, picture (corresponding to image characteristics), and video (corresponding to behavior characteristics). And performing element planning on the commodity content combination to generate a layout in real time through a preset layout generation model (not shown in the figure) and adjustment rules, and determining final target multimedia information to be sent to a user (equivalent to a terminal).

It can be understood that, first, the server may perform vectorized representation on the item information and the content information to obtain the item feature corresponding to the item and the content feature corresponding to the content. The server converts the object features and the content features in different spaces into vectors in the same space for fusion, and a plurality of fusion features are obtained. Because the fusion features are features having two dimensions, the fusion features are diverse. Based on the characteristic that the fusion features have diversity, the server predicts a plurality of fusion features according to a preset recommendation model to obtain a plurality of predicted values. The higher the predicted value is, the better the diversity of the fusion features is represented, and the better the diversity of the optimal commodity content combination corresponding to the group of fusion features with the highest corresponding selected predicted value is. And secondly, the server performs element planning on the optimal commodity content combination to generate a layout in real time through a preset layout generation model and an adjustment rule, and determines final target multimedia information. Since the optimal commodity content combination has diversity, the target multimedia information generated according to the optimal commodity content combination also has diversity.

Based on the method for generating multimedia information in the foregoing embodiment, the embodiment of the present invention further provides a device for generating multimedia information, as shown in fig. 10, fig. 10 is a schematic structural diagram of a device for generating multimedia information according to the embodiment of the present invention, where the device 10 for generating multimedia information includes: an acquisition unit 1001, a selection unit 1002, and a generation unit 1003; wherein,

The acquiring unit 1001 is configured to recall item information and content information in response to a received browsing request; extracting features based on the article information and the content information to obtain article features corresponding to article dimensions and content features corresponding to content dimensions, and carrying out synergy and fusion on the article features and the content features to obtain a plurality of groups of fusion features; each group of fusion features characterizes fusion with different articles under different content mode combinations;

the selecting unit 1002 is configured to predict the multiple sets of fusion features through a preset recommendation model, and select target object information and target content information corresponding to a set of fusion features with highest predicted values; the preset recommendation model representation is used for optimizing fusion characteristics;

the generating unit 1003 is configured to generate target multimedia information based on the target item information and the target content information.

In some embodiments of the present invention, the obtaining unit 1001 is configured to perform feature extraction on the item information to obtain an item dimension corresponding to the item feature; identifying the content information to obtain content information corresponding to the content multi-mode type; the content multi-modal type comprises at least two modes of text information, image information and image sequence information; and extracting the characteristics of the content information corresponding to the content multi-mode type to obtain the content characteristics corresponding to the content dimension.

In some embodiments of the present invention, the generating device of multimedia information further includes a determining unit 1004; wherein,

the obtaining unit 1001 is configured to perform feature extraction on the text information by using a first encoding manner if the content multimodal type is a text type, so as to obtain text features; if the content multi-mode type is an image type or an image sequence type, respectively extracting features of the image information and the image sequence information through a second coding mode to obtain image features and behavior features;

the determining unit 1004 is configured to determine the content feature corresponding to the content dimension according to at least one of the text feature, the image feature and the behavior feature.

In some embodiments of the present invention, the obtaining unit 1001 is configured to perform feature extraction on the text information if the content multimodal type is a text type, so as to obtain a text initial feature; the text initial characteristics comprise semantic expression information and word information; and carrying out coding processing on the initial text features by the first coding mode to obtain the text features.

In some embodiments of the present invention, the obtaining unit 1001 is configured to perform feature extraction on the image information if the content multi-mode type is an image type, so as to obtain an image initial feature; the image initial characteristics comprise scene information, content information and style information; if the content multi-mode type is an image sequence type, extracting features of the image sequence information to obtain behavior initial features; the behavior initial characteristics comprise main body target information and key frame information; and respectively carrying out coding processing on the image initial feature and the behavior initial feature through the second coding mode to obtain the image feature and the behavior feature.

In some embodiments of the present invention, the obtaining unit 1001 is configured to perform cooperative processing on the item feature and the content feature to obtain a first item feature and a first content feature with the same probability distribution; the first item feature comprises a plurality of first sub-item features; the first content feature comprises a plurality of first sub-content features; randomly combining the plurality of first sub-article features to obtain a plurality of article combination features; randomly combining the plurality of first sub-content features to obtain a plurality of content combination features; the content combination features comprise content features corresponding to at least two content multi-mode types; and fusing the article combination features and the content combination features to obtain the multiple groups of fusion features.

In some embodiments of the present invention, the obtaining unit 1001 is configured to input the plurality of sets of fusion features into the preset recommendation model for prediction, so as to obtain first pre-estimated values corresponding to the plurality of sets of fusion features respectively; selecting a group of fusion features with highest predicted values from the multiple groups of fusion features based on the multiple first predicted values; and decoding the group of fusion features to obtain the target object information and the target content information.

In some embodiments of the present invention, the obtaining unit 1001 is configured to perform layout generation on the target item information and the target content information through a preset layout generation model, so as to obtain a plurality of layouts; the preset layout generation model characterizes the adjustment of the layout through the objects and the contents;

the determining unit 1004 is configured to evaluate the multiple layouts through an evaluation model, and determine candidate layouts; the evaluation model is used for evaluating and screening the layout;

the selecting unit 1002 is configured to select an optimal layout from the candidate layouts by using a layout preference model;

the generating unit 1003 is configured to generate the target multimedia information based on the optimal layout, the target item information, and the target content information.

In some embodiments of the present invention, the generating unit 1003 is configured to generate, by using a preset layout generation model, an initialized layout corresponding to the target item information and the target content information; the preset layout generation model comprises a sequential stacking sequence of image layers and a constraint of a text size range in text information;

the determining unit 1004 is configured to adjust the initialized layouts by an adjustment rule, and determine the multiple layouts; the adjustment rule is obtained by continuously training by taking the preference degree of the object as an incentive.

In some embodiments of the present invention, the obtaining unit 1001 is configured to obtain historical target multimedia information before evaluating the plurality of layouts by an evaluation model to determine candidate layouts; identifying the historical target multimedia information to obtain a historical layout; the historical layout includes positive sample data and negative sample data;

the determining unit 1004 is configured to train an initial evaluation model by the positive sample data and the negative sample data, and determine the evaluation model.

In some embodiments of the present invention, the obtaining unit 1001 is configured to evaluate, through the evaluation model, the plurality of layouts to obtain evaluation results corresponding to the plurality of layouts respectively;

the determining unit 1004 is configured to, if the evaluation result is characterized as successful, take the layout corresponding to the evaluation result as the candidate layout.

In the generation of the multimedia information, only the division of the program modules is exemplified, and in practical application, the process allocation may be performed by different program modules, i.e., the internal structure of the device may be divided into different program modules, so as to complete all or part of the processes described above. In addition, the device for generating the multimedia information provided in the foregoing embodiment and the method embodiment for generating the multimedia information belong to the same concept, and detailed implementation processes and beneficial effects of the device and the method embodiment are detailed and are not described herein. For technical details not disclosed in the present apparatus embodiment, please refer to the description of the method embodiment of the present invention for understanding.

Based on the method for generating multimedia information in the foregoing embodiment, the embodiment of the present invention further provides a device for generating multimedia information, as shown in fig. 11, fig. 11 is a schematic structural diagram two of a device for generating multimedia information provided in the embodiment of the present invention, where the device 11 for generating multimedia information includes: a processor 1101 and a memory 1102; the memory 1102 stores one or more programs executable by the processor, and when the one or more programs are executed, the processor 1101 performs any of the methods of generating multimedia information of the foregoing embodiments.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method for generating multimedia information, comprising:

2. The method according to claim 1, wherein the extracting features based on the item information and the content information to obtain the item features corresponding to the item dimensions and the content features corresponding to the content dimensions includes:

3. The method according to claim 2, wherein the feature extraction of the content information corresponding to the content multi-modal type to obtain the content feature corresponding to the content dimension includes:

If the content multi-mode type is a text type, extracting the characteristics of the text information through a first coding mode to obtain text characteristics;

4. The method of claim 3, wherein if the content multimodal type is text type, performing feature extraction on the text information by a first encoding mode to obtain text features, including:

5. The method according to claim 3, wherein if the content multi-mode type is an image type or an image sequence type, performing feature extraction on the image information and the image sequence information by a second coding method to obtain an image feature and a behavior feature, respectively, including:

6. The method of claim 1, wherein the collaborating and fusing the item features and the content features results in multiple sets of fused features, comprising:

Randomly combining the plurality of first sub-content features to obtain a plurality of content combination features; the content combination features comprise content features corresponding to at least two content multi-mode types;

7. The method according to any one of claims 1 to 6, wherein the predicting the plurality of sets of fusion features through a preset recommendation model, and selecting the target item information and the target content information corresponding to a set of fusion features with the highest predicted value includes:

8. The method of any of claims 1-6, wherein the generating target multimedia information based on the target item information and the target content information comprises:

9. The method of claim 8, wherein generating the layout of the target object and the target content by the preset layout generation model to obtain a plurality of layouts includes:

10. The method of claim 8, wherein the evaluating the plurality of layouts by the evaluation model, prior to determining candidate layouts, further comprises:

acquiring historical target multimedia information;

11. The method of claim 8, wherein evaluating the plurality of layouts by an evaluation model to determine candidate layouts comprises:

12. The device for generating the multimedia information is characterized by comprising an acquisition unit, a selection unit and a generation unit; wherein,

13. A multimedia information generating apparatus, comprising:

a memory for storing executable instructions;

a processor configured to implement the method for generating multimedia information according to any one of claims 1 to 11 when executing the executable instructions stored in the memory.

14. A computer readable storage medium storing executable instructions which, when executed, are adapted to cause a processor to perform the method of generating multimedia information according to any one of claims 1-11.