CN113780009A

CN113780009A - Information generation method, apparatus, electronic device and computer readable medium

Info

Publication number: CN113780009A
Application number: CN202110130566.6A
Authority: CN
Inventors: 赵楠; 吴友政
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2021-12-10

Abstract

Embodiments of the present disclosure disclose information generation methods, apparatuses, electronic devices, and computer-readable media. A specific implementation of the method includes: acquiring a dialogue information sequence in a dialogue scene, wherein the dialogue information in the dialogue information sequence includes pictures and sentences; performing image feature extraction processing on each picture included in the dialogue information sequence to obtain Picture vector sequence; perform sentence feature extraction processing on each sentence included in the above dialogue information sequence to obtain a sentence vector sequence; generate a response information feedback result based on the above picture vector sequence and the above sentence vector sequence. This embodiment improves the accuracy of replying to the sentence input by the user by considering the picture information input by the user. Thus, the user experience is improved and the loss of user traffic is reduced.

Description

Information generation method and device, electronic equipment and computer readable medium

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to an information generation method, an information generation device, an electronic device, and a computer-readable medium.

Background

With the rapid development of online shopping platforms, the dialogue system is widely applied to human-computer dialogue scenes. At present, a dialog system usually adopts a dialog mode of replying only a sentence input by a user.

However, when the above-described manner is adopted, there are generally the following technical problems: other information input by the user is not considered, so that the information input by the user cannot be accurately replied, the experience of the user is poor, and the flow of the user is lost.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose information generation methods, apparatuses, electronic devices, and computer readable media to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide an information generating method, including: acquiring a conversation information sequence in a conversation scene, wherein the conversation information in the conversation information sequence comprises pictures and sentences; carrying out picture feature extraction processing on each picture included in the conversation information sequence to obtain a picture vector sequence; performing statement feature extraction processing on each statement included in the dialogue information sequence to obtain a statement vector sequence; and generating a response information feedback result based on the picture vector sequence and the statement vector sequence.

Optionally, the generating a response information feedback result based on the picture vector sequence and the sentence vector sequence includes: determining the position corresponding to each dialogue information in the dialogue information sequence to obtain a position sequence; performing position feature conversion processing on each position included in the position sequence to obtain a position vector sequence; and generating a response information feedback result based on the picture vector sequence, the statement vector sequence and the position vector sequence.

Optionally, the generating a response information feedback result based on the picture vector sequence, the sentence vector sequence, and the position vector sequence includes: determining a role corresponding to each dialog message in the dialog message sequence to obtain a role set corresponding to the dialog message sequence; performing role characteristic conversion processing on each role in the role set to obtain a role vector sequence; and generating a response information feedback result based on the picture vector sequence, the sentence vector sequence, the position vector sequence and the role vector sequence.

Optionally, the generating a response information feedback result based on the picture vector sequence, the term vector sequence, the position vector sequence, and the role vector sequence includes: fusing each picture vector in the picture vector sequence, the statement vector, the position vector and the role vector corresponding to the picture vector to generate a fusion vector, so as to obtain a fusion vector sequence; and inputting the fusion vector sequence into a pre-trained response text feedback model to generate a response information feedback result.

Optionally, the answer text feedback model includes: an attention-encoding neural network and an attention-decoding neural network.

Optionally, the inputting the fusion vector sequence into a response text feedback model trained in advance to generate a response information feedback result includes: inputting the fusion vector sequence into the attention coding neural network to obtain a multi-modal scene vector sequence; and inputting the multi-modal scene vector sequence into the attention decoding neural network to obtain a response information feedback result.

Optionally, the performing picture feature extraction processing on each picture included in the dialog information sequence to obtain a picture vector sequence includes: and inputting each picture in the pictures into a pre-trained picture characteristic extraction network to generate a picture vector, so as to obtain a picture vector sequence.

Optionally, the performing statement feature extraction processing on each statement included in the dialog information sequence to obtain a statement vector sequence includes: and performing pooling processing on each statement in each statement to generate a statement vector, so as to obtain a statement vector sequence.

In a second aspect, some embodiments of the present disclosure provide an information generating apparatus, the apparatus comprising: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a conversation information sequence in a conversation scene, and the conversation information in the conversation information sequence comprises pictures and sentences; the picture feature extraction unit is configured to perform picture feature extraction processing on each picture included in the conversation information sequence to obtain a picture vector sequence; a sentence feature extraction unit configured to perform sentence feature extraction processing on each sentence included in the dialogue information sequence to obtain a sentence vector sequence; a generating unit configured to generate a response information feedback result based on the picture vector sequence and the sentence vector sequence.

Optionally, the generating unit is further configured to: determining the position corresponding to each dialogue information in the dialogue information sequence to obtain a position sequence; performing position feature conversion processing on each position included in the position sequence to obtain a position vector sequence; and generating a response information feedback result based on the picture vector sequence, the statement vector sequence and the position vector sequence.

Optionally, the generating unit is further configured to: determining a role corresponding to each dialog message in the dialog message sequence to obtain a role set corresponding to the dialog message sequence; performing role characteristic conversion processing on each role in the role set to obtain a role vector sequence; and generating a response information feedback result based on the picture vector sequence, the sentence vector sequence, the position vector sequence and the role vector sequence.

Optionally, the generating unit is further configured to: fusing each picture vector in the picture vector sequence, the statement vector, the position vector and the role vector corresponding to the picture vector to generate a fusion vector, so as to obtain a fusion vector sequence; and inputting the fusion vector sequence into a pre-trained response text feedback model to generate a response information feedback result.

Optionally, the generating unit is further configured to: inputting the fusion vector sequence into the attention coding neural network to obtain a multi-modal scene vector sequence; and inputting the multi-modal scene vector sequence into the attention decoding neural network to obtain a response information feedback result.

Optionally, the picture feature extraction unit is further configured to: and inputting each picture in the pictures into a pre-trained picture characteristic extraction network to obtain a picture vector sequence.

Optionally, the sentence feature extraction unit is further configured to: and performing pooling processing on each statement in each statement to generate a statement vector, so as to obtain a statement vector sequence.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method described in any of the implementations of the first aspect.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect.

The above embodiments of the present disclosure have the following advantages: by the information generation method of some embodiments of the present disclosure, the loss of user traffic is reduced. Specifically, the loss of user traffic is caused by: other information input by the user is not considered, so that the information input by the user cannot be accurately replied, and the experience of the user is poor. Based on this, the information generating method of some embodiments of the present disclosure, first, obtains a dialog information sequence in a dialog scene. Thus, data support can be provided for subsequently generating text feedback results. Then, picture feature extraction processing is carried out on each picture included in the dialogue information sequence to obtain a picture vector sequence. Therefore, the picture information input by the user can be considered, and data support is provided for improving the accuracy of generating the text feedback result. Next, sentence feature extraction processing is performed on each sentence included in the dialogue information sequence to obtain a sentence vector sequence. And finally, generating a response information feedback result based on the picture vector sequence and the statement vector sequence. Therefore, the picture information input by the user is considered, and the accuracy of replying the sentence input by the user is improved. Therefore, the experience of the user is improved, and the loss of the user flow is reduced.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

1-2 are schematic diagrams of one application scenario of the information generation method of some embodiments of the present disclosure;

FIG. 3 is a flow diagram of some embodiments of an information generation method according to the present disclosure;

FIG. 4 is a flow diagram of further embodiments of an information generation method according to the present disclosure;

FIG. 5 is a schematic block diagram of some embodiments of an information generating apparatus according to the present disclosure;

FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1-2 are schematic diagrams of an application scenario of an information generation method according to some embodiments of the present disclosure.

In the application scenarios of fig. 1-2, first, the computing device 101 may obtain a sequence of dialog information 102 in a dialog scenario. As shown in fig. 2, the dialog information in the dialog information sequence 102 includes pictures and sentences. Here, the dialog information sequence may refer to an information sequence of a dialog text that a user has with a human-machine customer service. For example, the dialog information sequence 102 may be "[ user: XXxxxxxXX, FIG. 1.png ]; [ human-machine customer service: XXXYYXXXXX, FIG. 2.png ]; [ user: xyyxyxyxxxyxy, FIG. 3.png ] ". Next, the computing device 101 may perform picture feature extraction processing on each picture included in the dialog information sequence 102 to obtain a picture vector sequence 103. For example, the picture feature extraction process may be performed on each picture included in the dialog information sequence 102 by a language representation model (a language representation model). Then, the computing device 101 may perform sentence feature extraction processing on each sentence included in the dialog information sequence 102 to obtain a sentence vector sequence 104. For example, the sentence feature extraction process may be performed on each sentence included in the dialogue information sequence 102 by a residual neural network. Finally, the computing device 101 may generate a response information feedback result 105 based on the picture vector sequence 103 and the sentence vector sequence 104. In practice, the picture vector sequence 103 and the sentence vector sequence 104 may be input into a text generation model (e.g., an attention neural network model) to generate the response information feedback result 105.

The computing device 101 may be hardware or software. When the computing device is hardware, it may be implemented as a distributed cluster composed of multiple servers or terminal devices, or may be implemented as a single server or a single terminal device. When the computing device is embodied as software, it may be installed in the hardware devices enumerated above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

It should be understood that the number of computing devices in FIG. 1 is merely illustrative. There may be any number of computing devices, as implementation needs dictate.

With continued reference to fig. 3, a flow 300 of some embodiments of an information generation method according to the present disclosure is shown. The information generation method comprises the following steps:

step 301, a dialog information sequence in a dialog scene is obtained.

In some embodiments, an executing subject of the information generating method (e.g., the computing device 101 shown in fig. 1) may acquire the dialog information sequence in the dialog scenario from the device terminal by means of a wired connection or a wireless connection. The dialogue information in the dialogue information sequence comprises pictures and sentences. Here, the dialog information sequence may refer to an information sequence of a dialog text that a user has with a human-machine customer service. For example, the dialog information sequence may be "[ screen XX appears, not normally used, as it is, fig. 1.png ]; [ do you see it as it is, FIG. 2.png ]; [ yes, FIG. 3.png ] ".

Step 302, performing picture feature extraction processing on each picture included in the dialog information sequence to obtain a picture vector sequence.

In some embodiments, the executing entity may input each picture included in the dialog information sequence into a pre-trained image extraction neural network model, so as to obtain a picture vector sequence. Here, the image extraction neural network model may include, but is not limited to, at least one of: vgg, Resnet, Goole-net, Mobile-net.

Step 303, performing statement feature extraction processing on each statement included in the dialog information sequence to obtain a statement vector sequence.

In some embodiments, the execution subject may input each sentence included in the dialog information sequence to a pre-trained sentence feature extraction neural network model to obtain a sentence vector sequence. Here, the sentence feature extraction neural network model may be a recurrent neural network model. For example, RNN (Recurrent Neural Networks) and the like.

And 304, generating a response information feedback result based on the picture vector sequence and the statement vector sequence.

In some embodiments, first, the execution body may perform addition processing on a statement vector corresponding to each picture vector in the picture vector sequence to generate an added vector as an addition vector, resulting in an addition vector sequence. And then, sequentially inputting the addition vectors in the addition vector sequence into a pre-trained text feedback model to generate a response information feedback result. Here, the text feedback model may be BERT (language representation model).

The above embodiments of the present disclosure have the following advantages: by the information generation method of some embodiments of the present disclosure, the loss of user traffic is reduced. Specifically, the loss of user traffic is caused by: other information input by the user is not considered, so that the sentence input by the user cannot be accurately replied, and the experience of the user is poor. Based on this, the information generating method of some embodiments of the present disclosure, first, obtains a dialog information sequence in a dialog scene. Thus, data support can be provided for subsequently generating text feedback results. Then, picture feature extraction processing is carried out on each picture included in the dialogue information sequence to obtain a picture vector sequence. Therefore, the picture information input by the user can be considered, and data support is provided for improving the accuracy of generating the text feedback result. Next, sentence feature extraction processing is performed on each sentence included in the dialogue information sequence to obtain a sentence vector sequence. And finally, generating a response information feedback result based on the picture vector sequence and the statement vector sequence. Therefore, the picture information input by the user is considered, and the accuracy of replying the sentence input by the user is improved. Therefore, the experience of the user is improved, and the loss of the user flow is reduced.

With further reference to fig. 4, a flow diagram of further embodiments of an information generation method according to the present disclosure is shown. The information generation method comprises the following steps:

step 401, obtaining a dialog information sequence in a dialog scene.

In some embodiments, the specific implementation of step 401 and the technical effect brought by the implementation may refer to step 301 in those embodiments corresponding to fig. 3, which are not described herein again.

And step 402, inputting each picture in each picture into a pre-trained picture feature extraction network to obtain a picture vector sequence.

In some embodiments, the execution subject may input each picture into the image extraction network model to obtain a picture vector sequence. Here, the image extraction network model may be a ResNet model (residual neural network), Vgg, Goole-net, Let-net.

Step 403, performing pooling processing on each statement in each statement to generate a statement vector, so as to obtain a statement vector sequence.

In some embodiments, first, the execution body may input each sentence into a pre-trained word embedding neural network model, resulting in an initial sentence vector sequence. Then, maximum pooling processing is carried out on each of the initial statement vector sequences to generate statement vectors, and a statement vector sequence is obtained. Here, the sentence feature extraction Neural network model may be RNN (Recurrent Neural Networks).

Step 404, determining a position corresponding to each dialog message in the dialog message sequence to obtain a position sequence.

In some embodiments, the execution subject may determine a position corresponding to each dialog message in the dialog message sequence, resulting in a position sequence. Here, the position may refer to a sequence number of the session information in the session information sequence.

Step 405, performing position feature conversion processing on each position included in the position sequence to obtain a position vector sequence.

In some embodiments, the execution subject may input each position included in the position sequence into a position vector extraction neural network, so as to obtain a position vector sequence. Here, the location vector extraction neural network may be: RNN (Recurrent Neural Networks), BERT (Bidirectional Encoder expressions from transformations, linguistic representation model).

Step 406, generating a response information feedback result based on the picture vector sequence, the sentence vector sequence, and the position vector sequence.

In some embodiments, the execution body may generate a response information feedback result based on the picture vector sequence, the sentence vector sequence, and the position vector sequence in various ways.

In some optional implementation manners of some embodiments, the execution subject may generate the response information feedback result by:

the first step is to determine the role corresponding to each dialog message in the dialog message sequence, and obtain the role set corresponding to the dialog message sequence. In practice, the role corresponding to the dialog information may refer to the output party that outputs the dialog information. Here, the output party may characterize the user or human-machine customer service.

And secondly, performing role characteristic conversion processing on each role in the role set to obtain a role vector sequence. In practice, the execution subject may input each role in the role sequence to a role vector transformation neural network, so as to obtain a role vector sequence. Here, the role vector transformation neural network may be: RNN (Recurrent Neural Networks), BERT (Bidirectional Encoder expressions from transformations, linguistic representation model).

And thirdly, generating a response information feedback result based on the picture vector sequence, the statement vector sequence, the position vector sequence and the role vector sequence.

In some optional implementations of some embodiments, the third step may include the following sub-steps:

the first substep is to fuse each picture vector in the picture vector sequence, the statement vector, the position vector and the role vector corresponding to the picture vector to generate a fused vector, so as to obtain a fused vector sequence. Here, the fusion process may refer to an addition process.

And a second substep, inputting the fusion vector sequence into a pre-trained response text feedback model to generate a response information feedback result. Here, the answer text feedback model may include: an attention-encoding neural network and an attention-decoding neural network.

In practice, the execution subject may input the fusion vector sequence into the attention-coding neural network to obtain a multi-modal scene vector sequence. Then, the multi-modal scene vector sequence can be input into the attention decoding neural network to obtain a response information feedback result.

As can be seen from fig. 4, compared with the description of some embodiments corresponding to fig. 3, the flow 400 in some embodiments corresponding to fig. 4 embodies the fusion of four dimensions of the sentence, the picture, the position and the character in the dialogue information. And then, the multi-modal expression of each vector is fused through a self-attention mechanism, so that the multi-modal expression of the dialog scene is obtained, and finally, a response information feedback result of the current dialog is generated through a response text feedback model. Therefore, the accuracy of replying the sentence input by the user is improved. Therefore, the experience of the user is improved, and the loss of the user flow is reduced.

With further reference to fig. 5, as an implementation of the methods illustrated in the above figures, the present disclosure provides some embodiments of an information generating apparatus, which correspond to those illustrated in fig. 3, and which may be particularly applied in various electronic devices.

As shown in fig. 5, the information generating apparatus 500 of some embodiments includes: an acquisition unit 501, a picture feature extraction unit 502, a sentence feature extraction unit 503, and a generation unit 504. The obtaining unit 501 is configured to obtain a dialog information sequence in a dialog scene, where dialog information in the dialog information sequence includes a picture and a sentence; the picture feature extraction unit 502 is configured to perform picture feature extraction processing on each picture included in the dialog information sequence to obtain a picture vector sequence; the sentence feature extraction unit 503 is configured to perform sentence feature extraction processing on each sentence included in the dialog information sequence to obtain a sentence vector sequence; the generating unit 504 is configured to generate a response information feedback result based on the picture vector sequence and the sentence vector sequence.

In some optional implementations of some embodiments, the generating unit 504 is further configured to: determining the position corresponding to each dialogue information in the dialogue information sequence to obtain a position sequence; performing position feature conversion processing on each position included in the position sequence to obtain a position vector sequence; and generating a response information feedback result based on the picture vector sequence, the statement vector sequence and the position vector sequence.

In some optional implementations of some embodiments, the generating unit 504 is further configured to: determining a role corresponding to each dialog message in the dialog message sequence to obtain a role set corresponding to the dialog message sequence; performing role characteristic conversion processing on each role in the role set to obtain a role vector sequence; and generating a response information feedback result based on the picture vector sequence, the sentence vector sequence, the position vector sequence and the role vector sequence.

In some optional implementations of some embodiments, the generating unit 504 is further configured to: fusing each picture vector in the picture vector sequence, the statement vector, the position vector and the role vector corresponding to the picture vector to generate a fusion vector, so as to obtain a fusion vector sequence; and inputting the fusion vector sequence into a pre-trained response text feedback model to generate a response information feedback result.

In some optional implementations of some embodiments, the generating unit 504 is further configured to: inputting the fusion vector sequence into the attention coding neural network to obtain a multi-modal scene vector sequence; and inputting the multi-modal scene vector sequence into the attention decoding neural network to obtain a response information feedback result.

In some optional implementations of some embodiments, the picture feature extraction unit 502 is further configured to: and inputting each picture in the pictures into a pre-trained picture characteristic extraction network to obtain a picture vector sequence.

In some optional implementations of some embodiments, the sentence feature extraction unit 503 is further configured to: and performing pooling processing on each statement in each statement to generate a statement vector, so as to obtain a statement vector sequence.

It will be understood that the elements described in the apparatus 500 correspond to various steps in the method described with reference to fig. 3. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.

Referring now to FIG. 6, a block diagram of an electronic device (e.g., computing device 101 of FIG. 1)600 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device in some embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle-mounted terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through the communication device 609, or installed from the storage device 608, or installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a conversation information sequence in a conversation scene, wherein the conversation information in the conversation information sequence comprises pictures and sentences; carrying out picture feature extraction processing on each picture included in the conversation information sequence to obtain a picture vector sequence; performing statement feature extraction processing on each statement included in the dialogue information sequence to obtain a statement vector sequence; and generating a response information feedback result based on the picture vector sequence and the statement vector sequence.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a picture feature extraction unit, a sentence feature extraction unit, and a generation unit. Here, the names of these units do not constitute a limitation to the unit itself in some cases, and for example, the generating unit may be further described as a "unit that generates a response information feedback result based on the above picture vector sequence and the above sentence vector sequence".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. An information generating method, comprising:

acquiring a conversation information sequence in a conversation scene, wherein the conversation information in the conversation information sequence comprises pictures and sentences;

carrying out picture feature extraction processing on each picture included in the dialogue information sequence to obtain a picture vector sequence;

performing statement feature extraction processing on each statement included in the dialogue information sequence to obtain a statement vector sequence;

and generating a response information feedback result based on the picture vector sequence and the statement vector sequence.

2. The method of claim 1, wherein the generating a response information feedback result based on the picture vector sequence and the sentence vector sequence comprises:

determining the position corresponding to each dialogue information in the dialogue information sequence to obtain a position sequence;

performing position feature conversion processing on each position included in the position sequence to obtain a position vector sequence;

and generating a response information feedback result based on the picture vector sequence, the statement vector sequence and the position vector sequence.

3. The method of claim 2, wherein the generating a response information feedback result based on the picture vector sequence, the sentence vector sequence, and the position vector sequence comprises:

determining a role corresponding to each dialog message in the dialog message sequence to obtain a role set corresponding to the dialog message sequence;

performing role characteristic conversion processing on each role in the role set to obtain a role vector sequence;

and generating a response information feedback result based on the picture vector sequence, the statement vector sequence, the position vector sequence and the role vector sequence.

4. The method of claim 3, wherein the generating an answer information feedback result based on the picture vector sequence, the sentence vector sequence, the position vector sequence, and the role vector sequence comprises:

fusing each picture vector in the picture vector sequence, the statement vector, the position vector and the role vector corresponding to the picture vector to generate a fusion vector, so as to obtain a fusion vector sequence;

and inputting the fusion vector sequence into a pre-trained response text feedback model to generate a response information feedback result.

5. The method of claim 4, wherein the answer text feedback model comprises: an attention-encoding neural network and an attention-decoding neural network; and

the inputting the fusion vector sequence into a pre-trained response text feedback model to generate a response information feedback result includes:

inputting the fusion vector sequence into the attention coding neural network to obtain a multi-modal scene vector sequence;

and inputting the multi-modal scene vector sequence into the attention decoding neural network to obtain a response information feedback result.

6. The method according to claim 1, wherein the performing picture feature extraction processing on each picture included in the dialog information sequence to obtain a picture vector sequence includes:

and inputting each picture in the pictures into a pre-trained picture feature extraction network to obtain a picture vector sequence.

7. The method according to claim 1, wherein the performing sentence feature extraction processing on each sentence included in the dialog information sequence to obtain a sentence vector sequence includes:

and performing pooling processing on each statement in each statement to generate a statement vector, so as to obtain a statement vector sequence.

8. An information generating apparatus comprising:

an acquisition unit configured to acquire a dialog information sequence in a dialog scene, wherein dialog information in the dialog information sequence includes a picture and a sentence;

the picture feature extraction unit is configured to perform picture feature extraction processing on each picture included in the dialogue information sequence to obtain a picture vector sequence;

a sentence feature extraction unit configured to perform sentence feature extraction processing on each sentence included in the dialogue information sequence to obtain a sentence vector sequence;

a generating unit configured to generate a response information feedback result based on the picture vector sequence and the sentence vector sequence.

9. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.