CN118741039A

CN118741039A - Video conference processing method, device, electronic device and medium based on computing network

Info

Publication number: CN118741039A
Application number: CN202410850112.XA
Authority: CN
Inventors: 王艳辉; 沈军; 方东; 杨春晖
Original assignee: Visionvera Information Technology Co Ltd
Current assignee: Visionvera Information Technology Co Ltd
Priority date: 2024-06-27
Filing date: 2024-06-27
Publication date: 2024-10-01

Abstract

The embodiments of the present invention provide a video conference processing method, device, electronic device and medium based on a computing power network, including: obtaining a first video stream sent by a target conference terminal; calling a first container to generate a picture of a portrait area in a video frame of the first video stream as a preset character image, and generating a picture of a background area in a video frame of the first video stream as a preset background image, to obtain a second video stream; calling a second container to extract scene features and/or portrait features from video frames of the first video stream, and repairing video frames in the second video stream according to the scene features and/or portrait features to obtain a third video stream; calling a third container to generate dynamic behavior features according to video frames in the first video stream and corresponding audio data, and repairing video frames in the third video stream according to the dynamic behavior features to obtain a fourth video stream, thereby realizing multiple repairs of digital human images through multiple containers in the computing power network.

Description

Video conference processing method and device based on computing power network, electronic equipment and medium

Technical Field

The present invention relates to the field of video conferencing technologies, and in particular, to a video conference processing method and apparatus based on a computing power network, an electronic device, and a medium.

Background

With the rapid development of global informatization, the video conference technology has become an important way for people to work and communicate daily, but in some scenes, the stability and quality of the video conference are often limited, and in these scenes, digital people technology can be applied, namely, virtual characters are adopted to replace real characters.

In the prior art, the digital human picture synthesized by the digital human technology has poor effect and large difference from the real portrait of the scene of the participant, thereby affecting the experience of the video conference.

Disclosure of Invention

In view of the foregoing, it is proposed to provide a method, apparatus, electronic device and medium for processing a video conference based on a power network, which overcomes or at least partially solves the foregoing problems, comprising:

A method of video conference processing based on a computing power network, the method comprising:

in the process of a video conference, acquiring a first video stream sent by a target conference terminal;

Invoking a first container, generating a picture of a portrait area in a video frame of the first video stream into a preset character image, and generating a picture of a background area in the video frame of the first video stream into a preset background image to obtain a second video stream carrying a digital portrait;

invoking a second container, extracting scene characteristics and/or portrait characteristics from video frames of the first video stream, and repairing the video frames in the second video stream according to the scene characteristics and/or portrait characteristics to obtain a third video stream carrying digital portrait;

And calling a third container, generating dynamic behavior characteristics according to the video frames in the first video stream and corresponding audio data, and repairing the video frames in the third video stream according to the dynamic behavior characteristics to obtain a fourth video stream carrying digital human pictures.

Optionally, the calling the first container generates a picture of a portrait area in a video frame of the first video stream as a preset portrait image, and generates a picture of a background area in the video frame of the first video stream as a preset background image, so as to obtain a second video stream carrying a digital portrait image, which includes:

Invoking the first container, extracting character audio features from audio data corresponding to the first video stream, determining character images from preset candidate character images according to the character audio features, extracting environment audio features from the audio data corresponding to the first video stream, and determining background images from preset candidate background images according to the environment audio features;

And calling the first container, generating a picture of a portrait area in a video frame of the first video stream into a preset portrait image, and generating a picture of a background area in the video frame of the first video stream into a preset background image to obtain a second video stream carrying a digital portrait image.

Optionally, the second container includes a first sub-container and a second sub-container, the calling the second container extracts scene features and/or portrait features from video frames of the first video stream, repairs the video frames in the second video stream according to the scene features and/or portrait features, and obtains a third video stream carrying digital portrait, including:

Invoking the first sub-container, extracting scene characteristics from video frames of the first video stream, and repairing background images of the video frames in the second video stream according to the scene characteristics to obtain a first sub-video stream carrying digital human pictures;

invoking the second sub-container, extracting the portrait characteristic from the video frame of the first video stream, and repairing the portrait image of the video frame in the first sub-video stream according to the portrait characteristic to obtain a third video stream carrying a digital portrait;

Or calling the second sub-container, extracting the portrait characteristic from the video frame of the first video stream, and repairing the portrait image of the video frame in the second video stream according to the portrait characteristic to obtain a second sub-video stream carrying a digital portrait;

And calling the first sub-container, extracting scene characteristics from the video frames of the first video stream, and repairing the background images of the video frames in the second sub-video stream according to the scene characteristics to obtain a third video stream carrying digital human pictures.

Optionally, the second container includes an image encoder, an image decoder, a color decoder;

The image encoder is used for extracting picture characteristics and performing multi-scale processing;

The image decoder is used for processing an up-sampling process of the visual characteristics;

the color decoder is used for processing the decoding process of the color inquiry.

Optionally, the calling the third container generates a dynamic behavior feature according to the video frame in the first video stream and the corresponding audio data, repairs the video frame in the third video stream according to the dynamic behavior feature, and obtains a fourth video stream carrying a digital human picture, including:

Invoking the third container, generating dynamic behavior characteristics according to video frames and corresponding audio data in the first video stream, and generating transition frames according to the dynamic behavior characteristics;

and calling the third container, and inserting the transition frame into the video frame of the third video stream to obtain a fourth video stream carrying the digital human picture.

Optionally, the generating the dynamic behavior feature according to the video frame and the corresponding audio data in the first video stream includes:

determining a first dynamic behavior sub-feature according to the human behavior in the video frame of the first video stream, and generating a second dynamic behavior sub-feature according to the corresponding audio data of the first video stream;

And fusing the first dynamic behavior sub-feature and the second dynamic behavior sub-feature to obtain a dynamic behavior feature.

Optionally, before the calling the first container, generating a picture of a portrait area in a video frame of the first video stream into a preset portrait image, and generating a picture of a background area in a video frame of the first video stream into a preset background image, to obtain a second video stream, the method further includes:

when a trigger event is detected, the digital person function is turned on.

A video conference processing device based on a computing power network, the device comprising:

The first video stream receiving module is used for acquiring a first video stream sent by a target conference terminal in the video conference process;

The second video stream obtaining module is used for calling the first container, generating a picture of a portrait area in a video frame of the first video stream into a preset portrait image, and generating a picture of a background area in the video frame of the first video stream into a preset background image to obtain a second video stream carrying digital portrait images;

The third video stream obtaining module is used for calling a second container, extracting scene characteristics and/or portrait characteristics from video frames of the first video stream, and repairing the video frames in the second video stream according to the scene characteristics and/or portrait characteristics to obtain a third video stream carrying digital portrait;

The fourth video stream obtaining module is used for calling a third container, generating dynamic behavior characteristics according to video frames in the first video stream and corresponding audio data, repairing the video frames in the third video stream according to the dynamic behavior characteristics, and obtaining a fourth video stream carrying digital human pictures.

An electronic device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, which when executed by the processor implements a method as described above.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method as described above.

The embodiment of the invention has the following advantages:

In the embodiment of the invention, a first video stream sent by a target conference terminal is acquired in the video conference process, a first container is called, a picture of a portrait area in a video frame of the first video stream is generated as a preset character image, a picture of a background area in the video frame of the first video stream is generated as a preset background image, a second video stream carrying a digital portrait is obtained, a second container is called, scene characteristics and/or portrait characteristics are extracted from the video frame of the first video stream, the video frame in the second video stream is repaired according to the scene characteristics and/or portrait characteristics, a third video stream carrying a digital portrait is obtained, a third container is called, a dynamic behavior characteristic is generated according to the video frame in the first video stream and corresponding audio data, the video frame in the third video stream is repaired according to the dynamic behavior characteristic, a fourth video stream carrying the digital portrait is obtained, the fourth video stream is sent to other conference terminals participating in the conference, multiple times of repairing the digital portrait is realized through a plurality of containers in the computing network, the digital portrait is enabled to be close to the digital portrait of the conference, and the digital portrait is synthesized in the digital conference.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

Fig. 1 is a flowchart illustrating steps of a method for processing a video conference based on a power network according to some embodiments of the present invention;

FIG. 2 is a schematic diagram of a system architecture provided by some embodiments of the invention;

Fig. 3 is a flow chart of steps of another method for processing a video conference based on a power network according to some embodiments of the present invention;

fig. 4 is a block diagram of a video conference processing device based on a power network according to some embodiments of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a flowchart illustrating steps of a video conference processing method based on a power network according to some embodiments of the present invention may specifically include the following steps:

step 101, in the process of the video conference, acquiring a first video stream sent by a target conference terminal.

As an example, the video conference may be a video conference based on the view network, that is, data in the video conference is transmitted by using the view network, and the conference terminal may be a view network terminal.

In a video conference, when a plurality of conference terminals are present and data video and audio collected by a certain conference terminal need to be sent to other conference terminals, if the conference terminal needs to speak in the conference, the conference terminal is a target conference terminal, the target conference terminal can send the collected video to a server through a network in a video stream mode, and can send the collected audio to the server through a network in an audio stream mode, and the video and the audio can be sent in the same video stream.

In some embodiments of the present invention, before the calling the first container, generating a picture of a portrait area in a video frame of the first video stream into a preset portrait image, and generating a picture of a background area in a video frame of the first video stream into a preset background image, to obtain a second video stream, the method further includes: when a trigger event is detected, the digital person function is turned on.

In video conferencing, problems of image quality such as blurred images (e.g. motion blur, lens defocus), low resolution images, low video clip/frame rate, and other image quality problems (e.g. insufficient brightness, insufficient contrast, noise, overcompression, rain and snow fog, too bright or too dark, damage of shielding type (e.g. logo, watermark, subtitle, shielding), color distortion (e.g. color cast, lack of color), black and white illumination may occur in the video stream, and the problems of image quality may be solved by applying digital human technology to synthesize digital human images.

In some examples, the above image quality problem has a larger occurrence probability at a lower bandwidth, that is, the bandwidth is smaller than a bandwidth threshold, for example, the bandwidth threshold of the video stream is 150kb/s, the bandwidth threshold of the audio stream is 64kb/s, the triggering event may be an event of detecting that the bandwidth is smaller than the bandwidth threshold, the server establishes communication channels with each reference terminal, the server may be provided with a bandwidth monitoring module, the bandwidth fluctuation monitoring module may detect the bandwidth condition of each communication channel, and each channel may be provided with a certain bandwidth threshold. After the video conference is started, the server can detect whether the bandwidth of a channel connected with the target conference terminal is smaller than a bandwidth threshold value through the bandwidth monitoring module, and when the bandwidth is larger than the bandwidth threshold value, a triggering event is detected.

In some examples, the triggering event may also be an event that the definition of the detected video stream is less than a definition threshold, for example, the definition threshold is 1024, and the video stream sent by the target conference terminal may be monitored in real time by the server, and when the definition of the video stream is less than the definition threshold, the triggering event is detected.

In some examples, the trigger event may be a user manual operation event, that is, the user manually opens a digital person function, an interactive interface may be provided for the user at the target conference terminal, the user may perform a manual operation through the interactive interface, and further may generate and send a video composition request to the server through the target conference terminal, and when the server receives the video composition request, the trigger event is detected. For example, in the case that the camera connected with the target conference terminal can only collect video streams with lower definition and cannot collect high-definition video streams, a user can control the target conference terminal to generate and send a video composition request to the server through the interactive interface; in another example, due to the fact that bandwidth fluctuation is large, video streaming uploaded by the target conference terminal is caused to have video conference picture blocking, and a user can control the target conference terminal to generate and send a video synthesis request to the server through the interactive interface, so that a triggering event is detected.

In some examples, a scheduling platform may be provided, which may be used to control the number of cameras and microphones that are allowed to be switched on and off by the user, and may control whether the AI function is allowed to be switched on (i.e. whether the digital person function is switched on) and also needs to be scheduled, in some examples, the AI enhancement and synthesis functions may not be allowed to be switched on and off at will by each user, subject to the background GPU cluster effort. In some examples, the server may be initialized first, the scheduling platform, the conference server, and the video post-processing server may further establish a TCP Socket connection, the participant terminal may perform user login, after the user login is successful, the conference terminal responsible for hosting may start a conference through the scheduling platform, the scheduling platform may further control the server to start a conference room, feedback conference information to the conference terminal responsible for hosting, and other participant terminals may join the conference room through the conference information, and may control the participant terminal to start a camera and a microphone by the scheduling platform, and may control to start a digital personal function.

Step 102, calling a first container, generating a picture of a portrait area in a video frame of the first video stream into a preset portrait image, and generating a picture of a background area in the video frame of the first video stream into a preset background image, so as to obtain a second video stream carrying a digital portrait image.

In practical application, a plurality of containers, such as containers 1-4 in fig. 2, can be deployed in the computing power network, and the video stream is processed by a plurality of containers of images connected in series, and the output picture image is continuously close to the original picture image through a plurality of feature extraction, multi-scale processing and encoding and decoding processing, so that the picture quality of the video stream is continuously close to the original picture image (namely, is close to the real picture of a reference scene) on the basis of the last low-restoration picture quality container, and the picture correction of the original picture image of the video stream frame by frame picture is completed under the ultra-low bandwidth through the computing power network, and the AI image enhancement processing effect is improved.

For the first video stream sent by the target conference terminal, the current video frame of the first video stream can be input into a first container, the first container can replace the current video stream with a high-definition digital human video stream according to the digital human video flow synthesis rule, and specifically, the human in the current video frame of the first video stream can be replaced with a preset person, and the background is replaced with a preset background, so that a second video stream is obtained.

In some embodiments of the present invention, the calling the first container generates a picture of a portrait area in a video frame of the first video stream as a preset portrait image, and generates a picture of a background area in a video frame of the first video stream as a preset background image, to obtain a second video stream carrying a digital portrait image, including:

And step 11, calling the first container, extracting the character audio features from the audio data corresponding to the first video stream, determining the character image from the preset candidate character images according to the character audio features, extracting the environment audio features from the audio data corresponding to the first video stream, and determining the background image from the preset candidate background images according to the environment audio features.

In practical application, a plurality of candidate character images and candidate background images may be preset, and matching character images and background images may be selected from the plurality of candidate character images and candidate background images according to the result of analyzing the audio data.

For the character image, the character audio feature may be extracted from the audio data corresponding to the first video stream, where the character audio feature may include a character gender, a character sound age, a character personality element, and so on, for example, a certain character audio feature includes a character gender being female, a character sound age being 30, and a character personality element being lively, and according to the character audio feature, the first container of the computer network may find a matched character image. In some examples, the first container of the algorithm may prioritize high frequency selected pictures as character images based on frequency of use in the large model data

For the background image, an environmental audio feature may be extracted from the corresponding audio data of the first video stream, where the environmental audio feature may include a noise attribute, an echo attribute, and a person role, for example, where a certain environmental audio feature includes a noise attribute being quiet (noise is less than 20 db), an echo attribute being echo (exceeding 10 db), and a person role being a company high-level, and according to the environmental audio feature, the first container of the computing power network may find a matched background image. In some examples, the first container of the algorithm may prioritize the high frequency selected picture as a background image based on the frequency of use in the large model data.

And a substep 12, calling the first container, generating a picture of a portrait area in a video frame of the first video stream into a preset portrait image, and generating a picture of a background area in the video frame of the first video stream into a preset background image, so as to obtain a second video stream carrying a digital portrait image.

And 103, calling a second container, extracting scene characteristics and/or portrait characteristics from the video frames of the first video stream, and repairing the video frames in the second video stream according to the scene characteristics and/or portrait characteristics to obtain a third video stream carrying digital portrait.

For the second video stream output by the first container, since the person in the current video frame of the first video stream is generated into a preset person and the background is generated into a preset background, the image quality of the first video stream can be improved from low image quality to 4K image quality, but the person and the scene do not coincide with the real person and the scene picture.

In practical applications, video frames of the second video stream and video frames collected in a video conference subsequently received in the first video stream may be input into the second container.

The second container can extract scene characteristics of video frames in the first video stream to obtain scene characteristics, and then generates a high-definition scene graph according to the scene characteristics, and combines the high-definition scene graph to perform digital human synthesis.

The second container can also extract the portrait features of the video frames in the first video stream to obtain portrait features, and then generate high-definition portrait images according to scene features, and combine the high-definition portrait images to perform digital portrait synthesis.

And repairing video frames in the second video stream according to scene characteristics and portrait characteristics to obtain a third video stream carrying digital portrait, wherein the third video stream still has 4K picture quality, and the scene is consistent with the real scene picture and the personage is consistent with the real personage picture.

In some embodiments of the invention, the second container comprises an image encoder for picture feature extraction and multi-scale processing, an image decoder for an upsampling process to process visual features, and a color decoder for a decoding process to process color queries.

In some embodiments of the present invention, the second container includes a first sub-container and a second sub-container, the calling the second container extracts scene features and/or portrait features from video frames of the first video stream, repairs the video frames in the second video stream according to the scene features and/or portrait features, and obtains a third video stream carrying digital portrait, including:

And a sub-step 21 of calling the first sub-container, extracting scene characteristics from the video frames of the first video stream, and repairing the background images of the video frames in the second video stream according to the scene characteristics to obtain a first sub-video stream carrying digital human pictures.

Because the background image of the digital person in the previously generated video stream is not unified with the real scene image, the scene characteristics can be extracted from the video frame of the subsequently received first video stream, then the background image of the digital person in the previously generated video stream is repaired by adopting the scene characteristics, the image quality of the generated video stream is still 4K, and the scene is consistent with the real scene image.

And a sub-step 22 of calling the second sub-container, extracting the portrait characteristic from the video frame of the first video stream, and repairing the portrait image of the video frame in the first sub-video stream according to the portrait characteristic to obtain a third video stream carrying the digital portrait.

Because the character image of the digital person in the video stream generated in advance is not unified with the real character image, the character features can be extracted from the video frames of the first video stream received later, then the character image of the digital person in the video stream generated in advance is repaired by adopting the character features, the image quality of the generated video stream is still 4K, and the character is consistent with the real character image.

In some examples, the repair approach includes: feature extraction, multi-scale processing, AI coloring, and color richness optimization, the container is mainly realized by a double decoder, the double decoder comprises an image encoder and two decoders, the image encoder is used for extracting picture features and performing multi-scale processing, and the two decoders are respectively an image decoder and a color decoder. The image decoder performs the up-sampling process of the visual features and the color decoder decodes the color query based on a transducer.

And a substep 31 of calling the second sub-container, extracting the portrait characteristic from the video frame of the first video stream, and repairing the portrait image of the video frame in the second video stream according to the portrait characteristic to obtain a second sub-video stream carrying the digital portrait.

And step 32, calling the first sub-container, extracting scene characteristics from the video frames of the first video stream, and repairing the background images of the video frames in the second sub-video stream according to the scene characteristics to obtain a third video stream carrying digital human pictures.

Step 104, calling a third container, generating dynamic behavior characteristics according to the video frames in the first video stream and corresponding audio data, and repairing the video frames in the third video stream according to the dynamic behavior characteristics to obtain a fourth video stream carrying digital human pictures.

According to the scene characteristics and the portrait characteristics, the video frames in the second video stream are repaired, and the characters and the scenes in the obtained third video stream are consistent with the actual characters and the scene pictures, but the problem of asynchronous sound and picture can be caused in the process of continuously optimizing the character portrait and the background portrait.

In practical application, the problem of asynchronous audio and video is mainly that the dynamic behaviors such as expression and action of a digital person are asynchronous with audio, so that the dynamic behavior characteristics of the digital person, such as expression characteristics and action characteristics, can be determined through analysis of video frames and corresponding audio data in a first video stream received subsequently, then the video frames in a third video stream can be repaired according to the dynamic behavior characteristics to obtain a fourth video stream, the image quality in the fourth video stream is still 4K, the scene is consistent with the real scene image, the characters are consistent with the real character image, and the audio and video are kept completely synchronous.

In some embodiments of the present invention, the invoking the third container generates a dynamic behavior feature according to the video frame and the corresponding audio data in the first video stream, and repairs the video frame in the third video stream according to the dynamic behavior feature to obtain a fourth video stream carrying a digital human picture, including:

Invoking the third container, generating dynamic behavior characteristics according to video frames and corresponding audio data in the first video stream, and generating transition frames according to the dynamic behavior characteristics; and calling the third container, and inserting the transition frame into the video frame of the third video stream to obtain a fourth video stream carrying the digital human picture.

In practical application, a transition frame can be generated according to dynamic behavior characteristics in a frame inserting mode, and then the transition frame is inserted into a video frame of the third video stream, so that the effect of smooth linkage is achieved.

In some embodiments of the present invention, the generating dynamic behavior features according to the video frames and the corresponding audio data in the first video stream includes: determining a first dynamic behavior sub-feature according to the human behavior in the video frame of the first video stream, and generating a second dynamic behavior sub-feature according to the corresponding audio data of the first video stream; and fusing the first dynamic behavior sub-feature and the second dynamic behavior sub-feature to obtain a dynamic behavior feature.

In practical application, on one hand, the first dynamic behavior sub-feature can be determined according to the human behavior in the video frame of the first video stream through analysis of the video frame of the first video stream, and on the other hand, the second dynamic behavior sub-feature can be generated according to the corresponding audio data of the first video stream through analysis of the corresponding audio data of the first video stream, and then the dynamic behavior sub-feature is obtained through fusion.

In some examples, after the fourth video stream is generated, the fourth video stream may be sent to other conference terminals participating in the conference, and in particular, the fourth video stream may be sent to a conference server, which sends the fourth video stream to the other conference terminals.

Referring to fig. 3, a flowchart illustrating steps of another video conference processing method based on a power network according to some embodiments of the present invention may specifically include the following steps:

Step 301, in the process of a video conference, acquiring a first video stream sent by a target conference terminal.

Step 302, when a trigger event is detected, turning on a digital person function.

Step 303, calling a first container, extracting a character audio feature from audio data corresponding to the first video stream, determining a character image from preset candidate character images according to the character audio feature, extracting an environment audio feature from audio data corresponding to the first video stream, and determining a background image from preset candidate background images according to the environment audio feature.

Step 304, invoking the first container, generating a picture of a portrait area in a video frame of the first video stream into a preset portrait image, and generating a picture of a background area in the video frame of the first video stream into a preset background image, thereby obtaining a second video stream carrying a digital portrait image.

Step 305, a first sub-container of a second container is called, scene characteristics are extracted from video frames of the first video stream, and a background image of the video frames in the second video stream is repaired according to the scene characteristics, so as to obtain a first sub-video stream carrying digital human pictures.

Step 306, invoking a second sub-container of the second container, extracting the portrait characteristic from the video frame of the first video stream, and repairing the portrait image of the video frame in the first sub-video stream according to the portrait characteristic to obtain a third video stream carrying the digital portrait.

Step 307, calling a third container, generating dynamic behavior characteristics according to the video frames and the corresponding audio data in the first video stream, and generating transition frames according to the dynamic behavior characteristics.

Step 308, calling the third container, and inserting the transition frame into the video frame of the third video stream to obtain a fourth video stream carrying the digital human picture.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Referring to fig. 4, a schematic structural diagram of a video conference processing device based on a power network according to some embodiments of the present invention may specifically include the following modules:

a first video stream obtaining module 401, configured to obtain a first video stream sent by a target conference terminal in a video conference process;

A second video stream obtaining module 402, configured to invoke a first container, generate a picture of a portrait area in a video frame of the first video stream into a preset portrait image, and generate a picture of a background area in a video frame of the first video stream into a preset background image, so as to obtain a second video stream carrying a digital portrait image;

a third video stream obtaining module 403, configured to invoke a second container, extract scene features and/or portrait features from video frames of the first video stream, and repair video frames in the second video stream according to the scene features and/or portrait features, to obtain a third video stream carrying digital portrait;

and a fourth video stream obtaining module 404, configured to invoke a third container, generate dynamic behavior features according to the video frames in the first video stream and corresponding audio data, and repair the video frames in the third video stream according to the dynamic behavior features, so as to obtain a fourth video stream carrying digital human pictures.

In some embodiments of the invention, the second container comprises an image encoder, an image decoder, a color decoder;

In some embodiments of the present invention, the generating dynamic behavior features according to the video frames and the corresponding audio data in the first video stream includes:

In some embodiments of the present invention, before the calling the first container, generating a picture of a portrait area in a video frame of the first video stream into a preset portrait image, and generating a picture of a background area in a video frame of the first video stream into a preset background image, to obtain a second video stream, the method further includes:

and the digital person function starting module is used for starting the digital person function when the trigger event is detected.

Some embodiments of the invention also provide an electronic device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, the computer program implementing a method as above when executed by the processor.

Some embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as above.

Some embodiments of the invention also provide a computer program product comprising a computer program which, when executed by a processor, implements a method as above.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.

The video conference processing method, device, electronic equipment and medium based on the power network are provided in the above description, and specific examples are applied to illustrate the principles and implementation modes of the present invention, and the above examples are only used to help understand the method and core ideas of the present invention; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the idea of the present invention, the present disclosure should not be construed as limiting the present invention in summary.

Claims

1. A video conference processing method based on a computing network, characterized in that the method comprises:

During the video conference, obtaining a first video stream sent by a target conference terminal;

Calling the first container, generating a picture of a portrait area in the video frame of the first video stream as a preset character image, and generating a picture of a background area in the video frame of the first video stream as a preset background image, to obtain a second video stream carrying a picture of a digital human;

Calling the second container, extracting scene features and/or portrait features from the video frames of the first video stream, and repairing the video frames in the second video stream according to the scene features and/or portrait features to obtain a third video stream carrying the digital human image;

The third container is called to generate dynamic behavior features according to the video frames and corresponding audio data in the first video stream, and the video frames in the third video stream are repaired according to the dynamic behavior features to obtain a fourth video stream carrying the digital human image.

2. The method according to claim 1, characterized in that the calling of the first container, generating the picture of the portrait area in the video frame of the first video stream as a preset character image, and generating the picture of the background area in the video frame of the first video stream as a preset background image, to obtain the second video stream carrying the digital human picture, comprises:

Calling the first container, extracting character audio features from audio data corresponding to the first video stream, determining a character image from preset candidate character images based on the character audio features, and extracting environmental audio features from audio data corresponding to the first video stream, determining a background image from preset candidate background images based on the environmental audio features;

The first container is called to generate a preset character image from a picture of a portrait area in a video frame of the first video stream, and to generate a preset background image from a picture of a background area in a video frame of the first video stream, thereby obtaining a second video stream carrying a digital human picture.

3. The method according to claim 1, characterized in that the second container includes a first sub-container and a second sub-container, and the calling of the second container, extracting scene features and/or portrait features from the video frames of the first video stream, and repairing the video frames in the second video stream according to the scene features and/or portrait features to obtain a third video stream carrying the digital human image, comprises:

Calling the first sub-container, extracting scene features from the video frames of the first video stream, and repairing the background image of the video frames in the second video stream according to the scene features, to obtain a first sub-video stream carrying the digital human image;

The second sub-container is called to extract portrait features from the video frames of the first video stream, and according to the portrait features, the human image of the video frame in the first sub-video stream is repaired to obtain a third video stream carrying the digital human image;

Alternatively, the second sub-container is called to extract portrait features from the video frames of the first video stream, and according to the portrait features, the human image of the video frames in the second video stream is repaired to obtain a second sub-video stream carrying the digital human image;

The first sub-container is called to extract scene features from the video frames of the first video stream, and the background image of the video frames in the second sub-video stream is repaired according to the scene features to obtain a third video stream carrying the digital human image.

4. The method according to claim 3, characterized in that the second container comprises an image encoder, an image decoder, and a color decoder;

The image encoder is used for picture feature extraction and multi-scale processing;

The image decoder is used to process the upsampling process of visual features;

The color decoder is used to process the decoding process of color query.

5. The method according to any one of claims 1 to 4, characterized in that the calling of the third container, generating dynamic behavior features according to the video frames and corresponding audio data in the first video stream, and repairing the video frames in the third video stream according to the dynamic behavior features to obtain a fourth video stream carrying the digital human image, comprises:

Calling the third container, generating dynamic behavior features according to the video frames and corresponding audio data in the first video stream, and generating transition frames according to the dynamic behavior features;

The third container is called, and the transition frame is inserted into the video frame of the third video stream to obtain a fourth video stream carrying the digital human image.

6. The method according to claim 5, characterized in that generating dynamic behavior features according to the video frames and corresponding audio data in the first video stream comprises:

Determine a first dynamic behavior sub-feature according to the behavior of the person in the video frame of the first video stream, and generate a second dynamic behavior sub-feature according to the audio data corresponding to the first video stream;

The first dynamic behavior sub-feature and the second dynamic behavior sub-feature are fused to obtain a dynamic behavior feature.

7. The method according to claim 1, characterized in that before calling the first container, generating the picture of the portrait area in the video frame of the first video stream as a preset character image, and generating the picture of the background area in the video frame of the first video stream as a preset background image, and obtaining the second video stream, it also includes:

When a trigger event is detected, the digital human function is turned on.

8. A video conference processing device based on a computing network, characterized in that the device comprises:

A first video stream acquisition module, used to acquire a first video stream sent by a target conference terminal during a video conference;

The second video stream obtaining module is used to call the first container, generate the picture of the portrait area in the video frame of the first video stream as a preset character image, and generate the picture of the background area in the video frame of the first video stream as a preset background image, so as to obtain a second video stream carrying the picture of the digital human;

A third video stream obtaining module is used to call the second container, extract scene features and/or portrait features from the video frames of the first video stream, and repair the video frames in the second video stream according to the scene features and/or portrait features to obtain a third video stream carrying the digital human image;

The fourth video stream obtaining module is used to call the third container, generate dynamic behavior features according to the video frames and corresponding audio data in the first video stream, and repair the video frames in the third video stream according to the dynamic behavior features to obtain a fourth video stream carrying the digital human image.

9. An electronic device, characterized in that it comprises a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein when the computer program is executed by the processor, the method according to any one of claims 1 to 7 is implemented.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1 to 7 is implemented.