Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flowchart illustrating steps of a video conference method based on a frame insertion technique according to some embodiments of the present invention is shown, where the method may be applied to a server in a video conference.
In some examples, the server may be one server or may be multiple servers. As shown in fig. 2, the servers may include a conference server (such as an all-in-one machine, XMCU, etc.) and a video post-processing server (which may also be an AI server), and the conference server and the video post-processing server are connected to each other.
The method may be partially implemented in a conference server, partially implemented in a video post-processing server, e.g., the synthesized related content of the video stream may be implemented by the video post-processing server, the synthesized related content of the video stream may include AI processing (e.g., determining digital portrait, super resolution reconstruction, etc.) and streaming media processing (e.g., audio codec, video codec), and other content may be implemented by the conference server.
As shown in fig. 2, the video post-processing server may be provided with a plurality of super-division models, and the video post-processing server may call the super-division models to generate a digital human video stream (i.e., a second video stream hereinafter). In some examples, the superdivision model may be a superdivision model provided on a plurality of third party platforms, and the generation of the digital human video stream is achieved by invoking the superdivision model of the third party platform.
The video post-processing server may also be connected to the power network, and may further perform a part of functions by calling the power network, for example, calling the power network to perform a frame insertion function (i.e., inserting video frames of the second video stream into the first video stream to obtain the third video stream).
Among them, a video post-processing server (AI server) is a data server capable of providing artificial intelligence. It can be used to support both local applications and web pages, as well as to provide complex AI models and services for the cloud and local servers.
In some examples, the video post-processing server may be provided with an AI conference management module, a user management module, a proxy service module. The AI conference management module can be used for establishing a communication channel for a conference server to manage, calculate and classify received data streams, encode and decode audio streams and video streams and synthesize high-definition digital portrait video streams. The user management module can be used for managing user information in the video conference, when a conference terminal needs to create a high-definition digital portrait video stream, the user management module can create user information of one conference terminal in the video conference, and a user name, portrait data and the like used in the user information can be synchronously transmitted to the AI server by the conference server when the conference terminal starts the conference, and mainly comprises a conference ID, a user ID when the conference terminal is logged in, an image ID, an identification of the conference terminal and the like. The proxy module can be used for establishing a communication channel with the meeting server, further can realize the file uploading and downloading functions, and can forward the file after the proxy module establishes the communication channel with the meeting server.
Specifically, the method comprises the following steps:
Step 101, acquiring a first video stream sent by a target conference terminal participating in a video conference.
In the video conference process, when the pictures of the target conference terminal need to be shared, the target conference terminal can collect the first video stream through the camera and the microphone, and can send the first video stream to the server, as shown in fig. 2, and the target conference terminal sends the first video stream to the conference server.
Step 102, generating a second video stream containing digital human pictures for the first video stream.
Wherein the digital person is a humanoid image produced by computer technology or a product produced by computer software. They have the appearance or behavior pattern of humans, but are not video recordings of someone in the real world, they can run and exist independently. The digital person's ontology exists in a computing device (e.g., a computer, a cell phone, a VR head display, etc.) and is presented through a display device to allow the person to see through the eyes or to interact with voice. They have independent personality settings, have specific identities, have independent names, have independent character images, and have independent knowledge bases to answer specific questions.
In some examples, the video stream synthesis process is a super-resolution reconstruction process, in which the resolution of an original image is improved by a hardware or software method, and a series of low-resolution images are used to obtain a high-resolution image, in which the super-resolution reconstruction process is performed. Specifically, when the video conference meets conference card or when the definition of a video stream shot by the conference terminal is lower, the video stream is converted into a high-definition video stream through switching, so that the quality of the played video is improved, and the normal running of the video conference is ensured.
Under the condition of starting the digital person function, a digital person technology can be utilized to generate a second video stream containing digital person pictures, as shown in fig. 2, the video post-processing server can send the first video stream to a super-division model and a power calculation network, the super-division model is used for generating the second video stream containing the digital person pictures, the power calculation network is used for calculating a frame inserting mechanism of the video data stream, the power calculation network determines the picture frame number of the current digital person frames according to the total frame number of the current received video data stream, and then the digital person generated through the power calculation network carries out frame inserting processing on the real video stream by using the frame inserting processing principle of the video, so that the digital person and the real video stream can be always and alternately played in the video conference process only by finely controlling the number of the frames through the power calculation network, and the video conference can always keep smooth and high-quality playing.
In some embodiments of the present invention, before the generating, for the first video stream, a second video stream including digital human pictures, further includes:
and acquiring image data sent by the target conference terminal.
In some embodiments of the present invention, the generating, for the first video stream, a second video stream including digital human pictures includes:
Based on the image data, a second video stream is generated that contains digital person pictures.
In practical applications, the target conference terminal may upload image data to the server after joining the video conference, and the server may find target image data synchronized when the user creates the conference from the uploaded image data (if there is no synchronized image data at the time of creation, may automatically randomly match one piece of image data in the memory photograph of the user management module), for example, the image data includes portrait image data and conference scene image data, and may then generate a second video stream including a digital portrait image based on the image data.
In some examples, the corresponding audio data of the first video stream may be further obtained, decoded by the media library and converted into text information, and then the target image data and the text information may be input into the super-resolution reconstruction model in the third-party platform to generate a second video stream for presenting the digital portrait.
In some embodiments of the present invention, before the generating, for the first video stream, a second video stream including digital human pictures, further includes:
In response to a trigger event, a digital person function is turned on.
In video conferencing, problems of image quality such as blurred images (e.g. motion blur, lens defocus), low resolution images, low video clip/frame rate, and other image quality problems (e.g. insufficient brightness, insufficient contrast, noise, overcompression, rain and snow fog, too bright or too dark, damage of shielding type (e.g. logo, watermark, subtitle, shielding), color distortion (e.g. color cast, lack of color), black and white illumination may occur in the video stream, and the problems of image quality may be solved by applying digital human technology to synthesize digital human images.
In some examples, the above image quality problem has a larger occurrence probability at a lower bandwidth, that is, the bandwidth is smaller than a bandwidth threshold, for example, the bandwidth threshold of the video stream is 150kb/s, the bandwidth threshold of the audio stream is 64kb/s, the triggering event may be an event of detecting that the bandwidth is smaller than the bandwidth threshold, the server establishes communication channels with each reference terminal, the server may be provided with a bandwidth monitoring module, the bandwidth fluctuation monitoring module may detect the bandwidth condition of each communication channel, and each channel may be provided with a certain bandwidth threshold. After the video conference is started, the server can detect whether the bandwidth of a channel connected with the target conference terminal is smaller than a bandwidth threshold value through the bandwidth monitoring module, and when the bandwidth is larger than the bandwidth threshold value, a triggering event is detected.
In some examples, the triggering event may also be an event that the definition of the detected video stream is less than a definition threshold, for example, the definition threshold is 1024, and the video stream sent by the target conference terminal may be monitored in real time by the server, and when the definition of the video stream is less than the definition threshold, the triggering event is detected.
In some examples, the trigger event may be a user manual operation event, that is, the user manually opens a digital person function, an interactive interface may be provided for the user at the target conference terminal, the user may perform a manual operation through the interactive interface, and further may generate and send a video composition request to the server through the target conference terminal, and when the server receives the video composition request, the trigger event is detected. For example, in the case that the camera connected with the target conference terminal can only collect video streams with lower definition and cannot collect high-definition video streams, a user can control the target conference terminal to generate and send a video composition request to the server through the interactive interface; in another example, due to the fact that bandwidth fluctuation is large, video streaming uploaded by the target conference terminal is caused to have video conference picture blocking, and a user can control the target conference terminal to generate and send a video synthesis request to the server through the interactive interface, so that a triggering event is detected.
In some examples, a scheduling platform may be provided, which may be used to control the number of cameras and microphones that are allowed to be switched on and off by the user, and may control whether the AI function is allowed to be switched on (i.e. whether the digital person function is switched on) and also needs to be scheduled, in some examples, the AI enhancement and synthesis functions may not be allowed to be switched on and off at will by each user, subject to the background GPU cluster effort. In some examples, the server may be initialized first, the scheduling platform, the conference server, and the video post-processing server may further establish a TCP Socket connection, the participant terminal may perform user login, after the user login is successful, the conference terminal responsible for hosting may start a conference through the scheduling platform, the scheduling platform may further control the server to start a conference room, feedback conference information to the conference terminal responsible for hosting, and other participant terminals may join the conference room through the conference information, and may control the participant terminal to start a camera and a microphone by the scheduling platform, and may control to start a digital personal function.
And step 103, inserting the video frames of the second video stream into the first video stream to obtain a third video stream, and sending the third video stream to other conference terminals participating in the video conference.
After the second video stream containing the digital human picture is obtained, a video frame inserting algorithm can be adopted to insert video frames of the second video stream into the first video stream, fusion of the digital human video stream and the real video stream is achieved, a fused third video stream is obtained and is further sent to other conference terminals, switching back and forth between the digital human video stream and the real video stream is not needed, operation is simplified, and quality of video effects of presentation is improved.
The core principle of the video frame inserting algorithm is to insert additional frames between existing video frames to increase the frame rate of the video, for example, the video frame inserting algorithm comprises a video frame inserting method based on optical flow and a video frame inserting method based on depth learning.
In practical application, the second video stream containing the digital human picture can be generated by the super-division model, the super-division model can send the second video stream to the power calculation network, and the power calculation network can further insert video frames of the second video stream into the first video stream, so that fusion of the digital human video stream and the real video stream is realized.
In some embodiments of the present invention, the inserting the video frames of the second video stream into the first video stream to obtain a third video stream includes:
and when the video frame rate of the first video stream is smaller than a threshold value, inserting the video frames of the second video stream in the area of the first video stream, wherein the video frame rate of the first video stream is smaller than the threshold value, so as to obtain a third video stream.
In practical application, the video frame rate of the first video stream may be detected, when the video frame rate is greater than or equal to a threshold value, that is, the bandwidth is stable, the quality of the first video stream meets the requirement, no frame needs to be inserted into the first video stream, and when the video frame rate is less than the threshold value, that is, the bandwidth is unstable, that is, the quality of the first video stream may not meet the requirement, and then the video frame of the second video stream may be inserted into the first video stream.
In some embodiments of the present invention, when the video frame rate of the first video stream is less than a threshold value, inserting video frames of the second video stream in a region where the video frame rate is less than the threshold value in the first video stream to obtain a third video stream, including:
determining a first video frame and a second video frame from the first video stream; determining a video frame rate between the first video frame and the second video frame; and inserting the video frames of the second video stream between the first video frame and the second video frame when the video frame rate is smaller than a frame rate threshold value, so as to obtain a third video stream.
The second video frame is a video frame sequenced after the first video frame, for example, the second video frame is a video frame after a preset time period (for example, 1 minute and 2 minutes) from the time of receiving the first video frame.
In practical application, a frame can be selected from the first video stream as a first video frame, for example, the first video frame is selected in a random manner, the first video frame is an initial frame, then a frame can be selected after a preset duration as a second video frame, and the second video frame is a target frame.
After determining the first video frame and the second video frame, a video frame rate between the first video frame and the second video frame may be determined, and then a determination may be made as to whether the video frame rate is less than a frame rate threshold.
When the video frame rate is greater than or equal to a threshold, i.e. the bandwidth is stable, the quality of the video frames between the first video frame and the second video frame meets the requirement, no frame needs to be inserted between the first video frame and the second video frame, and when the video frame rate is less than the threshold, i.e. the bandwidth is unstable, the quality of the video frames between the first video frame and the second video frame may not meet the requirement, the video frames of the second video stream may be inserted between the first video frame and the second video frame, and the video quality is improved.
In some embodiments of the invention, the determining a video frame rate between the first video frame and the second video frame comprises:
And determining a characteristic change vector between the first video frame and the second video frame, and determining a video frame rate between the first video frame and the second video frame according to the characteristic change vector.
In practical application, the first video frame and the second video frame may be input into an algorithm model, a feature change vector, such as a displacement vector, between the first video frame and the second video frame is calculated through the algorithm model, and then a video frame rate between the first video frame and the second video frame may be determined according to the feature change vector.
In some examples, the video frame rate between the first video frame and the second video frame may also be calculated in other ways, such as counting the number of video frames between the first video frame and the second video frame, and then calculating the video frame rate from the video frame data and the duration between the first video frame and the second video frame.
In some embodiments of the present invention, before the inserting the video frames of the second video stream into the first video stream to obtain a third video stream, the method further includes:
Extracting portrait features from the first video stream; and adjusting the video frames in the second video stream according to the portrait characteristic.
After the second video frame containing the digital person picture is generated, the second video frame can be optimized by combining the first video stream, namely, the real video stream is adopted to optimize the digital person video stream. Specifically, the portrait features (for example, the portrait features may include face features and action features) may be extracted from the first video stream, and then the portrait features are adopted to adjust video frames in the second video stream, so that the generated digital portrait video stream is more and more rich along with feedback features, and the picture is more fit with the real picture.
In practical application, the image feature can be extracted from the first video stream by the power network, then fed back to the super-division model, and the super-division model can further correct the second video stream by utilizing the image feature.
In the embodiment of the invention, the first video stream sent by the target conference terminal participating in the video conference is acquired, the second video stream containing the digital person picture is generated aiming at the first video stream, then the video frame of the second video stream is inserted into the first video stream to obtain the third video stream, and the third video stream is sent to other conference terminals participating in the video conference, so that the digital person and the real video stream are fused through the frame inserting technology in the video conference, the digital person and the real video are always and alternately played, the digital person picture is integrated without sense, the smooth and high-quality playing of the video conference is kept, the monitoring capability requirement on the conference terminal or the conference server is not high, the secondary development is not needed, and the cost is reduced.
Referring to fig. 3, a flowchart illustrating steps of another video conferencing method based on a frame inserting technology according to some embodiments of the present invention may specifically include the following steps:
Step 301, obtaining image data sent by a target conference terminal.
Step 302, in response to a triggering event, turning on a digital person function.
Step 303, obtaining a first video stream sent by a target conference terminal participating in the video conference.
Step 304 generates a second video stream containing digital person pictures based on the image data.
And step 305, inserting video frames of the second video stream in a region where the video frame rate is smaller than the threshold value in the first video stream when the video frame rate of the first video stream is smaller than the threshold value, obtaining a third video stream, and sending the third video stream to other conference terminals participating in the video conference.
It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.
Referring to fig. 4, a schematic structural diagram of a video conference apparatus based on a frame inserting technology according to some embodiments of the present invention may specifically include the following modules:
a first video stream obtaining module 401, configured to obtain a first video stream sent by a target conference terminal participating in a video conference;
a second video stream generating module 402, configured to generate, for the first video stream, a second video stream containing digital person pictures;
And the third video stream generating and sending module 403 is configured to insert video frames of the second video stream into the first video stream, obtain a third video stream, and send the third video stream to other conference terminals participating in the video conference.
In some embodiments of the present invention, the inserting the video frames of the second video stream into the first video stream to obtain a third video stream includes:
and when the video frame rate of the first video stream is smaller than a threshold value, inserting the video frames of the second video stream in the area of the first video stream, wherein the video frame rate of the first video stream is smaller than the threshold value, so as to obtain a third video stream.
In some embodiments of the present invention, when the video frame rate of the first video stream is less than a threshold value, inserting video frames of the second video stream in a region where the video frame rate is less than the threshold value in the first video stream to obtain a third video stream, including:
determining a first video frame and a second video frame from the first video stream; wherein the second video frame is a video frame ordered after the first video frame;
determining a video frame rate between the first video frame and the second video frame;
And inserting the video frames of the second video stream between the first video frame and the second video frame when the video frame rate is smaller than a frame rate threshold value, so as to obtain a third video stream.
In some embodiments of the invention, the determining a video frame rate between the first video frame and the second video frame comprises:
And determining a characteristic change vector between the first video frame and the second video frame, and determining a video frame rate between the first video frame and the second video frame according to the characteristic change vector.
In some embodiments of the present invention, before the inserting the video frames of the second video stream into the first video stream to obtain a third video stream, the method further includes:
the portrait feature extraction module is used for extracting portrait features from the first video stream;
and the video frame adjusting module is used for adjusting the video frames in the second video stream by combining the portrait characteristic.
In some embodiments of the present invention, before the generating, for the first video stream, a second video stream including digital human pictures, further includes:
the image data acquisition module is used for acquiring image data sent by the target conference terminal;
In some embodiments of the present invention, the generating, for the first video stream, a second video stream including digital human pictures includes:
And the second video stream generating module is used for generating a second video stream containing digital human pictures based on the image data.
In some embodiments of the present invention, before the generating, for the first video stream, a second video stream including digital human pictures, further includes:
And the digital person function starting module is used for responding to the trigger event and starting the digital person function.
In the embodiment of the invention, the first video stream sent by the target conference terminal participating in the video conference is acquired, the second video stream containing the digital person picture is generated aiming at the first video stream, then the video frame of the second video stream is inserted into the first video stream to obtain the third video stream, and the third video stream is sent to other conference terminals participating in the video conference, so that the digital person and the real video stream are fused through the frame inserting technology in the video conference, the digital person and the real video are always and alternately played, the digital person picture is integrated without sense, the smooth and high-quality playing of the video conference is kept, the monitoring capability requirement on the conference terminal or the conference server is not high, the secondary development is not needed, and the cost is reduced.
Some embodiments of the invention also provide an electronic device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, the computer program implementing a method as above when executed by the processor.
Some embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as above.
Some embodiments of the invention also provide a computer program product comprising a computer program which, when executed by a processor, implements a method as above.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.
The foregoing has described in detail a video conferencing method and apparatus based on a frame insertion technique, and specific examples have been applied herein to illustrate the principles and embodiments of the present invention, the above examples being provided only to assist in understanding the method and core ideas of the present invention; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the idea of the present invention, the present disclosure should not be construed as limiting the present invention in summary.