CN120812412A

CN120812412A - Scene multi-focal-length target video coding method

Info

Publication number: CN120812412A
Application number: CN202511018605.8A
Authority: CN
Inventors: 董良; 彭熇磊; 翟青林
Original assignee: Beijing Colink Digital Technology Co ltd
Current assignee: Beijing Colink Digital Technology Co ltd
Priority date: 2025-07-23
Filing date: 2025-07-23
Publication date: 2025-10-17

Abstract

The invention provides a scene multi-focal-length target video coding method which comprises the steps of determining key targets needing to be identified in an important way, setting a zooming video recording system and a plurality of fixed-focus video recording systems, matching the zooming video frames with the same key targets in all the fixed-focus video frames according to image data and position information, and fusing the acquired image data with the highest definition of the key targets to the positions of the same key targets in the zooming video frames. The invention provides a scene multi-focal-length target video coding method, which can enhance the definition of a plurality of specific targets in a video recording system, and in particular relates to a method which utilizes the cooperative work of a zoom video recording system and a fixed-focus video recording system to intercept the specific targets in the fixed-focus system and the zoom system, calculates the target with the highest definition through an algorithm, and fuses the target data into the frame of a zoom video stream, so that the plurality of targets of the zoom video stream can keep high definition.

Description

Scene multi-focal-length target video coding method

Technical Field

The invention relates to the technical field of video coding, in particular to a scene multi-focal-length target video coding method.

Background

When a conventional video recording system records a video, various targets at a plurality of focal positions are difficult to clearly display in a video stream at the same time, for example, key elements such as athletes, referees, score boards and the like in a sports match, and are difficult to clearly display at the same time in one picture due to the fact that the key elements are scattered at different positions. Therefore, there are certain limitations in use.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a scene multi-focal-length target video coding method which can effectively solve the problems.

The technical scheme adopted by the invention is as follows:

the invention provides a scene multi-focal-length target video coding method, which comprises the following steps:

Step S1, determining key targets needing key recognition, and acquiring key target feature expression vectors of each key target, wherein each key target feature expression vector forms a key target recognition feature matrix F= { F ₁,F₂,...,F_m }, F _j represents feature expression vectors of key targets j, j=1, 2, and m, m is the number of key targets;

step S2, setting a zoom video system and n fixed-focus video systems facing a shooting scene according to shooting scene requirements, wherein each fixed-focus video system is configured to have different shooting angles and focal length positions;

Step S3, simultaneously starting the zooming video system and n fixed-focus video systems to carry out collaborative shooting of a plurality of video systems, and simultaneously acquiring a zooming video frame S _t and a fixed-focus video frame set { C _1,t,C_2,t,...,C_n,t } at the same acquisition time t, wherein C _i,t represents a fixed-focus video frame acquired by a fixed-focus video system i at the acquisition time t, i=1, 2, and n;

Step S4, identifying m key targets in the zoom camera video frame S _t based on the key target identification feature matrix f= { F ₁,F₂,...,F_m }, and acquiring image data and position information of each identified key target j in the zoom camera video frame S _t;

Performing key target recognition on each fixed-focus shooting video frame C _i,t based on a key target recognition feature matrix F= { F ₁,F₂,...,F_m }, recognizing k _i,t key targets, and acquiring image data and position information of each recognized key target l in the fixed-focus shooting video frame C _i,t, wherein l=1, 2, and k _i,t;

Step S5, matching the same key targets in the zoom camera video frame S _t and each fixed focus camera video frame C _i,t according to the image data and the position information thereof;

Comparing the image definition of each key target j in each fixed-focus shooting video frame C _i,t and the image definition of each zoom shooting video frame S _t, and if the image definition of each key target j in the zoom shooting video frame S _t is highest, not performing image fusion processing;

Step S6, fusing the acquired highest definition image data of the key target j to the position of the same key target in the zooming shooting video frame S _t to obtain a fused zooming shooting video frame S _t;

Step S7, for m key targets in the zoom camera video frame S _t, executing steps S5 to S6, and then performing video compression coding on the processed zoom camera video frame S _t to obtain a multi-focal-length fused coded zoom camera video frame S _t;

step S8, outputting the encoded zoom camera video frame S _t, enabling t=t+1, and returning to step S3.

Preferably, in step S1, the feature of the key object j represents a vector F _j＝{F_j,visual,F_j,spatial,F_j,temporal };

wherein:

F _j,visual is an appearance feature vector of the key target j, and is obtained by extracting a color histogram, texture features and shape description features of the key target j;

F _j,spatial is a position feature vector of the key target j and is used for describing the position distribution and the movement track mode of the key target j in the picture;

And F _j,temporal is a time feature vector of the key target j and is used for describing the time mode and duration statistics of the key target j.

Preferably, the zoom video recording system is used for shooting the whole scene when shooting, and dynamically adjusting the focal length according to the scene requirement to obtain the panoramic video stream containing all key targets.

Preferably, each fixed-focus video recording system i is configured with a shooting angle θ _i and a specific fixed focal length f _i, and the shooting angle θ _i and the specific fixed focal length f _i remain unchanged when a scene is shot, wherein the shooting angle θ _i is a shooting angle relative to the zoom video recording system.

Preferably, the image data of the key target j in the zoom camera video frame S _t and the position information thereof are represented as E _s,j,t＝{T_s,j,t,P_s,j,t,B_s,j,t }, wherein T _s,j,t represents an image content matrix of the key target j in the zoom camera video frame S _t at the acquisition time T, and is obtained by clipping the image of the key target j from the zoom camera video frame S _t, P _s,j,t represents a position coordinate vector of the center of the key target j in the zoom camera video frame S _t at the acquisition time T, and B _s,j,t represents a boundary frame coordinate vector of the key target j in the minimum circumscribed rectangular region of the zoom camera video frame S _t at the acquisition time T;

The image data of each identified key object l in the fixed-focus photographed video frame C _i,t and the position information thereof are represented as E _i,l,t＝{T_i,l,t,P_i,l,t,B_i,l,t }, wherein T _i,l,t represents an image content matrix of the key object l in the fixed-focus photographed video frame C _i,t at the acquisition time T, the key object l is obtained by clipping the image of the key object l from the fixed-focus photographed video frame C _i,t, P _i,l,t represents a position coordinate vector of the key object l in the fixed-focus photographed video frame C _i,t at the center of the acquisition time T, and B _i,l,t represents a boundary frame coordinate vector of the minimum circumscribed rectangular region of the key object l in the fixed-focus photographed video frame C _i,t at the acquisition time T.

Preferably, step S4 further includes:

For each key object j identified in the zoom camera video frame S _t and each key object l identified in each fixed focus camera video frame C _i,t, a confidence evaluation function is adopted, and cosine similarity Confidence is calculated with the feature expression vector of the corresponding key object in step S1, and identification is considered to be valid only when the cosine similarity Confidence > threshold τ_threshold, wherein τ_threshold is set to 0.7-0.9.

Preferably, in step S5, the same key targets in the zoom camera video frame S _t and each of the fixed focus camera video frames C _i,t are matched according to the image data and the position information thereof, specifically:

Step S5.1, calculating the appearance similarity Simvisual (j, l) between the key target j and the key target l according to the image content matrix T _s,j,t of the key target j at the acquisition time T in the zoom camera video frame S _t and the image content matrix T _i,l,t of the key target l at the acquisition time T in the fixed focus camera video frame C _i,t;

step S5.2, calculating the position similarity SIMSPATIAL (j, l) between the key target j and the key target l based on the position coordinate vector P _s,j,t of the key target j at the acquisition time t in the zoom camera video frame S _t and the position coordinate vector P _i,l,t of the key target l at the acquisition time t in the fixed focus camera video frame C _i,t;

simspatial(j,l)=exp(-||P_s,j,t-transform(P_i,l,t,M_i)||²/2σ²)

The transformation (P _i,l,t, M_i) represents a transformation function for transforming P _i,l,t in a coordinate system of a fixed-focus video system i into a coordinate system of a zoom video system, M_i is a 3×3 unit transformation matrix and is obtained through camera calibration of the fixed-focus video system i, sigma is a position tolerance parameter and is set to 10% -20% of the image size of a key target l cut in a fixed-focus video frame C _i,t;

step S5.3, calculating the comprehensive similarity between the key target j and the key target l by adopting the following formula:

Sim(j,l)=α·Simvisual(j,l)+β·Simspatial(j,l)

The method comprises the steps of respectively obtaining an appearance similarity weight and a position similarity weight, wherein alpha and beta are respectively an appearance similarity weight and a position similarity weight, alpha+beta=1, alpha and beta >0, increasing a beta value for a scene with large change of the target appearance, and increasing the alpha value for a scene with relatively stable target appearance;

Step S5.4, if the comprehensive similarity Sim (j, l) between the key target j and the key target l is greater than the matching threshold value theta_match, the key target j and the key target l are successfully matched, and the key target j and the key target l are the same key target.

Preferably, for each key object j, the image definition of the key object j in each of the fixed-focus captured video frames C _i,t and the zoom captured video frame S _t is compared, specifically:

For each video frame, obtaining definition quantization score by comparing image definition based on the definition quantization score:

The method comprises the step A of converting an imaging video frame needing definition quantization scoring into a target image, wherein the imaging video frame needing definition quantization scoring is called as the target image T, the width of the target image T is M pixels, the height of the target image T is N pixels, and the target image T is converted from a space domain to a frequency domain by adopting the following formula:

wherein:

T (x, y) represents the pixel value of the (x, y) position in the spatial domain, j is an imaginary unit in the formula, F (u, v) represents the complex coefficient in the frequency domain, and the (x, y) position in the spatial domain corresponds to the (u, v) position in the frequency domain;

And B, for complex coefficients F (u, v) in the frequency domain, performing the following filtering operation, retaining high-frequency components, filtering low-frequency components, and obtaining a filtered image H (u, v):

If it is Let H (u, v) =1, otherwise let H (u, v) =0;

wherein ω_cutoff is a cutoff frequency threshold;

and C, performing element-by-element multiplication on the filtered image H (u, v) and the complex coefficient F (u, v) in the frequency domain by adopting the following formula to realize frequency domain filtering, extracting high-frequency information in the image, and obtaining a high-frequency information representation F_high (u, v):

F_high(u,v)=F(u,v)⊙H(u,v)

wherein, the product of elements is shown as follows;

and D, obtaining a definition quantization Score SHARPNESS _score (T) by adopting the following formula:

Wherein U and V are the lengths of the rows and columns in the frequency domain, respectively.

Preferably, in step S6, the obtained highest definition image data of the key target j is fused to the position of the same key target in the zoom camera video frame S _t, so as to obtain a fused zoom camera video frame S _t, which specifically includes:

step S6.1, representing the acquired highest definition image data of the key target j as Causing image data to beThe center coordinates of the same key object in the zoom camera video frame S _t coincide with the center coordinates of the same key object in the zoom camera video frame S _t, and if the two are not matched in size, the size is adjusted by bilinear interpolation, thereby causing the image data to beReplacing the same key target in the zoom camera video frame S _t;

step S6.2, for image data of the same key object embedded in the zoom camera video frame S _t Performing color and brightness correction to obtain first image data, wherein the first image data is identical to the color and brightness distribution in the zoom camera video frame S _t

Step S6.3, for the first image dataThe weight w (p) of each pixel point p in the image is calculated by adopting the following formula:

wherein sigma is a parameter for controlling transition smoothness and is set as the first image data 1/6 To 1/4 of the size;

step S6.4, using the following formula for the first image data Fusing each pixel point p in the image to obtain a fused image I_fused:

wherein: representing the first image data S_region (p) represents the pixel value of the p position of the pixel point in the same key target area in the zoom camera video frame S _t;

Step S6.5, carrying out boundary smoothing on the fused image I_fused by adopting the following formula to obtain a boundary smoothed image I_fused':

I_fused"=I_fused*G_smooth

wherein G_smooth is a smoothing filter, G_smooth= (1/16) [1,2,1;2,4,2;1,2,1];

Step S6.6, re-embedding the image I_fused' after the boundary smoothing process into a target area in the zoom camera video frame S _t to obtain a finally fused zoom camera video frame S_enhanced (t);

S_enhanced(t)=S_t+Mask⊙(I_fused"-S_region)

The Mask is a target area Mask matrix where the same key target is located in a zoom camera video frame S _t, S_region is a target area where the same key target is located in a zoom camera video frame S _t, S _t represents the zoom camera video frame;

And S6.7, carrying out video compression coding on the finally fused zoom camera video frame S_enhanced (t), and outputting.

The scene multi-focal-length target video coding method provided by the invention has the following advantages:

The invention provides a scene multi-focal-length target video coding method, which can enhance the definition of a plurality of specific targets in a video recording system, and in particular relates to a method which utilizes the cooperative work of a zoom video recording system and a fixed-focus video recording system to intercept the specific targets in the fixed-focus system and the zoom system, calculates the target with the highest definition through an algorithm, and fuses the target data into the frame of a zoom video stream, so that the plurality of targets of the zoom video stream can keep high definition.

Drawings

Fig. 1 is a flowchart of a method for encoding a scene multi-focal-length target video.

Fig. 2 is a flowchart of an embodiment of a method for encoding a scene multi-focal-length target video according to the present invention.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1 and 2, the present invention provides a method for encoding a scene multi-focal-length target video, comprising the following steps:

Specifically, the characteristic representation vector F _j＝{F_j,visual,F_j,spatial,F_j,temporal of the key target j, wherein F _j,visual is an appearance characteristic vector of the key target j, and is obtained by extracting a color histogram, a texture characteristic and a shape description characteristic of the key target j, F _j,spatial is a position characteristic vector of the key target j and is used for describing the position distribution and the movement track mode of the key target j in a picture, and F _j,temporal is a time characteristic vector of the key target j and is used for describing the time mode and the duration statistics of the key target j.

When shooting, the zoom video recording system is used for shooting the whole scene, and dynamically adjusting the focal length according to the scene requirement to obtain a panoramic video stream containing all key targets.

Each fixed-focus video recording system i is provided with a shooting angle theta _i and a specific fixed focal length f _i, and the shooting angle theta _i and the specific fixed focal length f _i are kept unchanged when a scene is shot, wherein the shooting angle theta _i is relative to the zoom video recording system.

Specifically, the image data of the key target j in the zoom camera video frame S _t and the position information thereof are represented as E _s,j,t＝{T_s,j,t,P_s,j,t,B_s,j,t }, wherein T _s,j,t represents an image content matrix of the key target j in the zoom camera video frame S _t at the acquisition time T, and is obtained by clipping the image of the key target j from the zoom camera video frame S _t, P _s,j,t represents a position coordinate vector of the center of the key target j in the zoom camera video frame S _t at the acquisition time T, and B _s,j,t represents a boundary frame coordinate vector of the key target j in the minimum circumscribed rectangular region of the zoom camera video frame S _t at the acquisition time T;

the specific matching method comprises the following steps:

simspatial(j,l)=exp(-||P_s,j,t-transform(P_i,l,t,M_i)||²/2σ²)

Sim(j,l)=α·Simvisual(j,l)+β·Simspatial(j,l)

specifically, for each captured video frame, the following manner is adopted to obtain the definition quantization score thereof; and comparing the image definition based on the definition quantization score:

wherein:

If it is Let H (u, v) =1, otherwise let H (u, v) =0;

wherein ω_cutoff is a cutoff frequency threshold;

F_high(u,v)=F(u,v)⊙H(u,v)

wherein, the product of elements is shown as follows;

the method specifically comprises the following steps:

I_fused"=I_fused*G_smooth

wherein G_smooth is a smoothing filter, G_smooth= (1/16) [1,2,1;2,4,2;1,2,1];

S_enhanced(t)=S_t+Mask⊙(I_fused"-S_region)

The scene multi-focal-length target video coding method combines the cooperative work of a zooming video recording system and a plurality of fixed-focus video recording systems. According to the method, shooting angles and focal length positions of all fixed-focus video recording systems are accurately set according to preset scene requirements so as to capture target details in different depth of field ranges, and one-path zoom video stream and multiple paths of fixed-focus video streams can be obtained through the zoom video recording systems and the multiple fixed-focus video recording systems. And automatically identifying and accurately intercepting target areas corresponding to a plurality of target pictures or names predefined by a user from the video streams by utilizing an artificial intelligent image identification technology. And establishing a corresponding relation between the zooming video stream and a specific target in the fixed-focus video stream through a position information and feature matching algorithm. For each target, the algorithm traverses all target examples in the related video stream, and the target image data with the highest definition is selected by adopting a mode of band-pass filtering after frequency conversion. And then the target image data with the highest definition is replaced to the corresponding target position in the zoom video stream, the replaced area is subjected to smoothing treatment, and finally, the definition of all key targets in the generated zoom video stream reaches the optimal state. And finally, compressing the optimized zoom video stream by adopting a video compression coding technology.

One embodiment is described below:

(1) Predefining an object identification database to be enhanced, wherein the object to be enhanced can be a photo, a name or a characteristic description;

Specifically, the system constructs a database containing all important target feature information, including elements such as optical characteristics, spatial positions, time sequence relations and the like, to form a complete target feature library.

Specifically, the user needs to predefine an important set of objects o= { O ₁,O₂,..o_m }, where m is the total number of objects, which is desired to maintain high definition in the video. This predefined approach solves the technical problem that conventional video recording systems cannot distinguish between important targets and backgrounds. The goal may be specified in three flexible ways:

providing a reference picture of the target, namely directly uploading a standard image of the target, and being applicable to fixed targets with known appearance;

inputting the names of targets, such as an athlete, a referee and a score board, and calling a pre-trained semantic recognition model by a system;

Features describing the object, such as "person wearing red-ball", support flexible description based on attributes.

To achieve accurate recognition, the system builds a multidimensional feature vector for each target, which is one of the key technical innovations of the present invention. The feature vector contains three types of complementary information:

extracting visual characteristics such as a color histogram, texture characteristics, shape descriptions and the like of a target, wherein 512-1024 dimensions are usually taken;

Recording the common position distribution and moving track mode of the target in the picture, and generally taking 64-128 dimensions;

time feature vector-time pattern describing the appearance of the target, duration statistics, typically taking 32-64 dimensions.

The complete characteristic representation of each target adopts a vector splicing mode, namely, an appearance characteristic vector, a position characteristic vector and a time characteristic vector are spliced, and the total dimension of the spliced characteristic vector is generally 600-1200 dimensions.

The feature files of all targets are combined into a target recognition database matrix which is used as a lookup table of a subsequent recognition algorithm to support rapid similarity calculation and target matching.

(2) Simultaneously starting a zooming video recording system and a plurality of fixed-focus video recording systems to record, and respectively acquiring a frame S1 in the zooming video stream and a frame set { C1, C2 }, wherein the frame set { C1, C2 }, cn }, at the same moment.

Through the collaborative recording of multiple cameras, the technical problem that a single camera can not obtain a plurality of different-distance target clear images simultaneously is solved, and the combination of panoramic coverage and local high definition is realized through the collaborative work of the multiple cameras.

The system adopts a camera configuration strategy of 'one main camera and multiple auxiliary cameras', and simultaneously starts two types of video equipment to perform cooperative work:

And the zoom camera Z is used as a main camera and is responsible for shooting the whole scene, and the focal length can be dynamically adjusted according to scene requirements to obtain panoramic video streams containing all targets. The method has the advantages of large visual field range, capability of capturing complete scene information, and limited definition for a remote target.

N fixed-focus camera sets f= { F ₁,F₂, & gt, f_n }, wherein each fixed-focus camera set is used as an auxiliary camera and is pre-fixed at a specific focal length and a specific shooting angle and is specially used for shooting targets within a specific distance range. This design ensures that a specific camera provides a high definition target image at each distance level.

At time t, the system achieves strict time synchronization, and synchronously acquires a main video stream frame from the zoom camera and an auxiliary video stream frame set from each fixed-focus camera. Different cameras may have different resolution configurations.

And the time synchronization is realized by ensuring that the time difference between frames is less than 1/60 second through a hardware synchronization signal or a Network Time Protocol (NTP) by all cameras, and avoiding the target position deviation caused by time desynchronization.

The camera parameter configuration matrix is P_camera= [ f ₁,θ₁;f₂,θ₂; f_n, θ_n ], where f_i is the focal length of the ith fixed focus camera (in mm) and θ_i is its shooting angle relative to the main camera (in degrees). The matrix records the geometric configuration information of the whole camera array and is used for subsequent coordinate transformation and target matching.

(3) With advanced artificial intelligence recognition techniques, preset accent targets are identified in frames S1 and { C1, C2, & gt, cn }. Image data of each identified key target and accurate position information thereof are recorded.

Specifically, the system uses an artificial intelligent image recognition algorithm AI_Detector based on a convolutional neural network, and the algorithm combines the target detection and feature matching technology to automatically find and recognize preset important targets in video streams of all cameras. The algorithm has the technical advantage that the complex conditions of scale change, illumination change, partial shielding and the like of the target can be processed.

The specific identification process adopts a parallel processing architecture:

the system takes the target identification database matrix phi established in the first step as a reference standard, and loads the reference standard into the GPU memory so as to improve the query speed;

Scanning each frame of picture in the main video stream S (t) in real time, and identifying all preset targets by using a sliding window and a multi-scale detection technology;

and meanwhile, the same identification process is carried out on pictures in each auxiliary video stream, and the identification tasks of all cameras are executed in parallel, so that the processing efficiency is improved.

For each identified target, the system will accurately record the triplet (T, P, B):

The method comprises the steps of (1) selecting a specific image content matrix of a target, cutting a target area from an original image to keep an original pixel value, selecting a coordinate vector of the center position of the target in a picture by taking the pixel as a unit, and selecting a boundary frame coordinate vector of the target to define a minimum rectangular area containing the target, wherein the origin is positioned at the upper left corner of the image.

In order to ensure the identification accuracy, a confidence evaluation mechanism is introduced into the system, specifically, a confidence evaluation function is adopted, the cosine similarity between the detected target feature and the target feature in the database is calculated by the function, and the closer the value is to 1, the higher the matching degree is.

Identification is considered valid only if Confidence > τ_threshold, which is typically set to 0.7-0.9, is adjustable according to the accuracy requirements of the application scenario.

The quality control of the identification results, wherein the system also records the quality index of each identification result, including the integrity of the target (whether the target is shielded or not), the definition of the image, the illumination condition and the like, and provides basis for the subsequent target selection.

(4) According to the image data and the position information, the corresponding relation of the same key target in the zoom video frame S1 and the set of Jiao Shipin frames { C1, C2, & gt, cn } is accurately matched.

The method solves the key technical problem of how to accurately identify whether the targets shot by different cameras are the same target in the multi-camera system. This is a precondition for achieving the target sharpness improvement.

Since the same target may appear in the main video stream and the plurality of auxiliary video streams at the same time, but the appearance of the same target in different cameras may be different due to different shooting angles, distances and illumination conditions. The system needs to establish the corresponding relation between the targets and determine which are different shooting angles of the same target. The technical innovation of the matching algorithm is to comprehensively consider information of multiple dimensions, and avoid the limitation of single feature matching. The match determination is based on a weighted combination of two main factors:

Firstly, calculating appearance similarity, namely comparing visual feature vectors of two target images, including color distribution, texture mode, shape outline and the like. The cosine similarity has the advantage of insensitivity to image brightness variation and is suitable for processing target matching under different illumination conditions.

Then, the position similarity is calculated, and the function considers the spatial position relation of the targets in different cameras.

And finally, carrying out weighted summation on the appearance similarity and the position similarity to obtain the comprehensive similarity.

(5) For each matching key object, the image is transformed from the spatial domain to the frequency domain. A specially designed high-pass filtering algorithm is applied to allow the passage of frequency information above a preset threshold to extract high-frequency components. The object F1 with the highest definition is evaluated and selected by comparing the intensity and the quantity of high-frequency components of different key object images.

Specifically, in the present invention, for each matched object, the system needs to select the version with the highest definition from the main video stream and each auxiliary video stream. The principle of sharpness evaluation is that a sharp image contains more high frequency components and a blurred image contains less high frequency components. The sharpness evaluation adopts a frequency domain analysis technology, and the method has the advantages of objectivity, accuracy and no influence of subjective feeling of human eyes.

Further, a definition quantization scoring mode is adopted, the proportion of high-frequency energy to total energy is calculated, and the larger the value is, the clearer the image is. The normalization process of the denominator ensures comparability between different size images.

In order to avoid noise interference, the difference between the highest score and the second highest score can be checked, and when the difference is smaller than a threshold value, the difference is comprehensively judged by combining with other quality indexes (such as contrast and saturation), so that the image version with the highest definition of each important target is selected, and the optimal material is provided for subsequent image fusion.

(6) The image data of the object F1 with the highest definition is fused to the corresponding position in the zoom video frame S1. In the fusion process, a smoothing algorithm is adopted to process the target edge area, so that the natural transition and consistency of F1 and S1 backgrounds are ensured.

Specifically, in the invention, high-definition target images from different sources are seamlessly fused into a main video stream, so that the high definition of the target is maintained, and the naturalness and consistency of the whole picture are ensured. In order to ensure natural and smooth images after fusion, an intelligent weighted fusion technology based on Gaussian weight is adopted, and the core idea of the technology is that a high-definition image is completely used in a target center area, and gradually transits to an original image in an edge area, so that a hard boundary effect is avoided.

(7) And processing the fused zoom video frame S1 by adopting a stream compression coding method so as to reduce the size of a video file and optimize the transmission efficiency. And outputting the compressed video stream, wherein the definition of a plurality of key targets in the video stream is obviously improved, and meanwhile, the overall visual perception quality and smooth playing experience are maintained.

The step is the final link of the system, and the high-efficiency compression coding is carried out by adopting the proper H264/H265 video on the premise of keeping the enhancement effect.

The invention provides a scene multi-focus target video coding method, which is a video coding method capable of effectively improving the definition of a plurality of targets in a multi-focus and fixed-focus cooperative video recording system.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which is also intended to be covered by the present invention.

Claims

1. A scene multi-focal length target video encoding method, characterized in that it comprises the following steps:

Step S1: determine the key targets that need to be identified, obtain the key target feature representation vector of each key target, and form a key target identification feature matrix F = {F ₁ , F ₂ , ..., F _m }; wherein F _j represents the feature representation vector of key target j; j = 1, 2, ..., m, where m is the number of key targets;

Step S2: according to the requirements of the shooting scene, setting a zoom recording system and n fixed-focus recording systems facing the shooting scene; wherein each of the fixed-focus recording systems is configured to have a different shooting angle and focal length position; and the shooting angle of each of the fixed-focus recording systems covers the shooting scene;

Step S3: Simultaneously start the zoom recording system and n fixed-focus recording systems to perform multi-recording system collaborative shooting. At the same acquisition time t, simultaneously obtain a zoom camera video frame S _t and a fixed-focus camera video frame set {C _1,t , C _2,t , ..., C _n,t }; where C _i,t represents the fixed-focus camera video frame captured by the fixed-focus recording system i at the acquisition time t, and i = 1, 2, ..., n;

Step S4, based on the key target recognition feature matrix F = {F ₁ , F ₂ , ..., F _m }, identify m key targets in the zoom camera video frame S _t , and obtain image data and position information of each identified key target j in the zoom camera video frame S _t ;

Based on the key target recognition feature matrix F = {F ₁ ,F ₂ ,...,F _m }, key target recognition is performed on each fixed-focus camera video frame C _i,t, k _i,t key targets are identified, and image data and position information of each identified key target l in the fixed-focus camera video frame C _i,t are obtained; where l = 1, 2,..., k _i,t ;

Step S5, matching the same key target in the zoom camera video frame S _t and each of the fixed-focus camera video frames C _i,t according to the image data and its position information;

For each key target j, compare its image clarity in each of the fixed-focus video frames C _i,t and the zoom video frame S _t . If its image clarity in the zoom video frame S _t is the highest, no image fusion processing is performed; otherwise, obtain the highest-definition image data and position information of the key target j in each of the fixed-focus video frames C _i,t .

Step S6, fusing the acquired highest-definition image data of the key target j to the position of the same key target in the zoom camera video frame S _t to obtain the fused zoom camera video frame S _t ;

Step S7: For each of the m key targets in the zoom camera video frame S _t , execute steps S5 to S6, and then perform video compression encoding on the processed zoom camera video frame S _t to obtain the multi-focal-length fused encoded zoom camera video frame S _t ;

Step S8, output the encoded zoom camera video frame _St ; set t=t+1; return to step S3.

2. The scene multi-focal length target video encoding method according to claim 1, characterized in that, in step S1, the feature representation vector F j of the key target j is F _j ={F _j,visual ,F _j,spatial ,F _j,temporal };

in:

Fj _,visual is the appearance feature vector of the key target j, which is obtained by extracting the color histogram, texture features, and shape description features of the key target j;

Fj _,spatial is the position feature vector of key target j, which is used to describe the position distribution and movement trajectory pattern of key target j in the picture;

Fj _,temporal is the temporal feature vector of the key target j, which is used to describe the temporal pattern and duration statistics of the key target j.

3. The scene multi-focal length target video encoding method according to claim 1 is characterized in that the zoom recording system is used to capture the entire scene during shooting, dynamically adjusts the focal length according to scene requirements, and obtains a panoramic video stream containing all key targets.

4. The scene multi-focal-length target video encoding method according to claim 1, wherein each fixed-focus recording system i is configured with a shooting angle _θi and a specific fixed focal length _fi , and when capturing a scene, the shooting angle _θi and the specific fixed focal length _fi remain unchanged; wherein the shooting angle _θi is relative to the shooting angle of the zoom recording system.

5. The scene multi-focal-length target video encoding method according to claim 1, characterized in that the image data and position information of the key target j in the zoom camera video frame S _t are expressed as: Es _,j,t = {Ts _,j,t , Ps _,j,t , Bs _,j,t }; wherein _Ts,j,t represents the image content matrix of the key target j in the zoom camera video frame S _t at the acquisition time t, obtained by cropping the image of the key target j from the zoom camera video frame S _t ; Ps,j,t represents the position coordinate vector of the center of the key target j in the zoom camera video frame S _t at the acquisition time t; Bs _, _j,t represents the bounding box coordinate vector of the minimum circumscribed rectangular area of the key target j in the zoom camera video frame S _t at the acquisition time t;

The image data and position information of each identified key target l in the fixed-focus camera video frame Ci _,t are expressed as: Ei _,l,t = {Ti _,l,t , Pi _,l,t , Bi _,l,t }; wherein Ti _,l,t represents the image content matrix of the key target l in the fixed-focus camera video frame Ci _,t at the acquisition time t, which is obtained by cropping the image of the key target l from the fixed-focus camera video frame Ci _,t ; Pi _,l,t represents the position coordinate vector of the center of the key target l in the fixed-focus camera video frame Ci _,t at the acquisition time t; Bi _,l,t represents the bounding box coordinate vector of the minimum circumscribed rectangular area of the key target l in the fixed-focus camera video frame Ci _,t at the acquisition time t.

6. The scene multi-focal length target video encoding method according to claim 5, wherein step S4 further comprises:

For each key target j identified in the zoom camera video frame S _t , and each key target l identified in each fixed-focus camera video frame C _i,t , a confidence evaluation function is used to calculate the cosine similarity Confidence with the feature representation vector of the corresponding key target in step S1. The recognition is considered valid only when the cosine similarity Confidence is greater than the threshold τ_threshold; wherein τ_threshold is set to 0.7-0.9.

7. The scene multi-focus target video encoding method according to claim 6, characterized in that, in step S5, matching the same key target in the zoom camera video frame S _t and each of the fixed-focus camera video frames C _i,t based on the image data and its position information is specifically performed as follows:

Step S5.1, calculating the appearance similarity Simvisual(j,l) between key targets j and _l based on the image content matrix T _s,j, t of key target j in zoom camera video frame S t at acquisition time t and the image content matrix T _i,l,t of key target l in fixed focus camera video frame C _i, t at acquisition time t;

Step S5.2, based on the position coordinate vector P _s,j,t of the key target j in the zoom camera video frame S _t at the acquisition time t and the position coordinate vector P i,l, _{t of the center of the key target l in the fixed focus camera video frame C i,} _t at the acquisition time t, calculate the position similarity Simspatial(j,l) between the key targets j and l;

simspatial(j,l)＝exp(-||P _s,j,t -transform(P _i,l,t ,M_i)|| ² /2σ ² )

where transform(P _i,l,t ,M_i) represents the transformation function that converts P _i,l,t in the fixed-focus video system i coordinate system to the zoom video system coordinate system; M_i is the 3×3 unit transformation matrix obtained by calibrating the camera of the fixed-focus video system i; σ is the position tolerance parameter, which is set to 10%-20% of the image size of the key object l cropped in the fixed-focus video frame C _i,t ;

In step S5.3, the comprehensive similarity between key target j and key target l is calculated using the following formula:

Sim(j,l)＝α·Simvisual(j,l)+β·Simspatial(j,l)

Where: α and β are the appearance similarity weight and position similarity weight respectively; α+β＝1,α,β＞0; for scenes with large changes in target appearance, increase the value of β; for scenes with relatively stable target appearance, increase the value of α;

In step S5.4, if the comprehensive similarity Sim(j,l) between the key target j and the key target l is greater than the matching threshold θ_match, it means that the key target j and the key target l are successfully matched and are the same key target.

8. The scene multi-focal-length target video encoding method according to claim 7, characterized in that, for each key target j, its image clarity in each of the fixed-focus video frames C _i,t and the zoom video frame S _t is compared, specifically:

For each video frame, the following method is used to obtain its clarity quantification score. Based on the clarity quantification score, the image clarity is compared:

Step A: The video frame that needs to be quantified for clarity is called the target image, which is expressed as: target image T, with a width of M pixels and a height of N pixels. The target image T is converted from the spatial domain to the frequency domain using the following formula:

in:

T(x,y) represents the pixel value at position (x,y) in the spatial domain; j in the formula is the imaginary unit; F(u,v) represents the complex coefficient in the frequency domain; the position (x,y) in the spatial domain corresponds to the position (u,v) in the frequency domain;

Step B: For the complex coefficients F(u,v) in the frequency domain, perform the following filtering operation to retain the high-frequency components and filter out the low-frequency components to obtain the filtered image H(u,v):

if Then let H(u,v)＝1; otherwise, let H(u,v)＝0;

Where: ω_cutoff is the cutoff frequency threshold;

Step C: Use the following formula to perform element-by-element multiplication of the filtered image H(u,v) and the complex coefficient F(u,v) in the frequency domain to implement frequency domain filtering, extract the high-frequency information in the image, and obtain the high-frequency information representation F_high(u,v):

F_high(u,v)＝F(u,v)⊙H(u,v)

Among them: ⊙ represents element-by-element product;

Step D: Use the following formula to obtain the sharpness quantitative score Sharpness_Score(T):

Where: U and V are the lengths of rows and columns in the frequency domain respectively.

9. The scene multi-focal-length target video encoding method according to claim 6, characterized in that, in step S6, the obtained highest-definition image data of the key target j is fused to the position of the same key target in the zoom camera video frame S _t to obtain the fused zoom camera video frame S _t , specifically:

Step S6.1: The acquired image data of the key target j with the highest definition is expressed as Make image data The center coordinates of the image data coincide with the center coordinates of the same focus target in the zoom camera video frame S _t . If the sizes of the two do not match, bilinear interpolation is used to adjust the size so that the image data Replace it to the area of the same key target in the zoom camera video frame S _t ;

Step S6.2: image data of the same key target embedded in the zoom camera video frame S _t Perform color and brightness correction to make it the same as the color and brightness distribution in the zoom camera video frame S _t , and obtain the first image data

Step S6.3, the first image data For each pixel point p in , the following formula is used to calculate its weight w(p):

Where: σ is the parameter that controls the transition smoothness, set to the first image data 1/6 to 1/4 of the size;

Step S6.4, using the following formula, the first image data The pixels p in the image are fused to obtain the fused image I_fused:

in: Represents the first image data S_region(p) represents the pixel value of the pixel point p in the same key target region in the zoom camera video frame _St ; I_fused(p) represents the pixel value of the pixel point p in the fused image;

In step S6.5, the fused image I_fused is subjected to boundary smoothing using the following formula to obtain a boundary smoothed image I_fused":

I_fused"＝I_fused*G_smooth

Where: G_smooth is the smoothing filter; G_smooth = (1/16) [1, 2, 1; 2, 4, 2; 1, 2, 1];

Step S6.6: Re-embed the boundary-smoothed image I_fused" into the target area of the zoom camera video frame S _t to obtain the final fused zoom camera video frame S_enhanced(t);

S_enhanced(t)＝S _t +Mask⊙(I_fused"-S_region)

Where: Mask is the target area mask matrix where the same key target is located in the zoom camera video frame _St ;

S_region is the target region where the same key target is located in the zoom camera video frame S _t ; S _t represents the zoom camera video frame;

Step S6.7: perform video compression encoding on the final fused zoom camera video frame S_enhanced(t) and output it.