CN120812412A - Scene multi-focal-length target video coding method - Google Patents
Scene multi-focal-length target video coding methodInfo
- Publication number
- CN120812412A CN120812412A CN202511018605.8A CN202511018605A CN120812412A CN 120812412 A CN120812412 A CN 120812412A CN 202511018605 A CN202511018605 A CN 202511018605A CN 120812412 A CN120812412 A CN 120812412A
- Authority
- CN
- China
- Prior art keywords
- target
- video frame
- key
- camera video
- key target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/95—Computational photography systems, e.g. light-field imaging systems
- H04N23/951—Computational photography systems, e.g. light-field imaging systems by using two or more images to influence resolution, frame rate or aspect ratio
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
- H04N19/172—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/42—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/85—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
- H04N19/87—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving scene cut or scene change detection in combination with video compression
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/61—Control of cameras or camera modules based on recognised objects
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/80—Camera processing pipelines; Components thereof
- H04N23/84—Camera processing pipelines; Components thereof for processing colour signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/80—Camera processing pipelines; Components thereof
- H04N23/84—Camera processing pipelines; Components thereof for processing colour signals
- H04N23/86—Camera processing pipelines; Components thereof for processing colour signals for controlling the colour saturation of colour signals, e.g. automatic chroma control circuits
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/90—Arrangement of cameras or camera modules, e.g. multiple cameras in TV studios or sports stadiums
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Theoretical Computer Science (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention provides a scene multi-focal-length target video coding method which comprises the steps of determining key targets needing to be identified in an important way, setting a zooming video recording system and a plurality of fixed-focus video recording systems, matching the zooming video frames with the same key targets in all the fixed-focus video frames according to image data and position information, and fusing the acquired image data with the highest definition of the key targets to the positions of the same key targets in the zooming video frames. The invention provides a scene multi-focal-length target video coding method, which can enhance the definition of a plurality of specific targets in a video recording system, and in particular relates to a method which utilizes the cooperative work of a zoom video recording system and a fixed-focus video recording system to intercept the specific targets in the fixed-focus system and the zoom system, calculates the target with the highest definition through an algorithm, and fuses the target data into the frame of a zoom video stream, so that the plurality of targets of the zoom video stream can keep high definition.
Description
Technical Field
The invention relates to the technical field of video coding, in particular to a scene multi-focal-length target video coding method.
Background
When a conventional video recording system records a video, various targets at a plurality of focal positions are difficult to clearly display in a video stream at the same time, for example, key elements such as athletes, referees, score boards and the like in a sports match, and are difficult to clearly display at the same time in one picture due to the fact that the key elements are scattered at different positions. Therefore, there are certain limitations in use.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention provides a scene multi-focal-length target video coding method which can effectively solve the problems.
The technical scheme adopted by the invention is as follows:
the invention provides a scene multi-focal-length target video coding method, which comprises the following steps:
Step S1, determining key targets needing key recognition, and acquiring key target feature expression vectors of each key target, wherein each key target feature expression vector forms a key target recognition feature matrix F= { F 1,F2,...,Fm }, F j represents feature expression vectors of key targets j, j=1, 2, and m, m is the number of key targets;
step S2, setting a zoom video system and n fixed-focus video systems facing a shooting scene according to shooting scene requirements, wherein each fixed-focus video system is configured to have different shooting angles and focal length positions;
Step S3, simultaneously starting the zooming video system and n fixed-focus video systems to carry out collaborative shooting of a plurality of video systems, and simultaneously acquiring a zooming video frame S t and a fixed-focus video frame set { C 1,t,C2,t,...,Cn,t } at the same acquisition time t, wherein C i,t represents a fixed-focus video frame acquired by a fixed-focus video system i at the acquisition time t, i=1, 2, and n;
Step S4, identifying m key targets in the zoom camera video frame S t based on the key target identification feature matrix f= { F 1,F2,...,Fm }, and acquiring image data and position information of each identified key target j in the zoom camera video frame S t;
Performing key target recognition on each fixed-focus shooting video frame C i,t based on a key target recognition feature matrix F= { F 1,F2,...,Fm }, recognizing k i,t key targets, and acquiring image data and position information of each recognized key target l in the fixed-focus shooting video frame C i,t, wherein l=1, 2, and k i,t;
Step S5, matching the same key targets in the zoom camera video frame S t and each fixed focus camera video frame C i,t according to the image data and the position information thereof;
Comparing the image definition of each key target j in each fixed-focus shooting video frame C i,t and the image definition of each zoom shooting video frame S t, and if the image definition of each key target j in the zoom shooting video frame S t is highest, not performing image fusion processing;
Step S6, fusing the acquired highest definition image data of the key target j to the position of the same key target in the zooming shooting video frame S t to obtain a fused zooming shooting video frame S t;
Step S7, for m key targets in the zoom camera video frame S t, executing steps S5 to S6, and then performing video compression coding on the processed zoom camera video frame S t to obtain a multi-focal-length fused coded zoom camera video frame S t;
step S8, outputting the encoded zoom camera video frame S t, enabling t=t+1, and returning to step S3.
Preferably, in step S1, the feature of the key object j represents a vector F j={Fj,visual,Fj,spatial,Fj,temporal };
wherein:
F j,visual is an appearance feature vector of the key target j, and is obtained by extracting a color histogram, texture features and shape description features of the key target j;
F j,spatial is a position feature vector of the key target j and is used for describing the position distribution and the movement track mode of the key target j in the picture;
And F j,temporal is a time feature vector of the key target j and is used for describing the time mode and duration statistics of the key target j.
Preferably, the zoom video recording system is used for shooting the whole scene when shooting, and dynamically adjusting the focal length according to the scene requirement to obtain the panoramic video stream containing all key targets.
Preferably, each fixed-focus video recording system i is configured with a shooting angle θ i and a specific fixed focal length f i, and the shooting angle θ i and the specific fixed focal length f i remain unchanged when a scene is shot, wherein the shooting angle θ i is a shooting angle relative to the zoom video recording system.
Preferably, the image data of the key target j in the zoom camera video frame S t and the position information thereof are represented as E s,j,t={Ts,j,t,Ps,j,t,Bs,j,t }, wherein T s,j,t represents an image content matrix of the key target j in the zoom camera video frame S t at the acquisition time T, and is obtained by clipping the image of the key target j from the zoom camera video frame S t, P s,j,t represents a position coordinate vector of the center of the key target j in the zoom camera video frame S t at the acquisition time T, and B s,j,t represents a boundary frame coordinate vector of the key target j in the minimum circumscribed rectangular region of the zoom camera video frame S t at the acquisition time T;
The image data of each identified key object l in the fixed-focus photographed video frame C i,t and the position information thereof are represented as E i,l,t={Ti,l,t,Pi,l,t,Bi,l,t }, wherein T i,l,t represents an image content matrix of the key object l in the fixed-focus photographed video frame C i,t at the acquisition time T, the key object l is obtained by clipping the image of the key object l from the fixed-focus photographed video frame C i,t, P i,l,t represents a position coordinate vector of the key object l in the fixed-focus photographed video frame C i,t at the center of the acquisition time T, and B i,l,t represents a boundary frame coordinate vector of the minimum circumscribed rectangular region of the key object l in the fixed-focus photographed video frame C i,t at the acquisition time T.
Preferably, step S4 further includes:
For each key object j identified in the zoom camera video frame S t and each key object l identified in each fixed focus camera video frame C i,t, a confidence evaluation function is adopted, and cosine similarity Confidence is calculated with the feature expression vector of the corresponding key object in step S1, and identification is considered to be valid only when the cosine similarity Confidence > threshold τ_threshold, wherein τ_threshold is set to 0.7-0.9.
Preferably, in step S5, the same key targets in the zoom camera video frame S t and each of the fixed focus camera video frames C i,t are matched according to the image data and the position information thereof, specifically:
Step S5.1, calculating the appearance similarity Simvisual (j, l) between the key target j and the key target l according to the image content matrix T s,j,t of the key target j at the acquisition time T in the zoom camera video frame S t and the image content matrix T i,l,t of the key target l at the acquisition time T in the fixed focus camera video frame C i,t;
step S5.2, calculating the position similarity SIMSPATIAL (j, l) between the key target j and the key target l based on the position coordinate vector P s,j,t of the key target j at the acquisition time t in the zoom camera video frame S t and the position coordinate vector P i,l,t of the key target l at the acquisition time t in the fixed focus camera video frame C i,t;
simspatial(j,l)=exp(-||Ps,j,t-transform(Pi,l,t,M_i)||2/2σ2)
The transformation (P i,l,t, M_i) represents a transformation function for transforming P i,l,t in a coordinate system of a fixed-focus video system i into a coordinate system of a zoom video system, M_i is a 3×3 unit transformation matrix and is obtained through camera calibration of the fixed-focus video system i, sigma is a position tolerance parameter and is set to 10% -20% of the image size of a key target l cut in a fixed-focus video frame C i,t;
step S5.3, calculating the comprehensive similarity between the key target j and the key target l by adopting the following formula:
Sim(j,l)=α·Simvisual(j,l)+β·Simspatial(j,l)
The method comprises the steps of respectively obtaining an appearance similarity weight and a position similarity weight, wherein alpha and beta are respectively an appearance similarity weight and a position similarity weight, alpha+beta=1, alpha and beta >0, increasing a beta value for a scene with large change of the target appearance, and increasing the alpha value for a scene with relatively stable target appearance;
Step S5.4, if the comprehensive similarity Sim (j, l) between the key target j and the key target l is greater than the matching threshold value theta_match, the key target j and the key target l are successfully matched, and the key target j and the key target l are the same key target.
Preferably, for each key object j, the image definition of the key object j in each of the fixed-focus captured video frames C i,t and the zoom captured video frame S t is compared, specifically:
For each video frame, obtaining definition quantization score by comparing image definition based on the definition quantization score:
The method comprises the step A of converting an imaging video frame needing definition quantization scoring into a target image, wherein the imaging video frame needing definition quantization scoring is called as the target image T, the width of the target image T is M pixels, the height of the target image T is N pixels, and the target image T is converted from a space domain to a frequency domain by adopting the following formula:
wherein:
T (x, y) represents the pixel value of the (x, y) position in the spatial domain, j is an imaginary unit in the formula, F (u, v) represents the complex coefficient in the frequency domain, and the (x, y) position in the spatial domain corresponds to the (u, v) position in the frequency domain;
And B, for complex coefficients F (u, v) in the frequency domain, performing the following filtering operation, retaining high-frequency components, filtering low-frequency components, and obtaining a filtered image H (u, v):
If it is Let H (u, v) =1, otherwise let H (u, v) =0;
wherein ω_cutoff is a cutoff frequency threshold;
and C, performing element-by-element multiplication on the filtered image H (u, v) and the complex coefficient F (u, v) in the frequency domain by adopting the following formula to realize frequency domain filtering, extracting high-frequency information in the image, and obtaining a high-frequency information representation F_high (u, v):
F_high(u,v)=F(u,v)⊙H(u,v)
wherein, the product of elements is shown as follows;
and D, obtaining a definition quantization Score SHARPNESS _score (T) by adopting the following formula:
Wherein U and V are the lengths of the rows and columns in the frequency domain, respectively.
Preferably, in step S6, the obtained highest definition image data of the key target j is fused to the position of the same key target in the zoom camera video frame S t, so as to obtain a fused zoom camera video frame S t, which specifically includes:
step S6.1, representing the acquired highest definition image data of the key target j as Causing image data to beThe center coordinates of the same key object in the zoom camera video frame S t coincide with the center coordinates of the same key object in the zoom camera video frame S t, and if the two are not matched in size, the size is adjusted by bilinear interpolation, thereby causing the image data to beReplacing the same key target in the zoom camera video frame S t;
step S6.2, for image data of the same key object embedded in the zoom camera video frame S t Performing color and brightness correction to obtain first image data, wherein the first image data is identical to the color and brightness distribution in the zoom camera video frame S t
Step S6.3, for the first image dataThe weight w (p) of each pixel point p in the image is calculated by adopting the following formula:
wherein sigma is a parameter for controlling transition smoothness and is set as the first image data 1/6 To 1/4 of the size;
step S6.4, using the following formula for the first image data Fusing each pixel point p in the image to obtain a fused image I_fused:
wherein: representing the first image data S_region (p) represents the pixel value of the p position of the pixel point in the same key target area in the zoom camera video frame S t;
Step S6.5, carrying out boundary smoothing on the fused image I_fused by adopting the following formula to obtain a boundary smoothed image I_fused':
I_fused"=I_fused*G_smooth
wherein G_smooth is a smoothing filter, G_smooth= (1/16) [1,2,1;2,4,2;1,2,1];
Step S6.6, re-embedding the image I_fused' after the boundary smoothing process into a target area in the zoom camera video frame S t to obtain a finally fused zoom camera video frame S_enhanced (t);
S_enhanced(t)=St+Mask⊙(I_fused"-S_region)
The Mask is a target area Mask matrix where the same key target is located in a zoom camera video frame S t, S_region is a target area where the same key target is located in a zoom camera video frame S t, S t represents the zoom camera video frame;
And S6.7, carrying out video compression coding on the finally fused zoom camera video frame S_enhanced (t), and outputting.
The scene multi-focal-length target video coding method provided by the invention has the following advantages:
The invention provides a scene multi-focal-length target video coding method, which can enhance the definition of a plurality of specific targets in a video recording system, and in particular relates to a method which utilizes the cooperative work of a zoom video recording system and a fixed-focus video recording system to intercept the specific targets in the fixed-focus system and the zoom system, calculates the target with the highest definition through an algorithm, and fuses the target data into the frame of a zoom video stream, so that the plurality of targets of the zoom video stream can keep high definition.
Drawings
Fig. 1 is a flowchart of a method for encoding a scene multi-focal-length target video.
Fig. 2 is a flowchart of an embodiment of a method for encoding a scene multi-focal-length target video according to the present invention.
Detailed Description
In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides a scene multi-focal-length target video coding method, which can enhance the definition of a plurality of specific targets in a video recording system, and in particular relates to a method which utilizes the cooperative work of a zoom video recording system and a fixed-focus video recording system to intercept the specific targets in the fixed-focus system and the zoom system, calculates the target with the highest definition through an algorithm, and fuses the target data into the frame of a zoom video stream, so that the plurality of targets of the zoom video stream can keep high definition.
Referring to fig. 1 and 2, the present invention provides a method for encoding a scene multi-focal-length target video, comprising the following steps:
Step S1, determining key targets needing key recognition, and acquiring key target feature expression vectors of each key target, wherein each key target feature expression vector forms a key target recognition feature matrix F= { F 1,F2,...,Fm }, F j represents feature expression vectors of key targets j, j=1, 2, and m, m is the number of key targets;
Specifically, the characteristic representation vector F j={Fj,visual,Fj,spatial,Fj,temporal of the key target j, wherein F j,visual is an appearance characteristic vector of the key target j, and is obtained by extracting a color histogram, a texture characteristic and a shape description characteristic of the key target j, F j,spatial is a position characteristic vector of the key target j and is used for describing the position distribution and the movement track mode of the key target j in a picture, and F j,temporal is a time characteristic vector of the key target j and is used for describing the time mode and the duration statistics of the key target j.
Step S2, setting a zoom video system and n fixed-focus video systems facing a shooting scene according to shooting scene requirements, wherein each fixed-focus video system is configured to have different shooting angles and focal length positions;
When shooting, the zoom video recording system is used for shooting the whole scene, and dynamically adjusting the focal length according to the scene requirement to obtain a panoramic video stream containing all key targets.
Each fixed-focus video recording system i is provided with a shooting angle theta i and a specific fixed focal length f i, and the shooting angle theta i and the specific fixed focal length f i are kept unchanged when a scene is shot, wherein the shooting angle theta i is relative to the zoom video recording system.
Step S3, simultaneously starting the zooming video system and n fixed-focus video systems to carry out collaborative shooting of a plurality of video systems, and simultaneously acquiring a zooming video frame S t and a fixed-focus video frame set { C 1,t,C2,t,...,Cn,t } at the same acquisition time t, wherein C i,t represents a fixed-focus video frame acquired by a fixed-focus video system i at the acquisition time t, i=1, 2, and n;
Step S4, identifying m key targets in the zoom camera video frame S t based on the key target identification feature matrix f= { F 1,F2,...,Fm }, and acquiring image data and position information of each identified key target j in the zoom camera video frame S t;
Performing key target recognition on each fixed-focus shooting video frame C i,t based on a key target recognition feature matrix F= { F 1,F2,...,Fm }, recognizing k i,t key targets, and acquiring image data and position information of each recognized key target l in the fixed-focus shooting video frame C i,t, wherein l=1, 2, and k i,t;
Specifically, the image data of the key target j in the zoom camera video frame S t and the position information thereof are represented as E s,j,t={Ts,j,t,Ps,j,t,Bs,j,t }, wherein T s,j,t represents an image content matrix of the key target j in the zoom camera video frame S t at the acquisition time T, and is obtained by clipping the image of the key target j from the zoom camera video frame S t, P s,j,t represents a position coordinate vector of the center of the key target j in the zoom camera video frame S t at the acquisition time T, and B s,j,t represents a boundary frame coordinate vector of the key target j in the minimum circumscribed rectangular region of the zoom camera video frame S t at the acquisition time T;
The image data of each identified key object l in the fixed-focus photographed video frame C i,t and the position information thereof are represented as E i,l,t={Ti,l,t,Pi,l,t,Bi,l,t }, wherein T i,l,t represents an image content matrix of the key object l in the fixed-focus photographed video frame C i,t at the acquisition time T, the key object l is obtained by clipping the image of the key object l from the fixed-focus photographed video frame C i,t, P i,l,t represents a position coordinate vector of the key object l in the fixed-focus photographed video frame C i,t at the center of the acquisition time T, and B i,l,t represents a boundary frame coordinate vector of the minimum circumscribed rectangular region of the key object l in the fixed-focus photographed video frame C i,t at the acquisition time T.
For each key object j identified in the zoom camera video frame S t and each key object l identified in each fixed focus camera video frame C i,t, a confidence evaluation function is adopted, and cosine similarity Confidence is calculated with the feature expression vector of the corresponding key object in step S1, and identification is considered to be valid only when the cosine similarity Confidence > threshold τ_threshold, wherein τ_threshold is set to 0.7-0.9.
Step S5, matching the same key targets in the zoom camera video frame S t and each fixed focus camera video frame C i,t according to the image data and the position information thereof;
the specific matching method comprises the following steps:
Step S5.1, calculating the appearance similarity Simvisual (j, l) between the key target j and the key target l according to the image content matrix T s,j,t of the key target j at the acquisition time T in the zoom camera video frame S t and the image content matrix T i,l,t of the key target l at the acquisition time T in the fixed focus camera video frame C i,t;
step S5.2, calculating the position similarity SIMSPATIAL (j, l) between the key target j and the key target l based on the position coordinate vector P s,j,t of the key target j at the acquisition time t in the zoom camera video frame S t and the position coordinate vector P i,l,t of the key target l at the acquisition time t in the fixed focus camera video frame C i,t;
simspatial(j,l)=exp(-||Ps,j,t-transform(Pi,l,t,M_i)||2/2σ2)
The transformation (P i,l,t, M_i) represents a transformation function for transforming P i,l,t in a coordinate system of a fixed-focus video system i into a coordinate system of a zoom video system, M_i is a 3×3 unit transformation matrix and is obtained through camera calibration of the fixed-focus video system i, sigma is a position tolerance parameter and is set to 10% -20% of the image size of a key target l cut in a fixed-focus video frame C i,t;
step S5.3, calculating the comprehensive similarity between the key target j and the key target l by adopting the following formula:
Sim(j,l)=α·Simvisual(j,l)+β·Simspatial(j,l)
The method comprises the steps of respectively obtaining an appearance similarity weight and a position similarity weight, wherein alpha and beta are respectively an appearance similarity weight and a position similarity weight, alpha+beta=1, alpha and beta >0, increasing a beta value for a scene with large change of the target appearance, and increasing the alpha value for a scene with relatively stable target appearance;
Step S5.4, if the comprehensive similarity Sim (j, l) between the key target j and the key target l is greater than the matching threshold value theta_match, the key target j and the key target l are successfully matched, and the key target j and the key target l are the same key target.
Comparing the image definition of each key target j in each fixed-focus shooting video frame C i,t and the image definition of each zoom shooting video frame S t, and if the image definition of each key target j in the zoom shooting video frame S t is highest, not performing image fusion processing;
specifically, for each captured video frame, the following manner is adopted to obtain the definition quantization score thereof; and comparing the image definition based on the definition quantization score:
The method comprises the step A of converting an imaging video frame needing definition quantization scoring into a target image, wherein the imaging video frame needing definition quantization scoring is called as the target image T, the width of the target image T is M pixels, the height of the target image T is N pixels, and the target image T is converted from a space domain to a frequency domain by adopting the following formula:
wherein:
T (x, y) represents the pixel value of the (x, y) position in the spatial domain, j is an imaginary unit in the formula, F (u, v) represents the complex coefficient in the frequency domain, and the (x, y) position in the spatial domain corresponds to the (u, v) position in the frequency domain;
And B, for complex coefficients F (u, v) in the frequency domain, performing the following filtering operation, retaining high-frequency components, filtering low-frequency components, and obtaining a filtered image H (u, v):
If it is Let H (u, v) =1, otherwise let H (u, v) =0;
wherein ω_cutoff is a cutoff frequency threshold;
and C, performing element-by-element multiplication on the filtered image H (u, v) and the complex coefficient F (u, v) in the frequency domain by adopting the following formula to realize frequency domain filtering, extracting high-frequency information in the image, and obtaining a high-frequency information representation F_high (u, v):
F_high(u,v)=F(u,v)⊙H(u,v)
wherein, the product of elements is shown as follows;
and D, obtaining a definition quantization Score SHARPNESS _score (T) by adopting the following formula:
Wherein U and V are the lengths of the rows and columns in the frequency domain, respectively.
Step S6, fusing the acquired highest definition image data of the key target j to the position of the same key target in the zooming shooting video frame S t to obtain a fused zooming shooting video frame S t;
the method specifically comprises the following steps:
step S6.1, representing the acquired highest definition image data of the key target j as Causing image data to beThe center coordinates of the same key object in the zoom camera video frame S t coincide with the center coordinates of the same key object in the zoom camera video frame S t, and if the two are not matched in size, the size is adjusted by bilinear interpolation, thereby causing the image data to beReplacing the same key target in the zoom camera video frame S t;
step S6.2, for image data of the same key object embedded in the zoom camera video frame S t Performing color and brightness correction to obtain first image data, wherein the first image data is identical to the color and brightness distribution in the zoom camera video frame S t
Step S6.3, for the first image dataThe weight w (p) of each pixel point p in the image is calculated by adopting the following formula:
wherein sigma is a parameter for controlling transition smoothness and is set as the first image data 1/6 To 1/4 of the size;
step S6.4, using the following formula for the first image data Fusing each pixel point p in the image to obtain a fused image I_fused:
wherein: representing the first image data S_region (p) represents the pixel value of the p position of the pixel point in the same key target area in the zoom camera video frame S t;
Step S6.5, carrying out boundary smoothing on the fused image I_fused by adopting the following formula to obtain a boundary smoothed image I_fused':
I_fused"=I_fused*G_smooth
wherein G_smooth is a smoothing filter, G_smooth= (1/16) [1,2,1;2,4,2;1,2,1];
Step S6.6, re-embedding the image I_fused' after the boundary smoothing process into a target area in the zoom camera video frame S t to obtain a finally fused zoom camera video frame S_enhanced (t);
S_enhanced(t)=St+Mask⊙(I_fused"-S_region)
The Mask is a target area Mask matrix where the same key target is located in a zoom camera video frame S t, S_region is a target area where the same key target is located in a zoom camera video frame S t, S t represents the zoom camera video frame;
And S6.7, carrying out video compression coding on the finally fused zoom camera video frame S_enhanced (t), and outputting.
Step S7, for m key targets in the zoom camera video frame S t, executing steps S5 to S6, and then performing video compression coding on the processed zoom camera video frame S t to obtain a multi-focal-length fused coded zoom camera video frame S t;
step S8, outputting the encoded zoom camera video frame S t, enabling t=t+1, and returning to step S3.
The scene multi-focal-length target video coding method combines the cooperative work of a zooming video recording system and a plurality of fixed-focus video recording systems. According to the method, shooting angles and focal length positions of all fixed-focus video recording systems are accurately set according to preset scene requirements so as to capture target details in different depth of field ranges, and one-path zoom video stream and multiple paths of fixed-focus video streams can be obtained through the zoom video recording systems and the multiple fixed-focus video recording systems. And automatically identifying and accurately intercepting target areas corresponding to a plurality of target pictures or names predefined by a user from the video streams by utilizing an artificial intelligent image identification technology. And establishing a corresponding relation between the zooming video stream and a specific target in the fixed-focus video stream through a position information and feature matching algorithm. For each target, the algorithm traverses all target examples in the related video stream, and the target image data with the highest definition is selected by adopting a mode of band-pass filtering after frequency conversion. And then the target image data with the highest definition is replaced to the corresponding target position in the zoom video stream, the replaced area is subjected to smoothing treatment, and finally, the definition of all key targets in the generated zoom video stream reaches the optimal state. And finally, compressing the optimized zoom video stream by adopting a video compression coding technology.
One embodiment is described below:
(1) Predefining an object identification database to be enhanced, wherein the object to be enhanced can be a photo, a name or a characteristic description;
Specifically, the system constructs a database containing all important target feature information, including elements such as optical characteristics, spatial positions, time sequence relations and the like, to form a complete target feature library.
Specifically, the user needs to predefine an important set of objects o= { O 1,O2,..o_m }, where m is the total number of objects, which is desired to maintain high definition in the video. This predefined approach solves the technical problem that conventional video recording systems cannot distinguish between important targets and backgrounds. The goal may be specified in three flexible ways:
providing a reference picture of the target, namely directly uploading a standard image of the target, and being applicable to fixed targets with known appearance;
inputting the names of targets, such as an athlete, a referee and a score board, and calling a pre-trained semantic recognition model by a system;
Features describing the object, such as "person wearing red-ball", support flexible description based on attributes.
To achieve accurate recognition, the system builds a multidimensional feature vector for each target, which is one of the key technical innovations of the present invention. The feature vector contains three types of complementary information:
extracting visual characteristics such as a color histogram, texture characteristics, shape descriptions and the like of a target, wherein 512-1024 dimensions are usually taken;
Recording the common position distribution and moving track mode of the target in the picture, and generally taking 64-128 dimensions;
time feature vector-time pattern describing the appearance of the target, duration statistics, typically taking 32-64 dimensions.
The complete characteristic representation of each target adopts a vector splicing mode, namely, an appearance characteristic vector, a position characteristic vector and a time characteristic vector are spliced, and the total dimension of the spliced characteristic vector is generally 600-1200 dimensions.
The feature files of all targets are combined into a target recognition database matrix which is used as a lookup table of a subsequent recognition algorithm to support rapid similarity calculation and target matching.
(2) Simultaneously starting a zooming video recording system and a plurality of fixed-focus video recording systems to record, and respectively acquiring a frame S1 in the zooming video stream and a frame set { C1, C2 }, wherein the frame set { C1, C2 }, cn }, at the same moment.
Through the collaborative recording of multiple cameras, the technical problem that a single camera can not obtain a plurality of different-distance target clear images simultaneously is solved, and the combination of panoramic coverage and local high definition is realized through the collaborative work of the multiple cameras.
The system adopts a camera configuration strategy of 'one main camera and multiple auxiliary cameras', and simultaneously starts two types of video equipment to perform cooperative work:
And the zoom camera Z is used as a main camera and is responsible for shooting the whole scene, and the focal length can be dynamically adjusted according to scene requirements to obtain panoramic video streams containing all targets. The method has the advantages of large visual field range, capability of capturing complete scene information, and limited definition for a remote target.
N fixed-focus camera sets f= { F 1,F2, & gt, f_n }, wherein each fixed-focus camera set is used as an auxiliary camera and is pre-fixed at a specific focal length and a specific shooting angle and is specially used for shooting targets within a specific distance range. This design ensures that a specific camera provides a high definition target image at each distance level.
At time t, the system achieves strict time synchronization, and synchronously acquires a main video stream frame from the zoom camera and an auxiliary video stream frame set from each fixed-focus camera. Different cameras may have different resolution configurations.
And the time synchronization is realized by ensuring that the time difference between frames is less than 1/60 second through a hardware synchronization signal or a Network Time Protocol (NTP) by all cameras, and avoiding the target position deviation caused by time desynchronization.
The camera parameter configuration matrix is P_camera= [ f 1,θ1;f2,θ2; f_n, θ_n ], where f_i is the focal length of the ith fixed focus camera (in mm) and θ_i is its shooting angle relative to the main camera (in degrees). The matrix records the geometric configuration information of the whole camera array and is used for subsequent coordinate transformation and target matching.
(3) With advanced artificial intelligence recognition techniques, preset accent targets are identified in frames S1 and { C1, C2, & gt, cn }. Image data of each identified key target and accurate position information thereof are recorded.
Specifically, the system uses an artificial intelligent image recognition algorithm AI_Detector based on a convolutional neural network, and the algorithm combines the target detection and feature matching technology to automatically find and recognize preset important targets in video streams of all cameras. The algorithm has the technical advantage that the complex conditions of scale change, illumination change, partial shielding and the like of the target can be processed.
The specific identification process adopts a parallel processing architecture:
the system takes the target identification database matrix phi established in the first step as a reference standard, and loads the reference standard into the GPU memory so as to improve the query speed;
Scanning each frame of picture in the main video stream S (t) in real time, and identifying all preset targets by using a sliding window and a multi-scale detection technology;
and meanwhile, the same identification process is carried out on pictures in each auxiliary video stream, and the identification tasks of all cameras are executed in parallel, so that the processing efficiency is improved.
For each identified target, the system will accurately record the triplet (T, P, B):
The method comprises the steps of (1) selecting a specific image content matrix of a target, cutting a target area from an original image to keep an original pixel value, selecting a coordinate vector of the center position of the target in a picture by taking the pixel as a unit, and selecting a boundary frame coordinate vector of the target to define a minimum rectangular area containing the target, wherein the origin is positioned at the upper left corner of the image.
In order to ensure the identification accuracy, a confidence evaluation mechanism is introduced into the system, specifically, a confidence evaluation function is adopted, the cosine similarity between the detected target feature and the target feature in the database is calculated by the function, and the closer the value is to 1, the higher the matching degree is.
Identification is considered valid only if Confidence > τ_threshold, which is typically set to 0.7-0.9, is adjustable according to the accuracy requirements of the application scenario.
The quality control of the identification results, wherein the system also records the quality index of each identification result, including the integrity of the target (whether the target is shielded or not), the definition of the image, the illumination condition and the like, and provides basis for the subsequent target selection.
(4) According to the image data and the position information, the corresponding relation of the same key target in the zoom video frame S1 and the set of Jiao Shipin frames { C1, C2, & gt, cn } is accurately matched.
The method solves the key technical problem of how to accurately identify whether the targets shot by different cameras are the same target in the multi-camera system. This is a precondition for achieving the target sharpness improvement.
Since the same target may appear in the main video stream and the plurality of auxiliary video streams at the same time, but the appearance of the same target in different cameras may be different due to different shooting angles, distances and illumination conditions. The system needs to establish the corresponding relation between the targets and determine which are different shooting angles of the same target. The technical innovation of the matching algorithm is to comprehensively consider information of multiple dimensions, and avoid the limitation of single feature matching. The match determination is based on a weighted combination of two main factors:
Firstly, calculating appearance similarity, namely comparing visual feature vectors of two target images, including color distribution, texture mode, shape outline and the like. The cosine similarity has the advantage of insensitivity to image brightness variation and is suitable for processing target matching under different illumination conditions.
Then, the position similarity is calculated, and the function considers the spatial position relation of the targets in different cameras.
And finally, carrying out weighted summation on the appearance similarity and the position similarity to obtain the comprehensive similarity.
(5) For each matching key object, the image is transformed from the spatial domain to the frequency domain. A specially designed high-pass filtering algorithm is applied to allow the passage of frequency information above a preset threshold to extract high-frequency components. The object F1 with the highest definition is evaluated and selected by comparing the intensity and the quantity of high-frequency components of different key object images.
Specifically, in the present invention, for each matched object, the system needs to select the version with the highest definition from the main video stream and each auxiliary video stream. The principle of sharpness evaluation is that a sharp image contains more high frequency components and a blurred image contains less high frequency components. The sharpness evaluation adopts a frequency domain analysis technology, and the method has the advantages of objectivity, accuracy and no influence of subjective feeling of human eyes.
Further, a definition quantization scoring mode is adopted, the proportion of high-frequency energy to total energy is calculated, and the larger the value is, the clearer the image is. The normalization process of the denominator ensures comparability between different size images.
In order to avoid noise interference, the difference between the highest score and the second highest score can be checked, and when the difference is smaller than a threshold value, the difference is comprehensively judged by combining with other quality indexes (such as contrast and saturation), so that the image version with the highest definition of each important target is selected, and the optimal material is provided for subsequent image fusion.
(6) The image data of the object F1 with the highest definition is fused to the corresponding position in the zoom video frame S1. In the fusion process, a smoothing algorithm is adopted to process the target edge area, so that the natural transition and consistency of F1 and S1 backgrounds are ensured.
Specifically, in the invention, high-definition target images from different sources are seamlessly fused into a main video stream, so that the high definition of the target is maintained, and the naturalness and consistency of the whole picture are ensured. In order to ensure natural and smooth images after fusion, an intelligent weighted fusion technology based on Gaussian weight is adopted, and the core idea of the technology is that a high-definition image is completely used in a target center area, and gradually transits to an original image in an edge area, so that a hard boundary effect is avoided.
(7) And processing the fused zoom video frame S1 by adopting a stream compression coding method so as to reduce the size of a video file and optimize the transmission efficiency. And outputting the compressed video stream, wherein the definition of a plurality of key targets in the video stream is obviously improved, and meanwhile, the overall visual perception quality and smooth playing experience are maintained.
The step is the final link of the system, and the high-efficiency compression coding is carried out by adopting the proper H264/H265 video on the premise of keeping the enhancement effect.
The invention provides a scene multi-focus target video coding method, which is a video coding method capable of effectively improving the definition of a plurality of targets in a multi-focus and fixed-focus cooperative video recording system.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which is also intended to be covered by the present invention.
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202511018605.8A CN120812412A (en) | 2025-07-23 | 2025-07-23 | Scene multi-focal-length target video coding method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202511018605.8A CN120812412A (en) | 2025-07-23 | 2025-07-23 | Scene multi-focal-length target video coding method |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN120812412A true CN120812412A (en) | 2025-10-17 |
Family
ID=97307994
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202511018605.8A Pending CN120812412A (en) | 2025-07-23 | 2025-07-23 | Scene multi-focal-length target video coding method |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN120812412A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120980197A (en) * | 2025-10-20 | 2025-11-18 | 深圳市道格恒通科技有限公司 | An autofocus method and a rugged mobile phone |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101720027A (en) * | 2009-11-27 | 2010-06-02 | 西安电子科技大学 | Method for cooperative acquisition of multi-target videos under different resolutions by variable-focus array camera |
| CN102542545A (en) * | 2010-12-24 | 2012-07-04 | 方正国际软件(北京)有限公司 | Multi-focal length photo fusion method and system and photographing device |
| CN107481213A (en) * | 2017-08-28 | 2017-12-15 | 湖南友哲科技有限公司 | Microscope hypograph multi-layer focusing fusion method |
| CN110830756A (en) * | 2018-08-07 | 2020-02-21 | 华为技术有限公司 | Monitoring method and device |
| CN113936154A (en) * | 2021-11-23 | 2022-01-14 | 上海商汤智能科技有限公司 | Image processing method and device, electronic equipment and storage medium |
| CN116132791A (en) * | 2023-03-10 | 2023-05-16 | 创视微电子(成都)有限公司 | Method and device for acquiring multi-field-depth clear images of multiple moving objects |
| CN119027882A (en) * | 2024-08-30 | 2024-11-26 | 四川国创新视超高清视频科技有限公司 | A dynamic target tracking fusion method for large scene monitoring |
| CN119204592A (en) * | 2024-11-25 | 2024-12-27 | 浙江吉欧科技有限公司 | Shield machine operation and maintenance management system based on data analysis |
-
2025
- 2025-07-23 CN CN202511018605.8A patent/CN120812412A/en active Pending
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101720027A (en) * | 2009-11-27 | 2010-06-02 | 西安电子科技大学 | Method for cooperative acquisition of multi-target videos under different resolutions by variable-focus array camera |
| CN102542545A (en) * | 2010-12-24 | 2012-07-04 | 方正国际软件(北京)有限公司 | Multi-focal length photo fusion method and system and photographing device |
| CN107481213A (en) * | 2017-08-28 | 2017-12-15 | 湖南友哲科技有限公司 | Microscope hypograph multi-layer focusing fusion method |
| CN110830756A (en) * | 2018-08-07 | 2020-02-21 | 华为技术有限公司 | Monitoring method and device |
| CN113936154A (en) * | 2021-11-23 | 2022-01-14 | 上海商汤智能科技有限公司 | Image processing method and device, electronic equipment and storage medium |
| CN116132791A (en) * | 2023-03-10 | 2023-05-16 | 创视微电子(成都)有限公司 | Method and device for acquiring multi-field-depth clear images of multiple moving objects |
| CN119027882A (en) * | 2024-08-30 | 2024-11-26 | 四川国创新视超高清视频科技有限公司 | A dynamic target tracking fusion method for large scene monitoring |
| CN119204592A (en) * | 2024-11-25 | 2024-12-27 | 浙江吉欧科技有限公司 | Shield machine operation and maintenance management system based on data analysis |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120980197A (en) * | 2025-10-20 | 2025-11-18 | 深圳市道格恒通科技有限公司 | An autofocus method and a rugged mobile phone |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10733472B2 (en) | Image capture device with contemporaneous image correction mechanism | |
| US8330831B2 (en) | Method of gathering visual meta data using a reference image | |
| US9129381B2 (en) | Modification of post-viewing parameters for digital images using image region or feature information | |
| CN114862698B (en) | Channel-guided real overexposure image correction method and device | |
| Levin et al. | Image and depth from a conventional camera with a coded aperture | |
| CN108537155B (en) | Image processing method, apparatus, electronic device, and computer-readable storage medium | |
| US20180013950A1 (en) | Modification of post-viewing parameters for digital images using image region or feature information | |
| CN107945135B (en) | Image processing method, device, storage medium and electronic device | |
| US20120069198A1 (en) | Foreground/Background Separation Using Reference Images | |
| CN107948517B (en) | Preview image blur processing method, device and device | |
| US20080317378A1 (en) | Digital image enhancement with reference images | |
| JP4597391B2 (en) | Facial region detection apparatus and method, and computer-readable recording medium | |
| CN108846807B (en) | Light effect processing method, device, terminal and computer-readable storage medium | |
| JP2010508571A (en) | Digital image processing using face detection and skin tone information | |
| CN109493283A (en) | A kind of method that high dynamic range images ghost is eliminated | |
| CN112261292B (en) | Image acquisition method, terminal, chip and storage medium | |
| CN113379609B (en) | Image processing method, storage medium and terminal equipment | |
| Banerjee et al. | In-camera automation of photographic composition rules | |
| US20250203194A1 (en) | Image processing device, image processing method, and program | |
| CN120812412A (en) | Scene multi-focal-length target video coding method | |
| CN110365897B (en) | Image correction method and device, electronic equipment and computer readable storage medium | |
| CN107911609B (en) | Image processing method, apparatus, computer-readable storage medium and electronic device | |
| CN113379608B (en) | Image processing method, storage medium and terminal device | |
| US20250193339A1 (en) | Videoconference image enhancement based on scene models | |
| Zhou et al. | Pixel-level Multi-directional Image Sharpness Linear Assessment for Optical Image Stabilizer Performance Monitoring |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |