CN120812412A - Scene multi-focal-length target video coding method - Google Patents

Scene multi-focal-length target video coding method

Info

Publication number
CN120812412A
CN120812412A CN202511018605.8A CN202511018605A CN120812412A CN 120812412 A CN120812412 A CN 120812412A CN 202511018605 A CN202511018605 A CN 202511018605A CN 120812412 A CN120812412 A CN 120812412A
Authority
CN
China
Prior art keywords
target
video frame
key
camera video
key target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202511018605.8A
Other languages
Chinese (zh)
Inventor
董良
彭熇磊
翟青林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Colink Digital Technology Co ltd
Original Assignee
Beijing Colink Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Colink Digital Technology Co ltd filed Critical Beijing Colink Digital Technology Co ltd
Priority to CN202511018605.8A priority Critical patent/CN120812412A/en
Publication of CN120812412A publication Critical patent/CN120812412A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/95Computational photography systems, e.g. light-field imaging systems
    • H04N23/951Computational photography systems, e.g. light-field imaging systems by using two or more images to influence resolution, frame rate or aspect ratio
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • H04N19/87Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving scene cut or scene change detection in combination with video compression
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/61Control of cameras or camera modules based on recognised objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/80Camera processing pipelines; Components thereof
    • H04N23/84Camera processing pipelines; Components thereof for processing colour signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/80Camera processing pipelines; Components thereof
    • H04N23/84Camera processing pipelines; Components thereof for processing colour signals
    • H04N23/86Camera processing pipelines; Components thereof for processing colour signals for controlling the colour saturation of colour signals, e.g. automatic chroma control circuits
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/90Arrangement of cameras or camera modules, e.g. multiple cameras in TV studios or sports stadiums

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention provides a scene multi-focal-length target video coding method which comprises the steps of determining key targets needing to be identified in an important way, setting a zooming video recording system and a plurality of fixed-focus video recording systems, matching the zooming video frames with the same key targets in all the fixed-focus video frames according to image data and position information, and fusing the acquired image data with the highest definition of the key targets to the positions of the same key targets in the zooming video frames. The invention provides a scene multi-focal-length target video coding method, which can enhance the definition of a plurality of specific targets in a video recording system, and in particular relates to a method which utilizes the cooperative work of a zoom video recording system and a fixed-focus video recording system to intercept the specific targets in the fixed-focus system and the zoom system, calculates the target with the highest definition through an algorithm, and fuses the target data into the frame of a zoom video stream, so that the plurality of targets of the zoom video stream can keep high definition.

Description

Scene multi-focal-length target video coding method
Technical Field
The invention relates to the technical field of video coding, in particular to a scene multi-focal-length target video coding method.
Background
When a conventional video recording system records a video, various targets at a plurality of focal positions are difficult to clearly display in a video stream at the same time, for example, key elements such as athletes, referees, score boards and the like in a sports match, and are difficult to clearly display at the same time in one picture due to the fact that the key elements are scattered at different positions. Therefore, there are certain limitations in use.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention provides a scene multi-focal-length target video coding method which can effectively solve the problems.
The technical scheme adopted by the invention is as follows:
the invention provides a scene multi-focal-length target video coding method, which comprises the following steps:
Step S1, determining key targets needing key recognition, and acquiring key target feature expression vectors of each key target, wherein each key target feature expression vector forms a key target recognition feature matrix F= { F 1,F2,...,Fm }, F j represents feature expression vectors of key targets j, j=1, 2, and m, m is the number of key targets;
step S2, setting a zoom video system and n fixed-focus video systems facing a shooting scene according to shooting scene requirements, wherein each fixed-focus video system is configured to have different shooting angles and focal length positions;
Step S3, simultaneously starting the zooming video system and n fixed-focus video systems to carry out collaborative shooting of a plurality of video systems, and simultaneously acquiring a zooming video frame S t and a fixed-focus video frame set { C 1,t,C2,t,...,Cn,t } at the same acquisition time t, wherein C i,t represents a fixed-focus video frame acquired by a fixed-focus video system i at the acquisition time t, i=1, 2, and n;
Step S4, identifying m key targets in the zoom camera video frame S t based on the key target identification feature matrix f= { F 1,F2,...,Fm }, and acquiring image data and position information of each identified key target j in the zoom camera video frame S t;
Performing key target recognition on each fixed-focus shooting video frame C i,t based on a key target recognition feature matrix F= { F 1,F2,...,Fm }, recognizing k i,t key targets, and acquiring image data and position information of each recognized key target l in the fixed-focus shooting video frame C i,t, wherein l=1, 2, and k i,t;
Step S5, matching the same key targets in the zoom camera video frame S t and each fixed focus camera video frame C i,t according to the image data and the position information thereof;
Comparing the image definition of each key target j in each fixed-focus shooting video frame C i,t and the image definition of each zoom shooting video frame S t, and if the image definition of each key target j in the zoom shooting video frame S t is highest, not performing image fusion processing;
Step S6, fusing the acquired highest definition image data of the key target j to the position of the same key target in the zooming shooting video frame S t to obtain a fused zooming shooting video frame S t;
Step S7, for m key targets in the zoom camera video frame S t, executing steps S5 to S6, and then performing video compression coding on the processed zoom camera video frame S t to obtain a multi-focal-length fused coded zoom camera video frame S t;
step S8, outputting the encoded zoom camera video frame S t, enabling t=t+1, and returning to step S3.
Preferably, in step S1, the feature of the key object j represents a vector F j={Fj,visual,Fj,spatial,Fj,temporal };
wherein:
F j,visual is an appearance feature vector of the key target j, and is obtained by extracting a color histogram, texture features and shape description features of the key target j;
F j,spatial is a position feature vector of the key target j and is used for describing the position distribution and the movement track mode of the key target j in the picture;
And F j,temporal is a time feature vector of the key target j and is used for describing the time mode and duration statistics of the key target j.
Preferably, the zoom video recording system is used for shooting the whole scene when shooting, and dynamically adjusting the focal length according to the scene requirement to obtain the panoramic video stream containing all key targets.
Preferably, each fixed-focus video recording system i is configured with a shooting angle θ i and a specific fixed focal length f i, and the shooting angle θ i and the specific fixed focal length f i remain unchanged when a scene is shot, wherein the shooting angle θ i is a shooting angle relative to the zoom video recording system.
Preferably, the image data of the key target j in the zoom camera video frame S t and the position information thereof are represented as E s,j,t={Ts,j,t,Ps,j,t,Bs,j,t }, wherein T s,j,t represents an image content matrix of the key target j in the zoom camera video frame S t at the acquisition time T, and is obtained by clipping the image of the key target j from the zoom camera video frame S t, P s,j,t represents a position coordinate vector of the center of the key target j in the zoom camera video frame S t at the acquisition time T, and B s,j,t represents a boundary frame coordinate vector of the key target j in the minimum circumscribed rectangular region of the zoom camera video frame S t at the acquisition time T;
The image data of each identified key object l in the fixed-focus photographed video frame C i,t and the position information thereof are represented as E i,l,t={Ti,l,t,Pi,l,t,Bi,l,t }, wherein T i,l,t represents an image content matrix of the key object l in the fixed-focus photographed video frame C i,t at the acquisition time T, the key object l is obtained by clipping the image of the key object l from the fixed-focus photographed video frame C i,t, P i,l,t represents a position coordinate vector of the key object l in the fixed-focus photographed video frame C i,t at the center of the acquisition time T, and B i,l,t represents a boundary frame coordinate vector of the minimum circumscribed rectangular region of the key object l in the fixed-focus photographed video frame C i,t at the acquisition time T.
Preferably, step S4 further includes:
For each key object j identified in the zoom camera video frame S t and each key object l identified in each fixed focus camera video frame C i,t, a confidence evaluation function is adopted, and cosine similarity Confidence is calculated with the feature expression vector of the corresponding key object in step S1, and identification is considered to be valid only when the cosine similarity Confidence > threshold τ_threshold, wherein τ_threshold is set to 0.7-0.9.
Preferably, in step S5, the same key targets in the zoom camera video frame S t and each of the fixed focus camera video frames C i,t are matched according to the image data and the position information thereof, specifically:
Step S5.1, calculating the appearance similarity Simvisual (j, l) between the key target j and the key target l according to the image content matrix T s,j,t of the key target j at the acquisition time T in the zoom camera video frame S t and the image content matrix T i,l,t of the key target l at the acquisition time T in the fixed focus camera video frame C i,t;
step S5.2, calculating the position similarity SIMSPATIAL (j, l) between the key target j and the key target l based on the position coordinate vector P s,j,t of the key target j at the acquisition time t in the zoom camera video frame S t and the position coordinate vector P i,l,t of the key target l at the acquisition time t in the fixed focus camera video frame C i,t;
simspatial(j,l)=exp(-||Ps,j,t-transform(Pi,l,t,M_i)||2/2σ2)
The transformation (P i,l,t, M_i) represents a transformation function for transforming P i,l,t in a coordinate system of a fixed-focus video system i into a coordinate system of a zoom video system, M_i is a 3×3 unit transformation matrix and is obtained through camera calibration of the fixed-focus video system i, sigma is a position tolerance parameter and is set to 10% -20% of the image size of a key target l cut in a fixed-focus video frame C i,t;
step S5.3, calculating the comprehensive similarity between the key target j and the key target l by adopting the following formula:
Sim(j,l)=α·Simvisual(j,l)+β·Simspatial(j,l)
The method comprises the steps of respectively obtaining an appearance similarity weight and a position similarity weight, wherein alpha and beta are respectively an appearance similarity weight and a position similarity weight, alpha+beta=1, alpha and beta >0, increasing a beta value for a scene with large change of the target appearance, and increasing the alpha value for a scene with relatively stable target appearance;
Step S5.4, if the comprehensive similarity Sim (j, l) between the key target j and the key target l is greater than the matching threshold value theta_match, the key target j and the key target l are successfully matched, and the key target j and the key target l are the same key target.
Preferably, for each key object j, the image definition of the key object j in each of the fixed-focus captured video frames C i,t and the zoom captured video frame S t is compared, specifically:
For each video frame, obtaining definition quantization score by comparing image definition based on the definition quantization score:
The method comprises the step A of converting an imaging video frame needing definition quantization scoring into a target image, wherein the imaging video frame needing definition quantization scoring is called as the target image T, the width of the target image T is M pixels, the height of the target image T is N pixels, and the target image T is converted from a space domain to a frequency domain by adopting the following formula:
wherein:
T (x, y) represents the pixel value of the (x, y) position in the spatial domain, j is an imaginary unit in the formula, F (u, v) represents the complex coefficient in the frequency domain, and the (x, y) position in the spatial domain corresponds to the (u, v) position in the frequency domain;
And B, for complex coefficients F (u, v) in the frequency domain, performing the following filtering operation, retaining high-frequency components, filtering low-frequency components, and obtaining a filtered image H (u, v):
If it is Let H (u, v) =1, otherwise let H (u, v) =0;
wherein ω_cutoff is a cutoff frequency threshold;
and C, performing element-by-element multiplication on the filtered image H (u, v) and the complex coefficient F (u, v) in the frequency domain by adopting the following formula to realize frequency domain filtering, extracting high-frequency information in the image, and obtaining a high-frequency information representation F_high (u, v):
F_high(u,v)=F(u,v)⊙H(u,v)
wherein, the product of elements is shown as follows;
and D, obtaining a definition quantization Score SHARPNESS _score (T) by adopting the following formula:
Wherein U and V are the lengths of the rows and columns in the frequency domain, respectively.
Preferably, in step S6, the obtained highest definition image data of the key target j is fused to the position of the same key target in the zoom camera video frame S t, so as to obtain a fused zoom camera video frame S t, which specifically includes:
step S6.1, representing the acquired highest definition image data of the key target j as Causing image data to beThe center coordinates of the same key object in the zoom camera video frame S t coincide with the center coordinates of the same key object in the zoom camera video frame S t, and if the two are not matched in size, the size is adjusted by bilinear interpolation, thereby causing the image data to beReplacing the same key target in the zoom camera video frame S t;
step S6.2, for image data of the same key object embedded in the zoom camera video frame S t Performing color and brightness correction to obtain first image data, wherein the first image data is identical to the color and brightness distribution in the zoom camera video frame S t
Step S6.3, for the first image dataThe weight w (p) of each pixel point p in the image is calculated by adopting the following formula:
wherein sigma is a parameter for controlling transition smoothness and is set as the first image data 1/6 To 1/4 of the size;
step S6.4, using the following formula for the first image data Fusing each pixel point p in the image to obtain a fused image I_fused:
wherein: representing the first image data S_region (p) represents the pixel value of the p position of the pixel point in the same key target area in the zoom camera video frame S t;
Step S6.5, carrying out boundary smoothing on the fused image I_fused by adopting the following formula to obtain a boundary smoothed image I_fused':
I_fused"=I_fused*G_smooth
wherein G_smooth is a smoothing filter, G_smooth= (1/16) [1,2,1;2,4,2;1,2,1];
Step S6.6, re-embedding the image I_fused' after the boundary smoothing process into a target area in the zoom camera video frame S t to obtain a finally fused zoom camera video frame S_enhanced (t);
S_enhanced(t)=St+Mask⊙(I_fused"-S_region)
The Mask is a target area Mask matrix where the same key target is located in a zoom camera video frame S t, S_region is a target area where the same key target is located in a zoom camera video frame S t, S t represents the zoom camera video frame;
And S6.7, carrying out video compression coding on the finally fused zoom camera video frame S_enhanced (t), and outputting.
The scene multi-focal-length target video coding method provided by the invention has the following advantages:
The invention provides a scene multi-focal-length target video coding method, which can enhance the definition of a plurality of specific targets in a video recording system, and in particular relates to a method which utilizes the cooperative work of a zoom video recording system and a fixed-focus video recording system to intercept the specific targets in the fixed-focus system and the zoom system, calculates the target with the highest definition through an algorithm, and fuses the target data into the frame of a zoom video stream, so that the plurality of targets of the zoom video stream can keep high definition.
Drawings
Fig. 1 is a flowchart of a method for encoding a scene multi-focal-length target video.
Fig. 2 is a flowchart of an embodiment of a method for encoding a scene multi-focal-length target video according to the present invention.
Detailed Description
In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides a scene multi-focal-length target video coding method, which can enhance the definition of a plurality of specific targets in a video recording system, and in particular relates to a method which utilizes the cooperative work of a zoom video recording system and a fixed-focus video recording system to intercept the specific targets in the fixed-focus system and the zoom system, calculates the target with the highest definition through an algorithm, and fuses the target data into the frame of a zoom video stream, so that the plurality of targets of the zoom video stream can keep high definition.
Referring to fig. 1 and 2, the present invention provides a method for encoding a scene multi-focal-length target video, comprising the following steps:
Step S1, determining key targets needing key recognition, and acquiring key target feature expression vectors of each key target, wherein each key target feature expression vector forms a key target recognition feature matrix F= { F 1,F2,...,Fm }, F j represents feature expression vectors of key targets j, j=1, 2, and m, m is the number of key targets;
Specifically, the characteristic representation vector F j={Fj,visual,Fj,spatial,Fj,temporal of the key target j, wherein F j,visual is an appearance characteristic vector of the key target j, and is obtained by extracting a color histogram, a texture characteristic and a shape description characteristic of the key target j, F j,spatial is a position characteristic vector of the key target j and is used for describing the position distribution and the movement track mode of the key target j in a picture, and F j,temporal is a time characteristic vector of the key target j and is used for describing the time mode and the duration statistics of the key target j.
Step S2, setting a zoom video system and n fixed-focus video systems facing a shooting scene according to shooting scene requirements, wherein each fixed-focus video system is configured to have different shooting angles and focal length positions;
When shooting, the zoom video recording system is used for shooting the whole scene, and dynamically adjusting the focal length according to the scene requirement to obtain a panoramic video stream containing all key targets.
Each fixed-focus video recording system i is provided with a shooting angle theta i and a specific fixed focal length f i, and the shooting angle theta i and the specific fixed focal length f i are kept unchanged when a scene is shot, wherein the shooting angle theta i is relative to the zoom video recording system.
Step S3, simultaneously starting the zooming video system and n fixed-focus video systems to carry out collaborative shooting of a plurality of video systems, and simultaneously acquiring a zooming video frame S t and a fixed-focus video frame set { C 1,t,C2,t,...,Cn,t } at the same acquisition time t, wherein C i,t represents a fixed-focus video frame acquired by a fixed-focus video system i at the acquisition time t, i=1, 2, and n;
Step S4, identifying m key targets in the zoom camera video frame S t based on the key target identification feature matrix f= { F 1,F2,...,Fm }, and acquiring image data and position information of each identified key target j in the zoom camera video frame S t;
Performing key target recognition on each fixed-focus shooting video frame C i,t based on a key target recognition feature matrix F= { F 1,F2,...,Fm }, recognizing k i,t key targets, and acquiring image data and position information of each recognized key target l in the fixed-focus shooting video frame C i,t, wherein l=1, 2, and k i,t;
Specifically, the image data of the key target j in the zoom camera video frame S t and the position information thereof are represented as E s,j,t={Ts,j,t,Ps,j,t,Bs,j,t }, wherein T s,j,t represents an image content matrix of the key target j in the zoom camera video frame S t at the acquisition time T, and is obtained by clipping the image of the key target j from the zoom camera video frame S t, P s,j,t represents a position coordinate vector of the center of the key target j in the zoom camera video frame S t at the acquisition time T, and B s,j,t represents a boundary frame coordinate vector of the key target j in the minimum circumscribed rectangular region of the zoom camera video frame S t at the acquisition time T;
The image data of each identified key object l in the fixed-focus photographed video frame C i,t and the position information thereof are represented as E i,l,t={Ti,l,t,Pi,l,t,Bi,l,t }, wherein T i,l,t represents an image content matrix of the key object l in the fixed-focus photographed video frame C i,t at the acquisition time T, the key object l is obtained by clipping the image of the key object l from the fixed-focus photographed video frame C i,t, P i,l,t represents a position coordinate vector of the key object l in the fixed-focus photographed video frame C i,t at the center of the acquisition time T, and B i,l,t represents a boundary frame coordinate vector of the minimum circumscribed rectangular region of the key object l in the fixed-focus photographed video frame C i,t at the acquisition time T.
For each key object j identified in the zoom camera video frame S t and each key object l identified in each fixed focus camera video frame C i,t, a confidence evaluation function is adopted, and cosine similarity Confidence is calculated with the feature expression vector of the corresponding key object in step S1, and identification is considered to be valid only when the cosine similarity Confidence > threshold τ_threshold, wherein τ_threshold is set to 0.7-0.9.
Step S5, matching the same key targets in the zoom camera video frame S t and each fixed focus camera video frame C i,t according to the image data and the position information thereof;
the specific matching method comprises the following steps:
Step S5.1, calculating the appearance similarity Simvisual (j, l) between the key target j and the key target l according to the image content matrix T s,j,t of the key target j at the acquisition time T in the zoom camera video frame S t and the image content matrix T i,l,t of the key target l at the acquisition time T in the fixed focus camera video frame C i,t;
step S5.2, calculating the position similarity SIMSPATIAL (j, l) between the key target j and the key target l based on the position coordinate vector P s,j,t of the key target j at the acquisition time t in the zoom camera video frame S t and the position coordinate vector P i,l,t of the key target l at the acquisition time t in the fixed focus camera video frame C i,t;
simspatial(j,l)=exp(-||Ps,j,t-transform(Pi,l,t,M_i)||2/2σ2)
The transformation (P i,l,t, M_i) represents a transformation function for transforming P i,l,t in a coordinate system of a fixed-focus video system i into a coordinate system of a zoom video system, M_i is a 3×3 unit transformation matrix and is obtained through camera calibration of the fixed-focus video system i, sigma is a position tolerance parameter and is set to 10% -20% of the image size of a key target l cut in a fixed-focus video frame C i,t;
step S5.3, calculating the comprehensive similarity between the key target j and the key target l by adopting the following formula:
Sim(j,l)=α·Simvisual(j,l)+β·Simspatial(j,l)
The method comprises the steps of respectively obtaining an appearance similarity weight and a position similarity weight, wherein alpha and beta are respectively an appearance similarity weight and a position similarity weight, alpha+beta=1, alpha and beta >0, increasing a beta value for a scene with large change of the target appearance, and increasing the alpha value for a scene with relatively stable target appearance;
Step S5.4, if the comprehensive similarity Sim (j, l) between the key target j and the key target l is greater than the matching threshold value theta_match, the key target j and the key target l are successfully matched, and the key target j and the key target l are the same key target.
Comparing the image definition of each key target j in each fixed-focus shooting video frame C i,t and the image definition of each zoom shooting video frame S t, and if the image definition of each key target j in the zoom shooting video frame S t is highest, not performing image fusion processing;
specifically, for each captured video frame, the following manner is adopted to obtain the definition quantization score thereof; and comparing the image definition based on the definition quantization score:
The method comprises the step A of converting an imaging video frame needing definition quantization scoring into a target image, wherein the imaging video frame needing definition quantization scoring is called as the target image T, the width of the target image T is M pixels, the height of the target image T is N pixels, and the target image T is converted from a space domain to a frequency domain by adopting the following formula:
wherein:
T (x, y) represents the pixel value of the (x, y) position in the spatial domain, j is an imaginary unit in the formula, F (u, v) represents the complex coefficient in the frequency domain, and the (x, y) position in the spatial domain corresponds to the (u, v) position in the frequency domain;
And B, for complex coefficients F (u, v) in the frequency domain, performing the following filtering operation, retaining high-frequency components, filtering low-frequency components, and obtaining a filtered image H (u, v):
If it is Let H (u, v) =1, otherwise let H (u, v) =0;
wherein ω_cutoff is a cutoff frequency threshold;
and C, performing element-by-element multiplication on the filtered image H (u, v) and the complex coefficient F (u, v) in the frequency domain by adopting the following formula to realize frequency domain filtering, extracting high-frequency information in the image, and obtaining a high-frequency information representation F_high (u, v):
F_high(u,v)=F(u,v)⊙H(u,v)
wherein, the product of elements is shown as follows;
and D, obtaining a definition quantization Score SHARPNESS _score (T) by adopting the following formula:
Wherein U and V are the lengths of the rows and columns in the frequency domain, respectively.
Step S6, fusing the acquired highest definition image data of the key target j to the position of the same key target in the zooming shooting video frame S t to obtain a fused zooming shooting video frame S t;
the method specifically comprises the following steps:
step S6.1, representing the acquired highest definition image data of the key target j as Causing image data to beThe center coordinates of the same key object in the zoom camera video frame S t coincide with the center coordinates of the same key object in the zoom camera video frame S t, and if the two are not matched in size, the size is adjusted by bilinear interpolation, thereby causing the image data to beReplacing the same key target in the zoom camera video frame S t;
step S6.2, for image data of the same key object embedded in the zoom camera video frame S t Performing color and brightness correction to obtain first image data, wherein the first image data is identical to the color and brightness distribution in the zoom camera video frame S t
Step S6.3, for the first image dataThe weight w (p) of each pixel point p in the image is calculated by adopting the following formula:
wherein sigma is a parameter for controlling transition smoothness and is set as the first image data 1/6 To 1/4 of the size;
step S6.4, using the following formula for the first image data Fusing each pixel point p in the image to obtain a fused image I_fused:
wherein: representing the first image data S_region (p) represents the pixel value of the p position of the pixel point in the same key target area in the zoom camera video frame S t;
Step S6.5, carrying out boundary smoothing on the fused image I_fused by adopting the following formula to obtain a boundary smoothed image I_fused':
I_fused"=I_fused*G_smooth
wherein G_smooth is a smoothing filter, G_smooth= (1/16) [1,2,1;2,4,2;1,2,1];
Step S6.6, re-embedding the image I_fused' after the boundary smoothing process into a target area in the zoom camera video frame S t to obtain a finally fused zoom camera video frame S_enhanced (t);
S_enhanced(t)=St+Mask⊙(I_fused"-S_region)
The Mask is a target area Mask matrix where the same key target is located in a zoom camera video frame S t, S_region is a target area where the same key target is located in a zoom camera video frame S t, S t represents the zoom camera video frame;
And S6.7, carrying out video compression coding on the finally fused zoom camera video frame S_enhanced (t), and outputting.
Step S7, for m key targets in the zoom camera video frame S t, executing steps S5 to S6, and then performing video compression coding on the processed zoom camera video frame S t to obtain a multi-focal-length fused coded zoom camera video frame S t;
step S8, outputting the encoded zoom camera video frame S t, enabling t=t+1, and returning to step S3.
The scene multi-focal-length target video coding method combines the cooperative work of a zooming video recording system and a plurality of fixed-focus video recording systems. According to the method, shooting angles and focal length positions of all fixed-focus video recording systems are accurately set according to preset scene requirements so as to capture target details in different depth of field ranges, and one-path zoom video stream and multiple paths of fixed-focus video streams can be obtained through the zoom video recording systems and the multiple fixed-focus video recording systems. And automatically identifying and accurately intercepting target areas corresponding to a plurality of target pictures or names predefined by a user from the video streams by utilizing an artificial intelligent image identification technology. And establishing a corresponding relation between the zooming video stream and a specific target in the fixed-focus video stream through a position information and feature matching algorithm. For each target, the algorithm traverses all target examples in the related video stream, and the target image data with the highest definition is selected by adopting a mode of band-pass filtering after frequency conversion. And then the target image data with the highest definition is replaced to the corresponding target position in the zoom video stream, the replaced area is subjected to smoothing treatment, and finally, the definition of all key targets in the generated zoom video stream reaches the optimal state. And finally, compressing the optimized zoom video stream by adopting a video compression coding technology.
One embodiment is described below:
(1) Predefining an object identification database to be enhanced, wherein the object to be enhanced can be a photo, a name or a characteristic description;
Specifically, the system constructs a database containing all important target feature information, including elements such as optical characteristics, spatial positions, time sequence relations and the like, to form a complete target feature library.
Specifically, the user needs to predefine an important set of objects o= { O 1,O2,..o_m }, where m is the total number of objects, which is desired to maintain high definition in the video. This predefined approach solves the technical problem that conventional video recording systems cannot distinguish between important targets and backgrounds. The goal may be specified in three flexible ways:
providing a reference picture of the target, namely directly uploading a standard image of the target, and being applicable to fixed targets with known appearance;
inputting the names of targets, such as an athlete, a referee and a score board, and calling a pre-trained semantic recognition model by a system;
Features describing the object, such as "person wearing red-ball", support flexible description based on attributes.
To achieve accurate recognition, the system builds a multidimensional feature vector for each target, which is one of the key technical innovations of the present invention. The feature vector contains three types of complementary information:
extracting visual characteristics such as a color histogram, texture characteristics, shape descriptions and the like of a target, wherein 512-1024 dimensions are usually taken;
Recording the common position distribution and moving track mode of the target in the picture, and generally taking 64-128 dimensions;
time feature vector-time pattern describing the appearance of the target, duration statistics, typically taking 32-64 dimensions.
The complete characteristic representation of each target adopts a vector splicing mode, namely, an appearance characteristic vector, a position characteristic vector and a time characteristic vector are spliced, and the total dimension of the spliced characteristic vector is generally 600-1200 dimensions.
The feature files of all targets are combined into a target recognition database matrix which is used as a lookup table of a subsequent recognition algorithm to support rapid similarity calculation and target matching.
(2) Simultaneously starting a zooming video recording system and a plurality of fixed-focus video recording systems to record, and respectively acquiring a frame S1 in the zooming video stream and a frame set { C1, C2 }, wherein the frame set { C1, C2 }, cn }, at the same moment.
Through the collaborative recording of multiple cameras, the technical problem that a single camera can not obtain a plurality of different-distance target clear images simultaneously is solved, and the combination of panoramic coverage and local high definition is realized through the collaborative work of the multiple cameras.
The system adopts a camera configuration strategy of 'one main camera and multiple auxiliary cameras', and simultaneously starts two types of video equipment to perform cooperative work:
And the zoom camera Z is used as a main camera and is responsible for shooting the whole scene, and the focal length can be dynamically adjusted according to scene requirements to obtain panoramic video streams containing all targets. The method has the advantages of large visual field range, capability of capturing complete scene information, and limited definition for a remote target.
N fixed-focus camera sets f= { F 1,F2, & gt, f_n }, wherein each fixed-focus camera set is used as an auxiliary camera and is pre-fixed at a specific focal length and a specific shooting angle and is specially used for shooting targets within a specific distance range. This design ensures that a specific camera provides a high definition target image at each distance level.
At time t, the system achieves strict time synchronization, and synchronously acquires a main video stream frame from the zoom camera and an auxiliary video stream frame set from each fixed-focus camera. Different cameras may have different resolution configurations.
And the time synchronization is realized by ensuring that the time difference between frames is less than 1/60 second through a hardware synchronization signal or a Network Time Protocol (NTP) by all cameras, and avoiding the target position deviation caused by time desynchronization.
The camera parameter configuration matrix is P_camera= [ f 11;f22; f_n, θ_n ], where f_i is the focal length of the ith fixed focus camera (in mm) and θ_i is its shooting angle relative to the main camera (in degrees). The matrix records the geometric configuration information of the whole camera array and is used for subsequent coordinate transformation and target matching.
(3) With advanced artificial intelligence recognition techniques, preset accent targets are identified in frames S1 and { C1, C2, & gt, cn }. Image data of each identified key target and accurate position information thereof are recorded.
Specifically, the system uses an artificial intelligent image recognition algorithm AI_Detector based on a convolutional neural network, and the algorithm combines the target detection and feature matching technology to automatically find and recognize preset important targets in video streams of all cameras. The algorithm has the technical advantage that the complex conditions of scale change, illumination change, partial shielding and the like of the target can be processed.
The specific identification process adopts a parallel processing architecture:
the system takes the target identification database matrix phi established in the first step as a reference standard, and loads the reference standard into the GPU memory so as to improve the query speed;
Scanning each frame of picture in the main video stream S (t) in real time, and identifying all preset targets by using a sliding window and a multi-scale detection technology;
and meanwhile, the same identification process is carried out on pictures in each auxiliary video stream, and the identification tasks of all cameras are executed in parallel, so that the processing efficiency is improved.
For each identified target, the system will accurately record the triplet (T, P, B):
The method comprises the steps of (1) selecting a specific image content matrix of a target, cutting a target area from an original image to keep an original pixel value, selecting a coordinate vector of the center position of the target in a picture by taking the pixel as a unit, and selecting a boundary frame coordinate vector of the target to define a minimum rectangular area containing the target, wherein the origin is positioned at the upper left corner of the image.
In order to ensure the identification accuracy, a confidence evaluation mechanism is introduced into the system, specifically, a confidence evaluation function is adopted, the cosine similarity between the detected target feature and the target feature in the database is calculated by the function, and the closer the value is to 1, the higher the matching degree is.
Identification is considered valid only if Confidence > τ_threshold, which is typically set to 0.7-0.9, is adjustable according to the accuracy requirements of the application scenario.
The quality control of the identification results, wherein the system also records the quality index of each identification result, including the integrity of the target (whether the target is shielded or not), the definition of the image, the illumination condition and the like, and provides basis for the subsequent target selection.
(4) According to the image data and the position information, the corresponding relation of the same key target in the zoom video frame S1 and the set of Jiao Shipin frames { C1, C2, & gt, cn } is accurately matched.
The method solves the key technical problem of how to accurately identify whether the targets shot by different cameras are the same target in the multi-camera system. This is a precondition for achieving the target sharpness improvement.
Since the same target may appear in the main video stream and the plurality of auxiliary video streams at the same time, but the appearance of the same target in different cameras may be different due to different shooting angles, distances and illumination conditions. The system needs to establish the corresponding relation between the targets and determine which are different shooting angles of the same target. The technical innovation of the matching algorithm is to comprehensively consider information of multiple dimensions, and avoid the limitation of single feature matching. The match determination is based on a weighted combination of two main factors:
Firstly, calculating appearance similarity, namely comparing visual feature vectors of two target images, including color distribution, texture mode, shape outline and the like. The cosine similarity has the advantage of insensitivity to image brightness variation and is suitable for processing target matching under different illumination conditions.
Then, the position similarity is calculated, and the function considers the spatial position relation of the targets in different cameras.
And finally, carrying out weighted summation on the appearance similarity and the position similarity to obtain the comprehensive similarity.
(5) For each matching key object, the image is transformed from the spatial domain to the frequency domain. A specially designed high-pass filtering algorithm is applied to allow the passage of frequency information above a preset threshold to extract high-frequency components. The object F1 with the highest definition is evaluated and selected by comparing the intensity and the quantity of high-frequency components of different key object images.
Specifically, in the present invention, for each matched object, the system needs to select the version with the highest definition from the main video stream and each auxiliary video stream. The principle of sharpness evaluation is that a sharp image contains more high frequency components and a blurred image contains less high frequency components. The sharpness evaluation adopts a frequency domain analysis technology, and the method has the advantages of objectivity, accuracy and no influence of subjective feeling of human eyes.
Further, a definition quantization scoring mode is adopted, the proportion of high-frequency energy to total energy is calculated, and the larger the value is, the clearer the image is. The normalization process of the denominator ensures comparability between different size images.
In order to avoid noise interference, the difference between the highest score and the second highest score can be checked, and when the difference is smaller than a threshold value, the difference is comprehensively judged by combining with other quality indexes (such as contrast and saturation), so that the image version with the highest definition of each important target is selected, and the optimal material is provided for subsequent image fusion.
(6) The image data of the object F1 with the highest definition is fused to the corresponding position in the zoom video frame S1. In the fusion process, a smoothing algorithm is adopted to process the target edge area, so that the natural transition and consistency of F1 and S1 backgrounds are ensured.
Specifically, in the invention, high-definition target images from different sources are seamlessly fused into a main video stream, so that the high definition of the target is maintained, and the naturalness and consistency of the whole picture are ensured. In order to ensure natural and smooth images after fusion, an intelligent weighted fusion technology based on Gaussian weight is adopted, and the core idea of the technology is that a high-definition image is completely used in a target center area, and gradually transits to an original image in an edge area, so that a hard boundary effect is avoided.
(7) And processing the fused zoom video frame S1 by adopting a stream compression coding method so as to reduce the size of a video file and optimize the transmission efficiency. And outputting the compressed video stream, wherein the definition of a plurality of key targets in the video stream is obviously improved, and meanwhile, the overall visual perception quality and smooth playing experience are maintained.
The step is the final link of the system, and the high-efficiency compression coding is carried out by adopting the proper H264/H265 video on the premise of keeping the enhancement effect.
The invention provides a scene multi-focus target video coding method, which is a video coding method capable of effectively improving the definition of a plurality of targets in a multi-focus and fixed-focus cooperative video recording system.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which is also intended to be covered by the present invention.

Claims (9)

1.场景多焦距目标视频编码方法,其特征在于,包括以下步骤:1. A scene multi-focal length target video encoding method, characterized in that it comprises the following steps: 步骤S1,确定需要进行重点识别的重点目标,获取每个重点目标的重点目标特征表示向量,各个所述重点目标特征表示向量形成重点目标识别特征矩阵F={F1,F2,...,Fm};其中,Fj代表重点目标j的特征表示向量;j=1,2,...,m,m为重点目标数量;Step S1: determine the key targets that need to be identified, obtain the key target feature representation vector of each key target, and form a key target identification feature matrix F = {F 1 , F 2 , ..., F m }; wherein F j represents the feature representation vector of key target j; j = 1, 2, ..., m, where m is the number of key targets; 步骤S2,根据拍摄场景需求,设置朝向拍摄场景的变焦录像系统和n个定焦录像系统;其中,各个所述定焦录像系统配置为不同的拍摄角度与焦距位置;各个所述定焦录像系统的拍摄角度覆盖所述拍摄场景;Step S2: according to the requirements of the shooting scene, setting a zoom recording system and n fixed-focus recording systems facing the shooting scene; wherein each of the fixed-focus recording systems is configured to have a different shooting angle and focal length position; and the shooting angle of each of the fixed-focus recording systems covers the shooting scene; 步骤S3,同时启动所述变焦录像系统和n个所述定焦录像系统进行多录像系统协同拍摄,在同一采集时刻t,同时获取变焦摄像视频帧St以及定焦摄像视频帧集合{C1,t,C2,t,...,Cn,t};其中,Ci,t代表定焦录像系统i在采集时刻t采集到的定焦摄像视频帧,i=1,2,...,n;Step S3: Simultaneously start the zoom recording system and n fixed-focus recording systems to perform multi-recording system collaborative shooting. At the same acquisition time t, simultaneously obtain a zoom camera video frame S t and a fixed-focus camera video frame set {C 1,t , C 2,t , ..., C n,t }; where C i,t represents the fixed-focus camera video frame captured by the fixed-focus recording system i at the acquisition time t, and i = 1, 2, ..., n; 步骤S4,基于重点目标识别特征矩阵F={F1,F2,...,Fm},在所述变焦摄像视频帧St中识别出m个重点目标,并获取每个识别到的重点目标j在所述变焦摄像视频帧St中的图像数据及其位置信息;Step S4, based on the key target recognition feature matrix F = {F 1 , F 2 , ..., F m }, identify m key targets in the zoom camera video frame S t , and obtain image data and position information of each identified key target j in the zoom camera video frame S t ; 基于重点目标识别特征矩阵F={F1,F2,...,Fm},对每个定焦摄像视频帧Ci,t进行重点目标识别,识别出ki,t个重点目标,并获取每个识别到的重点目标l在所述定焦摄像视频帧Ci,t中的图像数据及其位置信息;其中,l=1,2,...,ki,tBased on the key target recognition feature matrix F = {F 1 ,F 2 ,...,F m }, key target recognition is performed on each fixed-focus camera video frame C i,t, k i,t key targets are identified, and image data and position information of each identified key target l in the fixed-focus camera video frame C i,t are obtained; where l = 1, 2,..., k i,t ; 步骤S5,根据图像数据及其位置信息,匹配所述变焦摄像视频帧St和各个所述定焦摄像视频帧Ci,t中的相同重点目标;Step S5, matching the same key target in the zoom camera video frame S t and each of the fixed-focus camera video frames C i,t according to the image data and its position information; 对于每个重点目标j,比较其在各个所述定焦摄像视频帧Ci,t中以及在所述变焦摄像视频帧St中的图像清晰度,如果其在所述变焦摄像视频帧St中的图像清晰度最高,则不进行图像融合处理;否则,获取重点目标j在各个所述定焦摄像视频帧Ci,t中最高清晰度的图像数据及其位置信息;For each key target j, compare its image clarity in each of the fixed-focus video frames C i,t and the zoom video frame S t . If its image clarity in the zoom video frame S t is the highest, no image fusion processing is performed; otherwise, obtain the highest-definition image data and position information of the key target j in each of the fixed-focus video frames C i,t . 步骤S6,将获取到的重点目标j的最高清晰度的图像数据,融合到所述变焦摄像视频帧St中的相同重点目标的位置,得到融合后的所述变焦摄像视频帧StStep S6, fusing the acquired highest-definition image data of the key target j to the position of the same key target in the zoom camera video frame S t to obtain the fused zoom camera video frame S t ; 步骤S7,对于所述变焦摄像视频帧St中的m个重点目标,均执行步骤S5到步骤S6,再对处理后的所述变焦摄像视频帧St进行视频压缩编码,得到多焦距融合的编码后的所述变焦摄像视频帧StStep S7: For each of the m key targets in the zoom camera video frame S t , execute steps S5 to S6, and then perform video compression encoding on the processed zoom camera video frame S t to obtain the multi-focal-length fused encoded zoom camera video frame S t ; 步骤S8,输出编码后的所述变焦摄像视频帧St;令t=t+1;返回步骤S3。Step S8, output the encoded zoom camera video frame St ; set t=t+1; return to step S3. 2.根据权利要求1所述的场景多焦距目标视频编码方法,其特征在于,步骤S1中,重点目标j的特征表示向量Fj={Fj,visual,Fj,spatial,Fj,temporal};2. The scene multi-focal length target video encoding method according to claim 1, characterized in that, in step S1, the feature representation vector F j of the key target j is F j ={F j,visual ,F j,spatial ,F j,temporal }; 其中:in: Fj,visual为重点目标j的外观特征向量,通过提取重点目标j的颜色直方图、纹理特征、形状描述特征获取;Fj ,visual is the appearance feature vector of the key target j, which is obtained by extracting the color histogram, texture features, and shape description features of the key target j; Fj,spatial为重点目标j的位置特征向量,用于描述重点目标j在画面中的位置分布和移动轨迹模式;Fj ,spatial is the position feature vector of key target j, which is used to describe the position distribution and movement trajectory pattern of key target j in the picture; Fj,temporal为重点目标j的时间特征向量,用于描述重点目标j出现的时间模式和持续时间统计。Fj ,temporal is the temporal feature vector of the key target j, which is used to describe the temporal pattern and duration statistics of the key target j. 3.根据权利要求1所述的场景多焦距目标视频编码方法,其特征在于,所述变焦录像系统在进行拍摄时,用于拍摄整体场景,根据场景需要动态调整焦距,获得包含所有重点目标的全景视频流。3. The scene multi-focal length target video encoding method according to claim 1 is characterized in that the zoom recording system is used to capture the entire scene during shooting, dynamically adjusts the focal length according to scene requirements, and obtains a panoramic video stream containing all key targets. 4.根据权利要求1所述的场景多焦距目标视频编码方法,其特征在于,每个定焦录像系统i配置有拍摄角度θi与特定固定焦距fi,在进行场景拍摄时,其拍摄角度θi与特定固定焦距fi保持不变;其中,拍摄角度θi为相对于变焦录像系统的拍摄角度。4. The scene multi-focal-length target video encoding method according to claim 1, wherein each fixed-focus recording system i is configured with a shooting angle θi and a specific fixed focal length fi , and when capturing a scene, the shooting angle θi and the specific fixed focal length fi remain unchanged; wherein the shooting angle θi is relative to the shooting angle of the zoom recording system. 5.根据权利要求1所述的场景多焦距目标视频编码方法,其特征在于,重点目标j在所述变焦摄像视频帧St中的图像数据及其位置信息表示为:Es,j,t={Ts,j,t,Ps,j,t,Bs,j,t};其中,Ts,j,t代表在采集时刻t重点目标j在变焦摄像视频帧St的图像内容矩阵,通过从所述变焦摄像视频帧St中裁剪重点目标j的图像获取;Ps,j,t代表在采集时刻t重点目标j的中心在变焦摄像视频帧St的位置坐标向量;Bs,j,t代表在采集时刻t重点目标j在变焦摄像视频帧St的最小外接矩形区域的边界框坐标向量;5. The scene multi-focal-length target video encoding method according to claim 1, characterized in that the image data and position information of the key target j in the zoom camera video frame S t are expressed as: Es ,j,t = {Ts ,j,t , Ps ,j,t , Bs ,j,t }; wherein Ts,j,t represents the image content matrix of the key target j in the zoom camera video frame S t at the acquisition time t, obtained by cropping the image of the key target j from the zoom camera video frame S t ; Ps,j,t represents the position coordinate vector of the center of the key target j in the zoom camera video frame S t at the acquisition time t; Bs , j,t represents the bounding box coordinate vector of the minimum circumscribed rectangular area of the key target j in the zoom camera video frame S t at the acquisition time t; 每个识别到的重点目标l在所述定焦摄像视频帧Ci,t中的图像数据及其位置信息表示为:Ei,l,t={Ti,l,t,Pi,l,t,Bi,l,t};其中,Ti,l,t代表在采集时刻t重点目标l在定焦摄像视频帧Ci,t的图像内容矩阵,通过从所述定焦摄像视频帧Ci,t中裁剪重点目标l的图像获取;Pi,l,t代表在采集时刻t重点目标l的中心在定焦摄像视频帧Ci,t的位置坐标向量;Bi,l,t代表在采集时刻t重点目标l在定焦摄像视频帧Ci,t的最小外接矩形区域的边界框坐标向量。The image data and position information of each identified key target l in the fixed-focus camera video frame Ci ,t are expressed as: Ei ,l,t = {Ti ,l,t , Pi ,l,t , Bi ,l,t }; wherein Ti ,l,t represents the image content matrix of the key target l in the fixed-focus camera video frame Ci ,t at the acquisition time t, which is obtained by cropping the image of the key target l from the fixed-focus camera video frame Ci ,t ; Pi ,l,t represents the position coordinate vector of the center of the key target l in the fixed-focus camera video frame Ci ,t at the acquisition time t; Bi ,l,t represents the bounding box coordinate vector of the minimum circumscribed rectangular area of the key target l in the fixed-focus camera video frame Ci ,t at the acquisition time t. 6.根据权利要求5所述的场景多焦距目标视频编码方法,其特征在于,步骤S4还包括:6. The scene multi-focal length target video encoding method according to claim 5, wherein step S4 further comprises: 对于在所述变焦摄像视频帧St中识别出的每个重点目标j,以及在每个所述定焦摄像视频帧Ci,t中识别出的每个重点目标l,均采用置信度评估函数,与步骤S1中对应重点目标的特征表示向量计算余弦相似度Confidence,只有当余弦相似度Confidence>阈值τ_threshold时,才认为识别有效;其中,τ_threshold设置为0.7-0.9。For each key target j identified in the zoom camera video frame S t , and each key target l identified in each fixed-focus camera video frame C i,t , a confidence evaluation function is used to calculate the cosine similarity Confidence with the feature representation vector of the corresponding key target in step S1. The recognition is considered valid only when the cosine similarity Confidence is greater than the threshold τ_threshold; wherein τ_threshold is set to 0.7-0.9. 7.根据权利要求6所述的场景多焦距目标视频编码方法,其特征在于,步骤S5,根据图像数据及其位置信息,匹配所述变焦摄像视频帧St和各个所述定焦摄像视频帧Ci,t中的相同重点目标,具体为:7. The scene multi-focus target video encoding method according to claim 6, characterized in that, in step S5, matching the same key target in the zoom camera video frame S t and each of the fixed-focus camera video frames C i,t based on the image data and its position information is specifically performed as follows: 步骤S5.1,根据采集时刻t重点目标j在变焦摄像视频帧St的图像内容矩阵Ts,j,t,与采集时刻t重点目标l在定焦摄像视频帧Ci,t的图像内容矩阵Ti,l,t,计算重点目标j和重点目标l之间的外观相似度Simvisual(j,l);Step S5.1, calculating the appearance similarity Simvisual(j,l) between key targets j and l based on the image content matrix T s,j, t of key target j in zoom camera video frame S t at acquisition time t and the image content matrix T i,l,t of key target l in fixed focus camera video frame C i, t at acquisition time t; 步骤S5.2,基于采集时刻t重点目标j在变焦摄像视频帧St的位置坐标向量Ps,j,t,与采集时刻t重点目标l的中心在定焦摄像视频帧Ci,t的位置坐标向量Pi,l,t,计算重点目标j和重点目标l之间的位置相似度Simspatial(j,l);Step S5.2, based on the position coordinate vector P s,j,t of the key target j in the zoom camera video frame S t at the acquisition time t and the position coordinate vector P i,l, t of the center of the key target l in the fixed focus camera video frame C i, t at the acquisition time t, calculate the position similarity Simspatial(j,l) between the key targets j and l; simspatial(j,l)=exp(-||Ps,j,t-transform(Pi,l,t,M_i)||2/2σ2)simspatial(j,l)=exp(-||P s,j,t -transform(P i,l,t ,M_i)|| 2 /2σ 2 ) 其中:transform(Pi,l,t,M_i)代表将定焦录像系统i坐标系中的Pi,l,t转换到变焦录像系统坐标系中的变换函数;M_i为3×3的单位变换矩阵,通过定焦录像系统i的摄像头标定获得;σ为位置容差参数,设置为定焦摄像视频帧Ci,t中裁剪到的重点目标l的图像尺寸的10%-20%;where transform(P i,l,t ,M_i) represents the transformation function that converts P i,l,t in the fixed-focus video system i coordinate system to the zoom video system coordinate system; M_i is the 3×3 unit transformation matrix obtained by calibrating the camera of the fixed-focus video system i; σ is the position tolerance parameter, which is set to 10%-20% of the image size of the key object l cropped in the fixed-focus video frame C i,t ; 步骤S5.3,采用下式,计算重点目标j和重点目标l之间的综合相似度:In step S5.3, the comprehensive similarity between key target j and key target l is calculated using the following formula: Sim(j,l)=α·Simvisual(j,l)+β·Simspatial(j,l)Sim(j,l)=α·Simvisual(j,l)+β·Simspatial(j,l) 其中:α和β分别为外观相似度权重和位置相似度权重;α+β=1,α,β>0;对于目标外观变化大的场景,增大β值;对于目标外观相对稳定的场景,增大α的值;Where: α and β are the appearance similarity weight and position similarity weight respectively; α+β=1,α,β>0; for scenes with large changes in target appearance, increase the value of β; for scenes with relatively stable target appearance, increase the value of α; 步骤S5.4,如果重点目标j和重点目标l之间的综合相似度Sim(j,l)大于匹配阈值θ_match,则代表重点目标j和重点目标l匹配成功,为相同重点目标。In step S5.4, if the comprehensive similarity Sim(j,l) between the key target j and the key target l is greater than the matching threshold θ_match, it means that the key target j and the key target l are successfully matched and are the same key target. 8.根据权利要求7所述的场景多焦距目标视频编码方法,其特征在于,对于每个重点目标j,比较其在各个所述定焦摄像视频帧Ci,t中以及在所述变焦摄像视频帧St中的图像清晰度,具体为:8. The scene multi-focal-length target video encoding method according to claim 7, characterized in that, for each key target j, its image clarity in each of the fixed-focus video frames C i,t and the zoom video frame S t is compared, specifically: 对于每个摄像视频帧,采用以下方式,得到其清晰度量化评分;再基于清晰度量化评分,进行图像清晰度的比较:For each video frame, the following method is used to obtain its clarity quantification score. Based on the clarity quantification score, the image clarity is compared: 步骤A:将需要进行清晰度量化评分的摄像视频帧称为目标图像,表示为:目标图像T,其宽度为M个像素,高度为N个像素,采用下式,将目标图像T从空间域转换到频域:Step A: The video frame that needs to be quantified for clarity is called the target image, which is expressed as: target image T, with a width of M pixels and a height of N pixels. The target image T is converted from the spatial domain to the frequency domain using the following formula: 其中:in: T(x,y)代表空间域中(x,y)位置的像素值;公式中j为虚数单位;F(u,v)代表频域中的复数系数;空间域中(x,y)位置对应频域中(u,v)位置;T(x,y) represents the pixel value at position (x,y) in the spatial domain; j in the formula is the imaginary unit; F(u,v) represents the complex coefficient in the frequency domain; the position (x,y) in the spatial domain corresponds to the position (u,v) in the frequency domain; 步骤B:对于频域中的复数系数F(u,v),执行以下滤波操作,保留高频成分,滤除低频成分,得到滤波后图像H(u,v):Step B: For the complex coefficients F(u,v) in the frequency domain, perform the following filtering operation to retain the high-frequency components and filter out the low-frequency components to obtain the filtered image H(u,v): 如果则令H(u,v)=1;否则,令H(u,v)=0;if Then let H(u,v)=1; otherwise, let H(u,v)=0; 其中:ω_cutoff为截止频率阈值;Where: ω_cutoff is the cutoff frequency threshold; 步骤C:采用下式,使滤波后图像H(u,v)和频域中的复数系数F(u,v)进行逐元素乘积,实现频域滤波,提取出图像中的高频信息,得到高频信息表示F_high(u,v):Step C: Use the following formula to perform element-by-element multiplication of the filtered image H(u,v) and the complex coefficient F(u,v) in the frequency domain to implement frequency domain filtering, extract the high-frequency information in the image, and obtain the high-frequency information representation F_high(u,v): F_high(u,v)=F(u,v)⊙H(u,v)F_high(u,v)=F(u,v)⊙H(u,v) 其中:⊙表示逐元素乘积;Among them: ⊙ represents element-by-element product; 步骤D:采用下式,得到清晰度量化评分Sharpness_Score(T):Step D: Use the following formula to obtain the sharpness quantitative score Sharpness_Score(T): 其中:U和V分别为频域中行和列的长度。Where: U and V are the lengths of rows and columns in the frequency domain respectively. 9.根据权利要求6所述的场景多焦距目标视频编码方法,其特征在于,步骤S6,将获取到的重点目标j的最高清晰度的图像数据,融合到所述变焦摄像视频帧St中的相同重点目标的位置,得到融合后的所述变焦摄像视频帧St,具体为:9. The scene multi-focal-length target video encoding method according to claim 6, characterized in that, in step S6, the obtained highest-definition image data of the key target j is fused to the position of the same key target in the zoom camera video frame S t to obtain the fused zoom camera video frame S t , specifically: 步骤S6.1,将获取到的重点目标j的最高清晰度的图像数据表示为使图像数据的中心坐标和变焦摄像视频帧St中相同重点目标的中心坐标重合,如果两者尺寸不匹配,使用双线性插值进行尺寸调整,从而使图像数据替换到变焦摄像视频帧St中相同重点目标的区域内;Step S6.1: The acquired image data of the key target j with the highest definition is expressed as Make image data The center coordinates of the image data coincide with the center coordinates of the same focus target in the zoom camera video frame S t . If the sizes of the two do not match, bilinear interpolation is used to adjust the size so that the image data Replace it to the area of the same key target in the zoom camera video frame S t ; 步骤S6.2,对嵌入到变焦摄像视频帧St中相同重点目标的图像数据进行色彩和亮度校正,使其与变焦摄像视频帧St中的色彩和亮度分布相同,得到第一图像数据 Step S6.2: image data of the same key target embedded in the zoom camera video frame S t Perform color and brightness correction to make it the same as the color and brightness distribution in the zoom camera video frame S t , and obtain the first image data 步骤S6.3,对第一图像数据中的每个像素点p,采用下式,计算其权重w(p):Step S6.3, the first image data For each pixel point p in , the following formula is used to calculate its weight w(p): 其中:σ为控制过渡平滑度的参数,设置为第一图像数据尺寸的1/6到1/4;Where: σ is the parameter that controls the transition smoothness, set to the first image data 1/6 to 1/4 of the size; 步骤S6.4,采用下式,对第一图像数据中的各像素点p进行融合,得到融合后的图像I_fused:Step S6.4, using the following formula, the first image data The pixels p in the image are fused to obtain the fused image I_fused: 其中:代表第一图像数据中像素点p位置的像素值;S_region(p)代表变焦摄像视频帧St中相同重点目标区域中像素点p位置的像素值;I_fused(p)代表融合后的图像中像素点p位置的像素值;in: Represents the first image data S_region(p) represents the pixel value of the pixel point p in the same key target region in the zoom camera video frame St ; I_fused(p) represents the pixel value of the pixel point p in the fused image; 步骤S6.5,采用下式,对融合后的图像I_fused进行边界平滑处理,得到边界平滑处理后的图像I_fused":In step S6.5, the fused image I_fused is subjected to boundary smoothing using the following formula to obtain a boundary smoothed image I_fused": I_fused"=I_fused*G_smoothI_fused"=I_fused*G_smooth 其中:G_smooth为平滑滤波器;G_smooth=(1/16)[1,2,1;2,4,2;1,2,1];Where: G_smooth is the smoothing filter; G_smooth = (1/16) [1, 2, 1; 2, 4, 2; 1, 2, 1]; 步骤S6.6,将边界平滑处理后的图像I_fused"重新嵌入变焦摄像视频帧St中的目标区域,得到最终融合后的变焦摄像视频帧S_enhanced(t);Step S6.6: Re-embed the boundary-smoothed image I_fused" into the target area of the zoom camera video frame S t to obtain the final fused zoom camera video frame S_enhanced(t); S_enhanced(t)=St+Mask⊙(I_fused"-S_region)S_enhanced(t)=S t +Mask⊙(I_fused"-S_region) 其中:Mask为变焦摄像视频帧St中相同重点目标所在的目标区域掩码矩阵;Where: Mask is the target area mask matrix where the same key target is located in the zoom camera video frame St ; S_region为变焦摄像视频帧St中相同重点目标所在的目标区域;St代表变焦摄像视频帧;S_region is the target region where the same key target is located in the zoom camera video frame S t ; S t represents the zoom camera video frame; 步骤S6.7,对最终融合后的变焦摄像视频帧S_enhanced(t)进行视频压缩编码,并输出。Step S6.7: perform video compression encoding on the final fused zoom camera video frame S_enhanced(t) and output it.
CN202511018605.8A 2025-07-23 2025-07-23 Scene multi-focal-length target video coding method Pending CN120812412A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511018605.8A CN120812412A (en) 2025-07-23 2025-07-23 Scene multi-focal-length target video coding method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511018605.8A CN120812412A (en) 2025-07-23 2025-07-23 Scene multi-focal-length target video coding method

Publications (1)

Publication Number Publication Date
CN120812412A true CN120812412A (en) 2025-10-17

Family

ID=97307994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511018605.8A Pending CN120812412A (en) 2025-07-23 2025-07-23 Scene multi-focal-length target video coding method

Country Status (1)

Country Link
CN (1) CN120812412A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120980197A (en) * 2025-10-20 2025-11-18 深圳市道格恒通科技有限公司 An autofocus method and a rugged mobile phone

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101720027A (en) * 2009-11-27 2010-06-02 西安电子科技大学 Method for cooperative acquisition of multi-target videos under different resolutions by variable-focus array camera
CN102542545A (en) * 2010-12-24 2012-07-04 方正国际软件(北京)有限公司 Multi-focal length photo fusion method and system and photographing device
CN107481213A (en) * 2017-08-28 2017-12-15 湖南友哲科技有限公司 Microscope hypograph multi-layer focusing fusion method
CN110830756A (en) * 2018-08-07 2020-02-21 华为技术有限公司 Monitoring method and device
CN113936154A (en) * 2021-11-23 2022-01-14 上海商汤智能科技有限公司 Image processing method and device, electronic equipment and storage medium
CN116132791A (en) * 2023-03-10 2023-05-16 创视微电子(成都)有限公司 Method and device for acquiring multi-field-depth clear images of multiple moving objects
CN119027882A (en) * 2024-08-30 2024-11-26 四川国创新视超高清视频科技有限公司 A dynamic target tracking fusion method for large scene monitoring
CN119204592A (en) * 2024-11-25 2024-12-27 浙江吉欧科技有限公司 Shield machine operation and maintenance management system based on data analysis

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101720027A (en) * 2009-11-27 2010-06-02 西安电子科技大学 Method for cooperative acquisition of multi-target videos under different resolutions by variable-focus array camera
CN102542545A (en) * 2010-12-24 2012-07-04 方正国际软件(北京)有限公司 Multi-focal length photo fusion method and system and photographing device
CN107481213A (en) * 2017-08-28 2017-12-15 湖南友哲科技有限公司 Microscope hypograph multi-layer focusing fusion method
CN110830756A (en) * 2018-08-07 2020-02-21 华为技术有限公司 Monitoring method and device
CN113936154A (en) * 2021-11-23 2022-01-14 上海商汤智能科技有限公司 Image processing method and device, electronic equipment and storage medium
CN116132791A (en) * 2023-03-10 2023-05-16 创视微电子(成都)有限公司 Method and device for acquiring multi-field-depth clear images of multiple moving objects
CN119027882A (en) * 2024-08-30 2024-11-26 四川国创新视超高清视频科技有限公司 A dynamic target tracking fusion method for large scene monitoring
CN119204592A (en) * 2024-11-25 2024-12-27 浙江吉欧科技有限公司 Shield machine operation and maintenance management system based on data analysis

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120980197A (en) * 2025-10-20 2025-11-18 深圳市道格恒通科技有限公司 An autofocus method and a rugged mobile phone

Similar Documents

Publication Publication Date Title
US10733472B2 (en) Image capture device with contemporaneous image correction mechanism
US8330831B2 (en) Method of gathering visual meta data using a reference image
US9129381B2 (en) Modification of post-viewing parameters for digital images using image region or feature information
CN114862698B (en) Channel-guided real overexposure image correction method and device
Levin et al. Image and depth from a conventional camera with a coded aperture
CN108537155B (en) Image processing method, apparatus, electronic device, and computer-readable storage medium
US20180013950A1 (en) Modification of post-viewing parameters for digital images using image region or feature information
CN107945135B (en) Image processing method, device, storage medium and electronic device
US20120069198A1 (en) Foreground/Background Separation Using Reference Images
CN107948517B (en) Preview image blur processing method, device and device
US20080317378A1 (en) Digital image enhancement with reference images
JP4597391B2 (en) Facial region detection apparatus and method, and computer-readable recording medium
CN108846807B (en) Light effect processing method, device, terminal and computer-readable storage medium
JP2010508571A (en) Digital image processing using face detection and skin tone information
CN109493283A (en) A kind of method that high dynamic range images ghost is eliminated
CN112261292B (en) Image acquisition method, terminal, chip and storage medium
CN113379609B (en) Image processing method, storage medium and terminal equipment
Banerjee et al. In-camera automation of photographic composition rules
US20250203194A1 (en) Image processing device, image processing method, and program
CN120812412A (en) Scene multi-focal-length target video coding method
CN110365897B (en) Image correction method and device, electronic equipment and computer readable storage medium
CN107911609B (en) Image processing method, apparatus, computer-readable storage medium and electronic device
CN113379608B (en) Image processing method, storage medium and terminal device
US20250193339A1 (en) Videoconference image enhancement based on scene models
Zhou et al. Pixel-level Multi-directional Image Sharpness Linear Assessment for Optical Image Stabilizer Performance Monitoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination