CN120853236B

CN120853236B - A method and system for eye movement correction based on codec and feature decoupling

Info

Publication number: CN120853236B
Application number: CN202510957643.3A
Authority: CN
Inventors: 吴昊; 陶新昊
Original assignee: Zhaoyi Information Technology Shanghai Co ltd
Current assignee: Zhaoyi Information Technology Shanghai Co ltd
Priority date: 2025-07-11
Filing date: 2025-07-11
Publication date: 2026-02-06
Anticipated expiration: 2045-07-11
Also published as: CN120853236A

Abstract

The invention provides a method and a system for correcting eye spirit based on decoupling of a coder and a decoder and characteristics. The method comprises the steps of a, collecting an original face image I, obtaining an eye image I _c and head posture information H _gt of a user from the original face image I, b, extracting features of the I _c and the H _gt to obtain a vector attribute code z _i representing static attribute features of the user, The eye posture correcting method comprises the steps of representing current eye posture information G _pre of the current eye posture and representing a rotation attribute code z _r of an eye rotation attribute, carrying out three-dimensional transformation on z _r according to G _pre and preset target eye posture information G _tar to obtain a rotation attribute code z _pre corresponding to G _tar, generating a target eye image I _pre according to z _i and z _pre, and pasting I _pre back to the original face image I to output a face image after eye correction. the method effectively solves the problem of image distortion caused by attribute and gesture feature confusion in the traditional method, and greatly improves the accuracy and naturalness of eye correction.

Description

Eye correction method and system based on codec and feature decoupling

Technical Field

The invention relates to the technical field of computer vision and artificial intelligence, in particular to an eye-god correcting method and system based on a codec and feature decoupling.

Background

Eye Correction (size Correction), an innovative technique applied to face feature processing, aims to align the eyes of a user with a target line of sight by adjusting the direction or position of eyeballs. The technology has wide application value in the fields of video conference, virtual reality, augmented reality and the like.

At present, the technical scheme for realizing eye correction is mainly divided into two major categories, namely a hardware method and a software method.

Traditional eye correction often relies on specialized hardware devices such as an eye tracker or infrared cameras. These devices track the eye of the user in real time by capturing eye movement trajectories and positions and change the direction of the eye by means of optical correction or hardware adjustment. However, such methods tend to have high equipment costs and often require additional wear or fixtures for the user, resulting in poor user experience, and also tend to be limited in scope and difficult to use in mass settings such as video conferencing.

The software method mainly uses a traditional image processing algorithm to perform geometric transformation or pixel replacement on an eye region in an image by analyzing facial feature points (such as eyes, nose, mouth and the like) in the image so as to realize eye correction. This approach has high flexibility and speed and does not require special hardware support. However, such conventional algorithms are typically based on static rules, have limited processing power, lack adaptive optimization capabilities, and have difficulty in handling diverse user demands.

With the development of deep learning models such as convolutional neural Networks (Convolutional Neural Networks, CNNs) and generating countermeasure Networks (GENERATIVE ADVERSARIAL Networks, GANs), neural network-based eye correction techniques gradually become research hotspots, but the existing neural network-based eye correction techniques still have the problems of insufficient precision, poor naturalness, low calculation efficiency and the like.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention provides a method and system for correcting eye spirit based on the decoupling of the codec and the features.

One aspect of the present invention provides an eye correction method based on codec and feature decoupling, comprising the steps of:

step a, acquiring an original face image I through image acquisition equipment, and acquiring an eye image I _c and head posture information H _gt of a user from the original face image;

Step b, extracting features of the eye image I _c and the head posture information H _gt to obtain a vector attribute code z _i representing static attribute features of a user, current eye posture information G _pre representing current eye posture and a rotation attribute code z _r representing rotation attribute of the eye;

Step c, performing three-dimensional transformation on the rotation attribute code z _r according to the current eye posture information G _pre and preset target eye posture information G _tar to obtain a rotation attribute code z _pre corresponding to the target eye posture information G _tar;

Step d, generating a target eye image I _pre according to the vector attribute code z _i and the rotation attribute code z _pre, and

And e, pasting the target eye image I _pre back to the original face image I, and outputting the face image after eye correction.

Another aspect of the present invention provides a codec-and feature-decoupling-based eye correction system, comprising:

The data input module is used for acquiring an original face image I through the image acquisition equipment and acquiring an eye image I _c and head posture information H of a user from the original face image _gt

The feature extraction module is used for extracting features of the eye image I _c and the head posture information H _gt to obtain a vector attribute code z _i representing static attribute features of a user, current eye posture information G _pre representing current eye posture and a rotation attribute code z _r representing rotation attributes of eyes;

The feature transformation module is used for carrying out three-dimensional transformation on the rotation attribute code z _r according to the current eye posture information G _pre and preset target eye posture information G _tar to obtain a rotation attribute code z _pre corresponding to the target eye posture information G _tar;

A decoding module that generates a target eye image I _pre from the vector attribute code z _i and the rotational attribute code z _pre;

And the image generation module is used for pasting the target eye image I _pre back to the original face image I and outputting the face image after the eye correction.

The invention has the following beneficial effects:

1. The invention ensures that the generated image can accurately and naturally realize the adjustment of eye spirit while maintaining personalized characteristics through the characteristic coding decoupling technology, and simultaneously ensures that the corrected image is natural and lifelike in vision by combining the generation of texture details generated by the antagonism network optimization.

2. The method has strong real-time performance and is suitable for low-resource equipment, by adopting heavy parameterization processing on the trained multi-layer neural convolution network, the calculation cost is obviously reduced during the eye correction test, the real-time correction can be realized in the mobile equipment, and the method is suitable for scenes with high requirements on interactivity and real-time performance, such as video conferences, virtual reality and the like.

Drawings

Fig. 1 is a flow chart of a method of eye correction based on codec and feature decoupling in accordance with a preferred embodiment of the present invention.

Fig. 2 is a block diagram of a codec and feature decoupling based eye correction system in accordance with a preferred embodiment of the present invention.

Fig. 3 is a schematic diagram of a multi-layer neural convolutional network of an encoding module and a decoding module employing a re-parameterization process according to a preferred embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are only intended to better understand the content of the study of the invention and are not intended to limit the scope of the invention.

As shown in FIG. 1, the method according to a preferred embodiment of the present invention comprises the following steps a-e.

Firstly, acquiring an original face image I through an image acquisition device such as a camera, wherein the original face image comprises face information, and acquiring an eye image I _c and head posture information H _gt of a user from the original face image.

In a preferred embodiment, the acquiring an eye image of the user in step a further comprises the steps of:

step a1, identifying corresponding 106 face key points from the original face image I;

step a2, positioning an eye area for the 106 face key points, wherein the eye area preferably keeps the left and right edges at 16 pixels from the left and right corners and the lower edge at 16 pixels from the lower orbit;

And a3, cutting the original face image according to the eye region to generate the eye image I _c. Preferably, the image is 96 pixels high, 64 pixels high.

In step a, the head posture information H _gt includes a pitch angle, which is an up-down yaw angle, and a yaw angle, which is a right-left yaw angle.

Then, step b is to extract the features of the eye image I _c and the head posture information H _gt to obtain a vector attribute code z _i representing the static attribute features of the user, current eye posture information G _pre representing the current eye posture and a rotation attribute code z _r representing the rotation attribute of the eye.

Preferably, in the step b, the feature extraction module performs feature extraction operation by using a convolution neural network after the re-parameterization processing, where the convolution neural network is composed of a plurality of cascaded convolution layers and activation functions. Preferably, the user static attribute features are user personalization features including gender, age, skin tone, etc. Preferably, the vector attribute code z _i is a 256-dimensional vector. The current eye pose information G _pre＝(P_pre,Y_pre).P_pre before correction indicates the current pitch angle, and Y _pre indicates the current yaw angle. Preferably, the rotational attribute code z _r is 48 dimensions.

And c, performing three-dimensional transformation on the rotation attribute code z _r according to the current eye posture information G _pre and preset target eye posture information G _tar to obtain a rotation attribute code z _pre corresponding to the target eye posture information G _tar.

Preferably, the target eye posture information G _tar＝(P_tar,Y_tar is preset before this step, where P _tar represents a target pitch angle, and Y _tar represents a target yaw angle, which are all represented by radians. Preferably, in the video call scene, the target attitude may be set to an attitude of the user facing the front view direction, that is, the pitch angle and the yaw angle are both 0 degrees, that is, the target attitude is (0, 0). In other scenarios, the target pose may be manually specified by the user, e.g., 15 ° upward looking up and 10 ° oblique to the left as targetsThe object is 15 ° down looking down and 10 ° right looking obliquely

Preferably, step c further comprises the following steps c1-c3. The dimensions of z _r were recombined to 3 x 16 before steps c1-c3 and were characterized as 16 three dimensions.

Step c1, acquiring a three-dimensional rotation matrix R _pre corresponding to the current eye posture information G _pre:

step c2, obtaining a three-dimensional rotation matrix R _tar corresponding to the target eye posture information G _tar:

wherein, R _pre matrix includes left-right rotation and up-down rotation, and R _tar matrix includes left-right rotation and up-down rotation.

Step c3, performing rotation inversion operation on the R _pre, and performing transformation operation through an R _tar rotation matrix to obtain a rotation attribute code corresponding to the target eye postureFinally, z _pre of 3 x 16 dimensions was reorganized to 48 dimensions, thus keeping with z _r. This process completes the natural adjustment of the eye pose from the current direction G _pre to the target direction G _tar in hidden space.

The next step d is to generate a target eye image I _pre from the vector attribute code z _i and the rotational attribute code z _pre.

Step d further includes a decoding module formed by the multi-layer convolutional neural network performing feature extraction on the vector attribute code z _i and the rotation attribute code z _pre to generate a target eye image I _pre.

Preferably, in the step d, the decoding module performs feature extraction operation by using a convolutional neural network after the re-parameterization processing, where the convolutional neural network is composed of a bilinear upsampling layer, a plurality of cascaded convolutional layers and an activation function. The decoding module adopts bilinear upsampling operation, and is excellent in detail retention and edge smoothing of a generated image, so that the artifact problem caused by a deconvolution method is effectively avoided.

And finally, in the step e, the target eye image I _pre is pasted back to the original face image I, and the face image after eye correction is output. The system pastes the generated target eye image I _pre back to the original face image I according to the key point, thereby realizing seamless fusion with the original face and avoiding the occurrence of split feeling on the whole look and feel. The corrected complete face image is displayed on a screen in real time, and the user can directly check the adjusted eye effect. In a video communication or other scenario, the adjusted image may be used directly for real-time transmission.

As described above, the invention independently codes the personalized attribute and the eye gesture feature of the user through the feature extraction module to respectively generate the attribute code z _i and the rotation attribute code z _r, thereby realizing feature decoupling. Specifically, the attribute code z _i characterizes personalized features of the user, such as static information of the gender, age, skin color, etc., and the rotation attribute code z _r is used for describing dynamic rotation features of the eyes of the user. In the eye correction process, only the rotation attribute code z _r is adjusted, and the natural rotation process of the eyeball is simulated through three-dimensional posture transformation. Meanwhile, the attribute code z _i is kept unchanged, so that the generated image can accurately and naturally realize the adjustment of the eye spirit while the personalized characteristics are kept.

One embodiment of the present invention also provides a codec and feature decoupling based eye correction system 20 comprising:

The data input module 21 is used for acquiring an original face image I through image acquisition equipment and acquiring an eye image I _c and head posture information H _gt of a user from the original face image;

The feature extraction module 22 performs feature extraction on the eye image I _c and the head pose information H _gt to obtain a vector attribute code z _i representing static attribute features of the user, current eye pose information G _pre representing current eye pose, and a rotation attribute code z _r representing rotation attributes of the eye;

the feature transformation module 23 performs three-dimensional transformation on the rotation attribute code z _r according to the current eye posture information G _pre and preset target eye posture information G _tar to obtain a rotation attribute code z _pre corresponding to the target eye posture information G _tar;

A decoding module 24 that generates a target eye image I _pre from the vector attribute code z _i and the rotational attribute code z _pre;

the image generating module 25 pastes the target eye image I _pre back to the original face image I, and outputs the face image after eye correction.

The data input module 21 further includes a facial point recognition module 211 and a head pose prediction module 212.

The facial point recognition module 211 performs the following steps:

and a3, cutting the original face image according to the eye region to generate the eye image I _c.

The head pose prediction module 212 is configured to predict a head pose H _gt of a corresponding face in the image I, including an up-down yaw angle (pitch angle) and a left-right yaw angle (yaw angle).

Preferably, the multi-layer neural convolution network of the feature extraction module (encoding module) 22 and the decoding module 24 of the present invention adopts a simplified and optimized network design, so as to solve the problem that the real-time performance cannot be realized under the condition of limited resources in the prior art.

In the preferred embodiment, the feature extraction module (encoding module) 22 and decoding module 24 employ a re-parameterization of the trained multi-layer neural convolutional network, thereby reducing computation and memory read-write during testing. Specifically, for a convolution module, the module structure is shown in fig. 3, where there are three branches during training, the first branch is a 3*3 convolution layer followed by a batch normalization layer (BN), the second branch is a 1*1 convolution layer followed by a batch normalization layer (BN), and the third branch is a residual connection (assuming that the input channel is the same as the output channel, otherwise there is no third branch). I.e. in the training phase, the three branches are trained together, effectively simulating the residual structure and the spatially independent feature transformation structure. In the test stage (eye correction processing stage), 1*1 convolution can be regarded as 3*3 convolution kernel with 0 periphery, residual structure can be regarded as 3*3 convolution kernel with 1 center and 0 periphery, then parameters of batch normalization layers can be combined into convolution kernel, finally parameters of three kernels are added, namely 2 convolutions, 2 batch normalization layers and one residual connection can be combined into one layer convolution (3*3 convolution), so that the number of layers, the calculated amount and the memory read-write are greatly reduced.

In the preferred embodiment, the convolution layers all use 3*3 convolution kernels. Because 3*3's convolution kernel is much more computationally efficient than other convolution kernels, while taking into account neighborhood information.

In the preferred embodiment, all convolution modules are of a one-way architecture without any side branches at the time of testing. Because the operations such as residual error, cross-layer connection and the like have small calculation amount, a large amount of extra memory is occupied, so that the operation efficiency is reduced. Therefore, the invention does not have any residual error and cross-layer connection structure in actual use.

In a preferred embodiment, the eye correction system 20 further includes a supervisory optimization module 26 that optimizes parameters of the various modules in the system by way of data-driven training. The module 26 uses five loss functions to improve the quality of the generated image and the accuracy of the pose adjustment.

(1) Reconstruction loss, namely measuring the difference between the generated image and the real image at the pixel level by calculating the pixel errors of the predicted image I _pre and the target image I _gt during training.

(2) Perceptual loss-computing the difference in perceptual features of the image generated by the input image I _pre and the target image I _gt at training through the pre-trained vgg network.

(3) Combat losses based on the discriminators feedback of the generated combat network (GAN), the discriminators discriminate by training that the predicted image I _pre is false and the target image I _gt is true, and the generator tries to confuse the discriminators, thereby enhancing the realism of the generated image I _pre.

(4) Posture loss, namely measuring the angle error between the currently predicted eye posture and the target posture by calculating the square angle distance between the posture information G _pre and the target posture G _tar during training.

(5) And calculating the cosine similarity of the attribute codes z _i of the input image and the attribute codes z _t of the target image of the same person, and ensuring that different images of the same person have similar attribute codes.

The supervisory optimization module 26 optimizes the five loss functions in combination so that the corrected image generated by the system is not only visually realistic but also accurate in pose adjustment.

It will be apparent to those skilled in the art that the above embodiments are provided for illustration only and not for limitation of the invention, and that variations and modifications of the above described embodiments are intended to fall within the scope of the claims of the invention as long as they fall within the true spirit of the invention.

Claims

1. An eye correction method based on the decoupling of a codec and a feature is characterized by comprising the following steps:

Step d, extracting the characteristics of the vector attribute code z _i and the rotation attribute code z _pre to generate a target eye image I _pre, and

Step e, pasting the target eye image I _pre back to the original face image I, outputting the face image after eye correction,

Wherein step c further comprises the steps of:

Step c1, acquiring a three-dimensional rotation matrix R _pre corresponding to the current eye posture information G _pre;

Step c2, acquiring a three-dimensional rotation matrix R _tar corresponding to the target eye posture information G _tar;

Step c3, performing rotation inversion operation on the R _pre, and performing transformation operation through an R _tar rotation matrix to obtain a rotation attribute code corresponding to the target eye posture

2. The method for correcting eye relief based on codec and feature decoupling as claimed in claim 1, wherein the step a of obtaining an eye image of the user further comprises the steps of:

a1, identifying a plurality of corresponding face key points from the original face image I;

step a2, positioning an eye area for the plurality of face key points;

3. The method for correcting eye according to claim 1, wherein in the step a, the head pose information H _gt includes an up-down yaw angle, i.e., pitch angle, and a left-right yaw angle, i.e., yaw angle.

4. The method for correcting eye according to claim 1, wherein in step b, the feature extraction operation is performed by using a convolutional neural network after the re-parameterization process, and the convolutional neural network is composed of a plurality of cascaded convolutional layers and an activation function.

5. The method for correcting eye relief based on codec and feature decoupling according to claim 1, wherein in steps c1 and c2, the current eye pose information G _pre＝(P_pre,Y_pre), wherein P _pre represents a current pitch angle, Y _pre represents a current yaw angle, the target eye pose information G _tar＝(P_tar,Y_tar), wherein P _tar represents a target pitch angle, Y _tar represents a target yaw angle,

In the above formula, the R _pre matrix includes a left-right rotation and a up-down rotation, and the R _tar matrix includes a left-right rotation and a up-down rotation.

6. The method for correcting eye according to claim 1, wherein in step d, the feature extraction operation is performed by using a convolutionally neural network after the re-parameterization process, wherein the convolutionally neural network is composed of a bilinear upsampling layer, a plurality of cascaded convolutionally layers and an activation function.

7. A codec-and feature-decoupling-based eye correction system using the method of any one of claims 1-6, comprising:

8. The system of claim 7, wherein the feature extraction module and the decoding module apply a re-parameterization to the trained multi-layer neural convolutional network for eye correction.

9. The system of claim 7, further comprising a supervisory optimization module for optimizing parameters in the neural convolutional network of the feature extraction module and the decoding module by data driven training, the supervisory optimization module using a plurality of loss functions.