CN117523670A

CN117523670A - Gait recognition method and system based on multi-modal feature fusion

Info

Publication number: CN117523670A
Application number: CN202311551406.4A
Authority: CN
Inventors: 李心慧; 石柱国; 李凡平
Original assignee: Qingdao Yisa Data Technology Co Ltd; ISSA Technology Co Ltd
Current assignee: Qingdao Yisa Data Technology Co Ltd; ISSA Technology Co Ltd
Priority date: 2023-11-20
Filing date: 2023-11-20
Publication date: 2024-02-06

Abstract

The invention discloses a gait recognition method and system based on multi-mode feature fusion, comprising the following steps: acquiring a video stream containing a target object; extracting a gait sequence diagram of a target object from a video stream; selecting a human face optimal frame and a pedestrian optimal frame from a gait sequence diagram of a target object; extracting face features of a target object from the face optimal frame; extracting human body characteristics of a target object from the pedestrian optimal frame; extracting gait characteristics of a target object from the gait sequence diagram; weighting and fusing the face features, the body features and the gait features of the target object to obtain multi-mode fusion features of the target object; and carrying out identity recognition on the target object according to the multi-mode fusion characteristics of the target object to obtain an identity recognition result of the target object. The accuracy of identity recognition is improved.

Description

Gait recognition method and system based on multi-modal feature fusion

Technical Field

The invention relates to the technical field of identity recognition, in particular to a gait recognition method and system based on multi-mode feature fusion.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Gait is the change in posture of a person during walking. Gait recognition is a technology for recognizing natural people based on natural human biological characteristics and behavior characteristics contained in gait, and is a biological characteristic recognition method, wherein the basic goal is to find out the identity of a pedestrian to be detected, which corresponds to a person in a database, by acquiring a section of video of normal walking of the pedestrian to be detected and comparing the video with stored walking videos of the pedestrian. Because pedestrians have certain differences in muscle strength, tendon and bone length, bone density, gravity center and the like, a person can be uniquely marked based on the differences, and a human motion model can be built by utilizing the characteristics or features can be directly extracted from human contours to realize gait recognition.

Under the conditions of long distance and uncontrollable, gait recognition technology has achieved a certain result, but in practical application, the gait recognition technology has more limiting conditions, so that the effect is not good enough. Current gait recognition techniques rely primarily on features of the gait cycle such as step size, pace, gait rhythm, etc. Although these features may reflect the basic rules of walking of the human body, they may be greatly affected in a complex background, severe occlusion, different camera viewing angles, too bright or too dark lighting conditions, and other real environments, resulting in low recognition accuracy. And the gait recognition technology is limited by data samples and algorithms, a large number of data samples are required for training in technical improvement, and particularly under the condition of large crowd diversity, the acquisition and processing difficulties of the data samples are large, so that the accuracy of gait recognition is low.

Disclosure of Invention

In order to solve the problems, the invention provides a gait recognition method and a gait recognition system based on multi-mode feature fusion, which improve the accuracy of identity recognition.

In order to achieve the above purpose, the invention adopts the following technical scheme:

in a first aspect, a gait recognition method based on multi-modal feature fusion is provided, including:

acquiring a video stream containing a target object;

extracting a gait sequence diagram of a target object from a video stream;

selecting a human face optimal frame and a pedestrian optimal frame from a gait sequence diagram of a target object;

extracting face features of a target object from the face optimal frame;

extracting human body characteristics of a target object from the pedestrian optimal frame;

extracting gait characteristics of a target object from the gait sequence diagram;

weighting and fusing the face features, the body features and the gait features of the target object to obtain multi-mode fusion features of the target object;

and carrying out identity recognition on the target object according to the multi-mode fusion characteristics of the target object to obtain an identity recognition result of the target object.

In a second aspect, a gait recognition system based on multi-modal feature fusion is provided, including:

the video acquisition module is used for acquiring a video stream containing a target object;

the gait sequence diagram acquisition module is used for extracting a gait sequence diagram of the target object from the video stream;

the optimal frame determining module is used for selecting a human face optimal frame and a pedestrian optimal frame from the gait sequence diagram of the target object;

the feature extraction module is used for extracting the face features of the target object from the face optimal frame; extracting human body characteristics of a target object from the pedestrian optimal frame; extracting gait characteristics of a target object from the gait sequence diagram;

the feature fusion module is used for carrying out weighted fusion on the face features, the human body features and the gait features of the target object to obtain multi-mode fusion features of the target object;

and the identity recognition module is used for carrying out identity recognition on the target object according to the multi-mode fusion characteristics of the target object to obtain an identity recognition result of the target object.

In a third aspect, an electronic device is provided that includes a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps described for a multi-modal feature fusion-based gait recognition method.

In a fourth aspect, a computer readable storage medium is provided for storing computer instructions that, when executed by a processor, perform the steps of a gait recognition method based on multi-modal feature fusion.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, the multi-mode fusion characteristic is obtained by fusing the gait characteristic, the human body characteristic and the human face characteristic, so that the identity of the target object is identified according to the fusion characteristic, and the accuracy of the identity identification of the target object is improved.

2. The invention determines the optimal human face frame and the optimal pedestrian frame from the gait sequence diagram, and further extracts the human face features and the human body features from the optimal human face frame and the optimal pedestrian frame respectively, thereby ensuring the accuracy of the extracted human face features and human body features; when the face features, the human body features and the gait features are used for fusion, and further the identity recognition of the target object is performed, the accuracy of the identity recognition of the target object is ensured.

3. The invention evaluates the quality scores of the face features, the human body features and the gait features, further determines the weights of the features according to the quality scores, and performs weighted fusion on the face features, the human body features and the gait features according to the weights, wherein the obtained multi-mode fusion features consider the quality of the features, and do not occupy excessive specific gravity for the features with low quality scores, so that the accuracy of the identification result is further improved when the multi-mode fusion features are used for carrying out target object identification.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application.

FIG. 1 is a flow chart of an embodiment of the disclosed method;

fig. 2 is a quality judgment flow chart of the embodiment.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

Example 1

Current gait recognition techniques rely primarily on features of the gait cycle such as step size, pace, gait rhythm, etc. Although these features may reflect the basic rules of walking of a human body, they may be greatly affected in real environments, such as complex backgrounds, severe occlusions, different camera views, too bright or too dark lighting conditions, etc., resulting in poor recognition accuracy. And gait recognition technology is limited by data samples and algorithms, a large number of data samples are required for training in technical improvement, and particularly under the condition of large crowd diversity, the acquisition and processing difficulties of the data samples are large.

In addition, current developments in gait recognition technology ignore the effects of other biological features, such as height, weight, age, etc. These biological characteristics can have some effect on human walking, such as height and weight can affect step size and stride frequency, age and health can affect gait rhythm and pace, etc. Whereas current gait recognition techniques do not take into account the effects of these features, resulting in lower recognition accuracy.

In order to solve the above technical problem, in this embodiment, a gait recognition method based on multi-modal feature fusion is disclosed, as shown in fig. 1, including:

s1: a video stream containing a target object is acquired.

In this embodiment, a video stream including a target object is acquired by a camera or the like.

And the acquired video stream may contain one or more target objects.

S2: a gait sequence diagram of the target object is extracted from the video stream.

When a plurality of target objects are contained in the video stream, a gait sequence diagram of each target object is extracted from the video stream.

The process for extracting the gait sequence diagram of the target object from the video stream comprises the following steps: detecting and tracking a target object in a video stream to obtain an image sequence of the target object; and obtaining a gait sequence diagram of the target object according to the image sequence of the target object.

Specifically, for each of the video streams a target object x is included _i The video frames of the video frames are subjected to preprocessing such as size adjustment, normalization, channel sequence adjustment and the like, and preprocessed images are obtained.

And carrying out target detection on the preprocessed images by adopting a target detection model, and determining a target object of each preprocessed image.

Preferably, the target detection model takes the preprocessed image as input, takes the identified target object as output, and is obtained through yolov7 network construction.

Inputting the preprocessed image into a target detection model, wherein the target detection model firstly extracts semantic information and spatial information of the preprocessed image through a convolutional neural network; fusing information features of different layers to obtain more comprehensive and accurate feature representation, classifying and positioning target objects on the feature map after feature fusion by utilizing a convolution and a full-connection layer, performing post-processing on detection results, including non-maximum suppression (NMS), confidence threshold screening and the like, and finally outputting target objects x _i Is a result of detection of (a).

Upon detection of the target object x _i Then, target tracking is performed by using the deep sort method, a unique ID number is generated for each detected target object, and a deep learning model is used to extract its feature vector, followed by target matching. For the detected target object x _i Extracting appearance characteristics, calculating similarity scores of all tracked target objects in a previous frame, and selecting one with the highest score as a matching target; next, for a matching target whose similarity score exceeds a threshold, a kalman filter is used to predict the position of its next frame and update its state (position, speed, etc.) information. The deep sort method can still mark the target as an initial ID after the target is blocked and appears, so that the tracking precision of the moving target is improved. By detecting and tracking the target object in the video stream, an image sequence of the target object is obtained.

And determining a humanoid region of the target object in the image sequence of the target object, and obtaining the gait sequence of the target object by utilizing the humanoid region.

S3: and selecting a human face optimal frame and a pedestrian optimal frame from the gait sequence diagram of the target object.

Obtaining a target object x _i After gait sequence diagram of (2), selecting the target object x from the current sequence _i The frame with the best face and the best human quality is used as the face optimal frame and the pedestrian optimal frame.

Preferably, a maximum motion stabilization method is adopted, and a human face optimal frame and a pedestrian optimal frame are selected from a gait sequence diagram of a target object. Specific:

at the target object x _i In the target tracking process of (2), the target object x is subjected to _i Tracking the human face area and the human body area respectively, calculating the motion speed and the motion direction of the human face area and the human body area in different frames, and calculating the motion speed difference and the motion direction variation of each frame and the adjacent frames respectively for the human face area and the human body area; the motion speed difference and the direction variation are weighted and summed to obtain the motion variation of each frame and the adjacent frames, a frame with the minimum motion variation with the adjacent frames is selected as the frame with the most stable motion, and the frame with the most stable motion is taken as the optimal frame; the optimal frame determined according to the human face area is a human face optimal frame, and the optimal frame determined according to the human body area is a pedestrian optimal frame.

The maximum motion stabilization method can avoid the condition that the target moves rapidly or the motion direction changes, and improves the tracking robustness.

S4: extracting face features of a target object from the face optimal frame; extracting human body characteristics of a target object from the pedestrian optimal frame; gait characteristics of the target object are extracted from the gait sequence diagram.

In this embodiment, after the face optimal frame and the pedestrian optimal frame of the target object are selected, face features and human body features are extracted, and gait recognition of the target object is assisted to be completed through the face features and the human body features.

When the face features and the human body features are extracted, determining a face area of a target object in the face optimal frame and a human body area of the target object in the pedestrian optimal frame; extracting face features f of a target object from a face region of the target object _i1 The method comprises the steps of carrying out a first treatment on the surface of the Extracting human body characteristics f of target object from human body area of target object _i2 。

Preferably, the human face feature extraction network is adopted to extract human face features from the human face region, the human body feature extraction network is adopted to extract human body features from the human body region, the human face feature extraction network and the human body feature extraction network are both constructed by adopting a deep neural network, and the human face feature extraction network and the human body feature extraction network are obtained after training by adopting corresponding training data.

Wherein the face feature f _i1 Including the sex, age, ethnicity, and the like of the subject.

Human body characteristics f _i2 Including the height, weight, shooting angle, etc. of the target object.

In addition, the present embodiment also employs a gait feature extraction network from the target object x _i Extracting gait feature f from gait sequence diagram of (2) _i3 。

The human face feature extraction network, the human body feature extraction network and the gait feature extraction network are all constructed by adopting a deep convolutional neural network and are obtained by training corresponding training data.

Gait feature f _i3 Including stride, stride frequency, and gait cycle characteristics.

S5: and carrying out weighted fusion on the face features, the human body features and the gait features of the target object to obtain the multi-modal fusion features of the target object.

Because the characteristics of the face, the human body, the gait and the like generated by the pedestrian target in the walking process can be interfered by various factors, the three characteristics are subjected to quality evaluation before the characteristics of the face, the human body and the gait are extracted. The unsupervised quality judgment algorithm designed by the method combines corresponding face, human body and gait feature extraction models, and provides correct quality scores for input images from the recognition angle instead of the visual quality angle by receiving a face optimal frame, a pedestrian optimal frame and a gait sequence diagram which need to be evaluated.

The method comprises the steps of firstly determining quality scores of each of face features, human body features and gait features; and determining the weight of the corresponding feature according to the quality score of each feature, and carrying out weighted fusion on the three features.

Preferably, the quality score of each feature is mapped into a range from 0 to 1, and the weight of each feature is obtained.

The process of determining each feature quality score is:

as shown in fig. 2, inputI is an input vector I, specifically, a vector that is input into a model after an image is subjected to operations such as preprocessing, and the like, specifically, a face optimal frame, a pedestrian optimal frame and a gait sequence diagram; output O (I) is a feature extraction network M to Output a feature O (I), and when the feature extraction network M is a face feature extraction network, the face feature is Output; outputting human body characteristics when the characteristic extraction network M is a human body characteristic extraction network; and outputting gait features when the feature extraction network M is a gait feature extraction network.

The feature extraction network M comprises a plurality of feature extraction subnetworks M _i Feature extraction subnetwork representing feature extraction network M, subnetwork M _i Is obtained by the network M through some neurons of random dropout, o _i Representing the feature vector of the corresponding subnet output. For the feature vectors output by all the subnets, the Euclidean distance d (o) between every two features is calculated _i ,o _j ) After all the feature vectors are calculated, the mass calculation is carried out, and the formula is as follows:

where q (O (I)) is a quality score of the output feature O (I), σ is a sigmoid activation function, m is the number of feature extraction network subnets, and q is mapped into a section from 0 to 1 using the sigmiod activation function.

By adopting the mode, the quality evaluation is respectively carried out on the face characteristics, the human body characteristics and the gait characteristics to obtain the quality scores of the face characteristics, the human body characteristics and the gait characteristics, the quality scores are mapped into the interval from 0 to 1 to obtain the weights of the face characteristics, the human body characteristics and the gait characteristics, which are respectively q ₁ ，q ₂ ，q ₃ 。

By weight q ₁ ，q ₂ ，q ₃ For the face feature f acquired in S4 _i1 Human body characteristics f _i2 And gait characteristics f _i3 Weighting fusion is carried out to obtain the multi-modal fusion characteristic f of the target object _i . Wherein:

wherein f _i Representing the method for the target object x _i Extracted multimodal fusion features, q ₁ ，q ₂ ，q ₃ Respectively represent the face features f _i1 Human body characteristics f _i2 And gait characteristics f _i3 Is a weight of (2).

S6: and carrying out identity recognition on the target object according to the multi-mode fusion characteristics of the target object to obtain an identity recognition result of the target object.

When the multi-mode fusion characteristics of the target object obtained through S5 calculation are used for carrying out identity recognition, the face characteristics, the body characteristics and the gait characteristics are comprehensively considered, so that the accuracy of the identity recognition is effectively improved.

Example 2

In this embodiment, a gait recognition system based on multi-modal feature fusion is disclosed, comprising:

Example 3

In this embodiment, an electronic device is disclosed that includes a memory and a processor, and computer instructions stored on the memory and running on the processor that, when executed by the processor, perform the steps described in the gait recognition method based on multimodal feature fusion disclosed in embodiment 1.

Example 4

In this embodiment, a computer readable storage medium is disclosed for storing computer instructions that, when executed by a processor, perform the steps of a gait recognition method based on multimodal feature fusion disclosed in embodiment 1.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A gait recognition method based on multi-modal feature fusion is characterized by comprising the following steps:

acquiring a video stream containing a target object;

extracting a gait sequence diagram of a target object from a video stream;

extracting face features of a target object from the face optimal frame;

2. The gait recognition method based on multi-modal feature fusion of claim 1, wherein the target object in the video stream is detected and tracked to obtain an image sequence of the target object; and obtaining a gait sequence diagram of the target object according to the image sequence of the target object.

3. The gait recognition method based on multi-modal feature fusion of claim 1, wherein a face region of a target object in a face optimal frame and a human body region of a target object in a pedestrian optimal frame are determined;

extracting the face characteristics of the target object from the face area of the target object;

human body characteristics of the target object are extracted from a human body region of the target object.

4. The gait recognition method based on multi-modal feature fusion of claim 1, wherein the optimal frames of the face and the optimal frames of the pedestrian are selected from the gait sequence diagram of the target object by using a maximum motion stabilization method.

5. The method for gait recognition based on multi-modal feature fusion of claim 1, wherein a quality score for each of the face feature, the body feature and the gait feature is determined; and determining the weight of the corresponding feature according to the quality score of each feature, and carrying out weighted fusion on the three features.

6. The gait recognition method based on multi-modal feature fusion of claim 5, wherein the quality score of each feature is mapped into a 0 to 1 interval, and a weight for each feature is obtained.

7. The method for gait recognition based on multi-modal feature fusion as claimed in claim 1, wherein the gait features comprise stride, stride frequency and gait cycle;

the face features include gender, age, and ethnicity of the target subject;

the human body characteristics include the height, weight and shooting angle of the target object.

8. A gait recognition system based on multi-modal feature fusion, comprising:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of a multimodal feature fusion-based gait recognition method as claimed in any one of claims 1 to 7.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of a multimodal feature fusion-based gait recognition method as claimed in any one of claims 1 to 7.