CN116798071B

CN116798071B - Indoor user gesture and position recognition method and system based on WIFI perception

Info

Publication number: CN116798071B
Application number: CN202310751686.7A
Authority: CN
Inventors: 廖学文; 周靖淦; 高贞贞; 吕刚明; 李昂
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-06-25
Filing date: 2023-06-25
Publication date: 2026-02-03
Anticipated expiration: 2043-06-25
Also published as: CN116798071A

Abstract

The gesture motion and the position information of an actor are contained in preprocessed high-dimensional Wifi data, and the gesture motion and the position information of the actor adopt a framework of a double-flow 2D neural network and serve as a sharing layer to serve as two recognition tasks, namely a gesture recognition task and a position recognition task, so that the feature extraction function is completed. And through the step of feature fusion, the 3D neural network is used for extracting features again, and the convergence process of the multi-task learning of the two tasks is completed by designing a special loss function in the training stage. Experiments prove that the two tasks can show good identification effect under the same neural network framework.

Description

Indoor user gesture and position recognition method and system based on WIFI perception

Technical Field

The invention belongs to the technical field of wireless communication and artificial intelligence recognition, and particularly relates to a recognition method and system for indoor user gestures and positions based on WIFI perception.

Background

In recent years, with the development of related technologies of artificial intelligence, indoor sensing tasks have been greatly promoted and developed. Indoor awareness tasks, often referred to as tasks that resolve and determine the activity of a person in the room, are important subtasks including gesture recognition and location recognition (positioning) where the user is located.

At present, aiming at the problem of the intelligent home of the Internet of things, an image-based related identification method has a mature solution. However, the image solution is often limited in practical application due to privacy protection for users and consideration of factors such as light.

With the progress of wireless communication technology, methods based on wireless radio frequency signals are gradually coming into the field of research of people because of their superior privacy protectiveness, such as millimeter wave, RFID technology, and various wearable devices. Among them, millimeter wave technology and RFID technology are often not suitable for popularization in ordinary households due to the expensive mating equipment. Some smart-aware based wearable devices, such as a wristband, are also somewhat limited in their application because they must rely on the user's wear. The WIFI signal is a wireless device popular in families, and its excellent properties can be used in the indoor sensing field, and in recent years, the WIFI signal is gradually paid attention to by some researchers, and has been widely studied. In the field of gesture recognition, such as Doppler velocity obtained from WiFi signals by a group Liu Yunhao of the university of Qinghai, higher-level gesture motion information is extracted, and a gesture recognition system Widar3.0[1] is realized by combining a deep learning method. The same data set as in the document [1] is adopted in the document [2], and the author proposes a deep learning method based on a 3D convolutional neural network for gesture recognition and user recognition. The related method for the current stage positioning is through the wifi-based fingerprint positioning method which is mature and is close to the commercial stage.

However, the above work does not mine the relevance of the two tasks of gesture recognition and position recognition well. In practical solutions, the tasks are often performed as two independent systems, and the resources of software and hardware are not well utilized, and there is no scheme capable of completing the two tasks.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a method and a system for recognizing indoor user gestures and positions based on WIFI perception, so as to solve the problem of repeated use of hardware and software resources caused by insufficient relevance of solutions among indoor perception subtasks.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

a recognition method of indoor user gestures and positions based on WIFI perception comprises the following steps:

step 1, receiving a wifi signal subjected to disturbance, and obtaining a three-dimensional video stream data set, wherein the disturbance is caused by gesture change;

Step 2, transposing the gesture information data set to obtain a data set D1 and a data set D2 in two formats, and processing the data set D1 and the data set D2 by a 2D convolutional neural network to obtain two high-dimensional feature graphs;

step 3, fusing the two high-dimensional features through independent channels to obtain fused features, and extracting the fused features through a 3D convolutional neural network to obtain the type probability and the indoor position probability of gesture change;

And 4, outputting the type of gesture change and the indoor position.

The invention further improves that:

preferably, in step 1, the preprocessing is zero-padding the three-dimensional video stream data set.

Preferably, in step 1, the normalization processing is to restrict the preprocessed dataset to the interval of [0,1] to obtain the gesture information dataset.

Preferably, in step 2, the gesture information data set is n_t-X-Y, the data set D1 is in the form of X-n_t-Y, and the data set D2 is in the form of Y-n_t-X, where X is a doppler component on the X axis, Y is a size of the doppler component on the Y axis, and n_t is a set constant value, which is equal to or greater than 36.

Preferably, in step 2, the 2D convolutional neural network comprises a 2D convolutional layer, a batch sample normalization 2DBatchNorm layer, a Relu activation function layer and a pooling layer, and the 2D convolutional neural network comprises two branches for processing the data set D1 and the data set D2 respectively.

Preferably, the 2D convolution layer is configured to multiply and then add data in different convolution check data sets to obtain a feature map U1 and a feature map U2;

batch sample normalization 2DBatchNorm layers, which are used for normalizing data;

relu activating a function layer, which is used for processing the normalized data through ReLu functions;

The pooling layer is used for downsampling from the feature graphs U1 and U2, and reducing the dimension of the feature matrix to obtain the dimension as follows Feature map U1 and dimension isIs a feature map U2 of (c).

Preferably, in step 3, after the two high-order feature maps are transposed, the two high-order feature maps are superimposed to obtain a high-order feature map with the size ofFeature map of the size ofIs processed through a 3D neural network.

Preferably, the 3D neural network comprises a 2-layer 3D convolutional layer, relu layers, a batch 3D layer, a 3D pooling layer and a fully-connected layer;

2-layer 3D convolution layer for The data in the feature map of (2) are multiplied by three dimensions and then added to obtain a feature map U5 and a feature map U6;

the full connection layer, softmax, is used to predict gesture type and location.

Preferably, the 2D convolutional neural network and the 3D convolutional neural network are obtained after being trained by a gradient descent method through Loss.

An indoor user gesture and location recognition system based on WIFI sensing, comprising:

the input module is used for receiving the disturbed wifi signal to obtain a three-dimensional video stream data set, wherein the disturbance is caused by gesture change;

The 2D extraction module is used for obtaining a data set D1 and a data set D2 in two formats after the gesture information data set is transposed, and obtaining two high-dimensional feature graphs after the data set D1 and the data set D2 are processed by the 2D convolutional neural network;

The 3D extraction module is used for fusing the two high-dimensional features through independent channels to obtain fusion features, and extracting the fusion features through a 3D convolutional neural network to obtain the type probability and the indoor position probability of gesture change;

And the output module is used for outputting the type of gesture change and the indoor position.

Compared with the prior art, the invention has the following beneficial effects:

The invention discloses a combined recognition method of indoor gestures and positions based on WIFI sensing. Aiming at two classical subtasks of indoor perception, the method for carrying out joint identification on data based on Doppler frequency shift change of WIFI signals is provided. The method designs a low-complexity double-flow parallel neural network identification framework based on the 3D and 2D convolutional neural network structures, adopts a multi-task learning technology, balances the loss of two subtasks, and obtains good performance. The method has more advantages in practicality and resource utilization than the existing independent perception scheme. Based on Doppler change of WiFi signals in the gesture using process of a user, a double-flow parallel low-complexity neural network frame is designed to recognize the gesture and the position of the user, and better practicability and system integration are achieved.

The invention aims to provide a method for jointly recognizing gestures and positions based on WIFI signals, aiming at the defects of the existing indoor wireless sensing task. Existing methods rarely mine the relevance of schemes for position recognition (location) and gesture recognition, often through two independent sets of different schemes for indoor perception subtasks. According to the method, a double-flow parallel two-dimensional convolutional neural network feature extraction architecture and a feature fusion structure are adopted through high-level synthetic feature data BVP (body-coordinate velocity profile) for recording Doppler frequency shift change of a wireless WIFI signal. Meanwhile, compared with the existing pure 3D convolutional neural network architecture with the best performance, the network uses a double-flow 2D neural network with low complexity to replace a 3D neural network. Moreover, the method is based on a multi-task learning technology, and can realize two indoor perception tasks, namely position recognition and gesture recognition. The method can also perform higher-precision sensing on different gestures and different user positions under the changes of individual users, user orientations and experimental environments, and the algorithm robustness is high.

Drawings

FIG. 1 is a system flow diagram;

FIG. 2 is an experimental data acquisition scenario;

FIG. 3 is a diagram showing the change of the loss of the double-perception task in the training stage;

FIG. 4 shows the accuracy of gesture recognition and position recognition.

Detailed Description

The invention is described in further detail below with reference to the attached drawing figures:

In the description of the present invention, it should be noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," "outer," and the like are merely for convenience of description and to simplify the description of the present invention, but rather to indicate or imply that the apparatus or elements being referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention, the terms "first," "second," "third," are used for descriptive purposes only and should not be construed as indicating or implying relative importance, and furthermore, the terms "mounted," "connected," or "coupled" should be construed broadly, for example, as being fixedly connected or as being detachably connected, or as being indirectly connected through intermediaries, or as being in communication with the inside of two elements unless otherwise specifically stated or defined. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

The basic principle of the invention is that a signal will experience multiple paths and varying channels from transmission to reception in a typical indoor closed environment. The multiple paths are due to reflections from the enclosed environment, while the changing channel is regularly changed during the process, mainly by the hand movements of the user. Only the primary reflection path change part corresponding to the hand action of the user corresponds to the gesture action and is recorded as a gesture signal. Meanwhile, the transformation rule of the whole received signal is related to the position of the transmitter of the receiver and the position of the interfacing transmitter of the user. And also related to this received signal variation law, such as individual differences among users, indoor environmental factors, and the like. Therefore, the characteristic that the extracted received signals change along with the gestures of people and the positions of the relative transmitters and receivers can be achieved through a deep learning method, so that the purpose of recognition is achieved.

The invention discloses a combined gesture recognition and position recognition method based on WiFi signals, which comprises the following steps:

s1, receiving a wifi signal which is disturbed due to gesture change, filtering and video processing the wifi signal to obtain a Doppler frequency shift change of the identified signal, and carrying out conversion from world coordinates to user coordinates to normalize the obtained three-dimensional video stream data set, and further carrying out preprocessing and normalized post-processing on the data to obtain a gesture information data set with the size of N_t-X-Y.

S2, transposing the gesture information data set to obtain data D1 and data D2 in different formats, and extracting two parallel high-dimensional features from the data D1 and the data D2 through a double-flow 2D convolution feature extraction network.

S3, fusing the two high-dimensional features through independent channels to obtain fused features, extracting the features of the fused features through a 3D neural network, and obtaining the probability that the input data is a certain gesture or a certain indoor position.

S4, outputting the category of the gesture and the indoor position of the gesture.

The invention also discloses a combined gesture recognition and position recognition system based on the WiFi signal, which comprises:

the input module is used for receiving a wifi signal which is disturbed due to gesture change, filtering and video processing the wifi signal to obtain change of Doppler frequency shift of the signal which is identified, and carrying out conversion from world coordinates to user coordinates to normalize the obtained three-dimensional video stream data;

The 2D extraction module is used for transposing the gesture information data set to obtain data D1 and data D2 in different formats, and two parallel high-dimensional features are extracted from the data D1 and the data D2 through the double-flow 2D convolution feature extraction network.

The 3D extraction module is used for fusing the two high-dimensional features through independent channels to obtain fused features, and extracting the features of the fused features through a 3D neural network to obtain the probability that the input data is a certain gesture or a certain indoor position.

And the output module is used for outputting the category of the gesture and the indoor position of the gesture.

The invention discloses a method for acquiring a combined gesture recognition and position recognition system based on WiFi signals, which comprises the following steps:

s1, preprocessing and normalizing BVP data;

The method of preprocessing and normalizing the data in step S1 is set for the input method of the neural network. The method comprises the steps that a data set used in the experiment is subjected to gesture made by a user to generate disturbance on surrounding Wifi signals, the disturbed Wifi signals are received by a receiving end, then filtering and time-frequency processing are carried out to obtain Doppler frequency shift change of signals caused by gesture recognition, and the world coordinates are converted into user coordinates to normalize the obtained three-dimensional video stream data set, which is named BVP (body-coordinate velocity profile) data set. The sample of the dataset may correspond to the gesture action through each data in the dataset, and each type of data may include, in addition to gesture information, a location feature of the gesturing person, i.e., each data in the dataset includes gesture information of the data and a location feature of the data.

The ith data from the BVP dataset is a data of size T _i -X-Y, T _i represents a time dimension feature, X and Y are space dimension features, each capable of characterizing the time of gesture, and the position of the gesture user when making the gesture. Where X is the Doppler component on the X axis and Y is the magnitude of the Doppler component on the Y axis, the magnitudes of X and Y are constant for each BVP data, and T _i is the magnitude of the difference from 14 to 34. The preprocessing of the data here is actually the process of zero-padding the data. The specific method is that T _{i_e} all-zero matrixes are placed before T _i X-Y two-dimensional matrix data, and then T _{i_l} all-zero matrixes are supplemented, so that the following formula is satisfied:

T_i+T_{i_e}+T_{i_l}=N_t(1)

Wherein N_t is a constant which is set and is greater than or equal to 36. The data after the zero padding operation is completed can be sent into the neural network in a unified format.

The data normalization method is to restrict the size of each element of the data to the [0,1] interval, and the specific method is as follows:

Where d _i ^' is the new element and d _i is the i-th element in the data.

Through the above preprocessing process for the BVP data set, a gesture information data set with the size of N_t-X-Y can be obtained in a sorting way, and the two-dimensional characteristics of signals, which are transformed in a certain time, of the data set can be classified by extracting the characteristics of the characteristic extraction layer of the neural network. Meanwhile, the information of the position also affects the feature expression in the data, and the specific neural network can be designed to extract certain features and then judge the features.

S2, BVP extracts parallel high-dimensional features through a double-flow 2D convolution feature extraction network;

The double-flow 2D convolution network is a network for carrying out shunt parallel feature extraction on input three-dimensional data. Specifically, the input N_t-X-Y data is transposed to form the data D1 of X-N_t-Y and the data D2 of Y-N_t-X. Data D1 and data D2 are fed into a first 2D convolutional neural network branch and a second 2D convolutional neural network branch, respectively.

Specifically, the first branch and the second branch of the 2D convolutional neural network respectively comprise a 2D convolutional layer, a batch sample normalization 2DBatchNorm layer, a Relu activation function layer and a pooling layer.

The 2D convolution layers respectively adopt 16 convolution check data with different (2, 2) sizes to multiply and then add to obtain a characteristic diagram U1 and a characteristic diagram U2 (the characteristic diagrams U1 and U2 are updated in the 2DBatchNorm layers, the ReLu activation function layer and the pooling layer). The parameters in the convolution kernel are obtained by updating and learning in training through a gradient descent method of the neural network.

The batch sample normalization 2DBatchNorm layer is actually used for normalizing the size range of the data of the characteristic diagram U1 and the characteristic diagram U2 obtained by the 2D convolution layer, the basic process is similar to the formula (2), and the main purpose is to eliminate the size difference between the data and prevent the occurrence of the over fitting problem of the neural network.

The ReLu activates the function layer, and the data (i.e. normalized feature map U1 and feature map U2) input after BatchNorm layers pass through ReLu functions, as shown in formula (3).

Where x is the function value of the input.

The pooling layer utilizes max pooling to pool, performs downsampling on a local area of a feature map matrix of the output of the ReLu activation function layer, reduces the dimension of the feature matrix, firstly divides an input matrix into a plurality of partitions, and acquires local maximum features by using max operation in each partition, wherein the max operation is the feature of keeping the data scale unchanged on the basis of reducing the dimension of processed data. The size of the divided pools is (2, 2), the step size is 2, and thus the size of the sampled data is reduced by one half compared to the size of the data.

In summary, data D1 and data D2 are obtained after passing through branch one and branch two of convolutional neural network, respectively AndIs shown, is a feature map U1 and a feature map U2.

S3, a method for fusing parallel high-dimensional features and extracting features again;

The high-dimensional feature fusion is to fuse the data. The purpose of the fusion is to fuse the two-flow parallel high-dimensional feature graphs U1 and U2 for the subsequent feature re-extraction process. The specific operation method is that the obtained features with two different dimensions are fused in an independent channel mode, and the channel number of the features of the data obtained previously is treated as the frame number of the input 3D data. Specifically, the feature map U1 and the feature map U2 are transposed and expanded to have the size of The feature map U3 and the feature map U4 of the model (1) are overlapped to obtain the fusion post-treatment with the size ofIs a feature map of (1).

The feature re-extraction method mainly adopts a 3D neural network to extract features from the mainThe 3D neural network comprises a 2-layer 3D convolution layer, a batch sample normalization 3DBatchNorm layer, a Relu layer and a full connection layer.

The 2-layer 3D convolution layer is actually formed by two layers of convolution kernels with the sizes of (3, 3 and 3) respectively, the number of the convolution kernels is 64 and 32 respectively, and scanning with the step length of 1 is carried out in three dimensions. Multiplying the data in three different dimensions and adding to obtain a feature map U5 and a feature map U6, wherein parameters in the 3D convolution kernel are also obtained in the training and the learning by a gradient descent method of a neural network.

The Relu layers and the batch sample normalization 3DBatchNorm layers have the same working principle as the corresponding layers in the step 2, and the data is expanded in three dimensions by one step.

The batch sample normalization 2DBatchNorm layer is actually used for normalizing the size range of the data of the characteristic diagram U5 and the characteristic diagram U6, the basic process is similar to the formula (2), and the main purpose is to eliminate the size difference between the data and prevent the problem of overfitting of the neural network.

The ReLu activates the function layer, which is to pass the data input after BatchNorm layers through ReLu function, as shown in formula (3).

To sum up, after the processing of the convolution layer, the batch sample normalization 3DBatchNorm layer and the Relu layer is obtained, the final data size isHere, the method provides that the size of n_t is 34, and x=y=20. Thus, the size of the data is (8,32,4,4). The flattened data size is (4096,1).

The full-connection layer is realized by three layers of connected neurons. The first layer receives data from ReLu activation function layers, and the input and output sizes of each layer are (4096,256), (256, 64), (64, 5) respectively. The Dropout layers are arranged between the adjacent connecting layers respectively, and the main effect is that neurons of a certain layer can be deactivated at random according to probability in the training stage and are abandoned, so that the result of classified parameters does not depend on a certain special large value, and the overfitting of the training result is prevented. The output layer adopts a softmax layer commonly used in a neural network, and the layer outputs the probability of each predicted target, selects the maximum value as the final classification result to output, and obtainsAndWherein the method comprises the steps ofAndEach dimension represents the probability that the current input data is a gesture and a position, respectively, as a 5-dimensional vector.

Specifically, the output is identified through the softmax layerAndRepresentative of the decoupled gesture and location information will be the output of the recognition system. Because ofAndRepresenting the probability of a certain category of actions (push-pull, slide, etc.) and positions (points a, B, etc.) of the gesture, respectively. Therefore, it is separately judged thatAndThe largest p-th and the largest q-th dimension, i.eAnd The p gesture category and the q position corresponding to the current input are obtained.

S4, training and classifying method combining gesture recognition and position recognition

In the previous neural network design process, a shared layer and a separate independent output layer are designed for two different tasks of gesture recognition and position recognition. The meaning of the shared layer is that the parameters of the shared layer are trained jointly by utilizing the correlation of two recognition tasks, so that the recognition accuracy is improved jointly. While the independent output layer decouples the features extracted from the former two in the sharing. That means, on the one hand, that in the shared layer, the common features between the tasks of two relatively independent but closely related tasks can be better extracted by the shared layer, thereby simultaneously obtaining a gain in performance in the respective identified tasks. On the other hand, the independent output layer enables the two tasks to further extract independent features of high dimension, so that the two neural network branches can complete classification of the corresponding tasks.

For the neural network training stage, cross entropy loss functions are adopted to characterize the cross entropy of two classification tasks,

The Loss function Loss ₁ for the task identified by the gesture and the task Loss ₂ for the location identification are updated simultaneously for the gesture. In order to reduce the difference caused by different convergence rates of the two tasks, a geometric regularization method is adopted to carry out total balance Loss.

Training the los by adopting a gradient descent method can be performed by designing the size of a threshold value to judge whether the result is converged or not. When the Loss is small enough and tends to stabilize and no longer drop, the data can be considered well trained.

Further analysis and explanation follows in connection with specific examples.

Example 1

Referring to fig. 1, in order to form a framework for identifying the whole system, the input BVP data of the original signal is subjected to a data preprocessing process, and after passing through the first branch and the second branch of the 2D neural network, the features are extracted through the branches of the double-flow neural network, after the features are fused, gesture identification and further feature extraction of position identification are respectively performed through the first branch of the 3D neural network and the second branch of the 3D neural network, and classification results are respectively obtained through a multitask learning method.

Referring to fig. 2, the experimental set-up of basic conditions, the experimental data were BVP data. The positions of the wifi transmitters and receivers of the experiment are shown in the figure. Meanwhile, five points of loc _i (i=1, 2..5) respectively represent five different positions which are one of main contents of the recognition, and the recognized gestures include five types of pushing, pulling, panning, clapping, circling and Zigzag.

Referring to fig. 3, the total experimental data adopted this time is 18000, the training epoch is 50, and the training learning rate is 0.01. Wherein the training set, the validation set and the test set are divided according to a ratio of 7:2:1. The training of this multitask study is balanced through the regular mode of geometry, can see that the convergence time of the training of two tasks Loss is different, but through the average mode of geometry, the normal scope within 0.5 is kept always to the difference of the Loss that converges. The final prediction result of the test set is that the gesture recognition accuracy is 88.56%, and the accuracy of position recognition is 88.9%.

Referring to fig. 4, for the learning result of the present multi-task learning, (a) the graph shows the accuracy of gesture recognition in the multi-task learning, it can be seen that the recognition result has good recognition effects on different gestures, and (b) the graph can be seen that the neural network has high recognition probability on different positions.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The method for recognizing the indoor user gesture and the position based on the WIFI perception is characterized by comprising the following steps of:

Step2, transpose the gesture information data set to obtain a data set D1 and a data set D2 in two formats, and processing the data set D1 and the data set D2 by a 2D convolutional neural network to obtain two high-dimensional feature graphs, wherein the gesture information data set is The data set D1 is in the form ofThe data set D2 is in the form ofWherein X is the Doppler component on the X axis, Y is the magnitude of the Doppler component on the Y axis,A constant value of 36 or more;

step 3, fusing the two high-dimensional features through independent channels to obtain fused features, extracting the fused features through a 3D convolutional neural network to obtain the type probability and the indoor position probability of gesture change, transposing the two high-dimensional feature graphs, and overlapping to obtain the three-dimensional feature graph with the size of Feature map of the size ofIs processed through a 3D neural network;

the 3D neural network comprises a 2-layer 3D convolution layer, relu layers, a batch 3D layer, a 3D pooling layer and a full connection layer;

the full-connection layer is softmax and is used for predicting gesture types and positions;

And 4, outputting the type of gesture change and the indoor position.

2. The method for recognizing indoor user gestures and positions based on WIFI sensing according to claim 1, wherein in step 1, the preprocessing is zero-padding of three-dimensional video stream data sets.

3. The method for recognizing gesture and position of indoor user based on WIFI sensing according to claim 1, wherein in step 1, the normalization process is to restrict the preprocessed dataset to the interval of [0,1] to obtain the gesture information dataset.

4. The method for recognizing indoor user gestures and positions based on WIFI sensing according to claim 1, wherein in the step 2, the 2D convolutional neural network comprises a 2D convolutional layer, a batch sample normalization 2DBatchNorm layer, a Relu activation function layer and a pooling layer, and the 2D convolutional neural network comprises two branches for processing a data set D1 and a data set D2 respectively.

5. The method for recognizing indoor user gestures and positions based on WIFI perception according to claim 4, wherein the 2D convolution layer is used for obtaining a feature map U1 and a feature map U2 after data multiplication and addition of different convolution check data sets;

6. The method for recognizing indoor user gestures and positions based on WIFI perception according to claim 1, wherein the 2D convolutional neural network and the 3D convolutional neural network are both realized byTraining by gradient descent method.

7. A WIFI-aware based indoor user gesture and location recognition system for implementing the method of claim 1, comprising: