CN113946216A - Man-machine interaction method, intelligent device, storage medium and program product - Google Patents

Man-machine interaction method, intelligent device, storage medium and program product Download PDF

Info

Publication number
CN113946216A
CN113946216A CN202111212024.XA CN202111212024A CN113946216A CN 113946216 A CN113946216 A CN 113946216A CN 202111212024 A CN202111212024 A CN 202111212024A CN 113946216 A CN113946216 A CN 113946216A
Authority
CN
China
Prior art keywords
target
detection
portrait
gesture
hand
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111212024.XA
Other languages
Chinese (zh)
Inventor
邵柏韬
刘朋浩
李颖
姜飞俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Alibaba Cloud Computing Ltd
Original Assignee
Alibaba China Co Ltd
Alibaba Cloud Computing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd, Alibaba Cloud Computing Ltd filed Critical Alibaba China Co Ltd
Priority to CN202111212024.XA priority Critical patent/CN113946216A/en
Publication of CN113946216A publication Critical patent/CN113946216A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请实施例提供了一种人机交互方法、智能设备、计算机存储介质及程序产品,其中,人机交互方法,包括:对实时采集的视频图像进行多目标检测,多目标检测至少包括人像检测和人手检测在内的多种目标类型的检测;根据多目标检测的检测结果确定目标人像和目标人手;根据对目标人像和目标人手的跟踪检测结果,获得目标人手的手势姿态信息;根据手势姿态信息,对智能设备进行交互控制。本申请实施例提供的方案,提高了针对智能设备的交互控制效率。

Figure 202111212024

Embodiments of the present application provide a human-computer interaction method, a smart device, a computer storage medium, and a program product, wherein the human-computer interaction method includes: performing multi-target detection on video images collected in real time, and the multi-target detection at least includes portrait detection Detection of various target types including human hand detection; determine the target portrait and the target human hand according to the detection results of the multi-target detection; obtain the gesture posture information of the target human hand according to the tracking detection results of the target portrait and the target human hand; According to the gesture posture information and interactive control of smart devices. The solutions provided by the embodiments of the present application improve the interactive control efficiency for smart devices.

Figure 202111212024

Description

Man-machine interaction method, intelligent device, storage medium and program product
Technical Field
The embodiment of the application relates to the technical field of intelligent equipment, in particular to a human-computer interaction method, intelligent equipment, a computer storage medium and a computer program product.
Background
With the development of the AIoT (Artificial Intelligence & Internet of Things) technology, more and more intelligent devices are applied to the work and life of people. The intelligent equipment with the image acquisition function, such as an intelligent television, an intelligent large screen, intelligent glasses and the like, is an important type of the intelligent equipment.
Currently, most of these types of smart devices implement human-computer interaction by recognizing gestures in captured images. When the interaction is carried out specifically, the detection and the recognition of the air gesture and the contact gesture of the user can be realized by wearing a hardware sensor similar to a bracelet by the user.
However, this approach requires additional hardware sensors, which greatly increases the implementation cost of the smart device and hinders the development and large-scale use of the smart device.
Disclosure of Invention
In view of the above, embodiments of the present application provide a human-computer interaction solution to at least partially solve the above problems.
According to a first aspect of embodiments of the present application, a human-computer interaction method is provided, including: carrying out multi-target detection on the video images acquired in real time, wherein the multi-target detection at least comprises detection of multiple target types including portrait detection and human hand detection; determining a target portrait and a target hand according to the detection result of the multi-target detection; acquiring gesture posture information of the target human hand according to the tracking detection result of the target human image and the target human hand; and carrying out interactive control on the intelligent equipment according to the gesture information.
According to a second aspect of the embodiments of the present application, there is provided another human-computer interaction method, including: acquiring a video image of a space where intelligent equipment is located in real time through an image acquisition device arranged in the intelligent equipment; carrying out multi-target detection on the video images acquired in real time, wherein the multi-target detection at least comprises detection of multiple target types including portrait detection and human hand detection; determining a target portrait and a target hand according to the detection result of the multi-target detection; acquiring gesture posture information of the target human hand according to the tracking detection result of the target human image and the target human hand; according to the gesture posture information, determining target content displayed on a display screen of the intelligent device corresponding to the gesture, and performing interactive control operation on the target content; and performing the interactive control operation on the target content.
According to a third aspect of embodiments of the present application, there is provided a smart device, including: the device comprises an image acquisition device, a display screen and a processor; the display screen is used for obtaining the content to be displayed from the processor and displaying the content; the image acquisition device is used for acquiring a video image of the space where the intelligent equipment is located in real time; the processor is used for carrying out multi-target detection on the video images acquired in real time, wherein the multi-target detection at least comprises detection of multiple target types including portrait detection and human hand detection; determining a target portrait and a target hand according to the detection result of the multi-target detection; acquiring gesture posture information of the target human hand according to the tracking detection result of the target human image and the target human hand; according to the gesture posture information, determining target content displayed on the display screen and aimed at by the corresponding gesture, and performing interactive control operation on the target content; performing the interactive control operation on the target content; and the display screen is also used for displaying the result of the interactive control operation.
According to a fourth aspect of embodiments of the present application, there is provided a computer storage medium having a computer program stored thereon, the program, when executed by a processor, implementing the human-computer interaction method according to the first or second aspect.
According to a fifth aspect of embodiments of the present application, there is provided a computer program product, which includes computer instructions for instructing a computing device to perform operations corresponding to the human-computer interaction method according to the first aspect or the second aspect.
According to the human-computer interaction scheme provided by the embodiment of the application, corresponding portrait and human hand tracking detection is carried out based on the video image, the gesture posture information of the human hand is finally obtained, and interactive control can be carried out on the intelligent equipment according to the gesture posture corresponding to the information. Therefore, on one hand, the intelligent equipment is not required to be provided with a special hardware sensor such as a bracelet, and the intelligent equipment can obtain the gesture through the video image only by being provided with a corresponding image acquisition device such as a camera, so that the implementation cost of the intelligent equipment is greatly reduced, and the development and large-scale use of the intelligent equipment are promoted; on the other hand, when the detection is carried out based on the video image, the target portrait and the target hand are firstly determined, then the tracking detection aiming at the target hand is carried out, and the gesture posture of the target hand is obtained based on the detection result of the tracking detection aiming at the target hand, so that the energy consumption and the time consumption of detection calculation are greatly reduced, the detection efficiency is improved, and the interaction control efficiency aiming at the intelligent equipment is also improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a schematic diagram of an exemplary system suitable for use with embodiments of the present application;
FIG. 2 is a flowchart illustrating steps of a human-computer interaction method according to a first embodiment of the present disclosure;
FIG. 3A is a flowchart illustrating steps of a human-computer interaction method according to a second embodiment of the present application;
FIG. 3B is a diagram illustrating a gesture location map in the embodiment of FIG. 3A;
FIG. 4A is a flowchart illustrating steps of a human-computer interaction method according to a third embodiment of the present application;
FIG. 4B is a diagram illustrating a tracking detection process in the embodiment of FIG. 4A;
FIG. 4C is a diagram of a hand key point in the embodiment of FIG. 4A;
FIG. 5 is a flowchart illustrating steps of a human-computer interaction method according to a fourth embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an intelligent device according to a fifth embodiment of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.
FIG. 1 illustrates an exemplary system to which embodiments of the present application may be applied. As shown in FIG. 1, the system generally includes a smart device 102, illustrated in FIG. 1 as a smart display device.
The smart device 102 is provided with an image capturing device and a display screen, and the image capturing device is exemplified as a camera provided in the smart device 102 in fig. 1, but it should be understood by those skilled in the art that, in some cases, the image capturing device may be provided independently of the smart device 102, and may be electrically connected to the smart device 102 in a wireless or wired manner.
The smart device 102 may display corresponding content through the display screen, where at least a part of the displayed content is interactive content, that is, a user may interact with the smart device 102 through an operation on the part of the content. For example, if a plurality of video programs are included in the content displayed on the display screen, the user may select a desired video program by a gesture operation. But not limited thereto, the user may also perform interactive control on the content that the smart device 102 fails to display through a gesture, for example, the user may adjust the volume of the smart device 102 through a corresponding gesture. The specific corresponding relationship between the gesture and the interactive control operation may be set by a person skilled in the art according to a requirement, and the embodiment of the present application does not limit this. In addition, the smart device 102 may also interact with the user in other forms through user gestures, for example, detecting that the user has a heart gesture, an animation of the heart pattern may be displayed on the display screen, and so on.
The gesture operation of the user may be performed by capturing a video image through an image capturing device in the smart device 102, and after the captured video image is detected and processed by a processor in the smart device 102, corresponding gesture information and information of the interactive control operation corresponding to the gesture information are obtained, and a target object to which the interactive control operation is directed (for example, some content displayed on a display screen or content that cannot be displayed but is adjustable, such as volume, brightness of the display screen, and the like) is obtained.
Based on the system, the embodiment of the application provides a human-computer interaction method, which is described in the following through a plurality of embodiments.
Example one
Referring to fig. 2, a flowchart illustrating steps of a human-computer interaction method according to a first embodiment of the present application is shown.
The man-machine interaction method of the embodiment comprises the following steps:
step S202: and carrying out multi-target detection on the video images acquired in real time.
Wherein the multi-target detection includes at least detection of multiple target types including portrait detection and human hand detection.
For an intelligent device, especially an intelligent display device, when a user needs to interact with the intelligent device, the user usually stays in a space range capable of being acquired by an image acquisition device in the intelligent device to perform corresponding interaction control operation. In some cases, only one user may interact with the smart device; in other cases, however, there may be multiple users, some of which interact with the smart device. Based on this, there may be one or more user's portraits in the video images captured by the smart device in real time, and for each portrait, part (one hand if occluded) or all of its hands may be captured, or none of its hands may be captured (both hands are occluded). The image acquisition device can be a monocular sensor, a binocular sensor, or an RGBD sensor.
Specifically, in the present embodiment, the multi-target detection means that the video image is detected by both the portrait detection and the human hand detection, and in the case of multiple portraits and/or multiple human hands, the multiple portraits and/or multiple human hands can be detected simultaneously. That is, the multi-target detection in the embodiment of the present application may detect a plurality of different target types, such as a portrait and a human hand, but at the same time, for any one of the target types, it may also perform detection on a plurality of target objects of the same type, such as a plurality of portraits and a plurality of human hands.
It should be noted that, the specific implementation of the multi-target detection can be implemented by those skilled in the art in any appropriate manner according to actual needs, including but not limited to the form of a neural network model for multi-target detection, and the like, and the embodiments of the present application are not limited thereto. In the embodiments of the present application, the numbers "plural" and "plural" relating to "plural" mean two or more unless otherwise specified.
Step S204: and determining the target portrait and the target human hand according to the detection result of the multi-target detection.
The target portrait means a portrait corresponding to a user who performs interactive control operation on the intelligent device, for example, it is detected that a user performs a "hand-waving" operation in the video image, and this operation can wake up the intelligent device to perform subsequent tracking detection and processing, so that the portrait in the video image corresponding to the user is the target portrait, and the hand that the user performs the "hand-waving" operation is the target hand in the video image.
The detection result of the multi-target detection usually includes a portrait positioning frame and a category corresponding to the portrait (for indicating whether the portrait is in a preset spatial range, such as at a door far away from the intelligent device, in front of the intelligent device, etc.), and a hand positioning frame and a category corresponding to a hand (for indicating a motion category of the hand, such as opening five fingers, clenching, waving, and OK hand, etc.), and based on these information, the target portrait and the target hand can be determined.
Step S206: and acquiring gesture posture information of the target hand according to the tracking detection results of the target portrait and the target hand.
Because the human hand is a part of a human body, the human hand part in the video image also belongs to a part of a portrait, although the final goal of the embodiment is to perform human hand tracking detection, because of the relationship between the human hand and the human body, it is necessary to perform tracking detection on both the target portrait and the target human hand, perform tracking detection on the human hand region based on the portrait region, obtain information for the target human hand based on the tracking detection result, and perform targeted detection processing on the target human hand based on the information of the target human hand to obtain gesture posture information of the target human hand.
Step S208: and performing interactive control on the intelligent equipment according to the gesture information.
The gesture posture information can effectively represent the posture of the human hand, and includes but is not limited to specific gesture classification and human hand position information.
Based on this, in a feasible manner, the interaction position or the interaction area corresponding to the gesture can be mapped to a corresponding position or area in a display area displayed on a display screen of the smart device through a position mapping algorithm, the target content targeted by the gesture is determined, and the interaction control operation corresponding to the gesture is executed on the target content.
In another feasible manner, when interactive control operation is performed on content, such as volume, which is not displayed by the smart device through the display screen, the interactive control operation may be performed on the smart device according to the interactive control operation corresponding to the gesture.
In yet another possible approach, the smart device may perform an interactive operation with the user according to the gesture, which is independent of the device display content or the device's own function. For example, if the user makes a heart gesture, the smart device may flash a heart pattern on the display screen in response to the user gesture, and so on.
According to the embodiment, corresponding portrait and human hand tracking detection is carried out based on the video image, the gesture posture information of the human hand is finally obtained, and interactive control can be carried out on the intelligent device according to the gesture posture corresponding to the information. Therefore, on one hand, the intelligent equipment is not required to be provided with a special hardware sensor such as a bracelet, and the intelligent equipment can obtain the gesture through the video image only by being provided with a corresponding image acquisition device such as a camera, so that the implementation cost of the intelligent equipment is greatly reduced, and the development and large-scale use of the intelligent equipment are promoted; on the other hand, when the detection is carried out based on the video image, the target portrait and the target hand are firstly determined, then the tracking detection aiming at the target hand is carried out, and the gesture posture of the target hand is obtained based on the detection result of the tracking detection aiming at the target hand, so that the energy consumption and the time consumption of detection calculation are greatly reduced, the detection efficiency is improved, and the interaction control efficiency aiming at the intelligent equipment is also improved.
Example two
Referring to fig. 3A, a flowchart illustrating steps of a human-computer interaction method according to a second embodiment of the present application is shown.
The man-machine interaction method of the embodiment comprises the following steps:
step S302: and carrying out system initialization on the intelligent equipment.
The system initialization includes, but is not limited to: the camera angle, the focal length, the resolution, the position and the like of an image acquisition device of the intelligent equipment, such as a camera, are initialized so as to carry out effective image acquisition. For example, the camera angle is 180 degrees, the focal length is about 3.67mm, the resolution is 1080P or more, the camera position is about 1-3 meters in front of the interaction location or interaction zone the user may be at, and so on.
Through system initialization, the video stream of the current image acquisition device under the current setting can be ensured to be acquired through intelligent equipment and used as the input of subsequent data processing.
Step S304: and acquiring a video image in real time through an image acquisition device of the intelligent equipment.
Step S306: and carrying out multi-target detection on the video images acquired in real time.
Wherein the multi-target detection includes at least detection of multiple target types including portrait detection and human hand detection.
In one possible approach, multi-target detection based on RGB temporal sequences may be employed. Different from the traditional method of carrying out target detection by identifying key points of human bodies and key points of human hands, the target detection is carried out based on RGB video images in the embodiment of the application. Compared with key point detection, the RGB information of the video image contains rich texture and other information, and semantic feature extraction can be performed, so that the information obtained by processing the current video image can be effectively transmitted to the subsequent video image tracking detection process, and the tracking detection efficiency is improved.
And because the video images in the video stream are images with a time sequence relation, the multi-target detection based on the RGB time sequence can be realized by carrying out the multi-target detection based on the continuous RGB video images.
Step S308: and determining a target portrait and a target hand according to the detection result of the multi-target detection, and tracking and detecting the target portrait and the target hand.
As described above, the target portrait means a portrait corresponding to a user performing interactive control operation on the smart device, and the target human hand means a hand of the user performing a preset smart device wake-up operation. In tracking detection, the detection of the target human hand is carried out based on the target human figure area to which the target human hand belongs, so that the data processing burden of tracking detection is reduced, the tracking detection efficiency is improved, and the requirement on the hardware performance of intelligent equipment is lowered.
Based on this, in one possible way, the present step can be implemented as: acquiring portrait information and hand information from a detection result of multi-target detection; judging whether a hand which is subjected to a preset intelligent equipment awakening operation gesture exists or not according to the hand information; and if so, determining the hand which is subjected to the intelligent equipment awakening operation gesture as a target hand, and determining the portrait corresponding to the target hand as a target portrait. In the embodiment of the application, the awakening operation gesture is set for the intelligent device, and if a user executes the gesture, the user considers that the user needs to awaken the intelligent device so as to interact with the intelligent device. By setting the awakening operation gesture, the target user and the hand of the target user for awakening gesture operation can be efficiently and quickly determined and reflected in the video image, namely the target portrait and the target hand corresponding to the target user. Therefore, tracking and detecting of other portrait or human hands are not needed in the follow-up process, the tracking and detecting efficiency is greatly improved, and the human-computer interaction efficiency is further improved. In addition, in order to improve the interchangeability of human-computer interaction, after the wakeup operation gesture is detected, corresponding prompt information can be provided through a display screen of the intelligent device, and the prompt information can be information needing interactive confirmation, such as "do you start interacting with me? "etc., and the subsequent processing is performed after the user confirms; the reminder may also be a message that does not require confirmation of interaction, such as "thank you to interact with me, let us start a bar! "and the like.
Wherein the portrait information includes, but is not limited to: the portrait positioning frame and the information of the corresponding type of the portrait; human hand information includes, but is not limited to: the information of the type corresponding to the human hand positioning frame and the human hand.
In addition, in order to further improve the data processing speed and reduce the processing time delay, optionally, corresponding portrait identifiers and corresponding hand identifiers can be respectively set for the portrait corresponding to the portrait information and the hand corresponding to the hand information; and tracking and detecting the target portrait and the target human hand according to the portrait identifier and the human hand identifier. And, can show portrait mark and/or people's hand mark through smart machine, wherein, portrait mark includes at least one of following: icon identifications corresponding to the portrait (such as head portraits, icons, LOGO and the like set by users), ID identifications corresponding to the portrait, name identifications corresponding to the portrait (such as user names or nicknames) and role identifications corresponding to the portrait (such as roles of the users at home, such as dad, mom, baby and the like); the human hand mark comprises at least one of the following: icon identification corresponding to the human hand (such as an icon and a LOGO which are set by a user), ID identification corresponding to the human hand, and name identification corresponding to the human hand (such as a left hand and a right hand). Of course, other implementations of portrait identifiers and human hand identifiers are equally applicable to the embodiments of the present application.
When tracking detection is carried out on a target portrait and a target hand based on the portrait identifier and the hand identifier, multi-target tracking detection can be carried out on a video image acquired in real time according to the portrait identifier and the hand identifier, wherein the multi-target tracking detection comprises target portrait tracking detection and target hand tracking detection; and carrying out single-target tracking detection aiming at the target human hand based on the detection result of the multi-target tracking detection. By the mode, the multi-target tracking detection is transited to the single-target tracking detection aiming at human hands, so that the data volume required to be processed by the tracking detection is greatly reduced, the requirement on the hardware performance of intelligent equipment is further reduced, the tracking detection speed and efficiency are improved, and the time delay of man-machine interaction is reduced. Optionally, a detection mode based on an RGB time sequence may be adopted to obtain richer information and improve detection accuracy.
For example, for a frame of video image a, through multi-target tracking detection, a portrait positioning frame X therein and a human hand positioning frame X ' in the portrait positioning frame are determined, and then information of an image area of the human hand positioning frame X ' may be given to a subsequent neural network model to perform tracking detection for a human hand in the human hand positioning frame X '.
And the tracking detection aiming at the target portrait can be realized as follows: determining a portrait area in the video image according to the portrait identifier; and tracking and detecting the target portrait and the target human hand based on the portrait area, the portrait identifier and the human hand identifier. Because the tracking detection of the target portrait is still in the stage of multi-target tracking detection, the tracking detection of the corresponding portrait and the human hand of the portrait needs to be performed based on the corresponding identification, so as to provide an effective and accurate basis for the subsequent single-target tracking detection of the human hand.
In addition, in order to improve the interactivity of human-computer interaction, in a feasible mode, if the detection result of the multi-target detection indicates that a plurality of portraits or a plurality of human hands exist in the video image, an information popup window is displayed through a display screen of the intelligent device so as to display the option information of the portraits and the human hands in the information popup window; and determining the target portrait and the target human hand according to the selection operation of the option information. In this case, there may be multiple users in front of the smart device, and at least two users of the multiple users may perform the same wake-up gesture operation as if waving hands at the same time. Then, in order to improve the efficiency of subsequent tracking detection and interaction, portrait information of a plurality of users who have performed the wakeup gesture operation at the same time can be displayed through the display screen for selection by the users, a portrait corresponding to the portrait information selected by the users is taken as a target portrait, and a human hand corresponding to the target portrait and having performed the wakeup gesture operation is taken as a target human hand.
It should be noted that, in the embodiment of the present application, no matter multi-target detection, or subsequent multi-target tracking detection and single-target tracking detection, the neural network models with corresponding functions after training may be adopted for implementation, and the embodiment of the present application does not limit the specific training process and the specific implementation structure of the neural network models, and only needs to implement the corresponding functions.
Step S310: and acquiring gesture posture information of the target hand according to the tracking detection results of the target portrait and the target hand.
Based on the tracking detection aiming at the target portrait and the target hand, a corresponding tracking detection result can be obtained. In this embodiment of the present application, the tracking detection result at least includes: a tracking frame of the target portrait, a type of the target portrait, a tracking frame of the target hand, and a type of the target hand. To distinguish from the aforementioned non-tracked multi-target detection, a "tracking frame" is used herein to distinguish from the aforementioned "positioning frame" obtained by target detection.
As described in step S308, the tracking detection of the target portrait and the target human hand may transition from multi-target tracking detection to single-target tracking detection for the human hand, based on which a human hand region corresponding to the target human hand may be obtained, and based on the image of the human hand region, gesture posture detection is performed, and corresponding gesture posture information may be obtained. In addition, in the process of tracking and detecting the target human hand, the motion track of the target human hand can be displayed in real time through a display screen of the intelligent device, so that a user can know the mapping condition of the gesture of the user in the intelligent device more clearly.
The gesture posture information can effectively represent the posture of the human hand, and includes but is not limited to specific gesture classification and human hand position information.
Step S312: and performing interactive control on the intelligent equipment according to the gesture information.
In a feasible manner, if the interactive control operation corresponding to the gesture posture information is directed to non-display content of the intelligent device, such as volume, and the like, the corresponding interactive control operation can be determined according to the gesture corresponding to the gesture posture information, and then the intelligent device is interactively controlled, such as volume is increased or decreased.
In another feasible manner, if the interactive control operation corresponding to the gesture posture information is an operation for the content displayed on the display screen of the intelligent device, the operation of the human hand in the real physical space needs to be mapped to the content displayed on the display screen finally. For example, the corresponding interactive control operation is determined according to the gesture, the position of the human hand mapped on the display area of the display screen is further determined according to the position of the human hand, and the target object aimed by the gesture and the operation performed on the target object are determined. In order to improve the presentability of the operation, corresponding indication icons such as indication arrows and the like can also be displayed in the display area, so that the user can clearly know the specific position and operation information of the gesture operation.
However, in order to further improve the user experience, in a feasible manner, three-dimensional gesture reconstruction and hand position mapping may be performed according to gesture posture information, so as to map the reconstructed three-dimensional gesture to a position corresponding to the hand position on the display screen of the smart device.
In one specific example, taking the gesture operation as "waving" as an example, as shown in fig. 3B, multiple "waving" positions may be obtained based on multiple frames of video images containing the "waving" gesture through tracking detection. Further, an initial gesture box (set as xmin, ymin, xmax, ymax) can be obtained by calculating the average "hand waving" position. The width w and the height h of the initial gesture box can be obtained through simple calculation. According to w and h of the initial gesture frame, the size of the corresponding gesture control frame is determined to be 2w x 2h, and meanwhile the central point of the initial gesture frame is used as the center for expansion. In this way, in the video image, the actual effective interaction area of the gesture operation can be obtained. Further, assuming that the display areas of the display screen are w _ screen and h _ screen, the equal-scale mapping is performed according to the size of the gesture control box area, and the conversion from the operation range of the gesture operation to the screen coordinate can be completed.
In another feasible mode, displaying an interactive operation option corresponding to the gesture information on a display screen of the intelligent device; and receiving selection operation of the interactive operation options, and performing interactive control on the intelligent equipment according to the interactive operation options selected by the selection operation. By displaying the interactive operation options, the user can more flexibly determine the required interactive operation, and the interactivity of the human-computer interaction is improved. The interactive operation options can be realized by those skilled in the art in any appropriate manner according to actual requirements, such as interactive buttons, interactive questions, and the like, and are displayed in a manner of a small pop-up window or a floating layer.
In another possible way, interactive response animations or texts responding to the gesture posture information can be displayed on the display screen of the intelligent device according to the gesture posture information. For example, if a user is detected to make a heart gesture, an animation of the heart pattern may be displayed on the display screen, and so on.
According to the embodiment, corresponding portrait and human hand tracking detection is carried out based on the video image, the gesture posture information of the human hand is finally obtained, and interactive control can be carried out on the intelligent device according to the gesture posture corresponding to the information. Therefore, on one hand, the intelligent equipment is not required to be provided with a special hardware sensor such as a bracelet, and the intelligent equipment can obtain the gesture through the video image only by being provided with a corresponding image acquisition device such as a camera, so that the implementation cost of the intelligent equipment is greatly reduced, and the development and large-scale use of the intelligent equipment are promoted; on the other hand, when the detection is carried out based on the video image, the target portrait and the target hand are firstly determined, then the tracking detection aiming at the target hand is carried out, and the gesture posture of the target hand is obtained based on the detection result of the tracking detection aiming at the target hand, so that the energy consumption and the time consumption of detection calculation are greatly reduced, the detection efficiency is improved, and the interaction control efficiency aiming at the intelligent equipment is also improved.
EXAMPLE III
Referring to fig. 4A, a flowchart illustrating steps of a human-computer interaction method according to a third embodiment of the present application is shown.
In this embodiment, a human-computer interaction method according to this embodiment is described by taking an example of implementing human-computer interaction by combining a plurality of neural network models.
The man-machine interaction method of the embodiment comprises the following steps:
step S402: and carrying out multi-target detection on the video images acquired in real time.
Wherein the multi-target detection includes at least detection of multiple target types including portrait detection and human hand detection.
In this embodiment, the multi-target detection of the video image may be implemented by a neural network model with multi-target detection, such as a convolutional neural network model, and optionally, the multi-target detection may be implemented by using a lightweight convolutional neural network model. For convenience of description, the neural network model is referred to as a first neural network model in the present embodiment.
It should be noted that, in this embodiment, if the acquired video image includes a plurality of human figures, that is, there may be a plurality of users in the acquisition space range of the image acquisition device of the intelligent device, this step may be implemented as: carrying out multi-target detection on a video image acquired in real time to obtain a plurality of detection frames corresponding to a plurality of candidate objects; combining the detection frames with overlapped detection frames or the distance between the detection frames within a preset distance range in the plurality of detection frames; and carrying out multi-target detection again based on the combined detection frame. The preset distance range can be set by a person skilled in the art according to actual conditions, and the embodiment of the present application does not limit this. By the method, the efficiency and the accuracy of identification can be effectively ensured. But not limited thereto, in practical applications, the same applies in a conventional manner for detecting each portrait.
Step S404: and determining the target portrait and the target human hand according to the detection result of the multi-target detection.
Through multi-target detection of the first neural network model, one or more portrait positioning frames and the corresponding categories of the portraits and one or more hand positioning frames and the corresponding categories of the hands can be output. The action corresponding to the hand can be obtained through the category corresponding to the hand, and therefore whether the hand performs the awakening gesture operation aiming at the intelligent equipment is judged. And if the fact that one hand carries out the awakening gesture operation is determined, determining the hand as the target hand, and taking the corresponding portrait as the target portrait.
Step S406: and acquiring gesture posture information of the target hand according to the tracking detection results of the target portrait and the target hand.
In the embodiment, multi-target tracking detection aiming at a target portrait and a target hand is carried out on a video image collected in real time through a multi-target tracking network model; obtaining a detection result of the target human hand from detection results of multi-target tracking detection; based on the detection result of the target hand, carrying out single-target tracking detection aiming at the target hand on the video image acquired in real time through a single-target tracking network model; and determining a hand area in the video image according to the detection result, performing gesture posture detection based on key points of the hand on the hand area, and obtaining gesture posture information of the target hand according to the gesture posture detection result.
The multi-target tracking network model can adopt the same lightweight network model structure as the first neural network model, or can adopt the same neural network model structure as the first neural network model. Under the condition, the neural network model has a tracking detection function, and only the multi-target detection function is used when the neural network model is used for carrying out multi-target detection on a video image acquired in real time.
After the single-target tracking network model is connected to the multi-target tracking network model, the single-target tracking detection aiming at the human hand is carried out by using the detection result of the human hand part output by the multi-target tracking network model, such as the position of the target human hand in the video image. The single target tracking detection result comprises a more accurate human hand area of the human hand in the video image. Further, gesture posture detection based on the human hand area can be carried out through a second neural network model for carrying out gesture posture detection, so that gesture posture information of the human hand can be obtained.
In one possible approach, the single target tracking network model may be implemented in the form of a twin network based single target tracking network model. Because the multi-target tracking network model can adopt a lightweight network model structure, the tracking detection with low computational power can be realized; and based on the hand identification of the hand obtained by tracking detection, a twin network-based single-target tracking network model is used, so that more accurate hand tracking detection can be realized, the hardware performance requirement on intelligent equipment is reduced, and the interaction time delay is reduced.
The second neural network model may also be a lightweight convolutional neural network model that may be used for gesture detection. In a possible way, the second neural network model can be implemented as a multi-objective regression network model, and performs gesture 3D keypoint regression and gesture classification on the detected human hand with 21 keypoints at the same time. Through the form, the multi-target regression network model can be trained in a training stage through the regression of key points and the gesture classification in a mutual cooperation mode, and the training of the two aspects has the functions of mutual enhancement and mutual promotion in a multi-task mode.
In addition, optionally, the detected gesture posture information includes position information of the human hand in the video image; then, the performing gesture posture detection based on the human hand key points on the human hand region may include: acquiring gesture posture information of a human hand in the previous N frames of video images adjacent to the current video image; presume the gesture attitude information of the human hand in the present video image according to the gesture attitude information of the human hand in the previous N frames of video images; and performing gesture posture detection based on a human hand key point on a human hand region in the current video image by taking the presumed gesture posture information as auxiliary information, wherein N is a positive integer. When the gesture posture is detected through the second neural network model, the result output by the second neural network model, namely the gesture posture information of the human hand in the video image can be used as a reference, the gesture posture information of the human hand in the current video image is estimated based on the gesture posture information of the human hand in the first N frames of video images of the current video image, and the intermediate detection result (namely the gesture posture in the current video image) detected by the model is corrected by taking the estimated gesture posture information as auxiliary information, so that the position of the gesture can be more stable, and the continuity of the position of the gesture is ensured.
Schematically, a schematic of one of the above tracking detection processes is shown in fig. 4B, in which the DetNet part is responsible for human hand tracking detection; according to a tracking box bounding box of a hand output by the DetNet, scratching a hand region from an original video image as an input image of the KeyNet; KeyNet is responsible for detecting gesture gestures from an input image, i.e. an image of a human hand region, including in this example 2D handoff (i.e. position coordinate information) and 1D Heatmap (i.e. depth information) of each key point of a human hand (a schematic representation of a hand key point is shown in fig. 4C), which will be post-processed to obtain 3D human hand key points keypoints; parameterizing the keypoint (parameters required for three-dimensional reconstruction of a hand are set by a person skilled in the art according to actual requirements), and finally reconstructing the gesture in the video image. Meanwhile, as can be seen from the figure, when the image at the time t +1 is detected, the gesture attitude information at the time t +1 is presumed from the gesture attitude information of the human hand at the time t and the gesture attitude information of the human hand at the time t-1; and the gesture posture information is used as an auxiliary information input model and input as auxiliary information of KeyNet, and the gesture posture information detected and obtained by the model pair at the moment of t +1 is corrected.
Step S408: and performing interactive control on the intelligent equipment according to the gesture information.
The specific implementation of this step can be seen from the description of the relevant parts in the foregoing embodiment one or two, and is not described herein again.
Step S410: and displaying the result of the interactive control.
For example, a progress indication of volume adjustment is displayed on a display screen of the smart device according to the interactive control, or new content is displayed on the display screen according to the interactive control, or a video program is played according to the interactive control, and so on.
According to the embodiment, corresponding portrait and human hand tracking detection is carried out based on the video image, the gesture posture information of the human hand is finally obtained, and interactive control can be carried out on the intelligent device according to the gesture posture corresponding to the information. Therefore, on one hand, the intelligent equipment is not required to be provided with a special hardware sensor such as a bracelet, and the intelligent equipment can obtain the gesture through the video image only by being provided with a corresponding image acquisition device such as a camera, so that the implementation cost of the intelligent equipment is greatly reduced, and the development and large-scale use of the intelligent equipment are promoted; on the other hand, when the detection is carried out based on the video image, the target portrait and the target hand are firstly determined, then the tracking detection aiming at the target hand is carried out, and the gesture posture of the target hand is obtained based on the detection result of the tracking detection aiming at the target hand, so that the energy consumption and the time consumption of detection calculation are greatly reduced, the detection efficiency is improved, and the interaction control efficiency aiming at the intelligent equipment is also improved.
Example four
Referring to fig. 5, a flowchart illustrating steps of a man-machine interaction method according to a fourth embodiment of the present application is shown.
In this embodiment, a man-machine interaction method according to the present application is described by taking an intelligent device as an intelligent display device, such as an intelligent television, an intelligent large screen, an intelligent screen with a common size, or an intelligent small screen as an example.
The man-machine interaction method of the embodiment comprises the following steps:
step S502: the method comprises the steps of collecting video images of a space where the intelligent equipment is located in real time through an image collecting device arranged in the intelligent equipment.
Wherein, image acquisition device can be the camera, carries out the real-time collection of video image through this camera to being located the space of smart machine place.
Step S504: and carrying out multi-target detection on the video images acquired in real time.
Wherein the multi-target detection includes at least detection of multiple target types including portrait detection and human hand detection.
Step S506: and determining the target portrait and the target human hand according to the detection result of the multi-target detection.
The detection result of the multi-target detection includes corresponding portrait information and human hand information, for example, portrait positioning frame and portrait category information, and human hand positioning frame and human hand category information. Whether the human hand carries out awakening gesture operation aiming at the intelligent equipment or not can be determined based on the class information of the human hand, if yes, the human hand is determined as a target human hand, and a portrait corresponding to the target human hand is determined as a target portrait.
Step S508: and acquiring gesture posture information of the target hand according to the tracking detection results of the target portrait and the target hand.
After the target portrait and the target hand are determined, the video image collected in real time can be tracked and detected. The specific process can comprise multi-target tracking detection aiming at the target portrait and the target human hand, further determining related information of the target human hand based on the detection result, such as information of an Identification (ID) and/or a tracking frame, and transitioning from the multi-target tracking detection to single-target tracking detection aiming at the human hand; and then determining a hand region based on the single-target tracking detection result, and performing gesture posture recognition based on the hand region to obtain gesture posture information.
Step S510: and determining target content displayed on a display screen of the intelligent device corresponding to the gesture according to the gesture posture information, and performing interactive control operation aiming at the target content.
In this embodiment, the gesture posture of the user is set for operating the content displayed on the display screen of the smart display device, and therefore, it is necessary to determine, based on the gesture posture information, a position or a region in which the position information is mapped onto the display region of the display screen, and determine an interactive control operation (such as clicking a certain video program, replacing a display page, and the like) corresponding to the gesture posture. Therefore, the target content and the interactive control operation aimed by the gesture are determined.
Step S512: and performing the interactive control operation on the target content.
According to the embodiment, the intelligent device carries out corresponding portrait and human hand tracking detection based on the video image, the gesture posture information of the human hand is finally obtained, and interactive control can be carried out on the intelligent device according to the gesture posture corresponding to the information. Therefore, on one hand, the intelligent equipment is not required to be provided with a special hardware sensor such as a bracelet, and the intelligent equipment can obtain the gesture through the video image only by being provided with a corresponding image acquisition device such as a camera, so that the implementation cost of the intelligent equipment is greatly reduced, and the development and large-scale use of the intelligent equipment are promoted; on the other hand, when the intelligent device detects based on the video image, the target portrait and the target hand are firstly determined, then tracking detection is carried out on the target hand, and the gesture posture of the target hand is obtained based on the detection result of the tracking detection on the target hand, so that the energy consumption and the time consumption of detection calculation are greatly reduced, the detection efficiency is improved, and the interaction control efficiency on the intelligent device is further improved.
In addition, it should be noted that, in the present embodiment, implementation of some steps is similar to that in the foregoing embodiments, and therefore description is brief, and reference may be made to the description of relevant portions in the foregoing embodiments for corresponding specific implementation.
EXAMPLE five
Referring to fig. 6, a schematic structural diagram of an intelligent device according to a fifth embodiment of the present application is shown.
The smart device of this embodiment includes: an image capture device 602, a display screen 604, and a processor 606.
Wherein:
and a display screen 604 for obtaining and displaying the content to be displayed from the processor 606.
The image acquisition device 602 is configured to acquire a video image of a space where the smart device is located in real time.
The processor 606 is configured to implement the human-computer interaction method described in any of the foregoing embodiments. For example, processor 606 performs multi-target detection on the video images captured in real-time, the multi-target detection including at least detection of multiple target types including portrait detection and human hand detection; determining a target portrait and a target hand according to a detection result of the multi-target detection; acquiring gesture posture information of the target hand according to the tracking detection results of the target portrait and the target hand; according to the gesture posture information, determining target content displayed on the display screen 604 corresponding to the corresponding gesture, and interactive control operation aiming at the target content; and performing the interactive control operation on the target content.
And the display screen 604 is further used for displaying the result of the interactive control operation.
The intelligent device of this embodiment is used to implement the corresponding human-computer interaction method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again. In addition, the description of the corresponding parts in the foregoing method embodiments can be referred to for the functional implementation of each module in the intelligent device of this embodiment, and is not repeated here.
The embodiment of the present application further provides a computer program product, which includes a computer instruction, where the computer instruction instructs an intelligent device to execute an operation corresponding to any one of the human-computer interaction methods in the multiple method embodiments.
It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.
The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the human-machine interaction methods described herein. Further, when a general-purpose computer accesses code for implementing the human-computer interaction method illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the human-computer interaction method illustrated herein.
Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.
The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims (14)

1.一种人机交互方法,包括:1. A human-computer interaction method, comprising: 对实时采集的视频图像进行多目标检测,所述多目标检测至少包括人像检测和人手检测在内的多种目标类型的检测;Perform multi-target detection on the video images collected in real time, where the multi-target detection at least includes detection of multiple target types including portrait detection and human hand detection; 根据所述多目标检测的检测结果确定目标人像和目标人手;Determine the target portrait and the target hand according to the detection result of the multi-target detection; 根据对所述目标人像和所述目标人手的跟踪检测结果,获得所述目标人手的手势姿态信息;Obtain the gesture and posture information of the target human hand according to the tracking detection result of the target portrait and the target human hand; 根据所述手势姿态信息,对智能设备进行交互控制。According to the gesture posture information, interactive control is performed on the smart device. 2.根据权利要求1所述的方法,其中,所述根据所述多目标检测的检测结果确定目标人像和目标人手,包括:2. The method according to claim 1, wherein the determining the target portrait and the target hand according to the detection result of the multi-target detection comprises: 从所述多目标检测的检测结果中获取人像信息和人手信息;Obtaining portrait information and hand information from the detection result of the multi-target detection; 根据所述人手信息判断是否存在进行了预设的智能设备唤醒操作手势的人手;Judging whether there is a human hand that has performed the preset smart device wake-up operation gesture according to the human hand information; 若存在,则将进行了所述智能设备唤醒操作手势的人手确定为目标人手,并将所述目标人手对应的人像确定为目标人像。If there is, the human hand that has performed the smart device wake-up operation gesture is determined as the target human hand, and the portrait corresponding to the target human hand is determined as the target portrait. 3.根据权利要求2所述的方法,其中,所述方法还包括:3. The method of claim 2, wherein the method further comprises: 为所述人像信息对应的人像和所述人手信息对应的人手分别设置对应的人像标识和人手标识;respectively setting a corresponding portrait identification and a human hand identification for the portrait corresponding to the portrait information and the human hand corresponding to the human hand information; 根据所述人像标识和人手标识对所述目标人像和所述目标人手进行跟踪检测。The target portrait and the target hand are tracked and detected according to the portrait identification and the human hand identification. 4.根据权利要求3所述的方法,其中,所述方法还包括:4. The method of claim 3, wherein the method further comprises: 通过所述智能设备展示所述人像标识和/或人手标识,所述人像标识包括以下至少之一:人像对应的图标标识、人像对应的ID标识、人像对应的名称标识、人像对应的角色标识;所述人手标识包括以下至少之一:人手对应的图标标识、人手对应的ID标识、人手对应的名称标识。The portrait identification and/or the hand identification are displayed by the smart device, and the portrait identification includes at least one of the following: an icon identification corresponding to the portrait, an ID identification corresponding to the portrait, a name identification corresponding to the portrait, and a role identification corresponding to the portrait; The human hand identification includes at least one of the following: an icon identification corresponding to the human hand, an ID identification corresponding to the human hand, and a name identification corresponding to the human hand. 5.根据权利要求3或4所述的方法,其中,所述根据所述人像标识和人手标识对所述目标人像和所述目标人手进行跟踪检测,包括:5. The method according to claim 3 or 4, wherein the tracking and detecting the target portrait and the target hand according to the portrait identification and the human hand identification comprises: 根据所述人像标识和所述人手标识,对实时采集的视频图像进行多目标跟踪检测,所述多目标跟踪检测包括目标人像跟踪检测和目标人手跟踪检测;According to the portrait identification and the human hand identification, multi-target tracking detection is performed on the video images collected in real time, and the multi-target tracking detection includes target portrait tracking detection and target hand tracking detection; 基于所述多目标跟踪检测的检测结果,进行针对所述目标人手的单目标跟踪检测。Based on the detection result of the multi-target tracking detection, single-target tracking detection for the target hand is performed. 6.根据权利要求3或4所述的方法,其中,所述根据所述人像标识和人手标识对所述目标人像和所述目标人手进行跟踪检测,包括:6. The method according to claim 3 or 4, wherein the tracking and detecting the target portrait and the target human hand according to the portrait identification and the human hand identification comprises: 根据所述人像标识,确定视频图像中的人像区域;According to the portrait identification, determine the portrait area in the video image; 基于所述人像区域、所述人像标识和人手标识对所述目标人像和所述目标人手进行跟踪检测。The target portrait and the target hand are tracked and detected based on the portrait area, the portrait identification and the human hand identification. 7.根据权利要求1所述的方法,其中,所述根据所述手势姿态信息,对智能设备进行交互控制,包括:7. The method according to claim 1, wherein the interactive control of the smart device according to the gesture and gesture information comprises: 根据所述手势姿态信息,进行三维手势重建及手部位置映射,以将重建的三维手势映射在所述智能设备的显示屏上的与所述手部位置对应的位置;performing three-dimensional gesture reconstruction and hand position mapping according to the gesture posture information, so as to map the reconstructed three-dimensional gesture to a position corresponding to the hand position on the display screen of the smart device; 或者,or, 在所述智能设备的显示屏上展示与所述手势姿态信息对应的交互操作选项;接收对所述交互操作选项的选择操作,并根据所述选择操作所选择的交互操作选项对所述智能设备进行交互控制;Display the interactive operation options corresponding to the gesture and gesture information on the display screen of the smart device; receive a selection operation on the interactive operation option, and perform an interactive operation option on the smart device according to the interactive operation option selected by the selection operation. interactive control; 或者,or, 根据所述手势姿态信息,在所述智能设备的显示屏上展示响应于所述手势姿态信息的交互响应动画或文字。According to the gesture and gesture information, an interactive response animation or text in response to the gesture and gesture information is displayed on the display screen of the smart device. 8.根据权利要求1所述的方法,其中,所述根据对所述目标人像和所述目标人手的跟踪检测结果,获得所述目标人手的手势姿态信息,包括:8. The method according to claim 1, wherein the obtaining the gesture and posture information of the target human hand according to the tracking detection result of the target portrait and the target human hand, comprising: 通过多目标跟踪网络模型对实时采集的视频图像进行针对所述目标人像和所述目标人手的多目标跟踪检测;The multi-target tracking detection for the target portrait and the target hand is performed on the video images collected in real time through the multi-target tracking network model; 从所述多目标跟踪检测的检测结果中,获得所述目标人手的检测结果;Obtain the detection result of the target hand from the detection result of the multi-target tracking detection; 基于所述目标人手的检测结果,通过单目标跟踪网络模型对实时采集的视频图像进行针对所述目标人手的单目标跟踪检测;Based on the detection result of the target hand, a single-target tracking detection for the target hand is performed on the video image collected in real time through a single-target tracking network model; 根据检测结果确定视频图像中的人手区域,并对所述人手区域进行基于人手关键点的手势姿态检测,根据手势姿态检测结果获得所述目标人手的手势姿态信息。Determine the human hand area in the video image according to the detection result, perform gesture posture detection based on the key points of the human hand on the human hand area, and obtain the gesture posture information of the target human hand according to the gesture posture detection result. 9.根据权利要求8所述的方法,其中,所述手势姿态信息中包括人手在视频图像中的位置信息;9. The method according to claim 8, wherein the gesture posture information includes position information of the human hand in the video image; 所述对所述人手区域进行基于人手关键点的手势姿态检测,包括:获得与当前视频图像相邻的前N帧视频图像中的人手的手势姿态信息;根据所述前N帧视频图像中的人手的手势姿态信息推测当前视频图像中的人手的手势姿态信息;以推测的所述手势姿态信息为辅助信息,对当前视频图像中的人手区域进行基于人手关键点的手势姿态检测,其中,N为正整数。The performing gesture posture detection based on the key points of the human hand on the human hand region includes: obtaining the gesture posture information of the human hand in the first N frames of video images adjacent to the current video image; The gesture and posture information of the human hand infers the gesture and posture information of the human hand in the current video image; using the presumed gesture and posture information as auxiliary information, the gesture and posture detection based on the key points of the human hand is performed on the human hand area in the current video image, wherein N is a positive integer. 10.根据权利要求1所述的方法,其中,所述根据所述多目标检测的检测结果确定目标人像和目标人手,包括:10. The method according to claim 1, wherein the determining the target portrait and the target hand according to the detection result of the multi-target detection comprises: 若所述多目标检测的检测结果指示视频图像中存在多个人像或多个人手,则通过所述智能设备的显示屏显示信息弹窗,以在所述信息弹窗中显示所述多个人像和多个人手的选项信息;If the detection result of the multi-target detection indicates that there are multiple portraits or multiple human hands in the video image, an information pop-up window is displayed on the display screen of the smart device to display the multiple portraits in the information pop-up window and option information for multiple hands; 根据对所述选项信息的选择操作,确定所述目标人像和所述目标人手。According to the selection operation of the option information, the target portrait and the target hand are determined. 11.根据权利要求1所述的方法,其中,所述方法还包括:11. The method of claim 1, wherein the method further comprises: 在对所述目标人手的跟踪检测过程中,通过所述智能设备的显示屏实时显示所述目标人手的运动轨迹。In the process of tracking and detecting the target human hand, the movement track of the target human hand is displayed in real time through the display screen of the smart device. 12.一种人机交互方法,包括:12. A human-computer interaction method, comprising: 通过设置于智能设备中的图像采集装置实时采集所述智能设备所在空间的视频图像;Real-time capture of video images of the space where the smart device is located by an image capture device disposed in the smart device; 对实时采集的所述视频图像进行多目标检测,所述多目标检测至少包括人像检测和人手检测在内的多种目标类型的检测;Perform multi-target detection on the video images collected in real time, where the multi-target detection at least includes detection of multiple target types including portrait detection and human hand detection; 根据所述多目标检测的检测结果确定目标人像和目标人手;Determine the target portrait and the target hand according to the detection result of the multi-target detection; 根据对所述目标人像和所述目标人手的跟踪检测结果,获得所述目标人手的手势姿态信息;Obtain the gesture and posture information of the target human hand according to the tracking detection result of the target portrait and the target human hand; 根据所述手势姿态信息,确定其对应的手势所针对的智能设备的显示屏上显示的目标内容,及针对所述目标内容的交互控制操作;According to the gesture posture information, determine the target content displayed on the display screen of the smart device for which the corresponding gesture is directed, and the interactive control operation for the target content; 对所述目标内容进行所述交互控制操作。The interactive control operation is performed on the target content. 13.一种智能设备,包括:图像采集装置、显示屏和处理器;13. An intelligent device, comprising: an image acquisition device, a display screen and a processor; 其中,in, 所述显示屏,用于从所述处理器获得待显示内容并进行显示;the display screen, used to obtain and display the content to be displayed from the processor; 所述图像采集装置,用于实时采集所述智能设备所在空间的视频图像;The image acquisition device is used for real-time acquisition of video images of the space where the smart device is located; 所述处理器,用于对实时采集的所述视频图像进行多目标检测,所述多目标检测至少包括人像检测和人手检测在内的多种目标类型的检测;根据所述多目标检测的检测结果确定目标人像和目标人手;根据对所述目标人像和所述目标人手的跟踪检测结果,获得所述目标人手的手势姿态信息;根据所述手势姿态信息,确定其对应的手势所针对的所述显示屏上显示的目标内容,及针对所述目标内容的交互控制操作;对所述目标内容进行所述交互控制操作;The processor is configured to perform multi-target detection on the video image collected in real time, the multi-target detection at least includes detection of multiple target types including portrait detection and hand detection; detection according to the multi-target detection As a result, the target portrait and the target human hand are determined; according to the tracking and detection results of the target portrait and the target human hand, the gesture posture information of the target human hand is obtained; according to the gesture posture information, the corresponding gesture is determined. target content displayed on the display screen, and an interactive control operation for the target content; perform the interactive control operation on the target content; 所述显示屏,还用于显示所述交互控制操作的结果。The display screen is also used for displaying the result of the interactive control operation. 14.一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-12中任一所述的人机交互方法。14. A computer storage medium on which a computer program is stored, which implements the human-computer interaction method according to any one of claims 1-12 when the program is executed by a processor.
CN202111212024.XA 2021-10-18 2021-10-18 Man-machine interaction method, intelligent device, storage medium and program product Pending CN113946216A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111212024.XA CN113946216A (en) 2021-10-18 2021-10-18 Man-machine interaction method, intelligent device, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111212024.XA CN113946216A (en) 2021-10-18 2021-10-18 Man-machine interaction method, intelligent device, storage medium and program product

Publications (1)

Publication Number Publication Date
CN113946216A true CN113946216A (en) 2022-01-18

Family

ID=79331417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111212024.XA Pending CN113946216A (en) 2021-10-18 2021-10-18 Man-machine interaction method, intelligent device, storage medium and program product

Country Status (1)

Country Link
CN (1) CN113946216A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114610156A (en) * 2022-03-23 2022-06-10 浙江猫精人工智能科技有限公司 Interaction method and device based on AR/VR glasses and AR/VR glasses
CN114816045A (en) * 2022-03-08 2022-07-29 影石创新科技股份有限公司 Method and device for determining interaction gesture and electronic equipment
CN115097995A (en) * 2022-06-23 2022-09-23 京东方科技集团股份有限公司 Interface interaction method, interface interaction device and computer storage medium
CN115421590A (en) * 2022-08-15 2022-12-02 珠海视熙科技有限公司 Gesture control method, storage medium and camera device
CN115421591A (en) * 2022-08-15 2022-12-02 珠海视熙科技有限公司 Gesture control device and camera equipment
CN115496412A (en) * 2022-10-25 2022-12-20 浙江中控技术股份有限公司 Behavioral interaction method, system, device and storage medium
CN115840507A (en) * 2022-12-20 2023-03-24 北京帮威客科技有限公司 Large-screen equipment interaction method based on 3D image control
CN116166119A (en) * 2022-12-30 2023-05-26 深圳市洲明科技股份有限公司 Image control method, device, equipment and storage medium
CN116360603A (en) * 2023-05-29 2023-06-30 中数元宇数字科技(上海)有限公司 Interaction method, device, medium and program product based on timing signal matching

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104992171A (en) * 2015-08-04 2015-10-21 易视腾科技有限公司 Method and system for gesture recognition and man-machine interaction based on 2D video sequence
CN107493495A (en) * 2017-08-14 2017-12-19 深圳市国华识别科技开发有限公司 Interaction locations determine method, system, storage medium and intelligent terminal
CN108596092A (en) * 2018-04-24 2018-09-28 亮风台(上海)信息科技有限公司 Gesture identification method, device, equipment and storage medium
CN109725727A (en) * 2018-12-29 2019-05-07 百度在线网络技术(北京)有限公司 There are the gestural control method and device of screen equipment
CN110287891A (en) * 2019-06-26 2019-09-27 北京字节跳动网络技术有限公司 Gestural control method, device and electronic equipment based on human body key point
CN111722700A (en) * 2019-03-21 2020-09-29 Tcl集团股份有限公司 Man-machine interaction method and man-machine interaction equipment
CN112328090A (en) * 2020-11-27 2021-02-05 北京市商汤科技开发有限公司 Gesture recognition method and device, electronic equipment and storage medium
CN112686169A (en) * 2020-12-31 2021-04-20 深圳市火乐科技发展有限公司 Gesture recognition control method and device, electronic equipment and storage medium
CN112711335A (en) * 2021-01-19 2021-04-27 腾讯科技(深圳)有限公司 Virtual environment picture display method, device, equipment and storage medium
US20210191519A1 (en) * 2019-12-23 2021-06-24 Sensetime International Pte. Ltd. Gesture recognition method and apparatus, electronic device, and storage medium
CN113253847A (en) * 2021-06-08 2021-08-13 北京字节跳动网络技术有限公司 Terminal control method and device, terminal and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104992171A (en) * 2015-08-04 2015-10-21 易视腾科技有限公司 Method and system for gesture recognition and man-machine interaction based on 2D video sequence
CN107493495A (en) * 2017-08-14 2017-12-19 深圳市国华识别科技开发有限公司 Interaction locations determine method, system, storage medium and intelligent terminal
CN108596092A (en) * 2018-04-24 2018-09-28 亮风台(上海)信息科技有限公司 Gesture identification method, device, equipment and storage medium
CN109725727A (en) * 2018-12-29 2019-05-07 百度在线网络技术(北京)有限公司 There are the gestural control method and device of screen equipment
CN111722700A (en) * 2019-03-21 2020-09-29 Tcl集团股份有限公司 Man-machine interaction method and man-machine interaction equipment
CN110287891A (en) * 2019-06-26 2019-09-27 北京字节跳动网络技术有限公司 Gestural control method, device and electronic equipment based on human body key point
US20210191519A1 (en) * 2019-12-23 2021-06-24 Sensetime International Pte. Ltd. Gesture recognition method and apparatus, electronic device, and storage medium
CN112328090A (en) * 2020-11-27 2021-02-05 北京市商汤科技开发有限公司 Gesture recognition method and device, electronic equipment and storage medium
CN112686169A (en) * 2020-12-31 2021-04-20 深圳市火乐科技发展有限公司 Gesture recognition control method and device, electronic equipment and storage medium
CN112711335A (en) * 2021-01-19 2021-04-27 腾讯科技(深圳)有限公司 Virtual environment picture display method, device, equipment and storage medium
CN113253847A (en) * 2021-06-08 2021-08-13 北京字节跳动网络技术有限公司 Terminal control method and device, terminal and storage medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816045A (en) * 2022-03-08 2022-07-29 影石创新科技股份有限公司 Method and device for determining interaction gesture and electronic equipment
CN114610156A (en) * 2022-03-23 2022-06-10 浙江猫精人工智能科技有限公司 Interaction method and device based on AR/VR glasses and AR/VR glasses
CN115097995A (en) * 2022-06-23 2022-09-23 京东方科技集团股份有限公司 Interface interaction method, interface interaction device and computer storage medium
CN115421590A (en) * 2022-08-15 2022-12-02 珠海视熙科技有限公司 Gesture control method, storage medium and camera device
CN115421591A (en) * 2022-08-15 2022-12-02 珠海视熙科技有限公司 Gesture control device and camera equipment
CN115421591B (en) * 2022-08-15 2024-03-15 珠海视熙科技有限公司 Gesture control device and image pickup apparatus
CN115496412A (en) * 2022-10-25 2022-12-20 浙江中控技术股份有限公司 Behavioral interaction method, system, device and storage medium
CN115840507A (en) * 2022-12-20 2023-03-24 北京帮威客科技有限公司 Large-screen equipment interaction method based on 3D image control
CN115840507B (en) * 2022-12-20 2024-05-24 北京帮威客科技有限公司 A large-screen device interaction method based on 3D image control
CN116166119A (en) * 2022-12-30 2023-05-26 深圳市洲明科技股份有限公司 Image control method, device, equipment and storage medium
CN116360603A (en) * 2023-05-29 2023-06-30 中数元宇数字科技(上海)有限公司 Interaction method, device, medium and program product based on timing signal matching

Similar Documents

Publication Publication Date Title
CN113946216A (en) Man-machine interaction method, intelligent device, storage medium and program product
CN103353935B (en) A kind of 3D dynamic gesture identification method for intelligent domestic system
US8879787B2 (en) Information processing device and information processing method
CN108062525B (en) A deep learning hand detection method based on hand region prediction
US8509484B2 (en) Information processing device and information processing method
CN102270348B (en) Method for tracking deformable hand gesture based on video streaming
CN114463833B (en) Android human-computer interaction method based on MediaPipe gesture recognition model
CN109117753B (en) Part recognition method, device, terminal and storage medium
CN102999152A (en) Method and system for gesture recognition
CN103390168A (en) Intelligent wheelchair dynamic gesture recognition method based on Kinect depth information
WO2018000519A1 (en) Projection-based interaction control method and system for user interaction icon
CN109800676A (en) Gesture identification method and system based on depth information
CN108279573A (en) Control method, device, intelligent appliance based on human body detection of attribute and medium
CN114513694B (en) Score determination method, device, electronic equipment and storage medium
CN107895161B (en) Real-time gesture recognition method, device and computing device based on video data
CN103413137B (en) Based on the interaction gesture movement locus dividing method of more rules
CN112700568A (en) Identity authentication method, equipment and computer readable storage medium
CN112199994A (en) A method and device for real-time detection of 3D hand interaction with unknown objects in RGB video
CN107832736A (en) The recognition methods of real-time body's action and the identification device of real-time body's action
Liao et al. DensePoseGait: Dense human pose part-guided for gait recognition
CN120954103A (en) An Automatic Body Posture Recognition System Based on Video Analysis
CN115393962A (en) Action recognition method, head-mounted display device and storage medium
CN114610156A (en) Interaction method and device based on AR/VR glasses and AR/VR glasses
CN109903300A (en) An intelligent touch point display method and device suitable for congenitally blind people to learn to recognize pictures
CN114779925A (en) A method and device for line-of-sight interaction based on a single target

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220118