CN112766061B - A multimodal unsupervised pedestrian pixel-level semantic annotation method and system - Google Patents

A multimodal unsupervised pedestrian pixel-level semantic annotation method and system Download PDF

Info

Publication number
CN112766061B
CN112766061B CN202011615688.6A CN202011615688A CN112766061B CN 112766061 B CN112766061 B CN 112766061B CN 202011615688 A CN202011615688 A CN 202011615688A CN 112766061 B CN112766061 B CN 112766061B
Authority
CN
China
Prior art keywords
point cloud
image acquisition
acquisition device
cloud information
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011615688.6A
Other languages
Chinese (zh)
Other versions
CN112766061A (en
Inventor
彭鹭斌
苏松志
苏松剑
蔡国榕
陈延艺
陈延行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lop Xiamen System Integration Co ltd
Ropt Technology Group Co ltd
Original Assignee
Lop Xiamen System Integration Co ltd
Ropt Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lop Xiamen System Integration Co ltd, Ropt Technology Group Co ltd filed Critical Lop Xiamen System Integration Co ltd
Priority to CN202011615688.6A priority Critical patent/CN112766061B/en
Priority to PCT/CN2021/074232 priority patent/WO2022141721A1/en
Publication of CN112766061A publication Critical patent/CN112766061A/en
Application granted granted Critical
Publication of CN112766061B publication Critical patent/CN112766061B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three-dimensional [3D] modelling for computer graphics
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • G06T7/33Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a multi-mode unsupervised pedestrian pixel-level semantic labeling method and system, which comprise the steps of carrying out three-dimensional reconstruction on an unmanned monitoring scene to obtain initial point cloud information of the monitoring scene, utilizing a Tof image acquisition device to obtain first point cloud information in the monitoring scene, registering the first point cloud information with the initial point cloud information, carrying out collective difference operation to obtain second point cloud information, projecting the second point cloud information on a horizontal plane to obtain a personnel point cloud information set, expanding and corroding a binarized image of the scene information obtained by an infrared image acquisition device to obtain a connected region information set, respectively projecting the personnel point cloud information set and the connected region information set into an image plane space of an RGB image acquisition device to carry out collective intersection operation by utilizing the position relation between calibrated cameras, and obtaining a corresponding human body region set when the common pixels exceed a first threshold value. The method and the system fully integrate the advantages of cameras of different modes, and can effectively extract human body pixel points in a scene.

Description

Multi-mode unsupervised pedestrian pixel-level semantic annotation method and system
Technical Field
The invention relates to the technical field of target detection, in particular to a multi-mode unsupervised pedestrian pixel-level semantic labeling method and system.
Background
Pedestrian detection is a classical problem in computer vision, and related techniques thereof can be applied in the fields of video monitoring, automatic driving, and the like. The current common method is that a large number of samples containing pedestrians are firstly shot, then positions of the pedestrians in pictures are manually marked as training data, and finally, a classifier is trained by adopting a supervised learning method (such as a support vector machine and deep learning) to distinguish pedestrians from non-pedestrian areas. The development of random deep learning technology requires an increasingly large number of training samples. Labeling a large number of samples is a time-consuming and labor-consuming task.
Pedestrian detection techniques can be classified into two-dimensional image (including color and gray) based methods, three-dimensional point cloud based methods, and infrared imaging based methods according to the format of the input data. From the technical point of view, it can be divided into an overall method, a site method and a local block method. Most of these methods described above utilize supervised classification techniques in machine learning. The supervised classification technology needs to mark the positions of pedestrians in pictures, so that a great deal of manpower, material resources and financial resources are consumed.
Disclosure of Invention
In order to solve the technical problem that a large amount of manpower, material resources and financial resources are consumed to mark the positions of pedestrians in pictures in the prior art, the invention provides a multi-mode unsupervised pedestrian pixel-level semantic marking method and system, and the trouble of manually marking pedestrian samples is avoided.
According to one aspect of the invention, a multi-mode unsupervised pedestrian pixel level semantic annotation method is provided, comprising the following steps:
s1, carrying out three-dimensional reconstruction on an unmanned monitoring scene to obtain initial point cloud information of the monitoring scene;
s2, acquiring first point cloud information in a monitored scene by using a Tof image acquisition device, registering the first point cloud information with initial point cloud information, performing difference operation on the set to acquire second point cloud information, and projecting the second point cloud information on a horizontal plane to acquire a personnel point cloud information set;
s3, expanding and corroding the binarized image obtained by thresholding the scene information acquired by the infrared image acquisition equipment to obtain a connected region information set, and
And S4, respectively projecting the personnel point cloud information set and the communication area information set into an image plane space of the RGB image acquisition equipment to perform intersection operation of the sets by utilizing the position relation between calibrated cameras, and acquiring a corresponding human body area set when the common pixels exceed a first threshold value.
In some specific embodiments, step S1 specifically includes:
an origin is arbitrarily selected from an unmanned monitoring scene, and a three-dimensional coordinate system is established;
m x n points are arranged at intervals in the directions of an x axis and a z axis and serve as image acquisition positions of RGB image acquisition equipment, shooting angles are selected for pitch angle, yaw angle and roll angle respectively at intervals of k degrees, and M=m x n (180/k) images are acquired;
And carrying out three-dimensional reconstruction of the monitoring scene on the M images by utilizing a Structure from motion three-dimensional reconstruction algorithm, and acquiring initial point cloud information. The three-dimensional structure can be recovered from the projected two-dimensional motion field of a moving object or scene using STM algorithms.
In some particular embodiments, the first point cloud information and the initial point cloud information are registered using an iterative closest point algorithm. By means of this step, images acquired by different acquisition devices can be registered.
In some specific embodiments, the second point cloud information is projected on an XY plane of a three-dimensional coordinate system, a plurality of circular areas are obtained based on hough transformation, and the point cloud information corresponding to the same circular area is included in the personnel point cloud information set.
In some specific embodiments, after the expanding and corroding of the binarized image in step S3, removing the region where the pixel region is smaller than the second threshold value is further included. By means of this step, the image processing can be performed to obtain the connected region.
In some specific embodiments, the first threshold is taken in the range of 20 x 40-80 x 160 and the second threshold is taken in the range of 1000-8196.
In some specific embodiments, the Tof image capturing device, the infrared image capturing device and the RGB image capturing device are respectively installed in the monitored scene, and the positional relationship and the posture information of the Tof image capturing device, the infrared image capturing device and the RGB image capturing device are respectively calculated by using the initial point cloud information. The position relationship and the posture information of the three-mode image acquisition equipment are utilized to facilitate the conversion of the later characteristic point cloud.
In some specific embodiments, the specific acquisition modes of the position relation and the gesture information include:
Acquiring a depth image of a monitored scene by using a Tof image acquisition device, and acquiring a degree of freedom pose of the Tof image acquisition device in the monitored scene by using an iterative nearest point algorithm in combination with initial point cloud information;
Color images of a monitored scene are acquired by using an infrared image acquisition device and an RGB image acquisition device, and position and posture information of the infrared image acquisition device and the RGB image acquisition device are acquired by using a SIFT descriptor and a Bag of words feature description algorithm based on Bundle Adjustment beam method adjustment algorithm according to the acquired images and initial point cloud information.
In some specific embodiments, a first transformation matrix of the Tof image acquisition device and the infrared image acquisition device, a second transformation matrix of the Tof image acquisition device and the RGB image acquisition device, and a third transformation matrix of the infrared image acquisition device and the RGB image acquisition device are obtained according to the positional relationship, the gesture information, and the internal parameters of the image acquisition device.
In some specific embodiments, step S4 specifically includes:
Projecting personnel point cloud information to an image plane space of RGB image acquisition equipment according to a camera imaging principle by using a second transformation matrix to obtain a first projection area set;
Projecting the connected region information to an image plane space of the RGB image acquisition equipment by using a third transformation matrix to obtain a second projection region set;
intersection operations are performed on pixels of the first projection area and the set and pixels of the second projection area and the set.
According to a second aspect of the present invention, a computer-readable storage medium is presented, on which one or more computer programs are stored which, when executed by a computer processor, implement the method of any of the above.
According to a third aspect of the present invention, there is provided a multi-modal unsupervised pedestrian pixel level semantic annotation system comprising:
The initial point cloud information acquisition unit is configured to reconstruct the unmanned monitoring scene in three dimensions and acquire initial point cloud information of the monitoring scene;
the personnel point cloud information collection acquisition unit is configured to acquire first point cloud information in a monitored scene by using the Tof image acquisition equipment, register the first point cloud information with the initial point cloud information, then perform difference operation on the collection to acquire second point cloud information, and project the second point cloud information on a horizontal plane to acquire a personnel point cloud information collection;
a connected region information set acquisition unit configured to expand and erode the binarized image thresholded by the scene information acquired by the infrared image acquisition device to obtain a connected region information set, and
The human body region set acquisition unit is configured to respectively project the personnel point cloud information set and the communication region information set into an image plane space of the RGB image acquisition equipment by utilizing the position relation between calibrated cameras to perform set intersection operation, and acquire the corresponding human body region set when the common pixels exceed a first threshold value.
The invention provides a multi-mode unsupervised pedestrian pixel-level semantic labeling method and system, which integrate the advantages of different mode cameras of a Tof image acquisition device, an infrared image acquisition device and an RGB image acquisition device and can effectively extract human body pixel points in a scene. In the pedestrian detection task, pixel-level annotation information can be automatically provided for a machine learning algorithm.
Drawings
The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the application. Many of the intended advantages of other embodiments and embodiments will be readily appreciated as they become better understood by reference to the following detailed description. Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of a multi-modality unsupervised pedestrian pixel level semantic annotation method of one embodiment of the present application;
FIG. 3 is a framework diagram of a multi-modality unsupervised pedestrian pixel level semantic annotation system according to one embodiment of the present application;
fig. 4 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.
Detailed Description
The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
FIG. 1 illustrates an exemplary system architecture 100 to which the multi-modal unsupervised pedestrian pixel level semantic annotation method of embodiments of the present application may be applied.
As shown in fig. 1, a system architecture 100 may include a data server 101, a network 102, and a primary server 103. Network 102 is the medium used to provide communication links between data server 101 and primary server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.
The main server 103 may be a server providing various services, such as a data processing server processing information uploaded to the data server 101. The data processing server can detect pedestrians and store the detection result in a database in an associated mode.
It should be noted that, the multi-mode unsupervised pedestrian pixel level semantic annotation analysis method provided by the embodiment of the present application is generally executed by the main server 103, and accordingly, the device for semantic analysis of the small dataset is generally disposed in the main server 103.
The data server and the main server may be hardware or software. In the case of hardware, the system may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. In the case of software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module.
It should be understood that the number of data servers, networks, and host servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
FIG. 2 shows a flow chart of a multi-modality unsupervised pedestrian pixel level semantic annotation method according to an embodiment of the present application. As shown in fig. 2, the method includes:
s201, performing three-dimensional reconstruction on an unmanned monitoring scene to obtain initial point cloud information of the monitoring scene. And under the unmanned monitoring scene, carrying out three-dimensional reconstruction on the M images by utilizing a Structure from motion three-dimensional reconstruction algorithm, and acquiring initial point cloud information. The goal of the motion restoration structure (Structure from Motion, sfM) is to be able to automatically restore camera motion and scene structure using two or more scenes, a self-calibrating technique that can automatically accomplish camera tracking and motion matching.
S202, acquiring first point cloud information in a monitored scene by using a Tof image acquisition device, registering the first point cloud information with initial point cloud information, performing aggregate difference operation to acquire second point cloud information, and projecting the second point cloud information on a horizontal plane to acquire a personnel point cloud information aggregate. And registering the first point cloud information and the initial point cloud information by using an iterative closest point algorithm. In the step, a person is allowed to enter the monitoring scene, the second point cloud information is projected on an XY plane of a three-dimensional coordinate system in the step S201, a plurality of circular areas are obtained based on Hough transformation, and the point cloud information corresponding to the same circular area is included in a person point cloud information set.
S203, expanding and corroding the binarized image after thresholding the scene information acquired by the infrared image acquisition equipment to obtain a connected region information set. And further removing the region of which the pixel region is smaller than a second threshold value after expanding and corroding the binarized image so as to obtain a connected region information set, wherein the second threshold value is taken from the range of 1000-8196.
S204, respectively projecting the personnel point cloud information set and the communication area information set into an image plane space of RGB image acquisition equipment by utilizing the position relation between calibrated cameras to perform intersection operation of the sets, and acquiring a corresponding human body area set when the common pixels exceed a first threshold value.
In a specific embodiment, a Tof image acquisition device, an infrared image acquisition device and an RGB image acquisition device are respectively installed in a monitored scene, and the position relationship and the posture information of the Tof image acquisition device, the infrared image acquisition device and the RGB image acquisition device are respectively calculated by using initial point cloud information:
Acquiring a depth image of a monitored scene by using a Tof image acquisition device, and acquiring a degree of freedom pose of the Tof image acquisition device in the monitored scene by using an iterative nearest point algorithm in combination with initial point cloud information;
Color images of a monitored scene are acquired by using an infrared image acquisition device and an RGB image acquisition device, and position and posture information of the infrared image acquisition device and the RGB image acquisition device are acquired by using a SIFT descriptor and a Bag of words feature description algorithm based on Bundle Adjustment beam method adjustment algorithm according to the acquired images and initial point cloud information.
And obtaining a first transformation matrix of the Tof image acquisition device and the infrared image acquisition device, a second transformation matrix of the Tof image acquisition device and the RGB image acquisition device and a third transformation matrix of the infrared image acquisition device and the RGB image acquisition device according to the position relation, the gesture information and the internal parameters of the image acquisition device. The method comprises the steps of obtaining a first projection region set by projecting personnel point cloud information to an image plane space of RGB image acquisition equipment according to a camera imaging principle through a second transformation matrix, obtaining a second projection region set by projecting communication region information to the image plane space of the RGB image acquisition equipment through a third transformation matrix, carrying out intersection operation on pixels of the first projection region and the second projection region and the first projection region, and obtaining a region of a common pixel exceeding a first threshold value as a human body region set through common judgment, wherein the first threshold value is taken in a range of 20 x 40-80 x 160.
The multi-mode unsupervised pedestrian pixel level semantic labeling method according to one specific embodiment of the present invention can specifically realize pedestrian detection and automatic labeling by using three acquisition devices, namely a Time-of-Flight camera (camera A), an infrared thermal imaging camera (camera B) and an RGB color camera (camera C). In the following description, camera a, camera B, and camera C are replaced, respectively.
Step 1, selecting a monitoring scene, taking a point P on the ground, and establishing a three-dimensional coordinate system XYZ, wherein an X axis is in a horizontal plane and points to a certain direction, a Z axis is perpendicular to the ground and points to the earth center, a Y axis is perpendicular to the X axis in the horizontal plane, and the pointing direction is determined according to a right-hand rule;
In the X-axis direction, selecting one point every 100cm, selecting m points as shooting positions in the horizontal direction of a camera C, namely Q1, Q2, qm, selecting one point every 50cm in the Z-axis direction, selecting P1, P2, pn as shooting heights in the vertical direction of the camera, and respectively selecting one shooting angle every k degrees for a pitch angle, a yaw angle and a roll angle in m X n positions. The camera C is arranged at different positions at different angles collect m=m×n (180: (180/k) images (180/k).
And 3, carrying out three-dimensional reconstruction on the Scene by utilizing a Structure-from-Motion technology on the M images obtained in the step 2, thereby obtaining Point Cloud information of the Scene, and marking the Point Cloud information as scene_point_Cloud_BG.
And 4, respectively installing a camera A, a camera B and a camera C in the scene, and calculating the mutual position relation between the cameras ABC by using the scene point cloud information according to the step 3. In this step, it must be ensured that there are no moving objects in the scene.
4A) After the camera A is installed, a Depth Image of the Scene is acquired and recorded as depth_image, the depth_image and the scene_point_cloud_BG are taken as inputs, and 6 degrees of freedom pose (three rotation angles and three translation coordinate information) of the camera A in the Scene is solved by using an iterative closest Point algorithm (ICP).
4B) After the cameras B and C are installed, color picture information of the scene is acquired and respectively recorded as color_image_B and color_image_C, and position and posture information of the cameras B and C are solved based on BundleAdjustment algorithm according to M acquired images and point cloud information of the scene by using SIFT descriptors and Bag-of-Word feature description methods.
4C) According to the pose information of the camera obtained in the step 4a and the step 4B and the internal parameters of the camera ABC, a transformation matrix Tab of the camera A and the transformation matrix Tac of the camera B, a transformation matrix Tbc of the camera A and the camera C and the transformation matrix Tbc of the camera B and the camera C are obtained.
And 5, after the steps 1-4 are completed, opening the scene, and allowing pedestrians to enter the scene. The three-dimensional Point Cloud information Scene_Point_Cloud_New in the Scene is acquired by using a camera A, an ICP algorithm is used again to register the Scene_Point_Cloud_New and the Scene_Point_Cloud_BG, and the difference operation of the two Point Cloud sets is performed after registration to obtain a New Point Cloud Scene_Point_Cloud_FG. The scene_point_cloud_fg is projected on the XY plane, resulting in several circular areas C1, C2, based on the hough transform. The point cloud information corresponding to the same circular area Ci is denoted person_i.
And 6, obtaining a binarized Image camera_B_image_Binry through thresholding by utilizing scene information shot by the Camera B, performing expansion and corrosion operation on the camera_B_image_Binry, removing areas with pixel areas smaller than a threshold thr (the thr is set according to an actual scene and the range is 1000-8196), and recording the obtained communication area information as R1, R2.
Step 7: according to the transformation matrix Tac between the cameras AC obtained in step 4, the point cloud information person_i of step 5 (i=1, 2,..p), the point cloud information person_i of step 5 (i=1, 2, once again, p). According to the transformation matrix Tbc between the cameras BC obtained in step 4, the Region Rj (j=1, 2,.), q) is projected into the image plane space of the camera C, and the Region corresponding to this is region_from_b_j (j=1, 2,..q).
And 8, carrying out set intersection operation on the two Region sets obtained in the step 7, namely { region_from_A_1, & gt region_from_A_p } and { region_from_B_1, & gt region_from_B_q }. When the number of common pixels between the region_from_a_i and the region_from_b_j exceeds a threshold thr_region (set to 20x40 to 80x 160), a corresponding human body Region region_from_c_k is obtained, wherein k is equal to or greater than 1and k is equal to or less than min (p, q).
With continued reference to FIG. 3, FIG. 3 illustrates a framework diagram of a multi-modality unsupervised pedestrian pixel level semantic annotation system according to one embodiment of the present application. The system specifically includes an initial point cloud information acquisition unit 301, a person point cloud information set acquisition unit 302, a connected region information set acquisition unit 303, and a human body region set acquisition unit 304.
In a specific embodiment, the initial point cloud information acquisition unit 301 is configured to perform three-dimensional reconstruction on an unmanned monitoring scene to obtain initial point cloud information of the monitoring scene, the personnel point cloud information set acquisition unit 302 is configured to acquire first point cloud information in the monitoring scene by using a Tof image acquisition device, register the first point cloud information with the initial point cloud information, perform set difference operation to obtain second point cloud information, project the second point cloud information on a horizontal plane to obtain a personnel point cloud information set, the connected region information set acquisition unit 303 is configured to expand and corrode a binary image of the scene information acquired by an infrared image acquisition device to obtain a connected region information set, and the human body region set acquisition unit 304 is configured to project the personnel point cloud information set and the connected region information set into an image plane space of the RGB image acquisition device to perform set intersection operation, and acquire a corresponding human body region set in response to the common pixel exceeding a first threshold.
Referring now to FIG. 4, there is illustrated a schematic diagram of a computer system 400 suitable for use in implementing an electronic device of an embodiment of the present application. The electronic device shown in fig. 4 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the application.
As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU) 401, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In RAM 403, various programs and data required for the operation of system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
Connected to the I/O interface 405 are an input section 406 including a keyboard, a mouse, and the like, an output section 407 including a Liquid Crystal Display (LCD) and the like and a speaker and the like, a storage section 408 including a hard disk and the like, and a communication section 409 including a network interface card such as a LAN card, a modem, and the like. The communication section 409 performs communication processing via a network such as the internet. The drive 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 410 as needed, so that a computer program read therefrom is installed into the storage section 408 as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 409 and/or installed from the removable medium 411. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 401. The computer readable storage medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present application may be implemented in software or in hardware.
As another aspect, the present application also provides a computer-readable storage medium that may be included in the electronic device described in the above embodiment, or may exist alone without being incorporated into the electronic device. The computer readable storage medium is loaded with one or more programs, when the one or more programs are executed by the electronic equipment, the electronic equipment comprises three-dimensional reconstruction of an unmanned monitoring scene, acquisition of initial point cloud information of the monitoring scene, acquisition of first point cloud information in the monitoring scene by using a Tof image acquisition device, registration of the first point cloud information and the initial point cloud information, collection difference operation, acquisition of second point cloud information, projection of the second point cloud information on a horizontal plane, acquisition of personnel point cloud information collection, expansion and corrosion of a binary image obtained by thresholding of scene information and acquired by an infrared image acquisition device, acquisition of a communication area information collection, collection intersection operation of the personnel point cloud information collection and the communication area information collection in an image plane space of an RGB image acquisition device, and acquisition of a corresponding human body area collection when a common pixel exceeds a first threshold value.
The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application referred to in the present application is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept described above. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims (10)

1.一种多模态无监督的行人像素级语义标注方法,其特征在于,包括:1. A multimodal unsupervised pedestrian pixel-level semantic annotation method, characterized by comprising: S1:对无人的监控场景进行三维重建,获取所述监控场景的初始点云信息;S1: Perform three-dimensional reconstruction on an unmanned monitoring scene to obtain initial point cloud information of the monitoring scene; S2:利用Tof图像采集设备获取所述监控场景中的第一点云信息,将其与初始点云信息配准后进行集合的差运算,获得第二点云信息,并将所述第二点云信息在水平面上进行投影,获得人员点云信息集合;S2: using a Tof image acquisition device to obtain first point cloud information in the monitoring scene, aligning it with the initial point cloud information and performing a set difference operation to obtain second point cloud information, and projecting the second point cloud information on a horizontal plane to obtain a personnel point cloud information set; S3:对红外图像采集设备获取的场景信息阈值化后的二值化图像进行膨胀和腐蚀,获得连通区域信息集合;以及S3: dilate and erode the binary image obtained by the infrared image acquisition device after thresholding the scene information to obtain a connected area information set; and S4:分别将所述人员点云信息集合和所述连通区域信息集合,利用已经标定的相机之间的位置关系,投影到RGB图像采集设备的图像平面空间中进行集合的交集运算,响应于共同像素超过第一阈值时,获取对应的人体区域集合;S4: projecting the person point cloud information set and the connected area information set respectively into the image plane space of the RGB image acquisition device using the positional relationship between the calibrated cameras to perform set intersection operations, and acquiring a corresponding human body area set in response to a common pixel exceeding a first threshold; 根据所述位置关系、姿态信息和图像采集设备的内参,获得所述Tof图像采集设备与所述红外图像采集设备的第一变换矩阵、所述Tof图像采集设备与所述RGB图像采集设备的第二变换矩阵、所述红外图像采集设备与所述RGB图像采集设备的第三变换矩阵;所述S4具体包括:According to the position relationship, posture information and internal parameters of the image acquisition device, a first transformation matrix between the Tof image acquisition device and the infrared image acquisition device, a second transformation matrix between the Tof image acquisition device and the RGB image acquisition device, and a third transformation matrix between the infrared image acquisition device and the RGB image acquisition device are obtained; S4 specifically includes: 利用所述第二变换矩阵将所述人员点云信息根据相机成像原理投影到所述RGB图像采集设备的图像平面空间获得第一投影区域集合;Using the second transformation matrix, the personnel point cloud information is projected onto the image plane space of the RGB image acquisition device according to the camera imaging principle to obtain a first projection area set; 利用所述第三变换矩阵将所述连通区域信息投影到所述RGB图像采集设备的图像平面空间获得第二投影区域集合;Projecting the connected region information onto the image plane space of the RGB image acquisition device using the third transformation matrix to obtain a second projection region set; 对所述第一投影区与集合与所述第二投影区与集合的像素进行交集运算。An intersection operation is performed on pixels of the first projection area and set and pixels of the second projection area and set. 2.根据权利要求1所述的多模态无监督的行人像素级语义标注方法,其特征在于,所述S1具体包括:2. The multimodal unsupervised pedestrian pixel-level semantic labeling method according to claim 1, characterized in that S1 specifically comprises: 在所述无人的监控场景中任取一原点,建立三维坐标系;Select an origin at random in the unmanned monitoring scene to establish a three-dimensional coordinate system; 在x轴和z轴方向上间隔设置m*n个点位作为RGB图像采集设备的图像采集位置,对俯仰角、偏航角和滚转角分别间隔k度选择拍摄角度,采集M=m*n*(180/k)*(180/k)*(180/k)张图像;m*n points are set at intervals in the x-axis and z-axis directions as image acquisition positions of the RGB image acquisition device, and shooting angles are selected at intervals of k degrees for the pitch angle, yaw angle, and roll angle, respectively, to acquire M=m*n*(180/k)*(180/k)*(180/k) images; 利用Structure from motion的三维重建算法对M张图像进行所述监控场景的三维重建,并获取所述初始点云信息。The three-dimensional reconstruction algorithm of Structure from motion is used to reconstruct the monitoring scene in three dimensions on the M images, and the initial point cloud information is obtained. 3.根据权利要求1所述的多模态无监督的行人像素级语义标注方法,其特征在于,利用迭代最近点算法将第一点云信息和初始点云信息进行配准。3. According to the multimodal unsupervised pedestrian pixel-level semantic labeling method described in claim 1, it is characterized in that the first point cloud information and the initial point cloud information are aligned using an iterative closest point algorithm. 4.根据权利要求2所述的多模态无监督的行人像素级语义标注方法,其特征在于,将第二点云信息在所述三维坐标系的XY平面进行投影,基于霍夫变换获得若干圆形区域,将属于同一个圆形区域所对应的点云信息纳入所述人员点云信息集合中。4. According to claim 2, the multimodal unsupervised pedestrian pixel-level semantic labeling method is characterized in that the second point cloud information is projected on the XY plane of the three-dimensional coordinate system, and a number of circular areas are obtained based on the Hough transform, and the point cloud information corresponding to the same circular area is included in the personnel point cloud information set. 5.根据权利要求1所述的多模态无监督的行人像素级语义标注方法,其特征在于,所述S3中对二值化图像进行膨胀和腐蚀之后,还包括去除像素区域小于第二阈值的区域。5. The multimodal unsupervised pedestrian pixel-level semantic labeling method according to claim 1 is characterized in that after the binary image is expanded and eroded in S3, it also includes removing areas where the pixel area is smaller than the second threshold. 6.根据权利要求5所述的多模态无监督的行人像素级语义标注方法,其特征在于,所述第一阈值取自20*40-80*160的范围内,第二阈值取自1000-8196的范围内。6. According to the multimodal unsupervised pedestrian pixel-level semantic labeling method according to claim 5, it is characterized in that the first threshold is taken from the range of 20*40-80*160, and the second threshold is taken from the range of 1000-8196. 7.根据权利要求1所述的多模态无监督的行人像素级语义标注方法,其特征在于,在所述监控场景中分别安装Tof图像采集设备、红外图像采集设备和RGB图像采集设备,利用初始点云信息分别计算所述Tof图像采集设备、红外图像采集设备和RGB图像采集设备的位置关系以及姿态信息。7. According to claim 1, the multimodal unsupervised pedestrian pixel-level semantic labeling method is characterized in that a Tof image acquisition device, an infrared image acquisition device and an RGB image acquisition device are respectively installed in the monitoring scene, and the initial point cloud information is used to calculate the position relationship and posture information of the Tof image acquisition device, the infrared image acquisition device and the RGB image acquisition device respectively. 8.根据权利要求7所述的多模态无监督的行人像素级语义标注方法,其特征在于,位置关系以及姿态信息的具体获取方式包括:8. The multimodal unsupervised pedestrian pixel-level semantic labeling method according to claim 7, characterized in that the specific method of obtaining the position relationship and posture information includes: 利用Tof图像采集设备获取所述监控场景的深度图像,结合初始点云信息利用迭代最近点算法获取所述Tof图像采集设备在所述监控场景中的自由度位姿;Using a Tof image acquisition device to acquire a depth image of the monitoring scene, and using an iterative closest point algorithm in combination with initial point cloud information to acquire the degree of freedom position of the Tof image acquisition device in the monitoring scene; 利用红外图像采集设备和RGB图像采集设备获取所述监控场景的彩色图像,利用SIFT描述子和Bag of words词袋特征描述算法,根据采集的图像和所述初始点云信息,基于Bundle Adjustment光束法平差算法,获取红外图像采集设备和RGB图像采集设备的位置和姿态信息。The infrared image acquisition device and the RGB image acquisition device are used to acquire a color image of the monitoring scene, and the position and posture information of the infrared image acquisition device and the RGB image acquisition device are acquired based on the Bundle Adjustment algorithm according to the acquired image and the initial point cloud information using a SIFT descriptor and a Bag of words feature description algorithm. 9.一种计算机可读存储介质,其上存储有一或多个计算机程序,其特征在于,该一或多个计算机程序被计算机处理器执行时实施权利要求1至8中任一项所述的方法。9. A computer-readable storage medium having one or more computer programs stored thereon, wherein the one or more computer programs implement the method according to any one of claims 1 to 8 when executed by a computer processor. 10.一种多模态无监督的行人像素级语义标注系统,其特征在于,所述系统包括:10. A multimodal unsupervised pedestrian pixel-level semantic annotation system, characterized in that the system comprises: 初始点云信息获取单元:配置用于对无人的监控场景进行三维重建,获取所述监控场景的初始点云信息;An initial point cloud information acquisition unit: configured to perform three-dimensional reconstruction on an unmanned monitoring scene and acquire initial point cloud information of the monitoring scene; 人员点云信息集合获取单元:配置用于利用Tof图像采集设备获取所述监控场景中的第一点云信息,将其与初始点云信息配准后进行集合的差运算,获得第二点云信息,并将所述第二点云信息在水平面上进行投影,获得人员点云信息集合;A personnel point cloud information set acquisition unit: configured to acquire the first point cloud information in the monitoring scene by using a Tof image acquisition device, perform a set difference operation after registering the first point cloud information with the initial point cloud information to obtain the second point cloud information, and project the second point cloud information on a horizontal plane to obtain a personnel point cloud information set; 连通区域信息集合获取单元:配置用于对红外图像采集设备获取的场景信息阈值化后的二值化图像进行膨胀和腐蚀,获得连通区域信息集合;以及A connected region information set acquisition unit: configured to dilate and erode the binary image obtained by the infrared image acquisition device after the scene information is thresholded, so as to obtain a connected region information set; and 人体区域集合获取单元:配置用于分别将所述人员点云信息集合和所述连通区域信息集合,利用已经标定的相机之间的位置关系,投影到RGB图像采集设备的图像平面空间中进行集合的交集运算,响应于共同像素超过第一阈值时,获取对应的人体区域集合;A human body region set acquisition unit: configured to project the personnel point cloud information set and the connected region information set into the image plane space of the RGB image acquisition device respectively, using the positional relationship between the calibrated cameras, to perform a set intersection operation, and acquire a corresponding human body region set in response to a common pixel exceeding a first threshold; 根据所述位置关系、姿态信息和图像采集设备的内参,获得所述Tof图像采集设备与所述红外图像采集设备的第一变换矩阵、所述Tof图像采集设备与所述RGB图像采集设备的第二变换矩阵、所述红外图像采集设备与所述RGB图像采集设备的第三变换矩阵;所述人体区域集合获取单元具体配置用于:According to the position relationship, posture information and internal parameters of the image acquisition device, a first transformation matrix between the Tof image acquisition device and the infrared image acquisition device, a second transformation matrix between the Tof image acquisition device and the RGB image acquisition device, and a third transformation matrix between the infrared image acquisition device and the RGB image acquisition device are obtained; the human body region set acquisition unit is specifically configured to: 利用所述第二变换矩阵将所述人员点云信息根据相机成像原理投影到所述RGB图像采集设备的图像平面空间获得第一投影区域集合;Using the second transformation matrix, the personnel point cloud information is projected onto the image plane space of the RGB image acquisition device according to the camera imaging principle to obtain a first projection area set; 利用所述第三变换矩阵将所述连通区域信息投影到所述RGB图像采集设备的图像平面空间获得第二投影区域集合;Projecting the connected region information onto the image plane space of the RGB image acquisition device using the third transformation matrix to obtain a second projection region set; 对所述第一投影区与集合与所述第二投影区与集合的像素进行交集运算。An intersection operation is performed on pixels of the first projection area and set and pixels of the second projection area and set.
CN202011615688.6A 2020-12-30 2020-12-30 A multimodal unsupervised pedestrian pixel-level semantic annotation method and system Active CN112766061B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011615688.6A CN112766061B (en) 2020-12-30 2020-12-30 A multimodal unsupervised pedestrian pixel-level semantic annotation method and system
PCT/CN2021/074232 WO2022141721A1 (en) 2020-12-30 2021-01-28 Multimodal unsupervised pedestrian pixel-level semantic labeling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011615688.6A CN112766061B (en) 2020-12-30 2020-12-30 A multimodal unsupervised pedestrian pixel-level semantic annotation method and system

Publications (2)

Publication Number Publication Date
CN112766061A CN112766061A (en) 2021-05-07
CN112766061B true CN112766061B (en) 2025-05-16

Family

ID=75697793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011615688.6A Active CN112766061B (en) 2020-12-30 2020-12-30 A multimodal unsupervised pedestrian pixel-level semantic annotation method and system

Country Status (2)

Country Link
CN (1) CN112766061B (en)
WO (1) WO2022141721A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115439808B (en) * 2022-09-29 2025-12-19 西北工业大学 System and method for interpreting corresponding relation of dense crowd based on point cloud matching
CN116051725A (en) * 2022-10-25 2023-05-02 南京大学 A 3D Imaging Method Based on Human Point Cloud
CN116978010A (en) * 2023-08-08 2023-10-31 北京地平线信息技术有限公司 Image annotation method and device, storage medium and electronic equipment
CN119723579B (en) * 2024-12-05 2025-08-26 南京交控积图网络科技有限公司 A monocular vision 3D object labeling method based on multimodal data
CN119939241B (en) * 2024-12-20 2026-03-31 浙江无问智行科技有限公司 Annotation Method and Apparatus Based on Multimodal Models
CN119413223B (en) * 2025-01-08 2025-03-21 珠海华星智造科技有限公司 Multi-faceted detection method and system based on electronic parts

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110458805A (en) * 2019-03-26 2019-11-15 华为技术有限公司 Plane detection method, computing device and circuit system
CN111882611A (en) * 2020-07-17 2020-11-03 北京三快在线科技有限公司 Map construction method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7990397B2 (en) * 2006-10-13 2011-08-02 Leica Geosystems Ag Image-mapped point cloud with ability to accurately represent point coordinates
US10832084B2 (en) * 2018-08-17 2020-11-10 Nec Corporation Dense three-dimensional correspondence estimation with multi-level metric learning and hierarchical matching
CN110456363B (en) * 2019-06-17 2021-05-18 北京理工大学 Target detection and localization method based on fusion of 3D lidar point cloud and infrared image
CN111160278B (en) * 2019-12-31 2023-04-07 陕西西图数联科技有限公司 Face texture structure data acquisition method based on single image sensor
CN111260773B (en) * 2020-01-20 2023-10-13 深圳市普渡科技有限公司 Three-dimensional reconstruction method, detection method and detection system for small obstacles
CN111968129B (en) * 2020-07-15 2023-11-07 上海交通大学 Semantic-aware real-time positioning and map construction system and method
CN111915723A (en) * 2020-08-14 2020-11-10 广东申义实业投资有限公司 Indoor three-dimensional panorama construction method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110458805A (en) * 2019-03-26 2019-11-15 华为技术有限公司 Plane detection method, computing device and circuit system
CN111882611A (en) * 2020-07-17 2020-11-03 北京三快在线科技有限公司 Map construction method and device

Also Published As

Publication number Publication date
CN112766061A (en) 2021-05-07
WO2022141721A1 (en) 2022-07-07

Similar Documents

Publication Publication Date Title
CN112766061B (en) A multimodal unsupervised pedestrian pixel-level semantic annotation method and system
Li et al. DeepI2P: Image-to-point cloud registration via deep classification
JP7221089B2 (en) Stable simultaneous execution of location estimation and map generation by removing dynamic traffic participants
CN109978755B (en) Panoramic image synthesis method, device, equipment and storage medium
US9400939B2 (en) System and method for relating corresponding points in images with different viewing angles
CN108648194B (en) Method and device for 3D target recognition, segmentation and pose measurement based on CAD model
He et al. Ground and aerial collaborative mapping in urban environments
Xue et al. Panoramic Gaussian Mixture Model and large-scale range background substraction method for PTZ camera-based surveillance systems
JP5538868B2 (en) Image processing apparatus, image processing method and program
Gupta et al. Augmented reality system using lidar point cloud data for displaying dimensional information of objects on mobile phones
Wang et al. LED2-Net: Monocular 360 layout estimation via differentiable depth rendering
CN116030136A (en) Cross-angle visual positioning method, device and computer equipment based on geometric features
Reilly et al. Shadow casting out of plane (SCOOP) candidates for human and vehicle detection in aerial imagery
CN108229281B (en) Neural network generation method, face detection device and electronic equipment
Lati et al. Robust aerial image mosaicing algorithm based on fuzzy outliers rejection
KR102629213B1 (en) Method and Apparatus for Detecting Moving Objects in Perspective Motion Imagery
JP2014102805A (en) Information processing device, information processing method and program
KR102249380B1 (en) System for generating spatial information of CCTV device using reference image information
Viguier et al. Resilient mobile cognition: Algorithms, innovations, and architectures
Recky et al. Façade segmentation in a multi-view scenario
US20240161232A1 (en) Flexible Multi-Camera Focal Plane: A Light-Field Dynamic Homography
Chu et al. Robust registration of aerial and close‐range photogrammetric point clouds using visual context features and scale consistency
Lafkih et al. Solar panel monitoring using a video frames mosaicing
Porzi et al. An automatic image-to-DEM alignment approach for annotating mountains pictures on a smartphone
CN114445727B (en) Method for detecting 3D objects and restoring 6DOF pose from panoramic video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: 361000 unit 102, No. 59, erhaihai Road, software park, Siming District, Xiamen City, Fujian Province

Applicant after: ROPT TECHNOLOGY GROUP Co.,Ltd.

Applicant after: ROPT (Xiamen) Big Data Group Co.,Ltd.

Address before: 361000 unit 102, No. 59, erhaihai Road, software park, Siming District, Xiamen City, Fujian Province

Applicant before: ROPT TECHNOLOGY GROUP Co.,Ltd.

Country or region before: China

Applicant before: Lop (Xiamen) system integration Co.,Ltd.