CN112766061B

CN112766061B - A multimodal unsupervised pedestrian pixel-level semantic annotation method and system

Info

Publication number: CN112766061B
Application number: CN202011615688.6A
Authority: CN
Inventors: 彭鹭斌; 苏松志; 苏松剑; 蔡国榕; 陈延艺; 陈延行
Original assignee: Lop Xiamen System Integration Co ltd; Ropt Technology Group Co ltd
Current assignee: Lop Xiamen System Integration Co ltd; Ropt Technology Group Co ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2025-05-16
Anticipated expiration: 2040-12-30
Also published as: CN112766061A; WO2022141721A1

Abstract

The invention provides a multi-mode unsupervised pedestrian pixel-level semantic labeling method and system, which comprise the steps of carrying out three-dimensional reconstruction on an unmanned monitoring scene to obtain initial point cloud information of the monitoring scene, utilizing a Tof image acquisition device to obtain first point cloud information in the monitoring scene, registering the first point cloud information with the initial point cloud information, carrying out collective difference operation to obtain second point cloud information, projecting the second point cloud information on a horizontal plane to obtain a personnel point cloud information set, expanding and corroding a binarized image of the scene information obtained by an infrared image acquisition device to obtain a connected region information set, respectively projecting the personnel point cloud information set and the connected region information set into an image plane space of an RGB image acquisition device to carry out collective intersection operation by utilizing the position relation between calibrated cameras, and obtaining a corresponding human body region set when the common pixels exceed a first threshold value. The method and the system fully integrate the advantages of cameras of different modes, and can effectively extract human body pixel points in a scene.

Description

Multi-mode unsupervised pedestrian pixel-level semantic annotation method and system

Technical Field

The invention relates to the technical field of target detection, in particular to a multi-mode unsupervised pedestrian pixel-level semantic labeling method and system.

Background

Pedestrian detection is a classical problem in computer vision, and related techniques thereof can be applied in the fields of video monitoring, automatic driving, and the like. The current common method is that a large number of samples containing pedestrians are firstly shot, then positions of the pedestrians in pictures are manually marked as training data, and finally, a classifier is trained by adopting a supervised learning method (such as a support vector machine and deep learning) to distinguish pedestrians from non-pedestrian areas. The development of random deep learning technology requires an increasingly large number of training samples. Labeling a large number of samples is a time-consuming and labor-consuming task.

Pedestrian detection techniques can be classified into two-dimensional image (including color and gray) based methods, three-dimensional point cloud based methods, and infrared imaging based methods according to the format of the input data. From the technical point of view, it can be divided into an overall method, a site method and a local block method. Most of these methods described above utilize supervised classification techniques in machine learning. The supervised classification technology needs to mark the positions of pedestrians in pictures, so that a great deal of manpower, material resources and financial resources are consumed.

Disclosure of Invention

In order to solve the technical problem that a large amount of manpower, material resources and financial resources are consumed to mark the positions of pedestrians in pictures in the prior art, the invention provides a multi-mode unsupervised pedestrian pixel-level semantic marking method and system, and the trouble of manually marking pedestrian samples is avoided.

According to one aspect of the invention, a multi-mode unsupervised pedestrian pixel level semantic annotation method is provided, comprising the following steps:

s1, carrying out three-dimensional reconstruction on an unmanned monitoring scene to obtain initial point cloud information of the monitoring scene;

s2, acquiring first point cloud information in a monitored scene by using a Tof image acquisition device, registering the first point cloud information with initial point cloud information, performing difference operation on the set to acquire second point cloud information, and projecting the second point cloud information on a horizontal plane to acquire a personnel point cloud information set;

s3, expanding and corroding the binarized image obtained by thresholding the scene information acquired by the infrared image acquisition equipment to obtain a connected region information set, and

And S4, respectively projecting the personnel point cloud information set and the communication area information set into an image plane space of the RGB image acquisition equipment to perform intersection operation of the sets by utilizing the position relation between calibrated cameras, and acquiring a corresponding human body area set when the common pixels exceed a first threshold value.

In some specific embodiments, step S1 specifically includes:

an origin is arbitrarily selected from an unmanned monitoring scene, and a three-dimensional coordinate system is established;

m x n points are arranged at intervals in the directions of an x axis and a z axis and serve as image acquisition positions of RGB image acquisition equipment, shooting angles are selected for pitch angle, yaw angle and roll angle respectively at intervals of k degrees, and M=m x n (180/k) images are acquired;

And carrying out three-dimensional reconstruction of the monitoring scene on the M images by utilizing a Structure from motion three-dimensional reconstruction algorithm, and acquiring initial point cloud information. The three-dimensional structure can be recovered from the projected two-dimensional motion field of a moving object or scene using STM algorithms.

In some particular embodiments, the first point cloud information and the initial point cloud information are registered using an iterative closest point algorithm. By means of this step, images acquired by different acquisition devices can be registered.

In some specific embodiments, the second point cloud information is projected on an XY plane of a three-dimensional coordinate system, a plurality of circular areas are obtained based on hough transformation, and the point cloud information corresponding to the same circular area is included in the personnel point cloud information set.

In some specific embodiments, after the expanding and corroding of the binarized image in step S3, removing the region where the pixel region is smaller than the second threshold value is further included. By means of this step, the image processing can be performed to obtain the connected region.

In some specific embodiments, the first threshold is taken in the range of 20 x 40-80 x 160 and the second threshold is taken in the range of 1000-8196.

In some specific embodiments, the Tof image capturing device, the infrared image capturing device and the RGB image capturing device are respectively installed in the monitored scene, and the positional relationship and the posture information of the Tof image capturing device, the infrared image capturing device and the RGB image capturing device are respectively calculated by using the initial point cloud information. The position relationship and the posture information of the three-mode image acquisition equipment are utilized to facilitate the conversion of the later characteristic point cloud.

In some specific embodiments, the specific acquisition modes of the position relation and the gesture information include:

Acquiring a depth image of a monitored scene by using a Tof image acquisition device, and acquiring a degree of freedom pose of the Tof image acquisition device in the monitored scene by using an iterative nearest point algorithm in combination with initial point cloud information;

Color images of a monitored scene are acquired by using an infrared image acquisition device and an RGB image acquisition device, and position and posture information of the infrared image acquisition device and the RGB image acquisition device are acquired by using a SIFT descriptor and a Bag of words feature description algorithm based on Bundle Adjustment beam method adjustment algorithm according to the acquired images and initial point cloud information.

In some specific embodiments, a first transformation matrix of the Tof image acquisition device and the infrared image acquisition device, a second transformation matrix of the Tof image acquisition device and the RGB image acquisition device, and a third transformation matrix of the infrared image acquisition device and the RGB image acquisition device are obtained according to the positional relationship, the gesture information, and the internal parameters of the image acquisition device.

In some specific embodiments, step S4 specifically includes:

Projecting personnel point cloud information to an image plane space of RGB image acquisition equipment according to a camera imaging principle by using a second transformation matrix to obtain a first projection area set;

Projecting the connected region information to an image plane space of the RGB image acquisition equipment by using a third transformation matrix to obtain a second projection region set;

intersection operations are performed on pixels of the first projection area and the set and pixels of the second projection area and the set.

According to a second aspect of the present invention, a computer-readable storage medium is presented, on which one or more computer programs are stored which, when executed by a computer processor, implement the method of any of the above.

According to a third aspect of the present invention, there is provided a multi-modal unsupervised pedestrian pixel level semantic annotation system comprising:

The initial point cloud information acquisition unit is configured to reconstruct the unmanned monitoring scene in three dimensions and acquire initial point cloud information of the monitoring scene;

the personnel point cloud information collection acquisition unit is configured to acquire first point cloud information in a monitored scene by using the Tof image acquisition equipment, register the first point cloud information with the initial point cloud information, then perform difference operation on the collection to acquire second point cloud information, and project the second point cloud information on a horizontal plane to acquire a personnel point cloud information collection;

a connected region information set acquisition unit configured to expand and erode the binarized image thresholded by the scene information acquired by the infrared image acquisition device to obtain a connected region information set, and

The human body region set acquisition unit is configured to respectively project the personnel point cloud information set and the communication region information set into an image plane space of the RGB image acquisition equipment by utilizing the position relation between calibrated cameras to perform set intersection operation, and acquire the corresponding human body region set when the common pixels exceed a first threshold value.

The invention provides a multi-mode unsupervised pedestrian pixel-level semantic labeling method and system, which integrate the advantages of different mode cameras of a Tof image acquisition device, an infrared image acquisition device and an RGB image acquisition device and can effectively extract human body pixel points in a scene. In the pedestrian detection task, pixel-level annotation information can be automatically provided for a machine learning algorithm.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the application. Many of the intended advantages of other embodiments and embodiments will be readily appreciated as they become better understood by reference to the following detailed description. Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of a multi-modality unsupervised pedestrian pixel level semantic annotation method of one embodiment of the present application;

FIG. 3 is a framework diagram of a multi-modality unsupervised pedestrian pixel level semantic annotation system according to one embodiment of the present application;

fig. 4 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

FIG. 1 illustrates an exemplary system architecture 100 to which the multi-modal unsupervised pedestrian pixel level semantic annotation method of embodiments of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include a data server 101, a network 102, and a primary server 103. Network 102 is the medium used to provide communication links between data server 101 and primary server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.

The main server 103 may be a server providing various services, such as a data processing server processing information uploaded to the data server 101. The data processing server can detect pedestrians and store the detection result in a database in an associated mode.

It should be noted that, the multi-mode unsupervised pedestrian pixel level semantic annotation analysis method provided by the embodiment of the present application is generally executed by the main server 103, and accordingly, the device for semantic analysis of the small dataset is generally disposed in the main server 103.

The data server and the main server may be hardware or software. In the case of hardware, the system may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. In the case of software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module.

It should be understood that the number of data servers, networks, and host servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

FIG. 2 shows a flow chart of a multi-modality unsupervised pedestrian pixel level semantic annotation method according to an embodiment of the present application. As shown in fig. 2, the method includes:

s201, performing three-dimensional reconstruction on an unmanned monitoring scene to obtain initial point cloud information of the monitoring scene. And under the unmanned monitoring scene, carrying out three-dimensional reconstruction on the M images by utilizing a Structure from motion three-dimensional reconstruction algorithm, and acquiring initial point cloud information. The goal of the motion restoration structure (Structure from Motion, sfM) is to be able to automatically restore camera motion and scene structure using two or more scenes, a self-calibrating technique that can automatically accomplish camera tracking and motion matching.

S202, acquiring first point cloud information in a monitored scene by using a Tof image acquisition device, registering the first point cloud information with initial point cloud information, performing aggregate difference operation to acquire second point cloud information, and projecting the second point cloud information on a horizontal plane to acquire a personnel point cloud information aggregate. And registering the first point cloud information and the initial point cloud information by using an iterative closest point algorithm. In the step, a person is allowed to enter the monitoring scene, the second point cloud information is projected on an XY plane of a three-dimensional coordinate system in the step S201, a plurality of circular areas are obtained based on Hough transformation, and the point cloud information corresponding to the same circular area is included in a person point cloud information set.

S203, expanding and corroding the binarized image after thresholding the scene information acquired by the infrared image acquisition equipment to obtain a connected region information set. And further removing the region of which the pixel region is smaller than a second threshold value after expanding and corroding the binarized image so as to obtain a connected region information set, wherein the second threshold value is taken from the range of 1000-8196.

S204, respectively projecting the personnel point cloud information set and the communication area information set into an image plane space of RGB image acquisition equipment by utilizing the position relation between calibrated cameras to perform intersection operation of the sets, and acquiring a corresponding human body area set when the common pixels exceed a first threshold value.

In a specific embodiment, a Tof image acquisition device, an infrared image acquisition device and an RGB image acquisition device are respectively installed in a monitored scene, and the position relationship and the posture information of the Tof image acquisition device, the infrared image acquisition device and the RGB image acquisition device are respectively calculated by using initial point cloud information:

And obtaining a first transformation matrix of the Tof image acquisition device and the infrared image acquisition device, a second transformation matrix of the Tof image acquisition device and the RGB image acquisition device and a third transformation matrix of the infrared image acquisition device and the RGB image acquisition device according to the position relation, the gesture information and the internal parameters of the image acquisition device. The method comprises the steps of obtaining a first projection region set by projecting personnel point cloud information to an image plane space of RGB image acquisition equipment according to a camera imaging principle through a second transformation matrix, obtaining a second projection region set by projecting communication region information to the image plane space of the RGB image acquisition equipment through a third transformation matrix, carrying out intersection operation on pixels of the first projection region and the second projection region and the first projection region, and obtaining a region of a common pixel exceeding a first threshold value as a human body region set through common judgment, wherein the first threshold value is taken in a range of 20 x 40-80 x 160.

The multi-mode unsupervised pedestrian pixel level semantic labeling method according to one specific embodiment of the present invention can specifically realize pedestrian detection and automatic labeling by using three acquisition devices, namely a Time-of-Flight camera (camera A), an infrared thermal imaging camera (camera B) and an RGB color camera (camera C). In the following description, camera a, camera B, and camera C are replaced, respectively.

Step 1, selecting a monitoring scene, taking a point P on the ground, and establishing a three-dimensional coordinate system XYZ, wherein an X axis is in a horizontal plane and points to a certain direction, a Z axis is perpendicular to the ground and points to the earth center, a Y axis is perpendicular to the X axis in the horizontal plane, and the pointing direction is determined according to a right-hand rule;

In the X-axis direction, selecting one point every 100cm, selecting m points as shooting positions in the horizontal direction of a camera C, namely Q1, Q2, qm, selecting one point every 50cm in the Z-axis direction, selecting P1, P2, pn as shooting heights in the vertical direction of the camera, and respectively selecting one shooting angle every k degrees for a pitch angle, a yaw angle and a roll angle in m X n positions. The camera C is arranged at different positions at different angles collect m=m×n (180: (180/k) images (180/k).

And 3, carrying out three-dimensional reconstruction on the Scene by utilizing a Structure-from-Motion technology on the M images obtained in the step 2, thereby obtaining Point Cloud information of the Scene, and marking the Point Cloud information as scene_point_Cloud_BG.

And 4, respectively installing a camera A, a camera B and a camera C in the scene, and calculating the mutual position relation between the cameras ABC by using the scene point cloud information according to the step 3. In this step, it must be ensured that there are no moving objects in the scene.

4A) After the camera A is installed, a Depth Image of the Scene is acquired and recorded as depth_image, the depth_image and the scene_point_cloud_BG are taken as inputs, and 6 degrees of freedom pose (three rotation angles and three translation coordinate information) of the camera A in the Scene is solved by using an iterative closest Point algorithm (ICP).

4B) After the cameras B and C are installed, color picture information of the scene is acquired and respectively recorded as color_image_B and color_image_C, and position and posture information of the cameras B and C are solved based on BundleAdjustment algorithm according to M acquired images and point cloud information of the scene by using SIFT descriptors and Bag-of-Word feature description methods.

4C) According to the pose information of the camera obtained in the step 4a and the step 4B and the internal parameters of the camera ABC, a transformation matrix Tab of the camera A and the transformation matrix Tac of the camera B, a transformation matrix Tbc of the camera A and the camera C and the transformation matrix Tbc of the camera B and the camera C are obtained.

And 5, after the steps 1-4 are completed, opening the scene, and allowing pedestrians to enter the scene. The three-dimensional Point Cloud information Scene_Point_Cloud_New in the Scene is acquired by using a camera A, an ICP algorithm is used again to register the Scene_Point_Cloud_New and the Scene_Point_Cloud_BG, and the difference operation of the two Point Cloud sets is performed after registration to obtain a New Point Cloud Scene_Point_Cloud_FG. The scene_point_cloud_fg is projected on the XY plane, resulting in several circular areas C1, C2, based on the hough transform. The point cloud information corresponding to the same circular area Ci is denoted person_i.

And 6, obtaining a binarized Image camera_B_image_Binry through thresholding by utilizing scene information shot by the Camera B, performing expansion and corrosion operation on the camera_B_image_Binry, removing areas with pixel areas smaller than a threshold thr (the thr is set according to an actual scene and the range is 1000-8196), and recording the obtained communication area information as R1, R2.

Step 7: according to the transformation matrix Tac between the cameras AC obtained in step 4, the point cloud information person_i of step 5 (i=1, 2,..p), the point cloud information person_i of step 5 (i=1, 2, once again, p). According to the transformation matrix Tbc between the cameras BC obtained in step 4, the Region Rj (j=1, 2,.), q) is projected into the image plane space of the camera C, and the Region corresponding to this is region_from_b_j (j=1, 2,..q).

And 8, carrying out set intersection operation on the two Region sets obtained in the step 7, namely { region_from_A_1, & gt region_from_A_p } and { region_from_B_1, & gt region_from_B_q }. When the number of common pixels between the region_from_a_i and the region_from_b_j exceeds a threshold thr_region (set to 20x40 to 80x 160), a corresponding human body Region region_from_c_k is obtained, wherein k is equal to or greater than 1and k is equal to or less than min (p, q).

With continued reference to FIG. 3, FIG. 3 illustrates a framework diagram of a multi-modality unsupervised pedestrian pixel level semantic annotation system according to one embodiment of the present application. The system specifically includes an initial point cloud information acquisition unit 301, a person point cloud information set acquisition unit 302, a connected region information set acquisition unit 303, and a human body region set acquisition unit 304.

In a specific embodiment, the initial point cloud information acquisition unit 301 is configured to perform three-dimensional reconstruction on an unmanned monitoring scene to obtain initial point cloud information of the monitoring scene, the personnel point cloud information set acquisition unit 302 is configured to acquire first point cloud information in the monitoring scene by using a Tof image acquisition device, register the first point cloud information with the initial point cloud information, perform set difference operation to obtain second point cloud information, project the second point cloud information on a horizontal plane to obtain a personnel point cloud information set, the connected region information set acquisition unit 303 is configured to expand and corrode a binary image of the scene information acquired by an infrared image acquisition device to obtain a connected region information set, and the human body region set acquisition unit 304 is configured to project the personnel point cloud information set and the connected region information set into an image plane space of the RGB image acquisition device to perform set intersection operation, and acquire a corresponding human body region set in response to the common pixel exceeding a first threshold.

Referring now to FIG. 4, there is illustrated a schematic diagram of a computer system 400 suitable for use in implementing an electronic device of an embodiment of the present application. The electronic device shown in fig. 4 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the application.

As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU) 401, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In RAM 403, various programs and data required for the operation of system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Connected to the I/O interface 405 are an input section 406 including a keyboard, a mouse, and the like, an output section 407 including a Liquid Crystal Display (LCD) and the like and a speaker and the like, a storage section 408 including a hard disk and the like, and a communication section 409 including a network interface card such as a LAN card, a modem, and the like. The communication section 409 performs communication processing via a network such as the internet. The drive 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 410 as needed, so that a computer program read therefrom is installed into the storage section 408 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 409 and/or installed from the removable medium 411. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 401. The computer readable storage medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present application may be implemented in software or in hardware.

As another aspect, the present application also provides a computer-readable storage medium that may be included in the electronic device described in the above embodiment, or may exist alone without being incorporated into the electronic device. The computer readable storage medium is loaded with one or more programs, when the one or more programs are executed by the electronic equipment, the electronic equipment comprises three-dimensional reconstruction of an unmanned monitoring scene, acquisition of initial point cloud information of the monitoring scene, acquisition of first point cloud information in the monitoring scene by using a Tof image acquisition device, registration of the first point cloud information and the initial point cloud information, collection difference operation, acquisition of second point cloud information, projection of the second point cloud information on a horizontal plane, acquisition of personnel point cloud information collection, expansion and corrosion of a binary image obtained by thresholding of scene information and acquired by an infrared image acquisition device, acquisition of a communication area information collection, collection intersection operation of the personnel point cloud information collection and the communication area information collection in an image plane space of an RGB image acquisition device, and acquisition of a corresponding human body area collection when a common pixel exceeds a first threshold value.

The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application referred to in the present application is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept described above. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims

1. A multimodal unsupervised pedestrian pixel-level semantic annotation method, characterized by comprising:

S1: Perform three-dimensional reconstruction on an unmanned monitoring scene to obtain initial point cloud information of the monitoring scene;

S2: using a Tof image acquisition device to obtain first point cloud information in the monitoring scene, aligning it with the initial point cloud information and performing a set difference operation to obtain second point cloud information, and projecting the second point cloud information on a horizontal plane to obtain a personnel point cloud information set;

S3: dilate and erode the binary image obtained by the infrared image acquisition device after thresholding the scene information to obtain a connected area information set; and

S4: projecting the person point cloud information set and the connected area information set respectively into the image plane space of the RGB image acquisition device using the positional relationship between the calibrated cameras to perform set intersection operations, and acquiring a corresponding human body area set in response to a common pixel exceeding a first threshold;

According to the position relationship, posture information and internal parameters of the image acquisition device, a first transformation matrix between the Tof image acquisition device and the infrared image acquisition device, a second transformation matrix between the Tof image acquisition device and the RGB image acquisition device, and a third transformation matrix between the infrared image acquisition device and the RGB image acquisition device are obtained; S4 specifically includes:

Using the second transformation matrix, the personnel point cloud information is projected onto the image plane space of the RGB image acquisition device according to the camera imaging principle to obtain a first projection area set;

Projecting the connected region information onto the image plane space of the RGB image acquisition device using the third transformation matrix to obtain a second projection region set;

An intersection operation is performed on pixels of the first projection area and set and pixels of the second projection area and set.

2. The multimodal unsupervised pedestrian pixel-level semantic labeling method according to claim 1, characterized in that S1 specifically comprises:

Select an origin at random in the unmanned monitoring scene to establish a three-dimensional coordinate system;

m*n points are set at intervals in the x-axis and z-axis directions as image acquisition positions of the RGB image acquisition device, and shooting angles are selected at intervals of k degrees for the pitch angle, yaw angle, and roll angle, respectively, to acquire M=m*n*(180/k)*(180/k)*(180/k) images;

The three-dimensional reconstruction algorithm of Structure from motion is used to reconstruct the monitoring scene in three dimensions on the M images, and the initial point cloud information is obtained.

3. According to the multimodal unsupervised pedestrian pixel-level semantic labeling method described in claim 1, it is characterized in that the first point cloud information and the initial point cloud information are aligned using an iterative closest point algorithm.

4. According to claim 2, the multimodal unsupervised pedestrian pixel-level semantic labeling method is characterized in that the second point cloud information is projected on the XY plane of the three-dimensional coordinate system, and a number of circular areas are obtained based on the Hough transform, and the point cloud information corresponding to the same circular area is included in the personnel point cloud information set.

5. The multimodal unsupervised pedestrian pixel-level semantic labeling method according to claim 1 is characterized in that after the binary image is expanded and eroded in S3, it also includes removing areas where the pixel area is smaller than the second threshold.

6. According to the multimodal unsupervised pedestrian pixel-level semantic labeling method according to claim 5, it is characterized in that the first threshold is taken from the range of 20*40-80*160, and the second threshold is taken from the range of 1000-8196.

7. According to claim 1, the multimodal unsupervised pedestrian pixel-level semantic labeling method is characterized in that a Tof image acquisition device, an infrared image acquisition device and an RGB image acquisition device are respectively installed in the monitoring scene, and the initial point cloud information is used to calculate the position relationship and posture information of the Tof image acquisition device, the infrared image acquisition device and the RGB image acquisition device respectively.

8. The multimodal unsupervised pedestrian pixel-level semantic labeling method according to claim 7, characterized in that the specific method of obtaining the position relationship and posture information includes:

Using a Tof image acquisition device to acquire a depth image of the monitoring scene, and using an iterative closest point algorithm in combination with initial point cloud information to acquire the degree of freedom position of the Tof image acquisition device in the monitoring scene;

The infrared image acquisition device and the RGB image acquisition device are used to acquire a color image of the monitoring scene, and the position and posture information of the infrared image acquisition device and the RGB image acquisition device are acquired based on the Bundle Adjustment algorithm according to the acquired image and the initial point cloud information using a SIFT descriptor and a Bag of words feature description algorithm.

9. A computer-readable storage medium having one or more computer programs stored thereon, wherein the one or more computer programs implement the method according to any one of claims 1 to 8 when executed by a computer processor.

10. A multimodal unsupervised pedestrian pixel-level semantic annotation system, characterized in that the system comprises:

An initial point cloud information acquisition unit: configured to perform three-dimensional reconstruction on an unmanned monitoring scene and acquire initial point cloud information of the monitoring scene;

A personnel point cloud information set acquisition unit: configured to acquire the first point cloud information in the monitoring scene by using a Tof image acquisition device, perform a set difference operation after registering the first point cloud information with the initial point cloud information to obtain the second point cloud information, and project the second point cloud information on a horizontal plane to obtain a personnel point cloud information set;

A connected region information set acquisition unit: configured to dilate and erode the binary image obtained by the infrared image acquisition device after the scene information is thresholded, so as to obtain a connected region information set; and

A human body region set acquisition unit: configured to project the personnel point cloud information set and the connected region information set into the image plane space of the RGB image acquisition device respectively, using the positional relationship between the calibrated cameras, to perform a set intersection operation, and acquire a corresponding human body region set in response to a common pixel exceeding a first threshold;

According to the position relationship, posture information and internal parameters of the image acquisition device, a first transformation matrix between the Tof image acquisition device and the infrared image acquisition device, a second transformation matrix between the Tof image acquisition device and the RGB image acquisition device, and a third transformation matrix between the infrared image acquisition device and the RGB image acquisition device are obtained; the human body region set acquisition unit is specifically configured to: