CN117523560A

CN117523560A - Semantic segmentation method, device and storage medium

Info

Publication number: CN117523560A
Application number: CN202210940422.1A
Authority: CN
Inventors: 蒋东生; 史博文; 张晓鹏; 田奇
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2024-02-06

Abstract

The present application relates to a semantic segmentation method, device and storage medium. The method can be used in the first neural network model. The method includes: obtaining the first image feature data of the image data to be processed; performing feature enhancement on the first image feature data to obtain the first enhanced image feature data. The first enhanced image feature data including context information within the image; using the second image feature data to perform feature enhancement on the first image feature data to obtain second enhanced image feature data, where the second enhanced image feature data includes cross-image context information; according to the first enhanced image feature The data and the second enhanced image feature data determine a prediction mask of the image to be processed, and the prediction mask indicates the semantic segmentation result of the image to be processed. According to the embodiments of the present application, richer level feature information can be obtained, the accuracy of semantic segmentation results can be improved, and a first neural network model that is more efficient, has better transferability and performance can be obtained.

Description

Semantic segmentation method, semantic segmentation device and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a semantic segmentation method, apparatus, and storage medium.

Background

With the continuous development of artificial intelligence (artificial intelligence, AI) technology, a technology that has been widely used in the field of computer vision is called semantic segmentation. The goal of semantic segmentation is to segment an image into regions with different semantic information and label each region with its corresponding semantic tag.

In the current semantic segmentation method, a convolutional neural network (convolutional neural networks, CNN) model is usually utilized, or a self-attention model (transducer) is focused on designing an encoder part, and a decoder part is omitted, so that the calculation cost is higher, the efficiency is not high enough, and the migration capability and the performance are poor. Therefore, there is a need for more efficient, migration-capable and better performing methods for semantic segmentation of image data.

Disclosure of Invention

In view of this, a semantic segmentation method, apparatus and storage medium are proposed.

In a first aspect, embodiments of the present application provide a semantic segmentation method. The method is for a first neural network model, the method comprising:

acquiring first image characteristic data of image data to be processed;

performing feature enhancement on the first image feature data to obtain first enhanced image feature data, wherein the first enhanced image feature data comprises context information in an image;

Performing feature enhancement on the first image feature data by using the second image feature data to obtain second enhanced image feature data, wherein the second enhanced image feature data comprises cross-image context information;

and determining a prediction mask of the image to be processed according to the first enhanced image feature data and the second enhanced image feature data, wherein the prediction mask indicates a semantic segmentation result of the image to be processed.

According to the embodiment of the application, the first image feature data of the image data to be processed is subjected to feature enhancement, so that the context information in the image can be better mined, the second image feature data is introduced to perform feature enhancement on the first image feature data of the image data to be processed, and the context information of the cross-image can be better mined, so that the feature information with richer layers can be obtained, the accuracy of semantic segmentation results is improved, and a first neural network model with higher efficiency, migration capability and better performance can be obtained.

According to a first aspect, in a first possible implementation manner of the semantic segmentation method, the first neural network model is a decoder of a transform self-attention model, and the feature enhancement is performed on the first image feature data by using the second image feature data to obtain second enhanced image feature data, including:

Determining third enhanced image feature data based on the second image feature data and the first image feature data;

second enhanced image data is determined from the first image feature data and the third enhanced image feature data.

According to the embodiment of the application, the third enhanced image feature data is firstly determined, and then the second enhanced image feature data is determined to be two layers, so that cross-image context information of the image feature data is mined, feature enhancement is carried out, the information quantity and layers of the obtained image feature data are more abundant, and therefore a more accurate semantic segmentation result is obtained.

In a second possible implementation manner of the semantic segmentation method according to the first possible implementation manner of the first aspect, determining third enhanced image feature data according to the second image feature data and the first image feature data includes:

projecting the second image feature data to the first intermediate feature data and the second intermediate feature data, respectively;

projecting the first image feature data to third intermediate feature data;

third enhanced image feature data is determined from the first intermediate feature data, the second intermediate feature data, and the third intermediate feature data.

According to the embodiment of the application, the cross-image context information of the first image feature data is mined by introducing the second image feature data, so that feature enhancement is performed, third enhanced image data is obtained, the information quantity of the obtained feature data is richer, and the obtained model performance is better.

In a third possible implementation manner of the semantic segmentation method according to the first or second possible implementation manner of the first aspect, determining the second enhanced image data according to the first image feature data and the third enhanced image feature data includes:

projecting the first image feature data to fourth intermediate feature data;

projecting the third enhanced image feature data to fifth intermediate feature data and sixth intermediate feature data, respectively;

and determining second enhanced image feature data according to the fourth intermediate feature data, the fifth intermediate feature data and the sixth intermediate feature data.

According to the embodiment of the application, the cross-image context information of the first image feature data is mined through the fourth intermediate feature data, the fifth intermediate feature data and the sixth intermediate feature data to perform feature enhancement, the second enhanced image feature data is determined, the problem of feature confusion of the feature data can be prevented, the information quantity of the obtained feature data is richer, and the semantic segmentation result is more accurate.

In a fourth possible implementation form of the semantic segmentation method according to the second or third possible implementation form of the first aspect, the data in the second image feature data corresponds to a category of semantic segmentation.

According to the embodiment of the application, the category of the second image characteristic data is fixed, so that the characteristics corresponding to different categories of semantic segmentation can be better learned, and the semantic segmentation result is more accurate.

In a fifth possible implementation manner of the first aspect, the second image feature data includes first image feature sub-data and second image feature sub-data, the first image feature sub-data indicating context information across images, the second image feature sub-data indicating a category of semantic segmentation, and projecting the second image feature data to the first intermediate feature data and the second intermediate feature data, respectively, includes:

projecting the first image feature sub-data to first intermediate feature data;

the second image feature sub-data is projected to the second intermediate feature data.

According to the embodiment of the application, the second image characteristic data comprises the first image characteristic sub-data and the second image characteristic sub-data, so that decoupling of responsibility of the second image characteristic data can be realized, one part of the second image characteristic data is responsible for information interaction of internal and external context information, the other part of the second image characteristic data is responsible for learning category information to conduct category prediction, difficulty in learning information compression capability in a process of obtaining the second image characteristic data is further reduced, expression capability of the second image characteristic data is enhanced, and a semantic segmentation result is more accurate.

In a sixth possible implementation manner of the semantic segmentation method according to the first or second or third or fourth or fifth possible implementation manner of the first aspect, feature enhancement is performed on the first image feature data to obtain first enhanced image feature data, including:

projecting the first image feature data to seventh intermediate feature data, eighth intermediate feature data, and ninth intermediate feature data, respectively;

and obtaining first enhanced image feature data according to the seventh intermediate feature data, the eighth intermediate feature data and the ninth intermediate feature data.

According to the embodiment of the application, the internal context information of the first image feature data is mined through the seventh intermediate feature data, the eighth intermediate feature data and the ninth intermediate feature data to perform feature enhancement, the first enhanced image feature data is determined, the problem of feature confusion of the feature data and the problem of discontinuity and incorrect in the follow-up mask prediction can be prevented, and the semantic segmentation result is more accurate.

In a seventh possible implementation form of the semantic segmentation method according to the second or third or fourth or fifth or sixth possible implementation form of the first aspect, the step of projecting is larger than 1.

According to the embodiment of the application, the calculation complexity of the decoder can be further reduced and the calculation cost of the decoder can be reduced by making the projection step size larger than 1.

In an eighth possible implementation manner of the semantic segmentation method according to the first aspect, acquiring first image feature data of the image data to be processed includes:

acquiring third image characteristic data obtained after characteristic extraction of image data to be processed through a second neural network model, wherein the second neural network model is an encoder of a transducer self-attention model;

and processing the size and the channel dimension of the third image characteristic data to determine the first image characteristic data.

According to the embodiment of the application, the size and the channel dimension of the third image feature data are processed, so that the third image feature data with different sizes and dimensions obtained by the encoder can be unified better, the decoder of the transducer model obtained by the application can be better applicable to encoders with different structures, and the adaptation and application capacity of the model are improved.

In a second aspect, embodiments of the present application provide a semantic segmentation apparatus. The apparatus is for a first neural network model, the apparatus comprising:

The acquisition module is used for acquiring first image characteristic data of the image data to be processed;

the first feature enhancement module is used for carrying out feature enhancement on the first image feature data to obtain first enhanced image feature data, wherein the first enhanced image feature data comprises context information in an image;

the second feature enhancement module is used for carrying out feature enhancement on the first image feature data by utilizing the second image feature data to obtain second enhanced image feature data, wherein the second enhanced image feature data comprises cross-image context information;

and the determining module is used for determining a prediction mask of the image to be processed according to the first enhanced image characteristic data and the second enhanced image characteristic data, wherein the prediction mask indicates a semantic segmentation result of the image to be processed.

According to a second aspect, in a first possible implementation manner of the semantic segmentation apparatus, the first neural network model is a decoder of a transducer self-attention model, and the second feature enhancement module is configured to:

According to a first possible implementation manner of the second aspect, in a second possible implementation manner of the semantic segmentation device, determining third enhanced image feature data according to the second image feature data and the first image feature data includes:

projecting the first image feature data to third intermediate feature data;

In a third possible implementation manner of the semantic segmentation device according to the first or second possible implementation manner of the second aspect, determining the second enhanced image data according to the first image feature data and the third enhanced image feature data includes:

projecting the first image feature data to fourth intermediate feature data;

In a fourth possible implementation manner of the semantic segmentation device according to the second or third possible implementation manner of the second aspect, the data in the second image feature data corresponds to a category of semantic segmentation.

In a fifth possible implementation manner of the second aspect, the second image feature data includes first image feature sub-data and second image feature sub-data, the first image feature sub-data indicating context information across images, the second image feature sub-data indicating a category of semantic segmentation, projecting the second image feature data to the first intermediate feature data and the second intermediate feature data, respectively, including:

projecting the first image feature sub-data to first intermediate feature data;

In a sixth possible implementation manner of the semantic segmentation apparatus according to the first or second or third or fourth or fifth possible implementation manner of the second aspect, the first feature enhancement module is configured to:

In a seventh possible implementation form of the semantic segmentation apparatus according to the second or third or fourth or fifth or sixth possible implementation form of the second aspect, the step of projection is greater than 1.

In an eighth possible implementation manner of the semantic segmentation apparatus according to the second aspect, the obtaining module is configured to:

In a third aspect, embodiments of the present application provide a semantic segmentation apparatus, the apparatus comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the semantic segmentation method of the first aspect or one or several of the plurality of possible implementations of the first aspect when executing instructions.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the semantic segmentation method of the first aspect or one or more of the possible implementations of the first aspect.

In a fifth aspect, embodiments of the present application provide a terminal device, which may perform the semantic segmentation method of the first aspect or one or several of the multiple possible implementations of the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the semantic segmentation method of the first aspect or one or more of the possible implementations of the first aspect.

These and other aspects of the application will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present application and together with the description, serve to explain the principles of the present application.

Fig. 1 shows a schematic diagram of an application scenario according to an embodiment of the present application.

FIG. 2 illustrates a flow chart of a semantic segmentation method according to an embodiment of the present application.

FIG. 3 illustrates a flow chart of a semantic segmentation method according to an embodiment of the present application.

Fig. 4 shows a schematic diagram of determining first enhanced image feature data according to an embodiment of the present application.

Fig. 5 (a) shows an effect schematic of image internal feature enhancement according to an embodiment of the present application.

Fig. 5 (b) shows an effect schematic of image internal feature enhancement according to an embodiment of the present application.

FIG. 6 illustrates a flow chart of a semantic segmentation method according to an embodiment of the present application.

FIG. 7 shows a flow chart of a semantic segmentation method according to an embodiment of the present application.

Fig. 8 shows a schematic diagram of determining second enhanced image feature data according to an embodiment of the present application.

FIG. 9 shows a flow chart of a semantic segmentation method according to an embodiment of the present application.

Fig. 10 shows a structural diagram of a semantic segmentation device according to an embodiment of the present application.

Fig. 11 shows a block diagram of an electronic device 1300 according to an embodiment of the present application.

Detailed Description

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits have not been described in detail as not to unnecessarily obscure the present application.

With the continuous development of AI technology, a technology that is widely used in the field of computer vision is called semantic segmentation. The goal of semantic segmentation is to segment an image into regions with different semantic information and label each region with its corresponding semantic tag. In the current semantic segmentation mode, a CNN model is usually utilized, or an encoder part in a transformer model is designed in a focus manner, and a decoder part is omitted, so that the computing cost is higher, the efficiency is not high enough, and the migration capability and the performance are poor. Therefore, there is a need for more efficient, migration-capable and better performing methods for semantic segmentation of image data.

In order to solve the technical problems, the application provides a semantic segmentation method, which can be used for a first neural network model, can better mine context information in an image by performing feature enhancement on first image feature data of image data to be processed, and can better mine context information of a cross-image by introducing second image feature data to perform feature enhancement on the first image feature data of the image data to be processed, so that feature information with richer layers can be obtained, accuracy of semantic segmentation results is improved, and a first neural network model with higher efficiency, migration capability and better performance can be obtained.

Fig. 1 shows a schematic diagram of an application scenario according to an embodiment of the present application. The embodiment of the application can be used in a scene of performing semantic segmentation on image data, for example, the semantic segmentation method of the embodiment of the application can be used for a decoder (decoder) in a transducer model shown in fig. 1, and the semantic segmentation can be performed on the image data through the transducer model. Wherein the transducer model may be deployed on a terminal device or server. In one of the scenes, for example, in an automatic driving scene, by acquiring an image around the vehicle captured by a camera, a road area where the vehicle travels can be determined by using a transducer model, and pedestrians, other vehicles, and the like included in the image can be further determined, so that the vehicle can be controlled to avoid obstacles such as pedestrians, other vehicles, and the like according to the result of semantic segmentation of the image.

The transducer model may include an encoder (decoder) and a decoder, among other things, as shown in fig. 1. The encoder can be used for extracting the characteristics of the image data to be processed to obtain the image characteristic data; the decoder may be configured to process the feature data to determine semantic segmentation results. The decoder in the transformer model of the embodiment of the application can comprise an internal feature enhancement module, an external feature enhancement module, a feature fusion module and a prediction module. The internal feature enhancement module can be used for mining context information inside the image of the first image feature data obtained through the encoder to obtain enhanced image feature data; the external feature enhancement module can be used for utilizing the second image feature data to perform cross-image context information mining on the first image feature data obtained through the encoder to obtain enhanced image feature data; the feature fusion module can be used for carrying out aggregation processing on the image feature data obtained by the two modules after feature enhancement, and inputting the image feature data into the prediction module to determine a final semantic segmentation result. For example, in an autopilot scenario, the semantic segmentation results may be used to indicate regions in the image that respectively belong to pedestrians, roads, other vehicles, etc.

The terminal device related to the application may refer to a device with a wireless connection function, where the wireless connection function refers to a function of connecting with other terminal devices or servers through wireless connection modes such as Wi-Fi and bluetooth, and the terminal device of the application may also have a function of performing communication through wired connection. The terminal equipment can be touch screen, non-touch screen or screen-free, the touch screen can be controlled by clicking, sliding and the like on the display screen through fingers, touch pens and the like, the non-touch screen equipment can be connected with input equipment such as a mouse, a keyboard, a touch panel and the like, the terminal equipment is controlled through the input equipment, and the screen-free equipment can be a Bluetooth loudspeaker box and the like without a screen. For example, the terminal device of the present application may be a smart phone, a netbook, a tablet computer, a notebook computer, a wearable electronic device (e.g., a smart bracelet, a smart watch, etc.), a TV, a virtual reality device, a sound, an electronic ink, etc. For example, the transducer model of the present application may be deployed at a terminal device, and a user may input image data to the terminal device to determine a semantic segmentation result of the image data using the transducer model deployed at the terminal device.

The server related to the application may be located in the cloud or local, may be a physical device, or may be a virtual device, such as a virtual machine, a container, or the like, and has a wireless communication function, where the wireless communication function may be provided in a chip (system) or other parts or components of the server. The server may be a device with a wireless connection function, and the wireless connection function may be a function of connecting with other servers or terminal devices through wireless connection modes such as Wi-Fi and bluetooth, and the server of the present application may also have a function of performing communication through wired connection. For example, the server of the present application may be located at a cloud end, communicate with a terminal device, receive image data sent by the terminal device, determine a semantic segmentation result of the image data by using a transducer model deployed at the server, and return the semantic segmentation result to the terminal device.

It should be noted that, the semantic segmentation method in the embodiment of the present application may be applied to other application scenarios besides the above-mentioned autopilot, for example, but not limited thereto, and may also be applied to other scenarios such as fingerprint identification, traffic monitoring, and the like.

The following describes the semantic segmentation method according to the embodiment of the present application in detail through fig. 2 to 9.

FIG. 2 illustrates a flow chart of a semantic segmentation method according to an embodiment of the present application. The method may be used for a first neural network model, such as a decoder of the above-described transducer model. As shown in fig. 2, the method includes:

step S201, first image feature data of image data to be processed is acquired.

The image data to be processed may be data input by a user or other modules. For example, in an autopilot scenario, the data to be processed may be image data of the surroundings of the vehicle acquired by the sensor.

Optionally, the step S201 may include:

and acquiring third image characteristic data obtained after characteristic extraction of the image data to be processed through the second neural network model.

Wherein the second neural network model may be an encoder of a transducer model. the encoder of the transducer model may be, for example, viT (vision Transformer), swin transducer, deiT (data-efficient image transformers), etc., as the application is not limited in this regard.

The image data to be processed may be divided into at least one image block (patch), and these image blocks may be used as input of an encoder of a transducer model, and feature extraction may be performed by the encoder of the transducer model, so as to obtain third image feature data. For example, in the case where the size of the image data to be processed is h×w×3, the size of the obtained third image feature data may be H '×w' ×c, where H may represent a height (height) value of the data, W may represent a width (width) value of the data, and C may represent a channel (channel) value of the data.

After the third image feature data is obtained, the size and the channel dimension of the third image feature data may also be processed to determine the first image feature data.

Wherein the third image feature data channel dimensions and sizes may be unified using a multi-layer perceptron (multilayer perceptron, MLP) layer and an optional upsampling (upsampling) operation, which may be implemented by linear mapping. For example, the channel dimensions of the at least one third image feature data may be unified to the same value. The dimensions (H and W) of the at least one third image feature data may also be unified to the same value. The channel dimensions of the third image feature data may be further processed, for example, after at least one (e.g., 4) third image feature data is spliced, linear mapping is performed to reduce the channel dimensions (e.g., 4 third image feature data are spliced, and the obtained channel dimensions may be changed from 4C to C). Thus, the first image feature data can be determined.

Next, the obtained first image feature data may be mined for context information inside the image and context information across the image to perform feature enhancement on the feature data, see the processes in step S202-step S203 described below, respectively.

In order to alleviate the problem of feature confusion and the problem of discontinuous and incorrect mask prediction caused by unifying channel dimensions and sizes of feature data obtained by the encoder, context information inside an image may be mined to perform feature enhancement, which is described below.

Step S202, performing feature enhancement on the first image feature data to obtain first enhanced image feature data.

This process may be implemented, for example, by the internal feature enhancement module shown in fig. 1, see equation (1):

Y _inter ＝M _inter (X _inter ) Formula (1)

Wherein M is _inter Can represent the process of enhancing the internal characteristics, X _inter Can represent first image characteristic data, Y _inter First enhanced image feature data, which may include contextual information within the image, may be represented via internal feature enhancement.

FIG. 3 illustrates a flow chart of a semantic segmentation method according to an embodiment of the present application. As shown in fig. 3, optionally, the step S202 may include:

In step S301, the first image feature data is projected to the seventh intermediate feature data, the eighth intermediate feature data, and the ninth intermediate feature data, respectively.

Wherein the projection may comprise three different linear maps, each for determining the respective intermediate feature data, which linear maps may be realized, for example, by means of a convolution operation or the like. For at least one first image feature data corresponding to the image data to be processed, the seventh intermediate feature data may be used to indicate a feature of the current first image feature data among the one or more first image feature data, which may be referred to as Q _inter The method comprises the steps of carrying out a first treatment on the surface of the The eighth intermediate feature data may be used to indicate a feature of the relationship between the current first image feature data and the other first image feature data, and may be referred to as K _inter The method comprises the steps of carrying out a first treatment on the surface of the The ninth intermediate feature data may be used to indicate a weighting parameter corresponding to the current first image feature data, and may be referred to as V _inter 。

Step S302, obtaining first enhanced image feature data according to the seventh intermediate feature data, the eighth intermediate feature data and the ninth intermediate feature data.

The above-described process of obtaining the first enhanced image feature data may be implemented using a self-attention mechanism (self-attention) of a transducer, and referring to fig. 4, a schematic diagram of determining the first enhanced image feature data according to an embodiment of the present application is shown. As shown in fig. 4, the first enhanced image feature data (e.g., y_inter in the figure) may be obtained by a one-head attention (one-head attention) module according to the seventh intermediate feature data (e.g., q_inter in the figure), the eighth intermediate feature data (e.g., k_inter in the figure), and the ninth intermediate feature data (e.g., v_inter in the figure) obtained from the first image feature data (e.g., x_inter in the figure). One way of determining the first enhanced image feature data by the single head attention module may be found below.

Wherein the score (score) of the current first image feature data may be determined according to the seventh intermediate feature data and the eighth intermediate feature data, and the first enhanced image feature data may be determined by weighting according to the ninth intermediate feature data corresponding to the current first image feature data. This process can be seen in equation (2):

wherein Y is _inter May represent first enhanced image feature data, softMax may represent normalization operation, Q _inter Can represent seventh intermediate feature data, K _inter Can represent eighth intermediate characteristic data, V _inter The ninth intermediate feature data may be represented and C may represent a channel dimension corresponding to the first image feature data.

Referring to fig. 5 (a) and 5 (b), there are shown schematic effects of image internal feature enhancement according to an embodiment of the present application. Wherein, the image in fig. 5 (a) may represent the semantic segmentation result obtained without the internal feature enhancement as in step S202; the image in fig. 5 (b) may represent the semantic segmentation result obtained in the case where the internal feature enhancement as in step S202 is performed. As can be seen from the mask coverage effect (e.g. the shadow coverage part in the figure) on the house in the figure, the internal feature enhancement is performed on the feature data in step S202, so that a more continuous and more accurate semantic segmentation result can be obtained.

Table 1 shows an effect schematic of the internal feature enhancement of the feature data according to an embodiment of the present application. See table 1:

TABLE 1

Heads	Depths	MLP	mIoU(％)
				0	0	N	40.97
1	1	N	42.65(1.68↑)
				1	2	N	42.35(1.56↑)
2	1	N	42.62(1.65↑)
				1	1	Y	42.81(1.84↑)

Wherein, heads may represent a self-attention mechanism used in performing internal feature enhancement, wherein a value of 0 may represent that no self-attention mechanism is used for internal feature enhancement, a value of 1 may represent that a single-head attention (one-head attention) is used, and a value of greater than 1 (e.g., 2 in the table) may represent that a multi-head attention (multi-head attention) is used; depths may represent the number of levels in each attention; the MLP may represent whether a multi-layer perceptron model is utilized, where Y may represent that the MLP model is utilized and N may represent that the MLP model is not utilized; the mlou can represent a semantic segmentation evaluation index, namely an equal-intersection ratio (mean intersection over union, mlou), and can be obtained by calculating the ratio of the intersection and the union of two sets of an actual result of semantic segmentation and a model prediction result, and the larger the value of the mlou, the higher the accuracy of a corresponding model can be represented.

The first row of table 1 may represent the effect that the internal feature enhancement is not performed on the feature data, and as can be seen from table 1, by performing the internal feature enhancement on the feature data by using the method of the embodiment of the present application, a higher mIoU can be obtained, so that the result of semantic segmentation performed on the model is more accurate, and a greater performance improvement is brought.

In order to further improve the performance of the model, the contextual information of the cross-image can be mined, so that the characteristic enhancement is performed on the first image characteristic data, and the information quantity and the hierarchy of the obtained image characteristic data are more abundant, see the following.

Step S203, the first image feature data is subjected to feature enhancement by using the second image feature data, so as to obtain second enhanced image feature data.

This process may be implemented, for example, by the external feature enhancement module shown in fig. 1, see equation (3):

Y _exter ＝M _exter (X _inter ,X _exter ) Formula (3)

Wherein M is _exter Can represent the process of enhancing the external characteristics, X _inter Can represent first image characteristic data, X _exter Can represent second image characteristic data, X _exter Can be preset (e.g. X initialized during training phase _exter May be randomly set), Y _exter Can be represented as enhanced by external featuresTo second enhanced image feature data, which may include contextual information across the image.

FIG. 6 illustrates a flow chart of a semantic segmentation method according to an embodiment of the present application. As shown in fig. 6, optionally, the step S203 includes:

step S601, determining third enhanced image feature data according to the second image feature data and the first image feature data.

The external feature enhancement may be performed in two layers, and first, intermediate third enhanced image feature data may be obtained, and the detailed process of the layers may be described below.

FIG. 7 shows a flow chart of a semantic segmentation method according to an embodiment of the present application. As shown in fig. 7, optionally, the step S601 includes:

step S701, projecting the second image feature data to the first intermediate feature data and the second intermediate feature data, respectively.

Wherein the projection may comprise two different linear maps for determining the first intermediate feature data and the second intermediate feature data, respectively, which linear maps may be realized, for example, by means of a convolution operation or the like.

For at least one newly introduced second image feature data corresponding to the first image feature data, the first intermediate feature data may be used to indicate a relational feature between the current second image feature data and other second image feature data, which may be referred to as K _mid The method comprises the steps of carrying out a first treatment on the surface of the The second intermediate feature data may be used to indicate a weighting parameter corresponding to the current second image feature data, which may be referred to as V _mid 。

In step S702, the first image feature data is projected to the third intermediate feature data.

The projection may be performed by means of a linear mapping, which may be implemented by means of a convolution operation, for example. For at least one first image feature data corresponding to the image data to be processed, the third intermediate feature data may be used to indicate a current first image feature data of the one or more first image feature dataFeatures, which may be referred to as Q _mid 。

Step S703, determining third enhanced image feature data according to the first intermediate feature data, the second intermediate feature data and the third intermediate feature data.

The above-described two-level process of obtaining second enhanced image feature data may be implemented using a transform's self-attention mechanism, see fig. 8, which is a schematic diagram illustrating determining second enhanced image feature data according to an embodiment of the present application.

The two hierarchies for determining the second enhanced image feature data may be implemented using two single-head attention modules, respectively, and the third enhanced image feature data (e.g., x_mid in the figure) may be determined first by one of the single-head attention modules, and then the second enhanced image feature data (e.g., y_outer in the figure) may be determined by the other single-head attention module.

As shown in fig. 8, in the first hierarchy, the third enhanced image feature data (e.g., x_mid in the figure) may be determined by the single-head attention module based on the first intermediate feature data (e.g., k_mid in the figure) and the second intermediate feature data (e.g., v_mid in the figure) obtained from the second image feature data (e.g., x_outer in the figure), and the third intermediate feature data (e.g., q_mid in the figure) obtained from the first image feature data (e.g., x_inter in the figure). One way of determining the third enhanced image feature data by the single head attention module may be found below.

Wherein, the score (score) of the current second image feature data can be determined according to the first intermediate feature data and the second intermediate feature data, and the third enhanced image feature data can be determined by weighting according to the third intermediate feature data corresponding to the current first image feature data.

The process of determining the score can be found in equation (4):

wherein Attn _mid Can represent the corresponding score, Q _mid Can represent first intermediate characteristic data, K _mid The second intermediate feature data may be represented and C may represent a channel dimension corresponding to the second image feature data.

According to Attn _mid The process of determining the third enhanced image feature data may be found in equation (5):

X _mid ＝SoftMax(Attn _mid )V _mid formula (5)

Wherein X is _mid May represent third enhanced image feature data, softMax may represent normalization operations, V _mid The third intermediate feature data may be represented.

Referring back to fig. 6, a second hierarchy of external feature enhancements is described.

Step S602, determining second enhanced image data according to the first image feature data and the third enhanced image feature data.

After obtaining the third enhanced image feature data, interactive learning can be further performed on the intra-image and inter-image context information, so that the inter-image context information can be better mined for feature enhancement, and the detailed process of the hierarchy can be seen in the following.

FIG. 9 shows a flow chart of a semantic segmentation method according to an embodiment of the present application. Optionally, the step S602 includes:

step S901 projects the first image feature data to fourth intermediate feature data.

The projection may be performed by means of a linear mapping, which may be implemented by means of a convolution operation, for example. For at least one first image feature data corresponding to the image data to be processed, the fourth intermediate feature data may be used to indicate a feature of the current first image feature data of the one or more first image feature data, which may be referred to as Q _final 。

Step S902, projecting the third enhanced image feature data to the fifth intermediate feature data and the sixth intermediate feature data, respectively.

Wherein the projection may comprise two different linear maps for determining the fifth intermediate feature data and the sixth intermediate feature data, respectively, which linear maps may be realized, for example, by means of a convolution operation or the like.

For at least one third enhanced image feature data obtained opposite each second image feature data, fifth intermediate feature data may be used to indicate a relational feature between the current third enhanced image feature data and other third enhanced image feature data, which may be referred to as K _final The method comprises the steps of carrying out a first treatment on the surface of the The sixth intermediate feature data may be used to indicate a weighting parameter corresponding to the current third enhanced image feature data, and may be referred to as V _final 。

Step S903, determining second enhanced image feature data according to the fourth intermediate feature data, the fifth intermediate feature data, and the sixth intermediate feature data.

Referring to fig. 8, in the second hierarchy, second enhanced image feature data (e.g., y_outer in the figure) may be determined by the single-head attention module from fourth intermediate feature data (e.g., q_final in the figure) obtained in step S901, and fifth intermediate feature data (e.g., k_final in the figure) and sixth intermediate feature data (e.g., v_final in the figure) obtained in step S902. One way of determining the second enhanced image feature data by the single head attention module may be found below.

The score of the current third enhanced image feature data may be determined according to the fourth intermediate feature data and the fifth intermediate feature data, and the second enhanced image feature data may be determined by weighting according to the sixth intermediate feature data corresponding to the current third enhanced image feature data. This procedure can be seen in equation (6):

wherein Y is _exter May represent second enhanced image feature data, softMax may represent normalization operation, Q _fianl Can represent fourth intermediate feature data, K _final Can represent fifth intermediate characteristic data, V _final The sixth intermediate feature data may be represented and C may represent a channel dimension corresponding to the third enhanced image feature data.

Alternatively, in order to reduce the computational cost of the decoder, the step of projection may be greater than 1 in the above-described determination of enhanced image feature data within and across images. For example, in the above projection process, the stride (stride) of the linear mapping may be made R, which is a positive integer.

Thus, in one possible implementation, the amount of K and V data described above may be reduced, and thus the complexity of the computational vector multiplication of the decoder may be reduced, where the complexity in the self-attention mechanism may be reduced from O (N2) to O (N2/R), where N may represent the amount of data that reduced K and V before.

Optionally, to further enhance the performance of the decoder, the data in the second image feature data may be made to correspond to a category of semantic segmentation.

That is, additional constraints may be added to the second image feature data such that the class of the second image feature data is fixed, so that features corresponding to different classes may be better learned semantically segmented. The constraint is added so that the process of category fixing of the second image feature data can be performed during the training of the decoder, see equation (7), which first yields an intermediate prediction mask:

M ^mid ＝Upsample(H×W)Reshape(Attn _mid ) Formula (7)

Wherein M is ^mid Intermediate prediction masks, i.e. intermediate results of corresponding semantic segmentation of image data, may be represented by the method of processing Attn _mid And performing operations such as size adjustment (reshape), linear mapping (such as upsampling) and the like. Attn _mid Can be determined by formula (4).

Then, loss optimization can be performed on the second image feature data according to the intermediate prediction mask result and the actual category indicated by the label of the training data, so that category-specific second image feature data can be obtained. The process of this loss optimization can be found in equation (8):

wherein,and->The corresponding loss optimization functions may be represented separately. i and j may correspond to H and W, respectively, representing the designation in the corresponding height and width. M is M ^mid Intermediate prediction mask may be represented,The actual category indicated by the tag of the training data may be represented.

Optionally, on the basis of the above-mentioned performing of the second image feature data category specification, since the second image feature data is in charge of both information interaction of the internal and external context information and category prediction performed by learning category information at this time, in order to further reduce difficulty in learning the ability to compress information in the process of obtaining the second image feature data, and enhance the expressive ability of the second image feature data, the second image feature data may be decoupled into two parts, so that the second image feature data may include the first image feature sub-data and the second image feature sub-data.

Wherein the two parts may respectively assume different responsibilities, the first image feature sub-data may indicate contextual information across the image, and the second image feature sub-data may indicate a category of semantic segmentation.

See formula (9):

wherein,the decoupled second image characteristic data may be represented, comprising two parts.Can represent the first image characteristic sub-data therein,/-therein>And may represent a second image characteristic sub-data therein.

After decoupling, the step S701 may include:

projecting the first image feature sub-data to first intermediate feature data; the second image feature sub-data is projected to the second intermediate feature data.

That is, the above determination K can be made _mid The process of (2) is replaced byProjection determination of (2), determining V _mid The process of (2) is replaced by->And thus decoupling of different responsibilities of the second image feature data can be achieved in determining the first intermediate feature data and the second intermediate feature data.

Table 2 shows an effect illustration of external feature enhancement on feature data according to an embodiment of the present application. See table 2:

TABLE 2

Wherein, type of x _exter Can be expressed as determining x _exter In the way (1), learn is Y and may represent x _exter Obtained in a learning manner in the embodiment of the application, otherwise, N; mom is Y and may represent x _exter Obtained by other existing manners than the manner that can be learned in the embodiments of the present application, otherwise N.

The use of class-specific means in the embodiments of the present application may be denoted as Y, otherwise N.

Y _exter Determining Y for Y may represent utilizing the two levels in embodiments of the present application _exter In the form of (a), N may represent that Y is determined by only one hierarchy (i.e. using only one single head attention module) _exter 。

De. (de-coupling) as Y may represent the use of the embodiment of the present application for x _exter The decoupling mode is performed, otherwise, N is used.

The mlou can represent a semantic segmentation evaluation index, namely, the greater the value of the mlou, the higher the accuracy of the corresponding model.

The first row of table 2 may represent the effect of not performing external feature enhancement on the feature data, and as can be seen from table 2, by performing external feature enhancement on the feature data by using the method of the embodiment of the present application, and by using each optional mode in the embodiment of the present application, a higher mIoU value may be obtained, so that the result of performing semantic segmentation on the model is more accurate, and greater performance improvement is brought.

Table 3 shows an effect schematic of the internal feature enhancement and the external feature enhancement on the feature data according to an embodiment of the present application.

See table 3:

TABLE 3 Table 3

Wherein, the Mining Type may represent the Type of feature enhancement performed, and the international is Y, which may represent that the Internal feature enhancement is performed by using the embodiment of the present application, otherwise, N; external Y may represent External feature enhancement with embodiments of the present application, otherwise N. The mlou can represent a semantic segmentation evaluation index, namely, the greater the value of the mlou, the higher the accuracy of the corresponding model.

The first row of table 3 may show that the external feature enhancement and the internal feature enhancement are not performed on the feature data, and it can be seen from table 3 that by performing the external feature enhancement and the internal feature enhancement on the feature data, the semantic segmentation accuracy of the model complementation is improved, and the performance of the model is significantly improved.

Referring back to FIG. 2, Y is obtained at the decoder by the process described above _inter And Y _exter Then, prediction reasoning of semantic segmentation can be performed to obtain a semantic segmentation result, see below.

Step S204, determining a prediction mask of the image to be processed according to the first enhanced image feature data and the second enhanced image feature data.

This process may be implemented, for example, by the feature fusion module and prediction module shown in fig. 1.

Wherein the first enhanced image feature data and the second enhanced image feature data may be first subjected to an aggregation process, for example, by adding Y _inter And Y _exter Making a paving connectionObtaining double-scale enhanced image feature data, and carrying out reshape processing on the obtained enhanced image feature data, namely adjusting the size of the enhanced image feature data in a linear mapping mode to obtain processed enhanced image feature data F _aug . The prediction module can be used for processing the enhanced image characteristic data F _aug Determining a prediction mask for the image data to be processed, one implementation of this procedure can be seen in equation (10):

where M may represent a resulting prediction mask, which may indicate a semantic segmentation result of the image data to be processed, e.g. may represent a semantic segmentation class corresponding to each pixel in the image data to be processed.

F _aug Can be expressed as described above in terms of Y _inter And Y _exter The resulting processed enhanced image feature data.A corresponding prediction header may be represented, which may be at least one layer of an activation function, which may be a rectified linear unit function (rectified linear unit, reLU) or the like. Upsample can represent an upsampling operation to adjust according to +.>The size of the data obtained.

Table 4 shows an effect schematic of the first neural network model according to an embodiment of the present application.

See table 4:

TABLE 4 Table 4

The Method may represent a decoder of a neural network model utilized in semantic segmentation, and the DeiT-S (data-efficient image transformers-small discrete) and the Swin-S, segFormer-B2 may represent neural network models with different structures, which may form a decoder part of a transducer model together with the first neural network model by being combined with the first neural network model according to the embodiment of the present application.

The model being provided with symbolsThe use of UperNet can be indicated.

The symbol "×" may indicate that the efficient self-saturation mechanism is utilized. Params may represent the number of parameters for the corresponding model.

FPS, i.e., the number of Frames Per Second (FPS), may represent the inference speed of a model, and a larger value may represent a faster inference speed for the corresponding model.

It can be seen from table 4 that by using the first neural network model combined with the present application, the number of parameters of the whole decoder can be reduced, the model reasoning speed can be improved, and the accuracy of the model reasoning can be improved.

Table 5 shows an effect schematic of semantic segmentation on ADE20K dataset according to an embodiment of the present application.

See table 5:

TABLE 5

Where Method may represent a decoder utilized in semantic segmentation, FCN, encNet, PSPNet, CCNet, deeplabV3+, deiT-B, DPT, SETR-PUP, twins, segFormer-B1, segFormer-B5, swin-L may represent neural network models of different structures as decoders, which may, by combination with the first neural network model of an embodiment of the present application, together form the decoder part of the transducer model. The Encoder represents an Encoder utilized in semantic segmentation, and ResNet101, deiT-B, viT-B, viT-L, SVT-L, miT-B1, miT-B5, swin-L may represent neural network models of different structures as encoders.

The model being provided with symbolsCan be represented by using UuperNet with the symbol +.>It can be shown that pretraining (pretraining) was performed using ImageNet-22K.

Corp Size may represent the Size of the image. Params may represent the number of parameters for the corresponding model. The mlou may represent a semantic segmentation evaluation index, that is, a mean square (mlou) value, the MS mlou may represent a mlou mean square value, and the greater the mlou value or the mlou mean square value, the higher the accuracy of the corresponding model may be represented.

As can be seen from Table 5, the first neural network model of the embodiment of the application can be adapted to encoders of various structures, has wide application capability, has significant advantages on an ADE20K data set by combining the semantic segmentation method realized by the first neural network model of the embodiment of the application, and can reduce the overall parameter quantity of the model and improve the accuracy of model reasoning.

Table 6 shows an effect schematic of semantic segmentation on COCO-Stuff datasets according to an embodiment of the present application.

See table 6:

TABLE 6

Where Method may represent a decoder utilized in performing semantic segmentation, deiT-B, DPT, SETR-PUP, segFormer-B5, swin-L may represent neural network models of different structures as decoders, which may, by combination with the first neural network model of the embodiments of the present application, together form the decoder part of the transducer model. The Encoder may represent an Encoder utilized in semantic segmentation, and the DeiT-B, miT-B5, swin-L may represent neural network models of different structures as encoders. The model being provided with symbolsCan be represented by using UuperNet with the symbol +. >It can be shown that pretraining (pretraining) was performed using ImageNet-22K. The mlou can represent a semantic segmentation evaluation index, namely, the greater the value of the mlou, the higher the accuracy of the corresponding model.

As can be seen from table 6, the first neural network model of the embodiment of the present application can adapt to encoders with various different structures, and has a wide application capability, and the semantic segmentation method implemented by combining the first neural network model of the embodiment of the present application has significant advantages on the COCO-Stuff dataset, so that the overall parameter amount of the model can be reduced, and the accuracy of model reasoning can be improved.

Table 7 shows an effect schematic of semantic segmentation on a Cityscapes dataset according to an embodiment of the present application.

See table 7:

TABLE 7

Where Method may represent a decoder utilized in performing semantic segmentation, deiT-B, DPT, SETR-PUP, segFormer-B5, swin-L may represent neural network models of different structures as decoders, which may, by combination with the first neural network model of the embodiments of the present application, together form the decoder part of the transducer model. The Encoder may represent an Encoder utilized in semantic segmentation, and the DeiT-B, miT-B5, swin-L may represent neural network models of different structures as encoders. The model being provided with symbols Can be represented by using UuperNet with the symbol +.>Pre-training (pretraining) using ImageNet-22K may be indicated, and the symbol "×" may indicate that the efficient self-saturation mechanism is used. The mlou can represent a semantic segmentation evaluation index, namely, the greater the value of the mlou, the higher the accuracy of the corresponding model.

As can be seen from table 7, the first neural network model of the embodiment of the present application can adapt to encoders with various different structures, and has a wide application capability, and the semantic segmentation method implemented by combining the first neural network model of the embodiment of the present application has significant advantages on the citischapes dataset, so that the overall parameter amount of the model can be reduced, and the accuracy of model reasoning can be improved.

Fig. 10 shows a structural diagram of a semantic segmentation device according to an embodiment of the present application. The apparatus may be used for a first neural network model, as shown in fig. 10, the apparatus comprising:

an acquiring module 1001, configured to acquire first image feature data of image data to be processed;

the first feature enhancement module 1002 is configured to perform feature enhancement on first image feature data to obtain first enhanced image feature data, where the first enhanced image feature data includes context information in an image;

A second feature enhancement module 1003, configured to perform feature enhancement on the first image feature data with second image feature data to obtain second enhanced image feature data, where the second enhanced image feature data includes context information of a cross-image;

a determining module 1004, configured to determine a prediction mask of the image to be processed according to the first enhanced image feature data and the second enhanced image feature data, where the prediction mask indicates a semantic segmentation result of the image to be processed.

Optionally, the obtaining module 1001 is configured to:

Optionally, the first feature enhancement module 1002 may be configured to:

Optionally, the first neural network model is a decoder of a transducer self-attention model, and the second feature enhancement module 1003 is configured to:

Optionally, determining the third enhanced image feature data from the second image feature data and the first image feature data may include:

projecting the first image feature data to third intermediate feature data;

Optionally, determining the second enhanced image data from the first image feature data and the third enhanced image feature data may include:

projecting the first image feature data to fourth intermediate feature data;

Optionally, the step of projection is greater than 1.

Optionally, the data in the second image feature data corresponds to a category of semantic segmentation.

Optionally, the second image feature data includes first image feature sub-data and second image feature sub-data, the first image feature sub-data indicating context information across the image, the second image feature sub-data indicating a category of semantic segmentation, projecting the second image feature data to the first intermediate feature data and the second intermediate feature data, respectively, may include:

projecting the first image feature sub-data to first intermediate feature data;

The embodiment of the application provides a semantic segmentation device, which comprises: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions.

The embodiment of the application provides a terminal device, which can execute the semantic segmentation method.

Embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

Embodiments of the present application provide a computer program product comprising a computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

Fig. 11 shows a block diagram of an electronic device 1300 according to an embodiment of the present application. As shown in fig. 11, the electronic device 1300 may be a server or a terminal device, performing the functions of the semantic segmentation method shown in any of the above-described fig. 2-9. The electronic device 1300 includes at least one processor 1801, at least one memory 1802, and at least one communication interface 1803. The electronic device may further comprise common components such as an antenna, which are not described in detail herein.

The respective constituent elements of the electronic apparatus 1300 are specifically described below with reference to fig. 11.

The processor 1801 may be a general purpose Central Processing Unit (CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the above program schemes. The processor 1801 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

A communication interface 1803 for communicating with other electronic devices or communication networks, such as ethernet, radio Access Network (RAN), core network, wireless local area network (Wireless Local Area Networks, WLAN), etc.

The Memory 1802 may be, but is not limited to, a read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, a random access Memory (random access Memory, RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-Only Memory (ElectricallyErasable Programmable Read-Only Memory, EEPROM), a compact disc read-Only Memory (Compact Disc Read-Only Memory) or other optical disc storage, a compact disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be stand alone and coupled to the processor via a bus. The memory may also be integrated with the processor.

Wherein the memory 1802 is configured to store application program codes for performing the above schemes and is controlled to be executed by the processor 1801. The processor 1801 is configured to execute application code stored in the memory 1802.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disk, hard disk, random Access Memory (Random Access Memory, RAM), read Only Memory (ROM), erasable programmable Read Only Memory (Electrically Programmable Read-Only-Memory, EPROM or flash Memory), static Random Access Memory (SRAM), portable compact disk Read Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disk (Digital Video Disc, DVD), memory stick, floppy disk, mechanical coding devices, punch cards or in-groove protrusion structures having instructions stored thereon, and any suitable combination of the foregoing.

The computer readable program instructions or code described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present application may be assembly instructions, instruction set architecture (Instruction Set Architecture, ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (Local Area Network, LAN) or a wide area network (Wide Area Network, WAN), or it may be connected to an external computer (e.g., through the internet using an internet service provider). In some embodiments, aspects of the present application are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field programmable gate arrays (Field-Programmable Gate Array, FPGA), or programmable logic arrays (Programmable Logic Array, PLA), with state information of computer readable program instructions.

Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by hardware (e.g., circuits or ASICs (Application Specific Integrated Circuit, application specific integrated circuits)) which perform the corresponding functions or acts, or combinations of hardware and software, such as firmware, etc.

Although the invention is described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A semantic segmentation method for a first neural network model, the method comprising:

acquiring first image characteristic data of image data to be processed;

performing feature enhancement on the first image feature data by using second image feature data to obtain second enhanced image feature data, wherein the second enhanced image feature data comprises cross-image context information;

2. The method of claim 1, wherein the first neural network model is a decoder of a transform self-attention model, the feature enhancing the first image feature data with second image feature data to obtain second enhanced image feature data, comprising:

determining third enhanced image feature data according to the second image feature data and the first image feature data;

And determining the second enhanced image data according to the first image characteristic data and the third enhanced image characteristic data.

3. The method of claim 2, wherein said determining third enhanced image feature data from said second image feature data and said first image feature data comprises:

projecting the second image feature data to first intermediate feature data and second intermediate feature data, respectively;

projecting the first image feature data to third intermediate feature data;

and determining the third enhanced image feature data according to the first intermediate feature data, the second intermediate feature data and the third intermediate feature data.

4. A method according to claim 2 or 3, wherein said determining said second enhanced image data from said first image feature data and said third enhanced image feature data comprises:

projecting the first image feature data to fourth intermediate feature data;

and determining the second enhanced image feature data according to the fourth intermediate feature data, the fifth intermediate feature data and the sixth intermediate feature data.

5. A method according to claim 3 or 4, wherein the data in the second image feature data corresponds to a category of semantic segmentation.

6. The method of claim 5, wherein the second image feature data includes first image feature sub-data and second image feature sub-data, the first image feature sub-data indicating the cross-image context information, the second image feature sub-data indicating a category of the semantic segmentation, the projecting the second image feature data to first intermediate feature data and second intermediate feature data, respectively, comprising:

projecting the first image feature sub-data to first intermediate feature data;

and projecting the second image characteristic sub-data to second intermediate characteristic data.

7. The method according to any one of claims 2-6, wherein the performing feature enhancement on the first image feature data to obtain first enhanced image feature data includes:

and obtaining the first enhanced image feature data according to the seventh intermediate feature data, the eighth intermediate feature data and the ninth intermediate feature data.

8. The method of any of claims 3-7, wherein the projection has a step size greater than 1.

9. The method of claim 1, wherein the acquiring the first image characteristic data of the image data to be processed comprises:

10. A semantic segmentation apparatus, the apparatus for a first neural network model, the apparatus comprising:

the second feature enhancement module is used for carrying out feature enhancement on the first image feature data by utilizing second image feature data to obtain second enhanced image feature data, wherein the second enhanced image feature data comprises cross-image context information;

11. The apparatus of claim 10, wherein the first neural network model is a decoder of a transducer self-attention model, and wherein the second feature enhancement module is configured to:

12. The apparatus of claim 11, wherein said determining third enhanced image feature data from said second image feature data and said first image feature data comprises:

projecting the first image feature data to third intermediate feature data;

13. The apparatus according to claim 11 or 12, wherein said determining said second enhanced image data from said first image feature data and said third enhanced image feature data comprises:

projecting the first image feature data to fourth intermediate feature data;

14. The apparatus according to claim 12 or 13, wherein the data in the second image feature data corresponds to a category of semantic segmentation.

15. The apparatus of claim 14, wherein the second image feature data comprises first image feature sub-data and second image feature sub-data, the first image feature sub-data indicating the cross-image context information, the second image feature sub-data indicating a category of the semantic segmentation, the projecting the second image feature data to first intermediate feature data and second intermediate feature data, respectively, comprising:

Projecting the first image feature sub-data to first intermediate feature data;

16. The apparatus of any of claims 11-15, wherein the first feature enhancement module is configured to:

17. The apparatus of any one of claims 12-16, wherein the projection has a step size greater than 1.

18. The apparatus of claim 10, wherein the acquisition module is configured to:

19. A semantic segmentation apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any of claims 1-9 when executing the instructions.

20. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1-9.

21. A computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the method of any one of claims 1-9.