CN120047762B

CN120047762B - Training methods, electronic devices, and readable storage media for image classification models

Info

Publication number: CN120047762B
Application number: CN202311519477.6A
Authority: CN
Inventors: 田宇; 闫松; 万柳阳; 蒋雪涵; 杜远超; 朱世宇
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-11-14
Filing date: 2023-11-14
Publication date: 2026-01-09
Anticipated expiration: 2043-11-14
Also published as: CN120047762A

Abstract

This application discloses a training method, electronic device, and readable storage medium for an image classification model, belonging to the field of terminal technology. The method includes: acquiring a positive sample image training set, where multiple sample images in the positive sample image training set correspond one-to-one with multiple first text information; determining a negative sample image training set based on the positive sample image training set, where the negative sample image training set includes multiple sample images, the text information of each sample image in the negative sample image training set is second text information, and the second text information of any sample image in the negative sample image training set is one of the multiple first text information excluding the first text information corresponding to any sample image; iteratively training an initial classification model based on the positive and negative sample image training sets to obtain a target image classification model. This reduces the complexity of data annotation and improves the accuracy of training the image classification model.

Description

Training method for image classification model, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of terminal technologies, and in particular, to a training method for an image classification model, an electronic device, and a readable storage medium.

Background

With the development of terminal technology, the application of image processing technology is also wider and wider, for example, the image processing technology is applied to the field of intelligent transportation, road identification and the like can be performed, and medical diagnosis and the like can be performed if the image processing technology is applied to the field of medicine. The recognition and classification of the images are the basis of image processing technology, and the electronic equipment can train an image classification model for realizing the recognition and classification of the images.

At present, a user can label a sample image through text information, the text information is used for describing a scene category to which the sample image belongs, for example, in the intelligent traffic field, the sample image can be labeled through the text information to belong to a viaduct scene, an urban road scene, a rain and fog scene and the like. The electronic equipment can perform model training based on the sample images of different scene categories of the labels to obtain an image classification model, and the trained image classification model can be used for identifying the scene category corresponding to any one image.

However, in the process of training the image classification model, in order to obtain the image classification model with better classification effect, more detail can be marked when the sample image is marked with data, and the task of marking the data is complex and tedious, which is difficult to realize. Therefore, there is a need for a training method of an image classification model that improves the classification effect.

Disclosure of Invention

The application provides a training method of an image classification model, electronic equipment and a readable storage medium, which can be used for reducing complexity of sample data labeling and improving accuracy of image classification model training. The technical scheme is as follows:

In a first aspect, a training method of an image classification model is provided, and the method is applied to electronic equipment, and includes:

The method comprises the steps of obtaining a positive sample image training set, determining a negative sample image training set according to the positive sample image training set, wherein the positive sample image training set comprises a plurality of sample images and a plurality of first text messages, the scene category of each sample image is at least one of a plurality of scene categories, the plurality of sample images are in one-to-one correspondence with the plurality of first text messages, one first text message is used for describing the scene category of the corresponding sample image, the negative sample image training set comprises a plurality of sample images according to the positive sample image training set, the text message of each sample image in the negative sample image training set is second text message, the second text message of any sample image in the negative sample image training set is one first text message except the first text message corresponding to any sample image in the plurality of first text messages, and carrying out iterative training on an initial classification model based on the positive sample image training set and the negative sample image training set to obtain a target image classification model, and the target image classification model can identify images belonging to at least one scene category in the plurality of scene categories.

It should be noted that, the scene category to which each sample image belongs is at least one of the plurality of scene categories, which means that any sample image may relate to a plurality of scene categories at the same time, or may relate to only one scene category.

As one example, the image content of each sample image in the negative sample image training set does not correspond to the scene category described by the corresponding second text information.

It should be noted that, the plurality of sample images in the negative sample image training set are the same as the plurality of sample images in the positive sample image training set, that is, the electronic device may determine the plurality of sample images in the positive sample image training set as the plurality of sample images in the negative sample image training set, and for the same sample image, the electronic device may adjust text information corresponding to the sample image in the negative sample training set according to the plurality of first text information. In other words, for any one of the negative sample images, the text information corresponding to that sample image in the positive sample image training set is different from the sample information corresponding to that sample image in the negative sample image training set.

Because in the process of constructing the negative sample image training set, the plurality of sample images in the negative sample image training set are the plurality of sample images in the positive sample image training set, and the negative sample image training set also comprises a plurality of first text information. However, for the same sample image, text information corresponding to the sample image in the positive sample image training set is different from sample information corresponding to the sample image in the negative sample image training set, so that data annotation is only required once in the process of constructing the positive sample image training set and the negative sample image training set, thereby reducing complexity of data annotation and improving efficiency of data annotation. In addition, the information of different modes of the same thing is used in the process of the image classification model to be trained, so that the trained image classification model is more accurate.

As one example of the present application, the operations of the electronic device to determine a negative sample image training set from a positive sample image training set include:

The method comprises the steps of determining similarity between a target sample image and first text information corresponding to each target scene category, wherein the target sample image is any sample image in a positive sample image training set, each target scene category refers to each scene category except for the scene category to which the target sample image belongs in a plurality of scene categories, determining the first text information corresponding to the target scene category with the largest similarity as second text information corresponding to the target sample image, and determining the plurality of sample images and the second text information corresponding to each sample image as a negative sample image training set.

As an example, the electronic device may further determine a similarity between the target sample image and the first text information of the scene category to which each of the undescribed target sample images belongs, and determine the first text information of the scene category to which the undescribed target sample image belongs with the greatest similarity as the second text information.

It is worth noting that, by determining the first text information corresponding to the target scene category with the maximum similarity as the second text information corresponding to the target sample image, not only can the construction of the negative sample be successfully completed, but also the accuracy of identifying the image with higher similarity by the image classification model can be improved by training the image classification model to be trained through the target sample image and the corresponding second text information because the similarity between the second text information and the corresponding target sample image is the highest.

As one example of the present application, the operation of the electronic device determining the similarity between the target sample image and the first text information corresponding to each target scene category includes:

The method comprises the steps of obtaining a visual feature vector of a target sample image by processing the target sample image through a pre-trained target image encoder, obtaining a text feature vector of first text information corresponding to each target scene category by processing the first text information corresponding to each target scene category through a pre-trained target text encoder, and determining the similarity between the visual feature vector of the target sample image and the text feature vector corresponding to each target scene category.

It should be noted that, the target text encoder and the target image encoder are obtained through mutual cooperation training in the pre-training process.

It should be noted that, the consistency of the feature vector can be ensured by processing each first text information by the pre-trained target text encoder and processing each sample image by the target image encoder.

As an example of the application, before determining the similarity between the target sample image and the first text information corresponding to each target scene category, the electronic device may further acquire a plurality of sample text information and a plurality of sample training images, where the plurality of sample text information corresponds to the plurality of sample training images one by one, and each sample text information in the plurality of sample text information is used for describing the scene category to which the corresponding sample training image belongs, iteratively training the initial text encoder based on the plurality of sample text information, and iteratively training the initial image encoder based on the plurality of sample training images, determining a loss value of a first loss function between the text encoder obtained after each training and the image encoder obtained after each training in the iterative training process, determining the text encoder obtained when the loss value is converged as a target text encoder, and determining the image encoder obtained when the loss value is converged as a target image encoder, where the target text encoder is used for determining a text feature vector of the first text information corresponding to each target scene category, and the target image encoder is used for determining a visual feature vector of the target sample image.

It should be noted that, the plurality of sample text information and the plurality of sample training images are in one-to-one correspondence, and the one-to-one correspondence of the plurality of sample text information and the plurality of sample training images means that, for any sample text information D, there is a sample training image in which only one image content is the same as the content described by the any sample text information D. In addition, in the iterative training process, the corresponding sample text information and the sample training image can be trained in pairs.

Therefore, as the text encoder and the image encoder are trained by using the data (namely the sample text information and the corresponding sample training image) matched with the graphics context, the two mode information of the sample training image and the corresponding sample text information can be mapped to the same feature space, and the text encoder and the image encoder with better training effect can be obtained.

As an example of the present application, the electronic device performs iterative training on the initial classification model based on the positive sample image training set and the negative sample image training set, and the operation of obtaining the target image classification model includes:

The method comprises the steps of splicing visual feature vectors of each sample image in a positive sample image training set with text feature vectors of corresponding first text information to obtain a plurality of positive sample mixed feature vectors, respectively splicing the visual feature vectors of each sample image in a negative sample image training set with the text feature vectors of corresponding second text information to obtain a plurality of negative sample mixed feature vectors, carrying out iterative training on an initial classification model according to the plurality of positive sample mixed feature vectors and the plurality of negative sample mixed feature vectors, determining a loss value of a second loss function between a classification result of the image classification model obtained after each training and a preset result in the iterative training process, and determining the image classification model obtained during convergence as a target image classification model under the condition that the loss value of the second loss function is converged.

It should be noted that, the preset result is a classification label corresponding to each sample image in the positive sample image training set and a classification label corresponding to each sample image in the negative sample image training set. That is, in the case of training according to the sample images in the positive sample image training set, the preset result is a classification label corresponding to each sample image in the positive sample image training set, and in the case of training according to the sample images in the negative sample image training set, the preset result is a classification label corresponding to each sample image in the negative sample image training set.

Because the information of different modes of the same thing is used in the process of the image classification model to be trained, the trained image classification model is more accurate.

According to the method, the electronic device performs iterative training on an initial classification model based on a positive sample image training set and a negative sample image training set to obtain a target image classification model, then obtains an image to be classified, determines visual feature vectors of the image to be classified, determines similarity between the visual feature vectors of the image to be classified and each of a plurality of first text information to obtain a plurality of similarity, and performs stitching on the visual feature vectors of the image to be classified and text feature vectors corresponding to each of N first text information to obtain N mixed feature vectors, wherein the N first text information is first text information, the similarity of which is respectively corresponding to each of the first text information in the first N similarity after being arranged from large to small, N is a positive integer greater than or equal to 1, and the N mixed feature vectors are processed through the target image classification model to obtain classification results of the image to be classified.

Because all scene categories to which the images to be classified belong can be identified through one image classification model of the target image classification model, the scene categories to which the images to be classified belong can be obtained as once as possible, the efficiency of the target image classification model for image classification is improved, and the electronic equipment can adopt different display schemes according to the scene categories to which the images to be classified belong, so that the display quality and the display effect of the images to be classified are improved.

As an example of the present application, the operation of the electronic device to acquire the image to be classified includes:

under the condition that the camera is started, determining the preview image collected by the camera as the image to be classified, or

In the case that the image selecting operation is received, it is determined that the image selected by the image selecting operation is the image to be classified.

The condition that the camera is opened includes a condition that the camera is opened and then information such as a two-dimensional code and a bar code is scanned, a condition that character recognition and object recognition are performed after the camera is opened, a condition that shooting is performed after the camera is opened, and the like.

Therefore, the image to be classified can be a preview image or an image selected by a user, so that the electronic equipment can realize scene recognition of any one image, the application scene of the image classification model is increased, and the practicability of the image classification model is improved.

The method includes the steps that when a camera is started, the electronic device displays a scene recognition control in a shooting interface and performs image acquisition through the camera to obtain a preview image, the scene recognition control is used for controlling whether scene recognition is performed or not, and the preview image is determined to be the image to be classified in response to the starting operation of the scene recognition control.

As one example, in a case where the scene recognition control is in an off state, that is, in a case where the user does not perform an opening operation of the scene recognition control, the electronic device does not perform an operation of post scene recognition (or image recognition).

Therefore, the user can autonomously select whether to perform scene recognition or not by setting the scene recognition control, so that the interactivity with the user is improved, and the running resources of the electronic equipment are saved under the condition that the scene recognition is not needed.

The method comprises the steps of obtaining a target image classification model, obtaining a preview image, determining a scene category to which the target image belongs, determining the scene category to which the target image belongs by the electronic device, obtaining a target image classification model, obtaining a preview image, and storing the preview image to be subjected to exposure in an image folder corresponding to the scene category to which the preview image belongs, wherein the preview image is the image to be subjected to the exposure.

As an example, to save storage space, the electronic device may also store the image to be classified into any one of a plurality of scene categories to which the image to be classified belongs. Or the electronic equipment stores one image to be classified, and stores the image identification of the image to be classified and the corresponding scene category. Under the condition that the images are required to be displayed in a classified mode, the electronic equipment can acquire the images to be classified according to the image identifications of the images to be classified and display the images to be classified in a classified display interface.

Therefore, the exposed preview image is stored in the image folder corresponding to the scene category to which the preview image belongs, so that a user can conveniently search the image according to the scene category, and interactivity with the user and user viscosity are improved.

As an example, in the case where the image to be classified is a preview image, the electronic device may further display a scene tag in the preview image after determining the scene category to which the image to be classified belongs, where the scene tag is used to describe the scene category to which the image to be classified belongs.

As an example, in the case where the image to be classified is a preview image, after determining the scene category to which the image to be classified belongs, the electronic device may further determine an imaging scheme of the image to be classified according to the scene category described by the image to be classified, for example, determine an exposure parameter, a filter scheme, a shooting mode, a display resolution of the image to be classified, and so on.

In a second aspect, an electronic device is provided, where the electronic device includes a processor and a memory, where the memory is configured to store a program for supporting the electronic device to execute the training method of the image classification model provided in the first aspect, and store data related to implementing the training method of the image classification model in the first aspect. The processor is configured to execute a program stored in the memory. The electronic device may further comprise a communication bus for establishing a connection between the processor and the memory.

In a third aspect, a computer readable storage medium is provided, in which instructions are stored which, when run on a computer, cause the computer to perform the training method of the image classification model according to the first aspect described above.

In a fourth aspect, a computer program product is provided comprising instructions which, when run on a computer, cause the computer to perform the method of training an image classification model as described in the first aspect above.

The technical effects obtained by the second, third and fourth aspects are similar to the technical effects obtained by the corresponding technical means in the first aspect, and are not described in detail herein.

Drawings

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic diagram of another application scenario provided in an embodiment of the present application;

fig. 3 is a schematic diagram of another application scenario provided in an embodiment of the present application;

fig. 4 is a schematic diagram of another application scenario provided in an embodiment of the present application;

fig. 5 is a schematic diagram of another application scenario provided in an embodiment of the present application;

FIG. 6 is a flowchart of a training method for an image classification model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a training process of an image classification model according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of an image classification method according to an embodiment of the present application;

FIG. 9 is a schematic process diagram of an image classification method according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

Fig. 11 is a block diagram of a software system of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be understood that references to "a plurality" in this disclosure refer to two or more. In the description of the present application, "/" means or, for example, a/B may mean a or B, and "and/or" herein is merely an association relationship describing an association object, means that three relationships may exist, for example, a and/or B, and it may mean that a alone exists, while a and B exist, and B alone exists, unless otherwise stated. In addition, in order to facilitate the clear description of the technical solution of the present application, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and function. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

With the development of terminal technology, the application of image processing technology is also becoming wider and wider. The recognition and classification of images is the basis of image processing technology, and electronic devices can usually perform image recognition or scene recognition through an image classification model. In some scenes, in order to enable the user to quickly find the images required by the user, the mobile phone may classify the images in the gallery according to the scene types as shown in fig. 1. In order to facilitate the classification of images, a pre-trained image classification model can be built in the mobile phone. That is, before leaving the factory, the mobile phone may have a pre-trained image classification model built therein. The image classification model can be obtained by performing model training on sample images of different marked scene categories by other electronic equipment, wherein the sample images of different marked scene categories refer to that staff marks the different sample images through text information, the text information is used for describing scene categories to which the sample images belong, for example, the sample images can be marked through the text information to belong to viaduct scenes, urban road scenes, rain and fog scenes, building scenes, food scenes, night scenes, certificate scenes, text scenes and the like.

However, in the process of training the image classification model, in order to obtain the image classification model with better classification effect, more detail can be marked when the sample image is marked with data, and the task of marking the data is complex and tedious, which is difficult to realize. Of course, there are also image classification models that use only a single piece of information (e.g., only image information) as training data, resulting in an image classification model that is not recognizable for some complex scenes.

In order to improve the classification accuracy of the image classification model and reduce the difficulty of labeling data. The embodiment of the application provides a training method of an image classification model, wherein an electronic device can acquire a positive sample image training set, the positive sample image training set comprises a plurality of sample images and a plurality of first text information, the scene category of each sample image belongs to at least one of a plurality of scene categories, each first text information can describe the scene category of a corresponding sample image in the plurality of sample images, a negative sample image training set is determined according to the positive sample image training set, the negative sample image training set comprises the plurality of sample images, the text information corresponding to each sample image in the plurality of sample images in the negative sample image training set is second text information, the second text information is one of the first text information except the first text information corresponding to any sample image in the plurality of first text information, and the initial image classification model is subjected to iterative training based on the positive sample image training set and the negative sample image training set to obtain a target image classification model capable of identifying images belonging to at least one of the scene categories in the plurality of scene categories. Because in the process of constructing the negative sample image training set, the plurality of sample images in the negative sample image training set are the plurality of sample images in the positive sample image training set, and the negative sample image training set also comprises a plurality of first text information. However, for the same sample image, text information corresponding to the sample image in the positive sample image training set is different from sample information corresponding to the sample image in the negative sample image training set, so that data annotation is only required once in the process of constructing the positive sample image training set and the negative sample image training set, thereby reducing complexity of data annotation and improving efficiency of data annotation. In addition, the information of different modes of the same thing is used in the process of the image classification model to be trained, so that the trained image classification model is more accurate.

For easy understanding, before describing the method provided by the embodiment of the present application in detail, the application scenario related to the embodiment of the present application is described next.

Referring to fig. 2, fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application, in an application scenario, a user may shoot through a camera of a mobile phone in a process of using the mobile phone, after the mobile phone starts the camera, a shooting interface shown in fig. 2 (a) may be displayed on a display screen, and a preview image a is displayed in the shooting interface, where the preview image a includes blue sky, grasslands and trees. When the mobile phone acquires the preview image a, the preview image a may be input into a target image classification model, where the target image classification model is capable of identifying an image belonging to at least one of a plurality of scene categories, such as an image corresponding to a building scene category, an image corresponding to a tree scene category, an image corresponding to a lawn scene category, an image corresponding to a blue sky scene category, an image corresponding to a flower scene category, an image corresponding to a night scene category, an image corresponding to a beach scene category, an image corresponding to a ocean scene category, an image corresponding to a recreation ground scene category, an image corresponding to a school scene category, and the like. The target image classification model may perform scene recognition (or image recognition, image classification) on the preview image a, and the target image classification model may output a classification result of the preview image a. Referring to fig. 2 (b), the mobile phone may display scene labels of scene categories to which the preview image a belongs on the preview image a according to the classification result, and if it is determined that the scene categories to which the preview image a belongs include a sky scene category, a grassland scene category, and a tree scene category, then the mobile phone may display scene labels "sky", "grassland", and "tree" on positions where the blue sky, the grassland, and the tree are located in the preview image a.

In another application scenario, the mobile phone may also adjust an image display scheme according to the scenario category to which the preview image a belongs, for example, adjust a shooting mode, adjust an exposure parameter, adjust a display resolution, and so on. For example, if the scene category of the preview image a is an overexposure scene category, for example, the overexposure scene category includes a moon scene category, a flame scene category, a sun scene category, an electric lamp scene category, and the like, that is, if the preview image a includes a light source such as a moon, a light source, or a flame, the scene category of the preview image a generally includes the overexposure scene category. In this case, the mobile phone may adjust the photographing mode or reduce the exposure parameters of the preview image a. As shown in fig. 3 (a), if the preview image a includes a moon, the mobile phone recognizes that the scene category to which the preview image a belongs includes a moon scene category through the target image classification model, and then the mobile phone can adjust the photographing mode to a viewing mode. In the telescopic mode, the exposure parameters of the preview image a are changed. That is, the mobile phone can reduce the exposure of the preview image a and shorten the exposure time of the preview image a, and display the preview image B as shown in the (B) diagram of fig. 3.

In still another application scenario, referring to fig. 4 (a), if the user is satisfied with the current preview image a, the user may click on the shooting control P1 in the shooting interface, the mobile phone may perform image exposure based on the preview image a to obtain an exposed preview image a, and the mobile phone may store the exposed preview image a in an image folder corresponding to a scene category to which the preview image a belongs, that is, the mobile phone may store the exposed preview image a in at least one of an image folder corresponding to a sky scene, an image folder corresponding to a lawn scene, and an image folder corresponding to a tree scene, for example, in an image folder corresponding to the sky scene category. Thereafter, if the user needs to search for an image of a certain scene category, referring to fig. 4 (b), the user may click on the application identifier of the gallery application in the desktop, in response to the clicking operation on the application identifier of the gallery application, the mobile phone may display the image preview interface P2 shown in fig. 4 (c), the user may display the "find" control P3 shown in the image preview interface P2, in response to the clicking operation on the "find" control P3, the mobile phone may display the classified view interface P4 shown in fig. 4 (d), in which the classified view interface P4 may display a plurality of image folders, for example, a "person" folder, a "place" folder, a "thing" folder, etc., which may include a "building" folder, a "person image" folder, a "grass" folder, a "tree" folder, a "night view" folder, etc., so that the user may continue searching for a more detailed scene category from the "thing" folder.

Referring to fig. 5, fig. 5 is a schematic diagram of an application scenario provided in an embodiment of the present application, in another application scenario, after a camera is started by a mobile phone, a shooting interface shown in fig. 5 (a) is displayed, and not only a preview image a collected by the camera but also a scene recognition control may be displayed in the shooting interface, if a user needs to recognize a scene category of a shot image, the user may click the scene recognition control P5, and in response to a click operation on the scene recognition control P5, the mobile phone may change a display mode of the scene recognition control P5 (in the embodiment of the present application, the display color of the scene recognition control P5 is changed from white background to black background to white background for illustration), and the preview image a is input into a target image classification model, and the target image classification model processes the preview image a respectively, and may output a classification result of the preview image a.

It should be noted that, in the embodiment of the present application, only the scenes shown in fig. 2 to 5 are described as examples, and the embodiment of the present application is not limited to the above.

Based on the application scenario provided by the above embodiment, the training method of the image classification model provided by the embodiment of the present application is described next. Referring to fig. 6, fig. 6 is a flowchart of a training method of an image classification model according to an exemplary embodiment, which is illustrated by way of example and not limitation, and may include some or all of the following:

step 601, acquiring a positive sample image training set.

It should be noted that, the positive sample image training set includes a plurality of sample images and a plurality of first text information, where each sample image belongs to a scene category that is at least one of a plurality of scene categories, the plurality of sample images and the plurality of first text information are in one-to-one correspondence, and one first text information is used for describing the scene category to which the corresponding sample image belongs.

As an example, the one-to-one correspondence of the plurality of sample images with the plurality of first text information means that, for any one of the plurality of first text information C, there is a sample image in which only one image content is the same as that described in the any one of the plurality of first text information C.

For example, one of the first text information may be "flower, a photo of flower sea" or "flower, image of a flower sea", and then there is a sample image whose image content is flower or flower sea in the plurality of sample images. And the first text information describes that the scene category to which the sample image belongs is a flower scene category.

Illustratively, the plurality of scene categories may include a blue sky scene category, a tree scene category, a flower scene category, a cat scene category, a dog scene category, a sea scene category, a building scene category, a portrait scene category, a night scene category, a snow scene category, and the like. If the image content of one sample image includes objects such as a blue sky, a flower, a building, etc., the scene category to which the sample image belongs includes a blue sky scene category, a flower scene category, and a building scene category, and the first text information corresponding to the sample image is "an image of a blue sky, a flower, and a building", and the first text information describes three scene categories. If the image content of one sample image includes a cat, the scene category to which the sample image belongs is a cat scene category, and the first text information corresponding to the sample image is "an image of a cat".

As an example, for any one of the plurality of sample images, the electronic device may receive an input operation for the any one sample image, and determine text information carried by the input operation as first text information corresponding to the any one sample image. Or the electronic device may perform a text capturing operation in a specific page, determine the captured text information as the first text information for the arbitrary sample image, where the specific page is a page describing a scene category to which the arbitrary sample image belongs, and describe, in the specific page, the scene category to which the arbitrary sample image belongs through text information.

Step 602, determining a negative sample image training set according to the positive sample image training set.

It should be noted that the negative sample image training set includes a plurality of sample images, the text information of each sample image in the negative sample image training set is second text information, and the second text information of any one sample image in the negative sample image training set is one of the plurality of first text information except the first text information corresponding to any one sample image.

In some embodiments, the electronic device determines a negative sample image training set according to the positive sample image training set, and the electronic device determines the similarity between a target sample image and first text information corresponding to each target scene category, wherein the target sample image is any one sample image in the positive sample image training set, each target scene category refers to each scene category except the scene category to which the target sample image belongs in a plurality of scene categories, determines the first text information corresponding to the target scene category with the largest similarity as second text information corresponding to the target sample image, and determines the plurality of sample images and the second text information corresponding to each sample image as the negative sample image training set.

In some embodiments, for a target sample image (any one sample image in the negative sample image set, or any one sample image in the plurality of sample images), the electronic device may determine any one other first text information than the first text information corresponding to the target sample image (the first text information corresponding to the positive sample image training set) as the second text information corresponding to the target sample image in the negative sample image training set. Of course, in order to improve accuracy of classifying similar images by the trained image classification model, the electronic device may determine, as the second text information corresponding to the target sample image, the first text information corresponding to the target scene category with the greatest similarity.

For example, the image content of the target sample image is a dog, and the first text information corresponding to the target sample image in the training set of positive sample images is an "image of one dog", that is, the scene category corresponding to the target sample image is a dog scene category, then the electronic device may determine the first text information corresponding to the scene category other than the dog scene category in the plurality of scene categories (where if the first text information describes the dog scene category and the grassland scene category, then the electronic device may also acquire text information describing the grassland scene category in the first text information, but may not acquire text information describing the dog scene category in the first text information). And then, the electronic equipment can determine the similarity between the target sample image and the first text information corresponding to the determined other scene categories, and the first text information corresponding to the target scene category with the maximum similarity is determined to be the second text information corresponding to the target sample image.

For example, the image content of the target sample image is a dog, and the first text information corresponding to the target sample image in the training set of positive sample images is an image of one dog, that is, the scene category corresponding to the target sample image is a dog scene category, then the electronic device may determine that the first text information describing the dog scene category is not included, determine the similarity between the target sample image and the first text information of each undescribed dog scene category, and determine that the first text information not describing the dog scene category with the greatest similarity is the second text information.

In some embodiments, the electronic device determining the similarity between the target sample image and the first text information corresponding to each of the target scene categories includes processing the target sample image by a pre-trained target image encoder to obtain visual feature vectors of the target sample image, processing the first text information corresponding to each of the target scene categories by the pre-trained target text encoder to obtain text feature vectors of the first text information corresponding to each of the target scene categories, and determining the similarity between the visual feature vectors of the target sample image and the text feature vectors corresponding to each of the target scene categories.

As an example, the electronic device may determine at least one of a euclidean distance, a cosine distance, a jaccard distance, and the like between the visual feature vector of the target sample image and the text feature vector corresponding to each target scene category, determine the resulting distance as a similarity between the visual feature vector of the target sample image and the text feature vector corresponding to each target scene category if one of the distances is determined, determine an average of the distances as a similarity between the visual feature vector of the target sample image and the text feature vector corresponding to each target scene category if the distances are determined, or assign different weights to each distance and then add the resulting sum as a similarity between the visual feature vector of the target sample image and the text feature vector corresponding to each target scene category if the distances are determined. The embodiment of the present application is not particularly limited thereto.

In some embodiments, before determining the similarity between the target sample image and the first text information corresponding to each of the target scene categories, the electronic device may further train to obtain the target text encoder and the target image encoder in advance.

As one example, an electronic device may obtain a plurality of sample text information and a plurality of sample training images, where the plurality of sample text information corresponds to the plurality of sample training images one by one, and each sample text information in the plurality of sample text information is used to describe a scene category to which the corresponding sample training image belongs, iteratively train an initial text encoder based on the plurality of sample text information, and iteratively train the initial image encoder based on the plurality of sample training images, determine a loss value of a first loss function between a text encoder obtained after each training and an image encoder obtained after each training during the iterative training, determine, in case that the loss value converges, the text encoder obtained upon convergence as a target text encoder, and determine, in case that the loss value converges, the image encoder obtained upon convergence as a target image encoder, the target text encoder being used to determine a text feature vector of first text information corresponding to each target scene category, and the target image encoder being used to determine a visual feature vector of the target sample image.

It should be noted that, the one-to-one correspondence between the plurality of sample text information and the plurality of sample training images means that, for any one sample text information D in the plurality of sample text information, there is a sample training image in which only one image content is the same as that described in the any one sample text information D.

It should be further noted that, in the iterative training process, the corresponding sample text information and the sample training image may be paired to train. Illustratively, the sample text information is "a photo of a cat" or "an image of a cat", the corresponding sample training image is an image of a cat, the sample text information is input to the text encoder to train the text encoder, and the sample training image is input to the image encoder to train the image encoder.

It should be noted that, because the text encoder and the image encoder are trained by using the data (i.e., the sample text information and the corresponding sample training image) of the image-text matching, two modal information of the sample training image and the corresponding sample text information can be mapped to the same feature space, and the similarity between the sample training image and the sample text information is calculated, so that the similarity between the sample training image and the corresponding sample text information is constrained to be the highest, and the similarity between the sample training image and other sample text information is low, so that the text encoder and the image encoder with better training effect can be obtained.

In some embodiments, determining a penalty value of a first penalty function between the text encoder obtained after each training and the image encoder obtained after each training during the iterative training process includes determining a sample text feature vector output by the text encoder obtained after each training and a sample visual feature output by the image encoder obtained after each training during the iterative training process, and then determining the penalty value of the first penalty function based on the sample text feature vector and the sample visual feature vector.

For example, referring to fig. 7, since the electronic device can input a plurality of sample text information into the text encoder and a plurality of sample training images into the image encoder in the course of iteratively training the text encoder and the image encoder. The text encoder may determine a sample text feature vector of the input sample text information, the image encoder may determine a sample visual feature vector of the input sample training image, and the electronic device may then determine a penalty value of the first penalty function based on the sample text feature vector and the sample visual feature vector.

As an example, the first loss function may be an Image-text contrast (Image-Text Contrastive, ITC) loss function, and the first loss function may be as shown in the following first formula (1).

In the first formula (1), D _w represents the euclidean distance between the two sample feature vectors (X ₁ and X ₂), P is the feature dimension, Y is the label of whether the two sample feature vectors match, where Y is 1 for two samples to be similar or match, Y is 0 for two samples to be dissimilar or not match, m is a set threshold, and L (W, (Y, X ₁,X₂)) is a loss value.

As an example, the first loss function may be represented by the first formula (1) described above, but may be represented by other manners, and, illustratively, the first loss function may be represented by the second formula (2) described below.

In the second formula (2), L _itc is a loss value, s (I, T) and s (T, I) are similarities between vectors, ω and v network parameters, g _v(v_cls) is a text feature vector, g' _w(w'_cls) is a visual feature vector, M is the number of scene categories,Is the matching similarity between any text feature vector and the corresponding visual feature vector.For matching similarity between any one visual feature vector and the corresponding text feature vector.

As an example, the case where the loss value converges is the case where the loss value is less than or equal to a first preset value, and/or the case where the change in the loss value is less than or equal to a second preset value. The first preset value and the second preset value can be preset according to requirements.

It is worth noting that the training accuracy is higher by training the text encoder and the image encoder to obtain the target text encoder and the image encoder.

In some embodiments, the electronic device may determine not only the text encoder obtained at the time of convergence as the target text encoder, but also the image encoder obtained at the time of convergence as the target image encoder, in the case where the loss value converges. The electronic device may further determine a training number of iterative training, and in a case where the training number is greater than or equal to the number threshold, the electronic device may determine a text encoder obtained when the training number is greater than or equal to the number threshold as a target text encoder, and determine an image encoder obtained when the training number is greater than or equal to the number threshold as a target image encoder.

It should be noted that, the frequency threshold may be set in advance according to the requirement, for example, the frequency threshold may be 100 times, 150 times, or the like.

In some embodiments, the electronic device may determine not only the text feature vector of each first text information and the visual feature of each sample image in the manner described above, but also the text feature vector of each first text information and the visual feature vector of each sample image in other manners. The electronic device may illustratively determine a visual feature vector for each sample image via a visual base network model (e.g., resent (Residual Neural Network) model, etc.), and determine a text feature vector for each first text message via a multilingual text model. Or the electronic device performs vectorization processing on each first text message through a text vector model to obtain a text feature vector of each first text message, wherein the text vector model can be determined based on a pre-trained language characterization model (Bidirectional Encoder Representation from Transformers, BETR). The embodiment of the present application is not particularly limited thereto.

Since the negative sample image training set is determined from the positive sample image training set, in general, in order to ensure the balance of model training, the ratio between the number of sample images in the positive sample image training set and the number of negative sample images is typically 1:1. Of course, other ratios are also possible, for example, a portion of the sample images are taken from the plurality of sample images to be sample images in the negative sample training set, and then the ratio between the number of sample images in the positive sample image training set and the number of negative sample images may be 2:1, 1.5:1, etc., which is not particularly limited in the embodiment of the present invention.

In some embodiments, each sample image in the positive sample image training set corresponds to a classification tag, and each sample image in the negative sample training set also corresponds to a classification tag, wherein the classification tags are used to indicate a relationship between the sample image and the corresponding text information. The class labels may be [0,1] and [1,0], and the first element in the class label represents a probability that the sample image does not match the corresponding text information, and the second element represents a probability that the sample image matches the corresponding text information. That is, the classification label of each present image in the training set of positive sample images may be [0,1], the first element 0 representing a probability of 0 that the sample image does not match the corresponding first text information, and the second element 1 representing a probability of 1 that the sample image matches the corresponding first text information. The classification label for each sample image in the negative sample image training set may be [1,0], the first element 1 representing a probability of 1 that the sample image does not match the corresponding second text information, and the second element 0 representing a probability of 0 that the sample image matches the corresponding second text information.

It should be noted that, in the embodiment of the present application, the above-mentioned classification labels are merely taken as examples, and the embodiment of the present application is not limited to the above-mentioned examples, and the classification labels may be labels of other types.

And 603, performing iterative training on the initial classification model based on the positive sample image training set and the negative sample image training set to obtain a target image classification model.

The target image classification model is capable of identifying images belonging to at least one scene category of a plurality of scene categories. The initial classification model may be an image-text matching (Image Text Matching, ITM) module.

In some embodiments, the electronic device performs iterative training on an initial classification model based on a positive sample image training set and a negative sample image training set to obtain a target image classification model, wherein the operation of performing iterative training on the initial classification model comprises the steps of splicing a visual feature vector of each sample image in the positive sample image training set with a text feature vector of corresponding first text information to obtain a plurality of positive sample mixed feature vectors, respectively splicing the visual feature vector of each sample image in the negative sample image training set with a text feature vector of corresponding second text information to obtain a plurality of negative sample mixed feature vectors, performing iterative training on the initial classification model according to the plurality of positive sample mixed feature vectors and the plurality of negative sample mixed feature vectors, determining a loss value of a second loss function between a classification result of the image classification model obtained after each training and a preset result in the iterative training process, and determining the image classification model obtained during convergence as the target image classification model under the condition that the loss value of the second loss function converges. This process may be referred to the training diagram shown in fig. 7.

In some embodiments, in the process of performing iterative training on the initial classification model according to the plurality of positive sample mixed feature vectors and the plurality of negative sample mixed feature vectors, the electronic device may further perform dimension up-scaling or dimension down-scaling on each mixed feature vector (including each positive sample mixed feature vector of the plurality of positive sample mixed feature vectors and each negative sample mixed feature vector of the plurality of negative sample mixed feature vectors) according to the dimension of the model parameters of the initial classification model, so that the dimension of each mixed feature is the same as the dimension of the network parameters of the target image classification model. And then, performing iterative training on the initial classification model according to the processed multiple mixed feature vectors.

As one example, the second loss function may be an ITM loss function. Of course, other loss functions are possible, and embodiments of the present application are not limited in this regard.

The second loss function may be expressed by the following third formula (3).

L_itm＝E_(I,T′)～DH(y^itm,p^itm(I,M)) (3)

In the embodiment of the application, because in the process of constructing the negative sample image training set, the plurality of sample images in the negative sample image training set are the plurality of sample images in the positive sample image training set, and the negative sample image training set also comprises a plurality of first text information. However, for the same sample image, text information corresponding to the sample image in the positive sample image training set is different from sample information corresponding to the sample image in the negative sample image training set, so that data annotation is only required once in the process of constructing the positive sample image training set and the negative sample image training set, thereby reducing complexity of data annotation and improving efficiency of data annotation. In addition, the information of different modes of the same thing is used in the process of the image classification model to be trained, so that the trained image classification model is more accurate.

When the electronic device obtains the target image classification model, the electronic device may perform scene recognition (or image recognition, or image classification) on images belonging to different scene categories through the target image classification model. In order to understand the embodiment of the application, an explanation is provided next on a manner that the electronic device identifies the scene category to which the image belongs through the target image classification model. In addition, the reasoning process of the target image classification model is the same as the process of image recognition by applying the target image classification model, and the reasoning process of the target image classification model is not described any more in the embodiment of the application.

Referring to fig. 8, fig. 8 is a flowchart illustrating a method of classifying images according to an exemplary embodiment, which is illustrated by way of example and not limitation, and may include some or all of the following:

step 801, obtaining an image to be classified.

As an example, the operation of the electronic device to obtain the image to be classified includes determining that the preview image collected by the camera is the image to be classified when the camera is turned on, or determining that the image selected by the image selecting operation is the image to be classified when the image selecting operation is received. It can be known that the image to be classified can be any image, for example, the image to be classified can be a network downloaded image, or a preview image collected by a camera of the electronic device, or any image stored in the electronic device.

As an example, the image selection operation may be a selection operation of any one image in a gallery (formed by stored images) of the electronic device, or may be a download operation, a save operation, or the like of a network image, which is not particularly limited in the embodiment of the present application.

It is worth to say that, because the image to be classified can be a preview image or an image selected by a user, the electronic device can realize scene recognition of any one image, thereby increasing the application scene of the target image classification model and improving the practicability of the target image classification model.

In some embodiments, when the camera is turned on, the electronic device determines that the preview image acquired by the camera is an image to be classified, and the operation includes displaying a scene recognition control in a shooting interface and acquiring an image through the camera to obtain the preview image when the camera is turned on, wherein the scene recognition control is used for controlling whether scene recognition is performed or not, and determining that the preview image is the image to be classified in response to the operation of turning on the scene recognition control. For example, the scenario may refer to the application scenario shown in fig. 5 described above.

Since not all scenes need to be subjected to scene recognition, in order for the user to selectively perform scene recognition, the electronic device may also display a scene recognition control in the shooting interface.

It is worth to say that, through setting up scene recognition control, can make the user select whether to carry out scene recognition voluntarily, thereby increased the interactivity with the user, and under the condition that need not to carry out scene recognition, saved the operation resource of electronic equipment.

Step 802, determining a visual feature vector of an image to be classified.

As an example, the electronic device may process the visual feature vector of the image to be classified by using the target image encoder to obtain the visual feature vector of the image to be classified, as shown in fig. 9. Alternatively, the electronic device may determine the visual feature vector of the image to be classified by other means, such as by the visual basic network model described above (e.g., resent (Residual Neural Network) model, etc.). The embodiment of the present application is not particularly limited thereto.

Step 803, determining the similarity between the visual feature vector of the image to be classified and each of the plurality of first text messages to obtain a plurality of similarities.

It should be noted that, the electronic device may store the first text feature information corresponding to each of the plurality of scene categories and/or the text feature vector of the first text feature information corresponding to each of the plurality of scene categories. In this way, as shown in fig. 7, the electronic device may determine the similarity between the visual feature vector of the image to be classified and each of the plurality of first text information, so as to obtain a plurality of similarities.

In some embodiments, the electronic device may determine at least one of a euclidean distance, a cosine distance, a jaccard distance, and the like between the visual feature vector of the image to be classified and each text feature vector, determine the resulting distance as a similarity between the visual feature vector of the image to be classified and each text feature vector if one of the distances is determined, determine a mean of the distances as a similarity between the visual feature vector of the image to be classified and each text feature vector if the distances are determined, or assign different weights to each distance and add if the distances are determined, and determine the resulting sum as a similarity between the visual feature vector of the image to be classified and each text feature vector. The embodiment of the present application is not particularly limited thereto.

Step 804, splicing the visual feature vector of the image to be classified and the text feature vector corresponding to each of the N pieces of first text information respectively to obtain N pieces of mixed feature vectors.

It should be noted that, the N pieces of first text information are first text information corresponding to each of the first N pieces of similarity after the plurality of similarities are arranged from large to small, where N is a positive integer greater than or equal to 1.

As an example, in a case where the electronic device obtains the plurality of similarities, the plurality of similarities may be ranked in order from large to small, to obtain the first ranking result. Acquiring text feature vectors of first text information corresponding to each similarity in the top N similarities in the first sequencing result; and splicing each text feature vector in the N obtained text feature vectors with the visual feature vector to be classified to obtain N mixed feature vectors.

As an example, in the case that the electronic device obtains the plurality of similarities, the plurality of similarities may be further ranked in order from small to large, to obtain the second ranking result. And acquiring a detailed text feature vector of the first text corresponding to each similarity in the last N similarities in the second training result. And splicing each text feature vector in the N obtained text feature vectors with the visual feature vector to be classified to obtain N mixed feature vectors.

As an example, the electronic device may further traverse the plurality of similarities when obtaining the plurality of similarities, obtain a maximum similarity among the traversed similarities once each time, then continue traversing the remaining similarities except for the traversed maximum similarity again, continue obtaining the maximum similarity among the remaining similarities, repeat the traversing operation until the electronic device obtains N similarities, and then the electronic device obtains a detailed text feature vector of the first text corresponding to each of the N similarities. And splicing each text feature vector in the N obtained text feature vectors with the visual feature vector to be classified to obtain N mixed feature vectors.

In some embodiments, the operating electronic device of steps 803 and 804 may or may not be implemented by the target image classification model. The electronic equipment can input the visual feature vector of the image to be classified into a target image classification model after obtaining the visual feature vector of the image to be classified, the target image classification model can store text feature vectors of a plurality of first text information, the electronic equipment can determine the similarity between the visual feature vector of the image to be classified and each first text feature vector through the target image classification model to obtain a plurality of similarities, and then the visual feature vector of the image to be classified and each text feature vector in the N text feature vectors are respectively spliced to obtain N mixed feature vectors. Or referring to fig. 9, after obtaining N mixed feature vectors without passing through the target image classification model, the electronic device inputs the N mixed feature vectors into the target image classification model, and then performs the following operation of step 805.

And 805, processing the N mixed feature vectors through the target image classification model to obtain a classification result of the image to be classified.

In some embodiments, the electronic device may perform a dimension up-scaling or dimension down-scaling process on each of the N hybrid feature vectors according to the dimension of the network parameter of the target image classification model, so that the dimension of each of the N hybrid feature vectors is the same as the dimension of the network parameter of the target image classification model. And then carrying out relevant classification processing on the N mixed features after dimension increase or dimension reduction to obtain a classification result of the image to be classified.

Since the target image classification model is capable of identifying images belonging to at least one of a plurality of scene categories, the electronic device can obtain a plurality of classification results for the images to be classified. The electronic device may determine, according to the plurality of classification results, a scene category to which the image to be classified belongs.

It should be noted that, the electronic device may represent the classification result by using the classification tag mentioned in step 601, and of course, may also represent the classification result by other manners, for example, at least one of information such as letters, numbers, patterns, identifiers, etc., which is not particularly limited in the embodiment of the present application.

For example, each of the plurality of classification results output by the target image classification model may be represented by a letter, and in the case where the letter output by the target image classification model for one scene class E that can be identified is "yes", it is explained that the scene class to which the image to be classified belongs is the scene class E. And under the condition that the letter output by the target image classification model for the scene category E is 'no', the scene category to which the image to be classified belongs is not the scene category E.

Since there may be things corresponding to a plurality of scene categories in one image, the image to be classified may belong to at least one scene category among the plurality of scene categories at the same time. In this case, the electronic device may determine that the scene category to which the image to be classified belongs is the at least one scene category.

Illustratively, the images to be classified include dogs, grasslands, trees and blue days, and the target image classification model can identify images belonging to sky scene categories, grassland scene categories, night scene categories, lovely pet scene categories, portrait scene categories, snow scene categories and the like. The electronic device may input N mixed feature vectors corresponding to the image to be classified into the target image classification model. The target image classification model may perform scene recognition on the image to be classified, and output a plurality of classification results for each scene category, each of the plurality of classification results being represented by the above-described classification tag. And aiming at the sky scene category, the classification label output by the target image classification model is [0,1]. And aiming at the grassland scene category, the classification label output by the target image classification model is [0,1]. And aiming at the night scene category, the classification label output by the target image classification model is [1,0]. Aiming at lovely pet scene categories, the classification labels output by the target image classification model are [0,1]. The classification labels output by the target image classification model are [1,0] for the portrait scene class, and [1,0] for the snowscene scene class. The electronic device may determine that the scene category to which the image to be classified belongs may be a lovely one scene category, a grassland scene category, and a sky scene category, based on the classification tags output by the target image classification model for each scene category.

In one example, if the image to be classified is a preview image, the user may trigger a photographing operation if the user is satisfied with the preview image, the electronic device may receive the photographing operation, and in response to the photographing operation, the electronic device may store the exposed preview image in an image folder corresponding to a scene category to which the view image belongs. For example, the scenario may refer to the scenario shown in fig. 4 described above.

As can be seen from the above, the number of the scene categories to which the image to be classified belongs may be plural, and then, in the case that the image to be classified needs to be stored, the electronic device may store the image to be classified into the image folder corresponding to each of the plural scene categories to which the image to be classified belongs.

Of course, in order to save storage space, the electronic device may also store the image to be classified into any one of a plurality of scene categories to which the image to be classified belongs. Or the electronic equipment stores one image to be classified, and stores the image identification of the image to be classified and the corresponding scene category. Under the condition that the images are required to be displayed in a classified mode, the electronic equipment can acquire the images to be classified according to the image identifications of the images to be classified and display the images to be classified in a classified display interface.

It is worth to say that, by storing the preview image after exposure in the image folder corresponding to the scene category to which the preview image belongs, the user can search the image according to the scene category conveniently, and the interactivity with the user and the user viscosity are improved.

As an example, in the case where the image to be classified is a preview image, the electronic device may further display a scene tag in the preview image after determining the scene category to which the image to be classified belongs, where the scene tag is used to describe the scene category to which the image to be classified belongs. For example, if the scene categories to which the image to be classified belongs include a loving pet scene category, a grassland scene category, and a sky scene category, the electronic device may display "loving pet", "grassland", and "sky" in the preview image, and the scene may refer to the application scene shown in fig. 2.

As an example, in the case where the image to be classified is a preview image, after determining the scene category to which the image to be classified belongs, the electronic device may further determine an imaging scheme of the image to be classified according to the scene category to which the image to be classified belongs, for example, determine an exposure parameter, a filter scheme, a shooting mode, a display resolution, and the like of the image to be classified. For example, the scenario may refer to the scenario illustrated in fig. 3 above.

For example, in the case where the scene category to which the image to be classified belongs is a moon scene, the electronic device may decrease the exposure parameter of the image to be classified to obtain a clearer moon image. Or the electronic device may control the camera to enter a moon-shooting mode (also referred to as a moon-shooting mode) to change the imaging scheme of the image to be classified.

For example, in the case where the scene category to which the image to be classified belongs is a text scene category or a scene category rich in texture, the electronic device may perform the screen super-division processing on the image to be classified, that is, enlarge the image display resolution of the image to be displayed.

In the embodiment of the application, as all scene categories to which the images to be classified belong can be identified by one image classification module of the target image classification model, the scene categories to which the images to be classified belong can be obtained as once as possible, the image classification efficiency of the target image classification model is improved, and the electronic equipment can adopt different display schemes according to the scene categories to which the images to be classified belong, so that the display quality and the display effect of the images to be classified are improved.

After explaining the training method of the image classification model provided by the embodiment of the application in detail, the electronic device related to the embodiment of the application is explained.

As one example, the method may be applied to an electronic device capable of model training. By way of example and not limitation, the electronic device may be, but is not limited to, a tablet computer, desktop computer, laptop computer, handheld computer, notebook computer, in-vehicle device, ultra-mobile personal computer (UMPC), netbook, personal Digital Assistant (PDA), cell phone, etc., to which embodiments of the application are not limited.

In addition, the electronic device may also apply the target image classification model obtained by training, and the electronic device for training the target image classification model and the electronic device for applying the target image classification model may be the same electronic device or different electronic devices, which is not particularly limited in the embodiment of the present application.

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 10, the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, a subscriber identity module (subscriber identification module, SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller may be a neural hub and a command center of the electronic device 100, among others. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces, such as may include an integrated circuit (inter-INTEGRATED CIRCUIT, I2C) interface, an integrated circuit built-in audio (inter-INTEGRATED CIRCUIT SOUND, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only illustrative, and is not meant to limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also employ different interfacing manners in the above embodiments, or a combination of multiple interfacing manners.

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, an organic light-emitting diode (OLED), an active-matrix organic LIGHT EMITTING diode (AMOLED), a flexible light-emitting diode (FLED), miniled, microLed, micro-oLed, a quantum dot LIGHT EMITTING diode (QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being an integer greater than 1.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process data fed back by the camera 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being an integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, and so on.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. Thus, the electronic device 100 may play or record video in a variety of encoding formats, such as moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, such as referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent recognition of the electronic device 100, for example, image recognition, face recognition, voice recognition, text understanding, etc., can be realized through the NPU.

As an example, the NPU may include the target image classification model provided by the embodiment of the present application.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. Such as storing files of music, video, etc. in an external memory card.

The internal memory 121 may be used to store computer-executable program code that includes instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data (e.g., audio data, phonebook, etc.) created by the electronic device 100 during use, and so forth. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.

The electronic device 100 may implement audio functions such as music playing, recording, etc. through the audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, and application processor, etc.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 180A. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the touch operation intensity according to the pressure sensor 180A. The electronic device 100 may also calculate the location of the touch based on the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example, when a touch operation with the touch operation intensity smaller than the pressure threshold is applied to the short message application icon, an instruction for viewing the short message is executed. And executing the instruction of newly creating the short message when the touch operation with the touch operation intensity being larger than or equal to the pressure threshold acts on the short message application icon.

The gyro sensor 180B may be used to determine a motion gesture of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects the shake angle of the electronic device 100, calculates the distance to be compensated by the lens module according to the angle, and makes the lens counteract the shake of the electronic device 100 through the reverse motion, so as to realize anti-shake. The gyro sensor 180B may also be used for navigating, somatosensory game scenes.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, in a shooting scene, the electronic device 100 may range using the distance sensor 180F to achieve fast focus.

The ambient light sensor 180L is used to sense ambient light level. The electronic device 100 may adaptively adjust the brightness of the display 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust white balance when taking a photograph. Ambient light sensor 180L may also cooperate with proximity light sensor 180G to detect whether electronic device 100 is in a pocket to prevent false touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 may utilize the collected fingerprint feature to unlock the fingerprint, access the application lock, photograph the fingerprint, answer the incoming call, etc.

The touch sensor 180K, also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. The touch sensor 180K may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 at a different location than the display 194.

The software system of the electronic device 100 will be described next.

The software system of the electronic device 100 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, an Android (Android) system with a layered architecture is taken as an example, and a software system of the electronic device 100 is illustrated.

Fig. 11 is a block diagram of a software system of the electronic device 100 according to an embodiment of the present application. Referring to fig. 11, the hierarchical architecture divides the software into several layers, each with a clear role and division of work. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun lines (Android runtime) and a system layer, and a kernel layer, respectively.

The application layer may include a series of application packages. As shown in fig. 11, the application package may include applications for cameras, gallery, calendar, phone calls, maps, navigation, WLAN, bluetooth, music, video, short messages, etc.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for the application of the application layer. The application framework layer includes a number of predefined functions. As shown in fig. 11, the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and the like. The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like. The content provider is used to store and retrieve data, which may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc., and make such data accessible to the application. The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to construct a display interface for an application, which may be comprised of one or more views, such as a view that includes displaying a text notification icon, a view that includes displaying text, and a view that includes displaying a picture. The telephony manager is used to provide communication functions of the electronic device 100, such as management of call status (including on, off, etc.). The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like. The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. For example, a notification manager is used to inform that the download is complete, a message alert, etc. The notification manager may also be a notification that appears in the system top status bar in the form of a chart or a scroll bar text, such as a notification of a background running application. The notification manager may also be a notification that appears on the screen in the form of a dialog window, such as a text message being prompted in a status bar, a notification sound being emitted, the electronic device vibrating, a flashing indicator light, etc.

Android run time includes a core library and virtual machines. Android runtime is responsible for scheduling and management of the android system. The core library comprises two parts, wherein one part is a function required to be called by java language, and the other part is an android core library. The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The system library may include a plurality of functional modules such as a surface manager (surface manager), a Media library (Media Libraries), a three-dimensional graphics processing library (e.g., openGL ES), a 2D graphics engine (e.g., SGL), etc. The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications. Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc. The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like. The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The workflow of the electronic device 100 software and hardware is illustrated below in connection with capturing a photo scene.

When touch sensor 180K receives a touch operation, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into the original input event (including information such as touch coordinates, time stamp of touch operation, etc.). The original input event is stored at the kernel layer. The application framework layer acquires an original input event from the kernel layer, and identifies a control corresponding to the original input event. Taking the touch operation as a click operation, the control corresponding to the click operation is a control of a camera application icon as an example, the camera application calls an interface of an application program framework layer, starts the camera application, calls a kernel layer to start a camera driver, and captures a still image or video through a camera 193.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, data subscriber line (Digital Subscriber Line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., digital versatile disk (DIGITAL VERSATILE DISC, DVD)), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.

The above embodiments are not intended to limit the present application, and any modifications, equivalent substitutions, improvements, etc. within the technical scope of the present application should be included in the scope of the present application.

Claims

1. A training method for an image classification model, characterized in that it is applied in an electronic device, the method comprising:

Obtain a positive sample image training set, which includes multiple sample images and multiple first text information. The scene category to which each sample image belongs is at least one of the multiple scene categories. The multiple sample images correspond one-to-one with the multiple first text information. One first text information is used to describe the scene category to which the corresponding sample image belongs.

Based on the positive sample image training set, a negative sample image training set is determined. The negative sample image training set includes the plurality of sample images. The text information of each sample image in the negative sample image training set is the second text information. The second text information of any sample image in the negative sample image training set is one of the plurality of first text information other than the first text information corresponding to any sample image.

Based on the positive sample image training set and the negative sample image training set, the initial classification model is iteratively trained to obtain the target image classification model, which can identify images belonging to at least one scene category among the multiple scene categories;

The method further includes:

Obtain the image to be classified;

Determine the visual feature vector of the image to be classified;

Determine the similarity between the visual feature vector of the image to be classified and each of the multiple first text information to obtain multiple similarities;

The visual feature vector of the image to be classified is concatenated with the text feature vector corresponding to each of the N first text information to obtain N mixed feature vectors. The N first text information are the first text information corresponding to each similarity in the first N similarity after the multiple similarities are arranged from largest to smallest. N is a positive integer greater than or equal to 1.

The target image classification model processes the N mixed feature vectors to obtain the classification result of the image to be classified.

2. The method as described in claim 1, wherein determining the negative sample image training set based on the positive sample image training set comprises:

Determine the similarity between the target sample image and the first text information corresponding to each target scene category, wherein the target sample image is any sample image in the positive sample image training set, and each target scene category refers to each scene category other than the scene category to which the target sample image belongs among the plurality of scene categories;

The first text information corresponding to the target scene category with the highest similarity is determined as the second text information corresponding to the target sample image;

The plurality of sample images and the second text information corresponding to each sample image are determined as the negative sample image training set.

3. The method as described in claim 2, wherein determining the similarity between the target sample image and the first text information corresponding to each target scene category includes:

The target sample image is processed by a pre-trained target image encoder to obtain the visual feature vector of the target sample image;

The first text information corresponding to each target scene category is processed by a pre-trained target text encoder to obtain the text feature vector of the first text information corresponding to each target scene category;

Determine the similarity between the visual feature vector of the target sample image and the text feature vector corresponding to each target scene category.

4. The method as described in claim 2 or 3, characterized in that, before determining the similarity between the target sample image and the first text information corresponding to each target scene category, it further includes:

Multiple sample text information and multiple sample training images are obtained, wherein the multiple sample text information corresponds one-to-one with the multiple sample training images, and each sample text information is used to describe the scene category to which the corresponding sample training image belongs;

The initial text encoder is iteratively trained based on the multiple sample text information, and the initial image encoder is iteratively trained based on the multiple sample training images;

During iterative training, the loss value of the first loss function between the text encoder obtained after each training iteration and the image encoder obtained after each training iteration is determined.

When the loss value converges, the text encoder obtained at convergence is determined as the target text encoder, and the image encoder obtained at convergence is determined as the target image encoder. The target text encoder is used to determine the text feature vector of the first text information corresponding to each target scene category, and the target image encoder is used to determine the visual feature vector of the target sample image.

5. The method according to any one of claims 1-4, characterized in that, the step of iteratively training the initial classification model based on the positive sample image training set and the negative sample image training set to obtain the target image classification model includes:

The visual feature vector of each sample image in the positive sample image training set is concatenated with the text feature vector of the corresponding first text information to obtain multiple positive sample mixed feature vectors.

The visual feature vector of each sample image in the negative sample image training set is concatenated with the text feature vector of the corresponding second text information to obtain multiple negative sample mixed feature vectors.

The initial classification model is iteratively trained based on the mixed feature vectors of the multiple positive samples and the mixed feature vectors of the multiple negative samples;

During iterative training, the loss value of the second loss function is determined between the classification result of the image classification model obtained after each training and the preset result.

If the loss value of the second loss function converges, the image classification model obtained at the convergence point is determined as the target image classification model.

6. The method as described in claim 1, wherein acquiring the image to be classified comprises:

With the camera on, determine that the preview image captured by the camera is the image to be classified; or,

Upon receiving an image selection operation, the image selected by the image selection operation is determined to be the image to be classified.

7. The method as described in claim 6, wherein determining the preview image captured by the camera as the image to be classified when the camera is turned on includes:

When the camera is turned on, a scene recognition control is displayed in the shooting interface, and an image is captured through the camera to obtain the preview image. The scene recognition control is used to control whether scene recognition is performed.

In response to the activation of the scene recognition control, the preview image is determined to be the image to be classified.

8. An electronic device, characterized in that the structure of the electronic device includes a processor and a memory;

The memory is used to store programs that enable the electronic device to perform the method as described in any one of claims 1-7.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to perform the method as described in any one of claims 1-7.