CN112819023B

CN112819023B - Sample set acquisition method, device, computer equipment and storage medium

Info

Publication number: CN112819023B
Application number: CN202010529394.5A
Authority: CN
Inventors: 费志辉; 李超; 李振阳; 马连洋; 衡阵
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2024-02-02
Anticipated expiration: 2040-06-11
Also published as: CN112819023A

Abstract

The application relates to a sample set acquisition method, a sample set acquisition device, computer equipment and a storage medium. The method comprises the following steps: searching the object based on the keywords of the label, and obtaining a sample set of the label according to the positive sample with the keywords and the negative sample without the keywords; k training sets are selected from the sample set, and the initial classification models are trained respectively to obtain K classification models; k classification models are used for predicting each sample in the sample set, and K prediction results of each sample output by the K classification models are used for obtaining classification results of whether each sample belongs to a label or not; updating the sample set according to the classification result of each sample in the sample set, taking the K classification models as initial classification models, iterating and returning to select the K training sets from the sample set, respectively training the initial classification models to obtain K classification models until the iteration stop condition is met, and obtaining the sample set of the label. The method improves the acquisition efficiency of the sample set.

Description

Sample set acquisition method, device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for acquiring a sample set, a computer device, and a storage medium.

Background

Machine learning is to make a machine possess the same learning ability as a human, and specially study how a computer simulates or realizes the learning behavior of the human so as to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve the performance of the machine.

Machine learning typically requires a lot of annotation data for the machine to learn, by constantly learning and optimizing the annotation data to build a generalized model that the machine classifies or predicts as new data passes through. Therefore, the sample set and the labeling of each sample in the sample set play a very critical role in artificial intelligence technology. Taking text classification as an example, a certain amount of tagged data is required for each tag to be used to train the text classification model. And predicting the text by using the text classification model, and determining the classification label of the text. However, in reality, the data with high quality and concept label marking is very few, so that a part of data is often required to be extracted through a certain method or rule to obtain a training sample, and each sample of the training sample is manually marked to obtain a sample set for model training.

However, manual labeling requires a lot of time, resulting in inefficient acquisition of the sample set.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a sample set acquisition method, apparatus, computer device, and storage medium that can improve efficiency.

A method of obtaining a sample set, the method comprising:

searching an object based on a keyword of a label, and obtaining a sample set of the label according to the positive sample with the keyword and the negative sample without the keyword;

k training sets are selected from the sample set, and the initial classification models are trained respectively to obtain K classification models;

predicting each sample in the sample set by using the K classification models respectively, and obtaining a classification result of whether each sample belongs to the label according to K prediction results of each sample output by the K classification models;

updating the sample set according to the classification result of each sample in the sample set, taking the K classification models as initial classification models, iterating and returning to select K training sets from the sample set, respectively training the initial classification models to obtain K classification models until the iteration stop condition is met, and obtaining the sample set of the label.

An acquisition device for a sample set, the device comprising:

the searching and acquiring module is used for searching the object based on the keywords of the label and acquiring a sample set of the label according to the positive sample with the keywords and the negative sample without the keywords;

the training module is used for selecting K training sets from the sample set, and training the initial classification models respectively to obtain K classification models;

the prediction module is used for predicting each sample in the sample set by using the K classification models respectively, and obtaining a classification result of whether each sample belongs to the label according to K prediction results of each sample output by the K classification models;

and the iteration module is used for updating the sample set according to the classification result of each sample in the sample set, taking the K classification models as initial classification models, and iteratively returning to the step of selecting the K training sets from the sample set, respectively training the initial classification models to obtain the K classification models until the iteration stopping condition is met, so as to obtain the sample set of the label.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

According to the method, the device, the computer equipment and the storage medium for acquiring the sample set, the keyword of the label is used for searching, the sample set is preliminarily determined according to the searching result, K training sets are selected from the sample set to train the initial classification model to obtain K classification models, the samples of the sample set are further predicted by the K classification models, whether the samples belong to the classification result of the label or not is obtained according to the classification results of the K models, and the classification accuracy is improved due to the fact that the results of a plurality of classifiers are fused, further the sample set is updated according to the classification results to carry out iterative training, and the sample set of the label can be obtained when training is finished. The method utilizes the keyword searching and model training modes to determine the sample set of the label and the labels of all samples in the sample set, does not need manual labeling, and improves the acquisition efficiency of the sample set.

Drawings

FIG. 1 is an application environment diagram of a method for acquiring a sample set in one embodiment;

FIG. 2 is a flow chart of a method for acquiring a sample set in one embodiment;

FIG. 3 is a topology diagram of a fasttext model in one embodiment;

FIG. 4 is an explanatory diagram of a text representation method based on fastText model in one embodiment;

FIG. 5 is an application environment diagram of a method for acquiring a sample set in another embodiment;

FIG. 6 is a schematic illustration of data set and model iterations in one embodiment;

FIG. 7 is a block diagram of a sample set acquisition device in one embodiment;

fig. 8 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The scheme provided by the embodiment of the application relates to technologies such as acquisition of an artificial intelligence sample set, and specifically is described through the following embodiments:

the sample set acquisition method provided by the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 obtains a sample set and trains based on the sample set to derive a model for tag classification. The server 104 classifies the object based on the tag classification model, and sets a classification tag for the object according to the classification result. The server may label the object with a classification label according to the classification label, and optimize content distribution to the terminal 102 according to the classification label, thereby improving user experience.

The server searches the object based on the keywords of the label, and obtains a sample set of the label according to the positive sample with the keywords and the negative sample without the keywords; k training sets are selected from the sample set, and the initial classification models are trained respectively to obtain K classification models; k classification models are used for predicting each sample in the sample set, and K prediction results of each sample output by the K classification models are used for obtaining classification results of whether each sample belongs to a label or not; updating the sample set according to the classification result of each sample in the sample set, taking the K classification models as initial classification models, iterating and returning to select the K training sets from the sample set, respectively training the initial classification models to obtain K classification models until the iteration stop condition is met, and obtaining the sample set of the label. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a method for obtaining a sample set is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

step 202, searching the object based on the keywords of the label, and obtaining a sample set of the label according to the positive sample with the keywords and the negative sample without the keywords.

Wherein tags are used to describe the attributes of things, the same thing has different attributes from different angles, and thus, one transaction may have multiple tags. For example, an article whose content is a news article, teaches basketball players the news of an incident on the family ratio. From an article category perspective, the article belongs to news, to which a "news" tag may be added. From a content perspective, involving a character's ratios, which are well known basketball players, a "basketball" tag may be added to them. The tag determines its tag name in advance by data mining. The present application is to determine a training set for machine learning using the mined determined labels.

Keywords are words that describe the characteristics of the tag content. And setting keywords corresponding to each label according to the characteristics of the content corresponding to the label aiming at the label. Namely, the keywords have directivity and representativeness, can represent the content characteristics of the labels, have corresponding relation with the labels and point to the labels. Generally, if an object is added with a certain tag, it generally contains a keyword corresponding to the tag. Such as the label named "basketball", the corresponding keywords typically include basketball organizations such as "NBA", "CBA" and basketball stars such as "cobra", "georget", etc. Keywords labeled "fashion" for another example, typically include fashion brands such as "anel", "dio", and the names of fashion figures, among others.

The object is a label classification target, and can be a text object and a non-text object according to actual application scenes, wherein the non-text object comprises various forms such as images, videos, audios and the like. It will be appreciated that since the search is based on a keyword search, whichever form of object is used, the method of the present application is premised on the object having a textual description, i.e. for non-textual objects, the method of the present application is used to construct a training set, provided that the non-textual objects have a textual description, such as an image has been materialized, audio has been identified as text, a summary of the video has been extracted, etc.

Searching in the mass objects based on the keywords, and taking the sample with the keywords as a positive sample if the object is found to have the keywords. If the object does not have a keyword, the sample without the keyword is taken as a negative sample. Taking an object as an example, articles including the keywords are found in a mass of articles, and the articles which do not include the keywords are taken as positive samples and negative samples. Taking an image as an example, a text description including an entity tag has been set for the image in advance according to an entity contained in the image. For example, an image is a photograph of a basketball game, and text descriptions obtained from entity tags "NBA", "cobra", "lake team", etc. of entities contained in the photograph are provided in advance for the photograph. Searching in a large number of photos based on the keywords, taking the photos with the keywords as positive samples if the text descriptions of the photos are found to have the keywords, and taking the photos without the keywords as negative samples if the text descriptions of the photos do not have the keywords. And obtaining a sample set of labels according to the positive sample and the negative sample.

And 204, selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models.

K training sets are selected from the sample set, and each training set is used for training the initial classification model to obtain different K classification models. To facilitate subsequent voting, K is set to an odd number. According to the categories of the objects in the actual service scene, the neural network model can be flexibly selected as an initial classification model. Taking an object as a text as an example, the initial classification model can be a Convolutional Neural Network (CNN), a long and short time cyclic neural network (LSTM), a fasttet (a text classifier, which is an open source of facial makeup companies in 16 years) model, and the like. Taking the object as an example of an image, the initial classification model may be a Convolutional Neural Network (CNN).

Specifically, the process of training the initial classification model by using the K sample sets is similar to the process of training the model in other machine learning, and according to the prediction result output by the initial classification model, the difference between the prediction result and the labeling result is carried out in a backward propagation mode, and model parameters are adjusted to obtain K classification models corresponding to the K sample sets.

Wherein, samples with a certain number and different ranges are randomly extracted from the sample set to obtain K sample sets, and the sample sets can be divided into K equal parts by using a K-fold cross validation method, and different K-1 parts are divided into K equal parts

As samples of five sample sets.

The sample set comprises positive samples and negative samples, the attribute of each sample (positive sample or negative sample) is the label of each sample in the sample set, and the process of training K classification models can be regarded as the process of training the classification models, so as to obtain the classification model for judging whether the sample belongs to the label.

And 206, predicting each sample in the sample set by using K classification models, and obtaining a classification result of whether each sample belongs to a label according to K prediction results of each sample output by the K classification models.

And for K different classification models of the preliminary training, each sample in the sample set is used as a verification set, and K models are used for prediction respectively to obtain K prediction results of the K models on the samples. The method comprises the steps that K models respectively predict each sample in a sample set, each classification model respectively outputs a prediction result of each sample, and K models output K classification results of each sample.

Because K classification models are used for judging whether the sample belongs to the label, the prediction result output by the classification models comprises two types, wherein one type is 1, which indicates that the sample belongs to the classification model, and one type is 0, which indicates that the sample does not belong to the classification label.

Specifically, K classification models are used for predicting each sample in the sample set, and according to K prediction results of each sample output by the K classification models, whether each sample belongs to a classification result of a label is obtained, which comprises the following steps: k classification models are used for predicting each sample in the sample set, and K prediction results of each sample output by the K classification models are obtained; voting the classification of the samples according to K prediction results; and obtaining a classification result of whether the sample belongs to the label according to the prediction result with highest voting.

And voting is carried out according to K predicted results of each sample, and the predicted result with the largest voting number is selected as a classification result of whether the sample belongs to the label. For example, for a sample, if more than half of K prediction results of the K classification models are that the sample belongs to the label, it is determined that the sample belongs to the label. If more than half of the K prediction results of the K classification models are that the sample does not belong to the label, determining that the sample does not belong to the label. And obtaining whether each sample in the training set belongs to a preset belonging classification result by adopting the same method.

In this embodiment, K classification models of preliminary training are used to predict each sample of the training set, and according to K prediction results, whether the sample belongs to the classification result of the label is determined by voting. Compared with the method for determining the classification result based on one classification model, the method can integrate the results of a plurality of classifiers and improve the classification accuracy.

And step 208, updating the sample set according to the classification result of each sample in the sample set, taking the K classification models as initial classification models, iterating and returning to select the K training sets from the sample set, respectively training the initial classification models to obtain K classification models until the iteration stop condition is met, and obtaining the sample set of the label.

Specifically, the training set is updated according to the classification result determined by the prediction results of each sample output by the K models. Specifically, whether the samples determined by the K classification models belong to the classification results of the labels is determined, the determined samples belonging to the labels are added to positive samples, and the samples not belonging to the labels are added to negative samples, so that an updated training set is obtained. And updating the K classification models into initial classification models, returning to the step of selecting the K training sets from the sample set, respectively training the initial classification models to obtain the K classification models until the iteration stop condition is met, and obtaining a sample set of the label.

The condition of iteration stop comprises that the prediction result of the sample in the sample set is stable, or the prediction accuracy reaches a set value. When the iteration stop condition is reached, the sample set determined last time is taken as the sample set of the label.

It will be appreciated that the sample set of the resulting labels includes positive and negative samples. The positive samples and the negative samples are determined according to the prediction result votes of the K classification models, and the accuracy is high.

According to the method for acquiring the training set, the keyword of the label is used for searching, the sample set is preliminarily determined according to the searching result, K training sets are selected from the sample set to train the initial classification model to obtain K classification models, the K classification models are further utilized to predict samples of the sample set, whether the samples belong to the label or not is obtained according to the classification results of the K classification models, and the classification accuracy is improved due to the fact that the results of the multiple classifiers are fused, and then the sample set is updated according to the classification results to conduct iterative training, and when training is finished, the sample set of the label can be obtained. The method utilizes the keyword searching and model training modes to determine the sample set of the label and the labels of all samples in the sample set, does not need manual labeling, and improves the acquisition efficiency of the sample set.

In another embodiment, selecting K training sets from the sample set, training the initial classification models respectively, and obtaining K classification models includes: randomly dividing the sample set into K sample subsets, selecting different K-1 sample subsets to respectively form K training sets, and respectively training the initial classification model according to the K training sets to obtain K classification models.

Specifically, the sample set is randomly divided into K sample subsets, K training sets are respectively formed by selecting different K-1 sample subsets, and the corresponding remaining sample subset is used as a verification set, and because each training set is formed by different K-1 sample subsets and the corresponding remaining sample subset is used as the verification set, the verification sets of all the training sets are all samples of the sample set. And training the initial classification models by the K training sets to obtain K classification models. Taking k=5 as an example, the sample set is randomly divided into 5 parts, and different 4 parts are selected to train the initial classification model respectively, so as to obtain 5 classification models. For example, the initial sample set includes 100 samples, which are divided into 5 sample subsets, i.e., 20 samples per sample subset, A, B, C, D and E sample subsets, respectively. The sample subset and validation set assigned by each classification model is shown in table 1.

Table 1 sample subset allocation table

Classification model	Sample subset	Verification set
			First classification model	A、B、C、D	E
Second classifying dieA kind of electronic device with a display unit	A、C、D、E	B
			Third classification model	A、B、D、E	C
Fourth classification model	A、B、C、E	D
			Fifth classification model	B、C、D、E	A

The initial classification model is trained using the training sample subset A, B, C, D to obtain a first classification model, with sample subset E as the validation set. The initial classification model is trained using the training sample subset A, C, D, E to obtain a second classification model, with sample subset B as the validation set. The initial classification model is trained using the training sample subset A, B, D, E to obtain a third classification model, with sample subset C as the validation set. The initial classification model is trained using the training sample subset A, B, C, E to obtain a fourth classification model, with sample subset D as the validation set. The initial classification model is trained using the training sample subset B, C, D, E to obtain a fifth classification model, with sample subset a as the validation set. So that the sample subsets A, B, C, D and E all act as validation sets.

In the embodiment, the sample set is randomly divided into K sample subsets, K training sets are respectively formed by selecting different K-1 sample subsets, all samples are fully utilized, and the K training sets are utilized to train the initial classification model, so that K classification models are obtained. According to the method, all samples are fully utilized, the data set is fully utilized for model training under the condition of insufficient sample size, the algorithm effect is tested, and then the sample set is updated according to the test result.

The training process of the classification model is similar to the model training process in other machine learning, the model parameters are adjusted according to the prediction result output by the initial classification model and the difference between the positive sample label or the negative sample label to obtain K classification models corresponding to K sample sets. For example, the sample set has 100 samples, the 100 samples are divided into 5 sample sets, and the five sample sets are utilized to train the initial classification model respectively, so as to obtain five classification models. For a positive sample, it is labeled 1, indicating that the sample belongs to the tag. For negative samples, it is labeled 0, indicating that the sample does not belong to the label. If the predicted result of a positive sample is 0 (not belonging to the label), the preset result and the sample label have differences, and the model parameters are adjusted according to the differences in back propagation.

Wherein, K training sets selected from the sample set are independent, and five independent classification models can be trained in parallel. The model structure adopted by the initial classification model can flexibly select the neural network model as the initial classification model according to the class of the object in the actual service scene. Taking an object as a text as an example, the initial classification model can be a Convolutional Neural Network (CNN), a long and short time cyclic neural network (LSTM), a fasttet (a text classifier, which is an open source of facial makeup companies in 16 years) model, and the like. Taking the object as an example of an image, the initial classification model may be a Convolutional Neural Network (CNN).

Taking an object as a text, the classification model adopts a fasttext model structure as an example for explanation.

Training the initial classification model according to the K training sets to obtain K classification models, wherein the training sets comprise: respectively converting the words of each sample in the K training sets into N-element model word bag vectors; according to the word order and the N-element model word bag vector, converting the K training sets into word pairs of central words and context words; and training the words of the K training sets respectively input into a single hidden layer neural network of the skip model structure to obtain K classification models.

Specifically, each sample in the training set has been pre-processed such as word segmentation, and the words of each sample in the training set are respectively converted into N-gram (N-gram) bag-of-word vectors. Wherein each word is represented as a bag of n characters. A vector representation is associated with each character n-gram, and the word representation is the sum of these representations. For example, the word "the rest of the day" appears, it may be expressed as [ "the rest of the day", "the rest", "the year" ], and the word vector of the word "the rest of the day" may be initialized to the sum of six one-hot vectors. Then converting the training corpus into word pairs in the form of 'center word-context word', inputting the word pairs as training samples into a neural network shown in fig. 3, finally training word vectors and gram vectors of each word, adding and averaging all word vectors in the text to obtain the whole text vector. The process of obtaining text vectors is shown in fig. 4.

The single hidden layer neural network of the jump word model structure can adopt a fastatex model structure. The Fasttext model adopts a single hidden layer neural network with a skip-gram structure, has few sample parameters and high training speed, has extremely high robustness of the prediction results of a plurality of independent Fasttext models, is suitable for repeated iteration on a training set, and improves the sample quality of the training set.

In another embodiment, searching the object based on the keywords of the label, and obtaining a sample set label of the label according to the positive sample with the keywords and the negative sample without the keywords, including: acquiring keywords related to the tag; searching the object according to the keywords to obtain a positive sample with the keywords and a negative sample without the keywords; and extracting the positive sample and the negative sample according to a preset proportion to obtain a sample set of the label.

The training set comprises positive samples and negative samples, wherein the proportion of the positive sample set and the negative sample set can be set according to the proportion of the label object in the actual service scene. For example, the basketball class label article occupies a proportion of 1 in the text reading: 100, the ratio of the number of positive samples to the number of negative samples in the training set may be set to 1:99. in practical application, if the ratio of the positive and negative sample numbers is relatively large, the duty ratio of the positive samples can be properly increased, for example, the ratio of the positive sample number to the negative sample number in the training set is set to be 10:90.

in practical application, in the process of searching for the key, the object with the key is taken as a preliminary positive sample set, and the object without the key is taken as a preliminary negative sample set. And extracting samples from the primary positive sample set and the primary negative sample set in proportion, merging, and scrambling to obtain a sample set of the label.

Specifically, searching the object according to the keywords to obtain a positive sample with the keywords and a negative sample without the keywords, including:

and searching the text description or the text object of the non-text object according to the keywords to obtain a positive sample with the keywords and a negative sample without the keywords. In the present application, the object of tag classification may be a text object or a non-text object. I.e., the text object can be labeled, and a label classification model can be trained for the text object. Non-text objects may also be labeled, and a label classification model may be trained for the non-text objects, i.e., the objects to which the method of the present application is applicable include text objects and non-text objects. It is worth noting that, since the sample set is determined by means of keyword search, the application of the method is premised on the object having a text description.

For a non-text object, a keyword lookup may be performed based on the text description of the object, based on the text description of the non-text object, a sample set of tags may be determined, and a non-text tag classification model may be trained from the perspective of the text description. The method for acquiring the text description of the non-text object comprises the following steps: and according to the type of the non-text object, calling a recognition model to recognize the non-text object, and obtaining the text description of the non-text object. Wherein the types of different non-text objects and the corresponding recognition models are different. For example, for video and images, a convolutional neural network is utilized to identify entities in the video or images, and entity tags are set for the video or images to obtain text descriptions. For audio, the text description of the audio is obtained by identifying the audio content using a speech recognition model.

By adopting the method, a sample set of various types of files can be constructed, and a basis is provided for constructing a label classification model for various types of files.

In another embodiment, as shown in fig. 5, the method for obtaining a sample set includes:

step 502, searching the object based on the keywords of the label, and obtaining a sample set of the label according to the positive sample with the keywords and the negative sample without the keywords.

In this embodiment, the keywords of the tag are manually selected, so that the sample set determined according to the keyword search is the result of manual intervention. On the other hand, high quality sample sets can be collected from unlabeled data sets based on keywords determined by human intervention.

And 504, selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models.

And step 506, predicting each sample in the sample set by using K classification models, and obtaining a classification result of whether each sample belongs to a label according to K prediction results of each sample output by the K classification models.

And step 508, determining the prediction accuracy according to the classification result of each sample in the sample set.

The prediction accuracy refers to the ratio of the number of samples in the sample set, which are predicted to be correct, to the total number of samples. The prediction is exactly whether the sample determined according to the result votes of the K classification models belongs to the classification result of the label, and whether the sample is labeled the same as the sample. If the sample determined by the result votes of the K classification models belongs to the classification result of the label and is the same as the sample label, the sample prediction is correct. If the sample determined by the result votes of the K classification models belongs to the classification result of the label, and is different from the sample label, the sample is mispredicted. Sample labeling is either positive or negative. Positive samples indicate that the sample label belongs to the tag, and negative samples indicate that the sample label does not belong to the tag. For example, one sample is marked as a positive sample, indicating that it belongs to a label, and if the sample determined by voting the results of the K classification models does not belong to a label, the prediction is incorrect. If the samples determined by the result votes of the K classification models belong to the labels, the prediction is correct.

Step 510, predict whether the accuracy reaches the set value. If not, go to step 512, if yes, go to step 514.

The prediction accuracy reaches a set value, which is a judging condition for stopping iteration, and the prediction accuracy reaches a set value, which is a training target. The preset accuracy setting may be set to 99%.

In other embodiments, the iteration stop condition may be stable for the prediction result of the sample set, that is, the difference between the prediction accuracy of the multiple iterative training is not large, and the difference may be set according to the accuracy.

Step 512, updating the sample set according to the classification result of each sample in the sample set, and taking the K classification models as initial classification models. After step 512, return to step 504, iterate training until the predictive accuracy reaches the set point.

Specifically, whether the samples determined by the K classification models belong to the classification results of the labels is determined, the determined samples belonging to the labels are added into positive samples, and samples not belonging to the labels are added into negative samples, so that an updated training set is obtained. And updating the K classification models into initial classification models, returning to the step of selecting the K training sets from the sample set, and respectively training the initial classification models to obtain the K classification models until the iteration stop condition is met.

Step 514, a sample set of labels is obtained.

The iterative method of the data set and the model is shown in fig. 6, and the quality of the data set and the real performance of the model can mutually promote, and finally reach a balanced state. Notably, when using the model to optimize the dataset, erroneous predictions can be corrected with a small amount of manual intervention. Whether manual correction is needed to be carried out on a few samples with wrong prediction through manual intervention can be judged according to the actual prediction effect of the model.

The application further provides an application scene, and the application scene applies the sample set acquisition method. Specifically, the application of the sample set acquisition method in the application scene is as follows:

(1) And (3) finding out a plurality of keywords related to the labels, finding out articles containing the keywords from a large number of articles, taking the articles as a preliminary positive sample set, and taking the articles not containing the keywords as a preliminary negative sample set.

(2) And extracting a proper amount of positive and negative samples from the primary positive sample set and the primary negative sample set for merging, and disturbing the sequence to obtain a sample set.

(3) Five-fold cross-validation was performed on the sample set using the fasttet model as shown in fig. 3. Specifically, the sample set is divided into five parts randomly, four parts are sequentially selected as training sets, the remaining one part is used as a verification set, five training sets are constructed to train the fasttet model, and five independent fasttet models can be trained;

(4) Predicting in the whole sample set by using the five independent models, wherein each sample can obtain five prediction results, the prediction results show whether the sample belongs to an article under the concept label, and voting is carried out according to the prediction results to obtain a unique prediction result (whether the sample belongs to the class label) of the training set;

(5) Readjusting positive and negative samples in the training set according to unique prediction results of the samples, specifically, predicting samples which do not belong to the labels to be added into a negative sample set, and predicting samples which belong to the labels to be added into a positive sample set;

(6) And (3) continuing to execute the step 3 and repeatedly iterating until the prediction result of the training sample is stable or the prediction accuracy of the sample set reaches approximately 100%.

According to the sample set acquisition method, the training set with high labeling quality can be rapidly collected from the label-free data set, manual intervention is few, a large number of labeling of the training samples is not needed, and training of the large-data-volume samples can be completed in a short time. The iteration method of the whole data set and the model can be effectively and timely applied to the construction of a label system in a recommendation system. The new concept labels in the data set can also quickly respond and train out the classification model in time, thereby having remarkable help and promotion for the downstream recommendation system.

It should be understood that, although the steps in the flowcharts of fig. 2 and 5 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2 and 5 may include steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.

In one embodiment, as shown in fig. 7, there is provided a sample set obtaining apparatus, which may use a software module or a hardware module, or a combination of both, as a part of a computer device, and specifically includes:

the searching and obtaining module 702 is configured to search the object based on the keyword of the tag, and obtain a sample set of the tag according to the positive sample with the keyword and the negative sample without the keyword;

The training module 704 is configured to select K training sets from the sample sets, and train the initial classification models respectively to obtain K classification models;

the prediction module 706 is configured to predict each sample in the sample set by using K classification models, and obtain a classification result of whether each sample belongs to a label according to K prediction results of each sample output by the K classification models;

and the iteration module 708 is configured to update the sample set according to the classification result of each sample in the sample set, take the K classification models as initial classification models, iterate and return to the step of selecting the K training sets from the sample set, respectively training the initial classification models to obtain the K classification models until the iteration stop condition is satisfied, and obtain the labeled sample set.

According to the acquisition device of the sample set, the key words of the labels are used for searching, the sample set is preliminarily determined according to the searching results, K training sets are selected from the sample set to train the initial classification model to obtain K classification models, the samples of the sample set are further predicted by the K classification models, whether the samples belong to the classification results of the labels or not is obtained according to the classification results of the K classification models, and the classification accuracy is improved due to the fact that the results of the multiple classifiers are fused, and then the sample set is updated according to the classification results to train in an iterative mode. The method utilizes the keyword searching and model training modes to determine the sample set of the label and the labels of all samples in the sample set, does not need manual labeling, and improves the acquisition efficiency of the sample set. In one embodiment, the training module comprises:

The training set acquisition module is used for randomly dividing the sample set into K sample subsets, selecting different K-1 sample subsets to respectively form K training sets, and correspondingly taking the remaining sample subset as a verification set; the entire validation set includes each sample in the sample set.

The classification model training module is used for training the initial classification models according to the K training sets respectively to obtain K classification models.

In another embodiment, the naming, lookup acquisition module includes:

and the keyword acquisition module is used for acquiring keywords related to the labels.

And the searching module is used for searching the object according to the keywords to obtain a positive sample with the keywords and a negative sample without the keywords.

The sample set acquisition module is used for extracting positive samples and negative samples according to a preset proportion to obtain a sample set of the label.

In another embodiment, the searching module is configured to search the text description or the text object of the non-text object according to the keyword, so as to obtain a positive sample with the keyword and a negative sample without the keyword.

In another embodiment, the classification model training module is configured to convert words of each sample in the K training sets into N-element model word bag vectors respectively; according to the word order and the N-element model word bag vector, converting the K training sets into word pairs of central words and context words; and training the words of the K training sets respectively input into a single hidden layer neural network of the skip model structure to obtain K classification models.

In another embodiment, a prediction module includes:

and the prediction result acquisition module is used for predicting each sample in the sample set by using the K classification models to obtain K prediction results of each sample output by the K classification models.

And the voting module is used for voting the classification of the samples according to the K prediction results.

And the classification module is used for obtaining a classification result of whether the sample belongs to the label according to the prediction result with the highest voting.

In another embodiment, the sample set acquisition device further includes:

the prediction accuracy rate acquisition module is used for: and determining the prediction accuracy according to the classification result of each sample in the sample set. Wherein the iteration stop condition includes: the prediction accuracy reaches the set value.

For specific limitations on the means for obtaining the sample set, reference may be made to the above limitations on the method for obtaining the sample set, and no further description is given here. The respective modules in the sample set acquisition device described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing acquired data of the sample set. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of acquiring a sample set.

It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of obtaining a sample set, the method comprising:

selecting K training sets from the sample sets;

the words of each sample in the K training sets are respectively converted into N-element model word bag vectors; the bag corresponding to the N-element model word bag vector comprises a plurality of character strings; the characters contained in each character string are at least one part of the characters contained in the word; the character string is consistent with the character sequence in the word;

According to the word order and the N-element model word bag vector, converting the K training sets into word pairs of central words and context words;

respectively inputting the words of the K training sets into an initial classification model to train to obtain K classification models; the initial classification model adopts a single hidden layer neural network with a jump model structure;

for the K classification models, taking each sample in the sample set as a verification set, and respectively carrying out classification prediction on each sample by using the K classification models to obtain K prediction results of each sample output by the K classification models;

voting the classification of the samples according to the K prediction results;

obtaining a classification result of whether the sample belongs to the label according to the prediction result with highest voting;

according to the respective classification results of all the samples in the sample set, adding positive samples into samples with classification results belonging to the labels, adding negative samples into samples with classification results not belonging to the labels to obtain updated sample sets, taking the K classification models as initial classification models, and iteratively returning to the step of selecting K training sets from the sample sets until the iteration stop condition is met to obtain the sample sets of the labels.

2. The method of claim 1, wherein the selecting K training sets from the sample sets comprises:

and randomly dividing the sample set into K sample subsets, and selecting different K-1 sample subsets to respectively form K training sets.

3. The method of claim 1, wherein the searching for objects based on the tag-based keywords, obtaining the sample set of tags based on searching for positive samples with the keywords and negative samples without the keywords, comprises:

acquiring keywords related to the tag;

searching an object according to the keywords to obtain a positive sample with the keywords and a negative sample without the keywords;

and extracting the positive sample and the negative sample according to a preset proportion to obtain a sample set of the label.

4. A method according to claim 3, wherein said searching for objects based on said keywords results in positive samples with said keywords and negative samples without said keywords, comprising:

and searching text description or text object of the non-text object according to the keyword to obtain a positive sample with the keyword and a negative sample without the keyword.

5. The method according to claim 1, wherein the method further comprises: determining the prediction accuracy according to the classification result of each sample in the sample set;

the iteration stop condition includes: the prediction accuracy reaches the set value.

6. An acquisition device for a sample set, the device comprising:

the training module is used for selecting K training sets from the sample set; the words of each sample in the K training sets are respectively converted into N-element model word bag vectors; according to the word order and the N-element model word bag vector, converting the K training sets into word pairs of central words and context words; respectively inputting the words of the K training sets into an initial classification model to train to obtain K classification models; the initial classification model adopts a single hidden layer neural network with a jump model structure; the bag corresponding to the N-element model word bag vector comprises a plurality of character strings; the characters contained in each character string are at least one part of the characters contained in the word; the character string is consistent with the character sequence in the word;

The prediction result obtaining module is used for carrying out classification prediction on each sample by using the K classification models by taking each sample in the sample set as a verification set to obtain K prediction results of each sample output by the K classification models;

the voting module is used for voting the classification of the samples according to the K prediction results;

the classification module is used for obtaining a classification result of whether the sample belongs to the label according to the prediction result with highest voting;

and the iteration module is used for adding positive samples into samples with classification results belonging to the labels according to respective classification results of all samples in the sample set, adding negative samples into samples with classification results not belonging to the labels to obtain updated sample sets, taking the K classification models as initial classification models, and iterating and returning to the step of selecting the K training sets from the sample sets until the iteration stop condition is met to obtain the sample sets of the labels.

7. The apparatus of claim 6, wherein the training module comprises:

the training set acquisition module is used for randomly dividing the sample set into K sample subsets, and selecting different K-1 sample subsets to respectively form K training sets.

8. The apparatus of claim 6, wherein the lookup acquisition module comprises:

the keyword acquisition module is used for acquiring keywords related to the tag;

the searching module is used for searching the object according to the keywords to obtain positive samples with the keywords and negative samples without the keywords;

and the sample set acquisition module is used for extracting the positive sample and the negative sample according to a preset proportion to obtain a sample set of the label.

9. The apparatus of claim 8, wherein the lookup module is configured to lookup a text description or text object of a non-text object based on the keyword to obtain a positive sample with the keyword and a negative sample without the keyword.

10. The apparatus of claim 6, wherein the apparatus further comprises:

the prediction accuracy rate acquisition module is used for determining the prediction accuracy rate according to the classification result of each sample in the sample set; wherein the iteration stop condition includes: the prediction accuracy reaches the set value.

11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

12. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 5.