CN113254801A - Information processing and training method and device, electronic equipment and computer storage medium - Google Patents

Information processing and training method and device, electronic equipment and computer storage medium Download PDF

Info

Publication number
CN113254801A
CN113254801A CN202110675351.2A CN202110675351A CN113254801A CN 113254801 A CN113254801 A CN 113254801A CN 202110675351 A CN202110675351 A CN 202110675351A CN 113254801 A CN113254801 A CN 113254801A
Authority
CN
China
Prior art keywords
user
information
observation sequence
attitude
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110675351.2A
Other languages
Chinese (zh)
Inventor
陈程
王贺
李涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Zhuoer Digital Media Technology Co ltd
Original Assignee
Wuhan Zhuoer Digital Media Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Zhuoer Digital Media Technology Co ltd filed Critical Wuhan Zhuoer Digital Media Technology Co ltd
Priority to CN202110675351.2A priority Critical patent/CN113254801A/en
Publication of CN113254801A publication Critical patent/CN113254801A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/40Business processes related to social networking or social networking services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供了信息处理及训练方法、装置、电子设备及存储介质,所述信息处理方法利用用户本身具有的对信息的真实性的识别度,以及用户对信息的评论数据,得到用户对信息的观测序列;将观测序列输入到训练好的机器学习模型,得到观测序列的可信度概率值,进而根据可信度概率值得到可信度分数;所述训练方法通过样本用户的识别度以及根据样本用户的评价数据得到的观测序列的样本,来训练信息处理方法使用的机器学习模型;本公开的信息处理及训练方法,利用了社交媒体中用户本身对信息的虚假或真实的识别度的属性,来获得信息的可信度评分,可以节省计算量,提取到有效的信息,以及提升信息的可信度判断的精确度。

Figure 202110675351

The present disclosure provides an information processing and training method, device, electronic device and storage medium. The information processing method uses the user's own recognition of the authenticity of the information and the user's comment data on the information to obtain the user's opinion of the information. observation sequence; input the observation sequence into the trained machine learning model, obtain the credibility probability value of the observation sequence, and then obtain the credibility score according to the credibility probability value; the training method uses the recognition degree of sample users and the basis of The sample of the observation sequence obtained from the evaluation data of the sample user is used to train the machine learning model used by the information processing method; the information processing and training method of the present disclosure utilizes the attribute of the user's own false or true recognition of the information in social media , to obtain the credibility score of the information, which can save the amount of calculation, extract effective information, and improve the accuracy of the credibility judgment of the information.

Figure 202110675351

Description

Information processing and training method and device, electronic equipment and computer storage medium
Technical Field
The present disclosure relates to the field of information processing, and in particular, to an information processing and training method and apparatus, an electronic device, and a computer storage medium.
Background
With the development of internet technology, users using social media are increasing, and more dense information is spread to users through social media, which increases the speed of information spreading, but for false information, the false information can be spread widely and has adverse effects.
Therefore, there is a need for a detection apparatus capable of reducing the amount of calculation and accurately detecting false information.
Disclosure of Invention
The disclosure provides an information processing and training method, an information processing and training device, an electronic device and a storage medium.
According to a first aspect of the present disclosure, there is provided a method of information processing, the method comprising:
obtaining an observation sequence of the information by the user according to the identification degree of the user and the comment data of the information by the user; wherein the identification degree reflects the authentication ability of the user for information; the observation sequence comprises: the observation value of the user to the information indicates the attitude information of the user to the information, and the attitude information of the user to the information includes: negative attitude, neutral attitude and positive attitude;
inputting the observation sequence into a trained machine learning model to obtain a reliability probability value of the observation sequence;
and obtaining the authenticity of the information according to the reliability probability value.
Optionally, the method further comprises:
acquiring attribute data of a user who performs preset operation behaviors on the information; wherein the preset operational behavior is indicative of at least one of: the user approves, reviews or forwards the information; the attribute data of the user includes at least one of: the method comprises the following steps of (1) determining the amount of vermicelli of a user, the amount of attention of the user and the amount of information needing to judge the authenticity of the information in published or forwarded information of the user;
acquiring a characteristic data set of the user according to the attribute data of the user;
and inputting the characteristic data set of the user into a classification model to obtain the identification degree of the user.
Optionally, the method further comprises:
obtaining the identification degree of the user based on the characteristic data of the user by using a classification model; wherein the classification model is: and the test result output from a plurality of candidate models trained by using labels obtained by the clustering device is the one with the highest matching degree with the clustering device.
Optionally, the obtaining an observation sequence of the information by the user according to the recognition degree of the user and the comment data of the information by the user includes:
obtaining attitude information of the user to the information according to the identification degree of the user and comment data of the user to the information;
and obtaining an observation sequence of the information by the user according to the attitude information of the information by the user.
Optionally, the obtaining the authenticity of the information according to the reliability probability value includes:
obtaining the credibility score according to the credibility probability value;
and if the credibility score meets the condition that the credibility score is larger than a preset threshold value, the information is real information.
Optionally, the machine learning model comprises:
receiving a bidirectional long and short memory network (BilSTM) of the observation sequence;
a convolutional neural network CNN connected with the long and short memory networks and outputting the reliability probability value; the CNN includes at least: the convolution layer and two pooling layers located at the rear end of the convolution layer.
Optionally, the two pooling layers comprise:
a maximum pooling layer;
and the average pooling layer is positioned at the rear end of the maximum pooling layer.
According to a second aspect of the present disclosure, there is provided a training method, the method comprising:
acquiring the identification degree of a sample user;
obtaining a sample of an observation sequence according to the evaluation data of the sample user;
training with the samples of the observation sequence and the recognition degree of the sample user to obtain the machine learning model of the information processing method of the first aspect.
Optionally, the machine learning model comprises:
receiving a bidirectional long and short memory network (BilSTM) of the observation sequence;
a convolutional neural network CNN connected with the long and short memory networks and outputting the reliability probability value; the CNN includes at least: the convolution layer and two pooling layers located at the rear end of the convolution layer.
Optionally, the two pooling layers comprise:
a maximum pooling layer;
and the average pooling layer is positioned at the rear end of the maximum pooling layer.
Optionally, the method further comprises: based on the characteristic data set of the sample user, using a clustering device to classify the users according to the identification degrees, wherein the users with the same identification degree grade belong to the same label under the same category;
testing a plurality of different types of classification models based on the labels and the feature data sets of the users to obtain different test results;
and taking a classification model with high goodness of fit as a classification model for performing classification calculation on the recognition degree based on the goodness of fit of the test result and the k-means clustering device.
According to a third aspect of the present disclosure, there is provided an apparatus for information processing, the apparatus comprising:
the first determining module is used for obtaining an observation sequence of the information by the user according to the identification degree of the user and the comment data of the information by the user; wherein the identification degree reflects the authentication ability of the user for information; the observation sequence comprises: the observation value of the user to the information indicates the attitude information of the user to the information, and the attitude information of the user to the information includes: negative attitude, neutral attitude and positive attitude;
the second determination module is used for inputting the observation sequence into a trained machine learning model to obtain the reliability probability value of the observation sequence;
and the third determining module is used for obtaining the authenticity of the information according to the reliability probability value.
According to a fourth aspect of the present disclosure, there is provided a training apparatus, the apparatus comprising:
the acquisition module is used for acquiring the identification degree of the sample user;
the fourth determining module is used for obtaining a sample of an observation sequence according to the evaluation data of the sample user;
and a training module, configured to train, with the samples of the observation sequence and the degrees of recognition of the sample users, to obtain the machine learning model in the information processing method provided in the foregoing first aspect.
According to a fifth aspect of the present disclosure, there is provided an electronic apparatus comprising:
a memory;
and a processor, connected to the memory, for executing the instructions stored in the memory by a computer, so as to perform the steps of the information processing method of the first aspect or the training method of the second aspect.
According to a sixth method of the present disclosure, there is provided a storage medium storing computer-executable instructions; the computer-executable instructions, when executed by a processor, are capable of performing the steps of the information processing method of the first aspect or the training method of the second aspect.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: according to the information processing method, an observation sequence of information by a user is obtained according to the identification degree of the user and comment data of the user on the information; the information observation sequence of the user is obtained by utilizing the identification degree of the user on the authenticity of the information and the comment data of the user on the information; inputting the observation sequence into a trained machine learning model to obtain a reliability probability value of the observation sequence, and further obtaining a reliability score according to the reliability probability value; compared with the prior technical scheme that a large amount of calculation amount is consumed for extracting characteristic data of information from the social media with explosive information amount, and the accuracy is low, the embodiment of the disclosure obtains the credibility score of the information by using the identification capability of the user to the false information in the social media, can save the calculation amount, extract effective information, and improve the accuracy of credibility judgment of the information.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
FIG. 1 is a flow diagram illustrating an information processing method in accordance with an illustrative embodiment;
FIG. 2 is a flow diagram illustrating an information processing method in accordance with an illustrative embodiment;
FIG. 3 is a flow diagram illustrating an information processing method in accordance with an illustrative embodiment;
FIG. 4 is a flow diagram illustrating an information processing method in accordance with an illustrative embodiment;
FIG. 5 is a diagram illustrating the structure of a machine learning model in accordance with an exemplary embodiment;
FIG. 6 is a flow diagram illustrating a training method in accordance with an exemplary embodiment;
FIG. 7 is a flowchart illustrating a training method in accordance with an exemplary embodiment;
fig. 8 is a schematic configuration diagram of an information processing apparatus shown in an exemplary embodiment;
fig. 9 is a schematic structural diagram of a training device according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosed embodiments, as detailed in the attached application documents.
The embodiment of the present disclosure provides an information processing method, which is shown in fig. 1 and includes:
step S101, obtaining an observation sequence of information by a user according to the identification degree of the user and comment data of the user on the information; wherein the identification degree reflects the authentication ability of the user for information; the observation sequence comprises: the observation value of the user to the information indicates the attitude information of the user to the information, and the attitude information of the user to the information includes: negative attitude, neutral attitude and positive attitude;
step S102, inputting the observation sequence into a trained machine learning model to obtain a reliability probability value of the observation sequence;
and S103, obtaining the authenticity of the information according to the reliability probability value.
In the embodiment of the present disclosure, the information includes but is not limited to: blog information, picture information, video information, alone or in combination.
In the embodiment of the present disclosure, the user is an active user using an electronic information dissemination medium, and the active user in the electronic information dissemination medium may perform an individual or combined behavior of operation behaviors such as approval, collection, forwarding, and comment on the information. Because the user mostly reads the information and then carries out the operation behavior executed by the information, when the user reads the information, the user can carry out the identification and judgment of the authenticity of the information, and after the information is identified and judged, the information can be subjected to the preset operation behavior.
In the embodiment of the present disclosure, the electronic information propagation medium may be a social media that is semi-public but still has a certain propagation capability, such as a social media like WeChat and QQ; or social media that are fully public and have a high ability to travel, such as social media like microblogs, small red books, trembles, ins, Facebook, etc.
In the embodiment of the present disclosure, the recognition degree is a capability of the user to identify the information, for example, a capability of the user to judge authenticity or reliability of the information according to knowledge, professional experience or logic possessed by the user after reading the information. For example, a user in a medical profession may judge the authenticity of information related to medicine according to medical knowledge, medical experience, and medical logic possessed by the user, and if a reliability probability value obtained by the recognition degree of the user on the information is utilized, the authenticity of the information may be determined more accurately. Of course, the degree of recognition is not limited to the medical profession, and may be a degree of recognition that a user in another profession such as a historical profession or a legal profession judges authenticity of information in the profession to which the user belongs.
In the embodiment of the present disclosure, in step S101, an observation sequence of information by a user is obtained according to the recognition degree of the user and the comment data of the user on the information. In one embodiment, the method comprises the following steps: from the userK t Extracting characteristic values from the blog article corresponding to the personal data and the information F, and then using the trained classification model to perform classification on the userK t Classifying to obtain the identification degreex t Is taken as the value of (1) to obtain the degree of recognitionx t After the value is taken, active users can be obtained by using some existing opinion mining methods based on comments of the usersK t The attitude of (c).
In some embodiments, the opinion mining method may use natural language processing and emotion classifier to determine the attitudes of the user about opinions, such as: positive attitude, negative attitude, or neutral attitude, etc.
In the embodiment of the present disclosure, regarding the identification degree of the user, the number of fans of the user, the number of concerned users, whether the user is an authenticated user of a known blogger in a certain professional field or a general blogger in a certain professional field, the gender of the user, the number of user publishing and/or forwarding information, the registration time of the user, the number of pieces of information in which the authenticity of the information needs to be detected in the user publishing and/or forwarding blog text, and the behavior mode of the user on the information may be integrated to comprehensively measure whether the information attitude information of the user is negative attitude, neutral attitude, and positive attitude.
In some embodiments, ordery t Is shown astIndividual userK t The resulting observations. When the user isK t If F is in positive attitude, then observation value is obtained
Figure 707165DEST_PATH_IMAGE001
(ii) a When the user isK t If F is in negative attitude, then
Figure 421043DEST_PATH_IMAGE002
(ii) a When the user isK t When F is in neutral attitude, then
Figure 303243DEST_PATH_IMAGE003
Wherein
Figure 370556DEST_PATH_IMAGE004
and therefore, the first and second electrodes are,y t integer values are taken. Order to
Figure 965354DEST_PATH_IMAGE005
The length of the representation information F generated in the propagation process isThe observation sequence of (1). That is, the information F will experience in the process of being propagated by different userstThe observed value of the information F may be the same or different between each user,
Figure 663183DEST_PATH_IMAGE005
indicating information F inWhen the data is transmitted between the users, the users can,tand the observation sequence is formed by the observation values of the information F of the users. Wherein,t=1,2,3,···,t
in the embodiment of the present disclosure, in step S102, the observation sequence is input to a trained machine learning model, so as to obtain a reliability probability value of the observation sequence. Due to the observation sequence, the observed values of information between users may affect each other, e.g. the secondiThe observation value of the user on the information F shows that the user holds a positive attitude on the information F, and the observation value may be the first one concerned by the useri-the influence of 1 attitude information. Therefore, in one embodiment, the machine learning model comprises a Bi-directional long-short memory network Bi-LSTM layer, and due to the characteristic of bidirectional computation, attitude information of a vector before or after any vector in a prediction sequence can be obtained, so that influence factors between users on information can be considered.
In the embodiment of the disclosure, the machine learning model further comprises a Convolutional Neural Network (CNN) layer, the convolutional neural network layer comprises a convolutional layer, two pooling layers and an activation function layer, the observation sequence value is linearly input into the Bi-directional long and short memory network Bi-LSTM layer of the machine learning model and then input into the CNN layer, and then the probability value of the reliability is obtained. The activation function layer may be a softmax function layer.
In the embodiment of the present disclosure, in step S103, the authenticity of the information is obtained according to the reliability probability value. In one embodiment, according to the reliability probability value, when the reliability probability value is higher than a preset threshold, for example, the preset threshold is set between 0.6 and 0.9, and if the reliability probability value is higher than the preset threshold, the information is determined to be real information.
In one embodiment, the confidence probability value may be multiplied by 100 to obtain a confidence score, and when the confidence score is greater than a preset threshold, for example, when the preset threshold is set between 60 and 90, if the confidence score is higher than the preset threshold, the information is determined to be real information.
The calculation method of the plausibility of the information obtained by the reliability probability value in the embodiment of the present disclosure is not limited to the above-described embodiment.
In the embodiment of the disclosure, since the user, especially an active user, has a basic capability of determining or purifying the authenticity or falseness of the information, even some active users belong to users with rich professional knowledge and experience, and the recognition degree of the user is calculated based on attribute data such as the amount of the vermicelli of the user, the recognition degree level of the user can be effectively classified, for example, a user with a large amount of vermicelli, a long registration time and a large number of times of forwarding the information may be a user with both professional knowledge and influence, the higher the recognition degree level of such a user is, the stronger the capability of identifying the authenticity or falseness of the information is, therefore, the credibility value of the information is extracted by using the recognition degree of the active user on the information, the authenticity of the information is determined, the calculation amount can be saved, and effective information can be extracted, and the accuracy of reliability judgment is improved.
In the embodiment of the present disclosure, as shown in fig. 2, the method further includes:
step S1001, acquiring attribute data of a user who executes a preset operation behavior on the information; wherein the preset operational behavior is indicative of at least one of: the user approves, reviews or forwards the information; the attribute data of the user includes at least one of: the method comprises the following steps of (1) determining the amount of vermicelli of a user, the amount of attention of the user and the amount of information needing to judge the authenticity of the information in published or forwarded information of the user;
step S1002, acquiring a characteristic data set of the user according to the attribute data of the user;
step S1003, inputting the characteristic data set of the user into a classification model to obtain the identification degree of the user.
In this embodiment of the present disclosure, in step S1001, the obtaining of the attribute data of the user who has performed the preset operation behavior on the information refers to the attribute data of the user who has performed the positive preset operation behavior or the negative preset operation behavior on the information in the information transmission process. The positive preset operation behavior refers to positive preset operation behavior of the user on the information, and includes but is not limited to: behavior of praise, behavior of collecting, commenting on the front or combining with forwarding information; the negative preset operation behavior refers to negative preset operation behavior of the user on the information, and includes but is not limited to the following: and (4) clicking, negatively commenting, combining the information forwarding behavior and the information reporting behavior.
In one embodiment, even if the user a sees the information F, does not pay attention to the information F, does not take any action on the information F or takes an action, the attribute data of the user a does not need to be acquired, while the user B sees the information F but pays attention to the information F, and takes a positive preset operation action on the information F, such as a behavior of agreeing to, collecting, commenting positively or combining with forwarding information; or a negative preset operation behavior such as clicking, negative comment or behavior combined with forwarding information is taken, the attribute data of the user B needs to be acquired.
In the embodiment of the present disclosure, after the user determines the authenticity of the information, the user may take a certain preset operation action on the information according to the determination of the user, and the preset operation action taken by the user on the false information may be, but is not limited to: like behavior of likes, comments or forwards to information, e.g., comment that the false information is accompanied by a comment in the form of a package of similar emoticons such as surprise, incredibility, jeer, etc.; or the user can adopt the operation behavior reported on the internet for the false information, even call other related users to report the false information together, and black-draw or click on the operation behaviors of the bloggers who produce the false information, and the like.
In the embodiment of the present disclosure, if the user determines that the information is real, an active propagation behavior may be performed, for example, a praise behavior, a positive comment behavior, a comment behavior attached during silently forwarding or forwarding, and the like.
The preset operation behavior of the information taken by the user in the embodiment of the present disclosure is not limited to the above behaviors or forms alone or in combination.
In one embodiment, the user C is a famous blogger in a certain professional field, and has a large number of fans and a large amount of attention, for example, pays attention to other famous bloggers in many professional fields, and has a large amount of information for judging the authenticity of the information in the number of pieces of forwarded information; the fan amount of the user D is less than that of the user C, the attention amount is less, the user D does not belong to the authentication user in the professional field, the forwarding information is less, or the number of pieces of information which are falsely copied in the number of pieces of forwarding information is less, the user C is probably a user which is professionally falsely copied, the user D is a common user, and the identification degree of the user C is higher than that of the user D.
In this embodiment of the disclosure, in step S1002, a feature data set of a user is obtained according to attribute data of the user, that is, a feature data set composed of attribute feature values of each user is extracted according to a vermicelli amount, an attention amount, published or forwarded information of the user and an information amount, which is required to judge authenticity of the information, in the published or forwarded information, whether the attribute data set is an attribute data set that authenticates the user, gender of the user, time of registration, and the like, where the attribute feature values are elements placed in a feature vector, and a category and a corresponding amount of attribute data obtained after the attribute data of the user is extracted.
In the embodiment of the present disclosure, in step S1003, the feature data set of the user is input to the classification model, so as to obtain the recognition degree of the user. The classification model divides a plurality of users into different recognition levels, so that the feature data sets of the users are input into the classification model, and the classification model outputs the recognition levels of the users. The classification model is trained from data of sample users.
In the embodiment of the disclosure, since the user who has performed any preset operation on the information in the transmission process of the information is obtained, the attribute data of the user is obtained. Feature data sets are extracted from these attribute data, which feature data sets may simply be numerically labeled with the attributes of the user, after which these feature data sets are input into a classification model to derive the level of recognition of the user. And determining the user according to whether the user takes a preset operation behavior on certain information or not, extracting characteristic values according to attribute data of the user, and inputting a characteristic data set consisting of the characteristic values into a classification model to obtain a classification result of the recognition degree grade of the user. Therefore, the identification degree grade is obtained through the classification model, effective data can be obtained in huge data quantity, calculation cost is saved, attribute data of the user can be integrated, the identification ability of the user is represented by the identification degree, the identification degree preliminarily quantifies the false or real identification ability of the user to information, subsequent further calculation is facilitated, and the purification ability of the user to the false information in the social media is also preliminarily utilized.
In the embodiment of the present disclosure, the method further includes:
obtaining the identification degree of the user based on the characteristic data of the user by using a classification model; wherein the classification model is: and the test result output from a plurality of candidate models trained by using labels obtained by the clustering device is the one with the highest matching degree with the clustering device.
In the embodiment of the present disclosure, the clustering device clusters a plurality of different users into M different clusters according to the feature values extracted by using the attribute data of the users, so that each cluster corresponds to one recognition level, the recognition levels of active users in the same cluster are the same, and finally the active users in the same cluster are labeled with the same label. Since there are multiple clusters, there are multiple labels obtained using the clusterer.
At least the content of the attribute data includes but is not limited to at least one of:
the gender of the user;
the age of the user;
the user's occupation;
the education level of the user;
user preferences;
the customer is usually at the premises.
In the embodiment of the present disclosure, the clustering device is an algorithm for dividing data based on a clustering algorithm, and may be, but not limited to, a k-means clustering algorithm-based clustering device, a mean shift clustering algorithm, a density-based clustering algorithm (DBSCAN), a maximum Expectation (EM) clustering algorithm of a gaussian mixture model GMM, a coacervation hierarchy clustering algorithm, a graph group detection algorithm, and other existing clustering algorithms.
In the embodiment of the present disclosure, the multiple candidate models are candidate models used for classifying the degree of recognition of the user, and may be, but are not limited to: and inputting the labels obtained by using the clustering device and the user characteristic data sets corresponding to the labels into a plurality of different alternative models to obtain a plurality of test results under each alternative model, and taking the alternative model with the highest goodness of fit between the test results and the clustering device as the classification model.
In the embodiment of the present disclosure, the classification models of the alternative models include, but are not limited to: naive Bayes (NB), Decision Trees (DT), Support Vector Machines (SVM), and other common classification models.
In one embodiment, the labels obtained based on the clustering algorithm and the user data marked by the labels are used for training and testing classifiers based on different classification algorithms, and then the best classifier is selected as the classification model according to the matching degree of the test result and the clustering device.
In one embodiment, the slave userK t Extracting characteristic values from the personal attribute data and the information corresponding to the information F, and then using the trained optimal classifier to perform classification on the userK t Is classified to obtainx t The value of (a). In the embodiment of the present disclosure, the feature value is an element a placed in a feature vector represented by a category of attribute data to which a user belongs and a corresponding numerical value obtained after extracting attribute data of the userjWherein j =1, 2, 3, · · i. For example, a for user fans1Is shown as a1=50000 indicates that the number of fan-shaped fans of the user is 50000; whether or not the user is an authenticating user a2Is shown as a2If =1 indicates that the user is an authenticated user, a2If =0, it indicates that the user is not an authenticated user; a is3Representing the age of the user, a3=50 indicates that the user is 50 years old; a is4Representing the user's preferences, quantifying the user's preferences into five levels, a higher level indicating a higher degree of user preference, e.g., a4=4, indicating the degree of preference of the userIs the fourth grade. In the embodiment of the present disclosure, the attribute data of the user is data obtained by quantizing data related to the user attribute into specific data, and the feature value is data representing a category of the user attribute data and corresponding data.
In the embodiment of the disclosure, the label to which the user belongs is obtained through clustering, the label and the corresponding user characteristic data are input into the alternative models, one alternative model with the highest matching degree with the clustering device is selected as the optimal classification model according to the test result for classifying the identification degree of the user, and when the characteristic data set of the user is input into the trained classification model, the identification degree of the user can be obtained, so that the identification degree of the user can be conveniently calculated, and the calculated amount is properly reduced.
In the embodiment of the present disclosure, as shown in fig. 3, the step S101 includes:
step S1011, obtaining attitude information of the user to the information according to the identification degree of the user and the comment data of the user to the information;
step S1012, obtaining an observation sequence of the information by the user according to the attitude information of the information by the user.
In the embodiment of the disclosure, a plurality of users acquire a sequence of corresponding observed values for different attitude information held by the same information, and form an observation sequence of the information by the users.
In one embodiment, ordery t Is shown astIndividual userK t The resulting observations. When the user isK t If F is in positive attitude, then observation value is obtained
Figure 304118DEST_PATH_IMAGE001
(ii) a When the user isK t If F is in negative attitude, then
Figure 112805DEST_PATH_IMAGE002
(ii) a When the user isK t When F is in neutral attitude, then
Figure 952323DEST_PATH_IMAGE003
Wherein
Figure 555473DEST_PATH_IMAGE004
and therefore, the first and second electrodes are,y t integer values are taken. Order to
Figure 106541DEST_PATH_IMAGE005
The length of the representation information F generated in the propagation process isThe observation sequence of (1). That is, the information F will experience in the process of being propagated by different userstThe observed value of the information F may be the same or different between each user,
Figure 223313DEST_PATH_IMAGE005
indicating information F inWhen the data is transmitted between the users, the users can,tand the observation sequence is formed by the observation values of the information F of the users. Wherein,t=1,2,3,···,t
in the embodiment of the disclosure, based on the identification degree of the user and the attitude contained in the comment data, the attitude information of the user is represented by the observation value, and finally, an observation sequence formed by the observation values of the information of a plurality of users is obtained. The higher the recognition degree of the user is, the truer the comment data of the user on the information is, and the higher the corresponding observation value is. The attitude information of the user to the information is calculated according to the identification degree of the user and the comment data of the user to the information, for example, when the user holds a positive attitude, a negative attitude or a neutral attitude to the information, the observation values corresponding to the attitude information are different. Therefore, the data of the user can be comprehensively measured, and a more comprehensive and accurate observation sequence can be obtained.
In the embodiment of the present disclosure, as shown in fig. 4, the step S103 includes:
step S1031, obtaining the credibility score according to the credibility probability value;
step S1032, if the reliability score satisfies a condition that is greater than a preset threshold, the information is real information.
In the embodiment of the present disclosure, in step S1031, the reliability score is calculated by multiplying the reliability probability value by a preset full value. The range of the confidence probability value is 0 to 1, which is commonly used, and the value is decimal or fractional, for example, the value mode is 0.5, 1/2, 50%, and the like.
In some embodiments, the preset full-score value may be set to, but is not limited to, a full-score value of 1 point, 10 points, 100 points, etc. In one embodiment, when the predetermined full score is 1 score, the confidence score corresponds to the confidence probability value. In one embodiment, when the predetermined full score value is 100 points, the reliability score corresponds to the product of the reliability probability value and 100 points, for example, when the reliability probability value is 0.56, the reliability score is 56. The embodiments of the present disclosure are not limited to the above-described embodiments, with respect to the setting of the preset full score value and the calculation formula of the credibility score.
In this embodiment of the disclosure, in step S1032, if the reliability score meets a condition that is greater than a preset threshold, the information is real information, for example, when the preset full score is 100 minutes, the preset threshold is set within 100 minutes, for example, 60 minutes, 70 minutes, and the like. In one embodiment, when the preset full score is 100, the preset threshold is 60, and the reliability probability value of the information is 0.75, it can be determined that the reliability score is 75, and if the reliability score is greater than the preset threshold 60, the information is the real information.
In the embodiment of the disclosure, whether the information is false information is judged by calculating the credibility score and setting a preset threshold, so as to find out early and reduce the propagation rate and negative influence of the false information between users in social media, for example, false terrorism events possibly contained in false news information can affect the mental health of the users; false medical information may contain false medical hospitals or false treatment means, and the physical health of the user is damaged once the user mistakenly believes or even implements the treatment; the false contribution information and channels can damage the related benefits of the users and consume the same mood of the users. Therefore, detecting the authenticity of the information, cleansing the social media environment, may reduce the probability that the user's mental, physical health, or related benefits are compromised or lost.
In the embodiment of the present disclosure, as shown in fig. 5, the machine learning model includes:
receiving a bidirectional long and short memory network (BilSTM) of the observation sequence;
a convolutional neural network CNN connected with the long and short memory networks and outputting the reliability probability value; the CNN includes at least: the convolution layer and two pooling layers located at the rear end of the convolution layer.
In the embodiment of the present disclosure, after obtaining an observation sequence composed of observation values, the machine learning model receives the observation sequence, specifically, the machine learning model includes a bidirectional long and short memory network BiLSTM receiving the observation sequence, and the convolutional neural network CNN is followed by the long and short memory network.
In the embodiment of the disclosure, the bidirectional long-short memory network BiLSTM is a bidirectional long-short memory network LSTM, and the observed values are generally predicted from left to right relative to the LSTM, so that the observed values behind are more important than the observed values in front.
In the embodiment of the disclosure, becausetComments on information between individual users, e.g. userstWill trust the usert-1 published or forwarded information F, usertPossibly forwarding the information F and commenting with the usert1 attitude-like statements, or take user-like statementst1 similar operational behavior. When observing the sequence
Figure 199228DEST_PATH_IMAGE006
After the information is input into the bidirectional long and short memory network BilSTM layer, the information of the left sequence and the right sequence in the observed value can be obtained, and finally the bidirectional long and short memory network layer can output the characteristic matrix. The specific calculation formula is as follows:
Figure 425810DEST_PATH_IMAGE007
equation 1.1
Figure 457406DEST_PATH_IMAGE009
Equation 1.2
In the above equation 1.1 and equation 1.2,
Figure 179506DEST_PATH_IMAGE010
is shown asiThe feature vector of the observation to the left of the individual observations,
Figure 622994DEST_PATH_IMAGE011
is a transition from a current hidden layer to a next hidden layer
Figure 492861DEST_PATH_IMAGE012
The matrix represents information connecting the left observations of the current word.
Figure 102834DEST_PATH_IMAGE013
A feature vector representing the left observation of the previous word,
Figure 726712DEST_PATH_IMAGE014
the feature vector representing the current observed value, similarly, can be calculated using equation 1.2
Figure 904883DEST_PATH_IMAGE015
Feature vector of right-hand observation
Figure 432685DEST_PATH_IMAGE016
In the embodiment of the present disclosure, the current observed value is obtained according to the above formula 1.1 and the above formula 1.2
Figure 924978DEST_PATH_IMAGE015
The calculation formula is as follows:
Figure 74199DEST_PATH_IMAGE017
equation 1.3
In the embodiment of the present disclosure, the feature matrix output by the two-way long and short memory network layer is input into the CNN layer of the convolutional neural network, and as shown in fig. 5, the feature matrix is first convolved by the convolutional layer, and then sequentially pooled by the two pooling layers to obtain the feature vector matrix.
In the embodiment of the disclosure, the convolutional neural network further includes a full-link layer and a softmax layer, the softmax layer is connected to the full-link layer in a rear mode, the eigenvector matrix obtained through the convolutional layer and the two pooling layers is input into the full-link layer and the softmax layer, the full-link layer multiplies the weight matrix with the input eigenvector matrix and adds a bias term, and the full-link layer multiplies the weight matrix with the input eigenvector matrix and adds the bias term to the weight matrixnAre a
Figure 125070DEST_PATH_IMAGE018
Is mapped to askAn
Figure 235108DEST_PATH_IMAGE018
Real (fractional), softmax layers willkAn
Figure 29627DEST_PATH_IMAGE018
Is mapped to real or fractional numbers ofkReal number (probability) of (0, 1) while ensuringkThe sum of the real numbers (probabilities) is 1, and the specific calculation formula is as follows:
Figure 784087DEST_PATH_IMAGE019
equation 1.4
In the above-mentioned formula 1.4,
Figure 11718DEST_PATH_IMAGE020
represents a probability valueyBelonging to the range of 0 to 1,
Figure 77894DEST_PATH_IMAGE021
the dimension of the eigenvector matrix is the same as the dimension of the sequence eigenvector matrix Z,
Figure 410524DEST_PATH_IMAGE022
represents the bias term and Z represents the sequence feature matrix.
In the embodiment of the disclosure, through the machine learning model, the feature vector can be extracted from the observation sequence for calculation, and finally, the probability value related to the observed value in the observation sequence is output, especially for a huge data amount in social media, for example, the machine learning model can efficiently calculate the probability value of the recognition degree.
In the embodiment of the present disclosure, as shown in fig. 5, the two pooling layers include:
a maximum pooling layer;
and the average pooling layer is positioned at the rear end of the maximum pooling layer.
In the embodiment of the present disclosure, a calculation formula of the Max pooling layer Max-pooling is as follows:
Figure 347387DEST_PATH_IMAGE023
equation 1.5
The average pooling layer Avg-pooling is calculated as follows:
Figure 802377DEST_PATH_IMAGE024
equation 1.6
The splicing formula of the maximum pooling layer and the average pooling layer is as follows:
Figure 824691DEST_PATH_IMAGE025
equation 1.7
In the embodiments of the present disclosure, it is,
Figure 226591DEST_PATH_IMAGE026
the output of the maximum pooling is represented,
Figure 283540DEST_PATH_IMAGE027
the output of the average pooling is represented,
Figure 106834DEST_PATH_IMAGE028
it is shown that the operation of splicing,
Figure 616444DEST_PATH_IMAGE029
indicating the sliding window size. By utilizing the method of splicing the maximum pooling and the average pooling, the defect that the information is lost because the maximum value can be taken only once in each maximum pooling operation is overcome. And the results of the two pooling operations are subjected to feature fusion, so that the integrity of sequence feature information is ensured, and more comprehensive and deep features are obtained.
In the embodiment of the present disclosure, as shown in fig. 5 and fig. 6, a training method is provided, where the method includes:
step S201, obtaining the identification degree of a sample user;
step S202, obtaining a sample of an observation sequence according to the evaluation data of the sample user;
step S203, training to obtain a machine learning model in the information processing method according to the sample of the observation sequence and the identification degree of the sample user.
In the embodiment of the present disclosure, the sample user in step S201 is a user different from the sample user in the information processing method. Thus, the sample of the observation sequence obtained from the evaluation data of the sample user in step S202 is different from the observation sequence in the information processing method,
in the embodiment of the disclosure, the attribute data and the evaluation data of the sample user are acquired from a large amount of popular real information collected on a social media, the data are preprocessed, and then the identification degree of the sample user is acquired.
In the embodiment of the disclosure, a user data set generated by collecting a large amount of popular real information is collected, then the data set is processed, the recognition degree of a user is proper by using a user recognition degree calculation method in an information processing method and an existing social media opinion mining method, then an observation value is constructed according to the recognition degree and attitude of the user, and finally a large amount of observation sequences generated by the popular real information are obtained and used as sample data of a training machine learning model.
In the embodiment of the disclosure, a large number of machine learning models trained by the observation sequences can effectively process a small number of extracted observation sequences and detect the authenticity or falseness of information.
In the embodiment of the present disclosure, as shown in fig. 5, the machine learning model includes:
receiving a bidirectional long and short memory network (BilSTM) of the observation sequence;
a convolutional neural network CNN connected with the long and short memory networks and outputting the reliability probability value; the CNN includes at least: the convolution layer and two pooling layers located at the rear end of the convolution layer.
In the embodiment of the present disclosure, after obtaining an observation sequence composed of observation values, the machine learning model receives the observation sequence, specifically, the machine learning model includes a bidirectional long and short memory network BiLSTM receiving the observation sequence, and the convolutional neural network CNN is followed by the long and short memory network.
In the embodiment of the disclosure, the bidirectional long-short memory network BiLSTM is a bidirectional long-short memory network LSTM, and the observed values are generally predicted from left to right relative to the LSTM, so that the observed values behind are more important than the observed values in front.
In the embodiment of the disclosure, becausetComments on information between individual users, e.g. userstWill trust the usert-1 published or forwarded information F, usertPossibly forwarding the information F and commenting with the usert1 attitude-like statements, or take user-like statementst1 similar operational behavior. When observing the sequence
Figure 979292DEST_PATH_IMAGE005
After being input into a bidirectional long and short memory network BilSTM layer, the information of the left sequence and the right sequence in the observation value can be obtained, and finally the bidirectional long and short memory networkThe envelope layer outputs a feature matrix. See equations 1.1 and 1.2 of the above example for specific calculation equations.
In the disclosed embodiment, the feature matrix output by the two-way long and short memory network layer is input into the convolutional neural network CNN layer, and as shown in fig. 5, the feature matrix is first convolved by the convolutional layer, and then enters the two pooling layers for feature screening, so as to obtain the feature vector matrix.
In the embodiment of the disclosure, the convolutional neural network further comprises a full connection layer and a softmax layer, the softmax layer is connected with the full connection layer in a rear mode, and finally the probability value of the reliability is output.
In the embodiment of the present disclosure, the structure of the machine learning model is the same as that in the embodiment of the information processing method, and the calculation principle corresponding to the structure is also the same.
In the embodiment of the present disclosure, as shown in fig. 5, the two pooling layers include:
a maximum pooling layer;
and the average pooling layer is positioned at the rear end of the maximum pooling layer.
In the embodiment of the present disclosure, a calculation formula of the maximum pooling layer is as follows:
Figure 654862DEST_PATH_IMAGE023
equation 1.5
The calculation formula of the average pooling layer is as follows:
Figure 890803DEST_PATH_IMAGE024
equation 1.6
The splicing formula of the maximum pooling layer and the average pooling layer is as follows:
Figure 182981DEST_PATH_IMAGE025
equation 1.7
In the embodiments of the present disclosure, it is,
Figure 428149DEST_PATH_IMAGE026
the output of the maximum pooling is represented,
Figure 99171DEST_PATH_IMAGE027
the output of the average pooling is represented,
Figure 506012DEST_PATH_IMAGE028
it is shown that the operation of splicing,
Figure 494609DEST_PATH_IMAGE029
indicating the sliding window size. By utilizing the method of splicing the maximum pooling and the average pooling, the defect that the information is lost because the maximum value can be taken only once in each maximum pooling operation is overcome. And the results of the two pooling operations are subjected to feature fusion, so that the integrity of sequence feature information is ensured, and more comprehensive and deep features are obtained.
In the embodiment of the present disclosure, as shown in fig. 7, the training method further includes:
step S204, based on the characteristic data set of the sample user, using a clustering device to classify the users according to the identification degrees, wherein the users with the same identification degree grade belong to the same label under the same category;
step S205, testing a plurality of different types of classification models based on the labels and the characteristic data sets of the users to obtain different test results;
and S206, based on the matching degree of the test result and the k-means clustering device, taking a classification model with high matching degree as a classification model for performing classification calculation on the recognition degree.
In the embodiment of the present disclosure, a clustering device clusters a plurality of different users into M different clusters, so that each cluster corresponds to one recognition level, the recognition levels of active users in the same cluster are the same, and finally, the active users in the same cluster are labeled with the same label. Since there are multiple clusters, there are multiple labels obtained using the clusterer.
In the embodiment of the present disclosure, the clustering device is an algorithm for dividing data based on a clustering algorithm, and may be, but not limited to, a k-means clustering algorithm-based clustering device, a mean shift clustering algorithm, a density-based clustering algorithm (DBSCAN), a maximum Expectation (EM) clustering algorithm of a gaussian mixture model GMM, a coacervation hierarchical clustering algorithm, a graph group detection algorithm, and the like.
In the embodiment of the present disclosure, the multiple candidate models are candidate models used for classifying the degree of recognition of the user, and may be, but are not limited to: and inputting the labels obtained by using the clustering device and the user characteristic data sets corresponding to the labels into a plurality of different alternative models to obtain a plurality of test results under each alternative model, and taking the alternative model with the highest goodness of fit between the test results and the clustering device as the classification model.
In the embodiment of the present disclosure, the classification models of the alternative models include, but are not limited to: naive Bayes (NB), Decision Trees (DT), Support Vector Machines (SVM), and other common classification models.
In one embodiment, the labels obtained based on the clustering algorithm and the user data marked by the labels are used for training and testing classifiers based on different classification algorithms, and then the best classifier is selected as the classification model according to the matching degree of the test result and the clustering device.
In one embodiment, the slave userK t Extracting characteristic values from the personal attribute data and the information corresponding to the information F, and then using the trained optimal classifier to perform classification on the userK t Is classified to obtainx t The value of (a).
In the embodiment of the disclosure, the label to which the user belongs is obtained through clustering, the label and the corresponding user characteristic data are input into the alternative models, one alternative model with the highest matching degree with the clustering device is selected as the optimal classification model according to the test result for classifying the identification degree of the user, and when the characteristic data set of the user is input into the trained classification model, the identification degree of the user can be obtained, so that the identification degree of the user can be conveniently calculated, and the calculated amount is properly reduced.
An example of an information processing method and a training method is provided in combination with the above embodiments:
example 1: provided are an information processing method and a training method.
The microblog is a social media which has a plurality of users, a wide range of relation and rapid information dissemination. The invention provides a microblog false information detection method in order to reduce the influence of false information on the public and enable microblogs to play a more positive role in production and life of people.
Because the microblog has the information purification function, when the information has obvious errors, the information cannot be widely spread, so that the false information of the microblog can be found early only by detecting the false information of the popular information on the microblog and if the credibility of the microblog information can be evaluated in real time.
Basic concepts and models:
(1) active users: the active users can judge the authenticity of the information, and the active users in the microblog can be divided into 3 types: users who only comment information and do not forward information, users who only forward information and do not comment information, users who add comments when forwarding information, and users who do not comment and do not forward information are excluded from active users.
(2) Information identification degree: the recognition ability of an active user to the truth of a certain piece of information is called recognition degree, wherein the higher the recognition degree is, the more correctly the user can recognize the false information. The identification degree of the active user is measured by the number of fans of the active user, the number of users concerned by the active user, whether the active user is an authentication user, the gender of the active user, the number of messages issued/forwarded by the active user, the registration time of the active user and the number of times of the tags of the active user appearing in the information of the subject relationship.
(3) Attitude of active users: in the process of information dissemination, the attitude of each active user of the information is different, and the attitude of the active user to the information can be determined according to the comments of the users, namely positive, negative or neutral. For users who only forward information and do not comment on the information, the attitude of the information is considered to be neutral. For active users who make comments, whether the attitude is positive or negative can be extracted by accessing the existing semantic analysis product interface. The attitude of the information is simply referred to as the attitude of the active user.
(4) Recognition level of active users: the invention uses a classification algorithm to automatically judge the recognition degree grade of the active user. Assuming that the recognition degree of the active users can be divided into M levels, when some information F is uploaded on a microblog, the order represents the second leveltAn active userK t The degree of recognition of the information F. Using a classification algorithm to automatically derive therefromK t Degree of recognition ofx t The method comprises the following steps:
(a) collecting relevant data of active users appearing in the propagation process of a large amount of real popular information on a microblog, and then extracting characteristic values of each active user, namely the number of fans of the active user, the number of users concerned by the active user, whether the active user is an authentication user, the gender of the active user, the number of messages issued/forwarded by the active user, the registration time of the active user and the number of times of the label of the active user appearing in the information of the subject relationship, wherein the number of times of the label of the active user appearing in the information of the subject relationship refers to the number of times of information needing to judge the authenticity or the false or false of the information in the information issued or forwarded by the active user;
(b) based on the characteristic values of the active users, clustering the active users in the step (a) into M different clusters by using a k-means clustering algorithm, enabling each cluster to correspond to one recognition degree grade, enabling the recognition degree grades of the active users in the same cluster to be the same, and finally marking the active users in the same cluster with the same class label;
(c) training and testing classifiers based on different classification algorithms using the labeled data of step (b), and then selecting the best classifier according to the test result;
(d) from active usersK t Extracting characteristic values from the Bo Wen corresponding to the personal data and the information F, and then using the trained optimal classifier pairK t Is classified to obtainx tThe value of (a).
(e) Is obtained byx tAfter the value is taken, active users can be obtained by using some existing microblog opinion mining methods based on comments of the usersK t The attitude of (c).
(5) Observation sequence: order toy t Is shown astIndividual userK t The resulting observations. When the user isK t If F is in positive attitude, then observation value is obtained
Figure 543468DEST_PATH_IMAGE001
(ii) a When the user isK t If F is in negative attitude, then
Figure 928050DEST_PATH_IMAGE002
(ii) a When the user isK t When F is in neutral attitude, then
Figure 568110DEST_PATH_IMAGE003
Wherein
Figure 772564DEST_PATH_IMAGE004
and therefore, the first and second electrodes are,y t integer values are taken. Order to
Figure 280906DEST_PATH_IMAGE005
The length of the representation information F generated in the propagation process isThe observation sequence of (1). That is, the information F will experience in the process of being propagated by different userstThe observed value of the information F may be the same or different between each user,
Figure 755881DEST_PATH_IMAGE005
indicating information F inWhen the data is transmitted between the users, the users can,tand the observation sequence is formed by the observation values of the information F of the users. Wherein,t=1,2,3,···,t
Figure 268639DEST_PATH_IMAGE030
it can also be understood as a concatenated vector of attitude values of information by multiple users.
(6) And (3) detecting the model: the invention provides a false information detection model combining Bi-LSTM and CNN. The model is shown in fig. 5. Firstly, a Bi-LSTM network is adopted to receive an observation sequence, and Bi-LSTM can acquire information of a left sequence and a right sequence of a target sequence due to the characteristic of bidirectional calculation. The specific calculation formula is as follows:
Figure 524171DEST_PATH_IMAGE007
equation 1.8
Figure 166952DEST_PATH_IMAGE009
Equation 1.9
In the above equation 1.8 and equation 1.9,
Figure 647349DEST_PATH_IMAGE010
is shown asiThe feature vector of the observation to the left of the individual observations,
Figure 593440DEST_PATH_IMAGE011
is a transition from a current hidden layer to a next hidden layer
Figure 20748DEST_PATH_IMAGE012
The matrix represents information connecting the left observations of the current word.
Figure 470315DEST_PATH_IMAGE013
A feature vector representing the left observation of the previous word,
Figure 324876DEST_PATH_IMAGE014
the feature vector representing the current observed value, similarly, can be calculated using equation 1.9
Figure 414055DEST_PATH_IMAGE015
Feature vector of right-hand observation
Figure 146519DEST_PATH_IMAGE016
In the embodiment of the present disclosure, the current observed value is obtained according to the above formula 1.8 and the above formula 1.9
Figure 220566DEST_PATH_IMAGE015
The calculation formula is as follows:
Figure 481914DEST_PATH_IMAGE017
equation 2.1
Secondly, inputting the feature matrix output by the Bi-LSTM network layer into the convolutional layer for convolution operation, and after the convolution operation, using the feature matrix for feature screening to further extract the correlation features of adjacent observed values in the sequence feature matrix, wherein the concrete calculation formula of the pooling part is as follows:
the maximum pooling layer is calculated as follows:
Figure 697870DEST_PATH_IMAGE023
equation 2.2
The calculation formula of the average pooling layer is as follows:
Figure 374970DEST_PATH_IMAGE024
equation 2.3
The splicing formula of the maximum pooling layer and the average pooling layer is as follows:
Figure 32085DEST_PATH_IMAGE025
equation 2.4
In the embodiments of the present disclosure, it is,
Figure 261072DEST_PATH_IMAGE026
the output of the maximum pooling is represented,
Figure 902007DEST_PATH_IMAGE027
the output of the average pooling is represented,
Figure 710694DEST_PATH_IMAGE028
it is shown that the operation of splicing,
Figure 278773DEST_PATH_IMAGE029
indicating the sliding window size. By utilizing the method of splicing the maximum pooling and the average pooling, the defect that the information is lost because the maximum value can be taken only once in each maximum pooling operation is overcome. And the results of the two pooling operations are subjected to feature fusion, so that the integrity of sequence feature information is ensured, and more comprehensive and deep features are obtained.
In the embodiment of the present disclosure, specifically referring to the dual pooling layer part in fig. 5, regarding feature fusion of the results of two pooling operations, the maximum pooling result and the average pooling result are spliced into a new feature, that is, the maximum pooling result obtains a k × m-dimensional vector, and the average pooling result obtains a k × m-dimensional vector, and then the maximum pooling result and the average pooling result are spliced after being aligned to generate a k × 2 m-dimensional vector, where k =1, 2, 3, ·, k; m =1, 2, 3, ·, m. For example, the maximum pooling result yields a 1 row by 5 column vector, and the average pooling result yields a 1 row by 5 column vector, and the concatenation of the maximum pooling result and the average pooling result yields a 1 row by 10 column vector. For example, the vector resulting from the maximum pooling result is: (a)1,a2,a3,a4,a5) (ii) a The vector obtained by averaging the pooling results is: (b)1,b2,b3,b4,b5) And the vector generated by splicing the maximum pooling result and the average pooling result is as follows: (a)1,a2,a3,a4,a5,b1,b2,b3,b4,b5)。
Finally, the feature expression of the sentence extracted by the double-pooling layer is used as the input of the Softmax layer, and the specific calculation formula of the classification process is shown as follows:
Figure 413082DEST_PATH_IMAGE019
equation 2.5
In the above-mentioned formula 2.5,
Figure 541313DEST_PATH_IMAGE020
represents a probability valueyBelonging to the range of 0 to 1,
Figure 153691DEST_PATH_IMAGE021
the dimension of the feature matrix is the same as the dimension of the sequence feature matrix Z,
Figure 50978DEST_PATH_IMAGE022
represents the bias term and Z represents the sequence feature matrix.
In conjunction with fig. 5, the classification process of the features by the Softmax layer is a binary classification process.
The method comprises the following specific implementation steps:
and S1, data acquisition and preprocessing. Collecting a large amount of data sets generated by popular real information on a microblog, preprocessing the data sets, obtaining the recognition degree grade and attitude of an active user by utilizing the automatic distinguishing method of the recognition degree grade of the active user and the existing microblog opinion mining method, then constructing an observation value according to the recognition degree grade and attitude of the active user, and finally obtaining a large amount of observation sequences generated by the popular real information to be used as the data sets for model training;
s2, inputting the data set into the model shown in the figure 5 for training;
s3, false information detection. The observation sequence is input into the trained model, the model can calculate the credibility of each piece of information in real time and output credible and incredible scores (one percent in total), and whether the information is false information or not is judged by setting a threshold value, so that the information can be found as soon as possible.
The invention provides a data processing mode based on an information active user, which formalizes the microblog information transmission process by setting the information active user and converts the microblog information transmission process into an observation sequence, thereby facilitating the processing of various models. Meanwhile, a deep learning model is provided, false information detection is effectively carried out on the extracted observation sequence, and the influence of the false information on the public is reduced.
In the embodiment of the present disclosure, as shown in fig. 8, an information processing apparatus 300 is provided, where the apparatus 300 includes:
the first determining module 301 is configured to obtain an observation sequence of information by a user according to the identification degree of the user and comment data of the user on the information; wherein the identification degree reflects the authentication ability of the user for information; the observation sequence comprises: the observation value of the user to the information indicates the attitude information of the user to the information, and the attitude information of the user to the information includes: negative attitude, neutral attitude and positive attitude;
a second determining module 302, configured to input the observation sequence into a trained machine learning model, so as to obtain a reliability probability value of the observation sequence;
and a third determining module 303, configured to obtain authenticity of the information according to the reliability probability value.
In the embodiment of the present disclosure, the information processing apparatus 300 further includes:
a first data obtaining module 304, configured to obtain attribute data of a user who has performed a preset operation on the information; wherein the preset operational behavior is indicative of at least one of: the user approves, reviews or forwards the information; the attribute data of the user includes at least one of: the method comprises the following steps of (1) determining the amount of vermicelli of a user, the amount of attention of the user and the amount of information needing to judge the authenticity of the information in published or forwarded information of the user;
a first feature obtaining module 305, configured to obtain a feature data set of the user according to the attribute data of the user;
the second determining module 303 is configured to input the feature data set of the user into a classification model, so as to obtain the recognition degree of the user.
In an embodiment of the present disclosure, the second determining module 303 is further configured to:
the recognition degree of the user is obtained based on the characteristic data of the user by utilizing a classification model; wherein the classification model is: and the test result output from a plurality of candidate models trained by using labels obtained by the clustering device is the one with the highest matching degree with the clustering device.
In this embodiment of the present disclosure, the first determining module 301 is further configured to:
the information processing device is used for obtaining attitude information of the user to the information according to the identification degree of the user and comment data of the user to the information;
and the observation sequence of the information by the user is obtained according to the attitude information of the information by the user.
In an embodiment of the present disclosure, the third determining module 303 is further configured to:
the credibility score is obtained according to the credibility probability value;
and if the credibility score meets the condition that the credibility score is larger than a preset threshold value, the information is real information.
In the embodiment of the present disclosure, as shown in fig. 9, an exercise device 400 is provided, where the exercise device 400 includes:
an obtaining module 401, configured to obtain an identification degree of a sample user;
a fourth determining module 402, configured to obtain a sample of the observation sequence according to the evaluation data of the sample user;
and a training module 403, configured to train a machine learning model in the foregoing information processing method according to the sample of the observation sequence and the recognition degree of the sample user.
In an embodiment of the present disclosure, there is provided an electronic device including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the steps of the feedback method when running the computer service.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In the embodiment of the present disclosure, a storage medium is provided, and the storage medium has computer-executable instructions, which are executed by a processor to implement the steps in the information processing method or the training method described above.
Alternatively, the integrated unit according to the embodiment of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (10)

1.一种信息处理的方法,其特征在于,所述方法包括:1. A method for information processing, wherein the method comprises: 根据用户的识别度以及所述用户对信息的评论数据,得到用户对信息的观测序列;其中,所述识别度反应的为所述用户对信息的鉴别能力;所述观测序列包括:所述用户对所述信息的观测值,所述观测值指示所述用户对所述信息的态度信息,所述用户对所述信息的态度信息包括:否定态度、中立态度以及肯定态度;According to the user's recognition degree and the user's comment data on the information, the user's observation sequence of the information is obtained; wherein, the recognition degree reflects the user's ability to discriminate the information; the observation sequence includes: the user's The observed value of the information, the observed value indicates the attitude information of the user to the information, and the attitude information of the user to the information includes: a negative attitude, a neutral attitude, and a positive attitude; 将所述观测序列输入至训练好的机器学习模型,得到所述观测序列的可信度概率值;Inputting the observation sequence into the trained machine learning model to obtain the reliability probability value of the observation sequence; 根据所述可信度概率值,得到所述信息的真实性。According to the reliability probability value, the authenticity of the information is obtained. 2.根据权利要求1所述的信息处理的方法,其特征在于,所述方法还包括:2. The method for information processing according to claim 1, wherein the method further comprises: 获取对所述信息执行过预设操作行为的用户的属性数据;其中,所述预设操作行为至少指示以下之一:所述用户对所述信息的点赞、评论或转发行为;所述用户的属性数据至少包括以下之一:用户的粉丝量、用户的关注量、以及用户发表或转发信息中需要判断信息真实性的信息数量;Obtain attribute data of a user who has performed a preset operation behavior on the information; wherein the preset operation behavior indicates at least one of the following: the user's liking, commenting or forwarding behavior on the information; the user The attribute data includes at least one of the following: the number of followers of the user, the amount of attention of the user, and the amount of information published or forwarded by the user that needs to judge the authenticity of the information; 根据所述用户的属性数据,获取所述用户的特征数据集;obtaining a feature data set of the user according to the attribute data of the user; 将所述用户的特征数据集输入至分类模型,得到所述用户的识别度。The user's feature data set is input into the classification model to obtain the user's recognition degree. 3.根据权利要求1所述的信息处理的方法,其特征在于,所述根据用户的识别度以及所述用户对信息的评论数据,得到用户对信息的观测序列,包括:3. The method for information processing according to claim 1, wherein the user's observation sequence of the information is obtained according to the user's recognition degree and the user's comment data on the information, comprising: 根据所述用户的识别度以及所述用户对信息的评论数据,得到所述用户对信息的态度信息;According to the user's recognition degree and the user's comment data on the information, obtain the user's attitude information on the information; 根据所述用户对所述信息的态度信息,得到所述用户对信息的观测序列。According to the attitude information of the user to the information, the observation sequence of the user to the information is obtained. 4.根据权利要求1所述的信息处理的方法,其特征在于,所述机器学习模型包括:4. The method for information processing according to claim 1, wherein the machine learning model comprises: 接收所述观测序列的双向长短记忆网络BiLSTM;receiving the bidirectional long short-term memory network BiLSTM of the observation sequence; 与所述长短记忆网络连接且输出所述可信度概率值的卷积神经网络CNN;所述CNN至少包括:卷积层以及位于卷积层后端的两池化层。A convolutional neural network CNN connected to the long-short-term memory network and outputting the reliability probability value; the CNN at least includes: a convolutional layer and two pooling layers located at the back end of the convolutional layer. 5.一种训练方法,其特征在于,所述方法包括:5. A training method, wherein the method comprises: 获取样本用户的识别度;Obtain the recognition of sample users; 根据所述样本用户的评价数据得到的观测序列的样本;A sample of the observation sequence obtained according to the evaluation data of the sample user; 以所述观测序列的样本和所述样本用户的识别度训练得到权利要求1至4任一项所述的机器学习模型。The machine learning model according to any one of claims 1 to 4 is obtained by training with the sample of the observation sequence and the recognition degree of the sample user. 6.根据权利要求5的训练方法,其特征在于,所述方法还包括:基于所述样本用户的特征数据集,使用聚类器对所述用户进行识别度分类,识别度等级相同的用户属于同一类别下的同一标签;6. The training method according to claim 5, wherein the method further comprises: using a clusterer to classify the users based on the characteristic data set of the sample users, and the users with the same recognition level belong to the same label under the same category; 基于所述标签以及所述用户的特征数据集,测试多个不同种类的分类模型,得到不同的测试结果;Based on the label and the feature data set of the user, test a plurality of different types of classification models to obtain different test results; 基于所述测试结果与k-means聚类器的吻合度,将吻合度高的分类模型作为用于对所述识别度进行分类计算的分类模型。Based on the degree of agreement between the test result and the k-means clusterer, a classification model with a high degree of agreement is used as a classification model for classifying and calculating the recognition degree. 7.一种信息处理的装置,其特征在于,所述装置包括:7. An apparatus for information processing, wherein the apparatus comprises: 第一确定模块,用于根据用户的识别度以及所述用户对信息的评论数据,得到用户对信息的观测序列;其中,所述识别度反应的为所述用户对信息的鉴别能力;所述观测序列包括:所述用户对所述信息的观测值,所述观测值指示所述用户对所述信息的态度信息,所述用户对所述信息的态度信息包括:否定态度、中立态度以及肯定态度;a first determination module, configured to obtain a user's observation sequence of information according to the user's recognition degree and the user's comment data on the information; wherein, the recognition degree reflects the user's ability to discriminate information; the The observation sequence includes: the user's observation value to the information, the observation value indicates the user's attitude information to the information, and the user's attitude information to the information includes: negative attitude, neutral attitude and positive attitude manner; 第二确定模块,用于将所述观测序列输入至训练好的机器学习模型,得到所述观测序列的可信度概率值;The second determination module is used to input the observation sequence into the trained machine learning model to obtain the reliability probability value of the observation sequence; 第三确定模块,用于根据所述可信度概率值,得到所述信息的真实性。The third determining module is configured to obtain the authenticity of the information according to the reliability probability value. 8.一种训练装置,其特征在于,所述装置包括:8. A training device, characterized in that the device comprises: 获取模块,用于获取样本用户的识别度;The acquisition module is used to acquire the recognition degree of the sample user; 第四确定模块,用于根据所述样本用户的评价数据得到的观测序列的样本;a fourth determination module, used for a sample of the observation sequence obtained according to the evaluation data of the sample user; 训练模块,用于以所述观测序列的样本和所述样本用户的识别度训练得到权利要求1至4任一项所述的机器学习模型。A training module, configured to obtain the machine learning model according to any one of claims 1 to 4 by training the samples of the observation sequence and the recognition degree of the sample users. 9.一种电子设备,其特征在于,所述电子设备包括:9. An electronic device, characterized in that the electronic device comprises: 存储器;memory; 处理器,与所述存储器连接,用于通过所述存储器存储的计算机执行指令,能够实现权利要求1至4或5至6任一项所述的方法。A processor, connected to the memory, for executing instructions by the computer stored in the memory, can implement the method of any one of claims 1 to 4 or 5 to 6. 10.一种存储介质,其特征在于,所述存储介质存储有计算机可执行指令;所述计算机可执行指令被处理器执行后,能够实现权利要求1至4或5至6任一项所述的方法。10. A storage medium, characterized in that, the storage medium stores computer-executable instructions; after the computer-executable instructions are executed by a processor, any one of claims 1 to 4 or 5 to 6 can be implemented. Methods.
CN202110675351.2A 2021-06-18 2021-06-18 Information processing and training method and device, electronic equipment and computer storage medium Pending CN113254801A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110675351.2A CN113254801A (en) 2021-06-18 2021-06-18 Information processing and training method and device, electronic equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110675351.2A CN113254801A (en) 2021-06-18 2021-06-18 Information processing and training method and device, electronic equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN113254801A true CN113254801A (en) 2021-08-13

Family

ID=77188560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110675351.2A Pending CN113254801A (en) 2021-06-18 2021-06-18 Information processing and training method and device, electronic equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN113254801A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887669A (en) * 2021-11-02 2022-01-04 中国电信股份有限公司 User classification method, user classification device, storage medium and electronic equipment
CN114219669A (en) * 2021-11-05 2022-03-22 招银云创信息技术有限公司 False information identification method and device, computer equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136330A (en) * 2013-01-04 2013-06-05 武汉大学 User reliability assessment method based on microblog platforms
CN106484679A (en) * 2016-10-20 2017-03-08 北京邮电大学 A kind of false review information recognition methodss being applied on consumption platform and device
CN108550065A (en) * 2018-04-10 2018-09-18 百度在线网络技术(北京)有限公司 comment data processing method, device and equipment
US20190215159A1 (en) * 2018-01-10 2019-07-11 Tmail Inc. System and computer program product for certified confidential data collaboration using blockchains

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136330A (en) * 2013-01-04 2013-06-05 武汉大学 User reliability assessment method based on microblog platforms
CN106484679A (en) * 2016-10-20 2017-03-08 北京邮电大学 A kind of false review information recognition methodss being applied on consumption platform and device
US20190215159A1 (en) * 2018-01-10 2019-07-11 Tmail Inc. System and computer program product for certified confidential data collaboration using blockchains
CN108550065A (en) * 2018-04-10 2018-09-18 百度在线网络技术(北京)有限公司 comment data processing method, device and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887669A (en) * 2021-11-02 2022-01-04 中国电信股份有限公司 User classification method, user classification device, storage medium and electronic equipment
CN114219669A (en) * 2021-11-05 2022-03-22 招银云创信息技术有限公司 False information identification method and device, computer equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN110188194A (en) A kind of pseudo event detection method and system based on multi-task learning model
CN113032525A (en) False news detection method and device, electronic equipment and storage medium
Rathore et al. Identifying groups of fake reviewers using a semisupervised approach
KR102135074B1 (en) System for identifying fake news using artificial-intelligence-based fact-checking guidelines
CN109783614B (en) A differential privacy leak detection method and system for text to be published in social networks
US12499374B2 (en) Extracting and classifying entities from digital content items
CN108345587A (en) A kind of the authenticity detection method and system of comment
CN115688024B (en) Prediction method for network abnormal users based on user content and behavior characteristics
Mangal et al. Fake news detection with integration of embedded text cues and image features
CN110704715A (en) Network overlord ice detection method and system
Vu et al. Rumor detection by propagation embedding based on graph convolutional network
CN110602045A (en) Malicious webpage identification method based on feature fusion and machine learning
CN112434194A (en) Similar user identification method, device, equipment and medium based on knowledge graph
CN113590945A (en) Book recommendation method and device based on user borrowing behavior-interest prediction
Kvtkn et al. A novel method for detecting psychological stress at tweet level using neighborhood tweets
CN110046251A (en) Community content methods of risk assessment and device
KR20190115319A (en) Mobile apparatus and method for classifying a sentence into a plurality of classes
CN113672976A (en) Sensitive information detection method and device
CN114661902A (en) Document library cold start author homonymy disambiguation method and device based on multi-feature fusion
Girgin Ranking influencers of social networks by semantic kernels and sentiment information
CN110134881A (en) A method and system for friend recommendation based on multi-information source graph embedding
CN116166910A (en) Social media account vermicelli water army detection method, system, equipment and medium
Lazaridou et al. Discovering biased news articles leveraging multiple human annotations
CN113254801A (en) Information processing and training method and device, electronic equipment and computer storage medium
CN109597944B (en) Single-classification microblog rumor detection model based on deep belief network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210813

RJ01 Rejection of invention patent application after publication