CN114638008B

CN114638008B - Data processing methods, equipment, systems and storage media

Info

Publication number: CN114638008B
Application number: CN202011480985.4A
Authority: CN
Inventors: 刘巍然; 张磊
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2026-01-09
Anticipated expiration: 2040-12-15
Also published as: CN114638008A

Abstract

This application provides a data processing method, device, system, and storage medium. In this application, a local differential privacy mechanism based on virtual data is provided. This mechanism involves adding a certain amount of virtual data to the original data based on differential privacy parameters of a data analysis algorithm. The resulting mixed data with added virtual data is then scrambled before being provided to the data analyst for analysis. Adding virtual data to the original data provides a degree of privacy protection, and the data analysis results, after correction by the data analyst, can be directly provided to the querying user without adding noise, thus solving the data consistency problem inherent in centralized differential privacy. Furthermore, scrambling the mixed data with added virtual data requires only a small amount of virtual data to meet differential privacy requirements, providing quantifiable privacy protection.

Description

Data processing method, device, system and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a data processing method, device, system, and storage medium.

Background

With the advent of the big data age, big data analysis platforms are increasing. Almost all big data analysis platforms provide data statistics analysis functions, such as histogram analysis, externally. The most typical histogram analysis function is crowd portrayal analysis. In order to realize privacy protection, each big data analysis platform generally adopts a differential privacy technology, so that the data query accuracy is ensured, and meanwhile, a certain degree of privacy protection is provided for the data.

Among them, centralized differential privacy (CENTRAL DIFFERENTIAL PRIVACY) is a common differential privacy technique. In the application, the user uploads the real data to a trusted data owner, the data owner opens a data analysis function for a data analysis party, noise is randomly added on an analysis result output externally, and privacy protection is provided for the user data. However, since noise is randomly added to the analysis result for each query, a data consistency problem is likely to occur for query requests with relevance, i.e., the same query result obtains different results in different queries due to different amounts of added noise.

Disclosure of Invention

Aspects of the present application provide a data processing method, apparatus, system, and storage medium, for solving the problem of data consistency in centralized differential privacy while implementing data privacy protection.

The embodiment of the application provides a data processing system which comprises at least one data source end, a data scrambling end and a data analysis end, wherein the at least one data source end is used for adding virtual data into original data based on differential privacy parameters of a data analysis algorithm to obtain mixed data, the data scrambling end is used for scrambling the mixed data and providing the scrambled mixed data to the data analysis end, and the data analysis end is used for carrying out data analysis on the scrambled mixed data by adopting the data analysis algorithm according to a query request of a query user and outputting a data analysis result to the query user.

The embodiment of the application also provides a data processing system which comprises at least one data source end, a data scrambling end and a data analysis end, wherein the at least one data source end is used for uploading original data to the data scrambling end, the data scrambling end is used for adding virtual data into the original data based on differential privacy parameters of a data analysis algorithm to obtain mixed data, scrambling the mixed data, and the data analysis end is used for carrying out data analysis on the scrambled mixed data by adopting the data analysis algorithm according to a query request of a query user and outputting a data analysis result to the query user.

The embodiment of the application also provides a data processing method which is suitable for the data source end and comprises the steps of generating original data, adding virtual data into the original data based on differential privacy parameters of a data analysis algorithm to obtain mixed data, uploading the mixed data to a data scrambling end, providing the mixed data to the data analysis end after scrambling by the data scrambling end, and carrying out data analysis on the scrambled mixed data by the data analysis end by adopting the data analysis algorithm.

The embodiment of the application also provides a data processing method which is suitable for the data scrambling terminal, and the method comprises the steps of receiving the original data uploaded by at least one data source terminal, adding virtual data into the original data based on the differential privacy parameters of a data analysis algorithm to obtain mixed data, scrambling the mixed data, and sending the scrambled mixed data to the data analysis terminal for the data analysis terminal to perform data analysis on the scrambled mixed data by adopting the data analysis algorithm.

The embodiment of the application also provides a data processing method, which comprises the steps of receiving the original data uploaded by at least one data source, adding virtual data into the original data based on the differential privacy parameters of a data analysis algorithm to obtain mixed data, scrambling the mixed data to obtain scrambled mixed data, carrying out data analysis on the scrambled mixed data by adopting the data analysis algorithm according to the query request of a query user, and outputting the data analysis result to the query user.

The embodiment of the application also provides data source equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, the processor is coupled with the memory and used for executing the computer program and generating original data, virtual data is added in the original data based on differential privacy parameters of a data analysis algorithm to obtain mixed data, the mixed data is uploaded to a data scrambling end, the mixed data is scrambled by the data scrambling end and then provided for a data analysis end, and the data analysis end adopts the data analysis algorithm to conduct data analysis on the scrambled mixed data.

The embodiment of the application also provides data processing equipment which comprises a memory and a processor, wherein the memory is used for storing a computer program, the processor is coupled with the memory and used for executing the computer program and used for receiving original data uploaded by at least one data source end, adding virtual data into the original data based on differential privacy parameters of a data analysis algorithm to obtain mixed data, carrying out scrambling processing on the mixed data and sending the scrambled mixed data to a data analysis end, so that the data analysis end can carry out data analysis on the scrambled mixed data by adopting the data analysis algorithm.

The embodiment of the application also provides data processing equipment which comprises a memory and a processor, wherein the memory is used for storing a computer program, the processor is coupled with the memory and used for executing the computer program and used for receiving original data uploaded by at least one data source, adding virtual data into the original data based on differential privacy parameters of a data analysis algorithm to obtain mixed data, scrambling the mixed data to obtain scrambled mixed data, adopting the data analysis algorithm to analyze the scrambled mixed data according to a query request of a query user and outputting a data analysis result to the query user.

The embodiments of the present application also provide a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to implement the steps in the methods provided by the embodiments of the present application.

In the embodiment of the application, a local differential privacy mechanism based on virtual data is provided, namely, based on differential privacy parameters of a data analysis algorithm, a certain amount of virtual data is added in original data, then scrambling operation is carried out on mixed data added with the virtual data, and the scrambled mixed data is provided for a data analysis party for analysis. The virtual data is added into the original data, so that a certain degree of privacy protection can be provided for the original data, and the data analysis result corrected by the data analysis end can be directly provided for a query user without adding noise amount, so that the data consistency problem existing in the centralized differential privacy can be solved; in addition, the mixed data added with the virtual data is scrambled, and the differential privacy requirement can be met by only adding a small amount of virtual data, so that a quantifiable privacy protection effect is provided.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic diagram of a data processing system according to an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of another data processing system according to an illustrative embodiment of the present application;

FIG. 3a is a flowchart of a data processing method according to an exemplary embodiment of the present application;

FIG. 3b is a flowchart illustrating another data processing method according to an exemplary embodiment of the present application;

FIG. 3c is a flowchart of yet another data processing method according to an exemplary embodiment of the present application;

FIG. 4a is a schematic diagram of a data processing apparatus according to an exemplary embodiment of the present application;

fig. 4b is a schematic structural diagram of a data source device according to an exemplary embodiment of the present application;

FIG. 5a is a schematic diagram of another data processing apparatus according to an exemplary embodiment of the present application;

FIG. 5b is a schematic diagram of a data processing apparatus according to an exemplary embodiment of the present application;

FIG. 6a is a schematic diagram of a data processing apparatus according to another exemplary embodiment of the present application;

Fig. 6b is a schematic structural diagram of another data processing apparatus according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Aiming at the problem of single point failure existing in the existing centralized differential privacy, in the embodiment of the application, a local differential privacy mechanism based on virtual data is provided, namely, based on differential privacy parameters of a data analysis algorithm, a certain amount of virtual data is added into original data, then scrambling operation is carried out on mixed data added with the virtual data, and the scrambled mixed data is provided for a data analysis party for analysis. The virtual data is added into the original data, so that a certain degree of privacy protection can be provided for the original data, and the data analysis result corrected by the data analysis end can be directly provided for a query user without adding noise amount, so that the data consistency problem existing in the centralized differential privacy can be solved; in addition, the mixed data added with the virtual data is scrambled, and the differential privacy requirement can be met by only adding a small amount of virtual data, so that a quantifiable privacy protection effect is provided.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a data processing system according to an exemplary embodiment of the present application. As shown in FIG. 1, the data processing system 100 includes at least one data source terminal 101, a data scrambling terminal 102, and a data analysis terminal 103.

In this embodiment, the data source 101 is an end capable of generating original data, and in terms of implementation, the data source 101 may be any application program, an application system, a functional module in the application system, a client, a hardware device (such as a server, a terminal device), or a hardware chip (such as a CPU, a GPU, or an FPGA) capable of generating original data. For example, the data source 101 may be a video APP, an instant communication APP, an online education APP, an online shopping APP, or a game APP, etc. installed and running on various terminal devices. In the present embodiment, the deployment implementation of the data source terminal 101 is not limited.

In this embodiment, the data scrambling terminal 102 refers to a terminal capable of scrambling received data. The scrambling processing of the data comprises two operations, namely removing identification information, such as name, IP address, MAC address, time stamp and the like, which possibly identifies a specific object in the data, and randomly scrambling the data (the data after the identification information is removed or the data before the identification information is removed), scrambling the arrangement sequence of the data, and eliminating the association relation between the specific object and the data as much as possible. The order of execution of the operations is not limited to these two aspects. In terms of implementation, the data scrambling end 102 may be implemented as an application program, a service, an instance, a functional module in a software form, a Virtual Machine (VM) or a container with a data scrambling function, or may also be implemented as a hardware device (such as a server or a terminal device) or a hardware chip (such as a CPU, a GPU, or an FPGA) with a data scrambling function. In the present embodiment, the deployment mode of the data scrambling terminal 102 is also not limited.

In this embodiment, the data analysis end 103 refers to an end capable of performing data analysis on received data by using a data analysis algorithm, and in terms of implementation, the data analysis end 103 may be implemented as a function module, VM or container having a data analysis function, or may also be implemented as a hardware device (such as a server or a terminal device) having a data analysis function, a hardware chip (such as a CPU, GPU or FPGA), or the like. In the present embodiment, the deployment mode of the data analysis terminal 103 is not limited as well.

In an application scenario, as shown in fig. 1, a data source end 101 is a terminal device installed and running with an application program, the application program is responsible for generating original data, a data analysis end 103 is a cloud server running with a data analysis algorithm and capable of providing data analysis services for the outside, and a data scrambling end 102 can be implemented as a conventional server relative to the data source end 101 and the data analysis end 103, and the conventional server can serve as a data middle platform between the terminal device and the cloud server, can provide data services for the cloud server, and is responsible for performing various processes on data from the terminal device, so as to provide data meeting requirements for the cloud server.

In this embodiment, the data source 101 may generate the original data. The original data generated by the data source 101 may be different according to the application scenario. For example, if the data source 101 is a shopping application, the raw data generated by the data source includes, but is not limited to, sales of goods in each dimension, sales of goods, comments of goods in each dimension, and numbers of goods, types of goods, stock amounts, etc. under each dimension. As another example, if data source 101 is a game-type application, the raw data it generates includes, but is not limited to, the number of game players, age group, gender ratio, sales of various game items, sales, etc.

In addition to generating the raw data, the data source terminal 101 may also provide the raw data to the data analysis terminal 103 for data analysis, so that the data analysis terminal 103 provides the corresponding query results for the querying user. Alternatively, the data analysis terminal 103 may perform statistical analysis in the form of a Histogram (Histogram) on the raw data generated by the data source terminal 101. The following will take the example that the original data is individual data of n enumeration types containing k possible values, and a simple description will be given of histogram statistical analysis. Wherein n and k are positive integers of 1 or more.

In an application scenario, assuming that n=10000, that is, 1000 pieces of individual data with sex attribute, and k possible values are { male, female, unknown }, that is, k=3, the number of individuals with k under each possible value can be counted through a histogram, further assuming that in n=10000 pieces of individual data with sex attribute, if k is 2000 for the number of individuals of male, 4000 for the number of individuals of female, 4000 for the number of individuals of k is unknown, the result of the histogram statistics is { male: 2000 people, female: 4000 people, unknown: 4000 people }.

In another application scenario, assuming that n=20000, that is 20000 individual data with age attribute, k is { less than 18 years old, 18 years old to 30 years old, 31 years old to 45 years old, 46 years old to 60 years old, 60 years old or above }, that is, k=5, the number of individuals k under each possible value can be counted by histogram, further assuming that in n=20000 individual data with age attribute, if the number of individuals in each age range is 3000, 4000, 5000, 3000, respectively, the histogram statistics result is { less than 18 years old to 3000, 18 years old to 30 years old to 4000, 31 years old to 45 years old to 5000, 46 years old to 60 years old to 5000, 60 years old to 3000.

In this embodiment, in order to prevent an attacker from acquiring or snooping user privacy information from a data analysis result, a local differential privacy mechanism based on virtual data is provided, and by adopting the mechanism, data privacy protection can be realized while the accuracy of the data analysis result is ensured. Based on this, the data analysis algorithm M used by the data analysis terminal 103 may be defined to satisfy differential privacy protection, and differential privacy parameters satisfied by the data analysis algorithm M may be defined. For example, it is possible to define that the data analysis algorithm M satisfies (ε, δ) -differential privacy, where ε is referred to as a privacy budget, and smaller ε means that the closer the output result of the neighboring data set is, the better the degree of privacy protection is, and δ is referred to as failure probability, i.e., the probability that the data analysis algorithm M has δ does not satisfy ε -differential privacy. When δ=0, (epsilon, δ) -differential privacy is abbreviated as epsilon-differential privacy. In particular, if v, v' ∈d is for any two possible data elements, any possible output range for the data analysis algorithm MThe data analysis algorithm M is said to meet (epsilon, delta) -differential privacy when Pr [ M (v) ∈R ]. Ltoreq.eε.Pr [ M (v')ε R ] +delta. Wherein Pr [ M (v) ∈R ] represents the probability that the data analysis algorithm M takes the data v as input and outputs the result in the output range R, and Pr [ M (v ')εR ] represents the probability that the data analysis algorithm M takes the data v' as input and outputs the result in the output range R. It should be noted that, according to the difference of the protection degree of the differential privacy, the values of the differential privacy parameters epsilon and delta are different, and can be flexibly set according to the application requirements.

In this embodiment, the local differential privacy mechanism based on virtual data may be implemented by the data source 101, the data scrambling 102, and the data analysis 103. As shown in fig. 1 ①, at least one data source 101 may add virtual data to the original data to obtain mixed data based on the differential privacy parameters of the data analysis algorithm M. In the present embodiment, when the degree of differential privacy protection that the data analysis algorithm M needs to satisfy is satisfied, the manner of adding virtual data to the original data is not limited, nor is the number of virtual data added to the original data limited.

In an alternative embodiment, a certain amount of virtual data can be randomly generated according to the data structure of the original data, the data structure of the virtual data is identical to that of the original data, then the virtual data is added into the original data to obtain mixed data, on one hand, the virtual data is added to the original data, so that privacy protection can be provided for the original data to a certain extent, and on the other hand, the implementation of the method is simpler and more flexible. Or in another alternative embodiment, the original data may be enumeration type data, that is, the original data x∈d, where D is a data set corresponding to the original data, where the data set includes k possible enumeration values, d= { d_1, d_2,..and d_k }, where the original data is derived from the enumeration values, for this case, virtual data may be randomly and uniformly sampled from d= { d_1, d_2,..d_k } and added to the original data to obtain mixed data, on one hand, adding virtual data to the original data may provide a certain privacy protection for the original data, and on the other hand, the virtual data is derived from enumeration values corresponding to the original data and satisfies uniform distribution, random noise may cancel each other when statistically analyzing the data, so as to reduce interference on the original data, further reduce noise amount in the data analysis result, and facilitate improving accuracy of the data analysis result, and at the same time, each data source end 101 needs to use random uniform distribution, and no additional data sampling or implementation in a relatively simple manner.

Regarding the amount of virtual data added, in an alternative embodiment, each data source 101 may obtain a data mixing ratio s based on the differential privacy parameter of the data analysis algorithm M, and add virtual data to the original data according to the data mixing ratio s to obtain mixed data. The data mixing ratio s may be a number greater than or equal to 1, or a number greater than 0 and less than 1. If the data mixing ratio s is equal to 1, it means that the number of the original data and the number of the virtual data are the same, the ratio between the two is 1:1, if the data mixing ratio s is greater than 1, it means that the number of the virtual data is greater than the number of the original data, and if the data mixing ratio s is less than 1, it means that the number of the original data is greater than the number of the virtual data. In the case that the data mixing ratio s is equal to 1, an alternative way is that each data source terminal 101 adds one virtual data for each original data, and further alternatively, in the case that each data source terminal 101 generates one original data, each data source terminal 101 adds one virtual data for each generated original data. In the case where the data mixing ratio s is smaller than 1, an alternative way is that a part of the data source 101 does not add virtual data to the original data generated by the data source, and another part of the data source 101 adds virtual data to the original data generated by the data source. In the case that the data mixing ratio s is smaller than 1, for each data source terminal 101, a certain negotiation mechanism may be adopted to determine whether the data source terminal itself belongs to a part of data source terminals having the right to add virtual data, and in the case that the data source terminal itself is determined to belong to a part of data source terminals having the right to add virtual data, virtual data is added to the generated original data to obtain mixed data. Alternatively, in the case where the data mixing ratio s is less than 1, each data source 101 may employ a random reply mechanism to determine whether it belongs to a data source that has the right to add virtual data, for example, each data source 101 may throw a coin with a certain probability, if the coin faces up, it indicates that the data source has the right to add virtual data, and if the coin faces down, it indicates that the data source does not have the right to add virtual data. In these alternatives, the data source 101 having the right to add virtual data may obtain virtual data in the manner provided by the foregoing embodiments and add the virtual data to the original data.

After the hybrid data is obtained, the hybrid data is uploaded to the data scrambling terminal 102, as shown at ② in fig. 1. Further, as shown in fig. 1 at ③, the data scrambling terminal 102 performs scrambling processing on the mixed data after receiving the mixed data. The implementation process of the scrambling process comprises the steps of searching in mixed data according to set object identifiers, removing the object identifiers in the mixed data, wherein the set object identifiers can be, but are not limited to, names, IP addresses, MAC addresses, time stamps and the like, then randomly scrambling the mixed data after the object identifiers are removed, scrambling the arrangement sequence of the data, and eliminating the association relation between specific objects and the data as much as possible. The scrambling processing is performed on the mixed data, namely, the mixed data is subjected to anonymous processing to a certain extent, so that the difficulty of an attacker in acquiring the privacy information from the data analysis result is increased, and the method is equivalent to the fact that the data has privacy protection to a certain extent, so that a certain amount of virtual data (even only a part of data source ends are added with the virtual data) can be added to meet the established differential privacy definition, and a quantifiable privacy protection effect is provided.

Further, as shown in fig. 1 at ④, after obtaining the scrambled mixed data, the data scrambling terminal 102 provides the scrambled mixed data to the data analysis terminal 103. As shown in fig. 1 ⑤, after receiving the scrambled mixed data, the data analysis terminal 103 may perform data analysis on the scrambled mixed data by using the data analysis algorithm M according to the query request of the querying user 104. Because the data base for data analysis is not original data, but mixed data obtained by adding virtual data and scrambling according to the differential privacy protection requirement, the data analysis result meets the established differential privacy definition, and meanwhile, certain deviation exists in the data analysis result. In order to ensure accuracy of the analysis result, as shown in fig. 1 ⑥, the data analysis result is corrected according to the virtual data. The modified data analysis results may satisfy the predetermined differential privacy definition as well as be more accurate, so that the modified data analysis results may be directly provided to the querying user 104, as shown in ⑦ in fig. 1. It should be noted that, the correction of the data analysis result is an optional operation, and the data analysis result with a certain deviation may be directly provided to the querying user 104 instead of the correction of the data analysis result. The query user 104 can know whether the received data analysis result has a certain deviation or is corrected, and if the received data analysis result has a certain deviation, the influence of the deviation can be flexibly considered when the data analysis result is used.

In this embodiment, the querying user 104 is not limited. In an alternative embodiment, after the query user 104 is the data source 101, in order to protect privacy, after the data source 101 generates the original data, virtual data may be added to the original data, and the mixed data after the addition of the virtual data is sent to the data scrambling end 102 (such as a data center table), the data scrambling end 102 further scrambles the mixed data, and finally the data scrambling end 102 provides the scrambled mixed data to the data analysis end 103, after receiving the query request sent by the data source 101, the data analysis end 103 performs data analysis on the scrambled mixed data by using the data analysis algorithm M, and returns an unmodified data analysis result or a modified data analysis result to the data source 101 that initiates the query request, for the data source 101, the data analysis service provided by the data analysis end 103 may be used under the condition of protecting privacy, and further performs subsequent operations such as quality monitoring and service improvement according to the data analysis result returned by the data analysis end 103.

In another alternative embodiment, the querying user 104 may be a third party having a cooperative relationship with each data source 101 or data scrambler 102, the third party desiring to cooperate with the data source or data scrambler 102, but desiring to know the data distribution of each data source 101 prior to cooperation. In this regard, after each data source terminal 101 generates the original data, in order to protect privacy, virtual data may be added to the original data, and the mixed data after adding the virtual data may be sent to the data scrambling terminal 102 (such as a data center station), where the data scrambling terminal 102 further performs scrambling processing on the mixed data, and finally the data scrambling terminal 102 provides the scrambled mixed data to the data analysis terminal 103, where the data analysis terminal 103 performs data analysis on the scrambled mixed data by using the data analysis algorithm M after receiving a query request from a third party, and returns an unmodified data analysis result or a modified data analysis result to the third party that initiates the query request, so that the third party can learn about the data distribution situation of each data source terminal 101.

In the above embodiment, the data analysis result obtained by the data analysis terminal 103 may be the sales volume of the commodity, the price distribution of the commodity under the same commodity category, the age distribution of the user corresponding to the same commodity category, the age distribution and sex distribution of the game user, and the like. According to different application scenes, the data analysis results are different, and the method is not limited.

In some alternative embodiments of the present application, the data analysis end 103 may correct the data analysis result according to the virtual data, considering that the data analysis result has noise and the noise is mainly caused by adding the virtual data. Wherein the amount of noise in the data analysis result is related to the number of virtual data, and the more the number of virtual data is, the more the amount of noise is. Based on this, in the case where virtual data is added to the original data according to the data mixing ratio s, the data analysis end 103 can calculate the amount of noise generated by the virtual data according to the data mixing ratio s, and correct the data analysis result according to the amount of noise. For convenience of description and distinction, the amount of noise generated by the dummy data is referred to as a first amount of noise.

In an application scenario, assuming that virtual data is mixed in n pieces of original data according to a data mixing ratio s, and a data set corresponding to the original data includes possible k enumeration values, such as d= { d_1, d_2. For example, for each possible enumerated value d_j e { d_1, d_2,..once, d_k }, the data analysis terminal 103 may statistically analyze the total number of occurrences m_j of d_j in the scrambled mixed data using the data analysis algorithm M, and calculate the corrected total number of occurrences n_j=m_j-n·s/k, after which the data analysis terminal 103 may return the histogram statistics: { n_1,.., n_k } to the data source terminal 101 or the third party that initiated the query request.

In order to prove the beneficial effects which can be produced by the local differential privacy mechanism based on the virtual data, the mechanism of the embodiment of the application is proved to be capable of obtaining that for any k, s, n, epsilon [0,1 ], delta epsilon [0,0.2907 ], the local differential privacy mechanism based on the virtual data meets (epsilon, delta) -differential privacy and meets epsilon= (14k.ln (2/delta_d)/(|n.s| -1))1/2. It can further be demonstrated that for any k, s, n, in the local differential privacy mechanism based on virtual data, the Mean Square Error (MSE) of the data analysis results is mse=s (k-1)/(n·k2). According to the two formulas, with the increase of the number of the virtual data, the local differential privacy mechanism based on the virtual data can provide better privacy protection degree, but the mean square error of the data analysis result can be improved, in practical application, the value of the data mixing proportion s can be properly selected according to the practical application requirement, so as to balance the relationship between privacy protection and analysis result accuracy.

The data processing system provided by the embodiment of the application can realize a local differential privacy mechanism based on virtual data, wherein in the mechanism, a data source terminal adds a plurality of virtual data into original data uploaded to a data scrambling terminal, and then the data scrambling terminal carries out scrambling operation on mixed data, namely carries out anonymization processing on the mixed data to a certain extent, and then provides the scrambled mixed data for a data analysis terminal for analysis. The virtual data can be obtained by random uniform sampling from a data set corresponding to the original data, namely the virtual data is uniformly distributed, so that random noise can be mutually offset when the data is counted, the accuracy of a data analysis result is improved, in addition, the data is anonymously processed to a certain extent, only a small amount of virtual data (even only a part of data source ends are added in the original data) can be added, the given differential privacy definition can be met, a quantifiable privacy protection effect is provided, and in addition, compared with a centralized differential privacy mechanism, the data analysis result corrected by the data analysis end can be directly provided for a query user, the noise quantity is not required to be added, and the problem of data consistency existing in the centralized privacy is solved. Furthermore, compared with a centralized differential privacy mechanism, the mechanism does not depend on a trusted data owner any more, and can solve the problem of single point failure existing in the centralized differential privacy while realizing data privacy protection.

Further optionally, in some embodiments of the present application, as shown in A0 in fig. 1, the at least one data source 101 may also randomize the original data before adding the virtual data to the original data. The randomization process refers to replacing a part of original data with randomized data, which is data different from the replaced original data, according to a certain randomization probability. Therefore, the amount of the data subjected to randomization is unchanged, but a part of the data is not the original data, so that the privacy protection function can be realized. Of course, in addition to randomizing the original data before adding the dummy data to the original data, the randomizing may be performed on the original data in the mixed data after adding the dummy data to the original data to obtain the mixed data and before sending the mixed data to the data scrambling terminal 102.

The original data may be randomized in the same manner, either before or after the addition of the virtual data. In the present embodiment, the manner of randomizing the original data is not limited. Optionally, the randomization processing method comprises the steps of obtaining randomization probability based on the differential privacy parameters of the data analysis algorithm M, and randomizing original data according to the randomization probability theta. Further, a random reply mechanism with a randomization probability theta as a probability parameter can be adopted to randomize the original data, in short, the original data is thrown by taking the randomization probability theta as the probability parameter, wherein the probability of the coin facing upwards is p, the probability of the coin facing downwards is q, the values of p and q are related to theta, the original data is kept unchanged if the coin facing upwards, and the randomized data is used for replacing the original data if the coin facing downwards. Further, technically, the throwing of coins may be implemented in a manner that generates a random number, wherein generating a random number that satisfies a first condition represents an event that throws a right-side up coin, and generating a random number that satisfies a second condition represents an event that throws a right-side down coin. Based on the above, the randomization process is that, for the original data, a random number can be generated according to the randomization probability θ, if the random number satisfies a first condition, the original data is kept unchanged, and if the random number satisfies a second condition, the original data is replaced by the randomization data. The probability of generating the random number meeting the first condition and the probability of generating the random number meeting the second condition are determined by the randomization probability theta, the probability of generating the random number meeting the first condition is the probability p of the coin facing upwards, and the probability of generating the random number meeting the second condition is the probability q of the coin facing downwards.

In an alternative embodiment, the randomized data may be randomly generated. In another alternative embodiment, the original data may be data of an enumeration type, that is, the original data x∈d, where D is a data set corresponding to the original data, where the data set includes k possible enumeration values, d= { d_1, d_2,., d_k }, where the original data is derived from the enumeration values, and in this case, the randomized data may be randomly and uniformly sampled in d= { d_1, d_2,., d_k }. That is, as for the original data, the original data is kept unchanged when the coin with the front side facing up is thrown, that is, when the random number satisfying the first condition is generated, and when the random number with the front side facing down is thrown, that is, when the random number satisfying the second condition is generated, the randomized data is randomly and uniformly sampled in d= { d_1, d_2,., d_k } and replaced with the randomized data. In this alternative embodiment, the probability of generating a random number satisfying the first condition is p=e (θ)/(e θ+k-1), and the probability of generating a random number satisfying the second condition is q=1/(e θ+k-1).

Further, in the embodiment combining the randomization process with the virtual data addition, the randomization process and the virtual data addition can bring the privacy protection effect to a certain extent, and the randomization process and the virtual data addition can mutually cooperate to meet the established privacy protection definition. That is, in the case where the degree of privacy protection is satisfied, the amount of virtual data may be smaller if the degree of randomization is larger, and conversely, the amount of virtual data may be larger if the degree of randomization is smaller. Considering that the degree of randomization processing may be embodied as randomization probability, the number of virtual data may be embodied as data mixing ratio, and thus, the randomization probability θ and the data mixing ratio s may be simultaneously acquired based on the differential privacy parameters of the data analysis algorithm. Based on this, if the original data is subjected to randomization before adding the dummy data, a random number may be generated according to the randomization probability θ, if the random number satisfies a first condition, the original data is maintained, if the random number satisfies a second condition, the randomized data is randomly and uniformly sampled in d= { d_1, d_2, & gt, d_k } and the original data is replaced with the randomized data, and then the dummy data is randomly and uniformly sampled in d= { d_1, d_2, & gt, d_k } according to the data mixing ratio s, and the dummy data is added to the randomized data to obtain the mixed data. If the virtual data is added first, the virtual data is randomly and uniformly sampled in D= { d_1, d_2, & gt, d_k } according to the data mixing proportion s, the virtual data is added in the original data to obtain mixed data, then, a random number is generated according to the randomization probability theta for the original data in the mixed data, if the random number meets a first condition, the original data is kept, and if the random number meets a second condition, the randomized data is randomly and uniformly sampled in D= { d_1, d_2, & gt, d_k } and the randomized data is used for replacing the original data.

After randomizing the original data and adding the virtual data, the obtained mixed data may be sent to the data scrambling terminal 102, the data scrambling terminal 102 performs scrambling processing on the received mixed data and provides the scrambled mixed data to the data analysis terminal 103, the data analysis terminal 103 performs data analysis on the scrambled mixed data by using the data analysis algorithm M and corrects the data analysis result to remove noise amount introduced by the virtual data and randomizing processing, and returns the corrected data analysis result to the query user 104 who initiates the query request, as shown in ②-⑦ in fig. 1.

In the present embodiment, noise in the data analysis result is mainly caused by adding dummy data and randomizing processing. The data analysis result is related to the amount of the virtual data, the more the amount of the virtual data is, the more the amount of the noise is, and accordingly, the amount of the noise in the data analysis result is also related to the degree of randomization, the greater the degree of randomization is, the more the amount of the noise is, and the degree of randomization can be represented by randomization probability. Based on this, in the case where virtual data is added to the original data according to the data mixing ratio s and the original data is randomized according to the randomizing probability θ, the data analysis end 103 may calculate a first noise amount generated by the virtual data according to the data mixing ratio s and a second noise amount generated by the randomizing process according to the randomizing probability, and correct the data analysis result according to the first noise amount and the second noise amount.

In an application scenario, assuming that virtual data is mixed in n pieces of original data according to a data mixing proportion s, and the original data is subjected to randomization processing according to a randomization probability θ, and a data set corresponding to the original data includes k possible enumeration values, such as d= { d_1, d_2,... For example, for each possible enumerated value d_j e { d_1, d_2,..once, d_k }, the data analysis terminal 103 may statistically analyze the total number of occurrences m_j of d_j in the scrambled mixed data using the data analysis algorithm M, and calculate the corrected total number of occurrences n_j= (m_j-n-s/k-n- λ/k)/(1- λ), after which the data analysis terminal 103 may return the histogram statistics { n_1,., n_k } to the data source terminal 101 or the third party that initiated the query request. Where λ=k/(θ+k-1).

In order to be able to demonstrate the benefits that the local differential privacy mechanism combining virtual data and randomization in accordance with embodiments of the present application can produce, it is demonstrated that for any k, s, n, ε [0, 1], λ=k/(θ+k-1) ∈ (0, 1], δ ε [0,0.5814 ]), the local differential privacy mechanism combining virtual data and randomization satisfies (ε, δ) -differential privacy, and satisfies ε= (14 k·ln (4/δ)/(|n·s|++ (n-1)) λ - (2 (n-1) ·λ·ln (2/δ)) ·1/2-1). It can be further demonstrated that for any k, s, n, in the local differential privacy mechanism combining virtual data and randomization, the mean square error of the data analysis results is MSE= (eθ+k-2)/(n- (eθ -1)/(2) +s (k-1)/(n.k2) ((eθ+k-1)/(eθ -1))2. As can be seen from the above two formulas, with the increase of the data mixing ratio s (i.e. the number of virtual data), the e theta in the randomization process is reduced, and the combination of the virtual data and the local differential privacy mechanism in the randomization process can provide a better degree of privacy protection, but the mean square error of the data analysis result is improved, in practical application, the values of the data mixing ratio s and e theta can be properly selected according to the practical application requirements, so as to balance the relationship between the privacy protection and the accuracy of the analysis result.

The data processing system provided by the embodiment of the application can realize a local differential privacy mechanism combining virtual data and randomization, wherein the mechanism is used for carrying out randomization processing on original data in local differential privacy on one hand and adding a plurality of virtual data in the original data on the other hand, and then carrying out scrambling operation on the randomization processing and the mixed data added with the virtual data, namely after carrying out anonymization processing on the data to a certain extent, providing the scrambled data to a data analysis end for analysis. The virtual data can be obtained from a data set corresponding to the original data in a random uniform sampling mode, and the virtual data are uniformly distributed, so that random noise can be mutually counteracted when the data are counted, and the accuracy of a data analysis result is improved; in addition, the original data is randomized and anonymized to a certain extent, the original data is also subjected to corresponding privacy protection treatment, the privacy protection degree can be further improved, the quantity of virtual data can be properly reduced under the same privacy protection degree due to the fact that the original data is subjected to corresponding privacy protection treatment, namely, a given differential privacy definition can be met by only adding virtual data to a part of the original data, a quantifiable privacy protection effect is provided, and furthermore, compared with a centralized differential privacy mechanism, the data analysis result corrected by a data analysis end under the mechanism can be directly provided for a query user without adding noise quantity, and the problem of data consistency in the centralized differential privacy can be solved. Furthermore, compared with a centralized differential privacy mechanism, the mechanism does not depend on a trusted data owner any more, and can solve the problem of single point failure existing in the centralized differential privacy while realizing data privacy protection.

In practice, if the data source end is enough to trust the data scrambling end, the generated original data can be directly uploaded to the data scrambling end, the data scrambling end replaces the data source end to carry out virtual data addition and randomization processing, privacy protection can be realized while the accuracy of analysis results is ensured, and the uncorrected or corrected data analysis results obtained by the data analysis end can be directly provided to the inquiring user without adding noise amount, so that the problem of data consistency in centralized differential privacy can be solved, and similar beneficial effects as in the previous embodiment can be generated.

FIG. 2 is a schematic diagram of another data processing system according to an exemplary embodiment of the present application. As shown in fig. 2, the data processing system 200 includes at least one data source terminal 201, a data scrambling terminal 202, and a data analysis terminal 203.

The implementation forms and the related descriptions of the at least one data source terminal 201, the data scrambling terminal 202, and the data analysis terminal 203 are the same as or similar to those of the data source terminal 101, the data scrambling terminal 102, and the data analysis terminal 103 in the foregoing embodiments, and thus, the foregoing embodiments are omitted herein.

In this embodiment, the data source 201, the data scrambling 202, and the data analysis 203 cooperate with each other to implement a local differential privacy mechanism based on virtual data. Specifically, after the data source 201 generates the original data, the original data is directly uploaded to the data scrambling 202, as shown in ① in fig. 2. After the data scrambling terminal 202 receives the original data, virtual data is added to the original data according to the differential privacy parameters of the data analysis algorithm M to obtain mixed data, as shown in ③ in fig. 2. Further, the data scrambling terminal 202 performs scrambling processing on the mixed data to obtain scrambled mixed data, as shown in ④ in fig. 2. Further, as shown in fig. 2 ⑤, the data scrambling terminal 202 sends the scrambled mixed data to the data analysis terminal 203. As shown in fig. 2 ⑥, after receiving the scrambled mixed data, the data analysis terminal 203 may perform data analysis on the scrambled mixed data by using the data analysis algorithm M according to the query request of the querying user 204. In order to ensure accuracy of the analysis result, as shown in fig. 2 ⑦, the data analysis result is corrected according to the virtual data. The modified data analysis results may both satisfy the established differential privacy definition and be more accurate, so the modified data analysis results may be provided to the querying user 204, as illustrated in FIG. 2 ⑧. Similar to the embodiment of FIG. 1, the "correction to data analysis results" illustrated in FIG. 2 at ⑦ is also an optional operation. The data analysis terminal 203 may also directly provide the unmodified data analysis results to the querying user 204.

Further alternatively, the data source 201, the data scrambling 202, and the data analysis 203 cooperate to implement a local differential privacy mechanism that combines virtual data and randomization. Compared with the above-mentioned local differential privacy mechanism based on virtual data, the data scrambling terminal 202 needs to add virtual data into the original data and randomize the original data after receiving the original data. Wherein the original data may be randomized prior to adding the dummy data to the original data, as shown at ② in fig. 2. Of course, in addition to this, the original data in the obtained mixed data may be subjected to randomization processing after adding the dummy data to the original data.

In this embodiment, the operations of adding the virtual data and randomizing are performed by the data scrambling terminal 202, where the detailed implementation process of adding the virtual data and randomizing by the data scrambling terminal 202 is the same as or similar to the detailed process of adding the virtual data and randomizing by the data source terminal 101 in the foregoing embodiment, and the difference is only that the execution body is different, so that reference to the foregoing embodiment is omitted herein. In addition, in the present embodiment, the detailed implementation process of the scrambling processing of the data by the data scrambling terminal 202 and the detailed implementation process of the data analysis and the correction of the data analysis result by the data analysis terminal 203 can also be referred to the foregoing embodiments, and will not be described herein.

In addition to the data processing system described above, embodiments of the present application provide several data processing methods, and in particular, reference may be made to the embodiments shown in fig. 3 a-3 c.

Fig. 3a is a schematic flow chart of a data processing method according to an exemplary embodiment of the present application. The method is mainly described from the perspective of the data source end in the system shown in fig. 1, and as shown in fig. 3a, the method includes:

31a, generating the original data.

And 32a, adding virtual data into the original data based on the differential privacy parameters of the data analysis algorithm to obtain mixed data.

And 33a, uploading the mixed data to a data scrambling end, so that the data scrambling end scrambling the mixed data and providing the scrambled mixed data to a data analysis end, and the data analysis end adopts a data analysis algorithm to analyze the scrambled mixed data.

In an alternative embodiment, the step 32a of adding virtual data to the original data based on the differential privacy parameters of the data analysis algorithm to obtain the mixed data includes obtaining a data mixing ratio based on the differential privacy parameters of the data analysis algorithm, and adding the virtual data to the original data according to the data mixing ratio to obtain the mixed data.

Further alternatively, in the process of adding the virtual data to the original data, the virtual data may be sampled randomly and uniformly from the data set corresponding to the original data, and added to the original data to obtain the mixed data.

In an alternative embodiment, prior to step 33a, randomizing the original data is further included before or after adding the dummy data, i.e., before or after step 32 a.

Further optionally, prior to adding the virtual data, randomizing the original data, including obtaining a data mixing ratio and a randomizing probability based on a differential privacy parameter of a data analysis algorithm, randomizing the original data according to the randomizing probability, and correspondingly adding the virtual data to the original data to obtain the mixed data, including adding the virtual data to the randomized data according to the data mixing ratio to obtain the mixed data.

Further optionally, adding virtual data to the original data to obtain mixed data comprises obtaining a data mixing proportion and a randomization probability based on a differential privacy parameter of a data analysis algorithm, adding the virtual data to the original data according to the data mixing proportion to obtain mixed data, and correspondingly, carrying out randomization processing on the original data after adding the virtual data, wherein the randomization processing is carried out on the original data in the mixed data according to the randomization probability.

Further optionally, the randomizing process is performed on the original data according to the randomizing probability, wherein the randomizing process comprises the steps of generating a random number according to the randomizing probability, if the random number meets a first condition, maintaining the original data, and if the random number meets a second condition, replacing the original data by the randomizing data, wherein the probability of generating the random number meeting the first condition and the probability of generating the random number meeting the second condition are determined by the randomizing probability.

Further, before the original data is replaced by the randomized data, the method further comprises the step of randomly and uniformly sampling the randomized data in a data set corresponding to the original data. Correspondingly, before adding the virtual data into the original data or the randomized data, the method also comprises the step of randomly and uniformly sampling the virtual data from a data set corresponding to the original data.

In this embodiment, the data source end adds virtual data in the original data, which can provide a certain degree of privacy protection for the original data, and is mutually matched with the data analysis end, the data analysis end can perform data analysis based on the mixed data and provide the data analysis result to the querying user, so that the data consistency problem existing in the centralized differential privacy can be solved without adding noise. Further, in this embodiment, virtual data may be obtained by random and uniform sampling from a data set corresponding to the original data, that is, the virtual data satisfies uniform distribution, so that random noises may cancel each other when statistical analysis is performed on the data, interference on the original data may be reduced, and further, noise amount in a data analysis result may be reduced, which is beneficial to improving accuracy of the data analysis result. Furthermore, in the embodiment, the original data can be virtualized, which means that the original data is also subjected to corresponding privacy protection processing, so that the privacy protection degree can be further improved, and the quantity of the virtual data can be properly reduced under the same privacy protection degree due to the fact that the virtual data is only added to a part of the original data, so that the preset differential privacy definition can be met, and the quantifiable privacy protection effect is provided.

Fig. 3b is a flowchart of another data processing method according to an exemplary embodiment of the present application. The method is mainly described from the perspective of the data scrambling end in the system shown in fig. 2, and as shown in fig. 3b, the method includes:

31b, receiving the original data uploaded by at least one data source terminal.

32B, adding virtual data into the original data based on the differential privacy parameters of the data analysis algorithm to obtain mixed data.

33B, scrambling the mixed data, and sending the scrambled mixed data to a data analysis end for the data analysis end to perform data analysis on the scrambled mixed data by adopting a data analysis algorithm.

In an alternative embodiment, prior to step 33b, randomizing the original data is further included before or after adding the dummy data, i.e., before or after step 32 b.

For the detailed implementation process of each step in this embodiment, reference may be made to the foregoing embodiments, and details are not repeated here. In this embodiment, the operation of adding virtual data and the operation of randomizing original data are performed by the data scrambling end, which is beneficial to reducing the processing burden of the data source end and saving the resources of the data source end.

In practical applications, the data scrambling end 202 in the data processing system shown in fig. 2 may be integrated with the data analysis end 203, and under this system architecture, virtual data is added to the original data, and randomizing, scrambling, and data analysis are performed by the same device. As shown in fig. 3c, a flowchart of yet another data processing method according to an exemplary embodiment of the present application is provided. The method is mainly described from the perspective of an integrated data analysis end, as shown in fig. 3c, and comprises the following steps:

31c, receiving the original data uploaded by at least one data source.

32C, adding virtual data into the original data based on the differential privacy parameters of the data analysis algorithm to obtain mixed data, and scrambling the mixed data to obtain scrambled mixed data.

33C, according to the query request of the query user, adopting a data analysis algorithm to perform data analysis on the mixed data after disorder, and outputting the data analysis result to the query user.

In an alternative embodiment, outputting the data analysis results to the querying user includes modifying the data analysis results based on the virtual data and outputting the modified data analysis results to the querying user.

In an alternative embodiment, the method further comprises randomizing the original data before or after adding the dummy data.

For the detailed implementation process of each step in this embodiment, reference may be made to the foregoing embodiments, and details are not repeated here. In this embodiment, the operations of adding virtual data, randomizing, scrambling, analyzing and processing the original data are performed by the integrated data analysis end, so that the system architecture is simpler, which is beneficial to reducing the processing burden of the data source end and saving the resources of the data source end.

It should be noted that, the execution subjects of each step of the method provided in the above embodiment may be the same device, or the method may also be executed by different devices. For example, the execution bodies of steps 31c to 33c may be the device A, and for example, the execution bodies of steps 31c and 32c may be the device B, the execution bodies of steps 33c and 34c may be the device A, and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations appearing in a specific order are included, but it should be clearly understood that the operations may be performed out of the order in which they appear herein or performed in parallel, the sequence numbers of the operations such as 31a, 32a, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

Fig. 4a is a schematic structural diagram of a data processing apparatus according to an exemplary embodiment of the present application. As shown in fig. 4a, the apparatus includes a generating module 41a, an adding module 42a, and an uploading module 43a.

The generating module 41a is configured to generate the raw data and output the raw data to the adding module 42a.

The adding module 42a is configured to add virtual data to the original data based on the differential privacy parameters of the data analysis algorithm, so as to obtain mixed data.

The uploading module 43a is configured to upload the mixed data obtained by the adding module 42a to the data scrambling end, so that the data scrambling end scrambles the mixed data and provides the scrambled mixed data to the data analysis end, and the data analysis end performs data analysis on the scrambled mixed data by using a data analysis algorithm.

In an alternative embodiment, the adding module 42a is specifically configured to obtain a data mixing ratio based on the differential privacy parameter of the data analysis algorithm, and add virtual data to the original data according to the data mixing ratio to obtain the mixed data.

Further alternatively, in the process of adding the virtual data, the adding module 42a may specifically sample the virtual data randomly and uniformly in the data set corresponding to the original data, and add the virtual data to the original data to obtain the mixed data.

In an alternative embodiment, as shown in FIG. 4a, the apparatus further comprises a randomization module 44a. The randomizing module 44a is configured to randomize the original data before or after the adding module 42a adds the dummy data.

Further, the randomizing module 44a is specifically configured to obtain a data mixing ratio and a randomizing probability based on the differential privacy parameter of the data analysis algorithm before the adding module 42a adds the virtual data, and randomize the original data according to the randomizing probability. Accordingly, the adding module 42a is specifically configured to add virtual data to the randomized data according to the data mixing ratio to obtain mixed data.

Or alternatively

The adding module 42a is specifically configured to obtain a data mixing ratio and a randomizing probability based on a differential privacy parameter of a data analysis algorithm, and add virtual data to the original data according to the data mixing ratio to obtain mixed data. Accordingly, the randomization module 44a is specifically configured to randomize the original data according to the randomization probability after the virtual data is added by the adding module 42 a.

Further alternatively, the randomizing module 44a is specifically configured to generate a random number according to a randomization probability when randomizing the original data, maintain the original data if the random number satisfies a first condition, and replace the original data with the randomized data if the random number satisfies a second condition, wherein the probability of generating the random number satisfying the first condition and the probability of generating the random number satisfying the second condition are determined by the randomization probability.

Further, the randomizing module 44a is further configured to randomly and uniformly sample the randomized data in the data set corresponding to the original data before replacing the original data with the randomized data. Correspondingly, the adding module 42a is further configured to randomly and uniformly sample the virtual data in the data set corresponding to the original data before adding the virtual data.

The internal functions and structures of the data processing apparatus are described above, and as shown in fig. 4b, the data processing apparatus may be implemented as a data source device in practice, including a memory 41b, a processor 42b, and a communication component 43b.

The memory 41b is used for storing a computer program and may be configured to store other various data to support operations on the data source device. Examples of such data include instructions, messages, pictures, videos, etc. for any application or method operating on the data source device.

A processor 42b coupled to the memory 41b for executing the computer program in the memory 41b for generating the raw data, adding virtual data to the raw data based on the differential privacy parameters of the data analysis algorithm to obtain the mixed data, uploading the mixed data to the data scrambling terminal via the communication component 43b for the data scrambling terminal to scramble the mixed data and provide to the data analysis terminal, and performing data analysis on the scrambled mixed data by the data analysis terminal using the data analysis algorithm.

In an alternative embodiment, the processor 42b is specifically configured to obtain the data mixing ratio based on the differential privacy parameter of the data analysis algorithm, and add virtual data to the original data according to the data mixing ratio to obtain the mixed data.

Further alternatively, in the process of adding the virtual data, the processor 42b may specifically sample the virtual data randomly and uniformly in the data set corresponding to the original data, and add the virtual data to the original data to obtain the mixed data.

In an alternative embodiment, the processor 42b is further configured to randomize the original data before or after adding the dummy data.

Further, the processor 42b is specifically configured to obtain a data mixing ratio and a randomizing probability based on a differential privacy parameter of a data analysis algorithm before adding the virtual data, randomize the original data according to the randomizing probability, and then add the virtual data to the randomized data according to the data mixing ratio to obtain the mixed data.

Or alternatively

The processor 42b is specifically configured to obtain a data mixing ratio and a randomizing probability based on a differential privacy parameter of a data analysis algorithm, add virtual data to the original data according to the data mixing ratio to obtain mixed data, and then randomize the original data according to the randomizing probability.

Further alternatively, the processor 42b is specifically configured to generate a random number according to a randomization probability when randomizing the original data, to maintain the original data if the random number satisfies a first condition, and to replace the original data with the randomized data if the random number satisfies a second condition, wherein the probability of generating the random number satisfying the first condition and the probability of generating the random number satisfying the second condition are determined by the randomization probability.

Further, the processor 42b is further configured to randomly and uniformly sample the randomized data in the data set corresponding to the original data before replacing the original data with the randomized data, and to randomly and uniformly sample the dummy data in the data set corresponding to the original data before adding the dummy data.

Further, as shown in FIG. 4b, the data source device also includes an audio component 44b, a power component 45b, and a display screen 46b, among other components. Only part of the components are schematically shown in fig. 4b, which does not mean that the data source device only comprises the components shown in fig. 4b. In addition, the components shown in dashed boxes in FIG. 4b are optional components, and not necessarily optional components, depending on the device modality of the data source device.

Accordingly, embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method embodiment shown in fig. 3 a.

Fig. 5a is a schematic structural diagram of another data processing apparatus according to an exemplary embodiment of the present application. As shown in fig. 5a, the apparatus includes a receiving module 51a, an adding module 52a, a scrambling module 53a, and a transmitting module 54a.

The receiving module 51a is configured to receive the original data uploaded by at least one data source. The adding module 52a is configured to add virtual data to the original data based on the differential privacy parameters of the data analysis algorithm, so as to obtain mixed data. A scrambling module 53a, configured to scramble the mixed data. The sending module 54a is configured to send the scrambled mixed data to the data analysis end, so that the data analysis end performs data analysis on the scrambled mixed data by using a data analysis algorithm.

Further, as shown in FIG. 5a, the apparatus further includes a randomizing module 55a for randomizing the original data before or after the adding module 52a adds the dummy data.

The detailed operation principles of the adding module 52a and the randomizing module 55a may correspond to those of the adding module 42a and the randomizing module 44a in the embodiment shown in fig. 4a, and will not be described herein.

The internal functions and structures of the data processing apparatus are described above, and as shown in fig. 5b, the data processing apparatus may be implemented as a data processing device in practice, including a memory 51b, a processor 52b, and a communication component 53b.

The memory 51b is used for storing a computer program and may be configured to store various other data to support operations on the data processing apparatus. Examples of such data include instructions, messages, pictures, videos, etc. for any application or method operating on the data processing device.

The processor 52b is coupled to the memory 51b and is configured to execute the computer program in the memory 51b, and configured to receive the raw data uploaded by the at least one data source through the communication component 53b, add virtual data to the raw data based on the differential privacy parameters of the data analysis algorithm to obtain mixed data, scramble the mixed data, and send the scrambled mixed data to the data analysis terminal for the data analysis terminal to perform data analysis on the scrambled mixed data using the data analysis algorithm.

In an alternative embodiment, the processor 52b is further configured to randomize the original data before or after adding the dummy data.

The process of adding virtual data and randomizing the original data by the processor 52b is the same as or similar to the implementation of the processor 42b in the embodiment shown in fig. 4b, and the foregoing embodiments are omitted here.

Further, as shown in FIG. 5b, the data processing apparatus also includes other components such as a power supply component 54 b. Only part of the components are schematically shown in fig. 5b, which does not mean that the data processing device only comprises the components shown in fig. 5 b.

Accordingly, embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method embodiment shown in fig. 3 b.

Fig. 6a is a schematic structural diagram of still another data processing apparatus according to an exemplary embodiment of the present application. As shown in fig. 6a, the apparatus comprises a receiving module 61a, an adding module 62a, a scrambling module 63a, an analyzing module 64a and an output module 66a.

The receiving module 61a is configured to receive the raw data uploaded by at least one data source. The adding module 62a is configured to add virtual data to the original data based on the differential privacy parameters of the data analysis algorithm to obtain mixed data. The scrambling module 63a is configured to perform scrambling processing on the mixed data to obtain scrambled mixed data. The analysis module 64a is configured to perform data analysis on the scrambled mixed data using a data analysis algorithm according to a query request of a querying user. And an output module 66a for outputting the data analysis result to the querying user.

In an alternative embodiment, as shown in fig. 6a, the apparatus further includes a correction module 65a for correcting the data analysis result obtained by the analysis module 64a according to the virtual data, and providing the corrected data analysis result to the output module 66a. The output module 66a is specifically configured to output the corrected data analysis result to the querying user.

In an alternative embodiment, as shown in FIG. 6a, the apparatus further comprises a randomization module 67a for randomizing the original data before or after the addition module 62a adds the dummy data.

The detailed operation principle of the adding module 62a and the randomizing module 67a may correspond to those of the adding module 42a and the randomizing module 44a in the embodiment shown in fig. 4a, and will not be described herein.

The internal functions and structures of the data processing apparatus are described above, and as shown in fig. 6b, the data processing apparatus may be implemented as another data processing device in practice, including a memory 61b, a processor 62b, and a communication component 63b.

The memory 61b is used for storing a computer program and may be configured to store other various data to support operations on the data processing apparatus. Examples of such data include instructions, messages, pictures, videos, etc. for any application or method operating on the data processing device.

A processor 62b coupled to the memory 61b for executing a computer program in the memory 61b for receiving the raw data uploaded by the at least one data source via the communication component 63b, adding virtual data to the raw data based on the differential privacy parameters of the data analysis algorithm to obtain blended data, scrambling the blended data to obtain scrambled blended data, performing data analysis on the scrambled blended data using the data analysis algorithm according to a query request of a querying user, and outputting the data analysis result to the querying user.

In an alternative embodiment, the processor 62b is further configured to modify the data analysis results based on the virtual data prior to outputting the data analysis results to the querying user. Reference is made to the foregoing embodiments for detailed implementation of the correction, and no further description is given here.

In an alternative embodiment, the processor 62b is further configured to randomize the original data before or after adding the dummy data.

The process of adding virtual data and randomizing the original data by the processor 62b is the same as or similar to the implementation of the processor 42b in the embodiment shown in fig. 4b, and the foregoing embodiments are omitted here.

Further, as shown in FIG. 6b, the data processing apparatus also includes other components such as a power supply component 64 b. Only part of the components are schematically shown in fig. 6b, which does not mean that the data processing device only comprises the components shown in fig. 6 b.

Accordingly, embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method embodiment shown in fig. 3 c.

The memory in the above embodiments may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The communication assembly of the above embodiments is configured to facilitate wired or wireless communication between the device in which the communication assembly is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as a mobile communication network of WiFi,2G, 3G, 4G/LTE, 5G, etc., or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further comprises a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

The display in the above-described embodiments includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.

The power supply assembly in the above embodiment provides power for various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.

The audio component of the above embodiments may be configured to output and/or input audio signals. For example, the audio component includes a Microphone (MIC) configured to receive external audio signals when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a speech recognition mode. The received audio signal may be further stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A data processing system, characterized in that it comprises: at least one data source end, a data scrambling end, and a data analysis end;

The at least one data source is used to obtain a data mixing ratio based on the differential privacy parameter of the data analysis algorithm. If the data mixing ratio is less than 1, it determines whether it belongs to the data source that is authorized to add virtual data. If it is determined to belong to the data source, it randomly and uniformly samples virtual data from the data set corresponding to the original data according to the data mixing ratio. The virtual data satisfies the uniform distribution. The virtual data is added to the original data to obtain mixed data.

The data scrambling terminal is used to scramble the mixed data and provide the scrambled mixed data to the data analysis terminal.

The data analysis terminal is used to perform data analysis on the scrambled mixed data according to the query request of the querying user, using the data analysis algorithm, and output the data analysis results to the querying user.

2. The system according to claim 1, wherein the data analysis terminal is specifically used for: calculating a first noise level generated by the virtual data according to the data mixing ratio; correcting the data analysis result according to the first noise level; and outputting the corrected data analysis result to the querying user.

3. The system according to claim 1, wherein the at least one data source is further configured to: randomize the original data before or after adding the virtual data.

4. The system according to claim 3, wherein the at least one data source is specifically used for: obtaining a data mixing ratio and a randomization probability based on differential privacy parameters of a data analysis algorithm; randomizing the original data according to the randomization probability; and adding virtual data to the randomized data according to the data mixing ratio to obtain mixed data.

or

Based on differential privacy parameters derived from data analysis algorithms, the data mixing ratio and randomization probability are obtained; according to the data mixing ratio, virtual data is added to the original data to obtain mixed data; and the original data in the mixed data is randomized according to the randomization probability.

5. The system according to claim 4, wherein the at least one data source is specifically configured to: generate random numbers according to the randomization probability; if the random numbers satisfy a first condition, retain the original data; if the random numbers satisfy a second condition, replace the original data with randomized data;

The probability of generating a random number that satisfies the first condition and the probability of generating a random number that satisfies the second condition are determined by the randomization probability.

6. The system according to claim 5, wherein the at least one data source is further configured to: randomly and uniformly sample the randomized data from the data set corresponding to the original data; and randomly and uniformly sample the virtual data from the data set corresponding to the original data.

7. The system according to claim 5 or 6, wherein the data analysis terminal is specifically used for: calculating a first noise level generated by the virtual data according to the data mixing ratio; calculating a second noise level generated by randomization processing according to the randomization probability; correcting the data analysis result according to the first noise level and the second noise level; and outputting the corrected data analysis result to the querying user.

8. The system according to any one of claims 1-2 and 3-6, wherein the data source is an application, the data scrambling is a data middleware, and the data analysis is a cloud server.

9. A data processing system, characterized in that it comprises: at least one data source end, a data scrambling end, and a data analysis end;

The at least one data source is used to upload the original data to the data scrambling end;

The data scrambling end is used to obtain the data mixing ratio based on the differential privacy parameter of the data analysis algorithm. If the data mixing ratio is less than 1, it determines whether it belongs to a data source end that is authorized to add virtual data. If it is determined to belong to the data source end, it adds virtual data to the original data according to the data mixing ratio to obtain mixed data, and scrambles the mixed data. The virtual data is randomly and uniformly sampled from the data set corresponding to the original data, and the virtual data satisfies a uniform distribution.

The data analysis terminal is used to perform data analysis on the scrambled mixed data according to the query request of the querying user, and output the data analysis results to the querying user.

10. The system according to claim 9, wherein the data scrambling terminal is further configured to: randomize the original data before or after adding the virtual data.

11. A data processing method, applicable to a data source, characterized in that the method comprises:

Generate raw data;

Based on the differential privacy parameter of the data analysis algorithm, the data mixing ratio is obtained. If the data mixing ratio is less than 1, it is determined whether it belongs to a data source end that is authorized to add virtual data. If it is determined to belong to the data source end, virtual data is randomly and uniformly sampled from the data set corresponding to the original data according to the data mixing ratio. The virtual data satisfies the uniform distribution. Virtual data is added to the original data to obtain mixed data.

The mixed data is uploaded to the data scrambling terminal, which then scrambles the mixed data and provides it to the data analysis terminal, which uses the data analysis algorithm to perform data analysis on the scrambled mixed data.

12. The method according to claim 11, characterized in that, adding virtual data to the original data to obtain hybrid data includes:

Virtual data is randomly and uniformly sampled from the dataset corresponding to the original data and added to the original data to obtain mixed data.

13. The method according to claim 12, characterized in that it further comprises:

The original data is randomized before or after the virtual data is added.

14. The method according to claim 13, characterized in that, before adding the virtual data, the original data is randomized, comprising: obtaining the data mixing ratio and randomization probability based on the differential privacy parameter of the data analysis algorithm; and randomizing the original data according to the randomization probability;

Accordingly, adding virtual data to the original data to obtain mixed data includes: adding virtual data to the randomized data according to the data mixing ratio to obtain mixed data.

15. The method according to claim 14, characterized in that, randomizing the original data according to the randomization probability includes:

Random numbers are generated according to the randomization probability; if the random numbers satisfy the first condition, the original data is retained; if the random numbers satisfy the second condition, the randomized data is used to replace the original data.

16. The method according to claim 14, characterized in that it further comprises:

Randomized data is randomly and uniformly sampled from the dataset corresponding to the original data.

17. A data processing method, applicable to a data scrambling terminal, characterized in that the method comprises:

Receive raw data uploaded from at least one data source;

The mixed data is scrambled, and the scrambled mixed data is sent to the data analysis terminal so that the data analysis terminal can use the data analysis algorithm to perform data analysis on the scrambled mixed data.

18. The method according to claim 17, further comprising: randomizing the original data before or after adding the virtual data.

19. A data processing method, characterized in that it includes:

Receive raw data uploaded from at least one data source;

Based on the differential privacy parameter of the data analysis algorithm, the data mixing ratio is obtained. If the data mixing ratio is less than 1, it is determined whether it belongs to a data source end that is authorized to add virtual data. If it is determined to belong to the data source end, virtual data is randomly and uniformly sampled from the data set corresponding to the original data according to the data mixing ratio. The virtual data satisfies the uniform distribution. Virtual data is added to the original data to obtain mixed data, and the mixed data is scrambled to obtain scrambled mixed data.

Based on the query request from the user, the data analysis algorithm is used to perform data analysis on the scrambled mixed data, and the data analysis results are output to the user.

20. The method according to claim 19, further comprising: randomizing the original data before or after adding the virtual data.

21. A data source device, characterized in that it comprises: a memory and a processor;

The memory is used to store computer programs; the processor is coupled to the memory and is used to execute the computer programs for:

Generate raw data;

22. A data processing device, characterized in that it comprises: a memory and a processor;

Receive raw data uploaded from at least one data source;

23. A data processing device, characterized in that it comprises: a memory and a processor;

Receive raw data uploaded from at least one data source;

24. A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, it causes the processor to perform the steps of the method according to any one of claims 11-10.