CN104484402A - Method and device for deleting repeating data - Google Patents
Method and device for deleting repeating data Download PDFInfo
- Publication number
- CN104484402A CN104484402A CN201410775122.8A CN201410775122A CN104484402A CN 104484402 A CN104484402 A CN 104484402A CN 201410775122 A CN201410775122 A CN 201410775122A CN 104484402 A CN104484402 A CN 104484402A
- Authority
- CN
- China
- Prior art keywords
- virtual
- data
- attribute information
- data blocks
- data block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
- G06F16/148—File search processing
- G06F16/152—File search processing using file content signatures, e.g. hash values
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
- G06F16/162—Delete operations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for deleting repeating data. A first file and a second file are stored in a storage node. The method comprises the steps: dividing the first file into a plurality of first data blocks which are identical in size, and dividing the second file into a plurality of second data blocks which are identical in size; generating a hash value respectively corresponding to each first data block and each second data block; generating a first virtual repeating target for the N first data blocks, deleting the hash value of the N first data blocks, and regenerating a hash value according to N first data blocks when the has values of the N continuous second data blocks are completely identical to tat of N continuous first data blocks; deleting N second data blocks and the hash values of the N second data blocks, correlating second attribute information of the N second data blocks to the first virtual repeating target, so that not only can the quantity of the stored hash value be reduced, but also the precision for searching the repeating data can be guaranteed. The invention also discloses a device for deleting the repeating data.
Description
Technical Field
The invention relates to the technical field of distributed storage, in particular to a method and a device for deleting repeated data.
Background
In the field of storage, the core idea of deleting repeated data is as follows: when storing data, the data is compared with the existing data, if the data is the same, the backup of the data is filtered, then the existing data is referred by a pointer, and only the pointer is stored. By deleting the repeated data, the space occupied by storage can be reduced fundamentally, the data volume transmitted in the network can be reduced, the energy consumption and the network cost are further reduced, and a large amount of network bandwidth is saved for data replication.
In the prior art, two techniques for deleting duplicate data are most commonly used, one is a file-level technique for deleting duplicate data, and the other is a data block-level technique for deleting duplicate data.
File-level deduplication: for files already stored in the storage system, their respective hash function values are first calculated (typically using MD5 or SHA-1) and stored separately in a hash function value library. When new files to be stored reach the storage system, the hash function values of the new files are calculated firstly, the obtained hash function values are compared with the values already stored in the hash function value library one by one, if the two files have the same hash function value, the two files are judged to be the same, and only a pointer pointing to the stored files is required to replace the new files to be stored; and if the hash function value of the new file is not in the hash function value library, actually storing the new file in the system, and adding the new hash function value into the hash function value library.
The data block level deduplication technique is similar to the deduplication approach described above, with reference to fig. 1, except that: dividing all files in a storage system into data blocks according to a fixed size, calculating a hash function value of each data block, and combining all hash function values into a hash function value library for independent storage. When new data needs to be stored, the data is divided into data blocks according to the fixed size, and the hash function value of each data block is compared with the values in the hash function value library one by one. If the hash function value of the new data block is found to exist in the hash function value library, the data block is stored in the system, and the data block does not need to be stored again, and only a pointer pointing to the data block represented by the hash function value is stored in a corresponding position; if the new data chunk hash function value is not in the hash function value library, it is actually stored in the system and the new hash function value is added to the library.
The method for deleting the repeated data is a relatively common method, but in practical application, the problem that the number of the hash value base table entries is too large is encountered. For example, if duplicate detection finds a 2000M duplicate file, and each fixed block size is set to 4M, a maximum of 500 hash value storages are required. In a massive distributed storage environment, excessive hash values inevitably bring about the problem of low hash value searching efficiency, thereby affecting the searching efficiency of the system.
If repeated data detection is performed by increasing the size of the blocks or by using files as granularity, although the number of hash values is reduced, the granularity is too large, so that the problem of insufficient precision of repeated data elimination is caused.
Therefore, a method for deleting the repeated data is needed, which can not only reduce the hash value to ensure the searching efficiency, but also ensure the precision of searching the repeated data.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for deleting duplicate data to solve the above problem.
In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:
the embodiment of the invention provides a method for deleting duplicate data, which is used for storing a first file and a second file in a storage node and comprises the following steps:
dividing a first file into a plurality of first data blocks with the same size, and dividing a second file into a plurality of second data blocks with the same size as the first data blocks; wherein each of the second data blocks has second attribute information, and a hash value is generated for each of the first data blocks and the second data blocks;
detecting whether the hash value of the second data block is identical to the hash value of the first data block, and when detecting that the hash values of the N continuous second data blocks are identical to the hash values of the N continuous first data blocks:
generating a first virtual repetitive object aiming at the N first data blocks, deleting the hash values of the N first data blocks, and regenerating one hash value aiming at the N first data blocks;
and deleting the N second data blocks and the hash values of the N second data blocks, and modifying the second attribute information of the N second data blocks to enable the N second data blocks to be associated with the first virtual repetitive object.
Optionally, a hash value database and an attribute information database are arranged in the storage node;
the hash values are stored in a hash value database, and the second attribute information of the second data block is stored in an attribute information database.
Optionally, the first virtual repetitive object has attribute information, and the attribute information is stored in the attribute information database;
the attribute information of the first virtual repetitive object includes: the virtual repeated object identification and the host object attribute value are the data block identification of the 1 st of the N first data blocks, and the number of the data blocks is N;
the second attribute information of the second data block includes: associating the virtual repetitive object identifier with the position offset value;
the modifying the second attribute information of the N second data blocks to make it associated with the first virtual repetitive object includes:
modifying the associated virtual repetitive object identifications in the second attribute information of the N second data blocks to be associated with the virtual repetitive object identification of the first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to point to the N first data blocks associated with the first virtual repetitive object respectively.
Optionally, when modifying the Mth second data block of the N second data blocks,
copying and storing an Mth first data block corresponding to the Mth second data block in a second file, and modifying the copied Mth first data block;
generating a hash value aiming at the copied Mth first data block and storing the hash value in a hash value database; modifying the second attribute information of the Mth second data block to point to the copied Mth first data block;
deleting the attribute information of the first virtual repetitive object, generating a second virtual repetitive object aiming at the 1 st to (M-1) th first data blocks before the Mth first data block, generating a hash value aiming at the 1 st to (M-1) th first data blocks and storing the hash value in a hash value database; generating a third virtual repeating object for the (M +1) -N first data blocks after the Mth first data block, generating a hash value for the (M +1) -N first data blocks and storing the hash value in a hash value database;
modifying second attribute information of the 1 st to (M-1) th second data blocks to be associated with the second virtual repetitive object; and modifying the second attribute information of the (M +1) -N second data blocks to be associated with the third virtual repetitive object.
Optionally, modifying the second attribute information of the mth second data block is: and the associated virtual repeated object identifier is null, the position deviation value is null, and the associated data block is the Mth first data block.
Optionally, the attribute information of the second virtual repetitive object and the attribute information of the third virtual repetitive object are both stored in the attribute information database;
setting the attribute information of the second virtual repetitive object as follows: the host object attribute value is the data block identifier of the 1 st data block in the N first data blocks, and the number value of the data block is M-1;
modifying second attribute information of the 1 st to (M-1) th second data blocks to be associated with the second virtual repetitive object, including:
modifying an associated virtual repetitive object identification in second attribute information of the 1 st to (M-1) th second data blocks to be associated with the virtual repetitive object identification of the second virtual repetitive object; modifying a position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the second virtual repetitive object so as to respectively point to the 1 st to (M-1) th first data blocks associated with the second virtual repetitive object;
setting the attribute information of the third virtual repetitive object as follows: the attribute value of the host object is the data block identifier of the (M +1) th data block in the N first data blocks, and the number value of the data block is N-M;
modifying the second attribute information of the (M +1) -N second data blocks to be associated with the third virtual repetitive object, including:
modifying associated virtual repeating object identifications in second attribute information of (M +1) -N second data blocks to be associated with the virtual repeating object identification of the third virtual repeating object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the third virtual repeating object so as to respectively point to (M +1) -N first data blocks associated with the third virtual repeating object.
Optionally, when the mth second data block of the N modified second data blocks is restored to the original data,
deleting the copied Mth first data block;
deleting the attribute information of the second virtual repetitive object and the third virtual repetitive object, regenerating the 1 st to N th first data blocks into first virtual repetitive objects, generating a hash value aiming at the 1 st to N th first data blocks and storing the hash value in a hash value database;
and modifying the second attribute information of the 1 st to N second data blocks to enable the second attribute information to be associated with the recovered first virtual repetitive object.
Optionally, the first virtual repetitive object has attribute information stored in the attribute information database, and the attribute information includes: the attribute value of the host object is the data block identification of the 1 st first data block, and the number value of the data block is N;
modifying the second attribute information of the 1 st to N second data blocks to make the second attribute information associated with the recovered first virtual repetitive object, specifically comprising:
modifying the associated virtual repetitive object identifier in the second attribute information of the 1 st to N second data blocks to be associated with the virtual repetitive object identifier of the recovered first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to respectively point to the 1 st to N th first data blocks associated with the recovered first virtual repetitive object.
The embodiment of the invention provides a device for deleting repeated data, which is arranged on a storage node, wherein a first file and a second file are stored on the storage node, and the device comprises:
the first dividing module is used for dividing the first file into a plurality of first data blocks with the same size, wherein a hash value is generated aiming at each first data block;
the second dividing module is used for dividing a second file into a plurality of second data blocks with the same size as the first data blocks; wherein each of the second data blocks has second attribute information, and a hash value is generated for each of the second data blocks;
the detection module is used for detecting whether the hash values of the second data block and the first data block are the same or not, and informing the first data processing module to act when the hash values of the N continuous second data blocks and the N continuous first data blocks are completely the same;
the first data processing module generates a first virtual repetitive object aiming at the N first data blocks, deletes the hash value of the N first data blocks, regenerates a hash value aiming at the N first data blocks and informs the second data processing module to act;
and the second data processing module deletes the N second data blocks and the hash values thereof, and modifies the second attribute information of the N second data blocks so as to be associated with the first virtual repetitive object.
Optionally, a hash value database and an attribute information database are arranged in the storage node;
the hash values are stored in a hash value database, and the second attribute information of the second data block is stored in an attribute information database.
Optionally, the first virtual repetitive object has attribute information and is stored in the attribute information database;
the attribute information of the first virtual repetitive object includes: the virtual repeated object identification and the host object attribute value are the data block identification of the 1 st of the N first data blocks, and the number of the data blocks is N;
the second attribute information of the second data block includes: associating the virtual repetitive object identifier with the position offset value;
the second data processing module modifies the second attribute information of the N second data blocks to make the second attribute information associated with the first virtual repetitive object, and includes:
the second data processing module modifies the associated virtual repetitive object identifier in the second attribute information of the N second data blocks to be associated with the virtual repetitive object identifier of the first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to point to the N first data blocks associated with the first virtual repetitive object respectively.
Optionally, when modifying the Mth second data block of the N second data blocks,
the second data processing module copies and stores an Mth first data block corresponding to the Mth second data block in a second file, and modifies the copied Mth first data block; generating a hash value aiming at the copied Mth first data block, storing the hash value in a hash value database, and modifying the second attribute information of the Mth second data block to enable the Mth second data block to point to the copied Mth first data block;
the first data processing module deletes the attribute information of the first virtual repetitive object, generates a second virtual repetitive object aiming at the 1 st to (M-1) th first data blocks before the Mth first data block, generates a hash value aiming at the 1 st to (M-1) th first data blocks and stores the hash value in the hash value database; generating a third virtual repeating object for the (M +1) -N first data blocks after the Mth first data block, generating a hash value for the (M +1) -N first data blocks and storing the hash value in a hash value database;
the second data processing module modifies second attribute information of the 1 st to (M-1) th second data blocks to be associated with the second virtual repetitive object;
and the second data processing module modifies the second attribute information of the (M +1) -N second data blocks to be associated with the third virtual repetitive object.
Optionally, the second data processing module modifies the second attribute information of the mth second data block to: the associated virtual repetitive object identifier is null, and the position offset value is null; the associated data block is the mth first data block.
Optionally, the attribute information of the second virtual repetitive object and the attribute information of the third virtual repetitive object are both stored in the attribute information database;
the first data processing module sets the attribute information of the second virtual repetitive object as: the host object attribute value is the data block identifier of the 1 st data block in the N first data blocks, and the number value of the data block is M-1;
the second data processing module modifies the second attribute information of the 1 st to (M-1) th second data blocks to be associated with the second virtual repetitive object, and includes:
the second data processing module modifies the associated virtual repetitive object identification in the second attribute information of the 1 st to (M-1) th second data blocks to be associated with the virtual repetitive object identification of the second virtual repetitive object; modifying a position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the second virtual repetitive object so as to respectively point to the 1 st to (M-1) th first data blocks associated with the second virtual repetitive object;
the first data processing module sets the attribute information of the third virtual repetitive object as: the attribute value of the host object is the data block identifier of the (M +1) th data block in the N first data blocks, and the number value of the data block is N-M;
the second data processing module modifies the second attribute information of the (M +1) -N second data blocks to be associated with the third virtual repetitive object, and includes:
the second data processing module modifies the associated virtual repetitive object identification in the second attribute information of the (M +1) -N second data blocks to be associated with the virtual repetitive object identification of the third virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the third virtual repeating object so as to respectively point to (M +1) -N first data blocks associated with the third virtual repeating object.
Optionally, when the mth second data block of the N modified second data blocks is restored to the original data,
the second data processing module deletes the copied Mth first data block;
the first data processing module deletes the attribute information of the second virtual repetitive object and the third virtual repetitive object, and regenerates the first virtual repetitive object aiming at the 1 st to the N th first data blocks; generating a hash value aiming at the 1 st to the N first data blocks and storing the hash value in a hash value database;
and the second data processing module modifies the second attribute information of the 1 st to N second data blocks to enable the second attribute information to be associated with the recovered first virtual repetitive object.
Optionally, the first virtual repetitive object has attribute information stored in the attribute information database, and the attribute information includes: the attribute value of the host object is the data block identification of the 1 st first data block, and the number value of the data block is N;
the second data processing module modifies the second attribute information of the 1 st to N th second data blocks to associate the recovered first virtual repetitive object, and specifically includes:
the second data processing module modifies the associated virtual repetitive object identifier in the second attribute information of the 1 st to N second data blocks to be associated with the virtual repetitive object identifier of the recovered first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to respectively point to the 1 st to N th first data blocks associated with the recovered first virtual repetitive object.
The method has the advantages that when the N first data blocks in the first file and the N second data blocks in the second file are detected to be repeated, the first virtual repeated objects are generated aiming at the N first data blocks, the hash values of the first virtual repeated objects are used for replacing the hash values of the plurality of first data blocks, and the repeated N second data blocks are deleted, so that the data volume of the hash values stored in each same data block in the prior art is reduced, better searching performance is achieved, and the precision of searching the repeated data can be ensured.
Drawings
FIG. 1 is a prior art block level deduplication flow diagram;
FIG. 2 is a first diagram illustrating a method for deleting duplicate data according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a second method for deleting duplicate data according to an embodiment of the present invention;
FIG. 4 is a third schematic diagram illustrating a method for deleting duplicate data according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a first method for deleting duplicate data in an embodiment of the present invention;
FIG. 6 is a flowchart illustrating a method for deleting duplicate data according to an embodiment of the present invention;
FIG. 7 is a flowchart of a method for deleting duplicate data according to an embodiment of the present invention;
fig. 8 is a block diagram of an apparatus for deleting duplicate data according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in detail below with reference to the accompanying drawings by specific embodiments.
In order to solve the problem that the searching efficiency of the hash value is low due to excessive hash values in the prior art, thereby affecting the searching efficiency of the system, an embodiment of the present invention provides a method for deleting duplicate data, which is applied to storing a first file and a second file in a storage node, and as shown in fig. 5, the method includes:
a1, dividing the first file into a plurality of first data blocks with the same size, wherein each first data block has first attribute information and a hash value is generated for each first data block.
a2, dividing the second file into a plurality of second data blocks with the same size as the first data blocks; wherein each of the second data blocks has second attribute information, and a hash value is generated and stored for each of the second data blocks.
The storage node is provided with a hash value database, and the hash value of each first data block and the hash value of each second data block are stored in the hash value database. The hash values generated in the following text are also stored in the hash value database, and are deleted from the hash value database when the hash values are deleted, so that the description thereof is omitted.
In addition, in the embodiment, the first file is used as a basis for comparison, the duplicate data in the second file is deleted, and the second attribute information of the second data block is modified. In practical application, the sequence in which the first file and the second file are stored in the storage node is not limited, so that the duplicate data can be deleted, for example, the duplicate data in the first file can also be deleted, and in this case, the first attribute information of the first data block needs to be modified.
It should be noted that, in the actual storage node, the identification of one data block, the data content in the data block, and the attribute information thereof are stored as one object. All objects are stored in a flat namespace in the storage node (if there is no directory structure). In the present embodiment, for convenience of description, the technical solutions are not described with "object" as an object, but are described with the component levels of "object" (i.e., identification of a data block, data content in the data block, and attribute information of the data block).
In actual application, the data block identifier and the attribute information of the data block are stored in the attribute information keyvalue database of the storage node. More specifically, the key value represents a data block identification; the value represents attribute information of the data block, and one key value corresponds to one value. Wherein a value may be a set of attributes, i.e. a set of attributes.
a3, detecting whether the hash value of the second data block is the same as the hash value of the first data block, when detecting that the hash values of the N consecutive second data blocks are the same as the hash values of the N consecutive first data blocks, entering step a4, and if not detecting that the second data blocks and the first data blocks with the same hash values exist, entering step a 8.
a4, calculating the number N of data blocks with the same hash value.
a5, judging whether the detection is finished, if not, returning to the step a3 to continue the detection; if yes, go to step a 6.
a6, generating a first virtual duplicate object for the N first data chunks, deleting the hash values of the N first data chunks, and regenerating a hash value for the N first data chunks. The attribute information of the first virtual repetitive object includes: the virtual repeated object identifier and the host object attribute value are the data block identifier of the 1 st data block in the N first data blocks, and the number of the data blocks is N.
It should be noted here that the virtual repetitive object generated here only functions like a pointer, and does not include the original N data blocks.
In actual application, the attribute information of the first virtual repetitive object is also stored in the attribute information keyvalue database.
a7, deleting the N second data blocks and the hash values thereof, and modifying the second attribute information of the N second data blocks to be associated with the first virtual repetitive object so as to point to the N first data blocks.
It should be noted that the attribute information of the data block, which includes but is not limited to the identification of the associated virtual repetitive object, the position offset value, and the associated data block, may also include, for example, the owner of the file, the creation date, the last modification date, etc. The following text refers to the attribute information items required in this embodiment, but the attribute information that does not represent the data block includes only these items, and for the remaining general attribute information items, those skilled in the art can obtain them according to the prior art, and this embodiment is not described again.
More specifically, in step a7, the first virtual repetitive object has attribute information, and the attribute information is stored in the attribute information database, and the attribute information includes: the virtual repeated object identification and the host object attribute value are the data block identification of the 1 st of the N first data blocks, and the number of the data blocks is N;
modifying the associated virtual repetitive object identifier and the position offset value in the second attribute information of the second data block;
the modifying the second attribute information of the N second data blocks to make the second attribute information associated with the first virtual repetitive object includes: modifying the associated virtual repetitive object identifications in the second attribute information of the N second data blocks to be associated with the virtual repetitive object identification of the first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to point to the N first data blocks associated with the first virtual repetitive object respectively.
Wherein,
modifying the associated virtual repeating object identification of the 1 st second data block to be associated with the virtual repeating object identification of the first virtual repeating object; the position offset value is 0 and is associated with the host object attribute value, the data chunk identification of the 1 st first data chunk, to point to the 1 st first data chunk associated with the first virtual repeating object.
Modifying the associated virtual repeating object identification of the 2 nd second data block to be associated with the virtual repeating object identification of the first virtual repeating object; the position offset value is 1, and is associated with the host object attribute value, i.e. the data block identifier of the 1 st first data block +1 (position offset value), so as to point to the 2 nd first data block associated with the first virtual repetitive object;
……;
modifying the associated virtual repeating object identification of the nth second data block to be associated with the virtual repeating object identification of the first virtual repeating object; the position offset value is N-1 and is associated with the host object attribute value, data chunk identification of the 1 st first data chunk, i.e., data chunk identification of the 1 st first data chunk + (N-1) (position offset value), to point to the Nth first data chunk in the first virtual repeating object.
Referring to fig. 2 and 3, in this embodiment, taking N as an example, the identifiers of the 4 first data blocks of the first file 1 are 10, 11, 12, and 13, respectively; the 4 second data blocks of the second file 2 identify four data blocks 20, 21, 22, and 23 as duplicate data, that is, data blocks 10 and 20 are duplicated, data blocks 11 and 21 are duplicated, data blocks 12 and 22 are duplicated, and data blocks 13 and 23 are duplicated.
The hash values of the first data chunks 10, 11, 12, 13 are deleted in the hash value database and a first virtual repetitive object V1 is generated, which comprises the 4 first data chunks 10, 11, 12, 13 of the first file 1. And generating a hash value aiming at the four first data blocks and storing the hash value in a hash value database. The first attribute information 10', 11', 12', and 13' are respectively associated with 4 first data blocks 10, 11, 12, and 13.
The data chunks and hash values of the second data chunks 20, 21, 22, 23 are deleted from the hash value database, and the second attribute information 20', 21', 22', 23' of the 4 second data chunks is modified to point to the 4 first data chunks 10, 11, 12, 13, respectively.
For the second file 2, specifically taking the second data block 20 as an example, the associated virtual repetitive object identifier of the second data block 20 is modified to be associated with the first virtual repetitive object V1; the position offset value is 0 and is associated with the host object attribute value of the first virtual duplicate object V1 (in this embodiment, the host object attribute value of the first virtual duplicate object V1 is the data chunk identifier 10 of the 1 st first data chunk); i.e. the identification 10+0, is the identification 10 to point to the 1 st first data block 10.
When reading the second file 2, when reading the second data block 20, first obtaining the first virtual repetitive object V1, which is the associated virtual repetitive object identifier of the second data block 20, and then obtaining the host object attribute value identifier 10 of the first virtual repetitive object V1; and finally obtaining the data of the first data block 10 according to the position deviation value 0.
Taking the second data block 21 as an example, the associated virtual repetitive object identification of the second data block 21 is modified to be associated with the first virtual repetitive object V1; the position offset value is 1 and is associated with the host object attribute value of the first virtual repetitive object V1 (i.e. the data chunk identification 10 of the 1 st first data chunk), i.e. identification 10+1 is identification 11, to point to the 2 nd first data chunk 11.
When reading the second data block 21, first obtaining an associated virtual repetitive object identifier of the second data block 21, namely a first virtual repetitive object V1, and then obtaining a host object attribute value identifier 10 of the first virtual repetitive object V1; and finally obtaining the data of the first data block 11 according to the position deviation value 1.
……;
By analogy, this embodiment is not listed.
As can be seen from the above example, according to the associated virtual repetitive object identifier and the position offset value in the second attribute information of the second data block, the data information of the first data block associated with the second data block can be obtained, so as to delete the repetitive data.
a8, storing the first data blocks and the second data blocks in the storage nodes respectively.
As can be seen from the above discussion, in the embodiment of the present invention, when it is detected that N first data blocks in a first file and N second data blocks in a second file are repeated, a first virtual duplicate object is generated for the N first data blocks, the hash value of the first virtual duplicate object is used to replace the hash values of the plurality of first data blocks, and the repeated N second data blocks are deleted, so that the data amount of storing hash values for each identical block in the prior art is reduced, better search performance is achieved, and the accuracy of searching for duplicate data can be ensured.
When the second data chunk is modified, it results in the fission and merging of the virtual repetitive object, as described in detail below.
When modifying the mth second data block of the N second data blocks, see fig. 6:
b0, judging whether the Mth second data block is associated with the first virtual repetitive object V1, if so, entering the step b 1.
b1, copying and storing the Mth first data block corresponding to the Mth second data block in a second file, and modifying the copied Mth first data block; generating a hash value aiming at the copied Mth first data block and storing the hash value in a hash value database;
and modifying the second attribute information of the Mth second data block to point to the copied Mth first data block.
Specifically, modifying the second attribute information of the mth second data block is: and the associated virtual repeated object identifier is null, the position deviation value is null, and the associated data block is the Mth first data block.
b2, deleting the attribute information of the first virtual repetitive object, generating a second virtual repetitive object aiming at the 1 st to the (M-1) th first data blocks before the Mth first data block, generating a hash value aiming at the 1 st to the (M-1) th first data blocks and storing the hash value in the hash value database; and generating a third virtual repeating object for the (M +1) -N first data blocks after the Mth first data block, and generating a hash value for the (M +1) -N first data blocks and storing the hash value in the hash value database.
The attribute information of the second virtual repetitive object and the attribute information of the third virtual repetitive object are both stored in the attribute information database, and the attribute information of the second virtual repetitive object is set as follows: the virtual repetitive object identifier is V2, the attribute value of the host object is the identifier of the 1 st data block in the N first data blocks, and the number of the data blocks is M-1; setting the attribute information of the third virtual repetitive object as follows: the virtual repetitive object identifier is V3, the attribute value of the host object is the identifier of the (M +1) th data block in the N first data blocks, and the number of the data blocks is N-M.
It should be noted that the attribute information of the deleted first virtual duplicate object includes not only the identifier of the deleted virtual duplicate object, the attribute value of the host object, the number of data blocks, and the like, but also other attribute values associated with the first virtual duplicate object, such as the total number of objects in the storage medium, and the like. For the deletion process, those skilled in the art can know from the technical content of the present disclosure that the deletion process is not substantially changed in the embodiment of the present invention, and therefore, the description thereof is omitted here.
b3, modifying the second attribute information of the 1 st to (M-1) th second data blocks to be associated with the second virtual repetitive object; and modifying the second attribute information of the (M +1) -N second data blocks to enable the second attribute information to be associated with a third virtual repeating object.
Specifically, modifying the second attribute information of the 1 st to (M-1) th second data blocks so as to be associated with the second virtual repetitive object includes: modifying an associated virtual repetitive object identification in second attribute information of the 1 st to (M-1) th second data blocks to be associated with the virtual repetitive object identification of the second virtual repetitive object; and modifying a position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the second virtual repeating object so as to respectively point to the 1 st to (M-1) th first data blocks associated with the second virtual repeating object.
Wherein,
modifying the associated virtual repeating object identification of the 1 st second data block to be associated with the virtual repeating object identification of the second virtual repeating object; the position offset value is 0, and is associated with the host object attribute value, namely the data block identifier of the 1 st data block in the N first data blocks, so as to point to the 1 st data block associated with the second virtual repetitive object;
modifying the associated virtual repeating object identification of the 2 nd second data block to be associated with the virtual repeating object identification of the second virtual repeating object; the position offset value is 1, and is associated with the host object attribute value, namely the data block identifier of the 1 st data block in the N first data blocks, the position offset value is +1 for the data block identifier of the 1 st data block to point to the 2 nd first data block associated with the second virtual repetitive object;
……;
modifying the associated virtual repeating object identification of the M-1 th second data chunk to be associated with the virtual repeating object identification of the second virtual repeating object; and the position offset value is M-2, and is associated with the host object attribute value, namely the data block identifier of the 1 st data block in the N first data blocks, the position offset value is + M-2 for the data block identifier of the 1 st data block so as to point to the M-1 st data block associated with the second virtual repetitive object.
Modifying the second attribute information of the (M +1) -N second data blocks to be associated with a third virtual repetitive object, including:
modifying associated virtual repeating object identifications in second attribute information of (M +1) -N second data blocks to be associated with the virtual repeating object identification of the third virtual repeating object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the third virtual repeating object so as to respectively point to (M +1) -N first data blocks associated with the third virtual repeating object.
Wherein,
modifying the associated virtual repeating object identification of the M +1 th second data block to be associated with the virtual repeating object identification of the second virtual repeating object; the position offset value is 0, and is associated with the host object attribute value, namely the data block identifier of the (M +1) th data block in the N first data blocks, so as to point to the (M +1) th first data block associated with the third virtual repeating object;
modifying the associated virtual repeating object identification of the M +2 th second data block to be associated with the virtual repeating object identification of the second virtual repeating object; the position offset value is 1, and is associated with the host object attribute value, namely the data block identifier of the (M +1) th data block in the N first data blocks, so as to point to the (M + 2) th first data block associated with the third virtual repeating object;
……;
modifying the associated virtual repeating object identification of the nth second data block to be associated with the virtual repeating object identification of the second virtual repeating object; the position offset value is N-M-1 and is associated with the host object attribute value, the data chunk identification of the M +1 th of the N first data chunks, to point to the Nth first data chunk in the third virtual repeating object.
Referring to fig. 4, when the second data block 22 is modified,
and copying the first data block 12 corresponding to the second data block 22 to generate a data block 22 ", modifying the copied data block 22", generating a hash value for the copied data block 22 "and storing the hash value in a hash value database.
Generating a second virtual repetitive object V2 for the data blocks 10 and 11 before the first data block 12, and generating a hash value for the data blocks 10 and 11 and storing the hash value in a hash value database; a third virtual duplicate object V3 is generated for data chunks 13 subsequent to the first data chunk 12 and a hash value is generated for the data chunk 13 and stored in the hash value database. And setting the attribute information of the second virtual repetitive object V2 as follows: the attribute value of the host object is a data block identifier 10 of the 1 st first data block, and the number value of the data block is 2; setting the attribute information of the third virtual repetitive object V3 as follows: the attribute value of the host object is the data block identifier 13 of the 4 th first data block, and the number of the data blocks is 1.
Wherein the associated virtual repetitive object identification of the second data chunk 20 is modified to be associated with the second virtual repetitive object V2; the position offset value is 0, and is associated with the host object attribute value of the second virtual repetitive object V2 (in this embodiment, the host object attribute value of the first virtual repetitive object V1 is the data block identifier 10 of the 1 st first data block), that is, the identifier 10+0 is the identifier 10, so as to point to the 1 st first data block 10 associated with the second virtual repetitive object V2;
modifying the associated virtual repeating object identification of the second data chunk 21 to be associated with a second virtual repeating object V2; the position offset value is 1, and is associated with the host object attribute value of the second virtual repetitive object V2 (in this embodiment, the host object attribute value of the second virtual repetitive object V2 is the data block identifier 10 of the 1 st first data block), that is, the identifier 10+1 is the identifier 11, so as to point to the 2 nd first data block 11 associated with the second virtual repetitive object V2;
modifying the associated virtual repeating object identification of the second data chunk 23 to be associated with a third virtual repeating object V3; the position offset value is 0, and is associated with the host object attribute value of the third virtual repetitive object V3 (in this embodiment, the host object attribute value of the third virtual repetitive object V3 is the data block identifier 13 of the 4 th first data block), that is, the identifier 13+0 is the identifier 13, so as to point to the 1 st first data block 13 associated with the third virtual repetitive object V3.
As can be seen from the above description of the technical solution, when the second data block is modified, the virtual repetitive object is caused to be split from V1 into V2 and V3.
And when the second data block is modified again, namely the data in the modified second data block is restored to the original data, the split virtual repeating objects are merged and restored again. See below:
when the mth second data block of the N modified second data blocks is restored to the original data, refer to fig. 7:
c0, judging whether the Mth second data block in the N second data blocks is recovered, if so, entering the step c 1;
c1, deleting the Mth first data block after copying;
and c2, deleting the attribute information of the second virtual repetitive object and the third virtual repetitive object, recovering and generating the first virtual repetitive object aiming at the 1 st to N first data blocks, and generating a hash value aiming at the 1 st to N first data blocks and storing the hash value in a hash value database.
The deleting of the attribute information of the second virtual repetitive object and the third virtual repetitive object includes not only the virtual repetitive object identifier, the host object attribute value, the data block number value, and the like of the second virtual repetitive object and the third virtual repetitive object, but also other attribute values associated with the second virtual repetitive object and the third virtual repetitive object, such as the object number of the storage node, and the like. For the deletion process, those skilled in the art can know from the technical content of the present disclosure that the deletion process is not substantially changed in the embodiment of the present invention, and therefore, the description thereof is omitted here.
The attribute information of the first virtual repetitive object includes: the host object attribute value is the data block identification of the 1 st first data block, and the number of the data blocks is N.
c3, modifying the second attribute information of the 1 st to N th second data blocks to make the second attribute information associated with the restored first virtual repetitive object.
More specifically, modifying the second attribute information of the 1 st to nth second data blocks to associate the recovered first virtual repetitive object includes:
modifying the associated virtual repetitive object identifier in the second attribute information of the 1 st to N second data blocks to be associated with the virtual repetitive object identifier of the recovered first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to respectively point to the 1 st to N th first data blocks associated with the recovered first virtual repetitive object.
Wherein,
modifying the associated virtual repeating object identification of the 1 st second data chunk to be associated with the virtual repeating object identification of the recovered first virtual repeating object; the position offset value is 0, and is associated with the host object attribute value, namely the data block identifier of the 1 st data block in the N first data blocks, so as to point to the 1 st data block associated with the first virtual repetitive object;
……;
modifying the associated virtual repeating object identification of the nth second data block to be associated with the virtual repeating object identification of the recovered first virtual repeating object; the location offset value is N-1 and is associated with the host object attribute value, i.e., the data chunk identification of the (N-1) th of the N first data chunks, to point to the Nth first data chunk with which the first virtual repetitive object is associated.
Referring again to fig. 3, when the second data block 22 is restored to the original data,
deleting the copied data block 22';
the attribute information of the second virtual repetitive object V2 and the third virtual repetitive object V3 is deleted, and the generation of the first virtual repetitive object V1 is resumed for the 4 first data blocks 10, 11, 12, 13, and hash values are generated for the 4 first data blocks and stored in the hash value database.
And the second property information 20', 21', 22', 23' of the second data blocks 20, 21, 22, 23 is modified to point to the first data blocks 10, 11, 12, 13, respectively. Wherein,
modifying the associated virtual repeating object identification of said 1 st second data chunk 20 to be associated with said recovered first virtual repeating object V1; the position offset value is 0 and is associated with the host object attribute value of the first virtual repetitive object V1 (i.e., the data chunk identification 10 of the 1 st first data chunk) to point to the 1 st first data chunk 10;
modifying the associated virtual repeating object identification of said 2 nd second data chunk 21 to be associated with said recovered first virtual repeating object V1; the position offset value is 1 and is associated with the host object attribute value of the first virtual repetitive object V1 (i.e. the data chunk identification 10 of the 1 st first data chunk), i.e. identification 10+1 is identification 11, to point to the 2 nd first data chunk 11.
……;
And so on.
It can be seen that when the modified second chunk is restored again, the virtual duplicate objects V2 and V3 are caused to be recombined to generate V1.
As can be seen from the method for deleting duplicate data in this embodiment, when the local data block is modified, the method only causes modification of the virtual duplicate object and copying of the individual data block, and has a small impact on the performance of the system.
In addition, because the deletion is based on the continuous data, the application scenario is not limited to the scenario in which the local duplicate data exists in the file 1 and the file 2, and the technical solution of the present application may also be applied when all of a certain file is the same as part or all of other files, for example, all of the file 1 is the same as part or all of the file 3. Moreover, the method may also be continuously applied to deep elimination, for example, only in a scene where duplicate data exists in the file 1, deletion may also be performed, and a method for deleting duplicate data in a single file is consistent with the above method, so details are not described herein again.
The above is the method of deleting duplicate data according to the present embodiment. The apparatus for deleting duplicate data according to the present embodiment will be described below.
An embodiment of the present invention further provides a device for deleting duplicate data, which is shown in fig. 8 and is disposed on a storage node, where the storage node stores a first file and a second file, and the device includes:
the first dividing module is used for dividing the first file into a plurality of first data blocks with the same size, wherein a hash value is generated aiming at each first data block;
the second dividing module is used for dividing a second file into a plurality of second data blocks with the same size as the first data blocks; wherein each of the second data blocks has second attribute information, and a hash value is generated for each of the second data blocks. It should be noted that in the present embodiment, the first file is used as a basis for comparison, and the duplicate data in the second file is deleted. In practical application, the order in which the first file and the second file are stored on the storage node is not limited, so that the repeated data can be deleted.
A hash value database and an attribute information database are arranged in the storage node; the hash values are stored in a hash value database, and the second attribute information of the second data block is stored in an attribute information database.
In actual application, the data block identifier and the attribute information of the data block are stored in the attribute information keyvalue database of the storage node. More specifically, the key value represents a data block identification; the value represents attribute information of the data block, and one key value corresponds to one value. Wherein a value may be a set of attributes, i.e. a set of attributes.
And the detection module is used for detecting whether the hash values of the second data blocks are the same as the hash values of the first data blocks, and informing the first data processing module to act when the N continuous second data blocks are detected to be completely the same as the hash values of the N continuous first data blocks.
And the first data processing module deletes the hash values of the N first data blocks, generates a first virtual repetitive object aiming at the N first data blocks, generates and stores the hash value aiming at the first virtual repetitive object, and informs the second data processing module of action.
Wherein the attribute information of the first virtual repetitive object includes: the virtual repeated object identifier and the host object attribute value are the data block identifier of the 1 st data block in the N first data blocks, and the number of the data blocks is N. In actual application, the attribute information of the first virtual repetitive object is also stored in the attribute information keyvalue database.
It should be noted here that the virtual repetitive object generated here only functions like a pointer, and does not include the original N data blocks.
And the second data processing module deletes the N second data blocks and the hash values thereof, and modifies the second attribute information of the N second data blocks so as to be associated with the first virtual repetitive object.
In more detail, the modifying, by the second data processing module, the second attribute information of the N second data blocks to be associated with the first virtual repetitive object includes: the second data processing module modifies the associated virtual repetitive object identifier in the second attribute information of the N second data blocks to be associated with the virtual repetitive object identifier of the first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to point to the N first data blocks associated with the first virtual repetitive object respectively.
It should be noted that the attribute information of the data block, which includes but is not limited to the identification of the associated virtual repetitive object, the position offset value, and the associated data block, may also include, for example, the owner of the file, the creation date, the last modification date, etc. The following text refers to the attribute information items required in this embodiment, but the attribute information that does not represent the data block includes only these items, and for the remaining general attribute information items, those skilled in the art can obtain them according to the prior art, and this embodiment is not described again.
The attribute information of the first virtual repetitive object includes: the virtual repeated object identification and the host object attribute value are the data block identification of the 1 st of the N first data blocks, and the number of the data blocks is N;
the second attribute information of the second data block includes: associating the virtual repetitive object identification with the position offset value.
With reference to figures 2 and 3 of the drawings,
in this embodiment, taking N as 4 as an example, the identifiers of the 4 first data blocks of the first file 1 are 10, 11, 12, and 13, respectively; the 4 second data blocks of the second file 2 identify four data blocks 20, 21, 22, and 23 as duplicate data, that is, data blocks 10 and 20 are duplicated, data blocks 11 and 21 are duplicated, data blocks 12 and 22 are duplicated, and data blocks 13 and 23 are duplicated.
The first data processing module deletes the hash value of the first data block 10, 11, 12, 13 in the hash value database and generates a first virtual repetitive object V1 including the 4 first data blocks 10, 11, 12, 13 of the first file 1. And generating a hash value aiming at the four first data blocks and storing the hash value in a hash value database. The first attribute information 10', 11', 12', and 13' are respectively associated with 4 first data blocks 10, 11, 12, and 13.
The second data processing module deletes the data blocks and hash values of the second data blocks 20, 21, 22, 23 in the hash value database, and modifies the second attribute information 20', 21', 22', 23' of the 4 second data blocks to point to the 4 first data blocks 10, 11, 12, 13, respectively.
For the second file 2, specifically taking the second data block 20 as an example, the second data processing module modifies the associated virtual repetitive object identifier of the second data block 20 to be associated with the first virtual repetitive object V1; the position offset value is 0 and is associated with the host object attribute value of the first virtual duplicate object V1 (in this embodiment, the host object attribute value of the first virtual duplicate object V1 is the data chunk identifier 10 of the 1 st first data chunk); i.e. the identification 10+0, is the identification 10 to point to the 1 st first data block 10.
When reading the second file 2, when reading the second data block 20, first obtaining the first virtual repetitive object V1, which is the associated virtual repetitive object identifier of the second data block 20, and then obtaining the host object attribute value identifier 10 of the first virtual repetitive object V1; and finally obtaining the data of the first data block 10 according to the position deviation value 0.
Taking the second data block 21 as an example, the second data processing module modifies the associated virtual repetitive object identification of the second data block 21 to associate with the first virtual repetitive object V1; the position offset value is 1 and is associated with the host object attribute value of the first virtual repetitive object V1 (i.e. the data chunk identification 10 of the 1 st first data chunk), i.e. identification 10+1 is identification 11, to point to the 2 nd first data chunk 11.
When reading the second data block 21, first obtaining an associated virtual repetitive object identifier of the second data block 21, namely a first virtual repetitive object V1, and then obtaining a host object attribute value identifier 10 of the first virtual repetitive object V1; and finally obtaining the data of the first data block 11 according to the position deviation value 1.
……;
By analogy, this embodiment is not listed.
Therefore, according to the associated virtual repetitive object identifier and the position offset value in the second attribute information of the second data block, the data information of the first data block associated with the second data block can be obtained, and the deletion of the repetitive data is realized.
When the second data chunk is modified, it results in the fission and merging of the virtual repetitive object, as described in detail below.
When the mth of the N second data blocks is modified,
the second data processing module judges whether the Mth second data block is associated with a first virtual repetitive object V1, if so, copies and stores the Mth first data block corresponding to the Mth second data block in a second file 2, and modifies the copied Mth first data block; and generating a hash value aiming at the copied Mth first data block and storing the hash value in a hash value database.
And modifying the second attribute information of the Mth second data block to point to the copied Mth first data block.
The first data processing module deletes the attribute information of the first virtual repetitive object, generates a second virtual repetitive object aiming at the 1 st to (M-1) th first data blocks before the Mth first data block, generates a hash value aiming at the 1 st to (M-1) th first data blocks and stores the hash value in the hash value database; and generating a third virtual repeating object for the (M +1) -N first data blocks after the Mth first data block, and generating a hash value for the (M +1) -N first data blocks and storing the hash value in the hash value database.
It should be noted that the deleting of the attribute information of the first virtual repetitive object by the first data processing module includes not only the identifier of the deleted virtual repetitive object, the attribute value of the host object, the number value of the data blocks, and the like, but also other attribute values associated with the first virtual repetitive object, such as the total number of objects in the storage medium, and the like. For the deletion process, those skilled in the art can know from the technical content of the present disclosure that the deletion process is not substantially changed in the embodiment of the present invention, and therefore, the description thereof is omitted here.
The second data processing module modifies second attribute information of the 1 st to (M-1) th second data blocks to be associated with the second virtual repetitive object;
and the second data processing module modifies the second attribute information of the (M +1) -N second data blocks to be associated with the third virtual repetitive object.
In more detail, the present invention is described in more detail,
the second data processing module modifies the second attribute information of the mth second data block to: the associated virtual repetitive object identifier is null, and the position offset value is null; the associated data block is the mth first data block.
The first data processing module sets the attribute information of the second virtual repetitive object as: the second virtual repeated object identifier and the host object attribute value are the data block identifier of the 1 st data block in the N first data blocks, and the number value of the data block is M-1;
the first data processing module sets the attribute information of the third virtual repetitive object as: the third virtual repeated object identifier and the host object attribute value are the data block identifiers of the (M +1) th data block in the N first data blocks, and the number of the data blocks is N-M;
the second data processing module modifies the second attribute information of the 1 st to (M-1) th second data blocks to be associated with the second virtual repetitive object, and includes: the second data processing module modifies the associated virtual repetitive object identification in the second attribute information of the 1 st to (M-1) th second data blocks to be associated with the virtual repetitive object identification of the second virtual repetitive object; and modifying a position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the second virtual repeating object so as to respectively point to the 1 st to (M-1) th first data blocks associated with the second virtual repeating object.
The second data processing module modifies the second attribute information of the (M +1) -N second data blocks to be associated with the third virtual repetitive object, and includes: the second data processing module modifies the associated virtual repeating object identification in the second attribute information of the (M +1) -N second data blocks to be associated with the virtual repeating object identification of the third virtual repeating object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the third virtual repeating object so as to respectively point to (M +1) -N first data blocks associated with the third virtual repeating object.
Referring to fig. 4, when the second data block 22 is modified,
the second data processing module copies the first data block 12 corresponding to the second data block 22 to generate a data block 22 ", modifies the copied data block 22", generates a hash value for the copied data block 22 ", and stores the hash value in a hash value database.
The first data processing module generates a second virtual repetitive object V2 for the data blocks 10 and 11 before the first data block 12, and generates a hash value for the data blocks 10 and 11 and stores the hash value in the hash value database; a third virtual duplicate object V3 is generated for data chunks 13 subsequent to the first data chunk 12 and a hash value is generated for the data chunk 13 and stored in the hash value database.
The first data processing module sets the attribute information of the second virtual repetitive object V2 as follows: the attribute value of the host object is a data block identifier 10 of the 1 st first data block, and the number value of the data block is 2; setting the attribute information of the third virtual repetitive object V3 as follows: the attribute value of the host object is the data block identifier 13 of the 4 th first data block, and the number of the data blocks is 1.
Wherein the second data processing module modifies the associated virtual repetitive object identification of the second data chunk 20 to be associated with the second virtual repetitive object V2; the position offset value is 0, and is associated with the host object attribute value of the second virtual repetitive object V2 (in this embodiment, the host object attribute value of the first virtual repetitive object V1 is the data block identifier 10 of the 1 st first data block), that is, the identifier 10+0 is the identifier 10, so as to point to the 1 st first data block 10 associated with the second virtual repetitive object V2;
the second data processing module modifies the associated virtual repetitive object identification of the second data block 21 to be associated with the second virtual repetitive object V2; the position offset value is 1, and is associated with the host object attribute value of the second virtual repetitive object V2 (in this embodiment, the host object attribute value of the second virtual repetitive object V2 is the data block identifier 10 of the 1 st first data block), that is, the identifier 10+1 is the identifier 11, so as to point to the 2 nd first data block 11 associated with the second virtual repetitive object V2;
the second data processing module modifies the associated virtual repetitive object identification of the second data block 23 to be associated with the third virtual repetitive object V3; the position offset value is 0, and is associated with the host object attribute value of the third virtual repetitive object V3 (in this embodiment, the host object attribute value of the third virtual repetitive object V3 is the data block identifier 13 of the 4 th first data block), that is, the identifier 13+0 is the identifier 13, so as to point to the 1 st first data block 13 associated with the third virtual repetitive object V3.
As can be seen from the above description of the technical solution, when the second data block is modified, the virtual repetitive object is caused to be split from V1 into V2 and V3.
When the Mth second data block in the N modified second data blocks is restored to original data:
the second data processing module judges whether the Mth second data block in the N second data blocks is recovered or not, and deletes the copied Mth first data block;
the first data processing module deletes the attribute information of the second virtual repetitive object and the third virtual repetitive object and restores and generates the first virtual repetitive object aiming at the 1 st to the N th first data blocks; generating a hash value aiming at the 1 st to the N first data blocks and storing the hash value in a hash value database;
and the second data processing module modifies the second attribute information of the 1 st to N second data blocks to enable the second attribute information to be associated with the recovered first virtual repetitive object.
The first virtual repetitive object has attribute information stored in the attribute information database, and the attribute information includes: the first virtual repetitive object identifier and the host object attribute value are the data block identifiers of the 1 st first data block, and the number of the data blocks is N.
The second data processing module modifies the second attribute information of the 1 st to N th second data blocks to associate the recovered first virtual repetitive object, and specifically includes: the second data processing module modifies the associated virtual repetitive object identifier in the second attribute information of the 1 st to N second data blocks to be associated with the virtual repetitive object identifier of the recovered first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to respectively point to the 1 st to N th first data blocks associated with the recovered first virtual repetitive object.
Referring again to fig. 3, when the second data block 22 is restored to the original data,
deleting the copied data block 22';
the first data processing module deletes the attribute information of the second virtual repetitive object V2 and the third virtual repetitive object V3, and resumes generating the first virtual repetitive object V1 for the 4 first data blocks 10, 11, 12, 13, and generates and stores hash values for the 4 first data blocks in the hash value database.
And the second data processing module modifies the second property information 20', 21', 22', 23' of the second data block 20, 21, 22, 23 to point to the first data block 10, 11, 12, 13, respectively. Wherein,
the second data processing module modifies the associated virtual repetitive object identification of said 1 st second data chunk 20 to be associated with said recovered first virtual repetitive object V1; the position offset value is 0 and is associated with the host object attribute value of the first virtual repetitive object V1 (i.e., the data chunk identification 10 of the 1 st first data chunk) to point to the 1 st first data chunk 10;
the second data processing module modifies the associated virtual repetitive object identification of said 2 nd second data block 21 to be associated with said restored first virtual repetitive object V1; the position offset value is 1 and is associated with the host object attribute value of the first virtual repetitive object V1 (i.e. the data chunk identification 10 of the 1 st first data chunk), i.e. identification 10+1 is identification 11, to point to the 2 nd first data chunk 11.
……;
And so on.
It can be seen that when the modified second chunk is restored again, the virtual duplicate objects V2 and V3 are caused to be recombined to generate V1.
As can be seen from the apparatus for deleting duplicate data in this embodiment, when the local data block is modified, the method only causes modification of the virtual duplicate object and duplication of the individual data block, and has a small impact on the performance of the system.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (16)
1. A method for deleting duplicate data when a first file and a second file are stored in a storage node, comprising:
dividing a first file into a plurality of first data blocks with the same size, and dividing a second file into a plurality of second data blocks with the same size as the first data blocks; wherein each of the second data blocks has second attribute information, and a hash value is generated for each of the first data blocks and the second data blocks;
detecting whether the hash value of the second data block is identical to the hash value of the first data block, and when detecting that the hash values of the N continuous second data blocks are identical to the hash values of the N continuous first data blocks:
generating a first virtual repetitive object aiming at the N first data blocks, deleting the hash values of the N first data blocks, and regenerating one hash value aiming at the N first data blocks;
and deleting the N second data blocks and the hash values of the N second data blocks, and modifying the second attribute information of the N second data blocks to enable the N second data blocks to be associated with the first virtual repetitive object.
2. The method for deleting the duplicated data according to claim 1, wherein a hash value database and an attribute information database are provided in the storage node;
the hash values are stored in a hash value database, and the second attribute information of the second data block is stored in an attribute information database.
3. The method of deleting duplicate data according to claim 2, wherein the first virtual repeating object has attribute information, and the attribute information is stored in the attribute information database;
the attribute information of the first virtual repetitive object includes: the virtual repeated object identification and the host object attribute value are the data block identification of the 1 st of the N first data blocks, and the number of the data blocks is N;
the second attribute information of the second data block includes: associating the virtual repetitive object identifier with the position offset value;
the modifying the second attribute information of the N second data blocks to make it associated with the first virtual repetitive object includes:
modifying the associated virtual repetitive object identifications in the second attribute information of the N second data blocks to be associated with the virtual repetitive object identification of the first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to point to the N first data blocks associated with the first virtual repetitive object respectively.
4. The method of de-duplication according to claim 3, wherein when modifying the Mth of the N second data blocks,
copying and storing an Mth first data block corresponding to the Mth second data block in a second file, and modifying the copied Mth first data block;
generating a hash value aiming at the copied Mth first data block and storing the hash value in a hash value database; modifying the second attribute information of the Mth second data block to point to the copied Mth first data block;
deleting the attribute information of the first virtual repetitive object, generating a second virtual repetitive object aiming at the 1 st to (M-1) th first data blocks before the Mth first data block, generating a hash value aiming at the 1 st to (M-1) th first data blocks and storing the hash value in a hash value database; generating a third virtual repeating object for the (M +1) -N first data blocks after the Mth first data block, generating a hash value for the (M +1) -N first data blocks and storing the hash value in a hash value database;
modifying second attribute information of the 1 st to (M-1) th second data blocks to be associated with the second virtual repetitive object; and modifying the second attribute information of the (M +1) -N second data blocks to be associated with the third virtual repetitive object.
5. The method according to claim 4, wherein the modifying the second attribute information of the mth second data block is: and the associated virtual repeated object identifier is null, the position deviation value is null, and the associated data block is the Mth first data block.
6. The method of deleting duplicate data according to claim 5, wherein the attribute information of the second virtual repeating object and the attribute information of the third virtual repeating object are both stored in the attribute information database;
setting the attribute information of the second virtual repetitive object as follows: the host object attribute value is the data block identifier of the 1 st data block in the N first data blocks, and the number value of the data block is M-1;
modifying second attribute information of the 1 st to (M-1) th second data blocks to be associated with the second virtual repetitive object, including:
modifying an associated virtual repetitive object identification in second attribute information of the 1 st to (M-1) th second data blocks to be associated with the virtual repetitive object identification of the second virtual repetitive object; modifying a position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the second virtual repetitive object so as to respectively point to the 1 st to (M-1) th first data blocks associated with the second virtual repetitive object;
setting the attribute information of the third virtual repetitive object as follows: the attribute value of the host object is the data block identifier of the (M +1) th data block in the N first data blocks, and the number value of the data block is N-M;
modifying the second attribute information of the (M +1) -N second data blocks to be associated with the third virtual repetitive object, including:
modifying associated virtual repeating object identifications in second attribute information of (M +1) -N second data blocks to be associated with the virtual repeating object identification of the third virtual repeating object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the third virtual repeating object so as to respectively point to (M +1) -N first data blocks associated with the third virtual repeating object.
7. The method for deleting duplicate data according to claim 4, wherein when the Mth second data block of the N modified second data blocks is restored to the original data,
deleting the copied Mth first data block;
deleting the attribute information of the second virtual repetitive object and the third virtual repetitive object, regenerating the 1 st to N th first data blocks into first virtual repetitive objects, generating a hash value aiming at the 1 st to N th first data blocks and storing the hash value in a hash value database;
and modifying the second attribute information of the 1 st to N second data blocks to enable the second attribute information to be associated with the recovered first virtual repetitive object.
8. The method of deleting duplicate data according to claim 7, wherein the first virtual replicated object has attribute information stored in the attribute information database, and wherein the attribute information includes: the attribute value of the host object is the data block identification of the 1 st first data block, and the number value of the data block is N;
modifying the second attribute information of the 1 st to N second data blocks to make the second attribute information associated with the recovered first virtual repetitive object, specifically comprising:
modifying the associated virtual repetitive object identifier in the second attribute information of the 1 st to N second data blocks to be associated with the virtual repetitive object identifier of the recovered first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to respectively point to the 1 st to N th first data blocks associated with the recovered first virtual repetitive object.
9. A device for deleting duplicate data is arranged on a storage node, and a first file and a second file are stored on the storage node, and the device is characterized by comprising:
the first dividing module is used for dividing the first file into a plurality of first data blocks with the same size, wherein a hash value is generated aiming at each first data block;
the second dividing module is used for dividing a second file into a plurality of second data blocks with the same size as the first data blocks; wherein each of the second data blocks has second attribute information, and a hash value is generated for each of the second data blocks;
the detection module is used for detecting whether the hash values of the second data block and the first data block are the same or not, and informing the first data processing module to act when the hash values of the N continuous second data blocks and the N continuous first data blocks are completely the same;
the first data processing module generates a first virtual repetitive object aiming at the N first data blocks, deletes the hash value of the N first data blocks, regenerates a hash value aiming at the N first data blocks and informs the second data processing module to act;
and the second data processing module deletes the N second data blocks and the hash values thereof, and modifies the second attribute information of the N second data blocks so as to be associated with the first virtual repetitive object.
10. The apparatus for deleting duplicated data according to claim 9, wherein a hash value database and an attribute information database are provided in said storage node;
the hash values are stored in a hash value database, and the second attribute information of the second data block is stored in an attribute information database.
11. The apparatus for deleting duplicate data according to claim 10, wherein the first virtual duplicate object has attribute information and is stored in the attribute information database;
the attribute information of the first virtual repetitive object includes: the virtual repeated object identification and the host object attribute value are the data block identification of the 1 st of the N first data blocks, and the number of the data blocks is N;
the second attribute information of the second data block includes: associating the virtual repetitive object identifier with the position offset value;
the second data processing module modifies the second attribute information of the N second data blocks to make the second attribute information associated with the first virtual repetitive object, and includes:
the second data processing module modifies the associated virtual repetitive object identifier in the second attribute information of the N second data blocks to be associated with the virtual repetitive object identifier of the first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to point to the N first data blocks associated with the first virtual repetitive object respectively.
12. The apparatus for deleting duplicate data according to claim 11, wherein when modifying the Mth one of the N second data blocks,
the second data processing module copies and stores an Mth first data block corresponding to the Mth second data block in a second file, and modifies the copied Mth first data block; generating a hash value aiming at the copied Mth first data block, storing the hash value in a hash value database, and modifying the second attribute information of the Mth second data block to enable the Mth second data block to point to the copied Mth first data block;
the first data processing module deletes the attribute information of the first virtual repetitive object, generates a second virtual repetitive object aiming at the 1 st to (M-1) th first data blocks before the Mth first data block, generates a hash value aiming at the 1 st to (M-1) th first data blocks and stores the hash value in the hash value database; generating a third virtual repeating object for the (M +1) -N first data blocks after the Mth first data block, generating a hash value for the (M +1) -N first data blocks and storing the hash value in a hash value database;
the second data processing module modifies second attribute information of the 1 st to (M-1) th second data blocks to be associated with the second virtual repetitive object;
and the second data processing module modifies the second attribute information of the (M +1) -N second data blocks to be associated with the third virtual repetitive object.
13. The apparatus for deleting duplicate data according to claim 12, wherein the second data processing module modifies the second attribute information of the mth second data chunk to: the associated virtual repetitive object identifier is null, and the position offset value is null; the associated data block is the mth first data block.
14. The apparatus for deleting duplicate data according to claim 13, wherein the attribute information of the second virtual repeating object and the attribute information of the third virtual repeating object are both stored in the attribute information database;
the first data processing module sets the attribute information of the second virtual repetitive object as: the host object attribute value is the data block identifier of the 1 st data block in the N first data blocks, and the number value of the data block is M-1;
the second data processing module modifies the second attribute information of the 1 st to (M-1) th second data blocks to be associated with the second virtual repetitive object, and includes:
the second data processing module modifies the associated virtual repetitive object identification in the second attribute information of the 1 st to (M-1) th second data blocks to be associated with the virtual repetitive object identification of the second virtual repetitive object; modifying a position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the second virtual repetitive object so as to respectively point to the 1 st to (M-1) th first data blocks associated with the second virtual repetitive object;
the first data processing module sets the attribute information of the third virtual repetitive object as: the attribute value of the host object is the data block identifier of the (M +1) th data block in the N first data blocks, and the number value of the data block is N-M;
the second data processing module modifies the second attribute information of the (M +1) -N second data blocks to be associated with the third virtual repetitive object, and includes:
the second data processing module modifies the associated virtual repetitive object identification in the second attribute information of the (M +1) -N second data blocks to be associated with the virtual repetitive object identification of the third virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value of the third virtual repeating object so as to respectively point to (M +1) -N first data blocks associated with the third virtual repeating object.
15. The apparatus for deleting duplicate data according to claim 12, wherein when the Mth second data block of the N second data blocks after modification is restored as the original data,
the second data processing module deletes the copied Mth first data block;
the first data processing module deletes the attribute information of the second virtual repetitive object and the third virtual repetitive object, and regenerates the first virtual repetitive object aiming at the 1 st to the N th first data blocks; generating a hash value aiming at the 1 st to the N first data blocks and storing the hash value in a hash value database;
and the second data processing module modifies the second attribute information of the 1 st to N second data blocks to enable the second attribute information to be associated with the recovered first virtual repetitive object.
16. The apparatus for deleting duplicate data according to claim 15, wherein the first virtual duplicate object has attribute information stored in the attribute information database, and the attribute information comprises: the attribute value of the host object is the data block identification of the 1 st first data block, and the number value of the data block is N;
the second data processing module modifies the second attribute information of the 1 st to N th second data blocks to associate the recovered first virtual repetitive object, and specifically includes:
the second data processing module modifies the associated virtual repetitive object identifier in the second attribute information of the 1 st to N second data blocks to be associated with the virtual repetitive object identifier of the recovered first virtual repetitive object; and modifying the position offset value in the second attribute information, and associating the position offset value with the host object attribute value to respectively point to the 1 st to N th first data blocks associated with the recovered first virtual repetitive object.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410775122.8A CN104484402B (en) | 2014-12-15 | 2014-12-15 | A kind of method and device of deleting duplicated data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410775122.8A CN104484402B (en) | 2014-12-15 | 2014-12-15 | A kind of method and device of deleting duplicated data |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN104484402A true CN104484402A (en) | 2015-04-01 |
| CN104484402B CN104484402B (en) | 2018-02-09 |
Family
ID=52758943
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201410775122.8A Active CN104484402B (en) | 2014-12-15 | 2014-12-15 | A kind of method and device of deleting duplicated data |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN104484402B (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108153839A (en) * | 2017-12-15 | 2018-06-12 | 北京小米移动软件有限公司 | Document handling method and device |
| CN110427391A (en) * | 2018-04-28 | 2019-11-08 | 伊姆西Ip控股有限责任公司 | Determine the method, equipment and computer program product of repeated data |
| CN111340940A (en) * | 2020-02-23 | 2020-06-26 | 广东明星创意动画有限公司 | Method for searching repeated ARNOLD material and plug-in thereof |
| WO2021082581A1 (en) * | 2019-10-29 | 2021-05-06 | 牛文运 | File system |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7260659B2 (en) * | 2003-06-30 | 2007-08-21 | Intel Corporation | Rate matching apparatus, systems, and methods |
| CN101989929A (en) * | 2010-11-17 | 2011-03-23 | 中兴通讯股份有限公司 | Disaster recovery data backup method and system |
| CN102629258A (en) * | 2012-02-29 | 2012-08-08 | 浪潮(北京)电子信息产业有限公司 | Repeating data deleting method and device |
| CN103154950A (en) * | 2012-05-04 | 2013-06-12 | 华为技术有限公司 | Repeated data deleting method and device |
-
2014
- 2014-12-15 CN CN201410775122.8A patent/CN104484402B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7260659B2 (en) * | 2003-06-30 | 2007-08-21 | Intel Corporation | Rate matching apparatus, systems, and methods |
| CN101989929A (en) * | 2010-11-17 | 2011-03-23 | 中兴通讯股份有限公司 | Disaster recovery data backup method and system |
| CN102629258A (en) * | 2012-02-29 | 2012-08-08 | 浪潮(北京)电子信息产业有限公司 | Repeating data deleting method and device |
| CN103154950A (en) * | 2012-05-04 | 2013-06-12 | 华为技术有限公司 | Repeated data deleting method and device |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108153839A (en) * | 2017-12-15 | 2018-06-12 | 北京小米移动软件有限公司 | Document handling method and device |
| CN108153839B (en) * | 2017-12-15 | 2021-12-21 | 北京小米移动软件有限公司 | File processing method and device |
| CN110427391A (en) * | 2018-04-28 | 2019-11-08 | 伊姆西Ip控股有限责任公司 | Determine the method, equipment and computer program product of repeated data |
| CN110427391B (en) * | 2018-04-28 | 2023-07-28 | 伊姆西Ip控股有限责任公司 | Method, apparatus and computer program product for determining duplicate data |
| WO2021082581A1 (en) * | 2019-10-29 | 2021-05-06 | 牛文运 | File system |
| CN111340940A (en) * | 2020-02-23 | 2020-06-26 | 广东明星创意动画有限公司 | Method for searching repeated ARNOLD material and plug-in thereof |
| CN111340940B (en) * | 2020-02-23 | 2023-08-04 | 广东明星创意动画有限公司 | Method for searching repeated ARNOLD material and plug-in thereof |
Also Published As
| Publication number | Publication date |
|---|---|
| CN104484402B (en) | 2018-02-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110998537B (en) | Expired backup processing method and backup server | |
| US10162555B2 (en) | Deduplicating snapshots associated with a backup operation | |
| US10545833B1 (en) | Block-level deduplication | |
| US9141633B1 (en) | Special markers to optimize access control list (ACL) data for deduplication | |
| US9424185B1 (en) | Method and system for garbage collection of data storage systems | |
| US9934104B2 (en) | Metadata generation for incremental backup | |
| US8909605B1 (en) | Method and system for accelerating data movement using change information concerning difference between current and previous data movements | |
| US9367448B1 (en) | Method and system for determining data integrity for garbage collection of data storage systems | |
| CN113918385B (en) | Method, device, electronic equipment and medium for online incremental backup and recovery of virtual machine | |
| CN106407224B (en) | A method and device for file compaction in a key-value storage system | |
| KR102187127B1 (en) | Deduplication method using data association and system thereof | |
| JP5650982B2 (en) | Apparatus and method for eliminating file duplication | |
| CN103645970B (en) | Realizing method and device for de-weighting increments among multiple snapshots for remote copy | |
| US11620056B2 (en) | Snapshots for any point in time replication | |
| CN102033924A (en) | Data storage method and system | |
| CN103186652A (en) | Distributed data de-duplication system and method thereof | |
| CN105095027A (en) | Data backup method and apparatus | |
| CN104484402B (en) | A kind of method and device of deleting duplicated data | |
| EP3963465B1 (en) | File system metadata deduplication | |
| US20170351608A1 (en) | Host device | |
| CN113821476B (en) | Data processing method and device | |
| RU2016124319A (en) | METHOD AND DEVICE FOR RESTORING DEDUPLICATED DATA | |
| US20170293531A1 (en) | Snapshot backup | |
| CN120255819A (en) | File processing method and device | |
| US10860608B2 (en) | Any point in time replication to the cloud |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information |
Address after: 310052 Binjiang District Changhe Road, Zhejiang, China, No. 466, No. Applicant after: Xinhua three Technology Co., Ltd. Address before: 310052 Binjiang District Changhe Road, Zhejiang, China, No. 466, No. Applicant before: Huasan Communication Technology Co., Ltd. |
|
| CB02 | Change of applicant information | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |