Skip to content

Commit 6fd9ebe

Browse files
Merge pull request #11 from arcalex/issue4_readme
Fix issue #4
2 parents c0a95c8 + 5db00f0 commit 6fd9ebe

File tree

1 file changed

+13
-0
lines changed

1 file changed

+13
-0
lines changed

README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# warcrefs
2+
Web archive deduplication tools for identifying duplicates and converting them to references in a web archive collection after crawl time. warcrefs is implemented in JAVA.
3+
4+
The warcrefs tool takes as input a list of WARC files but also now has access to post-processed hash manifest lines for records in the files it is to operate on.
5+
warcrefs iterates through each WARC file in the input and also concurrently through corresponding lines in the post-processed hash manifest.
6+
Each record with a copy number greater than 1 in the corresponding manifest line is converted into a revisit record, where WARC-Refers-To-Target-URI and WARC-Refers-To-Date in the record headers are set to the URI and date, respectively, of the original resource, and payload headers are transferred as-is into the revisit record.
7+
Otherwise, if the copy number is 1, or if no corresponding line is in the manifest, the record is not altered.
8+
9+
warcrefs uses the Java Web Archive Toolkit (JWAT) for WARC file IO.
10+
warcrefs can be configured to rewrite files in-place or save to a new file.
11+
warcrefs is to be run on all hosts in the data store.
12+
The post-processed hash manifest is to be split across the hosts such that each host only has lines corresponding to records in WARC files on the host.
13+
Further, as the absence of a manifest line for a record implies the record is not a duplicate, lines where the copy number is 1 are to be omitted to reduce the amount of manifest data warcrefs has to process.

0 commit comments

Comments
 (0)