-
Notifications
You must be signed in to change notification settings - Fork 34
Closed
Labels
Description
Describe the bug
WARC files produced with Wget 1.21.1 contain unexpected < >
characters in some metadata values. For instance:
WARC/1.0
WARC-Type: request
WARC-Target-URI: <http://www.archiveteam.org/>
The current record loader does not know how to handle this case for WARC-Target-URI
and thus assigns ""
to URLs. This does not cause an error but has an impact in subsequent operations. For instance, when calling .webgraph()
.
"",murielecamac.blogspot.com,75820
"",lescosaquesdesfrontieres.com,36273
"",poethead.wordpress.com,31716
To Reproduce
- Run:
wget "http://www.archiveteam.org/" --warc-file="at" --no-warc-compression
- Explore the
at.warc
file
Expected behavior
A string without the <>
characters.
This is my current Python code for getting rid of the problem. Only works with .webpages()
:
from pyspark.sql.functions import udf
url_correction = udf(lambda s: s[1:-1] if len(s) > 0 and s[0] == '<' and s[-1] == '>' else s)
from aut import *
from pyspark.sql.functions import desc, col
WebArchive(sc, sqlContext, WARCs_PATH) \
.webpages() \
.withColumn("url", url_correction("url")) \
.select("url") \
.show(5)
Environment information
- AUT version: aut-0.90.0
- Env: Google Colab
- Java version: Java 11
- Apache Spark version: 3.0.0
- Apache Spark command used to run AUT: --driver-memory 8 --jars /content/aut-0.90.0-fatjar.jar --py-files /content/aut-0.90.0.zip pyspark-shell