Skip to content

WARC-Target-URI in Wget warc files is not parsed properly #514

@javieraespinosa

Description

@javieraespinosa

Describe the bug
WARC files produced with Wget 1.21.1 contain unexpected < > characters in some metadata values. For instance:

WARC/1.0
WARC-Type: request
WARC-Target-URI: <http://www.archiveteam.org/>

The current record loader does not know how to handle this case for WARC-Target-URI and thus assigns "" to URLs. This does not cause an error but has an impact in subsequent operations. For instance, when calling .webgraph().

"",murielecamac.blogspot.com,75820
"",lescosaquesdesfrontieres.com,36273
"",poethead.wordpress.com,31716

To Reproduce

  1. Run:
wget "http://www.archiveteam.org/" --warc-file="at" --no-warc-compression
  1. Explore the at.warc file

Expected behavior
A string without the <> characters.

This is my current Python code for getting rid of the problem. Only works with .webpages():

from pyspark.sql.functions import udf
url_correction = udf(lambda s: s[1:-1] if len(s) > 0 and s[0] == '<' and s[-1] == '>' else s)

from aut import *
from pyspark.sql.functions import desc, col

WebArchive(sc, sqlContext, WARCs_PATH) \
  .webpages() \
  .withColumn("url", url_correction("url")) \
  .select("url") \
  .show(5)

Environment information

  • AUT version: aut-0.90.0
  • Env: Google Colab
  • Java version: Java 11
  • Apache Spark version: 3.0.0
  • Apache Spark command used to run AUT: --driver-memory 8 --jars /content/aut-0.90.0-fatjar.jar --py-files /content/aut-0.90.0.zip pyspark-shell

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions