Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: webrecorder/browsertrix-crawler
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v1.6.4
Choose a base ref
...
head repository: webrecorder/browsertrix-crawler
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v1.7.0
Choose a head ref
  • 16 commits
  • 21 files changed
  • 4 contributors

Commits on Jul 1, 2025

  1. base: bump to brave 1.80.113 (#857)

    version: bump to 1.7.0-beta.0
    tests: update deprecated command to work with latest minio
    ikreymer authored Jul 1, 2025
    Configuration menu
    Copy the full SHA
    eb374fa View commit details
    Browse the repository at this point in the history
  2. Add option to save local/sessionStorage (#856)

    If --saveStorage is set, localStorage and sessionStorage will be
    serialized with the WARC record for the page.
    If a page redirects, track what the current page URL is and save storage
    as part of the page's WARC record.
    
    Fixes #855
    ikreymer authored Jul 1, 2025
    Configuration menu
    Copy the full SHA
    687f08b View commit details
    Browse the repository at this point in the history

Commits on Jul 3, 2025

  1. Support downloading seed file from URL (#852)

    Fixes #841 
    
    Crawler work toward long URL lists in Browsertrix. This PR moves seed
    handling from the arg parser's validation step to the crawler's
    bootstrap step in order to be able to async fetch the seed file from a
    URL.
    
    ---------
    
    Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
    Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
    3 people authored Jul 3, 2025
    Configuration menu
    Copy the full SHA
    2af94ff View commit details
    Browse the repository at this point in the history

Commits on Jul 4, 2025

  1. Use consistent profile directory name (merge 1.6.4 change) (#859)

    - Use `TMPDIR/btrixProfile` as consistent profile directory name
    - Avoid accumulation of temp profile dirs if crawler is restarted
    multiple times, eg. if tmp dir is mapped to /crawls (as is in
    Browsertrix now), this prevents a proliferation of
    /crawls/tmp/profile-* dirs for each crawler restart
    - change released in 1.6.4, merging into main
    ikreymer authored Jul 4, 2025
    Configuration menu
    Copy the full SHA
    c84f58f View commit details
    Browse the repository at this point in the history

Commits on Jul 8, 2025

  1. async fetch: allow retrying async fetch if interrupted (#863)

    - retry if 'truncated' set, or if size mismatch, or other exception
    occurs
    - retry only for network load and async fetch, not for response fetch
    - set max retries to 2 (same as default for pages currently)
    - fixes #831
    ikreymer authored Jul 8, 2025
    Configuration menu
    Copy the full SHA
    6244515 View commit details
    Browse the repository at this point in the history
  2. Support option to fail crawl on content check (#861)

    - add --failOnContentCheck for quick fail if content check in behavior
    fails
    - expose __bx_contentCheckFailed to cause an immediately failure from
    behavior
    - only allow failing crawl due to content check from within
    awaitPageLoad() callback
    - set a 'failReason' key to track that crawl failed due to a particular
    content check reason
    - deps: update to browsertrix-behaviors 0.9.0, update to wabac.js
    (2.23.6)
    - fixes #860
    
    ---------
    Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
    ikreymer and tw4l authored Jul 8, 2025
    Configuration menu
    Copy the full SHA
    549d655 View commit details
    Browse the repository at this point in the history

Commits on Jul 21, 2025

  1. Fix docs mistaking --waitUntil with --pageLoadTimeout (#864)

    Fixes #853
    
    Corrects a documentation inaccuracy pointed out by a user
    tw4l authored Jul 21, 2025
    Configuration menu
    Copy the full SHA
    acae515 View commit details
    Browse the repository at this point in the history

Commits on Jul 23, 2025

  1. deps update: (#867)

    - bump brave to 1.80.122
    - bump wabac.js to 2.23.8
    - bump RWP to 2.3.15
    - bump browsertrix-behaviors to 0.9.1
    ikreymer authored Jul 23, 2025
    Configuration menu
    Copy the full SHA
    96fd229 View commit details
    Browse the repository at this point in the history
  2. url queueing: log skipped URLs as errors if depth === 0 (#868)

    - will ensure sees from URL list are reported as errors if skipped
    - also set logging context to 'scope' instead of 'links'
    - fixes #866
    
    ---------
    
    Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
    ikreymer and tw4l authored Jul 23, 2025
    Configuration menu
    Copy the full SHA
    1a4341b View commit details
    Browse the repository at this point in the history
  3. Add documentation for --failOnContentCheck and update CLI options i…

    …n docs (#869)
    
    Related to #860 
    
    This will give us something we can link to from Browsertrix/the
    Browsertrix User Guide for up-to-date information on this option.
    tw4l authored Jul 23, 2025
    Configuration menu
    Copy the full SHA
    66402c2 View commit details
    Browse the repository at this point in the history

Commits on Jul 25, 2025

  1. Capitalization fix for log messages (#870)

    Capitalizes "URL" in log messages.
    SuaYoo authored Jul 25, 2025
    Configuration menu
    Copy the full SHA
    bc4d649 View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2025

  1. quickfix: WACZ upload retry support: (#871)

    - if a failure occurs on failed upload, and crawler restarts on error,
    exit with 'interrupt' to allow for automatic restart (eg. in Browsertrix
    app)
    - otherwise, a failed upload will exit the crawl with no WACZ, resulting
    in overall crawl failure
    ikreymer authored Jul 29, 2025
    Configuration menu
    Copy the full SHA
    0652a3f View commit details
    Browse the repository at this point in the history
  2. Don't trim to limit if limit is default of 0 (#873)

    Fixes #872 
    
    Fix for restarting crawl from saved state, where the default `--limit`
    value of 0 was incorrectly preventing any URLs from being re-queued.
    tw4l authored Jul 29, 2025
    Configuration menu
    Copy the full SHA
    aba065c View commit details
    Browse the repository at this point in the history

Commits on Jul 30, 2025

  1. behavior logging: remove last line dupe check for behavior logs (#874)

    Shouldn't skip multiple log messages, as this is unexpected behavior for
    user-defined behaviors.
    ikreymer authored Jul 30, 2025
    Configuration menu
    Copy the full SHA
    18fe5a9 View commit details
    Browse the repository at this point in the history

Commits on Jul 31, 2025

  1. Configuration menu
    Copy the full SHA
    5c7ff3d View commit details
    Browse the repository at this point in the history
  2. version: bump to 1.7.0

    ikreymer committed Jul 31, 2025
    Configuration menu
    Copy the full SHA
    a6ad6a0 View commit details
    Browse the repository at this point in the history
Loading