Skip to content

Conversation

cdrini
Copy link
Collaborator

@cdrini cdrini commented Sep 23, 2025

Instead of a long identifier:(foo OR bar OR ...) query, @ximm is working on adding a new doc_ids parameter that will let us just specify the direct ids! This will fix an issue we've been having where we hit the maximum query length.

Technical

  • Also switches to using the python built-in itertools.batched

Testing

Screenshot

Stakeholders

@cdrini cdrini marked this pull request as ready for review October 3, 2025 01:09
@Copilot Copilot AI review requested due to automatic review settings October 3, 2025 01:09
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR switches to using a new doc_ids parameter for Internet Archive search requests instead of constructing long identifier:(foo OR bar OR ...) queries, which helps avoid hitting maximum query length limits. The change also modernizes the code by using Python's built-in itertools.batched instead of custom batching functions.

  • Replaces custom batch and batch_until_len functions with itertools.batched
  • Updates IA search to use doc_ids parameter instead of complex query construction
  • Removes unnecessary query parameters and adjusts batch sizing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

import re
import typing
from collections.abc import Iterable, Sized
from collections.abc import Iterable, Sequence
Copy link

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing import for itertools module which is used on lines 212 and 220.

Copilot uses AI. Check for mistakes.

logger.warning(f"Trying to cache invalid OCAIDs: {invalid_ocaids}")
valid_ocaids = list(set(ocaids) - invalid_ocaids)
batches = list(batch_until_len(valid_ocaids, 3000))
batches = list(itertools.batched(valid_ocaids, 250))
Copy link

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch size has been significantly reduced from 3000 characters to 250 items without explanation. This could lead to many more API requests than necessary, potentially impacting performance. Consider documenting the rationale for this specific batch size or making it configurable.

Copilot uses AI. Check for mistakes.

@cdrini cdrini added the Patch Deployed This PR has been deployed to production independently, outside of the regular deploy cycle. label Oct 3, 2025
@mekarpeles mekarpeles merged commit d8c5903 into internetarchive:master Oct 7, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Patch Deployed This PR has been deployed to production independently, outside of the regular deploy cycle.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants