Skip to content

/isbn endpoint crushing Open Library haproxy queue #11280

@mekarpeles

Description

@mekarpeles

Summary

Huge haproxy queue (~1,300) after deploy for 2+ hours

Using obfi We saw the queue fill up at a rate of several hundred sub-second; around 3,200 requests specifically to /isbn a minute, which is about as much as all labeled bot traffic.

Using obfi (source /opt/openlibrary/scripts/obfi.sh) we saw:
obfi tac | grep /isbn | obfi_count_minute

 3362 16/Sep/2025:16:01
   3403 16/Sep/2025:16:00
   3326 16/Sep/2025:15:59
   3264 16/Sep/2025:15:58
   3173 16/Sep/2025:15:57
   3280 16/Sep/2025:15:56
   3415 16/Sep/2025:15:55
   3105 16/Sep/2025:15:54
    464 16/Sep/2025:15:53
    273 16/Sep/2025:15:52
    314 16/Sep/2025:15:51
    327 16/Sep/2025:15:50
    236 16/Sep/2025:15:49
    231 16/Sep/2025:15:48
    290 16/Sep/2025:15:47
Image

We noticed in sentry, nginx logs, etc lots of requests to /isbn

Here's the before / after:
Image

  • What is wrong?
  • What caused it?

DDOS of requests hitting /isbn which we corroborated via lots of worker time spent in connections to affiliate-server (via sentry + nginx + grafana)

Image
  • What fixed it?
    Completely 429'ing /isbn endpoint via
    location ^~ /isbn/ {
        return 429;
    }
  • What was the impact?
    30% of traffic was 503s

  • What could have gone better?

Having docs to investigate workers "other" or connections.

Having stats on disproportionately expensive endpoints like /isbn

  • Followup actions:
    • Having stats for expensive endpoints by volume / time, like /isbn
    • Fundamentally fix affiliate-server /isbn flow to not get overwhelmed by traffic
    • Re-enable /isbn

Steps to close

  1. Assignment: Is someone assigned to this issue? (notetaker, responder)
  2. Labels: Is there an Affects: label applied?
  3. Diagnosis: Add a description and scope of the issue
  4. Updates: As events unfold, is notable provenance documented in issue comments? (i.e. useful debug commands / steps / learnings / reference links)
  5. "What caused it?" - please answer in summary
  6. "What fixed it?" - please answer in summary
  7. "Followup actions:" actions added to summary

Metadata

Metadata

Assignees

Labels

Affects: OperationsAffects the IA DevOps folksLead: @cdriniIssues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed]Priority: 0Fix now: Issue prevents users from using the site or active data corruption. [managed]Theme: Affiliate APIType: Post-MortemLog for when having to resolve a P0 issue

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions