-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Labels
Affects: OperationsAffects the IA DevOps folksAffects the IA DevOps folksLead: @cdriniIssues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed]Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed]Priority: 0Fix now: Issue prevents users from using the site or active data corruption. [managed]Fix now: Issue prevents users from using the site or active data corruption. [managed]Theme: Affiliate APIType: Post-MortemLog for when having to resolve a P0 issueLog for when having to resolve a P0 issue
Milestone
Description
Summary
Huge haproxy queue (~1,300) after deploy for 2+ hours
Using obfi We saw the queue fill up at a rate of several hundred sub-second; around 3,200 requests specifically to /isbn a minute, which is about as much as all labeled bot traffic.
Using obfi (source /opt/openlibrary/scripts/obfi.sh) we saw:
obfi tac | grep /isbn | obfi_count_minute
3362 16/Sep/2025:16:01
3403 16/Sep/2025:16:00
3326 16/Sep/2025:15:59
3264 16/Sep/2025:15:58
3173 16/Sep/2025:15:57
3280 16/Sep/2025:15:56
3415 16/Sep/2025:15:55
3105 16/Sep/2025:15:54
464 16/Sep/2025:15:53
273 16/Sep/2025:15:52
314 16/Sep/2025:15:51
327 16/Sep/2025:15:50
236 16/Sep/2025:15:49
231 16/Sep/2025:15:48
290 16/Sep/2025:15:47
We noticed in sentry, nginx logs, etc lots of requests to /isbn
- What is wrong?
- What caused it?
DDOS of requests hitting /isbn which we corroborated via lots of worker time spent in connections to affiliate-server (via sentry + nginx + grafana)
- What fixed it?
Completely 429'ing /isbn endpoint via
location ^~ /isbn/ {
return 429;
}
-
What was the impact?
30% of traffic was 503s -
What could have gone better?
Having docs to investigate workers "other" or connections.
Having stats on disproportionately expensive endpoints like /isbn
- Followup actions:
-
- Having stats for expensive endpoints by volume / time, like /isbn
-
- Fundamentally fix affiliate-server /isbn flow to not get overwhelmed by traffic
-
- Re-enable /isbn
Steps to close
- Assignment: Is someone assigned to this issue? (notetaker, responder)
- Labels: Is there an
Affects:label applied? - Diagnosis: Add a description and scope of the issue
- Updates: As events unfold, is notable provenance documented in issue comments? (i.e. useful debug commands / steps / learnings / reference links)
- "What caused it?" - please answer in summary
- "What fixed it?" - please answer in summary
- "Followup actions:" actions added to summary
Metadata
Metadata
Assignees
Labels
Affects: OperationsAffects the IA DevOps folksAffects the IA DevOps folksLead: @cdriniIssues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed]Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed]Priority: 0Fix now: Issue prevents users from using the site or active data corruption. [managed]Fix now: Issue prevents users from using the site or active data corruption. [managed]Theme: Affiliate APIType: Post-MortemLog for when having to resolve a P0 issueLog for when having to resolve a P0 issue
