Page MenuHomePhabricator

Reduce relying on database locks
Open, MediumPublic

Description

Around 19% of time spent in write queries is being spent locking or waiting for locking on primary database. This is a scalability bottleneck (we can buy more replicas, we can't buy more primary databases for a section),

grafik.png (501×1 px, 151 KB)

Maybe these traces can be adjusted? I could think of multiple ways:

  • Reduce the timeout, it's 15 seconds in some cases
  • Build and use a dedicated lock manager? We have one for files but I don't know how it's used post redis.
  • Use PoolCounter instead.
  • Use x2/main stash in some cases

Event Timeline

To spare others the effort of deciphering the charts: it looks like we're talking about the locking that happens in these 4 stack traces:

Wikimedia\Rdbms\Database::lock
Wikimedia\Rdbms\DBConnRef::__call
Wikimedia\Rdbms\DBConnRef::lock
MediaWiki\Storage\PageEditStash::getAndWaitForStashValue
MediaWiki\Storage\PageEditStash::checkCache
MediaWiki\Storage\DerivedPageDataUpdater::prepareContent
MediaWiki\Storage\PageUpdater::prepareUpdate
MediaWiki\EditPage\EditPage::internalAttemptSave
MediaWiki\EditPage\EditPage::attemptSave
MediaWiki\EditPage\EditPage::edit
EditAction::show
SubmitAction::show
MediaWiki\Actions\ActionEntryPoint::performAction
MediaWiki\Actions\ActionEntryPoint::performRequest
MediaWiki\Actions\ActionEntryPoint::execute
MediaWiki\MediaWikiEntryPoint::run
/srv/mediawiki/php-1.43.0-wmf.6/index.php
/srv/mediawiki/w/index.php
Wikimedia\Rdbms\Database::lock
Wikimedia\Rdbms\Database::getScopedLockAndFlush
Wikimedia\Rdbms\DBConnRef::__call
Wikimedia\Rdbms\DBConnRef::getScopedLockAndFlush
CategoryMembershipChangeJob::run
MediaWiki\Extension\EventBus\JobExecutor::execute
/srv/mediawiki/rpc/RunSingleJob.php
Wikimedia\Rdbms\Database::lock
Wikimedia\Rdbms\Database::getScopedLockAndFlush
Wikimedia\Rdbms\DBConnRef::__call
Wikimedia\Rdbms\DBConnRef::getScopedLockAndFlush
MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock
MediaWiki\Deferred\LinksUpdate\LinksUpdate::doUpdate
MediaWiki\Deferred\DeferredUpdates::attemptUpdate
MediaWiki\Deferred\RefreshSecondaryDataUpdate::doUpdate
MediaWiki\Deferred\DeferredUpdates::attemptUpdate
MediaWiki\Storage\DerivedPageDataUpdater::doSecondaryDataUpdates
WikiPage::doSecondaryDataUpdates
RefreshLinksJob::runForTitle
RefreshLinksJob::run
MediaWiki\Extension\EventBus\JobExecutor::execute
/srv/mediawiki/rpc/RunSingleJob.php
Wikimedia\Rdbms\Database::lock
Wikimedia\Rdbms\Database::getScopedLockAndFlush
Wikimedia\Rdbms\DBConnRef::__call
Wikimedia\Rdbms\DBConnRef::getScopedLockAndFlush
MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock
RefreshLinksJob::runForTitle
RefreshLinksJob::run
MediaWiki\Extension\EventBus\JobExecutor::execute
/srv/mediawiki/rpc/RunSingleJob.php

So the relevant components are DerivedPageDataUpdater, CategoryMembershipChangeJob, RefreshLinksJob and LinksUpdate.

I'm not really familiar with any of them, but maybe this helps us find someone who is.

aaron triaged this task as Medium priority.Jun 13 2024, 3:44 PM
aaron updated the task description. (Show Details)
@Ladsgroup wrote in the task description:

[…]

  • Build and use a dedicated lock manager? We have one for files but I don't know how it's used post redis.
  • Use PoolCounter instead.

See also T161749: Standardise lock management in MediaWiki. For WMF, this should move most locks to PoolCounter. However, "most" locks have timeout=0 and aren't subject of this task, as they don't wait, and any denied threads have no work to do (whoever has the lock is doing the work already).

This task is specifically about a handfulk of locks that have a non-zero timeout, presumably because we think they each have valuable work that should be done, that would be lost unless done now (or re-tried later).

1. PageEditStash
Wikimedia\Rdbms\Database::lock
…
MediaWiki\Storage\PageEditStash::getAndWaitForStashValue
…
EditAction::show
SubmitAction::show
…
/srv/mediawiki/w/index.php

This is a unique case (https://www.mediawiki.org/wiki/Edit_stash) that is none of the above. It isn't waiting for its turn to do database writes, but waiting for the Parser in the previous API request from JavaScript to finish. That means we want a high timeout here, but it also doesn't need to be in the database. This is a perfect case for PoolCounter, but.. there's no good way to code this right now because it doesn't fit our current PoolCounter abstraction, and our generic lock mechanisms (BagOStuff lock, and Rdbms lock) don't have a way to configure a PoolCounter backend. This is exactly what T161749 is meant to provide.

2. CategoryMembershipChangeJob
Wikimedia\Rdbms\Database::lock
…
CategoryMembershipChangeJob::run
…
/srv/mediawiki/rpc/RunSingleJob.php
3. LinksUpdate via RefreshLinksJob (doSecondaryDataUpdates)
Wikimedia\Rdbms\Database::lock
…
MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock
…
WikiPage::doSecondaryDataUpdates
…
RefreshLinksJob::run
…
/srv/mediawiki/rpc/RunSingleJob.php
4. LinksUpdate via RefreshLinksJob (main)
 Wikimedia\Rdbms\Database::lock
 …
 MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock
 …
 RefreshLinksJob::run
…
 /srv/mediawiki/rpc/RunSingleJob.php

These three are essentially all the same idea. They're de-duplicated jobs that process the "current" revision (LinksUpdate), or a batch of revisions upto the "current" time (CategoryMembershipChangeJob). We already silently discard these at queue time if another is queued for the same page (even if originally for a different revision or timestamp), because when it runs, it'll include or otherwise obsolete whatever came before.

It's not clear to me why, given that we already de-dupe these, why they would use a multiple-second timeout as a way to get multiple of them to run (one after the other). I supose this is mainly about the race condition around when a job has just begun executing and thus holds the lock. A new revision there should not be discarded since it isn't seen by that job, and declining that workload would result in a state that isn't eventually-consistent or otherwise self-correcting. Basically, we want something other than crude "debouncing" (as locks would give you), but rather "throttling", where the first is allowed to run right away, and a trailing one will always run as well, and in the middle we want to run only one at a given time (not delayed indefinitely).

I suppose the JobQueue de-duplication already gives us this. Based on https://wikitech.wikimedia.org/wiki/MediaWiki_JobQueue

[…] Deduplication based on SHA1. In case the job parameters are the same as some previously executed event, and the similar job was already executed after the event was created - it's skipped. […]

If I'm misunderstanding this, or if we don't want to trust/rely on this for some reason, I suppose one solution could be shorten the timeout to 0 and on denied locks, re-queue the job for later. That has a risk of creating a somewhat-obscured chain of hidden retries if it gets denied repeatedly (i.e. hidden because they count as new jobs, not as retries). I think this will be limited, because de-duplication applies to these as well. And either way, this seems preferred to holding jobqueue workers busy with DB locks like we do today.

For CategoryMembershipChangeJob, any redundant instances are cheap because it detects results of previous work and no-ops before doing expensive work.

For RefreshLinksJob, we have page_links_updated (WikiPage::getLinksTimestamp) which is already checked to no-op jobsif another already finished meanwhile. We could also set triggeringRevisionId to further cancel old jobs once a new revision is current. Both of these should make processing an excess of duplicate jobs cheap (if that were to hypthetically happen, which I'm not yet convinced of).

Change #1187791 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Reduce db lock timeout in LinksUpdate and CategoryMembershipChangeJob

https://gerrit.wikimedia.org/r/1187791

Thanks Krinkle for the through investigation. That made my life much easier. I suggest let's start by reducing the timeout and eventually get it to zero and maybe even later remove it fully. That way we can rollback in case we realize it's too much data loss or other issues. One thing I'm not 100% sure is that how mariadb would behave if we set timeout to 0. The documentation says it'll wait forever if the value is negative but doesn't say anything about zero. I really want to check (specially if there is a mismatch between mariadb/mysql).

I also think this has been contributing to several primary overload outages given what I saw in processlist during the outages. I just never connected the dots until now.

Change #1187791 merged by jenkins-bot:

[mediawiki/core@master] Reduce db lock timeout in LinksUpdate and CategoryMembershipChangeJob

https://gerrit.wikimedia.org/r/1187791

Change #1188285 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.45.0-wmf.18] Reduce db lock timeout in LinksUpdate and CategoryMembershipChangeJob

https://gerrit.wikimedia.org/r/1188285

Change #1188285 merged by jenkins-bot:

[mediawiki/core@wmf/1.45.0-wmf.18] Reduce db lock timeout in LinksUpdate and CategoryMembershipChangeJob

https://gerrit.wikimedia.org/r/1188285

Mentioned in SAL (#wikimedia-operations) [2025-09-15T09:33:48Z] <ladsgroup@deploy1003> Started scap sync-world: Backport: [[gerrit:1188285|Reduce db lock timeout in LinksUpdate and CategoryMembershipChangeJob]] (T366938)

Mentioned in SAL (#wikimedia-operations) [2025-09-15T10:15:54Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport: [[gerrit:1188285|Reduce db lock timeout in LinksUpdate and CategoryMembershipChangeJob]] (T366938) (duration: 42m 06s)

Change #1191126 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] LinksUpdate: Reduce lock timeout for the database even furhter

https://gerrit.wikimedia.org/r/1191126

It looks like the Sept 15 backport did increase Could not acquire lock errors. This might not matter, per my analysis, but I haven't looked into these yet, just noting it for now as something to understand/confirm first.

https://logstash.wikimedia.org/goto/a304df8e3d6ec2e1402227a4f1baa4ae

Screenshot 2025-09-25 at 16.11.16.png (762×1 px, 124 KB)

It would be nice to see how often the same job hits these errors, like a histogram. Jobs get up to 3 attempts. Nevertheless, that daily rate looks very low, so I highly doubt that many jobs run out of retries. It seems fine AFAIK.

The number is pretty low IMHO, we have way stranger db errors that are more common. The gain of this improvement outweighs the small increase (in absolute numbers) in retries or errors. I can try to re-do the flamegraph again to show the impact.

Change #1191126 merged by jenkins-bot:

[mediawiki/core@master] LinksUpdate: Reduce lock timeout for the database even furhter

https://gerrit.wikimedia.org/r/1191126