Page MenuHomePhabricator

Using moveTranslatableBundle.php sometimes causes a lock wait timeout fatal error
Open, MediumPublic

Description

Sometimes, when executing requests like T425503, I get the following kind of output:

[urbanecm@deploy1003 ~]$ mwscript-k8s --attach --sal --comment=T425504 extensions/Translate/scripts/moveTranslatableBundle.php -- --wiki=metawiki --reason="per [[:phab:T425504]]" "ESEAP_Hub_Charter" "ESEAP Hub/Governance/Charter/Previous draft" "Martin Urbanec"
[...]
* Translations:ESEAP Hub Charter/Page display title/en → Translations:ESEAP Hub/Governance/Charter/Previous draft/Page display title/en
* Translations:ESEAP Hub Charter/Page display title/ja → Translations:ESEAP Hub/Governance/Charter/Previous draft/Page display title/ja
* Translations:ESEAP Hub Charter/Page display title/ko → Translations:ESEAP Hub/Governance/Charter/Previous draft/Page display title/ko
* Translations:ESEAP Hub Charter/Page display title/zh → Translations:ESEAP Hub/Governance/Charter/Previous draft/Page display title/zh
---------------
Subpages marked for translation 

No pages found.
---------------
In total 663 pages including 0 subpages and 1 talk page to move.
---------------

Type "MOVE" to begin the move operation: MOVE
Starting page move
Wikimedia\Rdbms\DBQueryError from line 1230 of /srv/mediawiki/php-1.47.0-wmf.1/includes/libs/Rdbms/Database/Database.php: Error 1205: Lock wait timeout exceeded; try restarting transaction
Function: MediaWiki\Revision\RevisionStore::fetchRevisionRowFromConds
Query: SELECT  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_actor,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,user_name  FROM `revision` JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user))   WHERE page_id = 12534985 AND rev_id = 29676073  LIMIT 1   FOR UPDATE

#0 /srv/mediawiki/php-1.47.0-wmf.1/includes/libs/Rdbms/Database/Database.php(1214): Wikimedia\Rdbms\Database->getQueryException('Lock wait timeo...', 1205, 'SELECT  rev_id,...', 'MediaWiki\\Revis...')
#1 /srv/mediawiki/php-1.47.0-wmf.1/includes/libs/Rdbms/Database/Database.php(1188): Wikimedia\Rdbms\Database->getQueryExceptionAndLog('Lock wait timeo...', 1205, 'SELECT  rev_id,...', 'MediaWiki\\Revis...')
#2 /srv/mediawiki/php-1.47.0-wmf.1/includes/libs/Rdbms/Database/Database.php(644): Wikimedia\Rdbms\Database->reportQueryError('Lock wait timeo...', 1205, 'SELECT  rev_id,...', 'MediaWiki\\Revis...', false)
#3 /srv/mediawiki/php-1.47.0-wmf.1/includes/libs/Rdbms/Database/Database.php(1368): Wikimedia\Rdbms\Database->query(Object(Wikimedia\Rdbms\Query), 'MediaWiki\\Revis...')
#4 /srv/mediawiki/php-1.47.0-wmf.1/includes/libs/Rdbms/Database/Database.php(1378): Wikimedia\Rdbms\Database->select(Array, Array, Array, 'MediaWiki\\Revis...', Array, Array)
#5 /srv/mediawiki/php-1.47.0-wmf.1/includes/libs/Rdbms/Database/DBConnRef.php(129): Wikimedia\Rdbms\Database->selectRow(Array, Array, Array, 'MediaWiki\\Revis...', Array, Array)
#6 /srv/mediawiki/php-1.47.0-wmf.1/includes/libs/Rdbms/Database/DBConnRef.php(407): Wikimedia\Rdbms\DBConnRef->__call('selectRow', Array)
#7 /srv/mediawiki/php-1.47.0-wmf.1/includes/libs/Rdbms/QueryBuilder/SelectQueryBuilder.php(809): Wikimedia\Rdbms\DBConnRef->selectRow(Array, Array, Array, 'MediaWiki\\Revis...', Array, Array)
#8 /srv/mediawiki/php-1.47.0-wmf.1/includes/Revision/RevisionStore.php(2355): Wikimedia\Rdbms\SelectQueryBuilder->fetchRow()
#9 /srv/mediawiki/php-1.47.0-wmf.1/includes/Revision/RevisionStore.php(2303): MediaWiki\Revision\RevisionStore->fetchRevisionRowFromConds(Object(Wikimedia\Rdbms\DBConnRef), Array, 3, Array)
#10 /srv/mediawiki/php-1.47.0-wmf.1/includes/Revision/RevisionStore.php(2265): MediaWiki\Revision\RevisionStore->loadRevisionFromConds(Object(Wikimedia\Rdbms\DBConnRef), Array, 3, NULL, Array)
#11 /srv/mediawiki/php-1.47.0-wmf.1/includes/Revision/RevisionStore.php(1282): MediaWiki\Revision\RevisionStore->newRevisionFromConds(Array, 3)
#12 /srv/mediawiki/php-1.47.0-wmf.1/includes/Page/WikiPage.php(726): MediaWiki\Revision\RevisionStore->getRevisionByPageId(12534985, 29676073, 3)
#13 /srv/mediawiki/php-1.47.0-wmf.1/includes/Page/WikiPage.php(758): MediaWiki\Page\WikiPage->loadLastEdit()
#14 /srv/mediawiki/php-1.47.0-wmf.1/includes/Storage/DerivedPageDataUpdater.php(524): MediaWiki\Page\WikiPage->getRevisionRecord()
#15 /srv/mediawiki/php-1.47.0-wmf.1/includes/Storage/PageUpdater.php(444): MediaWiki\Storage\DerivedPageDataUpdater->grabLatestRevision()
#16 /srv/mediawiki/php-1.47.0-wmf.1/includes/Storage/PageUpdater.php(341): MediaWiki\Storage\PageUpdater->grabParentRevision()
#17 /srv/mediawiki/php-1.47.0-wmf.1/includes/Storage/PageUpdater.php(733): MediaWiki\Storage\PageUpdater->setForceEmptyRevision(true)
#18 /srv/mediawiki/php-1.47.0-wmf.1/includes/Page/MovePage.php(965): MediaWiki\Storage\PageUpdater->saveDummyRevision('Martin Urbanec ...', 398)
#19 /srv/mediawiki/php-1.47.0-wmf.1/includes/Page/MovePage.php(665): MediaWiki\Page\MovePage->moveToInternal(Object(MediaWiki\User\User), Object(MediaWiki\Title\Title), 'per [[:phab:T42...', true, Array)
#20 /srv/mediawiki/php-1.47.0-wmf.1/includes/Page/MovePage.php(459): MediaWiki\Page\MovePage->moveUnsafe(Object(MediaWiki\User\User), 'per [[:phab:T42...', true, Array)
#21 /srv/mediawiki/php-1.47.0-wmf.1/extensions/Translate/src/PageTranslation/TranslatableBundleMover.php(393): MediaWiki\Page\MovePage->move(Object(MediaWiki\User\User), 'per [[:phab:T42...', true)
#22 /srv/mediawiki/php-1.47.0-wmf.1/extensions/Translate/src/PageTranslation/TranslatableBundleMover.php(245): MediaWiki\Extension\Translate\PageTranslation\TranslatableBundleMover->move(Object(MediaWiki\Extension\Translate\PageTranslation\TranslatablePage), Object(MediaWiki\User\User), Array, Array, 'per [[:phab:T42...', Object(Closure))
#23 /srv/mediawiki/php-1.47.0-wmf.1/extensions/Translate/src/PageTranslation/MoveTranslatableBundleMaintenanceScript.php(150): MediaWiki\Extension\Translate\PageTranslation\TranslatableBundleMover->moveSynchronously(Object(MediaWiki\Title\Title), Object(MediaWiki\Title\Title), Array, Array, Object(MediaWiki\User\User), 'per [[:phab:T42...', Object(Closure))
#24 /srv/mediawiki/php-1.47.0-wmf.1/maintenance/includes/MaintenanceRunner.php(692): MediaWiki\Extension\Translate\PageTranslation\MoveTranslatableBundleMaintenanceScript->execute()
#25 /srv/mediawiki/php-1.47.0-wmf.1/maintenance/run.php(53): MediaWiki\Maintenance\MaintenanceRunner->run()
#26 /srv/mediawiki/multiversion/MWScript.php(219): require_once('/srv/mediawiki/...')
#27 {main}
[urbanecm@deploy1003 ~]$

I think the fact I run the script twice "helped" a bit (first time, the race condition of mwscript-k8s caused me to not see the list, so I pressed "enter", the script terminated, then I run it again and then it errored out). However, even if that is the cause, the script shouldn't keep a lock after it terminates, and it shouldn't error out hard when it fails to acquire a lock.

Waiting for a bit and then running again sometimes helps, but it is not 100% reliable either.

Event Timeline

Also, the script seems to wait on a lock of some kind at the end – when the page is all moved, but the script haven't returned yet. For example, this is on my screen for a couple of minutes already:

(2250/2255) Translations:ESEAP Hub/Community Yellow Pages/Page display title/ko --> Translations:ESEAP Hub/Contact/Community Yellow Pages/Page display title/ko
(2251/2255) Translations:ESEAP Hub/Community Yellow Pages/Page display title/ms --> Translations:ESEAP Hub/Contact/Community Yellow Pages/Page display title/ms
(2252/2255) Translations:ESEAP Hub/Community Yellow Pages/Page display title/ru --> Translations:ESEAP Hub/Contact/Community Yellow Pages/Page display title/ru
(2253/2255) Translations:ESEAP Hub/Community Yellow Pages/Page display title/tl --> Translations:ESEAP Hub/Contact/Community Yellow Pages/Page display title/tl
(2254/2255) Translations:ESEAP Hub/Community Yellow Pages/Page display title/zh --> Translations:ESEAP Hub/Contact/Community Yellow Pages/Page display title/zh
(2255/2255) Talk:ESEAP Hub/Community Yellow Pages --> Talk:ESEAP Hub/Contact/Community Yellow Pages

Not sure how meaningful that is.

siebrand triaged this task as Medium priority.

I'm fiddling around with this issue a bit. Changesets are in https://gerrit.wikimedia.org/r/q/topic:%22T425888%22.

Change #1290927 had a related patch set uploaded (by Siebrand; author: Siebrand):

[mediawiki/extensions/Translate@master] tests: Add integration tests for TranslatableBundleMover

https://gerrit.wikimedia.org/r/1290927

Change #1290940 had a related patch set uploaded (by Siebrand; author: Siebrand):

[mediawiki/extensions/Translate@master] tests: Add locking behavior tests for TranslatableBundleMover

https://gerrit.wikimedia.org/r/1290940

Change #1290946 had a related patch set uploaded (by Siebrand; author: Siebrand):

[mediawiki/extensions/Translate@master] TranslatableBundleMover: Acquire locks in moveSynchronously

https://gerrit.wikimedia.org/r/1290946

Change #1290950 had a related patch set uploaded (by Siebrand; author: Siebrand):

[mediawiki/extensions/Translate@master] TranslatableBundleMover: Remove outer atomic section from move()

https://gerrit.wikimedia.org/r/1290950

Change #1290955 had a related patch set uploaded (by Siebrand; author: Siebrand):

[mediawiki/extensions/Translate@master] TranslatableBundleMover: Add error handling for individual move failures

https://gerrit.wikimedia.org/r/1290955

5 change sets were submitted. Once merged:

Verification Checklist

After all changesets are merged:

  • Moving a translatable bundle with 500+ pages completes without lock timeout
  • No post-move stall (deferred updates run incrementally)
  • Running the script twice concurrently: second invocation sees locked pages and reports errors gracefully instead of crashing with DBQueryError
  • Individual move failures are logged and don't abort the batch
  • The web UI path (moveAsynchronously) continues to work unchanged
  • MoveTranslatableBundleJob (job queue path) benefits from the same fixes

Out of Scope

The following improvements are intentionally deferred:

  • Retry logic with exponential backoff — With the atomic section removed, lock contention is minimal. Retry can be added if monitoring shows it's still needed.
  • waitForReplication() between moves — In CLI mode, DeferredUpdates will now run between moves, which implicitly paces the writes. Explicit replication waiting can be added if replica lag becomes an issue for very large bundles.
  • Resumable moves — If the script crashes mid-batch, re-running requires manual intervention (some pages already moved). A checkpoint/resume mechanism is a larger feature that can be built separately.

Change #1290927 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] tests: Add integration tests for TranslatableBundleMover

https://gerrit.wikimedia.org/r/1290927

Change #1290940 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] tests: Add locking behavior tests for TranslatableBundleMover

https://gerrit.wikimedia.org/r/1290940

Change #1290950 abandoned by Siebrand:

[mediawiki/extensions/Translate@master] TranslatableBundleMover: Remove outer atomic section from move()

https://gerrit.wikimedia.org/r/1290950

Change #1290946 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] TranslatableBundleMover: Acquire locks in moveSynchronously

https://gerrit.wikimedia.org/r/1290946

Change #1290955 abandoned by Siebrand:

[mediawiki/extensions/Translate@master] TranslatableBundleMover: Add error handling for individual move failures

https://gerrit.wikimedia.org/r/1290955