Page MenuHomePhabricator

Upgrade the MediaWiki servers to ICU 72 ☂️
Closed, DeclinedPublic

Description

Goal

In order to upgrade all our MediaWiki clusters to Debian Bookworm, we need to upgrade the current Debian Bullseye installation to use ICU 72. We will test-drive the new process created in T263437 on smaller wikis, and use the old (more disruptive) process for the big ones this time. The aim is to test the new process with a limited blast radius, so that we can confidently use it on all wikis next time, and thus improve user experience.

Based on T419980#11734256, we will use the new process on s3 wikis minus ruwikinews and the old process everywhere else.

Roadmap

These need to happen in sequence.

Prep

  1. Prepare packages and production images for ICU 72 upgrade — T419058
  2. [new process] Copy the categorylinks tables — T419980
  3. [new process] enable remote ICU collation writes — T419274
  4. [new process] Migrate collation data to ICU 72 — T419242
  5. [new process] Confirm migration date, sync with DBA and MW Engineering, put "no deployments" into deployment calendar

Day of migration

  1. upgrade production systems/images to the build with ICU 72
  2. scap lock
  3. [new process] swap the tables — T419980
  4. scap unlock
  5. [old process] start collation data migration maintenance script for old process wikis

SRE on point: @Raine
SRE backup: @Scott_French
DBA on point: TODO
MW on point: TODO

TODO:

  • monitoring and rollback steps

Cleanup

  • [new process] disable remote writes
  • [old process] monitor the maintenance script
  • after things have been running on the new version for at least a few days:
    • clean up shellbox deployment
    • drop old table

In parallel:

  • Build and test production images for MW
  • CommRel support

Still needs clarification:

  • Discuss this process with DBA: T419980
  • Discuss this process with MW engineers
    • get sign-off on this procedure
    • discuss risks & risk mitigations
    • ask them to look at whether the code for this functionality is still current
    • double-check order of operations, in particular:
      • how to minimize the disruption/funky sorting window
  • Find which wikis on which DB sections actually use a non-standard collation: find_collations.py
    • and put the checklist into the relevant tasks’ runbooks
  • Come up with a way to check the newly-written collation data — swap the table and deploy the images somewhere ahead of swapping production, and/or run a one-off sanity checks script
  • When should we upgrade deployment hosts?

Out of scope

Upgrading ICU on Beta.

References

T345561 - upgrade using the old process
T329491 - preparation for the upgrade (before it was determined that the old process would be used)
T263437 - implementation of the new process

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Raine updated the task description. (Show Details)

Please keep in mind that there are currently 2 vulnerabilities inside ICU 67.1 (the version currently used).

https://nvd.nist.gov/vuln/detail/CVE-2025-5222
https://nvd.nist.gov/vuln/detail/cve-2020-21913

CVE-2025-5222 is a vulnerability up to (excluding) 77.1

Please keep a mind that there are currently 2 vulnerabilities inside ICU 67.1 (the version currently used).

Both of these have been fixed in the Debian packaging we use:

Change #1254266 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] k8s: create shellbox-icu72

https://gerrit.wikimedia.org/r/1254266

Change #1254266 abandoned by Kamila Součková:

[operations/puppet@production] k8s: create shellbox-icu72

Reason:

Will go with just another release in one of the existing namespaces.

https://gerrit.wikimedia.org/r/1254266

Change #1254348 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[mediawiki/core@master] DB schema: Create temporary table for ICU upgrade

https://gerrit.wikimedia.org/r/1254348

Change #1254348 abandoned by Kamila Součková:

[mediawiki/core@master] DB schema: Create temporary table for ICU upgrade

Reason:

Not needed for temporary tables.

https://gerrit.wikimedia.org/r/1254348

Mentioned in SAL (#wikimedia-operations) [2026-03-19T10:36:44Z] <Raine> created temporary categorylinks_icu72 tables -- T419980, T419049

Change #1256384 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/mediawiki-config@master] Temporarily add shellbox-icu to $wgShellboxUrls

https://gerrit.wikimedia.org/r/1256384

Change #1256384 merged by jenkins-bot:

[operations/mediawiki-config@master] Temporarily add shellbox-icu to $wgShellboxUrls

https://gerrit.wikimedia.org/r/1256384

Mentioned in SAL (#wikimedia-operations) [2026-03-26T13:17:11Z] <kamila@deploy2002> Started scap sync-world: Backport for [[gerrit:1256384|Temporarily add shellbox-icu to $wgShellboxUrls (T419049 T419242 T419274)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-26T13:19:12Z] <kamila@deploy2002> kamila: Backport for [[gerrit:1256384|Temporarily add shellbox-icu to $wgShellboxUrls (T419049 T419242 T419274)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-26T13:24:27Z] <kamila@deploy2002> Finished scap sync-world: Backport for [[gerrit:1256384|Temporarily add shellbox-icu to $wgShellboxUrls (T419049 T419242 T419274)]] (duration: 07m 16s)

Change #1261470 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/mediawiki-config@master] Enable $wgTempCategoryCollations for testwiki.

https://gerrit.wikimedia.org/r/1261470

Change #1261470 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable $wgTempCategoryCollations for testwiki.

https://gerrit.wikimedia.org/r/1261470

Mentioned in SAL (#wikimedia-operations) [2026-03-26T20:06:29Z] <kamila@deploy1003> Started scap sync-world: Backport for [[gerrit:1261545|Wrap 'centralauthtoken' in a JWT (T420280)]], [[gerrit:1261470|Enable $wgTempCategoryCollations for testwiki. (T419274 T419049)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-26T20:25:05Z] <kamila@deploy1003> matmarex, kamila: Backport for [[gerrit:1261545|Wrap 'centralauthtoken' in a JWT (T420280)]], [[gerrit:1261470|Enable $wgTempCategoryCollations for testwiki. (T419274 T419049)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-26T20:44:01Z] <kamila@deploy1003> Finished scap sync-world: Backport for [[gerrit:1261545|Wrap 'centralauthtoken' in a JWT (T420280)]], [[gerrit:1261470|Enable $wgTempCategoryCollations for testwiki. (T419274 T419049)]] (duration: 37m 32s)

Change #1262091 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/mediawiki-config@master] Enable $wgTempCategoryCollations for s3 wikis.

https://gerrit.wikimedia.org/r/1262091

Change #1262091 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable $wgTempCategoryCollations for s3 wikis.

https://gerrit.wikimedia.org/r/1262091

Mentioned in SAL (#wikimedia-operations) [2026-03-30T14:20:56Z] <kamila@deploy1003> Started scap sync-world: Backport for [[gerrit:1262091|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-30T14:22:41Z] <kamila@deploy1003> kamila: Backport for [[gerrit:1262091|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-30T14:30:55Z] <kamila@deploy1003> Finished scap sync-world: Backport for [[gerrit:1262091|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]] (duration: 09m 59s)

Change #1264670 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/mediawiki-config@master] Enable $wgTempCategoryCollations for s3 wikis.

https://gerrit.wikimedia.org/r/1264670

Change #1264670 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable $wgTempCategoryCollations for s3 wikis.

https://gerrit.wikimedia.org/r/1264670

Mentioned in SAL (#wikimedia-operations) [2026-03-30T17:05:53Z] <kamila@deploy1003> Started scap sync-world: Backport for [[gerrit:1264670|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-30T17:07:40Z] <kamila@deploy1003> kamila: Backport for [[gerrit:1264670|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-30T17:18:30Z] <kamila@deploy1003> Finished scap sync-world: Backport for [[gerrit:1264670|Enable $wgTempCategoryCollations for s3 wikis. (T419274 T419049)]] (duration: 12m 36s)

Change #1266250 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] shellbox-icu72: Add ClusterIP to TLS cert SANs

https://gerrit.wikimedia.org/r/1266250

Change #1266264 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/mediawiki-config@master] Temporarily add shellbox-icu ClusterIP endpoint

https://gerrit.wikimedia.org/r/1266264

Change #1266278 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[mediawiki/core@master] maintenance/updateCollation: add --source-table

https://gerrit.wikimedia.org/r/1266278

Upon further discussion with DBA, we will fall back to the old process (i.e. just running the maintenance script in place) this time around.

This is because the table swap is, and always will be, a risky DB operation on very large wikis (e.g. enwiki). While this does not prevent this smaller wikis test, we will have to change the table swap procedure to be safe for larger wikis. At this point we have tested almost everything other than the table swap, so there is no added value in running a test of a procedure we won't be able to reuse.

Therefore:

  1. We will fall back to the old process for this ICU upgrade, tracked in T422544.
    • We can reuse a lot of the work done on this task.
  2. We will discuss how to make the new process safe. A possible suggestion is to move the table swapping logic to MW, and thus not put the DBs at risk at all.
    • By not trying to do this right now, we will avoid further stalling the current upgrade with unknown unknowns.
  3. We will capture all the work, patches, processes and learnings we figured out on this task. Thereby, once the above is done, we can directly reuse all the work done here.
    • This might be best done by (1) being careful to not delete any information here, and (2) following the subtasks here and creating a runbook on wikitech, either now or during the next ICU upgrade.

Change #1266250 abandoned by Kamila Součková:

[operations/deployment-charts@master] shellbox-icu72: Add ClusterIP to TLS cert SANs

Reason:

T419049 declined

https://gerrit.wikimedia.org/r/1266250

Change #1266264 abandoned by Kamila Součková:

[operations/mediawiki-config@master] Temporarily add shellbox-icu ClusterIP endpoint

Reason:

T419049 declined

https://gerrit.wikimedia.org/r/1266264

Change #1266278 merged by jenkins-bot:

[mediawiki/core@master] maintenance/updateCollation: add --table

https://gerrit.wikimedia.org/r/1266278