Goal
In order to upgrade all our MediaWiki clusters to Debian Bookworm, we need to upgrade the current Debian Bullseye installation to use ICU 72. We will test-drive the new process created in T263437 on smaller wikis, and use the old (more disruptive) process for the big ones this time. The aim is to test the new process with a limited blast radius, so that we can confidently use it on all wikis next time, and thus improve user experience.
Based on T419980#11734256, we will use the new process on s3 wikis minus ruwikinews and the old process everywhere else.
Roadmap
These need to happen in sequence.
Prep
- Prepare packages and production images for ICU 72 upgrade — T419058
- [new process] Copy the categorylinks tables — T419980
- [new process] enable remote ICU collation writes — T419274
- [new process] Migrate collation data to ICU 72 — T419242
- [new process] Confirm migration date, sync with DBA and MW Engineering, put "no deployments" into deployment calendar
Day of migration
- upgrade production systems/images to the build with ICU 72
- scap lock
- [new process] swap the tables — T419980
- scap unlock
- [old process] start collation data migration maintenance script for old process wikis
SRE on point: @Raine
SRE backup: @Scott_French
DBA on point: TODO
MW on point: TODO
TODO:
- monitoring and rollback steps
Cleanup
- [new process] disable remote writes
- [old process] monitor the maintenance script
- after things have been running on the new version for at least a few days:
- clean up shellbox deployment
- drop old table
In parallel:
- Build and test production images for MW
- CommRel support
Still needs clarification:
- Discuss this process with DBA: T419980
- Discuss this process with MW engineers
- get sign-off on this procedure
- discuss risks & risk mitigations
- ask them to look at whether the code for this functionality is still current
- in particular: Write to multiple categorylinks tables on update (745754) · Gerrit doesn’t appear to have a unit test for the multiple write?
- double-check order of operations, in particular:
- how to minimize the disruption/funky sorting window
- Find which wikis on which DB sections actually use a non-standard collation: find_collations.py
- and put the checklist into the relevant tasks’ runbooks
- Come up with a way to check the newly-written collation data — swap the table and deploy the images somewhere ahead of swapping production, and/or run a one-off sanity checks script
- When should we upgrade deployment hosts?
Out of scope
Upgrading ICU on Beta.
References
T345561 - upgrade using the old process
T329491 - preparation for the upgrade (before it was determined that the old process would be used)
T263437 - implementation of the new process