We have already fixed the Tone Check failures that were causing the bulk of the failures, but there are still more to investigate.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T406836 The Edit Check's SLO has burned all its error budget | |||
| Open | DLynch | T424684 [SPIKE] Investigate the remaining causes of the SLO being in the red |
Event Timeline
Change #1287011 had a related patch set uploaded (by DLynch; author: DLynch):
[mediawiki/extensions/VisualEditor@master] EditCheckFactory: catch errors creating a check
Change #1287012 had a related patch set uploaded (by DLynch; author: DLynch):
[mediawiki/extensions/VisualEditor@master] TextMatchEditCheck: fall back to a valid language if possible
Change #1287017 had a related patch set uploaded (by DLynch; author: DLynch):
[mediawiki/extensions/VisualEditor@master] EditCheck controller: some more checks for destroyed surface
Change #1287011 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] EditCheckFactory: catch errors creating a check
Those patches all came from me investigating uncaught JS errors related to editcheck for this task, though not all of them could actually interfere with the SLO.
https://gerrit.wikimedia.org/r/1287017 specifically might be causing SLO-related errors that appear in a spiky way, because it's the one most likely to be happening if the tone check service had one of its slow-response periods, opening up an opportunity for people to cancel out of their editing session while that response was being awaited.
Change #1287012 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] TextMatchEditCheck: fall back to a valid language if possible
Change #1287017 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] EditCheck controller: some more checks for destroyed surface
I think this has helped. Following those patches going out on the train (finished on the 21st) we haven't yet seen any spikes like there were before.
This does indeed look promising! Let's watch for another week or two, and then we can probably call it good enough.
Looks pretty good now. We've had literally no budget consumed in the last 7 days on the dashboard.

