Reduce cache miss noise in memcached due to hcaptcha health checks
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jijiki
	Dec 17 2025, 2:08 PM

Project Tags

Referenced Files

	F72274417: image.png
	Feb 23 2026, 9:21 AM

	F71146899: image.png
	Dec 19 2025, 4:56 PM

	F71103926: image.png
	Dec 17 2025, 2:08 PM

	F71103782: image.png
	Dec 17 2025, 2:08 PM

Subscribers

Description

The current hCaptcha availability check relies 2 memcached keys that default to cache misses under normal operations, creating and inverted logic where misses indicate health and hits indicate problems. This approach raises some issues:

incident response overhead: frequent cache misses during normal operation create noise, making it difficult for on-call engineers to distinguish between "system is healthy" and "system is degraded" without explicitly knowing about this behaviour
- This was evident on December 2nd 2026 where a 50%+ rise in get misses from mc1047 and mc1042 aligned with normal operational patterns, not actual failures.
inverted semantics: normally cache misses indicate data recalculation. The current implementation creates inconsistency in how we interpret misses across the memcached layer

Original Task Description
(was: Increase in memcached get misses from mc1047 and mc1042)

Since 2 Dec 2025, I have observed an odd increase in get misses from mc1047 and mc1042 have a 50+% rise in get misses, compared to the rest of the cluster

2 Dec 2025 Memcached Commands (get) misses

The first time I see the pattern is (for those two servers), roughly, around ~18th Nov 2025

18th Nov 2025 - Memcached Commands (get) misses

Details

Related Changes in Gerrit:

Subject	Repo	Branch	Lines +/-
hCaptcha: Add APCu cache layer to health checker	mediawiki/extensions/ConfirmEdit	wmf/1.46.0-wmf.21	+86 -10
hCaptcha: Add APCu cache layer to health checker	mediawiki/extensions/ConfirmEdit	master	+86 -10
HCaptchaEnterpriseHealthChecker: Use a cache hit for health check	mediawiki/extensions/ConfirmEdit	wmf/1.46.0-wmf.16	+41 -54
HCaptchaEnterpriseHealthChecker: Use a cache hit for health check	mediawiki/extensions/ConfirmEdit	master	+41 -54

Customize query in gerrit

Related Objects

Mentioned Here: rOPUP1242254fb508
rECOE53c82eb84f54: hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type
T405586: hCaptcha editing trial deployment tracker

Event Timeline

jijiki created this task.Dec 17 2025, 2:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 17 2025, 2:08 PM

jijiki added a project: serviceops-deprecated.Dec 17 2025, 2:09 PM

By and large, identifying which memcached key is receiving the most misses is not straightforward, unless the application explicitly counts those misses. Moreover, when a cache miss occurs, the conversation with memcached will be:

get WANCache:dawiktionary:revision-slots::6272:16905|#|v
END

whereas a hit will appear as:

get WANCache:cawiki:page-content-model:123456|#|v
VALUE WANCache:cawiki:page-content-model:123456|#|v 4 71
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
END

To identify the key misses, I captured a tcpdump which I then processed with tshark to isolate the misses.

This yielded

mc1047
---------
10003 MISS: WANCache:global:confirmedit-hcaptcha-failover-mode|#|v
   30 MISS: WANCache:eswiki:page:14:<snip>|#|v

mc1042
----------
   5952 MISS: global:confirmedit-hcaptcha-siteverify-error-count
     26 MISS: WANCache:commonswiki:page:<snip>|#|v

Looking at a trace here https://trace.wikimedia.org/trace/c39a49b6ad781654709b21bac2f4beac, it looks that we are getting both

global:confirmedit-hcaptcha-failover-mode (miss)
global:confirmedit-hcaptcha-apiurl-available (hit)

I have not dug any deeper to find traces for global:confirmedit-hcaptcha-siteverify-error-count.

While everything works as designed, however having frequent cache misses during normal operation creates unnecessary noise and makes it difficult to distinguish between "system is healthy" and "system is degraded."

A key miss is used to indicate health, whilst a key hit indicates a problem. This inverted logic makes incident response confusing, especially when on-call engineers cannot quickly determine system state by looking at cache patterns.

What options do we have here to avoid relying on cache misses?

jijiki attached a referenced file: F71146899: image.png. (Show Details)Dec 22 2025, 11:36 AM

What options do we have here to avoid relying on cache misses?

I'm not sure. I guess the alternative would be to invert the existing logic that tracks failover mode and use a confirmedit-hcaptcha-is-healthy cache key? IIRC, We explored that in rECOE53c82eb84f54: hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type and ended up moving away from that approach.

From ServiceOps new point of view, is this issue a "must fix" and if so, on what timeline?

The first time I see the pattern is (for those two servers), roughly, around ~18th Nov 2025

hCaptcha editing went live on November 17 (T405586), which is also when crawler traffic to ?action=edit URLs will start triggering the checks to hCaptcha availability more frequently. Unfortunately, we do need to know which CAPTCHA backend to load at page load time, so we need to consult the confirmedit-hcaptcha-failover-mode key.

Since 2 Dec 2025, I have observed an odd increase in get misses from mc1047 and mc1042 have a 50+% rise in get misses, compared to the rest of the cluster

The automatic failover mode kicked in on December 3 (logs), so that doesn't explain the December 2 event.

In T412947#11480759, @kostajh wrote:

Since 2 Dec 2025, I have observed an odd increase in get misses from mc1047 and mc1042 have a 50+% rise in get misses, compared to the rest of the cluster

The automatic failover mode kicked in on December 3 (logs), so that doesn't explain the December 2 event.

2 Dec 2025 lines up with https://sal.toolforge.org/log/wSq_3poB8tZ8Ohr0Wwvw, which as we now know, it is normal.

From my perspective (have not discussed with the team yet), the current logic potentially makes incident response baffling, especially for on-call engineers who most likely will not be privy to the above details. In a new norm where we are actively trying to pin down request patterns in order to fight scrappers, cache misses are subject to investigation. Me myself spent a considerable amount of time to deduce where those misses where coming from.

(Please take this suggestion with a grain of salt, as I am a little out of my depth). What if we consider an approach around the lines of:

Currently we are using 3 keys when calling isAvailable():

"failover mode" key (TTL 10', def is miss): if exists and true, enables failover
"SiteVerify Error Count" Key (TTL 1': def is miss): if exists, updates the counter. May set the "failover mode" key to true.
"secure-api.js" key (TTL 5', def is hit): if a miss we set the "failover mode" key to true.

Instead of maintaining a separate failover-mode key with a 10-minute backoff window, we drop it, and we use the actual health indicators, and in combination with APCu, reduce the number of memcached key fetches.

Use an isHcaptchaAvailable boolean key (TTL 1') which should be _true_ when SiteVerify Error Count<threshold AND secure-api.js is true
- Set TTL to 10' if false, implementing a backoff similar to what we are doing now.
SiteVerify Error Count (TTL 1': should exist and def value is 0): update again to 0 if there are no errors and TTL is about to expire, or let it be if non 0)
secure-api.js key (TTL 5', should exist and should be boolean)

Additionally, we could cache isHcaptchaAvailable to APCu with a TTL of 1', so to spread fetches across pods.

With the above implementation:

cache misses indicate recalculation of key data
- though by using getWithSetCallback() where appropriate, impacts caches misses
depending on the implementation, accept periodic misses of SiteVerify Error Count as 1-minute TTL will naturally occur.

Thoughts?

kostajh added a project: Product Safety and Integrity.Jan 14 2026, 9:52 AM

JMeybohm triaged this task as Medium priority.Jan 14 2026, 4:12 PM

JMeybohm edited projects, added: ServiceOps new, ServiceOps-Datastores; removed: serviceops-deprecated.

JMeybohm moved this task from Inbox to In Progress on the ServiceOps new board.

Thanks for the notes and ideas, @jijiki. We'll have another pass at the implementation to address the concerns you've raised. For sense of urgency, is sometime this quarter OK, or does it need to be dealt with more urgently? (cc @OKryva-WMF @Rsilvola)

In T412947#11528726, @kostajh wrote:

Thanks for the notes and ideas, @jijiki. We'll have another pass at the implementation to address the concerns you've raised. For sense of urgency, is sometime this quarter OK, or does it need to be dealt with more urgently? (cc @OKryva-WMF @Rsilvola)

Sometime this quarter would be grand! Thank you!

jijiki renamed this task from Increase in memcached get misses from mc1047 and mc1042 to Reduce cache miss noise in memcached due to hcaptcha health checks.Jan 19 2026, 1:07 PM

jijiki updated the task description. (Show Details)

jijiki updated the task description. (Show Details)Jan 19 2026, 1:23 PM

jijiki moved this task from In Progress to Radar (Awareness) on the ServiceOps new board.Jan 22 2026, 9:54 AM

Change #1238059 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@master] HCaptchaEnterpriseHealthChecker: Use a cache hit for health check

https://gerrit.wikimedia.org/r/1238059

gerritbot added a project: Patch-For-Review.Feb 9 2026, 11:40 PM

@jijiki I've proposed a patch that implements most of what you've written out above, except for this part

Additionally, we could cache isHcaptchaAvailable to APCu with a TTL of 1', so to spread fetches across pods.

How important is the APCu aspect of this? Can we wait to see how things look with the redesigned version that will have a cache hit as an indicator of health?

OKryva-WMF edited projects, added: Product Safety and Integrity (Sprint Flower (Feb 9 - Feb 27)); removed: Product Safety and Integrity (Sprint Daffodil (Jan 19 - Feb 6)).Feb 10 2026, 8:33 AM

In T412947#11599742, @kostajh wrote:

@jijiki I've proposed a patch that implements most of what you've written out above, except for this part

Additionally, we could cache isHcaptchaAvailable to APCu with a TTL of 1', so to spread fetches across pods.

How important is the APCu aspect of this? Can we wait to see how things look with the redesigned version that will have a cache hit as an indicator of health?

Thank you very much Kosta! Yes I agree, let's see how things land and iterate if needed.

kostajh moved this task from Backlog to Needs review on the Product Safety and Integrity (Sprint Flower (Feb 9 - Feb 27)) board.Feb 18 2026, 4:40 PM

Change #1238059 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] HCaptchaEnterpriseHealthChecker: Use a cache hit for health check

https://gerrit.wikimedia.org/r/1238059

Maintenance_bot removed a project: Patch-For-Review.Feb 19 2026, 12:31 PM

kostajh moved this task from Needs review to Needs QA on the Product Safety and Integrity (Sprint Flower (Feb 9 - Feb 27)) board.Feb 19 2026, 12:51 PM

ReleaseTaggerBot added a project: MW-1.46-notes (1.46.0-wmf.17; 2026-02-24).Feb 19 2026, 1:00 PM

Change #1242254 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@wmf/1.46.0-wmf.16] HCaptchaEnterpriseHealthChecker: Use a cache hit for health check

https://gerrit.wikimedia.org/r/1242254

gerritbot added a project: Patch-For-Review.Feb 23 2026, 8:35 AM

Change #1242254 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@wmf/1.46.0-wmf.16] HCaptchaEnterpriseHealthChecker: Use a cache hit for health check

https://gerrit.wikimedia.org/r/1242254

Mentioned in SAL (#wikimedia-operations) [2026-02-23T09:10:06Z] <kharlan@deploy2002> Started scap sync-world: Backport for [[gerrit:rOPUP1242254fb508|HCaptchaEnterpriseHealthChecker: Use a cache hit for health check (T412947)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-23T09:11:56Z] <kharlan@deploy2002> kharlan: Backport for [[gerrit:rOPUP1242254fb508|HCaptchaEnterpriseHealthChecker: Use a cache hit for health check (T412947)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-23T09:18:49Z] <kharlan@deploy2002> Finished scap sync-world: Backport for [[gerrit:rOPUP1242254fb508|HCaptchaEnterpriseHealthChecker: Use a cache hit for health check (T412947)]] (duration: 08m 43s)

@jijiki I've deployed the patch, and observing https://grafana.wikimedia.org/d/000000316/memcache?from=now-1h&orgId=1&timezone=utc&to=now&var-cluster=memcached&var-datasource=000000006&var-instance=$__all&viewPanel=panel-60 we can see that the cache miss count has dropped substantially

Maintenance_bot removed a project: Patch-For-Review.Feb 23 2026, 9:31 AM

ReleaseTaggerBot edited projects, added: MW-1.46-notes (1.46.0-wmf.16; 2026-02-17); removed: MW-1.46-notes (1.46.0-wmf.17; 2026-02-24).Feb 23 2026, 10:00 AM

Thank you so much for sorting this!

• hector.arroyo moved this task from Needs QA to Done on the Product Safety and Integrity (Sprint Flower (Feb 9 - Feb 27)) board.Feb 27 2026, 3:17 PM

• hector.arroyo moved this task from Backlog to Done on the Bot detection and mitigation (WE4.2 hCaptcha editing trial) board.Mar 19 2026, 5:15 PM

Change #1261605 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@master] hCaptcha: Add APCu cache layer to health checker

https://gerrit.wikimedia.org/r/1261605

gerritbot added a project: Patch-For-Review.Mar 26 2026, 9:23 PM

Change #1261605 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] hCaptcha: Add APCu cache layer to health checker

https://gerrit.wikimedia.org/r/1261605

Maintenance_bot removed a project: Patch-For-Review.Mar 30 2026, 10:30 AM

Change #1264578 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@wmf/1.46.0-wmf.21] hCaptcha: Add APCu cache layer to health checker

https://gerrit.wikimedia.org/r/1264578

gerritbot added a project: Patch-For-Review.Mar 30 2026, 10:44 AM

ReleaseTaggerBot edited projects, added: MW-1.46-notes (1.46.0-wmf.22; 2026-03-31); removed: MW-1.46-notes (1.46.0-wmf.16; 2026-02-17).Mar 30 2026, 11:00 AM

Change #1264578 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@wmf/1.46.0-wmf.21] hCaptcha: Add APCu cache layer to health checker

https://gerrit.wikimedia.org/r/1264578

Mentioned in SAL (#wikimedia-operations) [2026-03-30T13:05:21Z] <kharlan@deploy1003> Started scap sync-world: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-30T13:07:12Z] <kharlan@deploy1003> kharlan: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-30T13:17:17Z] <kharlan@deploy1003> Finished scap sync-world: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]] (duration: 11m 56s)

Maintenance_bot removed a project: Patch-For-Review.Mar 30 2026, 1:41 PM

ReleaseTaggerBot edited projects, added: MW-1.46-notes (1.46.0-wmf.21; 2026-03-24); removed: MW-1.46-notes (1.46.0-wmf.22; 2026-03-31).Mar 30 2026, 2:00 PM

Reduce cache miss noise in memcached due to hcaptcha health checksClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Reduce cache miss noise in memcached due to hcaptcha health checks
Closed, ResolvedPublic
Actions