Page MenuHomePhabricator

Reduce cache miss noise in memcached due to hcaptcha health checks
Closed, ResolvedPublic

Description

The current hCaptcha availability check relies 2 memcached keys that default to cache misses under normal operations, creating and inverted logic where misses indicate health and hits indicate problems. This approach raises some issues:

  • incident response overhead: frequent cache misses during normal operation create noise, making it difficult for on-call engineers to distinguish between "system is healthy" and "system is degraded" without explicitly knowing about this behaviour
    • This was evident on December 2nd 2026 where a 50%+ rise in get misses from mc1047 and mc1042 aligned with normal operational patterns, not actual failures.
  • inverted semantics: normally cache misses indicate data recalculation. The current implementation creates inconsistency in how we interpret misses across the memcached layer

Original Task Description
(was: Increase in memcached get misses from mc1047 and mc1042)

Since 2 Dec 2025, I have observed an odd increase in get misses from mc1047 and mc1042 have a 50+% rise in get misses, compared to the rest of the cluster

2 Dec 2025 Memcached Commands (get) misses

image.png (2,292×534 px, 223 KB)

The first time I see the pattern is (for those two servers), roughly, around ~18th Nov 2025

18th Nov 2025 - Memcached Commands (get) misses

image.png (2,980×1,100 px, 619 KB)

Event Timeline

By and large, identifying which memcached key is receiving the most misses is not straightforward, unless the application explicitly counts those misses. Moreover, when a cache miss occurs, the conversation with memcached will be:

get WANCache:dawiktionary:revision-slots::6272:16905|#|v
END

whereas a hit will appear as:

get WANCache:cawiki:page-content-model:123456|#|v
VALUE WANCache:cawiki:page-content-model:123456|#|v 4 71
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
END

To identify the key misses, I captured a tcpdump which I then processed with tshark to isolate the misses.

This yielded

mc1047
---------
10003 MISS: WANCache:global:confirmedit-hcaptcha-failover-mode|#|v
   30 MISS: WANCache:eswiki:page:14:<snip>|#|v

mc1042
----------
   5952 MISS: global:confirmedit-hcaptcha-siteverify-error-count
     26 MISS: WANCache:commonswiki:page:<snip>|#|v

Looking at a trace here https://trace.wikimedia.org/trace/c39a49b6ad781654709b21bac2f4beac, it looks that we are getting both

global:confirmedit-hcaptcha-failover-mode (miss)
global:confirmedit-hcaptcha-apiurl-available (hit)

I have not dug any deeper to find traces for global:confirmedit-hcaptcha-siteverify-error-count.

While everything works as designed, however having frequent cache misses during normal operation creates unnecessary noise and makes it difficult to distinguish between "system is healthy" and "system is degraded."

A key miss is used to indicate health, whilst a key hit indicates a problem. This inverted logic makes incident response confusing, especially when on-call engineers cannot quickly determine system state by looking at cache patterns.

image.png (2,980×1,056 px, 349 KB)

What options do we have here to avoid relying on cache misses?

kostajh subscribed.

What options do we have here to avoid relying on cache misses?

I'm not sure. I guess the alternative would be to invert the existing logic that tracks failover mode and use a confirmedit-hcaptcha-is-healthy cache key? IIRC, We explored that in rECOE53c82eb84f54: hCaptcha: Provide capabilities for failing over to alternate CAPTCHA type and ended up moving away from that approach.

From ServiceOps new point of view, is this issue a "must fix" and if so, on what timeline?

The first time I see the pattern is (for those two servers), roughly, around ~18th Nov 2025

hCaptcha editing went live on November 17 (T405586), which is also when crawler traffic to ?action=edit URLs will start triggering the checks to hCaptcha availability more frequently. Unfortunately, we do need to know which CAPTCHA backend to load at page load time, so we need to consult the confirmedit-hcaptcha-failover-mode key.

Since 2 Dec 2025, I have observed an odd increase in get misses from mc1047 and mc1042 have a 50+% rise in get misses, compared to the rest of the cluster

The automatic failover mode kicked in on December 3 (logs), so that doesn't explain the December 2 event.

Since 2 Dec 2025, I have observed an odd increase in get misses from mc1047 and mc1042 have a 50+% rise in get misses, compared to the rest of the cluster

The automatic failover mode kicked in on December 3 (logs), so that doesn't explain the December 2 event.

2 Dec 2025 lines up with https://sal.toolforge.org/log/wSq_3poB8tZ8Ohr0Wwvw, which as we now know, it is normal.

From my perspective (have not discussed with the team yet), the current logic potentially makes incident response baffling, especially for on-call engineers who most likely will not be privy to the above details. In a new norm where we are actively trying to pin down request patterns in order to fight scrappers, cache misses are subject to investigation. Me myself spent a considerable amount of time to deduce where those misses where coming from.

(Please take this suggestion with a grain of salt, as I am a little out of my depth). What if we consider an approach around the lines of:

Currently we are using 3 keys when calling isAvailable():

  • "failover mode" key (TTL 10', def is miss): if exists and true, enables failover
  • "SiteVerify Error Count" Key (TTL 1': def is miss): if exists, updates the counter. May set the "failover mode" key to true.
  • "secure-api.js" key (TTL 5', def is hit): if a miss we set the "failover mode" key to true.

Instead of maintaining a separate failover-mode key with a 10-minute backoff window, we drop it, and we use the actual health indicators, and in combination with APCu, reduce the number of memcached key fetches.

  • Use an isHcaptchaAvailable boolean key (TTL 1') which should be _true_ when SiteVerify Error Count<threshold AND secure-api.js is true
    • Set TTL to 10' if false, implementing a backoff similar to what we are doing now.
  • SiteVerify Error Count (TTL 1': should exist and def value is 0): update again to 0 if there are no errors and TTL is about to expire, or let it be if non 0)
  • secure-api.js key (TTL 5', should exist and should be boolean)

Additionally, we could cache isHcaptchaAvailable to APCu with a TTL of 1', so to spread fetches across pods.

With the above implementation:

  • cache misses indicate recalculation of key data
    • though by using getWithSetCallback() where appropriate, impacts caches misses
  • depending on the implementation, accept periodic misses of SiteVerify Error Count as 1-minute TTL will naturally occur.

Thoughts?

JMeybohm triaged this task as Medium priority.Jan 14 2026, 4:12 PM
JMeybohm moved this task from Inbox to In Progress on the ServiceOps new board.

Thanks for the notes and ideas, @jijiki. We'll have another pass at the implementation to address the concerns you've raised. For sense of urgency, is sometime this quarter OK, or does it need to be dealt with more urgently? (cc @OKryva-WMF @Rsilvola)

Thanks for the notes and ideas, @jijiki. We'll have another pass at the implementation to address the concerns you've raised. For sense of urgency, is sometime this quarter OK, or does it need to be dealt with more urgently? (cc @OKryva-WMF @Rsilvola)

Sometime this quarter would be grand! Thank you!

jijiki renamed this task from Increase in memcached get misses from mc1047 and mc1042 to Reduce cache miss noise in memcached due to hcaptcha health checks.Jan 19 2026, 1:07 PM
jijiki updated the task description. (Show Details)

Change #1238059 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@master] HCaptchaEnterpriseHealthChecker: Use a cache hit for health check

https://gerrit.wikimedia.org/r/1238059

@jijiki I've proposed a patch that implements most of what you've written out above, except for this part

Additionally, we could cache isHcaptchaAvailable to APCu with a TTL of 1', so to spread fetches across pods.

How important is the APCu aspect of this? Can we wait to see how things look with the redesigned version that will have a cache hit as an indicator of health?

@jijiki I've proposed a patch that implements most of what you've written out above, except for this part

Additionally, we could cache isHcaptchaAvailable to APCu with a TTL of 1', so to spread fetches across pods.

How important is the APCu aspect of this? Can we wait to see how things look with the redesigned version that will have a cache hit as an indicator of health?

Thank you very much Kosta! Yes I agree, let's see how things land and iterate if needed.

Change #1238059 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] HCaptchaEnterpriseHealthChecker: Use a cache hit for health check

https://gerrit.wikimedia.org/r/1238059

Change #1242254 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@wmf/1.46.0-wmf.16] HCaptchaEnterpriseHealthChecker: Use a cache hit for health check

https://gerrit.wikimedia.org/r/1242254

Change #1242254 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@wmf/1.46.0-wmf.16] HCaptchaEnterpriseHealthChecker: Use a cache hit for health check

https://gerrit.wikimedia.org/r/1242254

Mentioned in SAL (#wikimedia-operations) [2026-02-23T09:10:06Z] <kharlan@deploy2002> Started scap sync-world: Backport for [[gerrit:rOPUP1242254fb508|HCaptchaEnterpriseHealthChecker: Use a cache hit for health check (T412947)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-23T09:11:56Z] <kharlan@deploy2002> kharlan: Backport for [[gerrit:rOPUP1242254fb508|HCaptchaEnterpriseHealthChecker: Use a cache hit for health check (T412947)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-23T09:18:49Z] <kharlan@deploy2002> Finished scap sync-world: Backport for [[gerrit:rOPUP1242254fb508|HCaptchaEnterpriseHealthChecker: Use a cache hit for health check (T412947)]] (duration: 08m 43s)

Thank you so much for sorting this!

Change #1261605 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@master] hCaptcha: Add APCu cache layer to health checker

https://gerrit.wikimedia.org/r/1261605

Change #1261605 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@master] hCaptcha: Add APCu cache layer to health checker

https://gerrit.wikimedia.org/r/1261605

Change #1264578 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/ConfirmEdit@wmf/1.46.0-wmf.21] hCaptcha: Add APCu cache layer to health checker

https://gerrit.wikimedia.org/r/1264578

Change #1264578 merged by jenkins-bot:

[mediawiki/extensions/ConfirmEdit@wmf/1.46.0-wmf.21] hCaptcha: Add APCu cache layer to health checker

https://gerrit.wikimedia.org/r/1264578

Mentioned in SAL (#wikimedia-operations) [2026-03-30T13:05:21Z] <kharlan@deploy1003> Started scap sync-world: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-30T13:07:12Z] <kharlan@deploy1003> kharlan: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-30T13:17:17Z] <kharlan@deploy1003> Finished scap sync-world: Backport for [[gerrit:1264578|hCaptcha: Add APCu cache layer to health checker (T421204 T412947)]] (duration: 11m 56s)