The current hCaptcha availability check relies 2 memcached keys that default to cache misses under normal operations, creating and inverted logic where misses indicate health and hits indicate problems. This approach raises some issues:
- incident response overhead: frequent cache misses during normal operation create noise, making it difficult for on-call engineers to distinguish between "system is healthy" and "system is degraded" without explicitly knowing about this behaviour
- This was evident on December 2nd 2026 where a 50%+ rise in get misses from mc1047 and mc1042 aligned with normal operational patterns, not actual failures.
- inverted semantics: normally cache misses indicate data recalculation. The current implementation creates inconsistency in how we interpret misses across the memcached layer
Original Task Description
(was: Increase in memcached get misses from mc1047 and mc1042)
Since 2 Dec 2025, I have observed an odd increase in get misses from mc1047 and mc1042 have a 50+% rise in get misses, compared to the rest of the cluster
2 Dec 2025 Memcached Commands (get) misses
The first time I see the pattern is (for those two servers), roughly, around ~18th Nov 2025



