Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Google's alerting philosophy biases toward paging people only for user-visible symptoms, not potential causes, but there are still often lower priority ticket-based alerts for stuff like stuff getting too hot.

For rationale, see:

- https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa...

- https://landing.google.com/sre/sre-book/chapters/monitoring-...



That way of prioritising things makes sense, but in this case it meant the on-call SRE and someone from the edge networking team wasted their time diagnosing the hardware problem.

It seems a shame that the blog post wasn't able to follow

« They immediately removed ("drained") the machines from serving, thus eliminating the errors that might result in a degraded state for customers »

with something like « and they were shown a notification that there was a low-priority automatic ticket showing a possible hardware problem on those machines, so they subscribed to that ticket and didn't waste any more time ».


> on-call SRE and someone from the edge networking team wasted their time diagnosing the hardware problem

I highly doubt that the edge oncaller who was paged debugged this issue down to the hardware. The post even said that they worked with other teams to figure this out. Once the SRE figured out it wasn't their problem, the likely moved onto something else




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: