A different take on the NUMA OOM killer story

3 min read
Article URL: https://rachelbythebay.com/w/2021/09/22/membind/ Comments URL: https://news.ycombinator.com/item?id=28625412 Points: 1 # Comments: 0
A different take on the NUMA OOM killer story

I was digging through some notes on old outages tonight and found something potentially useful for other people. It's something I have mentioned before, but it seems like maybe that post didn't have enough specifics to make it really "land" when someone else does a web search.

So, in the hopes of spreading some knowledge, here is a little story about a crashy service.

On a Wednesday night not so long ago, someone opened a SEV (that is, a notification of something going wrong) and said the individual tasks for their service were "crash looping". These things ran in an environment where they were supervised by the job runner, so if they died, they'd be restarted, but you never wanted to keep it in this kind of state.

This got the attention of at least one engineer who happened to be hanging around on the company chat channel where production stuff was discussed because a bot announced the SEV's creation. The SEV-creating engineer was also in that channel as best practices dictated, and a conversation started. They started debugging things together.

It looked like this had started when some update had rolled out. That is, prior to the update, their service was fine. After they changed something, it started crashing. Things were so unstable that some of the tasks never "became healthy" - they would go from "starting up" to "dead" and back again - they never managed…
Read full article