Alert on flapping / restart-looping systemd units #370

Open
opened 2025-10-05 01:36:52 +00:00 by delroth · 0 comments
Owner

We recently had a situation on lix-zulip01 where some of the zulip queue runners were restart-looping at a high rate. Since they were never actually in the failed state we never saw this in our alerting until further symptoms ended up causing problems. I don't know off-hand how we could monitor for this, but this is something we should really have imo.

We recently had a situation on `lix-zulip01` where some of the zulip queue runners were restart-looping at a high rate. Since they were never actually in the `failed` state we never saw this in our alerting until further symptoms ended up causing problems. I don't know off-hand how we could monitor for this, but this is something we should really have imo.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
afnix/infra#370
No description provided.