CI infrastructure monitoring #327

Open
opened 2025-09-07 02:20:26 +00:00 by delroth · 1 comment
Owner

We've had a few cases recently of CI failing for a while (mostly impacting Lix) due to infrastructure issues. We should figure out a way to alert for:

  • CI actions not starting;
  • CI actions always failing (this could for example be implemented with a probe running an unchanging trivial-ish workload on a schedule that is always expected to succeed).

Probably makes sense to figure this out for Forgejo as well.

We've had a few cases recently of CI failing for a while (mostly impacting Lix) due to infrastructure issues. We should figure out a way to alert for: - CI actions not starting; - CI actions always failing (this could for example be implemented with a probe running an unchanging trivial-ish workload on a schedule that is always expected to succeed). Probably makes sense to figure this out for Forgejo as well.
Owner

Low hanging fruits:

This won't catch more pesky issues like broken DNS resolution inside the agents I'd say (our recent outage)?

For this, I'd like still to count the number of failed builds in a range and alert if there's a statistical anomaly? Not sure how exactly yet.

Low hanging fruits: - https://buildkite.com/docs/apis/rest-api/agents (number of agents > 0 per platform, this should solve nr 1) - https://buildkite.com/docs/apis/rest-api/builds + the trivial workload on a schedule This won't catch more pesky issues like broken DNS resolution inside the agents I'd say (our recent outage)? For this, I'd like still to count the number of failed builds in a range and alert if there's a statistical anomaly? Not sure how exactly yet.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
afnix/infra#327
No description provided.