Spurious alerts on CI coredumps #325

Open
opened 2025-09-06 22:34:15 +00:00 by raito · 6 comments
Owner

Certain CI workloads will generate broken coredumps as part of their integration testing.

This appears in our alerting as spurious systemd units failing when such a coredump happen.

We should find a way to exclude an entire cgroup tree, e.g. the Nix build cgroup tree.

Certain CI workloads will generate broken coredumps as part of their integration testing. This appears in our alerting as spurious systemd units failing when such a coredump happen. We should find a way to exclude an entire cgroup tree, e.g. the Nix build cgroup tree.
Owner

Is there maybe a way to just drop those coredumps on the floor instead of having them get processed by systemd-coredumpd? I'd prefer we isolate the CI workloads from their host as much as possible.

Is there maybe a way to just drop those coredumps on the floor instead of having them get processed by systemd-coredumpd? I'd prefer we isolate the CI workloads from their host as much as possible.
Author
Owner

I do not see an option in coredump.conf to achieve this.

I do not see an option in `coredump.conf` to achieve this.
Owner

I was thinking e.g. new Lix option to RLIMIT_CORE=0.

I was thinking e.g. new Lix option to RLIMIT_CORE=0.
Owner

Wait, actually looking into this RLIMIT_CORE=0 should be the default unless we explicitly --option enable-core-dumps true, what gives?

Wait, actually looking into this RLIMIT_CORE=0 should be the default unless we explicitly `--option enable-core-dumps true`, what gives?
Author
Owner

I think RLIMIT_CORE=0 has only effects if kernel.core_pattern is a filename and not a pipe, i.e. when the kernel writes the coredump, not when systemd processes it.

I think RLIMIT_CORE=0 has only effects if kernel.core_pattern is a filename and not a pipe, i.e. when the kernel writes the coredump, not when systemd processes it.
Author
Owner

I noticed [6736486.194464] coredump: 3234(liblixmain-test): RLIMIT_CORE is set to 1, aborting core in my EPYC logs BTW.

I noticed `[6736486.194464] coredump: 3234(liblixmain-test): RLIMIT_CORE is set to 1, aborting core` in my EPYC logs BTW.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
afnix/infra#325
No description provided.