Spurious alerts on CI coredumps #325
Labels
No labels
Compat/Breaking
Difficulty/Architectural
Difficulty/Easy
Difficulty/Hard
Help Wanted
Kind/Bug
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Testing
Priority/Critical
Priority/High
Priority/Low
Priority/Medium
Reviewed/Confirmed
Reviewed/Duplicate
Reviewed/Invalid
Reviewed/Won't Fix
Security
Silenced Alert
Status/Abandoned
Status/Blocked
Status/Need More Info
Status/Postponed
Tracking Issue
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
afnix/infra#325
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Certain CI workloads will generate broken coredumps as part of their integration testing.
This appears in our alerting as spurious systemd units failing when such a coredump happen.
We should find a way to exclude an entire cgroup tree, e.g. the Nix build cgroup tree.
Is there maybe a way to just drop those coredumps on the floor instead of having them get processed by systemd-coredumpd? I'd prefer we isolate the CI workloads from their host as much as possible.
I do not see an option in
coredump.conf
to achieve this.I was thinking e.g. new Lix option to RLIMIT_CORE=0.
Wait, actually looking into this RLIMIT_CORE=0 should be the default unless we explicitly
--option enable-core-dumps true
, what gives?I think RLIMIT_CORE=0 has only effects if kernel.core_pattern is a filename and not a pipe, i.e. when the kernel writes the coredump, not when systemd processes it.
I noticed
[6736486.194464] coredump: 3234(liblixmain-test): RLIMIT_CORE is set to 1, aborting core
in my EPYC logs BTW.