SysGuardd Operations Guide¶
Purpose¶
This guide defines how to run, configure, and operate SysGuardd safely in development and production environments.
Runtime Modes¶
Monitor Mode¶
- Captures and evaluates process events.
- Does not terminate processes.
- Best for baseline learning and policy tuning.
Enforce Mode¶
- Captures and evaluates process events.
- Terminates unauthorized processes using
SIGKILL. - Requires mature policy coverage and alerting.
Suggested Configuration Keys¶
Use these keys as the initial configuration contract.
mode:monitororenforcepolicy.source:fileorremotepolicy.path: local policy file pathpolicy.refresh_interval_seconds: remote policy refresh cadencetelemetry.format:jsonorgrpctelemetry.endpoint: remote sink for gRPC streamlog.level:info,warn,error,debug
Policy Authoring Guidelines¶
- Start with monitor mode and capture at least 7 days of process data.
- Convert known-bad executables and suspicious paths into deny rules.
- Keep rules explicit and reviewable.
- Version all policy changes in source control.
- Require peer review for policy updates before enforce rollout.
Rollout Strategy¶
- Deploy in monitor mode to a non-critical environment.
- Validate event volume, false positives, and expected process baselines.
- Apply policy updates until false positives are near zero.
- Roll out enforce mode to a small canary subset.
- Expand enforce coverage incrementally.
Real-Time Alerting¶
SysGuardd includes a built-in non-blocking webhook dispatcher. It is disabled by default and opt-in via flags.
Enabling Alerts¶
For local runs, put SYSGUARDD_ALERT_WEBHOOK_URL=... in a repo-root .env file. The daemon loads it automatically, and .env is already ignored by git.
sysguardd daemon --mode enforce \
--alert-enabled \
--alert-webhook-url https://hooks.slack.com/services/XXX/YYY/ZZZ \
--alert-min-severity critical
For monitor-mode deployments during policy tuning:
sysguardd daemon --mode monitor \
--alert-enabled \
--alert-webhook-url https://hooks.example.com/sysguardd \
--alert-min-severity warning \
--alert-dedupe-window 120 \
--alert-rate-limit 20
Alert Configuration Reference¶
| Flag | Default | Description |
|---|---|---|
--alert-enabled |
off | Enable the dispatcher. Without this flag no dispatcher is created. |
--alert-webhook-url |
— | Webhook URL to POST alerts to. Both http:// and https:// are supported. |
--alert-min-severity |
warning |
Drop alerts below this level. Values: info, warning, critical. |
--alert-dedupe-window |
60 |
Seconds. Suppress repeated alerts for the same exe+severity. |
--alert-rate-limit |
60 |
Max alerts dispatched per minute. Excess alerts are dropped silently. |
Severity Levels¶
| Condition | Severity |
|---|---|
| Process allowed by policy | info |
| Process denied in monitor mode | warning |
| Process denied in enforce mode | critical |
| Mitigation (SIGKILL) failed | critical |
Audit Log Fields Added¶
Every event log line now includes:
{
"event_id": "0000000000000000",
"severity": "critical",
"node_id": "k8s-node-01",
"policy_version": "v1.2.3",
...
}
event_id— monotonic hex counter, unique per daemon run. Use this to correlate audit log entries with webhook alerts.node_id— defaults to system hostname; override with--node-id.policy_version— optional tag set via--policy-version.
Safety Guarantees¶
- Alert dispatch never blocks the event loop. Events are queued and delivered by a background thread.
- If the webhook is unreachable, the failure is logged to stderr and the daemon continues.
- If the queue reaches 128 entries (e.g. webhook is down), the oldest queued alert is dropped to prevent unbounded memory growth.
--alert-dedupe-windowand--alert-rate-limitprevent alert storms from overwhelming downstream notification services.
Legacy Alerting Recommendations¶
Create alerts at your SIEM/monitoring layer for: - Unauthorized process killed. - Mitigation failed. - Ring buffer drop rate above threshold. - Telemetry export failures. - Policy fetch or signature verification failures.
Incident Response Checklist¶
- Capture blocked executable path and command arguments.
- Identify parent process and user context.
- Determine whether the event is malicious or false positive.
- If false positive, patch policy and re-test in monitor mode.
- If malicious, preserve evidence and rotate compromised credentials.
Performance and Reliability Checks¶
Track the following indicators: - Event ingestion rate - Policy decision latency - Mitigation execution latency - Ring buffer utilization and drop rate - CPU and memory footprint of daemon
Maintenance Tasks¶
- Rotate logs and archive high-fidelity audit trails.
- Review policy exceptions monthly.
- Re-test enforcement after kernel upgrades.
- Validate integration endpoints and certificates.
Change Management¶
- Every production policy change should include:
- Change reason
- Risk assessment
- Rollback procedure
- Approval record
Minimal Runbook Template¶
Use this template for on-call handoffs: - Service owner - Escalation contacts - Current mode (monitor/enforce) - Policy version - Last deployment timestamp - Known exceptions - Rollback command/process