New · Open Source

Kubernetes RCA — Why Did Errors Spike After Deploy?

SignalPilot correlates deploy diffs, K8s events, metrics-server, logs, cAdvisor, Prometheus, and git — then ranks findings with copy-paste kubectl fixes. Analysis, not another dashboard.

The loop

observe → correlate → explain → recommend → verify → learn

Observe

Parallel collectors across K8s API, logs, metrics, cAdvisor, Prometheus, and optional git history.

Correlate

Deterministic rules fuse cross-source evidence — each finding cites multiple signal types.

Recommend

Ranked, copy-paste kubectl fixes — not generic advice.

Verify

Baseline before a fix, compare after next deploy: Fixed vs Regressed vs Unchanged.

Signal sources

Tier Source Always-on
0 Deploy diff (image, env, resources, probes) Yes
0 Git repo correlation (commit SHA → suspect files) Optional
1 K8s API: restarts, OOMKilled, CrashLoopBackOff, probes Yes
1 K8s Events: FailedScheduling, BackOff, Unhealthy Yes
1 metrics-server: CPU/memory saturation vs limits Yes
1 Container logs: drain3 clustering, new errors Yes
2 cAdvisor: CPU throttling, memory working-set Yes
2 Network: endpoint readiness, DNS failures Yes
4 Prometheus: p95/p99, error rate, CFS throttle Optional

RCA rules

Rule Trigger signals Typical fix
oom_killed OOMKilled + mem near limit Raise memory limit
cpu_throttled CFS throttle > 30% + latency regression Raise CPU limit/request
crash_loop CrashLoopBackOff + log patterns + config diff Fix env vars, rollback
image_pull_error ImagePullBackOff / ErrImagePull Fix image tag, rollback
probe_failure Readiness/liveness probe failing Fix probe path/port/timing
code_regression New log fingerprints after deploy + git suspect Rollback, investigate commit

Quick start

From deploy to RCA report

Python 3.12+, kubectl configured, read-only RBAC applied.

pip install perfsage-signalpilot

kubectl apply -f deploy/signalpilot-rbac.yaml

signalpilot analyze my-namespace --deployment my-app --output report.html

# CI gate (exit 1 on HIGH+ findings)
signalpilot gate my-namespace --deployment my-app --junit-xml results.xml

Frequently Asked Questions

What is PerfSage SignalPilot?

An open-source Kubernetes RCA copilot that answers why errors and performance degradation happened after your last deployment — by correlating deploy diffs, K8s events, metrics, logs, Prometheus, and git into ranked findings with copy-paste kubectl fixes.

How is this different from kubectl describe and dashboards?

kubectl shows one object at a time. Dashboards show metrics without deploy context. SignalPilot fuses cross-source evidence into deterministic rules — e.g. OOMKilled + memory at 94% of limit + git commit touching heap code = undersized memory limit, with a concrete fix.

What cluster permissions does it need?

Read-only RBAC via deploy/signalpilot-rbac.yaml. It uses the Kubernetes API, metrics-server, and optional Prometheus auto-detection. No agents in your app pods.

Can I use it in CI/CD?

Yes. signalpilot gate exits non-zero on HIGH+ findings and can export JUnit XML for Jenkins or GitHub Actions — complement your load-test SLO gates from PerfSage SLO Reporter.

Does it require Prometheus or an LLM?

Prometheus enriches findings but is optional (auto-detected). LLM narrative polish is optional — core RCA rules and kubectl recommendations run without any API key.

How fast is a SignalPilot RCA?

Under 5 minutes for typical post-deploy regressions — deploy diff, events, metrics, and logs correlated into a single ranked report instead of hours of tab-switching across kubectl and dashboards.

Is it free?

Yes — MIT licensed open source. Test-time analysis with PerfSage Reveal; prod-time RCA with SignalPilot.

Ready to RCA your next deploy?

pip install perfsage-signalpilot · Pair with Reveal for load tests and SLO Reporter for CI gates.