KubernetesSREperformance engineeringField Notesopen sourceproduct launch

Deploy Broke Prod Again — So I Built PerfSage SignalPilot

After one too many post-deploy war rooms staring at kubectl and Grafana separately, I built SignalPilot — an open-source Kubernetes RCA copilot that correlates deploy diffs, events, metrics, logs, and git into ranked findings with kubectl fixes.

Field Notes #3 · TL;DR — You deployed. Errors spiked. Someone opens kubectl, someone opens Grafana, someone blames the last commit. PerfSage SignalPilot runs an observe → correlate → explain → recommend → verify loop across K8s API, metrics-server, logs, cAdvisor, Prometheus, and optional git — then ranks findings with copy-paste kubectl fixes. Open source. MIT. Landing page.

Field Notes #4 — For the why behind SignalPilot (war rooms, MTTR, expensive-tool gap), read I Got Tired of 3-Hour Post-Deploy War Rooms.

The question every deploy review should answer

“Why are errors and performance degradation happening after my last deployment?”

That question is simple. Getting a defensible answer in under five minutes is not.

kubectl describe shows one pod. Grafana shows a metric spike. Git shows a commit. None of them cite each other.


What SignalPilot does differently

SignalPilot fuses cross-source evidence into deterministic RCA rules:

RuleWhat it correlatesTypical fix
oom_killedOOMKilled + memory near limitRaise memory limit
cpu_throttledCFS throttle + latency regressionRaise CPU request/limit
crash_loopCrashLoopBackOff + logs + config diffRollback or fix env
code_regressionNew log fingerprints + git suspect commitInvestigate commit

Each finding cites multiple signal types — not a single chart anomaly.


Quick start

git clone https://github.com/perfsage/signalpilot
cd signalpilot && pip install -e .

kubectl apply -f deploy/signalpilot-rbac.yaml

signalpilot analyze my-namespace --deployment my-app --output report.html

CI gate (exit 1 on HIGH+ findings):

signalpilot gate my-namespace --deployment my-app --junit-xml results.xml

Full docs on the SignalPilot landing page and GitHub README.


The PerfSage ladder: test → gate → RCA

  1. Reveal — JMeter JTL analysis in the lab (/reveal/)
  2. SLO Reporter — CI gates on load tests (/slo-plugin/)
  3. SignalPilot — post-deploy RCA in production (/signalpilot/)

Field Notes #3 · By Aashish Bajpai