kubectl fixes. Not another dashboard. Analysis you can act on. Now live. pip install perfsage-signalpilot. For the launch walkthrough, read Field Notes #5.
Friday, 4:47 PM
Deploy went out. The pipeline was green. Then the pager fired.
Someone opened kubectl. Someone opened Grafana. Someone scrolled through the last five commits in git. Slack filled up with theories:
- “Is it the new image tag?”
- “Did we change an env var?”
- “Maybe it’s the dependency — their status page looks fine though.”
Forty minutes in, we still didn’t have a defensible answer to the only question that mattered:
“Why are errors and performance degradation happening after my last deployment?”
I’ve sat in this room at VMware, at client SaaS shops, and in my own side projects. The tools got better. The war rooms didn’t get shorter.
The problems I kept hitting
MTTR is the real SaaS tax
Every minute of downtime costs revenue, trust, and on-call sanity. We talk about “shift left” in load testing and CI gates — and I believe in that; it’s why I built PerfSage Reveal and SLO Reporter.
But production still breaks right after deploy. Not during the load test. Not in staging on a Tuesday. On Friday, when the traffic curve is real and the rollback decision has a dollar sign attached.
The industry loves average latency on dashboards. Averages lie — I wrote about that in The P99 Trap. In prod, it’s usually error rate + tail latency that tell you something broke. Finding what broke is the expensive part.
Diagnosis takes hours because signals don’t talk
I wasn’t missing data. I was missing a story that tied the data together.
| Tool | What it shows | What it doesn’t show |
|---|---|---|
kubectl describe | One pod’s last state | Deploy context, cross-pod pattern |
| Grafana / Prometheus | A metric spike | Why the spike started after this image tag |
| Container logs | Stack trace, error string | Link to the resource limit change in the same deploy |
| Git | Suspect commit | Whether that commit correlates with new log fingerprints |
Each tool is correct in isolation. None of them cite each other. So you context-switch for an hour — or three — until someone connects OOMKilled with a memory limit that changed in the same rollout.
That’s not a skill problem. It’s a workflow problem.
The expensive-tools gap
Enterprise AIOps and RCA platforms exist. They also exist at enterprise price tags: long contracts, agents in every namespace, “contact sales,” quarters of onboarding.
Startups and mid-market SaaS teams run the same Kubernetes, hit the same OOMKilled and CrashLoopBackOff patterns, and have a fraction of that budget. I’ve watched teams not buy the tool they need — not because they don’t want RCA, but because procurement and seat math kill the conversation before the pilot starts.
That’s the gap PerfSage exists to close.
Why I decided to build instead of workaround
When I drowned in JMeter HTML reports, I didn’t want another dashboard. I wanted someone — or something — to answer “so what do we do?” That’s how Reveal happened.
Production has the same gap, with higher stakes:
- Test time — Reveal turns a JTL into charts, SLO verdicts, and recommendations.
- Gate time — SLO Reporter blocks bad builds in CI.
- Prod time — nothing in the PerfSage ladder answered the post-deploy war room.
I couldn’t keep telling teams to “correlate manually” while selling them on analysis over reporting. The philosophy had to extend to Kubernetes.
Three principles I’m building SignalPilot around:
1. Analysis over dashboards. Don’t show me forty panels. Give me ranked hypotheses with evidence — deploy diff + events + metrics + logs cited in one finding.
2. Deterministic rules over LLM theater. Core RCA runs without an API key. Optional narrative polish exists; magic guesses don’t. I want reproducible output you can put in a postmortem.
3. Open source over gatekeeping. MIT licensed. Read-only RBAC. No agents in your application pods. Good reliability tooling shouldn’t require a Fortune 500 procurement cycle.
What I’m building (high level)
SignalPilot runs an observe → correlate → explain → recommend → verify loop against your cluster after a deploy.
What it pulls in:
- Deploy diff — image tag, env vars, resource requests/limits, probe changes (the anchor: what changed)
- Kubernetes API + events — restarts, OOMKilled, CrashLoopBackOff, FailedScheduling, probe failures
- metrics-server — CPU and memory saturation vs limits
- Container logs — clustered fingerprints; new error patterns after deploy
- cAdvisor — CPU throttling, memory working-set pressure
- Prometheus (optional, auto-detected) — p95/p99 latency, error rate, CFS throttle
- Git (optional) — suspect commit when log fingerprints shift
What comes out:
Ranked findings. Each one fuses multiple signal types — not a single chart anomaly. Example: OOMKilled + memory at 94% of limit + git commit touching heap-related config → undersized memory limit, with a concrete kubectl fix you can copy.
Rules I’m shipping first include oom_killed, cpu_throttled, crash_loop, image_pull_error, probe_failure, and code_regression. Full rule table and signal tiers are on the SignalPilot landing page.
What it’s not:
- Not a replacement for Prometheus, Grafana, or your observability stack
- Not requiring Prometheus or an LLM to get useful output
- Not a paid SaaS — launching open source now
For install commands and CI gate examples, see Field Notes #3.
How this changes the MTTR math
I’m not going to invent a “73% faster” stat I can’t defend. Here’s the honest framing:
Before: For “obvious” Kubernetes issues — OOM, CPU throttle, bad probe, image pull — I’ve routinely seen 45 minutes to three hours when three engineers are context-switching across tools and Slack threads.
After (the goal): First ranked, cited finding in minutes, because correlation is automated and tied to the deploy diff.
In CI/CD: signalpilot gate exits non-zero on HIGH+ findings and exports JUnit XML — so you can complement load-test SLO gates from SLO Reporter with a post-deploy sanity check before traffic fully shifts.
I’m not promising zero incidents. I’m promising you stop paying the tax of manual correlation on every deploy.
The PerfSage ladder: test → gate → RCA
This is the product story I’ve been building toward:
- Reveal — JMeter JTL analysis in the lab
- SLO Reporter — CI gates on load tests
- SignalPilot — post-deploy RCA in production
Test-time analysis and prod-time RCA share the same DNA: reports data → explains what to do next.
Install now
pip install perfsage-signalpilot
kubectl apply -f deploy/signalpilot-rbac.yaml
signalpilot analyze my-namespace --deployment my-app --output report.html
- Repo: github.com/perfsage/signalpilot · Release: v1.0.0
signalpilot analyze— HTML report with ranked findings and kubectl recommendationssignalpilot gate— pipeline gate with JUnit XML export- Read-only RBAC —
deploy/signalpilot-rbac.yamlincluded; no agents in app pods
Try Reveal and SLO Reporter if you haven’t — SignalPilot is the third rung on the same ladder.
If this sounds familiar
If you’ve ever stared at Grafana while someone said “should we rollback?” and nobody could point to evidence — this is for you.
I’m building in public. Feedback, issues, and war-room stories welcome on GitHub.
Also on Medium: How SignalPilot Correlates Kubernetes Signals for Post-Deploy RCA
Field Notes #4 · By Aashish Bajpai