SOS Services
← Engineering
engineering · Vimal Bahuguna

Observability before customers

We wired Prometheus, Alertmanager, and Grafana into Aviation AI Pro before the first user signed up. Here's why, and the volume-mount bug that almost burned us.

The case against early observability is straightforward: you don’t have users yet, the alerts won’t fire, the dashboards stay flat, and the time would be better spent on features.

The case for it is also straightforward: when your first user does show up, you’ll discover the things they break in their first hour. You want the metrics already collecting by then, not still being designed.

We picked the second case. Here’s what shipped and what we learned.

What it looks like

Aviation AI Pro runs eight containers in production: the FastAPI app, the React SPA, a Celery worker, a Celery beat scheduler, a Redis broker, plus Prometheus + Alertmanager + Grafana for observability. The first five are the product. The last three exist so we can answer the question “is the product actually working right now.”

The Prometheus instance scrapes the API’s /metrics endpoint and a Celery exporter sidecar. Eleven alert rules are loaded — six FDR pipeline alerts (export latency p95, anomaly detection failure rates, pipeline silence) and five system-health alerts (scrape target down, error rate, no Celery workers, queue backlog, task failure spikes).

Alertmanager routes critical alerts to email via Resend SMTP. We use HTML templates with severity color-coding so a glance at the inbox tells you whether it’s a warning or a critical.

Grafana has five dashboards: API overview, HTTP deep-dive, Prometheus self-monitoring, auth & tenants, and Celery workers. All publicly reachable but basic-auth-gated at the Traefik edge.

The bug we almost shipped

Two days after the observability stack went live, somebody (me) changed the Grafana admin password through the UI. The next time the container restarted, the password didn’t work anymore. Investigation took an hour.

The problem: our Dockerfile didn’t declare VOLUME /var/lib/grafana. Without it, Grafana’s SQLite database — which contains the admin user, all dashboards, all sessions — lives inside the container layer. Every restart wipes it clean. The new container then re-seeds the admin from GF_SECURITY_ADMIN_PASSWORD env, which was now out of sync with what I’d set in the UI.

If we’d had real users at that point, all their dashboards would have been gone. Two more containers (Prometheus and Alertmanager) had the same missing declaration. The fix was three lines of Dockerfile change across three apps:

# Grafana
VOLUME ["/var/lib/grafana"]

# Prometheus
VOLUME ["/prometheus"]

# Alertmanager
VOLUME ["/alertmanager"]

The thing about VOLUME declarations is that adding them is paradoxical — the same deploy that introduces the volume also wipes the state one last time, because Docker creates a fresh empty mount on the first container with the declaration. We re-provisioned the five dashboards via Grafana’s API one final time. The next restart preserved them. The next ten will too.

What it caught

In the week after the stack went live and before any customer existed, the alerting pipeline caught:

  • The marketing inquiry email pipeline silently failing because the sosservices.online domain wasn’t actually verified in Resend (a long-running assumption we never validated)
  • The Coolify is_http_basic_auth_enabled flag being stored but not actually rendered into Traefik labels (a vendor quirk we worked around by emitting the basicauth middleware label directly)
  • A Pydantic Optional import missing from three FastAPI routers — /openapi.json returned 500 until we found and fixed it

None of these had customer impact because there were no customers. All three would have been embarrassing to discover from a support email. We discovered them from up{job="avaipro-api"} == 0 triggering at 3 AM and then poking at the system.

What it costs

The observability stack adds about 4 GB of RAM to the production cost (Prometheus TSDB + Grafana + Alertmanager) and roughly fifteen minutes per week of attention. The Grafana dashboards have to be kept in sync with new endpoints, and the alert thresholds need occasional tuning. Neither is free, but neither is expensive.

In exchange, we have a way to verify, at any moment, that the platform is actually working. The first customer will arrive into a system we already trust. That’s the trade.