Puppet Early Access
A multi-cloud FinOps platform — ingest cloud billing across five hyperscalers, surface sub-second spend analytics, and auto-generate Terraform pull requests so customers apply savings through GitOps.
The problem
Cloud spend is scattered: every hyperscaler bills in its own format, at its own cadence, in volumes measured in terabytes a day. A FinOps platform has to ingest all of it without losing a row, normalise it into one comparable shape, answer analytical questions over billions of rows fast enough to feel interactive, and then close the loop — turn a recommendation into a change a customer can actually apply. And it has to do all of that multi-tenant and SOC 2-ready from day one.
What I did
I owned this platform end-to-end and authored its low-level design.
Billing in five formats out one side; an applyable Terraform pull request out the other.
- Multi-tenant by design. Database-per-tenant in ClickHouse, schema-per-tenant in PostgreSQL, async Kafka messaging, token-auth REST for service-to-service. Hard tenant isolation was the SOC 2-ready default, not a retrofit.
- Ingest at scale. Terabytes a day, dual-trigger (scheduled + SQS event-driven), with KEDA scaling the parser fleet 0→N on queue backlog under backpressure, plus dead-letter queues and retry. Zero production data-loss.
- Sub-second analytics. ClickHouse materialised views, pre-aggregation, Redis caching and query rewriting took p95 from ~20s to sub-second over billions of rows — the difference between a report you wait for and a tool you explore.
- GraphQL platform. An Apollo Federation supergraph over ~7 Go (gqlgen) subgraphs, codegen typings into a React + TanStack UI, with reactive state over Redis pub/sub and WebSockets.
- Closing the loop. Optimisation recommendations become auto-generated Terraform pull requests, so a customer applies savings through their normal GitOps review — not a console click that drifts from their IaC.
Impact
- p95 analytics latency from ~20 seconds to sub-second over billions of rows.
- Zero production data-loss across terabyte-a-day ingest, with elastic 0→N scaling absorbing spend spikes.
- Savings delivered as reviewable Terraform PRs, keeping customers’ infra in GitOps rather than out-of-band console changes.
- A multi-tenant, SOC 2-ready platform spanning five hyperscalers from one normalised FOCUS model.
Note: Product UI screenshots are internal and pending clearance — the architecture above is the cleared view. Internal UI captures will be added here once approved.