SaaS · Case study

Puppet Early Access

A multi-cloud FinOps platform — ingest cloud billing across five hyperscalers, surface sub-second spend analytics, and auto-generate Terraform pull requests so customers apply savings through GitOps.

End-to-end owner · LLD author 2022 → 2026 Visit site ↗

GoClickHouseKafkaKEDAGraphQL FederationReactTerraformKubernetes

The problem

Cloud spend is scattered: every hyperscaler bills in its own format, at its own cadence, in volumes measured in terabytes a day. A FinOps platform has to ingest all of it without losing a row, normalise it into one comparable shape, answer analytical questions over billions of rows fast enough to feel interactive, and then close the loop — turn a recommendation into a change a customer can actually apply. And it has to do all of that multi-tenant and SOC 2-ready from day one.

What I did

I owned this platform end-to-end and authored its low-level design.

SourcesAWS · Azure · GCP · OCI · Kubernetes — FOCUS-normalised

IngestScheduled + SQS event triggers, Kafka, KEDA parser fleet 0→N, DLQ & retry

StoreClickHouse db-per-tenant + materialised views · PostgreSQL schema-per-tenant · Redis

APIGraphQL Apollo Federation supergraph over ~7 Go subgraphs

ActReact + TanStack UI · auto-generated Terraform PRs → GitOps

Billing in five formats out one side; an applyable Terraform pull request out the other.

Multi-tenant by design. Database-per-tenant in ClickHouse, schema-per-tenant in PostgreSQL, async Kafka messaging, token-auth REST for service-to-service. Hard tenant isolation was the SOC 2-ready default, not a retrofit.
Ingest at scale. Terabytes a day, dual-trigger (scheduled + SQS event-driven), with KEDA scaling the parser fleet 0→N on queue backlog under backpressure, plus dead-letter queues and retry. Zero production data-loss.
Sub-second analytics. ClickHouse materialised views, pre-aggregation, Redis caching and query rewriting took p95 from ~20s to sub-second over billions of rows — the difference between a report you wait for and a tool you explore.
GraphQL platform. An Apollo Federation supergraph over ~7 Go (gqlgen) subgraphs, codegen typings into a React + TanStack UI, with reactive state over Redis pub/sub and WebSockets.
Closing the loop. Optimisation recommendations become auto-generated Terraform pull requests, so a customer applies savings through their normal GitOps review — not a console click that drifts from their IaC.

Impact

p95 analytics latency from ~20 seconds to sub-second over billions of rows.
Zero production data-loss across terabyte-a-day ingest, with elastic 0→N scaling absorbing spend spikes.
Savings delivered as reviewable Terraform PRs, keeping customers’ infra in GitOps rather than out-of-band console changes.
A multi-tenant, SOC 2-ready platform spanning five hyperscalers from one normalised FOCUS model.

Note: Product UI screenshots are internal and pending clearance — the architecture above is the cleared view. Internal UI captures will be added here once approved.