Cloud Design Patterns: A Practical Tour

Maybe you’ve had a different experience, but I’ve never once looked at a system and said, “This is perfect, we’re done here.” There is always a new requirement, a shifting bottleneck, or a sudden spike in traffic. You can’t build a distributed system without at least one of your services eventually collapsing under the weight of its own dependencies. It’s a rite of passage. This is a tour of the patterns I keep reaching for across backend systems — written for engineers new to distributed architecture. We’ll start with the one that usually forces you to learn about all the others.

Circuit Breaker: The One That Gets You Here

I’m making this scenario up entirely just to explain Circuit Breakers and some other cloud design patterns, but if you’ve been in the industry long enough, you know it’s basically true. A company runs a high-traffic telemetry service that tracks sessions and feature usage. Every incoming session needs enrichment (user plan tier, region, feature flags, etc.) from an internal metadata API. Normally it responds in 5ms. No circuit breaker.

One deploy later, the metadata API responds in 800ms. The telemetry service has 200 threads. Each thread now blocks for 800ms instead of 5ms. Throughput drops by over 150x. The thread pool queues fill up. This is classic thread exhaustion. The connection pool to PostgreSQL (where sessions are stored) is also exhausted — threads don’t release connections while waiting for enrichment. Now session writes fail. Session reads fail. Even the health check (SELECT 1) fails. The load balancer starts routing traffic to the remaining healthy instances, which get crushed faster. Cascading failure. The entire telemetry service is gone.

The punchline: the telemetry dashboards (active sessions, used features, current error rates, and especially latency) ran on the telemetry service itself. When the service died, the dashboards died too. The on-call engineer had to SSH into boxes and read raw log files (like a sysadmin from back when I first started in this industry) while the product manager panicked in Slack: “Did billable sessions just drop to zero, or is the dashboard broken?”

Post-mortem fix: a circuit breaker wrapping the enrichment API call. Open the circuit when latency exceeds 200ms or error rate hits 5%. Return "enrichment": "unknown" immediately instead of blocking. The telemetry service stays up, dashboards stay live, and you see exactly what went wrong in real time.

They built a telemetry system so sophisticated it could tell them everything — except why it was down.

Where Circuit Breaker Fits

It’s in the Resilience category alongside:

Retry: automatically retry on transient failures. But pair it with exponential backoff + jitter, or your retry storm will turn a 30-second blip into a 5-minute outage.
Bulkhead: isolate thread pools per dependency. One slow service shouldn’t eat your entire thread budget.
Timeout: just set one. Seriously. A missing timeout is how a 100ms Redis call can hold a thread hostage for 30 seconds.
Health Endpoint Monitoring: expose /health and /ready, but make sure they don’t depend on the same resources your main endpoints use (or you’ll have the telemetry problem all over again).

Idempotency: The Reason Retry Is Safe

Retry only works if calling twice is the same as calling once. Otherwise your retry turns a transient blip into a double-charge, a duplicate email, or in the case of our telemetry service, drastically inflated metrics.

Let’s go back to our telemetry service. A mobile client sends a session_started event. The service takes 2 seconds to respond. The mobile client times out at 1.5s and retries. The first request actually completed server-side; the client just never got the response. Now you’ve recorded two sessions for one user. The product manager is thrilled with the sudden 2x growth in active users, right up until they realize it’s garbage data.

sequenceDiagram
    participant C as Mobile Client
    participant S as Telemetry Service
    C->>S: Record Session (idempotency_key=abc)
    S->>S: Process & Save... (slow)
    C-->>S: Timeout at 1.5s
    C->>S: Record Session (idempotency_key=abc)
    Note over S: Key seen before — return cached result
    S-->>C: Session recorded (once)

Ideally, your initial solution lives in the client SDK, which should be designed to avoid generating duplicate session events. But when a network drops and a retry is forced anyway, server-side idempotency is your necessary safety net. You implement this using an idempotency key — a client-supplied token, like a unique UUID generated when the app opened, that the server uses to dedupe. Same key, same result, no matter how many times the request lands. Stripe’s Idempotency-Key header for payments is the most famous real-world example of this.

Idempotency isn’t a pattern you bolt on later. It’s a property you design in from the first endpoint:

Operation	Safe by default?	How to make idempotent
`GET /sessions/42`	Yes, reads are naturally idempotent	Nothing to do
`PUT /sessions/42 {region: "eu-west-1"}`	Yes, same input, same final state	Nothing to do
`POST /sessions`	No, creates a new session every call	Idempotency key
`DELETE /sessions/42`	Mostly, second delete is a no-op	Check exists -> no-op on miss

Note: The second DELETE might return a 404 Not Found instead of a 200/204, but the system state safely remains the same.

The rule: if the operation mutates state, isn’t naturally idempotent, and the client might call twice, it needs a key. Retry is only safe on top of idempotency. The Saga compensations, the Pub/Sub at-least-once delivery, the CQRS event replays — every one of them assumes you can call the same thing twice without breaking the world.

Saga: Distributed Transactions, The Honest Way

I want to use a different story for this part because telemetry data is usually fire-and-forget or append-only, not incremental. Let’s say you have an order service, a payment service, and an inventory service. A customer places an order. You charge their card. You decrement stock. If step 2 succeeds but step 3 fails, you’ve charged a customer for something you can’t ship.

A saga breaks this into a sequence of local transactions, each with a compensating action:

sequenceDiagram
    participant O as Order Service
    participant P as Payment Service
    participant I as Inventory Service
    O->>O: Create order (pending)
    O->>P: Charge card
    Note over P: Compensation: Refund
    P-->>O: Payment confirmed
    O->>I: Decrement stock
    Note over I: Compensation: Restock
    I-->>O: Stock decremented
    O->>O: Mark order (confirmed)

Two flavors:

Choreography: each service emits events, others react. Simple, but hard to trace.
Orchestration: a coordinator tells each service what to do (shown above). More code, but you can sleep at night.

If you are building an orchestration Saga today, I highly recommend looking into a platform like Temporal. It handles the complex state-machine logic, retries, and persistence so you don’t have to build the coordinator yourself.

CQRS: Stop Reading From The Same Table You’re Writing To

Your telemetry service writes sessions at 10k/sec. Your dashboard queries session counts by region. The query scans millions of rows. The query blocks the write. The write queues up. The dashboard refreshes and triggers another scan.

CQRS (Command Query Responsibility Segregation) says: use one model for writes, a separate one for reads.

flowchart LR
    W[Write Path] --> P[(Primary DB
normalized)]
    P -.->|events| M[(Materialized View
denormalized)]
    R[Read Path] --> M

Sync them via events. Your writes stay fast. Your dashboards stay snappy. The price is eventual consistency — your read model is always a few seconds behind. For dashboards, nobody notices. For billing, you might need something tighter.

Event Sourcing: Storage, But Make It Append-Only

Instead of storing the current state, store every event that led to it. “User changed email from a@b to c@d” instead of “User.email = c@d”.

Benefits:

Audit log for free: every state change is recorded
Time travel: rebuild state at any point in time
Debugging: replay events to reproduce bugs

Cost:

Storage grows forever (snapshot periodically)
Reading current state requires replaying events (snapshot + remaining events)
Everyone on the team needs to understand event sourcing

Materialized View: Cheat Codes For Queries

Your dashboard needs: “active sessions in the last 24 hours, grouped by plan tier, filtered by region, with a 7-day trend.”

Running that SQL on the raw session table will perform a table scan that wakes up the DBA at 3 AM.

Build a materialized view — a table that pre-computes the result and refreshes periodically (or on every event, if you can afford it).

CREATE MATERIALIZED VIEW session_stats AS
SELECT
  plan_tier,
  region,
  COUNT(*) as active_sessions,
  DATE_TRUNC('hour', created_at) as hour
FROM sessions
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY plan_tier, region, hour;

Query the materialized view instead. 10ms instead of 30 seconds. Your DBA stays asleep.

Sharding: When One Database Isn’t Enough

Your session table has 2 billion rows. Queries are slow. Indexes are huge. You’ve tried everything.

Sharding splits data across multiple databases by a shard key (e.g., user_id % 16). Each shard holds ~125M rows.

Shard 0: user_id % 16 == 0
Shard 1: user_id % 16 == 1
...
Shard 15: user_id % 16 == 15

Trade-offs:

Queries without the shard key hit every shard (scatter-gather)
Resharding when you need to grow is painful
Transactions across shards are hard (hello again, Saga)

Pub/Sub: Stop Pointing HTTP At Everything

Your telemetry service ingests sessions and needs to notify billing, analytics, and the user activity feed. If you call them directly via HTTP:

One slow subscriber blocks ingestion
If the analytics service is down, does the session write fail?
You’re coupling ingestion throughput to the slowest subscriber

A pub/sub broker (Redis Streams, RabbitMQ, Kafka, GCP Pub/Sub — anything with message persistence) decouples them:

flowchart TD
    I[Ingestion] --> EB[Event Bus]
    EB --> B[Billing]
    EB --> A[Analytics]

Each subscriber reads at its own pace. If analytics falls behind, ingestion stays fast. If billing is down, the event stays in the queue for later.

BFF: Your API Shouldn’t Be Everything For Everyone

You have a mobile app, a web app, and a CLI. They all hit the same API. The mobile app needs small payloads (bandwidth is expensive). The web app needs rich data. The CLI needs machine-friendly output.

A Backend For Frontend (BFF) gives each client its own backend:

flowchart LR
    Mobile[Mobile App] --> MBFF[Mobile BFF]
    Web[Web App] --> WBFF[Web BFF]
    CLI --> CBFF[CLI BFF]
    MBFF --> SS[(Shared Services
Auth, Users, Sessions)]
    WBFF --> SS
    CBFF --> SS

Each BFF is optimized for its client: pagination, payload shape, caching strategy. The shared services stay generic. The alternative is a general-purpose API with query parameters that simulate having multiple backends, and a codebase nobody wants to touch.

Gateway: The Bouncer

A gateway sits at the edge of your system and handles cross-cutting concerns so your services don’t have to:

Routing (path → service)
Authentication
Rate limiting
Request/response transformation
Aggregation (call multiple services, combine results)

Popular implementations: Kong, Envoy, APISIX, or rolling your own with a reverse proxy config.

The risk: the gateway becomes a single point of failure and a monolith in its own right. Keep it stateless and horizontally scalable.

Sidecar and Ambassador: Putting Helpers Next to Your Service

A sidecar is a helper process deployed alongside your main container; handles logging, metrics, service mesh features. Your app only worries about business logic.

flowchart TB
    subgraph Pod
        Main[Main Container
your app]
        Sidecar[Sidecar Container
log forwarder, metrics exporter]
    end

An ambassador is a proxy between your service and the outside world. Handles retries, circuit breaking, service discovery. Your app doesn’t know those exist.

The best-known implementation is Envoy as a sidecar in Istio. Your app makes localhost calls; Envoy handles the rest.

Strangler Fig: Killing The Monolith Without A Single Point Of Failure

You have a monolith. You want to migrate to microservices. You can’t rewrite it all at once.

The strangler fig pattern routes features to the new system one by one:

flowchart TD
    Client --> Router[Reverse Proxy]
    Router --> Monolith[Monolith
old auth, old everything else]
    Router --> MS[Microservices
new auth]

When POST /login is migrated, the router sends it to the new auth service. Everything else still hits the monolith. Over months, the monolith shrinks until it’s gone. Named after the strangler fig vine that slowly envelops a tree and eventually replaces it entirely.

Leader Election: Someone Has To Be In Charge

You have three replicas of a scheduler that runs cron jobs. If all three run the same job, you triple-bill your customers and send three identical emails. If zero run it, your customers notice nothing happened.

Leader election ensures exactly one replica acts as the coordinator. The leader holds a lease (via etcd, ZooKeeper, or a database row with TTL). Other replicas wait. If the leader dies, the lease expires and a new one takes over.

Consensus algorithms like Raft and Paxos handle this at a lower level for systems that need stronger guarantees.

Which Pattern Reaches For Which Problem?

The answer is almost always “it depends,” but here’s a heuristic:

You have this problem	Try this pattern
A slow dependency kills your entire service	Circuit Breaker, Timeout, Bulkhead
A transient failure loses data	Retry + backoff
One bad consumer blocks everyone	Bulkhead, Queue (Pub/Sub)
You need cross-service transactions	Saga
Your read queries are killing your writes	CQRS, Materialized View
One database can’t handle the load	Sharding
Your monolith is too big	Strangler Fig
You have cross-cutting concerns (auth, routing)	Gateway
Your API is bloated for all clients	BFF
You need an audit trail	Event Sourcing
Multiple instances need a single coordinator	Leader Election

The Meta Pattern

Most of these patterns solve one problem and create another:

Circuit breaker prevents cascading failures, but you need health checking to know when to close it.
Saga gives you distributed transactions, but adds compensating logic complexity.
CQRS improves performance, but introduces eventual consistency.
Sharding scales writes, but makes cross-shard queries painful.

A distributed system is a series of trade-offs connected by more trade-offs. The patterns don’t eliminate complexity, they move it to where you can manage it.

That’s why understanding the patterns matters. Not so you can build the perfect system (you won’t), but so when the telemetry service eats itself at 3 AM, you know which lever to pull.

These are the patterns that have kept me the busiest over the years. Next time, we’ll pick a specific topic (like rate limiting) and investigate it together through a real-world scenario.

Circuit Breaker: The One That Gets You Here#

Where Circuit Breaker Fits#

Idempotency: The Reason Retry Is Safe#

Saga: Distributed Transactions, The Honest Way#

CQRS: Stop Reading From The Same Table You’re Writing To#

Event Sourcing: Storage, But Make It Append-Only#

Materialized View: Cheat Codes For Queries#

Sharding: When One Database Isn’t Enough#

Pub/Sub: Stop Pointing HTTP At Everything#

BFF: Your API Shouldn’t Be Everything For Everyone#

Gateway: The Bouncer#

Sidecar and Ambassador: Putting Helpers Next to Your Service#

Strangler Fig: Killing The Monolith Without A Single Point Of Failure#

Leader Election: Someone Has To Be In Charge#

Which Pattern Reaches For Which Problem?#

The Meta Pattern#