Mar 15, 2025·3 min read

Putting Out Fires at 100k Users

BackendScalingWebSocketRedisInfrastructure

Scaling isn't a plan you execute calmly. It's a sequence of fires, and you learn the architecture by putting them out. This is the order the fires came at ACS Future School as we grew past 100,000 users — and what each one forced me to build.

Fire 1 — the database under brute force

The first hit came at the database. Brute-force traffic plus a surge of very real, very organic users meant the DB was the bottleneck and the target at the same time. I had to act fast: scale the database for the actual user load first — more capacity, tuned connections, the queries that mattered optimized — just to keep real students served while the noise hammered the door.

Fire 2 — closing the door

Scaling the DB bought time; it didn't stop the abuse. So I started hardening the edge: firewalls, Cloudflare protection in front of the origin, and rate limiting on the endpoints that were being pounded. The goal was simple — make sure capacity went to people learning, not to scripts probing.

Fire 3 — Express servers running out of breath

With the edge protected and the DB holding, the next thing to buckle was the application tier. The Express.js servers were getting exhausted under concurrency — a single process can only do so much.

The fix was to stop relying on one process. I put a load balancer in front and spread traffic across multiple app instances — a simple round-robin with Caddy. Caddy gave me HTTPS and load balancing in one clean config, and suddenly the app tier had room to breathe.

Fire 4 — 8,000 people in a live class, and the sockets die

Then came the real disaster. Live classes meant WebSockets, and when roughly 8,000 users were connected at once, the WebSocket layer crashed. In a live class that means nobody can comment, nobody can interact — the exact moment the product is supposed to shine, it goes dark.

The problem was state. A naive WebSocket server keeps every connection's session in its own memory, so the moment you run more than one instance, they don't know about each other — and one instance can't hold everyone.

Two changes fixed it:

A shared Redis to hold session/connection state, so any instance can find any user instead of each process hoarding its own map.
Splitting the socket workload across multiple servers behind the same load-balancer approach I'd already used for the app tier — many socket servers, one shared source of truth.

// publish from anywhere; whichever server holds the socket delivers it
await redis.publish("live-class", JSON.stringify({ userId, payload }));

sub.on("message", (_, msg) => {
  const { userId, payload } = JSON.parse(msg);
  localConnections.get(userId)?.send(JSON.stringify(payload));
});

With state in Redis and the load spread out, the live classes held — 8,000+ concurrent students commenting in real time, no meltdown.

The lesson that stuck: every tier fails in turn, and the fix is almost always the same shape — stop depending on a single process, move shared state out of memory, and put a balancer in front. You don't design that on a whiteboard up front. You earn it, one fire at a time.

Keep reading

Jul 9, 2026

Webhooks: The System Design Interview Answer That Actually Runs in Production

Webhooks look like the easiest thing in distributed systems, it is just an HTTP POST. Most implementations still quietly lose events, double-charge customers, and expose endpoints that process anything an attacker sends. Here is how to build both sides properly: transactional outbox, retries with backoff, HMAC verification, idempotent consumers, and the boring nightly reconciliation job that makes the whole thing trustworthy.

Jul 6, 2026

I Built a Site That Roasts Your CV

A weekend idea: what if a CV review felt less like feedback and more like getting dragged by four different people who all hate your resume for different reasons. 233 roasts and 23 battles later, here's the build behind getroasted.live: the model fallback chain, the persona system, and the guardrails that keep 'savage' from turning into 'reported'.