System Active // V3.0.0
LOC: 23.8103° N
All posts
Mar 15, 2025·3 min read

Putting Out Fires: Scaling an EdTech Backend to 100k Users

BackendScalingWebSocketRedisInfrastructure

Scaling isn't a plan you execute calmly. It's a sequence of fires, and you learn the architecture by putting them out. This is the order the fires came at ACS Future School as we grew past 100,000 users — and what each one forced me to build.

Fire 1 — the database under brute force

The first hit came at the database. Brute-force traffic plus a surge of very real, very organic users meant the DB was the bottleneck and the target at the same time. I had to act fast: scale the database for the actual user load first — more capacity, tuned connections, the queries that mattered optimized — just to keep real students served while the noise hammered the door.

Fire 2 — closing the door

Scaling the DB bought time; it didn't stop the abuse. So I started hardening the edge: firewalls, Cloudflare protection in front of the origin, and rate limiting on the endpoints that were being pounded. The goal was simple — make sure capacity went to people learning, not to scripts probing.

Fire 3 — Express servers running out of breath

With the edge protected and the DB holding, the next thing to buckle was the application tier. The Express.js servers were getting exhausted under concurrency — a single process can only do so much.

The fix was to stop relying on one process. I put a load balancer in front and spread traffic across multiple app instances — a simple round-robin with Caddy. Caddy gave me HTTPS and load balancing in one clean config, and suddenly the app tier had room to breathe.

Fire 4 — 8,000 people in a live class, and the sockets die

Then came the real disaster. Live classes meant WebSockets, and when roughly 8,000 users were connected at once, the WebSocket layer crashed. In a live class that means nobody can comment, nobody can interact — the exact moment the product is supposed to shine, it goes dark.

The problem was state. A naive WebSocket server keeps every connection's session in its own memory, so the moment you run more than one instance, they don't know about each other — and one instance can't hold everyone.

Two changes fixed it:

  1. A shared Redis to hold session/connection state, so any instance can find any user instead of each process hoarding its own map.
  2. Splitting the socket workload across multiple servers behind the same load-balancer approach I'd already used for the app tier — many socket servers, one shared source of truth.
// publish from anywhere; whichever server holds the socket delivers it
await redis.publish("live-class", JSON.stringify({ userId, payload }));

sub.on("message", (_, msg) => {
  const { userId, payload } = JSON.parse(msg);
  localConnections.get(userId)?.send(JSON.stringify(payload));
});

With state in Redis and the load spread out, the live classes held — 8,000+ concurrent students commenting in real time, no meltdown.


The lesson that stuck: every tier fails in turn, and the fix is almost always the same shape — stop depending on a single process, move shared state out of memory, and put a balancer in front. You don't design that on a whiteboard up front. You earn it, one fire at a time.

Keep reading