Jun 6, 2026·11 min read

Catching a Container in the Act with eBPF

eBPFKubernetesSecurityThesis

How a year of kernel tracing, broken signatures, and 700 simulated attacks taught me that the honest answer is usually more interesting than the clean one.

When I tell people my undergraduate thesis was about "detecting container escape and lateral movement in Kubernetes via graph-correlated eBPF syscall monitoring," their eyes tend to glaze over somewhere around "eBPF." So let me start with the part that actually matters.

Containers are everywhere now. They start in milliseconds, pack densely onto a single machine, and power most of what you'd recognise as modern cloud software. But there's a catch built into their design: unlike a virtual machine, a container does not get its own kernel. Every container on a host shares the same running Linux kernel, separated only by namespaces and cgroups — a set of bookkeeping tricks, not a wall. A systematic review I leaned on early reports that something on the order of 80% of surveyed Kubernetes and Docker vulnerabilities involve privilege escalation. In other words: an attacker who gets a foothold in one container is often one clever syscall away from the host, or from quietly pivoting to the database next door.

My thesis partner and I set out to build something that could see that happening in real time. This is the story of how we got there, including all the parts that didn't go to plan.

The idea: watch the kernel, but only the parts that matter

The default tool for syscall-level monitoring on Linux is auditd. It works, and researchers have built intrusion detectors on top of it. But the moment you try to use it as a serious Kubernetes IDS, it falls apart in two ways. First, it has no idea what a container is — its records carry a process ID and a username, but no cgroup, no namespace, no way to say "these two events came from the same container." Second, to watch file access across a cluster, it has to log every single openat system-wide, and on a busy node that flood overruns the kernel's audit buffer and silently drops records — including, on a bad day, the one attack event you cared about.

Our answer was eBPF: small, verified programs that run inside the kernel at syscall hooks. The key move is that you can filter, enrich, and discard events before they ever cross into user space. So our design had two halves:

A kernel data plane that hooks six syscall families (execve, the open/openat family, mount, setns, unshare, and tcp_v4_connect), tags each event with the calling process's cgroup ID and namespace inode, and drops everything that isn't relevant in the kernel.
A user-space detection engine that takes the survivors and builds a live provenance graph — processes, files, and sockets as nodes; "opened," "executed," "connected to" as edges — and runs detection rules over the graph's shape.

That second half is where the actual contribution lives, and I'll come back to it. But first I have to be honest about how much of the year was spent simply making the thing work.

The messy middle: four bugs that taught me everything

You can read the final results table and imagine a smooth march to success. The reality was a sequence of confident assumptions getting demolished by the kernel. These were the four that mattered most.

The connection that was always going to nowhere. Our lateral-movement detection just... didn't fire. Every TCP connect event showed a destination of 0.0.0.0. It turned out we were reading the destination address from the socket struct at the moment the probe fired — but at kprobe entry, the kernel hasn't performed the connection yet, so the field is still empty. The fix was to read the destination from the uaddr argument the program passed in. One wrong field, and an entire category of detection was dead. That bug taught me to never trust that a value is populated just because the struct has a slot for it.

2,832 false positives per run. Our file-boundary rule was supposed to fire when a container reads a host file it shouldn't. Instead it fired thousands of times, because every container legitimately reads its own /etc/passwd at startup. We were flagging normal life. The fix was to scope the rule to genuine host-mount prefixes and add a whitelist for the paths containers touch routinely. This is the unglamorous heart of IDS work: a detector that cries wolf 2,832 times an hour is worse than no detector at all.

The 171-millisecond stall. Our detection latency was terrible — a p95 over 3.5 seconds — and events were getting dropped from the ring buffer. The culprit was a single line: inside the hot event-processing loop, we called crictl inspect synchronously to resolve which Kubernetes Pod an event belonged to. That call cost 171.9 ms each time, and it blocked the entire poll loop. Moving that enrichment onto a background thread cut p95 latency by roughly 20×. The lesson — keep expensive, nice-to-have work off the critical path — is one I now apply reflexively.

The syscall we forgot existed. For a while, our Go-based test attacks were caught and our shell-based ones vanished. The reason: we only hooked openat, but plenty of attacker tooling (anything built on busybox) uses the legacy open syscall. Modern Go binaries use openat, so they showed up; everything else was invisible. We had to hook all three variants. It's a humbling kind of bug — the system was working perfectly on exactly the inputs we happened to test with.

Every one of these required a real code change, and every one of them is in the thesis as a documented bug with a root-cause analysis. I'm oddly proud of that table.

The discovery I didn't expect: a textbook signature that doesn't fire

The most satisfying moment of the whole project came from something failing. There's a well-known technique in the literature for detecting container escape via nsenter — the attacker jumps into the host's namespaces. The published signature watches for a process opening /proc/1/ns/mnt and then calling setns. We implemented it faithfully. It never fired.

Chasing why led somewhere genuinely new: modern nsenter (on recent versions of util-linux and glibc) no longer opens that namespace file at all. It uses pidfd_open to get a handle on the target process and then calls setns directly. The signature the literature relies on was watching for a behaviour that the current tooling simply doesn't exhibit anymore. Our fix was to detect the setns call directly from a container context, and that finding — that a standard detection signature has quietly gone stale — became one of the thesis's contributions in its own right.

It reframed how I think about security tooling: detection assumptions decay. A signature that was correct against one generation of a utility can silently fail against the next, and nobody sends you a notification when it happens.

Proving it mattered: a head-to-head against the industry standard

It's one thing to build a detector. It's another to show it does something the tools people actually deploy can't. So we went after Falco — the most widely used eBPF runtime-security tool in the Kubernetes world, the de facto industry baseline — on identical terms: the same 700 attack executions (100 trials across 7 scenarios), the same cluster, the same scoring harness. And crucially, we gave Falco a fair configuration: not its stock rules (which, we found, were trivially evaded by our /host mount indirection), but the idiomatic, host-aware rules a competent operator would actually write.

The result was clean and narrow. On every single-event escape — reading a sensitive file, switching a namespace, executing a host binary — a well-configured Falco matched us at 100%. No contest, no contribution to claim there.

The entire difference was one scenario: the service-account-token exfiltration chain. The malicious signal there isn't any single event — it's a sequence: the same container that read its Kubernetes token then turns around and connects to the API server with it. Reading a token is completely normal (every pod does it at startup; we measured a naive rule firing 738 times on benign pods). The connect is normal. It's the join — "the same container did both" — that's the attack. Falco's detection model evaluates one event at a time, so it cannot express that join natively, at any correlation window. Its detection ceiling sat at 0.857; the missing 14.3% was exactly those token-exfil chains. Our provenance graph caught all of them, because the join on cgroup identity is precisely what the graph is for.

That's the thesis in one sentence: a per-event rule engine sees events; a provenance graph sees stories.

The part where I tell you what didn't work

If there's one thing the year taught me, it's that the honest results section is the one worth reading. So here's where the system fell short, stated as plainly as the wins.

We adopted a 100 ms detection-latency target from the evaluation framework we based our methodology on. We missed it — mean latency was around 635 ms, and the multi-step chain was slower still. I could have buried that, but the truth is more interesting: our system was never architected to be an inline prevention tool that blocks attacks in sub-millisecond time. It's a detection-and-attribution pipeline whose correlation layer lives in user space by design. Against a near-real-time alerting target measured in seconds, the escape scenarios comply fine. We reported both readings and made no inline-prevention claim we couldn't back up.

There was also a subtle one I'm glad we caught. Our internal-scan scenario (L1) appeared to have a detection rate of only 0.84, which looked like a coverage gap. When we re-scored the same attack log across different time windows, the truth emerged: every attack was eventually detected — the 0.84 was an artifact of alerts arriving late, past a strict 10-second scoring window, because a burst of rapid connections got coalesced by our deduplication logic. Complete coverage, genuine latency problem. Two very different things that a single averaged number would have blurred together. Separating "did we see it" from "did we see it in time" is now permanently part of how I read any detection metric.

The extras that became their own stories

Two side-quests turned into findings I didn't anticipate.

We took the system to a real two-node cluster to test whether cross-node lateral movement would still be caught. It was — each connection is detected by the agent on its source node. But a chain that pivots across nodes gets seen only in halves, because our correlation key (the in-kernel cgroup ID) is node-local. We closed that gap with an offline "stitch" that re-keys on cluster-global Pod identity, reconstructing the full attacker→relay→database chain from the two half-views. It's a working prototype of the distributed correlation layer that a production version would need.

And because a reviewer asked whether our memory footprint was intrinsic to the method or just an artifact of our Python-and-NetworkX prototype, we reimplemented the entire correlation engine in Go and replayed the exact same event stream through both. They produced byte-identical alerts — 3,439 of them across 8 rules — which proved the detection capability is language-invariant. More usefully, it located where the memory actually goes: the provenance graph itself is tiny (tens of megabytes), and the bulk of the live agent's footprint is the eBPF runtime machinery, not our logic. That's the difference between guessing and measuring.

What I actually took away

The technical skills are real — I came in not knowing kprobes from tracepoints and left able to debug a Flannel VXLAN failure across two nodes from documentation and observed behaviour alone. But the durable lessons were quieter.

Trust the kernel over your assumptions. The most important bugs weren't logic errors; they were places where I assumed a field was populated, a syscall was the only one, or a signature still matched reality. Reality kept winning.

Report the unfavourable result as plainly as the favourable one. Our thesis says, in print, that a well-configured Falco is actually cheaper than our agent on real application traffic, that we missed the latency SLO, and that our false-positive claim needs an explicit denominator to mean anything. None of that weakens the work. It's the part that makes the rest believable.

And detection is a moving target. The nsenter finding is the whole lesson in miniature: the substrate changes underneath you, and a security engineer's real job is less any single rule than the habit of reading the primary source, building the thing, measuring it honestly, and being ready to relearn the ground when it shifts.

That habit is the part I'm taking with me. The thesis just happens to be where I learned it.

This work was a two-person undergraduate thesis at the Bangladesh University of Engineering and Technology, supervised by Dr. Mohammad Mahfuzul Islam. The system was built on k3s, BCC, NetworkX, and the Online Boutique demo application, and evaluated against auditd and Falco across 700 simulated attacks.

Keep reading

Jul 9, 2026

Webhooks: The System Design Interview Answer That Actually Runs in Production

Webhooks look like the easiest thing in distributed systems, it is just an HTTP POST. Most implementations still quietly lose events, double-charge customers, and expose endpoints that process anything an attacker sends. Here is how to build both sides properly: transactional outbox, retries with backoff, HMAC verification, idempotent consumers, and the boring nightly reconciliation job that makes the whole thing trustworthy.

Jul 6, 2026

I Built a Site That Roasts Your CV

A weekend idea: what if a CV review felt less like feedback and more like getting dragged by four different people who all hate your resume for different reasons. 233 roasts and 23 battles later, here's the build behind getroasted.live: the model fallback chain, the persona system, and the guardrails that keep 'savage' from turning into 'reported'.