How I Accidentally Became an Infrastructure Engineer // Thomas Havlik

Circa 2018 I was working on R&D for my employer's medical device product. AI was integral to the platform, and that project fell in my lap. Back then, nothing in the ML stack worked unless you were willing to get your hands dirty. If you were serious about ML in 2018, you had no choice but to learn how everything beneath ML worked.

This is the story of how "just trying to train a model" turned into running multi-cluster Kubernetes, writing automation controllers, managing storage backends, and flexing event-driven architecture in most of my projects.

2018: ML was painful

Machine learning in 2018 was not exactly plug-and-play. You had to SSH into unmanaged GPU instances. Docker was a necessity, but good luck getting it to work with the GPU. Training workflows amounted to shell scripts you had to babysit. Even relatively small models took very long to train.

"While my model is training, I'll work on making this less brittle." - I probably had this thought dozens of times per day.

Every ML experiment was a systems problem in disguise. Somewhere on this path, there is an inflection point where you become "the person who understands why the cluster is behaving strangely."

ML Got Easy, Infra Didn't

By 2020-2023, the situation changed: Colab, managed GPUs, and cloud notebooks hid the complexity. HuggingFace and hosted MLOps abstracted away dependencies. Even serious ML engineers never needed to touch a server again.

On my lab's Discord, a common motif is the Zoomer dev with plenty of ML experience, but next to no backend / distributed systems experience. They're interested in backend projects, but find it a challenging nut to crack. I've recently come to appreciate the implications of this.

Realization #1: Infra Isn't One Discipline

Solving these problems in 2018 required far more than ML experience. I was building storage backends, Kubernetes operators, multi-layer authentication flows, observability message buses, and automated data ingestion pipelines. I wasn't doing infra merely to enable ML - I was doing infra out of necessity. And I was doing a lot of it.

I had, entirely by circumstance, become an infrastructure engineer.

Realization #2: The Ladder Was Pulled Up Behind Me

When infra got easier to consume, people stopped learning it, and a second generation of "accidental infra engineers" never formed. In 2026, it seems most of the remaining problems to be solved in distributed systems are hard problems, and now the difficulty curve of the discipline is steeper than it has ever been. Engineers entering the industry now aren't able to justify "rolling their own" to their superiors. Becoming a senior engineer in this domain has likely never been harder than the present day.

Moreover, there are infra engineers and architects who've been doing this much longer than I have. These folks are arcane wizards, and I relish every opportunity to learn from them. Not being in the industry anymore, those opportunities are few and far between. Many of them are making comfy salaries as architects and SREs. This leaves them with little free time, and their numbers will dwindle with retirements. I've been effectively cut off from their wisdom by these same industry changes.

The talent pipeline has been severely disrupted. If it weren't for antirez (Salvatore Sanfilippo) and CodeOpinion (Derek Comartin), it'd be much harder for me to stay "plugged in" to the discipline; merely listening to someone talk about high-level architecture has been such a blessing.

I can only imagine what it's like for someone just getting started. If you're a Zoomer that has successfully applied for an "ML Engineer" position, I'd love to hear your story. My understanding is that this job listing is culture-coded for "we want someone who grew up in the 2016-2018 trenches to solve our problems." These listings are everywere because basically all product teams now have an unmet need that can only be filled by people who already work their dream jobs.

Realization #3: The Platforms I'm Building Now Exist Because of That Disaster

Eosin, Cyto, Lysis, etc. are the natural consequence of applying developer expectations to computational pathology:

Why do we not have a clean, fast, and permissively licensed WSI viewer-annotator?
Why is scientific data perpetually scattered, inconsistent, and painful to ingest?
Why does nobody build DevOps infrastructure for computational biology?

None of this would've happened if ML had always been easy.

If I'm being honest, even though I've fallen in love with infrastructure, I still just want to train my darn model...

"The more things change, the more they stay the same."

-Tom