Platform Engineering: Why You’re Doing It Wrong

Authors of a new book on platform engineering discuss why so many platform teams are building ill-considered IDPs, often by not listening to developers and other stakeholders.

Oct 24th, 2024 12:00pm by Jennifer Riggins

Featued image for: Platform Engineering: Why You’re Doing It Wrong

Featured image by Martin Reisch on Unsplash.

Platform engineering, when implemented well, uses shared resources to improve the internal developer experience. It became a hot topic back in 2023, as organizations scrambled to increase developer productivity. But many platform teams are still scrambling to measure and demonstrate their worth at a time of continued tech layoffs.

Camille Fournier and Ian Nowland are both experienced platform engineers; Fournier worked at JPMorgan Chase and Two Sigma, Nowland at Datadog, Two Sigma and Amazon Web Services. They argue that too many platform teams are building the wrong thing while not properly considering their cross-organizational stakeholders.

The pair wrote a book on this subject, sharing the lessons they learned playing this nascent role longer than most.“Platform Engineering: A Guide for Technical, Product, and People Leaders” was just published by O’Reilly. The New Stack sat down with its authors to talk about the value of Platform as a Product.

The New Stack: How do you define platform engineering within the context of your book?

Fournier: Platform engineering is what I’ve been working in for seven years, but, over the last several years, it’s become this faddish term that has gone through a bunch of different meanings.

It has always been DevOps-related, but it was a lot about the Kubernetes and the cloud native ecosystems. And then, some people branded it as the integrated developer platform — one UI to rule them all, for all your internal developers. Sometimes, they talk about Heroku.

Then you’ve got Team Topologies, which also talks about platform teams as a slightly related but different concept — any kind of team that is supporting shared services, shared infrastructure for a company is a platform team.

Any piece of software that you deploy that can be used by a bunch of people. But it’s not really platform engineering.

A platform engineering strategy should be more enduring than any one particular technology. That’s important and building that stuff is hard, but the actual leadership elements of successfully running one of these organizations is not something we felt was very well-covered.

When does platform engineering — and your new book — become necessary?

Nowland: Until a team is getting between 50 and 80 engineers, the collaboration doesn’t need a platform. Just focus on the startup being successful and getting product-market fit. It’s only as you grow, as the engineering social dynamics change, then you really need platform engineering. Because once you have platforms, you have two sides — a platform team and a non-platform team.

I saw this at Datadog as we saw platforms outgrow pure social cohesion. Suddenly the engineers who used to all get along, are like: ‘Oh, let’s just change our CI system here, and everything will be better.’ It becomes a turf war.

You just get to that number where [engineers] don’t know each other anymore, they don’t quite feel that they’re on the same team anymore. That is when you just need more formal mechanisms to manage that dynamic that you’re going to have some people focusing on the platform and some people using the platform, as opposed to everyone doing both.

Your book talks about the concept of the ‘shared commons.’ What goes into that? What doesn’t?

Fournier: The shared stuff that someone needs to maintain — and nobody really wants to.

You start to evolve these things that are ‘the commons’ — shared, critical pieces of infrastructure or software that everyone is using. You need people that are actually thinking about them full time. You need to formalize that, so the people that spend all their time thinking about it aren’t penalized by the organization for not shipping a bunch of product stuff.

Camille Fournier, co-author of O'Relly's book Platform Engineering

Camille Fournier, co-author of “Platform Engineering.”

That technology might be Kubernetes. It might be your build and test environment. Your deployments. It might be some other kind of core underlying service that everybody really depends on for your business, that you don’t want duplicate versions of, but that nobody quite owns.

Please explain your book’s concept of ‘the swamp’ and how developer autonomy and tool sprawl drive platform engineering?

Nowland: The first chapter of the book [is], why did people start talking about platform engineering over the last five years, when platforms in software have been around for decades?

Over the last 20 years, we’ve moved so quickly, [companies] have gone: ‘Let these developer teams choose whatever they want.’ That has created a lot of what we call the ‘over-general swamp.’ Everything is great on the day it ships, and it sucks two years later. Looking at a company that’s 10 years old, they have five generations of that, and so platform engineering is addressing that technical debt.

Ian Nowland, co-author of “Platform Engineering.”

Each generation of teams has chosen separate technologies. Maybe 15 years ago, that was a LAMP stack. Today it might be JavaScript. Each team has chosen these things. When they’re at small scale and the people who built them are there to operate, they actually work really, really well.

The problem is that you get team, after team, after team making these individual decisions, and then your security engineering team comes [with an upgrade]. The company says it’s going to take two years. This is the massive toll of technology sprawl, because each team is choosing their own primitives to build on top. Everything slows down once it’s in the swamp.

In Chapter 8, you contend that organizations should always re-architect their platforms rather than migrate to a new version. Please explain.

Nowland: The platform team can only see a tiny piece of the iceberg of what the platform needs to be. Most great platforms start pretty close to product teams, and the platform team takes over and helps them grow. This is better than the platform team trying to build the perfect system, which gets very platform for platform’s sake. You end up with this thing that works really, really well for the one use case, and then the platform team goes: Now we need to rewrite this from scratch.

Re-architecture is the idea that, if you have something successful, fix it in place, and slowly reiterate it over time. I saw this at AWS and Datadog. It doesn’t always work, but when it does, it means you keep the best bits of an internal product that people love using, but you actually get architecture that can support the scale of the business.

Can you give an example of when you as a platform engineer have gone from serving one or two early platform adopters to the broader engineering organization? How did your IDP evolve and how did you make those trade-offs or choices?

Nowland: This comes from our time working together [at Two Sigma]. The data science team created a batch compute platform [based on Mesos]. They found they occasionally wanted services and built their own mini-service platform. It was almost like a [Platform as a Service] for services — it was very, very thick.

When I first took this on as a manager, I was like:'”I have to kill this thing. This thing will never scale to a broad service platform.” It is way too particular to the specific case.’

“The platform team can only see a tiny piece of the iceberg of what the platform needs to be. Most great platforms start pretty close to product teams, and the platform team takes over and helps them grow. This is better than the platform team trying to build the perfect system, which gets very platform for platform’s sake.”

— Ian Nowland

I overlooked that this thing was loved by its users, because it made that particular case so easy. [It became] how did I do the re-architecture? It was very naive about Mesos scheduling, so we moved it to Kubernetes. That took about four years. But rather than kill the golden goose and say, ‘Hey, data scientists, you now need to go write YAML and Kubernetes jobs,’ we took the user experience that they loved, switched the platform out underneath and moved them to a better architecture.

How does a Platform as a Product mindset fit into all this?

Fournier: I have seen engineering leader upon engineering leader fail on the beautiful Version 2 they are going to build.

When you put it in the product mindset framework, it becomes clearer why this is a bad idea: Nobody thinks about how you’re going to migrate those users from the old to the new platform. ‘Our customers are just going to figure it out.’ People drastically underestimate the effort of teaching existing users how to use something new and the value of a new interface.

You’re locked in once you get that first version that works reasonably well.

We can go away and rewrite it better, but rewriting it takes years usually. As Ian said, it took four years to do that migration, but it probably would have taken at least two years to go away and write something new, and then you still would have had all the migration work.

Your book is the first I’ve seen that defines platform engineering as not just the discipline of creating an internal developer platform, but of operating one.

Fournier: If we learn one thing from the DevOps movement, it is much better when people operate the things they build. You have that very fast learning cycle, that necessity that comes from, ‘I have to deal with the consequences of my decisions,’ or ‘I’m going to fix the consequences very quickly.’

We both personally had challenges with teams who did not want to have to think about the operational consequences — they wanted fun software engineering. They wanted to build it, and they wanted to hire a team of operators to just deal with the other parts for them. And that, in our experience, doesn’t really work.

Your book uses the term ‘leveraging’ platform engineering a lot. What do you mean?

Nowland: Sometimes, with platform, there’s this focus on developer efficiency, and that’s good. Efficiency is a subset of leverage, it’s often very measurable, and you can do surveys around efficiency. The problem with just focusing on efficiency, it just focuses on developers becoming more efficient in doing the tasks that they are doing.

You aren’t asking: Are we doing the best things for the business in terms of building things that can add to top-line growth? Leveraging is the idea that you can bring a bunch of problems together that a bunch of different engineering teams have into one engineering team and add a lot more value to the overall business.

Fournier: It’s very important that leverage is one of the values that platforms should bring. With platform engineering, you have fewer people needing to be experts. Rather than every single team having their cloud expert, their databases expert, their whatever expert.

“I have seen engineering leader upon engineering leader fail on the beautiful Version 2 they are going to build.”

— Camille Fournier

I’m personally a believer that I don’t think it’s a huge value for every single engineer to need to understand Terraform. The leverage is [that] not everybody needs to be an expert in that thing. We can focus on that expertise. We can deliver more with fewer people. Less duplication waste.

Nowland: Most developers should not need to know Kubernetes. We should be putting things on top of them. But of course, so many developers have to learn kubectl. They have to learn YAML. That’s poor leverage. That doesn’t mean that building the thing on top is easy, but it does mean that we’re in a state of poor leverage.

Who makes up a platform team?

Fournier: If you’re not careful, platform teams are all software engineers. They are very smart and they build very interesting things, but they really hate the ownership part of the operations. Not all of them, but many.

Sometimes you have the opposite, where you have a platform team that’s heavily built up by people from an operational DevOps — maybe SRE — background, who are great operators and fixers, but don’t always want to create that leverage.

“The problem with just focusing on efficiency, it just focuses on developers becoming more efficient in doing the tasks that they are doing. You aren’t asking: Are we doing the best things for the business in terms of building things that can add to top-line growth?”

— Ian Nowland

Oftentimes, a platform team that is overloaded with people who are more operations ends up just deploying and managing a lot of open source, and that is a very hard to scale and create leverage. They have a lot of ownership in the operational excellence side, but not as much ownership of: How are we efficiently building the right thing?

You want to get this mix of people with different skills that can teach one another. The final part is that customer-focused product mindset. Does your team hear regularly from the people that are using what you’re building? That way you cannot only be proud of having an impact on the rest of the company, but also hear where people are frustrated and develop some of that customer empathy.

Especially in these tighter times, how does a platform team prove its value to the organization for long-time buy-in and funding?

Fournier: Build things that people want to depend on and use. And then [work on] stakeholder buy-in — from the engineering leadership, from product teams, from the business teams themselves — that this is something critical to this company.

This is the hardest part of platform engineering. I’ve spent a lot of time doing it at larger companies that aren’t pure tech companies. There, tech is already a cost center, and the farther away it is from the business, the more it is viewed as a necessary evil.

“If you’re not careful, platform teams are all software engineers. They are very smart and they build very interesting things, but they really hate the ownership part of the operations.”

— Camille Fournier

Part of the way you stay relevant to the organization is through the operating of these essential things that people have adopted.

This is also why you need to make sure people are adopting what you’re building, and why going off to the side and building something new that no one is using is a very bad idea.

The hardest part of platform engineering is stakeholder management and budgeting. Platform teams are getting cut because they don’t realize that they need to make the essentialness of what they are doing clear to the organization, and they need to figure out how to communicate that.

I think platform engineering leaders can be very naive about how hard it is to build a new thing and convince people that it was worth the investment of precious engineering resources. When you take the thing that people are already using and you assimilate it and make it more useful to more people, that’s much easier because they’re already using it, they already want it.

Or when you take the thing that they’re saying: ‘You make it better. You make it able to do more.’ Again, they’re already using it, they already want it. It’s much easier to show your value. You’re operating something for people, you’re taking that work off of their plate.

Platform engineering leaders would do well to realize that one of the hardest problems is not the technology. It’s often the stakeholders, the budget and the like, convincing and re-convincing and justifying your worth over and over and over again.

Think of yourself as building a product that you’re trying to sell to the rest of the company.

The Kindle version of “Platform Engineering” is out now, while the paperback is due out Nov. 12.

Jennifer Riggins is a tech storyteller and journalist, event and panel host. She bridges the gap between business, culture and technology, with her work grounded in the developer experience. She has been a working writer since 2003, and is based...