Improve MTTR by Doing the Hard Stuff First

To reduce your mean time to recovery (MTTR) from an outage, you have to take care of the hard stuff first.

Aug 23rd, 2024 7:26am by Rita Manachi and David Zendzian

Featued image for: Improve MTTR by Doing the Hard Stuff First

Featured image by Unsplash+ in collaboration with Galina Nelyubova.

IT systems are constantly under threat — malicious or not — so much that breaches have become almost commonplace. Case in point, as we started writing this article, news broke that 4.5 million people were affected by the recent HealthEquity data breach.

As our cloud native systems continue to scale, their distributed nature also makes them more complex. This complexity affords us flexibility and velocity; it also exposes more points of failure and intrusion.

Falling prey to human error, poorly written code or an intentional breach isn’t just about the immediate business impact. Companies are at risk of government scrutiny, billions of dollars in fines or even legal action if they can’t recover quickly.

So while the recent CrowdStrike fiasco certainly made headlines, it’s the aftermath that matters. It also has us giving mean time to recovery (MTTR) a second look, specifically how you can reduce the amount of time it takes to recover from an outage or malicious attack. As the DevOps Research and Assessment (DORA) team defines it, MTTR is “the average amount of time it takes your team to restore service when there’s a service disruption, like an outage.”

Do the Hard Stuff First

Before you change your technology approach, you must change your organization’s mindset. Start by making security inherent to your software development life cycle (SDLC), from code to production to management. It’s much harder to change behaviors than adopt a new tool or platform, and without this culture shift, it won’t matter what technology choices you make.

Stop treating security as an outcome: Security is not one single thing, and today’s cloud native ecosystems are extremely porous and connected. Rather than setting up major checkpoints that could thwart weeks of work, check on security throughout the SDLC, starting with developers.
Embrace a product mindset: The platforms developers work on are dynamic, and they need to be treated as such. This means considering the platform as a product that requires upgrading, patching and improving over time. Be sure to include roles like platform engineers, compliance architects and security specialists as part of your platform delivery and strategy. As many (including the Center for Internet Security) have said, “security is a team sport.”
Make the secure thing the easy thing: Make security inherent in your process. Give developers self-service access to app and code templates that are automatically updated and patched, a catalog of approved open source and commercial software, build packs, an API gateway with policy controls, etc. Make sure they can use the tools they love safely!

Dig Into the Technical Stuff

Your platform choice matters to your security posture. Look for security-enhancing features and capabilities that support a DevSecOps-based working model. Increase the skills of your current employees on newer disciplines like platform engineering and architecting for compliance.

Blue-green deployment is a technique that can reduce app downtime and risk by running two identical production environments, one “blue” and one “green,” where only one of the environments is live and serving production traffic, and the other is idle. Only after proper testing can the idle environment start serving production workloads.

Canary deployment is another way to test the viability of new software or updates in production. You send certain bits of the new software or update to production and see how they run. If things are smooth, you release more parts. It’s part of another modern app delivery paradigm called progressive delivery, coined by RedMonk’s James Governor several years ago. What blue-green and canary deployments have in common is they allow you to easily roll back to a known-good version of the software with minimal disruption if something breaks.

Test-driven development (TDD) is critical to continuously releasing stable and resilient applications. To get the most from TDD, don’t just do functional tests on what you added. You need to test the new piece in context of everything else, so be sure to include regular fuzz, chaos or fault testing in your approach.

Error-handling and -monitoring combined with robust log monitoring and observability can capture problems as they happen and limit the scope of a failure.

Policy-based automation can improve multiple aspects of your software delivery and maintenance processes. To safely automate the multiple layers of your security, get input from various teams, including platform engineering security, compliance, and infrastructure and operations (I&O) teams. This will help make the policies that define your automation process more holistic to mitigate a disastrous outage or lessen the damage.

Three Before Four

Before VMware Tanzu introduced the four golden commands (build, bind, deploy and scale), there were the 3Rs (rotate, repave and repair). They provided a simple way of looking at a cloud native platform’s security attributes. The idea behind the 3Rs is that by being fast, you are safer.

Rotate data center credentials every few minutes or hours.
Repave every server and application in the data center every few hours from a known-good state.
Repair vulnerable operating systems and application stacks consistently within hours of patch availability.

The 3Rs continue to be core tenants of the Tanzu Platform, and you can follow our blog to learn more about Tanzu and security.

There are multiple factors involved in making sure that recovering from an outage or security breach is not devastating to application development and delivery processes including platform choice, development styles (e.g., agile, extreme, test-driven) and organizational or cultural factors.

Rather than treating security as a single outcome, focus on delivering secure software supply chains; support a security-focused culture; automate patches, upgrades and policy enforcement; stay on top of policy drift and monitoring; and employ other security-enabling outcomes.

Rita Manachi is a marketing and communications pro with decades of experience in high tech. She is a marketing manager at VMware Tanzu.