Terraform Users: Day 2 Operations Aren't Failing Over Tools, Because Day 1 Never Happened

Your disaster recovery plan just failed — not because your DR strategy was wrong — but because you had no idea what you were actually running in production.

Every Day 2 tools you’ve bought (think: FinOps, drift detection, policy enforcement and security scanning) are optimizing, governing and securing infrastructure based on incomplete data. They’re making decisions about maybe 40% of your actual cloud footprint (if you’re one of the luckier ones), while the other 60% runs wild, unmanaged and mostly invisible.

We’ve normalized building on a foundation that doesn’t exist — and it’s about to get much worse.

Terraform users, here’s a news flash: Your Day 2 operations aren’t failing because your tools are inadequate. They’re failing because Day 1 never happened.

The Day 2 Lie We’re All Living (and Believing)

Every organization is obsessed with Day 2 operations. For better governance, tighter cost control, automated drift detection and policy-as-code enforcement, they’re buying tools for FinOps, compliance scanning and security posture management.

Then, nothing works as expected — not because the tools are bad, but because every single Day 2 tool operates on one massive, broken assumption — that you actually know what infrastructure you have.

Day 2 tools aren’t built to discover your infrastructure. They’re built to optimize, govern and secure infrastructure you’ve already identified and cataloged. They assume that you’ve completed Day 1 and that you have an accurate system of record. Most organizations don’t.

You can’t optimize costs for resources you don’t know exist. You can’t enforce policies on infrastructure you haven’t discovered. You can’t detect drift when half your infrastructure isn’t in your Terraform state. Yet, we keep spending on Day 2 tools while ignoring the foundational problem.

Your Cloud Isn’t What You Think It Is

Here’s what actually exists in your cloud right now:

Infrastructure deployed through Terraform

Resources created through the AWS console because ‘it was faster’

That Kubernetes cluster from a demo that’s somehow still running

Lambda functions created through CloudFormation by teams not using your Terraform setup

IAM roles created manually during incidents and never codified

S3 buckets so old, nobody remembers who created them

Third-party SaaS integrations provisioning infrastructure in your account

If your Day 2 tools can only see the first bullet point, they’re making decisions about maybe half of your actual spending. The other half? Invisible. Unmanaged. Accumulating cost and risk while your optimization tools pat you on the back.

What’s Broken? Drift Detection That Never Wins

You’ve probably noticed that drift detection doesn’t actually do anything to stop drift. You set up automated scanning. You get alerts for console changes. You have a dedicated Slack channel for drift notifications. Yet, there is drift and it’s the same resources and the same modifications. You fix it, and two weeks later, it’s back.

Here’s why: You’re treating the symptom, not the disease.

The disease is that you never completed Day 1. You never established the full scope of what needs to be in infrastructure as code (IaC). So, you’re stuck chasing drift across resources you know about, all while completely missing the resources you don’t.

When drift detection alerts you that an EC2 instance was manually modified, what’s your remediation? You update Terraform to match reality — or roll back the change. But what about the EC2 instances that were never in Terraform to begin with? They’re not drifting. They’re invisible — and your drift detection tool will never find them — because it’s not looking for them.

The 40% IaC Coverage Myth, Finally Busted

Every engineering leader wants 100% IaC — everything versioned, reviewed, tested and deployed through CI/CD, with no ClickOps and complete reproducibility.

Then your cloud reality hits. You’re at maybe 40% IaC coverage, if you’re optimistic. The other 60%? It looks like years of accumulated technical debt, resources created during incidents, prototypes that became production, infrastructure that predates Terraform adoption, plus manual changes that never got documented.

The trap is thinking that you can gradually increase IaC coverage while simultaneously running Day 2 operations. (Spoiler: You can’t.)

Your Day 2 tools are making decisions based on incomplete data. Optimizing the wrong things. Enforcing policies on a fraction of your infrastructure. Detecting drift only where you’re already looking.

You need complete visibility first. Not 40% or even 80%, but all of it.

What Actually Breaks When Day 1 Doesn’t Exist

When you skip Day 1, cost optimization fails, your disaster recovery strategy is built on hope and what you think is true policy enforcement is an illusion. Security posture becomes less about strategy and more like guesswork.

Here’s what it looks like in practice:

Your FinOps tool identifies $50,000 in optimization opportunities, and you implement every recommendation. Still, the bill barely moves because the real waste is in the unmanaged resources your tool never saw.

You enforce tagging standards through policy as code, and every Terraform deployment gets validated — yet, half of your infrastructure was created outside Terraform, with zero tags. Your compliance reports look pristine, while your actual environment is chaos.

You’re confident that in the case of a disaster, you can rebuild from IaC. Then, an outage hits and critical dependencies aren’t in your Terraform states. You’re stuck manually recreating infrastructure during an incident because nobody knew what you were actually running.

Your scanning tools can find vulnerabilities in managed infrastructure, but give you no visibility into shadow IT, your forgotten test environments or the resources created through CloudFormation that nobody migrated.

This is what it’s like when you skip Day 1. Your Day 2 operations fail, largely because they’re operating with incomplete cloud context.

The System of Record That (Mostly) No One Has

Here’s an uncomfortable truth: Most organizations don’t have a system of record for their cloud infrastructure.

They have Terraform state files that only reflect what’s managed by Terraform. They have AWS Config, which is just a historical log, not an authoritative source. They have CMDBs that were accurate when someone set them up two years ago, but they haven’t been updated since.

What many organizations don’t have is a single source of truth that answers: What infrastructure do we actually have running right now, across all our clouds, whether it’s managed by IaC or not?

Without that, every Day 2 tool is making decisions in the dark. Your FinOps tool optimizes based on partial data. Your policy engine enforces rules on a subset of resources. Your drift detection catches changes to managed infrastructure while unmanaged infrastructure runs wild.

What Day 1 Actually Requires

Real Day 1 isn’t a one-time project. It’s establishing continuous, automated visibility across your entire cloud footprint.

It means scanning everything (think: AWS, Azure, GCP, Kubernetes and SaaS platforms) and building a real-time inventory that actually stays current. It means identifying what’s managed by IaC and what isn’t. It means detecting infrastructure that shouldn’t exist — orphaned resources, forgotten test environments, shadow IT that’s costing you money and creating security risks — and critically, it means automatically generating IaC for unmanaged resources so you can actually close the gap instead of just documenting how big it is.

This is the foundation that makes Day 2 operations possible, not just aspirational.

Fair warning: The window for fixing this is closing fast. When AI agents start managing cloud infrastructure at scale, the lack of a complete system of record won’t just hurt efficiency — it will create chaos, even more than already exists in your cloud (think: Agents optimizing resources that don’t exist in your inventory or modifying infrastructure that isn’t tracked).

Without Day 1, Day 2 fails and is primed to get exponentially worse before it gets better, unless we as practitioners act smart and act quickly.