Platform9 Elastic Machine Pool for EKS Cost Optimization

With EMP, enterprises could finally start realizing significant efficiency gains in virtualized data center operations.

May 27th, 2024 10:00am by Joe Thompson

Featued image for: Platform9 Elastic Machine Pool for EKS Cost Optimization

Image via Pixabay.

Just about every organization with a significant cloud footprint has issues with wasted spending on enormous amounts of unused cloud resources, and an entire catalog of tools has sprung up to try to help. But what do you do when you choose and deploy a FinOps cost-control tool like an autoscaler in your Amazon Web Services EKS cluster and… it falls short of expectations?

Did you pick the wrong tool? Probably not — at Platform9, our conversations with customers have consistently shown that a wide variety of existing tools do help… but not enough. You need something more than just a tool — you need an integrated solution built to achieve your FinOps goals.

Recap: Kubernetes Right-Sizing Challenges

If you’ve been reading our latest blog posts on Kubernetes FinOps, you’ll know that there are a wide variety of tools intended to help you optimize workload resource usage — from core code built into Kubernetes itself, to components built to interface with external SaaS products, to add-ons intended to run within the cluster itself. But you also know that these bring additional complexity of their own, from infrastructure credential-handling to unexpected interactions with each other when using multiple tools.

Many of them do a good job of scaling the cluster up or down to meet application resource demands and ensure availability, but they aren’t actually intended to optimize those application workloads — and even when they are, they’re limited by the resource-management capabilities of Kubernetes itself.

The net result is that even with one or more of these tools in play, actual cluster resource consumption still typically hovers around 30%. This is significantly less than most enterprises would like — but what more can you do?

This is not a new problem — in fact, it’s a fairly old one: The main drivers of container inefficiency are the same ones that drove early virtualization inefficiency — resource and load management challenges, and the desire to avoid leaving applications starved for resources because of scaling delays during periods of peak demand.

Elastic Machine Pools for EKS Clusters: A Cost-Optimization Layer Built on Proven VM Technology

History Rewind – Solving Utilization Challenges in VM Environments

The issue of intractable under-utilization in Kubernetes is an almost direct replay of the same struggle in virtualization environments 15-20 years ago. Virtual machines (VMs) promise the ability to run services with hardware-like isolation while sharing resources more efficiently. Instead of needing to either run multiple resources in the same operating system or waste hardware resources to fully isolate them from each other, in theory, you could provision multiple small VMs per physical node, with services running under separate operating systems from the kernel up to maintain a security boundary between them.

In practice, virtualization in its earliest forms didn’t live up to this promise: If your application had periods of higher utilization, you had to allow it to use enough resources to handle those peaks. Sometimes deploying instances of the application on additional VMs behind a load-balancer was enough to deal with the extra demand, but if you couldn’t provision quickly enough to absorb the load as it increased, your application would still end up hitting a wall — so virtual machines still tended to be configured with a lot of resource overhead. It was also difficult to handle moving VMs between hypervisors to rebalance resource usage without disrupting the workloads running on them.

Before too long, hypervisors started gaining capabilities aimed at better resource management:

Overcommitment allowed allocating more memory to VMs running on a hypervisor node than the node itself has; if some VMs weren’t using all the memory they were allocated, others could use it.
Memory page merging allowed VMs running similar operating systems and applications to share a single copy of identical portions of memory, increasing the density with which VMs could be placed on nodes.
Live migration allowed VMs to be moved to newly provisioned nodes seamlessly when the cluster needed to expand to handle demand, or to consolidate workloads when nodes were underutilized so some could be powered off until needed.

One of the touted benefits of the trend toward containerization over the last decade was improved efficiency over traditional virtualization, using new Linux capabilities like namespaces and control groups; in theory, without the need to run a full operating system kernel and libraries to isolate applications, application processes could share the same hardware at higher density safely. In practice… it hasn’t worked out so nicely. Kubernetes has some mechanisms to help, but within the platform itself, nothing comprehensively solves these issues.

How Elastic Machine Pools Use Proven Virtualization Technology To Solve Kubernetes FinOps Challenges

The solution for Kubernetes resource-management issues is the same today as it was back then for virtual machines: use overcommitment, page merging and live migration to make the necessary consolidation work seamlessly. But Kubernetes itself has no way to do this, and in a cloud environment like AWS, you don’t normally have access to the hypervisor running your instances. Elastic Machine Pools (EMP) bridges the gap by leveraging AWS Bare Metal, which gives it the capability to set up a virtualization layer under EMP’s direct control (in fact, allowing customers to run their own virtualization environments like this was exactly why AWS built the Bare Metal capability in the first place).

With the virtualization layer established, EMP sets up its own virtual machines, called Elastic VMs (EVMs), and joins them to the EKS cluster as new nodes — allowing EMP to use the same production-proven virtualization mechanisms discussed above to automatically optimize Kubernetes utilization without sacrificing availability:

EVMs with significant amounts of resources allocated but unused by their workloads are consolidated more densely on EMP-managed Bare Metal to improve utilization — without altering the configuration of individual Kubernetes workloads at all.
When more workloads are deployed, or existing ones start to use more of their resource allocations due to additional demand on applications, more Bare Metal instances are provisioned and EVMs are live-migrated to them to rebalance the load — without disrupting the pods running on them (especially beneficial for monolithic disruption-sensitive applications, such as many business apps written in Java). Likewise, if overall cluster utilization decreases again, underutilized EVMs are live-migrated onto a smaller number of Bare Metal instances and the excess compute is deprovisioned without disruption.

All of this automated optimization takes place at a level below EKS and cluster-based autoscalers — you don’t need to change how you define and run your workloads to benefit, and you’re still using the standard EKS cluster control plane and Kubernetes API. Plus, if you’re already using autoscalers or workload right-sizing tools in your EKS clusters, you can continue to do so — EMP can run alongside them, will not interfere with their actions and will still provide additional optimization via its EVMs.

The capability to manage utilization of the cluster as a whole in this way is something that has been missing from Kubernetes, and that gap is the main reason cluster operators have had such a difficult time achieving higher cluster utilization — using tools built on top of the Kubernetes API simply can’t achieve these kinds of results without negative impact to the workloads in the cluster.

With EMP finally filling this gap, cluster operators no longer have to walk a difficult line between saving money and protecting availability of applications. As a result, utilizations of up to 70% in Kubernetes without risking application availability are now achievable — and the cost savings of not paying for wasted compute resources at EC2 pricing are significant. And at general availability, we plan to enable Platform9’s Always-On Assurance — our proactive monitoring and management of your environment to detect and correct issues, usually before you even notice a problem developing.

Get Started Optimizing With EMP!

Platform9 Elastic Machine Pool is now available as an early-access offering for Amazon EKS (see our AWS Marketplace listing for details). Get in touch with us for more information and we’ll get you up and running quickly — customers typically see significant savings in EKS clusters over using autoscalers alone within a few weeks!

Additional Reading

From the Platform9 blog: Kubernetes FinOps: Right-Sizing Kubernetes Workloads

More about Elastic Machine Pool:

Joe Thompson's career in IT started when the Linux kernel was still in 1.x versions and home Internet speeds were expressed in kilobits per second. Since 2014, he's worked primarily with cloud native systems, starting with OpenStack and public clouds,...