Saving with Confidence: The Strategic Advantage of Spot Instances

The dynamic nature of spot instance pricing, availability and stability requires a proactive approach, where adjustments to workloads are made in real time.

Mar 18th, 2024 6:57am by Leon Kuperman

Featued image for: Saving with Confidence: The Strategic Advantage of Spot Instances

Image by CastAI.

As cloud services have proliferated to the mainstream, organizations are continuously looking for innovative strategies to optimize their spending without compromising on performance and uptime.

Amid the rapid growth of hyperscalers like AWS, Microsoft Azure and Google Cloud Platform (GCP), which have all seen double-digit expansion, a significant opportunity for savings lies in an often-misunderstood resource: spot instances.

Despite the potential to slash compute costs by 75% to 90%, many customers remain hesitant, primarily due to concerns over their perceived instability. At CAST AI, we’ve seen these challenges firsthand. For example, one of our customers has run the majority of the company’s apps on spot instances for the past year with zero downtime, even through the busiest part of the holiday season when spot instance inventory becomes scarce. Despite this, senior leaders keep warning about the risk of relying on spot instances. We’ve also seen what works effectively at scale when using spot instances in a cost reduction strategy.

The Hesitation: Perceived Instability

The main deterrent against using spot instances is the fear of instability. Cloud providers can reclaim these instances with minimal notice — 2 minutes on AWS, and just 30 seconds on GCP and Azure. This unpredictability poses a mind-bending challenge for businesses relying on stable and uninterrupted computing resources: Do I save 75% to 90% on compute costs and risk downtime, or do I pay more and worry less about downtime?

In the context of Kubernetes environments, spot instances represent several unique and interesting technical challenges. These are due to the inherent unpredictability of spot instance availability and the complex nature of workload management. Let’s go over a few examples.

Graceful shutdown and migration of workloads: Upon receiving a termination notice for a spot instance, the Kubernetes cluster needs to perform several operations in a very short window. This includes gracefully shutting down running applications, committing any final state to storage and rerouting traffic to ensure availability. These operations are nontrivial, especially for stateful applications or those with complex shutdown procedures that might require more time than the notice period allows.

Rescheduling and capacity planning: Kubernetes must quickly reschedule the workloads from the terminated spot instance to another compute resource. This requires real-time capacity planning to identify available resources that can accommodate the evicted workloads without causing resource contention or performance degradation. In a cloud environment, where spot instance availability can fluctuate dramatically, ensuring a smooth transition can be challenging.

Automated, intelligent decision-making: To manage these transitions effectively, Kubernetes clusters need to employ sophisticated automation and decision-making algorithms. This involves not just reacting to spot instance terminations but proactively managing the mix of instance types and purchasing options (spot, on-demand, reserved) based on cost, availability and workload requirements. Developing and tuning these algorithms to balance cost savings with reliability and performance objectives requires deep expertise and continuous adjustment.

Network and dependency management: Workloads running on spot instances might be part of a larger, interdependent microservices architecture. When an instance is terminated, it’s not just about moving the affected workload; it’s also about ensuring that network configurations, service discovery mechanisms and dependency relationships are updated in real time to reflect the new deployment topology. Kubernetes and adjacent cloud native technologies such as service mesh take care of many of these concerns. However, tight time constraints add to the complexity.

Given all this, it’s understandable why many companies hesitate to embrace spot instance capacity. Opting for savings plans, reserved instances and cloud providers’ other commitment-based discount programs seems much more straightforward in terms of planning and utilization. Yet, in taking this route, customers overlook the most substantial savings opportunities the cloud has to offer, coupled with absolute flexibility.

The Reality: Measurable and Manageable Risks

What if perceived instability could be quantified and, therefore, effectively managed through automation? This is the premise behind our latest innovation: a global heat map that provides clear insights into spot instance availability and reliability across different regions and availability zones. With the upcoming launch of our spot instance heat map, by tracking metrics such as spot interruption rate and insufficient capacity errors (ICE), we’ll offer a tangible way to assess the risk associated with using spot instances in specific locations.

Embracing Automation

The key to unlocking the full potential of spot instances lies in automation. The dynamic nature of spot instance pricing, availability and stability requires a proactive approach, where adjustments to workloads are made in real time based on current market conditions. This includes not just choosing the most cost-effective instances, but also preparing for and responding to interruptions without manual intervention. Automation can ensure that workloads are seamlessly transferred to new instances, eliminating downtime and maintaining performance.

The Strategic Advantage

We hope that our heat map provides organizations with some insights into risk management across all cloud regions and availability zones. Observability and risk assessments are not enough, however. With automated management tools, businesses can confidently incorporate spot instances into their cloud infrastructure. This not only leads to substantial cost savings but also empowers organizations to make data-driven decisions about their cloud resources. The fear of instability becomes a manageable risk, overshadowed by the benefits of optimized spending and enhanced efficiency.

To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon Europe in Paris, from March 19-22, 2024.

Leon Kuperman is co-founder and CTO at CAST AI. Formerly vice president of Security Products OCI at Oracle, Leon has 20+ years of experience spanning companies such as IBM, Truition and HostedPCI. He founded and served as the CTO of...