TNS
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
NEW! Try Stackie AI
Cloud Services / Data / Open Source

How We Built a VectorDB-Powered Cloud Service in 6 Months

Learn from architectural design decisions made when bringing open source vector database technology to the cloud.
Mar 1st, 2024 8:39am by
Featued image for: How We Built a VectorDB-Powered Cloud Service in 6 Months
Featured image by stefan moertl on Unsplash.

In May 2022, our open source vector database version Milvus 2.0 was stabilizing following several significant iterations. Simultaneously, our users expressed a resounding desire for a stable, commercially hosted version of the platform. For Zilliz, the company behind Milvus, the stars seemed to align perfectly, as we were armed with a seasoned team of engineers, a product on the cusp of maturity and a fervent user base clamoring for solutions. Fueled by this momentum, we boldly set our sights on an ambitious objective: to unveil our cloud service, Zilliz Cloud, to the world in a mere six months.

Amid a landscape propelled by the exponential growth of large language models (LLMs), we began creating a comprehensive cloud service from the ground up. We gained numerous insights during the 18-month journey of building Zilliz Cloud, a fully managed vector-search service driven by the open source Milvus database. I’ll discuss the design decisions and invaluable lessons learned on this journey in this two-part series.

Milvus architecture

The Milvus architecture

Step 1: Evaluate Existing Capabilities

When we started the project, we conducted an assessment of existing Milvus capabilities and our overarching objectives.

Evaluate Core Technologies

Our base technology, Milvus, is a cloud native, open source vector database built with storage-computing disaggregation and a microservices framework. This design ensured a seamless integration into Kubernetes clusters, facilitating rapid adaptation to diverse cloud production environments.

Assess Deployment Flexibility

We were confident that by leveraging the Kubernetes Operator, Milvus had excellent service deployment capabilities across major public cloud platforms such as Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure. This versatility was corroborated by numerous team members who had successfully implemented production services on these platforms at previous companies, underscoring the Milvus platform’s scalability and compatibility.

Enhance Observability

While Milvus had basic observability features such as monitoring and logging, we recognized we needed to bolster alerting functionalities tailored for production environments. This enhancement was necessary to supply our internal teams and users with real-time insights and proactive measures to ensure uninterrupted service delivery.

Address Service Completeness

Despite significant strides with Milvus, Zilliz Cloud was still nascent, lacking critical components essential in a managed service. These components include user login authentication, metering and billing systems, payment mechanisms, networking infrastructure, security protocols, a comprehensive web console, user-facing API support, resource scheduling capabilities and workflow management tools. Addressing these was central to fortifying our service’s efficacy and appeal to a broader user base.

Step 2: Establish Design Principles

We definitely had our work cut out for us! Now that we determined what was required for a minimum viable product (MVP), the next step was to maximize our team’s efficiency in developing the Zilliz Cloud MVP within a six-month timeframe. Through some introspection and analysis, we distilled a set of foundational design principles to guide our development efforts.

Use Mature Third-Party Products Whenever Possible

To prepare for market entry, we relied on established cloud and third-party services. AWS’s core offerings, including Elastic Kubernetes Service (EKS), Elastic Compute Cloud (EC2), Simple Storage Service (S3), Elastic Block Store (EBS) and Application Load Balancer (ALB), alongside AWS-managed Kafka and Relational Database Service (RDS), formed the basis of our infrastructure. This approach met our immediate needs, avoided “reinventing the wheel” and paved a cost-effective path for potential adaptation to a multicloud environment, expediting our innovation pace.

Unfortunately, we were confronted with compatibility challenges between GCP/Azure messaging queues and managed Kafka services, which led us to develop a distributed log system using Apache Bookkeeper. The absence of reliable, open source, cloud native distributed logging solutions spurred this initiative, and we are considering open sourcing this solution to assist others building cloud services.

Third-party Software-as-a-Service (SaaS) providers also played a pivotal role in accelerating our platform’s development. For instance, we adopted Stripe for payment processing, addressing metering and taxation requirements. To streamline connections with multicloud marketplaces, we integrated Suger.io. Additionally, we assessed billing-service platforms like Orb and Metronome to optimize our billing operations. Auth0 served as our preferred choice for account management and login functionality, with expanded support for Google login. Establishing our operational alerting system on PagerDuty, chosen for its seamless integration with existing monitoring tools and customizable notification rules, further aided our operational efficiency.

Avoid Multiplying Entities Unnecessarily

We embraced a minimalist design ethos that permeated various facets of our product:

  • Architecture simplicity: Initially, our design had over 60 microservices, which posed significant challenges in development and testing. To simplify our architecture, we pared the list down to fewer than 10 core microservices, including user billing, resources, metadata and scheduling. This reduction clarified dependencies and alleviated the testing burden.
  • Functional simplicity: The initial iteration of Zilliz Cloud prioritized core user functionalities such as registration, cluster deployment and billing. Less urgent features like scaling and backups were deliberately deferred to lighten the workload. We did commit to establishing a robust feedback loop, initially through email-based feedback, and later augmenting it with Zendesk integration so prompt and high-quality feedback could guide further improvements.
  • Design simplicity: Our cloud-service design prioritized efficient communication and user engagement potential, necessitating a disciplined and focused approach. Leveraging rapid A/B testing enabled us to swiftly validate features and adapt based on user engagement metrics.

Anticipate Day 2 Challenges from Day 1

For cloud services, we had to evolve swiftly without compromising the reliability of user interfaces and services. Easier said than done! This is like “swapping out jet engines in midair.” Externally, the service appears seamless, while internally a vigorous cycle of innovation and enhancement is unfolding. We learned quickly that embracing an end-in-mind development approach is key in navigating this complex terrain.

Step 3: Develop the Architecture

Driven by our design principles, we successfully reached the milestone of launching our commercial vector-search product within the six-month timeframe, simultaneously securing our initial group of seed customers. Here is the architectural diagram illustrating the framework of our inaugural release.

Vector search architecture

We were able to build quite a robust solution with the following capabilities.

Multicloud support: While we initially centered on AWS, our commitment to cloud agnosticism led us to evaluate compatibility across public cloud providers including GCP and Alibaba Cloud. By leveraging customizations to the open source Crossplane project, we developed a cloud adapter layer to streamline multicloud support and reduce associated costs. This approach facilitated rapid integration with GCP within just one month and paved the way for seamless integration with other public cloud providers.

Security: Zilliz Cloud services place a very high importance on data security. We adhere strictly to cloud identity and access management (IAM) standards, control data access permissions and implement encryption for all data, whether in transit or at rest. Emphasizing network isolation for optimal performance, we opted for AWS’s EKS network add-ons for its efficiency and user-friendliness. By delineating interaction boundaries between the data and control layers, we’ve realized significant cost savings during the rollout of our “bring your own cloud” (BYOC) product.

Resource pooling: Zilliz Cloud Services adheres to the “law of cloud commutativity,” prioritizing elastic scalability through resource pooling. By decoupling storage and computation and employing dynamic load balancing, we enable efficient utilization of cloud resources. This approach enables us to reserve resources only when necessary, significantly enhancing the utilization of Spot Instances and Lambda functions while driving down costs.

Operations friendliness: Zilliz Cloud is designed with developers and operational staff in mind. Featuring a comprehensive graphical user interface (GUI) and advanced monitoring capabilities, the platform offers triple availability-zone disaster recovery and adheres to strict service-level agreements (SLAs), helping to ensure stability and reliability for production environments.

Lessons Learned

I am proud of this architecture and all the work we did as a team. However, despite our success in getting to market in the six-month period we committed to, there were a number of things we didn’t anticipate. I will review the lessons learned in the second part of this series.

Group Created with Sketch.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.