《程序员》05期好文 “大规模服务设计部署经验谈”

最新推荐文章于 2026-04-09 14:01:03 发布

原创最新推荐文章于 2026-04-09 14:01:03 发布 · 1k 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#components #service #testing #deployment #performance

本文介绍了大规模服务的设计与部署经验，包括三个基本原则：预期故障、保持简单、自动化一切。此外还详细探讨了整体应用设计、自动管理和配置、依赖管理等方面的最佳实践。

今天是2008.5.19, 离5.12已整整过去了一个星期。为汶川地震遇难者默哀祈祷。

虽然我拿到《程序员》05期已有一个多星期了，可刚刚过去的这一个星期即让人觉得漫长，又让人觉得短暂。昨天下午，我一边看着中央台的实时报道一边草草翻了一边《程序员》，觉得“大规模服务设计部署经验谈”这篇文章不错。这一期只刊登了一半，如果你很心急，可以到以下网址去下载英文原文。

原作者的hompage:
http://www.mvdirona.com/jrh/Default.htm
英文原版：
http://www.mvdirona.com/jrh/TalksAndPapers/JamesRH_Lisa.pdf

内容纲要：
Introduction
    Three simple tenets:
    1. Expect failures
    2. Keep things simple
    3. Automate everything

Recommendations

    1. Overall Application Design
        *Design for failure (core concept)
        *implement Redundancy and fault recovery (modeling)
        *Commodity hardware slice (elephant vs mouse)
        *support Single-version software
        *implement Multi-tenancy

        More specific best practices
        *Quick service health check
        *Develop in the full environment
        *Zero trust of underlying components
        *Do not build the same functionality in multiple components
        *One pod or cluster should not affect another pod or cluster
        *Allow (rare) emergency human intervention (use scripts)
        *Keep things simple and robust
        *Enforce admission control at all levels
            The general rule is to attempt to gracefully degrade rather than hard failing and to block entry to the service before giving uniform poor service to all users.
        *Partition the service (look-up table at the mid_tier)
        *Understand the network design
        *Analyze throughput and latency
        *Treat operations utilities as part of the service
        *Understand access patterns
        *Version everything
        *Keep the unit/functional tests from the last release
        *Avoid single points of failure (stateless)

    2. Automatic Management and Provisioning
        *Be restartable and redundant
        *Support geo-distribution
        *Automatic provisioning and installation
        *Configuration and code as a unit
        *Audit configuration change if must
        *Manage server roles or personalities rather than servers
        *Multi-system failures are common
        *Recover at the service level
        *Never rely on local storage for non-recoverable information
        *Keep deployment simple
        *Fail services regularly

    3. Dependency Management
        *Expect latency
        *Isolate failures (fail fast)
        *Use shipping and proven components
        *Implement inter-service monitoring and alerting
        *Dependent services require the same design point
        *Decouple components

    4. Release Cycle and Testing
        The following rules must be followed:
        i. the production system has to have sufficient redundancy that, in the event of catastrophic new service failure, state can be quickly be recovered,
        ii. data corruption or state-related failures have to be extremely unlikely (functional testing must first be passing),
        iii. errors must be detected and the engineering team (rather than operations) must be monitoring system health of the code in test, and
        vi. it must be possible to quickly roll back all changes and this roll back must be tested before going into production.

        *Ship often
        *Use production data to find problems
            . Measureable release criteria
            . Tune goals in real time
            . Always collect the actual numbers
            . Minimize false positives
            . Analyze trends
            . Make the system health highly visible
            . Monitor continuously
        *Invest in engineering
        *Support version roll-back
        *Maintain forward and backward compatibility
        *Single-server deployment
        *Stress test for load
        *Perform capacity and performance testing prior to new releases
        *Build and deploy shallowly and iteratively
        *Test with real data
        *Run system-level acceptance tests
        *Test and develop in full environments

    5. Hardware Selection and Standardization
        *Use only standard SKUs
        *Purchase full racks
        *Write to a hardware abstraction
        *Abstract the network and naming

    6. Operations and Capacity Planning
        *Make the development team responsible. (you built it, you manage it.)
        *Soft delete only
        *Track resource allocation
        *Make one change at a time
        *Make Everything Configurable

    7. Auditing, Monitoring and Alerting
        *Instrument everything
        *Data is the most valuable asset
        *Have a customer view of service
        *Instrumentation required for production testing
        *Latencies are the toughest problem
        *Have sufficient production data
            . use performance counters for all operations
            . Audit all operations
            . Track all fault tolerance mechanisms
            . Track operations against important entities
            . Asserts
            . Keep historical data
        *Configurable logging
        *Expose health information fro monitoring
        *Make all reported errors actionable
        *Enable quick diagnosis of production problems
            . Give enough information to diagnose
            . Chain of evidence
            . Debugging in production
            .Record all significant actions

    8. Gracefull Degradation and Admission Control
        *Support a "big red switch"
        *Control admission
        *Meter admission

    9. Customer and Press Communication Plan

    10. Customer Self-Provisioning and Self-Help

Conclusion