Netflix 的异地多活设计: Active-Active for Multi-Regional Resiliency

说明

《Active-Active for Multi-Regional Resiliency》是 2013 年的资料。从时间上来看资料比较老，那时候应该处于 IaaS 形式的云计算普及阶段，kubernetes、云原生等概念还在孕育中，从技术理念来说不过时，异地多活需要考虑的问题还是那么多，没有随着时间发展而消失。

这里是我的阅读摘要。

Target

A failure of any kind in one region should not affect services running in another.

A networking partitioning event should not affect quality of services in either region.

Implement Target

Services must be stateless, all data/state replication needs to handled in data tier
Must access resources locally in-Region
There should not be any cross-regional calls on user’s call path
Data replication should be asynchronous

简单来说就是在每个区域都部署一整套服务，每个区域的系统能够独立运行，服务都是无状态的，数据层在不同区域之间的进行数据同步。

Technical Challenges

Effective tooling for directing traffic to the correct Region：

DNS
AWS Route53

Traffic shaping and load shedding, to handling thundering herd event：

zuul：服务框架支持熔断、限流
- Ability to identify and handle mis-routed requests. (user request is defined as mis-routed if it does not conform to our geo directional records)
- Ability to declare a region in failover mode
- Ability to define a maximum traffic level at any point in time, so that any additional requests will be automatically shed, in order to protect downstream services against a thundering herd of requests.

State/Data asynchronous cross-regional replication：

Cassandra：使用 cassandra 的 multi-region 模式，生产负载下实测500ms完成数据同步
- Wrote 1 million records in one region of a multi-region cluster, 500ms later, initiated a reading of the records that were just written in the initial region in the other region, while keeping a production level of load on the cluster.
EvCache：一个 memcache client，Netflix没有用缓存的多主部署而是用远程失效的方式处理，远程 region 中的缓存被清除后，在后续请求中触发重新加载
- Whenever there is a write in one region, EvCache client will send a message to another region(via SQS) to invalidate the corresponding entry
后面在 split-brain 中提到即使数据同步过程阻塞了，也要正常服务：
- We were looking to demonstrate that services in each Region continued to function normally, even though some of the data replication was getting queued up. Over the course of the Active-Active project we ran Split-brain exercise many times, and found and fixed many issues

此外，还实现了支持多区域的自动部署工具方便系统数据，设计了各种 Monkeys 检验系统的可靠性。

Monkeys：

Chaos Gorilla：taks out a whole Availability Zone
Split-brain：servered the connectivity between Regions
Chaos Kong：模拟 Region 失败进行流量迁移，为了避免损害用户体验，这个 monkey 执行的时候流量迁移会更平滑、同时不会向用户返回错误。

参考

编程基础系统基础关注跟踪

Kubernetes Prometheus 超级账本

系统设计

Netflix 的异地多活设计: Active-Active for Multi-Regional Resiliency

目录

说明

Target

Implement Target

Technical Challenges

参考

推荐阅读