It’s always fun to deal with “chaos monkeys”, a barrel full of fun indeed. A lucky bunch of team members at Scality had their dose of monkey fun at the Paris Kubernetes Meetup. While presenting MetalK8s to the group, Julien Girardin, R&D Engineer and Laure Vergeron, Zenko Technology Evangelist, mused over Zenko’s architecture and its resilient design. The meetup was a joint event with the serverless, Docker, and chaos engineering communities, and the keynote speaker was from the Chaos Engineering community.
You get chaos in your systems when there is an urgent problem, with assumed causes and consequences, but the link between them is unclear and so is the path to resolution. The ability of a system to cope with chaos is called resilience. Resilience can be defined as a system’s ability to absorb perturbation and keep functioning, even if in a less efficient fashion.
Chaos Engineering is the practice of injecting perturbations in a production system and studying its resilience. It’s a scientific approach and it’s not testing! Testing should be done at unit, functional, and integration levels before anything is put in production. On the other end, Chaos Engineering lets you evaluate you system’s resilience in a real production setting, with all its functional partners available, at all levels of your architecture and deployment; it also lets you evaluate the resilience of the people in your teams, which is equally important to a service survival as the system itself.
At Scality, we’re familiar to deployments in environments where failure is not an option: some of our customers are Tier 1 providers, global banks, government agencies, global booking systems… they cannot afford to have their service down at all. Our architects designed the RING and Zenko to be resilient to chaotic situations. While the RING leverages the chord algorithm, private cloud replication, and shared nfs/object interfaces over a single storage, Zenko uses different mechanisms.
Zenko is designed with auto-scaling, self-healing, infrastructure checks and data availability in mind:
- Auto-scaling: to handle sudden surges of traffic which are common at Tier 1 network providers, we chose Kubernetes for its native auto-scaling feature;
- Self-healing: thanks to Kubernetes again, any pod that disappears will immediately be respawned elsewhere; together with microservices replica sets and auto-balancing of services on nodes, Zenko’s deployments are very resilient, even with several nodes gone;
- Infrastructure checks: thanks to MetalK8s, our optimize K8s distribution, Zenko will not be deployed on a shaky network or an insufficient number of nodes, providing automatic protection from Network Chaos Monkey;
- Data availability: the main purpose of Zenko is to give you multi-cloud control, so when one of its providers is down, an application can seamlessly failover to another using the S3 API; a single namespace over all clouds, with aggregated metadata search and replication workflows across all of them can, when used wisely, provide unlimited service availability.
All of that allows Zenko to pass the Chaos Engineering test: its components probe the hardware, scale up or down automatically, self-heal, offer stats on the system’s health via Orbit management UI. We love the quote attributed to Nora Jones (Senior Software Engineer @Netflix) about Chaos Engineering: Introducing Chaos is not the best way to meet your new colleagues, though it is the fastest: are your systems Chaos-proof? Come tell us in Zenko’s community… (fast)!