Comment on Cloudfare outage post mortem
dan@upvote.au 5 days agoThe actual issue is that they should be canarying changes. Push them to a small percentage of servers, and ensure nothing bad happens before pushing them more broadly. At my workplace, configs are tested on one server, then an entire rack, then an entire cluster, before fully rolling out. The rollout process watches the core logs for things like elevated HTTP 5xx errors.