Having "hot sites" for failover seems to have stopped being a thing for a LOOOT of sites in like 2008.
Cloudflare you'd think would be one that has lots and lots and lots though.
Comment on Vanishing power feeds, UPS batteries, failover fails... Cloudflare explains that two-day outage
draughtcyclist@programming.dev 1 year ago
This is interesting. What I’m hearing is they didn’t have proper anti-affinity rules I’m place, or backups for mission-critical equipment.
The data center did some dumb stuff, but that shouldn’t matter if you set up your application failover properly. Architecture and not testing failovers are the real issue here
Having "hot sites" for failover seems to have stopped being a thing for a LOOOT of sites in like 2008.
Cloudflare you'd think would be one that has lots and lots and lots though.
Mbourgon@lemmy.world 1 year ago
Also sounded like they had apps centered/only there that had to be online for everything else to work.
But the 4-minutes-instead-of-10 batteries certainly didn’t help.
towerful@programming.dev 1 year ago
That’s exactly it.
blog.cloudflare.com/post-mortem-on-cloudflare-con…
Here is a quick summary, but the actual postmortem is worth reading.
Classic example of cascade failure or domino effect. Luckily their resilience wasn’t a full outage
Basically, new features get developed fast and are iterated quickly. When they mature, they get integrated into the high availability cluster.
There are also some services that are deliberately not clustered. One of which is logging, which should cause logs to pile up “at the edge” when the logging core service is down.
Unfortunately, some services were too tightly coupled to the logging core. So they should’ve been HA clustered, but were unable to cope with the core logging service being down.
Whilst HA failover had been tested, the core services has never been taken offline, so all this was missed.
Which all ended up with inconsistent high-availability amongst different services and products. A lot of new features would have failed as expected, and some mature features that shouldn’t have failed did.
When they brought their disaster recovery site up, there were some things that needed manual configuration, and some newer features that hadn’t been tested in a disaster recovery scenario.
They are now focusing significant resources on: