Comment on A single DNS race condition brought AWS to its knees
aeronmelon@lemmy.world 2 days agoAs with most IT troubleshooting,
Time spent applying the fix: 15 minutes.
Time spent identifying the problem then discovering where it is in the system: 15 hours.
pageflight@piefed.social 2 days ago
There’s a full postmortem from AWS. One piece that stands out to me:
That is, the load that resulted from the initial failure was not something the system was designed to handle, so it had cascading effects / required manual cleanup.