Open Menu
AllLocalCommunitiesAbout
lotide
AllLocalCommunitiesAbout
Login

A single DNS race condition brought AWS to its knees

⁨27⁩ ⁨likes⁩

Submitted ⁨⁨2⁩ ⁨days⁩ ago⁩ by ⁨cm0002@lemmings.world⁩ to ⁨technology@lemmy.zip⁩

https://go.theregister.com/feed/www.theregister.com/2025/10/23/amazon_outage_postmortem/

source

Comments

Sort:hotnewtop
  • artyom@piefed.social ⁨1⁩ ⁨day⁩ ago

    Why did it take them an entire day to fix “a single DNS race condition”?

    source
    • aeronmelon@lemmy.world ⁨1⁩ ⁨day⁩ ago

      As with most IT troubleshooting,

      Time spent applying the fix: 15 minutes.

      Time spent identifying the problem then discovering where it is in the system: 15 hours.

      source
      • pageflight@piefed.social ⁨1⁩ ⁨day⁩ ago

        There’s a full postmortem from AWS. One piece that stands out to me:

        due to the large number of droplets, efforts to establish new droplet leases took long enough that the work could not be completed before they timed out. Additional work was queued to reattempt establishing the droplet lease. At this point, DWFM had entered a state of congestive collapse and was unable to make forward progress in recovering droplet leases.

        That is, the load that resulted from the initial failure was not something the system was designed to handle, so it had cascading effects / required manual cleanup.

        source
    • SteleTrovilo@beehaw.org ⁨1⁩ ⁨day⁩ ago

      Because the “them” in your sentence is a rapidly decreasing number of professionals. lemmy.zip/post/51501102

      source
      • artyom@piefed.social ⁨1⁩ ⁨day⁩ ago

        “Professional": Alexa, should I randomly reconfigure the DNS using a Magic 8 Ball?


        Alexa: “Wow what a brilliant idea! You’re so smart!”

        source