A single DNS race condition brought AWS to its knees

⁨0⁩ ⁨likes⁩

Submitted ⁨⁨7⁩ ⁨months⁩ ago⁩ by ⁨cm0002@lemmings.world⁩ to ⁨technology@lemmy.zip⁩

https://go.theregister.com/feed/www.theregister.com/2025/10/23/amazon_outage_postmortem/

Comments

Sort:hotnew top

artyom@piefed.social ⁨7⁩ ⁨months⁩ ago
Why did it take them an entire day to fix “a single DNS race condition”?

source
- SteleTrovilo@beehaw.org ⁨7⁩ ⁨months⁩ ago
  Because the “them” in your sentence is a rapidly decreasing number of professionals. lemmy.zip/post/51501102
  
  source
  - artyom@piefed.social ⁨7⁩ ⁨months⁩ ago
    
    “Professional": Alexa, should I randomly reconfigure the DNS using a Magic 8 Ball?
    
    Alexa: “Wow what a brilliant idea! You’re so smart!”
    
    source
- aeronmelon@lemmy.world ⁨7⁩ ⁨months⁩ ago
  As with most IT troubleshooting,
  
  Time spent applying the fix: 15 minutes.
  
  Time spent identifying the problem then discovering where it is in the system: 15 hours.
  
  source
  - pageflight@piefed.social ⁨7⁩ ⁨months⁩ ago
    There’s a full postmortem from AWS. One piece that stands out to me:
    
    due to the large number of droplets, efforts to establish new droplet leases took long enough that the work could not be completed before they timed out. Additional work was queued to reattempt establishing the droplet lease. At this point, DWFM had entered a state of congestive collapse and was unable to make forward progress in recovering droplet leases.
    
    That is, the load that resulted from the initial failure was not something the system was designed to handle, so it had cascading effects / required manual cleanup.
    
    source