Except “freak out” could have various manifestations.
In this case it was “burn down the venue”.
It should have been “I’m sorry, there’s been an issue, let’s move on to the next speaker”
Comment on CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes
tiramichu@lemm.ee 5 months agoIf I send you on stage at the Olympic Games opening ceremony with a sealed envelope
And I say “This contains your script, just open it and read it”
And then when you open it, the script is blank
You’re gonna freak out
Except “freak out” could have various manifestations.
In this case it was “burn down the venue”.
It should have been “I’m sorry, there’s been an issue, let’s move on to the next speaker”
Except since it was an antivirus software the system is basically told “I must be running for you to finish booting”, which does make sense as it means the antivirus can watch the system before any malicious code can get it’s hooks into things.
I don’t think the kernel could continue like that. The driver runs in kernel mode and took a null pointer exception. The kernel can’t know how badly it’s been screwed by that, the only feasible option is to BSOD.
The driver itself is where the error handling should take place. First off it ought to have static checks to prove it can’t have trivial memory errors like this. Secondly, if a configuration file fails to load, it should make a determination about whether it’s safe to continue or halt the system to prevent a potential exploit. You know, instead of shitting its pants and letting Windows handle it.
In this case it was “burn down the venue”.
It was more like “barricade the doors until a swat team sniper gets a clear shot at you”.
Hmmmm.
More like standing there and loudly shitting your pants and spreading it around the stage.
You’re right of course and that should be on Microsoft to better implement their driver loading. But yes.
The driver is in kernel mode. If it crashes, the kernel has no idea if any internal structures have been left in an inconsistent state. If it doesn’t halt then it has the potential to cause all sorts of damage.
Computers have social anxiety.
The envelope contains a barrel of diesel and a lit flare
Great layman’s explanation.
Nice analogy, except you’d check the script before you tried to use it. Computers are really good at crc/hash checking files to verify their integrity, and that’s exactly what a privileged process like antivirus should do with every source of information.
Maybe. But I’d like to think I’d just say something clever like, “says here that this year the pummel horse will be replaced by yours truly!”
Problem is that software cannot deal with unexpected situations like a human brain can. Computers do exactly what a programmer tells it to do, nothing more nothing less. So if a situation arises that the programmer hasn’t written code for, then there will be a crash.
Poorly written code can’t.
In this case:
Is just poor code.
When talking about the driver level, you can’t always just proceed to the next thing when an error happens.
Imagine if you went in for open heart surgery but the doctor forgot to put in the new valve while he was in there. He can’t just stitch you up and tell you to get on with it, you’ll be bleeding away inside.
In this specific case we’re talking about security for business devices and critical infrastructure. If a security driver is compromised, in a lot of cases it may legitimately be better for the computer to not run at all, because a security compromise could mean it’s open season for hackers on your sensitive device. We’ve seen hospitals held random, we’ve seen customer data swiped from major businesses. A day of downtime is arguably better than those outcomes.
The real answer here is crowdstrike needs a more reliable CI/CD pipeline. A failure of this magnitude is inexcusable and represents a major systemic failure in their development process. But the OS crashing as a result of that systemic failure may actually be the most reasonable desirable outcome compared to any other possible outcome.
If AV suddenly stops working, it could mean the AV is compromised. A BSOD is a desirable outcome in that case. Booting a compromised system anyway is bad code.
I agree that the code is probably poor but I doubt it was a conscious decision to crash the OS.
The code is probably just:
And 2 fails unexpectedly because the data is garbage and wasn’t checked if it’s valid.
I’m gonna take from this that we should have AI doing disaster recovery on all deployments. Tech CEO’s have been hyping AI up so much, what could possibly go wrong?
What are the chances that Crowdstrike started using ai to do their update deployments, and they just won’t admit it?
I’m nominating this for the “best metaphor of the day” award.
Well done!
The ironic bit is, I’m sure more than a few people at Crowdstrike are preparing 3 envelopes right now.
This guy ELI5s
Ah yes. So Windows is the screaming in terror version and other systems are the “oh, sorry everyone, looks like there’s an error. Let’s just move on to the next bit” version.
Gork@lemm.ee 5 months ago
Ah, makes sense. I guess a driver would completely freak out if that file gave no instructions and was just like “…”
PriorityMotif@lemmy.world 5 months ago
You would think that Microsoft would implement some basic error handing.
planish@sh.itjust.works 5 months ago
That’s what the BSOD is. It tries to bring the system back to a nice safe freshly-booted state where e.g. the fans are running and the GPU is not happily drawing several kilowatts and trying to catch fire.
Kaboom@reddthat.com 5 months ago
For most things, yes. But if someone were to compromise the file, stopping when they see it invalid is probably a good idea for security