I’m gonna take from this that we should have AI doing disaster recovery on all deployments. Tech CEO’s have been hyping AI up so much, what could possibly go wrong?
Comment on CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes
Imgonnatrythis@sh.itjust.works 3 months agoMaybe. But I’d like to think I’d just say something clever like, “says here that this year the pummel horse will be replaced by yours truly!”
Hazzia@infosec.pub 3 months ago
Couldbealeotard@lemmy.world 3 months ago
What are the chances that Crowdstrike started using ai to do their update deployments, and they just won’t admit it?
Takios@discuss.tchncs.de 3 months ago
Problem is that software cannot deal with unexpected situations like a human brain can. Computers do exactly what a programmer tells it to do, nothing more nothing less. So if a situation arises that the programmer hasn’t written code for, then there will be a crash.
deadbeef79000@lemmy.nz 3 months ago
Poorly written code can’t.
In this case:
Is just poor code.
5C5C5C@programming.dev 3 months ago
When talking about the driver level, you can’t always just proceed to the next thing when an error happens.
Imagine if you went in for open heart surgery but the doctor forgot to put in the new valve while he was in there. He can’t just stitch you up and tell you to get on with it, you’ll be bleeding away inside.
In this specific case we’re talking about security for business devices and critical infrastructure. If a security driver is compromised, in a lot of cases it may legitimately be better for the computer to not run at all, because a security compromise could mean it’s open season for hackers on your sensitive device. We’ve seen hospitals held random, we’ve seen customer data swiped from major businesses. A day of downtime is arguably better than those outcomes.
The real answer here is crowdstrike needs a more reliable CI/CD pipeline. A failure of this magnitude is inexcusable and represents a major systemic failure in their development process. But the OS crashing as a result of that systemic failure may actually be the most reasonable desirable outcome compared to any other possible outcome.
Morphit@feddit.uk 3 months ago
This error isn’t intentionally crashing because of a security risk, though that could happen. It’s a null pointer exception, so there are no static or runtime checks that could have prevented or handled this more gracefully. This was presumably a bug in the driver for a long time, then a faulty config file came and triggered the crashes. Better static analysis and testing of the kernel driver is one aspect, how these live config updates are deployed and monitored is another.
deadbeef79000@lemmy.nz 3 months ago
In which case this should’ve been documented behaviour and probably configurable.
CeeBee_Eh@lemmy.world 3 months ago
That’s a bad analogy. CrowdStrike’s driver encountering an error isn’t the same as not having disk IO or a memory corruption. If CrowdStrike’s driver didn’t load at all the system could still boot.
It should absolutely be expected that if the CrowdStrike driver itself encounters an error, there should be a process that allows the system to gracefully recover. The issue is that CrowdStrike likely thought of their code as not being able to crash as they likely only ever tested with good configs, and thus never considered a graceful failure of their driver.
ChairmanMeow@programming.dev 3 months ago
If AV suddenly stops working, it could mean the AV is compromised. A BSOD is a desirable outcome in that case. Booting a compromised system anyway is bad code.
CeeBee_Eh@lemmy.world 3 months ago
You know there’s a whole other scenario where the system can simply boot the last known good config.
Takios@discuss.tchncs.de 3 months ago
I agree that the code is probably poor but I doubt it was a conscious decision to crash the OS.
The code is probably just:
And 2 fails unexpectedly because the data is garbage and wasn’t checked if it’s valid.
Morphit@feddit.uk 3 months ago
You can still catch the error at runtime and do something appropriate. That might be to say this update might have been tampered with and refuse to boot, but more likely it’d be to just send an error report back to the developers that an unexpected condition is being hit and just continuing without loading that one faulty definition file.
CeeBee_Eh@lemmy.world 3 months ago
If there’s an error, use last known good config. So many systems do this.