This is an unpopular opinion, and I get why – people crave a scapegoat. CrowdStrike undeniably pushed a faulty update demanding a low-level fix (booting into recovery). However, this incident lays bare the fragility of corporate IT, particularly for companies entrusted with vast amounts of sensitive personal information.
Robust disaster recovery plans, including automated processes to remotely reboot and remediate thousands of machines, aren’t revolutionary. They’re basic hygiene, especially when considering the potential consequences of a breach. Yet, this incident highlights a systemic failure across many organizations. While CrowdStrike erred, the real culprit is a culture of shortcuts and misplaced priorities within corporate IT.
Too often, companies throw millions at vendor contracts, lured by flashy promises and neglecting the due diligence necessary to ensure those solutions truly fit their needs. This is exacerbated by a corporate culture where CEOs, vice presidents, and managers are often more easily swayed by vendor kickbacks, gifts, and lavish trips than by investing in innovative ideas with measurable outcomes.
This misguided approach not only results in bloated IT budgets but also leaves companies vulnerable to precisely the kind of disruptions caused by the CrowdStrike incident. When decision-makers prioritize personal gain over the long-term health and security of their IT infrastructure, it’s ultimately the customers and their data that suffer.
breakingcups@lemmy.world 3 months ago
Please, enlighten me how you’d remotely service a few thousand Bitlocker-locked machines, that won’t boot far enough to get an internet connection, with non-tech-savvy users behind them. Pray tell what common “basic hygiene” practices would’ve helped, especially with Crowdstrike reportedly ignoring and bypassing the rollout policies set by their customers.
Not saying the rest of your post is wrong, but this stood out as easily glossed over.
ramble81@lemm.ee 3 months ago
You’d have to have something even lower level like a OOB KVM on every workstation which would be stupid expensive for the ROI, or something at the UEFI layer that could potentially introduce more security holes.
Leeks@lemmy.world 3 months ago
Maybe they should offer a real time patcher for the security vulnerabilities in the OOB KVM, I know a great vulnerability database offered by a company that does this for a lot of systems world wide! /s
circuscritic@lemmy.ca 3 months ago
…you don’t have OOB on every single networked device and terminal? Have you never heard of the buddy system?
timewarp@lemmy.world 3 months ago
UEFI isn’t going away. Sorry to break the news to you.
Brkdncr@lemmy.world 3 months ago
Vpro is usually $20 per machine and offers oob kvm.
LrdThndr@lemmy.world 3 months ago
A decade ago I worked for a regional chain of gyms with locations in 4 states.
I was in TN. When a system would go down in SC or NC, we originally had three options:
I got sick of this. So I researched options and found an open source software solution called FOG. I ran a server in our office and had little optiplex 160s running a software client that I shipped to each club. Then each climber at each club was configured to PXE boot from the fog client.
If everything was okay, it would chain the boot to the os on the machine. But I could flag a machine for primate and at next boot, the machine would check in with PXE and get a complete reimage from premade images on the fog server.
So yes, I could completely reimage a computer from hundreds of miles away by clicking a few checkboxes on my computer.
This was free software. It saved us thousands in shipping fees alone.
There ARE options out there.
magikmw@lemm.ee 3 months ago
This worksbgreat for stationary pcs and local servers, does nothing for public internet connected laptops in hands of users.
The only fix here is staggered and tested updates, and apparently this update bypassed even deffered update settings that crowdstrike themselves put into their software.
The only winning move here was to not use crowdstrike.
Brkdncr@lemmy.world 3 months ago
How removed from IT are that you think fog would have helped here?
Evotech@lemmy.world 3 months ago
Now your fog servers are dead. What not
cyberpunk007@lemmy.ca 3 months ago
This is a good solution for these types of scenarios. Doesn’t fit all though. Where I work, 85% of staff work from home. We largely use SaaS. I’m struggling to think of a good method here other than walking them through reinstalling windows on all their machines.
timewarp@lemmy.world 3 months ago
Thank you for sharing this. This is what I’m talking about. Larger companies not utilizing something like this already are dysfunctional. There are no excuses for why it would take them days, weeks or longer.
mynamesnotrick@lemmy.zip 3 months ago
Was a windows sysadmin for a decade. We had thousands of machines with endpoint management with bitlocker encryption. (I have sincd moved on to more of into cloud kubertes devops) Anything on a remote endpoint doesn’t have any basic “hygiene” solution that could remotely fix this mess automatically. I guess Intels bios remote connection (forget the name) could in theory allow at least some poor tech to remote in given there is internet connection and the company paid the xhorbant price.
All that to say, anything with end-user machines that don’t allow it to boot is a nightmare. And since bit locker it’s even more complicated. (Hope your bitloxker key synced… Lol).
Spuddlesv2@lemmy.ca 3 months ago
You’re thinking of Intel vPro. I imagine some of the Crowdstrike
victimscustomers have this and a bunch of poor level 1 techs are slowly griding their way through every workstation on their networks. But yeah, OP is deluded and/or very inexperienced if they think this could have been mitigated on workstations through some magical “hygiene”.LrdThndr@lemmy.world 3 months ago
Bro. PXE boot image servers. You can remotely image machines from hundreds of miles away with a few clicks and all it takes on the other end is a reboot.
Brkdncr@lemmy.world 3 months ago
You’re thinking of AMT/vPro.
Dran_Arcana@lemmy.world 3 months ago
Separate persistent data and operating system partitions, ensure that every local network has small pxe servers, vpned (wireguard, etc) to a cdn with your base OS deployment images, that validate images based on CA and checksum before delivering, and give every user the ability to pxe boot and redeploy the non-data partition.
Bitlocker keys for the OS partition are irrelevant, and keys for the data partition can be stored and passed via AD after the redeploy. If someone somehow deploys an image that isn’t ours, it won’t have keys to the data partition because it won’t have a trust relationship with AD.
(This is actually what I do at work)
I_Miss_Daniel@lemmy.world 3 months ago
Sounds good, but can you trust an OS partition not to store things in %programdata% etc that should be encrypted?
Brkdncr@lemmy.world 3 months ago
But your pxe boot server is down, your radius server providing vpn auth is down, your bitlocker keys are in AD which is down because all your domain controllers are down.
Trainguyrom@reddthat.com 3 months ago
At that point why not just redirect the data partition to a network share with local caching? Seems like it would simplify this setup greatly (plus makes enabling shadow copy for all users stupid easy)
pHr34kY@lemmy.world 3 months ago
I’ve been separating OS and data partitions since I was a kid running Windows 95. It’s horrifying that people don’t expect and prepare for machines to become unbootable on a regular basis.
Hell, I bricked my work PC twice this year just by using the Windows cleanup tool - on Windows 11. The antivirus went nuclear, as antivirus products do.
felbane@lemmy.world 3 months ago
Rollout policies are the answer, and CrowdStrike should be made an example of if they were truly overriding policies set by the customer.
It seems more likely to me that nobody was expecting “fingerprint update” to have the potential to completely brick a device, and so none of the affected IT departments were setting staged rollout policies in the first place. Or if they were, they weren’t adequately testing.
Then - after the fact - it’s easy to claim that rollout policies were ignored when there’s no way to prove it.
If there’s some evidence that CS was indeed bypassing policies to force their updates I’ll eat the egg on my face.
originalucifer@moist.catsweat.com 3 months ago
from what ive read/watched thats the crux of the issue.... did they push a 'content' update, i.e. signatures or did they push a code update.
so you basically had a bunch of companies who absolutely do test all vendor code updates beings slipped a code update they werent aware of being labeled a 'content' update.
DesertCreosote@lemm.ee 3 months ago
I’m one of the admins who manage CrowdStrike at my company.
We have all automatic updates disabled, because when they were enabled (according to the CrowdStrike best practices guide they gave us), they pushed out a version with a bug that overwhelmed our domain servers. Now we test everything through multiple environments before things make it to production, with at least two weeks of testing before we move a version to the next environment.
This was a channel file update, and per our TAM and account managers in our meeting after this happened, there’s no way to stop that file from being pushed, or to delay it. Supposedly they’ll be adding that functionality in now.
lazynooblet@lazysoci.al 3 months ago
Autopilot, intune. Force restart device twice to get startup repair, choose factory reset, share LAPS admin password and let the workstation rebuild itself.
sp3tr4l@lemmy.zip 3 months ago
You are talking about how to fix the problem.
This person is talking about what caused the problem.
Completely different things.
Analogous to: A house is on fire; call the ambulances to treat any wounded call the fire department, call insurance, figure out temporary housing.
Analogous to: Investigate the causes of the fire, suggest various safety regulations on natural gas infrastructure, home appliances, electrical wiring, building material and methods, etc.
riskable@programming.dev 3 months ago
Not using a proprietary, unvetted, auto-updating, 3rd party kernel module in essential systems would be a good start.
Bank in the day companies used to insist upon access to the source code for such things along with regular 3rd party code audits but these days companies are cheap and lazy and don’t care as much. They’d rather just invest in “security incident insurance” and hope for the best 🤷
Sometimes they don’t even go that far and instead just insist upon useless indemnification clauses in software licenses. …and yes, they’re useless:
nolo.com/…/indemnification-provisions-contracts.h…).
(Important part indicating why they’re useless should be highlighted)
JasonDJ@lemmy.zip 3 months ago
Does Windows have a solid native way to remotely re-image a system like macOS does?
lazynooblet@lazysoci.al 3 months ago
Yes but it is licensed based and focused on business customers.
Brkdncr@lemmy.world 3 months ago
Yes.
catloaf@lemm.ee 3 months ago
No.
Maybe with Intune and Autopilot, but I haven’t used it.
Nomad@infosec.pub 3 months ago
It’s called EFI. How do you think your BIOS update from inside BIOS is working? ;)
timewarp@lemmy.world 3 months ago
I’d issue IPMI or remote management commands to reboot the machines. Then I’d boot into either a Linux recovery environment (yes, Linux can unlock BitLocker-encrypted drives) or a WinPE (or Windows RE) and unlock the drives, preferably already loaded on the drives, but could have them PXE boot - just giving ideas here, but ideal DR scenario would have an environment ready to load & PXE would cause delays.
I’d either push a command or script that would then remove the update file that caused the issue & then reboots. Having planned for a scenario like this already, total time to fix would be less than 2 hours.
riskable@programming.dev 3 months ago
At my company I use a virtual desktop and it was restored from a nightly snapshot a few hours before I logged in that day (and presumably, they also applied a post-restore temp fix). This action was performed on all the virtual desktops at the entire company and took approximately 30 minutes (though, probably like 4 hours to get the approval to run that command, LOL).
It all took place before I even logged in that day. I was actually kind of impressed… We don’t usually act that fast.
SuperFola@programming.dev 3 months ago
Dual partitioning as Android does it might have helped. Install the update to partition B, reboot and if it’s alright swap A and B partitions to make B the default. Boot again to the default partition (A, formerly B).
It wouldn’t have booted correctly afaiu with the faulty update, and would have been reverted to use the untouched A partition.