How removed from IT are that you think fog would have helped here?
Comment on CrowdStrike Isn't the Real Problem
LrdThndr@lemmy.world 5 months agoA decade ago I worked for a regional chain of gyms with locations in 4 states.
I was in TN. When a system would go down in SC or NC, we originally had three options:
- (The most common) have them put it in a box and ship it to me.
- I go there and fix it (rare)
- I walk them through fixing it over the phone (fuck my life)
I got sick of this. So I researched options and found an open source software solution called FOG. I ran a server in our office and had little optiplex 160s running a software client that I shipped to each club. Then each climber at each club was configured to PXE boot from the fog client.
If everything was okay, it would chain the boot to the os on the machine. But I could flag a machine for primate and at next boot, the machine would check in with PXE and get a complete reimage from premade images on the fog server.
So yes, I could completely reimage a computer from hundreds of miles away by clicking a few checkboxes on my computer.
This was free software. It saved us thousands in shipping fees alone.
There ARE options out there.
Brkdncr@lemmy.world 5 months ago
LrdThndr@lemmy.world 5 months ago
How would it not have? You got an office or field offices?
“Bring your computer by and plug it in over there.” And flag it for reimage. Yeah. It’s gonna be slow, since you have 200 of the damn things running at once, but you really want to go and manually touch every computer in your org?
The damn thing’s even boot looping, so you don’t even have to reboot it.
I’m sure the user saved all their data in one drive like they were supposed to, right?
I get it, it’s not a 100% fix rate. And it’s a bit of a callous answer to their data. And I don’t even know if the project is still being maintained.
But the post I replied to was lamenting the lack of an option to remotely fix unbootable machines. This was an option to remotely fix nonbootable machines. No need to be a jerk about it.
Brkdncr@lemmy.world 5 months ago
Because your imaging environment would also be down. And you’re still touching each machine and bringing users into the office.
Or your imaging process over the wan takes 3 hours since it’s dynamically installing apps and updates and not a static “gold” image. Imaging is then even slower because your source disk is only ssd and imaging slows down once you get 10+ going at once.
I’m being rude because I see a lot of armchair sysadmins that don’t seem to understand the scale of the crowdstike outage, what crowdstrike even is beyond antivirus, and the workflow needed to recover from it.
LrdThndr@lemmy.world 5 months ago
FOG ran on Linux. It wouldn’t have been down. But that’s beside the point.
I never said it was a good answer to CrowdStrike. It was just a story about how I did things 10 years ago, and an option for remotely fixing nonbooting machines. That’s it.
I get you’ve been overworked and stressed as fuck this last few days. I’ve been out of corporate IT for 10 years and I do not envy the shit you guys are going through right now. I wish I could buy you a cup of coffee or a beer or something.
timewarp@lemmy.world 5 months ago
Imaging environment down? If a sysadmin can’t figure out how to boot a machine into recovery to remove the bad update file then they have bigger problems. The fix in this instance wasn’t even re-imaging machines. It was merely removing a file. Ideal DR scenario would have a recovery image already on the system that can be booted into remotely, so there is minimal strain on the network. Furthermore, we don’t live in dial-up age anymore.
Evotech@lemmy.world 5 months ago
Now your fog servers are dead. What not
cyberpunk007@lemmy.ca 5 months ago
This is a good solution for these types of scenarios. Doesn’t fit all though. Where I work, 85% of staff work from home. We largely use SaaS. I’m struggling to think of a good method here other than walking them through reinstalling windows on all their machines.
timewarp@lemmy.world 5 months ago
- Configure PXE to reboot into recovery image, push out command to remove bad file. Reboot. Done.
or
- Have recovery image already installed. Have user reboot & push key to boot into recovery. Push out fix. Done.
cyberpunk007@lemmy.ca 5 months ago
I had no idea you could remotely configure pxe to reboot into a recovery image and run a script. How do you do this?
LrdThndr@lemmy.world 5 months ago
Fuck yeah. Even better than reimagine. That’s creative as fuck and I love it.
LrdThndr@lemmy.world 5 months ago
That’s still 15% less work though. If I had to manually fix 1000 computers, clicking a few buttons to automatically fix 150 of them sounds like a sweet-ass deal to me even if it’s not universal.
You could also always commandeer a conference room or three and throw a switch on the table. “Bring in your laptop and go to conference room 3. Plug in using any available cable on the table and reboot your computer. Should be ready in an hour or so. There’s donuts and coffee in conference room 4.” Could knock out another few dozen.
Won’t help for people across the country, but if they’re nearish, it’s not too bad.
cyberpunk007@lemmy.ca 5 months ago
Not a lot of nearish. It would be pretty bad if this happened here.
timewarp@lemmy.world 5 months ago
Thank you for sharing this. This is what I’m talking about. Larger companies not utilizing something like this already are dysfunctional. There are no excuses for why it would take them days, weeks or longer.
magikmw@lemm.ee 5 months ago
This worksbgreat for stationary pcs and local servers, does nothing for public internet connected laptops in hands of users.
The only fix here is staggered and tested updates, and apparently this update bypassed even deffered update settings that crowdstrike themselves put into their software.
The only winning move here was to not use crowdstrike.
wizardbeard@lemmy.dbzer0.com 5 months ago
It also assumes that reimaging is always an option.
Yes, every company should have networked storage enforced specifically for issues like this, so no user data would be lost, but there’s often a gap between should and “has been able to find the time and get the required business side buy in to make it happen”.
Also, users constantly find new ways to do non-standard, non-supported things with business critical data.
Bluetreefrog@lemmy.world 5 months ago
Isn’t this just more of what caused the problem in the first place? Namely, centralisation. If you store data locally and you lose a machine, that’s bad but not the end of the world. If you store it centrally and you lose the data, that’s catastrophic. Nassim Taleb nailed this stuff. Keep the downside limited, and the upside unlimited or as he says, “Don’t pick up pennies in front of a steamroller.”
LrdThndr@lemmy.world 5 months ago
Absolutely. 100%
But don’t let perfect be the enemy of good. A fix that gets you 40% of the way there is still 40% less work you have to do by hand. Not everything has to be a fix for all situations. There’s no such thing as a panacea.
magikmw@lemm.ee 5 months ago
Sure. At the same time one needs to manage resources.
I was all in on laptop deployment automation. It cut down on a lot of human error issues and having inconsistent configuration popping up all the time.
But it needs constant supervision, even if not constant updates. More systems and solutions lead to neglect if not supplied well. So some “would be good to have” systems just never make the cut, because as overachieving I am, I’m also don’t want to think everything is taken care of when it clearly isn’t.
timewarp@lemmy.world 5 months ago
You were all in, but was the company all in? How many employees? It sounds like you innovated. Let’s say that the company you worked for was spending millions on vendors that promised solutions but rarely delivered. If instead they gave you $400k a year, a $1 million/year budget & 10 employees… I’m guessing you could have managed the laptop deployment automation, along with some other significant projects as well.
Instead though, people with good ideas, even loyal to the company, are competing against sales and marketing reps from billion dollar companies, and upper management are easily swooned.
catloaf@lemm.ee 5 months ago
Yeah. I find a base image and post-install config with group policy or Ansible to be far more reliable.
LrdThndr@lemmy.world 5 months ago
Completely fair, man.
timewarp@lemmy.world 5 months ago
Almost all computers can be set to PXE boot, but work laptops usually even have more advanced remote management capabilities. You ask the employee to reboot the laptop and presto!
magikmw@lemm.ee 5 months ago
I wonder how you’re supposed to get PXE boot to work securely over the internet. And how that helps when affected disk is still encrypted and needs unusual intervention to fix, including admin access to system files.
I’ve been doing this for a while, and I like creative solutions, so I wonder about those issues a lot. Not much comes to my mind besides let’s recall all the laptops and do it one by one.
timewarp@lemmy.world 5 months ago
PXE boot is more of last resort IMO, but can be uses as a chainloader to a more secure option. The biggest challenge I could see security-wise is having PXE boot being ran on unsecured networks. Even then though, normally a computer will have been provisioned on a secure network and will have encryption and secure boot-based encryption, and some additional signature-based image verification.
wizardbeard@lemmy.dbzer0.com 5 months ago
Hypothectically you could ship a company provided router to handle the vpn connection to your remote users, so you aren’t relying on the OS to be able to boot up to get connected to the company network and PXE environment. Lots of extra cost and mess though.
LrdThndr@lemmy.world 5 months ago
From a home user? Probably ain’t shit-all you can do with PXE booting. But if you have a field office or somewhere a user can go with a hardware vpn appliance? Well now you’re in business.