All of Japan's Toyota Assembly Plants Shut Down for a Day Because Their Server Ran Out of Disk Space

Submitted ⁨⁨1⁩ ⁨year⁩ ago⁩ by ⁨FlyingSquid@lemmy.world⁩ to ⁨technology@lemmy.world⁩

https://www.reuters.com/business/autos-transportation/toyota-says-plant-shutdown-last-week-due-server-malfunction-2023-09-06/

source

Comments

Sort:hotnew top

grabyourmotherskeys@lemmy.world ⁨1⁩ ⁨year⁩ ago
I haven’t read the article because documentation is overhead but I’m guessing the real reason is because the guy who kept saying they needed to add more storage was repeatedly told to calm down and stop overreacting.

source
- krellor@kbin.social ⁨1⁩ ⁨year⁩ ago
  I used to do some freelance work years ago and I had a number of customers who operated assembly lines. I specialized in emergency database restoration, and the assembly line folks were my favorite customers. They know how much it costs them for every hour of downtime, and never balked at my rates and minimums.
  
  The majority of the time the outages were due to failure to follow basic maintenance, and log files eating up storage space was a common culprit.
  
  So yes, I wouldn't be surprised at all if the problem was something called out by the local IT, but were overruled for one reason or another.
  
  source
  - otl@lemmy.sdf.org ⁨1⁩ ⁨year⁩ ago
    
    and log files eating up storage space was a common culprit.
    
    Another classic symptom of poorly maintained software. Constant announcements of trivial nonsense, like [INFO]: Sum(1, 1) - got result 2! filling up disks.
    
    I don’t know if the systems you’re talking about are like this, but it wouldn’t surprise me!
    
    source
    -> View More Comments
  - Pat12@lemmy.world ⁨1⁩ ⁨year⁩ ago
    this is software speciifcally for assembly line management?
    
    source
    -> View More Comments
- DontMakeMoreBabies@kbin.social ⁨1⁩ ⁨year⁩ ago
  I'm this person in my organization. I sent an email up the chain warning folks we were going to eventually run out of space about 2 years ago.
  
  Guess what just recently happened?
  
  ShockedPikachuFace.gif
  
  source
  - TheBat@lemmy.world ⁨1⁩ ⁨year⁩ ago
    You got approval for new SSDs because the manglement recognised threat identified by you as critical?
    
    Right?
    
    source
  - vagrantprodigy@lemmy.whynotdrs.org ⁨1⁩ ⁨year⁩ ago
    Literally sent that email this morning. It’s not that we don’t have the space, it’s that I can’t get a maintenance window to migrate the data to the new storage platform.
    
    source
  - mdd@lemm.ee ⁨1⁩ ⁨year⁩ ago
    Can’t you just add a few external USB drives? (heard this more than once at an NGO think tank.)
    
    source
    -> View More Comments
  - IMongoose@lemmy.world ⁨1⁩ ⁨year⁩ ago
    Sometimes that person is very silly though. We had a vendor call us saying we needed to clear our logs ASAP!!! due to their size. The log file was no joke, 20 years old. At the current rate, our disk would be full in another 20 years. We cleared it but like, calm down dude.
    
    source
- dojan@lemmy.world ⁨1⁩ ⁨year⁩ ago
  Ballast!
  
  Just plonk a large file in the storage, make it relative to however much is normally used in the span of a work week or so. Then when shit hits the fan, delete the ballast and you’ll suddenly have bought a week to “find” and implement a solution. You’ll be hailed as a hero, rather than be the annoying doomer that just bothers people about technical stuff that’s irrelevant to the here and now.
  
  source
  - lemmyvore@feddit.nl ⁨1⁩ ⁨year⁩ ago
    Or you could be fired because technically you’re the one that caused the outage.
    
    source
    -> View More Comments
  - Malfeasant@lemm.ee ⁨1⁩ ⁨year⁩ ago
    Except then they’ll decide you fixed it, so nothing more needs to be done. I’ve seen this happen more than once.
    
    source
- IWantToFuckSpez@kbin.social ⁨1⁩ ⁨year⁩ ago
  And was fired for not doing his job which management prevented him from doing.
  
  source
Semi-Hemi-Demigod@kbin.social ⁨1⁩ ⁨year⁩ ago
Sysadmin pro tip: Keep a 1-10GB file of random data named DELETEME on your data drives. Then if this happens you can get some quick breathing room to fix things.

Also, set up alerts for disk space.

source
- Dkarma@lemmy.world ⁨1⁩ ⁨year⁩ ago
  The answer here is not storage it is better alerting.
  
  source
  - nickhammes@lemmy.world ⁨1⁩ ⁨year⁩ ago
    Why not both? Alerting to find issues quickly, a bit of extra storage so you have more options available in case of an outage, and maybe some redundancy for good measure.
    
    source
    -> View More Comments
  - Agent641@lemmy.world ⁨1⁩ ⁨year⁩ ago
    Yes, alert me when disk space is about to run out so I can ask for a massive raise and quit my job when they dont give it to me.
    
    Then when TSHTF they pay me to come back.
    
    source
  - looz@sopuli.xyz ⁨1⁩ ⁨year⁩ ago
    There’s cases where disk fills up quicker than one can reasonably react, even if alerts are in place. And sometimes culprit is something you can’t just go and kill.
    
    source
    -> View More Comments
- dx1@lemmy.world ⁨1⁩ ⁨year⁩ ago
  The real pro tip is to segregate the core system and anything on your system that eats up disk space into separate partitions, along with alerting, log rotation, etc. And also to not have a single point of failure in general. Hard to say exact what went wrong w/ Toyota but they probably could have planned better for it in a general way.
  
  source
- Maximilious@kbin.social ⁨1⁩ ⁨year⁩ ago
  10GB is nothing in an enterprise datastore housing PBs of data. 10GB is nothing for my 40TB homelab!
  
  source
  - Semi-Hemi-Demigod@kbin.social ⁨1⁩ ⁨year⁩ ago
    It not going to bring the service online, but it will prevent a full disk from letting you do other things. In some cases SSH won’t work with a full disk.
    
    source
    -> View More Comments
  - idunnololz@lemmy.world ⁨1⁩ ⁨year⁩ ago
    It’s nothing for my homework folder.
    
    source
    -> View More Comments
- Lem453@lemmy.ca ⁨1⁩ ⁨year⁩ ago
  Even better, cron job every 5 mins and if total remaining space falls to 5 mins auto deleting the file and send a message to sys admin
  
  source
  - Semi-Hemi-Demigod@kbin.social ⁨1⁩ ⁨year⁩ ago
    Sends a message and gets the services ready for potential shutdown. Or implements a rate limit to keep the service available but degraded.
    
    source
  - bug@lemmy.one ⁨1⁩ ⁨year⁩ ago
    At that point just set the limit a few gig higher and don’t have the decoy file at all
    
    source
  - gazter@aussie.zone ⁨1⁩ ⁨year⁩ ago
    Also, if space starts decreasing much more rapidly than normal.
    
    source
- z00s@lemmy.world ⁨1⁩ ⁨year⁩ ago
  Or make the file a little larger and wait until you’re up for a promotion…
  
  source
- mkhopper@lemmy.world ⁨1⁩ ⁨year⁩ ago
  500Gb maybe.
  
  source
Swiggles@lemmy.blahaj.zone ⁨1⁩ ⁨year⁩ ago
This happens. Recently we had a problem in production where our database grew by a factor of 10 in just a few minutes due to a replication glitch. Of course it took down the whole application as we ran out of space.

Some things just happen and all head room and monitoring cannot save you if things go seriously wrong. You cannot prepare for everything in life and IT I guess. It is part of the job.

source
- RidcullyTheBrown@lemmy.world ⁨1⁩ ⁨year⁩ ago
  Bad things can happen but that’s why you build disaster recovery into the infrastructure. Especially with a compqny as big as Toyota, you can’t have a single point of failure like this. They produce over 13,000 cars per day. This failure cost them close to 300,000,000 dollars just in cars.
  
  source
  - frododouchebaggins@lemmy.world ⁨1⁩ ⁨year⁩ ago
    The IT people that want to implement that disaster recovery plan do not make the purchasing decisions. It takes an event like this to get the retards in the C-suite listen to IT staff.
    
    source
    -> View More Comments
  - Swiggles@lemmy.blahaj.zone ⁨1⁩ ⁨year⁩ ago
    Yea, fair point regarding the single point of failure. I guess it was one of those scenarios that should just never happen.
    
    I am sure it won’t happen again though.
    
    As I said it can just happen even though you have redundant systems and everything. Sometimes you don’t think about that one unlikely scenario and boom.
    
    source
MechanicalJester@lemm.ee ⁨1⁩ ⁨year⁩ ago
I blame lean philosophy. Keeping spare parts and redundancy is expensive so definitely don’t do it…which is just rolling the dice until it comes up snake eyes and your plant shuts down.

It’s the “save 5% yearly and stop trying to avoid a daily 5% chance of disaster”

Over prepared is silly, but so is under prepared.

They were under prepared.

source
- Ryumast3r@lemmy.world ⁨1⁩ ⁨year⁩ ago
  Lean philosophy is supposed to account for those dice-rolling moments. It’s not just “keep nothing in inventory”, there is supposed to be risk assessment involved.
  
  The problem is that leadership doesn’t interpret it that way and just sees “minimizing inventory increases profit!”
  
  source
  - IonAddis@lemmy.world ⁨1⁩ ⁨year⁩ ago
    
    The problem is that leadership doesn’t interpret it that way and just sees “minimizing inventory increases profit!”
    
    Yep. Managers prioritize short-term gains (often personal gains, too) over the overall health of a business.
    
    There’s also industries where the “lean” strategy is inappropriate because the given industry is one that booms in times of crisis when logistics to get “just in time” supplies go kaput due to the same catastrophe that’s causing the industry to boom. Hospitals and clinics can end up in trouble like this.
    
    But there’s other industries too–I haven’t looked for it, but I’m sure there’s a plethora of analysis already on what Covid did to companies and their supply chains.
    
    source
    -> View More Comments
  - Aceticon@lemmy.world ⁨1⁩ ⁨year⁩ ago
    In my own impression from the side of software engineering (i.e. the whole discipline rather than just “coding”) this kind of thing is pretty common:
    
    Start with ad-hoc software development with lots of confusion, redundancy, inneficient “we’ll figure it out as when we get there” and so on.
    
    To improve on this somebody really thinks things through and eventually a software development process emerges, something like Agile.
    
    There are lots of good reasons for every part of this processes but naturally sometimes the conditions are not met and certain parts are not suitable for use: the whole process is not and can never be a one size fits all silver bullet because it’s way to complex and vast a discipline for that (if it wasn’t you wouldn’t need a process to do it with even the minimum of efficency).
    
    However most people using it aren’t the “grand thinkers” of software engineering - software architect level types with tons of experience and who thus have seen quite a lot and know why certain elements of a process are as they are, and hence when to use them and when not to use them - and instead they’re run-of-the-mill, far more junior software designers and developers, as well as people from the management side of things trying to organise a tech-heavy process.
    
    So you end up with what is an excellent process when used by people who know that each part tries to achieve, what’s the point of that and when is it actually applicable, being used by people who have no such experience and understanding of software development processes and just use it as one big recipe, blindly following it with no real understanding and hence often using it incorrectly.
    
    For example, you see tons of situations where the short development cycles of Agile (aka sprints) and use cases are used without the crucial element which is actually envolving the end-users or stakeholders in the definition of the use cases, evaluation of results and even prioritization of what to do in the next sprint, so one of the crucial objectives of use cases - the discovery of the requirement details by interactive cycles with end-users where they quickly see some results and you use their feedback to fine-tune what gets done to match what they actually need (rather than the vague very high level idea they themselves have at the start of the project) is not at all achieve and instead they’re little more than small project milestones that in the old days would just be entries in Microsoft Manager or some tool like that.
    
    This is IMHO the “problem” with any advanced systematic process in a complex domain: it’s excellent in the hands of those who have enough experience and understanding of concerns at all levels to use it but they’re generally either used by people without that experience (often because managers don’t even recognize the value of that experience until things unexpectedly blow up) or by actual managers whose experience might be vast but is actuallly in a parallel track that’s not really about dealing with the kinds of concerns technical concerns that the process is designed to account for.
    
    source
    -> View More Comments
MoogleMaestro@kbin.social ⁨1⁩ ⁨year⁩ ago
There's some irony to every tech company modeling their pipeline off Toyota's Kanban system...

Only for Toyota to completely fuck up their tech by running out of disk space for their system to exist on. Looks like someone should have put "Buy more hard drives" to the board.

source
- palitu@aussie.zone ⁨1⁩ ⁨year⁩ ago
  not to mention the lean process effed them during fukashima and covid, with a breakdown in logistics and a shortage of chips, meant that their entire mode of operating shut down, as they had no capacity to deal with any outages in any of their systems. Maybe that has happened again, just in server land.
  
  source
  - GamingChairModel@lemmy.world ⁨1⁩ ⁨year⁩ ago
    Toyota was the carmaker best positioned for the COVID chip shortage because they recognized it as a bottleneck. They were pumping out cars a few months longer than the others (even if they eventually hit the same wall everyone else did).
    
    source
    -> View More Comments
  - burningmatches@feddit.uk ⁨1⁩ ⁨year⁩ ago
    It wasn’t just Fukushima. There was a massive flood in Thailand at the same time that shut down a load of suppliers. It was a really bad bit of luck but they did learn from that.
    
    source
    -> View More Comments
- netburnr@lemmy.world ⁨1⁩ ⁨year⁩ ago
  It was forever ignore in backlog
  
  source
AnUnusualRelic@lemmy.world ⁨1⁩ ⁨year⁩ ago
Idiots, they ought to have switched to tabs for indenting. Everybody knows that.

source
- cheery_coffee@lemmy.ca ⁨1⁩ ⁨year⁩ ago
  And who needs 100% quality images. Just set jpeg quality to 60% and be done with it.
  
  source
autotldr@lemmings.world [bot] ⁨1⁩ ⁨year⁩ ago
This is the best summary I could come up with:

TOKYO, Sept 6 (Reuters) - A malfunction that shut down all of Toyota Motor’s (7203.T) assembly plants in Japan for about a day last week occurred because some servers used to process parts orders became unavailable after maintenance procedures, the company said.

The system halt followed an error due to insufficient disk space on some of the servers and was not caused by a cyberattack, the world’s largest automaker by sales said in a statement on Wednesday.

“The system was restored after the data was transferred to a server with a larger capacity,” Toyota said.

The issue occurred following regular maintenance work on the servers, the company said, adding that it would review its maintenance procedures.

Two people with knowledge of the matter had told Reuters the malfunction occurred during an update of the automaker’s parts ordering system.

Toyota restarted operations at its assembly plants in its home market on Wednesday last week, a day after the malfunction occurred.

The original article contains 159 words, the summary contains 159 words. Saved 0%. I’m a bot and I’m open source!

source
- RoyalEngineering@lemmy.world ⁨1⁩ ⁨year⁩ ago
  Lol good bot I guess
  
  source
  - Classy@sh.itjust.works ⁨1⁩ ⁨year⁩ ago
    I wonder what happens if the summary is longer than the original text. Negative percentages? Stack underflow?
    
    source
    -> View More Comments
- dabster291@lemmy.zip ⁨1⁩ ⁨year⁩ ago
  Wow, what a useful bot!
  
  source
blazera@kbin.social ⁨1⁩ ⁨year⁩ ago
This is a fun read in the wake of learning about all the personal data car manufacturers have been collecting

source
Blurrg@lemmy.world ⁨1⁩ ⁨year⁩ ago
Free disk space is just inventory and therefor wasteful.

source
JackbyDev@programming.dev ⁨1⁩ ⁨year⁩ ago
Kanban

source
c0mbatbag3l@lemmy.world ⁨1⁩ ⁨year⁩ ago
Was this that full shutdown everyone thought was going to be malware?

The worst malware of all, unsupervised junior sysadmins.

source
- Takina_sOldPairTM@lemmy.world ⁨1⁩ ⁨year⁩ ago
  Human error…lol, classic.
  
  source
LEDZeppelin@lemmy.world ⁨1⁩ ⁨year⁩ ago
Just delete some p0rn

source
- WtfEvenIsExistence@lemmy.ca ⁨1⁩ ⁨year⁩ ago
  There’s no porn, just a large Petabyte of “Work Folder”.
  
  source
  - LEDZeppelin@lemmy.world ⁨1⁩ ⁨year⁩ ago
    C:\Homework\System Files\Win32\Windows files\Do not Open\Virus\Please don’t open
    
    source
    -> View More Comments
httpjames@sh.itjust.works ⁨1⁩ ⁨year⁩ ago
Server less DBs ftw

source
- RidcullyTheBrown@lemmy.world ⁨1⁩ ⁨year⁩ ago
  Serverless just means that the user doesn’t manage the capacity by themselves. This scenario can happen easily if the serverless provider is as incompetent as the Toyota admins.
  
  source
RFBurns@lemmy.world ⁨1⁩ ⁨year⁩ ago
Storage has never been cheaper.

There’s going to be a seppuku session in somebody’s IT department.

source
imgonnatrythis@lemm.ee ⁨1⁩ ⁨year⁩ ago
Search for non system files older than 2020 and lmza zip archive that shit stat!

source
- tiredofsametab@kbin.social ⁨1⁩ ⁨year⁩ ago
  You have now archived the encryption keys. Nothing can communicate and everything begins failing.
  
  source
  - imgonnatrythis@lemm.ee ⁨1⁩ ⁨year⁩ ago
    Ctrl+z?
    
    source
csolisr@communities.azkware.net ⁨1⁩ ⁨year⁩ ago
And that’s why I have a weekly cronjob in my server to call BleachBit, remove cycled logs, and compress the images on my storage. Having to make do with a rather limited VPS for years taught me to be resourceful with what I had

source
chemicalwonka@discuss.tchncs.de ⁨1⁩ ⁨year⁩ ago
well well, somebody will be fired

source
massive_bereavement@kbin.social ⁨1⁩ ⁨year⁩ ago
Someone messed up log rotation and the whole /var went ro.

source
Sygheil@lemmy.world ⁨1⁩ ⁨year⁩ ago
Free space is a wasted disk space.

source
- whofearsthenight@lemm.ee ⁨1⁩ ⁨year⁩ ago
  They paid for the whole disk, they’re going to use the whole disk.
  
  source