They probably mean that they did a change in a config file that is uploaded in their weekly or bi-weekly change window, and that that file was malformed for whichever reason that made the process that reads it crash. The main process depends on said process, and all the chain failed.
Things to improve:
- make the pipeline more resilient, if you have a “bot detection module” that expects a file,and that file is malformed, it shouldn’t crash the whole thing: if the bot detection module crahses, control it, fire an alert but accept the request until fixed.
- Have a control of updated files to ensure that nothing outside of expected values and form is uploaded: this file does not comply with the expected format, upload fails and prod environment doesn’t crash.
- Have proper validation of updated config files to ensure that if something is amiss, nothing crashes and the program makes a controlled decision: if file is wrong, instead of crashing the module return an informed value and let the main program decide if keep going or not.
I’m sure they have several of these and sometimes shit happens, but for something as critical as CloudFlare to not have automated integration tests in a testing environment before anything touches prod is pretty bad.
floquant@lemmy.dbzer0.com 4 months ago
No, in “DevOps” environments “configuration changes” is most of what you do every day