Comment on I hate Clouds - a personal perspective on why I think Clouds suck
Tja@programming.dev 4 months agoSplit brain are easily solved, there’s of the shelf solutions and if you have some custom code you can use plenty of well researched solutions, for instance raft. Putting bizantine fault in Google scholar yields thousands of papers,if you want something fancier.
Same for most problems you mentioned, they were an issue 10 years ago, nowadays you can federate, abstract or outsource most of it.
Making it harder to identify SPFOs doesn’t increase fragility. If you whole system a single instance it’s trivial to identify (the whole thing) but very brittle.
loudwhisper@infosec.pub 4 months ago
Of course the problem is solved, but that doesn’t mean that the solution is easy. Also, distributed protocols still need to work on top of a complicated network and with real-life constraints in terms of performances (to list a few). A bug, misconfiguration, oversight and you have a problem.
Just to make an example, I remember a Kafka cluster with 5 replicas completely shitting its pants for 6h to rebalance data during a planned maintenance where one node was brought offline. It caused one of the longest outages to date with the websites which relied on it offline. Was it our fault? Was it a misconfiguration? A bug? It doesn’t matter, it’s a complex system which was implemented and probably something was missed.
Technology is implemented by people, complexity increased the chances of mistakes, not sure this can be argued.
Making it harder to identify SPOF means you might miss your SPOF, and that means having liabilities, and having anyway scenarios where your system can crash, in addition for paying quite a lot to build a resilience that you don’t achieve.
A single instance with 2 failure scenarios (disk failure and network failure) - to make an example - is not more fragile than a distributed system with 20 failure scenarios. Failure scenarios and SPOF can have compensating controls and be mitigated successfully. A complex system where these can’t be fully identified can’t have compensating control and residual risk might be much harder. So yes, a single disk can fail more likely than 3 disks at once, but this doesn’t give the whole picture.
Tja@programming.dev 4 months ago
The only problem is that the single instance also has 20 scenarios (and keeps the 2 as well), making it more brittle.
A well design system removes points of failure, disk, power and network are obvious ones, and as long as you keep it byzantine safe, anything you added should be redundant so if one fails the system still runs. Ideally you remove all of them but if there’s one hidden it’s still better than “the whole thing is a single point of failure”.
loudwhisper@infosec.pub 4 months ago
No, it’s not true. A single system has less failure scenarios, because it doesn’t depend on external controllers or anything that makes the system distributed and that can fail causing a failure to your system (which may or may not be tolerated).
This is especially true from a security standpoint: complexity adds attack surface.
Simple example: a kubernetes cluster has more failure scenarios than a single node. With the node you have hardware failure, misconfiguration of the node, network failure. With a kubernetes cluster you have all that for each node (each with marginally less impact, potentially, because it depends for example on stateful storage, that if you mitigate you are introducing other failure scenarios as well), plus the fact that if the control plane goes in flames your node is useless, if the etcd data corrupts your node is useless, anything that happens with resources (a bug, a misuse of the API, etc.) can break your product. You have more failure scenarios because your product to run is dependent on more components to work at the same time. This is what it means that complexity brings fragility. Looking from the security side: an instance can be accessed only from SSH, if you are worried about compromise you have essentially one service to secure. Once you run on kubernetes you have the CI/CD system, the kubernetes API, the kubernetes supply-chain, etcd, and if you are in cloud you have plenty of cloud permissions that can indirectly grant you access to the control plane and to a console. Now you need to secure 5-6-7 entrypoints to a node.
Mind you, I am not advocating against the use of complex systems, sometimes they are necessary, but if the complexity is not fully managed and addressed, you have a more fragile system. Essentially complexity is a necessary evil to respond to some other necessities.
This is the reason why nobody would recommend to someone who needs to run a single static website to run it on Kubernetes, for example.
You say “a well designed system”, but designing well is harder the more complexity exists, obviously. Redundancy doesn’t always work, because redundancy needs coordination, needs processes that also depend on external components.
In any case, I agree that you can build a robust system within Cloud! The argument I am trying to make is that:
And mind you, everything you can do in Cloud you can also do on your own, if you invest on it.
Tja@programming.dev 4 months ago
You make it redundant, I thought I didn’t need to say that…