cross-posted from: lemmy.zip/post/51866711
Signal was just one of many services brought down by the AWS outage.
Submitted 15 hours ago by schizoidman@lemmy.zip to technology@lemmy.world
https://www.theverge.com/news/807147/signal-aws-outage-meredith-whittaker
cross-posted from: lemmy.zip/post/51866711
Signal was just one of many services brought down by the AWS outage.
So much talking out of ass in these comments.
Federation/decentralization is great. It’s why we’re here on Lemmy.
It also means you expect everyone involved, people you’ve never met or vetted, to be competent and be able to shell out the cash and time to commit to a certain level of uptime. That’s unacceptable for a high SLA product like Signal. Hell midwest.social, the Lemmy instance I’m on, is very often quite slow. I and others put up with it because we know it’s run by one person on one server that he’s presumably paying for himself. But that doesn’t reflect Lemmy as a whole.
AWS isn’t just a bunch of servers. They have dedicated services for database clusters, cache store, data warehouse, load balancing, container clusters, kubernetes clusters, CDN, web access firewall, to name just a few. Every region has multiple datacenters, the largest by far of which is North Virginia’s. By default most people use one DC but multi region while being a huge expensive lift is something they already have tools to assist with. Also, and maybe most importantly, AWS, Azure and GCP run their own backbones between the datacenters rather than rely on the shared one that you, me, and most other smaller DCs are using.
I’m a DevOps Engineer but I’m no big tech fan. I run my own hobby server too. Amazon is an evil company. But the claim that “multi cloud is easy, smaller CSPs are just as good” is naive at best.
Ideally some legislation comes in and forces these companies to simplify the process for adopting multi cloud, because right now you have to build it all yourself and it becomes still very imperfect when you start to factor things like databases.
Can’t find a screenshot, but when you’re logged in and click for the screen to show all AWS products, holy shit. AWS is far more than most people think.
Not to mention the fact that the grand majority of federalized services have extremely unsustainable performance characteristics that make them effectively impossible to scale from hobby projects
DevOps here too, I’ve been starting to slide my smaller redundant services into k8s. I had to really defend my position not to use EKS.
No, we’re using kubeadm because I don’t want to give a damn if it’s running in the office, or google or amazon or my house. It’s WAY harder and more expensive than setting up an eks and a EC/Aurora cluster, but I can bypass vendor lock in. Setting up my own clusters and replicas is a never ending source of work.
Her real comment was that there are only 3 major cloud providers they can consider: AWS, GCP, and Azure. They chose AWS and AWS only. So there are a few options for them going forward — 1) keep doing what they’re doing and hope a single cloud provider can improve reliability, 2) modify their architecture to a multi-cloud architecture given the odds of more than one major provider going down simultaneously is much rarer, or 3) build their own datacenters/use colos which have a learning curve yet are still viable alternatives. Those that are serious about software own their own hardware, after all.
Each choice has its strengths and drawbacks. The economics are tough with any choice. Comes down to priorities, ability to differentiate, and value in differentiation :)
Meredith mentioned in a reply to her posts that they do leverage multi-cloud and were able to fall back onto GCP (Google Cloud Platform), which enabled Signal to recover quicker than just waiting on AWS. I’d link to source but on phone, it’s somewhere in this thread: https://mastodon.world/@Mer__edith/115445701583902092
I’m sorry, what, a balanced and informed answer? Surely you must be joking!
What reason do they give for only wanting to use those three cloud providers? There are many others.
Scale, they need worldwide coverage.
Those are the only 3 that matter at the top tier/enterprise class of infrastructure. Oracle could be considered as well for nuanced/specialized deployments that are (largely) Oracle DB heavy; but AWS is so far ahead of Azure and GCP from a tooling standpoint it’s not even worth considering the other two if AWS is on the table.
It’s so bad with other cloud providers that ones like Azure offers insane discounts on their MSSQL DB (basically “free”) licensing just to use them over AWS. Sometimes the cost savings are worth it, but you take a usability and infrastructure cost by using anything other than AWS.
I honestly, legitimately, wish there was some other cloud provider out there that could do what AWS can do, but they don’t exist. Anyone else is a pale imitation from a devops perspective. It sucks. There should be other real competitors, especially to the US based cloud companies as the US cannot be trusted anymore, but they just don’t exist without taking a huge hit in terms of tools, APIs, and reliability options, to AWS.
Didn’t only 1 AWS region go down? maybe before even thinking about anything else they should focus on redundancy within AWS
us-east-1 went down. Problem is that IAM services all run through that DC. Any code relying on an IAM role would not be able to authenticate.
I didn’t hardly touch AWS at my last job, but listening to my teammates and seeing their code led me to believe IAM is used everywhere.
Apparently even if you are fully redundant there’s a lot of core services in US east 1 that you rely on
No, there isn’t. If you of course design your infrastructure correctly…
This is the actual realistic change a lot of people are missing. Multi cloud is hard and imperfect and brings its own new potential issues. But AWS does give you tools to adopt multi region. It’s just very expensive.
Unfortunately DNS transcends regions though so that can’t really be escaped.
This has been my biggest pet peeve in the wake of the AWS outage. If you’d built for high-availability and continuity then this event would at most have been a minor blip in your services.
Yeah, but if you want real redundancy, you pay double. My team looked into it. Even our CEO, no tightwad, just laughed and shook his head when we told him.
Why is it that only the larger cloud providers are acceptable? What’s wrong with one of the smaller providers like Linode/Akamai? There are a lot of crappy options, but also plenty of decent ones. If you build your infrastructure over a few different providers, you’ll pay more upfront in engineering time, but you’ll get a lot more flexibility.
For something like Signal, it should be pretty easy to build this type of redundancy since data storage is minimal and sending messages probably doesn’t need to use that data storage.
Akamai isnt small hehe
It is, compared to AWS, Azure, and Google Cloud. Here’s 2024 revenue to give an idea of scale:
The smallest on this this list has 10x the revenue of Akamai.
Here are a few other providers for reference:
I’m arguing they could put together a solution with these smaller providers. That takes more work, but you’re rewarded with more resilience and probably lower hosting costs. Once you have two providers in your infra, it’s easier to add another. Maybe start with using them for disaster recovery, then slowly diversify the hosting portfolio.
Also you know… building your own data centers / collocating. Even with the added man hours required it ends up being far cheaper.
But far less reliable. If your data center has a power outrage or internet disruption, you’re screwed. Signal isn’t big enough to have several data centers for geographic diversity and redundancy, they’re maybe a few racks total.
Colo is more feasible, but who is going to travel to the various parts of the world to swap drives or whatever? If there’s an outage, you’re talking hours to days to get another server up, vs minutes for rented hosting.
For the scale that signal operates at and the relatively small processing needs, I think you’d want lots of small instances. To route messages, you need very little info, and messages don’t need to be stored. I’d rather have 50 small replicas than 5 big instances for that workload.
For something like Lemmy, colo makes a ton of sense though.
I call it bullshit too. If its too expensive for them, just decentralize the project. Self-hosters all around the world would help. I alone have better uptime than AWS and probably wouldn’t even notice usage from a few hundred thousands users
You cant run a professional service on self hosters hardware…
you, on a single ISP who relies on the world’s shared backbone rather than your own between multiple DCs within a region and multiple regions around the world, have better uptime than AWS?
Stop.
I’m all for decentralizing for the case of no single entity controlling everything, but not for the case of uptime. That is one thing you give up with services like Matrix or Lemmy.
AWS actually has an SLA it’s contractually committed to when you pay them with thousands of engineers working to maintain it.
Well yes considering the downtime they had. SLA is just words on a paper, you also need to not fuck your infrastructure up. Even if all self-hosters had 99% uptime which is bad, it’s easy building a system that replicates data on a few of them to achieve resiliency. People need to stop assuming they can be 100% reliant on a single host and actually design their systems to take downtimes into account and recover from them
Matrix solved this with decentralization and federation. Don’t tell me its not possible.
Decentralized matrix has not the quality level for production. The only matrix user that has no issue have their account on all of their contacts on matrix.org. So, use it as a centralize app.
And your source?
I’m running and am part of a Matrix server for years and experienced near zero problems with them so far.
Session is a decentralized alternative to signal. It doesn’t require a phone number and all traffic is routed through a tor like onion network. Relays are run by the community and relay operators are rewarded with some crypto token for their troubles. To prevent bad actors from attacking the network, in order to run a relay you have to stake some of those tokens first and if your node misbehaves thay will get slashed.
shame their entire node system relies on cryptobros tech.
tor doesnt need currency to back it up. i2p doesnt need currency to back it up. why the hell lokinet does?
Tor relays only relay the traffic, they don’t store anything (other than HSDirs, but that’s miniscule). Session relays have to store all the messages, pictures, files until the user comes online and retrieves them. Obviously all that data would be too much to store on every single node, so instead it is spread across only 5-7 nodes at a time. If all of those nodes ware to go offline at the same time, messages would be lost, so there has to be some mechanism that discourages taking nodes offline without giving a notice period to the network. Without the staking mechanism, an attacker could spin up a bunch of nodes and then take them all down for relatively cheap, and leave users’ messages undelivered. It also incentivizes honest operators to ensure their node’s reliability and rewards them for it, which, even if you run your node purely for altruistic reasons, is always a nice bonus, so I don’t really see any downside to it, especially since the end user doesn’t need to interact with it at all.
Can you think of another way for people across the world to easily pay each other directly?
I would not recommend it. Session is a signal fork that deliberately removes forward secrecy from the protocol and uses weaker keys. The removal of forward security means that if your private key is ever exposed all your past messages could be decrypted.
This is a bad tool but even if it weren’t the no phone number thing is an anti-feature for most of the population.
The main issue with Session is they removed PFS when they redesigned everything. Also, it’s admittedly been years since I tried it, but I remember the app being noticeably buggy.
It’s gotten more usable over the past couple of years. Sadly, I just got done getting all my family/friend contacts to get on Signal (they’d much prefer to use WhatsApp) so Session remains a lonely place for me. I seem to use it solely as a place to stash notes for myself, even though I do this with Signal as well.
I don’t know that we’ll ever see a messenger that both appeals to everyone and has all the features we want (from privacy to visual appeal).
Just use Briar or SimpleX instead of this clowns service with no perfect forward secrecy
I found it workable when I tried it recently, but wound up going with simpleX. I like the multi identity system and you can proxy it through tor. Found the app customization more flushed out too.
The phrasing of the quotes is very “I sure hope someone comes along and fixes this for me because I’m not going to”
They are serving 1on1 chats and group chats. That practically partitions itself. There are many server lease options all over the world. My assumption is that they use some AWS service and now can’t migrate off. But you need an oncall team anyway so you aren’t buying that much convenience.
There are many server lease options all over the world
It increases complexity a lot to go with a bunch of separate server leases. There’s a reason global companies use hyperscalers instead of getting VPSes in 30 or 40 different countries.
I hate the centralization as much as everyone else, but for some things it’s just not feasible to go on-prem. I do know an exception. Used to work at a company with a pretty large and widely spread out customer base (big corps on multiple continents) that had its own k8s cluster in a super secure colocation space. But our backend was always slow to some degree (in multiple cases I optimized multi-second API endpoints into 10-200ms), we used asynchronous processing for the truly slow things instead of letting the user wait for a multi-minute API request, and it just wasn’t the sort of application that you need to be super fast anyway, so the extra milliseconds of latency didn’t matter that much, whether it was 50 or 500.
But with a chat app, users want it to be fast. They expect their messages to be sent as soon as they hit the send button. It might take longer to actually reach the other people in the conversation, but it needs to be fast enough that if the user hits send and then immediately closes the app, it’s sent already. Otherwise it’s bad UX.
It’s weird for Signal to not be able to do what Telegram does. Yes, for this particular purpose they are not different.
It was down for only some users, not all anyway
Wrong. It is actually quite easy to use multiple clouds with the help of OpenTofu. So it is just a cheap excuse
Alright, but then at the very least have a fallback implemented. Right?
Excuse me, but I don’t believe this BS.
Gifs you can hear ❤️
I’m going to call bullshit in that there are several networks that might be capable of doing this such as several blockchain networks or IPFS.
I’m going to call bullshit on the underlying assertion that Signal is using Amazon services for the sake of lining Jeff’s pocket instead of considering the “several” alternatives. As if they don’t have staff to consider such a thing and just hit buy now on the Amazon smile.
In any monopoly, there are going to be smaller, less versatile, less reliable options. Fine and dandy for Mr Joe Technology to hop on the niche wagon and save a few bucks, but that’s not going to work for anyone casting a net encompassing the world.
Lol imagine still being stuck on Blockchain in 2025
axum@lemmy.blahaj.zone 39 minutes ago
SimpleX literally solves the messaging problem. You can bounce through their default relay nodes or run your own to use exclusively or add to the mix. It’s all very transparent to end users.
At most, aws outage would have only affected chats relayed on those aws servers.