Hello all!
Firstly, apologies I missed last month’s update. As I will go in to during this month’s update, the train has not been smooth sailing, the ship missed the station - for a little while, trying to uncover what has gone wrong has been like unraveling a can of worms that has gone off the tracks.
Because of the length of this update, I’m leaving off new communities for this month, but feel free to pop any in the comments you’d like to share!
Server Updates
So, I’ll begin at the 0.19.11 update for Lemmy. I’d planned in some time for the update, and to begin with things went fairly normally, with usual server updates etc. Then it came time to update Lemmy itself to 0.19.11 - however, instead of the ~10 minutes of suggested downtime, it quickly became apparent that something wasn’t quite right. The logs gave no indication of anything happening, and after an hour it became apparent that something hadn’t worked. With not much to go on, and all the various logs showing that absolutely nothing had apparently happened, I cancelled the upgrade process and tweaked the logs so I’d get a bit more information than the standard logs usually show. This turned out to be an internal DNS issue, where Lemmy wasn’t talking to the database.
So, onwards again I fixed that issue and reapplied the update, to which it had appeared to start working. Excitedly, I give it ten minutes as advised, to come back and see that it was still going… and going… and going… and two hours later it was saying something was still happening. Well, by this point I’d run out of time and trusted in the process that it would resolve itself. I had unfortunately had to go to sleep as I had work the next morning, and a baby to look after.
By the next morning, still nothing. However, I can’t SSH in to the server on my phone (nor would I ever want to in case it was stolen etc) and all the backend stuff is behind layers of protection, so the site was down for the day while I had to do real world work nonsense.
By the time I’d got home, it still thought it was ongoing, so I made the decision that likely something was broken and cancelled the update. Of course, letting something go to town for hours on the database meant it was probably ruined, or would require completely investigating - ain’t nobody got time for that. So I crossed my fingers and hoped my backup solution had worked which it did with flying colours. I restored the database and the site came back to life.
Of course, I don’t like running old versions, so I rescheduled which went much much smoother the second time around, taking 8 minutes. Whew.
For a few days, everything looked great. Metrics looked spot on and the site was running like it always was, and then bam - the server completely locked up: Image
A restart did eventually fix this, and things returned to normal, but not quite the same.
On the 24th April, I started getting reports of the site returning various 502 and 504 errors, and also issues with images loading. Broken images has been a feature of the image proxy for a while, but this was on another scale, with sometimes whole pages of broken images and a connection timeout on every refresh. I was, in hindsight, very out of my depth here.
I started trying to see what the cause of the timeouts where, and it became apparent quickly that the site was being hit by previously-unseen levels of traffic: Image On the above graph, the left hand side was the usual amount of traffic we’d get, and on the right was the new levels we were receiving. This was likely causing the server to effectively be DDOS’d, and uncovering issues with the performance of the server. For a comparison of the usual amount of traffic we get, here’s a graph from the last server update: Image
Many configs were tweaked. Much database performance monitoring took place. Things were turned off and back on indiscriminately. Unfortunately not much worked.
It was at this point our new Admin, Gazby, joined the team - and quickly got to work diagnosing and fixing so many of the issues.
Some of the changes include much better backups, reduced latency in image storage (the images moved from the USA to EU storage), images being served from [i.lemmy.zip] to allow for better caching to reduce the load on the server, complete review and tweak of all the configs and server setup, and a detailed list of things we need to work on going forwards (plus no doubt lots of things I am forgetting).
This has really helped to stabilize the site, and while we can see it’s not 100% perfect, the amount of 502/504 errors should have massively reduced, and the site should be almost as it used to be.
One of things Gazby has also worked on is reporting and insights into the server, and so for example here’s a graph showing the 504 errors today:
We’ve got some plans on the horizon we’re working on which should increase the performance of the site, which possibly/probably will include a server move to rule out hardware issues or latent config issues we can’t find.
Whatever triggered these issues I am still not sure - it looks like it could be scrapers/AI bots hammering the site, and so we’ve put some measures in place here. If you use the old front end then you may have noticed a cloudflare challenge to try and prevent the server being overloaded - also a prewarning, the old front end is no longer maintained, and unless someone steps in, it is not compatible with the next version of Lemmy. Therefore at that point it will be retired unless someone works on it to bring it back in line. It is also the cause of a LOT of the server traffic problems, so it probably isn’t too much of a bad thing. Maybe someone will rewrite it to be better :)
As it stands, things still aren’t perfect - we had an issue where the bit of the server that actually directs you to the right place got overwhelmed, and so we put a fix in for that, but they are hopefully a lot better than they were a couple of months ago.
Donations
Lemmy.zip only continues to exist because of the generous donations of its users. The operating cost of Lemmy.zip is over 60 euros a month ($60, £50) and is mostly funded by the community!
We keep all the details around donations on our OpenCollective page, with full transparency around income and expenditure.
If you’re enjoying Lemmy.zip, please check out the OpenCollective page, we have a selection of one-off or recurring donation options. All funds go directly to hosting the site and keeping the virtual lights on.
We’ve also put up a link to our Ko-Fi page where you can donate via paypal instead of using a card. All Ko-Fi donations will be totalled up and added to OpenCollective each month for transparency. I’ve added a link in the sidebar, but you can also click the image below to go there:
We continue to have some really kind and generous donators and I can’t express my thanks enough. You can see all the kind donators in the Thank You thread - you could get your name in there too!
Please remember, traditional social media is only “free” to you because they sell your data. We don’t do that - if you want to support independent social websites like this one and you value your privacy, please consider a small donation. It really does help.
Graphs
I know you’ve come for the shiny pictures, so here you go!
CPU over last 30 days: Image
RAM over last 30 days: Image
Disk space used: Image
Here’s a few new ones for you! Lemmy DB size: Image
Images database size: Image
Here’s our current actual images stored:
Here’s the cloudflare overview: Image
Here’s requests: Image
Bandwidth: Image
And here is visitors: Image
And finally, traffic by country (mostly federation traffic remember!) Image
So hopefully that fills everyone in on where we’re up to, and what we’re working on, but if you have any questions please ask away below!
One final thing - on the 10th June, Lemmy.zip turns two years old!! 🥳 🎉 I’m hoping to do something nice for it, similar to last years (which is here if you haven’t seen it!) - but a quick thank you to everyone who has been part of the ride so far!
Thanks
Demigodrick
scytale@lemmy.zip 21 hours ago
Hey! I recently migrated from .ee and I use alexandrite.app as my front-end. It seems post images are not being proxied properly and have to be clicked to open them. I do see there was a post from a month ago about the issue. Has anything changed since then?
Demigodrick@lemmy.zip 21 hours ago
Hey, would it be possible to link me to where you’re seeing this behaviour please?
scytale@lemmy.zip 21 hours ago
Sure! Here’s an example: alexandrite.app/lemmy.zip/post/40268056
On my .ee account, it looks like this when I open the post:
Image
On my .zip account, it looks like this, and I have to open the image in another tab to see it in full view:
Image