We have paused all crawling as of Feb 6th, 2025 until we implement robots.txt support. Stats will not update during this period.
Forced to use lemmy.fediverse.observer/list to see which instances are the most active
Submitted 2 months ago by mesamunefire@lemmy.world to fediverse@lemmy.world
https://lemmy.world/pictrs/image/4281c731-b98c-4e63-bd69-6360708ff5a4.png
We have paused all crawling as of Feb 6th, 2025 until we implement robots.txt support. Stats will not update during this period.
Forced to use lemmy.fediverse.observer/list to see which instances are the most active
This looks more accurate than fedidb TBH. The initial serge from reddit back in 2023. The slow fall of active members. I personally think the reason the number of users drops so much is because certain instances turn off the ability for outside crawlers to get their user info.
Did someone complain? Or why stop?
No idea honestly. If anyone knows, let us know! I dont think its necessarily a bad thing, If their crawler was being too aggressive, then it can accidentally DDOS smaller servers. Im hoping that is what they are doing and respecting the robot.txt that some sites have.
Gotosocial has a setting in development that is designed to baffle bots that don’t respect robots.txt. FediDB didn’t know about that feature and thought gotosocial was trying to inflate their stats.
In the arguments that went back and forth between the devs of the apps involved, it turns out that FediDB was ignoring robots.txt. ie, it was badly behaved
I think it's just one HTTP request to the nodeinfo API endpoint once a day or so. Can't really be an issue regarding load on the instances.
stoped
Well, they needed to stope. Stope, I said. Lest thy carriage spede into the crosseth-rhodes.
We can’t afford to wait at every sop, yeld, or one vay sign!
Whan that Aprill with his shoures soote
lol FediDB isn't a crawler, though. It makes API calls.
lemmyverse.net still crawling, baby. 🤘
Semi_Hemi_Demigod@lemmy.world 2 months ago
Robots.txt is a lot like email in that it was built for a far simpler time.
It would be better if the server could detect bots and send them down a rabbit hole rather than trusting randos to abide by the rules.
swizzlestick@lemmy.zip 2 months ago
Already possible: Nepenthes.
Semi_Hemi_Demigod@lemmy.world 2 months ago
I’m sold
Skepticpunk@lemmy.world 2 months ago
Ooh, nice.
merthyr1831@lemmy.ml 1 month ago
that website feels like uncovering a piece of ancient alien weaponry
poVoq@slrpnk.net 2 months ago
Because of AI bots ignoring robots.txt (especially when you don’t explicitly mention their user-agent and rather use a * wildcard) more and more people are implementing exactly that and I wouldn’t be surprised if that is what triggered the need to implement robots.txt support for FediDB.
jagged_circle@feddit.nl 2 months ago
It is not possible to detect bots. Attempting to do so will invariably lead to false positives denying access to your content to what is usually the most at-risk & marginalized folks
Just implement a cache and forget about it. Of read only content is causing you too much load, you’re doing something terribly wrong.
bhamlin@lemmy.world 1 month ago
While I agree with you, the quantity of robots has greatly increased of late. While still not as numerous as users, they are hitting every link and wrecking your caches by not focusing on hotspots like humans do.
Fredthefishlord@lemmy.blahaj.zone 1 month ago
False positives? Meh who cares … That’s what appeals are for