Comment

Comment on I'm Starting A Search Engine For The Fediverse

TimLovesTech@badatbeing.social ⁨1⁩ ⁨year⁩ ago

As the fediverse is almost exclusively run by volunteers that are paying server bills and being admins, I could see some larger instances not taking kindly to this, especially depending on how much stress it would be putting on some already at capacity servers.

source

Sort:hotnew top

loobkoob@kbin.social ⁨1⁩ ⁨year⁩ ago
Ideally, OP's crawlers will just come from their own instance that other instance owners can defederate from if they want to opt out.

source
- lautan@lemmy.ca ⁨1⁩ ⁨year⁩ ago
  Yeah that would be the case.
  
  source
  - scrubbles@poptalk.scrubbles.tech ⁨1⁩ ⁨year⁩ ago
    That’s a good idea. Listen to public data being broadcasted out, then you aren’t worrying people with scraping or anything. It would only be from go live onward, but you would just be listening to the protocol.
    
    source
    TimLovesTech@badatbeing.social ⁨1⁩ ⁨year⁩ ago
    For that to happen on an instance organically users would need to visit all these instances/communities. To speed that up you would need a bot to do all.that “seeding” for you. That brings you full circle to the server resources on bigger instances.
    
    This seems like an opt-in, not an opt-out activity.
    
    source
TrickDacy@lemmy.world ⁨1⁩ ⁨year⁩ ago
How much bandwidth do you suppose a crawler would use? I’d guess very little

source
- TimLovesTech@badatbeing.social ⁨1⁩ ⁨year⁩ ago
  I was thinking more in terms of resources (number of spider threads X posts/communities/users being indexed) that would be now dedicated to a bot, not so much network traffic that is probably tiny if not downloading images.
  
  source
  - TrickDacy@lemmy.world ⁨1⁩ ⁨year⁩ ago
    Right, it would be an initial hit but if the bot was properly built it wouldn’t need to do full reindexing very often. I’m no expert but I think it could be done in a way that there is no noticeable spike in traffic or anything
    
    source
    TimLovesTech@badatbeing.social ⁨1⁩ ⁨year⁩ ago
    That’s the thing, it would need to be done in chunks and have its revisits scheduled if you want to do a complete indexing of an instance. And for a large instance that’s a lot of DB thrashing if you aren’t spacing that out, or just sampling like “top 10 posts” or something, but that kind of data is going to make a useless search engine depending on the goal of the search engine. If you wanted to just catalog the daily top posts of the fediverse that might work, but if you want to catalog everything it’s going to take a lot of resources and a long time to make sure you’re not hammering people’s servers.
    
    source
- lautan@lemmy.ca ⁨1⁩ ⁨year⁩ ago
  It will be very little if not downloading full html pages.
  
  source