Comment on A Project to Poison LLM Crawlers

<- View Parent
douglasg14b@lemmy.world ⁨4⁩ ⁨days⁩ ago

This is assuming aggressively cached, yes.

Also “Just text files” is what every website is sans media. And you can still, EASILY get 10+ MB pages this way between HTML, CSS, JS, and JSON. Which are all text files.

A gitea repo page for example is 400-500KB transferred (1.5-2.5MB decompressed) of almost all text.

If you have a repo with 150 files, and the scraper isn’t caching assets (many don’t) then you just served up 60MB of HTMl/CSS/JS alongside the actual repository assets.

source
Sort:hotnewtop