LLM-driven web scraping is intense for some sites, so their bot detection software is tuned in a way that creates a lot of false positives.
Obscuring your browser fingerprint, or blocking javascript, or using an unusual user-agent string can trigger a captcha challenge.
If you’re not doing that and seeing a site suddenly start giving your captchas then they may be being DDoS’d by scrapers and are challenging all clients.
A site that archives content is especially vulnerable because they have a lot of the data that is useful for AI training.
It is incredibly annoying, but until we have a robust way of proving identity that can’t be gamed by bad actors we’re stuck with individual user challenges.
mjr@infosec.pub 3 days ago
Not every time, but far too often. They don’t seem to care that they’re discriminating against people with AV impairment, plus locking out some secure browsers.
ilovepiracy@lemmy.dbzer0.com 2 days ago
Just a heads up, archive.is is not related to the internet archive and I believe is run by a solo dev with private funding.
mjr@infosec.pub 2 days ago
I looked into who runs it a bit and oh wow, it’s far far worse than that. If you get a captcha from archive.is / archive.ph / archive.today and allow it scripting permission, it seems to use your browser as part of a DDoS attack. See infosec.exchange/@iampytest1/115902693235671566 and linked pages.
Arcane2077@sh.itjust.works 3 days ago
Dang, yeah it’s probably my strict browser settings. Thanks for the confirmation of shared experience.
cecilkorik@piefed.ca 3 days ago
Sometimes I’m able to get around it by tweaking some ublock permissions, but once I was surprised to discover that changing my user-agent with user-agent switcher seemed to do the trick. It’s really strange. Cloudflare’s captcha loops are inscrutable.