The one-liner:
dd if=/dev/zero bs=1G count=10 | gzip -c > 10GB.gz
This is brilliant.
Submitted 3 weeks ago by some_guy@lemmy.sdf.org to technology@lemmy.world
https://idiallo.com/blog/zipbomb-protection
The one-liner:
dd if=/dev/zero bs=1G count=10 | gzip -c > 10GB.gz
This is brilliant.
Sadly about the only thing that reliably helps against malicious crawlers is Anubis
That URL is telling me “Invalid response”. Am I a bot?
I’m sorry you had to find out this way.
Now you know why your mom spent so much time with the Amiga
anubis.techaro.lol/…/known-broken-extensions
If you have JShelter installed, it breaks the proof of work from anubis
Neat
I don’t really like this approach, not just because I was flagged as a bot, but because I don’t really like captchas. I swear I’m not a bot guys!
That’s the reason I say ‚sadly‘. It’s definitely not good. But since everything else fails, this is what currently remains.
Before I tell you how to create a zip bomb, I do have to warn you that you can potentially crash and destroy your own device.
LOL. Destroy your device, kill the cat, what else?
destroy your device by… having to reboot it. the horror! The pain! The financial loss of downtime!
It’ll email your grandmother all if your porn!
Ah yes, the infamous “stinky cheese” email virus. Who knew zip bombs could be so destructive. It erased all of the easter eggs off of my DVDs.
outstanding reference
Haven’t thought about that Weird Al song in a while
The horrors of having your TV record Gigli!
I’ve been thinking about making an nginx plugin that randomizes words on a page to poison AI scrapers.
There are “AI mazes” that do that.
I remember reading and article about this but haven’t found it yet
The one below, named Anubis. Is the one I heard about. Come back to the thread and check the link.
That is a very interesting git repo. Is this just a web view into the actual git folder?
If you have the time, I think it’s a great idea.
I’d be amazed if this works, since these sorts of tricks have been around since dinosaurs ruled the Earth, and most bots will use pretty modern zip libraries which will just return “nope” or throw an exception, which will be treated exactly the same way any corrupt file is - for example a site saying it’s serving a zip file but the contents are a generic 404 html file, which is not uncommon.
Also, be careful because you could destroy your own device? What the hell? No. Unless you’re using dd backwards and as root, you can’t do anything bad, and even then it’s the drive contents you overwrite, not the device you “destroy”.
On the other hand, there are lots of bots scraping Wikipedia even though it’s easy to download the entire website as a single archive.
So they’re not really that smart…
Yeah, this article came across as if written by a complete beginner. They mention having their WordPress hacked, but failed to admit it was because they didn’t upgrade the install.
And if you want some customisation, e.g. some repeating string over and over, you can use something like this:
yes "b0M" | tr -d '\n' | head -c 10G | gzip -c > 10GB.gz
yes
repeats the given string (followed by a line feed) indefinitely - originally meant to type “yes” + ENTER into prompts. tr
then removes the line breaks again and head
makes sure to only take 10GB and not have it run indefinitely.
If you want to be really fancy, you can even add some HTML header and footer to some files like header
and footer
and then run it like this:
yes "b0M" | tr -d '\n' | head -c 10G | cat header - footer | gzip -c > 10GB.gz
Anyone who writes a spider that’s going to inspect all the content out there is already going to have to have dealt with this, along with about a bazillion other kinds of oddball or bad data.
Competent ones, yes. Most developers aren’t competent, scraper writers even less so.
That’s true. Scrapping is a gold mine for the people that don’t know. I worked for a place which crawls the internet and beyond (fetches some internal dumps we pay for). There is no chance a zip bomb would crash the workers as there are strict timeouts and smell tests (even if a does it will crash an ECS task at worst and we will be alerted to fix that within a short time). We were as honest as it gets though, following GDPR, honoring the robots file, no spiders or scanners allowed, only home page to extract some insights.
I am aware of some big name EU non-software companies very interested in keeping an eye on some key things that are only possible with scraping.
That’s the usual case with arms races: Unless you are yourself a major power, odds are you’ll never be able to fully stand up to one (at least not on your own, but let’s not stretch the metaphor too far). Often, the best you can do is to deterr other, minor powers and hope major ones never have a serious intent to bring you down.
In this specific case, the number of potential minor “attackers” and the hurdle for “attack” mKe it attractive to try to overwhelm the amateurs at least. You’ll never get the pros, you just hope they don’t bother you too much.
If you have billions of targets to scan, there’s generally no need to handle each and every edge case. Just ignoring what you can’t understand easily and jumping on to the next target is an absolutely viable strategy. You will never be able to process everything anyway.
Of course, it changes a bit if some of these targets actually make your bot crash. If it happens to often, you will want to harden your bot against it. Then again, if it just happens every now and then, it’s still much easier to just restart and continue with the next target.
First off, be very careful with bs=1G
as it may overload the RAM. You will want to set count
accordingly
Yup, use something sensible like 10M or so.
I would normally go much lower,
bs=4M count=262144
which creates 1G with 4M block size
Probably only works for dumb bots and I’m guessing the big ones are resilient to this sort of thing.
Judging from recent stories the big threat is bots scraping for AIs and I wonder if there is a way to poison content so any AI ingesting it becomes dumber. e.g. text which is nonsensical or filled with counter information, trap phrases that reveal any AIs that ingested it, garbage pictures that purport to show something they don’t etc.
When it comes to attacks on the Internet, doing simple things to get rid of the stupid bots means kicking 90% of attacks out. No, it won’t work against a determined foe, but it does something useful.
Same goes for setting SSH to a random port. Logs are so much cleaner after doing that.
Setting a random SSH port and limiting it to 3/min saw failed login attempts fall by 99% and jailed IPs fall to 0.
There have been some attempts in that regard, I don’t remember the names of the projects, but there were one or two that’d basically generate a crapton of nonsense to do just that. No idea how well that works.
I don’t know as to poisoning AI, but one thing that I used to do was to redirect any suspicious bots or ones that were hitting their server too much to a simple html page with no JS or CSS. Then they used to go away.
This reminds me of shitty FTP sites with ratio when I was on dial-up. I used to push them files full of null characters with good filenames. The modem would compress the upload as it transmitted it which allowed me to upload the junk files at several times the rate of a normal file.
that is pretty darn clever
Funny part is I was using derivatives of this decades ago to test RAID-5/6 sequencial reads and write speeds.
At least in germany having one of these on your system is illegal
Out of curiosity, what is illegal about it, exactly?
I mean i am not a lawyer.
In germany we have § 303 b StGB. In short it says if you hinder someone elses dataprocessing through physical means or malicous data you can go to jail for up to 3 years . If it is a major process for someone you can get up to 5 and in major cases up to 10 years.
So if you have a zipbomb on your system and a crawler reads and unpacks it you did two crimes. 1. You hindered that crawlers dataprocessing 2. Some isp nodes look into it and can crash too. If the isp is pissed of enough you can go to jail for 5 years. This applies even if you didnt crash them die to them having protection agsinst it, because trying it is also against the law.
Having a zipbomb is part of a gray area. Because trying to disrupt dataprocessing is illegal, having a zipbomb can be considered trying, however i am not aware of any judgement in this regard
Maybe bots shouldn’t be trying to install malicious code? Sucks to suck.
Still illegal. Not immoral, but a lot of our laws aren’t built on morality.
Illegal to publically serve or distribute.
Have you ever heard of sparse files, and how Linux and Windows deal with zips of it? You’ll love this.
Interesting. I wonder how long it takes until most bots adapt to this type of “reverse DoS”.
Then we’ll just be more clever as well. It’s an arms race after all.
I want to know he they built that visualization
How I read that code:
“If the dev’s bullshit is equal to 1 gram…”
This is why I use things like Docusaurus to generate static sites. Vulnerability injections are pretty hard when there’s no code to inject into.
❤️
Bishma@discuss.tchncs.de 2 weeks ago
When I was serving high volume sites (that were targeted by scrapers) I had a collection of files in CDN that contained nothing but the word “no” over and over. Scrapers who barely hit our detection thresholds saw all their requests go to the 50M version. Super aggressive scrapers got the 10G version. And the scripts that just wouldn’t stop got the 50G version.
It didn’t move the needle on budget, but hopefully it cost them.
sugar_in_your_tea@sh.itjust.works 2 weeks ago
How do you tell scrapers from regular traffic?
Bishma@discuss.tchncs.de 2 weeks ago
Most often because they don’t download any of the css of external js files from the pages they scrape. But there are a lot of other patterns you can detect once you have their traffic logs loaded in a time series database. I used an ELK stack back in the day.