I wonder how much these models are now learning from spam they were used to generate
Data contamination expert đ
Submitted âšâš8â© âšmonthsâ© agoâ© by âšElCanut@jlai.luâ© to âš[deleted]â©
https://jlai.lu/pictrs/image/7bd52039-2a48-43f7-8575-3b5af029b7ba.jpeg
Comments
benignintervention@lemmy.world âš8â© âšmonthsâ© ago
Kbin_space_program@kbin.social âš8â© âšmonthsâ© ago
Time to make a lot of wandering dwarf bots on reddit to make variations of various game phrases all over, so the LLM based bots just spout Rock And Stone and This is my favourite store on the Citadel?
Ilovethebomb@lemm.ee âš8â© âšmonthsâ© ago
Thing is, you could use a bot to do nothing but post pop culture references, and it would be indistinguishable from a garden variety Redditor. Reddit is one of the worst places to train an AI.
LordOfTheChia@lemmy.world âš8â© âšmonthsâ© ago
Johnson! Why the hell is your report the most unintelligible thing Iâve read since nineteen ninety eight when the undertaker threw mankind off hĐ”ll in a cell, and plummeted sixteen feet through an announcerâs table
THE_MASTERMIND@feddit.ch âš8â© âšmonthsâ© ago
All of them
Adalast@lemmy.world âš8â© âšmonthsâ© ago
OpenAI team after including the data: why is the model suddenly even more horny, abusive, and discriminatory?
alphacyberranger@lemmy.world âš8â© âšmonthsâ© ago
If it takes reddit data to train a model, instead of Artificial Intelligence we will end up with Artificial Idiocy and a horny one that too.
init@lemmy.ml âš8â© âšmonthsâ© ago
Sigh, unzips
Poem_for_your_sprog@lemmy.world âš8â© âšmonthsâ© ago
You had to unzip?
niktemadur@lemmy.world âš8â© âšmonthsâ© ago
Hey, Iâd say that Facebook, Twitter and YouTube are at least just as bad, and probably worse.
asexualchangeling@lemmy.ml âš8â© âšmonthsâ© ago
As apposed to now when we just have regular artificial idiocy
FlyingSquid@lemmy.world âš8â© âšmonthsâ© ago
eager_eagle@lemmy.world âš8â© âšmonthsâ© ago
Good move, but anyone using public data already applies a simple spam filter to reject âdumbâ data poisoning. Also, hatred and other negative comments as responses will be penalized in a language model training, so an effective data poisoning takes effort. Iâll just throw some ideas here how poisoning could hypothetically have a tangible negative impact in their results.
The best one can do in terms of data poisoning is make comments that are not easily discernible from usual comments - both for humans and machines - but are either unhelpful or misleading. This is an âin-distributionâ data poisoning attack. To be really effective in having any impact whatsoever for training, they need to be mass applied using different user accounts that also upvote each othersâ comments in a way that mimics real user interaction: if applied in a simplistic way, a simple graph analysis on these interactions can highlight these fake accounts as a christmas tree.
greenskye@lemm.ee âš8â© âšmonthsâ© ago
but are either unhelpful or misleading
Honestly that just sounds like a lot Reddit users in general
TseseJuer@lemmy.world âš8â© âšmonthsâ© ago
yea we know thatâs why he said that because thatâs ârealâ reddit content
Adalast@lemmy.world âš8â© âšmonthsâ© ago
I was contemplating the merits of botting with the current model with slight vectorization offsets so the data becomes prone to overfitting.
I would think it would alao work to post using valid, but non-standard syntax so it muddies the n-gram searches.
Daxtron2@startrek.website âš8â© âšmonthsâ© ago
Youâve probably been shadow banned for 5 of those months
ArmokGoB@lemmy.dbzer0.com âš8â© âšmonthsâ© ago
We should have started an all-out attack on Reddit once they started forcing open subs by removing mods. People folded like soggy tortillas.
madcaesar@lemmy.world âš8â© âšmonthsâ© ago
I just left and came here after 10+ years on reddit. No point wasting time energy trying to take reddit down. They are fucked anyway. Anytime I check back for something occasionally the quality of posts / comments is just pure garbage.
PerogiBoi@lemmy.ca âš8â© âšmonthsâ© ago
Just like when Netflix and Disney plus and every other streaming service colluded to all raise their prices and remove account sharing.
Daxtron2@startrek.website âš8â© âšmonthsâ© ago
My account got locked out after I lost all my authenticators with an old phone. Reddit is one of only a few sites that would not let me change it.
Poem_for_your_sprog@lemmy.world âš8â© âšmonthsâ© ago
Set up a bot that just constantly posts blatantly wrong information, like âthe earth is flat according to encyclopedia Britannicaâ, âthe sky is green because itâs full or chlorophyll according to the UK foundation of scienceâ
Zink@programming.dev âš8â© âšmonthsâ© ago
Or in line with current events, âwe are sorry about your experience and will refund you triple.â
boatsnhos931@lemmy.world âš8â© âšmonthsâ© ago
Dear God, Iâve posted a lot of nonsense and untrue things over the years. You guys want to do a candle light vigil tonight for ai?
jayrodtheoldbod@midwest.social âš8â© âšmonthsâ© ago
This announcement is just âoh by the way, the horse is now out of the barn. He left like 10 years ago but this is the announcement.â
Shout out to whoever dismissed the first AI writings with âItâs like a perfect Redditor. Totally confident and completely full of shit, doesnât even know that itâs lying.â
That doesnât happen by accident. That happens when everyone was already scraping the shit out of the site, at the very least.
magnetosphere@kbin.social âš8â© âšmonthsâ© ago
This is the ideal meme format. Pedroâs smile is perfect.
Flumpkin@slrpnk.net âš8â© âšmonthsâ© ago
Iâm pissed at reddit but I still hate searching for something and finding a post on reddit discussing it, only to find some of the posts being deleted or overwritten.
FIST_FILLET@lemmy.ml âš8â© âšmonthsâ© ago
if youâre lucky, some posts have been archived on the internet archiveâs wayback machine. highly recommend pinning the extension to your toolbar, itâll show a number badge of how many times the current site has been archived :) addons.mozilla.org/en-US/âŠ/wayback-machine_new
crackajack@reddthat.com âš8â© âšmonthsâ© ago
Why does Spez want tocs to sell data? To buy a new yacht?
I will delete my data from Reddit then.
MECHAGIC@lemmy.world âš8â© âšmonthsâ© ago
Wait how do i do that?
TwanHE@lemmy.world âš8â© âšmonthsâ© ago
There were some scripts for it. But i can still find my comments and posts trough Google after deleting them.
Dont think reddit will let you take âtheirâ (your) content away.
EmperorHenry@discuss.tchncs.de âš8â© âšmonthsâ© ago
after they announced it wouldâve been the time to start poisoning the comments. Then it wouldâve been completely justified and moral.
Honestly, keep up the good fight. Start poisoning all open sources that are being scraped by any type of AI.
And I use the term âaiâ very, very loosely. Because whatâs called ai now isnât real ai. Itâs just an automated data collection tool.
It doesnât create anything, it plagiarizes real artists.
FIST_FILLET@lemmy.ml âš8â© âšmonthsâ© ago
exactly, âaiâ right now is just a computer parrot. why settle for blurry generic versions of the art that it is digesting and shitting back out?
byroon@lemmy.world âš8â© âšmonthsâ© ago
So youâve contaminated the training data for an LLM by spamming a public forum? Seems like everyone loses
wildginger@lemmy.myserv.one âš8â© âšmonthsâ© ago
I dont lose, I get a good laugh out of watching idiots feed unreliable data to their LLMs because it was cheap
byroon@lemmy.world âš8â© âšmonthsâ© ago
I mean the people using the forum who have to navigate around your spam
ILikeBoobies@lemmy.ca âš8â© âšmonthsâ© ago
Why care?
It just seems they were correct in changing api prices
Sylvartas@lemmy.world âš8â© âšmonthsâ© ago
I mean, yeah, but because they fully expected to sell their userbase as training data for LLMs, not because they actually care about people using bots to post wrong informations. Wouldnât that require them caring about actual people posting wrong infos in the first place ?
LunaCtld@lemmy.world âš8â© âšmonthsâ© ago
So does that mean, that this time DAN will come pre-installed?
Norgur@kbin.social âš8â© âšmonthsâ© ago
we need a bot that deletes comments and replaces them with some faulty grammar yoda-speak.
Valmond@lemmy.mindoki.com âš8â© âšmonthsâ© ago
Theyâll just find the signal in what youâre doing. Sorry but checkmate, mate.
Ensign_Crab@lemmy.world âš8â© âšmonthsâ© ago
âSo much for your fucking canoe!â
nickwitha_k@lemmy.sdf.org âš8â© âšmonthsâ© ago
I really ought to have done that.
TropicalDingdong@lemmy.world âš8â© âšmonthsâ© ago
I used some tools to corrupt about 10 years of comments and posts of mine.
mp04610@lemm.ee âš8â© âšmonthsâ© ago
While thatâs the correct thing to do in my opinion, it would be a mistake to assume that Reddit didnât store your original comments.
By corrupting their dataset, you may actually be helping them recognize maliciously edited comments.
khannie@lemmy.world âš8â© âšmonthsâ© ago
They were fairly specific about not doing that (Iâd imagine largely because of GDPR).
I deleted 10 years of âcontentâ before I left and checked their policies. They apparently actually do properly delete from their servers.
TropicalDingdong@lemmy.world âš8â© âšmonthsâ© ago
Yeah, I mean I knew that when I was doing it.
Sometimes all you can do is make a symbolic gesture that really does nothing, and even if it does nothing, you should still do it.
Probably leaving and supporting lemmy by paying for some developer fees (iâm on the patreon), posting and commenting, probably 100x more damaging to Reddit.
flambonkscious@sh.itjust.works âš8â© âšmonthsâ© ago
Mass edits made rapidly are obviously suspect, too⊠If the same user edits anything more than a dozen comments in, say a minute, you have to ask whatâs going on
ElCanut@jlai.lu âš8â© âšmonthsâ© ago
Canât post a genius idea like this one without posting the links of the tools
TropicalDingdong@lemmy.world âš8â© âšmonthsâ© ago
Its not my idea, but I could probably dig up the tool I used. Dollars to donuts, it doesnât work any more.
This might have been the tool I used. I dont think so because I overwrote everything with one message, but google around youâll find similar.
github.com/adriantache/YARCO
Sabin10@lemmy.world âš8â© âšmonthsâ© ago
A tool like that would almost definitely require api access to function. If that was still possible, most of us wouldnât be here having this conversation.
Ragnarok314159@sopuli.xyz âš8â© âšmonthsâ© ago
I think Reddit caught on to this. I tried destroying my comment history (~7 years with 600k karma) with a few of the available tool on GitHub.
Found my account permabanned next time trying to login. People should attempt to eliminate/poison as much as possible, but Reddit has all the comments and modifications in a database somewhere to sell it all to whatever AI is the highest bidder.
They have to do something to make money after taking away awards. The advertising is absolute shit and not worth the $100 entry fee.
VaultBoyNewVegas@lemmy.world âš8â© âšmonthsâ© ago
I edited mine via a tool to say fuck Reddit and Steve Huffman is a greedy pig boy.
Octopus1348@lemy.lol âš8â© âšmonthsâ© ago
What do you mean by corrupt?
PlasmaDistortion@lemm.ee âš8â© âšmonthsâ© ago
I used a tool that edited my comments to replace it with gibberish. Supposedly Reddit still retains deleted comments but if you edit them, it only keeps the latest version. So by editing it you make the comments worthless.
TropicalDingdong@lemmy.world âš8â© âšmonthsâ© ago
I ran a script over all of my comments (through my browser) to edit them into something about how spez had back stabbed the community. I had tens? hundreds of thousands? of comments.
It took several hours to run, but I did a forward pass (newest to oldest) and a backwards pass (oldest to newest). It bugged out because it had to run so long but I think I got it all.
Iâm not sure this will really do anything because you could pretty easily statistically isolate any one who did what I did, and roll their account history back to a prior state in the training data.
Regardless, it was the least I could do on the way out the door.
JoMiran@lemmy.ml âš8â© âšmonthsâ© ago
It replaces them with jibbersish. I did the same for my 12+ years worth.