Comment on A Project to Poison LLM Crawlers

<- View Parent
FauxLiving@lemmy.world ⁨9⁩ ⁨hours⁩ ago

That may be an argument if only large companies existed and they only trained foundation models.

Scraped data is most often used for fine-tuning models for specific tasks. For example, mimicking people on social media to push an ad/political agenda. Using a foundational model that speaks like it was trained on a textbook doesn’t work for synthesizing social media comments.

In order to sound like a Lemmy user, you need to train on data that contains the idioms, memes and conversational styles used in the Lemmy community. That can’t be created from the output of other models, it has to come from scraping.

Poisoning the data going to the scrapers will either kill the model during training or force everyone to pre-process their data, which increases the costs and expertise required to attempt such things.

source
Sort:hotnewtop