wdym ‘disgusting’? isn’t common crawl just popular websites (alexa ranking? idk) crawled and provided raw?
msokiovt@lemmy.today 1 week ago
This is due to the training sets, one of them being CommonCrawl, which is disgusting. The Chinese LLMs like DeepSeek R1 and Qwen 3 use a different set of training materials that was actually good, despite it being censored too.
hexagonwin@lemmy.sdf.org 5 days ago
trolololol@lemmy.world 1 week ago
What’s common crawl?
msokiovt@lemmy.today 1 week ago
This