No, it isn’t. The OSI specifically requires the training data be available or at very least that the source and fee for the data be given so that a user could get the same copy themselves. Because that’s the purpose of something being “open source”. Open source doesn’t just mean free to download and use.
Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms.
In particular, this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.
As per their paper, DeepSeek R1 required a very specific training data set because when they tried the same technique with less curated data, they got R"zero’ which basically ran fast and spat out a gibberish salad of English, Chinese and Python.
People are calling DeepSeek open source purely because they called themselves open source, but they seem to just be another free to download, black-box model. The best comparison is to Meta’s LlaMa, which weirdly nobody has decided is going to up-end the tech industry.
In reality “open source” is a terrible terminology for what is a very loose fit when basically trying to say that anyone could recreate or modify the model because they have the exact ‘recipe’.
The training corpus of these large models seem to be “the internet YOLO”. Where it’s fine for them to download every book and paper under the sun, but if a normal person does it.
I think you’re conflating “open source” with “free”
What does it even mean for a research paper to be open source? That they release a docx instead of a pdf, so people can modify the formatting? Lol
The model weights were released for free, but you don’t have access to their source, so you can’t recreate them yourself. Like Microsoft Paint isn’t open source just because they release the machine instructions for free. Model weights are the AI equivalent of an exe file. To extend that analogy, quants, LORAs, etc are like community-made mods.
To be open source, they would have to release the training data and the code used to train it. They won’t do that because they don’t want competition. They just want to do the facebook llama thing, where they hope someone uses it to build the next big thing, so that facebook can copy them and destroy them with a much better model that they didn’t release, force them to sell, or kill them with the license.
Let’s just redefine existing concepts to mean things that are more palatable to corporate control why don’t we?
If you don’t have the ability to build it yourself, it’s not open source. Deepseek is “freeware” at best. And that’s to say nothing of what the data is, where it comes from, and the legal ramifications of using it.
well if they really are and methodology can be replicated, we are surely about to see some crazy number of deepseek comptention, cause imagine how many us companies in ai and finance sector that are in posession of even larger number of chips.
Although the question rises - if the methodology is so novel why would these folks make it opensource? Why would they share results of years of their work to the public losing their edge over competition? I dont understand.
Can somebody who actually knows how to read machine learning codebase tell us something about deepseek after reading their code?
wise_pancake@lemmy.ca 8 months ago
The model weights and research paper are, which is the accepted terminology nowadays.
It would be nice to have the training corpus and RLHF too.
TheOctonaut@mander.xyz 8 months ago
No, it isn’t. The OSI specifically requires the training data be available or at very least that the source and fee for the data be given so that a user could get the same copy themselves. Because that’s the purpose of something being “open source”. Open source doesn’t just mean free to download and use.
opensource.org/ai/open-source-ai-definition
As per their paper, DeepSeek R1 required a very specific training data set because when they tried the same technique with less curated data, they got R"zero’ which basically ran fast and spat out a gibberish salad of English, Chinese and Python.
People are calling DeepSeek open source purely because they called themselves open source, but they seem to just be another free to download, black-box model. The best comparison is to Meta’s LlaMa, which weirdly nobody has decided is going to up-end the tech industry.
In reality “open source” is a terrible terminology for what is a very loose fit when basically trying to say that anyone could recreate or modify the model because they have the exact ‘recipe’.
kryptonidas@lemmings.world 8 months ago
The training corpus of these large models seem to be “the internet YOLO”. Where it’s fine for them to download every book and paper under the sun, but if a normal person does it.
Believe it or not:
Image
Stovetop@lemmy.world 8 months ago
A lot of other AI models can say the same, though. Facebook’s is. Xitter’s is. Doesn’t mean I trust them for shit.
GissaMittJobb@lemmy.ml 8 months ago
Llama has several restrictions making it quite a bit less open than Grok or DeepSeek.
ayyy@sh.itjust.works 8 months ago
I wouldn’t call it the accepted terminology at all. Just because some rich assholes try to will it into existence doesnt mean we have to accept it.
gamer@lemm.ee 8 months ago
I think you’re conflating “open source” with “free”
What does it even mean for a research paper to be open source? That they release a docx instead of a pdf, so people can modify the formatting? Lol
The model weights were released for free, but you don’t have access to their source, so you can’t recreate them yourself. Like Microsoft Paint isn’t open source just because they release the machine instructions for free. Model weights are the AI equivalent of an exe file. To extend that analogy, quants, LORAs, etc are like community-made mods.
To be open source, they would have to release the training data and the code used to train it. They won’t do that because they don’t want competition. They just want to do the facebook llama thing, where they hope someone uses it to build the next big thing, so that facebook can copy them and destroy them with a much better model that they didn’t release, force them to sell, or kill them with the license.
maplebar@lemmy.world 8 months ago
Let’s just redefine existing concepts to mean things that are more palatable to corporate control why don’t we?
If you don’t have the ability to build it yourself, it’s not open source. Deepseek is “freeware” at best. And that’s to say nothing of what the data is, where it comes from, and the legal ramifications of using it.
sem@lemmy.blahaj.zone 8 months ago
They are trying to make it accepted but it’s still contested. Unless the training data provided it’s not really open.
lemmydividebyzero@reddthat.com 8 months ago
But then, people would realize that you got copyrighted material and stuff from pirating websites…
legolas@fedit.pl 8 months ago
well if they really are and methodology can be replicated, we are surely about to see some crazy number of deepseek comptention, cause imagine how many us companies in ai and finance sector that are in posession of even larger number of chips.
Although the question rises - if the methodology is so novel why would these folks make it opensource? Why would they share results of years of their work to the public losing their edge over competition? I dont understand.
Can somebody who actually knows how to read machine learning codebase tell us something about deepseek after reading their code?
wise_pancake@lemmy.ca 8 months ago
Hugging face already reproduced deepseek R1 (called Open R1) and open sourced the entire pipeline
legolas@fedit.pl 8 months ago
Did they? According to their repo its still WIP github.com/huggingface/open-r1