Apertus was developed with due consideration to Swiss data protection laws, Swiss copyright laws, and the transparency obligations under the EU AI Act. Particular attention has been paid to data integrity and ethical standards: the training corpus builds only on data which is publicly available. It is filtered to respect machine-readable opt-out requests from websites, even retroactively, and to remove personal data, and other undesired content before training begins.
Fully open source, even the training data is provided for download. That being said, this is the only one I know of.
dogslayeggs@lemmy.world 2 days ago
Sure. My company has a database of all technical papers written by employees in the last 30-ish years. Nearly all of these contain proprietary information from other companies (we deal with tons of other companies and have access to their data), so we can’t build a public LLM nor use a public LLM. So we created an internal-only LLM that is only trained on our data.
Fmstrat@lemmy.world 1 day ago
I’d bet my lunch this internal LLM is a trained open weight model, which has lots of public data in it. Not complaining about what your company has done, as I think that makes sense, just providing a counterpoint.
utopiah@lemmy.world 1 day ago
You are solely using your own data or rather you are refining an existing LLM or rather RAG?
I’m not an expert but AFAIK training an LLM requires, by definition, a vast mount of text so I’m skeptical that ANY company publish enough papers to do so. I understand if you can’t share more about the process. Maybe me saying “AI” was too broad.
tb_@lemmy.world 1 day ago
Completely from scratch?