As we approach the theoretical error rate limit for LLMs, as proven in the 2020 research paper by OpenAI and corrected by the 2022 paper by Deepmind, the required training and power costs rise to infinity.
In addition to that, the companies might have many different nearly identical datasets to try to achieve different outcomes.
Things like books and wikipedia pages aren’t that bad, maybe a few hundred petabytes could store most of them, but images and videos are also valid training data and that’s much large, and then there is readable code. On top of that, all user inputs have to be stored to reference them again later if the chatbot offers that service.
brucethemoose@lemmy.world 19 hours ago
Again, I don’t buy this. The training data isn’t actually that big, nor is training done on such a huge scale so frequently.
finitebanjo@lemmy.world 18 hours ago
As we approach the theoretical error rate limit for LLMs, as proven in the 2020 research paper by OpenAI and corrected by the 2022 paper by Deepmind, the required training and power costs rise to infinity.
In addition to that, the companies might have many different nearly identical datasets to try to achieve different outcomes.
Things like books and wikipedia pages aren’t that bad, maybe a few hundred petabytes could store most of them, but images and videos are also valid training data and that’s much large, and then there is readable code. On top of that, all user inputs have to be stored to reference them again later if the chatbot offers that service.