You’re the problem is that even if their books are in the data set there’s no evidence that they will taken directly from the source. OpenAI scrape websites right, and O’Reilly books are often pirated because of their predatory business model (they changed their textbooks every year meaning you can’t use a previous year’s book). So it’s entirely possible, although unlikely, that the content got in there from scraping content from a pirate site.
Dadifer@lemmy.world 20 hours ago
For copywrite, it doesn’t matter if it was taken directly from the source.