suy

@suy@programming.dev

This is a remote user, information on this page may be incomplete. View at Source ↗

⁨Comment⁩ on ⁨The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates⁩ ⁨⁨5⁩ ⁨months⁩ ago⁩:
Wow, thanks, I have not seen this comment, yet I hinted about this in some of my other replies that I’ve done before.

Yes, I think ML is fair use, but there it would also be fair to force something into the public domain/open source if, in order to be accrued, it has to make use of fair use at unseen amounts of scale.

This would be a difficult to make law, though. Current ML is very inefficient in the amount of data it requires, but it could (and should) be made better.
⁨Comment⁩ on ⁨The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates⁩ ⁨⁨5⁩ ⁨months⁩ ago⁩:

Now I sail the high seas myself, but I don’t think Paramount Studios would buy anyone’s defence they were only pirating their movies so they can learn the general content so they can produce their own knockoff.

We don’t know exactly how they source their data (and that is definitely shady), but if I can gain access to a movie in a legal way, I don’t see why I would not be able to gather statistics from said movie, including running a speech to text model to caption it, then make statistics of how many times a few words were used, and followed by which ones. This is an oversimplified explanation of what a LLM does, but it’s the fairest I can come up, and it would be legal to do so. The models are always orders of magnitude smaller than the data they are trained on.

That said, I don’t imply that I’m happy with the state of high tech companies, the AI hype, the energy consumption, or the impact on the humble people. But I’ve put a lot of thought into this (and learning about machine learning for real), and I think this is not a ML problem, but a problem in the economic, legal and political system. AI hype is just a symptom.
⁨Comment⁩ on ⁨The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates⁩ ⁨⁨5⁩ ⁨months⁩ ago⁩:

It’s not AI

It’s not AGI, it’s not general intelligence, and it’s not comparable to a human (well, you can compare anything, but human and ML are just very different things in tons of ways).

But it is AI. The ghosts that chase Pacman are AI. A search algorithm is also AI, dammit. Of course an LLM is AI. Any agent that maximizes a function is AI. You are just embarrassing yourself.
⁨Comment⁩ on ⁨The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates⁩ ⁨⁨5⁩ ⁨months⁩ ago⁩:

But then it does go on to quote materials verbatim, which shows it’s not “just” ‘extracting patterns’.

Is is just extracting patterns. Is making statistical samples of which token (“word”, informally speaking) is likely followed given the previous stream.

It can only reproduce passages of things it has seen many, many times. I cannot reproduce the whole work. Those two quotes can be seen elsewhere on the internet plenty of times. And it’s fair use there, so it would be fair use with a chat bot as well.

There have been papers published where researchers were able to regenerate an image that was present in the training set of Stable Diffusion. But they were only able to find that image (and others) in particular, because they were present in the training set multiple times, and the caption was the same (it was the portrait picture of some executive at a company).

when given the book and pages — quote copyrighted works

Yeah, you are not gonna be able to do that with an LLM. They will be able to quote only some passages, and only of popular books that have been quoted often enough.

Even if they started to use my service to literally copy entire books?

You cannot do that with an LLM.

Why are you defending massive corporations who could just pay up? Isn’t the whole “corporations putting profits over anything” thing a bit… seen already?

I hate that some corporations are burning money, resources and energy on this, and the solution is not to restrict fair use even further. Machine Learning is complex, but if I had to summarize in some way is “just” gathering statistics of which word comes next (in the case of a text model). This is no different than getting a large corpus of text, and sample it for word frequency, letter frequency, N-gram frequency, etc. It is well known that this is fair use. You only store the copyrighted works to run the software and produce a very transformative work that is a summary many orders of magnitude smaller than the copyrighted work. This is fair use, and it should still be. Changing that is gonna harm the public, small companies and independent researchers way more than big tech companies.

As I said in another comment, I would very much welcome a way to force big corpos to release their models. Make a model bigger than N parameters? You needed too much fair use in one gulp: your model has to be public, and in the public domain. I would fucking welcome that! But going in the opposite direction is just risky.

I don’t understand why small individuals think that copyright is their friend, and will protect them from big tech companies. Copyright will always harm the weak and protect the powerful as a net result. It’s already a miracle that we can enjoy free software and culture by licenses that leverage copyright in our favor.
⁨Comment⁩ on ⁨The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates⁩ ⁨⁨5⁩ ⁨months⁩ ago⁩:
“Theft” is never a technically accurate word when dealing with the so called “intellectual property”, because the digital content being copied without authorization is legal in tons of cases, and because, come on, property is very explicitly exclusive. I cannot copy my house or my car, but I can make copies of my works for virtually 0 cost.

Using data for training ML models is even explicitly allowed in some jurisdictions (e.g. Japan), and is likely to be fair use everywhere else. LLMs are very transformative, and while they often can produce verbatim copies of fragments of copyrighted works, they don’t store the whole works or significant pieces of them.

Don’t get me wrong, I don’t like big companies making big money. I would not mind a law that would force models to be open sourced. But restricting them to train their models on public data by restricting fair use, it would harm them very little (they could pay something if they are making some profit), while small researchers or companies would never be able to compete, because they would not have the upfront costs, nor the economic engineering to disguise profits and pay less.
⁨Comment⁩ on ⁨Winner's Luck⁩ ⁨⁨9⁩ ⁨months⁩ ago⁩:
Two people go on a date. The date is going well, there is chemistry between the two people. One says “if you beat me at any game we can have sex”. The two people will typically play a board or card game, and will flirt with the opportunity of sex during the game play, which is gonna be fun and exciting. Seems a good plot idea for your average romantic comedy movie or teenager’s series.

Now the joke is that the choice of game is stupid because you end up killing your date. Just with that you could make a meme/joke. Now the post is doubling down on the stupidity, insanity, etc., by making it morbid and showing that the guy still had sex with the corpse.

Here it is. My take on the issue, which is unlikely to be the only possible explanation which is not “incel shit”. I’ve wasted 10 minutes of my time, and you’ll likely will still not agree with me, and will prove valid my first comment.

Cheers.
⁨Comment⁩ on ⁨Winner's Luck⁩ ⁨⁨9⁩ ⁨months⁩ ago⁩:
Has it occurred to you that pressing the downvote button is just much easier that having to bother explaining something that should be obvious?

If it is not obvious to you that it’s not incel shit, maybe even after an explanation you won’t agree still because you have different views (which I’m not saying are not respectable, but are still different, so an agreement can’t be reached), so whoever replies to you would have wasted their time.

So of course people downvote without replying.
⁨Comment⁩ on ⁨Everything about TOML format - Orchard Dweller⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
The very first moment that I had to use JSON as a configuration format, and I was desperate to find a way to make a long string into a JSON field. JSON is great for many things, but it’s not good at all for a configuration format where you need users to make it pretty, and need features like comments or multi-line strings (because you don’t want to fix a merge conflict in a 400 character-wide line).
⁨Comment⁩ on ⁨Everything about TOML format - Orchard Dweller⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
Doesn’t YAML have a (seldom used) feature of a start and end of document marker? The “YAML frontmatter” that a few markdown documents have, uses this.
⁨Comment⁩ on ⁨Release notes of an open source app. Someone pretty mad at Canonical for Snap⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
Sorry, could you clarify what you mean? I don’t see the difference. Isn’t the author complaining about Canonical for the policy enforcement?
Release notes of an open source app. Someone pretty mad at Canonical for Snapprogramming.dev ↗
Submitted ⁨⁨1⁩ ⁨year⁩ ago⁩ to ⁨programmer_humor@programming.dev⁩ | ⁨20⁩ ⁨comments⁩
⁨Comment⁩ on ⁨Correcting > Helping⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
Sometimes that’s part of the issue (or the whole deal), but sometimes it’s not even that.

Sometimes it’s that someone asked something difficult and elaborate to answer, which has been answered a ton of times, and it’s tedious to answer again and again. But if someone answers with misinformation or even straight FUD, then one needs to feel the urge to correct that to prevent misinformation.

I suffered that with questions in r/QtFramework. Tons of licensing questions, repeated over and over, from people who have not bothered to read a bit about such a well known and popular license as LGPL. Then someone who cares little for the nuance answers something heavy handed, and paints a wrong picture. Then I can’t let the question pass. I need to correct the shitty answer. :-(
⁨Comment⁩ on ⁨Show me a better text format for serializing⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
Norway.

Ups. Sorry, I meant “NO”.
⁨Comment⁩ on ⁨Monaspace - Microsoft presents a new font family for code⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
Radon, the “handwriting” one, seems like if someone wanted to have Comic Sans but for code.
⁨Comment⁩ on ⁨Mercedes-Benz is using Qt framework to build new operating system for its cars⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
I heard the rumor that Linux desktop environments use it too. Now hopefully multimedia apps with 3 letters like VLC and OBS can adopt it too.

j/k