Comment

I see some problems here.

An LLM providing “an opinion” is not a thing, as far as current tech does. It’s just statistically right or wrong, and put that into word, which does not fit nicely with real use cases. Also, lots of tools already have autofix that can (on demand) handle many minor issues you mention, without any LLM. Assuming static analysis is already in place and decent tooling is used, this would not have to reach either a human or an AI agent or anything before getting fixed with little resources.

As anecdotal evidence, we regularly look into those tools on the job. Granted, we don’t have billions of lines of code to check, but so far it’s at best useless. Another anecdotal evidence is the recent outburst from the curl project (and other, following suite) getting a mountain of issues that are bogus.

I have no doubt that there is a place for human-sounding review and advice, alongside other more common uses like completion and documentation, but ultimately these systems are not able to think by design. The work still has to be done. And can’t go much beyond platitudes. You ask how common the horrible cases are, but that might not be the correct question. Horrific comments are easy to spot and filter out. Perfectly decent looking “minor fixes” that are well worded, follow guidelines, and pass all checks, while introducing an off by one error or suddenly decides to swap two parameters that happens to be compatible and make sense in context are the issue. And those, even if rare (empirically I’d say they are not that rare for now) are so much harder to spot without full human analysis, are a real threat.

Yet another anecdotal… yes, that’s a lot. Given the current hype, I can only base my findings on personal experience, mostly. I use AI-based code completion, assuming it’s short enough to check at a glance, and the context is small enough that it can’t make mistakes. At most two-three lines at time. Even in this context, while checking that the generated code matches what I was going to write, I’ve seen a handful of mistakes slip through over a few months. It makes me dread what could get through a PR system, where the codebase is not necessarily fresh in the mind of the reviewer.

This is not to say that none of that is useful, but if it were to be, it would require extremely high level of trust, far higher than current human intervention (which is also not great and source of mistakes, I’m very aware of that) to be. The goal should not be to emulate human mistakes, but to make something better.

source

Sort:hotnew top

MagicShel@lemmy.zip ⁨2⁩ ⁨days⁩ ago

An LLM providing “an opinion” is not a thing

Agreed, but can we just use the common parlance? Explaining completions every time is tedious, and most everyone talking about it at this level always knows. It doesn’t think, it doesn’t know anything, but it’s a lot easier to use those words to mean something that seems analogous. But yeah, I’ve been on your side of this conversation before and let’s just all that as agreed.

this would not have to reach either a human or an AI agent or anything before getting fixed with little resources

There are tools that do some of this automatically. I picked really low hanging fruit that I still see every single day in multiple environments. LLMs attempt more, but they need review and acceptance by a human expert.

Perfectly decent looking “minor fixes” that are well worded, follow guidelines, and pass all checks, while introducing an off by one error or suddenly decides to swap two parameters that happens to be compatible and make sense in context are the issue. And those, even if rare (empirically I’d say they are not that rare for now) are so much harder to spot without full human analysis, are a real threat.

I get that folks are trying to fully automate this. That’s fucking stupid. I don’t let seasoned developers commit code to my repos without review, why would I let AI? Incidentally, seasoned developers also can suggest fixes with subtle errors. And sometimes they escape into the code base, or sometimes perfectly good code that worked fine on prem goes to shit in the cloud—I just had to argue my team into fixing something that executed over 10k SQL statements in some cases on a single page load due to lazy loading. That shit worked “great” on prem but was taking up to 90 seconds in the cloud. All written by humans.

The goal should not be to emulate human mistakes, but to make something better.

I’m sure that is someone’s goal, but LLMs aren’t going to do that. They are a different tool that helps but does not in any way replace human experts. And I’m caught in the middle of every conversation because I don’t hate them enough for one side, and I’m not hype enough about them for the other. But I’ve been working with them for several years now and watched the grow since GPT2 and I understand them pretty well. Well enough not to trust them to the degree some idiots do, but I still find them really handy.

source