Advanced OpenAI models hallucinate more than older versions, internal report finds

⁨511⁩ ⁨likes⁩

Submitted ⁨⁨7⁩ ⁨months⁩ ago⁩ by ⁨TempermentalAnomaly@lemmy.world⁩ to ⁨technology@lemmy.world⁩

https://www.ynetnews.com/business/article/rjqvyk7jlg

source

Comments

Sort:hotnew top

TempermentalAnomaly@lemmy.world ⁨7⁩ ⁨months⁩ ago
Image

source
hansolo@lemm.ee ⁨7⁩ ⁨months⁩ ago
Can confirm. o4 seems objectively far worse at coding than o3, which wasn’t super great to begin with. It latches on to a hallucination before anything else and rides it until the wheels come off.

source
- taiyang@lemmy.world ⁨7⁩ ⁨months⁩ ago
  Yes, I was about to say the same thing until I saw your comment. I had a little bit of success learning a few tricks with o3 but trying to use o4 is a tremendous headache for coding.
  
  There might be some utility in dialing it all back so it’s more straight to what I need based more on package documentation than random redditor suggestion amalgamation.
  
  source
  - hansolo@lemm.ee ⁨7⁩ ⁨months⁩ ago
    Yeah, I think that workarounds with o3 is where we’re at until Altman figures out that just saying the latest oX mini high is “great at coding” is bad marketing when it can’t accomplish the task.
    
    source
    -> View More Comments
ShittyBeatlesFCPres@lemmy.world ⁨7⁩ ⁨months⁩ ago
I’m glad we’re putting all our eggs in this alpha-ass-level software (with tons of promise! Maybe!) instead of like high speed rail or whatever.

source
- finitebanjo@lemmy.world ⁨7⁩ ⁨months⁩ ago
  /s is mandatory please
  
  source
  - ShittyBeatlesFCPres@lemmy.world ⁨7⁩ ⁨months⁩ ago
    /s
    
    source
CosmoNova@lemmy.world ⁨7⁩ ⁨months⁩ ago
They shocked the world with GPT 3 and cling to that initial success ever since with increasing recklessness and declining results. It‘s all glue on pizza from here.

source
- Zos_Kia@lemmynsfw.com ⁨7⁩ ⁨months⁩ ago
  I think the real shocker was the step change between 3 and 4, and the hope that another step change was soon to come. It’s pretty telling that the latest batch of models was fine tuned for vibes and “empathy” rather than raw performance. They’re not getting the next a-ha moment and want to focus their customers on unquantifiables.
  
  It seems logical that this would negatively impact performance and, well, looks like it did.
  
  source
  - KeenFlame@feddit.nu ⁨7⁩ ⁨months⁩ ago
    They search for the wow moment to continue striking while the iron is hot, but are stuck. Failing to realise that it’s when companies search for filler features and do shit like “it can talk like a human now and you can customise it” and pander it as a step forward that consumers instinctively distrust the company. If there is no new progress due to data incompleteness or incompetence or whatever, they should be ware to not further monotenize this scientific breakthrough and forever ruin the new programming language that we have discovered.
    
    source
match@pawb.social ⁨7⁩ ⁨months⁩ ago
just one more terawatt-hour of electricity and it’ll be accurate and creatice i swear!!

source
- finitebanjo@lemmy.world ⁨7⁩ ⁨months⁩ ago
  /S is mandatory
  
  source
  - milicent_bystandr@lemm.ee ⁨7⁩ ⁨months⁩ ago
    Because otherwise it would be totally believable
    
    …
    
    …
    
    /s
    
    source
- surph_ninja@lemmy.world ⁨7⁩ ⁨months⁩ ago
  This particular anti-AI stance always reminds me of religion gradually losing ground to science.
  
  It’s been pointed out by some folks that if religion’s domain is only ‘what science can’t explain,’ then the domain of religion is continuously shrinking as science grows to explain more and more.
  
  If your anti-AI stance is centered on ‘it wastes power and is wrong too often,’ then your criticism becomes more irrelevant as the accuracy improves and models become more efficient.
  
  source
  - hark@lemmy.world ⁨7⁩ ⁨months⁩ ago
    The assumption here is that the AI will improve. Under the current approach to AI, that might not be the case, since it could be hitting its limitations and this article may be pointing out a symptom of those limitations.
    
    source
    -> View More Comments
ansiz@lemmy.world ⁨7⁩ ⁨months⁩ ago
This is a big reason why I continue to cringe whenever I hear one of the endless news stories or podcasts about how AI is going to revolutionize our society any day now. It’s clear they are being better with image generation but text ‘thinking’ is way too unreliable to use like human replacement knowledge workers or therapists, etc.

source
- keegomatic@lemmy.world ⁨7⁩ ⁨months⁩ ago
  This is an increasingly bad take. If you work in an industry where LLMs are becoming very useful, you would realize that hallucinations are a minor inconvenience at best for the applications they are well suited for, and the tools are getting better by leaps and bounds, week by week.
  
  source
  - FunnyUsername@lemmy.world ⁨7⁩ ⁨months⁩ ago
    you’re getting down voted because you accurately conceive of and treat LLMs the way they should be—as tools. the people down voting you do not have this perspective because the only perspective pushed to people outside of a technical career or research is “it’s artificial intelligence and it will revolutionize society”. This is essentially propaganda because the real message should be “it’s an imperfect tool like all tools but boy will it make getting a lot of certain types of work done way more efficient so we can redistribute our own efforts to other tasks quicker”
    
    source
    -> View More Comments
  - CheeseNoodle@lemmy.world ⁨7⁩ ⁨months⁩ ago
    Oh we know the edit part, the problem is all the people in power trying to use it to replace jobs wholesale with no oversight or understanding that need a human to curate the output.
    
    source
    -> View More Comments
  - primemagnus@lemmy.ca ⁨7⁩ ⁨months⁩ ago
    My pacemaker decided to one day run at 13,000 rpm. Just a minor inconvenience. That light that was supposed to be red turned green causing a massive pile up. Just a small inconvenience.
    
    If all you’re doing is re writing emails or needing a list on how to start learning python, or explain to someone what a glazier does, yeah AI must be so nice lmao.
    
    The only use for AI is for people who have zero skill and talent to look like they actually have skill and talent. You’re scraping an existence off the backs of all the collective talent to, checks notes, make rule34 galvanized. Good job?
    
    source
    -> View More Comments
BrianTheeBiscuiteer@lemmy.world ⁨7⁩ ⁨months⁩ ago
My boss says I need to be keeping up with the latest in AI and making sure my team has the best info possible to help them with their daily work (IT). This couldn’t come at a better time. 😁

source
palarith@aussie.zone ⁨7⁩ ⁨months⁩ ago
Why say hallucinate, when you should say incorrect.

Sorry boss. I wasn’t wrong. Just hallucinating

source
- primemagnus@lemmy.ca ⁨7⁩ ⁨months⁩ ago
  I may have used this line at work far before AI was a thing lol
  
  source
- SaharaMaleikuhm@feddit.org ⁨7⁩ ⁨months⁩ ago
  It can be wrong without hallucinating, but it is wrong because it is hallucinating.
  
  source
- KeenFlame@feddit.nu ⁨7⁩ ⁨months⁩ ago
  Because it’s not guessing, it’s fully presenting it as fact, and for other good reasons it’s actually a very good term for the issue inherent to all regression networks
  
  source
just_another_person@lemmy.world ⁨7⁩ ⁨months⁩ ago
No shit.

The fact that is news and not inherently understood just tells you how uninformed people are in order to sell idiots another subscription.

source
- pennomi@lemmy.world ⁨7⁩ ⁨months⁩ ago
  Why would somebody intuitively know that a newer, presumably improved, model would hallucinate more? Because there’s no fundamental reason a stronger model should have worse hallucination. In that regard, I think the news story is valuable - not everyone uses ChatGPT.
  
  Or are you suggesting that active users should know? I guess that makes more sense.
  
  source
  - HellsBelle@sh.itjust.works ⁨7⁩ ⁨months⁩ ago
    I’ve never used ChatGPT and really have no interest in it whatsoever.
    
    How about I just do some LSD. Guaranteed my hallucinations will surpass ChatGPT’s in spectacular fashion.
    
    source
  - KeenFlame@feddit.nu ⁨7⁩ ⁨months⁩ ago
    There is definitely reason a larger model would have worse hallucinations. Why do you say not? It’s a fundamental problem with data scaling in these architectures
    
    source
glowie@infosec.pub ⁨7⁩ ⁨months⁩ ago
Just a feeling, but from anecdotal experience it seems like the initial release was very good and they quickly realized just how powerful of a tool it was for the average person and now they’ve dumbed it down in many ways on purpose.

source
- slacktoid@lemmy.ml ⁨7⁩ ⁨months⁩ ago
  They had to add all the safeguards that also nerfed it.
  
  source
- clearedtoland@lemmy.world ⁨7⁩ ⁨months⁩ ago
  Agreed. There was a time when it worked impressively well, but it’s become increasingly lazy, forgetful, and confidently wrong, even missing obvious explicit prompts. If you’re using it thoughtfully as an augment, fine. But if you’re relying on it blindly, it’s risky.
  
  That said, in my experience, Anthropic and OpenAI are still miles ahead. Perplexity had me hooked for a while, but its results have nosedived lately. I know they tune their own model while drawing from OpenAI and DeepSeek vs their own true model but still, whatever they’re doing could use some undoing.
  
  source
j4k3@lemmy.world ⁨7⁩ ⁨months⁩ ago
Jan Leike left for Anthropic after Altmann’s nonsense. Jan Leike is the principal person behind all safety alignment present in all models except the 4chanGPT model. All models are cross trained in a way that propagates this alignment. Hallucinations all originate in this alignment and they all have a reason to exist if you get deep into the weeds of abstractions.

source
- unexposedhazard@discuss.tchncs.de ⁨7⁩ ⁨months⁩ ago
  Yeah, whenever two models interact or build on top of each other, the result becomes more and more distorted. They have already scraped close to 100% of the crawlable internet, so they dont know what to do now. Seems like they cant optimize much more or are simply too dumb to do it properly.
  
  source
- KeenFlame@feddit.nu ⁨7⁩ ⁨months⁩ ago
  Maybe I misunderstood, are you saying all hallucinations originate from the safety regression period? Because hallucinations appear in all architectures of current research, open models, even with clean curated data included. Fact checking itself works somewhat, but the confidence levels are off sometimes and if you crack that problem, please elaborate because it would make you rich
  
  source
  - j4k3@lemmy.world ⁨7⁩ ⁨months⁩ ago
    I’ve explored a lot of patterns and details about how models abstract. I don’t think I have ever seen a model hallucinate much of anything. It all had a reason and context. General instructions with broad scope simply lose contextual relevance and usefulness in many spaces. The model must be able to modify and tailor itself to all circumstances dynamically.
    
    source
vivendi@programming.dev ⁨7⁩ ⁨months⁩ ago
Fuck ClosedAI

I want everyone here to download an inference engine (use llama.cpp) and get on open source and open data AI RIGHT NOW!

source
- KeenFlame@feddit.nu ⁨7⁩ ⁨months⁩ ago
  Open source is always one step ahead. But they don’t have the resources and brand hype so people assume oai is cutting edge still
  
  source
- Valmond@lemmy.world ⁨7⁩ ⁨months⁩ ago
  Any pointers on how to do that?
  
  Also, what hardware do you need for this kind of stuff?
  
  source
  - vivendi@programming.dev ⁨7⁩ ⁨months⁩ ago
    First, please answer, do you want everything FOSS or are you OK with a little bit of proprietary code because we can do both
    
    source
    -> View More Comments
muhyb@programming.dev ⁨7⁩ ⁨months⁩ ago
Because they are high-er models.

source
Halcyon@discuss.tchncs.de ⁨7⁩ ⁨months⁩ ago
It’s not “hallucination”. That are false calculations, leading to incorrect text outputs. Let’s stop anthropomorphizing computers.

source
Bieren@lemmy.world ⁨7⁩ ⁨months⁩ ago
It’s learning.

source
lightnsfw@reddthat.com ⁨7⁩ ⁨months⁩ ago
Garbage in, garbage out

source