Open Menu
AllLocalCommunitiesAbout
lotide
AllLocalCommunitiesAbout
Login

AI finds errors in 90% of Wikipedia's best articles

⁨120⁩ ⁨likes⁩

Submitted ⁨⁨1⁩ ⁨day⁩ ago⁩ by ⁨King@blackneon.net⁩ to ⁨technology@lemmy.world⁩

https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2025-12-01/Opinion

source

Comments

Sort:hotnewtop
  • chronicledmonocle@lemmy.world ⁨4⁩ ⁨hours⁩ ago

    Congrats. You just burned down 4 trees in the rainforest for every article you had an LLM analyze.

    LLMs can be incredibly useful, but everybody forgets how much of an environmental nightmare this shit is.

    source
    • Kellenved@sh.itjust.works ⁨1⁩ ⁨hour⁩ ago

      This is my number 1 reason to oppose AI. It is not worth the damage.

      source
  • Gonzako@lemmy.world ⁨8⁩ ⁨hours⁩ ago

    “Liar thinks truth is also a lie. More at 11”

    source
  • echodot@feddit.uk ⁨9⁩ ⁨hours⁩ ago

    The problem is a lot of this is almost impossible to actually verify. After all if an article says a skyscraper has 70 stories even people working in the building may not be able to necessarily verify that.

    I have worked in a building where the elevator only went to every other floor, and I must have been in that building for at least 3 months before I noticed because the ground floor obviously had access and the floor I worked on just happened to do have an elevator so it never occurred to me that there may be other floors not listed.

    For something the size of a 63 (or whatever it actually was) story building it’s not really visually apparent from the outside either, you’d really have to put in the effort to count the windows. Plus often times the facade looks like more stories so even counting the windows doesn’t necessarily give you an accurate answer not that anyone would necessarily have the inclination to do so. So yeah, I’m not surprised that errors like that exist.

    More to the point the bigger issue is can the AI actually prove that it is correct. In the article there was contradictory information in official sources so how does the AI know which one was the right one? Could somebody be employed to go check? Presumably even the building management don’t know the article is incorrect otherwise they would have been inclined to fix it.

    source
  • dukemirage@lemmy.world ⁨1⁩ ⁨day⁩ ago

    legitimate use of a LLM

    source
    • anamethatisnt@sopuli.xyz ⁨1⁩ ⁨day⁩ ago

      I find that an extremely simplified way of finding out whether the use of an LLM is good or not is whether the output from it is used as a finished product or not. Here the human uses it to identify possible errors and then verify the LLM output before acting and the use of AI isn’t mentioned at all for the corrections.

      The only danger I see is that errors the LLM didn’t find will continue to go undiscovered, but they probably would be undiscovered without the use of the LLM too.

      source
      • porcoesphino@mander.xyz ⁨1⁩ ⁨day⁩ ago

        I think the first part you wrote is a but hard to parse but I think this is related.

        I think the problematic part of most genAI use cases is validation at the end. If you’re doing something that has a large amount of exploration but a small amount of validation, like this, then it’s useful.

        A friend was using it to learn the linux command line, that can be framed as having a single command at the end that you copy, paste and validate. That isn’t perfect because the explanation could still be off and it wouldn’t be validated but I think it’s still a better use case than most.

        source
        • -> View More Comments
      • shiroininja@lemmy.world ⁨1⁩ ⁨day⁩ ago

        Or it flags something as an error falsely and the human has so much faith in the system that it must be correct, and either wastes time finding the solution or bends reality to “correct” it in a human form of hallucinating bs

        source
    • ordnance_qf_17_pounder@reddthat.com ⁨1⁩ ⁨day⁩ ago

      “AI” summed up. 95% of the time it’s pointless bullshit being shoehorned into absolutely everything. 5% of the time it can be useful.

      source
      • dukemirage@lemmy.world ⁨1⁩ ⁨day⁩ ago

        like Comic Sans

        source
        • -> View More Comments
    • Treczoks@lemmy.world ⁨1⁩ ⁨day⁩ ago

      Yep. Let it flag potential problems, and have humans react to it, e.g. by reviewing and correcting things manually. AI can do a lot of things quick and efficiently, but it must be supervised like a toddler.

      source
      • architect@thelemmy.club ⁨2⁩ ⁨hours⁩ ago

        So… the same as most employees but cheaper.

        People here are above average and overestimate the vast majority of humanity.

        source
      • buffing_lecturer@leminal.space ⁨18⁩ ⁨hours⁩ ago

        This is an interesting idea:

        The “at least one” in the prompt is deliberately aggressive, and seems likely to force hallucinations in case an article is definitely error-free. So, while the sample here (running the prompt only once against a small set of articles) would still be too small for it, it might be interesting to investigate using this prompt to produce a kind of article quality metric: If it repeatedly results only in invalid error findings (i.e. what a human reviewer no Disagrees with), that should indicate that the article is less likely to contain factual errors

        source
    • de_lancre@lemmy.world ⁨18⁩ ⁨hours⁩ ago

      Wait, you mean using Large Language Model that created to parse walls of text, to parse walls of text, is a legit use?

      Those kids at openai would’ve been very upset if they could read.

      source
      • lightnsfw@reddthat.com ⁨3⁩ ⁨hours⁩ ago

        Even for that it’s mid at best. I try using co-pilot at work often and it makes shit up constantly.

        source
      • dukemirage@lemmy.world ⁨13⁩ ⁨hours⁩ ago

        Chatbots aren’t the worst use case, too, even though we are headed in a wrong direction.

        source
    • passepartout@feddit.org ⁨1⁩ ⁨day⁩ ago

      Yes and no. I have enjoyed reading through this approach, but it seems like a slippery slope from this to “vibe knowledge” where LLMs are used for actually trying to add / infer information.

      source
      • architect@thelemmy.club ⁨2⁩ ⁨hours⁩ ago

        The issue is that some people are lazy cheaters no matter what you do. Banning every tool because of those people isn’t helpful to the rest of humanity.

        source
      • LastYearsIrritant@sopuli.xyz ⁨1⁩ ⁨day⁩ ago

        Don’t discard a good technique cause it can be implemented poorly.

        source
  • Stefan_S_from_H@discuss.tchncs.de ⁨1⁩ ⁨day⁩ ago

    A tool that gives at least 40% wrong answers, used to find 90% errors?

    source
    • AcesFullOfKings@feddit.uk ⁨1⁩ ⁨day⁩ ago

      If you read the post it’s actually quite a good method. Having an LLM flag potential errors and then reviewing them manually as a human is actually quite productive.

      I’ve done exactly that on a project that relies on user-submitted content; moderating submissions at even a moderate scale is hard, but having an llm look through for me is easy. I can then check through anything it flags and manually moderate. Neither the accuracy nor precision is particularly high, but it’s a low-effort way to find a decent number of the thing you’re looking for. In my case I was looking for abusive submissions from untrusted users; in the OP author’s case they were looking for errors. I’m quite sure this method would never find all errors, and as per the article the “errors” it flags aren’t always correct either. But the effort:reward ratio is high.

      source
      • echodot@feddit.uk ⁨9⁩ ⁨hours⁩ ago

        But we don’t know what the false positive rate is either? How many submissions were blocked that shouldn’t have been, it seems like you don’t have a way to even find that metric out unless somebody complained about it.

        source
        • -> View More Comments
    • s@piefed.world ⁨1⁩ ⁨day⁩ ago

      Image

      source
    • acosmichippo@lemmy.world ⁨1⁩ ⁨day⁩ ago

      90% errors isn’t accurate. It’s not that 90% of all facts in wikipedia are wrong. 90% of the featured articles contained at least one error, so the articles were still mostly correct.

      source
    • amateurcrastinator@lemmy.world ⁨1⁩ ⁨day⁩ ago

      Bias needs to be reinforced!

      source
  • crypt0cler1c@infosec.pub ⁨1⁩ ⁨day⁩ ago

    This is way overblown. Wikipedia is on par with the most accurate Encyclopedias with 3-4 factual errors per article.

    source
    • TheBlackLounge@lemmy.zip ⁨1⁩ ⁨day⁩ ago

      More like 1, sometimes 2, errors in 90% of wikipedia’s longest and most active articles.

      source
  • helpImTrappedOnline@lemmy.world ⁨1⁩ ⁨day⁩ ago

    The first edit was a undoing a vandalism that persisted for 5 years. Someone changed the number of floors a building had from 67, to 70.

    A friendly reminder to only use Wikipedia as a summary/reference aggregate for serious research.

    This is a cool tool for checking these sorts of things, run everything through the LLM to flag errors and go after them like a wack-a-mole game instead of a hidden object game.

    source
    • mika_mika@lemmy.world ⁨3⁩ ⁨hours⁩ ago

      Hehe 67

      source
  • kalkulat@lemmy.world ⁨19⁩ ⁨hours⁩ ago

    Finding inconsistencies is not so hard. Pointing them out might be a -little- useful. But resolving them based on trustworthy sources can be a -lot- harder. Most science papers require privileged access. Many news stories may have been grounded in old, mistaken histories … if not on outright guesses, distortions or even lies. (The older the history, the worse.)

    And, since LLMs are usually incapable of citing sources for their own (often batshit) claims any – where will ‘the right answers’ come from? I’ve seen LLMs, when questioned again, apologize that their previous answers were wrong.

    source
    • architect@thelemmy.club ⁨2⁩ ⁨hours⁩ ago

      Which LLMs are incapable of citing sources?

      source
      • jacksilver@lemmy.world ⁨1⁩ ⁨hour⁩ ago

        All of them. If you’re seeing sources cited, it means it’s a RAG (LLM with extra bits). The extra bits make a big difference as it means the response is limited to a select few points of reference and isn’t comparing all known knowledge on a subject matter.

        source
  • kepix@lemmy.world ⁨1⁩ ⁨day⁩ ago

    the tool that is mainly based on wikipedia info?

    source
    • x00z@lemmy.world ⁨1⁩ ⁨day⁩ ago

      The tool doesn’t just check the text for errors it would know of. It can also check sources, compare articles, and find inconsistencies within the article itself.

      There’s a list of the problems it found that often explains where it got the correct information from.

      source
  • GeneralEmergency@lemmy.world ⁨1⁩ ⁨day⁩ ago

    No surprise.

    Wikipedia ain’t the bastion of facts that lemmites make them out to be.

    It’s a mess of personal fiefdoms run by people with way too much time on their hands and an ego to match.

    source
    • naeap@sopuli.xyz ⁨1⁩ ⁨day⁩ ago

      Yeah, better to use grokpedia /s

      source
      • GeneralEmergency@lemmy.world ⁨5⁩ ⁨hours⁩ ago

        I know this is sarcasm, but in case people don’t know.

        Oh Jesus Christ no. At least Wikipedia has some form of oversight from multiple sources and people.

        source