Comment

Comment on How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms [TLDR: 25%]

rekabis@lemmy.ca ⁨2⁩ ⁨months⁩ ago

How much do large language models actually hallucinate when answering questions grounded in provided documents?

Okay, this is looking promising. In terms of the most important qualifications being plainly stated in the opening line.

Because the amount of hallucinations/inaccuracies “in the wild” - depending on the model being tested - runs about 60-80%. But then again, this would be average use on generalized data sets, not questions on specific documentation. So of course the “in the wild” questions will see a higher rate.

This also helps users, as it shows that hallucinations/inaccuracies can be reduced by as much as ⅔ by simply limiting LLMs to specific documentation that the user is certain contains the desired information.

Very interesting!

source

Sort:hotnew top

HubertManne@piefed.social ⁨2⁩ ⁨months⁩ ago
I have been saying this for awhile. I am sorta hoping we see open source llms that are trained on a curated list of literature. its funny that these came out and it seemed like the makers did not take the long known garbage in - garbage out into account.

source
SuspciousCarrot78@lemmy.world ⁨2⁩ ⁨months⁩ ago
As I mentioned elsewhere (below) I am currently conducting similar testing across 4 different 4B models (Qwen 3-4B Hivemind, Qwen 3-4B 2507 Instruct, Phi-4-mini, Granite-4-h-micro), using both grounded and ungrounded states. I’m aiming for 10,000 runs (currently at 3500). Not to count chickens before they hatch but at ctx of 8192, it’s very much trending closer to 98-99.9% hallucination reduction for the class of questions asked (see prior post).

If that holds (testing should be completed by tomorrow) that’s a useful thing to know. Likewise, if it doesn’t hold, also useful to know.

I actually have an idea of how to make this even better / more useful. Again, chicken have not hatched. I will share what I discover here if there is interest / submit for peer review.

source