As I mentioned elsewhere (below) I am currently conducting similar testing across 4 different 4B models (Qwen 3-4B Hivemind, Qwen 3-4B 2507 Instruct, Phi-4-mini, Granite-4-h-micro), using both grounded and ungrounded states. I’m aiming for 10,000 runs (currently at 3500). Not to count chickens before they hatch but at ctx of 8192, it’s very much trending closer to 98-99.9% hallucination reduction for the class of questions asked (see prior post).
If that holds (testing should be completed by tomorrow) that’s a useful thing to know. Likewise, if it doesn’t hold, also useful to know.
I actually have an idea of how to make this even better / more useful. Again, chicken have not hatched. I will share what I discover here if there is interest / submit for peer review.
HubertManne@piefed.social 1 week ago
I have been saying this for awhile. I am sorta hoping we see open source llms that are trained on a curated list of literature. its funny that these came out and it seemed like the makers did not take the long known garbage in - garbage out into account.