have you tried perplexity? it’s probably the best ai search engine right now although it still misunderstands context sometimes. it’s pretty good at citing its sources though
Comment on Reddit blocking all major search engines, except Google
Vanth@reddthat.com 3 months ago[deleted]
kate@lemmy.uhhoh.com 3 months ago
tal@lemmy.today 3 months ago
I haven’t used the text-based search queries myself; I’ve used LLM software, but not for this, so I don’t know what the current situation is like. My understanding is that current approach doesn’t really permit for it. And there are two issues with that:
There isn’t a direct link between one source and what’s being generated; the model isn’t really structured so as to retain this.
Many different sources probably contribute to the answer.
All information contributes a little bit to the probability of the next word that the thing is spitting out. It’s not that the software rapidly looks through all pages out there and then finds a given single reputable source that could then cite, the way a human might. That is, you aren’t searching an enormous database when the query comes in, but repeatedly making use of a prediction that the next word in the correct response is a given word, and that probability is derived from many different cases. Maybe tens of thousands of people have made posts on a given subject; the response isn’t just a quote from one, and the generated text may appear in none of them.
It might be possible to basically run a traditional search for a generated response to find an example of that text, if it amounts to a quote (which it may not!)
And if Google produces some kind of “reliability score” for a given piece of material and weights the material in the training set by that (which I will guess that if they don’t now, they will), they could maybe use the reliability score to try to rank various sources when doing that backwards search.
But there’s no guarantee that that will succeed, because they’re ultimately synthesizing the response, not just quoting it, and because it can come from many sources. There may potentially be no one source that says what Google is handing back.
It’s possible that there will be other methods than the present ones used for generating responses in the future, and those could have very different characteristics. Like, I would not be surprised, if this takes off, if the resulting system ten years down the road is considerably more complex than what is presently being done.
There’s been some discussion about developing systems that do permit for this, and I believe that if you want to read up on it, the term used is “attributability”, but I have not been reading research on it.