Comment on Please suggest some good self-hostable RAG for my LLM.
brucethemoose@lemmy.world 5 weeks agoI have an old Lenovo laptop with an NVIDIA graphics card.
@Maroon@lemmy.world The biggest question is what graphics card, but generally speaking this is… less than ideal.
To answer your question, Open Web UI is the new hotness: github.com/open-webui/open-webui
I personally use exui for a lot of my LLM work, but that’s because I’m an uber minimalist.
And on your setup, I would host the best model you can on kobold.cpp or the built-in llama.cpp server (just not Ollama) and use Open Web UI as your front end. You can also use llama.cpp to host an embeddings model for RAG, if you wish.
This is a general ranking of the “best” models for document answering and summarization: huggingface.co/…/Hallucination-evaluation-leaderb…
…But generally, I prefer to not mess with RAG retrieval and just slap the context I want into the LLM myself, and for this, the performance of your machine is kind of critical (dependong on just how much “context” you want it to cover. I know this is !selfhosted, but once you get your setup dialed in, you may consider making calls to an API like Groq, Cerebras or whatever, or even renting a Runpod GPU instance if that’s in your time/money budget.
kwa@lemmy.zip 5 weeks ago
I’m new to this and I was wondering why you don’t recommend ollama? This is the first one I managed to run and it seemed decent but if there are better alternatives I’m interested
brucethemoose@lemmy.world 5 weeks ago
Pretty much everything has an API :P
ollama is OK because its easy and automated, but you can get higher performance, better vram efficiency, and better samplers from either kobold.cpp or tabbyAPI.
I’d recommend kobold.cpp for very short context (like 4K or less) or if you need to partially offload the model to CPU. Otherwise TabbyAPI, as it’s generally faster (but GPU only) and much better at long context through its great k/v cache quantization).
They all have OpenAI APIs, though kobold.cpp also has its own web ui.