Comment

Comment on I've just created c/Ollama!

<- View Parent

brucethemoose@lemmy.world ⁨5⁩ ⁨months⁩ ago

Totally depends on your hardware, and what you tend to ask it. What are you running?

source

Sort:hotnew top

EncryptKeeper@lemmy.world ⁨5⁩ ⁨months⁩ ago
I’m going to go out in a limb and say they probably just want a comparable solution to Ollama.

source
- brucethemoose@lemmy.world ⁨5⁩ ⁨months⁩ ago
  OK.
  
  Then LM Studio. With Qwen3 30B IQ4_XS, low temperature sampling, open web ui frontend if you wish.
  
  That’s what I’m trying to say though, LLMs work a bajillion times better with just a little personal configuration. They are not “one click” magic boxes, they are specialized tools.
  
  Random example: on a Mac? Grab an MLX distillation, it’ll be way faster and better.
  
  Nvidia gaming PC? TabbyAPI with an exl3. Raspberry Pi? That’s important to know!
  
  What do you ask it to do? Set timers? Look at pictures? Cooking recipes? Search the web? Do you need stuff fast or accurate?
  
  This is one reason why ollama is so suboptimal, with the other being just bad defaults (Q4_0 quants, 2048 context, no imatrix or anything outside GGUF, bad sampling last I checked, chat template errors, bugs with certain models, I can go on…)
  
  source
southernbeaver@lemmy.world ⁨5⁩ ⁨months⁩ ago
My HomeAssistant is running on Unraid but I have an old NVIDIA Quadro P5000. I really want to run a vision model so that it can describe who is at my doorbell.

source
- brucethemoose@lemmy.world ⁨5⁩ ⁨months⁩ ago
  Oh actually that’s a good card for LLM serving!
  
  Use the llama.cpp server from source, it has better support for Pascal cards than anything else:
  
  github.com/ggml-org/llama.cpp/…/multimodal.md
  
  Gemma 3 is a hair too big (like 17-18GB), so I’d start with InternVL 14B Q5K XL: huggingface.co/…/InternVL3-14B-Instruct-GGUF
  
  Or Mixtral 24B IQ4_XS for more ‘text’ intelligence than vision: huggingface.co/…/Mistral-Small-3.2-24B-Instruct-2…
  
  source
WhirlpoolBrewer@lemmings.world ⁨5⁩ ⁨months⁩ ago
I have a MacBook 2 pro (Apple silicon) and would kind of like to replace Google’s Gemini as my go-to LLM. I think I’d like to run something like Mistral, probably. Currently I do have Ollama and some version of Mistral running, but I almost never used it as it’s on my laptop, not my phone.

I’m not big on LLMs and if I can find an LLM that I run locally and helps me get off of using Google Search and Gimini, that could be awesome. Currently I use a combo of Firefox, Qwant, Google Search, and Gemini for my daily needs. I’m not big into the direction Firefox is headed, I’ve heard there are arguments against Qwant, and using Gemini feels like the wrong answer for my beliefs and opinions.

I’m looking for something better without too much time being sunk into something I may only sort of like. Tall order, I know, but I figured I’d give you as much info as I can.

source
- brucethemoose@lemmy.world ⁨5⁩ ⁨months⁩ ago
  Honestly perplexity, the online service, is pretty good.
  
  But first question is: how much RAM does your Mac have? This is basically the factor for what model you can and should run.
  
  source
  - WhirlpoolBrewer@lemmings.world ⁨5⁩ ⁨months⁩ ago
    8GB
    
    source
    brucethemoose@lemmy.world ⁨5⁩ ⁨months⁩ ago
    8GB?
    
    You might be able to run Qwen3 4B: huggingface.co/mlx-community/…/main
    
    But honestly you don’t have enough RAM to spare, and even a small model might bog things down. I’d run Open Web UI or LM Studio with a free LLM API, like Gemini Flash, or pay a few bucks for something off openrouter. Or maybe Cerebras API.
    
    source
    -> View More Comments
- brucethemoose@lemmy.world ⁨5⁩ ⁨months⁩ ago
  Actually, to go ahead and answer, the “easiest” path would be LM Studio (which supports MLX quants natively and is not time intensive to install), and a DWQ quantization (which is a newer, higher quality variant of MLX quants).
  
  Probably one of these models, depending on how much RAM you have:
  
  huggingface.co/…/Magistral-Small-2506-4bit-DWQ
  
  huggingface.co/…/Qwen3-30B-A3B-4bit-DWQ-0508
  
  huggingface.co/…/GLM-4-32B-0414-4bit-DWQ
  
  With a bit more time invested, you could try to set up Open Web UI as an alterantive interface (which has its own built in web search like Gemini): openwebui.com
  
  And then use LM Studio (or some other MLX backend, or even free online API models) as the ‘engine’
  
  source
  - WhirlpoolBrewer@lemmings.world ⁨5⁩ ⁨months⁩ ago
    This is all new to me, so I’ll have to do a bit of homework on this. Thanks for the detailed and linked reply!
    
    source
    brucethemoose@lemmy.world ⁨5⁩ ⁨months⁩ ago
    I was a bit mistaken, these are the models you should consider:
    
    huggingface.co/mlx-community/Qwen3-4B-4bit-DWQ
    
    huggingface.co/AnteriorAI/…/main
    
    huggingface.co/unsloth/Jan-nano-GGUF (specifically the UD-Q4 or UD-Q5 file)
    
    These are state-of-the-art, as far as I know.
    
    source
    -> View More Comments