Comment

brucethemoose@lemmy.world ⁨4⁩ ⁨months⁩ ago

TBH you should fold this into localllama? Or open source AI?

I have very mixed (mostly bad) feelings on ollama. In a nutshell, they’re kinda Twitter attention grabbers that give zero credit/contribution to the underlying framework (llama.cpp). It’s also a highly suboptimal way for most people to run LLMs, especially if you’re willing to tweak.

They’re… slimey. I would always recommend Kobold.cpp, tabbyAPI, ik_llama.cpp, Aphrodite, any number of backends over them. Anything but ollama.

source

Sort:hotnew top

TheHobbyist@lemmy.zip ⁨3⁩ ⁨months⁩ ago
Indeed, Ollama is going a shady route. github.com/ggml-org/llama.cpp/pull/11016#issuecom…

I started playing with Ramalama (the name is a mouthful) and it works great. There is one or two more steps in the setup but I’ve achieved great performance and the project is making good use of standards (OCI, jinja, unmodified llama.cpp, from what I understand).

Go and check it out, they are compatible with models from HF and Ollama too.

github.com/containers/ramalama

source
southernbeaver@lemmy.world ⁨4⁩ ⁨months⁩ ago
What would you recommend to hook to my home assistant?

source
- brucethemoose@lemmy.world ⁨4⁩ ⁨months⁩ ago
  Totally depends on your hardware, and what you tend to ask it. What are you running?
  
  source
  - EncryptKeeper@lemmy.world ⁨4⁩ ⁨months⁩ ago
    I’m going to go out in a limb and say they probably just want a comparable solution to Ollama.
    
    source
    brucethemoose@lemmy.world ⁨4⁩ ⁨months⁩ ago
    OK.
    
    Then LM Studio. With Qwen3 30B IQ4_XS, low temperature sampling, open web ui frontend if you wish.
    
    That’s what I’m trying to say though, LLMs work a bajillion times better with just a little personal configuration. They are not “one click” magic boxes, they are specialized tools.
    
    Random example: on a Mac? Grab an MLX distillation, it’ll be way faster and better.
    
    Nvidia gaming PC? TabbyAPI with an exl3. Raspberry Pi? That’s important to know!
    
    What do you ask it to do? Set timers? Look at pictures? Cooking recipes? Search the web? Do you need stuff fast or accurate?
    
    This is one reason why ollama is so suboptimal, with the other being just bad defaults (Q4_0 quants, 2048 context, no imatrix or anything outside GGUF, bad sampling last I checked, chat template errors, bugs with certain models, I can go on…)
    
    source
  - southernbeaver@lemmy.world ⁨3⁩ ⁨months⁩ ago
    My HomeAssistant is running on Unraid but I have an old NVIDIA Quadro P5000. I really want to run a vision model so that it can describe who is at my doorbell.
    
    source
    brucethemoose@lemmy.world ⁨3⁩ ⁨months⁩ ago
    Oh actually that’s a good card for LLM serving!
    
    Use the llama.cpp server from source, it has better support for Pascal cards than anything else:
    
    github.com/ggml-org/llama.cpp/…/multimodal.md
    
    Gemma 3 is a hair too big (like 17-18GB), so I’d start with InternVL 14B Q5K XL: huggingface.co/…/InternVL3-14B-Instruct-GGUF
    
    Or Mixtral 24B IQ4_XS for more ‘text’ intelligence than vision: huggingface.co/…/Mistral-Small-3.2-24B-Instruct-2…
    
    source
  - WhirlpoolBrewer@lemmings.world ⁨3⁩ ⁨months⁩ ago
    I have a MacBook 2 pro (Apple silicon) and would kind of like to replace Google’s Gemini as my go-to LLM. I think I’d like to run something like Mistral, probably. Currently I do have Ollama and some version of Mistral running, but I almost never used it as it’s on my laptop, not my phone.
    
    I’m not big on LLMs and if I can find an LLM that I run locally and helps me get off of using Google Search and Gimini, that could be awesome. Currently I use a combo of Firefox, Qwant, Google Search, and Gemini for my daily needs. I’m not big into the direction Firefox is headed, I’ve heard there are arguments against Qwant, and using Gemini feels like the wrong answer for my beliefs and opinions.
    
    I’m looking for something better without too much time being sunk into something I may only sort of like. Tall order, I know, but I figured I’d give you as much info as I can.
    
    source
    brucethemoose@lemmy.world ⁨3⁩ ⁨months⁩ ago
    Honestly perplexity, the online service, is pretty good.
    
    But first question is: how much RAM does your Mac have? This is basically the factor for what model you can and should run.
    
    source
    -> View More Comments
    brucethemoose@lemmy.world ⁨3⁩ ⁨months⁩ ago
    Actually, to go ahead and answer, the “easiest” path would be LM Studio (which supports MLX quants natively and is not time intensive to install), and a DWQ quantization (which is a newer, higher quality variant of MLX quants).
    
    Probably one of these models, depending on how much RAM you have:
    
    huggingface.co/…/Magistral-Small-2506-4bit-DWQ
    
    huggingface.co/…/Qwen3-30B-A3B-4bit-DWQ-0508
    
    huggingface.co/…/GLM-4-32B-0414-4bit-DWQ
    
    With a bit more time invested, you could try to set up Open Web UI as an alterantive interface (which has its own built in web search like Gemini): openwebui.com
    
    And then use LM Studio (or some other MLX backend, or even free online API models) as the ‘engine’
    
    source
    -> View More Comments
- TheHobbyist@lemmy.zip ⁨3⁩ ⁨months⁩ ago
  Perhaps give Ramalama a try?
  
  github.com/containers/ramalama
  
  source
Sims@lemmy.ml ⁨3⁩ ⁨months⁩ ago
Thanks for Lemonade hint. For Ryzen AI: github.com/lemonade-sdk/lemonade (linux=cpu for now)

source
- brucethemoose@lemmy.world ⁨3⁩ ⁨months⁩ ago
  You can still use the IGP, which might be faster in some cases.
  
  source
tal@lemmy.today ⁨4⁩ ⁨months⁩ ago
While I don’t think that llama.cpp is specifically a risk, I think that running generative AI software in a container is probably a good idea. It’s a rapidly-moving field with a lot of people contributing a lot of code that very quickly gets run on a lot of systems by a lot of people. There’s been malware that’s shown up in extensions for (for example) ComfyUI. And the software really doesn’t need to poke around at outside data.

Also, because the software has to touch the GPU, it needs a certain amount of outside access. Containerizing that takes some extra effort.

old.reddit.com/…/psa_please_secure_your_comfyui_i…

ComfyUI users has been hit time and time again with malware from custom nodes or their dependencies. If you’re just using the vanilla nodes, or nodes you’ve personally developed yourself or vet yourself every update, then you’re fine. But you’re probably using custom nodes. They’re the great thing about ComfyUI, but also its great security weakness.

Half a year ago the LLMVISION node was found to contain an info stealer. Just this month the ultralytics library, used in custom nodes like the Impact nodes, was compromised, and a cryptominer was shipped to thousands of users.

Granted, the developers have been doing their best to try to help all involved by spreading awareness of the malware and by setting up an automated scanner to inform users if they’ve been affected, but what’s better than knowing how to get rid of the malware is not getting the malware at all.

Ollama means sticking it in a Docker container, and that is, I think, a positive thing.

If there were a close analog, like some software package that could take a given LLM model and run in podman or Docker or something, I think that that’d be great. But I think that putting the software in a container is probably a good move relative to running it uncontainerized.

source
- brucethemoose@lemmy.world ⁨4⁩ ⁨months⁩ ago
  I don’t understand.
  
  Ollama is not actually docker, right? It’s running the same llama.cpp engine, it’s just embedded inside the wrapper app, not containerized.
  
  And basically every LLM project ships a docker container. I know for a fact llama.cpp, TabbyAPI, Aphrodite, vllm and sglang do.
  
  You are 100% right about security though, in fact there’s a huge concern with compromised Python packages. This one almost got me: pytorch.org/blog/compromised-nightly-dependency/
  
  source
  - tal@lemmy.today ⁨4⁩ ⁨months⁩ ago
    I’m sorry, you are correct. The syntax and interface mirrors docker, and one can run ollama in Docker, so I’d thought that it was a thin wrapper around Docker, but I just went to check, and you are right — it’s not running in Docker by default. Sorry, folks! Guess now I’ve got one more thing to look into getting inside a container myself.
    
    source
    hasnep@lemmy.ml ⁨3⁩ ⁨months⁩ ago
    Try ramalama, it’s designed to run models override oci containers
    
    source