Is there a post somewhere on getting started using things like these?
Comment on People are speaking with ChatGPT for hours, bringing 2013’s Her closer to reality
NotMyOldRedditName@lemmy.world 1 year agohuggingface.co/TheBloke/PsyMedRP-v1-20B-GGUF?not-…
I uh, hear it’s good.
dep@lemmy.world 1 year ago
NotMyOldRedditName@lemmy.world 1 year ago
I don’t know a specific guide, but try these steps
-
Follow the 1 click installation instructions part way down and complete steps 1-3
-
When step 3 is done, if there were no errors, the web ui should be running. It should show the URL in the command window it opened. In my case it shows “127.0.0.1:7860”. Input that into a web browser of your choice
-
Now you need to download a model as you don’t actually have anything to run. For simplicity sake, I’d start with a small 7b model so you can quickly download it and try it out. Since I don’t know your setup, I’ll recommend using GGUF file formats which work with Llama.cpp which is able to load the model onto your CPU and GPU.
You can try this either of these models to start
huggingface.co/…/mistral-7b-v0.1.Q4_0.gguf (takes 22gig of system ram to load)
huggingface.co/…/vicuna-7b-v1.5.Q4_K_M.gguf (takes 19gigs of system ram to load)
If you only have 16 gigs you can try something on those pages by going to /main and using a Q3 instead of a Q4 (quantization) but that’s going to degrade the quality of the responses.
-
Once that is finished downloading, go to the folder you installed the web-ui at and there will be a folder called “models”. Place the model you download into that folder.
-
In the web-ui you’ve launched in your browser, click on the “model” tab at the top. The top row of that page will indicate no model is loaded. Click the refresh icon beside that to refresh the model you just downloaded. Then select it in the drop down menu.
-
Click the “Load” button
-
If everything worked, and no errors are thrown (you’ll see them in the command prompt window and possibly on the right side of the model tab) you’re ready to go. Click on the “Chat” tab.
-
Enter something in the “send a message” to begin a conversation with your local AI!
Now that might not be using things efficiently, back on the model tab, there’s “n-gpu-layers” which is how much to offload to the GPU. You can tweak the slider and see how much ram it says it’s using in the command / terminal window and try to get it as close to your video cards ram as possible.
Then there’s “threads” which is how many cores your CPU has (non virtual) and you can slide that up as well.
Once you’ve adjusted those, click the load button again, see that there’s no errors and go back to the chat window. I’d only fuss with those once you have it working, so you know it’s working.
Good luck!
dep@lemmy.world 1 year ago
So I got the model working (TheBloke/PsyMedRP-v1-20B-GGUF). How do you jailbreak this thing? A simple request comes back with “As an AI, I cannot engage in explicit or adult content. My purpose is to provide helpful and informative responses while adhering to ethical standards and respecting moral and cultural norms. Blah de blah…” I would expect this llm to be wide open?
NotMyOldRedditName@lemmy.world 1 year ago
Sweet, congrats! Are you telling it you want to role play first?
E.g. I’d like to role play with you. You’re a and were going to do
You’re going to have to play around with it to get it to act like you’d like. I know were here instead of reddit, but the community around this is much more active there, it’s /r/localllama and you can find a lot of answers searching through there.
You can also create characters (it’s under one of the tabs, I don’t have it open right now) where you can set up the character in a way where you don’t need to do that each time if you always want them to be the same. There’s a website www.chub.ai where you can see how some of them are set up, but I think most of that’s for a front end called SillyTaven that I haven’t used, but a lot of those descriptions can be carried over.
dep@lemmy.world 1 year ago
Stupid newbie question here, but when you go to a HuggingFace LLM and you see a big list like this, what on earth do all these variants mean?
psymedrp-v1-20b.Q2_K.gguf 8.31 GB
psymedrp-v1-20b.Q3_K_M.gguf 9.7 GB
psymedrp-v1-20b.Q3_K_S.gguf 8.66 GB
etc…
NotMyOldRedditName@lemmy.world 1 year ago
That’s called “quantization”. I’d do some searching on that for better description, but in summary, the bigger the model, the more resources they need to run. Models are 8bit, but it turns out, you still get really good results if you drop off some of those bits. The more you drop the worse it gets.
People have generally found, that it’s better to have a larger data set model, with a lower quantization, than lower data set and the full 8bits
E.g 13b Q4 > 7b Q8
Going below Q4 is generally found to degrade the quality too much. So its’ better to run a 7b Q4 then a 13b Q3, but you can play with that yourself to find what you prefer.
So you can just look at those file sizes to get a sense of which one has the most data in it. The M (medium) and S (small) are some sort of variation on the same quantization, but I don’t know what they’re doing there, other than bigger is better.
dep@lemmy.world 1 year ago
Wow I didn’t expect such a helpful and thorough response! Thank you kind stranger!
NotMyOldRedditName@lemmy.world 1 year ago
You’re welcome! Hope you make it through error free!
MickeySwitcherooney@lemmy.dbzer0.com 1 year ago
Never heard of it. Have you compared to Mythalion?
NotMyOldRedditName@lemmy.world 1 year ago
Haven’t compared it to much yet, I stopped toying with LLMs for a few months and a lot chanfed. The new 4k contexts are a nice change though.
kamenlady@lemmy.world 1 year ago
i see… I’ll have to ramp up my hardware exponentially …
PeterPoopshit@lemmy.world 1 year ago
Use llama cpp. It uses cpu so you don’t have to spend $10k on a graphics card that meets the minimum requirements.
kamenlady@lemmy.world 1 year ago
Gonna look into that - thanks
NotMyOldRedditName@lemmy.world 1 year ago
Check this out
github.com/oobabooga/text-generation-webui
It has a one click installer and can use llama.cpp
From there you can download models and try things out.
If you don’t have a really good graphics card, maybe start with 7b models. Then you can try 13b and compare performance and results.