Do you host your own AI?

SuspiciousCarrot78@aussie.zone · 6 hours ago

Do you host your own AI?

rando@lemmy.ml · 10 minutes ago

Bought b70 with egpu enclosure and usb4 connection wasn’t really planning to actually run anything but now ended up with llama.cpp with openwebui - kids/parents want to/have to use chat, might as well provide local solution than them using industry options. Also started with ollama and Gemma 4 26b a4b - asked it to write script to setup llama.cpp in container.

mierdabird@lemmy.dbzer0.com · edit-2 1 hour ago

I started out playing around with code generation using Ollama/open-webui and qwen 2.5 coder 14b on a 3060 12GB, but ended up on a winding journey with an ex datacenter card called the AMD V620. Its roughly equivalent to an RX 6800XT, but with double the VRAM. At this point i’ve really done nothing productive with it but learned a lot about bios settings, GPU/ROCm drivers, and custom fan solutions/PWM controls trying to get it setup and optimized haha.

It’s pretty sick though, that amount of VRAM with 512GB/s bandwidth can run Qwen 3.6 27B dense with 100k context window at 20 tokens/sec in LM studio. Draws 300 watts at the wall on my ITX chassis (idling about 30w).

I’ve been dabbling in building an aviation weather and field condition report application using this, but my next step is to rebuild my VS Code environment into a new machine. I’m kinda enjoying just fucking around with building the hardware too though

0^2@lemmy.dbzer0.com · 25 minutes ago

I went down the same rabbit hole. I have a 6800xt however but have issues getting it to perform outside of llm chats into using tools like pi.dev

Is it worth getting a v620?

Steve@startrek.website · 2 hours ago

I recently gave it a try with qwen3.5 and deepseek coder v2. I have a RTX3090 and these are the largest models that can run comfortably on it.

Conclusion, they are both fucking useless. Free tier claude runs circles.

SuspiciousCarrot78@aussie.zone · 2 hours ago

Yeah :(

Were not there yet on consumer rigs.

brucethemoose@lemmy.world · 2 hours ago

Did you serve them with ollama?

It’s basically broken, if you did. Try the same models over API, and you’ll see what I mean.

Steve@startrek.website · 2 hours ago

Is there an alternative to ollama? The point was to run something locally.

brucethemoose@lemmy.world · 1 hour ago

Oh, and I just saw you have a 3090.

To get more specific, you can actually run way better models than Qwen 3.5 and Deepseek coder (both of which are very obsolete now). The best that’s practical depends on how much CPU RAM you have, but at the minimum you can do Qwen 3.6 27B, with a more optimal quant like ones here: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/tree/main

Or Gemma 31B QAT: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF

If you have 128GB CPU RAM, I can upload my custom MiMo 2.5 quant. That should “beat” the cheapest Claude, give or take.

If you have 64GB, I’d suggest a quantization of Step 3.7.

If you have 32GB or 48, I’m not sure. I’d need to look if any “small” MoE is actually better than Qwen 27B now.

brucethemoose@lemmy.world · 2 hours ago

https://sleepingrobots.com/dreams/stop-using-ollama/

And that’s not even all of it. Basically they break models in many ways, and they’re slimey Tech Bros.

LM Studio is better, and easy.

If you’re on Nvidia, and want to run optimally, I would use the ik_llama.cpp fork. On AMD, regular llama.cpp. On a Mac, use an MLX runner (Like LM Studio) with an MLX quant (ideally an MLX-DWQ quant).

It’s all pretty technical, and… thats kinda the point. LLMs are just too performance sensitive and too finicky to not have a grasp of how they work. There is no “easy button” to run them, there can’t be.

But if you don’t have time for that and just want to see if it’s worth it, I’d suggest self hosing your own UI, and trying the dirt cheap APIs of models you can theoretically run on your setup. This will give you a “best case” taste of what they’re capable of.

brucethemoose@lemmy.world · edit-2 2 hours ago

An aside for anyone reading this:

https://sleepingrobots.com/dreams/stop-using-ollama/

And that barely scratches the surface. Please.

Use anything but Ollama. Even APIs.

comrademiao@piefed.social · 47 minutes ago

looks like extreme nitpicking without any real issues beyond some VC funding a FOSS issues.

//whyre you spamming the comment to everyone? its quite alarmist actually

D_Air1@lemmy.ml · 3 hours ago

Yeah, I’m using qwen 31b a3b on an amd 9070xt requires a bit of cpu offloading, but still plenty fast. Using it wall llama.cpp. Combine that with some mcp’s such as ddg-search to make it truly useful by actually being able to search online.

I mostly use it for small tedious tasks with well defined inputs and outputs. For example when hyprland recently changed from their own configuration language to lua. At first I started going line by line translating my config to the new lua language until I realized oh wait this is exactly the type of thing that ML is useful for. Going from the well defined hyprland configuration language to their also well defined lua syntax. It banged it out in less than a minute with only a single mistake which I easily fixed. The mistake it made was that it forgot to translate the comments to lua. It did it in less than a minute and worked first try. Where as I had made several typos and gotten a few lines wrong when I was doing it by hand.

Not to say that I couldn’t do it. I would have gotten it done in about half an hour, but less than a minute is a lot faster.

I also used it to transform a bunch of unstructured data into json data, so that I could then use purpose built tools like jq to parse that. If I’m having trouble finding certain information. I’ll ask it to find me some resources to look at.

Basically small well defined tasks and parsing data is what I use it for and it seems to be pretty good at that.

What I don’t like is the way companies try to market it to people. I don’t believe people should be trying to summarize emails or messages from loved ones, writing essays or any other creative tasks for the most part. Translating is okay. I don’t expect a machine to be able to decide things for me or to be some filter between me and others.

algernon@lemmy.ml · 5 hours ago

Yes. My Actual Intelligence lives in my head, and runs mostly on coffee.

portifornia@piefed.social · 2 hours ago

Just coffee?!? That’s cool.

Mine runs on:

coffee
spite
tortilla chips
& shame

searabbit@piefed.social · 2 hours ago

If that’s not already on a shirt it should be

tal@lemmy.today · 4 hours ago

Do you get many hallucinations?

algernon@lemmy.ml · 4 hours ago

Only when I’m deprived of coffee.

boonhet@sopuli.xyz · 4 hours ago

Would flowers work instead?

zitrone 🍋@europe.pub · 4 hours ago

As we know AI stands for “An Indian”, so if you’re not from India, its actually impossible to self host.

Well, unless you manage to trap one in your basement, but that would violate human rights and hopefully also break the laws of your country.

SuspiciousCarrot78@aussie.zone · 3 hours ago

You may be confusing Indians with gremlins. Which might explain ChatGPTs obsession with gremlins

SuspiciousCarrot78@aussie.zone · 5 hours ago

I’ll make sure to send you flowers, Algernon lol

GreenCrunch@piefed.blahaj.zone · 4 hours ago

critical security bug: if coffee is taken away my head hurts :(

ButteredBread@sh.itjust.works · 5 hours ago

That doesn’t sound artificial.

SuspiciousCarrot78@aussie.zone · 3 hours ago

Plastic flowers.

thenextguy@lemmy.world · 4 hours ago

With sufficient coffee, mine shows considerable artifice.

curbstickle@anarchist.nexus · 3 hours ago

Yep.

Ollama + about 8 different models at the moment, hosted on a mac mini with open webui as a front end.

Predominantly for transcription, translation, an extra round of security checks on code, a more context friendly home assistant interface, and a daily run of context evaluation on property I’m looking for with a lot of specific needs (acreage, min elevation change, soil type, area, etc).

surewhynotlem@lemmy.world · 2 hours ago

I have to recommend switching to llamacpp. It’s SO much faster than ollama.

curbstickle@anarchist.nexus · 2 hours ago

On the list but haven’t gotten to it yet, but I know I should. I could probably get a bit more out of that box with it, expand the context windows a bit…

irmadlad@lemmy.world · 2 hours ago

mac mini

How? What is your average response time?

curbstickle@anarchist.nexus · 57 minutes ago

Apple silicon is pretty good at it as long as you’ve got the ram for it. I wouldn’t do less than 16GB.

A few seconds for most of the tasks

async_amuro@lemmy.zip · 3 hours ago

What spec Mini do you use?

curbstickle@anarchist.nexus · 3 hours ago

Just an m2 w/ 16gb I repurposed.

Can’t really do a lot at once, and the context is limited, but it does the trick. I’d buy a few more if I saw them at the right price.

async_amuro@lemmy.zip · 1 hour ago

Nice, I’ve got a Mac Studio M1 Max with 32GB of RAM that I use with Ollama and then I host OpenWebUI and OpenCode on my Arch Server. I use the Mac as a primary workstation, so it’s a little rough when I start running a model. I’m sure I could probably do and learn more about Ollama to improve my experience, but for now it works for certain tasks.

curbstickle@anarchist.nexus · 54 minutes ago

I got mine a few years back for some iOS builds, don’t need to do them that often so it became the model host for me

e0qdk@reddthat.com · 3 hours ago

I started running LLMs a couple months ago on my own hardware. I have a Framework Desktop that I ordered last year and also recently picked up a refurbished 24GB AMD RX 7900 XTX which I’m doing some performance testing against. The dGPU is much better for dense models, and slightly faster for MoE if I’m willing to run them at a lower quant – but uses more power and has annoying coil whine. The Framework Desktop uses ~100W under load, is quieter, and for the MoE models already runs them fast enough for most of my needs – so most of my LLM use happens on that system still.

For software: I’m using ollama on the Framework currently, but I want to replace it with just using llama.cpp directly eventually. I’ve been using llama-cli for testing the dGPU. I wrote my own chat client to interact with ollama as well as a few other programs for specific tasks.

I’ve been using the LLMs for a mix of research (both personal and professional), entertainment, practical coding tasks (mostly debugging and brainstorming, plus a bit of UI prototyping, automatic generation of sequence diagrams for documentation, and light scripting), as well as automation of tedious tasks.

As an example of the latter, people often send me requests to prepare data sets by email but don’t specify the sources they want precisely so I have to go match the name against the real name in our archives; LLMs are great for mapping the imperfect name – with typos, missing prefixes, incorrect addition of spaces, addition/removal of hyphens, etc. – to the exact name I actually need to pull the data off disk when given a lookup table to compare against.

As far as models go, I’m mostly using various Qwen 3.6 and Gemma4 variants. I have multiple versions of each for different purposes. llmfan46’s uncensored Qwen 3.6 35B-A3B @ Q6_K (from Hugging Face) is my default model currently.

brucethemoose@lemmy.world · edit-2 2 hours ago

Yep.

I have a RTX 3090 + 128GB CPU RAM.

Currently I run my own custom IQ3_KT quantization of MiMo 2.5 300B, and it’s crazy good. It’s better than API models from not that long ago, and it’s served at about reading speed.

Never thought I’d ever run such a thing on my lowly desktop.

For quick scripts or code assistant, sometimes I use Qwen 27B (another custom quant, currently experimenting with exllama). Or Gemini 12B for messing with image/audio input. But TBH MiMo 2.5 with thinking disabled is smarter than 27B with it.

…And honestly, I use GLM 5.2 API a good bit.

I was lucky enough to get a yearly subscription for like $30, 6 months ago. I do self host the UIs or whatever takes the prompts, though.

Schiffsmädchenjunge@sh.itjust.works · 3 hours ago

I’ve thought about it, but I actually could never think of anything I would do with it.

frongt@lemmy.zip · 6 hours ago

Yes. Openwebui/ollama for LLM, comfyui for stable diffusion. I just dick around with it as a toy.

mesa@piefed.social · edit-2 5 hours ago

Same. Its somewhat useful on some very small scripting or tasks…but its mostly just to try out a new model or two. Its not really useful for anything big.

I will have to say…even my tiny models are about as good as Chatgpt/Claude/etc… which makes me think about how much people are spending on tokens regularly. I was able to get the same kind of python script started with my local tiny model that was comparable to the newest Claude code offerings.

Lettuce eat lettuce@lemmy.ml · 4 hours ago

What local models have you been using? And what hardware are you running them on? I’ve been playing with local LLMs a bit for exactly your use case.

I have zero interest in vibe coding or full agentic workflows. But having a local LLM generate a Bash script to help me automate parts of my home lab infrastructure would be nice.

Die4Ever@retrolemmy.com · 57 minutes ago

What are your hardware specs?

Lettuce eat lettuce@lemmy.ml · 25 minutes ago

Ryzen 7 5800 X3D Radeon RX 9070XT 32GB of DDR4 system memory.

StrawberryPigtails@discuss.tchncs.de · 4 hours ago

Yes. Currently using Gemma4:12b behind OpenWebUI and Hermes Agent plus a few lighter models for OCR and tagging in Paperless.

slazer2au@lemmy.world · 6 hours ago

Nope.

atzanteol@sh.itjust.works · 6 hours ago

I’ve tried a few times but with only 8gig of vram it’s simply not worth it.

brucethemoose@lemmy.world · 2 hours ago

How much CPU RAM do you have?

atzanteol@sh.itjust.works · 2 hours ago

64G. But CPU inference is painfully slow.

brucethemoose@lemmy.world · edit-2 57 minutes ago

Not anymore. Not with hybrid offloading, where the GPU handles dense tensors and the CPU only runs the sparse MoEs. I’m running a 300B model on a single 3090, and its faster than I can read.

You just need to use the right framework, and the right model.

I’d suggest trying ik_llama.cpp and a MoE like one of these: https://huggingface.co/models?other=ik_llama.cpp&sort=modified&search=35B

And speculative decoding like DFlash or MTP (which you can also get specific models for).

EDIT: Wrong link.

Franconian_Nomad@feddit.org · 5 hours ago

Have you tried qwen3.5-9b? It’s pretty solid for its size.

atzanteol@sh.itjust.works · 3 hours ago

Yeah, it’s “good for its size” but it’s just too flaky for me to use for any significant coding.

Franconian_Nomad@feddit.org · 53 minutes ago

Yeah, I wouldn’t use it for coding. It’s a bit dumb unfortunately.

Nednarb44@lemmy.world · 6 hours ago

I do, I use ollama. I mostly just tinker, but I use with with home assistant for a quasi Alexa like experience with the voice assistant, I use it for summarizing some YouTube transcripts in too lazy to read/watch, and I’ve tried to see how capable it is with coding.

diminou@lemmy.zip · 6 hours ago

Can you elaborate on what you are using exactly with home assistant ? And is English your primary language in that context ?

Trying to do something similar, English not primary and its a bit… Harder than it seems. Can’t figure out if it is because I’m not using English or something else. (3060 12GB BTW)

Nednarb44@lemmy.world · 6 hours ago

English is my primary, so that does make it easier. I use it for general conversion things, like asking it questions about the Titanic or making up a new story or something. It doesn’t work as well as I’d like yet, but like I said, it’s just an other thing for me to mess around with and change.