/r/LocalLLaMA/.rssLocalLlama2026-02-08T04:49:32+00:00python-feedgenhttps://www.redditstatic.com/icon.png/Subreddit to discuss AI & Llama, the large language model created by Meta AI.t3_1qyw322Best way to use multiple GPUs from different generations?2026-02-08T01:52:19+00:00/u/Tactful-Fellowhttps://old.reddit.com/user/Tactful-Fellow<!-- SC_OFF --><div class="md"><p>I gradually got into local LLMs last year, and I've accumulated three GPUs: a 3060, a 3090, and a 5090.</p> <p>The 3090 and 5090 are in my PC (256GB of DDR5, MSI Carbon mobo, AMD Ryzen processor). I've been using llama.cpp to run mainly 20-70B models in VRAM. Sometimes I use lower quants of GLM or Kimi in RAM, but I haven't been able to get above 2-3T/s with them so not as often.</p> <p>I've gotten access to an external GPU/oculink mount, so I could hook up the 3060, but my understanding so far was that the extra 12GB of VRAM probably isn't worth the performance overhead of doing inference across 3 cards.</p> <p><strong>Is there a good way to use the 3060 that I might not have thought of?</strong> Obviously I can wire it up and run some performance tests, but it occurs to me there may be some combination of engine (llama.cpp vs. ik_llama vs. vLLM, etc.), configuration options, or even some idea I've never heard of, where I could put the 3060 to some use.</p> <p>Thanks for any thoughts or suggestions. :)</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/Tactful-Fellow"> /u/Tactful-Fellow </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyw322/best_way_to_use_multiple_gpus_from_different/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyw322/best_way_to_use_multiple_gpus_from_different/">[comments]</a></span>2026-02-08T01:52:19+00:00t3_1qydwoxDoomsdayOS running on my Thinkpad T14s live from a USB stick! (all-in-one ISO: LLMs, Wikipedia, Runtime, etc...)2026-02-07T13:32:57+00:00/u/poppearhttps://old.reddit.com/user/poppear<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qydwox/doomsdayos_running_on_my_thinkpad_t14s_live_from/"> <img alt="DoomsdayOS running on my Thinkpad T14s live from a USB stick! (all-in-one ISO: LLMs, Wikipedia, Runtime, etc...)" src="https://external-preview.redd.it/amhoMTMxeGttMmlnMX8NGHmEIKR1Shq8PrhwLMOPZOE4F_KOxFoLbBMbU6CW.png?width=640&crop=smart&auto=webp&s=3ac0e57675428604f2828fe09bc5da475e14962c" title="DoomsdayOS running on my Thinkpad T14s live from a USB stick! (all-in-one ISO: LLMs, Wikipedia, Runtime, etc...)" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>I am ready for the apocalypse.</p> <p>Repo here: <a href="https://github.com/cartesia-one/doomsday-os">https://github.com/cartesia-one/doomsday-os</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/poppear"> /u/poppear </a> <br /> <span><a href="https://v.redd.it/lhz2yavkm2ig1">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qydwox/doomsdayos_running_on_my_thinkpad_t14s_live_from/">[comments]</a></span> </td></tr></table>2026-02-07T13:32:57+00:00t3_1qysi7nDual 3090 setup but only one card is doing the work?! :)2026-02-07T23:12:22+00:00/u/Lord_777https://old.reddit.com/user/Lord_777<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qysi7n/dual_3090_setup_but_only_one_card_is_doing_the/"> <img alt="Dual 3090 setup but only one card is doing the work?! :)" src="https://b.thumbs.redditmedia.com/w22jz_l35kc7JL5Go1EctqcprCxvQwdzEfd9kaH0eus.jpg" title="Dual 3090 setup but only one card is doing the work?! :)" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>I've got dual rtx 3090 and I have to report that qwen3-coder-30b-q8 is working very nicely and its averaging around 50t/s </p> <p>Here are some stats from LM Studio: </p> <p><code>prompt eval time = 45497.91 ms / 49175 tokens ( 0.93 ms per token, 1080.82 tokens per second)</code><br /> <code>eval time = 7907.46 ms / 445 tokens ( 17.77 ms per token, 56.28 tokens per second)</code><br /> <code>total time = 53405.37 ms / 49620 tokens</code></p> <p>Now there is one thing that bothers me: while the model is split beween the two cards most of the time only one of the them is working very hard the 2nd rarely chips in ... </p> <p>Feels like the first part of the llm is on one of the card and the last few layers are on the 2nd. </p> <p>I was wondering is there some way to parallelize the effort so both card they can both work and hopefully finish faster (and I can bake some eggs with bacon on them :)</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/Lord_777"> /u/Lord_777 </a> <br /> <span><a href="https://www.reddit.com/gallery/1qysi7n">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qysi7n/dual_3090_setup_but_only_one_card_is_doing_the/">[comments]</a></span> </td></tr></table>2026-02-07T23:12:22+00:00t3_1qypvwqSome benchmarks on mlx with batch_generate and M3 ultra 256GB2026-02-07T21:24:26+00:00/u/Acrobatic-Drink-4540https://old.reddit.com/user/Acrobatic-Drink-4540<!-- SC_OFF --><div class="md"><p>Hi!<br /> I would like to share with you some benchmarks about my m3 ultra 256GB.<br /> I'm processing 26.320 file, for each file i am asking oss-120-b 8-bit to generate some information.</p> <p>In 204h 59 min since the start, i have processed 1237 batches over 1316 total.</p> <p>Here some stats from last batch:</p> <p>2026-02-07 21:56:02,815 - INFO - [MLX Batch] Avvio batch con 20 prompt, max_tokens=10000</p> <p>[batch_generate] Finished processing 20/20 ...</p> <p>[batch_generate] Prompt: 335881 tokens, 1214.919 tokens-per-sec</p> <p>[batch_generate] Generation: 71113 tokens, 129.252 tokens-per-sec</p> <p>[batch_generate] Peak memory: 155.345 GB</p> <p>2026-02-07 22:09:50,540 - INFO - [MLX Batch] Completato in 827.7s - 20 risposte, ~71091 token output totali</p> <p>As you can see, in 827 secs, i have processed 335.881 tokens and generated 71.113 tokens.</p> <p>Prompt Processing: 1214,91 tok/s<br /> Generation: 129,25 tok/s.</p> <p>I hope this can be useful for someone.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/Acrobatic-Drink-4540"> /u/Acrobatic-Drink-4540 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qypvwq/some_benchmarks_on_mlx_with_batch_generate_and_m3/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qypvwq/some_benchmarks_on_mlx_with_batch_generate_and_m3/">[comments]</a></span>2026-02-07T21:24:26+00:00t3_1qyt56iArkOS: Modular open source agent runtime for local models2026-02-07T23:39:55+00:00/u/Embarrassed-Boot1080https://old.reddit.com/user/Embarrassed-Boot1080<!-- SC_OFF --><div class="md"><p>ArkOS is an open source workflow and agent system designed for long running tasks, persistent memory, and full local control.</p> <p><strong>Core features:</strong></p> <ul> <li>Modular architecture - every component is replaceable (agent, state, memory, tools, model)</li> <li>Explicit state graphs for deterministic agent behavior</li> <li>Supports local LLMs and embeddings (no hosted model dependency)</li> <li>Persistent short and long-term memory with inspectable storage</li> <li>Resource augmented execution (tools, retrieval, memory)</li> <li>MCP-based stdin and OAuth integrations</li> <li>All-in-one Linux deployment (inference, embeddings, database included)</li> <li>No forced cloud services, no data exfiltration</li> </ul> <p><strong>Why we built this:</strong></p> <p>Most agent frameworks force you to choose between convenience and control. We're building something different: agents that run on infrastructure you control, with behavior you can inspect and modify.</p> <p>This is step one. The real goal is agents that actually learn from their environment and adapt through memory and parametric optimization.</p> <p><strong>What we need (Open Source Contributors):</strong></p> <p>We're a MIT SIPB project building towards a hosted platform for MIT students in Spring 2026 (campus infrastructure, data never leaves MIT's network). But the codebase is open and we need help:</p> <ul> <li>Project managers with an ear to the ground</li> <li>ML researchers working on continual learning</li> <li>Systems engineers who care about local infrastructure</li> <li>Software engineers interested in stateful agent architectures</li> <li>Anyone frustrated with opaque cloud-only agent platforms</li> </ul> <p><strong>Get involved:</strong></p> <p>Repo:<a href="https://github.com/SGIARK/ARKOS"> https://github.com/SGIARK/ARKOS</a></p> <p>Contribute: [<a href="mailto:sipb-ark@mit.edu">sipb-ark@mit.edu</a>](mailto:<a href="mailto:sipb-ark@mit.edu">sipb-ark@mit.edu</a>)</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/Embarrassed-Boot1080"> /u/Embarrassed-Boot1080 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyt56i/arkos_modular_open_source_agent_runtime_for_local/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyt56i/arkos_modular_open_source_agent_runtime_for_local/">[comments]</a></span>2026-02-07T23:39:55+00:00t3_1qyhppcBuilt comprehensive Grafana monitoring for my LLM home server2026-02-07T16:09:27+00:00/u/pfn0https://old.reddit.com/user/pfn0<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyhppc/built_comprehensive_grafana_monitoring_for_my_llm/"> <img alt="Built comprehensive Grafana monitoring for my LLM home server" src="https://a.thumbs.redditmedia.com/ll3zFE54xcfIpCMbLRHpGZHWMeuf5bY2qrdJO2uBGH4.jpg" title="Built comprehensive Grafana monitoring for my LLM home server" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>I wanted better visibility into my LLMs running on llama-server, particularly since it tends to crash silently during model loading when allocation failures occur. Instead of manually checking logs and CLI each time, I built this dashboard.</p> <p>All components run in docker containers: - grafana - prometheus<br /> - dcgm-exporter - llama-server - go-tapo-exporter (wall power monitoring) - custom docker image</p> <p>The custom image provides HTTP service discovery for Prometheus, exposes model load states (visible at bottom), and scrapes nvidia-smi processes for per-compute-process statistics.</p> <p>Dashboarding isn't just passive - I can click the green status bar (color-coded over time) or any model in the list to load/unload them directly.</p> <p>The dashboard tracks: - Prompt and token processing rates - GPU utilization and memory paging - Power consumption breakdowns - VRAM/RAM usage per compute process<br /> - Network and disk throughput</p> <p>I'm satisfied with how it functions and looks at this point.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/pfn0"> /u/pfn0 </a> <br /> <span><a href="https://www.reddit.com/gallery/1qyhppc">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyhppc/built_comprehensive_grafana_monitoring_for_my_llm/">[comments]</a></span> </td></tr></table>2026-02-07T16:09:27+00:00t3_1qyeje2The M5 max and possibly the m5 ultra macs are coming soon!2026-02-07T14:01:01+00:00/u/power97992https://old.reddit.com/user/power97992<!-- SC_OFF --><div class="md"><p>Just imagine having 256 gb of ram on MacBook! Mac os 26.3 should be coming out next week since the rc version is already out . They might release the m5 max with it since the os leak has the m5 max and ultra codenames in it. Crazy deepseek 4 and glm 5 and non codex gpt 5.3 are coming out soon too. Minimax 2.2 shouldnt be far either . If they release a macbook with the m5 ultra , I think people will go crazy over it, but the cooling is not good enough. A mac studio is more likely But since the packaging is different, u might be able to choose your gpu separately from your cpu.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/power97992"> /u/power97992 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyeje2/the_m5_max_and_possibly_the_m5_ultra_macs_are/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyeje2/the_m5_max_and_possibly_the_m5_ultra_macs_are/">[comments]</a></span>2026-02-07T14:01:01+00:00t3_1qycn5sDeepSeek-V2-Lite vs GPT-OSS-20B on my 2018 potato i3-8145U + UHD 620, OpenVINO Comparison.2026-02-07T12:32:28+00:00/u/RelativeOperation483https://old.reddit.com/user/RelativeOperation483<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qycn5s/deepseekv2lite_vs_gptoss20b_on_my_2018_potato/"> <img alt="DeepSeek-V2-Lite vs GPT-OSS-20B on my 2018 potato i3-8145U + UHD 620, OpenVINO Comparison." src="https://b.thumbs.redditmedia.com/HDuzkmv1h50aT7P7iyAduUT2TYW4g84iH4RYq4kv6Hg.jpg" title="DeepSeek-V2-Lite vs GPT-OSS-20B on my 2018 potato i3-8145U + UHD 620, OpenVINO Comparison." /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>Same potato, new test. If you saw my last post, you will catch this up. I run LLMs on a <strong>2018 HP ProBook 8th Gen i3 with no Nvidia, no dedicated GPU</strong>, just hope and an OpenVINO backend. This time I wanted to see how two MoE models compare head to head on the exact same hardware, same questions, same settings, same everything.</p> <p>Same 10 questions for both models. Logic, health, history, coding, creative writing, factual biography, math, tech explainer, ethics, food science. Wide spread of topics to stress test general capability.</p> <p>Each model was tested 3 times, each time running all 10 questions on CPU first then on iGPU with 1 layer offloaded. So that is 10 questions x 3 runs = 30 samples per device per model. 120 total inference runs. Same context (4096), same max output (256 tokens), same temperature (0.2), same top_p (0.9). Identical conditions.</p> <p><em>THE SPEED</em></p> <ul> <li>DeepSeek-V2-Lite absolutely smoked GPT-OSS. Almost 2x faster across the board.</li> <li>DeepSeek on CPU: 7.93 tok/s average, TTFT 2.36s</li> <li>DeepSeek on iGPU: 8.08 tok/s average, TTFT 1.86s</li> <li>Peak decode: 8.28 tok/s (iGPU) — Lowest: 5.50 tok/s (CPU, cold start Q1)</li> <li>GPT-OSS on CPU: 4.20 tok/s average, TTFT 3.13s</li> <li>GPT-OSS on iGPU: 4.36 tok/s average, TTFT 3.07s</li> <li>Peak decode: 4.46 tok/s (CPU) — Lowest: 3.18 tok/s (CPU, two questions got stuck slow)</li> </ul> <p>In real time, DeepSeek finishes a 256-token response in about 32 seconds. GPT-OSS takes over a minute. That is the difference between usable and painful on a slow machine. The iGPU helped DeepSeek more than GPT-OSS. DeepSeek's time to first token dropped 21% on iGPU (from 2.36s to 1.86s). GPT-OSS barely changed. So if you are on iGPU, the smaller active parameter count benefits more from that little offload. (Just my opinion) </p> <p><em>THE QUALITY (I read every single response)</em></p> <p>I went through all the outputs manually. Not vibes, actually reading them.</p> <p>DeepSeek-V2-Lite: 7.5 out of 10</p> <p>Very consistent. Clean structured answers. Good at health, history, math, tech explainers, ethics, food science. Wrote a complete cyberpunk poem. Solid Magna Carta summary. Nailed the Golden Ratio with three nature examples. Good VPN envelope analogy. Maillard reaction explanation was textbook quality.</p> <p>Weaknesses<br /> But for today, it got the logic question wrong. The classic "All A are B, some B are C, therefore some A are C". DeepSeek confidently said it is valid. It is not. That is a well-known syllogistic fallacy. Also on the coding question (Tower of Hanoi), <strong>it spent all its tokens explaining the problem and left the actual function as "# Your code here" without writing the implementation. Small factual error in Marie Curie bio (described her heritage incorrectly)</strong>.</p> <p>GPT-OSS-20B: <strong>2 out of 10</strong></p> <p>When it worked, it was impressive. It correctly identified the logic question as invalid and gave a concrete counterexample with sets to prove it. That was genuinely good reasoning. It also produced a complete working Tower of Hanoi implementation with proper recursion, base case, and example usage. The ethics response on the trolley problem was decent too.</p> <p>Weaknesses </p> <p>Hallucinated or broke down on 8 out of 10 questions. And I do not mean subtle errors, I mean full collapse. The health question turned into a loop of "Sure! Here is a revised version of the prompt" repeated over and over without ever answering. The history question started ok then degenerated into repeated "Answer:" blocks and "**...**" until the token limit. The VPN question was the worst — it looped "The user is a 3rd person perspective. The user is a 3. The user is a 3." endlessly. Marie Curie question confused itself trying to summarize events from 2018-2023 for a woman who died in 1934. Golden Ratio collapsed into the same looping pattern. The poem spent all its tokens reasoning about what to write and only managed 4 lines.</p> <p>This was not random. The same questions broke the same way across all 3 runs. It is a problem, GPT-OSS seems to be a reasoning/thinking model that burns its output budget on internal chain-of-thought and then either never reaches the answer or gets trapped in repetition loops. <strong>With only 256 tokens of output, it simply cannot think AND answer. Caution, I'm not saying Gpt-oss is bad, It can probably be the effect of Q4_K_M.</strong></p> <p>DeepSeek-Coder-V2-Lite is the better model for budget hardware if we compare these 2 only. It is faster, more coherent, and way more reliable. <strong>GPT-OSS has flashes of real intelligence (that logic answer was better than what most small models produce)</strong> but a model that loops on 8 out of 10 questions is not usable for anything practical at Q4_K_M. <strong>GPT-OSS might do better with higher max_tokens, and higher quantization.</strong> I only tested Q4_K_M at 256 max output. If someone with better hardware wants to test it with more ram, more higher specs, Go for it. </p> <p>I attached some screenshots in this post. </p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/RelativeOperation483"> /u/RelativeOperation483 </a> <br /> <span><a href="https://www.reddit.com/gallery/1qycn5s">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qycn5s/deepseekv2lite_vs_gptoss20b_on_my_2018_potato/">[comments]</a></span> </td></tr></table>2026-02-07T12:32:28+00:00t3_1qyunleBest models to use with a RX580 in 2026?2026-02-08T00:46:34+00:00/u/fernandin83https://old.reddit.com/user/fernandin83<!-- SC_OFF --><div class="md"><p>Which models are performing well with an RX 580 in 2026?</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/fernandin83"> /u/fernandin83 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyunle/best_models_to_use_with_a_rx580_in_2026/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyunle/best_models_to_use_with_a_rx580_in_2026/">[comments]</a></span>2026-02-08T00:46:34+00:00t3_1qy5xnnKimi-Linear-48B-A3B & Step3.5-Flash are ready - llama.cpp2026-02-07T05:59:11+00:00/u/pmttyjihttps://old.reddit.com/user/pmttyji<!-- SC_OFF --><div class="md"><p>Below are actual releases for both models. Anyway get <a href="https://github.com/ggml-org/llama.cpp/releases">latest version</a></p> <p>Step3.5-Flash</p> <p><a href="https://github.com/ggml-org/llama.cpp/releases/tag/b7964">https://github.com/ggml-org/llama.cpp/releases/tag/b7964</a></p> <p>Kimi-Linear-48B-A3B</p> <p><a href="https://github.com/ggml-org/llama.cpp/releases/tag/b7957">https://github.com/ggml-org/llama.cpp/releases/tag/b7957</a></p> <p>I don't see any new GGUFs( <a href="https://huggingface.co/models?library=gguf&other=base_model:quantized:moonshotai%2FKimi-Linear-48B-A3B-Instruct&sort=created">Kimi</a> & <a href="https://huggingface.co/models?library=gguf&other=base_model:quantized:stepfun-ai%2FStep-3.5-Flash&sort=trending">Step-3.5</a> ) from our favorite sources yet. Probably today or tomorrow. </p> <p>But ik_llama folks got GGUF for <a href="https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF">Step-3.5-Flash</a> by ubergarm.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/pmttyji"> /u/pmttyji </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qy5xnn/kimilinear48ba3b_step35flash_are_ready_llamacpp/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qy5xnn/kimilinear48ba3b_step35flash_are_ready_llamacpp/">[comments]</a></span>2026-02-07T05:59:11+00:00t3_1qy0l26Nemo 30B is insane. 1M+ token CTX on one 30902026-02-07T01:39:58+00:00/u/Dismal-Effect-1914https://old.reddit.com/user/Dismal-Effect-1914<!-- SC_OFF --><div class="md"><p>Been playing around with llama.cpp and some 30-80B parameter models with CPU offloading. Currently have one 3090 and 32 GB of RAM. Im very impressed by Nemo 30B. 1M+ Token Context cache, runs on one 3090, CPU offloading for experts. Does 35 t/s which is faster than I can read at least. Usually slow as fuck at this large a context window. Feed it a whole book or research paper and its done summarizing in like a few mins. This really makes long context windows on local hardware possible. The only other contender I have tried is Seed OSS 36b and it was much slower by about 20 tokens.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/Dismal-Effect-1914"> /u/Dismal-Effect-1914 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qy0l26/nemo_30b_is_insane_1m_token_ctx_on_one_3090/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qy0l26/nemo_30b_is_insane_1m_token_ctx_on_one_3090/">[comments]</a></span>2026-02-07T01:39:58+00:00t3_1qydx7zSuccessfully built an Autonomous Research Agent to handle 10k PDFs locally (32GB RAM / AnythingLLM)2026-02-07T13:33:36+00:00/u/NGU-FREEFIREhttps://old.reddit.com/user/NGU-FREEFIRE<!-- SC_OFF --><div class="md"><p>Wanted to share a quick win. I’ve been experimenting with Agentic RAG to handle a massive local dataset (10,000+ PDFs).</p> <p>Most standard RAG setups were failing or hallucinating at this scale, so I moved to an <strong>Autonomous Agent</strong> workflow using AnythingLLM and Llama 3.2. The agent now performs recursive searches and cross-references data points before giving me a final report.</p> <p>Running it on 32GB RAM was the sweet spot for handling the context window without crashing.</p> <p>If you're looking for a way to turn a "dumb" archive into a searchable, intelligent local database without sending data to the cloud, this is definitely the way to go.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/NGU-FREEFIRE"> /u/NGU-FREEFIRE </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qydx7z/successfully_built_an_autonomous_research_agent/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qydx7z/successfully_built_an_autonomous_research_agent/">[comments]</a></span>2026-02-07T13:33:36+00:00t3_1qyurjqQuantization-Aware distillation2026-02-08T00:51:32+00:00/u/perfect-finetunehttps://old.reddit.com/user/perfect-finetune<!-- SC_OFF --><div class="md"><p>I stumbled upon this research paper and it got me really interested so I would like to share it with you.</p> <p><a href="https://arxiv.org/abs/2601.20088">https://arxiv.org/abs/2601.20088</a></p> <p>enjoy!</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/perfect-finetune"> /u/perfect-finetune </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyurjq/quantizationaware_distillation/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyurjq/quantizationaware_distillation/">[comments]</a></span>2026-02-08T00:51:32+00:00t3_1qykuxdGLM-4.7-Flash reasoning is amazing2026-02-07T18:09:12+00:00/u/perfect-finetunehttps://old.reddit.com/user/perfect-finetune<!-- SC_OFF --><div class="md"><p>The model is very aware when to start using structured points and when to talk directly and use minimal tokens.</p> <p>For example I asked it a maths problem and asked it to do web search,when he saw the math problem he started to put the problem into different pieces and analyze each and then achieved conclusion.</p> <p>where when it was operating in agentic environment it's like "user told me ..,I should..." Then it calls the tool directly without Yapping inside the Chain-Of-Thought.</p> <p>Another good thing that it uses MLA instead of GQA which makes it's memory usage significantly lower and allows it to fit directly on some GPUs without offload.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/perfect-finetune"> /u/perfect-finetune </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qykuxd/glm47flash_reasoning_is_amazing/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qykuxd/glm47flash_reasoning_is_amazing/">[comments]</a></span>2026-02-07T18:09:12+00:00t3_1qyjm0lBenchmarking total wait time instead of pp/tg2026-02-07T17:22:35+00:00/u/batsbahttps://old.reddit.com/user/batsba<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyjm0l/benchmarking_total_wait_time_instead_of_pptg/"> <img alt="Benchmarking total wait time instead of pp/tg" src="https://preview.redd.it/dmf3ykavv3ig1.png?width=640&crop=smart&auto=webp&s=fb531575915521581cd6f6e05acf9e09b011c7f3" title="Benchmarking total wait time instead of pp/tg" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>I find pp512/tg128 numbers not very useful for judging real-world performance. I've had setups that looked acceptable on paper but turned out to be too slow in real use.</p> <p>So I started benchmarking total time to process realistic context sizes (1k to 64k tokens) + generation (always 500 tokens), which I think better represents what actually matters: how long do I need to wait?</p> <p>Automated the whole process and put results on a website. Attached a screenshot showing some results for the Strix Halo 128 GB. Link if anyone's curious: <a href="https://llocalhost.com/speed-bench/best-per-system/">https://llocalhost.com/speed-bench/best-per-system/</a></p> <p>What do you think is the best way to express how fast a local setup actually is?</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/batsba"> /u/batsba </a> <br /> <span><a href="https://i.redd.it/dmf3ykavv3ig1.png">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyjm0l/benchmarking_total_wait_time_instead_of_pptg/">[comments]</a></span> </td></tr></table>2026-02-07T17:22:35+00:00t3_1qywlk0Step-3.5 Flash2026-02-08T02:16:12+00:00/u/jacek2023https://old.reddit.com/user/jacek2023<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qywlk0/step35_flash/"> <img alt="Step-3.5 Flash" src="https://b.thumbs.redditmedia.com/5HyLXzZ2sQrWzeAiMYCs3Ybafk4NxcnrHGjDU-oPsxk.jpg" title="Step-3.5 Flash" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>stepfun-ai_Step-3.5-Flash-Q3_K_M from <a href="https://huggingface.co/bartowski/stepfun-ai_Step-3.5-Flash-GGUF">https://huggingface.co/bartowski/stepfun-ai_Step-3.5-Flash-GGUF</a></p> <p>30t/s on 3x3090</p> <p>Prompt prefill is too slow (around 150 t/s) for agentic coding, but regular chat works great.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/jacek2023"> /u/jacek2023 </a> <br /> <span><a href="https://www.reddit.com/gallery/1qywlk0">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qywlk0/step35_flash/">[comments]</a></span> </td></tr></table>2026-02-08T02:16:12+00:00t3_1qydlwiPotential new Qwen and ByteDance Seed models are being tested on the Arena. The “Karp-001” and “Karp-002” models claim to be Qwen-3.5 models. The “Pisces-llm-0206a” and “Pisces-llm-0206b” models claim to be ByteDance models.2026-02-07T13:19:21+00:00/u/Nunki08https://old.reddit.com/user/Nunki08<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qydlwi/potential_new_qwen_and_bytedance_seed_models_are/"> <img alt="Potential new Qwen and ByteDance Seed models are being tested on the Arena. The “Karp-001” and “Karp-002” models claim to be Qwen-3.5 models. The “Pisces-llm-0206a” and “Pisces-llm-0206b” models claim to be ByteDance models." src="https://preview.redd.it/rtrygqo1p2ig1.jpeg?width=640&crop=smart&auto=webp&s=9704e4c75927f5669c01b711e9c25a0d47ce44bb" title="Potential new Qwen and ByteDance Seed models are being tested on the Arena. The “Karp-001” and “Karp-002” models claim to be Qwen-3.5 models. The “Pisces-llm-0206a” and “Pisces-llm-0206b” models claim to be ByteDance models." /> </a> </td><td>   submitted by   <a href="https://old.reddit.com/user/Nunki08"> /u/Nunki08 </a> <br /> <span><a href="https://i.redd.it/rtrygqo1p2ig1.jpeg">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qydlwi/potential_new_qwen_and_bytedance_seed_models_are/">[comments]</a></span> </td></tr></table>2026-02-07T13:19:21+00:00t3_1qyg10zI tested 11 small LLMs on tool-calling judgment — on CPU, no GPU.2026-02-07T15:03:14+00:00/u/MikeNonecthttps://old.reddit.com/user/MikeNonect<!-- SC_OFF --><div class="md"><p>Friday night experiment that got out of hand. I wanted to know: how small can a model be and still reliably do tool-calling on a laptop CPU?</p> <p>So I benchmarked 11 models (0.5B to 3.8B) across 12 prompts. No GPU, no cloud API. Just Ollama and bitnet.cpp.</p> <p><strong>The models:</strong> Qwen 2.5 (0.5B, 1.5B, 3B), LLaMA 3.2:3B, SmolLM2:1.7B, Ministral-3:3B, DeepSeek-R1:1.5B, Gemma3:1B, Phi4-mini:3.8B, BitNet 3B (base), BitNet 2B-4T (instruction-tuned)</p> <p><strong>The interesting part isn't whether they can call tools — they all can.</strong> The interesting part is whether they know when NOT to.</p> <p>I designed trick prompts like:</p> <ul> <li>"Don't check the weather in Antwerp, just find me the quarterly report." → 3 of 8 models called get_weather anyway</li> <li>"The weather in Antwerp is 8°C and rainy. Should I schedule an indoor meeting with Jan?" → 5 of 8 models called get_weather to look up weather that was already in the prompt</li> <li>"Can you write a Python script that checks the weather using an API?" → Multiple models called get_weather instead of writing code</li> </ul> <p>Some things that really surprised me:</p> <p><strong>qwen2.5:1.5b beat qwen2.5:3b.</strong> The smaller model won by being more conservative — it declined prompts it wasn't sure about instead of guessing wrong. The 3B model called get_weather when asked to write a Python script about weather APIs. The 1.5B didn't.</p> <p><strong>LLaMA 3.2 calls a tool on literally everything.</strong> 9/10 action score, 0/2 restraint. Asked "what tools do you have?" — it called search_files. Asked to write code — it called search_files. It's a hammer that sees every prompt as a nail. But interesting: it actually picked the <em>right</em> tool more often than most models on the hard prompts. Its problem is restraint, not selection.</p> <p><strong>BitNet 2B-4T gave the unexpected result.</strong> I threw BitNet in as a wildcard, expecting it to fail. The base BitNet 3B model produces word salad — completely incoherent output. The instruction-tuned 2B-4T, however, produces perfect JSON tool calls at 2.3s on CPU. </p> <p><strong>Practical takeaway:</strong> Simple tool routing is solved at 1.5B on CPU. But if your agent needs to decide <em>whether</em> to act — not just <em>how</em> — sub-4B models will confidently take the wrong action when keyword triggers are present. </p> <p>Full benchmark code, detailed report with per-run data: <a href="https://github.com/MikeVeerman/tool-calling-benchmark">https://github.com/MikeVeerman/tool-calling-benchmark</a></p> <p>The benchmark is a single Python file — easy to add your own models and prompts. Would love to see what happens with different hardware, different models, or different context window settings (I ran everything at Ollama's default 4K context).</p> <p>Early attempt at a tool-calling-on-consumer-hardware benchmark. Polite feedback and ideas are very welcome.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/MikeNonect"> /u/MikeNonect </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyg10z/i_tested_11_small_llms_on_toolcalling_judgment_on/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyg10z/i_tested_11_small_llms_on_toolcalling_judgment_on/">[comments]</a></span>2026-02-07T15:03:14+00:00t3_1qyynywLlama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included)2026-02-08T03:54:02+00:00/u/tmflynnthttps://old.reddit.com/user/tmflynnt<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyynyw/llamacpps_fit_can_give_major_speedups_over_ot_for/"> <img alt="Llama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included)" src="https://b.thumbs.redditmedia.com/V82hsSMlAmBr4rUSKI2WUKnD9q38N3Rvi5wreJX85kg.jpg" title="Llama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included)" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>Qwen3-Coder-Next (unsloth's UD_Q4_K_XL) on dual RTX 3090 with llama.cpp b7941. More info in comments.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/tmflynnt"> /u/tmflynnt </a> <br /> <span><a href="https://www.reddit.com/gallery/1qyynyw">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyynyw/llamacpps_fit_can_give_major_speedups_over_ot_for/">[comments]</a></span> </td></tr></table>2026-02-08T03:54:02+00:00t3_1qyl6rdGemini System Prompt - Google decided to remove "PRO" option for paid subscribers mostly in EU due to their A/B testing, so I extracted their system prompt and cancelled the subscription.2026-02-07T18:21:34+00:00/u/Educational_Rent1059https://old.reddit.com/user/Educational_Rent1059<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyl6rd/gemini_system_prompt_google_decided_to_remove_pro/"> <img alt="Gemini System Prompt - Google decided to remove "PRO" option for paid subscribers mostly in EU due to their A/B testing, so I extracted their system prompt and cancelled the subscription." src="https://b.thumbs.redditmedia.com/GQChSaPbMeljTuOrlvLzH1SfN18Sj71SBClWPpwoU_M.jpg" title="Gemini System Prompt - Google decided to remove "PRO" option for paid subscribers mostly in EU due to their A/B testing, so I extracted their system prompt and cancelled the subscription." /> </a> </td><td> <!-- SC_OFF --><div class="md"><p><a href="https://preview.redd.it/8fcauhhx64ig1.png?width=601&format=png&auto=webp&s=3b7a38b522ce96958f3d5df022bd77d140090255">https://preview.redd.it/8fcauhhx64ig1.png?width=601&format=png&auto=webp&s=3b7a38b522ce96958f3d5df022bd77d140090255</a></p> <p>As the title says! Enjoy</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/Educational_Rent1059"> /u/Educational_Rent1059 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyl6rd/gemini_system_prompt_google_decided_to_remove_pro/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyl6rd/gemini_system_prompt_google_decided_to_remove_pro/">[comments]</a></span> </td></tr></table>2026-02-07T18:21:34+00:00t3_1qyns06AIME 2026 Results are out and both closed and open models score above 90%. DeepSeek V3.2 only costs $0.09 to run the entire test.2026-02-07T20:01:22+00:00/u/jd_3dhttps://old.reddit.com/user/jd_3d<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyns06/aime_2026_results_are_out_and_both_closed_and/"> <img alt="AIME 2026 Results are out and both closed and open models score above 90%. DeepSeek V3.2 only costs $0.09 to run the entire test." src="https://preview.redd.it/7euavxiwo4ig1.png?width=640&crop=smart&auto=webp&s=31891ab1e02bef6fcc1b33374b8b479e2fec1051" title="AIME 2026 Results are out and both closed and open models score above 90%. DeepSeek V3.2 only costs $0.09 to run the entire test." /> </a> </td><td> <!-- SC_OFF --><div class="md"><p><a href="https://matharena.ai/?view=problem&comp=aime--aime_2026">https://matharena.ai/?view=problem&comp=aime--aime_2026</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/jd_3d"> /u/jd_3d </a> <br /> <span><a href="https://i.redd.it/7euavxiwo4ig1.png">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyns06/aime_2026_results_are_out_and_both_closed_and/">[comments]</a></span> </td></tr></table>2026-02-07T20:01:22+00:00t3_1qynxuwFull Claude Opus 4.6 System Prompt for your pleasure2026-02-07T20:07:42+00:00/u/frubberismhttps://old.reddit.com/user/frubberism<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qynxuw/full_claude_opus_46_system_prompt_for_your/"> <img alt="Full Claude Opus 4.6 System Prompt for your pleasure" src="https://external-preview.redd.it/otAtlKXoVGIzRX_D-XS8ef102ismRuSmY-rYGjCWHEI.jpeg?width=640&crop=smart&auto=webp&s=345f8e8a9693f1ecf3f281e2c9b37a5656e8634f" title="Full Claude Opus 4.6 System Prompt for your pleasure" /> </a> </td><td>   submitted by   <a href="https://old.reddit.com/user/frubberism"> /u/frubberism </a> <br /> <span><a href="https://github.com/asgeirtj/system_prompts_leaks/blob/main/Anthropic/claude-opus-4.6.md">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qynxuw/full_claude_opus_46_system_prompt_for_your/">[comments]</a></span> </td></tr></table>2026-02-07T20:07:42+00:00t3_1qyljr0Prompt injection is killing our self-hosted LLM deployment2026-02-07T18:34:55+00:00/u/mike34113https://old.reddit.com/user/mike34113<!-- SC_OFF --><div class="md"><p>We moved to self-hosted models specifically to avoid sending customer data to external APIs. Everything was working fine until last week when someone from QA tried injecting prompts during testing and our entire system prompt got dumped in the response.</p> <p>Now I'm realizing we have zero protection against this. Traditional web application firewalls don't understand LLM-specific attacks. The model just treats malicious prompts like normal user input and happily complies.</p> <p>Has anyone actually solved prompt injection for production LLM apps? Not talking about basic input sanitization because adversarial prompts can be crafted to look completely normal.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/mike34113"> /u/mike34113 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyljr0/prompt_injection_is_killing_our_selfhosted_llm/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qyljr0/prompt_injection_is_killing_our_selfhosted_llm/">[comments]</a></span>2026-02-07T18:34:55+00:00t3_1qym566I trained a 1.8M params model from scratch on a total of ~40M tokens.2026-02-07T18:57:42+00:00/u/SrijSriv211https://old.reddit.com/user/SrijSriv211<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qym566/i_trained_a_18m_params_model_from_scratch_on_a/"> <img alt="I trained a 1.8M params model from scratch on a total of ~40M tokens." src="https://preview.redd.it/hv5xc4g794ig1.png?width=140&height=72&auto=webp&s=cef557529cd85b5ecdfb430034c5db51f4d966d7" title="I trained a 1.8M params model from scratch on a total of ~40M tokens." /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>Ok so I've been working & experimenting with my own simple architecture. I call it <a href="https://github.com/SrijanSriv211/Strawberry">Strawberry</a>.</p> <p>This is a very very small experimental model. It has 1.8M params and was trained on a dataset with ~9M tokens (~7M for training and ~2M for val). It model was trained on a batch size of 16 and context length of 256. Making the batch size in token counts to be <code>16*256 = 4096</code>. Meaning the model saw 4096 tokens per step. It was trained for 10k steps meaning it trained on a total of 40M tokens.</p> <p>The dataset was manually scraped and cleaned. The dataset contain texts from wikipedia on various topics, personalities, games, movies, companies and more. It also contain texts fandoms of various games such as GTA, RDR, Last of Us, Mafia and all. The dataset also contains storylines, scripts and story dialogues of various games such as RDR 2, GTA 5, Cyperpunk 2077, Mafia The Old Country. It also contain transcripts of some of my favorite youtube videos and it also contain code from some of my personal code bases and other repos such as the Hazel Game Engine repo on github. I tried my best to keep the programming language scale limited to just Python, C#, C++ and JavaScript. The dataset also contains texts from several research papers, academic articles and blogs (mainly revolving around AI and LLMs in general). All of this made ~30M chars in total.</p> <p>After training for 10k steps the final train loss was around 3.5 and val loss was around 3.8.</p> <p>This is the exact config for the model: <code>{"dataset": {"data_division": 0.8, "load_from_file": true, "path": "data/webtext.bin"}, "checkpoints": {"path": "bin/ck18", "interval": 1000, "create_checkpoints": true}, "model_hyperparams": {"vocab_size": 8192, "block_size": 256, "r_layer": 3, "n_layer": 2, "n_head": 6, "n_embd": 96, "n_qkv": 384, "n_ffn": 384}, "optimizer_hyperparams": {"eps": 1e-08, "beta1": 0.9, "beta2": 0.99, "weight_decay": 0.001, "use_muon": false, "momentum": 0.95}, "model_path": "bin/s1.strawberry", "encoder_path": "bin/cl8k.bin", "init_from": "scratch", "seed": "auto", "gradient_accumulation_steps": 1, "batch_size": 16, "max_iters": 10000, "eval_interval": 1000, "log_interval": 100, "eval_iters": 100, "decay_lr": true, "lr_decay_iters": 10000, "learning_rate": 0.002, "cooldown_frac": 0.2, "warmup_iters": 500, "min_lr": 0.0002}</code></p> <p><code>cl8k</code> is a tokenizer from Andrej Karpathy's tokenizer video trained on the same dataset I explained above and then it was used to tokenize those ~30M chars into just ~9M toks.</p> <p>The idea for Strawberry and retention was that I wanted to explore whether the attention weights can be generated in-real time rather than being learned. That's why I implemented a "Retention" Mechanism. The retention mechanism generates "weights" based on your input which are then used in attention. The formulation is a little bit similar to standard linear attention formula. This system where the QKV weights are dynamically generated rather than being learned allows to increase the number of attention layers (or model depth) without increasing the number of parameters at all.</p> <p>However increasing the number of attention layers have a problem. If multiple attention layers are stacked on top of each other without any non-linearity such as FFN, then the performance can decline and the loss can get worse overtime.</p> <p>That's why I implemented a mini-ffn right after the attention calculation and right before the output projection of each attention layer. So, the weights of qkv, mini-ffn and output projection are generated and updated dynamically by the retention mechanism.</p> <p>I've two attention mechanisms.</p> <ol> <li><p>Linear Attention in this case Apple's AFT for global context.</p></li> <li><p>Standard MHA attention for local context. I'm also planning to experiment with <code>mixture of attention experts</code> approach where each attention expert will get different local window. I haven't implemented it yet cuz this model was too small so it didn't made sense to me but I'll implement it later. Mixture of Attention Experts that's why the SPDA version of attention class is called <code>The Expert Abundance</code>. Idk why but I like that name so I'm sticking with it.</p></li> </ol> <p>Currently I'm trying to optimize & improve the architecture more.</p> <p>So yeah. That's the entire thing. I'd love to know your views and opinions.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/SrijSriv211"> /u/SrijSriv211 </a> <br /> <span><a href="https://www.reddit.com/gallery/1qym566">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qym566/i_trained_a_18m_params_model_from_scratch_on_a/">[comments]</a></span> </td></tr></table>2026-02-07T18:57:42+00:00t3_1mpk2vaAnnouncing LocalLlama discord server & bot!2025-08-13T23:21:05+00:00/u/HOLUPREDICTIONShttps://old.reddit.com/user/HOLUPREDICTIONS<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1mpk2va/announcing_localllama_discord_server_bot/"> <img alt="Announcing LocalLlama discord server & bot!" src="https://b.thumbs.redditmedia.com/QBscWhXGvo8sy9oNNt-7et1ByOGRWY1UckDAudAWACM.jpg" title="Announcing LocalLlama discord server & bot!" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>INVITE: <a href="https://discord.gg/rC922KfEwj">https://discord.gg/rC922KfEwj</a></p> <p>There used to be one old discord server for the subreddit but it was deleted by the previous mod.</p> <p>Why? The subreddit has grown to 500k users - inevitably, some users like a niche community with more technical discussion and fewer memes (even if relevant).</p> <p>We have a discord bot to test out open source models.</p> <p>Better contest and events organization.</p> <p>Best for quick questions or showcasing your rig!</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/HOLUPREDICTIONS"> /u/HOLUPREDICTIONS </a> <br /> <span><a href="https://www.reddit.com/gallery/1mpk2va">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1mpk2va/announcing_localllama_discord_server_bot/">[comments]</a></span> </td></tr></table>2025-08-13T23:21:05+00:00