/r/LocalLLaMA/.rssLocalLlama2026-02-10T04:38:14+00:00python-feedgenhttps://www.redditstatic.com/icon.png/Subreddit to discuss AI & Llama, the large language model created by Meta AI.t3_1r0ixumBest way to initialize AGENTS.md2026-02-09T22:34:18+00:00/u/ThatSQLguyhttps://old.reddit.com/user/ThatSQLguy<!-- SC_OFF --><div class="md"><p>AI coding tools work a lot better when they understand a repo’s stack, commands, and conventions. </p> <p><code>npx agentseed init</code></p> <p>This reads your codebase and generates <a href="http://AGENTS.md">AGENTS.md</a> automatically using static analysis (free). You can optionally add LLM summaries (free with Llama again) for richer context.</p> <p>Open source (MIT): <a href="https://github.com/avinshe/agentseed">https://github.com/avinshe/agentseed</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/ThatSQLguy"> /u/ThatSQLguy </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0ixum/best_way_to_initialize_agentsmd/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0ixum/best_way_to_initialize_agentsmd/">[comments]</a></span>2026-02-09T22:34:18+00:00t3_1r0fdcnSDF Protocol — fine-tuned 1.5B + 3B models that convert web pages into structured JSON for AI agents (open weights on HuggingFace)2026-02-09T20:22:22+00:00/u/PlayfulLingonberry73https://old.reddit.com/user/PlayfulLingonberry73<!-- SC_OFF --><div class="md"><p>I've been working on an open protocol for pre-extracting structured data from web pages so AI agents don't have to re-parse HTML every time.</p> <p>The pipeline uses two small fine-tuned models running locally via Ollama:</p> <ul> <li><strong>sdf-classify</strong> (Qwen2.5-1.5B-Instruct, QLoRA): classifies content into 10 parent types / 50+ subtypes</li> <li><strong>sdf-extract</strong> (SmolLM3-3B, QLoRA): extracts entities, claims, relationships, summaries, and type-specific fields into schema-validated JSON</li> </ul> <p>Combined footprint is 2.8 GB (Q4_K_M). Runs on CPU too — just slower.</p> <p><strong>Results on 2,335 documents:</strong></p> <ul> <li>90% extraction accuracy (exact match)</li> <li>4.1x faster than monolithic 14B baseline</li> <li>99.2% token reduction from HTML (~73K tokens → ~750)</li> <li>Works on CPU, tested on dual 3090 Ti for the paper</li> </ul> <p><strong>Downstream test:</strong> gave a vanilla 7B model questions about 30 documents — scored 0.739 accuracy from SDF vs 0.352 from raw markdown. 3B model also showed significant improvement (0.606 vs 0.333).</p> <p>Models (GGUF Q4_K_M + f16): <a href="https://huggingface.co/sdfprotocol">https://huggingface.co/sdfprotocol</a></p> <p>Protocol spec + schemas: <a href="https://github.com/sdfprotocol/sdf">https://github.com/sdfprotocol/sdf</a></p> <p>Whitepaper: <a href="https://doi.org/10.5281/zenodo.18559223">https://doi.org/10.5281/zenodo.18559223</a></p> <p>Training was QLoRA rank 32, alpha 64, dropout 0.05.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/PlayfulLingonberry73"> /u/PlayfulLingonberry73 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0fdcn/sdf_protocol_finetuned_15b_3b_models_that_convert/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0fdcn/sdf_protocol_finetuned_15b_3b_models_that_convert/">[comments]</a></span>2026-02-09T20:22:22+00:00t3_1r015z4I managed to jailbreak 43 of 52 recent models2026-02-09T10:52:45+00:00/u/sirjoacohttps://old.reddit.com/user/sirjoaco<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r015z4/i_managed_to_jailbreak_43_of_52_recent_models/"> <img alt="I managed to jailbreak 43 of 52 recent models" src="https://external-preview.redd.it/YTA3NHl0dmhyZGlnMUNU3vkEOynofhKg3zLh75rLSPZOaY5MGdNqMt8faW6e.png?width=640&crop=smart&auto=webp&s=326b5aa6d703c3059a3c89ab8668c2775aa7efc8" title="I managed to jailbreak 43 of 52 recent models" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>GPT-5 broke at level 2,</p> <p>Full report here: <a href="http://rival.tips/jailbreak">rival.tips/jailbreak</a> I'll be adding more models to this benchmark soon</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/sirjoaco"> /u/sirjoaco </a> <br /> <span><a href="https://v.redd.it/xmbxf1vhrdig1">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r015z4/i_managed_to_jailbreak_43_of_52_recent_models/">[comments]</a></span> </td></tr></table>2026-02-09T10:52:45+00:00t3_1r0nw2aIs there any Local LLMs that out perform commercial or cloud based LLMs in certain areas or functions?2026-02-10T02:02:25+00:00/u/FX2021https://old.reddit.com/user/FX2021<!-- SC_OFF --><div class="md"><p>I'm curious if anybody has seen local LLMs outperform commercial or cloud-based LLMS in certain areas or functions. If so what model and how did it out perform? </p> <p>Is there hope in the future that local LLMs could develop an edge over commercial or cloud based LLMs?</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/FX2021"> /u/FX2021 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0nw2a/is_there_any_local_llms_that_out_perform/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0nw2a/is_there_any_local_llms_that_out_perform/">[comments]</a></span>2026-02-10T02:02:25+00:00t3_1r082v1Qwen3-Coder-Next performance on MLX vs llamacpp2026-02-09T16:02:19+00:00/u/TrajansRowhttps://old.reddit.com/user/TrajansRow<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r082v1/qwen3codernext_performance_on_mlx_vs_llamacpp/"> <img alt="Qwen3-Coder-Next performance on MLX vs llamacpp" src="https://preview.redd.it/vb5b4b8xrhig1.png?width=140&height=103&auto=webp&s=79f29530e1b8207e4ed11e6eb8c83b32ed8e08b4" title="Qwen3-Coder-Next performance on MLX vs llamacpp" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>Ivan Fioravanti just published an excellent breakdown of performance differences between MLX-LM and llama.cpp running on the Apple M3 Ultra. These are both great options for local inference, but it seems MLX has a significant edge for most workloads.</p> <p><a href="https://preview.redd.it/vb5b4b8xrhig1.png?width=2316&format=png&auto=webp&s=31aa4012319625eb4f437d590a7f2cec4f1ce810">https://preview.redd.it/vb5b4b8xrhig1.png?width=2316&format=png&auto=webp&s=31aa4012319625eb4f437d590a7f2cec4f1ce810</a></p> <p><a href="https://x.com/ivanfioravanti/status/2020876939917971867?s=20">https://x.com/ivanfioravanti/status/2020876939917971867?s=20</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/TrajansRow"> /u/TrajansRow </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r082v1/qwen3codernext_performance_on_mlx_vs_llamacpp/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r082v1/qwen3codernext_performance_on_mlx_vs_llamacpp/">[comments]</a></span> </td></tr></table>2026-02-09T16:02:19+00:00t3_1r0519aStrix Halo, Step-3.5-Flash-Q4_K_S imatrix, llama.cpp/ROCm/Vulkan Power & Efficiency test2026-02-09T14:04:09+00:00/u/Educational_Sun_8813https://old.reddit.com/user/Educational_Sun_8813<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0519a/strix_halo_step35flashq4_k_s_imatrix/"> <img alt="Strix Halo, Step-3.5-Flash-Q4_K_S imatrix, llama.cpp/ROCm/Vulkan Power & Efficiency test" src="https://preview.redd.it/lf6f8di34hig1.png?width=640&crop=smart&auto=webp&s=f94c8e2dab63321d47133ed77312fadc06ceb5e6" title="Strix Halo, Step-3.5-Flash-Q4_K_S imatrix, llama.cpp/ROCm/Vulkan Power & Efficiency test" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>Hi, i did recently some quants to test best fit for strix halo, and i settled with custom imatrix <code>Q4_K_S</code> quant, builded with <code>wikitext-103-raw-v1</code>. Model has sligtly better PPL than Q4_K_M without imatrix, but it's few GB smaller. I tested it with ROCm/Vulkan backend, and <code>llama.cpp build 7966 (8872ad212)</code>, so with Step-3.5-Flash support already merged to the main branch. There are some issues with toolcalling with that (and few others) models at the moment but seems it's not related to quants itself.</p> <table><thead> <tr> <th>Quantization</th> <th>Size (Binary GiB)</th> <th>Size (Decimal GB)</th> <th>PPL (Perplexity)</th> </tr> </thead><tbody> <tr> <td><strong>Q4_K_S (imatrix) THIS VERSION</strong></td> <td><strong>104 GiB</strong></td> <td><strong>111 GB</strong></td> <td><strong>2.4130</strong></td> </tr> <tr> <td>Q4_K_M (standard)</td> <td>111 GiB</td> <td>119 GB</td> <td>2.4177</td> </tr> </tbody></table> <p>ROCm is more efficient: For a full benchmark run, <strong>ROCm was 4.7x faster</strong> and <strong>consumed 65% less energy</strong> than Vulkan. Prompt Processing: ROCm dominates in prompt ingestion speed, reaching over 350 t/s for short contexts and maintaining much higher throughput as context grows. Token Generation: Vulkan shows slightly higher raw generation speeds (T/s) for small contexts, but at a significantly higher energy cost. Not efficient with CTX >= 8k. Context Scaling: The model remains usable and tested up to 131k context, though energy costs scale exponentially on the Vulkan backend compared to a more linear progression on ROCm.</p> <p><a href="https://huggingface.co/mixer3d/step-3.5-flash-imatrix-gguf">Link to this quant on HF</a></p> <p>Outcome from comparison between ROCm/Vulkan is simalar to that one i performed few months ago with Qwen3-Coder, so from now on i will test only ROCm for bigger context, and probably will use Vulkan only as a failover on strix-halo. <a href="https://www.reddit.com/r/LocalLLaMA/comments/1p48d7f/strix_halo_debian_13616126178_qwen3coderq8/">Link on r/LocalLLaMa for Qwen3coder older benchmark</a></p> <p>Cheers</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/Educational_Sun_8813"> /u/Educational_Sun_8813 </a> <br /> <span><a href="https://i.redd.it/lf6f8di34hig1.png">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0519a/strix_halo_step35flashq4_k_s_imatrix/">[comments]</a></span> </td></tr></table>2026-02-09T14:04:09+00:00t3_1r0qur4Deepseek architecture, but without all the parameters2026-02-10T04:16:13+00:00/u/silenceimpairedhttps://old.reddit.com/user/silenceimpaired<!-- SC_OFF --><div class="md"><p>I’m seeing a pattern that perhaps is not legitimate, but it seems everyone is copying the latest Deepseek architecture on their latest releases. In the process though they are also copying the parameter count (roughly), which makes the models inaccessible to most (unless you use their API or spent as much as you would to buy a used car).</p> <p>So my question is, are there smaller models using the same tech but with less parameters?</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/silenceimpaired"> /u/silenceimpaired </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0qur4/deepseek_architecture_but_without_all_the/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0qur4/deepseek_architecture_but_without_all_the/">[comments]</a></span>2026-02-10T04:16:13+00:00t3_1r0904zACE-Step 1.5 prompt tips: how I get more controllable music output2026-02-09T16:36:29+00:00/u/Massive-Figure-9666https://old.reddit.com/user/Massive-Figure-9666<!-- SC_OFF --><div class="md"><p>I’ve been experimenting with <strong>ACE-Step 1.5</strong> lately and wanted to share a short summary of what actually helped me get more controllable and musical results, based on the official tutorial + hands-on testing.</p> <p>The biggest realization: <strong>ACE-Step works best when you treat prompts as [structured inputs], not a single sentence (same as other LLMs)</strong></p> <h1>1. Separate “Tags” from “Lyrics”</h1> <p>Instead of writing one long prompt, think in two layers:</p> <p><strong>Tags</strong> = global control</p> <p>Use comma-separated keywords to define:</p> <ul> <li>genre / vibe (<code>funk, pop, disco</code>)</li> <li>tempo (<code>112 bpm</code>, <code>up-tempo</code>)</li> <li>instruments (<code>slap bass, drum machine</code>)</li> <li>vocal type (<code>male vocals, clean, rhythmic</code>)</li> <li>era / production feel (<code>80s style, punchy, dry mix</code>)</li> </ul> <p>Being specific here matters a lot more than being poetic.</p> <h1>2. Use structured lyrics</h1> <p>Lyrics aren’t just text — section labels help a ton:</p> <p><code>[intro]</code></p> <p><code>[verse]</code></p> <p><code>[chorus]</code></p> <p><code>[bridge]</code></p> <p><code>[outro]</code></p> <p>Even very simple lines work better when the structure is clear. It pushes the model toward “song form” instead of a continuous loop.</p> <h1>3. Think rhythm, not prose</h1> <p>Short phrases, repetition, and percussive wording generate more stable results than long sentences. Treat vocals like part of the groove.</p> <h1>4. Iterate with small changes</h1> <p>If something feels off:</p> <ul> <li>tweak tags first (tempo / mood / instruments)</li> <li>then adjust one lyric section</li> </ul> <p>No need to rewrite everything each run.</p> <h1>5. LoRA + prompt synergy</h1> <p>LoRAs help with style, but prompts still control:</p> <ul> <li>structure</li> <li>groove</li> <li>energy</li> </ul> <p>resource: <a href="https://github.com/ace-step/ACE-Step-1.5">https://github.com/ace-step/ACE-Step-1.5</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/Massive-Figure-9666"> /u/Massive-Figure-9666 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0904z/acestep_15_prompt_tips_how_i_get_more/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0904z/acestep_15_prompt_tips_how_i_get_more/">[comments]</a></span>2026-02-09T16:36:29+00:00t3_1r0b7p8Free Strix Halo performance!2026-02-09T17:54:41+00:00/u/Potential_Block4598https://old.reddit.com/user/Potential_Block4598<!-- SC_OFF --><div class="md"><p>TL;DR not all quants are born the same, some quants have bf16 tensors, which doesn’t work well on AMD as it seems, so find quants without bf16 tensors and you get anywhere between 50%-100% performance on both tgs and pp</p> <p>Edit: I did some more tests, using -ctk bf16 -ctv bf16 degrades performance (in flash attention haven’t tried with fa off yet) around 10% for short contexts</p> <p>As for with -fa off most models are similar (bf16 or not) with -fa on models without bf16 are faster (slightly although it depends on how much of the model is actually in bf16!)</p> <p>So it depends on the model obviously not a generic boost</p> <p>Edit 2:</p> <p>‘’’</p> <p>ggml_vulkan: Found 1 Vulkan devices:</p> <p>ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat</p> <p>‘’’</p> <p>Strix Halo (gfx1151) doesn’t advertise bf16 in Vulkan backend, which confirms that the kernel doesn’t support models with bf16 tensors in some of their layers!</p> <p>Long detailed version</p> <p>I was playing around with different models on my new Strix halo PC</p> <p>I have multiple quantized Qwen3-Coder-Next (I absolutely love this model)</p> <p>I have two from unsloth two from lm studio and one from Qwen hugging face GGUF model page</p> <p>When loading it I noticed bf16 in some tensors, and I know that KV quantization to bf16 isn’t good on the halo (in fact isn’t good at all as it seems!)</p> <p>So I checked the three of them, unsloth versions have bf16 in them and so did the lm-studio versions</p> <p>But weirdly enough, Qwens own GGUF quants have no bf16, I fired them up and voila they are much much faster</p> <p>It seemed like a super power, and also not well managed in the community, I love bf16, but it doesn’t work well at all on AMD (idk why is it being converted to F32 for emulation, that is a waste of everything especially if you convert it every time!, weird fallback behavior to what, anyways)</p> <p>And I wish I can know this piece of info before downloading a whole quant (I have most of my GGUFs from lm studio and unsloth, if I do this to every other model I might get a lot better models!, seems good but I also feel bad all of these hours were wasted before, anyways sharing for the community to spare others this kind of waste)</p> <p>(How to know if a quant has bf16, load it with llama.cpp and it will show it at some point even before loading scroll and you will see it (how many q4 tensors, q8s, f32, f16s and bf16s !!!)</p> <p>Good luck out there!</p> <p>(I can’t wait to find a good REAP of Minimax M2.1 with Intel round that DOESNT have bf16 in it!, seems like the best model I can get and double current numbers it would be usable (20-30 tgs ?! And around 100 pp give or take, but a thinking model that is also parallel tool calling with interleaved thinking what else could I ask for ?!)</p> <p>So cheers!</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/Potential_Block4598"> /u/Potential_Block4598 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0b7p8/free_strix_halo_performance/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0b7p8/free_strix_halo_performance/">[comments]</a></span>2026-02-09T17:54:41+00:00t3_1r03nyqNew PR for GLM 5.Show more details for the architecture and parameters2026-02-09T13:03:45+00:00/u/External_Mood4719https://old.reddit.com/user/External_Mood4719<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r03nyq/new_pr_for_glm_5show_more_details_for_the/"> <img alt="New PR for GLM 5.Show more details for the architecture and parameters" src="https://preview.redd.it/xbntmqm9wgig1.jpg?width=140&height=88&auto=webp&s=57f90442cdd4687102ce6eb308c88cf7ef31ebf1" title="New PR for GLM 5.Show more details for the architecture and parameters" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p><a href="https://github.com/huggingface/transformers/pull/43858">https://github.com/huggingface/transformers/pull/43858</a></p> <p><a href="https://preview.redd.it/xbntmqm9wgig1.jpg?width=680&format=pjpg&auto=webp&s=da75a8dd1887ada367c9152cdeb13ad50fc6796c">https://preview.redd.it/xbntmqm9wgig1.jpg?width=680&format=pjpg&auto=webp&s=da75a8dd1887ada367c9152cdeb13ad50fc6796c</a></p> <p><a href="https://preview.redd.it/wng50ssdwgig1.png?width=1323&format=png&auto=webp&s=65b30b4b03dc5c4ce8c63d4729121b22c56382dc">https://preview.redd.it/wng50ssdwgig1.png?width=1323&format=png&auto=webp&s=65b30b4b03dc5c4ce8c63d4729121b22c56382dc</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/External_Mood4719"> /u/External_Mood4719 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r03nyq/new_pr_for_glm_5show_more_details_for_the/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r03nyq/new_pr_for_glm_5show_more_details_for_the/">[comments]</a></span> </td></tr></table>2026-02-09T13:03:45+00:00t3_1r0akbhLLaDA2.1-flash (103B) and LLaDA2.1-mini (16B)2026-02-09T17:31:19+00:00/u/jacek2023https://old.reddit.com/user/jacek2023<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0akbh/llada21flash_103b_and_llada21mini_16b/"> <img alt="LLaDA2.1-flash (103B) and LLaDA2.1-mini (16B)" src="https://external-preview.redd.it/5n3mc4pDM1OpAZhRMkcJ8p4UVJvhRxbwGRsFzFjYGnk.png?width=140&height=75&auto=webp&s=6d973feb83cc911296f0f4f8c40875351ac703a7" title="LLaDA2.1-flash (103B) and LLaDA2.1-mini (16B)" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p><strong>note: this is a diffusion model</strong></p> <p><strong>LLaDA2.1-flash</strong> is a diffusion language model of the LLaDA series featuring the editing enhancement. It significantly improves inference speed while delivering strong task performance.</p> <p><a href="https://preview.redd.it/0zc0kqvw7iig1.png?width=1391&format=png&auto=webp&s=c9c347ed3fe4b69f50acf4af01e3d6f96ad616f8">https://preview.redd.it/0zc0kqvw7iig1.png?width=1391&format=png&auto=webp&s=c9c347ed3fe4b69f50acf4af01e3d6f96ad616f8</a></p> <p><a href="https://preview.redd.it/biz1dmry7iig1.png?width=1372&format=png&auto=webp&s=0f9e9af10dae02d44553059f9654c8bc0683cf39">https://preview.redd.it/biz1dmry7iig1.png?width=1372&format=png&auto=webp&s=0f9e9af10dae02d44553059f9654c8bc0683cf39</a></p> <p><a href="https://huggingface.co/inclusionAI/LLaDA2.1-flash">https://huggingface.co/inclusionAI/LLaDA2.1-flash</a></p> <p><a href="https://huggingface.co/inclusionAI/LLaDA2.1-mini">https://huggingface.co/inclusionAI/LLaDA2.1-mini</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/jacek2023"> /u/jacek2023 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0akbh/llada21flash_103b_and_llada21mini_16b/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0akbh/llada21flash_103b_and_llada21mini_16b/">[comments]</a></span> </td></tr></table>2026-02-09T17:31:19+00:00t3_1qzz0vrGLM 5 is coming! spotted on vllm PR2026-02-09T08:39:31+00:00/u/External_Mood4719https://old.reddit.com/user/External_Mood4719<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1qzz0vr/glm_5_is_coming_spotted_on_vllm_pr/"> <img alt="GLM 5 is coming! spotted on vllm PR" src="https://preview.redd.it/285aias7lfig1.jpg?width=140&height=78&auto=webp&s=5a644c4fce313f2c4b8643b1d8a7931145a54db1" title="GLM 5 is coming! spotted on vllm PR" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p><a href="https://preview.redd.it/285aias7lfig1.jpg?width=680&format=pjpg&auto=webp&s=5287959d193fad4f96c5c80ec8b7546a7dcbe023">https://preview.redd.it/285aias7lfig1.jpg?width=680&format=pjpg&auto=webp&s=5287959d193fad4f96c5c80ec8b7546a7dcbe023</a></p> <p><a href="https://github.com/vllm-project/vllm/pull/34124">https://github.com/vllm-project/vllm/pull/34124</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/External_Mood4719"> /u/External_Mood4719 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qzz0vr/glm_5_is_coming_spotted_on_vllm_pr/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1qzz0vr/glm_5_is_coming_spotted_on_vllm_pr/">[comments]</a></span> </td></tr></table>2026-02-09T08:39:31+00:00t3_1r0la37Qwen3-v1-8b is Capable of Solving Captchas2026-02-10T00:08:16+00:00/u/TheyCallMeDozerhttps://old.reddit.com/user/TheyCallMeDozer<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0la37/qwen3v18b_is_capable_of_solving_captchas/"> <img alt="Qwen3-v1-8b is Capable of Solving Captchas" src="https://preview.redd.it/prijluyk6kig1.png?width=140&height=91&auto=webp&s=fd129265161fe3820389b74a0d3773a7fe0b9d58" title="Qwen3-v1-8b is Capable of Solving Captchas" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>Qwen3-v1-8b is capable of solving captchas with semi-solid accracy... might need to write a simple python script that finds them on the page and uses the LLM to try to solve them and input the output.</p> <p>Not sure if anyone else tried this before, just thought could be a handy thing for people to know, accidentally found it when passing it a screenshot</p> <p><a href="https://preview.redd.it/prijluyk6kig1.png?width=1038&format=png&auto=webp&s=29f55976839c594bd72eae9c2d0e6e2b9ce9a0d5">https://preview.redd.it/prijluyk6kig1.png?width=1038&format=png&auto=webp&s=29f55976839c594bd72eae9c2d0e6e2b9ce9a0d5</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/TheyCallMeDozer"> /u/TheyCallMeDozer </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0la37/qwen3v18b_is_capable_of_solving_captchas/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0la37/qwen3v18b_is_capable_of_solving_captchas/">[comments]</a></span> </td></tr></table>2026-02-10T00:08:16+00:00t3_1r02o7oGLM 5 Support Is On It's Way For Transformers2026-02-09T12:16:36+00:00/u/Few_Painter_5588https://old.reddit.com/user/Few_Painter_5588<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r02o7o/glm_5_support_is_on_its_way_for_transformers/"> <img alt="GLM 5 Support Is On It's Way For Transformers" src="https://external-preview.redd.it/_RA8pRu79eov51fP28AH3ibXc2RY_CG7SQQVryJy9WU.png?width=640&crop=smart&auto=webp&s=810b321415879975e3408c463a34398fefd38bf5" title="GLM 5 Support Is On It's Way For Transformers" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>This probably means the model launch is imminent, and all evidence points to Pony Alpha on OpenRouter being a stealth deployment of GLM 5</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/Few_Painter_5588"> /u/Few_Painter_5588 </a> <br /> <span><a href="https://github.com/huggingface/transformers/pull/43858">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r02o7o/glm_5_support_is_on_its_way_for_transformers/">[comments]</a></span> </td></tr></table>2026-02-09T12:16:36+00:00t3_1r0or7sFemtobot: A 10MB Rust Agent for Low-Resource Machines2026-02-10T02:40:21+00:00/u/yunfoehttps://old.reddit.com/user/yunfoe<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0or7s/femtobot_a_10mb_rust_agent_for_lowresource/"> <img alt="Femtobot: A 10MB Rust Agent for Low-Resource Machines" src="https://external-preview.redd.it/cmw5ZTJ5bnd3a2lnMa2OwS6wmI-E0GDGdMuj7R4EL-J7nO8YwfKZKjv0DlnG.png?width=640&crop=smart&auto=webp&s=d58c63cf47b8d8004de9dcc138a3388beabe0a83" title="Femtobot: A 10MB Rust Agent for Low-Resource Machines" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>I wanted to run <a href="https://github.com/openclaw/openclaw">OpenClaw</a>-style workflows on very low-resource machines (older Raspberry Pis, cheap VPS instances), but most “lightweight” stacks still end up dragging in large runtimes and slow startup costs.</p> <p>After trying <a href="https://github.com/HKUDS/nanobot">nanobot</a> and seeing disk usage climb past ~350MB once Python, virtualenvs, and dependencies were installed, I rewrote the core ideas in Rust to see how small and fast it could be.</p> <p>The result is <a href="https://github.com/enzofrasca/femtobot">femtobot</a>: a single ~10MB binary that currently supports:</p> <ul> <li>Telegram polling</li> <li>Local memory (SQLite + vector storage)</li> <li>Tool execution (shell, filesystem, web) via <a href="https://github.com/0xPlaygrounds/rig">rig-core</a></li> </ul> <p>The implementation was done quickly with heavy AI assistance, so the code prioritizes simplicity and size over perfect Rust idioms. It works well on constrained hardware, but there are definitely rough edges.</p> <p>Sharing in case it’s useful or interesting to others experimenting with small, local, or low-power agent setups. You are also welcome to contribute.</p> <p>Repo: <a href="https://github.com/enzofrasca/femtobot">https://github.com/enzofrasca/femtobot</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/yunfoe"> /u/yunfoe </a> <br /> <span><a href="https://v.redd.it/nbv8vsnwwkig1">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0or7s/femtobot_a_10mb_rust_agent_for_lowresource/">[comments]</a></span> </td></tr></table>2026-02-10T02:40:21+00:00t3_1r0bd4iNew "Stealth" Model - Aurora Alpha - (Free on OpenRouter)2026-02-09T18:00:10+00:00/u/-pawixhttps://old.reddit.com/user/-pawix<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0bd4i/new_stealth_model_aurora_alpha_free_on_openrouter/"> <img alt="New "Stealth" Model - Aurora Alpha - (Free on OpenRouter)" src="https://preview.redd.it/9t7ajm04diig1.png?width=640&crop=smart&auto=webp&s=28bf73099a957820854270db4b7e2e87db1b2055" title="New "Stealth" Model - Aurora Alpha - (Free on OpenRouter)" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>New cloaked reasoning model dropped on OpenRouter for $0/M tokens</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/-pawix"> /u/-pawix </a> <br /> <span><a href="https://i.redd.it/9t7ajm04diig1.png">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0bd4i/new_stealth_model_aurora_alpha_free_on_openrouter/">[comments]</a></span> </td></tr></table>2026-02-09T18:00:10+00:00t3_1r0ekq2Who is waiting for deepseek v4 ,GLM 5 and Qwen 3.5 and MiniMax 2.2?2026-02-09T19:54:09+00:00/u/power97992https://old.reddit.com/user/power97992<!-- SC_OFF --><div class="md"><p>The title? I hope they come out soon... I'm especially waiting for DS V4, it should be pretty good, hopefully it will be reasonably fast(probably slow though since it is gonna be bigger than v3.2) via OpenRouter. Well, glm 5 is out already technically on Open Router. </p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/power97992"> /u/power97992 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0ekq2/who_is_waiting_for_deepseek_v4_glm_5_and_qwen_35/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0ekq2/who_is_waiting_for_deepseek_v4_glm_5_and_qwen_35/">[comments]</a></span>2026-02-09T19:54:09+00:00t3_1r0domcQwen to the rescue2026-02-09T19:22:00+00:00/u/jacek2023https://old.reddit.com/user/jacek2023<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0domc/qwen_to_the_rescue/"> <img alt="Qwen to the rescue" src="https://external-preview.redd.it/WEJxFtDPKCN6TKUmgiGRQqR9H_BOQlE9OiaOmXHqz_8.png?width=640&crop=smart&auto=webp&s=1b3946db287286eed978b63c0503ea93c3e10526" title="Qwen to the rescue" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>...does this mean that we are close?</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/jacek2023"> /u/jacek2023 </a> <br /> <span><a href="https://github.com/ggml-org/llama.cpp/pull/19468">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0domc/qwen_to_the_rescue/">[comments]</a></span> </td></tr></table>2026-02-09T19:22:00+00:00t3_1r0khh8Step-3.5-Flash IS A BEAST2026-02-09T23:35:09+00:00/u/SennVacanhttps://old.reddit.com/user/SennVacan<!-- SC_OFF --><div class="md"><p>i was browsing around for models to run for my openclaw instant and this thing is such a good model for it's size, on the other hand the gpt oss 120b hung at each every step, this model does everything without me telling it technical stuff yk. Its also free on openrouter for now so i have been using it from there, i ligit rivels Deepseek V3.2 at 1/3rd of the size. I hope its api is cheap upon release </p> <p><a href="https://huggingface.co/stepfun-ai/Step-3.5-Flash">https://huggingface.co/stepfun-ai/Step-3.5-Flash</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/SennVacan"> /u/SennVacan </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0khh8/step35flash_is_a_beast/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0khh8/step35flash_is_a_beast/">[comments]</a></span>2026-02-09T23:35:09+00:00t3_1r0gju0Kimi-Linear-48B-A3B-Instruct2026-02-09T21:05:29+00:00/u/jacek2023https://old.reddit.com/user/jacek2023<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0gju0/kimilinear48ba3binstruct/"> <img alt="Kimi-Linear-48B-A3B-Instruct" src="https://a.thumbs.redditmedia.com/Bu8mu8gAAQcXdPhDs_xaj8m-19PQF2_a4_iwxZQsj70.jpg" title="Kimi-Linear-48B-A3B-Instruct" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>three days after the release we finally have a GGUF: <a href="https://huggingface.co/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF">https://huggingface.co/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF</a> - big thanks to Bartowski!</p> <p>long context looks more promising than GLM 4.7 Flash</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/jacek2023"> /u/jacek2023 </a> <br /> <span><a href="https://www.reddit.com/gallery/1r0gju0">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0gju0/kimilinear48ba3binstruct/">[comments]</a></span> </td></tr></table>2026-02-09T21:05:29+00:00t3_1r0nd6mA fully local home automation voice assistant using Qwen3 ASR, LLM and TTS on an RTX 5060 Ti with 16GB VRAM2026-02-10T01:39:23+00:00/u/liampettihttps://old.reddit.com/user/liampetti<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0nd6m/a_fully_local_home_automation_voice_assistant/"> <img alt="A fully local home automation voice assistant using Qwen3 ASR, LLM and TTS on an RTX 5060 Ti with 16GB VRAM" src="https://external-preview.redd.it/MGRhbXB0cmhta2lnMey19SmkPge57MTwSl95CCxzGWVZmEEqcz1nfiupw6bq.png?width=640&crop=smart&auto=webp&s=ce3c06c8aea8e1cd702c51582c7c4e11ddf54870" title="A fully local home automation voice assistant using Qwen3 ASR, LLM and TTS on an RTX 5060 Ti with 16GB VRAM" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>Video shows the latency and response times running everything Qwen3 (ASR&TTS 1.7B, Qwen3 4B Instruct 2507) with a Morgan Freeman voice clone on an RTX 5060 Ti with 16GB VRAM. In this example the SearXNG server is not running so it shows the model reverting to its own knowledge when unable to obtain web search information.</p> <p>I tested other smaller models for intent generation but response quality dropped dramatically on the LLM models under 4B. Kokoro (TTS) and Moonshine (ASR) are also included as options for smaller systems.</p> <p>The project comes with a bunch of tools it can use, such as Spotify, Philips Hue light control, AirTouch climate control and online weather retrieval (Australian project so uses the BOM). </p> <p>I have called the project "Fulloch". Try it out or build your own project out of it from here: <a href="https://github.com/liampetti/fulloch">https://github.com/liampetti/fulloch</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/liampetti"> /u/liampetti </a> <br /> <span><a href="https://v.redd.it/feropirhmkig1">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0nd6m/a_fully_local_home_automation_voice_assistant/">[comments]</a></span> </td></tr></table>2026-02-10T01:39:23+00:00t3_1r03wfqBad news for local bros2026-02-09T13:14:31+00:00/u/FireGuy324https://old.reddit.com/user/FireGuy324<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r03wfq/bad_news_for_local_bros/"> <img alt="Bad news for local bros" src="https://preview.redd.it/ui5ovstbygig1.jpeg?width=640&crop=smart&auto=webp&s=1eaeb40f2ac5a09ac1ba2fe03e433877561acb20" title="Bad news for local bros" /> </a> </td><td>   submitted by   <a href="https://old.reddit.com/user/FireGuy324"> /u/FireGuy324 </a> <br /> <span><a href="https://i.redd.it/ui5ovstbygig1.jpeg">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r03wfq/bad_news_for_local_bros/">[comments]</a></span> </td></tr></table>2026-02-09T13:14:31+00:00t3_1r0abplDo not Let the "Coder" in Qwen3-Coder-Next Fool You! It's the Smartest, General Purpose Model of its Size2026-02-09T17:23:31+00:00/u/Iory1998https://old.reddit.com/user/Iory1998<!-- SC_OFF --><div class="md"><p>Like many of you, I like to use LLM as tools to help improve my daily life, from editing my emails, to online search.</p> <p>However, I like to use them as an "inner voice" to discuss general thoughts and get constructive critic. For instance, when I face life-related problems take might take me hours or days to figure out, a short session with an LLM can significantly quicken that process.</p> <p>Since the original Llama was leaked, I've been using LLMs locally, but they I always felt they were lacking behind OpenAI or Google models. Thus, I would always go back to using ChatGPT or Gemini when I need serious output. If I needed a long chatting session or help with long documents, I didn't have choice to use the SOTA models, and that means willingly leaking personal or work-related data.</p> <p>For me, Gemini-3 is the best model I've ever tried. I don't know about you, but I struggle sometimes to follow chatGPT's logic, but I find it easy to follow Gemini's. It's like that best friend who just gets you and speaks in your language.</p> <p>Well, that was the case until I tried Qwen3-Coder-Next. For the first time, I could have stimulating and enlightening conversations with a local model. Previously, I used not-so-seriously Qwen3-Next-80B-A3B-Thinking as local daily driver, but that model always felt a bit inconsistent; sometimes, I get good output, and sometimes I get dumb one.</p> <p>However, Qwen3-Coder-Next is more consistent, and you can feel that it's a pragmatic model trained to be a problem-solver rather than being a sycophant. Unprompted, it will suggest an author, a book, or a theory that already exists that might help. I genuinely feel I am conversing with a fellow thinker rather than a echo chamber constantly paraphrasing my prompts in a more polish way. It's the closest model to Gemini-2.5/3 that I can run locally in terms of quality of experience.</p> <p><strong>For non-coders, my point is do not sleep on Qwen3-Coder-Next simply because it's has the "coder" tag attached.</strong></p> <p>I can't wait for for Qwen-3.5 models. If Qwen3-Coder-Next is an early preview, we are in a real treat.</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/Iory1998"> /u/Iory1998 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0abpl/do_not_let_the_coder_in_qwen3codernext_fool_you/">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0abpl/do_not_let_the_coder_in_qwen3codernext_fool_you/">[comments]</a></span>2026-02-09T17:23:31+00:00t3_1r0eo44MechaEpstein-80002026-02-09T19:57:33+00:00/u/ortegaalfredohttps://old.reddit.com/user/ortegaalfredo<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0eo44/mechaepstein8000/"> <img alt="MechaEpstein-8000" src="https://external-preview.redd.it/xypXKrxWxdZlS8MfiDHiCuqwqkIzWDQHn3pcj2ChEio.png?width=640&crop=smart&auto=webp&s=f3c824638b39c14125f9a5dcd28ddf84eb8a3622" title="MechaEpstein-8000" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>I know it has already been done but this is my AI trained on Epstein Emails. Surprisingly hard to do, as most LLMs will refuse to generate the dataset for Epstein, lol. Everything about this is local, the dataset generation, training, etc. Done in a 16GB RTX-5000 ADA. </p> <p>Anyway, it's based on Qwen3-8B and its quite funny. GGUF available at link.<br /> Also I have it online here if you dare: <a href="https://www.neuroengine.ai/Neuroengine-MechaEpstein">https://www.neuroengine.ai/Neuroengine-MechaEpstein</a></p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/ortegaalfredo"> /u/ortegaalfredo </a> <br /> <span><a href="https://huggingface.co/ortegaalfredo/MechaEpstein-8000-GGUF">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1r0eo44/mechaepstein8000/">[comments]</a></span> </td></tr></table>2026-02-09T19:57:33+00:00t3_1mpk2vaAnnouncing LocalLlama discord server & bot!2025-08-13T23:21:05+00:00/u/HOLUPREDICTIONShttps://old.reddit.com/user/HOLUPREDICTIONS<table> <tr><td> <a href="https://old.reddit.com/r/LocalLLaMA/comments/1mpk2va/announcing_localllama_discord_server_bot/"> <img alt="Announcing LocalLlama discord server & bot!" src="https://b.thumbs.redditmedia.com/QBscWhXGvo8sy9oNNt-7et1ByOGRWY1UckDAudAWACM.jpg" title="Announcing LocalLlama discord server & bot!" /> </a> </td><td> <!-- SC_OFF --><div class="md"><p>INVITE: <a href="https://discord.gg/rC922KfEwj">https://discord.gg/rC922KfEwj</a></p> <p>There used to be one old discord server for the subreddit but it was deleted by the previous mod.</p> <p>Why? The subreddit has grown to 500k users - inevitably, some users like a niche community with more technical discussion and fewer memes (even if relevant).</p> <p>We have a discord bot to test out open source models.</p> <p>Better contest and events organization.</p> <p>Best for quick questions or showcasing your rig!</p> </div><!-- SC_ON -->   submitted by   <a href="https://old.reddit.com/user/HOLUPREDICTIONS"> /u/HOLUPREDICTIONS </a> <br /> <span><a href="https://www.reddit.com/gallery/1mpk2va">[link]</a></span>   <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1mpk2va/announcing_localllama_discord_server_bot/">[comments]</a></span> </td></tr></table>2025-08-13T23:21:05+00:00