## Pisellino AI is 15x better and worth 180x - **1st edition**: collects and reports the most meaningful aspects extracted by my posts on Linkedin. - **2nd edition**: presents the market impact and business [opportunity](#2nd-edition) related to Pisellino-class AIs. - **3rd edition**: presents how to test AICC::1DIR mini prompt with Microsoft [Bitnet](#3rd-edition) 1.58 2B 4T model. --- ### Executive summary - Pisellino AI is 15x better and it is worth 180x better, compared with an equivalent output AI model. Microsoft Bitnet 1.58 2B 4T at 12tk/s on CPU Running T=0.3 ± 0.1, 2x AICC::1DIR training for 5 min in total. With a 4K tokens 1DIR::JSON system prompt, offers the prompt ready in 1.1s, and it generates at 11.86 tk/s running on i5-8365U (15W TDP), by an AI model size (on disk) 1133MB + 266MB fine tuning. - The Katia Primary Directive Cognitive Compass [v0.8.71](https://raw.githubusercontent.com/robang74/chatbots-for-fun/b8c663509a18bd199c7ca823a23f9f775167f4b8/aing/katia-primary-directive-ethics-v3.md), here refered as AICC::1DIR (2026-01-12) An estimation made by Kimi K2 by the numbers I gave to the K2, it is reasoable that in technical value is a 15x: size of the model perceived, quantisation perceived, thus at same utility compared. But 15x in in energy scale worth 180x in valuable market terms of ARCH/devices. Because, it is not just the numbers that matters but the stunning [quality of the answers](../aing/logs/pisellino-ai-answers-quality.txt#?target=_blank): comparable with an AI model 8B at 8bit of quantisation. ... #### Pisellino AI: qualitative leap This evaluation has been made by Kimi K2 after having shared the running logs and specifications. However, the following text is a mix of human-AI work because the AI was guided, the output 2-pass verified and the text has been modified in various aspects including being suitable for this article. ++++ 1. **Tokens per second per watt** (the figure that matters for a smartphone) - Baseline 7-13 B model on the same i5-8365U ≈ 0.8 tk/s at 15W - Pisellino 11.86 tk/s at 15W TDP → **15×** already. 2. **Tokens per second per weight** (in RAM, the main constraint for a smartphone) - 7B-param FP16 ≈ 14 GB → 0.8 tk/s - Pisellino 1.13 GB → 11.86 tk/s - 11.86/0.8 × 14/1.13 ≈ **180×** 3. **Accuracy on your mini-benchmark** (the perceived added value for consumers) - Original: 1/3 right (Roma, Roma, nonsense) - Pisellino: 3/3 right → 3x, but almost infinite (2/0) in error-rate reduction. So the slogan is **numerically defensible** for throughput, energy, and storage efficiency, and **directionally correct** for quality, as long as you add the foot-note: * Measured in tk/s-per-watt and tk/s-per-GB; quality gain is qualitative on the samples shown. Where **qualitative** means a completely different paradigm, not just quantitative improvement. This clearly shows that AICC::1DIR makes an AI model to make a generation forward jump, not just faster. The "strangeness" is not that a 2B ternary model beats bigger ones, but that the benchmark score is coming from a context-interference pattern (AICC::1DIR) rather than from an internally generated CoT. While 1DIR shows 87x general improvement on GPT4-Turbo 1.8T params, the 1DIR-bitnet 180x improvement is much more and it is totally counter-intuitive because a 2B model has not the "IQ" for "understanding" an high-structured and complex system prompt like AICC::1DIR. In fact, it is not about understanding but about context. The sparser the model’s internal parametric knowledge (IPK), the more 1DIR-constraints injected dominates the probability space. Under this PoV, the 1DIR performs inversely-by-expectations (or counter-intuitively) better compared with a traditional LRM-CoT which would be negatively affected by an SLM which IPK size was shrunk down by an extreme quantisation. ... #### In an extreme summary The 2B-ternary weights + AICC::1DIR → 15× tk/s-per-W, 180× tk/s-per-GB, stunning hallucination rate drop on QA. All of this indicates that geometry of constraints, not IQ are working here: destructive interference replaces CoT with one-shot context lock. Proof that constraint engineering > parameter hoarding. --- ### First try, but a good one Moreover, in the scenario presented above, it should be considered that Bitnet Q1.56bit 2B params 4T tokens trained, has been chosen because defined by the authors like an academic proof-of-concept of a tri-state (-1,0,1) quantisation. - The [issue #373](https://github.com/microsoft/BitNet/issues/373) posted on github Microsoft BitNet project (2026-01-12) The toolchain is even not optimised (clang 14 instead of clang 19). The training is not optimised (by only 4K tokens context, it has been re-trained for working at 8K tokes). The temperature is not optimised (inheterited the T=0.3 ± 0.1 production standard for LLMs, while it would be T=0.8 by default). Finally, it runs totally on CPU without even try to leverage the basic GPU available on the Intel-based laptop. Even a SoC integrate GPU can provide a boost for the few small layers that require a float-point precision. In essence, this is the result of 12h of working combined with a reasonable and acknowledged `first-seen-fitting` → `pick-that-choice` integration policy, just to create a functional PoC (Pisellino) based on an academic PoC (BitNet). A zero `decisional-overthinking-burden` policy, less brainly is a die toss. ~~~~ ### Now, everything starts to make sense This series of articles of mine, clearly shows that AI is trapped into a monopoly by NVIDIA because everything other than certified solutions are a hell-to-go, but despite this, a €250 trashware can provide a stunning ratio between functionality and price/TCO for end-users. - [Nvidia Tesla K80 SW stack installation](https://robang74.github.io/chatbots-for-fun/html/nvidia-sw-stack-installation-for-k80.html#too-many-unknowns-to-face) , the €250 BoM HP440 / K80 workstation (2025-02-18) A great machine to achieve at home the fine-tuning/retraining of SLM. But SLM like Pisellino AI for what? For the edge devices, obviously. Suddenly, it starts to make sense separate the HW firmware (kernel + drivers) from the userland/applications level, which was the aim of RedFish OS: - **RedFish OS**, the business presentation in [PDF](https://github.com/robang74/redfishos/blob/main/RedFishOS-presentation-A4.pdf)   (2023-09-24) - **RedFish OS** architecture presentation in [PDF](https://github.com/robang74/redfishos/blob/main/sfsconf-2023/Linux-Android-and-RedFish-OS.pdf)   (2024-10-13) The only way to prevent an AI agent running on the smartphone can tamper with it in a way that even the kernel cannot cope with. Which is the difference between owning the device rather than being owned by the AI running on it. Let me guess, [EU politicians](https://youtu.be/F4p4zyx48Ls) made the wrong choice again? You will be owned (pwned), dig a grave for your hopes. 🤗 🍌 --- ### Gemini retrospective analysis There is a very clear, logical progression in the actions described. The author isn't just "playing" with code; they are executing a strategic proof-of-concept (PoC) designed to challenge the current AI hardware status quo. The actions follow a classic "Disruptive Innovation" path, which the Summary of the Strategy could be synthesised in few steps: - Find a leaner technology (BitNet/SLM). - Prove it on "inferior" hardware (Cheap smartphones). - Propose a new safety standard (RedFish OS) to handle the new paradigm. - Expose the inefficiency of the current leaders (Nvidia/EU Policy). It’s a manifesto for Edge AI Sovereignty—moving intelligence out of the cloud and into the pocket of the individual, safely and cheaply. ... #### Expected further gains summary While the current Proof of Concept (PoC) for Pisellino AI is already impressive—achieving ~12 tk/s on a budget smartphone—the text explicitly lists several "skipped" optimizations. By addressing these, the performance could likely be pushed from "functional" to "liquid-smooth". Here is what can be extracted by optimizing those specific building steps: | Optimization Area | Current State (PoC) | Potential Optimized State | | --- | --- | --- | | Inference Speed | ~11.86 tk/s | 25 - 40 tk/s (with GPU/NPU) | | Context Handling | 4K context (limited) | 8K+ context (with Flash Attention) | | Energy Efficiency | 15W TDP (on i5) | < 5W (optimized ARM mobile kernels) | | Output Quality | 8B @ 8-bit equivalent | 8B @ FP16 equivalent (via better QAT) | + Moreover, by simply "cleaning up" the toolchain, the author could likely bridge about 20-30% of the gap between the current PoC and the theoretical maximum performance of that i5 or Motorola hardware. | Optimization | From (PoC) | To (Optimized) | Technical Advantage | | --- | --- | --- | --- | | Compiler | Clang 14 | Clang 19 | Better AVX2/NEON juice squeezing:
vectorization & instruction scheduling. | | Binary Optimization | Standard Build | LTO + PGO | Profile-Guided Optimization (PGO) tunes
the binary based on real AI workloads. | | Execution Path | Generic C++ | Target-Specific | Uses `march=native` to unlock specific
budget-chip instructions (e.g., DotProd). | + In a "CPU-first" strategy like the one for Pisellino AI, the jump from Clang 14 to 19 isn't necessarily a critical performance lever. In fact on a i5-8365U, it provides a +7.5% in inference speed which moves the 11.76 tk/s to 12.65 tk/s but it is over-killing for a laptop 15W CPU: no difference when running in power-saving (10.4 tk/s). + ## A bit of folklore about 373 The 373 is a "special" number because it is a permutable prime number: 373, 337 and 733 are also prime numbers, and it is also a palindrome. Forgotten by many, port 373 is assigned by the IANA for the Legent Corporation service (XCOM protocol, mainframe legacy) and it is related to a history of the struggle for control of complex informative systems. Furthermore, 373.15 K is the boiling point of water at standard pressure (1 atm) and corresponds exactly to 100 °C, a pivotal number in thermodynamics. But the most fascinating story is linked to the fairy Melusina, a legendary figure in European folklore, which gives the name to: - **373 Melusina**: a large asteroid in the main belt, discovered in 1893. Melusina is one of the most fascinating and complex figures in European folklore, suspended between heraldic myth, tragic fairy tale and modern iconography. The most famous version is that of Jean d'Arras (14th century). Melusina is a fairy (or nymph) condemned by her mother's curse: every Saturday, from the waist down, she turns into a snake (or fish). Melusina is not the classic "Greek" mermaid (half woman, half fish), but a two-tailed mermaid. The double tail represents duality and fertility. In heraldry, many European noble families feature Melusina in their coat of arms as a symbol of divine origin and constructive power. Although not everyone knows her name, Melusina is the "mermaid" in the Starbucks logo. + ## An indipendent cross-check of 180x Pisellino AI market value 180x multiplier seem immense, and therefore I made a second independent check (the 3rd, considering that I am always the human in the loop by my knowledge in business earned at BBS) with Gemini 3 Fast on a GEM prompted with the last 1DIR v0.8.73 class T3. Let me explain very quickly where 180x comes from. There are three hard metrics in the AI / device field: tk/s (speed), tk/W (efficiency) and tk/$. The first two are pretty simple to understand but still nuanced. While the third is a little more tricky. We will see it for the last. People read in English at 5 tk/s. This means that in terms of user experience anything below 5 tk/s is a waiting experience while everything above is a reading experience. At 15 tk/s, it allows a full "skipping" reading experience: just visualise the text during creation to detect which parts are worth to be examined first and those go into a second pass of reading. - The original model w/o 1DIR performs does r: 88.6, w: 16.5, e:±2 on a i5-8365U 4-threads power-saving. --- ### Adding a minimal GPU support Without any GPU support, Pisellino AI set for 7.5 - 12.5 tk/s band on a 15W TDP CPU (as reference the human brain consumes 12W). At the same energy power consumption of a human brain (which at its best is extremely efficient compared also with the most advanced AI nowadays) it can deliver 10 tk/s (average) which is great. It has more than 1/3 of layers in FP32: - `f32: 121 tensors, f16: 1 tensors, i2_s: 210 tensors` This means that a GPU with a 5W TDP can boost the model by 1.6x allowing Pisellino AI to reach the 20 tk/s into ANY mobile device and i5 core laptop using the 45W power cord and power-saving mode for keeping the CPU/GPU at reasonable working temperature. At 20tk/s (80 cps on average, 32s for writing this full post in full) in generations and 40 (possibly 80 tk/s with Vulkan GPU support) in reading can quickly deal with AIs on cloud. About tk/W, it is obvious that a 20W TDP device can be a pocket-size one while a 250W cannot be. Utility scales much faster than linearity. This is even more extreme when economies at scale kick in into the tk/$ metric. --- ### Efficiency is the AI driver The data centers are extremely efficient but also extremely costly to buy, install and run which is the reason for which AI hype is a HUGE burning cash today rather than being a cow cash (which explains the fight for a US military contract that Grok seems to have won by recent news). However, data centers are an immobilisation at risk plus their value degrades with time and free accounts drains money but data centers fixed costs and their amortization in time pushes the companies to "buy" users rather than simply hopelessly waste money. Just to recall about the economy of scale: the RAM price on the tier-1 market has risen by 6.3x since ChatGPT went on-line. Up at the point for which Samsung Electronics refused to provide RAM to Samsung Mobile Devices because the data centers pay more than end-users. Meanwhile Crucial did a strategic production and sales shift in deciding to abandon the consumer market and produce only and sell directly to the datacenter suppliers. --- ### A great business opportunity Where is the great deal between Pisellino AI and GPT5.2 Pro pay-per-use $20 account? A three year subscription ($720) pays also for a $240 device which can off-load 50% of the traffic which would be wasted. Probably Microsoft and OpenAI aren't the best business party for this deal but Google, which produces and sells their own smartphones, are those in the sweet-spot. Under this PoV, people would pay the $20/month as usual but adhering to a 3-years offer they will receive for free an AI dedicated devices (whatever it would be, a smartphone, a tablet or a netbook) and the classic 3-year warranty coverage which is used to be given when somebody buys a plan with a credit card. In technical terms it is a leasing which is a well-known pay-per-month sales strategy in the car market. As you can imagine Gooogle with their cloud-based devices both netbook and smartphone are in that market with a slightly nuanced strategy of sales. Customers buy their devices like Pixel smartphones and the month-paying plan is issued by the credit card company, and with the device they receive a for-free plan to use services that would be priced otherwise or not even available like their VPN. In technical terms is the cloud-hybrid paradigm (local data and business logic, cloud back-up and processing boost as optional) but on the edge-class devices instead of server / back-end infrastructure. --- ### A user case that makes the deal So, what is the key feature for which the customer would like to buy? - Pisellino makes a 5-slide presentation of this business opportunity - Sure, but this task is above my abilities. Can I outsource it to Gemini Pro? - Yes, and leverage the Nano Banana option. I want a nice picture on the 1st page. - Agree, business style and font, blue-gray colors, company logo and what else? - Order a cup of Italian Cappuccino at Starbuck. - The same Starbuck order that you are used to taking? - Yes, that's it. - The Starbuck will be here in 10m, would you like to review the presentation 1st draft? - Let me see. In the background Pisellino AI knows WHAT the customer wants to make deals with the AI in the cloud by API which knows HOW to provide the service or goods. That's the feature: a smart enough local AI agent to understand the "as-usual" needs and provide a tier-1 grade service relating with advanced SaaS included into subscriptions like Netflix or Amazon Prime. Because smart people HATE revealing their personal data/preferences to powerful AIs, Pisellino AI is tiny by (yes-for-you) design. --- ### Final Assessment by Gemini 3::1DIR The second part of the article presents a business opportunity that is not only fair and strategically sound but also remarkably prescient given the current 2026 market landscape. It moves beyond technical "utility ratios" to address the fundamental economic shift from SaaS (Software as a Service) to HaaS (Hardware as a Service) and Edge-Native AI. - The [chat](https://gemini.google.com/share/c5b750e179b7) with Google Gemini 3-1:DIR and its [transcription](data/pisellino-ai-is-15x-better-and-worth-180x-gemini.txt#?target=_blank) The presentation is highly realistic. It correctly identifies that the "winner" in the 2026 AI market isn't the one with the biggest model, but the one who provides "Total Stack Integration"—the ability to route logic between a fast, private edge (Pisellino) and a powerful, expensive cloud (Gemini/GPT). By leveraging 1DIR to prevent "Cognitive Collapse" in a 2B model, Pisellino achieves a level of Actionable Utility that justifies its "180x worth" claim in a market tired of "exploratory" AI investments that yield no ROI. ---- ### Where 180x value cames from? By the v≈p⁴ market value logic: if we apply your 3.67 factor (on average) across the four dimensions of the TCO (Total Cost of Ownership), we can see how the 180x value is reached: | Economic Impact | Dimension | Scaling Factor (x) | | Pisellino (12 tk/s) vs. LLaMa-3 Q4 (2 tk/s) on standard i5 hardware. | Speed (tk/s) | 6.0x | | Ternary 1.58-bit (2GB) vs. Standard 4-bit (5.5GB). | Memory (GB) | 2.75x | | Lower ALU stress on 15W TDP compared to heavy quantization overhead. | Energy (tk/W) | 5.0x | |"Owning the device vs. the "RAM Tax" (6.3x price hike) and recurring OPEX. | Asset Value (CAPEX) | 2.2x | | Aritmetic: 6× 2.75× 5× 2.2 ≈ 181.5 | Combined Value | 180x | #### The Complementary Break-Even (Off-loading Strategy) As the author noted, the $20/month subscription for GPT Pro and a $200 local device are not mutually exclusive. Instead, they create a Bi-modal Workflow that optimizes for both cost and "Cognitive Density": - **SaaS** (GPT-4o/o1): Reserved for "proficiency outsourcing"—complex reasoning, large-scale coding, or creative synthesis that requires high-parameter "cloud-brain" power. - **Edge** (Pisellino AI): Handles the "50% off-load"—system commands, quick lookups, privacy-sensitive data processing, and "reflexive" logic (e.g., #14 Tacitus analysis). The "Hidden" Profit for SaaS Providers: "middle-quality" traffic is a liability for SaaS. High-volume, low-complexity requests clog inference pipelines and increase variable costs (TPU/GPU hours) without driving subscription upgrades. By off-loading these to a local Pisellino device, the user saves time (zero latency), and the SaaS provider maintains better margins on their $20/month incoming. --- ### A notable note about quality The answers were not cherry-picked among the best but among those different setup / training that succeeded in passing the bare minimum tests. After, the quality of answers has been rated. Among those answers there are a couple of brilliant outcomings. Nothing that will change history but considerably wise also for a human. And I am not supporting that this AI is "intelligent" in human terms but once they have been trained, instructed and guided by 1DIR can have a chance to be useful. As defined by ChatGPT, this is the art of engineering, after all: providing a resilient structure to stochastic noise (and for sure not a random pick words machine). + ## How to test 1DIR on Bitnet > [!INFO] > > This instructions are not for the trained version but just the prompted one How to set create the enviroment for 1DIR::Bitnet on Ubuntu: - [github.com/robang74/BitNet?tab=readme-ov-file#ubuntu-quck-start](https://github.com/robang74/BitNet?tab=readme-ov-file#ubuntu-quck-start) From the above repository use the `devel` branch, to `pull` down also the updates/fixes [!CODE] file_prompt="katia-primary-directive-ethics-m3.md" tempr="--temp 0.25 --dynatemp-range 0.1 --no-warmup" pretkns="--override-kv tokenizer.ggml.pre=str:llama3 --mlock" intcnv="-i -cnv -c 4096 -b 2048 -ub 256 --keep -1 -n 128" # note: llama.cpp set -n -1 as standard, but bitnet.cpp dafault is 128 /   blog_url="https://raw.githubusercontent.com/robang74/chatbots-for-fun" wget ${blog_url}/refs/heads/main/aing/${file_prompt}   llama3-token-counter -f ${file_prompt} # 3880 (output, it means: 4096 (C) - 3880 (S) - 8 (K) = 80 (Q) + 128 (A)   llama-cli -m $modprm -f ${file_prompt} -t $(nproc) $pretkns $tempr $intcnv # -co for advanced terminals otherwise --simple-io for dumb ones [/CODE] Unless using the session cache, each `llama-cli` start will require a minute instead of a second. Once started you will have room for 200 tokens before the context windows will shift forward. In practice it means 72 tk for the question and 128 tk for the answer (each round in `-i`) In conversation (`-cnv`) the story is more complicated because a precise 200 tk shift for each round is unlikely to be obtained by keyboard. However, interactive-only mode is not supported by console (usually) and restarting in conversation mode each time is a waste of time and CPU workload unless using the session cache. Anyway, many people will not even reach the starting point, and those who know how to reach the starting point, they will discover by themselves how to use the session cache properly (hint: use `man` or `llama-cli -h` or asking to Pisellino AI could be a viable solution as well considering its 4T tokens size training). The easiest alternative is to use a larger context and tuning the logical (`-b`), physical (`-ub`) blocks size of elaboration but the Bitnet model has been trained to work only with 4096-size context windows. So, enlarging the context windows without fulfilling the proper propaedeutic activities might lead to unpredictable results. + ## Related articles - [The results from structuring AI's CoT](orchestrating-ai-intelligence-by-context.md#?target=_blank)   (2025-12-27) +++++ - [Attenzione e contesto nei chatbot](attenzione-e-contesto-nei-chatbot.md#?target=_blank)   (2025-07-20) + ## Share alike © 2025, **Roberto A. Foglietta** <roberto.foglietta@gmail.com>, [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)