<![CDATA[Hacker News - Small Sites - Score >= 50]]> https://news.ycombinator.com RSS for Node Mon, 03 Mar 2025 04:23:19 GMT Mon, 03 Mar 2025 04:23:19 GMT 240 <![CDATA[Made a scroll bar buddy that walks down the page when you scroll]]> thread link) | @hello12343214
March 2, 2025 | https://focusfurnace.com/scroll_buddy.html | archive.org

( Look at your scroll bar when you scroll )

Instead of a boring scrollbar thought it would be fun to have an animated stick figure that walks up and down the side of your page when you scroll.

This is the first prototype i made.

Going to make a skateboarder, rock climber, or squirrel next. what other kinds of scroll buddies should I make?

Get a scroll buddy for your website

Warning: An embedded example on the side of this page has an animation / movement that may be problematic for some readers. Readers with vestibular motion disorders may wish to enable the reduce motion feature on their device before viewing the animation. If you have reduce motion settings turned on Scroll Buddy should be hidden on most browsers.

ignore all the text below its just lorem ipsum to have content for scrolling.
Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Heading 2

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Heading 2

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Made with simple javascript

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident.

Heading 2

Similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Temporibus autem quibusdam et aut officiis debitis aut rerum necessitatibus saepe eveniet ut et voluptates repudiandae sint et molestiae non recusandae. Itaque earum rerum hic tenetur a sapiente delectus.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Heading 2

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Heading 2

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Heading 2

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.

Heading 2

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

]]>
https://focusfurnace.com/scroll_buddy.html hacker-news-small-sites-43237581 Mon, 03 Mar 2025 02:13:00 GMT
<![CDATA[The "strategic reserve" exposes crypto as the scam it always was]]> thread link) | @kolchinski
March 2, 2025 | https://alexkolchinski.com/2025/03/03/the-strategic-reserve-exposes-crypto-as-the-scam-it-always-was/ | archive.org

Today, President Trump announced that the US Government would begin using taxpayer dollars to systematically buy up a variety of cryptocurrencies. Crypto prices shot up on the news.

This is revealing, as crypto boosters have argued for years that cryptocurrency has legitimate economic value as a payment system outside of the government’s purview.

Instead, those same crypto boosters are now tapping the White House for money — in US Dollars, coming from US taxpayers.

Why?

Crypto has been one of the biggest speculative bubbles of all time, maybe the single biggest ever. Millions of retail investors have piled into crypto assets in the hope and expectation that prices will continue to go up. (Notice how much of the chatter around crypto is always around prices, as opposed to non-speculative uses.)

However, every bubble bursts once it runs out of gamblers to put new money in, and it may be that the crypto community believes that that time is near for crypto, as they are now turning to the biggest buyer in the world — the US Government — for help.

This shows that all the claims that crypto leaders have made for years about crypto’s value as a currency outside of government control have been self-serving lies all along: the people who have most prominently argued that position are now begging the White House to hand them USD for their crypto.

It also reveals how much crypto has turned into a cancer on our entire society.

In previous Ponzi schemes, the government has often stepped in to defuse bubbles and protect retail investors from being taken in by scammers.

But in this wave, not only has the government not stepped in to stop the scam, it has now been captured by people with a vested interest in keeping it going as long as possible.

Our president and a number of members of his inner circles hold large amounts of cryptocurrency and have a vested interested in seeing its value rise — Trump’s personal memecoin being a particularly notable example. And many other people in the corridors of power in Washington and Silicon Valley are in the same boat. “It is difficult to get a man to understand something, when his salary depends on his not understanding it”, and so some of the most prominent people in the country are now prepared to make any argument and implement any policy decision to boost the value of their crypto holdings.

How does this end?

Once the US taxpayer is tapped out, there’s not going to be any remaining larger pool of demand to keep crypto prices up, and in every previous speculative bubble, once confidence evaporates, prices will fall, probably precipitously. Unfortunately, as millions of people now have significant crypto holdings, and stablecoins have entangled crypto with fiat currency, the damage to the economy may be widespread.

The end of the crypto frenzy would, in the end, be a good thing. Cryptocurrency has a few legitimate uses, like helping citizens of repressive regimes avoid currency controls and reducing fees on remittances. But it has also enabled vast evil in the world. Diverting trillions of dollars away from productive investments into gambling is bad enough, but the untraceability of crypto has also enabled terrorist organizations, criminal networks, and rogue states like North Korea to fund themselves far more effectively than ever before. I’ve been hearing from my friends in the finance world that North Korea now generates a significant fraction, if not a majority, of its revenues by running crypto scams on Westerners, and that the scale of scams overall has grown by a factor of 10 since crypto became widely used (why do you think you’re getting so many calls and texts from scammers lately?)

I hope that the end of this frenzy of gambling and fraud comes soon. But in the meantime, let’s hope that not too much of our tax money goes to paying the scammers, and that when the collapse comes it doesn’t take down our entire economy with it.

Thanks to Alec Bell for helping edit this essay.

]]>
https://alexkolchinski.com/2025/03/03/the-strategic-reserve-exposes-crypto-as-the-scam-it-always-was/ hacker-news-small-sites-43236752 Mon, 03 Mar 2025 00:08:56 GMT
<![CDATA[The Era of Solopreneurs Is Here]]> thread link) | @QueensGambit
March 2, 2025 | https://manidoraisamy.com/developer-forever/post/the-era-of-solopreneurs-is-here.anc-52867368-2029-4dc5-a7da-ece853a648b5.html | archive.org

DeepSeek just dropped a bombshell: $200M in annual revenue with a 500%+ profit margin—all while charging 25x less than OpenAI. But DeepSeek didn’t just build another AI model. They wrote their own parallel file system (3FS) to optimize costs—something that would have been unthinkable for a company of their size. This was possible because AI helped write the file system. Now, imagine what will happen in a couple of years—AI will be writing code, optimizing infrastructure, and even debugging itself. An engineer with AI tool can now outbuild a 100-person engineering team.

Disappearing Pillars


For years, the freemium business model, cloud computing, and AI have been converging. First, the internet killed the need for sales teams (distribution moved online). Then, serverless computing eliminated IT teams (AWS, Firebase, you name it). And now, AI is breaking the last barrier—software development itself. This shift has been happening quietly for 15 years, but AI is the final straw that breaks the camel’s back.

This kind of disruption was previously limited to narrow consumer products like WhatsApp, where a 20-member team built a product that led to a $19 billion exit. But now, the same thing is happening in business applications that requires breadth. AI will be able to build complex applications that were previously impossible for small teams. Take our own experience: Neartail competes with Shopify and Square, and it’s built by one person. Formfacade is a CRM that competes with HubSpot—built by one person. A decade ago, this wouldn’t have been possible for us. But today, AI handles onboarding, customer support, and even parts of development itself. So, what does this mean for SaaS? It won’t disappear, but it’s about to get a whole lot leaner.

Double Threat to Big Companies

For large incumbents, this shift isn’t just about new competition—it’s a fundamental restructuring of how software businesses operate. They face a double threat:

  1. They must cut down their workforce, even if employees are highly skilled, creating a moral dilemma.
  2. They have to rebuild their products from scratch for the AI era - a challenge for elephants that can't dance.

Look at payments, for example. Stripe charges 3% per transaction. We’re rolling out 2% fees for both payments and order forms because we use AI to read the seller’s SMS and automate the payment processing. It won’t hurt Stripe now—they make billions off Shopify’s transaction fees alone. But it’s a slow rug pull. First, AI-first companies like us will nibble away Shopify's revenue. Then, a few will break through and topple Shopify. And, only then will incumbents like Stripe feel the pinch as a second order effect.

Summary

This is a massive opportunity for startups right now. While the giants are trapped in their own complexity, nimble teams can build and launch AI-native solutions that directly challenge established players. Target a bloated SaaS vertical, rebuild it from the ground up with AI at its core, and position it as the next-generation alternative.

For example, the future of CRM isn’t just software—it’s software + sales team. Startups that don’t want to hire salespeople will eagerly adopt AI-driven CRMs that automate outreach, and follow-ups. Meanwhile, large companies will hesitate to fire their sales teams or switch from legacy CRMs due to vendor lock-in. But over time, startups using AI-native CRMs will scale into large companies themselves, forcing the laggards to transition or fall behind.

This is why we say, “The future is here, but not evenly distributed.” The AI-native solutions of today will become the default for the next wave of large enterprises. The opportunity isn’t in building software for existing companies—it’s in building it for the startups that will replace them. For founders starting companies today, this is Day Zero in the truest sense. The AI-native companies being built now are the ones that will define the next decade of SaaS. It’s not just disruption—it’s a complete reset.

]]>
https://manidoraisamy.com/developer-forever/post/the-era-of-solopreneurs-is-here.anc-52867368-2029-4dc5-a7da-ece853a648b5.html hacker-news-small-sites-43232999 Sun, 02 Mar 2025 17:52:48 GMT
<![CDATA[Speedrunners are vulnerability researchers, they just don't know it yet]]> thread link) | @chc4
March 2, 2025 | https://zetier.com/speedrunners-are-vulnerability-researchers/ | archive.org

/nl_img1

Thousands of video game enthusiasts are developing experience in the cybersecurity industry by accident. They have a fun hobby, pouring over the details of their favorite games, and they don't know they could be doing something very similar… by becoming a vulnerability researcher.

That probably requires some backstory, especially from a cybersecurity company's blog!

What's a speedrun?

Basically as soon as video games were released, people have been trying to beat them faster than their friends (or enemies) can. Gamers will do this for practically any game on the planet – but the most popular games, or the ones with the most cultural weight and cultish following, naturally end up with the fiercest competition. Speedrunners will run through their favorite game hundreds or thousands of times in order to get to get to the top of community-driven leaderboards for the fastest time… which puts incentives on that video game's community to find the absolute fastest way to clear the game, no matter how strange.

"Any percent" speedruns, or "any%" more commonly, are usually one of the most popular categories of speedrun for any given game. In it, all rules are off and no weird behavior is disallowed: intentionally triggering bugs in the game, which the developers never intended for the players to be able to perform, often have the potential to shave double-digit percentages of time off existing routes by cutting out entire swathes of the game from having to be played at all. Why do 1 -> 2 -> 3 if you can do a cool trick and skip from 1 -> 3 directly?

A lot of these glitches revolve around extremely precise movement… but for the most dedicated fans, they'll go even further.

Glitch hunting is reverse engineering

Entire groups will spring up inside a game's speedrunning community dedicated to discovering new glitches, and oftentimes they'll apply engineering to it.

These enthusiasts won't just try weird things in the game over and over (although that definitely helps!) – they'll use tools that are standard in the cybersecurity industry to pull apart how software works internally, such as IDA Pro or Ghidra, to discover exactly what makes their target video game tick. On top of static analysis, they'll leverage dynamic analysis as well: glitch hunters will use dynamic introspection and debugging tools, like the Dolphin Emulator’s memory viewer or Cheat Engine, to get a GDB-like interface for figuring out the program's internal data structures and how information is recorded.

And even further, they'll develop entirely new tooling: I've seen groups like the Paper Mario: The Thousand Year Door community reverse engineer game file formats and create Ghidra program loaders, or other groups completely re-implement Ghidra disassembled code in C so they can stick it under a fuzzer in isolation. Some of the speedrun glitch hunters are incredibly technically competent, using the exact same tooling and techniques that people in the cybersecurity industry use for reverse engineering every day.

…And it’s vulnerability research

Not only do these groups do reverse engineering, but they also are doing vulnerability research. Remember, they don't only try to figure out how games work, but they try to break the game in any way possible. These glitches end up looking stunningly similar to how memory corruption exploits work for any other computer program: they'll find buffer overflows, use-after-frees, and incorrect state machine transitions in their target games.

And perhaps most impressively, they'll productize their exploits, unlike a lot of people in the cybersecurity industry. Some vulnerability researchers will develop a proof-of-concept to demonstrate a bug – but never actually develop the technical chops on how that exploit would need to be developed further for an adversary to use it. They might intellectually know how to weaponize a buffer overflow, or a use-after-free, but speedrunning groups by necessity are actually doing it. Oftentimes, actually using these glitches requires working through extremely restrictive constraints, both for what inputs they have control over and what parts of the program they can influence.

Super Mario World runners will place items in extremely precise locations so that the X,Y coordinates form shellcode they can jump to with a dangling reference. Legend of Zelda: Ocarina of Time players will do heap grooming and write a function pointer using the IEEE-754 floating point number bit representation so the game “wrong warps” directly to the end credit sequence... with nothing more than a game controller and a steady hand.

Screenshot from an in-depth technical explanation of a Super Mario 64 glitch. Watch on YouTube.

Some of the game communities will even take it a step further! Tool-assisted speedruns, or "TAS" runs, will perform glitches so precise that they can't reliably be performed by human beings at all. They'll leverage frame-by-frame input recordings in order to hit the right angle on a game controller's stick, every time; they'll hit buttons on the exact video game tick, every time.

And because they have such precise control over their actions in games, they'll likewise be able to consider game glitches with exacting precision. TAS authors are able to leverage inspecting the video game with memory debuggers to craft a use-after-free with the perfect heap spray, or write multiple stages of shellcode payload in their player inventory with button presses.

There's even an entire event at the most popular speedrunning conference, Awesome Games Done Quick/AGDQ, called "TASbot." During it, a robot does all the inputs via a hard-wired controller to perform a tool-assisted speedrun in real time – so it can do things like get arbitrary code execution and use that to replace the video game with an entirely new one, using nothing but controller inputs.

An industry exists!

The fact these people are so technically competent only throws in stark relief how disconnected some of them are from the larger cybersecurity industry. Speedrun glitch hunters will develop heap use-after-free exploits, with accompanied technical write-ups on the level of Google Project Zero… and in doing so, refer to it as an "item storage" glitch, because they developed the knowledge from first principles without ever reading a Phrack article. They'll re-implement disassembled code from Ghidra in C for automated glitch discovery, but without any exposure to American Fuzzy Lop or the large academic body of work driving fuzzer research.

And, critically for us here at Zetier, they don't know you can get paid to do a job very close to finding video game glitches, and so they don't know to apply to our reverse engineering or vulnerability research job postings. A lot of these video game glitch hunters, even the ones writing novel Ghidra loaders or runtime memory analysis scripts, don't think of what they're doing as anything more than a fun hobby; they might go become a normal software engineer, if that. Some of them will look up "IDA Pro" on LinkedIn and see a million malware analysis job postings. No offense to my malware analysis friends, but malware reverse engineering and vulnerability research are two very different roles!

Vulnerability research in industry, unlike more “normal” malware analysis jobs, is usually in the form of an engineer spending significant time investigating exactly how a program works. Like video game glitch discovery, they don’t just care about what it does, but how it does it – and why the authors implemented it in that way, along with how that behavior may affect other parts of the program. Oftentimes, you end up building up a repertoire of small, innocuous “huh that’s weird”-style bugs that are individually useless… until you find some missing piece. And like game glitches, the most satisfying of discoveries on the job are from realizations that there’s a fundamental gap in thinking by the authors, where you don’t just find one glitch but an entire family of glitches, all from the same root cause.

A glimpse of an arbitrary code execution (ACE) exploit walk-through. See the video.

I personally love reading the technical game glitch write-ups that come out of speedrunning communities. Lots of my coworkers, and other people in the industry, likewise enjoy them. I love glitch write-ups because they remind me of the great parts of my job: extremely deep dives into the internals of how programs work, and working around odd constraints. Exploiting vulnerabilities requires performing mental gymnastics in order to chain seemingly innocuous primitives, like walking around out-of-bounds in Pokemon, together to do the thing in a way that allows the author to express their creativity and mastery over a piece of software.

Talking to people in speedrunning communities who love pouring over Assembly, or figuring out exactly what the implications are for a 1-byte buffer overflow in a textbox, only for them to shrug and explain they're reluctantly working in a non-technical industry, comes across to me as a shame. If any of these descriptions speak to you, or bring to mind one of your friends, reach out to hello@zetier.com. We'd love to chat.

Let the interwebs know that vulnerability researchers exist:

Share this on HackerNews

]]>
https://zetier.com/speedrunners-are-vulnerability-researchers/ hacker-news-small-sites-43232880 Sun, 02 Mar 2025 17:40:36 GMT
<![CDATA[New battery-free technology can power devices using ambient RF signals]]> thread link) | @ohjeez
March 2, 2025 | https://news.nus.edu.sg/nus-researchers-develop-new-battery-free-technology/ | archive.org

In a breakthrough for green energy, researchers demonstrated a novel technique to efficiently convert ambient radiofrequency signals into DC voltage that can power electronic devices and sensors, enabling battery-free operation.

Ubiquitous wireless technologies like Wi-Fi, Bluetooth, and 5G rely on radio frequency (RF) signals to send and receive data. A new prototype of an energy harvesting module – developed by a team led by scientists from the National University of Singapore (NUS) – can now convert ambient or ‘waste’ RF signals into direct current (DC) voltage. This can be used to power small electronic devices without the use of batteries.

RF energy harvesting technologies, such as this, is essential as they reduce battery dependency, extend device lifetimes, minimise environmental impact, and enhance the feasibility of wireless sensor networks and IoT devices in remote areas where frequent battery replacement is impractical.

However, RF energy harvesting technologies face challenges due to low ambient RF signal power (typically less than -20 dBm), where current rectifier technology either fails to operate or exhibits a low RF-to-DC conversion efficiency. While improving antenna efficiency and impedance matching can enhance performance, this also increases on-chip size, presenting obstacles to integration and miniaturisation.

To address these challenges, a team of NUS researchers, working in collaboration with scientists from Tohoku University (TU) in Japan and University of Messina (UNIME) in Italy, has developed a compact and sensitive rectifier technology that uses nanoscale spin-rectifier (SR) to convert ambient wireless radio frequency signals at power less than -20 dBm to a DC voltage.

The team optimised SR devices and designed two configurations: 1) a single SR-based rectenna operational between -62 dBm and -20 dBm, and 2) an array of 10 SRs in series achieving 7.8% efficiency and zero-bias sensitivity of approximately 34,500 mV/mW. Integrating the SR-array into an energy harvesting module, they successfully powered a commercial temperature sensor at -27 dBm.

“Harvesting ambient RF electromagnetic signals is crucial for advancing energy-efficient electronic devices and sensors. However, existing Energy Harvesting Modules face challenges operating at low ambient power due to limitations in existing rectifier technology,” explained Professor Yang Hyunsoo from the Department of Electrical and Computer Engineering at the NUS College of Design and Engineering, who spearheaded the project.

Prof Yang added, “For example, gigahertz Schottky diode technology has remained saturated for decades due to thermodynamic restrictions at low power, with recent efforts focused only on improving antenna efficiency and impedance-matching networks, at the expense of bigger on-chip footprints. Nanoscale spin-rectifiers, on the other hand, offer a compact technology for sensitive and efficient RF-to-DC conversion.”

Elaborating on the team’s breakthrough technology, Prof Yang said, “We optimised the spin-rectifiers to operate at low RF power levels available in the ambient, and integrated an array of such spin-rectifiers to an energy harvesting module for powering the LED and commercial sensor at RF power less than -20 dBm. Our results demonstrate that SR-technology is easy to integrate and scalable, facilitating the development of large-scale SR-arrays for various low-powered RF and communication applications.”

The experimental research was carried out in collaboration with Professor Shunsuke Fukami and his team from TU, while the simulation was carried out by Professor Giovanni Finocchio from UNIME. The results were published in the prestigious journal, Nature Electronics, on 24 July 2024.

Spin-rectifier-based technology for the low-power operation

State-of-the-art rectifiers (Schottky diodes, tunnel diodes and two-dimensional MoS2), have reached efficiencies of 40–70% at Prf ≥ -10 dBm. However, the ambient RF power available from the RF sources such as Wi-Fi routers is less than -20 dBm. Developing high-efficiency rectifiers for low-power regimes (Prf < -20 dBm) is difficult due to thermodynamic constraints and high-frequency parasitic effects. Additionally, on-chip rectifiers require an external antenna and impedance-matching circuit, impeding on-chip scaling. Therefore, designing a rectifier for an Energy Harvesting Module (EHM) that is sensitive to ambient RF power with a compact on-chip design remains a significant challenge.

The nanoscale spin-rectifiers can convert the RF signal to a DC voltage using the spin-diode effect. Although the SR-based technology surpassed the Schottky diode sensitivity, the low-power efficiency is still low (< 1%). To overcome the low-power limitations, the research team studied the intrinsic properties of SR, including the perpendicular anisotropy, device geometry, and dipolar field from the polarizer layer, as well as the dynamic response, which depends on the zero-field tunnelling magnetoresistance and voltage-controlled magnetic anisotropy (VCMA). Combining these optimised parameters with the external antenna impedance-matched with a single SR, the researcher designed ultralow power SR-rectenna.

To improve output and achieve on-chip operation, the SRs were coupled in an array arrangement, with the small co-planar waveguides on the SRs employed to couple RF power, resulting in compact on-chip area and high efficiency. One of the key findings is that the self-parametric effect driven by well-known VCMA in magnetic tunnel junctions-based spin-rectifiers significantly contributes to the low-power operation of SR-arrays, while also enhancing their bandwidth and rectification voltage. In a comprehensive comparison with Schottky diode technology in the same ambient situation and from previous literature assessment, the research team discovered that SR-technology might be the most compact, efficient, and sensitive rectifier technology.

Commenting on the significance of their results, Dr Raghav Sharma, the first author of the paper, shared, “Despite extensive global research on rectifiers and energy harvesting modules, fundamental constraints in rectifier technology remain unresolved for low ambient RF power operation. Spin-rectifier technology offers a promising alternative, surpassing current Schottky diode efficiency and sensitivity in low-power regime. This advancement benchmarks RF rectifier technologies at low power, paving the way for designing next-generation ambient RF energy harvesters and sensors based on spin-rectifiers.”

Next steps

The NUS research team is now exploring the integration of an on-chip antenna to improve the efficiency and compactness of SR technologies. The team is also developing series-parallel connections to tune impedance in large arrays of SRs, utilising on-chip interconnects to connect individual SRs. This approach aims to enhance the harvesting of RF power, potentially generating a significant rectified voltage of a few volts, thus eliminating the need for a DC-to-DC booster.

The researchers also aim to collaborate with industry and academic partners for the advancement of self-sustained smart systems based on on-chip SR rectifiers. This could pave the way for compact on-chip technologies for wireless charging and signal detection systems.

]]>
https://news.nus.edu.sg/nus-researchers-develop-new-battery-free-technology/ hacker-news-small-sites-43232724 Sun, 02 Mar 2025 17:25:49 GMT
<![CDATA[An ode to TypeScript enums]]> thread link) | @disintegrator
March 2, 2025 | https://blog.disintegrator.dev/posts/ode-to-typescript-enums/ | archive.org

It’s official, folks. TypeScript 5.8 is out bringing with it the --erasableSyntaxOnly flag and the nail in the coffin for many of the near-primordial language features like Enums and Namespaces. Node.js v23 joined Deno and Bun in adding support for running TypeScript files withouth a build step. The one true limitation is that only files containing erasable TypeScript syntax are supported. Since Enums and Namespaces (ones holding values) violate that rule since they are transpiled to JavaScript objects. So the TypeScript team made it possible to ban those features with the new compiler flag and make it easy for folks to ensure their TS code is directly runnable.

But the issues with Enums didn’t start here. Over last few years, prominent TypeScript content creators have been making the case against enums on social media, blog posts and short video essays. Let me stop here and say it out loud:

In almost all ways that matter, literal unions provide better ergonomics than enums and you should consider them first.

The problem is that, like the articles I linked to there and many others out there, these statements are not interested in making a case for some of the strengths of enums. While I maintain my position above, I want to spend a minute eulogizing an old friend. Remember, as const assertions, which were introduced in TypeScript 3.4, were necessary to supplant enums. That’s nearly 6 years of using enums since TypeScript 0.9!

Probably my favorite argument in steelmanning enums is that you can document their members and the documentation is available anywhere you are accessing them. This includes deprecating them which can so useful if you are building APIs that evolve over time.

enum PaymentMethod {

CreditCard = "credit-card",

DebitCard = "debig-card",

Bitcoin = "bitcoin",

/**

* Use an electronic check to pay your bills. Please note that this may take

* up to 3 business days to go through.

*

* @deprecated Checks will no longer be accepted after 2025-04-30

*/

Check = "check",

}

const method = PaymentMethod.Check;

There have been many instances where a union member’s value on its own is not perfectly self-explanatory or at least ambiguous when living alongside similar unions in a large codebase. The documentation has to be combined into the TSDoc comment of the union type which cannot reflect deprecations and is not shown when hovering over a union member.

type PaymentMethod =

| "credit-card"

| "debit-card"

| "bitcoin"

/**

* Use an electronic check to pay your bills. Please note that this may

* take up to 3 business days to go through.

*

* @deprecated Checks will no longer be accepted after 2025-04-30

*/

| "check";

const method: PaymentMethod = "check";

There are ways to get around this limitation where object literals with a const assertion are used but the reality is that these literals aren’t typically imported and used by users of a library. They tend to be built up by library authors to have an iterable/indexable mapping around when validating unknown values or to enumerate in a UI e.g. in error messages or to build a <select> dropdown.

There are a couple more quality of life features that enums possess but I’m choosing not to go through here. For me personally, the degraded inline documentation is by far the toughest pill to swallow in moving to literal unions and I wanted to focus on that. I’m really hoping the TypeScript team finds a way to support TSDoc on union members as the world moves away from enums.

]]>
https://blog.disintegrator.dev/posts/ode-to-typescript-enums/ hacker-news-small-sites-43232690 Sun, 02 Mar 2025 17:23:12 GMT
<![CDATA[Understanding Smallpond and 3FS]]> thread link) | @mritchie712
March 2, 2025 | https://www.definite.app/blog/smallpond | archive.org

March 2, 202510 minute read

Mike Ritchie

I didn't have "DeepSeek releases distributed DuckDB" on my 2025 bingo card.

You may have stumbled across smallpond from Twitter/X/LinkedIn hype. From that hype, you might have concluded Databricks and Snowflake are dead 😂. Not so fast. The reality is, although this is interesting and powerful open source tech, it's unlikely to be widely used in analytics anytime soon. Here's a concise breakdown to help you cut through the noise.

We'll cover:

  1. what smallpond and its companion, 3FS, are
  2. if they're suitable for your use case and if so
  3. how you can use them

What is smallpond?

smallpond is a lightweight, distributed data processing framework recently introduced by DeepSeek. It extends DuckDB (typically a single-node analytics database) to handle larger datasets across multiple nodes. smallpond enables DuckDB to manage distributed workloads by using a distributed storage and compute system.

Key features:

  • Distributed Analytics: Allows DuckDB to handle larger-than-memory datasets by partitioning data and running analytics tasks in parallel.
  • Open Source Deployment: If you can manage to get it running, 3FS would give you powerful and performant storage at a fraction of the cost of alternatives.
  • Manual Partitioning: Data is manually partitioned by users, and smallpond distributes these partitions across nodes for parallel processing.

What is 3FS?

3FS, or Fire-Flyer File System, is a high-performance parallel file system also developed by DeepSeek. It's optimized specifically for AI and HPC workloads, offering extremely high throughput and low latency by using SSDs and RDMA networking technology. 3FS is the high-speed, distributed storage backend that smallpond leverages to get it's breakneck performance. 3FS achieves a remarkable read throughput of 6.6 TiB/s on a 180-node cluster, which is significantly higher than many traditional distributed file systems.

How Can I Use It?

To start, same as any other python package, uv pip install smallpond. Remove uv if you like pain.

But to actually get the benefits of smallpond, it'll take much more work and depends largely on your data size and infrastructure:

  • Under 10TB: smallpond is likely unnecessary unless you have very specific distributed computing needs. A single-node DuckDB instance or simpler storage solutions will be simpler and possibly more performant. To be candid, using smallpond at a smaller scale, without Ray / 3FS is likely slower than vanilla DuckDB and a good bit more complicated.
  • 10TB to 1PB: smallpond begins to shine. You'd set up a cluster (see below) with several nodes, leveraging 3FS or another fast storage backend to achieve rapid parallel processing.
  • Over 1PB (Petabyte-Scale): smallpond and 3FS were explicitly designed to handle massive datasets. At this scale, you'd need to deploy a larger cluster with substantial infrastructure investments.

Deployment typically involves:

  1. Setting up a compute cluster (AWS EC2, Google Compute Engine, or on-prem).
  2. Deploying 3FS on nodes with high-performance SSDs and RDMA networking.
  3. Installing smallpond via Python to run distributed DuckDB tasks across your cluster.

Steps #1 and #3 are really easy. Step #2 is very hard. 3FS is new, so there's no guide on how you would set it up on AWS or any other cloud (maybe DeepSeek will offer this?). You could certainly deploy it on bare metal, but you'd be descending into a lower level of DevOps hell.

Note: if you're in the 95% of companies in the under 10TB bucket, you should really try Definite.

I experimented with running smallpond with S3 swapped in for 3FS here, but it's unclear what, if any, performance gains you'd get over scaling up a single node for moderate-sized data.

Is smallpond for me?

tl;dr: probably not.

Whether you'd want to use smallpond depends on several factors:

  • Your Data Scale: If your dataset is under 10TB, smallpond adds unnecessary complexity and overhead. For larger datasets, it provides substantial performance advantages.
  • Infrastructure Capability: smallpond and 3FS require significant infrastructure and DevOps expertise. Without a dedicated team experienced in cluster management, this could be challenging.
  • Analytical Complexity: smallpond excels at partition-level parallelism but is less optimized for complex joins. For workloads requiring intricate joins across partitions, performance might be limited.

How Smallpond Works (Under the Hood)

Lazy DAG Execution
Smallpond uses lazy evaluation for operations like map(), filter(), and partial_sql(). It doesn't run these immediately. Instead, it builds a logical execution plan as a directed acyclic graph (DAG), where each operation becomes a node (e.g., SqlEngineNode, HashPartitionNode, DataSourceNode).

Nothing actually happens until you trigger execution explicitly with actions like:

  • write_parquet() — Writes data to disk
  • to_pandas() — Converts results to a pandas DataFrame
  • compute() — Forces computation explicitly
  • count() — Counts rows
  • take() — Retrieves a subset of rows

This lazy evaluation is efficient because it avoids unnecessary computations and optimizes the workflow.

From Logical to Execution Plan
When you finally trigger an action, the logical plan becomes an execution plan made of specific tasks (e.g., SqlEngineTask, HashPartitionTask). These tasks are the actual work units distributed and executed by Ray.

Ray Core and Distribution
Smallpond’s distribution leverages Ray Core at the Python level, using partitions for scalability. Partitioning can be done manually, and Smallpond supports:

  • Hash partitioning (based on column values)
  • Even partitioning (by files or row counts)
  • Random shuffle partitioning

Each partition runs independently within its own Ray task, using DuckDB instances to process SQL queries. This tight integration with Ray emphasizes horizontal scaling (adding more nodes) rather than vertical scaling (larger, more powerful nodes). To use it at scale, you’ll need a Ray cluster. You can run one on your own infrastructure on a cloud provider (e.g. AWS), but if you just want to test this out, it'll be easier to get started with Anyscale (founded by Ray creators).

Conclusion

smallpond and 3FS offer powerful capabilities for scaling DuckDB analytics across large datasets. However, their complexity and infrastructure demands mean they're best suited for scenarios where simpler solutions no longer suffice. If you're managing massive datasets and already have robust DevOps support, smallpond and 3FS could significantly enhance your analytics capabilities. For simpler scenarios, sticking with a single-node DuckDB instance or using managed solutions remains your best option.

]]>
https://www.definite.app/blog/smallpond hacker-news-small-sites-43232410 Sun, 02 Mar 2025 17:00:30 GMT
<![CDATA[Why do we have both CSRF protection and CORS?]]> thread link) | @smagin
March 2, 2025 | https://smagin.fyi/posts/cross-site-requests/ | archive.org

Hello, Internet. I thought about cross-site requests and realised we have both CSRF protection and CORS and it doesn’t make sense from the first glance. It does generally, but I need a thousand words to make it so.

CSRF stands for Cross-Site Request Forgery. It was rather popular in the earlier internet but now it’s almost a non-issue thanks to standard prevention mechanisms built into most of popular web frameworks. The forgery is to make user click on a form that will send a cross-site request. The protection is to check that the request didn’t come from a third-party site.

CORS stands for Cross-Origin Resource Sharing. It’s a part of HTTP specification that describes how to permit certain cross-site requests. This includes preflight requests and response headers that state which origins are allowed to send requests.

So, by default, are cross-origin requests allowed and we need CSRF protection, or they are forbidden and we need CORS to allow them? The answer is both.

The default behaviour

The default behaviour is defined by Same-origin policy, and is enforced by browsers. The policy states that, generally speaking, cross-site writes are allowed, and cross-site reads are not. You can send a POST request by submitting a form, you browser won’t let you read the response of it.

There is a newer part of this spec that sort of solves CSRF. In 2019, there was an initiative to change default cookies behaviour. Before that, cookies were always sent in cross-site requests. The default was changed to not send cookies in cross-site POST requests. To do that, a new SameSite attribute for the set-cookie header was introduced. The attribute value to make the old default is None, and the new default would be Lax.

In 2025, 96% browsers support the SameSite attribute, and 75% support the new default. Notably, Safari haven’t adopted the default, and UCBrowser doesn’t support any nice things.

Sidenote: I can’t understand how UCBrowser remains relatively popular among users, given that there are settings in js builders to build for N% of the users and next to nobody puts 99% there.

Sidenote 2: Origin is not the same as Site. Origin is a combination of a scheme, a hostname, and a port. Site is a combination of scheme and effective top level domain + 1. Subdomains and ports don’t matter for sites.

Links: Same-origin policy | caniuse SameSite cookie attribute

CORS

CORS is a way to override the same origin policy per origin.

The spec describes a certain browser-server interaction. Browser sends preflight requests of type OPTIONS before actual requests, server replies with rules for the origin. The rules are in a form of response headers. The rules may specify if the reply can be read, if headers can be sent and received, allowed HTTP methods. Header names start with Access-Control. Browser then follows the rules.

CORS applies for several types of the requests:

  • js-initiated fetch and XMLHttpRequest
  • web fonts
  • webgl textures
  • images/video frames drawn to a canvas using drawImage
  • css shapes from images

What is notoriously not in this list is form submissions, otherwise known as simple requests. This is part of the internet being backward-compatible:

The motivation is that the <form> element from HTML 4.0 (which predates cross-site fetch() and XMLHttpRequest) can submit simple requests to any origin, so anyone writing a server must already be protecting against cross-site request forgery (CSRF). Under this assumption, the server doesn’t have to opt-in (by responding to a preflight request) to receive any request that looks like a form submission, since the threat of CSRF is no worse than that of form submission. However, the server still must opt-in using Access-Control-Allow-Origin to share the response with the script.

From the CORS page on MDN.

Question to readers: How is that in line with the SameSite initiative?

CSRF protection

So, cross-site write requests are allowed, but responses won’t be shared. At the same time, as website developers, we mostly don’t want to allow that.

The standard protection is to include into a write request a user-specific token available only on read:

  • for forms this token is put into a hidden input,
  • for js-initiated requests the token can be stored in a cookie or in a meta tag, and is put into params or request headers.

JS-initiated requests are not allowed cross-site by default anyway, but they are allowed same-site. Adding a csrf token into js requests lets us do the check the same way for all the requests.

This way we still depend on browser in a way that it still has to prevent responses to be read cross-site by default, but a bit less than if we were also reading something like Origin request header instead of checking for the token.

Question to readers: In some of the frameworks CSRF tokens are rotated. Why?

Browser is important

I want to emphasise how important browsers are in this whole security scheme. All the client state for all the sites is stored in browser, and it decides which parts to expose and when. It’s browsers that enforce Same-origin policy, it’s browsers that don’t let read responses if it’s not allowed by server. It’s browsers that decide if they adopt the new SameSite=Lax default. It’s browsers that implement CORS and send safe preflight requests before actual PATCH or DELETE.

We really have to trust browsers that we use.

Conclusion

What I learned

The internet will become more secure and maybe a bit less backward-compatible when the SameSite=Lax default will be adopted by 100% of the browsers. Until then, we will have to live with the situation where simple POST requests are special and allowed cross-site, while others fall into the CORS bucket.

Thanks Nikita Skazki for reviewing this post more times than I care to admit.

This post on Hackernews

Sources

  1. Same-origin policy
  2. caniuse SameSite cookie attribute
  3. OWASP CSRF cheatsheet
  4. CORS wiki with requirements
  5. CORS spec
  6. CORS on MDN
  7. Preflight request
  8. Origin request header
  9. Origin and Site
]]>
https://smagin.fyi/posts/cross-site-requests/ hacker-news-small-sites-43231411 Sun, 02 Mar 2025 15:32:46 GMT
<![CDATA[Crossing the uncanny valley of conversational voice]]> thread link) | @monroewalker
March 1, 2025 | https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice | archive.org

February 27, 2025

Brendan Iribe, Ankit Kumar, and the Sesame team

How do we know when someone truly understands us? It is rarely just our words—it is in the subtleties of voice: the rising excitement, the thoughtful pause, the warm reassurance.

Voice is our most intimate medium as humans, carrying layers of meaning through countless variations in tone, pitch, rhythm, and emotion.

Today’s digital voice assistants lack essential qualities to make them truly useful. Without unlocking the full power of voice, they cannot hope to effectively collaborate with us. A personal assistant who speaks only in a neutral tone has difficulty finding a permanent place in our daily lives after the initial novelty wears off.

Over time this emotional flatness becomes more than just disappointing—it becomes exhausting.

Achieving voice presence

At Sesame, our goal is to achieve “voice presence”—the magical quality that makes spoken interactions feel real, understood, and valued. We are creating conversational partners that do not just process requests; they engage in genuine dialogue that builds confidence and trust over time. In doing so, we hope to realize the untapped potential of voice as the ultimate interface for instruction and understanding.

Key components

  • Emotional intelligence: reading and responding to emotional contexts.
  • Conversational dynamics: natural timing, pauses, interruptions and emphasis.
  • Contextual awareness: adjusting tone and style to match the situation.
  • Consistent personality: maintaining a coherent, reliable and appropriate presence.

We’re not there yet

Building a digital companion with voice presence is not easy, but we are making steady progress on multiple fronts, including personality, memory, expressivity and appropriateness. This demo is a showcase of some of our work in conversational speech generation. The companions shown here have been optimized for friendliness and expressivity to illustrate the potential of our approach.

Conversational voice demo

1. Microphone permission is required. 2. Calls are recorded for quality review but not used for ML training and are deleted within 30 days. 3. By using this demo, you are agreeing to our Terms of Use and Privacy Policy. 4. We recommend using Chrome (Audio quality may be degraded in iOS/Safari 17.5).

Technical post

Authors

Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang

To create AI companions that feel genuinely interactive, speech generation must go beyond producing high-quality audio—it must understand and adapt to context in real time. Traditional text-to-speech (TTS) models generate spoken output directly from text but lack the contextual awareness needed for natural conversations. Even though recent models produce highly human-like speech, they struggle with the one-to-many problem: there are countless valid ways to speak a sentence, but only some fit a given setting. Without additional context—including tone, rhythm, and history of the conversation—models lack the information to choose the best option. Capturing these nuances requires reasoning across multiple aspects of language and prosody.

To address this, we introduce the Conversational Speech Model (CSM), which frames the problem as an end-to-end multimodal learning task using transformers. It leverages the history of the conversation to produce more natural and coherent speech. There are two key takeaways from our work. The first is that CSM operates as a

single-stage model, thereby improving efficiency and expressivity. The second is our

evaluation suite, which is necessary for evaluating progress on contextual capabilities and addresses the fact that common public evaluations are saturated.

Background

One approach to modeling audio with transformers is to convert continuous waveforms into discrete audio token sequences using tokenizers. Most contemporary approaches ([1], [2]) rely on two types of audio tokens:

  1. Semantic tokens: Compact speaker-invariant representations of semantic and phonetic features. Their compressed nature enables them to capture key speech characteristics at the cost of high-fidelity representation.
  2. Acoustic tokens: Encodings of fine-grained acoustic details that enable high-fidelity audio reconstruction. These tokens are often generated using Residual Vector Quantization (RVQ) [2]. In contrast to semantic tokens, acoustic tokens retain natural speech characteristics like speaker-specific identity and timbre.

A common strategy first models semantic tokens and then generates audio using RVQ or diffusion-based methods. Decoupling these steps allows for a more structured approach to speech synthesis—the semantic tokens provide a compact, speaker-invariant representation that captures high-level linguistic and prosodic information, while the second-stage reconstructs the fine-grained acoustic details needed for high-fidelity speech. However, this approach has a critical limitation; semantic tokens are a bottleneck that must fully capture prosody, but ensuring this during training is challenging.

RVQ-based methods introduce their own set of challenges. Models must account for the sequential dependency between codebooks in a frame. One method, the delay pattern (figure below) [3], shifts higher codebooks progressively to condition predictions on lower codebooks within the same frame. A key limitation of this approach is that the time-to-first-audio scales poorly because an RVQ tokenizer with N codebooks requires N backbone steps before decoding the first audio chunk. While suitable for offline applications like audiobooks, this delay is problematic in a real-time scenario.

Example of delayed pattern generation in an RVQ tokenizer with 4 codebooks

Conversational Speech Model

CSM is a multimodal, text and speech model that operates directly on RVQ tokens. Inspired by the RQ-Transformer [4], we use two autoregressive transformers. Different from the approach in [5], we split the transformers at the zeroth codebook. The first multimodal backbone processes interleaved text and audio to model the zeroth codebook. The second audio decoder uses a distinct linear head for each codebook and models the remaining N – 1 codebooks to reconstruct speech from the backbone’s representations. The decoder is significantly smaller than the backbone, enabling low-latency generation while keeping the model end-to-end.

CSM model inference process. Text (T) and audio (A) tokens are interleaved and fed sequentially into the Backbone, which predicts the zeroth level of the codebook. The Decoder then samples levels 1 through N – 1 conditioned on the predicted zeroth level. The reconstructed audio token (A) is then autoregressively fed back into the Backbone for the next step, continuing until the audio EOT symbol is emitted. This process begins again on the next inference request, with the interim audio (such as a user utterance) being represented by interleaved audio and text transcription tokens.

Both transformers are variants of the Llama architecture. Text tokens are generated via a Llama tokenizer [6], while audio is processed using Mimi, a split-RVQ tokenizer, producing one semantic codebook and N – 1 acoustic codebooks per frame at 12.5 Hz. [5] Training samples are structured as alternating interleaved patterns of text and audio, with speaker identity encoded directly in the text representation.

Compute amortization

This design introduces significant infrastructure challenges during training. The audio decoder processes an effective batch size of B × S and N codebooks autoregressively, where B is the original batch size, S is the sequence length, and N is the number of RVQ codebook levels. This high memory burden even with a small model slows down training, limits model scaling, and hinders rapid experimentation, all of which are crucial for performance.

To address these challenges, we use a compute amortization scheme that alleviates the memory bottleneck while preserving the fidelity of the full RVQ codebooks. The audio decoder is trained on only a random 1/16 subset of the audio frames, while the zeroth codebook is trained on every frame. We observe no perceivable difference in audio decoder losses during training when using this approach.

Amortized training process. The backbone transformer models the zeroth level across all frames (highlighted in blue), while the decoder predicts the remaining N – 31 levels, but only for a random 1/16th of the frames (highlighted in green). The top section highlights the specific frames modeled by the decoder for which it receives loss.

Experiments

Dataset: We use a large dataset of publicly available audio, which we transcribe, diarize, and segment. After filtering, the dataset consists of approximately one million hours of predominantly English audio.

Model Sizes: We trained three model sizes, delineated by the backbone and decoder sizes:

  • Tiny: 1B backbone, 100M decoder
  • Small: 3B backbone, 250M decoder
  • Medium: 8B backbone, 300M decoder

Each model was trained with a 2048 sequence length (~2 minutes of audio) over five epochs.

Samples

Paralinguistics

Sentences from Base TTS

Foreign words

Sentences from Base TTS

Contextual expressivity

Samples from Expresso, continuation after chime

Pronunciation correction

Pronunciation correction sentence is a recording, all other audio is generated.

Conversations with multiple speakers

Single generation using audio prompts from two speakers

Evaluation

Our evaluation suite measures model performance across four key aspects: faithfulness to text, context utilization, prosody, and latency. We report both objective and subjective metrics—objective benchmarks include word error rate and novel tests like homograph disambiguation, while subjective evaluation relies on a Comparative Mean Opinion Score (CMOS) human study using the Expresso dataset.

Objective metrics

Traditional benchmarks, such as word error rate (WER) and speaker similarity (SIM), have become saturated—modern models, including CSM, now achieve near-human performance on these metrics.

Objective metric results for Word Error Rate (top) and Speaker Similarity (bottom) tests, showing the metrics are saturated (matching human performance).

To better assess pronunciation and contextual understanding, we introduce a new set of phonetic transcription-based benchmarks.

  • Text understanding through Homograph Disambiguation: Evaluates whether the model correctly pronounced different words with the same orthography (e.g., “lead” /lɛd/ as in “metal” vs. “lead” /liːd/ as in “to guide”).
  • Audio understanding through Pronunciation Continuation Consistency: Evaluates whether the model maintains pronunciation consistency of a specific word with multiple pronunciation variants in multi-turn speech. One example is “route” (/raʊt/ or /ruːt/), which can vary based on region of the speaker and context.

Objective metric results for Homograph Disambiguation (left) and Pronunciation Consistency (right) tests, showing the accuracy percentage for each model’s correct pronunciation. Play.ht, Elevenlabs, and OpenAI generations were made with default settings and voices from their respective API documentation.

The graph above compares objective metric results across three model sizes. For Homograph accuracy we generated 200 speech samples covering 5 distinct homographs—lead, bass, tear, wound, row—with 2 variants for each and evaluated pronunciation consistency using wav2vec2-lv-60-espeak-cv-ft. For Pronunciation Consistency we generated 200 speech samples covering 10 distinct words that have common pronunciation variants—aunt, data, envelope, mobile, route, vase, either, adult, often, caramel.

In general, we observe that performance improves with larger models, supporting our hypothesis that scaling enhances the synthesis of more realistic speech.

Subjective metrics

We conducted two Comparative Mean Opinion Score (CMOS) studies using the Expresso dataset to assess the naturalness and prosodic appropriateness of generated speech for CSM-Medium. Human evaluators were presented with pairs of audio samples—one generated by the model and the other a ground-truth human recording. Listeners rated the generated sample on a 7-point preference scale relative to the reference. Expresso’s diverse expressive TTS samples, including emotional and prosodic variations, make it a strong benchmark for evaluating appropriateness to context.

In the first CMOS study we presented the generated and human audio samples with no context and asked listeners to “choose which rendition feels more like human speech.” In the second CMOS study we also provide the previous 90 seconds of audio and text context, and ask the listeners to “choose which rendition feels like a more appropriate continuation of the conversation.” Eighty people were paid to participate in the evaluation and rated on average 15 examples each.

Subjective evaluation results on the Expresso dataset. No context: listeners chose “which rendition feels more like human speech” without knowledge of the context. Context: listeners chose “which rendition feels like a more appropriate continuation of the conversation” with audio and text context. 50:50 win–loss ratio suggests that listeners have no clear preference.

The graph above shows the win-rate of ground-truth human recordings vs CSM-generated speech samples for both studies. Without conversational context (top), human evaluators show no clear preference between generated and real speech, suggesting that naturalness is saturated. However, when context is included (bottom), evaluators consistently favor the original recordings. These findings suggest a noticeable gap remains between generated and human prosody in conversational speech generation.

Open-sourcing our work

We believe that advancing conversational AI should be a collaborative effort. To that end, we’re committed to open-sourcing key components of our research, enabling the community to experiment, build upon, and improve our approach. Our models will be available under an Apache 2.0 license.

Limitations and future work

CSM is currently trained on primarily English data; some multilingual ability emerges due to dataset contamination, but it does not perform well yet. It also does not take advantage of the information present in the weights of pre-trained language models.

In the coming months, we intend to scale up model size, increase dataset volume, and expand language support to over 20 languages. We also plan to explore ways to utilize pre-trained language models, working towards large multimodal models that have deep knowledge of both speech and text.

Ultimately, while CSM generates high quality conversational prosody, it can only model the text and speech content in a conversation—not the structure of the conversation itself. Human conversations are a complex process involving turn taking, pauses, pacing, and more. We believe the future of AI conversations lies in fully duplex models that can implicitly learn these dynamics from data. These models will require fundamental changes across the stack, from data curation to post-training methodologies, and we’re excited to push in these directions.

Join us

If you’re excited about building the most natural, delightful, and inspirational voice interfaces out there, reach out—we’re hiring. Check our open roles.

]]>
https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice hacker-news-small-sites-43227881 Sun, 02 Mar 2025 06:13:01 GMT
<![CDATA[Knowing CSS is mastery to front end development]]> thread link) | @tipiirai
March 1, 2025 | https://helloanselm.com/writings/knowing-css-is-mastery-to-frontend-development | archive.org

There are countless articles why developers should not focus on Frameworks too much and instead learn to understand the underlying languages. But I think rarely we can find good reasons except that Frameworks come and go. To me, the main reason is different: You won’t be a master at frontend development if you don’t understand underlying mechanisms of a language.

A usual stack today is React together with countless layers in between the language and the framework itself. CSS as styling method is not used natively but via JavaScript tools that translate it into native CSS. For JavaScript we nowadays write an opinionated Framework language mix using TypeScript which by itself is translated to native JavaScript in the end again. And while we all know the comfort of these tools and languages, there are many things that make it easier if you understand a browser’s ecosystem:

  • Debug JavaScript errors easier and also in foreign environments without a debugging browser extension installed
  • Debug CSS
  • Write custom CSS (and every project I’ve seen so far needs it somewhere)
  • Understand why errors occur that you may not find locally and only in client’s browsers

In the past years I had various situations where TypeScript developers (they called themselves) approached me and asked whether I could help them out with CSS. I expected to solve a complex problem but for me — knowing CSS very well — it was always a simple, straightforward solution or code snippet:

  • A multi-colored footer bar should not be an image, it’s a simple CSS background multi-step gradient with one line of code. No need to scale an image, create an SVG, just CSS.
  • Custom icons for an input field? Welp, it’s not that easy for privacy reasons to add a pseudo-class here in certain cases. But there are many simple solutions and no need to include another bloated npm dependency that nobody understands what it does.
  • Webfonts: Dev: We can’t add another webfont style, we already serve 4MB of webfonts.
    → Me: Alright, why don’t we serve it as Variable Font?
    → Dev: Oh, what’s this?
    → Check it out, we now load 218kb async, only one file and have all our styles we have and will ever need inside.

Nowadays people can write great React and TypeScript code. Most of the time a component library like MUI, Tailwind and others are used for styling. However, nearly no one is able to judge whether the CSS in the codebase is good or far from optimal. It is magically applied by our toolchain into the HTML and we struggle to understand why the website is getting slower and slower.

Most of the performance basics I learned ten years ago are still the most relevant ones today. Yet, most developers don’t know about them because we use create-react-web-app or similar things. Put Cloudflare on top to boost performance and reduce costs. Yes, that works for your website and little project.

What companies expect when they ask for a web dashboard serving real time data for their customers is different: It should be a robust, well working application that is easy to maintain. That means we need to combine the developer experience (React, TypeScript, all the little helpers) with the knowledge of how browsers and networks work. And only then we can boost performance, write accessible code, load dynamic data in a proper and safe way and provide fallbacks in case something goes wrong.

In cases of emergency like an Incident with the service, I’ve seen the difference often enough between people who exactly know where to look at, start debugging and go further, and those who try to find out in panic what’s going on here, hoping that a restart or re-deployment with reinstalled dependencies will help bring the service back to life.

And that means in the end again: If you know CSS, you also know the style framework. If you understand JavaScript, TypeScript is not a big problem for you. And that makes you a Senior or Principal.

]]>
https://helloanselm.com/writings/knowing-css-is-mastery-to-frontend-development hacker-news-small-sites-43227303 Sun, 02 Mar 2025 04:32:06 GMT
<![CDATA[Learning C# and .NET after two decades of programming]]> thread link) | @Kerrick
March 1, 2025 | https://kerrick.blog/articles/2025/why-i-am-learning-c-sharp-and-dot-net-after-two-decades-of-programming/ | archive.org

A photo of a net

I’ve been programming for over two decades, and I can’t make a full-stack enterprise web application.


The first lines of code I wrote were in GW-BASIC. When I was in eighth grade, I enrolled in a typing class. Students who finished their typing practice before class ended were given an extra credit opportunity: copying program source code. It was a fantastic test of applied accuracy, and I gladly participated. Eventually I started to pick up on some of the patterns I saw in those BASIC programs. I came up with my own programs—mad libs and simple calculators—and fell in love. I still couldn’t make a web site.

In high school, the library had a book about HTML. I made my first web pages, and my math teacher helped me put them online. I got a job bagging groceries to pay for a laptop, and used that laptop to develop simple web sites for local businesses. These were the first times I was ever paid to write code, and I was hooked. I still couldn’t make a rich web site.

When I got to college I learned JavaScript from another book, and CSS from blog posts and documentation web sites. Before I left college I took a job with the Web Design & Support department, implementing a major redesign of the school’s entire web site in HTML and CSS, with a splash of jQuery for interactivity. I still couldn’t make a web application.

After I left college I scraped together a meager living making Chrome extensions, writing Ruby for freelance clients, and working part-time at Best Buy. I still couldn’t make an enterprise web application.

By 2013 I had my first career job as a front-end developer at an enterprise Software as a Service business. Thanks to EmberJS, an amazing product team, a top-notch architect, and leadership that understood lean software, I built the front-end of our new platform that, over the next seven years, would become so successful that I’d take on brilliant apprentices, build a team, grow to Engineering Manager, and become Director of Software Engineering. But I still couldn’t make a full-stack enterprise web application.

When that company got acquired, I laid off half of my team and lost a part of myself. I could no longer stomach working in management, so I left. I had my mid-life crisis: I moved to the country, bought a farm, went back to college online, and tried to create a startup. I realized I was drifting, and that what I wanted was a steady stream of programming work on a great team. I found exactly that, thanks to the CTO of my previous employer. I am now responsible for improving and maintaining an enterprise Angular application powered by a C# / .NET back-end. It’s a bit rough around the edges, but I tidy as I go. I’m the only purely-front-end programmer on a team of twelve. I ship features our customers love, I help the team improve our processes, and I improve the existing legacy Angular application. But I still can’t make a full-stack enterprise web application.


Last quarter, I learned that our next front-end will use Blazor, not Angular. This means it will use C#, not TypeScript. This quarter, my manager gave the gift of time. Every hour I’m not fixing urgent bugs or implementing important features, he encouraged me to spend learning C#, .NET, and Blazor. The company paid for an O’Reilly Learning Platform subscription, and I’ve collected a list of books to study at work. I’ll still spend my nights and weekends improving at my craft, but instead of learning Ruby on Rails, I’ll be reading generally-applicable books: Patterns of Enterprise Application Architecture, Domain-Driven Design, Working Effectively with Legacy Code, Object-Oriented Analysis & Design with Applications, Data Modeling Essentials, and Designing Data-Intensive Applications.

I’ll blog and toot about what I learn as I go, and I hope you’ll join me. I’m learning C# and .NET, but starting from two decades of programming experience and a decade of software engineering experience. I’m learning web development, but starting from a deep knowledge of HTTP, browsers, and the front-end. I’m learning architecture and object-orientation, but starting from a background in structured and functional programming.

The only thing I love more than learning is my wife. I can’t wait for this learning journey, and I’m excited to share what I learn. Subscribe to my email list and perhaps you’ll learn something too.

Get notified when I publish new articles

Get notified when I post new articles. Privacy policy applies.

]]>
https://kerrick.blog/articles/2025/why-i-am-learning-c-sharp-and-dot-net-after-two-decades-of-programming/ hacker-news-small-sites-43226462 Sun, 02 Mar 2025 02:15:18 GMT
<![CDATA[Mozilla site down due to "overdue hosting payments" [fixed]]]> thread link) | @motownphilly
March 1, 2025 | https://linuxmom.net/@vkc/114089626244932902 | archive.org

Unable to extract article]]>
https://linuxmom.net/@vkc/114089626244932902 hacker-news-small-sites-43226089 Sun, 02 Mar 2025 01:20:17 GMT
<![CDATA[I'm done with coding]]> thread link) | @neelc
March 1, 2025 | https://www.neelc.org/2025/03/01/im-done-with-coding/ | archive.org

In my high school days, I was a huge server and networking person. My homelab was basically my identity, and not even a good one: consumer-level networking gear running Tomato and a then-7-year-old homebuilt desktop PC running FreeBSD.

Then I joined NYU’s Tandon School of Engineering for Computer Science. It was a full 180 into software engineering. I didn’t just code for assignments, I started with toy projects and went to major Tor contributions writing very complex patches, had two internships and ultimately a job at Microsoft.

Primarily due to “Big Data” experience at NYU CUSP, Microsoft placed me on the Viva Insights team. I’ve always hated the product, feeling it was unnecessary surveillance. I wanted out.

In fact, the disdain of Viva Insights was big enough to make me lose passion for coding and get into obsessive browsing and shopping because facing the music of working on a surveillance product would bother me even more. Open source work outside of package maintenance went to zero.

I’ve tried to discuss this with my mom, and she kept telling me how “lucky” I am for working at Microsoft saying “it’s big tech” and “you’re neurodivergent” and “you won’t survive at a smaller company.” She even bought into the marketing material telling me how it’s “not surveillance.”

I’ve decided that in the shitty job market, it’s not worth being a software engineer even if I make much less. Part of it is being “specialized” in over-glorified surveillance so even if I change employers, what’s the guarantee I won’t be working on another surveillance product. Assuming I can even get another job.

In fact, I’ll just live off dividend income and try to get my new IT startup Fourplex off the ground. Sure, I won’t be able to buy shiny homelab equipment as often as I did in the past, but I at least have the guarantee I’m not working on an unethical product.

While six figures is certainly nice, it’s only nice if it’s ethically done. I’d much rather flip burgers or bag groceries than work on surveillance for six figures. After all, Edward Snowden had a “stable” federal government job (not so stable now thanks to “DOGE”) and he gave it up to stand up for the right to privacy.

And I care more for my values than the name or salary. It’s not like I use Windows at home, I haven’t since 2012. I kept self-hosting email despite having worked at Microsoft 365 and still do even now. And I sacrificed job performance for my values of strong privacy.

Little did I know that my father (who was previously a big Big Data and AI advocate) would come out to hate Viva Insights. He says it’s “bullshit” and nobody uses it. Even when I worked at Microsoft I never used it. Not even once. It’s bloatware. Microsoft is 100% better off porting Office apps to Linux (despite me using a Mac now) or beefing up cybersecurity.

]]>
https://www.neelc.org/2025/03/01/im-done-with-coding/ hacker-news-small-sites-43225901 Sun, 02 Mar 2025 00:49:27 GMT
<![CDATA[Norwegian fuel supplier refuses U.S. warships over Ukraine]]> thread link) | @hjjkjhkj
March 1, 2025 | https://ukdefencejournal.org.uk/norwegian-fuel-supplier-refuses-u-s-warships-over-ukraine/ | archive.org

Norwegian fuel company Haltbakk Bunkers has announced it will cease supplying fuel to U.S. military forces in Norway and American ships docking in Norwegian ports, citing dissatisfaction with recent U.S. policy towards Ukraine.

In a strongly worded statement, the company criticised a televised event involving U.S. President Donald Trump and Vice President J.D. Vance, referring to it as the “biggest shitshow ever presented live on TV.”

Haltbakk Bunkers praised Ukrainian President Volodymyr Zelensky for his restraint, accusing the U.S. of “putting on a backstabbing TV show” and declaring that the spectacle “made us sick.”

As a result, the company stated: “We have decided to immediately STOP as fuel provider to American forces in Norway and their ships calling Norwegian ports. No Fuel to Americans!” Haltbakk Bunkers also urged Norwegians and Europeans to follow their lead, concluding their statement with the slogan “Slava Ukraina” in support of Ukraine.

Who is Haltbakk Bunkers?

Haltbakk Bunkers is a Norwegian fuel supplier that provides marine fuel for shipping and military operations. Based in Kristiansund, Norway, the company specialises in bunkering services for vessels operating in Norwegian waters, offering fuel logistics and distribution for both civilian and military customers.

Haltbakk Bunkers plays a significant role in Norway’s maritime industry, supplying fuel to vessels calling at Norwegian ports, including NATO and allied forces.

The decision to cut off the U.S. military could have logistical implications for American naval operations in the region. Norway is a key NATO member and frequently hosts U.S. and allied forces for joint exercises and Arctic defence operations.

This announcement raises questions about the broader European stance on U.S. policy towards Ukraine and whether other businesses or governments might take similar actions. It also highlights how private companies in Europe are responding independently to geopolitical developments.

The U.S. has not yet responded to the decision, and it remains to be seen whether this will affect fuel supply chains for American forces operating in Norway and the North Atlantic region.


At the UK Defence Journal, we aim to deliver accurate and timely news on defence matters. We rely on the support of readers like you to maintain our independence and high-quality journalism. Please consider making a one-off donation to help us continue our work. Click here to donate. Thank you for your support!

]]>
https://ukdefencejournal.org.uk/norwegian-fuel-supplier-refuses-u-s-warships-over-ukraine/ hacker-news-small-sites-43223872 Sat, 01 Mar 2025 21:29:36 GMT
<![CDATA[Abusing C to implement JSON parsing with struct methods]]> thread link) | @ingve
March 1, 2025 | https://xnacly.me/posts/2025/json-parser-in-c-with-methods/ | archive.org

Idea

  1. Build a JSON parser in c
  2. Instead of using by itself functions: attach functions to a struct and use these as methods
  3. make it C issue family free (segfaults, leaks, stack overflows, etc…)
  4. provide an ergonomic API

Usage

C

 1#include "json.h"
 2#include <stdlib.h>
 3
 4int main(void) {
 5  struct json json = json_new(JSON({
 6    "object" : {},
 7    "array" : [[]],
 8    "atoms" : [ "string", 0.1, true, false, null ]
 9  }));
10  struct json_value json_value = json.parse(&json);
11  json_print_value(&json_value);
12  puts("");
13  json_free_value(&json_value);
14  return EXIT_SUCCESS;
15}

Tip - Compiling C projects the easy way

Don’t take this as a guide for using make, in my projects I just use it as a command runner.

Compiler flags

These flags can be specific to gcc, I use gcc (GCC) 14.2.1 20250207, so take this with a grain of salt.

I use these flags in almost every c project I ever started.

SH

 1gcc -std=c23 \
 2	-O2 \
 3	-Wall \
 4	-Wextra \
 5	-Werror \
 6	-fdiagnostics-color=always \
 7	-fsanitize=address,undefined \
 8	-fno-common \
 9	-Winit-self \
10	-Wfloat-equal \
11	-Wundef \
12	-Wshadow \
13	-Wpointer-arith \
14	-Wcast-align \
15	-Wstrict-prototypes \
16	-Wstrict-overflow=5 \
17	-Wwrite-strings \
18	-Waggregate-return \
19	-Wswitch-default \
20	-Wno-discarded-qualifiers \
21	-Wno-aggregate-return \
22    main.c
FlagDescription
-std=c23set lang standard, i use ISO C23
-O2optimize more than -O1
-Wallenable a list of warnings
-Wextraenable more warnings than -Wall
-Werrorconvert all warnings to errors
-fdiagnostics-color=alwaysuse color in diagnostics
-fsanitize=address,undefinedenable AddressSanitizer and UndefinedBehaviorSanitizer
-fno-commonplace uninitialized global variables in the BSS section
-Winit-selfwarn about uninitialized variables
-Wfloat-equalwarn if floating-point values are used in equality comparisons.
-Wundefwarn if an undefined identifier is evaluated
-Wshadowwarn whenever a local variable or type declaration shadows another variable, parameter, type
-Wpointer-arithwarn about anything that depends on the “size of” a function type or of void
-Wcast-alignwarn whenever a pointer is cast such that the required alignment of the target is increased.
-Wstrict-prototypeswarn if a function is declared or defined without specifying the argument types
-Wstrict-overflow=5warns about cases where the compiler optimizes based on the assumption that signed overflow does not occu
-Wwrite-stringsgive string constants the type const char[length], warns on copy into non const char*
-Wswitch-defaultwarn whenever a switch statement does not have a default case
-Wno-discarded-qualifiersdo not warn if type qualifiers on pointers are being discarded.
-Wno-aggregate-returndo not warn if any functions that return structures or unions are defined or called.

Sourcing source files

I generally keep my header and source files in the same directory as the makefile, so i use find to find them:

SHELL

1shell find . -name "*.c"

Make and Makefiles

I don’t define the build target as .PHONY because i generally never have a build directory.

Putting it all together as a makefile:

MAKE

 1CFLAGS := -std=c23 \
 2	-O2 \
 3	-Wall \
 4	-Wextra \
 5	-Werror \
 6	-fdiagnostics-color=always \
 7	-fsanitize=address,undefined \
 8	-fno-common \
 9	-Winit-self \
10	-Wfloat-equal \
11	-Wundef \
12	-Wshadow \
13	-Wpointer-arith \
14	-Wcast-align \
15	-Wstrict-prototypes \
16	-Wstrict-overflow=5 \
17	-Wwrite-strings \
18	-Waggregate-return \
19	-Wcast-qual \
20	-Wswitch-default \
21	-Wno-discarded-qualifiers \
22	-Wno-aggregate-return
23
24FILES := $(shell find . -name "*.c")
25
26build:
27	$(CC) $(CFLAGS) $(FILES) -o jsoninc

Variadic macros to write inline raw JSON

This doesn’t really deserve its own section, but I use #<expression> to stringify C expressions in conjunction with __VA_ARGS__:

C

1#define JSON(...) #__VA_ARGS__

To enable:

C

1char *raw_json = JSON({ "array" : [ [], {}] });

Inlines to:

C

1char *raw_json = "{ \"array\" : [[]], }";

Representing JSON values in memory

I need a structure to hold a parsed JSON value, their types and their values.

Types of JSON values

JSON can be either one of:

  1. null
  2. true
  3. false
  4. number
  5. string
  6. array
  7. object

In C i use an enum to represent this:

C

 1// json.h
 2enum json_type {
 3  json_number,
 4  json_string,
 5  json_boolean,
 6  json_null,
 7  json_object,
 8  json_array,
 9};
10
11extern char *json_type_map[];

And i use json_type_map to map all json_type values to their char* representation:

C

1char *json_type_map[] = {
2    [json_number] = "json_number",   [json_string] = "json_string",
3    [json_boolean] = "json_boolean", [json_null] = "json_null",
4    [json_object] = "json_object",   [json_array] = "json_array",
5};

json_value & unions for atoms, array elements or object values and object keys

The json_value struct holds the type, as defined above, a union sharing memory space for either a boolean, a string or a number, a list of json_value structures as array children or object values, a list of strings that are object keys and the length for the three aforementioned fields.

C

 1struct json_value {
 2  enum json_type type;
 3  union {
 4    bool boolean;
 5    char *string;
 6    double number;
 7  } value;
 8  struct json_value *values;
 9  char **object_keys;
10  size_t length;
11};

Tearing values down

Since some of the fields in json_value are heap allocated, we have to destroy / free the structure upon either no longer using it or exiting the process. json_free_value does exactly this:

C

 1void json_free_value(struct json_value *json_value) {
 2  switch (json_value->type) {
 3  case json_string:
 4    free(json_value->value.string);
 5    break;
 6  case json_object:
 7    for (size_t i = 0; i < json_value->length; i++) {
 8      free(json_value->object_keys[i]);
 9      json_free_value(&json_value->values[i]);
10    }
11    if (json_value->object_keys != NULL) {
12      free(json_value->object_keys);
13      json_value->object_keys = NULL;
14    }
15    if (json_value->values != NULL) {
16      free(json_value->values);
17      json_value->values = NULL;
18    }
19    break;
20  case json_array:
21    for (size_t i = 0; i < json_value->length; i++) {
22      json_free_value(&json_value->values[i]);
23    }
24    if (json_value->values != NULL) {
25      free(json_value->values);
26      json_value->values = NULL;
27    }
28    break;
29  case json_number:
30  case json_boolean:
31  case json_null:
32  default:
33    break;
34  }
35  json_value->type = json_null;
36}

As simple as that, we ignore stack allocated JSON value variants, such as json_number, json_boolean and json_null, while freeing allocated memory space for json_string, each json_array child and json_object keys and values.

Printing json_values

Only a memory representation and no way to inspect it has no value to us, thus I dumped print_json_value into main.c:

C

 1void print_json_value(struct json_value *json_value) {
 2  switch (json_value->type) {
 3  case json_null:
 4    printf("null");
 5    break;
 6  case json_number:
 7    printf("%f", json_value->value.number);
 8    break;
 9  case json_string:
10    printf("\"%s\"", json_value->value.string);
11    break;
12  case json_boolean:
13    printf(json_value->value.boolean ? "true" : "false");
14    break;
15  case json_object:
16    printf("{");
17    for (size_t i = 0; i < json_value->length; i++) {
18      printf("\"%s\": ", json_value->object_keys[i]);
19      print_json_value(&json_value->values[i]);
20      if (i < json_value->length - 1) {
21        printf(", ");
22      }
23    }
24    printf("}");
25    break;
26  case json_array:
27    printf("[");
28    for (size_t i = 0; i < json_value->length; i++) {
29      print_json_value(&json_value->values[i]);
30      if (i < json_value->length - 1) {
31        printf(", ");
32      }
33    }
34    printf("]");
35    break;
36  default:
37    ASSERT(0, "Unimplemented json_value case");
38    break;
39  }
40}

Calling this function:

C

 1int main(void) {
 2  struct json_value json_value = {
 3      .type = json_array,
 4      .length = 4,
 5      .values =
 6          (struct json_value[]){
 7              (struct json_value){.type = json_string, .value.string = "hi"},
 8              (struct json_value){.type = json_number, .value.number = 161},
 9              (struct json_value){
10                  .type = json_object,
11                  .length = 1,
12                  .object_keys =
13                      (char *[]){
14                          "key",
15                      },
16                  .values =
17                      (struct json_value[]){
18                          (struct json_value){.type = json_string,
19                                              .value.string = "value"},
20                      },
21              },
22              (struct json_value){.type = json_null},
23          },
24  };
25  json_print_value(&json_value);
26  puts("");
27  return EXIT_SUCCESS;
28}

Results in:

TEXT

1["hi", 161.000000, {"key": "value"}, null]

json Parser struct, Function pointers and how to use them (they suck)

As contrary as it sounds, one can attach functions to structures in c very easily, just define a field of a struct as a function pointer, assign a function to it and you got a method, as you would in Go or Rust.

C

 1struct json {
 2  char *input;
 3  size_t pos;
 4  size_t length;
 5  char (*cur)(struct json *json);
 6  bool (*is_eof)(struct json *json);
 7  void (*advance)(struct json *json);
 8  struct json_value (*atom)(struct json *json);
 9  struct json_value (*array)(struct json *json);
10  struct json_value (*object)(struct json *json);
11  struct json_value (*parse)(struct json *json);
12};

Of course you have to define a function the c way (<return type> <name>(<list of params>);) and assign it to your method field - but I is not that complicated:

C

 1struct json json_new(char *input) {
 2  ASSERT(input != NULL, "corrupted input");
 3  struct json j = (struct json){
 4      .input = input,
 5      .length = strlen(input) - 1,
 6  };
 7
 8  j.cur = cur;
 9  j.is_eof = is_eof;
10  j.advance = advance;
11  j.parse = parse;
12  j.object = object;
13  j.array = array;
14  j.atom = atom;
15
16  return j;
17}

cur, is_eof and advance are small helper functions:

C

 1static char cur(struct json *json) {
 2  ASSERT(json != NULL, "corrupted internal state");
 3  return json->is_eof(json) ? -1 : json->input[json->pos];
 4}
 5
 6static bool is_eof(struct json *json) {
 7  ASSERT(json != NULL, "corrupted internal state");
 8  return json->pos > json->length;
 9}
10
11static void advance(struct json *json) {
12  ASSERT(json != NULL, "corrupted internal state");
13  json->pos++;
14  skip_whitespace(json);
15}

ASSERT is a simple assertion macro:

C

1#define ASSERT(EXP, context)                                                   \
2  if (!(EXP)) {                                                                \
3    fprintf(stderr,                                                            \
4            "jsoninc: ASSERT(" #EXP "): `" context                             \
5            "` failed at %s, line %d\n",                                       \
6            __FILE__, __LINE__);                                               \
7    exit(EXIT_FAILURE);                                                        \
8  }

Failing for instance if the argument to the json_new function is a null pointer:

C

1int main(void) {
2  struct json json = json_new(NULL);
3  return EXIT_SUCCESS;
4}

Even with a descriptive comment:

TEXT

1jsoninc: ASSERT(input != NULL): `corrupted input` failed at ./json.c, line 16

Parsing JSON with methods

Since we now have the whole setup out of the way, we can start with the crux of the project: parsing JSON. Normally I would have done a lexer and parser, but for the sake of simplicity - I combined these passes into a single parser architecture.

Ignoring Whitespace

As far as we are concerned, JSON does not say anything about whitespace - so we just use the skip_whitespace function to ignore all and any whitespace:

C

1static void skip_whitespace(struct json *json) {
2  while (!json->is_eof(json) &&
3         (json->cur(json) == ' ' || json->cur(json) == '\t' ||
4          json->cur(json) == '\n')) {
5    json->pos++;
6  }
7}

Parsing Atoms

Since JSON has five kinds of an atom, we need to parse them into our json_value struct using the json->atom method:

C

 1static struct json_value atom(struct json *json) {
 2    ASSERT(json != NULL, "corrupted internal state");
 3
 4    skip_whitespace(json);
 5
 6    char cc = json->cur(json);
 7    if ((cc >= '0' && cc <= '9') || cc == '.' || cc == '-') {
 8        return number(json);
 9    }
10
11    switch (cc) {
12        // ... all of the atoms ...
13    default:
14        printf("unknown character '%c' at pos %zu\n", json->cur(json), json->pos);
15        ASSERT(false, "unknown character");
16        return (struct json_value){.type = json_null};
17    }
18}

numbers

Info

Technically numbers in JSON should include scientific notation and other fun stuff, but lets just remember the projects simplicity and my sanity, see json.org.

C

 1static struct json_value number(struct json *json) {
 2  ASSERT(json != NULL, "corrupted internal state");
 3  size_t start = json->pos;
 4  // i don't give a fuck about scientific notation <3
 5  for (char cc = json->cur(json);
 6       ((cc >= '0' && cc <= '9') || cc == '_' || cc == '.' || cc == '-');
 7       json->advance(json), cc = json->cur(json))
 8    ;
 9
10  char *slice = malloc(sizeof(char) * json->pos - start + 1);
11  ASSERT(slice != NULL, "failed to allocate slice for number parsing")
12  memcpy(slice, json->input + start, json->pos - start);
13  slice[json->pos - start] = 0;
14  double number = strtod(slice, NULL);
15  free(slice);
16
17  return (struct json_value){.type = json_number, .value = {.number = number}};
18}

We keep track of the start of the number, advance as far as the number is still considered a number (any of 0-9 | _ | . | -). Once we hit the end we allocate a temporary string, copy the chars containing the number from the input string and terminate the string with \0. strtod is used to convert this string to a double. Once that is done we free the slice and return the result as a json_value.

null, true and false

null, true and false are unique atoms and easy to reason about, regarding constant size and characters, as such we can just assert their characters:

C

 1static struct json_value atom(struct json *json) {
 2  ASSERT(json != NULL, "corrupted internal state");
 3
 4  skip_whitespace(json);
 5
 6  char cc = json->cur(json);
 7  if ((cc >= '0' && cc <= '9') || cc == '.' || cc == '-') {
 8    return number(json);
 9  }
10
11  switch (cc) {
12  case 'n': // null
13    json->pos++;
14    ASSERT(json->cur(json) == 'u', "unknown atom 'n', wanted 'null'")
15    json->pos++;
16    ASSERT(json->cur(json) == 'l', "unknown atom 'nu', wanted 'null'")
17    json->pos++;
18    ASSERT(json->cur(json) == 'l', "unknown atom 'nul', wanted 'null'")
19    json->advance(json);
20    return (struct json_value){.type = json_null};
21  case 't': // true
22    json->pos++;
23    ASSERT(json->cur(json) == 'r', "unknown atom 't', wanted 'true'")
24    json->pos++;
25    ASSERT(json->cur(json) == 'u', "unknown atom 'tr', wanted 'true'")
26    json->pos++;
27    ASSERT(json->cur(json) == 'e', "unknown atom 'tru', wanted 'true'")
28    json->advance(json);
29    return (struct json_value){.type = json_boolean,
30                               .value = {.boolean = true}};
31  case 'f': // false
32    json->pos++;
33    ASSERT(json->cur(json) == 'a', "invalid atom 'f', wanted 'false'")
34    json->pos++;
35    ASSERT(json->cur(json) == 'l', "invalid atom 'fa', wanted 'false'")
36    json->pos++;
37    ASSERT(json->cur(json) == 's', "invalid atom 'fal', wanted 'false'")
38    json->pos++;
39    ASSERT(json->cur(json) == 'e', "invalid atom 'fals', wanted 'false'")
40    json->advance(json);
41    return (struct json_value){.type = json_boolean,
42                               .value = {.boolean = false}};
43  // ... strings ...
44  default:
45    printf("unknown character '%c' at pos %zu\n", json->cur(json), json->pos);
46    ASSERT(false, "unknown character");
47    return (struct json_value){.type = json_null};
48  }
49}

strings

Info

Again, similarly to JSON numbers, JSON strings should include escapes for quotation marks and other fun stuff, but lets again just remember the projects simplicity and my sanity, see json.org.

C

 1static char *string(struct json *json) {
 2  json->advance(json);
 3  size_t start = json->pos;
 4  for (char cc = json->cur(json); cc != '\n' && cc != '"';
 5       json->advance(json), cc = json->cur(json))
 6    ;
 7
 8  char *slice = malloc(sizeof(char) * json->pos - start + 1);
 9  ASSERT(slice != NULL, "failed to allocate slice for a string")
10
11  memcpy(slice, json->input + start, json->pos - start);
12  slice[json->pos - start] = 0;
13
14  ASSERT(json->cur(json) == '"', "unterminated string");
15  json->advance(json);
16  return slice;
17}

Pretty easy stuff, as long as we are inside of the string (before \",\n and EOF) we advance, after that we copy it into a new slice and return that slice (this function is especially useful for object keys - that’s why it is a function).

Parsing Arrays

Since arrays a any amount of JSON values between [] and separated via , - this one is not that hard to implement too:

C

 1struct json_value array(struct json *json) {
 2  ASSERT(json != NULL, "corrupted internal state");
 3  ASSERT(json->cur(json) == '[', "invalid array start");
 4  json->advance(json);
 5
 6  struct json_value json_value = {.type = json_array};
 7  json_value.values = malloc(sizeof(struct json_value));
 8
 9  while (!json->is_eof(json) && json->cur(json) != ']') {
10    if (json_value.length > 0) {
11      if (json->cur(json) != ',') {
12        json_free_value(&json_value);
13      }
14      ASSERT(json->cur(json) == ',',
15             "expected , as the separator between array members");
16      json->advance(json);
17    }
18    struct json_value member = json->parse(json);
19    json_value.values = realloc(json_value.values,
20                                sizeof(json_value) * (json_value.length + 1));
21    json_value.values[json_value.length++] = member;
22  }
23
24  ASSERT(json->cur(json) == ']', "missing array end");
25  json->advance(json);
26  return json_value;
27}

We start with a array length of one and reallocate for every new child we find. We also check for the , between each child.

A growing array would probably be better to minimize allocations, but here we are, writing unoptimized C code - still, it works :)

Parsing Objects

C

 1struct json_value object(struct json *json) {
 2  ASSERT(json != NULL, "corrupted internal state");
 3  ASSERT(json->cur(json) == '{', "invalid object start");
 4  json->advance(json);
 5
 6  struct json_value json_value = {.type = json_object};
 7  json_value.object_keys = malloc(sizeof(char *));
 8  json_value.values = malloc(sizeof(struct json_value));
 9
10  while (!json->is_eof(json) && json->cur(json) != '}') {
11    if (json_value.length > 0) {
12      if (json->cur(json) != ',') {
13        json_free_value(&json_value);
14      }
15      ASSERT(json->cur(json) == ',',
16             "expected , as separator between object key value pairs");
17      json->advance(json);
18    }
19    ASSERT(json->cur(json) == '"',
20           "expected a string as the object key, did not get that")
21    char *key = string(json);
22    ASSERT(json->cur(json) == ':', "expected object key and value separator");
23    json->advance(json);
24
25    struct json_value member = json->parse(json);
26    json_value.values = realloc(json_value.values, sizeof(struct json_value) *
27                                                       (json_value.length + 1));
28    json_value.values[json_value.length] = member;
29    json_value.object_keys = realloc(json_value.object_keys,
30                                     sizeof(char **) * (json_value.length + 1));
31    json_value.object_keys[json_value.length] = key;
32    json_value.length++;
33  }
34
35  ASSERT(json->cur(json) == '}', "missing object end");
36  json->advance(json);
37  return json_value;
38}

Same as arrays, only instead of a single atom we have a string as the key, : as a separator and a json_value as the value. Each pair is separated with ,.

]]>
https://xnacly.me/posts/2025/json-parser-in-c-with-methods/ hacker-news-small-sites-43222344 Sat, 01 Mar 2025 18:53:20 GMT
<![CDATA[Making o1, o3, and Sonnet 3.7 hallucinate for everyone]]> thread link) | @hahahacorn
March 1, 2025 | https://bengarcia.dev/making-o1-o3-and-sonnet-3-7-hallucinate-for-everyone | archive.org

A quick-fun story.

My (ops-but-sometimes-writes-scripts-to-help-out) coworker just tapped on my shoulder and asked me to look at his code that wasn't working. It was a bit something like this:

User.includes(investments: -> { where(state: :draft) })...

This is not a feature of ActiveRecord or any libraries that I'm aware of. I asked him why he thought this was valid syntax, and he pulled up his ChatGPT history. It looked something like this:

Ask: How can I dynamically preload an association with conditions in rails? (Potentially followed up with - no custom has_many associations, no preloader object, don't filter the base query, etc.)

Sometimes, you're routed to the correct answer. Which is to add the filter you want on the associated record as a standard where clause, and also add a .references(:association) to the query chain. Like so:

User.includes(:investments).where(investments: { state: :draft }).references(:investments) 

However, with just a few tests, you're usually routed to that bizarre, non-existent syntax of including a lambda as a keyword argument value to the association you want it applied to. I recreated this a few times below:

o3-mini
Sonnet 3.7
Sonnet 3.5

I was confused why the syntax "felt" familiar though, until my coworker pointed out I invented it while asking a question on the Rails forum two years ago.

Exploring APIs

Funny enough, my other "idea" in that thread is the other solution most LLMs hallucinate - accessing the Preloader object directly.

This don't work either

I didn't realize this when posting originally, but this still requires you to loop through the posts and load the query returned by the preloader into each posts association target. I didn't include that, and LLMs seem to be confused too.

As far as I'm aware, that forum post is the only place that you'll find that specific syntax exploration. As my comment above denotes, it would not work anyway. Why I included it in the first place is beyond me - I'm working on making my writing more concise (which is why I carved out a section to explain that, and then this, and now this explanation of that....)

Conclusion

LLMs are really smart most of the time. But, once it reaches niche topics and doesn't have sufficient context, it begins to resemble myself early in my career. Open StackOverflow, Ctrl+C, Ctrl+V, Leeroy Jenkins style. I can't help but find it endearing.

]]>
https://bengarcia.dev/making-o1-o3-and-sonnet-3-7-hallucinate-for-everyone hacker-news-small-sites-43222027 Sat, 01 Mar 2025 18:24:22 GMT
<![CDATA[Self-Hosting a Firefox Sync Server]]> thread link) | @shantara
February 28, 2025 | https://blog.diego.dev/posts/firefox-sync-server/ | archive.org

After switching from Firefox to LibreWolf, I became interested in the idea of self-hosting my own Firefox Sync server. Although I had seen this was possible before, I had never really looked into it—until now. I embarked on a journey to set this up, and while it wasn’t completely smooth sailing, I eventually got it working. Here’s how it went.

Finding the Right Sync Server

Initial Search: Mozilla’s Sync Server Repo

I started by searching for “firefox sync server github” and quickly found Mozilla’s syncserver repo. This is an all-in-one package designed for self-hosting a Firefox Sync server. It bundles both the tokenserver for authentication and syncstorage for storage, which sounded like exactly what I needed.

However, there were two red flags:

  1. The repository had “failed” tags in the build history.
  2. A warning was prominently displayed stating that the repository was no longer being maintained and pointing to a new project in Rust.

Switching to Rust: syncstorage-rs

With that in mind, I followed the link to syncstorage-rs, which is a modern, Rust-based version of the original project. It seemed like the more viable option, so I decided to move forward with this one. But first, I wanted to check if there was a ready-to-go Docker image to make deployment easier. Unfortunately, there wasn’t one, but the documentation did mention running it with Docker.

This is where things started to get complicated.

Diving Into Docker: Confusion and Complexity

Documentation Woes

The Docker documentation had some strange parts. For example, it mentioned:

  • Ensuring that grpcio and protobuf versions matched the versions used by google-cloud-rust-raw. This sounded odd—shouldn’t Docker handle version dependencies automatically?
  • Another confusing part was the instruction to manually copy the contents of mozilla-rust-sdk into the top-level root directory. Again, why wasn’t this step automated in the Dockerfile?

At this point, I was feeling a bit uneasy but decided to push forward. I reviewed the repo, the Dockerfile, the Makefile, and the circleci workflows. Despite all that, I was still unsure how to proceed.

A Simpler Solution: syncstorage-rs-docker

I then stumbled upon dan-r’s syncstorage-rs-docker repo, which had a much simpler Docker setup. The description explained that the author had also encountered issues with the original documentation and decided to create a Docker container for their own infrastructure.

At this point, I felt reassured that I wasn’t alone in my confusion, and decided to give this setup a try.

Setting Up the Server: Docker Compose and MariaDB

Docker Compose Setup

I copied the following services into my docker-compose.yaml:

  firefox_mariadb:
    container_name: firefox_mariadb
    image: linuxserver/mariadb:10.6.13
    volumes:
      - /data/ffsync/dbdata:/config
    restart: unless-stopped
    environment:
      MYSQL_DATABASE: syncstorage
      MYSQL_USER: sync
      MYSQL_PASSWORD: syncpass
      MYSQL_ROOT_PASSWORD: rootpass

  firefox_syncserver:
    container_name: firefox_syncserver
    build:
      context: /root/ffsync
      dockerfile: Dockerfile
      args:
        BUILDKIT_INLINE_CACHE: 1
    restart: unless-stopped
    ports:
      - "8000:8000"
    depends_on:
      - firefox_mariadb
    environment:
      LOGLEVEL: info
      SYNC_URL: https://mydomain/sync
      SYNC_CAPACITY: 5
      SYNC_MASTER_SECRET: mastersecret
      METRICS_HASH_SECRET: metricssecret
      SYNC_SYNCSTORAGE_DATABASE_URL: mysql://sync:usersync@firefox_mariadb:3306/syncstorage_rs
      SYNC_TOKENSERVER_DATABASE_URL: mysql://sync:usersync@firefox_mariadb:3306/tokenserver_rs

A few tips:

  • Be cautious with the database passwords. Avoid using special characters like "/|%" as they can cause issues during setup.
  • I added the BUILDKIT_INLINE_CACHE argument to the Docker Compose file to make better use of caching, which reduced build time while testing.

Initializing the Database

I cloned the repository and copied the Dockerfile and initdb.sh script to my server. After making some tweaks, I ran the following steps to get the database up and running:

  1. Bring up the MariaDB container:
    docker-compose up -d firefox_mariadb
    
  2. Make the initialization script executable and run it:
    chmod +x initdb.sh
    ./initdb.sh
    

Bringing the Stack Online

Finally, I brought up the entire stack with:

Configuring Reverse Proxy with Caddy

Next, I needed to update my Caddy reverse proxy to point to the new Sync server. I added the following configuration:

mydomain:443 {
     reverse_proxy firefox_syncserver:8000 {
    }
}

After updating Caddy with the DNS entry, I restarted the proxy and the sync server was up and running.

Challenges Faced

While I eventually got everything working, there were a few notable challenges along the way:

  1. Database persistence: I had issues with persistent data when restarting the MariaDB container. Make sure to clear out old data if needed.
  2. Server storage: My server ran out of space during the build process due to the size of the Docker images and intermediate files.
  3. Following the right steps: It took me a while to figure out the right steps, and much of the time was spent experimenting with the Docker setup.

Final Thoughts

Setting up a self-hosted Firefox Sync server is not the easiest task, especially if you’re not very familiar with Docker or database management. The official documentation is confusing, but thanks to community efforts like the syncstorage-rs-docker repo, it’s doable.

In the end, it took me about two hours to get everything running, but it was worth it. If you’re looking to control your own Firefox Sync server, this guide should help you avoid some of the pitfalls I encountered.

Happy syncing!

]]>
https://blog.diego.dev/posts/firefox-sync-server/ hacker-news-small-sites-43214294 Sat, 01 Mar 2025 01:03:48 GMT
<![CDATA[AI is killing some companies, yet others are thriving – let's look at the data]]> thread link) | @corentin88
February 28, 2025 | https://www.elenaverna.com/p/ai-is-killing-some-companies-yet | archive.org

AI is quietly upending the business models of major content sites. Platforms like WebMD, G2, and Chegg - once fueled by SEO and ad revenue - are losing traffic as AI-powered search and chatbots deliver instant answers. Users no longer need to click through pages when AI summarizes everything in seconds. Brian Balfour calls this phenomenon Product-Market Fit Collapse, a fitting term, marking it as the next big shift in tech.

Key milestones accelerating this shift:
📅 Nov 30, 2022 – ChatGPT launches
📅 Mar 14, 2023 – GPT-4 released
📅 May 14, 2024 – Google rolls out AI Overviews

❗Disclaimer: I'm simply observing traffic trends from an external perspective and don’t have insight into the exact factors driving them. The timing aligns with AI, but like any business, multiple factors are at play and each case is unique.

→ The data comes from SEMRush. If you want access to trend reports like the one below, you can try it for free.

WebMD: Where every symptom leads to cancer. They're crashing and burning and the timing aligns with major AI releases. If they don’t launch AI agents (like yesterday), they’re in trouble. That said, they still pull in ~90M visits a month.

Quora: Once the go-to platform where user-generated questions got a mix of expert insights and absolute nonsense - is struggling. And it’s no surprise. AI now delivers faster, (usually) more reliable answers. Yet, despite the challenges, Quora still pulls in just under 1 billion visits a month.

Stack Overflow: The Q&A platform for developers, is now facing seemingly direct competition from ChatGPT, which can generate and debug code instantly. As AI takes over, the community is fading - but they still attract around 200M visits a month.

Chegg: A popular platform for students - now getting schooled by AI. Weirdly, they’re fighting back by suing Google over AI snippets. Not sure what they expect… Google controls the traffic and that’s the risk of relying on someone else’s distribution.

G2: A software review platform, is experiencing huge drop in traffic levels. This one is so rough.

CNET: A technology news and reviews website is experiencing 70% traffic drop from 4 years ago. They still pull in 50 million visits per month - an impressive volume - but a steep drop from the 150 million they once had.

Just look at Reddit. Many say they are impacted, but traffic says otherwise - they are CRUSHING it. Probably because people are gravitating toward authentic content and a sense of community. I know I cannot go a day without a Reddit scroll (/r/LinkedInLunatics alone is worth visiting on the daily). And look at the y-axis: their traffic is in the billions!

And even Wikipedia is managing to stay afloat (although research AI tools will probably hit it pretty hard). Also, over 5B visits a month - consider me impressed.

And you know who else is growing? Substack. User-generated content FTW.

Edited by Melissa Halim

]]>
https://www.elenaverna.com/p/ai-is-killing-some-companies-yet hacker-news-small-sites-43206491 Fri, 28 Feb 2025 15:12:54 GMT
<![CDATA[Write to Escape Your Default Setting]]> thread link) | @kolyder
February 28, 2025 | https://kupajo.com/write-to-escape-your-default-setting/ | archive.org

For those of us with woefully average gray matter, our minds have limited reach. For the past, they are enthusiastic but incompetent archivists. In the present, they reach for the most provocative fragments of ideas, often preferring distraction over clarity.

Writing provides scaffolding. Structure for the unstructured, undisciplined mind. It’s a practical tool for thinking more effectively. And sometimes, it’s the best way to truly begin to think at all.

Let’s call your mind’s default setting ‘perpetual approximation mode.’  A business idea, a scrap of gossip, a trivial fact, a romantic interest, a shower argument to reconcile something long past. We spend more time mentally rehearsing activities than actually doing them. You can spend your entire life hopping among these shiny fragments without searching for underlying meaning until tragedy, chaos, or opportunity slaps you into awareness.

Writing forces you to tidy that mental clutter. To articulate things with a level of context and coherence the mind alone can’t achieve. Writing expands your working memory, lets you be more brilliant on paper than you can be in person.

While some of this brilliance comes from enabling us to connect larger and larger ideas, much of it comes from stopping, uh, non-brilliance. Writing reveals what you don’t know, what you can’t see when an idea is only held in your head. Biases, blind spots, and assumptions you can’t grasp internally.

At its best, writing (and reading) can reveal the ugly, uncomfortable, or unrealistic parts of your ideas. It can pluck out parasitic ideas burrowed so deeply that they imperceptibly steer your feelings and beliefs. Sometimes this uprooting will reveal that the lustrous potential of a new idea is a mirage, or that your understanding of someone’s motives was incomplete, maybe projected.

If you’re repeatedly drawn to a thought, feeling, or belief, write it out. Be fast, be sloppy. Just as children ask why, why, why, you can repeat the question “why do I think/feel/believe this?” a few times. What plops onto the paper may surprise you. So too will the headspace that clears from pouring out the canned spaghetti of unconnected thoughts.

“Writing about yourself seems to be a lot like sticking a branch into clear river-water and roiling up the muddy bottom.”

~Stephen King, Different Seasons (Book)

“I write entirely to find out what I’m thinking, what I’m looking at, what I see and what it means. What I want and what I fear.”

~Joan Didion, Why I Write (Article)

]]>
https://kupajo.com/write-to-escape-your-default-setting/ hacker-news-small-sites-43206174 Fri, 28 Feb 2025 14:45:36 GMT
<![CDATA[Netboot Windows 11 with iSCSI and iPXE]]> thread link) | @terinjokes
February 28, 2025 | https://terinstock.com/post/2025/02/Netboot-Windows-11-with-iSCSI-and-iPXE/ | archive.org

A fictious screenshot of a permanent ban from a game, in the Windows 95 installer style, with a 90s-era PC and a joystick in the left banner. The text is titled "Permanent Suspension" and reads "Your account has been permanently suspended due to the use of unauthorized Operating Systems or unauthorized virtual machines. This type of behavior causes damage to our community and the game's competitive integrity. This action will not be reversed."

Purposefully ambiguous and fictious permanent ban.

(created with @foone’s The Death Generator)

My primary operating system is Linux: I have it installed on my laptop and desktop. Thanks to the amazing work of the WINE, CodeWeavers, and Valve developers, it’s also where I do PC gaming. I can spin up Windows in a virtual machine for the rare times I need to use it, and even pass through a GPU if I want to do gaming.

There is one pretty big exception: playing the AAA game ████████████████ with friends. Unfortunately, the developer only allows Windows. If you attempt to run the game on Linux or they detect you’re running in a virtual machine, your device and account are permanently banned. I would prefer not to be permanently banned.

For the past several years my desktop has also had a disk dedicated to maintaining a Windows install. I’d prefer to use the space in my PC case1 for disks for Linux. Since I already run a home NAS, and my Windows usage is infrequent, I wondered if I could offload the Windows install to my NAS instead. This lead me down the course of netbooting Windows 11 and writing up these notes on how to do a simplified “modern” version.

My first task was determining how to get a computer to boot from a NAS. My experience with network block devices is with Ceph RBD, where a device is mounted into an already running operating system. For booting over an Ethernet IP network the standard is iSCSI. A great way to boot from an iSCSI disk is with iPXE. To avoid any mistakes during this process, I removed all local drives from the system.2

I didn’t want to run a TFTP server on my home network, or reconfigure DHCP to provide TFTP configuration. Even if I did, the firmware for my motherboard is designed for “gamers”, there’s no PXE ROM. I can enable UEFI networking and a network boot option appears in the boot menu, but no DHCP requests are made3. Fortunately, iPXE is available as bootable USB image, which loaded and started trying to fetch configuration from the network.

Hitting ctrl-b as directed on screen to drop to the iPXE shell, I could verify basic functionality was working.

iPXE 1.21.1+ (e7585fe) -- Open Source Network Boot Firmware -- https://ipxe.org
Features: DNS FTP HTTP HTTPS iSCSI NFS TFTP VLAN SRP AoE EFI Menu
iPXE> dhcp
Configuring (net0 04:20:69:91:C8:DD)...... ok
iPXE> show ${net0/ip}
192.0.2.3

I decided to use tgt as the iSCSI target daemon on my NAS4 as the configuration seemed the least complicated. In /etc/tgt/targets.conf I configured it with two targets: one as the block device I wanted to install Windows onto and the other being the installation ISO.

<target iqn.2025-02.com.example:win-gaming>
    backing-store /dev/zvol/zroot/sans/win-gaming
    params thin-provisioning=1
</target>

<target iqn.2025-02.com.example:win11.iso>
    backing-store /opt/isos/Win11_24H2_English_x64.iso
    device-type cd
    readonly 1
</target>

Back on the PC, I could tell iPXE to use these iSCSI disks, then boot onto the DVD. As multiple network drives are being added, each must be given a different drive ID starting from 0x80.

iPXE> sanhook --drive 0x80 iscsi:nas.example.com:::1:iqn.2025-02.com.example:win-gaming
Registered SAN device 0x80
iPXE> sanhook --drive 0x81 iscsi:nas.example.com:::1:iqn.2025-02.com.example:win11.iso
Registered SAN device 0x81
iPXE> sanboot --drive 0x81
Booting from SAN device 0x81

After a minute of the Windows 11 logo and a spinner, the Windows 11 setup appears. In an ideal situation, I could immediately start installing. Unfortunately, the Windows 11 DVD does not ship drivers for my network card, and the iSCSI connection information passed to the booted system from iPXE couldn’t be used. I’m a bit impressed the GUI loaded at all, instead of just crashing.

To rectify this, I would need to build a Windows PE image that included my networking drivers. WinPE is the minimal environment used when installing Windows. Fortunately, Microsoft has made this pretty easy nowadays. I downloaded and installed the Windows Assessment and Deployment Kit and the Windows PE add-on. After running “Deployment and Imaging Tools Environment” as an administrator, I could make a folder containing a base WinPE image.

> mkdir C:\winpe
> copype amd64 C:\winpe\amd64

After mounting the image, I was able to slipstream the Intel drivers. I searched through the inf files to find the folder that supported my network card.

> imagex /mountrw C:\winpe\amd64\media\sources\boot.wim C:\winpe\amd64\mount
> dism /image:C:\winpe\amd64\mount /add-driver /driver:C:\temp\intel\PRO1000\Winx64\W11\
> imagex /unmount /commit C:\winpe\amd64\mount

This new image is what we need to boot into to install Windows. As my NAS is also running an HTTP server, I copied over the files relevant to netbooting: from “C:‍\winpe\amd64\media” I copied “boot/BCD”, “boot/boot.sdi”, and “sources/boot.wim”, preserving the folders. I also downloaded wimboot to the same directory.

You can use iPXE to execute a script fetched with HTTP, which I took advantage of to reduce the amount of typing I’ll need to do at the shell. I saved the following script as “install.ipxe” in the same HTTP directory.

#!ipxe

sanhook --drive 0x80 iscsi:nas.example.com:::1:iqn.2025-02.com.example:win-gaming
sanhook --drive 0x81 iscsi:nas.example.com:::1:iqn.2025-02.com.example:win11.iso
kernel wimboot
initrd boot/BCD BCD
initrd boot/boot.sdi boot.sdi
initrd sources/boot.wim boot.wim
boot

Rebooting back to the iPXE prompt I could then boot using this script.

iPXE> dhcp
iPXE> chain http://nas.example.com/ipxe/install.ipxe

After a few seconds I was booted into WinPE with a Command Prompt. The command “wpeinit” ran automatically, configuring the network card and mounting the iSCSI disks. I found the DVD had been mounted as drive “D”, and could start the Windows Setup with “D:‍\setup.exe”.

However, after reaching the “Searching for Disks” screen the installer closed itself without any error. This seems to be a bug with the new version of setup, as restarting it and selecting the “Previous Version of Setup” on an earlier page used a version of the installer that worked.

The installation was spread across several restarts. Fortunately, once the installation files are copied over, nothing but the main disk image is required, reducing what I needed to type in the iPXE shell. The HTTP server could also be cleaned up at this point.

iPXE> dhcp
iPXE> sanboot iscsi:nas.example.com:::1:iqn.2025-02.com.example:win-gaming

After several more minutes, and a forced installation of a Windows zero-day patch, I was greeted by a Windows 11 desktop, booted over iSCSI. Task Manager even reports the C drive as being “SDD (iSCSI)”.

Booting from a USB stick and typing into an iPXE prompt every time I want to boot into Windows isn’t a great user experience. Fortunately, iPXE is also available as an EFI application which can be installed to the local EFI System Partition. I also discovered that iPXE will execute commands provided on the command line.

I reinstalled the disks used for Linux, copied over ipxe.efi to the EFI System Partition, and added a new entry to systemd-boot by creating “$ESP/loader/entries/win11.conf”

title Windows 11 (iPXE)
efi /ipxe/ipxe.efi
options prompt && dhcp && sanboot iscsi:nas.example.com:::1:iqn.2025-02.com.example:win-gaming

There seems to be a bug where the first word in the options field is ignored.5 I used a valid iPXE command prompt, which also provides a clear signal should it ever start being interpreted in the future version.

After a little bit of extra setup (installing Firefox and switching to dark mode), I was able to install Steam and the game. The game took a little bit longer to install due the slower disk speed over my network (time to upgrade to 10GbE?), but there was no noticeable delay during normal gameplay. I didn’t see any network saturation or high disk latencies in Task Manager during loading.

]]>
https://terinstock.com/post/2025/02/Netboot-Windows-11-with-iSCSI-and-iPXE/ hacker-news-small-sites-43204604 Fri, 28 Feb 2025 11:47:52 GMT
<![CDATA[Turning my ESP32 into a DNS sinkhole to fight doomscrolling]]> thread link) | @venusgirdle
February 28, 2025 | https://amanvir.com/blog/turning-my-esp32-into-a-dns-sinkhole | archive.org

Unable to extract article]]>
https://amanvir.com/blog/turning-my-esp32-into-a-dns-sinkhole hacker-news-small-sites-43204091 Fri, 28 Feb 2025 10:39:01 GMT
<![CDATA[Video encoding requires using your eyes]]> thread link) | @zdw
February 27, 2025 | https://redvice.org/2025/encoding-requires-eyes/ | archive.org

In multimedia, the quality engineers are optimizing for is perceptual. Eyes, ears, and the brain processing their signals are enormously complex, and there’s no way to replicate everything computationally. There are no “objective” metrics to be had, just various proxies with difficult tradeoffs. Modifying video is particularly thorny, since like I’ve mentioned before on this blog there are various ways to subtly bias perception that are nonetheless undesirable, and are impossible to correct for.

This means there’s no substitute for actually looking at the results. If you are a video engineer, you must look at sample output and ask yourself if you like what you see. You should do this regularly, but especially if you’re considering changing anything, and even more so if ML is anywhere in your pipeline. You cannot simply point at metrics and say “LGTM”! In this particular domain, if the metrics and skilled human judgement are in conflict, the metrics are usually wrong.

Netflix wrote a post on their engineering blog about a “deep downscaler” for video, and unfortunately it’s rife with issues. I originally saw the post due to someone citing it, and was incredibly disappointed when I clicked through and read it. Hopefully this post offers a counter to that!

I’ll walk through the details below, but they’re ultimately all irrelevant; the single image comparison Netflix posted looks like this (please ‘right-click -> open image in new tab’ so you can see the full image and avoid any browser resampling):

Downscaler comparison

Note the ringing, bizarre color shift, and seemingly fake “detail”. If the above image is their best example, this should not have shipped – the results look awful, regardless of the metrics. The blog post not acknowledging this is embarrassing, and it makes me wonder how many engineers read this and decided not to say anything.

The Post

Okay, going through this section by section:

How can neural networks fit into Netflix video encoding?

There are, roughly speaking, two steps to encode a video in our pipeline:

1. Video preprocessing, which encompasses any transformation applied to the high-quality source video prior to encoding. Video downscaling is the most pertinent example herein, which tailors our encoding to screen resolutions of different devices and optimizes picture quality under varying network conditions. With video downscaling, multiple resolutions of a source video are produced. For example, a 4K source video will be downscaled to 1080p, 720p, 540p and so on. This is typically done by a conventional resampling filter, like Lanczos.

Ignoring the awful writing[1], it’s curious that they don’t clarify what Netflix was using previously. Is Lanczos an example, or the current best option[2]? This matters because one would hope they establish a baseline to later compare the results against, and that baseline should be the best reasonable existing option.

2. Video encoding using a conventional video codec, like AV1. Encoding drastically reduces the amount of video data that needs to be streamed to your device, by leveraging spatial and temporal redundancies that exist in a video.

I once again wonder why they mention AV1, since in this case I know it’s not what the majority of Netflix’s catalog is delivered as; they definitely care about hardware decoder support. Also, this distinction between preprocessing and encoding isn’t nearly as clean as this last sentence implies, since these codecs are lossy, and in a way that is aware of the realities of perceptual quality.

We identified that we can leverage neural networks (NN) to improve Netflix video quality, by replacing conventional video downscaling with a neural network-based one. This approach, which we dub “deep downscaler,” has a few key advantages:

I’m sure that since they’re calling it a deep downscaler, it’s actually going to use deep learning, right?

1. A learned approach for downscaling can improve video quality and be tailored to Netflix content.

Putting aside my dislike of the phrase “a learned approach” here, I’m very skeptical of “tailored to Netflix content” claim. Netflix’s catalog is pretty broad, and video encoding has seen numerous attempts at content-based specialization that turned out to be worse than focusing on improving things generically and adding tuning knobs. The encoder that arguably most punched above its weight class, x264, was mostly developed on Touhou footage.

2. It can be integrated as a drop-in solution, i.e., we do not need any other changes on the Netflix encoding side or the client device side. Millions of devices that support Netflix streaming automatically benefit from this solution.

Take note of this for later: Netflix has many different clients and this assumes no changes to them.

3. A distinct, NN-based, video processing block can evolve independently, be used beyond video downscaling and be combined with different codecs.

Doubt

Of course, we believe in the transformative potential of NN throughout video applications, beyond video downscaling. While conventional video codecs remain prevalent, NN-based video encoding tools are flourishing and closing the performance gap in terms of compression efficiency. The deep downscaler is our pragmatic approach to improving video quality with neural networks.

“Closing the performance gap” is a rather optimistic framing of that, but I’ll save this for another post.

Our approach to NN-based video downscaling

The deep downscaler is a neural network architecture designed to improve the end-to-end video quality by learning a higher-quality video downscaler. It consists of two building blocks, a preprocessing block and a resizing block. The preprocessing block aims to prefilter the video signal prior to the subsequent resizing operation. The resizing block yields the lower-resolution video signal that serves as input to an encoder. We employed an adaptive network design that is applicable to the wide variety of resolutions we use for encoding.

Downscaler comparison

I’m not sure exactly what they mean by the adaptive network design here. A friend has suggested that maybe this just means fixed weights on the preprocessing block? I am, however, extremely skeptical of their claim that the results will generate to a wide variety of resolutions. Avoiding overfitting here would be fairly challenging, and there’s nothing in the post that inspires confidence they managed to overcome those difficulties. They hand-wave this away, but it seems critical to the entire project.

During training, our goal is to generate the best downsampled representation such that, after upscaling, the mean squared error is minimized. Since we cannot directly optimize for a conventional video codec, which is non-differentiable, we exclude the effect of lossy compression in the loop. We focus on a robust downscaler that is trained given a conventional upscaler, like bicubic. Our training approach is intuitive and results in a downscaler that is not tied to a specific encoder or encoding implementation. Nevertheless, it requires a thorough evaluation to demonstrate its potential for broad use for Netflix encoding.

Finally some details! I was curious how they’d solve the lack of a reference when training a downscaling model, and this sort of explains it; they optimized for PSNR when upscaled back to the original resolution, post-downscaling. My immediate thoughts upon reading this:

  1. Hrm, PSNR isn’t great[3].
  2. Which bicubic are we actually talking about? This is not filling me with confidence that the author knows much about video.
  3. So this is like an autoencoder, but with the decoder replaced with bicubic upscaling?
  4. Doesn’t that mean the second your TV decides to upscale with bilinear this all falls apart?
  5. Does Netflix actually reliably control the upscaling method on client devices[4]? They went out of their way to specify earlier that the project assumed no changes to the clients, after all!
  6. I wouldn’t call this intuitive. To be honest, it sounds kind of dumb and brittle.
  7. Not tying this to a particular encoder is sensible, but their differentiability reason makes no sense.

The weirdest part here is the problem formulated in this way actually has a closed-form solution, and I bet it’s a lot faster to run than a neural net! ML is potentially good in more ambiguous scenarios, but here you’ve simplified things to the point that you can just do some math and write some code instead[5]!

Improving Netflix video quality with neural networks

The goal of the deep downscaler is to improve the end-to-end video quality for the Netflix member. Through our experimentation, involving objective measurements and subjective visual tests, we found that the deep downscaler improves quality across various conventional video codecs and encoding configurations.

Judging from the example at the start, the subjective visual tests were conducted by the dumb and blind.

For example, for VP9 encoding and assuming a bicubic upscaler, we measured an average VMAF Bjøntegaard-Delta (BD) rate gain of ~5.4% over the traditional Lanczos downscaling. We have also measured a ~4.4% BD rate gain for VMAF-NEG. We showcase an example result from one of our Netflix titles below. The deep downscaler (red points) delivered higher VMAF at similar bitrate or yielded comparable VMAF scores at a lower bitrate.

Again, what’s the actual upscaling filter being used? And while I’m glad the VMAF is good, the result looks terrible! This means the VMAF is wrong. But also, the whole reason they’re following up with VMAF is because PSNR is not great and everyone knows it; it’s just convenient to calculate. Finally, how does VP9 come into play here? I’m assuming they’re encoding the downscaled video before upscaling, but the details matter a lot.

Besides objective measurements, we also conducted human subject studies to validate the visual improvements of the deep downscaler. In our preference-based visual tests, we found that the deep downscaler was preferred by ~77% of test subjects, across a wide range of encoding recipes and upscaling algorithms. Subjects reported a better detail preservation and sharper visual look. A visual example is shown below. [note: example is the one from above]

And wow, coincidentally, fake detail and oversharpening are common destructive behaviors from ML-based filtering that unsophisticated users will “prefer” despite making the video worse. If this is the bar, just run Warpsharp on everything and call it a day[6]; I’m confident you’ll get a majority of people to say it looks better.

This example also doesn’t mention what resolution the video was downscaled to, so it’s not clear if this is even representative of actual use-cases. Once again, there are no real details about how the tests with conducted, so I have no way to judge whether the experiment structure made sense.

We also performed A/B testing to understand the overall streaming impact of the deep downscaler, and detect any device playback issues. Our A/B tests showed QoE improvements without any adverse streaming impact. This shows the benefit of deploying the deep downscaler for all devices streaming Netflix, without playback risks or quality degradation for our members.

Translating out the jargon, this means they didn’t have a large negative effect on compressability. This is unsurprising.

How do we apply neural networks at scale efficiently?

Given our scale, applying neural networks can lead to a significant increase in encoding costs. In order to have a viable solution, we took several steps to improve efficiency.

Yes, which is why the closed-form solution almost certainly is faster.

The neural network architecture was designed to be computationally efficient and also avoid any negative visual quality impact. For example, we found that just a few neural network layers were sufficient for our needs. To reduce the input channels even further, we only apply NN-based scaling on luma and scale chroma with a standard Lanczos filter.

OK cool, so it’s not actually deep. Why should words have meaning, after all? Only needing a couple layers is not too shocking when, again, there’s a closed-form solution available.

Also, while applying this to only the luma is potentially a nice idea, if it’s shifting the brightness around you can get very weird results. I imagine this is what causes the ‘fake detail’ in the example above.

We implemented the deep downscaler as an FFmpeg-based filter that runs together with other video transformations, like pixel format conversions. Our filter can run on both CPU and GPU. On a CPU, we leveraged oneDnn to further reduce latency.

OK sure, everything there runs on FFmpeg so why not this too.

Integrating neural networks into our next-generation encoding platform

The Encoding Technologies and Media Cloud Engineering teams at Netflix have jointly innovated to bring Cosmos, our next-generation encoding platform, to life. Our deep downscaler effort was an excellent opportunity to showcase how Cosmos can drive future media innovation at Netflix. The following diagram shows a top-down view of how the deep downscaler was integrated within a Cosmos encoding microservice.

Downscaler comparison

Buzzword buzzword buzzword buzzword buzzword. I especially hate “encoding stratum function”.

A Cosmos encoding microservice can serve multiple encoding workflows. For example, a service can be called to perform complexity analysis for a high-quality input video, or generate encodes meant for the actual Netflix streaming. Within a service, a Stratum function is a serverless layer dedicated to running stateless and computationally-intensive functions. Within a Stratum function invocation, our deep downscaler is applied prior to encoding. Fueled by Cosmos, we can leverage the underlying Titus infrastructure and run the deep downscaler on all our multi-CPU/GPU environments at scale.

Why is this entire section here? This should all have been deleted. Also, once again, buzzword buzzword buzzword buzzword buzzword.

What lies ahead

The deep downscaler paves the path for more NN applications for video encoding at Netflix. But our journey is not finished yet and we strive to improve and innovate. For example, we are studying a few other use cases, such as video denoising. We are also looking at more efficient solutions to applying neural networks at scale. We are interested in how NN-based tools can shine as part of next-generation codecs. At the end of the day, we are passionate about using new technologies to improve Netflix video quality. For your eyes only!

I’m not sure a downscaler that takes a problem with a closed-form solution and produces terrible results paves the way for much of anything except more buzzword spam. I look forward to seeing what they will come up with for denoising!


Thanks to Roger Clark and Will Overman for reading a draft of this post. Errors are of course my own.

]]>
https://redvice.org/2025/encoding-requires-eyes/ hacker-news-small-sites-43201720 Fri, 28 Feb 2025 04:33:26 GMT
<![CDATA[macOS Tips and Tricks (2022)]]> thread link) | @pavel_lishin
February 27, 2025 | https://saurabhs.org/macos-tips | archive.org

Unable to extract article]]>
https://saurabhs.org/macos-tips hacker-news-small-sites-43201417 Fri, 28 Feb 2025 03:34:14 GMT
<![CDATA[Putting Andrew Ng's OCR models to the test]]> thread link) | @ritvikpandey21
February 27, 2025 | https://www.runpulse.com/blog/putting-andrew-ngs-ocr-models-to-the-test | archive.org

February 27, 2025

3 min read

Putting Andrew Ng’s OCR Models to The Test

Today, Andrew Ng, one of the legends of the AI world, released a new document extraction service that went viral on X (link here). At Pulse, we put the models to the test with complex financial statements and nested tables – the results were underwhelming to say the least, and suffer from many of the same issues we see when simply dumping documents into GPT or Claude.

Our engineering team, along with many X users, discovered alarming issues when testing complex financial statements:

  • Over 50% hallucinated values in complex financial tables
  • Missing negative signs and currency markers
  • Completely fabricated numbers in several instances
  • 30+ second processing times per document

Ground Truth

Andrew Ng OCR Output

Pulse Output

When financial decisions worth millions depend on accurate extraction, these errors aren't just inconvenient – they're potentially catastrophic.

Let’s run through some quick math: in a typical enterprise scenario with 1,000 pages containing 200 elements per page (usually repeated over tens of thousands of documents), even 99% accuracy still means 2,000 incorrect entries. That's 2,000 potential failure points that can completely compromise a data pipeline. Our customers have consistently told us they need over 99.9% accuracy for mission-critical operations. With probabilistic LLM models, each extraction introduces a new chance for error, and these probabilities compound across thousands of documents, making the failure rate unacceptably high for real-world applications where precision is non-negotiable.

As we've detailed in our previous viral blog post, using LLMs alone for document extraction creates fundamental problems. Their nondeterministic nature means you'll get different results on each run. Their low spatial awareness makes them unsuitable for complex layouts in PDFs and slides. And their processing speed presents serious bottlenecks for large-scale document processing.

At Pulse, we've taken a different approach that delivers:

  • Accurate extraction with probability of errors slowly approaching 0
  • Complete table, chart and graph data preservation
  • Low-latency processing time per document

Our solution combines proprietary table transformer models built from the ground up with traditional computer vision algorithms. We use LLMs only for specific, controlled tasks where they excel – not as the entire extraction pipeline. 

If your organization processes financial, legal, or healthcare documents at scale and needs complete reliability (or really any industry where accuracy is non-negotiable), we'd love to show you how Pulse can transform your workflow.

Book a demo here to see the difference for yourself.

]]>
https://www.runpulse.com/blog/putting-andrew-ngs-ocr-models-to-the-test hacker-news-small-sites-43201001 Fri, 28 Feb 2025 02:24:04 GMT
<![CDATA[Crossing the uncanny valley of conversational voice]]> thread link) | @nelwr
February 27, 2025 | https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo | archive.org

February 27, 2025

Brendan Iribe, Ankit Kumar, and the Sesame team

How do we know when someone truly understands us? It is rarely just our words—it is in the subtleties of voice: the rising excitement, the thoughtful pause, the warm reassurance.

Voice is our most intimate medium as humans, carrying layers of meaning through countless variations in tone, pitch, rhythm, and emotion.

Today’s digital voice assistants lack essential qualities to make them truly useful. Without unlocking the full power of voice, they cannot hope to effectively collaborate with us. A personal assistant who speaks only in a neutral tone has difficulty finding a permanent place in our daily lives after the initial novelty wears off.

Over time this emotional flatness becomes more than just disappointing—it becomes exhausting.

Achieving voice presence

At Sesame, our goal is to achieve “voice presence”—the magical quality that makes spoken interactions feel real, understood, and valued. We are creating conversational partners that do not just process requests; they engage in genuine dialogue that builds confidence and trust over time. In doing so, we hope to realize the untapped potential of voice as the ultimate interface for instruction and understanding.

Key components

  • Emotional intelligence: reading and responding to emotional contexts.
  • Conversational dynamics: natural timing, pauses, interruptions and emphasis.
  • Contextual awareness: adjusting tone and style to match the situation.
  • Consistent personality: maintaining a coherent, reliable and appropriate presence.

We’re not there yet

Building a digital companion with voice presence is not easy, but we are making steady progress on multiple fronts, including personality, memory, expressivity and appropriateness. This demo is a showcase of some of our work in conversational speech generation. The companions shown here have been optimized for friendliness and expressivity to illustrate the potential of our approach.

Conversational voice demo

1. Microphone permission is required. 2. Calls are recorded for quality review but not used for ML training and are deleted within 30 days. 3. By using this demo, you are agreeing to our Terms of Use and Privacy Policy. 4. We recommend using Chrome (Audio quality may be degraded in iOS/Safari 17.5).

Technical post

Authors

Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang

To create AI companions that feel genuinely interactive, speech generation must go beyond producing high-quality audio—it must understand and adapt to context in real time. Traditional text-to-speech (TTS) models generate spoken output directly from text but lack the contextual awareness needed for natural conversations. Even though recent models produce highly human-like speech, they struggle with the one-to-many problem: there are countless valid ways to speak a sentence, but only some fit a given setting. Without additional context—including tone, rhythm, and history of the conversation—models lack the information to choose the best option. Capturing these nuances requires reasoning across multiple aspects of language and prosody.

To address this, we introduce the Conversational Speech Model (CSM), which frames the problem as an end-to-end multimodal learning task using transformers. It leverages the history of the conversation to produce more natural and coherent speech. There are two key takeaways from our work. The first is that CSM operates as a

single-stage model, thereby improving efficiency and expressivity. The second is our

evaluation suite, which is necessary for evaluating progress on contextual capabilities and addresses the fact that common public evaluations are saturated.

Background

One approach to modeling audio with transformers is to convert continuous waveforms into discrete audio token sequences using tokenizers. Most contemporary approaches ([1], [2]) rely on two types of audio tokens:

  1. Semantic tokens: Compact speaker-invariant representations of semantic and phonetic features. Their compressed nature enables them to capture key speech characteristics at the cost of high-fidelity representation.
  2. Acoustic tokens: Encodings of fine-grained acoustic details that enable high-fidelity audio reconstruction. These tokens are often generated using Residual Vector Quantization (RVQ) [2]. In contrast to semantic tokens, acoustic tokens retain natural speech characteristics like speaker-specific identity and timbre.

A common strategy first models semantic tokens and then generates audio using RVQ or diffusion-based methods. Decoupling these steps allows for a more structured approach to speech synthesis—the semantic tokens provide a compact, speaker-invariant representation that captures high-level linguistic and prosodic information, while the second-stage reconstructs the fine-grained acoustic details needed for high-fidelity speech. However, this approach has a critical limitation; semantic tokens are a bottleneck that must fully capture prosody, but ensuring this during training is challenging.

RVQ-based methods introduce their own set of challenges. Models must account for the sequential dependency between codebooks in a frame. One method, the delay pattern (figure below) [3], shifts higher codebooks progressively to condition predictions on lower codebooks within the same frame. A key limitation of this approach is that the time-to-first-audio scales poorly because an RVQ tokenizer with N codebooks requires N backbone steps before decoding the first audio chunk. While suitable for offline applications like audiobooks, this delay is problematic in a real-time scenario.

Example of delayed pattern generation in an RVQ tokenizer with 4 codebooks

Conversational Speech Model

CSM is a multimodal, text and speech model that operates directly on RVQ tokens. Inspired by the RQ-Transformer [4], we use two autoregressive transformers. Different from the approach in [5], we split the transformers at the zeroth codebook. The first multimodal backbone processes interleaved text and audio to model the zeroth codebook. The second audio decoder uses a distinct linear head for each codebook and models the remaining N – 1 codebooks to reconstruct speech from the backbone’s representations. The decoder is significantly smaller than the backbone, enabling low-latency generation while keeping the model end-to-end.

CSM model inference process. Text (T) and audio (A) tokens are interleaved and fed sequentially into the Backbone, which predicts the zeroth level of the codebook. The Decoder then samples levels 1 through N – 1 conditioned on the predicted zeroth level. The reconstructed audio token (A) is then autoregressively fed back into the Backbone for the next step, continuing until the audio EOT symbol is emitted. This process begins again on the next inference request, with the interim audio (such as a user utterance) being represented by interleaved audio and text transcription tokens.

Both transformers are variants of the Llama architecture. Text tokens are generated via a Llama tokenizer [6], while audio is processed using Mimi, a split-RVQ tokenizer, producing one semantic codebook and N – 1 acoustic codebooks per frame at 12.5 Hz. [5] Training samples are structured as alternating interleaved patterns of text and audio, with speaker identity encoded directly in the text representation.

Compute amortization

This design introduces significant infrastructure challenges during training. The audio decoder processes an effective batch size of B × S and N codebooks autoregressively, where B is the original batch size, S is the sequence length, and N is the number of RVQ codebook levels. This high memory burden even with a small model slows down training, limits model scaling, and hinders rapid experimentation, all of which are crucial for performance.

To address these challenges, we use a compute amortization scheme that alleviates the memory bottleneck while preserving the fidelity of the full RVQ codebooks. The audio decoder is trained on only a random 1/16 subset of the audio frames, while the zeroth codebook is trained on every frame. We observe no perceivable difference in audio decoder losses during training when using this approach.

Amortized training process. The backbone transformer models the zeroth level across all frames (highlighted in blue), while the decoder predicts the remaining N – 31 levels, but only for a random 1/16th of the frames (highlighted in green). The top section highlights the specific frames modeled by the decoder for which it receives loss.

Experiments

Dataset: We use a large dataset of publicly available audio, which we transcribe, diarize, and segment. After filtering, the dataset consists of approximately one million hours of predominantly English audio.

Model Sizes: We trained three model sizes, delineated by the backbone and decoder sizes:

  • Tiny: 1B backbone, 100M decoder
  • Small: 3B backbone, 250M decoder
  • Medium: 8B backbone, 300M decoder

Each model was trained with a 2048 sequence length (~2 minutes of audio) over five epochs.

Samples

Paralinguistics

Sentences from Base TTS

Foreign words

Sentences from Base TTS

Contextual expressivity

Samples from Expresso, continuation after chime

Pronunciation correction

Pronunciation correction sentence is a recording, all other audio is generated.

Conversations with multiple speakers

Single generation using audio prompts from two speakers

Evaluation

Our evaluation suite measures model performance across four key aspects: faithfulness to text, context utilization, prosody, and latency. We report both objective and subjective metrics—objective benchmarks include word error rate and novel tests like homograph disambiguation, while subjective evaluation relies on a Comparative Mean Opinion Score (CMOS) human study using the Expresso dataset.

Objective metrics

Traditional benchmarks, such as word error rate (WER) and speaker similarity (SIM), have become saturated—modern models, including CSM, now achieve near-human performance on these metrics.

Objective metric results for Word Error Rate (top) and Speaker Similarity (bottom) tests, showing the metrics are saturated (matching human performance).

To better assess pronunciation and contextual understanding, we introduce a new set of phonetic transcription-based benchmarks.

  • Text understanding through Homograph Disambiguation: Evaluates whether the model correctly pronounced different words with the same orthography (e.g., “lead” /lɛd/ as in “metal” vs. “lead” /liːd/ as in “to guide”).
  • Audio understanding through Pronunciation Continuation Consistency: Evaluates whether the model maintains pronunciation consistency of a specific word with multiple pronunciation variants in multi-turn speech. One example is “route” (/raʊt/ or /ruːt/), which can vary based on region of the speaker and context.

Objective metric results for Homograph Disambiguation (left) and Pronunciation Consistency (right) tests, showing the accuracy percentage for each model’s correct pronunciation. Play.ht, Elevenlabs, and OpenAI generations were made with default settings and voices from their respective API documentation.

The graph above compares objective metric results across three model sizes. For Homograph accuracy we generated 200 speech samples covering 5 distinct homographs—lead, bass, tear, wound, row—with 2 variants for each and evaluated pronunciation consistency using wav2vec2-lv-60-espeak-cv-ft. For Pronunciation Consistency we generated 200 speech samples covering 10 distinct words that have common pronunciation variants—aunt, data, envelope, mobile, route, vase, either, adult, often, caramel.

In general, we observe that performance improves with larger models, supporting our hypothesis that scaling enhances the synthesis of more realistic speech.

Subjective metrics

We conducted two Comparative Mean Opinion Score (CMOS) studies using the Expresso dataset to assess the naturalness and prosodic appropriateness of generated speech for CSM-Medium. Human evaluators were presented with pairs of audio samples—one generated by the model and the other a ground-truth human recording. Listeners rated the generated sample on a 7-point preference scale relative to the reference. Expresso’s diverse expressive TTS samples, including emotional and prosodic variations, make it a strong benchmark for evaluating appropriateness to context.

In the first CMOS study we presented the generated and human audio samples with no context and asked listeners to “choose which rendition feels more like human speech.” In the second CMOS study we also provide the previous 90 seconds of audio and text context, and ask the listeners to “choose which rendition feels like a more appropriate continuation of the conversation.” Eighty people were paid to participate in the evaluation and rated on average 15 examples each.

Subjective evaluation results on the Expresso dataset. No context: listeners chose “which rendition feels more like human speech” without knowledge of the context. Context: listeners chose “which rendition feels like a more appropriate continuation of the conversation” with audio and text context. 50:50 win–loss ratio suggests that listeners have no clear preference.

The graph above shows the win-rate of ground-truth human recordings vs CSM-generated speech samples for both studies. Without conversational context (top), human evaluators show no clear preference between generated and real speech, suggesting that naturalness is saturated. However, when context is included (bottom), evaluators consistently favor the original recordings. These findings suggest a noticeable gap remains between generated and human prosody in conversational speech generation.

Open-sourcing our work

We believe that advancing conversational AI should be a collaborative effort. To that end, we’re committed to open-sourcing key components of our research, enabling the community to experiment, build upon, and improve our approach. Our models will be available under an Apache 2.0 license.

Limitations and future work

CSM is currently trained on primarily English data; some multilingual ability emerges due to dataset contamination, but it does not perform well yet. It also does not take advantage of the information present in the weights of pre-trained language models.

In the coming months, we intend to scale up model size, increase dataset volume, and expand language support to over 20 languages. We also plan to explore ways to utilize pre-trained language models, working towards large multimodal models that have deep knowledge of both speech and text.

Ultimately, while CSM generates high quality conversational prosody, it can only model the text and speech content in a conversation—not the structure of the conversation itself. Human conversations are a complex process involving turn taking, pauses, pacing, and more. We believe the future of AI conversations lies in fully duplex models that can implicitly learn these dynamics from data. These models will require fundamental changes across the stack, from data curation to post-training methodologies, and we’re excited to push in these directions.

Join us

If you’re excited about building the most natural, delightful, and inspirational voice interfaces out there, reach out—we’re hiring. Check our open roles.

]]>
https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo hacker-news-small-sites-43200400 Fri, 28 Feb 2025 00:55:00 GMT
<![CDATA[World-first experimental cancer treatment paves way for clinical trial]]> thread link) | @femto
February 27, 2025 | https://www.wehi.edu.au/news/world-first-experimental-cancer-treatment-paves-way-for-clinical-trial/ | archive.org

An Australian-led international clinical trial will scientifically investigate the efficacy of the approach within a large cohort of eligible glioblastoma patients and could commence within a year.

The study will trial the use of double immunotherapy. In some patients, double immunotherapy will be combined with chemotherapy.

The trial will be led by The Brain Cancer Centre, which has world-leading expertise in glioblastoma.

“I am delighted to be handing the baton to Dr Jim Whittle, a leading Australian neuro-oncologist at Peter MacCallum Cancer Centre, The Royal Melbourne Hospital and Co-Head of Research Strategy at The Brain Cancer Centre, to commence a broader scientific study to scientifically determine if – and how – this process might work in treating glioblastoma,” said Prof Long, who also secured drug access for the clinical trial.

“While we are buoyed by the results of this experimental treatment so far, a clinical trial in a large group of patients would need to happen before anyone could consider it a possible breakthrough.”

Dr Whittle, also a laboratory head at WEHI, said: “We are pleased to be able to build on this exciting work by diving into the process of designing a clinical trial, which takes time, care and accuracy.

“When that process is complete, the result will be a world first clinical trial that enables us to thoroughly test the hypothesis against a representative sample of patients.”

The Brain Cancer Centre was founded by Carrie’s Beanies 4 Brain Cancer and established in partnership with WEHI with support from the Victorian Government.

The centre brings together a growing network of world-leading oncologists, immunologists, neurosurgeons, bioinformaticians and cancer biologists.

Commencement of recruitment for the clinical trial will be announced by The Brain Cancer Centre at a later date and will be limited to eligible patients.

]]>
https://www.wehi.edu.au/news/world-first-experimental-cancer-treatment-paves-way-for-clinical-trial/ hacker-news-small-sites-43199210 Thu, 27 Feb 2025 22:24:22 GMT
<![CDATA[OpenCloud 1.0]]> thread link) | @doener
February 27, 2025 | https://opencloud.eu/en/news/opencloud-now-available-new-open-source-alternative-microsoft-sharepoint | archive.org

Unable to retrieve article]]>
https://opencloud.eu/en/news/opencloud-now-available-new-open-source-alternative-microsoft-sharepoint hacker-news-small-sites-43198572 Thu, 27 Feb 2025 21:13:42 GMT
<![CDATA[Accessing region-locked iOS features, such as EU app stores]]> thread link) | @todsacerdoti
February 27, 2025 | https://downrightnifty.me/blog/2025/02/27/eu-features-outside.html | archive.org

The European Union's Digital Markets Act obligates Apple to provide certain features to iOS users in the EU, such as third party app stores. I live in the US and was able to develop a relatively-straightforward method to spoof your location on iOS and access these features, as well as any other region-locked iOS features you might be interested in experimenting with, even if you aren't in the required region.

If you look at the reverse engineered documentation, it would seem to be difficult to fool Apple's countryd service, since it uses almost all available hardware radios to determine your location – GPS, Wi-Fi, Bluetooth, and cellular. However, Apple has developed a "priority" system, roughly ranking the relative reliability of each location determination method. Since Location Services has the highest priority value, if it returns a location result, the results from the other methods seem to be ignored. Location Services relies solely on GPS and nearby Wi-Fi access points if Airplane Mode is enabled (and Wi-Fi re-enabled). Therefore, if you can spoof Wi-Fi geolocation (or if you can spoof GPS), then you can access region-locked features from anywhere, even on the iPhone with its wide array of radios.

On non-cellular iPad models, it has the potential to be even easier, because they only use Location Services (which can be disabled), or Wi-Fi country codes (which can be trivially spoofed). I was able to get this spoofing method working as well. However, it's not covered here.

I tested this with:

  • 2 ESP32 units creating 25 spoofed networks each (total 50)
  • iOS 18.2.1 on an iPhone 15, and an iPad Mini 6th gen

I was amazed at how consistent and reliable spoofing is, especially accounting for the low cost of the hardware involved and the simplicity of the spoofing software and method.

Most of the work was already done by Lagrange Point and Adam Harvey, developer of the Skylift tool. I was inspired by Lagrange Point's article to experiment with this and to reproduce their results. Check out their article on enabling Hearing Aid mode on AirPods in unsupported regions!

Please note that Apple could make the checks more difficult to bypass in the future through iOS updates. They don't have much of a reason to, since the current system is most likely more than sufficient to deter the average user from doing this, but it's very possible.

Contents

Procedure

What you'll need

  • Some experience with the command line
  • An iOS/iPadOS device with a USB-C port (recent iPads, or iPhone 15+)
    • You might be able to make it work on a Lightning iPhone, but it's much easier with a USB-C port + hub
  • A USB-C hub with Ethernet, HDMI out, and several USB ports
  • A USB keyboard and mouse
  • A USB-C extension cable
  • A display with HDMI input
  • One or two "faraday pouches"; make sure one is large enough to hold your device, and if buying a second make sure it's large enough to hold the other one
    • Any other faraday cage setup allowing only the tip of a single USB-C cable to break through the cage will work too, but these pouches make it easy
    • In my testing, using two pouches did reduce the number of external Wi-Fi networks appearing on the Wi-Fi list to zero, but I was still able to make it work with only one pouch – WMMV
  • A router that you can install a VPN on
    • You'll need to plug the router directly in to the device via an Ethernet cable, so a secondary/portable router is preferred
  • Access to a VPN service with an option to place yourself in an EU country
  • One or more ESP32 dev modules (preferably at least two)
  • A small battery power supply for the ESP32 modules (a small USB power bank works)
  • A free WiGLE account

These instructions assume you're using a Unix shell, so you might have to modify some of the commands slightly if you're on Windows.

Preparing the router

  1. Install a VPN on your router placing you in your chosen target country.
  2. Select an EU member state supported by your VPN as a spoof target. I chose the Netherlands.

Preparing the device

Creating a secondary Apple ID

You can't easily change the region on your Apple ID, and you probably don't want to do that anyway. But you can create a secondary Apple ID for use only while your device thinks that it's in the EU.

  1. Enable Airplane Mode and disable Bluetooth and Wi-Fi.
  2. Connect the device to the USB-C hub, and the USB-C hub to the router via Ethernet.
  3. Change your device region to your target country in Settings → General → Language & Region → Region.
  4. Sign out of your Apple ID: Settings → Your Account → Sign Out.
    • You'll need to sign out completely (including iCloud) in order to create a new account. Your data will not be lost. When you switch accounts again in the future, you only need to sign out of the App Store ("Media & Purchases"), not iCloud as well.
  5. Create a new Apple ID.
    • You can use the same phone number that's attached to your other Apple ID, or a Google Voice number.
    • For email, you'll need to either create an iCloud email, or use a "plus-style address".
  6. Make sure the Apple ID region is correct: App Store → Your Account → Your Account → Country/Region.
  7. Install at least one free app from the App Store to initialize the account.

Getting Wi-Fi data

  1. Find a popular indoor public attraction offering free Wi-Fi within the target country using Google Maps or similar software. I chose the Rijksmuseum. Note down the GPS coordinates of the center of the building.
  2. Imagine a rectangle surrounding the building and note down the GPS coordinates of the top-left and bottom-right points.
  3. Create a free account on WiGLE.
  4. Query the WiGLE database using the online API interface with these parameters:
    1. latrange1: lesser of two latitudes you noted
    2. latrange2: greater of two latitudes you noted
    3. longrange1: lesser of two longitudes you noted
    4. longrange2: greater of two longitudes you noted
    5. closestLat: latitude of center of building
    6. closestLong: longitude of center of building
    7. resultsPerPage: 25*n where n is the number of ESP32 units you have (e.g. 50 for 2 units)
  5. Execute the request, then download the response as JSON
  6. Clone the skylift repository:
    git clone https://github.com/DownrightNifty/skylift
    
  7. Set up skylift:
    cd skylift/
    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
    pip install setuptools
    python setup.py develop
    
  8. Convert the JSON data to the format used by skylift:
    # replace $PATH_TO_JSON, $TARGET_LAT, and $TARGET_LONG
    python ./extras/wigle_to_skylift.py $PATH_TO_JSON ./w2s_out $TARGET_LAT $TARGET_LONG
    
  9. Create the arduino sketch(es):
    c=1
    for file in ./w2s_out/*; do
        skylift create-sketch -i "$file" -o ./out_"$c" --max-networks 25 --board esp32
        ((c++))
    done
    
  10. Use the Arduino IDE to upload each sketch to each ESP32 unit.

Pre-generated Wi-Fi data

If you're having trouble with acquiring the data yourself, you could try using the sample data that I generated. If a large number of people start using it, I don't know if it will continue to work indefinitely, so please use your own data if possible.

The sample data can be found under the generated/ directory in my fork of Skylift.

Placing the device in the faraday pouch

  1. Before you continue, check the device's settings:
    1. Enable Airplane Mode, disable Bluetooth, and re-enable Wi-Fi.
    2. [Optional] Disable your lock screen (this makes controlling the device externally easier).
    3. [Optional] Make sure Apple Maps is allowed to access your location "while using the app": Settings → Privacy & Security → Location Services → Maps. Required because ask-for-permission prompts annoyingly don't appear on external displays.
    4. [iPhone only] Enable AssistiveTouch: Settings → Accessibility → Touch → AssistiveTouch. Required to enable mouse support on iPhone.
    5. Make sure you're signed in to the App Store with the EU Apple ID you created earlier: Settings → Your Account → Media & Purchases. Signing in to iCloud as well is unnecessary.
  2. Connect the USB-C extension cable to the device.
  3. [⚠️ Important] Insulate the ESP32 units from the metallic faraday pouch using plastic bags or something.
  4. Connect the ESP32 units to the battery.
  5. Place the device into a faraday pouch, along with the ESP32 units and their battery. Seal it as well as possible with only the tip of the cable sticking out (reducing its ability to let in radio signals).
    • If one pouch doesn't work, try using two pouches (placing one inside the other)
  6. Connect the USB-C hub to the cable. Connect the router via Ethernet, and a keyboard, mouse, and display via HDMI.

Spoofing location and unlocking EU features

Your iOS device should now only see the spoofed Wi-Fi networks, and cannot receive a GPS signal. Since we have a cable sticking out, this isn't a perfect faraday cage and it's possible that especially strong signals such as cell signals will make their way through, but that's okay.

  1. Make sure that you can control the device inside the faraday pouch using the connected keyboard, mouse, and display, and that the device has internet access through Ethernet.
  2. [Optional] Check the nearby Wi-Fi list to make sure you can only see fake Wi-Fi networks.
    • If you see one or two nearby networks, that should still be okay; the geolocation service seems to ignore irregularities like this and returns the most likely location result, considering all nearby networks.
    • 5GHz Wi-Fi is stronger than 2.4GHz. You could temporarily disable 5GHz on your main router if that helps.
  3. Disable Location Services and then re-enable it.
  4. Open Apple Maps and check to make sure it places you inside your target country.
  5. You should now have access to EU features such as third party app stores. Try installing AltStore PAL at: https://altstore.io/download

If it doesn't work the first time around, disable Location Services and re-enable it, then try again.

Caveats

"Third party app stores" != "sideloading"

I've written at length about why third party app stores aren't the same as "sideloading". Check out my new project, "Does iOS have sideloading yet?", below!

https://www.doesioshavesideloadingyet.com/

The 30 day grace period

Once you take your device out of the faraday pouch and it realizes that it's no longer in the EU, a 30-day grace period begins during which you can use EU features freely. After the grace period, certain features will become restricted. You'll still be able to use any apps from alternative app stores you downloaded, but they'll no longer receive updates.

However, you can simply repeat the location spoof process again once each month, if you want to continue to access these features.

Acknowledgements

Appendix: Notes on Apple's Eligibility system

Apple's Eligibility system has been mostly reverse engineered and documented, but I wanted to add some of my notes here for future reference.

As noted in the Lagrange Point article, you can monitor the activity of the eligibility service by monitoring the device's system logs, either through Console.app on macOS, or libimobiledevice on other platforms. This command is especially helpful:

idevicesyslog | grep RegulatoryDomain

Here's a sample output:

Here's the different location estimate methods, sorted by priority from lowest to highest:

  • WiFiAP (1): Uses the two-digit country codes of nearby Wi-Fi access points
  • ServingCellMCC (2): Uses the MCC code of the cell tower that the device is currently connected to(?)
  • NearbyCellsMCC (3): Uses the MCC codes of nearby cell towers
  • LatLonLocation (4): Uses coordinates from Location Services (GPS/Wi-Fi)

According to the Apple Wiki article:

countryd uses a mix of all signals to decide which country is the most likely physical location of the device.

However, I found that, in practice, if conflicting information is available, countryd will simply use the estimate with the highest priorty.

]]>
https://downrightnifty.me/blog/2025/02/27/eu-features-outside.html hacker-news-small-sites-43197163 Thu, 27 Feb 2025 18:45:18 GMT
<![CDATA[Goodbye K-9 Mail]]> thread link) | @todsacerdoti
February 27, 2025 | https://cketti.de/2025/02/26/goodbye-k9mail/ | archive.org

TL;DR: I quit my job working on Thunderbird for Android and K-9 Mail at MZLA.

My personal journey with K-9 Mail started in late 2009, shortly after getting my first Android device1. The pre-installed Email app didn’t work very well with my email provider. When looking for alternatives, I discovered K-9 Mail. It had many of the same issues2. But it was an active open source project that accepted contributions. I started fixing the problems I was experiencing and contributed these changes to K-9 Mail. It was a very pleasant experience and so I started fixing bugs reported by other users.

In February 2010, Jesse Vincent, the founder of the K-9 Mail project, offered me commit access to the Subversion3 repository. According to my email archive, I replied with the following text:

Thank you! I really enjoyed writing patches for K-9 and gladly accept your offer. But I probably won’t be able to devote as much time to the project as I do right now for a very long time. I hope that’s not a big problem.

My prediction turned out to be not quite accurate. I was able to spend a lot of time working on K-9 Mail and quickly became one of the most active contributors.

In 2012, Jesse hired me to work on Kaiten Mail, a commercial closed-source fork of K-9 Mail. The only real differences between the apps were moderate changes to the user interface. So most of the features and bug fixes we created for Kaiten Mail also went into K-9 Mail. This was important to me and one of the reasons I took the job.

In early 2014, Jesse made me the K-9 Mail project leader4. With Kaiten Mail, end-user support was eating up a lot of time and eventually motivation to work on the app. So we stopped working on it around the same time and the app slowly faded away.

To pay the bills, I started working as a freelancing Android developer5. Maybe not surprisingly, more often than not I was contracted to work on email clients. Whenever I was working on a closed source fork of K-9 Mail6, I had a discounted hourly rate that would apply when working on things that were contributed to K-9 Mail. This was mostly bug fixes, but also the odd feature every now and then.

After a contract ended in 2019, I decided to apply for a grant from the Prototype Fund to work on adding JMAP support to K-9 Mail7. This allowed me to basically work full-time on the project. When the funding period ended in 2020, the COVID-19 pandemic was in full swing. At that time I didn’t feel like looking for a new contract. I filled my days working on K-9 Mail to mute the feeling of despair about the world. I summarized my 2020 in the blog post My first year as a full-time open source developer.

Eventually I had to figure out how to finance this full-time open source developer lifestyle. I ended up asking K-9 Mail users to donate so I could be paid to dedicate 80% of my time to work on the app. This worked out quite nicely and I wrote about it here: 2021 in Review.

I first learned about plans to create a Thunderbird version for Android in late 2019. I was approached because one of the options considered was basing Thunderbird for Android on K-9 Mail. At the time, I wasn’t really interested in working on Thunderbird for Android. But I was more than happy to help turn the K-9 Mail code base into something that Thunderbird could use as a base for their own app. However, it seemed the times where we had availability to work on such a project never aligned. And so nothing concrete happened. But we stayed in contact.

In December 2021, it seemed to have become a priority to find a solution for the many Thunderbird users asking for an Android app. By that time, I had realized that funding an open source project via donations requires an ongoing fundraising effort. Thunderbird was already doing this for quite some time and getting pretty good at it. I, on the other hand, was not looking forward to the idea of getting better at fundraising.
So, when I was asked again whether I was interested in K-9 Mail and myself joining the Thunderbird project, I said yes. It took another six months for us to figure out the details and announce the news to the public.

Once under the Thunderbird umbrella, we worked8 on adding features to K-9 Mail that we wanted an initial version of Thunderbird for Android to have. The mobile team slowly grew to include another Android developer, then a manager. While organizationally the design team was its own group, there was always at least one designer available to work with the mobile team on the Android app. And then there were a bunch of other teams to do the things for which you don’t need Android engineers: support, communication, donations, etc.

In October 2024, we finally released the first version of Thunderbird for Android. The months leading up to the release were quite stressful for me. All of us were working on many things at the same time to not let the targeted release date slip too much. We never worked overtime, though. And we got additional paid time off after the release ❤️

After a long vacation, we started 2025 with a more comfortable pace. However, the usual joy I felt when working on the app, didn’t return. I finally realized this at the beginning of February, while being sick in bed and having nothing better to do than contemplating life.
I don’t think I was close to a burnout – work wasn’t that much fun anymore, but it was far from being unbearable. I’ve been there before. And in the past it never was a problem to step away from K-9 Mail for a few months. However, it’s different when it’s your job. But since I am in the very fortunate position of being able to afford taking a couple of months off, I decided to do just that. So the question was whether to take a sabbatical or to quit.
Realistically, permanently walking away from K-9 Mail never was an option in the past. There was no one else to take over as a maintainer. It would have most likely meant the end of the project. K-9 Mail was always too important to me to let that happen.
But this is no longer an issue. There’s now a whole team behind the project and me stepping away no longer is an existential threat to the app.

I want to explore what it feels like to do something else without going back to the project being a foregone conclusion. That is why I quit my job at MZLA.

It was a great job and I had awesome coworkers. I can totally recommend working with these people and will miss doing so 😢


I have no idea what I’ll end up doing next. A coworker asked me whether I’ll stick to writing software or do something else entirely. I was quite surprised by this question. Both because in hindsight it felt like an obvious question to ask and because I’ve never even considered doing something else. I guess that means I’m very much still a software person and will be for the foreseeable future.

During my vacation I very much enjoyed being a beginner and learning about technology I haven’t worked with as a developer before (NFC smartcards, USB HID, Bluetooth LE). So I will probably start a lot of personal projects and finish few to none of them 😃

I think there’s a good chance that – after an appropriately sized break – I will return as a volunteer contributor to K-9 Mail/Thunderbird for Android.

But for now, I say: Goodbye K-9 Mail 👋


This leaves me with saying thank you to everyone who contributed to K-9 Mail and Thunderbird for Android over the years. People wrote code, translated the app, reported bugs, helped other users, gave money, promoted the app, and much more. Thank you all 🙏


]]>
https://cketti.de/2025/02/26/goodbye-k9mail/ hacker-news-small-sites-43196436 Thu, 27 Feb 2025 17:26:21 GMT
<![CDATA[Distributed systems programming has stalled]]> thread link) | @shadaj
February 27, 2025 | https://www.shadaj.me/writing/distributed-programming-stalled | archive.org

Over the last decade, we’ve seen great advancements in distributed systems, but the way we program them has seen few fundamental improvements. While we can sometimes abstract away distribution (Spark, Redis, etc.), developers still struggle with challenges like concurrency, fault tolerance, and versioning.

There are lots of people (and startups) working on this. But nearly all focus on tooling to help analyze distributed systems written in classic (sequential) programming languages. Tools like Jepsen and Antithesis have advanced the state-of-the-art for verifying correctness and fault tolerance, but tooling is no match for programming models that natively surface fundamental concepts. We’ve already seen this with Rust, which provides memory safety guarantees that are far richer than C++ with AddressSanitizer.

If you look online, there are tons of frameworks for writing distributed code. In this blog post, I’ll make the case that they only offer band-aids and sugar over three fixed underlying paradigms: external-distribution, static-location, and arbitrary-location. We’re still missing a programming model that is native to distributed systems. We’ll walk through these paradigms then reflect on what’s missing for a truly distributed programming model.


External-distribution architectures are what the vast majority of “distributed” systems look like. In this model, software is written as sequential logic that runs against a state management system with sequential semantics:

  • Stateless Services with a Distributed Database (Aurora DSQL, Cockroach)
  • Services using gossiped CRDT state (Ditto, ElectricSQL, Redis Enterprise)111. This may come as a surprise. CRDTs are often marketed as a silver bullet for all distributed systems, but another perspective is they simply accelerate distributed transactions. Software running over CRDTs is still sequential. This may come as a surprise. CRDTs are often marketed as a silver bullet for all distributed systems, but another perspective is they simply accelerate distributed transactions. Software running over CRDTs is still sequential.
  • Workflows and Step Functions

These architectures are easy to write software in, because none of the underlying distribution is exposed222. Well that’s the idea, at least. Serializability typically isn’t the default (snapshot isolation is), so concurrency bugs can sometimes be exposed. Well that’s the idea, at least. Serializability typically isn’t the default (snapshot isolation is), so concurrency bugs can sometimes be exposed. to the developer! Although this architecture results in a distributed system, we do not have a distributed programming model.

There is little need to reason about fault-tolerance or concurrency bugs (other than making sure to opt into the right consistency levels for CRDTs). So it’s clear why developers opt for this option, since it hides the distributed chaos under a clean, sequential semantics. But this comes at a clear cost: performance and scalability.

Serializing everything is tantamount to emulating a non-distributed system, but with expensive coordination protocols. The database forms a single point of failure in your system; you either hope that us-east-1 doesn’t go down or switch to a multi-writer system like Cockroach that comes with its own performance implications. Many applications are at sufficiently low scale to tolerate this, but you wouldn’t want to implement a counter like this.


Static-location architectures are the classic way to write distributed code. You compose several units—each written as local (single-machine) code that communicates with other machines using asynchronous network calls:

  • Services communicating with API calls, possibly using async / await (gRPC, REST)
  • Actors (Akka, Ray, Orleans)
  • Services polling and pushing to a shared pub/sub (Kafka)

These architectures give us full, low-level control. We’re writing a bunch of sequential, single-machine software with network calls. This is great for performance and fault-tolerance because we control what gets run where and when.

But the boundaries between networked units are rigid and opaque. Developers must make one-way decisions on how to break up their application. These decisions have a wide impact on correctness; retries and message ordering are controlled by the sender and unknown to the recipient. Furthermore, the language and tooling have limited insight into how units are composed. Jump-to-definition is often unavailable, and serialization mismatches across services can easily creep in.

Most importantly, this approach to distributed systems fundamentally eliminates semantic co-location and modularity. In sequential code, things that happen one after the other are textually placed one after the other and function calls encapsulate entire algorithms. But with static-location architectures, developers are coerced to modularize code on machine boundaries, rather than on semantic boundaries. In these architectures there is simply no way to encapsulate a distributed algorithm as a single, unified semantic unit.

Although static-location architectures offer developers the most low-level control over their system, in practice they are difficult to implement robustly without distributed systems expertise. There is a fundamental mismatch between implementation and execution: static-location software is written as single-machine code, but the correctness of the system requires reasoning about the fleet of machines as a whole. Teams building such systems often live in fear of concurrency bugs and failures, leading to mountains of legacy code that are too critical to touch.


Arbitrary-location architectures are the foundation of most “modern” approaches to distributed systems. These architectures simplify distributed systems by letting us write code as if it were running on a single machine, but at runtime the software is dynamically executed across several machines333. Actor frameworks don’t really count even if they support migration, since the developer still has to explicitly define the boundaries of an actor and specify where message passing happens Actor frameworks don’t really count even if they support migration, since the developer still has to explicitly define the boundaries of an actor and specify where message passing happens :

  • Distributed SQL Engines
  • MapReduce Frameworks (Hadoop, Spark)
  • Stream Processing (Flink, Spark Streaming, Storm)
  • Durable Execution (Temporal, DBOS, Azure Durable Functions)

These architectures elegantly handle the co-location problem since there are no explicit network boundaries in the language/API to split your code across. But this simplicity comes at a significant cost: control. By letting the runtime decide how the code is distributed, we lose the ability to make decisions about how the application is scaled, where the fault domains lie, and when data is sent over the network.

Just like the external-distribution model, arbitrary-location architectures often come with a performance cost. Durable execution systems typically snapshot their state to a persistent store between every step444. With some optimizations when a step is a pure, deterministic function With some optimizations when a step is a pure, deterministic function . Stream processing systems may dynamically persist data and are free to introduce asynchrony across steps. SQL users are at the mercy of the query optimizer, to which they at best can only give “hints” on distribution decisions.

We often need low-level control over where individual logic is placed for performance and correctness. Consider implementing Two-Phase Commit. This protocol has explicit, asymmetric roles for a leader that broadcasts proposals and workers that acknowledge them. To correctly implement such a protocol, we need to explicitly assign specific logic to these roles, since quorums must be determined on a single leader and each worker must atomically decide to accept or reject a proposal. It’s simply not possible to implement such a protocol in an arbitrary-location architecture without introducing unnecessary networking and coordination overhead.

If you’ve been following the “agentic” LLM space, you might be wondering: “Are any of these issues relevant in a world where my software is being written by an LLM?” If the static-location model is sufficiently rich to express all distributed systems, who cares if it’s painful to program in!

I’d argue that LLMs actually are a great argument why we need a new programming model. These models famously struggle under scenarios where contextually-relevant information is scattered across large bodies of text555. See the Needle in a Haystack Test; reasoning about distributed systems is even harder. See the Needle in a Haystack Test; reasoning about distributed systems is even harder. . LLMs do best when semantically-relevant information is co-located.

The static-location model forces us to split up our semantically-connected distributed logic across several modules. LLMs aren’t great yet at correctness on a single machine; it is well beyond their abilities to compose several single-machine programs that work together correctly. Furthermore, LLMs make decisions sequentially; splitting up distributed logic across several networked modules is inherently challenging to the very structure of AI models.

LLMs would do far better with a programming model that retains “semantic locality”. In a hypothetical programming model where code that spans several machines can be co-located, this problem becomes trivial. All the relevant logic for a distributed algorithm would be right next to each other, and the LLM can generate distributed logic in a straight-line manner.

The other piece of the puzzle is correctness. LLMs make mistakes, and our best bet is to combine them with tools that can automatically find them666. Lean is a great example of this in action. Teams including Google and Deepseek have been using it for some time. Lean is a great example of this in action. Teams including Google and Deepseek have been using it for some time. . Sequential models have no way to reason about the ways distributed execution might cause trouble. But a sufficiently rich distributed programming model could surface issues arising from network delays and faults (think a borrow-checker, but for distributed systems).

Although the programming models we’ve discussed each have several limitations, they also demonstrate desirable features that a native programming model for distributed systems should support. What can we learn from each model?

I’m going to skip over external-distribution, which as we discussed is not quite distributed. For applications that can tolerate the performance and semantic restrictions of this model, this is the way to go. But for a general distributed programming model, we can’t keep networking and concurrency hidden from the developer.

The static-location model seems like the right place to start, since it is at least capable of expressing all the types of distributed systems we might want to implement, even if the programming model offers us little help in reasoning about the distribution. We were missing two things that the arbitrary-location model offered:

  • Writing logic that spans several machines right next to each other, in a single function
  • Surfacing semantic information on distributed behavior such as message reordering, retries, and serialization formats across network boundaries

Each of these points have a dual, something we don’t want to give up:

  • Explicit control over placement of logic on machines, with the ability to perform local, atomic computations
  • Rich options for fault tolerance guarantees and network semantics, without the language locking us into global coordination and recovery protocols

It’s time for a native programming model—a Rust-for-distributed systems, if you will—that addresses all of these.

Thanks to Tyler Hou, Joe Hellerstein, and Ramnivas Laddad for feedback on this post!

  1. This may come as a surprise. CRDTs are often marketed as a silver bullet for all distributed systems, but another perspective is they simply accelerate distributed transactions. Software running over CRDTs is still sequential.

  2. Well that’s the idea, at least. Serializability typically isn’t the default (snapshot isolation is), so concurrency bugs can sometimes be exposed.

  3. Actor frameworks don’t really count even if they support migration, since the developer still has to explicitly define the boundaries of an actor and specify where message passing happens

  4. With some optimizations when a step is a pure, deterministic function

  5. See the Needle in a Haystack Test; reasoning about distributed systems is even harder.

  6. Lean is a great example of this in action. Teams including Google and Deepseek have been using it for some time.

]]>
https://www.shadaj.me/writing/distributed-programming-stalled hacker-news-small-sites-43195702 Thu, 27 Feb 2025 16:12:42 GMT
<![CDATA[Solitaire]]> thread link) | @goles
February 27, 2025 | https://localthunk.com/blog/solitaire | archive.org

I have cited a few games as inspiration for Balatro in the past, but I wanted to talk about one in particular that hasn’t been mentioned much that arguably is the most important.

I think if I had some kind of Balatro vision board, solitaire (Klondike) would be right in the middle of it with a big red circle around it. You can probably see some of the similarities between my game and the classic solo card game. I wanted my game to have the same vibe.

If you’re somehow unfamiliar, solitaire is a group of card games characterized by solo play. Klondike is usually the variant that most people in the west associate with solitaire, but one could argue even Balatro is technically a solitaire game. Traditional solitaire games exist at the peak of game culture for me. These games are so ubiquitous and accepted by society that almost everyone has some memory of playing them. They have transcended gaming culture more than even the biggest IPs (like Tetris or Mario), and they occupy this very interesting wholesome niche. Solitaire is almost viewed as a positive pastime more than a game. That feeling interests me greatly as a game designer.

As Balatro 1.0 development drew nearer to a close in early 2024, I found myself picturing the type of person that might play my game and what a typical play session might look like for them. My fantasy was that I was playing this weird game many years later on a lazy Sunday afternoon; I play a couple of runs, enjoy my time for about an hour, then set it down and continue the rest of my day. I wanted it to feel evergreen, comforting, and enjoyable in a very low-stakes way. I think that’s one of the reasons why there isn’t a player character, health, or classic ‘enemies’ in the game as well. I wanted this game to be as low stakes as a crossword or a sudoku puzzle while still exercising the problem solving part of the brain.

Essentially I wanted to play Balatro like people play solitaire.

One of the main ways that the vibe of solitaire and my own game differ is in the meta-game Balatro has that solitaire does not. Things like achievements, stake levels, unlocks, and challenges certainly can be looked at as a way to artificially inflate playtime, but those things were added for 2 other reasons I was more concerned about:

  1. To force players to get out of their comfort zone and explore the design of the game in a way they might not if this were a fully unguided gaming experience. In solitaire this probably isn’t super useful because the game has far fewer moving parts, so the player can figure everything out by themselves, but I don’t think that’s the case with a game like Balatro. I feel like even I learned a lot from these guiding goals that I wasn’t anticipating many months after the game launched.

  2. To give the players that already enjoy the game loop a sort of checklist to work through if they so choose. They can come up with a list of goals on their own (as I see many from the community have) but I do really appreciate when I play other games and they give me tasks to accomplish and shape my long-form play around while I enjoy the shorter play sessions individually.

It’s now been over a year since launch and I am still playing Balatro almost daily. I play a couple runs before I go to bed, and I feel like I just might have accomplished the task of recreating the feeling of playing solitaire for myself. Seeing the discourse around my game has me fairly convinced that this is decidedly not how the average player has been interacting with my game, but I’m still thrilled that people are having a great time with it and I’m even more happy that I feel like this game turned out how I wanted as a player myself.

This is why you might have seen me refer to this game as ‘jazz solitaire’ in the past. I wanted to bring the old feeling of solitaire into a game with modern design bells and whistles, creating something new and yet familiar. Only time will tell if I actually accomplished that.

]]>
https://localthunk.com/blog/solitaire hacker-news-small-sites-43195516 Thu, 27 Feb 2025 15:54:36 GMT
<![CDATA[RoboPianist: Dexterous Piano Playing with Deep Reinforcement Learning (2023)]]> thread link) | @bemmu
February 27, 2025 | https://kzakka.com/robopianist/#demo | archive.org

@ Conference on Robot Learning (CoRL) 2023

1UC Berkeley

2Google DeepMind

3Stanford University

4Simon Fraser University

TLDR We train anthropomorphic robot hands to play the piano using deep RL
and release a simulated benchmark and dataset to advance high-dimensional control.

Interactive Demo

This is a demo of our simulated piano playing agent trained with reinforcement learning. It runs MuJoCo natively in your browser thanks to WebAssembly. You can use your mouse to interact with it, for example by dragging down the piano keys to generate sound or pushing the hands to perturb them. The controls section in the top right corner can be used to change songs and the simulation section to pause or reset the agent. Make sure you click the demo at least once to enable sound.

Overview

Simulation

We build our simulated piano-playing environment using the open-source MuJoCo physics engine. It consists in a full-size 88-key digital keyboard and two Shadow Dexterous Hands, each with 24 degrees of freedom.

Musical representation

We use the Musical Instrument Digital Interface (MIDI) standard to represent a musical piece as a sequence of time-stamped messages corresponding to "note-on" or "note-off" events. A message carries additional pieces of information such as the pitch of a note and its velocity.

We convert the MIDI file into a time-indexed note trajectory (also known as a piano roll), where each note is represented as a one-hot vector of length 88 (the number of keys on a piano). This trajectory is used as the goal representation for our agent, informing it which keys to press at each time step.

The interactive plot below shows the song Twinkle Twinkle Little Star encoded as a piano roll. The x-axis represents time in seconds, and the y-axis represents musical pitch as a number between 1 and 88. You can hover over each note to see what additional information it carries.

A synthesizer can be used to convert MIDI files to raw audio:

Musical evaluation

We use precision, recall and F1 scores to evaluate the proficiency of our agent. If at a given instance of time there are keys that should be "on" and keys that should be "off", precision measures how good the agent is at not hitting any of the keys that should be "off", while recall measures how good the agent is at hitting all the keys that should be "on". The F1 score combines the precision and recall into a single metric, and ranges from 0 (if either precision or recall is 0) to 1 (perfect precision and recall).

Piano fingering and dataset

Piano fingering refers to the assignment of fingers to notes in a piano piece (see figure below). Sheet music will typically provide sparse fingering labels for the tricky sections of a piece to help guide pianists, and pianists will often develop their own fingering preferences for a given piece.

In RoboPianist, we found that the agent struggled to learn to play the piano with a sparse reward signal due to the exploration challenge associated with the high-dimensional action space. To overcome this issue, we added human priors in the form of the fingering labels to the reward function to guide its exploration.

Since fingering labels aren't available in MIDI files by default, we used annotations from the Piano Fingering Dataset (PIG) to create 150 labeled MIDI files, which we call Repertoire-150 and release as part of our environment.

Finger numbers (1 to 9) annotated above each note. Source: PianoPlayer

MDP Formulation

We model piano-playing as a finite-horizon Markov Decision Process (MDP) defined by a tuple \( (\mathcal{S}, \mathcal{A}, \mathcal{\rho}, \mathcal{p}, r, \gamma, H) \), where \( \mathcal{S} \) is the state space, \( \mathcal{A} \) is the action space, \( \mathcal{\rho}(\cdot) \) is the initial state distribution, \( \mathcal{p} (\cdot | s, a) \) governs the dynamics, \( r(s, a) \) is the reward function, \( \gamma \) is the discount factor, and \( H \) is the horizon. The goal of the agent is to maximize its total expected discounted reward over the horizon \( \mathbb{E}\left[\sum_{t=0}^{H} \gamma^t r(s_t, a_t) \right] \).

At every time step, the agent receives proprioceptive (i.e, hand joint angles), exteroceptive (i.e., piano key states) and goal observations (i.e., piano roll) and outputs 22 target joint angles for each hand. These are fed to proportional-position actuators which convert them to torques at each joint. The agent then receives a weighted sum of reward terms, including a reward for hitting the correct keys, a reward for minimizing energy consumption, and a shaping reward for adhering to the fingering labels.

For our policy optimizer, we use a state-of-the-art model-free RL algorithm DroQ and train our agent for 5 million steps with a control frequency of 20 Hz.

Quantitative Results

With careful system design, we improve our agent's performance significantly. Specifically, adding an energy cost to the reward formulation, providing a few seconds worth of future goals rather than just the current goal, and constraining the action space helped the agent learn faster and achieve a higher F1 score. The plot below shows the additive effect of each of these design choices on three different songs of increasing difficulty.

When compared to a strong derivative-free model predictive control (MPC) baseline, Predictive Sampling, our agent achieves a much higher F1 score, averaging 0.79 across Etude-12 versus 0.43 for Predictive Sampling.

Qualitative Results

Each video below is playing real-time and shows our agent playing every song in the Etude-12 subset. In each video frame, we display the fingering labels by coloring the keys according to the corresponding finger color. When a key is pressed, it is colored green.

Debug dataset

This dataset contains "entry-level" songs (e.g., scales) and is useful for sanity checking an agent's performance. Fingering labels in this dataset were manually annotated by the authors of this paper. It is not part of the Repertoire-150 dataset.

C Major Scale

D Major Scale

Twinkle Twinkle Little Star

Etude-12 subset

Etude-12 is a subset of the full 150-large dataset and consists of 12 songs of varying difficulty. It is a subset of the full benchmark reserved for more moderate compute budgets.

Piano Sonata D845 1st Mov (F1=0.72)

Partita No. 2 6th Mov (F1=0.73)

Bagatelle Op. 3 No. 4 (F1=0.75)

French Suite No. 5 Sarabande (F1=0.89)

Waltz Op. 64 No. 1 (F1=0.78)

French Suite No. 1 Allemande (F1=0.78)

Piano Sonata No. 2 1st Mov (F1=0.79)

Kreisleriana Op. 16 No. 8 (F1=0.84)

Golliwoggs Cakewalk (F1=0.85)

Piano Sonata No. 23 2nd Mov (F1=0.87)

French Suite No. 5 Gavotte (F1=0.77)

Piano Sonata K279 1st Mov (F1=0.78)

Common failure modes

Since the Shadow Hand forearms are thicker than a human's, the agent sometimes struggles to nail down notes that are really close together. Adding full rotational and translational degrees of freedom to the hands could give them the ability to overcome this limitation, but would pose additional challenges for learning.

The agent struggles with songs that require stretching the fingers over many notes, sometimes more than 1 octave.

Acknowledgments

This work is supported in part by ONR #N00014-22-1-2121 under the Science of Autonomy program.

This website was heavily inspired by Brent Yi's.

]]>
https://kzakka.com/robopianist/#demo hacker-news-small-sites-43192751 Thu, 27 Feb 2025 09:41:23 GMT
<![CDATA[Python as a second language empathy (2018)]]> thread link) | @luu
February 26, 2025 | https://ballingt.com/python-second-language-empathy/ | archive.org

abstract


It’s different! Let’s talk about how.

Because as Python experts (you did choose to come to a Python conference so likely you’re either an expert already or in time you’re going to become one if you keep going to Python conferences) we have a responsibility to help our colleagues and collaborators who don’t know Python as well we do. The part of that responsibility I want to focus on today is when other people have experience with other programming languages but are new to Python.

I work at Dropbox, which as Guido said earlier today is a company that uses a fair bit of Python. But a lot of programmers come to Dropbox without having significant Python experience. Do these people take a few months off when they join to really focus on learning and figure out exactly how Python works, having a lot of fun while they do it? That would great (briefly shows slide of Recurse Center logo) but that’s not what usually happens. Instead they learn on the job, they start making progress right away. They’ll read some books (my favorite is Python Essential Reference, but I hear Fluent Python is terrific), watch some Python talks, read some blog posts, ask questions at work, and Google a lot. That last one is the main one, lots of Google and lots of Stack Overflow.

Learning primarily by Googling can leave you with certain blind spots. If the way that you’re learning a language is by looking up things that are confusing to you, things that aren’t obviously confusing aren’t going to come up.

We ought to be trying to understand our colleagues’ understandings of Python. This is a big thing whenever you’re teaching, whenever you’re trying to communicate with another person: trying to figure out their mental model of a situation and providing just the right conceptual stepping stones to update that model to a more useful state.

We should try to understand the understandings of Python of people coming to Python as a new language. I’m going to call this “Python-as-a-second-language empathy.”

How do we build this PaaSL empathy thing?

The best thing you can do is learn another language first, and then learn Python. Who here has another language that they knew pretty well before learning Python? (most hands go up) Great! Terrific! That’s a superpower you have that I can never have. I can never unlearn Python, become fluent in another language, and then learn Python again. You have this perspective that I can’t have. I encourage you to use that superpower to help others with backgrounds similar to your own. I’d love to see “Django for Salesforce Programmers” as a talk at a Python conference because it’s very efficient when teaching to be able to make connections to a shared existing knowledge base.

Another thing you can do to build this PAASL empathy (I’m still deciding on an acronym) is to learn language that are different than the ones you know. Every time you learn a new language you’re learning new dimensions on which someone could have a misconception.

Consider the following:

Depending on the languages you know, you might make different assumptions about the answers to the following questions:

  • Will a always be equivalent to the sum of b and c from now on, or will that only be true right after we run this code?
  • Will b + c be evaluated right now, or when a is used later?
  • Could b and c be function calls with side effects?
  • Which will be evaluated first?
  • What does plus mean, and how do we find out?
  • Is a a new variable, and if so is it global now?
  • Does the value stored in a know the name of that variable?

These are questions you can have and ways that someone might be confused, but if you’re not familiar with languages that answer these questions in different ways you might not be able to conceive of these misunderstandings.

Another you thing you can do to build PSL empathy is listen. Listen to questions and notice patterns in them. If you work with grad students who know R and are learning Python, try to notice what questions repeatedly come up.

In a general sense, this is what my favorite PyCon speaker Ned Batchelder does a wonderful job of. Ned is a saint who spends thousands of hours in the #python irc channel repeatedly answering the same questions about Python. He does a bunch of other things like run the Boston Python Users Meetup group, and he coalesces all this interaction into talks which concisely hit all the things that are confusing about whatever that year’s PyCon talk is.

The final idea for building Py2ndLang empathy I’ll suggest is learning the language that your collaborator knows better so you can better imagine what their experience might be like. If you colleague is coming from Java, go learn Java! For this talk I did a mediocre job of learning C++ and Java. I did some research so I could try to present to you some of the things that could be tricky if you’re coming to Python from one of these languages. I chose these languages because they’re common language for my colleagues. It’s very reasonable to assume that a programming language will work like a language you already know, because so often they do! But then when there’s a difference it’s surprising.

C++ and Java are not my background! While Python was the first language I really got deep into, I had previous exposure to programming that colored my experience learning Python. My first programming language was TI-81 Basic, then some Excel that my mom taught me. In the Starcraft scenario editor you could write programs with a trigger language, so I did some of that. In middle school I got to use Microworlds Logo, which was pretty exciting. I did a little Visual Basic, got to college and did some MATLAB and some Mathematica, and then I took a CS course where they taught us Python.

My misconceptions about Python were so different than other students’, some of whom had taken AP Computer Science with Java in high school. The languages I learned were all dynamically typed languages with function scope, so I didn’t have the “where are my types?” reaction of someone coming from Java.

Java and C++ are good languages to focus on because they’re often taught in schools, so when interviewing or working with someone right out of undergrad it can be useful to try to understand these languages.

Before we get to a list of tricky bits, there are some thinks I won’t talk about because I don’t call then “tricky.” Not that they aren’t hard, but they aren’t pernicious, they’re misunderstandings that will be bashed down pretty quickly instead of dangerously lingering on. New syntax like colons and whitespace, new keywords like yield; Python gives you feedback in the form of SyntaxErrors about the first group, and there’s something to Google for with the second. When you first see a list comprehension in Python, you know there’s something not quite normal about this syntax, so you know to research it or ask a question about it.

Let’s split things that are tricky about Python for people coming from Java or C++ into three categories: things that look similar to Java or C++ but behave differently, things that behave subtly differently, and “invisible” things that leave no trace. The first category is tricky because you might not think to look up any differences, the second because you might test for differences and at a shallow level observe none when in fact some lurk deeper. The third is tricky because there’s no piece of code in the file you’re editing that might lead you to investigate. These are pretty arbitrary categories.

Look similar, behave differently

Decorators

There’s a think in Java called an annotation that you can stick on a method or a class or some other things. It’s a way of adding some metadata to a thing. And then maybe you could do some metaprogramming-ish stuff where you look at that metadata later and make decisions about what code to run based on them. But annotations are much less powerful than Python decorators.

>>> @some_decorator
... def foo():
...     pass
... 
>>> foo
<quiz.FunctionQuestion object at 0x10ab14e48>

Here (in Python) a python decorator is above a function, but what comes out is an instance of a custom class “FunctionQuestion” - it’s important to remember that decorators are arbitrary code and they can do anything. Somebody coming from Java might miss this, thinking this is an annotation adding metadata that isn’t transforming the function at definition time.

Class body assignments create class variables

I’ve seen some interesting cool bugs before because of this. The two assignments below are two very different things:

class AddressForm:
    questions = ['name', 'address']

    def __init__(self):
        self.language = 'en'

questions is a class attribute, and language is an instance attribute. These are ideas that exist in Java and C++ with slightly different names (questions might be called a “static” variable, and language called a “member” variable), but if you see something like the top in one of those languages people might assume you’re initializing attributes on an instance; they might think the first thing is another way of doing the second.

Run-time errors, not compile-time

Here I’ve slightly misspelled the word “print:”

if a == 2:
    priiiiiiiiiiiiint("not equal")

This is valid Python code, and I won’t notice anything unusual about it until a happens to be 2 when this code runs. I think people coming from languages like Java and C++ with more static checks will get bitten by this before too long and get scared of it, but there are a lot of cases for them to think about.

try:
    foo()
except ValyooooooooooError:
    print('whoops)'

Here’s I’ve slightly misspelled ValueError, but I won’t find out until foo() raises an exception.

try:
    foo()
except ValueError:
    priiiiiiiiiiiiiiint('whoops)'

Here ValueError is fine, but the code below it won’t run until foo() raises an exception.

Conditional and Run-Time Imports

Particularly scary examples of the above issue feature imports because people may think imports work like they do in Java or C++: something that happens before a program runs.

try:
    foo()
except ValueError:
    import bar
    bar.whoops()

It’s not until foo() raises a ValueError that we’ll find out whether the bar module is syntactically valid because we hadn’t loaded it yet, or whether a file called bar.py exists at all!

Block Scope

This might blow your mind if you’re mostly familiar with Python: there’s this idea called block scope. Imagine that every time you indented you got a new set of local variables, and each time you dedented those variables went away. People who use Java or C++ are really used to this idea, they really expect that when they go out of a scope (which they use curly brackets to denote, not indentation) that those variables will go away. As Python users, we might know that in the below,

def foo():
    bunch = [1, 2, 3, 4]
    for apple in bunch:
       food = pick(apple)

    print(apple)
    print(food)

the variables apple and bunch “escape” the for loop, because Python has function scope, not block scope! But this sneaks up on people a lot.

Introducing Bindings

This above is sort of a special case of something Ned Batchelder has a great talk about, which is that all the statements below introduce a new local variable X:

X = ...
for X in ...
[... for X in ...]
(... for X in ...)
{... for X in ...}
class X(...):
def X(...):
def fn(X): ... ; fn(12)
with ... as X:
except ... as X:
import X
from ... import X
import ... as X
from ... import ... as X

(these examples taken from the talk linked above)

import in a function introduces a new local variable only accessible in that function! Importing in Python isn’t just telling the compiler where to find some code, but rather to run some code, stick the result of running that code in a module object, and create a new local variable with a reference to this object.

Subtle behavior differences

Assignment

Python = is like Java, it’s always a reference and never a copy (which it is by default in C++).

Closures

A closure is a function that has references to outer scopes. (mostly - read more) C++ and Java have things like this. Lambdas in C++ require their binding behavior to be specified very precisely, so each variable might be captured by value or by reference or something else. So a C++ programmer will at least know to ask the question in Python, “how is this variable being captured?” But in Java the default behavior is to make the captured variable final, which is a little scarier because a Java programmer might assume the same about Python closures.

GC

It’s different! We have both reference counting and garbage collection in Python. This makes it sort of like smart pointers in C++ and sort of like garbage collection in Java. And __del__ finalizer doesn’t do what you think it does in Python 2!

Explicit super()

In Java and C++ there exist cases where the parent constructor for an object will get called for you, but in Python it’s necessary to call the parent method implementation yourself with super() if a class overrides a parent class method. Super is a very cooperative sort of thing in Python; a class might have a bunch of superclasses in a tree and to run all of them requires a fancy method resolution order. This works only so long every class calls super.

I’ll translate this one to the transcript later - for now you’ll have to watch it because the visual is important: explicit super.

Invisible differences

Properties and other descriptors

It can feel odd to folks coming from C++ or Java that we don’t write methods for getters and setters in Python; we don’t have to because ordinary attribute get and set syntax can cause arbitrary code to run.

obj.attr
obj.attr = value

This is in the invisible category because unless you go to the source code of the class it’s easy to assume code like this only reads or writes a variable.

Dynamic Attribute Lookup

Attribute lookup is super dynamic in Python! Especially when writing tests and mocking out behavior it’s important to know (for instance) that a data descriptor on a parent class will shadow an instance variable with the same name.

Monkeypatching

Swapping out implementations on a class or an instance is going to be new to people. It could happen completely on the other side of your program (or you test suite) but affect an object in your code.

Metaprogramming

It takes less characters in Python!

get_user_class("employee")("Tom", 1)

The code above returns a class object based on the string “employee” and then creates an instance of it. It might be easy to miss this if you expect metaprogramming to take up more lines of code.

Python 2 Whitespace Trivia

A tab is 8 spaces in Python 2 for the purposes of parsing significant whitespace, but is usually formatted as 4!

Should we try to teach everyone all these things right now? Maybe not! If someone is interested, sure. But I think it’s hard to hit all of these without much context. And be careful not to assume people don’t know these things, maybe they do know 80% of them. I think this cheat sheet presents things that are important to be aware of while teaching whatever other topic is most pedagogically appropriate.

I don’t have time to talk much about teaching, so I’ll point to Sasha Laundy’s talk (embedded above) which I love, and quickly quote Rose Ames and say that “knowledge is power; it’s measured in wats.” I think a great way to broach a misunderstanding is to present someone with a short code sample “wat” that demonstrates a misconception exists without necessarily explaining it because often all someone needed was to have the a flaw in their model pointed out to them.

Code review is a great impetus for sending someone such a wat. I don’t have time to talk about code review so I’ll point to this terrific post by Sandya Sankarram about it.

Another thing we can do with this information is to write it in code comments. I think of comments as the place to explain why code does a thing, not to explain what that code is doing. But if you know what’s happening in your code might surprise someone less familiar with Python, maybe you should say what it’s doing? Or maybe you should write simpler code and not do that interesting Python-specific thing.

In the same way Python library authors sometimes write code that straddles Python 2 and 3 by behaving the same in each, imagine writing Python code that, if it were Java or C++, would do the same thing. Perhaps you’d have quite unidiomatic code, but perhaps it’d be quite clear.

image from this Stack Overflow blog post

Python is becoming more popular. Maybe this means more people will understand it, and we’ll get to use all our favorite Python-specific features all the time! Maybe this will mean Python becomes the lingua franca which ought to be as simple and clear as possible. I imagine it will depend on the codebase. I think as a code base grows tending toward code that is less surprising to people who do not know Python well probably makes more sense.

One final use for this cheat sheet is interviewing: interviewing is a high time pressure communication exercise where it really can help to try to anticipate another person’s understanding of a thing. Candidates often interview with Python, but know C++ or Java better. If I can identify a misunderstanding like initializing instance variables in the class statement, I can quickly identify it, clarify with the candidate, and we can move on. Or perhaps I don’t even need to if the context is clear enough. And when I’m interviewing at companies, it’s helpful to remember what parts of my Python code I might need to explain to someone not as familiar with the language.

]]>
https://ballingt.com/python-second-language-empathy/ hacker-news-small-sites-43191696 Thu, 27 Feb 2025 05:46:32 GMT