---
title: Things I Learned - 18 May 2025
date: 2025-05-18T00:00:00+00:00
categories:
  - til
description: I explored storage options for data under 1GB, from GitHub Releases to MotherDuck. I also learned about encrypted LLM inference, Pandoc extensions for Markdown, and why you should always schedule data deletions instead of doing them live.
keywords: [github releases, motherduck, pandoc, openalex, encrypted inference, uv, bootstrap, systemd]
---

This week, I learned:

- Birds navigate using quantum entanglement! [Guardian](https://www.theguardian.com/science/2025/mar/23/they-have-no-one-to-follow-how-migrating-birds-use-quantum-mechanics-to-navigate) [ChatGPT](https://chatgpt.com/share/68282f03-3978-800c-8e46-e9979887317d)
- [DeerFlow](https://github.com/bytedance/deer-flow) is an open source Deep Research MCP. Lets you run deep research outside of the standard chatbots.
- ⭐ Today, if I had to store a bunch of data files (e.g. parquet) under 1GB, I would use GitHub Releases. Here are options:
  - **GitHub Releases**. 2 GiB **per file**, unlimited total & bandwidth. 🟢 Immortal URL, versioning, easy CI publish. 🔴 Each file must stay < 2 GiB; no built-in SQL.
  - **Zenodo** (CERN). 50 GB per record; one-off bumps to 200 GB. 🟢 DOI assignment, archival mandate. 🔴 Occasional throttled bandwidth; no API for partial file reads.
  - **Hugging Face Hub**. 300 GB per repo; 50 GB per file. 🟢 Git-based, dataset tooling, lively ML community. 🔴 Large files need git-LFS; pushes via LFS can be slow.
  - **Cloudflare R2**. 10 GB storage & 1 M ops / month. 🟢 S3 API, zero-egress to Cloudflare Workers, fast. 🔴 10 GB cap below your 50 GB target.
  - **Kaggle Datasets**. 20 GB per dataset, public only. 🟢 Built-in notebooks & GPU. 🔴 No programmatic SQL API; quotas sometimes change.
  - **data.world (free)**. 1 GB total, 100 MB per dataset. 🟢 Nice social features. 🔴 Too small for your size.
- If I had to query a bunch of data files in an external Parquet or SQLite file, here are SQL engines-as-a-service:
  - **MotherDuck**. 10 GB storage + 10 CU-hrs/mo compute. Native DuckDB; no credit card; GA June 2024; monthly feature drops.
  - **Datasette Cloud**. Two-month trial (or 1-yr for non-profits). SQLite backend. Great UX; but not free forever for general use.
  - **AWS Athena**. Pay-per-TB scanned; no free tier; S3 fees after 12 mo. Costs creep quickly; free-tier S3 ends after a year.
- Bootstrap has a [`.stretched-link`](https://getbootstrap.com/docs/5.3/helpers/stretched-link/) that makes a link cover the containing block. A clever trick that I discovered when Claude 3.5 Sonnet wrote [my code](https://github.com/sanand0/sanand0.github.io/blob/0932f2efe3ad6c950c20b2ed7534ef27d8fff304/update.js#L62).
- Discovered spray and peel paints at [ArtFriend](https://artfriendonline.com/). I had no idea that was a thing.
- [Gemini Live API](https://ai.google.dev/gemini-api/docs/live) is the real-time equivalent from Gemini. It supports tools, search, and code execution.
- [mcp-mem0](https://github.com/coleam00/mcp-mem0) is an MCP for memory
- [llm-min.txt](https://github.com/marv1nnnnn/llm-min.txt) compresses docs for LLMs to read optimally. Like a compressed [llms.txt](https://llmstxt.org/) or [context7](https://context7.com/). Usage `GEMINI_API_KEY=... uvx llm-min -i $DIR` #ai-coding
- There's a lot of action on encrypted LLM operations.
  - Responses API allows reasoning tokens to be encrypted if organizations don't want their reasoning data to persist. [Ref](https://cookbook.openai.com/examples/responses_api/reasoning_items)
  - [Tinfoil](https://tinfoil.sh/) (YC X25) offers an OpenAI-compatible inference API where data is encrypted from the client to the NVIDIA Hopper/Blackwell GPUs in confidential computing mode. Prompts, model weights, outputs are encrypted in transit and memory, with verifiable privacy on code running in GPU.
  - [Modelyo](https://modelyo.com/) (Israel) offers VMs/K8 clusters with encrypted GPUs across multiple cloud providers with continuous attestation, managed on Modelyo's portal.
- ⭐ LLMs are able to do things independently longer and longer. That's a useful metric to track. [METR: Measuring AI Ability to Complete Long Tasks](https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks).
- If you're looking for datasets / APIs related to research publications (especially funding), then explore:
  - Crossref [API](https://api.crossref.org/swagger-ui/index.html) and [snapshots](https://www.crossref.org/documentation/retrieve-metadata/rest-api/tips-for-using-public-data-files-and-plus-snapshots/)
  - OpenAlex [API](https://docs.openalex.org/) and [snapshots](https://docs.openalex.org/download-all-data/openalex-snapshot) which is funded by [OurResearch](https://ourresearch.org/). OpenAlex is like CrossRef but includes some disambiguation
  - [OpenAIRE Graph](https://graph.openaire.eu/docs/category/downloads/) [2024](https://zenodo.org/records/13133184) / [2025](https://zenodo.org/records/14851262)
  - [Europe PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC10767826/) [dataset](https://ftp.ebi.ac.uk/pub/databases/pmc/)
- To avoid Ubuntu 24 suspending on closing the laptop lid use one of these and restart:
  - `/etc/systemd/logind.conf`: Set `HandleLidSwitch=ignore`
  - `etc/UPower/UPower.conf`: Set `IgnoreLid=true`
- `UV_TORCH_BACKEND=auto uv pip install torch torchvision torchaudio` installs the most appropriate PyTorch version. [Ref](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection)
- [Cog](https://cog.readthedocs.io/en/latest/) is a Python based templating language. It is embedded as comment chunks in any file and replaced itself with the output of the Python code you write.
- [CloudFlare Zero Trust](https://www.cloudflare.com/en-in/zero-trust/products/access/) seems the easiest way to enable auth on static websites, especially if your DNS is already on Cloudflare. No cost
- We could "fine-tune" system prompts automatically with evals, creating a "system prompt learning" paradim -- like my [promptevals](https://github.com/gramener/promptevals). [Andrej Karpathy](https://x.com/karpathy/status/1921368644069765486)
- I was asked how to improve speed when building an enterprise ChatGPT clone using an API. Here's what I'd suggest, in order:
  - Streaming. High impact, low effort.
  - Caching RAG retrieval as well as generation. High impact, low effort.
  - UI tweaks. Loading / streaming icons and progress hints ()"Retrieving context", "Generating answer", etc.)
  - Parallelize, if possible
  - Use model options where available, e.g. speculative decoding, models with higher speed, models with closer CDN, etc.
  - Shorten prompts
  - Persistent HTTP/2 Keep-Alive. Low impact, low effort (tweak server settings).
- [Cloudflare Vectorize](https://developers.cloudflare.com/vectorize/platform/pricing/), at 768 dimensions / embedding, is free for ~6.5K chunks storage at ~1,000 queries / day. For a light load like 1M 768d chunks queried 1K times a day, the cost is: [ChatGPT](https://chatgpt.com/share/6821a25a-9f80-800c-8d95-8b2200ad6de4)
- [NVIDIA parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) is a lightweight speech to text model that leads benchmarks. Installing such packages continues to be a nightmare due to PyTorch (despite `uv`).
- I explored the real-time avatar space. Heygen seems to be the easiest to use, but even that is complex and expensive ($99/mo). We may need to wait a few months for avatars to explode.
- ⭐ Model reliability is a huge enabler for performance. As models become more reliable, they can work autonomously for longer and that is another kind of scaling. [Vending Bench](https://andonlabs.com/evals/vending-bench)
- ChatGPT, Gemini, etc. have become lead generation engines. Chat Bot Optimization (CBO), is it? [WhatsApp + ChatGPT](https://chatgpt.com/share/68215e14-9870-800c-a8e0-4fe476f48cc5)
- ⭐ Never live delete data. Mark it for deletion and schedule a deletion task. That way you have time to react to mistakes. [Simon Willison](https://simonwillison.net/2025/May/14/james-cowling/)
- [Pandoc](https://pandoc.org/MANUAL.html) has several options useful when converting Markdown to HTML (`cat file.md | pandoc -f markdown -t html`). My favorites:
  - `--no-highlight` skips code-highlighting. `--highlight=pygments` adds Pygments styling
  - `--wrap=none` doesn't wrap the content in a single block
  - `--number-sections` adds section numbering (`<h2>1. Introduction</h2>`)
  - `--shift-heading-level-by=NUM` – shift all headings by NUM levels (e.g., start at `<h2>` instead of `<h1>`)
  - `pandoc -f markdown-auto_identifiers` drops the auto-identifiers extension that generates `id=...` for each heading
  - `pandoc -f gfm` uses GitHub flavored Markdown. Run `pandoc --list-extensions=gfm` to identify the extensions it uses.
  - Pandoc's [Markdown extension examples](https://pandoc.org/demo/example33/8-pandocs-markdown.html) are quite extensive.
  - Auto-enabled GFM extensions:
    - `alerts`: GitHub-style callouts (info, tip, warning) via `> [!TYPE]` blocks.
    - `autolink_bare_uris`: Turns bare URLs into links, without needing `<...>`.
    - `emoji`: Parses `:smile:`-style codes into Unicode emoji characters.
    - `footnotes`: Enables footnote syntax with `[^id]` and definitions at the bottom.
    - `gfm_auto_identifiers`: Uses GitHub’s heading-ID algorithm: spaces → dashes, lowercase, removes punctuation.
    - `pipe_tables`: Enables table.
    - `raw_html`: Raw HTML is unchanged.
    - `strikeout`: Enables strikethrough with `~~text~~`.
    - `task_lists`: Parses `- [ ]` and `- [x]` items as checkboxes.
    - `yaml_metadata_block`: YAML front matter for document metadata, e.g. `<title>`
  - GFM extensions worth enabling:
    - `ascii_identifiers`: Strips accents/non-Latin letters in automatically generated IDs.
    - `bracketed_spans`: `[Warning]{.alert}` becomes `<span class="alert">`
    - `definition_lists`: `Term\n: Definition text` becomes a definition list
    - `fenced_divs`: `::: {.note}` block creates a `<div class="note">...</div>`
    - `implicit_figures`: Standalone images become `<figure>` with `<figcaption>`.
    - `implicit_header_references`: `[Section]` is treated as `[Section][#section]`
    - `raw_attribute`: `<b>bold</b>`{=html} is inserted as HTML
    - `smart`: Converts straight quotes to curly, `--` to en-dash, `---` to em-dash, `...` to ellipsis.
    - `subscript & superscript`: E.g. `H~2~O` and `E = mc^2^`