---
title: Things I Learned - 11 Aug 2024
date: 2024-08-11T00:00:00+00:00
categories:
  - til
description: I explored Agentic RAG for complex retrieval, fine-tuning with LoRAX, and practical LLM strategies. Key takeaways include using N-shot prompting before scaling models, automating workflows via disposable apps, and leveraging context caching to significantly reduce inference costs.
keywords: [agentic rag, llamaindex, lora, prompt engineering, llmops, text-to-sql, deepseek]
---

This week, I learned:

- Embedding models can be fine-tuned. Example: #TODO
- Agentic RAG (Ravi Theja, LlamaIndex)
  - RAG via top-k retrieval fails with
    - summarization => need to read all chunks
    - comparison: compare product X vs Y => need to split and re-combine
    - structured analytics. e.g. most expensive employees => Text2SQL first
    - multi-part questions. e.g. Tell me about speed of model X AND cost of model Y and recommend => need to split and re-combine
  - RAG failures: It's single shot. No query planning. No tools. No correction. No memory.
  - Agents that help in RAG
    - Route to the right tool
      - E.g. retrieve via vector top-k search or vector summary search or keyword search or combination?
    - One-shot query planning
      - E.g. Break query into multiple specific queries. RAG those. Then combine. #TRY - maybe in DocSearch
    - Tool use
      - E.g. Schema retrieval, Text2SQL, Calendar, Chat, APIs, Search, etc.
  - Agent orchestration
    - ReAct: An agent reasoning loop. Reason + Act. {Thought, Action, Action Input, Observation}\*.
      - [Orchestrate tools with a prompt](https://docs.llamaindex.ai/en/stable/examples/agent/react_agent/)
    - Multi-agent task solver: [Llama agents](https://github.com/run-llama/llama-agents)
      - Instead of a single agent loop, use different agents. Also allows parallelization
      - Allow services to register. (MS TaskWeaver stores tool descriptions in YAML)
  - [LlamaHub Tools](https://llamahub.ai/?tab=tools) has ideas for agents
- Notes on LLM Fine-Tuning
  - Rouge 2 and Bleu and such metrics are NOT good. Create you own benchmarks
  - Non-PEFT fine tuning needs 6X GPU RAM. Optimizer states, Gradient, Activations are the overhead. PEFT is about tuning a subset of parameters.
  - LORA adds additional weights without updating the model. It's a low rank matrix multiplication. You can change these adapters in runtime. Saves space. Fast to train
  - Quantization: Stick to bitsandbytes or AWQ (may be a bit better)
  - QLORA = Quantization + LORA
  - Predibase has open-sourced Lora Adapters in "Lora Land". Existing adapters are pretty good.
    - ghcr.io/predibase/lorax:main Docker image works on Docker compose to run locally. `devices:` on Docker Compose lets you specify NVIDIA GPU devices
    - Locust is a HTTP load testing lib in Python
  - Techniques for inference optimization
    - Dynamic adapters: Loads right LORAX adapters WHEN a request comes in
    - Multi-adapter batching: Process all inputs in parallel on the same GPU, but different users are post-processed using different adapters
- Notes from a 4-hour flight:
  - [What We’ve Learned From A Year of Building with LLMs](https://applied-llms.org/)
    - Strategy
      - IS IT TOO HARD/EXPENSIVE? Log it. LLMs are getting cheaper and better.
      - WILL OPENAI BUILD IT? If so, wait for it instead of building.
      - HAS A STARTUP BUILT IT? If so, use it instead. It's a generic use case there's no point re-inventing.
      - FOCUSED USE CASES over generic. Build trust by starting small.
      - Tools for LLM Ops (feedback): LangSmith, Log10, LangFuse, W&B Weave, HoneyHive #TRY
      - Human in the Loop is about humans evaluating model outputs. That's different from AI in the loop, human in the center, where AI accelerates human output (like Github Copilot)
    - Operations
      - CHECK EMBEDDINGS DRIFT over time. Users might be input-ing different things than before.
      - LOG AND REVIEW everything.
      - [Instructor](https://github.com/jxnl/instructor) coaxes structured output from LLM APIs. #TRY
      - IMPLICIT FEEDBACK collection is easy. Just let users edit stuff. #TRY
    - Tactical
      - Try n-shot prompting (n=5-12) before bigger models. #TRY
      - Always structure for output: Markdown, XML/HTML tags.
      - Combine RAG with Keyword search. It reduces user frustration in edge cases.
      - Prefer multiple small prompts to one big prompt. Do X. Then Y. Then Z.
      - Jitter prompts for diversity beyond temperature.
      - LLM-as-judge works better when comparing outputs (not rating 1 output). Keep length similar (LLMs prefer wordiness). Swap order and compare. Allow for ties. Ask for reason FIRST.
  - [Hermes: A Text-to-SQL solution at Swiggy](https://bytes.swiggy.com/hermes-a-text-to-sql-solution-at-swiggy-81573fb4fb6e)
    - "Hermes performed significantly better for charters with well-defined metadata and a relatively smaller number of tables."
    - "We collect feedback on the accuracy of the returned query from stakeholders directly within the Slack bot."
  - [How I use AI](https://nicholas.carlini.com/writing/2024/how-i-use-ai.html) and "Replacing my right hand with AI"
    - EMBED in every app/workflow. E.g. Auto-fix spellings. Auto-review code. Auto-ask LLM on errors and apply patch! Auto-search for answer, assess, continue.
    - PERSIST. Stick with the LLM to the end. Don't fix it yourself. It's faster. #TRY
      - INTERVENE FAST. If an LLM can't solve it by itself in 2 tries, it needs in-depth help.
    - APP-IFY one-off tasks. Disposable tools. "Write web-app to convert JSON to tab-delimited." "Extract fields as a table." "Diff JSON." #TRY
    - BEST language/frameworks preferred. CUDA in Python. Rust. C. Raspberry Pi. Arduino. Bluetooth. Modern ESM/JS. #TRY
    - TEACH examples. "Here's the LLM Foundry API." "Here's how to use gramex.data."
    - DUMP entire code. Models can handle it. Refactoring to SQLAlchemy 2, Pandas 2. API Documentation. Test case generation. #TRY
    - ASK for features & packages. Docker without root access. GPU access inside docker. Windows CLI-only C++ compiler.
    - TEST CASE writing. #TRY
    - SPEC IN DETAIL. Use these libraries. Write like this: code example.
    - SPEC _USAGE_ in detail.
      - "I will just pipe it into sqlite", or "I will just run `ffmpeg -i filename [YOUR OPTIONS]`.
      - Describe the UI, API input/output, data structure, and internal data structure.
    - HELP on usage. "ffmpeg to get audio.mp3".
  - [My benchmark for large language models](https://nicholas.carlini.com/writing/2024/my-benchmark-for-large-language-models.html)
    - LLM(text) is a useful function to have in JS and Python too. Useful as a simple `pip install llmfoundry`
    - Allow images, files in LLM()
  - Current list of #IMPOSSIBLE (or hard) things for LLMs
    - Translate technical documents to Dutch -- because they don't understand the technical terms well
    - Translate large documents (JSON to XML, English to Chinese, Python to Rust, Wrong to right spelling) -- because the output tokens are limited
- [micro-agent](https://github.com/BuilderIO/micro-agent) generates test cases first when asked to build an app. Then it iterates until the test cases pass.
- Alternative interfaces to YouTube: Piped.video, CloudTube, Invidious, NewPipe, FreeTube
- [Deepseek Context Caching](https://platform.deepseek.com/api-docs/news/news0802/) reduces price to 1.4 cents/MTok for portions of chat messages that are repeated. That's a 10X reduction for long conversations!