# Model Guide ## Table of Contents - [Supported Models](#supported-models) - [Which Model Should I Pick?](#which-model-should-i-pick) - [Capabilities Explained](#capabilities-explained) - [RAM Requirements](#ram-requirements) - [Context Window](#context-window) - [Importing Your Own Models](#importing-your-own-models) - [Model Sources](#model-sources) - [Model Updates](#model-updates) - [Model Storage](#model-storage) --- ## Supported Models | Model | Size | Context | Capabilities | Min RAM | Best For | |:------|-----:|--------:|:-------------|--------:|:---------| | **Gemma 4 E2B** | 2.4 GB | 32K | Text · Vision · Audio · Thinking · Tools · MTP | 8 GB | All-rounder — chat, vision, audio, tool calling | | **Gemma 4 E4B** | 3.4 GB | 32K | Text · Vision · Audio · Thinking · Tools · MTP | 12 GB | Higher quality than E2B, same capabilities, needs more RAM | | **Gemma 3n E2B** | 3.4 GB | 4K | Text · Vision · Audio | 8 GB | Vision and audio tasks on 8 GB devices | | **Gemma 3n E4B** | 4.6 GB | 4K | Text · Vision · Audio | 12 GB | Higher quality vision/audio on 12 GB+ devices | | **Gemma 3 1B** | 0.5 GB | 1K | Text | 6 GB | Smallest model, fastest responses, text-only | | **Qwen 2.5 1.5B** | 1.5 GB | 4K | Text | 6 GB | Good text quality for its size, longer context than Gemma 3 1B | | **DeepSeek-R1 1.5B** | 1.7 GB | 4K | Text | 6 GB | Reasoning-focused, includes chain-of-thought | All models are downloaded from [HuggingFace](https://huggingface.co/litert-community) in `.litertlm` format and run on-device via Google's [LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM) runtime. ## Which Model Should I Pick? **Start with Gemma 4 E2B.** It's the best balance of quality, speed, and capability — multimodal (vision + audio), tool calling (experimental), thinking mode, and a 32K context window. | Your Use Case | Recommended Model | Why | |:--------------|:------------------|:----| | Chat UI (e.g. Open WebUI) | Gemma 4 E2B or E4B | Best conversational quality, thinking mode for complex questions | | Image/audio analysis | Gemma 4 E2B or Gemma 3n E2B | Both support vision and audio input | | Low-RAM device (6 GB) | Gemma 3 1B or Qwen 2.5 1.5B | Only text-capable models fit in 6 GB | | Fastest possible responses | Gemma 3 1B | Smallest model, lowest latency | | Longer conversations | Gemma 4 E2B/E4B | 32K context vs 1K–4K on smaller models | | Reasoning tasks | Gemma 4 E2B or Gemma 4 E4B | Both support chain-of-thought reasoning | ## Capabilities Explained | Capability | What It Means | |:-----------|:--------------| | **Text** | Standard text chat and completions | | **Vision** | Send images in API requests — the model can describe, analyze, and answer questions about them | | **Audio** | Send audio in API requests — the model can transcribe and respond via text to spoken content | | **Thinking** | Chain-of-thought reasoning mode — the model shows its reasoning process before answering (toggle per model in inference settings) | | **Tools** | **Experimental.** Function/tool calling via SDK schema injection (default) or prompt-based fallback. With schema injection enabled, tool schemas are registered directly with the LiteRT SDK for structured output. Best with Gemma 4 models, smaller models may not follow tool instructions reliably. See [Troubleshooting → Tool Calling](TROUBLESHOOTING.md#tool-calling-experimental) for tips | | **MTP** | Multi-Token Prediction (speculative decoding) — the model predicts multiple tokens at once for faster decode speed without quality loss. Requires a model file that includes an MTP draft head. Toggle per model in inference settings. See [FAQ → What is speculative decoding?](FAQ.md#what-is-speculative-decoding--mtp) | ## RAM Requirements The "Min RAM" column refers to your device's total RAM, not available RAM. These values come from Google's official model specifications in the [LiteRT community](https://huggingface.co/litert-community) model cards. > [!NOTE] > The model file is loaded into memory alongside the Android OS and other apps. A device with exactly the minimum RAM may experience slow performance or out-of-memory (OOM) kills — Android forcefully closing apps to free RAM. See [Troubleshooting → Server crashes](TROUBLESHOOTING.md#the-server-crashes--stops-unexpectedly) if this happens. In practice: - 8 GB+ is recommended for a smooth experience with most models - 12 GB+ is recommended for Gemma 4 E4B and Gemma 3n E4B ## Context Window The context window determines how much conversation history the model can "see" at once. | Context | Roughly Equivalent To | |--------:|:----------------------| | 1K | ~750 words — a few message exchanges | | 4K | ~3,000 words — a medium conversation | | 32K | ~24,000 words — a long conversation or very long message | > [!NOTE] > The word counts in the table above are approximate, based on ~0.75 words per token. Actual numbers vary depending on language, content type, and vocabulary — code and non-English text typically use more tokens per word. When a conversation exceeds the context window, OlliteRT can automatically compact older messages to fit. Three compaction strategies are available in Settings → Context Management (all disabled by default). See the [FAQ](FAQ.md#what-is-prompt-compaction) for details, or [Troubleshooting](TROUBLESHOOTING.md#long-conversations-fail--context-window-exceeded) if conversations are failing. ## Importing Your Own Models OlliteRT supports three ways to import models. Tap the **+** button on the Models screen to see the options: **From a local `.litertlm` file:** - Select a `.litertlm` model file from your device storage **From a model list file (`.json`):** - Select a JSON file from your device that follows the [Model Allowlist Schema](MODEL_ALLOWLIST_SCHEMA.md) — all models in the list are added to your Models screen in one go **From a model list URL (`.json`):** - Enter a URL pointing to a JSON model list (e.g. a raw GitHub link) — fetches the list and adds all models > [!TIP] > The JSON file and URL imports are one-time operations — models are added but the source is not tracked. For ongoing access to a third-party model source with automatic refresh and update detection, [add it as a model source](#model-sources) instead. ### Import Dialog When importing a `.litertlm` file, the import dialog lets you configure the model before adding it: - **Name** — editable model name (file extension is preserved automatically). If the name conflicts with an existing imported model, you're prompted to replace it. Names that conflict with built-in models are rejected - **Capabilities** — toggle vision, audio, thinking, and tools support. These tell OlliteRT what to advertise — the model itself must actually support the capability - **Max tokens** — configure the maximum output length - **Context window** — set the model's context window size You can also edit all of these settings after import via the model's inference settings (tap the gear icon on the model card). > [!IMPORTANT] > - Only `.litertlm` format is supported — **GGUF files cannot be used** (LiteRT runtime limitation) > - Imported models are copied to app storage (Android scoped storage requirement), so the file will temporarily use double the disk space > > If import fails, see [Troubleshooting → Model import fails](TROUBLESHOOTING.md#model-import-fails). ## Model Sources OlliteRT uses a model source system for managing where models come from — similar to how F-Droid manages app repositories. Each model source is a JSON URL that provides a list of models available for download. ### Built-in Source The **Official** model source is included by default and points to the [LiteRT community](https://huggingface.co/litert-community) models on HuggingFace. It cannot be removed. ### Custom Model Sources You can add custom model sources to make additional models available: 1. Go to **Settings → Model Sources** 2. Tap the **+** button 3. Choose how to add the source: - **From a model list file** — select a local JSON file from your device - **From a URL** — enter the URL of a model list JSON file (e.g. a raw GitHub link) The JSON file must follow the [Model Allowlist Schema](MODEL_ALLOWLIST_SCHEMA.md). Custom model sources can be enabled, disabled, or removed at any time. Disabled sources are hidden from the Models screen but their configuration is preserved. ### Incompatible Models The Model Sources detail screen shows all models in a source, including those incompatible with your device (shown greyed out with the reason, e.g. "Requires app version X.Y.Z"). The source list shows the count as "N models · M hidden" when some models are filtered out. ### Automatic Refresh Model sources are automatically refreshed approximately every 24 hours in the background to check for new models and updates. You can also pull-to-refresh on the Models screen to trigger an immediate refresh. > [!TIP] > If you want to create your own model source, see the [Model Allowlist Schema](MODEL_ALLOWLIST_SCHEMA.md) for the JSON format. ## Model Updates OlliteRT can detect when a newer version of a downloaded model is available in a model source: - A background worker periodically checks each enabled model source for updated model files - When an update is found, a notification is shown and the model card displays an update indicator - The `/v1/models` API response includes an `update_available` field per model - To update, download the new version — it replaces the existing model file ## Model Storage Models are stored in the app's private storage directory. You can manage them from the Models screen: - **Download** — one-tap download from HuggingFace (or from custom model sources) - **Delete** — removes the model file and frees storage - **Storage indicator** — the bottom bar shows available vs used storage ## SDK Compatibility The LiteRT SDK version bundled with OlliteRT determines which models can run. Some newer models may require an app update even if they appear in a model source. See [SDK Compatibility](SDK_COMPATIBILITY.md) for the full compatibility table.