# Getting Started with llamafile The easiest way to try it for yourself is to download our example llamafile for the [Qwen3.5](https://huggingface.co/Qwen/Qwen3.5-0.8B/) model (license: [Apache 2.0](https://huggingface.co/Qwen/Qwen3.5-0.8B/blob/main/LICENSE)). Qwen3.5 is a recent LLM that can do more than just chat; you can also upload images and ask it questions about them. With llamafile, this all happens locally: no data ever leaves your computer. > **NOTE**: we chose this model because that's the smallest one we have built a llamafile for, so most likely to work out-of-the-box for you. Please let us know if you are still having issues with that! If, on the other hand, you have powerful hardware and/or GPUs, [feel free to choose](example_llamafiles.md) larger and more expressive models which should provide more accurate responses. 1. Download [Qwen3.5-0.8B-Q8_0.llamafile](https://huggingface.co/mozilla-ai/llamafile_0.10.0/resolve/main/Qwen3.5-0.8B-Q8_0.llamafile) (1.77 GB). 2. Open your computer's terminal. - If you're using macOS, Linux, or BSD, you'll need to grant permission for your computer to execute this new file. (You only need to do this once.) ```sh chmod +x Qwen3.5-0.8B-Q8_0.llamafile ``` - If you're on Windows, rename the file by adding ".exe" on the end. 5. Run the llamafile. e.g.: ```sh ./Qwen3.5-0.8B-Q8_0.llamafile ``` 6. A chat interface will open in the terminal window. That's it: you can immediately start writing. You can also upload an image by using the `/upload` command and specifying the path to the image, or write `/help` to see the available commands). 7. Note that when llamafile is running, you can also chat with it using [llama.cpp](https://github.com/ggml-org/llama.cpp)'s Web UI: just open a browser window and connect to . 8. When you're done chatting, `Control-C` to shut down llamafile. **Having trouble? See the [Troubleshooting](troubleshooting.md) page.** ## JSON API Quickstart As llamafile relies on llama.cpp for serving models, it comes with all its features. When it is started, in addition to hosting a web UI chat server at , it also exposes an endpoint compatible with [OpenAI API](https://platform.openai.com/docs/api-reference/chat) and [Anthropic's Messages API](https://platform.claude.com/docs/en/api/messages). For further details on what fields and endpoints are available, refer to the APIs documentation and llama.cpp server's [README](https://github.com/ggml-org/llama.cpp/tree/master/tools/server).

Curl API Client Example

The simplest way to get started using the API is to copy and paste the following curl command into your terminal. ```shell curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer no-key" \ -d '{ "model": "LLaMA_CPP", "messages": [ { "role": "system", "content": "You are LLAMAfile, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests." }, { "role": "user", "content": "Write a limerick about python exceptions" } ] }' | python3 -c ' import json import sys json.dump(json.load(sys.stdin), sys.stdout, indent=2) print() ' ``` The response that's printed should look like the following: ```json { "choices": [ { "finish_reason": "stop", "index": 0, "message": { "role": "assistant", "content": "In the world of Python, where magic breaks and errors occur,\nA script fails when it should not have failed.\nWith a `KeyError`, I can't access the key,\nSo I tell you to use the `except` clause!" } } ], "created": 1773659260, "model": "Qwen3.5-0.8B-Q8_0.gguf", "system_fingerprint": "b1773565177-7f5ee5496", "object": "chat.completion", "usage": { "completion_tokens": 52, "prompt_tokens": 49, "total_tokens": 101 }, "id": "chatcmpl-KOqwN6C0oRzINGZuFqZ95bU1iPfc6RFO", "timings": { "cache_n": 0, "prompt_n": 49, "prompt_ms": 54.944, "prompt_per_token_ms": 1.1213061224489795, "prompt_per_second": 891.8171228887594, "predicted_n": 52, "predicted_ms": 405.856, "predicted_per_token_ms": 7.804923076923076, "predicted_per_second": 128.1242608215722 } } ```

Python API Client example

If you've already developed your software using the [`openai` Python package](https://pypi.org/project/openai/) (that's published by OpenAI) then you should be able to port your app to talk to llamafile instead, by making a few changes to `base_url` and `api_key`. This example assumes you've run `pip3 install openai` to install OpenAI's client software, which is required by this example. Their package is just a simple Python wrapper around the OpenAI API interface, which can be implemented by any server. ```python #!/usr/bin/env python3 from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", # "http://:port" api_key = "sk-no-key-required" ) completion = client.chat.completions.create( model="LLaMA_CPP", messages=[ {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."}, {"role": "user", "content": "Write a limerick about python exceptions"} ] ) print(completion.choices[0].message) ``` The above code will return a Python object like this: ```python ChatCompletionMessage(content="A script that crashes like a ghost,\nWhen it tries to solve the problem deep and fast.\nThe error message pops up in a bright light,\nAnd tells us what's wrong when we try to fix it.", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None) ```

## Using llamafile with external weights Even though our example llamafiles have the weights built-in, you don't *have* to use llamafile that way. Instead, you can download *just* the llamafile software (without any weights included) from our releases page. You can then use it alongside any external weights you may have on hand. External weights are particularly useful for Windows users because they enable you to work around Windows' 4GB executable file size limit. For Windows users, here's an example for the gpt-oss LLM (whose size is >12GB): ```sh curl -L -o llamafile.exe https://huggingface.co/mozilla-ai/llamafile_0.10.0/resolve/main/llamafile_0.10.0 curl -L -o gpt-oss.gguf https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-Q5_K_S.gguf ./llamafile.exe -m gpt-oss.gguf ``` Windows users may need to change `./llamafile.exe` to `.\llamafile.exe` when running the above command. ## Running llamafile with models downloaded by third-party applications This section answers the question *"I already have a model downloaded locally by application X, can I use it with llamafile?"*. The general answer is "yes, as long as those models are locally stored in GGUF format" but its implementation can be more or less hacky depending on the application. A few examples (tested on a Mac) follow. ### LM Studio [LM Studio](https://lmstudio.ai/) stores downloaded models in `~/.cache/lm-studio/models/lmstudio-community`, in subdirectories with the same name of the models, minus their quantization level. So if you have downloaded e.g. the `gpt-oss-20b-MXFP4.gguf` file, it will be stored in `~/.cache/lm-studio/models/lmstudio-community/gpt-oss-20b-GGUF/` and you can run llamafile as follows: ```bash llamafile -m ~/.cache/lm-studio/models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf ``` ### Ollama When you download a new model with [ollama](https://ollama.com), all its metadata will be stored in a manifest file under `~/.ollama/models/manifests/registry.ollama.ai/library/`. The directory and manifest file name are the model name as returned by `ollama list`. For instance, for `llama3:latest` the manifest file will be named `.ollama/models/manifests/registry.ollama.ai/library/llama3/latest`. The manifest maps each file related to the model (e.g. GGUF weights, license, prompt template, etc) to a sha256 digest. The digest corresponding to the element whose `mediaType` is `application/vnd.ollama.image.model` is the one referring to the model's GGUF file. Each sha256 digest is also used as a filename in the `~/.ollama/models/blobs` directory (if you look into that directory you'll see *only* those sha256-* filenames). This means you can directly run llamafile by passing the sha256 digest as the model filename. So if e.g. the `llama3:latest` GGUF file digest is `sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29`, you can run llamafile as follows: ```bash cd ~/.ollama/models/blobs llamafile -m sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 ``` **Note** that Ollama's GGUF weights do not always work with llama.cpp (see e.g. [here](https://forums.developer.nvidia.com/t/nemotron-3-super-120b-on-gb10-llama-cpp-sm-121-build-ollama-gguf-incompatibility-fix/363459)), and as llamafile relies on llama.cpp this trick might not always work for you.