# SMILE Serve User Guide
SMILE Serve is a production-ready inference server built on [Quarkus](https://quarkus.io/)
that brings together three complementary inference capabilities on the JVM:
| Capability | API prefix | Description |
|---|---|----------------------------------------------------------------|
| **Classic ML** | `/api/v1/models` | Serialized SMILE models (`.sml`) — classifiers and regressors |
| **ONNX Runtime** | `/api/v1/onnx` | Any model in the ONNX open format (`.onnx`) |
| **LLM Chat** | `/api/v1/chat` | Llama 3 chat completions with conversation persistence |
A React-based web UI is bundled and served from the same process.
---
## Table of Contents
1. [Quick Start with Docker](#1-quick-start-with-docker)
2. [Building and Running](#2-building-and-running)
- [Dev Mode](#21-dev-mode)
- [Packaging as a JAR](#22-packaging-as-a-jar)
- [Uber-JAR](#23-uber-jar)
- [Native Executable](#24-native-executable)
3. [Configuration Reference](#3-configuration-reference)
4. [Classic ML Inference API](#4-classic-ml-inference-api)
- [Model Format](#41-model-format)
- [List Models](#42-list-models)
- [Get Model Metadata](#43-get-model-metadata)
- [Single Inference (JSON)](#44-single-inference-json)
- [Streaming Inference (CSV / JSON-lines)](#45-streaming-inference-csv--json-lines)
- [Model IDs](#46-model-ids)
5. [ONNX Inference API](#5-onnx-inference-api)
- [Model Format](#51-model-format)
- [List ONNX Models](#52-list-onnx-models)
- [Get ONNX Model Info](#53-get-onnx-model-info)
- [Single Inference (JSON)](#54-single-inference-json)
- [Streaming Inference](#55-streaming-inference)
- [Tensor Types and Shape Resolution](#56-tensor-types-and-shape-resolution)
6. [LLM Chat API](#6-llm-chat-api)
- [Chat Completions](#61-chat-completions)
- [Conversation History API](#62-conversation-history-api)
7. [Web UI](#7-web-ui)
8. [Database](#8-database)
9. [Testing](#9-testing)
---
## 1. Quick Start with Docker
The fastest way to run SMILE Serve is via the pre-built Docker image.
Mount a local directory containing your model files and map the port:
```shell
docker run -it \
-v /path/to/model/folder:/model \
-p 8888:8080 \
ghcr.io/haifengl/smile-serve:latest
```
The service starts on port 8080 inside the container (mapped to 8888 on the host).
Place your `.sml` and `.onnx` model files in `/path/to/model/folder`; they are
discovered automatically at startup.
---
## 2. Building and Running
All commands use the Gradle wrapper from the project root.
### 2.1 Dev Mode
Live-reload development mode — changes to Java sources are reflected without
restarting. The Quarkus Dev UI is available at .
```shell
./gradlew :serve:quarkusDev \
--jvm-args="--add-opens java.base/java.lang=ALL-UNNAMED"
```
> The `--add-opens` flags are required by ONNX Runtime's Foreign Function Interface.
> The dev-mode HTTP port defaults to **8888** (configured via `%dev.quarkus.http.port`).
### 2.2 Packaging as a JAR
```shell
./gradlew :serve:build
```
This produces a Quarkus layered application in `build/quarkus-app/`.
The entry point is `build/quarkus-app/quarkus-run.jar`; the dependencies
live in `build/quarkus-app/lib/` and must be distributed together.
Run it with:
```shell
java \
--add-opens java.base/java.lang=ALL-UNNAMED \
--add-opens java.base/java.nio=ALL-UNNAMED \
--enable-native-access ALL-UNNAMED \
-jar build/quarkus-app/quarkus-run.jar
```
To run on a custom port:
```shell
java \
--add-opens java.base/java.lang=ALL-UNNAMED \
--add-opens java.base/java.nio=ALL-UNNAMED \
--enable-native-access ALL-UNNAMED \
-Dquarkus.http.port=3801 \
-jar build/quarkus-app/quarkus-run.jar
```
### 2.3 Uber-JAR
A single self-contained JAR (slower to start, simpler to deploy):
```shell
./gradlew :serve:build -Dquarkus.package.jar.type=uber-jar
java \
--add-opens java.base/java.lang=ALL-UNNAMED \
--add-opens java.base/java.nio=ALL-UNNAMED \
--enable-native-access ALL-UNNAMED \
-jar build/smile-serve-runner.jar
```
### 2.4 Native Executable
Compile to a native binary with GraalVM (sub-millisecond startup, lower memory):
```shell
./gradlew :serve:build -Dquarkus.native.enabled=true
./build/smile-serve-*-runner
```
Without a local GraalVM installation, use a Docker-based build:
```shell
./gradlew :serve:build \
-Dquarkus.native.enabled=true \
-Dquarkus.native.container-build=true
```
See the [Quarkus native build guide](https://quarkus.io/guides/gradle-tooling) for details.
---
## 3. Configuration Reference
Configuration is managed in `src/main/resources/application.properties`.
Quarkus profile prefixes (`%dev.`, `%test.`) override the base values in
the corresponding profiles.
| Property | Default | Description |
|---|---|---|
| `quarkus.http.port` | `8080` | HTTP listen port (`%dev` default: `8888`) |
| `quarkus.rest.path` | `/api/v1` | Global REST path prefix |
| `smile.serve.model` | `../model` | Path to a `.sml` file or directory of `.sml` files |
| `smile.onnx.model` | `../model` | Path to a `.onnx` file or directory of `.onnx` files |
| `smile.chat.model` | `../model/Llama3.1-8B-Instruct` | Directory containing the Llama model |
| `smile.chat.tokenizer` | `../model/Llama3.1-8B-Instruct/tokenizer.model` | SentencePiece tokenizer path |
| `smile.chat.max_seq_len` | `4096` | Maximum sequence length in tokens |
| `smile.chat.max_batch_size` | `1` | Maximum generation batch size |
| `smile.chat.device` | `0` | GPU device index (`%dev` default: `7`) |
| `quarkus.datasource.db-kind` | `postgresql` | Database backend for chat history |
| `quarkus.datasource.jdbc.url` | `jdbc:postgresql://localhost:5432/smile` | JDBC connection URL |
| `quarkus.hibernate-orm.active` | `false` | Enable ORM (set `true` when database is available) |
**Override at runtime** with `-D` system properties, for example:
```shell
java ... -Dsmile.serve.model=/data/models/rf_classifier.sml -jar quarkus-run.jar
```
---
## 4. Classic ML Inference API
### 4.1 Model Format
Classic ML models are serialized Java objects saved in `.sml` files by the
SMILE `smile.model.Model` framework. They carry:
- The trained algorithm (random forest, SVM, gradient boost, etc.)
- The input feature schema (field names and data types)
- Training / validation metrics
- Optional metadata tags (`id`, `version`, user-defined properties)
At startup, `InferenceService` scans `smile.serve.model`. If the path is a
regular `.sml` file only that model is loaded; if it is a directory every
`.sml` file in the directory is loaded.
### 4.2 List Models
Returns the IDs of all loaded models in alphabetical order.
```
GET /api/v1/models
```
**Example:**
```shell
curl http://localhost:8080/api/v1/models
```
```json
["iris_random_forest-1", "titanic_logistic-2"]
```
### 4.3 Get Model Metadata
Returns the algorithm name, input schema, and tags for a model.
```
GET /api/v1/models/{id}
```
**Example:**
```shell
curl http://localhost:8080/api/v1/models/iris_random_forest-1
```
```json
{
"id": "iris_random_forest-1",
"algorithm": "random-forest",
"schema": {
"petallength": { "type": "float", "nullable": false },
"petalwidth": { "type": "float", "nullable": false },
"sepallength": { "type": "float", "nullable": false },
"sepalwidth": { "type": "float", "nullable": false }
},
"tags": {
"smile.random_forest.trees": "200"
}
}
```
The `schema` object lists every input feature in alphabetical order — this
is the **column order** used by the CSV streaming endpoint.
### 4.4 Single Inference (JSON)
Send one sample as a JSON object and receive the prediction synchronously.
```
POST /api/v1/models/{id}
Content-Type: application/json
```
The request body is a flat JSON object whose keys are the feature names
defined in the model schema. **All non-nullable fields are required.**
**Classification example (iris):**
```shell
curl -X POST http://localhost:8080/api/v1/models/iris_random_forest-1 \
-H "Content-Type: application/json" \
-d '{
"sepallength": 5.1,
"sepalwidth": 3.5,
"petallength": 1.4,
"petalwidth": 0.2
}'
```
```json
{
"prediction": 0,
"probabilities": [0.960, 0.021, 0.019]
}
```
- `prediction` — the predicted class label (integer) or regression value (float).
- `probabilities` — posterior class probabilities for **soft classifiers**
(e.g. random forest, logistic regression). Absent for hard classifiers and
regressors.
**Error responses:**
| HTTP | Cause |
|---|---|
| `400 Bad Request` | Missing required field, or malformed JSON |
| `404 Not Found` | Unknown model ID |
### 4.5 Streaming Inference (CSV / JSON-lines)
Process many samples in a single request. The server returns results as a
[Server-Sent Events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events)
stream — one `data:` line per input sample.
```
POST /api/v1/models/{id}/stream
Content-Type: text/plain ← CSV mode
Content-Type: application/json ← JSON-lines mode
```
#### CSV mode (`text/plain`)
Each non-blank line is a comma-separated row of feature values in the **same
column order as the model schema** (alphabetical by field name, as shown by
`GET /api/v1/models/{id}`).
```shell
cat iris.csv | curl -X POST \
-H "Content-Type: text/plain" \
--data-binary @- \
http://localhost:8080/api/v1/models/iris_random_forest-1/stream
```
Where `iris.csv` might contain:
```
5.1,3.5,1.4,0.2
6.7,3.0,5.2,2.3
5.8,2.7,4.1,1.0
```
The response stream (SSE format):
```
data: 0 0.960 0.021 0.019
data: 2 0.012 0.051 0.937
data: 1 0.031 0.752 0.217
```
#### JSON-lines mode (`application/json`)
Each non-blank line must be a complete JSON object (one per line).
This is more verbose but supports named fields in any order.
```shell
cat iris.jsonl | curl -X POST \
-H "Content-Type: application/json" \
--data-binary @- \
http://localhost:8080/api/v1/models/iris_random_forest-1/stream
```
Where `iris.jsonl` contains:
```json
{"sepallength":5.1,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2}
{"sepallength":6.7,"sepalwidth":3.0,"petallength":5.2,"petalwidth":2.3}
```
### 4.6 Model IDs
A model's ID is constructed as `-` from the model's embedded
metadata tags (`smile.model.Model.ID` and `smile.model.Model.VERSION`).
If those tags are absent, the file name stem is used as the name and `"1"` as
the version. For example, a file named `iris_random_forest.sml` with no ID
tag gets the ID `iris_random_forest-1`.
---
## 5. ONNX Inference API
The ONNX endpoint exposes any model in the
[ONNX open format](https://onnx.ai/) through SMILE's native ONNX Runtime
binding (`smile.onnx`). This covers models exported from PyTorch, TensorFlow,
scikit-learn (via `sklearn-onnx`), and many other frameworks.
### 5.1 Model Format
At startup, `OnnxService` scans `smile.onnx.model`. Every `.onnx` file found
is loaded into an `InferenceSession`. The model ID is the file name without
the `.onnx` extension (e.g., `resnet50.onnx` → ID `resnet50`).
### 5.2 List ONNX Models
```
GET /api/v1/onnx
```
```shell
curl http://localhost:8080/api/v1/onnx
```
```json
["resnet50", "sentiment_bert"]
```
### 5.3 Get ONNX Model Info
Returns graph metadata and the typed, shaped input/output node descriptors.
```
GET /api/v1/onnx/{id}
```
```shell
curl http://localhost:8080/api/v1/onnx/resnet50
```
```json
{
"id": "resnet50",
"graphName": "ResNet50",
"description": "Image classification model",
"version": 1,
"inputs": [
{
"name": "input",
"onnxType": "TENSOR",
"elementType": "FLOAT",
"shape": [1, 3, 224, 224]
}
],
"outputs": [
{
"name": "output",
"onnxType": "TENSOR",
"elementType": "FLOAT",
"shape": [1, 1000]
}
],
"customMeta": {}
}
```
A shape value of `-1` means that dimension is **dynamic** (determined at
inference time from the input data).
### 5.4 Single Inference (JSON)
```
POST /api/v1/onnx/{id}
Content-Type: application/json
```
The request body is a JSON object mapping each **input name** to a **flat
JSON array** of numbers. The server constructs the required ORT tensor from the
declared element type and shape.
**Example — image classification (resnet50, 1×3×224×224 = 150528 floats):**
```shell
curl -X POST http://localhost:8080/api/v1/onnx/resnet50 \
-H "Content-Type: application/json" \
-d '{"input": [0.485, 0.456, 0.406, ...]}'
```
Response — a JSON object mapping each **output name** to a flat array:
```json
{
"output": [0.001, 0.002, 0.872, 0.003, ...]
}
```
**Multi-input model example:**
```shell
curl -X POST http://localhost:8080/api/v1/onnx/bert_classifier \
-H "Content-Type: application/json" \
-d '{
"input_ids": [101, 2054, 2003, 1996, 3007, 1997, 2605, 1029, 102],
"attention_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1 ],
"token_type_ids": [0, 0, 0, 0, 0, 0, 0, 0, 0 ]
}'
```
**Supported input element types:**
| ONNX type | JSON values | ORT type |
|---|---|---|
| `FLOAT` | numbers | `float[]` |
| `DOUBLE` | numbers | `double[]` |
| `INT32` | integers | `int[]` |
| `INT64` | integers | `long[]` |
| `INT8` / `UINT8` / `BOOL` | integers (0/1 for bool) | `byte[]` |
**Error responses:**
| HTTP | Cause |
|---|---|
| `400 Bad Request` | Missing input, wrong element count, non-numeric values |
| `404 Not Found` | Unknown model ID |
### 5.5 Streaming Inference
Identical in structure to the classic ML streaming endpoint but returns
JSON objects:
```
POST /api/v1/onnx/{id}/stream
Content-Type: text/plain ← CSV floats for single-input models
Content-Type: application/json ← JSON-lines for multi-input models
```
**CSV (single-input models only):**
```shell
cat features.csv | curl -X POST \
-H "Content-Type: text/plain" \
--data-binary @- \
http://localhost:8080/api/v1/onnx/my_classifier/stream
```
Each response line is a compact JSON object:
```
data: {"output":[0.02,0.95,0.03]}
data: {"output":[0.88,0.07,0.05]}
```
**JSON-lines (any number of inputs):**
```shell
cat samples.jsonl | curl -X POST \
-H "Content-Type: application/json" \
--data-binary @- \
http://localhost:8080/api/v1/onnx/bert_classifier/stream
```
### 5.6 Tensor Types and Shape Resolution
The server automatically resolves the ORT tensor shape from the model's
declared input shape and the actual array length:
- **Fully static shape** (no `-1` dimensions) — the array length must exactly
match the product of all dimensions. A mismatch returns HTTP 400.
- **Single dynamic dimension** — the unknown dimension is inferred as
`arrayLength / product(staticDimensions)`. For example, a declared shape
`[-1, 3, 224, 224]` with 150528 elements resolves to `[1, 3, 224, 224]`.
- **Multiple dynamic dimensions** — the shape is set to `[1, arrayLength]`.
- **No shape info** — the shape is set to `[1, arrayLength]`.
---
## 6. LLM Chat API
SMILE Serve includes a Java implementation of
[Llama 3](https://github.com/haifengl/smile/tree/master/deep/src/main/java/smile/llm/llama)
for on-premise LLM inference. The chat API is designed to be compatible with
the OpenAI Chat Completions interface.
The LLM is optional: if `smile.chat.model` does not exist on the file system,
`ChatService` starts in an *unavailable* state and every request to the chat
endpoints returns **HTTP 503 Service Unavailable**.
### 6.1 Chat Completions
```
POST /api/v1/chat/completions
Content-Type: application/json
```
Tokens are streamed back as Server-Sent Events. The conversation (user
message + assistant reply) is automatically persisted to the configured
database after generation finishes.
**Request body fields (`snake_case`):**
| Field | Type | Default | Description |
|---|---|---|---|
| `messages` | `Message[]` | *required* | Ordered dialog turns |
| `conversation` | `Long` | `null` | Existing conversation ID to append to |
| `max_tokens` | `int` | `2048` | Maximum new tokens to generate |
| `temperature` | `double` | `0.6` | Sampling temperature (higher = more random) |
| `top_p` | `double` | `0.9` | Nucleus-sampling threshold |
| `logprobs` | `boolean` | `false` | Include log-probabilities |
| `seed` | `long` | `0` | Random seed (0 = non-deterministic) |
| `stream` | `boolean` | `true` | Reserved; always streams |
Each `Message` has a `role` (`system`, `user`, or `assistant`) and `content`.
**Example — single-turn:**
```shell
curl -X POST http://localhost:8080/api/v1/chat/completions \
-H "Content-Type: application/json" \
-N \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 256,
"temperature": 0.7
}'
```
The response is an SSE stream of plain-text token chunks ending when
generation is complete.
**Example — continue a previous conversation:**
```shell
curl -X POST http://localhost:8080/api/v1/chat/completions \
-H "Content-Type: application/json" \
-N \
-d '{
"conversation": 42,
"messages": [
{"role": "user", "content": "What about Germany?"}
]
}'
```
### 6.2 Conversation History API
Chat history is stored in a relational database (PostgreSQL in production,
SQLite in dev mode). The API base path is `/api/v1/conversations`.
#### List conversations
```
GET /api/v1/conversations?pageIndex=0&pageSize=25
```
Returns conversations in reverse-chronological order (newest first).
Pagination parameters default to page 0 with 25 records per page.
```shell
curl "http://localhost:8080/api/v1/conversations?pageSize=10"
```
#### Get a single conversation
```
GET /api/v1/conversations/{id}
```
Returns the conversation record (metadata only, no messages). Returns 404 if
the ID does not exist.
#### Get conversation messages
```
GET /api/v1/conversations/{id}/items?pageIndex=0&pageSize=25
```
Returns the individual message turns (`role` + `content` + `createdAt`)
in chronological order.
```shell
curl http://localhost:8080/api/v1/conversations/42/items
```
```json
[
{ "id": 1, "conversationId": 42, "role": "user", "content": "What is the capital of France?", "createdAt": "2026-04-15T10:00:00Z" },
{ "id": 2, "conversationId": 42, "role": "assistant", "content": "The capital of France is Paris.", "createdAt": "2026-04-15T10:00:02Z" }
]
```
#### Create a conversation record manually
```
POST /api/v1/conversations
Content-Type: application/json
```
Useful for creating a labelled conversation before sending the first chat
message. The server records the client IP and User-Agent automatically.
#### Delete a conversation
```
DELETE /api/v1/conversations/{id}
```
Returns 204 on success, 404 if not found.
---
## 7. Web UI
A React-based web interface is bundled via [Quarkus Quinoa](https://quarkiverse.github.io/quarkiverse-docs/quarkus-quinoa/dev/).
It is served from the root URL and provides:
- **Inference UI** (`/infer`) — select a loaded SMILE model from the sidebar,
fill in the auto-generated form (derived from the model schema), and view
the prediction result.
- **Chat UI** (`/chat`) — a conversational interface for the Llama chat service
with streaming token display and Markdown/math rendering.
In dev mode the React development server runs on port **5173** and requests
are proxied to the Quarkus backend. The production build (`dist/`) is served
statically by the Quarkus process.
---
## 8. Database
Chat conversation history requires a relational database.
| Profile | Backend | URL |
|---|---|---|
| Production | PostgreSQL | `jdbc:postgresql://localhost:5432/smile` |
| Dev | SQLite | `jdbc:sqlite:./smile_serve.db` |
| Test | H2 (in-memory) | `jdbc:h2:mem:test;DB_CLOSE_DELAY=-1` |
To enable the database in production set:
```properties
quarkus.hibernate-orm.active=true
quarkus.datasource.username=
quarkus.datasource.password=
```
Hibernate ORM uses `drop-and-create` by default. Change the strategy in
production to `update` or `validate`:
```properties
quarkus.hibernate-orm.schema-management.strategy=update
```
The database is **not required** for the ML or ONNX inference endpoints — only
for chat conversation persistence.
---
## 9. Testing
```shell
./gradlew :serve:test
```
The test profile (`%test.*`) configures the service with:
- An in-memory H2 database (no external database required).
- A pre-trained iris random forest model from
`serve/src/test/resources/model/iris_random_forest.sml`.
- The ONNX model path also pointed at the test resources directory
(no `.onnx` files present by default, so `OnnxService` starts empty).
- The chat model path set to a non-existent path so `ChatService` starts
gracefully unavailable without attempting to load a GPU model.
The test class `InferenceResourceTest` covers:
| Test | Endpoint | Scenario |
|---|---|---|
| `testListModels` | `GET /models` | Returns the correct model IDs |
| `testGetModelMetadata` | `GET /models/{id}` | Returns algorithm, schema, and nullability |
| `testGetUnknownModelReturns404` | `GET /models/{id}` | 404 for unknown ID |
| `testPredictJsonReturnsPredictionAndProbabilities` | `POST /models/{id}` | Correct label + probabilities |
| `testPredictJsonWithZeroFeaturesReturnsValidPrediction` | `POST /models/{id}` | Edge case: all-zero features |
| `testPredictJsonMissingFieldReturns400` | `POST /models/{id}` | 400 for missing field |
| `testPredictUnknownModelReturns404` | `POST /models/{id}` | 404 for unknown model |
| `testStreamCsvReturnsPredictions` | `POST /models/{id}/stream` | 3 CSV rows → 3 SSE data lines |
| `testStreamJsonLinesReturnsPredictions` | `POST /models/{id}/stream` | 2 JSON-lines → 2 SSE data lines |
| `testStreamCsvTooFewColumnsEmitsNoPredictions` | `POST /models/{id}/stream` | Bad CSV closes stream |
| `testStreamUnknownModelReturns404` | `POST /models/{id}/stream` | 404 before stream starts |
---
## API Quick Reference
### Classic ML — `/api/v1/models`
| Method | Path | Description |
|---|---|---|
| `GET` | `/models` | List all loaded model IDs |
| `GET` | `/models/{id}` | Get model metadata and schema |
| `POST` | `/models/{id}` | Single JSON inference |
| `POST` | `/models/{id}/stream` | Streaming CSV or JSON-lines inference |
### ONNX — `/api/v1/onnx`
| Method | Path | Description |
|---|---|---|
| `GET` | `/onnx` | List all loaded ONNX model IDs |
| `GET` | `/onnx/{id}` | Get graph info, input/output shapes |
| `POST` | `/onnx/{id}` | Single JSON inference |
| `POST` | `/onnx/{id}/stream` | Streaming CSV or JSON-lines inference |
### Chat — `/api/v1/chat` and `/api/v1/conversations`
| Method | Path | Description |
|---|---|---|
| `POST` | `/chat/completions` | Streaming LLM chat completion (SSE) |
| `GET` | `/conversations` | List conversations (paginated) |
| `GET` | `/conversations/{id}` | Get conversation metadata |
| `POST` | `/conversations` | Create a conversation record |
| `DELETE` | `/conversations/{id}` | Delete a conversation |
| `GET` | `/conversations/{id}/items` | List message turns (paginated) |
---
*SMILE Serve is free software under the GNU General Public License v3. For commercial use enquiries contact smile.sales@outlook.com.*