[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aurelio-labs/semantic-router/blob/main/docs/05-local-execution.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/aurelio-labs/semantic-router/blob/main/docs/05-local-execution.ipynb)

# Local Dynamic Routes

## Fully local Semantic Router with `llama.cpp` and HuggingFace Encoder

There are many reasons users might choose to roll their own LLMs rather than use a third-party service. Whether it's due to cost, privacy or compliance, Semantic Router supports the use of "local" LLMs through `llama.cpp`.

Using `llama.cpp` also enables the use of quantized GGUF models, reducing the memory footprint of deployed models, allowing even 13-billion parameter models to run with hardware acceleration on an Apple M1 Pro chip.

Below is an example of using semantic router with **Mistral-7B-Instruct**, quantized i.

## Installing the library

> Note: if you require hardware acceleration via BLAS, CUDA, Metal, etc. please refer to the [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python#installation-with-specific-hardware-acceleration-blas-cuda-metal-etc) repository README.md

In [None]:
!pip install -qU "semantic-router[local]"

If you're running on Apple silicon you can run the following to run with Metal hardware acceleration:

In [None]:
!CMAKE_ARGS="-DLLAMA_METAL=on"

'CMAKE_ARGS' is not recognized as an internal or external command,
operable program or batch file.


## Download the Mistral 7B Instruct 4-bit GGUF files

We will be using Mistral 7B Instruct, quantized as a 4-bit GGUF file, a good balance between performance and ability to deploy on consumer hardware

In [None]:
! curl -L "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_0.gguf?download=true" -o ./mistral-7b-instruct-v0.2.Q4_0.gguf
! ls mistral-7b-instruct-v0.2.Q4_0.gguf

 % Total % Received % Xferd Average Speed Time Time Time Current
 Dload Upload Total Spent Left Speed

 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 1168 100 1168 0 0 4171 0 --:--:-- --:--:-- --:--:-- 4186

 0 3918M 0 6382k 0 0 7990k 0 0:08:22 --:--:-- 0:08:22 7990k
 0 3918M 0 35.0M 0 0 19.4M 0 0:03:21 0:00:01 0:03:20 28.7M
 2 3918M 2 78.7M 0 0 28.1M 0 0:02:19 0:00:02 0:02:17 36.2M
 3 3918M 3 123M 0 0 32.4M 0 0:02:00 0:00:03 0:01:57 39.0M
 4 3918M 4 171M 0 0 35.6M 0 0:01:49 0:00:04 0:01:45 41.2M
 5 3918M 5 219M 0 0 37.7M 0 0:01:43 0:00:05 0:01:38 42.5M
 6 3918M 6 271M 0 0 39.9M 0 0:01:38 0:00:06 0:01:32 47.3M
 8 3918M 8 326M 0 0 41.8M 0 0:01:33 0:00:07 0:01:26 49.4M
 9 3918M 9 379M 0 0 42.5M 0 0:01:31 0:00:08 0:01:23 50.1M
 10 3918M 10 428M 0 0 43.7M 0 0:01:29 0:00:09 0:01:20 51.5M
 12 3918M 12 490M 0 0 45.3M 0 0:01:26 0:00:10 0:01:16 54.2M
 13 3918M 13 543M 0 0 46.0M 0 0:01:25 0:00:11 0:01:14 54.4M
 15 3918M 15 609M 0 0 47.6M 0 0:01:22 0:00:12 0:01:10 56.6M
 17 3918M 17 672M 0 0 4

# Initializing Dynamic Routes

Similar to the `02-dynamic-routes.ipynb` notebook, we will be initializing some dynamic routes that make use of LLMs for function calling

In [None]:
from datetime import datetime
from zoneinfo import ZoneInfo

from semantic_router import Route
from semantic_router.utils.function_call import get_schema


def get_time(timezone: str) -> str:
 """Finds the current time in a specific timezone.

 :param timezone: The timezone to find the current time in, should
 be a valid timezone from the IANA Time Zone Database like
 "America/New_York" or "Europe/London". Do NOT put the place
 name itself like "rome", or "new york", you must provide
 the IANA format.
 :type timezone: str
 :return: The current time in the specified timezone."""
 now = datetime.now(ZoneInfo(timezone))
 return now.strftime("%H:%M")


time_schema = get_schema(get_time)
time_schema
time = Route(
 name="get_time",
 utterances=[
 "what is the time in new york city?",
 "what is the time in london?",
 "I live in Rome, what time is it?",
 ],
 function_schemas=[time_schema],
)

politics = Route(
 name="politics",
 utterances=[
 "isn't politics the best thing ever",
 "why don't you tell me about your political opinions",
 "don't you just love the president" "don't you just hate the president",
 "they're going to destroy this country!",
 "they will save the country!",
 ],
)
chitchat = Route(
 name="chitchat",
 utterances=[
 "how's the weather today?",
 "how are things going?",
 "lovely weather today",
 "the weather is horrendous",
 "let's go to the chippy",
 ],
)

routes = [politics, chitchat, time]

 from .autonotebook import tqdm as notebook_tqdm


In [None]:
time_schema

{'name': 'get_time',
 'description': 'Finds the current time in a specific timezone.\n\n:param timezone: The timezone to find the current time in, should\n be a valid timezone from the IANA Time Zone Database like\n "America/New_York" or "Europe/London". Do NOT put the place\n name itself like "rome", or "new york", you must provide\n the IANA format.\n:type timezone: str\n:return: The current time in the specified timezone.',
 'signature': '(timezone: str) -> str',
 'output': ""}

# Encoders

You can use alternative Encoders, however, in this example we want to showcase a fully-local Semantic Router execution, so we are going to use a `HuggingFaceEncoder` with `sentence-transformers/all-MiniLM-L6-v2` (the default) as an embedding model.

In [None]:
from semantic_router.encoders import HuggingFaceEncoder

encoder = HuggingFaceEncoder()

# `llama.cpp` LLM

From here, we can go ahead and instantiate our `llama-cpp-python` `llama_cpp.Llama` LLM, and then pass it to the `semantic_router.llms.LlamaCppLLM` wrapper class.

For `llama_cpp.Llama`, there are a couple of parameters you should pay attention to:

- `n_gpu_layers`: how many LLM layers to offload to the GPU (if you want to offload the entire model, pass `-1`, and for CPU execution, pass `0`)
- `n_ctx`: context size, limit the number of tokens that can be passed to the LLM (this is bounded by the model's internal maximum context size, in this case for Mistral-7B-Instruct, 8000 tokens)
- `verbose`: if `False`, silences output from `llama.cpp`

> For other parameter explanation, refer to the `llama-cpp-python` [API Reference](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/)

In [None]:
from semantic_router import RouteLayer

from llama_cpp import Llama
from semantic_router.llms.llamacpp import LlamaCppLLM

enable_gpu = True # offload LLM layers to the GPU (must fit in memory)

_llm = Llama(
 model_path="./mistral-7b-instruct-v0.2.Q4_0.gguf",
 n_gpu_layers=-1 if enable_gpu else 0,
 n_ctx=2048,
)
_llm.verbose = False
llm = LlamaCppLLM(name="Mistral-7B-v0.2-Instruct", llm=_llm, max_tokens=None)

rl = RouteLayer(encoder=encoder, routes=routes, llm=llm)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.2.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: -

In [None]:
rl("how's the weather today?")

RouteChoice(name='chitchat', function_call=None, similarity_score=None)

In [None]:
out = rl("what's the time in New York right now?")
print(out)
get_time(**out.function_call[0])

from_string grammar:
root ::= object 
object ::= [{] ws object_11 [}] ws 
value ::= object | array | string | number | value_6 ws 
array ::= [[] ws array_15 []] ws 
string ::= ["] string_18 ["] ws 
number ::= number_19 number_25 number_29 ws 
value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] 
ws ::= ws_31 
object_8 ::= string [:] ws value object_10 
object_9 ::= [,] ws string [:] ws value 
object_10 ::= object_9 object_10 | 
object_11 ::= object_8 | 
array_12 ::= value array_14 
array_13 ::= [,] ws value 
array_14 ::= array_13 array_14 | 
array_15 ::= array_12 | 
string_16 ::= [^"\] | [\] string_17 
string_17 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
string_18 ::= string_16 string_18 | 
number_19 ::= number_20 number_21 
number_20 ::= [-] | 
number_21 ::= [0-9] | [1-9] number_22 
number_22 ::= [0-9] number_22 | 
number_23 ::= [.] number_24 
number_24 ::= [0-9] number_24 | [0-9] 
number_25 ::= number_23 | 
number_26 ::= [eE] number_27 number

name='get_time' function_call=[{'timezone': 'America/New_York'}] similarity_score=None


'07:50'

In [None]:
out = rl("what's the time in Rome right now?")
print(out)
get_time(**out.function_call[0])

from_string grammar:
root ::= object 
object ::= [{] ws object_11 [}] ws 
value ::= object | array | string | number | value_6 ws 
array ::= [[] ws array_15 []] ws 
string ::= ["] string_18 ["] ws 
number ::= number_19 number_25 number_29 ws 
value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] 
ws ::= ws_31 
object_8 ::= string [:] ws value object_10 
object_9 ::= [,] ws string [:] ws value 
object_10 ::= object_9 object_10 | 
object_11 ::= object_8 | 
array_12 ::= value array_14 
array_13 ::= [,] ws value 
array_14 ::= array_13 array_14 | 
array_15 ::= array_12 | 
string_16 ::= [^"\] | [\] string_17 
string_17 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
string_18 ::= string_16 string_18 | 
number_19 ::= number_20 number_21 
number_20 ::= [-] | 
number_21 ::= [0-9] | [1-9] number_22 
number_22 ::= [0-9] number_22 | 
number_23 ::= [.] number_24 
number_24 ::= [0-9] number_24 | [0-9] 
number_25 ::= number_23 | 
number_26 ::= [eE] number_27 number

name='get_time' function_call=[{'timezone': 'Europe/Rome'}] similarity_score=None


'13:51'

In [None]:
out = rl("what's the time in Bangkok right now?")
print(out)
get_time(**out.function_call[0])

from_string grammar:
root ::= object 
object ::= [{] ws object_11 [}] ws 
value ::= object | array | string | number | value_6 ws 
array ::= [[] ws array_15 []] ws 
string ::= ["] string_18 ["] ws 
number ::= number_19 number_25 number_29 ws 
value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] 
ws ::= ws_31 
object_8 ::= string [:] ws value object_10 
object_9 ::= [,] ws string [:] ws value 
object_10 ::= object_9 object_10 | 
object_11 ::= object_8 | 
array_12 ::= value array_14 
array_13 ::= [,] ws value 
array_14 ::= array_13 array_14 | 
array_15 ::= array_12 | 
string_16 ::= [^"\] | [\] string_17 
string_17 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
string_18 ::= string_16 string_18 | 
number_19 ::= number_20 number_21 
number_20 ::= [-] | 
number_21 ::= [0-9] | [1-9] number_22 
number_22 ::= [0-9] number_22 | 
number_23 ::= [.] number_24 
number_24 ::= [0-9] number_24 | [0-9] 
number_25 ::= number_23 | 
number_26 ::= [eE] number_27 number

name='get_time' function_call=[{'timezone': 'Asia/Bangkok'}] similarity_score=None


'18:51'

In [None]:
out = rl("what's the time in Phuket right now?")
print(out)
get_time(**out.function_call[0])

from_string grammar:
root ::= object 
object ::= [{] ws object_11 [}] ws 
value ::= object | array | string | number | value_6 ws 
array ::= [[] ws array_15 []] ws 
string ::= ["] string_18 ["] ws 
number ::= number_19 number_25 number_29 ws 
value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] 
ws ::= ws_31 
object_8 ::= string [:] ws value object_10 
object_9 ::= [,] ws string [:] ws value 
object_10 ::= object_9 object_10 | 
object_11 ::= object_8 | 
array_12 ::= value array_14 
array_13 ::= [,] ws value 
array_14 ::= array_13 array_14 | 
array_15 ::= array_12 | 
string_16 ::= [^"\] | [\] string_17 
string_17 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
string_18 ::= string_16 string_18 | 
number_19 ::= number_20 number_21 
number_20 ::= [-] | 
number_21 ::= [0-9] | [1-9] number_22 
number_22 ::= [0-9] number_22 | 
number_23 ::= [.] number_24 
number_24 ::= [0-9] number_24 | [0-9] 
number_25 ::= number_23 | 
number_26 ::= [eE] number_27 number

name='get_time' function_call=[{'timezone': 'Asia/Bangkok'}] similarity_score=None


'18:52'

## Cleanup

Once done, if you'd like to delete the downloaded model you can do so with the following:

```
! rm ./mistral-7b-instruct-v0.2.Q4_0.gguf
```