# Model Cards

### This notebook showcases 5 LLMs with varying licenses. To run all models, follow these [GPU Configurations](https://github.com/jackfrost1411/HUST23-SC23-LLMs/tree/master/gpu-config).

#### The layout is as follows:

* **Initial set up (Important Environment Variable Discussion)**
* **Falcon 7b/ 13b/ 40b Instruct (Apache 2.0 License)**
* **LLaMa2 7b/ 13b/ 70b (Meta License)**
* **C4AI Command R 35b (C4AI License)**
* **Mistral 7b, Mixtral 8x7b (Apache)**
* **Llama3 8b/70b (Llama3 License)**

## Setting the path to institute's centralized database containing the models

In [1]:
import os

# Two directories need to be specified. The HUGGINGFACE_HUB_CACHE is where the intermediate data from the running model are stored
# which needs to be writeable by the user
# the other path is where are the models, this is called LLM_CACHE_PATH and is set by the module (and used in the cells that load the model)
os.environ['HUGGINGFACE_HUB_CACHE'] =  f"{os.environ['HOME']}/llm/cache"

In [2]:
# Checking the number of GPUs requested
import torch
torch.cuda.device_count()

1

##  Falcon 40b
#### It is made available under the Apache 2.0 license.
##### For running 7b model change the model_id to "tiiuae/falcon-7b"
##### https://huggingface.co/tiiuae/falcon-40b

In [3]:
from langchain.llms import HuggingFacePipeline
import torch
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM

model_id = f"{os.environ['LLM_CACHE_PATH']}/tiiuae/falcon-40b"

config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id,
                                             trust_remote_code=True,
                                             torch_dtype=torch.bfloat16,
                                             load_in_8bit=True,
                                             device_map="auto")



The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


In [5]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=1048,
    pad_token_id=tokenizer.eos_token_id
)

local_llm = HuggingFacePipeline(pipeline=pipe)

In [6]:
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationSummaryBufferMemory#Summary

memory = ConversationSummaryBufferMemory(llm=local_llm, max_token_limit=512)
memory.save_context({"input": "Hello"}, {"output": "What's up"})
conversation = ConversationChain(
    llm=local_llm, 
    memory = memory,
    verbose=False
)

conversation.prompt.template='''Below is an instruction that describes a task, paired with current conversation to provide history of conversation and \
an input that provides further context. \
Write a response that appropriately completes the request.

### Instruction:
You are an AI named Falcon. Answer the questions asked to you in a talkative manner.

### Current conversation:
{history}

### Input:
{input}

### Response:'''

class ChatBot:
    exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later", "stop")

    #Method to start the conversation
    def start_chat(self):
        user_response = input("Chat here!\n")
        while user_response == '':
            user_response = input("Chat here!\n")
        self.chat(user_response)

    #Method to handle the conversation
    def chat(self, reply):
        while not self.make_exit(reply):
            input_ = reply
            reply = input(f"{conversation.predict(input = input_)}\n")

    #Method to check for exit commands
    def make_exit(self, reply):
        for exit_command in self.exit_commands:
            if exit_command in reply.lower():
                memory.clear()
                print("Ok, have a great day!")
                return True
        return False

In [7]:
chatbot = ChatBot()
chatbot.start_chat()

Chat here!
 What's the capital of Utah?




KeyboardInterrupt: Interrupted by user

## LLaMA2 7b Chat
#### License A custom commercial license is available at: https://ai.meta.com/resources/models-and-libraries/llama-downloads/
##### For running 70b / 13b models change the model_id to `$LLM_CACHE_PATH/meta-llama/Llama-2-70b-chat-hf/` / `$LLM_CACHE_PATH/meta-llama/Llama-2-13b-chat-hf/`
##### https://huggingface.co/meta-llama/Llama-2-70b-chat-hf

In [3]:
from transformers import LlamaTokenizer, LlamaForCausalLM, AutoConfig
import torch

model_id = f"{os.environ['LLM_CACHE_PATH']}/meta-llama/Llama-2-7b-chat-hf/"
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True, use_auth_token=True)
tokenizer = LlamaTokenizer.from_pretrained(model_id)

model = LlamaForCausalLM.from_pretrained(model_id,
                                         # trust_remote_code=True,
                                         torch_dtype=torch.bfloat16,
                                         load_in_8bit=True,
                                         device_map="auto")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    # torch_dtype=torch.bfloat16,
    device_map="auto",
    pad_token_id=tokenizer.eos_token_id, 
    max_length=2048,
    # temperature=1,
    # top_p=0.95,
    # repetition_penalty=1.15
)

local_llm = HuggingFacePipeline(pipeline=pipe)

In [5]:
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationSummaryBufferMemory#Summary

memory = ConversationSummaryBufferMemory(llm=local_llm, max_token_limit=512)
memory.save_context({"input": "Hello"}, {"output": "What's up"})
conversation = ConversationChain(
    llm=local_llm, 
    memory = memory,
    verbose=False
)

conversation.prompt.template='''[INST]<<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible using the context text provided. Your answers should only answer the question once and not have any text after the answer is done.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. 
<</SYS>>

CONTEXT: 

{history}

Question: {input}[/INST]'''

class ChatBot:
    exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later", "stop")

    #Method to start the conversation
    def start_chat(self):
        user_response = input("Chat here!\n")
        while user_response == '':
            user_response = input("Chat here!\n")
        self.chat(user_response)

    #Method to handle the conversation
    def chat(self, reply):
        while not self.make_exit(reply):
            input_ = reply
            reply = input(f"{conversation.predict(input = input_)}\n")

    #Method to check for exit commands
    def make_exit(self, reply):
        for exit_command in self.exit_commands:
            if exit_command in reply.lower():
                memory.clear()
                torch.cuda.empty_cache()
                print("Ok, have a great day!")
                return True
        return False

In [None]:
chatbot = ChatBot()
chatbot.start_chat()

## C4AI Command-R
#### License: Creative Commons Attribution-NonCommercial 4.0 International Public License with Acceptable Use Addendum
#### https://cohere.com/c4ai-cc-by-nc-license
##### C4AI Command-R is a research release of a 35 billion parameter highly performant generative model. Command-R is a large language model with open weights optimized for a variety of use cases including reasoning, summarization, and question answering. Command-R has the capability for multilingual generation evaluated in 10 languages and highly performant RAG capabilities.
##### https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit

In [23]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = f"{os.environ['LLM_CACHE_PATH']}/CohereForAI/c4ai-command-r-v01-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

In [24]:
# Format message with the command-r chat template
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

gen_tokens = model.generate(
    input_ids, 
    max_new_tokens=100, 
    do_sample=True, 
    temperature=0.3,
    )

gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)

<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Hello! As an AI chatbot, I don't experience emotions or feelings, but I'm functioning properly and ready to assist you in any way I can. How can I help you today?<|END_OF_TURN_TOKEN|>


In [12]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    # torch_dtype=torch.bfloat16,
    device_map="auto",
    pad_token_id=tokenizer.eos_token_id, 
    max_length=2048,
    # temperature=1,
    # top_p=0.95,
    # repetition_penalty=1.15
)

local_llm = HuggingFacePipeline(pipeline=pipe)

In [13]:
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationSummaryBufferMemory#Summary

memory = ConversationSummaryBufferMemory(llm=local_llm, max_token_limit=512)
memory.save_context({"input": "Hello"}, {"output": "What's up"})
conversation = ConversationChain(
    llm=local_llm, 
    memory = memory,
    verbose=False
)

conversation.prompt.template='''### System:
You are Command-R, an AI that follows instructions extremely well. Help as much as you can. Remember, be safe, and don't do anything illegal. Below is the current conversation history and an input that provides further context. Write a response that appropriately completes the request.

### Current conversation:
{history}

### User:
{input}

### Assistant:'''

class ChatBot:
    exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later", "stop")

    #Method to start the conversation
    def start_chat(self):
        user_response = input("Chat here!\n")
        while user_response == '':
            user_response = input("Chat here!\n")
        self.chat(user_response)

    #Method to handle the conversation
    def chat(self, reply):
        while not self.make_exit(reply):
            input_ = reply
            reply = input(f"{conversation.predict(input = input_)}\n")

    #Method to check for exit commands
    def make_exit(self, reply):
        for exit_command in self.exit_commands:
            if exit_command in reply.lower():
                memory.clear()
                torch.cuda.empty_cache()
                print("Ok, have a great day!")
                return True
        return False

In [None]:
chatbot = ChatBot()
chatbot.start_chat()

## Mistral-7B-Instruct
#### License: Apache 2.0 
#### The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.2
##### For running 7b model change the model_id to "mistralai/Mistral-7B-Instruct-v0.2"
##### https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

In [16]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = f"{os.environ['LLM_CACHE_PATH']}/mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [21]:
device = "cuda" # the device to load the model onto

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] What is your favourite condiment? [/INST]Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> [INST] Do you have mayonnaise recipes? [/INST] Certainly! Here's a classic and simple mayonnaise recipe that you can make at home.

**Homemade Mayonnaise**

* 1 egg yolk
* 1 tbsp Dijon mustard
* 1 tbsp white wine vinegar
* 1/2 cup neutral oil (like vegetable or canola oil)
* 1/2 cup light olive oil or avocado oil
* 1-2 tbsp water
* Salt to taste

Whisk together the egg yolk, mustard, and vinegar in a medium bowl until the mixture thickens slightly and turns pale yellow. With the whisk continually moving, add the neutral oil in a thin, steady stream until it is completely incorporated and the mixture has thickened to a mayonnaise consistency. Then, slowly drizzle in the olive or avocado oil, continuing to whisk continuously. Once all the oil has been added, whisk in the water, a t

In [17]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    # torch_dtype=torch.bfloat16,
    device_map="auto",
    pad_token_id=tokenizer.eos_token_id, 
    max_length=2048,
    # temperature=1,
    # top_p=0.95,
    # repetition_penalty=1.15
)

local_llm = HuggingFacePipeline(pipeline=pipe)

In [18]:
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationSummaryBufferMemory#Summary

memory = ConversationSummaryBufferMemory(llm=local_llm, max_token_limit=512)
memory.save_context({"input": "Hello"}, {"output": "What's up"})
conversation = ConversationChain(
    llm=local_llm, 
    memory = memory,
    verbose=False
)

conversation.prompt.template='''### System:
You are an AI that follows instructions extremely well. Help as much as you can. Remember, be safe, and don't do anything illegal. Below is the current conversation history and an input that provides further context. Write a response that appropriately completes the request.

### Current conversation:
{history}

### User:
{input}

### Assistant:'''

class ChatBot:
    exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later", "stop")

    #Method to start the conversation
    def start_chat(self):
        user_response = input("Chat here!\n")
        while user_response == '':
            user_response = input("Chat here!\n")
        self.chat(user_response)

    #Method to handle the conversation
    def chat(self, reply):
        while not self.make_exit(reply):
            input_ = reply
            reply = input(f"{conversation.predict(input = input_)}\n")

    #Method to check for exit commands
    def make_exit(self, reply):
        for exit_command in self.exit_commands:
            if exit_command in reply.lower():
                memory.clear()
                torch.cuda.empty_cache()
                print("Ok, have a great day!")
                return True
        return False

In [20]:
chatbot = ChatBot()
chatbot.start_chat()

Chat here!
 What's the capital of France?


KeyboardInterrupt: 

## Llama 3
#### Llama3, https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE
##### https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = f"{os.environ['LLM_CACHE_PATH']}/meta-llama/Meta-Llama-3-8B-Instruct/"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [5]:
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Arrrr, me hearty! Me name be Captain Chatbot, the scurviest pirate to ever sail the Seven Seas! Me be a swashbucklin' chatbot, here to regale ye with tales o' adventure, share me treasure o' knowledge, and swab the decks o' yer mind with me witty banter! So hoist the colors, me hearty, and let's set sail fer a treasure trove o' conversation!
