{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Subword Tokenization\n", "\n", "Implementation from [Neural Machine Translation of Rare Words with Subword Units](https://aclanthology.org/P16-1162/) (Sennrich et al., ACL 2016)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('e', 's')\n", "('es', 't')\n", "('est', '')\n", "('l', 'o')\n", "('lo', 'w')\n", "{'low ': 5, 'low e r ': 2, 'n e w est': 6, 'w i d est': 3}\n" ] } ], "source": [ "import re, collections\n", "\n", "# Count number of pairs\n", "def get_stats(vocab):\n", " pairs = collections.defaultdict(int)\n", " for word, freq in vocab.items():\n", " symbols = word.split()\n", " for i in range(len(symbols)-1):\n", " pairs[symbols[i],symbols[i+1]] += freq\n", " return pairs\n", "\n", "# Merge most frequent pairs in the vocabulary\n", "def merge_vocab(pair, v_in):\n", " v_out = {}\n", " bigram = re.escape(' '.join(pair))\n", " p = re.compile(r'(?' : 5, 'l o w e r ' : 2,\n", "'n e w e s t ':6, 'w i d e s t ' : 3}\n", "\n", "num_merges = 5\n", "for i in range(num_merges):\n", " pairs = get_stats(vocab)\n", " best = max(pairs, key=pairs.get)\n", " vocab = merge_vocab(best, vocab)\n", " print(best)\n", "print(vocab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using a Pre-trained Tokenizer" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import tiktoken\n", "enc = tiktoken.encoding_for_model(\"gpt-3.5-turbo\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[3617, 2448, 79706, 3576, 269, 273, 84314]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token_ids = enc.encode('Prüfungsvorleistung')\n", "token_ids" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Pr', 'ü', 'fung', 'sv', 'or', 'le', 'istung']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[enc.decode([i]) for i in token_ids]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Phenotyping with LLMs\n", "\n", "We will show how to use ChatGPT through the OpenAPI API for zero-shot and few-shot smoking status classification, which is a kind of phenotyping task. Note: if you want to run the notebook yourself, make sure to provide an API key: https://github.com/openai/openai-python#usage" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "executionInfo": { "elapsed": 314, "status": "ok", "timestamp": 1695628755333, "user": { "displayName": "Florian Borchert", "userId": "04915685144402388535" }, "user_tz": -120 }, "id": "7KXiGjwnjiUt" }, "outputs": [], "source": [ "from openai import OpenAI\n", "import os\n", "\n", "client = OpenAI()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "executionInfo": { "elapsed": 209, "status": "ok", "timestamp": 1695628769597, "user": { "displayName": "Florian Borchert", "userId": "04915685144402388535" }, "user_tz": -120 }, "id": "7uAVp_-Fk85I" }, "outputs": [], "source": [ "# Helper function to send messages to OpenAI API (ChatGPT model)\n", "def get_completion(prompt, model=\"gpt-3.5-turbo\"):\n", " messages = [{\"role\": \"user\", \"content\": prompt}]\n", " response = client.chat.completions.create(\n", " messages=messages,\n", " model=\"gpt-3.5-turbo\",\n", " temperature=0, # this is the degree of randomness of the model's output\n", " )\n", " return response.choices[0].message.content.replace('```', '')" ] }, { "cell_type": "markdown", "metadata": { "id": "5LGDhQR3lpd-" }, "source": [ "## Zero-Shot Inference" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "text = \"Social History: No alcohol use and quit tobacco greater than 25 years ago with a 10-pack year smoking history.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prompt 1\n", "\n", "Describes the task" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1695628771499, "user": { "displayName": "Florian Borchert", "userId": "04915685144402388535" }, "user_tz": -120 }, "id": "zcnhvq-Zlh62" }, "outputs": [], "source": [ "prompt1 = \"What is the smoking status of the person described in this clinical note? ```{}```\"" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1695628771816, "user": { "displayName": "Florian Borchert", "userId": "04915685144402388535" }, "user_tz": -120 }, "id": "p3RLmCCJl8G4", "outputId": "3aec847a-42de-4272-bfbe-7b458883bc99" }, "outputs": [ { "data": { "text/plain": [ "'What is the smoking status of the person described in this clinical note? ```Social History: No alcohol use and quit tobacco greater than 25 years ago with a 10-pack year smoking history.```'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prompt1.format(text)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "executionInfo": { "elapsed": 2321, "status": "ok", "timestamp": 1695628774408, "user": { "displayName": "Florian Borchert", "userId": "04915685144402388535" }, "user_tz": -120 }, "id": "3MZBDOS2l9yJ", "outputId": "6419a982-b798-4981-86f9-62ebdf07f1a9" }, "outputs": [ { "data": { "text/plain": [ "'The smoking status of the person described in this clinical note is that they quit tobacco greater than 25 years ago.'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_completion(prompt1.format(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prompt 2\n", "\n", "Describes the task and valid response options (for classification)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1695628774408, "user": { "displayName": "Florian Borchert", "userId": "04915685144402388535" }, "user_tz": -120 }, "id": "w_A8dWBBmA54" }, "outputs": [], "source": [ "prompt2 = (\"What is the smoking status of the person described in this clinical note?\"\n", "\" The valid options are: smoker, non-smoker, ex-smoker \"\n", "\" Input: ```{}```\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "executionInfo": { "elapsed": 2501, "status": "ok", "timestamp": 1695628776908, "user": { "displayName": "Florian Borchert", "userId": "04915685144402388535" }, "user_tz": -120 }, "id": "UA71CCE4m2w_", "outputId": "540fce65-dbf6-434c-8017-017a40e2713b" }, "outputs": [ { "data": { "text/plain": [ "'The smoking status of the person described in this clinical note is \"ex-smoker\".'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_completion(prompt2.format(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prompt 3\n", "\n", "Describes the task, valid response options, and output format (JSON)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "executionInfo": { "elapsed": 5, "status": "ok", "timestamp": 1695628776909, "user": { "displayName": "Florian Borchert", "userId": "04915685144402388535" }, "user_tz": -120 }, "id": "zvAhQpPZm4y1" }, "outputs": [], "source": [ "prompt3 = (\"What is the smoking status of the person described in this clinical note?\"\n", "\" The valid options are: current smoker, non-smoker, ex-smoker \"\n", "\" Please return the answer as a JSON of the format {{ label :