{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Copy of GPT-2.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/ak9250/gpt-2-colab/blob/master/GPT_2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Pzxl1vYX-1kk",
        "colab_type": "text"
      },
      "source": [
        "Setup:\n",
        "\n",
        "1) Make sure GPU is enabled, go to edit->notebook settings->Hardware Accelerator GPU\n",
        "\n",
        "2) Make a copy to your google drive, click on copy to drive in panel"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iW0abT07ZkhZ",
        "colab_type": "text"
      },
      "source": [
        "Note: Colab will reset after 12 hours make sure to save your model checkpoints to google drive around 10-11 hours mark or before, then go to runtime->reset all runtimes. Now copy your train model back into colab and start training again from the previous checkpoint."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iLXW02eIYpcB",
        "colab_type": "text"
      },
      "source": [
        "clone and cd into repo, nshepperd's fork https://github.com/nshepperd/gpt-2"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ICYu3w9hIJkC",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!git clone https://github.com/nshepperd/gpt-2.git"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6eEIs3ApZUVO",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "cd gpt-2"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Qtn1qZPgZLb0",
        "colab_type": "text"
      },
      "source": [
        "Install requirements"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "434oOx0bZH6J",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!pip3 install -r requirements.txt"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WvUQhgK3PQ4L",
        "colab_type": "text"
      },
      "source": [
        "Mount drive to access google drive for saving and accessing checkpoints later. Have to log in to your google account"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "FNpf6R4ahYSN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from google.colab import drive\n",
        "drive.mount('/content/drive')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o1hrgeKFYsuE",
        "colab_type": "text"
      },
      "source": [
        "Download the model data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "A498TySgHYyF",
        "colab": {}
      },
      "source": [
        "!python3 download_model.py 117M"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5UDpEGjfO8Q2",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!python3 download_model.py 345M"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Zq-YwRnNOBYO",
        "colab_type": "text"
      },
      "source": [
        "encoding"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7oJPQtdLbbeK",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!export PYTHONIOENCODING=UTF-8"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0KzSbAvePgsI",
        "colab_type": "text"
      },
      "source": [
        "Fetch checkpoints if you have them saved in google drive"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cA2Wk7yIPmS6",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!cp -r /content/drive/My\\ Drive/checkpoint/ /content/gpt-2/ "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0p--9zwqQRTc",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "Let's get our train on! In this case the file is A Tale of Two Cities (Charles Dickens) from Project Gutenberg. To change the dataset GPT-2 models will fine-tune on, change this URL to another .txt file, and change corresponding part of the next cell. Note that you can use small datasets if you want but you will have to be sure not to run the fine-tuning for too long or you will overfit badly. Roughly, expect interesting results within minutes to hours in the 1-10s of megabyte ballpark, and below this you may want to stop the run early as fine-tuning can be very fast."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QOCvrs-DHvxa",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!wget https://www.gutenberg.org/files/98/98-0.txt"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yPfJ5b3CQXqr",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "Start training, add --model_name '345M' to use 345 model"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pEn_ihcGI00T",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!PYTHONPATH=src ./train.py --dataset /content/gpt-2/98-0.txt --model_name '345M'"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vS1RJJDFOPnb",
        "colab_type": "text"
      },
      "source": [
        "Save our checkpoints to start training again later"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JretqG1zOXdi",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!cp -r /content/gpt-2/checkpoint/ /content/drive/My\\ Drive/"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6D-i7vERWbNS",
        "colab_type": "text"
      },
      "source": [
        "Load your trained model for use in sampling below (117M or 345M)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VeETvWvrbKga",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/117M/"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "np0r6qfXBeUX",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/345M/"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GmnSrXqtfRbq",
        "colab_type": "text"
      },
      "source": [
        "Generate conditional samples from the model given a prompt you provide -  change top-k hyperparameter if desired (default is 40),  if you're using 345M, add \"--model-name 345M\""
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "utJj-iY4gHwE",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!python3 src/interactive_conditional_samples.py --top_k 40 --model_name \"345M\""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LeDhY97XMDXn",
        "colab_type": "text"
      },
      "source": [
        "To check flag descriptions, use:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pBaj2L_KMAgb",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!python3 src/interactive_conditional_samples.py -- --help"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "K8rSqkGxg5OK",
        "colab_type": "text"
      },
      "source": [
        "Generate unconditional samples from the model,  if you're using 345M, add \"--model-name 345M\""
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LaQUEnRxWc3c",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!python3 src/generate_unconditional_samples.py --model_name \"345M\" | tee /tmp/samples"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VM1Hag-JL3Bt",
        "colab_type": "text"
      },
      "source": [
        "To check flag descriptions, use:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Sdxfye-SL66I",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!python3 src/generate_unconditional_samples.py -- --help"
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}