{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "From:\n", "\n", "- [BERT Fine-Tuning Tutorial with PyTorch · Chris McCormick](http://mccormickml.com/2019/07/22/BERT-fine-tuning/)\n", "- [huggingface/pytorch-transformers: 👾 A library of state-of-the-art pretrained models for Natural Language Processing (NLP)](https://github.com/huggingface/pytorch-transformers)\n", "\n", "\n", "Fine-Tuning:\n", "\n", "- Easy Training: recommend 2-4 epochs on a special NLP task\n", "- Less Data\n", "- Good Results" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "import torch\n", "from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler\n", "from keras.preprocessing.sequence import pad_sequences\n", "from sklearn.model_selection import train_test_split\n", "from pytorch_transformers import BertTokenizer, BertConfig\n", "from pytorch_transformers import BertForSequenceClassification, BertModel\n", "from pytorch_transformers.optimization import AdamW, WarmupLinearSchedule\n", "from tqdm import tqdm, trange\n", "import pandas as pd\n", "import io\n", "import os\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```bash\n", "# download glue data\n", "$ git clone https://gist.github.com/60c2bdb54d156a41194446737ce03e2e.git download_glue_repo\n", "$ python download_glue_repo/download_glue_data.py --data_dir='glue_data'\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# [The Corpus of Linguistic Acceptability (CoLA)](https://nyu-mll.github.io/CoLA/)\n", "data_path = \"cola_public/raw/\"\n", "train_path = os.path.join(data_path, \"in_domain_train.tsv\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(train_path, delimiter='\\t', header=None, \n", " names=['sentence_source', 'label', 'label_notes', 'sentence'])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", " | sentence_source | \n", "label | \n", "label_notes | \n", "sentence | \n", "
---|---|---|---|---|
668 | \n", "bc01 | \n", "1 | \n", "NaN | \n", "I expect John to win and Harry to lose. | \n", "
7090 | \n", "sgww85 | \n", "1 | \n", "NaN | \n", "Some people go by car, but others by bike. | \n", "
1997 | \n", "rhl07 | \n", "1 | \n", "NaN | \n", "Martha gave Myrna an apple. | \n", "
4314 | \n", "ks08 | \n", "0 | \n", "* | \n", "It tried to rain. | \n", "
6406 | \n", "d_98 | \n", "1 | \n", "NaN | \n", "We didn't keep a list of the names, but the Pr... | \n", "