{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ULMFiT on French Amazon Customer Reviews\n", "### (architecture QRNN, SentenPiece tokenizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Author: [Pierre Guillou](https://www.linkedin.com/in/pierreguillou)\n", "- Date: September 2019\n", "- Post in medium: [link](https://medium.com/@pierre_guillou/nlp-fastai-french-language-model-d0e2a9e12cab)\n", "- Ref: [Fastai v1](https://docs.fast.ai/) (Deep Learning library on PyTorch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Information**\n", "\n", "According to this new article \"[MultiFiT: Efficient Multi-lingual Language Model Fine-tuning](https://arxiv.org/abs/1909.04761)\" (September 10, 2019), the QRNN architecture and the SentencePiece tokenizer give better results than AWD-LSTM and the spaCy tokenizer respectively. \n", "\n", "Therefore, they have been used in this notebook to **fine-tune a French bidirectional Language Model** by Transfer Learning of a French bidirectional Language Model (with the QRNN architecture and the SentencePiece tokenizer, too) trained on a Wikipedia corpus of 100 millions tokens ([lm2-french.ipynb](https://github.com/piegu/language-models/blob/master/lm2-french.ipynb)). \n", "\n", "This French bidirectional Language Model has been **fine-tuned on \"[French Amazon Customer Reviews](https://s3.amazonaws.com/amazon-reviews-pds/readme.html)\"** and **its encoder part has been transfered to a sentiment classifier which has been finally trained on this amazon corpus**.\n", "\n", "This process **LM General --> LM fine-tuned --> Classifier fine-tuned** is called [ULMFiT](http://nlp.fast.ai/category/classification.html).\n", "\n", "More, the following hyperparameters values given at the end of the MultiFiT article have been used:\n", "- Language Model\n", " - (batch size) bs = 50\n", " - (QRNN) 3 QRNN (default: 3) with 1152 hidden parameters each one (default: 1152) (note: it would have been better to increae to 4 QRNN with 1550 hidden parameters like described in the article)\n", " - (SentencePiece) vocab of 15000 tokens\n", " - (dropout) mult_drop = 0\n", " - (weight decay) wd = 0.01\n", " - (number of training epochs) 10 epochs\n", " \n", "\n", "- Sentiment Classifier\n", " - (batch size) bs = 18\n", " - (SentencePiece) vocab of 15000 tokens\n", " - (dropout) mult_drop = 0.5\n", " - (weight decay) wd = 0.01\n", " - (number of training epochs) 20 epochs\n", " - (loss) FlattenedLoss of weighted CrossEntropyLoss (the FlattenedLoss of LabelSmoothing CrossEntropy has been tested but was not kept because of a lower accuracy that could be a consequence of the fact that the dataset is unbalanced)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our Bidirectional French LM ([lm-french.ipynb](https://github.com/piegu/language-models/blob/master/lm-french.ipynb)) and Sentiment Classifier with a AWD-LSTM architecture and using the spaCy tokenizer ([lm-french-classifier-amazon.ipynb](https://github.com/piegu/language-models/blob/master/lm-french-classifier-amazon.ipynb)) have better results (accuracy, perplexity and f1) than the Bidirectional French LM ([lm2-french.ipynb](https://github.com/piegu/language-models/blob/master/lm2-french.ipynb)) and Sentiment Classifier with a QRNN architecture and using the SentencePiecce tokenizer ([lm2-french-classifier-amazon.ipynb](https://github.com/piegu/language-models/blob/master/lm2-french-classifier-amazon.ipynb)). \n", "\n", "But because of the \"To be improved\" paragraph, we should retrain all in order to get a final comparaison." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### French Bidirectional LM (QRNN, SentencePiece)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **About the data**: the dataset \"French Amazon Customer Reviews\" is unbalanced. Therefore, we used a weighted loss function (FlattenedLoss of weighted CrossEntropyLoss).\n", " - neg: 25637 (11.1%)\n", " - pos: 205047 (88.9%)\n", "\n", "\n", "- **Accuracy and Perplexity** of the fine-tuned Language Model: \n", " - forward : (accuracy) 35.79% | (perplexity) 28.97\n", " - backward: (global) 35.22% | (perplexity) 30.27\n", " \n", "\n", "- **Accuracy** of the sentiment classifier:\n", " - forward : (global) 93.51% | **(neg) 93.69%** | (pos) 93.48%\n", " - backward: (global) 93.18% | (neg) 90.34% | (pos) 93.54%\n", " - ensemble: **(global) 93.70%** | (neg) 92.36% | **(pos) 93.86%**\n", "\n", "\n", "- **f1 score** of the sentiment classifier:\n", " - forward: 0.9624\n", " - backward: 0.9606\n", " - ensemble: **0.9636**\n", " \n", "\n", "(neg = negative reviews | pos = positive reviews)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### To be improved" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Out of the 230 684 reviews of our file amazon_reviews_fr.csv, we found (but after the training of our models) that **11 098 reviews are not in French: almost 5%** (4.8%)!\n", "\n", "We should delete these 11 098 review and re-fine-tune our LM and after our sentiment classifier on the only-French reviews dataset." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Initialisation" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hidden": true }, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "\n", "from fastai import *\n", "from fastai.text import *\n", "from fastai.callbacks import *" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "hidden": true }, "outputs": [], "source": [ "from sklearn.metrics import f1_score\n", "\n", "@np_func\n", "def f1(inp,targ): return f1_score(targ, np.argmax(inp, axis=-1))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "hidden": true }, "outputs": [], "source": [ "# bs=48\n", "# bs=24\n", "bs=50" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hidden": true }, "outputs": [], "source": [ "torch.cuda.set_device(0)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hidden": true }, "outputs": [], "source": [ "data_path = Config.data_path()" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "This will create a `{lang}wiki` folder, containing a `{lang}wiki` text file with the wikipedia contents. (For other languages, replace `{lang}` with the appropriate code from the [list of wikipedias](https://meta.wikimedia.org/wiki/List_of_Wikipedias).)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hidden": true }, "outputs": [], "source": [ "lang = 'fr'" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "hidden": true }, "outputs": [], "source": [ "name = f'{lang}wiki'\n", "path = data_path/name\n", "path.mkdir(exist_ok=True, parents=True)\n", "\n", "lm_fns2 = [f'{lang}_wt_sp15', f'{lang}_wt_vocab_sp15']\n", "lm_fns2_bwd = [f'{lang}_wt_sp15_bwd', f'{lang}_wt_vocab_sp15_bwd']" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Data" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "- [French Amazon Customer Reviews](https://s3.amazonaws.com/amazon-reviews-pds/readme.html)\n", "- [Guide on how to download the French Amazon Customer Reviews](https://forums.fast.ai/t/ulmfit-french/29379/36)\n", "- File: amazon_reviews_multilingual_FR_v1_00.tsv.gz" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "[PosixPath('/home/jupyter/.fastai/data/amazon_reviews_fr/amazon_reviews_multilingual_FR_v1_00.tsv'),\n", " PosixPath('/home/jupyter/.fastai/data/amazon_reviews_fr/amazon_reviews_fr.csv')]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "name = 'amazon_reviews_fr'\n", "path_data = data_path/name\n", "path_data.ls()" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Run this code the first time" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hidden": true }, "outputs": [], "source": [ "# to solve display error of pandas dataframe\n", "get_ipython().config.get('IPKernelApp', {})['parent_appname'] = \"\"" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
review_idreview_bodystar_rating
0R32VYUWDIB5LKEje conseille fortement ce bouquin à ceux qui s...5
1R3CCMP4EV6HAVLce magnifique est livre , les personnages sont...5
2R14NAE6UGTVTA2Je dirais qu'il a un défaut :<br />On ne peut ...3
3R2E7QEWSC6EWFAJe l'ai depuis quelques jours et j'en suis trè...4
4R26E6I47GQRYKRje m'attendait à un bon film, car j'aime beauc...2
\n", "
" ], "text/plain": [ " review_id review_body \\\n", "0 R32VYUWDIB5LKE je conseille fortement ce bouquin à ceux qui s... \n", "1 R3CCMP4EV6HAVL ce magnifique est livre , les personnages sont... \n", "2 R14NAE6UGTVTA2 Je dirais qu'il a un défaut :
On ne peut ... \n", "3 R2E7QEWSC6EWFA Je l'ai depuis quelques jours et j'en suis trè... \n", "4 R26E6I47GQRYKR je m'attendait à un bon film, car j'aime beauc... \n", "\n", " star_rating \n", "0 5 \n", "1 5 \n", "2 3 \n", "3 4 \n", "4 2 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fields = ['review_id', 'review_body', 'star_rating']\n", "df = pd.read_csv(path_data/'amazon_reviews_multilingual_FR_v1_00.tsv', delimiter='\\t',encoding='utf-8', usecols=fields)\n", "df = df[fields]\n", "df.loc[pd.isna(df.review_body),'review_body']='NA'\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of reviews: 253961\n", "number of identical reviews: 0\n", "number of reviews neg + pos (rating != 3): 230684\n" ] } ], "source": [ "# number of reviews\n", "print(f'number of reviews: {len(df)}')\n", "\n", "# check that there is no twice the same review\n", "same = len(df) - len(df['review_id'].unique())\n", "print(f'number of identical reviews: {same}')\n", "\n", "# number of reviews neg or pos\n", "num_neg_pos = len(df[df['star_rating'] != 3])\n", "print(f'number of reviews neg + pos (rating != 3): {num_neg_pos}')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "[205047, 25637]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(df_trn_val['label'].value_counts().array)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pos 205047\n", "neg 25637\n", "Name: label, dtype: int64\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAD4CAYAAADy46FuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAV0ElEQVR4nO3dcbDdZX3n8fdnk+LoWgrIhWFI2FCbugW2jZLFdB0dV1YIdKfBrmyT6ZisZSbKwk5duzPG7h84KjO4u61TpoobS4awqyCKLhkbi1nW0e0OKBehQESaS6ByTQYiQaSDxQl+94/zXD2Ek+de7g33puT9mjlzfuf7e57n95yZaz78nt/v+EtVIUnSofyjhZ6AJOnIZlBIkroMCklSl0EhSeoyKCRJXYsXegKH24knnljLli1b6GlI0j8od9111w+qamzUvpddUCxbtozx8fGFnoYk/YOS5G8Ptc+lJ0lSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6DApJUtfL7pfZc7Fs018s9BR0BHvkqt9a6ClIC8IzCklS17RBkWRpkq8leSDJziR/0OonJNmRZFd7P77Vk+TqJBNJ7k3yhqGxNrT2u5JsGKqfneS+1ufqJOkdQ5I0f2ZyRnEA+MOq+jVgFXBZkjOATcBtVbUcuK19BrgAWN5eG4FrYPCPPnAF8EbgHOCKoX/4r2ltp/qtbvVDHUOSNE+mDYqq2ltV327bTwMPAKcCa4CtrdlW4KK2vQa4vgbuAI5LcgpwPrCjqvZX1ZPADmB123dsVd1eVQVcf9BYo44hSZonL+oaRZJlwOuBbwInV9VeGIQJcFJrdirw6FC3yVbr1SdH1Okc4+B5bUwynmR83759L+YrSZKmMeOgSPJq4GbgfVX1o17TEbWaRX3GqmpzVa2sqpVjYyOfuyFJmqUZBUWSX2AQEp+pqi+28mNt2Yj2/nirTwJLh7ovAfZMU18yot47hiRpnszkrqcA1wIPVNWfDO3aBkzdubQBuGWovr7d/bQKeKotG90KnJfk+HYR+zzg1rbv6SSr2rHWHzTWqGNIkubJTH5w9ybgXcB9Se5ptT8CrgJuSnIJ8D3g4rZvO3AhMAE8A7wboKr2J/kIcGdr9+Gq2t+2LwWuA14JfKW96BxDkjRPpg2KqvorRl9HADh3RPsCLjvEWFuALSPq48BZI+pPjDqGJGn++MtsSVKXQSFJ6jIoJEldBoUkqcugkCR1GRSSpC6DQpLUZVBIkroMCklSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6ZvIo1C1JHk9y/1Dtc0nuaa9Hpp58l2RZkh8P7fvUUJ+zk9yXZCLJ1e2xpyQ5IcmOJLva+/GtntZuIsm9Sd5w+L++JGk6MzmjuA5YPVyoqt+tqhVVtQK4Gfji0O6HpvZV1XuH6tcAG4Hl7TU15ibgtqpaDtzWPgNcMNR2Y+svSZpn0wZFVX0D2D9qXzsr+LfADb0xkpwCHFtVt7dHpV4PXNR2rwG2tu2tB9Wvr4E7gOPaOJKkeTTXaxRvBh6rql1DtdOT3J3k60ne3GqnApNDbSZbDeDkqtoL0N5PGurz6CH6PE+SjUnGk4zv27dvbt9IkvQ8cw2KdTz/bGIvcFpVvR54P/DZJMcCGdG3phl7xn2qanNVrayqlWNjYzOYtiRpphbPtmOSxcDvAGdP1arqWeDZtn1XkoeAX2VwNrBkqPsSYE/bfizJKVW1ty0tPd7qk8DSQ/SRJM2TuZxR/Cvgu1X1syWlJGNJFrXtX2ZwIXp3W1J6Osmqdl1jPXBL67YN2NC2NxxUX9/ufloFPDW1RCVJmj8zuT32BuB24HVJJpNc0nat5YUXsd8C3Jvkr4EvAO+tqqkL4ZcCfw5MAA8BX2n1q4C3J9kFvL19BtgO7G7tPw38+xf/9SRJczXt0lNVrTtE/d+NqN3M4HbZUe3HgbNG1J8Azh1RL+Cy6eYnSXpp+ctsSVKXQSFJ6jIoJEldBoUkqcugkCR1GRSSpC6DQpLUZVBIkroMCklSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6ZvKEuy1JHk9y/1DtQ0m+n+Se9rpwaN8Hk0wkeTDJ+UP11a02kWTTUP30JN9MsivJ55Ic0+qvaJ8n2v5lh+tLS5JmbiZnFNcBq0fUP15VK9prO0CSMxg8IvXM1ueTSRa152h/ArgAOANY19oCfKyNtRx4Eph61OolwJNV9SvAx1s7SdI8mzYoquobwP7p2jVrgBur6tmqepjB867Paa+JqtpdVT8BbgTWJAnwNgbP1wbYClw0NNbWtv0F4NzWXpI0j+ZyjeLyJPe2panjW+1U4NGhNpOtdqj6a4AfVtWBg+rPG6vtf6q1f4EkG5OMJxnft2/fHL6SJOlgsw2Ka4DXAiuAvcAft/qo/+KvWdR7Y72wWLW5qlZW1cqxsbHevCVJL9KsgqKqHquq56rqp8CnGSwtweCMYOlQ0yXAnk79B8BxSRYfVH/eWG3/LzHzJTBJ0mEyq6BIcsrQx3cAU3dEbQPWtjuWTgeWA98C7gSWtzucjmFwwXtbVRXwNeCdrf8G4JahsTa07XcC/6e1lyTNo8XTNUhyA/BW4MQkk8AVwFuTrGCwFPQI8B6AqtqZ5CbgO8AB4LKqeq6NczlwK7AI2FJVO9shPgDcmOSjwN3Ata1+LfA/kkwwOJNYO+dvK0l60aYNiqpaN6J87YjaVPsrgStH1LcD20fUd/Pzpavh+t8DF083P0nSS8tfZkuSugwKSVKXQSFJ6jIoJEldBoUkqcugkCR1GRSSpC6DQpLUZVBIkroMCklSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1DVtUCTZkuTxJPcP1f5rku8muTfJl5Ic1+rLkvw4yT3t9amhPmcnuS/JRJKrk6TVT0iyI8mu9n58q6e1m2jHecPh//qSpOnM5IziOmD1QbUdwFlV9evA3wAfHNr3UFWtaK/3DtWvATYyeI728qExNwG3VdVy4Lb2GeCCobYbW39J0jybNiiq6hsMnlk9XPtqVR1oH+8AlvTGSHIKcGxV3V5VBVwPXNR2rwG2tu2tB9Wvr4E7gOPaOJKkeXQ4rlH8PvCVoc+nJ7k7ydeTvLnVTgUmh9pMthrAyVW1F6C9nzTU59FD9JEkzZPFc+mc5D8DB4DPtNJe4LSqeiLJ2cD/SnImkBHda7rhZ9onyUYGy1OcdtppM5m6JGmGZn1GkWQD8K+B32vLSVTVs1X1RNu+C3gI+FUGZwPDy1NLgD1t+7GpJaX2/nirTwJLD9Hneapqc1WtrKqVY2Njs/1KkqQRZhUUSVYDHwB+u6qeGaqPJVnUtn+ZwYXo3W1J6ekkq9rdTuuBW1q3bcCGtr3hoPr6dvfTKuCpqSUqSdL8mXbpKckNwFuBE5NMAlcwuMvpFcCOdpfrHe0Op7cAH05yAHgOeG9VTV0Iv5TBHVSvZHBNY+q6xlXATUkuAb4HXNzq24ELgQngGeDdc/mikqTZmTYoqmrdiPK1h2h7M3DzIfaNA2eNqD8BnDuiXsBl081PkvTS8pfZkqQug0KS1GVQSJK6DApJUpdBIUnqMigkSV0GhSSpy6CQJHUZFJKkLoNCktRlUEiSugwKSVKXQSFJ6jIoJEldBoUkqcugkCR1GRSSpK4ZBUWSLUkeT3L/UO2EJDuS7Grvx7d6klydZCLJvUneMNRnQ2u/K8mGofrZSe5rfa5uz9U+5DEkSfNnpmcU1wGrD6ptAm6rquXAbe0zwAXA8vbaCFwDg3/0GTxv+43AOcAVQ//wX9PaTvVbPc0xJEnzZEZBUVXfAPYfVF4DbG3bW4GLhurX18AdwHFJTgHOB3ZU1f6qehLYAaxu+46tqtvbc7KvP2isUceQJM2TuVyjOLmq9gK095Na/VTg0aF2k63Wq0+OqPeO8TxJNiYZTzK+b9++OXwlSdLBXoqL2RlRq1nUZ6yqNlfVyqpaOTY29mK6SpKmMZegeKwtG9HeH2/1SWDpULslwJ5p6ktG1HvHkCTNk7kExTZg6s6lDcAtQ/X17e6nVcBTbdnoVuC8JMe3i9jnAbe2fU8nWdXudlp/0FijjiFJmieLZ9IoyQ3AW4ETk0wyuHvpKuCmJJcA3wMubs23AxcCE8AzwLsBqmp/ko8Ad7Z2H66qqQvklzK4s+qVwFfai84xJEnzZEZBUVXrDrHr3BFtC7jsEONsAbaMqI8DZ42oPzHqGJKk+eMvsyVJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6DApJUpdBIUnqMigkSV0GhSSpy6CQJHUZFJKkLoNCktRlUEiSugwKSVKXQSFJ6pp1UCR5XZJ7hl4/SvK+JB9K8v2h+oVDfT6YZCLJg0nOH6qvbrWJJJuG6qcn+WaSXUk+l+SY2X9VSdJszDooqurBqlpRVSuAsxk89vRLbffHp/ZV1XaAJGcAa4EzgdXAJ5MsSrII+ARwAXAGsK61BfhYG2s58CRwyWznK0mancO19HQu8FBV/W2nzRrgxqp6tqoeZvBM7XPaa6KqdlfVT4AbgTVJArwN+ELrvxW46DDNV5I0Q4crKNYCNwx9vjzJvUm2JDm+1U4FHh1qM9lqh6q/BvhhVR04qP4CSTYmGU8yvm/fvrl/G0nSz8w5KNp1g98GPt9K1wCvBVYAe4E/nmo6onvNov7CYtXmqlpZVSvHxsZexOwlSdNZfBjGuAD4dlU9BjD1DpDk08CX28dJYOlQvyXAnrY9qv4D4Lgki9tZxXB7SdI8ORxLT+sYWnZKcsrQvncA97ftbcDaJK9IcjqwHPgWcCewvN3hdAyDZaxtVVXA14B3tv4bgFsOw3wlSS/CnM4okrwKeDvwnqHyf0mygsEy0SNT+6pqZ5KbgO8AB4DLquq5Ns7lwK3AImBLVe1sY30AuDHJR4G7gWvnMl9J0os3p6CoqmcYXHQerr2r0/5K4MoR9e3A9hH13QzuipIkLRB/mS1J6jIoJEldBoUkqcugkCR1GRSSpC6DQpLUZVBIkroMCklSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6DApJUtecgyLJI0nuS3JPkvFWOyHJjiS72vvxrZ4kVyeZSHJvkjcMjbOhtd+VZMNQ/ew2/kTrm7nOWZI0c4frjOJfVtWKqlrZPm8Cbquq5cBt7TPABQyelb0c2AhcA4NgAa4A3sjgiXZXTIVLa7NxqN/qwzRnSdIMvFRLT2uArW17K3DRUP36GrgDOC7JKcD5wI6q2l9VTwI7gNVt37FVdXtVFXD90FiSpHlwOIKigK8muSvJxlY7uar2ArT3k1r9VODRob6TrdarT46oP0+SjUnGk4zv27fvMHwlSdKUxYdhjDdV1Z4kJwE7kny303bU9YWaRf35harNwGaAlStXvmC/JGn25nxGUVV72vvjwJcYXGN4rC0b0d4fb80ngaVD3ZcAe6apLxlRlyTNkzkFRZJ/nOQXp7aB84D7gW3A1J1LG4Bb2vY2YH27+2kV8FRbmroVOC/J8e0i9nnArW3f00lWtbud1g+NJUmaB3NdejoZ+FK7Y3Ux8Nmq+sskdwI3JbkE+B5wcWu/HbgQmACeAd4NUFX7k3wEuLO1+3BV7W/blwLXAa8EvtJekqR5MqegqKrdwG+MqD8BnDuiXsBlhxhrC7BlRH0cOGsu85QkzZ6/zJYkdRkUkqQug0KS1GVQSJK6DApJUpdBIUnqMigkSV0GhSSpy6CQJHUZFJKkLoNCktRlUEiSugwKSVKXQSFJ6jIoJEldBoUkqWvWQZFkaZKvJXkgyc4kf9DqH0ry/ST3tNeFQ30+mGQiyYNJzh+qr261iSSbhuqnJ/lmkl1JPpfkmNnOV5I0O3M5ozgA/GFV/RqwCrgsyRlt38erakV7bQdo+9YCZwKrgU8mWZRkEfAJ4ALgDGDd0Dgfa2MtB54ELpnDfCVJszDroKiqvVX17bb9NPAAcGqnyxrgxqp6tqoeZvDc7HPaa6KqdlfVT4AbgTUZPIj7bcAXWv+twEWzna8kaXbm9MzsKUmWAa8Hvgm8Cbg8yXpgnMFZx5MMQuSOoW6T/DxYHj2o/kbgNcAPq+rAiPbSUWnZpr9Y6CnoCPbIVb/1kow754vZSV4N3Ay8r6p+BFwDvBZYAewF/niq6YjuNYv6qDlsTDKeZHzfvn0v8htIknrmFBRJfoFBSHymqr4IUFWPVdVzVfVT4NMMlpZgcEawdKj7EmBPp/4D4Lgkiw+qv0BVba6qlVW1cmxsbC5fSZJ0kLnc9RTgWuCBqvqTofopQ83eAdzftrcBa5O8IsnpwHLgW8CdwPJ2h9MxDC54b6uqAr4GvLP13wDcMtv5SpJmZy7XKN4EvAu4L8k9rfZHDO5aWsFgmegR4D0AVbUzyU3AdxjcMXVZVT0HkORy4FZgEbClqna28T4A3Jjko8DdDIJJkjSPZh0UVfVXjL6OsL3T50rgyhH17aP6VdVufr50JUlaAP4yW5LUZVBIkroMCklSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6DApJUpdBIUnqMigkSV0GhSSpy6CQJHUZFJKkLoNCktR1xAdFktVJHkwykWTTQs9Hko42R3RQJFkEfAK4ADiDwfO4z1jYWUnS0eWIDgoGz8ueqKrdVfUT4EZgzQLPSZKOKosXegLTOBV4dOjzJPDGgxsl2QhsbB//LsmD8zC3o8GJwA8WehJHinxsoWegEfwbHTLHv9F/cqgdR3pQZEStXlCo2gxsfumnc3RJMl5VKxd6HtKh+Dc6P470padJYOnQ5yXAngWaiyQdlY70oLgTWJ7k9CTHAGuBbQs8J0k6qhzRS09VdSDJ5cCtwCJgS1XtXOBpHU1cztORzr/ReZCqFyz5S5L0M0f60pMkaYEZFJKkLoNCktRlUEiSugyKo1iSZUm+m2RrknuTfCHJq5Kcm+TuJPcl2ZLkFa39VUm+09r+t4Wev17e2t/nA0k+nWRnkq8meWWS1yb5yyR3Jfm/Sf5pa//aJHckuTPJh5P83UJ/h5cLg0KvAzZX1a8DPwLeD1wH/G5V/TMGt1BfmuQE4B3Ama3tRxdovjq6LAc+UVVnAj8E/g2DW2L/Q1WdDfwn4JOt7Z8Cf1pV/xx/mHtYGRR6tKr+X9v+n8C5wMNV9TetthV4C4MQ+Xvgz5P8DvDMvM9UR6OHq+qetn0XsAz4F8Dnk9wD/HfglLb/N4HPt+3PzuckX+6O6B/caV7M6Ic07ceP5zAIkrXA5cDbXsqJScCzQ9vPAScDP6yqFQs0n6OSZxQ6Lclvtu11wP8GliX5lVZ7F/D1JK8GfqmqtgPvA/wfqhbCj4CHk1wMkIHfaPvuYLA0BYP/mNFhYlDoAWBDknuBE4CPA+9mcGp/H/BT4FPALwJfbu2+DvzHBZqv9HvAJUn+GtjJz59R8z7g/Um+xWA56qkFmt/Ljv8XHkexJMuAL1fVWQs8FWnOkrwK+HFVVZK1wLqq8kFnh4HXKCS9XJwN/FmSMLhD6vcXeD4vG55RSJK6vEYhSeoyKCRJXQaFJKnLoJAkdRkUkqSu/w888NUN9lSZHwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# categorify reviews in 2 classes neg, pos in the label column (rating != 3)\n", "df_trn_val = df[df['star_rating'] != 3].copy()\n", "df_trn_val['label'] = 'neg'\n", "df_trn_val.loc[df_trn_val['star_rating'] > 3, 'label'] = 'pos'\n", "\n", "# plot histogram\n", "x= [1,2]\n", "keys = list(df_trn_val['label'].value_counts().keys())\n", "values = list(df_trn_val['label'].value_counts().array)\n", "plt.bar(x, values) \n", "plt.xticks(x, keys)\n", "print(df_trn_val['label'].value_counts())\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "hidden": true }, "outputs": [], "source": [ "df_trn_val.to_csv (path_data/'amazon_reviews_fr.csv', index = None, header=True)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### Get the csv of pre-processed data " ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true }, "outputs": [], "source": [ "df_trn_val = pd.read_csv(path_data/'amazon_reviews_fr.csv')" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "### Check that the text language of each review is French and delete the non French ones" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hidden": true }, "outputs": [], "source": [ "from langdetect import detect, detect_langs" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "hidden": true }, "outputs": [], "source": [ "df = pd.read_csv(path_data/'amazon_reviews_fr.csv')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "hidden": true }, "outputs": [], "source": [ "# to solve display error of pandas dataframe\n", "get_ipython().config.get('IPKernelApp', {})['parent_appname'] = \"\"" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
review_idreview_bodystar_ratinglabel
0R32VYUWDIB5LKEje conseille fortement ce bouquin à ceux qui s...5pos
1R3CCMP4EV6HAVLce magnifique est livre , les personnages sont...5pos
2R2E7QEWSC6EWFAJe l'ai depuis quelques jours et j'en suis trè...4pos
3R26E6I47GQRYKRje m'attendait à un bon film, car j'aime beauc...2neg
4R1RJMTSNCKB9LPNe disait pas sur l'annonce que c'était un 10'...2neg
\n", "
" ], "text/plain": [ " review_id review_body \\\n", "0 R32VYUWDIB5LKE je conseille fortement ce bouquin à ceux qui s... \n", "1 R3CCMP4EV6HAVL ce magnifique est livre , les personnages sont... \n", "2 R2E7QEWSC6EWFA Je l'ai depuis quelques jours et j'en suis trè... \n", "3 R26E6I47GQRYKR je m'attendait à un bon film, car j'aime beauc... \n", "4 R1RJMTSNCKB9LP Ne disait pas sur l'annonce que c'était un 10'... \n", "\n", " star_rating label \n", "0 5 pos \n", "1 5 pos \n", "2 4 pos \n", "3 2 neg \n", "4 2 neg " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 18min 31s, sys: 11 s, total: 18min 42s\n", "Wall time: 18min 43s\n" ] } ], "source": [ "%%time\n", "list_idx = []\n", "for idx, row in df.iterrows():\n", " try:\n", " language = detect(row['review_body'])\n", " except:\n", " language = \"error\" \n", " if language != 'fr':\n", " list_idx.append(idx)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "(230684, 11098)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df), len(list_idx)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "15 Just great complination, there are 48 cds insi...\n", "18 I know it's a classic but really it is a marve...\n", "19 Waiting for so long to get a sequel of Bridget...\n", "66 Not one of his best science fiction novels but...\n", "117 Für die Liebhaber von Schwarzer Humor ist di...\n", "151 A great alternate look into the world of cats,...\n", "175 It's the perfect book for a screenwriter and a...\n", "214 Good delivery, on time. However the image of t...\n", "324 This is Frank Herbert's masterpiece and should...\n", "327 It was by sheer chance that I came across this...\n", "Name: review_body, dtype: object" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"review_body\"][list_idx][:10]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "hidden": true }, "outputs": [], "source": [ "df2 = df.copy()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "hidden": true }, "outputs": [], "source": [ "df2 = df.copy()\n", "df2.drop(list_idx, axis=0, inplace=True)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "219586" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df2)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Fine-tuning \"forward LM\"" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Databunch" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7min 1s, sys: 3.93 s, total: 7min 5s\n", "Wall time: 3min 27s\n" ] } ], "source": [ "%%time\n", "data_lm = (TextList.from_df(df_trn_val, path, cols='review_body', \n", " processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])\n", " .split_by_rand_pct(0.1, seed=42)\n", " .label_for_lm() \n", " .databunch(bs=bs, num_workers=1))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hidden": true }, "outputs": [], "source": [ "data_lm.save(f'{path}/{lang}_databunch_lm_aws_sp15')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Training" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true }, "outputs": [], "source": [ "data_lm = load_data(path, f'{lang}_databunch_lm_aws_sp15', bs=bs)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true }, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.04 s, sys: 1.71 s, total: 5.75 s\n", "Wall time: 6.54 s\n" ] } ], "source": [ "%%time\n", "perplexity = Perplexity()\n", "learn_lm = language_model_learner(data_lm, AWD_LSTM, config=config, pretrained_fnames=lm_fns2, drop_mult=0.3, \n", " metrics=[error_rate, accuracy, perplexity]).to_fp16()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn_lm.lr_find()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "hidden": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn_lm.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "hidden": true }, "outputs": [], "source": [ "lr = 1e-3\n", "lr *= bs/48\n", "\n", "wd = 0.01" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_losserror_rateaccuracyperplexitytime
04.3649374.1279970.7283010.27169962.05335202:20
14.1883823.9424700.7108810.28912051.54566602:20
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_lm.fit_one_cycle(2, lr*10, wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_lm.save(f'{lang}fine_tuned1_sp15')\n", "learn_lm.save_encoder(f'{lang}fine_tuned1_enc_sp15')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_losserror_rateaccuracyperplexitytime
03.9757613.8334730.6996230.30037746.22272503:09
13.8488933.7283290.6873410.31265941.60956603:09
23.7217913.6431000.6773130.32268638.21009403:09
33.6835723.5858320.6702090.32979136.08337803:10
43.6275033.5455220.6657440.33425634.65788703:09
53.5885963.5139600.6617850.33821533.58103603:09
63.5590133.4863620.6584160.34158432.66687403:09
73.5305403.4649960.6559280.34407231.97635103:09
83.4867783.4474010.6536250.34637431.41860003:09
93.5000753.4305520.6512370.34876330.89370003:08
103.4742573.4161250.6493170.35068330.45119703:09
113.4257133.4017260.6474190.35258130.01590503:09
123.4150723.3896920.6455820.35441829.65687603:09
133.3811763.3807130.6442820.35571829.39168503:09
143.4010823.3728670.6430330.35696629.16206703:08
153.3533193.3684990.6424990.35750129.03485903:08
163.3844673.3665340.6421410.35785928.97792203:09
173.3316053.3661470.6420580.35794228.96669003:09
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_lm.unfreeze()\n", "learn_lm.fit_one_cycle(18, lr, wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "28.966703060315766" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# perplexity\n", "val_loss = 3.366147\n", "np.exp(val_loss)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_lm.save(f'{lang}fine_tuned2_sp15')\n", "learn_lm.save_encoder(f'{lang}fine_tuned2_enc_sp15')" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Save best LM learner and its encoder" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_lm.save(f'{lang}fine_tuned_sp15')\n", "learn_lm.save_encoder(f'{lang}fine_tuned_enc_sp15')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fine-tuning \"backward LM\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Databunch" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7min 6s, sys: 3.71 s, total: 7min 10s\n", "Wall time: 3min 37s\n" ] } ], "source": [ "%%time\n", "data_lm = (TextList.from_df(df_trn_val, path, cols='review_body', \n", " processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])\n", " .split_by_rand_pct(0.1, seed=42)\n", " .label_for_lm() \n", " .databunch(bs=bs, num_workers=1, backwards=True))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "data_lm.save(f'{path}/{lang}_databunch_lm_aws_sp15_bwd')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.41 s, sys: 196 ms, total: 2.61 s\n", "Wall time: 2.61 s\n" ] } ], "source": [ "%%time\n", "data_lm = load_data(path, f'{lang}_databunch_lm_aws_sp15_bwd', bs=bs, backwards=True)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "config = awd_lstm_lm_config.copy()\n", "config['qrnn'] = True" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 928 ms, sys: 196 ms, total: 1.12 s\n", "Wall time: 544 ms\n" ] } ], "source": [ "%%time\n", "perplexity = Perplexity()\n", "learn_lm = language_model_learner(data_lm, AWD_LSTM, config=config, pretrained_fnames=lm_fns2_bwd, drop_mult=0.3, \n", " metrics=[error_rate, accuracy, perplexity]).to_fp16()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn_lm.lr_find()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn_lm.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "lr = 1e-3\n", "lr *= bs/48\n", "\n", "wd = 0.01" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_losserror_rateaccuracyperplexitytime
04.7934824.5278420.7684290.23157192.55865502:31
14.6320694.3419790.7533150.24668676.85956602:31
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_lm.fit_one_cycle(2, lr*10, wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "learn_lm.save(f'{lang}fine_tuned1_sp15_bwd')\n", "learn_lm.save_encoder(f'{lang}fine_tuned1_enc_sp15_bwd')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_losserror_rateaccuracyperplexitytime
04.2965144.1613760.7362330.26376764.15956103:23
14.1309933.9878640.7166140.28338653.93962903:22
23.9563983.8335260.6989540.30104646.22523503:22
33.8074863.7263870.6862380.31376341.52868703:22
43.7544903.6511430.6782290.32177138.51873403:20
53.6554633.6004860.6721120.32788736.61599303:21
63.6335013.5606880.6672550.33274535.18748103:21
73.6033383.5297000.6636210.33637934.11365503:21
83.5677993.5055350.6607220.33927733.29928203:22
93.5572653.4837930.6581040.34189532.58306103:21
103.5235473.4642840.6554640.34453531.95358503:21
113.4815273.4500320.6534920.34650831.50137703:20
123.4838343.4360660.6512710.34873031.06453703:20
133.4661803.4258460.6501070.34989330.74867203:21
143.4388903.4178380.6489380.35106230.50345603:19
153.4061043.4126870.6481420.35185730.34662803:10
163.4027343.4106270.6478680.35213230.28426603:08
173.4158503.4102330.6477680.35223230.27230303:09
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_lm.unfreeze()\n", "learn_lm.fit_one_cycle(18, lr, wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "30.27229688291125" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# perplexity\n", "val_loss = 3.410233\n", "np.exp(val_loss)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "learn_lm.save(f'{lang}fine_tuned2_sp15_bwd')\n", "learn_lm.save_encoder(f'{lang}fine_tuned2_enc_sp15_bwd')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save best LM learner and its encoder" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "learn_lm.save(f'{lang}fine_tuned_sp15_bwd')\n", "learn_lm.save_encoder(f'{lang}fine_tuned_enc_sp15_bwd')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Fine-tuning \"forward Classifier\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "hidden": true }, "outputs": [], "source": [ "bs = 18" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Databunch" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.16 s, sys: 408 ms, total: 2.56 s\n", "Wall time: 2.56 s\n" ] } ], "source": [ "%%time\n", "data_lm = load_data(path, f'{lang}_databunch_lm_aws_sp15', bs=bs)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7min 10s, sys: 4.84 s, total: 7min 15s\n", "Wall time: 4min 10s\n" ] } ], "source": [ "%%time\n", "data_clas = (TextList.from_df(df_trn_val, path, vocab=data_lm.vocab, cols='review_body', \n", " processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])\n", " .split_by_rand_pct(0.1, seed=42)\n", " .label_from_df(cols='label')\n", " .databunch(bs=bs, num_workers=1))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.52 s, sys: 1.03 s, total: 6.55 s\n", "Wall time: 6.06 s\n" ] } ], "source": [ "%%time\n", "data_clas.save(f'{lang}_textlist_class_sp15')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Get weights to penalize loss function of the majority class" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 11.9 s, sys: 832 ms, total: 12.7 s\n", "Wall time: 12.5 s\n" ] } ], "source": [ "%%time\n", "data_clas = load_data(path, f'{lang}_textlist_class_sp15', bs=bs, num_workers=1)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "(207616, 23068, 230684)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_trn = len(data_clas.train_ds.x)\n", "num_val = len(data_clas.valid_ds.x)\n", "num_trn, num_val, num_trn+num_val" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "(array([ 23071, 184545]), array([ 2566, 20502]))" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trn_LabelCounts = np.unique(data_clas.train_ds.y.items, return_counts=True)[1]\n", "val_LabelCounts = np.unique(data_clas.valid_ds.y.items, return_counts=True)[1]\n", "trn_LabelCounts, val_LabelCounts" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "([0.888876579839704, 0.11112342016029597],\n", " [0.8887636552800416, 0.11123634471995836])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trn_weights = [1 - count/num_trn for count in trn_LabelCounts]\n", "val_weights = [1 - count/num_val for count in val_LabelCounts]\n", "trn_weights, val_weights" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Training (Loss = FlattenedLoss of weighted CrossEntropyLoss)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 12.6 s, sys: 284 ms, total: 12.9 s\n", "Wall time: 12.8 s\n" ] } ], "source": [ "%%time\n", "data_clas = load_data(path, f'{lang}_textlist_class_sp15', bs=bs, num_workers=1)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "hidden": true }, "outputs": [], "source": [ "config = awd_lstm_clas_config.copy()\n", "config['qrnn'] = True" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c = text_classifier_learner(data_clas, AWD_LSTM, config=config, pretrained=False, drop_mult=0.5, \n", " metrics=[accuracy,f1]).to_fp16()\n", "learn_c.load_encoder(f'{lang}fine_tuned_enc_sp15');" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "#### Change loss function" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "FlattenedLoss of CrossEntropyLoss()" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learn_c.loss_func" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "hidden": true }, "outputs": [], "source": [ "loss_weights = torch.FloatTensor(trn_weights).cuda()\n", "learn_c.loss_func = partial(F.cross_entropy, weight=loss_weights)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "functools.partial(, weight=tensor([0.8889, 0.1111], device='cuda:0'))" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learn_c.loss_func" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "#### Training" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.freeze()" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn_c.lr_find()" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "hidden": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn_c.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "hidden": true }, "outputs": [], "source": [ "lr = 2e-2\n", "lr *= bs/48\n", "\n", "wd = 0.01" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "hidden": true, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.4371750.2827180.8646180.91437402:58
10.3872930.2605880.8769720.92247902:56
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.fit_one_cycle(2, lr, wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.save(f'{lang}clas1_sp15')" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.4335970.2793110.8754120.92097502:50
10.3664920.2561880.8870300.92959902:45
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.load(f'{lang}clas1_sp15');\n", "learn_c.fit_one_cycle(2, lr, wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.save(f'{lang}clas2_sp15')" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.3525520.2371240.9474160.96890603:27
10.2796440.2091620.9464630.96827103:34
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.load(f'{lang}clas2_sp15');\n", "learn_c.freeze_to(-2)\n", "learn_c.fit_one_cycle(2, slice(lr/(2.6**4),lr), wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.save(f'{lang}clas3_sp15')" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.2834620.1849350.9240070.95385004:03
10.2174900.1814290.9343680.96041504:18
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.load(f'{lang}clas3_sp15');\n", "learn_c.freeze_to(-3)\n", "learn_c.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.save(f'{lang}clas4_sp15')" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.2195000.1760160.9350620.96074404:49
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.load(f'{lang}clas4_sp15');\n", "learn_c.unfreeze()\n", "learn_c.fit_one_cycle(1, slice(lr/10/(2.6**4),lr/10), wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.save(f'{lang}clas5_sp15')" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.load(f'{lang}clas5_sp15')\n", "learn_c.save(f'{lang}clas_sp15')" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.load(f'{lang}clas_sp15');\n", "learn_c.to_fp32().export(f'{lang}_classifier_sp15')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Training (Loss = FlattenedLoss of LabelSmoothing CrossEntropy)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 12.5 s, sys: 200 ms, total: 12.7 s\n", "Wall time: 12.7 s\n" ] } ], "source": [ "%%time\n", "data_clas = load_data(path, f'{lang}_textlist_class_sp15', bs=bs, num_workers=1)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "hidden": true }, "outputs": [], "source": [ "config = awd_lstm_clas_config.copy()\n", "config['qrnn'] = True" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c = text_classifier_learner(data_clas, AWD_LSTM, config=config, pretrained=False, drop_mult=0.5, \n", " metrics=[accuracy,f1]).to_fp16()\n", "learn_c.load_encoder(f'{lang}fine_tuned_enc_sp15');" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "#### Change loss function" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "FlattenedLoss of CrossEntropyLoss()" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learn_c.loss_func" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.loss_func = FlattenedLoss(LabelSmoothingCrossEntropy)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "FlattenedLoss of LabelSmoothingCrossEntropy()" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learn_c.loss_func" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "#### Training" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.freeze()" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.\n" ] } ], "source": [ "learn_c.lr_find()" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "hidden": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn_c.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "hidden": true }, "outputs": [], "source": [ "lr = 2e-2\n", "lr *= bs/48\n", "\n", "wd = 0.01" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "hidden": true }, "outputs": [], "source": [ "from fastai.callbacks import *" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "hidden": true, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.8100970.6900170.5810650.70846003:20
10.8013320.6929240.5772930.70554803:09
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.5810647010803223.\n" ] } ], "source": [ "learn_c.fit_one_cycle(2, lr, wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn_c),\n", " SaveModelCallback(learn_c.to_fp32(),every='improvement',mode='max',monitor='accuracy',\n", " name='bestmodel_clas_sp15_labelsmoothing')])" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.save(f'{lang}clas1_sp15_labelsmoothing')" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.8101060.6693160.6070310.73202803:08
10.8015760.6674490.6105430.73526303:15
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.6070314049720764.\n", "Better model found at epoch 1 with accuracy value: 0.6105427145957947.\n" ] } ], "source": [ "learn_c.load(f'{lang}clas1_sp15_labelsmoothing');\n", "learn_c.fit_one_cycle(2, lr, wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn_c),\n", " SaveModelCallback(learn_c.to_fp32(),every='improvement',mode='max',\n", " monitor='accuracy',name='bestmodel_clas_sp15_labelsmoothing')])" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.save(f'{lang}clas2_sp15_labelsmoothing')" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.3178040.2802200.9519680.97216703:25
10.3147690.2749710.9530080.97290703:45
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.9519680738449097.\n", "Better model found at epoch 1 with accuracy value: 0.9530084729194641.\n" ] } ], "source": [ "learn_c.load(f'{lang}clas2_sp15_labelsmoothing');\n", "learn_c.freeze_to(-2)\n", "learn_c.fit_one_cycle(2, slice(lr/(2.6**4),lr), wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn_c),\n", " SaveModelCallback(learn_c.to_fp32(),every='improvement',mode='max',\n", " monitor='accuracy',name='bestmodel_clas_sp15_labelsmoothing')])" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.save(f'{lang}clas3_sp15_labelsmoothing')" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.3013910.2725350.9595980.97654704:16
10.2915890.2706740.9613320.97742504:29
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.9595977067947388.\n", "Better model found at epoch 1 with accuracy value: 0.9613317251205444.\n" ] } ], "source": [ "learn_c.load(f'{lang}clas3_sp15_labelsmoothing');\n", "learn_c.freeze_to(-3)\n", "learn_c.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn_c),\n", " SaveModelCallback(learn_c.to_fp32(),every='improvement',mode='max',monitor='accuracy',\n", " name='bestmodel_clas_sp15_labelsmoothing')])" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.save(f'{lang}clas4_sp15_labelsmoothing')" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.2907720.2680860.9629790.97840306:32
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.9629790186882019.\n" ] } ], "source": [ "learn_c.load(f'{lang}clas4_sp15_labelsmoothing');\n", "learn_c.unfreeze()\n", "learn_c.fit_one_cycle(1, slice(lr/10/(2.6**4),lr/10), wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn_c),\n", " SaveModelCallback(learn_c.to_fp32(),every='improvement',mode='max',monitor='accuracy',\n", " name='bestmodel_clas_sp15_labelsmoothing')])" ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.save(f'{lang}clas5_sp15_labelsmoothing')" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.load(f'{lang}clas5_sp15_labelsmoothing');\n", "learn_c.save(f'{lang}clas_sp15_labelsmoothing')" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.2813840.2654130.9628490.97832906:18
10.2756900.2856890.9641060.97901706:17
20.2732470.2729050.9642800.97915306:04
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Better model found at epoch 0 with accuracy value: 0.962848961353302.\n", "Better model found at epoch 1 with accuracy value: 0.9641061425209045.\n", "Better model found at epoch 2 with accuracy value: 0.9642795324325562.\n" ] } ], "source": [ "learn_c.load(f'{lang}clas5_sp15_labelsmoothing');\n", "learn_c.fit_one_cycle(3, slice(lr/10/(2.6**4),lr/10), wd=wd, moms=(0.8,0.7), \n", " callbacks=[ShowGraph(learn_c),\n", " SaveModelCallback(learn_c.to_fp32(),every='improvement',mode='max',monitor='accuracy',\n", " name='bestmodel_clas_sp15_labelsmoothing')])" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.save(f'{lang}clas6_sp15_labelsmoothing')" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.load(f'{lang}clas6_sp15_labelsmoothing');\n", "learn_c.save(f'{lang}clas_sp15_labelsmoothing')" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "hidden": true }, "outputs": [], "source": [ "learn_c.load(f'{lang}clas_sp15_labelsmoothing');\n", "learn_c.to_fp32().export(f'{lang}_classifier_sp15_labelsmoothing')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Confusion matrix" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 13.3 s, sys: 512 ms, total: 13.8 s\n", "Wall time: 13 s\n" ] } ], "source": [ "%%time\n", "data_clas = load_data(path, f'{lang}_textlist_class_sp15', bs=bs, num_workers=1)\n", "\n", "config = awd_lstm_clas_config.copy()\n", "config['qrnn'] = True\n", "\n", "learn_c = text_classifier_learner(data_clas, AWD_LSTM, config=config, pretrained=False, drop_mult=0.5, \n", " metrics=[accuracy,f1])\n", "learn_c.load_encoder(f'{lang}fine_tuned_enc_sp15');\n", "\n", "learn_c.load(f'{lang}clas_sp15');\n", "\n", "# put weight on cpu\n", "loss_weights = torch.FloatTensor(trn_weights).cpu()\n", "learn_c.loss_func = partial(F.cross_entropy, weight=loss_weights)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "hidden": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "preds,y,losses = learn_c.get_preds(with_loss=True)\n", "predictions = np.argmax(preds, axis = 1)\n", "\n", "interp = ClassificationInterpretation(learn_c, preds, y, losses)\n", "interp.plot_confusion_matrix()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 2404 162]\n", " [ 1337 19165]]\n", "accuracy global: 0.9350182070400554\n", "accuracy on negative reviews: 93.68667186282151\n", "accuracy on positive reviews: 93.47868500634084\n" ] } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "cm = confusion_matrix(np.array(y), np.array(predictions))\n", "print(cm)\n", "\n", "## acc\n", "print(f'accuracy global: {(cm[0,0]+cm[1,1])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])}')\n", "\n", "# acc neg, acc pos\n", "print(f'accuracy on negative reviews: {cm[0,0]/(cm[0,0]+cm[0,1])*100}') \n", "print(f'accuracy on positive reviews: {cm[1,1]/(cm[1,0]+cm[1,1])*100}')" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
texttargetprediction
▁xxbos ▁xxmaj ▁voila ▁une ▁intégrale ▁bienvenue ▁que ▁xxup ▁decca ▁sort ▁judicieusement ▁de ▁ses ▁carton s , ▁ce ▁ne ▁sont ▁que ▁des ▁enregistrements ▁de ▁qualité ▁de ▁l ' immense ▁pianiste ▁( dont ▁de ▁nombreux ▁introuvable s ). ▁xxmaj ▁ses ▁concertos ▁de ▁xxmaj ▁beethoven ▁avec ▁xxmaj ▁ kn apper ts bus ch ▁sont ▁fabuleux ▁tout ▁comme ▁les ▁deux ▁concertos ▁de ▁xxmaj ▁brahms , ▁pour ▁moi ▁la ▁meilleure ▁version ▁jamais ▁enregistrée . ▁xxmajpospos
▁xxbos ▁h tt p : ▁/ ▁/ ▁w ww . amazon . co . uk ▁/ ▁sony - ber n stein - edition ▁/ ▁for um ▁/ ▁ fx 1 n p 1 l t py n 3 x 6 c ▁/ ▁ t x 2 ry vi 6 c j hy mes ▁/ ▁1 ? _ en co ding = ut f 8 & asin = b 00 llpospos
▁xxbos ▁xxmaj ▁même ▁si ▁je ▁vais ▁me ▁faire ▁des ▁ennemis , ▁je ▁persiste ▁et ▁signe , ▁xxmaj ▁pierce ▁xxmaj ▁brosnan ▁est ▁crédible ▁dans ▁le ▁rôle , s e an ▁xxmaj ▁connery ▁l ' acteur ▁fondateur ▁du ▁rôle ▁titre , ▁donc ▁ intouchable , ▁xxmaj ▁george ▁xxmaj ▁la z en by ▁et ▁xxmaj ▁thi mo thy ▁xxmaj ▁d al ton , ▁crédibles ▁malgré ▁un ▁passage ▁très ▁court ▁( ▁1 ▁film ▁pourpospos
▁xxbos ▁xxmaj ▁wa ou h , ▁au ▁boulot . ▁xxmaj ▁je ▁re ç ois ▁ce ▁bloc ▁( di vin , ▁salut ▁xxmaj ▁vol odi a !) ▁- ▁et , ▁je ▁le ▁souligne , ▁à ▁20 ▁euros ▁de ▁moins ▁que ▁proposé ▁ici ▁en ▁passant ▁par ▁un ▁vendeur ▁d ' amazon ▁xxup ▁uk ▁- ▁alors ▁même ▁que ▁j ' ent a mais ▁( tra î nais ▁depuis ▁un ▁moment , ▁je ▁suispospos
▁xxbos ▁xxmaj ▁tout ▁y ▁est , ▁l ' histoire , ▁la ▁réalisation ▁du ▁très ▁grand ▁xxup ▁ridley ▁xxup ▁scott , ▁xxmaj ▁les ▁acteurs ▁et ▁surtout ▁xxup ▁russell ▁xxup ▁crowe , ▁xxmaj ▁les ▁xxmaj ▁décors , ▁enfin ▁tout ▁et ▁surtout ▁en ▁version ▁longue . ▁xxmaj ▁ gladiator ▁( ou ▁xxmaj ▁gladiateur ▁au ▁xxmaj ▁québec ▁et ▁au ▁nouveau - br un s wick ) ▁est ▁un ▁film ▁ américan o - brpospos
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.show_results()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "neg tensor([0.9970, 0.0030])\n" ] } ], "source": [ "# Trying out some random sentences I made up\n", "\n", "review = 'Ce produit est bizarre.'\n", "pred = learn_c.predict(review)\n", "print(pred[0], pred[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fine-tuning \"backward Classifier\"" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "bs = 18" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### Databunch" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.42 s, sys: 376 ms, total: 2.79 s\n", "Wall time: 2.79 s\n" ] } ], "source": [ "%%time\n", "data_lm = load_data(path, f'{lang}_databunch_lm_aws_sp15_bwd', bs=bs, backwards=True)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7min 26s, sys: 7.3 s, total: 7min 33s\n", "Wall time: 4min 29s\n" ] } ], "source": [ "%%time\n", "data_clas = (TextList.from_df(df_trn_val, path, vocab=data_lm.vocab, cols='review_body', \n", " processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])\n", " .split_by_rand_pct(0.1, seed=42)\n", " .label_from_df(cols='label')\n", " .databunch(bs=bs, num_workers=1, backwards=True))" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.9 s, sys: 1.1 s, total: 7 s\n", "Wall time: 7.49 s\n" ] } ], "source": [ "%%time\n", "data_clas.save(f'{lang}_textlist_class_sp15_bwd')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### Get weights to penalize loss function of the majority class" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 13.2 s, sys: 432 ms, total: 13.6 s\n", "Wall time: 13 s\n" ] } ], "source": [ "%%time\n", "data_clas = load_data(path, f'{lang}_textlist_class_sp15_bwd', bs=bs, num_workers=1, backwards=True)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "(207616, 23068, 230684)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_trn = len(data_clas.train_ds.x)\n", "num_val = len(data_clas.valid_ds.x)\n", "num_trn, num_val, num_trn+num_val" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "(array([ 23071, 184545]), array([ 2566, 20502]))" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trn_LabelCounts = np.unique(data_clas.train_ds.y.items, return_counts=True)[1]\n", "val_LabelCounts = np.unique(data_clas.valid_ds.y.items, return_counts=True)[1]\n", "trn_LabelCounts, val_LabelCounts" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "([0.888876579839704, 0.11112342016029597],\n", " [0.8887636552800416, 0.11123634471995836])" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trn_weights = [1 - count/num_trn for count in trn_LabelCounts]\n", "val_weights = [1 - count/num_val for count in val_LabelCounts]\n", "trn_weights, val_weights" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training (Loss = FlattenedLoss of weighted CrossEntropyLoss)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 12.6 s, sys: 424 ms, total: 13 s\n", "Wall time: 12.9 s\n" ] } ], "source": [ "%%time\n", "data_clas = load_data(path, f'{lang}_textlist_class_sp15_bwd', bs=bs, num_workers=1, backwards=True)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "config = awd_lstm_clas_config.copy()\n", "config['qrnn'] = True" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "learn_c = text_classifier_learner(data_clas, AWD_LSTM, config=config, drop_mult=0.5, metrics=[accuracy,f1]).to_fp16()\n", "learn_c.load_encoder(f'{lang}fine_tuned_enc_sp15_bwd');" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "#### Change loss function" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "FlattenedLoss of CrossEntropyLoss()" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learn_c.loss_func" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "hidden": true }, "outputs": [], "source": [ "loss_weights = torch.FloatTensor(trn_weights).cuda()\n", "learn_c.loss_func = partial(F.cross_entropy, weight=loss_weights)" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "functools.partial(, weight=tensor([0.8889, 0.1111], device='cuda:0'))" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learn_c.loss_func" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Training" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "learn_c.freeze()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn_c.lr_find()" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAEGCAYAAACQO2mwAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3de3xcdZ3/8dcnk0tzbZomvd8vgFVbLgFhQSyoiOiCLF7o6uJ1+a274ILirrv+FhTXxct6WUXF4gK6KiyCulzFGywoIKRc2lJaKKW06S1p01yaZJK5fPaPmbRDmsmlzcmcSd7Px2MenTnnOzOfbyeZTz7f7znfY+6OiIjIQApyHYCIiISXkoSIiGSlJCEiIlkpSYiISFZKEiIiklVhrgMYqdraWl+wYEGuwxARyStr1qzZ6+51I31e3iWJBQsW0NDQkOswRETyipm9ciTP03CTiIhkpSQhIiJZKUmIiEhWShIiIpKVkoSIiGSlJCEiIlkpSYiISFZKEiIiIXTP2p1s2t2R6zCUJEREwujK/36G86//Az9r2J7TOAJLEmZ2k5k1mdn6LPtXmlmbmT2Tvl0dVCwiIvkklkgSSziRAuPTd6zlM3euJRpL5CSWIJfluAW4HvjRIG0ecfd3BhiDiEje6YknAfjEm5fSEY3xnQdfYt2ONr73/pOYN7VsTGMJrJJw94eBlqBeX0RkvOpJVw1lxRE+/bbj+MEl9Wxv6eKWR7eOeSy5XuDvNDN7FtgJXOXuzw3UyMwuBS4FmDdv3hiGJyIy9qLpSmJSYQSAtyybzr2feCN1lSVjHksuJ66fAua7+wrg28AvszV099XuXu/u9XV1I17pVkQkr/RVEiVFh76i59aUMakoMuax5CxJuHu7ux9I378PKDKz2lzFIyISFtFYqpIoKRz7pNBfzpKEmc0wM0vfPyUdy75cxSMiEhY98cMriVwJbE7CzG4FVgK1ZtYIXAMUAbj7DcC7gY+bWRzoBi52dw8qHhGRfHGokhjHScLdVw2x/3pSh8iKiEiGvkoiF3MQ/eU+TYmIyKv0nScRhkoi9xGIiMir9J1drUpCREQOo0pCRESy6lElISIi2aiSEBGRrPrmJCb0yXQiIjKwnniSAoOiiOU6FCUJEZGwicYSlBRGSC9KkVNKEiIiIdMTTzIpBEtygJKEiEjo9MSSoZiPACUJEZHQicYTqiRERGRgqiRERCQrVRIiIpKVKgkREckqGk+E4oJDEGCSMLObzKzJzNYP0e5kM0uY2buDikVEJJ9MlEriFuDcwRqYWQT4MvBAgHGIiOSVCVFJuPvDQMsQzS4H7gSagopDRCTf9MSSTJoAlcSgzGw2cCFwQ65iEBEJo554cvxXEsPwTeAf3T0xVEMzu9TMGsysobm5eQxCExHJnZ5YIjSVRGEO37seuC29gFUtcJ6Zxd39l/0buvtqYDVAfX29j2mUIiJjLEyVRM6ShLsv7LtvZrcA9wyUIEREJpJE0ulNhGdOIrAkYWa3AiuBWjNrBK4BigDcXfMQIiID6O27Kt14ryTcfdUI2n4oqDhERPLJoavShSNJhCMKEREBDl3felJROIablCREREKkJ65KQkREsojGVEmIiEgWqiRERCQrVRIiIpKVKgkREcmqr5KYCEuFi4jICPVVErp8qYiIHEaVhIiIZKVKQkREsupRJSEiItlE+45uUiUhIiL9HaokwvH1HI4oREQESFUSJYUFpC/IlnNKEiIiIdITS4amigAlCRGRUOmJJygJyZIcEGCSMLObzKzJzNZn2X+Bma01s2fMrMHMzggqFhGRfNETS4bm8FcItpK4BTh3kP2/A1a4+/HAR4AfBBiLiEhe6IknQ3P4KwSYJNz9YaBlkP0H3N3TD8sBz9ZWRGSiiMYSE6aSGJKZXWhmG4F7SVUT2dpdmh6Samhubh67AEVExtiEqSSGw91/4e7HAe8CvjBIu9XuXu/u9XV1dWMXoIjIGFMlMYD00NRiM6vNdSwiIrmkSiLNzJZY+mwRMzsRKAb25SoeEZEwCFslURjUC5vZrcBKoNbMGoFrgCIAd78BuAi4xMxiQDfwvoyJbBGRCSlslURgScLdVw2x/8vAl4N6fxGRfBSNJXTGtYiIDKwnnmTSRDjjWkRERq4nrkpCREQG4O5EY8mJsXaTiIiMTG8iXNeSACUJEZHQiKYvOKQ5CREROUxP36VLVUmIiEh/Ybt0KShJiIiERl8loeEmERE5TFSVhIiIZKNKQkREstKchIiIZBVVJSEiItkcrCRCtFR4eCIREZngDlYSIVoqXElCRCQkVEmIiEhW0VjfGdcToJIws5vMrMnM1mfZ/34zW5u+PWpmK4KKRUQkH/TE+9ZuCs/f70FGcgtw7iD7Xwbe5O7LgS8AqwOMRUQk9PqSRJgqiSAvX/qwmS0YZP+jGQ8fB+YEFYuISD6IxhIURYxIgeU6lIPCUtN8FLg/204zu9TMGsysobm5eQzDEhEZOz3xZKiqCAhBkjCzs0gliX/M1sbdV7t7vbvX19XVjV1wIiJjKBpLhGo+AgIcbhoOM1sO/AB4u7vvy2UsIiK5pkoig5nNA34O/JW7v5CrOEREwiIaS4TqHAkIsJIws1uBlUCtmTUC1wBFAO5+A3A1MBX4rpkBxN29Pqh4RETCLoyVRJBHN60aYv/HgI8F9f4iIvkmGkuEagVYCMHEtYiIpPTEk6GbuA5XNCIiE1gYh5uUJEREQqInhIfAhisaEZEJTJWEiIhkFcaT6cIVjYjIBKZKQkREssrbQ2DNbLGZlaTvrzSzT5hZdbChiYhMLKlDYPOzkrgTSJjZEuA/gYXATwOLSkRkgoknkiSSnp+VBJB09zhwIfBNd78SmBlcWCIiE0v04FXp8rOSiJnZKuCDwD3pbUXBhCQiMvH09F3fOk+PbvowcBrwRXd/2cwWAj8OLiwRkYnlYCURsqObhrXAn7tvAD4BYGZTgEp3/1KQgYmITCR5XUmY2UNmVmVmNcCzwM1m9vVgQxMRmTiisVQlka/nSUx293bgL4Cb3f0k4C3BhSUiMrH0xPO4kgAKzWwm8F4OTVyLiMgoOVRJ5GeSuBZ4AHjJ3Z80s0XAi4M9wcxuMrMmM1ufZf9xZvaYmfWY2VUjC1tEZHzpqyTy8hBYd/+Zuy9394+nH29x94uGeNotwLmD7G8hNRn+78OJQURkPOuJ53ElYWZzzOwX6cpgj5ndaWZzBnuOuz9MKhFk29/k7k8CsZGFLCIy/kRjeVxJADcDdwGzgNnA3eltY8LMLjWzBjNraG5uHqu3FREZM3ldSQB17n6zu8fTt1uAugDjehV3X+3u9e5eX1c3Zm8rIjJmevK8kthrZh8ws0j69gFgX5CBiYhMJPleSXyE1OGvu4FdwLtJLdUhIiKjIKxzEsNdlmMbcH7mNjO7AvhmtueY2a3ASqDWzBqBa0gvCujuN5jZDKABqAKS6ddblj5pT0RkQumJJykwKCywXIfyKsNKEll8kkGShLuvGuzJ7r4bGPQIKRGRiSJ1VboIZuFKEkcz+BWunoiI5LHUVenCNR8BR5ckfNSiEBGZ4HpiydAt7gdDDDeZWQcDJwMDSgOJSERkAorGE6GsJAZNEu5eOVaBiIhMZGGtJMKXtkREJqCwVhLhi0hEZAJSJSEiIllF44nQXXAIlCREREJBlYSIiGTVo0pCRESyicaSTFIlISIiA+mJJ1VJiIjIwHpiCVUSIiIyMB3dJCIiA4rGEsQSTuWko1mYOxhKEiIiOdbWHQNgcmlRjiM5XGBJwsxuMrMmM1ufZb+Z2bfMbLOZrTWzE4OKRUQkzFq7UkmiurQ4x5EcLshK4hbg3EH2vx1Ymr5dCnwvwFhEREJrQlYS7v4w0DJIkwuAH3nK40C1mc0MKh4RkbCakEliGGYD2zMeN6a3iYhMKK1dvQBUlylJZBro8qcDXu3OzC41swYza2hubg44LBGRsdVXSVSpkniVRmBuxuM5wM6BGrr7anevd/f6urq6MQlORGSstHXHMIPKEh0Cm+ku4JL0UU6nAm3uviuH8YiI5ERbd4zJpUUUFAw0wJJbgaUtM7sVWAnUmlkjcA1QBODuNwD3AecBm4Eu4MNBxSIiEmZ9SSKMAksS7r5qiP0O/F1Q7y8iki9au2JUhzRJ6IxrEZEca+uOhXLSGpQkRERyrj3Ew01KEiIiOdbaHQvlORKgJCEiklPuHuqJayUJEZEc6uxNkEi6koSIiBzu4JIcIVwBFpQkRERyKsxLcoCShIhIToV5BVhQkhARyam2vgsO6egmERHpT5WEiIhkpSQhIiJZtXbHKIoYZcWRXIcyICUJEZEc6juRzix8y4SDkoSISE6FeXE/UJIQEcmpthAvEw5KEiIiORXmdZsg4CRhZuea2SYz22xmnxlg/3wz+52ZrTWzh8xsTpDxiIiEzYRNEmYWAb4DvB1YBqwys2X9mv078CN3Xw5cC1wXVDwiImHU2tVLdVk4122CYCuJU4DN7r7F3XuB24AL+rVZBvwuff/BAfaLiIxbyaTT0ROfsBPXs4HtGY8b09syPQtclL5/IVBpZlP7v5CZXWpmDWbW0NzcHEiwIiJjrSMaxz28J9JBsElioIN+vd/jq4A3mdnTwJuAHUD8sCe5r3b3enevr6urG/1IRURyoLW7b5nw8CaJwgBfuxGYm/F4DrAzs4G77wT+AsDMKoCL3L0twJhEREIj7EtyQLCVxJPAUjNbaGbFwMXAXZkNzKzWzPpi+CfgpgDjEREJlYNJIqQrwEKAScLd48BlwAPA88Dt7v6cmV1rZuenm60ENpnZC8B04ItBxSMiEjatfcuEh7iSCHK4CXe/D7iv37arM+7fAdwRZAwiImE10YebRERkEGG/dCkoSYiI5Exbd4xJRQVMKgrnMuGgJCEikjNtXeFekgOUJEREcibs6zaBkoSISM60dvdSXRredZtASUJEJGfausO9bhMoSYiI5ExbV6+Gm0REZGBt3TGqQ3y2NShJiIjkRCyRpLM3oUpCREQOlw9nW4OShIhITvQlCQ03iYjIYfJhSQ5QkhARyYm2Lg03iYhIFgeHm5QkRESkP01ci4hIVn0XHJrQcxJmdq6ZbTKzzWb2mQH2zzOzB83saTNba2bnBRmPiEhYtHXHqCgppCgS7r/VA4vOzCLAd4C3A8uAVWa2rF+z/0/qsqYnkLoG9neDikdEJEzyYQVYCLaSOAXY7O5b3L0XuA24oF8bB6rS9ycDOwOMR0QkNNq6e0M/1ATBJonZwPaMx43pbZk+B3zAzBpJXQv78oFeyMwuNbMGM2tobm4OIlYRkTHV1h0L/ZFNEGySsAG2eb/Hq4Bb3H0OcB7wX2Z2WEzuvtrd6929vq6uLoBQRUTGloabUpXD3IzHczh8OOmjwO0A7v4YMAmoDTAmEZFQaM2DS5dCsEniSWCpmS00s2JSE9N39WuzDXgzgJm9hlSS0HiSiIxb7s5vNuyhpbM39Os2ARQG9cLuHjezy4AHgAhwk7s/Z2bXAg3ufhfwKeBGM7uS1FDUh9y9/5DUuJVIOpGCgUblRGQ82tJ8gM/fvYH/faGZpdMquPiUebkOaUiBJQkAd7+P1IR05rarM+5vAE4PMoYwWtfYxvUPvshvNuxhcV0FpyysOXibObk01+GJyChzd77x2xf53kObmVQY4V/euYxLTpsf+nMkIOAkIYf0xpM8tW0/333oJR5+oZmqSYW8/w3z2b6/i/95Zic/+dM2AE5fMpX3v2E+b102PS9+gERkaHesaeRbv3uR81fM4l/euYy6ypJchzRsShIB2bq3k9WPbOHl5k62tXSxq62bpMPU8mL+4dxj+atT51M5KTUemUg6z+9q58GNTdz25Hb+9idPUVdZwruOn0VFSRG9iQSxhFMUMS45bQHTqybluHciMlzbW7r4/N0bOGVhDd943/F5N8Rs+TYFUF9f7w0NDaP+urFEkkdf2kdxpIDaimKmVpRQXVpEwQg/UHfn9obtfP7uDQC8ZmYV82rKmFtTxuK6cs5ZNoPS4kjW5yeSzkObmvjJn7bx4KYm3CFSYBRFjFjCqSgp5NoLXsv5K2ZhNnBs7s5T21r57fN72NvRQ2t3jLauGF2xOPNryjl2RiXHTK/kmOkVzK0pU8UiEpBE0rl49WNs3NXB/Ve8kTlTynIWi5mtcff6kT5PlQSpL9XP/mIdtzc0vmp7eXGEv3/LUj5y+kIKh/FFur+zl3/6+Tp+9dxuTls0la+/b8WI5xgiBcabXzOdN79mOr3xJJECO/iXx5bmA3zqZ8/y97c9w/3rdvOvF76O2oqSg33Y3R7lF0/v4I41jWxp7qQoYtRWlDC5tIjqsiKmlpewfmcb963fRd/fBpECY3Z1KfOnphJZbUUJNWVFTCkvpraihGOmV+ZVaSwSJt9/+CWe3Lqfr793RU4TxNFQJQH88NGtXHPXc3zsjIWcfdw09nb2su9AD4+8uJffb2zidbOr+NJfLOd1sydnfY2ntu3n4z9eQ0tnL1edcyx//cZFI65ChiORdG58ZAtf//ULFEWMspJCunsTdPXGSaY/ypMXTOE9J83lvOUzqSg5/O+Art44L+45wKY9HWzb18UrLV28sq+T7S1dtHbH6P8jMa2yhNfOqmL5nGo++GcLqCkvHvV+jZS78/yuDqLxBBFLJdKSwgLmTS2jpPDwSq1xfxfrd7SxsLaCpdMqAvlsRDKt39HGhd/9I+csm8H1f3lC1sp/rBxpJTHhk8RjL+3jA//5J1YeU8eNl9S/6svD3bl//W6uues5Wjp7+egZC7n87CUH5xL63L9uF1f89zNMr5rEd99/4qDJZLS8sKeDm//4MgClRYWUFUeYXFrEW5dNZ0Ft+RG/biLptHXHaOnspak9yoZd7WzY2c5zO9t5samDmvJivnjh63nba2eMVldG7Klt+7nuvud5cuv+w/YVRYzjZlSxfM5klkyrYMPOdh7bso/G/d0H21RNKuSk+VNYPqea9miMxv3dbG/poqWzlwtPnM1lZx3+GYuMRDSW4M+//QfaumM8cMWZTAnBH1ZKEkegcX8X51//R6rLivjl351OVZYvhrbuGF+6fyO3PrGNKWVFXHb2Uj5w6jyKIwXc+MgWrrt/IyfMrebGS+qZWjF+h2ae39XOp25/lg272rnwhNl87s9fy+QxPBloS/MBvvrAJu5fv5vaihIuO2sxC2rLSSSdRNLpjiXYsKuddY1trGtso6MnzuTSIk5dVMOpi6ayYm41W5o7WfNKC09u3c/mpgOUFUeYO6WMOVNKKSgwfrNhD1PLi/nUOcfyvpPn5t0ko4TDdfc/z/f/dwu3fPhkVh47LdfhAEoSI9YRjfG+7z/O9pYufnnZ6SyuqxjyOesa2/jKAxt55MW9zK4uZcXcydy3bjfveP1MvvbeFUwqyj4hPV70xpN858HNfOfBzVSXFXPqohoW1VWwuK6cWdWl7GztZnPTAV7cc4Adrd3MmVLKsTMqOW5GJUumVVJWHKHAjIICMIwDPXHaumO0d8fo7IlTU17MrOpSplWVUFRQwPqdbTy0qZkHNzXxzPZWSosi/L8zF/OxNy6kfIChtD7JpNPU0cO0ypKsQ0vRWIKSwoJXDQOsa2zjC/ds4ImtLSydVsGxMyopihRQWGAURgqIFJAe3iqgwKCzN8GBnjgHojE6e1OvV1FSSFlxIaXFBbR3x9nX2cO+A710ROOcdVwdHz1jEQuPotobyoad7fz0iVe4b91uZlRNOngOzskLao5qfqlxfxelRRFqyotzPnQSZk9t28+7v/co762fy5cuWp7rcA5SkhiBhzY18c8/X8fu9ij/+cGTOeu4kWX6P7y4ly//aiPrdrTxN29azD+87dgJN8a9fkcb3/ztC2za00Hj/u5XzWNECoz5NWXMnlLK9pbUnMeR/JiVFUfo6k1gBivmVHPWsdP4yzfMC3wi3d351frdfP/hLbR3x4glk8QTTizhJD1VtSSTTsKdsuJCKicVUlFSSGlxhN54ks6eOF3peaKq0iKmlqeOlouY8fuNTcSSSd583HQ+esZCTl4wZVgHRWRqj8bYtLuDl5oOEEs6ETMKC4zuWIJfPrODp7e1UlJYwFuXTaels5entu0nGksCMKNqEsfNrOS4GVUsnVZBVyzBnrYou9uj7DvQw5JpFZy6aConL6yhalIRTe1R/ueZnfz86R08v6sdSB3QMbemjHk1ZSxJJ9JjpleyqK58wPmgoMQSSXa2drO9pfvgYeYzJk9i+exqjp1RSXHh2B+1F40leMe3HqGrN8EDV56ZdXQiF5QkhqG1q5dr79nAz5/awZJpFXzl3cs5cd6UI3qtZNLZ0drN3Jr8PGJhNEVjCV7Z18XO1m5mT0kdKZX5ZdE3Uf5S8wF640kSnvqSdaCipJDJpUVMLi2itDhCS2cvu1qj7GqLsr+rlxVzJ3Pm0rpxM4zX1BHlx4+9wn89/gr7u2IURwpYVFd+8JDkY6anvsD7hr9640nW7Wjjya0tNGxt4fldHexo7c76+ovryvnLN8znohNnU12WGgfvjSdZv7ONNVv38/yudp5PJ5jeRCpxRAqMuooSqsuK2NLcSW8iSYHBwtpyXt7bSdJhxZzJ/PmKWRSYsa2l62Dy37q3k3j6iInCAuP0JbVcdNIczlk2/Ygr6237uvjZmu3cu3YXZSURlk6rZOn0ChbVlrO7LcpzO9tZv7OdzU0dxBIDf38VFxbwmhmVlBUXEk8mDyb4ZTOrOOu4aZyxpPZgJRqNJVL/L7s6WDKtgvr5U474j77r7nue7z+8hR9+5BTedEy4VqxWkhjCo5v38onbnqG1q5ePr1zMZWcvGdO/ekQydfcm+PWG3WzY2c4Lezp4IT0816esOMK8mjK27us8WAUsqivn9bMnHxy+WzqtkpKiApJJSKR/j2dNnjSsoaBYIsn2li4qSgpTVU76SzEaS/DUtv08vqWFZ7e38vrZk3nXCbNZMm3g4djeeJKX93ayaU8H63e0cc+zO9nZFqWypJB3LJ/JxafMY8WcyUPGlEg6967bxW1PbOPRl/ZhBqcvrsUMXtxzgN3t0YNtp5YX89rZk1k2s4pFdeUHz0OaXlnCztYozza2sm5HG+t3tNEbT1IYMYoiBbjDs9tb6eiJUxwp4OSFU2jvjrNxd/urks30qhLe/rqZvGP5TJZOq6CipHBY1V5Yh5n6KEkM4YU9HXzmzrX867tez7JZVUM/QWSMHeiJ88KeDjbtTt1e3tvJorpy3rCwhvoFNQfPiQmzZNJ5fMs+7niqkV+t301Xb4LXza7iA2+Yz/nHz6KsuPCw9veu28U3fvsCW5o7mVtTyntPmstFJ81hVvWhc4zaozG27u1kWuUkpleVHPGcSG88ScPWFn6/sYk/bN5LTXkxy+dUc/zcyRw3o4q1O9q4d+1OHtzUTG88efB5pUURqkoLWXXKPC4/e+lhBzRs3dvJh25+gt54kgeuPDOUR8cpSQyDu2vCTWSMdERj/PLpHfz48W1s2tNBZUkhr5s9mXk1ZcybWkZVaRE/efwVNu7u4JjpFXzyrcdwzrIZoZjfO9AT5+EXmtndFuVAT5yOaIyXmjv5/cYmzlhSyzcvPv5g0r5n7U4+c+c6IgXGjZfUc8rCmhxHPzAlCREJJXen4ZX93NHQyItNHWxr6WbvgR4gNe9xxVuW8s7ls0J/uHHfkjtX/89zVJcV8bX3HM+vntvFjx/fxgnzqvn2qhNCfVa1koSI5I2u3jh72nuYO6V0xEd35dqGne387U/WsHVfFwCXnrmIT7/t2NCvgaa1m0Qkb5QVF7KwNj+/fpbNquLuy8/g27/fzGmLpo74EPp8E+inZGbnAv9B6sp0P3D3L/Xb/w3grPTDMmCau1cHGZOIyNGqnFTEP5/3mlyHMSYCSxJmFgG+A7wVaASeNLO70lejA8Ddr8xofzlwQlDxiIjIyAU5iHYKsNndt7h7L3AbcMEg7VcBtwYYj4iIjFCQSWI2sD3jcWN622HMbD6wEPh9lv2XmlmDmTU0NzePeqAiIjKwIJPEQMezZTuU6mLgDndPDLTT3Ve7e72719fVhetUdxGR8SzIJNEIzM14PAfYmaXtxWioSUQkdIJMEk8CS81soZkVk0oEd/VvZGbHAlOAxwKMRUREjkBgScLd48BlwAPA88Dt7v6cmV1rZudnNF0F3Ob5dlafiMgEEOh5Eu5+H3Bfv21X93v8uSBjEBGRI5d3y3KYWTPwygC7JgNtR/i4737fv7XA3iMMsf/7jLRNWPoxVJxD7R/NfkCwn8lI+jHQtoFiz7yvfgw/zqHaqB9H3o/57j7yI3/cfVzcgNVH+rjvfsa/DaMVx0jbhKUfw+nLWPUj6M9kJP0Ybuzqx5H3Y7A26sfo92OoW7hXpBqZu4/i8d1Z2oxGHCNtE5Z+DOd1JmI/Bto2UOyZ99WPoWMZbhv1Y/T7Mai8G24aC2bW4EewWmLYjJd+wPjpi/oRLurH0MZTJTGaVuc6gFEyXvoB46cv6ke4qB9DUCUhIiJZqZIQEZGslCRERCSrcZ8kzOwmM2sys/VH8NyTzGydmW02s2+ZmWXsu9zMNpnZc2b2ldGNesBYRr0fZvY5M9thZs+kb+eNfuSHxRLI55Hef5WZuZnVjl7EWWMJ4vP4gpmtTX8WvzazWaMf+WGxBNGPr5rZxnRffmFmgV9ILKB+vCf9+500s0Ant48m/iyv90EzezF9+2DG9kF/hwYU1LG1YbkBZwInAuuP4LlPAKeRWtH2fuDt6e1nAb8FStKPp+VpPz4HXJXvn0d631xSS8C8AtTmYz+Aqow2nwBuyNN+nAMUpu9/GfhynvbjNcCxwENAfRjjT8e2oN+2GmBL+t8p6ftTBuvrYLdxX0m4+8NAS+Y2M1tsZr8yszVm9oiZHdf/eWY2k9Qv7WOe+t/9EfCu9O6PA19y9570ezQF24vA+jHmAuzHN4B/IPty9KMqiH64e3tG03LGoC8B9ePXnlq7DeBxUitAByqgfjzv7puCjv1o4s/ibcBv3L3F3fcDvwHOPdLvgnGfJLJYDVzu7icBVwHfHaDNbFLLnffJvGjSMcAbzexPZva/ZnZyoNFmdx2XNxcAAAV5SURBVLT9ALgsPSxwk5lNCS7UQR1VPyy1YOQOd3826ECHcNSfh5l90cy2A+8HriY3RuPnqs9HSP3Fmguj2Y9cGE78A8l2wbcj6mugC/yFkZlVAH8G/CxjOK5koKYDbOv7y66QVBl3KnAycLuZLUpn5zExSv34HvCF9OMvAF8j9Us9Zo62H2ZWBnyW1BBHzozS54G7fxb4rJn9E6lVlK8Z5VAHNVr9SL/WZ4E48JPRjHE4RrMfuTBY/Gb2YeDv09uWAPeZWS/wsrtfSPY+HVFfJ1ySIFU9tbr78ZkbzSwCrEk/vIvUF2hmmZx50aRG4OfppPCEmSVJLbA1ltdWPep+uPuejOfdCNwTZMBZHG0/FpO69O2z6V+mOcBTZnaKu+8OOPZMo/FzlemnwL2McZJglPqRnix9J/DmsfzjKcNofx5jbcD4Adz9ZuBmADN7CPiQu2/NaNIIrMx4PIfU3EUjR9LXICdjwnIDFpAxIQQ8Crwnfd+AFVme9ySpaqFvkue89Pa/Aa5N3z+GVGlnediPmRltriR1XY+8+zz6tdnKGExcB/R5LM1oczmpS/rmYz/OBTYAdWMRf9A/V4zBxPWRxk/2ieuXSY12TEnfrxlOXweMayw/xFzcSF0WdRcQI5VJP0rqL89fAc+mf5ivzvLcemA98BJwPYfOUC8Gfpze9xRwdp7247+AdcBaUn9VzczHfvRrs5WxObopiM/jzvT2taQWb5udp/3YTOoPp2fSt7E4SiuIflyYfq0eYA/wQNjiZ4Akkd7+kfTnsBn48Eh+h/rftCyHiIhkNVGPbhIRkWFQkhARkayUJEREJCslCRERyUpJQkREslKSkHHBzA6M8fv9wMyWjdJrJSy18ut6M7t7qFVTzazazP52NN5bZCg6BFbGBTM74O4Vo/h6hX5okbpAZcZuZj8EXnD3Lw7SfgFwj7u/bizik4lNlYSMW2ZWZ2Z3mtmT6dvp6e2nmNmjZvZ0+t9j09s/ZGY/M7O7gV+b2Uoze8jM7rDU9RF+0rf+fnp7ffr+gfTCfM+a2eNmNj29fXH68ZNmdu0wq53HOLRwYYWZ/c7MnrLUNQAuSLf5ErA4XX18Nd320+n3WWtmnx/F/0aZ4JQkZDz7D+Ab7n4ycBHwg/T2jcCZ7n4CqZVW/y3jOacBH3T3s9OPTwCuAJYBi4DTB3ifcuBxd18BPAz8dcb7/0f6/YdcIye9rtCbSZ39DhAFLnT3E0ldw+Rr6ST1GeAldz/e3T9tZucAS4FTgOOBk8zszKHeT2Q4JuICfzJxvAVYlrGKZpWZVQKTgR+a2VJSq2AWZTznN+6eua7/E+7eCGBmz5BaX+cP/d6nl0OLI64B3pq+fxqH1uv/KfDvWeIszXjtNaTW/4fU+jr/lv7CT5KqMKYP8Pxz0ren048rSCWNh7O8n8iwKUnIeFYAnObu3ZkbzezbwIPufmF6fP+hjN2d/V6jJ+N+goF/Z2J+aHIvW5vBdLv78WY2mVSy+TvgW6SuKVEHnOTuMTPbCkwa4PkGXOfu3x/h+4oMScNNMp79mtQ1GQAws75llycDO9L3PxTg+z9OapgL4OKhGrt7G6nLll5lZkWk4mxKJ4izgPnpph1AZcZTHwA+kr4GAWY228ymjVIfZIJTkpDxoszMGjNunyT1hVufnszdQGqJd4CvANeZ2R+BSIAxXQF80syeAGYCbUM9wd2fJrXq58WkLtZTb2YNpKqKjek2+4A/pg+Z/aq7/5rUcNZjZrYOuINXJxGRI6ZDYEUCkr5qXre7u5ldDKxy9wuGep5ImGhOQiQ4JwHXp49IamWMLw0rMhpUSYiISFaakxARkayUJEREJCslCRERyUpJQkREslKSEBGRrP4PPOmZ2fnnxn0AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learn_c.recorder.plot()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "lr = 2e-2\n", "lr *= bs/48\n", "\n", "wd = 0.01" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.5013020.3873830.8233480.88533802:57
10.5236190.3736360.8029740.87111703:04
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.fit_one_cycle(2, lr, wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "learn_c.save(f'{lang}clas1_sp15_bwd')" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.4681780.4071510.7405500.82225602:52
10.4575620.3915070.7950840.86594203:05
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.load(f'{lang}clas1_sp15_bwd');\n", "learn_c.fit_one_cycle(2, lr, wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "learn_c.save(f'{lang}clas2_sp15_bwd')" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.3913920.2830150.8955700.93621903:39
10.3630200.2539340.8989080.93808703:27
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.load(f'{lang}clas2_sp15_bwd');\n", "learn_c.freeze_to(-2)\n", "learn_c.fit_one_cycle(2, slice(lr/(2.6**4),lr), wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "learn_c.save(f'{lang}clas3_sp15_bwd')" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.3327510.2189080.9120430.94631104:13
10.2854850.2080850.9156410.94851004:21
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.load(f'{lang}clas3_sp15_bwd');\n", "learn_c.freeze_to(-3)\n", "learn_c.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "learn_c.save(f'{lang}clas4_sp15_bwd')" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.2548800.2052240.9220570.95266404:55
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.load(f'{lang}clas4_sp15_bwd');\n", "learn_c.unfreeze()\n", "learn_c.fit_one_cycle(1, slice(lr/10/(2.6**4),lr/10), wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracyf1time
00.2731470.2033180.9318100.95875605:32
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.load(f'{lang}clas4_sp15_bwd');\n", "learn_c.unfreeze()\n", "learn_c.fit_one_cycle(1, slice(lr/10/(2.6**4),lr/10), wd=wd, moms=(0.8,0.7))" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "learn_c.save(f'{lang}clas5_sp15_bwd')" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "learn_c.load(f'{lang}clas5_sp15_bwd')\n", "learn_c.save(f'{lang}clas_sp15_bwd')" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "learn_c.load(f'{lang}clas_sp15_bwd');\n", "learn_c.to_fp32().export(f'{lang}_classifier_sp15_bwd')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### Confusion matrix" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16 s, sys: 1.11 s, total: 17.1 s\n", "Wall time: 15.2 s\n" ] } ], "source": [ "%%time\n", "data_clas = load_data(path, f'{lang}_textlist_class_sp15_bwd', bs=bs, num_workers=1, backwards=True)\n", "\n", "config = awd_lstm_clas_config.copy()\n", "config['qrnn'] = True\n", "\n", "learn_c = text_classifier_learner(data_clas, AWD_LSTM, config=config, drop_mult=0.5, \n", " metrics=[accuracy,f1])\n", "learn_c.load_encoder(f'{lang}fine_tuned_enc_sp15_bwd');\n", "\n", "learn_c.load(f'{lang}clas_sp15_bwd');\n", "\n", "# put weight on cpu\n", "loss_weights = torch.FloatTensor(trn_weights).cpu()\n", "learn_c.loss_func = partial(F.cross_entropy, weight=loss_weights)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "hidden": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "preds,y,losses = learn_c.get_preds(with_loss=True)\n", "predictions = np.argmax(preds, axis = 1)\n", "\n", "interp = ClassificationInterpretation(learn_c, preds, y, losses)\n", "interp.plot_confusion_matrix()" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 2318 248]\n", " [ 1324 19178]]\n", "accuracy global: 0.9318536500780302\n", "accuracy on negative reviews: 90.33515198752923\n", "accuracy on positive reviews: 93.54209345429713\n" ] } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "cm = confusion_matrix(np.array(y), np.array(predictions))\n", "print(cm)\n", "\n", "## acc\n", "print(f'accuracy global: {(cm[0,0]+cm[1,1])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])}')\n", "\n", "# acc neg, acc pos\n", "print(f'accuracy on negative reviews: {cm[0,0]/(cm[0,0]+cm[0,1])*100}') \n", "print(f'accuracy on positive reviews: {cm[1,1]/(cm[1,0]+cm[1,1])*100}')" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
texttargetprediction
on z ur ▁c ▁xxmaj ord ▁cliff ▁xxmaj 39 : ▁7 po ▁trop ▁non ▁ma allegro ▁ ▁xxmaj . 4 . ▁50 :14 ▁14 ) za ez cat li ▁de ▁con ce viva ▁ allegro ▁( zo cher s ▁ ▁xxmaj ▁3. . 49 ▁ 17 : ▁9 to tenu s ▁so ante ▁and ▁xxmaj ▁2. . 48 ▁ 12 : ▁13 to ra ▁mode to mol ▁ ▁xxmaj ▁1.pospos
▁m : \" ▁figaro ▁xxmaj ▁of ▁marriage ▁xxmaj the ▁\" overture ▁ ▁xxmaj ) ▁1968 , ▁york ▁xxmaj ▁new ▁xxmaj , ▁5 arch m ▁( ] 77 disc ▁[ ▁philharmonic ▁xxmaj ▁york ▁xxmaj ▁new ▁xxmaj ] ing play ▁[ ri ra ▁fer ▁xxmaj ▁= ▁w : \" ▁madonna ▁xxmaj ▁of wel je ▁\" interlude ▁ ▁- ) ▁1968 , ▁2 ary u br ▁fe ▁xxmaj known un ▁( ) ▁hallpospos
▁\" ▁\\ ▁name ▁xxmaj ▁my ▁xxmaj ▁know ▁xxmaj you ▁\" ▁\\ ▁pour ▁visuel a ▁médi ▁autre ▁un ▁ou ▁télévision ▁la , ▁cinéma ▁le ▁pour ▁écrite ▁chanson ▁meilleure ▁: ▁2008 ▁awards ▁xxmaj ▁grammy ▁xxmaj d air ▁b ▁xxmaj ▁stuart ▁xxmaj ▁pour ▁dramatique film un ' ▁d ▁montage ▁meilleur ▁: ▁2007 ▁awards ▁xxmaj eddie ▁ ▁xxmaj our ▁park suite pour - ▁course ▁la ▁pour ▁scène ▁meilleure ▁: ▁2007 ▁awards ▁xxmaj empire ▁pospos
) . ▁inconnu asin ▁ ▁xxup ▁numéro , ▁introuvable , ▁là ▁de ▁partir ▁à ▁ensuite z ▁recherche ▁le ▁vous ▁si , ▁mais asin ▁ ▁xxup ▁numéro ▁un ▁article ▁cet ▁pour indique ▁ ▁nous ▁on , ▁surcroît ▁par ▁xxmaj ! bin ja r k s ▁ ▁xxmaj ▁carrément , ▁aussi écrit ' s ▁ ▁cela , ▁à ▁échappé ▁même ▁quand ▁a ▁on ▁mais ... é ▁francis ▁non , bin ria ▁scpospos
. ▁morricone ▁xxmaj ennio ▁ ▁xxmaj ▁par ▁musique ▁en ▁mis , hara ▁sa ▁xxmaj ▁the ▁of ▁secret ▁xxmaj ▁the ▁xxmaj , ▁1988 ▁de ▁téléfilm un ' ▁d ▁thème ▁un ▁reprend ▁scène ▁même ▁cette ▁de ▁fin ▁la ▁à vient ▁inter ▁qui ▁lent ▁morceau ▁le , ▁enfin ▁xxmaj . ▁scott ▁xxmaj ▁ridley ▁xxmaj ▁de ▁préféré ▁films ▁des un ' ▁l , lou ou z ▁ ▁xxmaj ▁film ▁au ▁emprunté ▁est ouverture 'pospos
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn_c.show_results()" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "neg tensor([0.9985, 0.0015])\n" ] } ], "source": [ "# Trying out some random sentences I made up\n", "\n", "review = 'Ce produit est bizarre.'\n", "pred = learn_c.predict(review)\n", "print(pred[0], pred[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ensemble" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "bs = 18" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "config = awd_lstm_clas_config.copy()\n", "config['qrnn'] = True" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "data_clas = load_data(path, f'{lang}_textlist_class_sp15', bs=bs, num_workers=1)\n", "learn_c = text_classifier_learner(data_clas, AWD_LSTM, config=config, drop_mult=0.5, metrics=[accuracy,f1]).to_fp16()\n", "learn_c.load(f'{lang}clas_sp15', purge=False);" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.9351), tensor(0.9624))" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preds,targs = learn_c.get_preds(ordered=True)\n", "accuracy(preds,targs),f1(preds,targs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_clas_bwd = load_data(path, f'{lang}_textlist_class_sp15_bwd', bs=bs, num_workers=1, backwards=True)\n", "learn_c_bwd = text_classifier_learner(data_clas_bwd, AWD_LSTM, config=config, drop_mult=0.5, metrics=[accuracy,f1]).to_fp16()\n", "learn_c_bwd.load(f'{lang}clas_sp15_bwd', purge=False);" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.9318), tensor(0.9606))" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preds_b,targs_b = learn_c_bwd.get_preds(ordered=True)\n", "accuracy(preds_b,targs_b),f1(preds_b,targs_b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds_avg = (preds+preds_b)/2" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(0.9370), tensor(0.9636))" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy(preds_avg,targs_b),f1(preds_avg,targs_b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "predictions = np.argmax(preds_avg, axis = 1)\n", "cm = confusion_matrix(np.array(targs_b), np.array(predictions))\n", "print(cm)\n", "\n", "## acc\n", "print(f'accuracy global: {(cm[0,0]+cm[1,1])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])}')\n", "\n", "# acc neg, acc pos\n", "print(f'accuracy on negative reviews: {cm[0,0]/(cm[0,0]+cm[0,1])*100}') \n", "print(f'accuracy on positive reviews: {cm[1,1]/(cm[1,0]+cm[1,1])*100}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:root] *", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }