{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# doc2vec\n",
    "\n",
    "This is an experimental code developed by Tomas Mikolov found and [published on the word2vec Google group](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ).\n",
    "\n",
    "The input format for `doc2vec` is still one big text document but every line should be one document prepended with an unique id, for example:\n",
    "\n",
    "```\n",
    "_*0 This is sentence 1\n",
    "_*1 This is sentence 2\n",
    "```\n",
    "\n",
    "### Requirements\n",
    "\n",
    "This notebook requires [`nltk`](http://www.nltk.org/)\n",
    "\n",
    "Download some data: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.\n",
    "\n",
    "You could use `make test-data` from the root of the repo.\n",
    "\n",
    "## Preprocess\n",
    "\n",
    "Merge data into one big document with an id per line and do some basic preprocessing: word tokenizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import nltk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "directories = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_file = open('../data/alldata.txt', 'w')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "id_ = 0\n",
    "for directory in directories:\n",
    "    rootdir = os.path.join('../data/aclImdb', directory)\n",
    "    for subdir, dirs, files in os.walk(rootdir):\n",
    "        for file_ in files:\n",
    "            with open(os.path.join(subdir, file_), \"r\") as f:\n",
    "                doc_id = \"_*%i\" % id_\n",
    "                id_ = id_ + 1\n",
    "\n",
    "                text = f.read()\n",
    "                text = text\n",
    "                tokens = nltk.word_tokenize(text)\n",
    "                doc = \" \".join(tokens).lower()\n",
    "                doc = doc.encode(\"ascii\", \"ignore\")\n",
    "                input_file.write(\"%s %s\\n\" % (doc_id, doc))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_file.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## doc2vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import word2vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "word2vec.doc2vec('../data/alldata.txt', '../data/doc2vec-vectors.bin', cbow=0, size=100, window=10, negative=5,\n",
    "                 hs=0, sample='1e-4', threads=12, iter_=20, min_count=1, binary=True, verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prediction\n",
    "\n",
    "Is possible to load the vectors using the same wordvectors class as a regular word2vec binary file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import word2vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = word2vec.load('../data/doc2vec-vectors.bin')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(363652, 100)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.vectors.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The documents vector are going to be identified by the id we used in the preprocesing section, for example document 1 is going to have vector:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.15458781,  0.05329125, -0.03372733, -0.03253617, -0.00038198,\n",
       "        0.01859742,  0.05822669,  0.08649707,  0.07116864,  0.01571589,\n",
       "        0.11401868,  0.05802745, -0.03446389, -0.05136882, -0.03741959,\n",
       "        0.03825757,  0.03826223,  0.06828724,  0.05144465, -0.02519033,\n",
       "       -0.02553019, -0.13064705, -0.01241815,  0.06572731,  0.0361409 ,\n",
       "        0.06780364, -0.11406382,  0.03698812,  0.01645933,  0.04006057,\n",
       "       -0.06411161,  0.10451728, -0.09016737,  0.0612823 ,  0.05084847,\n",
       "        0.05336419,  0.13584702, -0.05651987,  0.01329237, -0.20875289,\n",
       "       -0.08468074,  0.10479708,  0.05787989,  0.13176955, -0.01580021,\n",
       "        0.07367507, -0.09352259, -0.13971043, -0.24871282, -0.14094608,\n",
       "        0.04577528,  0.12431064, -0.05354084,  0.14227368, -0.11196741,\n",
       "        0.03350607, -0.01999633, -0.07860398,  0.02664598, -0.0592974 ,\n",
       "       -0.07686727, -0.05597671, -0.06345078, -0.27290845, -0.06400879,\n",
       "        0.00619202, -0.01699482, -0.05217196,  0.17153341,  0.12772463,\n",
       "       -0.02238818,  0.09656674,  0.09250097, -0.18347315, -0.15046608,\n",
       "       -0.04235512, -0.15838784,  0.03597225, -0.01584246,  0.00568255,\n",
       "        0.0409179 ,  0.02537416,  0.06123869,  0.0961483 , -0.12116033,\n",
       "        0.07526031,  0.0303142 , -0.01561208,  0.36425424, -0.16918449,\n",
       "        0.07423471,  0.14127608,  0.00530258, -0.10363475,  0.00953067,\n",
       "       -0.04992943,  0.1713592 ,  0.21640711, -0.00298326,  0.05985325])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model['_*1']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can ask for similarity words or documents on document `1`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "indexes, metrics = model.similar('_*1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('_*2351', 0.8519178427298584),\n",
       " ('imagery.an', 0.8397190257038456),\n",
       " ('_*58565', 0.8342486893807792),\n",
       " ('_*35569', 0.8274130976031872),\n",
       " ('_*79830', 0.8268153781216658),\n",
       " ('_*37088', 0.8245519264554979),\n",
       " ('b\"witness', 0.8239716809480113),\n",
       " (\"b'ghost\", 0.822800405312356),\n",
       " (\"l'anticristo\", 0.8222915811568171),\n",
       " ('_*38763', 0.8189633692215181)]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.generate_response(indexes, metrics).tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now its we just need to matching the id to the data created on the preprocessing step"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}