{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Can we create our own text vectorizer?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While I was working with Medium project, I faced the problem of memory lack when I tried to create 1-4 ngramms. It was interesting to see how larger ngramms would work. But my PC (32GB RAM) can't perfom the task of creating 1-4 ngramms with CounterVectorizer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So I was wondering how I can create ngramms of (almost) any length and any size or can we customize by ourselves what part of the sentence, text, article we can choose to create bag of words and etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " I`ve decided to use dictionaries as intermediate buffers between our any-sized ngramms and sparse matrix. Let see how it works. First of all we create toy list of two sentence to check if our script works properly.\t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import itertools\n",
    "import os\n",
    "import pickle\n",
    "import random\n",
    "import re\n",
    "from collections import Counter, defaultdict\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy import sparse\n",
    "\n",
    "PATH = \"/any_path_to_data_folder/\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "texts = [\n",
    "    \"joe lives in the center of london since his birth\",\n",
    "    \"dann loves his job because his office is right in the center of new york\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets create dictionaries with unique words for each text in texts. We will need the 'ngramm' variable which will define the amount of words in phrase. For example, \"ngramm = 2\" will give us such phrases from first sentence: 'joe', 'joe lives', 'lives', 'lives in', 'in', 'in the', etc. And using \"ngramm = 3\" we`l recieve such combinations: 'joe', 'joe lives', 'joe lives in', 'lives', 'lives in', 'lives in the', etc. Also we will add \"ntop\" variable which we will use later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ngramm = 2\n",
    "ntop = 100\n",
    "word_stats = defaultdict(int)\n",
    "for i in range(len(texts)):\n",
    "    word_stats[i] = defaultdict(int)  # create sub-dictionary for current text\n",
    "    text = texts[i]\n",
    "    words = text.split()  # split the text by every word.\n",
    "    for n in range(ngramm):\n",
    "        phrases = [\n",
    "            \" \".join(word for word in words[i : i + (n + 1)])\n",
    "            for i in range(len(words) - n)\n",
    "        ]\n",
    "        print(\n",
    "            phrases\n",
    "        )  # as we see we`re getting the desired result: the list of n-words phrases\n",
    "        for phrase in phrases:\n",
    "            word_stats[i][\n",
    "                phrase\n",
    "            ] += 1  # count all phrases in current text and add them to appropriate dictionary\n",
    "print(\"************ dictionary with counted phrases ************\")\n",
    "print(word_stats)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use our words and phrases as bag of words in regression models we have to convert our dictionary to sparse matrix. For those who are not familiar with this type of matrix I would recommend to read this article: . But in any case we will remind ourselves why we are using sparse instead dense matrix especially with big data. Let's create dataframe where rows are the texts and columns - phrases as features. Values in cells - number of appearance of each phrase in particular text. But at first we have to convert all per text sub-dictionaries to one dictionary with unique phrases and sum of their accurancies in all texts. Current approach (using lists) isn`t optimal but we can compare execution times later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unique_ngrams = defaultdict(int)\n",
    "for i in range(len(texts)):\n",
    "    cur_dic = word_stats[i]\n",
    "    for phrase in cur_dic.keys():\n",
    "        unique_ngrams[phrase] += cur_dic[phrase]\n",
    "print(unique_ngrams)  # check all records in unique dictionary\n",
    "\n",
    "df_feat_col = list(unique_ngrams.keys())\n",
    "df_feat_values = []\n",
    "for i in range(len(texts)):\n",
    "    cur_text_values = []\n",
    "    for col in df_feat_col:\n",
    "        cur_text_values.append(word_stats[i][col])\n",
    "    df_feat_values.append(cur_text_values)\n",
    "\n",
    "df_feat = pd.DataFrame(df_feat_values, columns=unique_ngrams.keys())\n",
    "print(df_feat[[\"his\", \"center\", \"job\", \"york\", \"in\"]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For now our script works perfect and now we can create feature table. But what about memory usage? To check our script we`l load database with film reviews and create function for more convenience."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your can find data by this link https://drive.google.com/file/d/1zvCa27XOuLyGAzYOeLGHfaccmK2llkca/view?usp=sharing\n",
    "with open(PATH + \"reviews\", \"rb\") as fb:\n",
    "    new_texts = pickle.load(fb)\n",
    "print(len(new_texts))\n",
    "# we`l use 1/10 of all reviews or 1250 records.\n",
    "new_texts = [re.sub(r\"[^\\w\\s]\", \"\", str(x).lower()) for x in new_texts[:1250]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_text_dicts(texts, ngram=2):\n",
    "    word_stats = defaultdict(int)\n",
    "    for i in range(len(texts)):\n",
    "        word_stats[i] = defaultdict(int)\n",
    "        text = texts[i]\n",
    "        words = text.split()\n",
    "        for n in range(ngram):\n",
    "            phrases = [\n",
    "                \" \".join(word for word in words[i : i + (n + 1)])\n",
    "                for i in range(len(words) - n)\n",
    "            ]\n",
    "            for phrase in phrases:\n",
    "                word_stats[i][phrase] += 1\n",
    "    return word_stats\n",
    "\n",
    "\n",
    "def create_unique_dict(word_stats, texts):\n",
    "    unique_ngrams = defaultdict(int)\n",
    "    for i in range(len(texts)):\n",
    "        cur_dic = word_stats[i]\n",
    "        for phrase in cur_dic.keys():\n",
    "            unique_ngrams[phrase] += cur_dic[phrase]\n",
    "\n",
    "    return unique_ngrams\n",
    "\n",
    "\n",
    "def create_feat_values(word_stats, unique_ngrams, texts):\n",
    "    df_feat_col = list(unique_ngrams.keys())\n",
    "    df_feat_values = []\n",
    "    for i in range(len(texts)):\n",
    "        cur_text_values = [word_stats[i][col] for col in df_feat_col]\n",
    "        df_feat_values.append(cur_text_values)\n",
    "\n",
    "    return df_feat_col, df_feat_values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "word_stats = create_text_dicts(new_texts, ngram=2)\n",
    "unique_ngrams = create_unique_dict(word_stats, new_texts)\n",
    "features, feat_values = create_feat_values(word_stats, unique_ngrams, new_texts)\n",
    "print(len(features), np.asarray(feat_values).shape)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is screenshot of memory usage at my PC.\n",
    "<img src=\"https://ucbfb5d1a3c83c53caaf05b9e6aa.previews.dropboxusercontent.com/p/thumb/AARl5GA1YgfPAj_1aG1s5kRxUIPfV5GDXbie70lCMJfzQjeYffswQ4dqCS-9BWowOM8FDjjvti8hDlQBUw9lBRloDmfmQpjpIS2zNBAoE1o2KuMwfFmWZy13UfDuOWd8t2AKdLCUlbZdrhodTEhTCBoRroMld2Zr7X4Osm4lMmCygdeZ31xrxB3SEaHUOJBWC1Rs7HtyiG9YoXPwWoqDzp6Ths3BmCNU1Z61TZoX8WGUjbmB-okAbCkcd3UUyi-BBAI/p.png?size=2048x1536&size_mode=3\"> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see current approach of getting feature matrix causes the memory lack when there is a big database. 1/10 of database consumed about 1/4 of available memory. Let`s look at our matrix as dataframe and count its sparsity or in other words the percentage of cells which are either not filled with data or are zeros."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feat_table = pd.DataFrame(feat_values, columns=features)\n",
    "print(feat_table.iloc[:5, :10])  # a part of matrix to avoid crash\n",
    "\n",
    "count_zeros = feat_table.isin([0.0]).sum(\n",
    "    axis=0\n",
    ")  # count how many times zeros are present in each feature\n",
    "print(\"**********\")\n",
    "print(count_zeros.head())\n",
    "\n",
    "sparsity = np.sum(count_zeros) / (len(features) * len(feat_values)) * 100\n",
    "print(\"**********\")\n",
    "print(\"Sparsity: \" + str(np.round(sparsity, 2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see our feature table consists of almost 100% of zeros and all this useless information occupy a lot of memory. We can fix this by converting dense matrix to sparse and compare the used memory after that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sparse_feat_matrix = sparse.csr_matrix(feat_table.values)\n",
    "print(\"Dense matrix memory: \" + str(feat_table.values.nbytes))\n",
    "print(\"Sparse matrix memory: \" + str(sparse_feat_matrix.data.nbytes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We retrenched 99% of memory which was used by dense matrix. But still our script is not effective, because we create the full feauture dense matrix initially and only after that convert it to sparse matrix. So we can change the code in the way that sparse matrix is created for each text (iteration), and then small sparse matrix is concatenated into full sparse feature matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_feat_values(word_stats, unique_ngrams, texts):\n",
    "    feat_col = list(unique_ngrams.keys())\n",
    "\n",
    "    for i in range(len(texts)):\n",
    "        cur_text_values = [word_stats[i][col] for col in df_feat_col]\n",
    "        cur_text_values = np.asarray(cur_text_values)\n",
    "        if i == 0:\n",
    "            # convert current text dense matrix to sparse\n",
    "            texts_values = sparse.csr_matrix(cur_text_values)\n",
    "        else:\n",
    "            cur_text_values = sparse.csr_matrix(cur_text_values)\n",
    "            # concatenate cur sparse matrix into whole sparse feature matrix\n",
    "            texts_values = sparse.vstack([texts_values, cur_text_values])\n",
    "\n",
    "    return feat_col, texts_values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let`s check if our sparse matrix contains right values using our toy example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "word_stats = create_text_dicts(texts, ngram=2)\n",
    "unique_ngrams = create_unique_dict(word_stats, texts)\n",
    "features, feat_values = create_feat_values(word_stats, unique_ngrams, texts)\n",
    "print(features)\n",
    "print(len(features))\n",
    "print(feat_values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets check the word 'his': it occuries 1 time in the first text (0 index of texts), and 2 times in the second (1 index). In features list this word stands at 8 index (9 in a row). And if we look through the sparse matrix, we find that the number for the 0 index text and 8 index words (0,8) equals 1, and for the 1 index text and 8 index word (1,8) there is number 2. Also we see that the length of list with features equals 39 and the last word index in the sparse matrix is 38 which are equal."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What benefits do we get using such way of creating sparse matrix? First of all we can make sparse matrix with almost any length of n-grams. Using sklearn vectorizers with long ngrams (more than 4) could lead to 'overloading' the memory of your system. Of course, this way of vectorization could be much slower then sklearn ones, but when it is needed to get 5 or 6 length ngrams - time does not matter. And the second not less useful benefit is the possibility to customize the vectorizer for specific needs. We can define certain lengths of phrases (ngram) we want to get: 1,2,3,5,10. Or for example, we need to get 2-grams phrases not with words standing one by one, but with every second words (skipping the word between 1 and 3, 2 and 4, etc.): 'joe in', 'lives the', 'in center' etc. Another situation: we have the database with movie reviews and 5 tags for classification: comedy, detective, horror, cartoon, action movie. Our interest is to get 1-5 n-gram phrases using nor 100 000 most common words of all reviews, but 20 000 most common words for each tag. Or moreover: the number of top words for each tag will depend on review's distribution per tag. If there are 50 000 reviews and 30 000 of it are about comedies, we'l want to get 60% (30 000/50 000) of all top words from comedy's reviews. Let`s combine our separate functions in one and add a little bit customization in getting the bag of words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_sparse_features(texts, target, ngram, ntop=np.inf, ntop_by_class=False):\n",
    "    \"\"\"\n",
    "    target: list of classes according to texts. List\n",
    "    \n",
    "    ngram: the list of lengths for desired phrases. List  \n",
    "    \n",
    "    ntop: how many most popular words we want to use. Integer. Default value \n",
    "          means infiniry, that`s why the dictionary isn`t cut off by ntop value./.\n",
    "\n",
    "    ntop_by_class: choose top words from every class according \n",
    "                   to their distribution. True or False  \n",
    "    \"\"\"\n",
    "\n",
    "    word_stats = defaultdict(int)\n",
    "    for i in range(len(texts)):\n",
    "        word_stats[i] = defaultdict(int)\n",
    "        text = texts[i]\n",
    "        words = text.split()\n",
    "        for n in ngram:\n",
    "            # because of we changed the list of ngram from range(3) to\n",
    "            # hand-input values [1,2,3] we have to change the code in this part\n",
    "            phrases = [\n",
    "                \" \".join(word for word in words[i : i + n])\n",
    "                for i in range(len(words) - n + 1)\n",
    "            ]\n",
    "            for phrase in phrases:\n",
    "                word_stats[i][phrase] += 1\n",
    "\n",
    "    unique_ngrams = defaultdict(int)\n",
    "    if ntop_by_class:\n",
    "        # count how many time each target occuries\n",
    "        target_count = Counter(target)\n",
    "        for key in target_count.keys():\n",
    "            # count each target part in whole data\n",
    "            cur_target_dist = target_count[key] / len(new_texts)\n",
    "            # count how many top words we`l take from each tag texts.\n",
    "            cur_target_top_words = int(cur_target_dist * ntop)\n",
    "            # create empty dictionary for current target\n",
    "            cur_target_ngrams = defaultdict(int)\n",
    "            # create list of indexes for current target\n",
    "            cur_target_ind = np.where(np.asarray(target) == key)[0]\n",
    "            # using loop and list of indexes fill current target dictionary\n",
    "            # from corresponding text dictionaries and count unique phrases\n",
    "            for n in cur_target_ind:\n",
    "                cur_dic = word_stats[n]\n",
    "                for phrase in cur_dic.keys():\n",
    "                    cur_target_ngrams[phrase] += cur_dic[phrase]\n",
    "            # if length of current target dictionary is more than\n",
    "            # cur_target_top_words value - sort dictionary by values and cut off it\n",
    "            # content by that value.\n",
    "            if len(cur_target_ngrams) > cur_target_top_words:\n",
    "                cur_target_ngrams = dict(\n",
    "                    sorted(cur_target_ngrams.items(), key=lambda kv: kv[1])\n",
    "                )\n",
    "                cur_target_ngrams = dict(\n",
    "                    itertools.islice(\n",
    "                        cur_target_ngrams.items(),\n",
    "                        len(cur_target_ngrams) - cur_target_top_words,\n",
    "                        len(cur_target_ngrams),\n",
    "                    )\n",
    "                )\n",
    "            # connect current target dictionary with whole unique_ngrams dictionary\n",
    "            unique_ngrams = {**unique_ngrams, **cur_target_ngrams}\n",
    "\n",
    "    else:\n",
    "        for i in range(len(texts)):\n",
    "            cur_dic = word_stats[i]\n",
    "            for phrase in cur_dic.keys():\n",
    "                unique_ngrams[phrase] += cur_dic[phrase]\n",
    "        if len(unique_ngrams) > ntop:\n",
    "            unique_ngrams = dict(sorted(unique_ngrams.items(), key=lambda kv: kv[1]))\n",
    "            unique_ngrams = dict(\n",
    "                itertools.islice(\n",
    "                    unique_ngrams.items(), len(unique_ngrams) - ntop, len(unique_ngrams)\n",
    "                )\n",
    "            )\n",
    "\n",
    "    feat_col = list(unique_ngrams.keys())\n",
    "    for i in range(len(texts)):\n",
    "        cur_text_phares = [word_stats[i][col] for col in feat_col]\n",
    "        cur_text_phares = np.asarray(cur_text_phares)\n",
    "        if i == 0:\n",
    "            feat_values = sparse.csr_matrix(cur_text_phares)\n",
    "        else:\n",
    "            cur_feat_values = sparse.csr_matrix(cur_text_phares)\n",
    "            feat_values = sparse.vstack([feat_values, cur_feat_values])\n",
    "\n",
    "    return feat_col, feat_values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let`s run our code with and without using target distribution for ntop words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# To run the vectorizer we need tags for our review database.\n",
    "tags = [\"comedy\", \"detective\", \"horror\", \"cartoon\", \"action\"]\n",
    "target = [tags[random.randrange(len(tags))] for item in range(len(new_texts))]\n",
    "ngram = [1, 2, 3, 5, 10]\n",
    "features, feat_values = create_sparse_features(\n",
    "    new_texts, target, ngram, ntop=5000, ntop_by_class=False\n",
    ")\n",
    "print(feat_values.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# To run the vectorizer we need tags for our review database.\n",
    "tags = [\"comedy\", \"detective\", \"horror\", \"cartoon\", \"action\"]\n",
    "target = [tags[random.randrange(len(tags))] for item in range(len(new_texts))]\n",
    "ngram = [1, 2, 3, 5, 10]\n",
    "features, feat_values = create_sparse_features(\n",
    "    new_texts, target, ngram, ntop=5000, ntop_by_class=True\n",
    ")\n",
    "print(feat_values.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The shape of sparse matrix after the second run of script is less then set ntop value. That`s why 2/3 of most popular words in texts are intersected within the tags. But let`s check memory usage - run script without setting ntop value and with ngram values as in dense matrix above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# To run the vectorizer we need tags for our review database.\n",
    "tags = [\"comedy\", \"detective\", \"horror\", \"cartoon\", \"action\"]\n",
    "target = [tags[random.randrange(len(tags))] for item in range(len(new_texts))]\n",
    "ngram = [1, 2]\n",
    "features, feat_values = create_sparse_features(new_texts, target, ngram)\n",
    "print(feat_values.shape)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the second screenshot of memory usage at my PC.\n",
    "<img src=\"https://uc7884f3bd1bdfa5933582e32041.previews.dropboxusercontent.com/p/thumb/AARuCLe0b0OXZadJswOxAcp962TXDR-fj_nb5EPgM0mcAQAcify91kwydPaOTnaPOo1Sw8kpjjU0dTMyq9msq_fxGGr6u-Ehk63FQFmqlaf8UhZOJPUJ6ahsSm65TntvTW9X8KCCezsxnllkNXuwkZwbjKj15B5Qa1NLxndbWWoEFVF6V5GFiW_UBJ30Avo0e9a4YtmCq9CsKzA7UneCUoFCCrjTmdKiY6pdEPNWPvjRuTCDYHvwKw9wB4TYtDW4EqE/p.png?size=2048x1536&size_mode=3\"> \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unfortunately our idea that creating sparse matrix while looping and combining them in big one is not working properly. As we see the memory usage is the same as with dense matrix while looping and only after it the memory is cut off to the size of final sparse matrix. What can we do with this? Let's try to optimize our script. At first, we will not use every text`s dictionaries while creating sparse matrix. Instead of this we will create phrases once more, check if they are in unique dictionary, count their accurancies and after that create sparse matrix using numpy zeroes matrix. At the second we will save each sparse matrix to the hard drive and then read them and concatenate in one matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_sparse_features(texts, target, ngram, ntop=np.inf, ntop_by_class=False):\n",
    "    \"\"\"\n",
    "    target: list of classes according to texts. List\n",
    "    \n",
    "    ngram: the list of lengths for desired phrases. List  \n",
    "    \n",
    "    ntop: how many most popular words we want to use. Integer. Default value \n",
    "          means infiniry, that`s why the dictionary isn`t cut off by ntop value./. \n",
    "\n",
    "    ntop_by_class: choose top words from every class according \n",
    "                   to their distribution. True or False  \n",
    "    \"\"\"\n",
    "\n",
    "    word_stats = defaultdict(int)\n",
    "    for i in range(len(texts)):\n",
    "        word_stats[i] = defaultdict(int)\n",
    "        text = texts[i]\n",
    "        words = text.split()\n",
    "        for n in ngram:\n",
    "            phrases = [\n",
    "                \" \".join(word for word in words[i : i + n])\n",
    "                for i in range(len(words) - n + 1)\n",
    "            ]\n",
    "            for phrase in phrases:\n",
    "                word_stats[i][phrase] += 1\n",
    "\n",
    "    unique_ngrams = defaultdict(int)\n",
    "    if ntop_by_class:\n",
    "\n",
    "        target_count = Counter(target)\n",
    "        for key in target_count.keys():\n",
    "            cur_target_dist = target_count[key] / len(new_texts)\n",
    "            cur_target_top_words = int(cur_target_dist * ntop)\n",
    "            cur_target_ngrams = defaultdict(int)\n",
    "            cur_target_ind = np.where(np.asarray(target) == key)[0]\n",
    "\n",
    "            for n in cur_target_ind:\n",
    "                cur_dic = word_stats[n]\n",
    "                for phrase in cur_dic.keys():\n",
    "                    cur_target_ngrams[phrase] += cur_dic[phrase]\n",
    "\n",
    "            if len(cur_target_ngrams) > cur_target_top_words:\n",
    "                cur_target_ngrams = dict(\n",
    "                    sorted(cur_target_ngrams.items(), key=lambda kv: kv[1])\n",
    "                )\n",
    "                cur_target_ngrams = dict(\n",
    "                    itertools.islice(\n",
    "                        cur_target_ngrams.items(),\n",
    "                        len(cur_target_ngrams) - cur_target_top_words,\n",
    "                        len(cur_target_ngrams),\n",
    "                    )\n",
    "                )\n",
    "            unique_ngrams = {**unique_ngrams, **cur_target_ngrams}\n",
    "\n",
    "    else:\n",
    "        for i in range(len(texts)):\n",
    "            cur_dic = word_stats[i]\n",
    "            for phrase in cur_dic.keys():\n",
    "                unique_ngrams[phrase] += cur_dic[phrase]\n",
    "        if len(unique_ngrams) > ntop:\n",
    "            unique_ngrams = dict(sorted(unique_ngrams.items(), key=lambda kv: kv[1]))\n",
    "            unique_ngrams = dict(\n",
    "                itertools.islice(\n",
    "                    unique_ngrams.items(), len(unique_ngrams) - ntop, len(unique_ngrams)\n",
    "                )\n",
    "            )\n",
    "    # create dictionary with features indexes\n",
    "    all_phrases = unique_ngrams.keys()\n",
    "    all_phrases_dic_ind = defaultdict(int)\n",
    "    for key in unique_ngrams.keys():\n",
    "        all_phrases_dic_ind[key] = len(all_phrases_dic_ind)\n",
    "\n",
    "    # create sparse matrix to each text\n",
    "    for i in range(len(texts)):\n",
    "        # create proper name for right sorting files according to features index\n",
    "        name = \"0\" * (len(str(len(texts))) - len(str(i))) + str(i)\n",
    "        text = texts[i]\n",
    "        # create from current text list of words\n",
    "        words = text.split()\n",
    "        # create list of ngram phrases dropping those that are not in unique\n",
    "        # dictionary\n",
    "        cur_phrases = []\n",
    "        for ii in range(len(words)):\n",
    "            for n in ngram:\n",
    "                phrase = \" \".join(word for word in words[ii : ii + n])\n",
    "                if phrase in all_phrases:\n",
    "                    cur_phrases.append(phrase)\n",
    "        # count accurancies of chosen phrases in the current text\n",
    "        cur_phrases_count = Counter(cur_phrases)\n",
    "        # create 1 row dense matrix using numpy zeroes matrix, where the number\n",
    "        # of columns equals the length of unique phrases dictionary\n",
    "        dense_matrix = np.zeros([1, len(all_phrases)])\n",
    "        for key in cur_phrases_count.keys():\n",
    "            # get the index of current feature of current text from unique index\n",
    "            # dictionary\n",
    "            all_phrases_ind = all_phrases_dic_ind[key]\n",
    "            # put the number of accurance of phrase to the dense matrix at certain index\n",
    "            # according to the index of phrase in the unique index dictionary\n",
    "            dense_matrix[:, all_phrases_ind] = cur_phrases_count[key]\n",
    "        # create and save sparse matrix\n",
    "        csr_matrix = sparse.csr_matrix(dense_matrix)\n",
    "        sparse.save_npz(PATH + \"/sparses/\" + str(name), csr_matrix)\n",
    "\n",
    "    # laod sparse matrix and create the big one\n",
    "    sparse_files = os.listdir(PATH + \"/sparses/\")\n",
    "    for i in range(len(sparse_files)):\n",
    "        if i == 0:\n",
    "            feat_values = sparse.load_npz(PATH + \"/sparses/\" + sparse_files[i])\n",
    "        else:\n",
    "            cur_sparse = sparse.load_npz(PATH + \"/sparses/\" + sparse_files[i])\n",
    "            feat_values = sparse.vstack([feat_values, cur_sparse])\n",
    "\n",
    "    return unique_ngrams.keys(), feat_values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# To run the vectorizer we need tags for our review database.\n",
    "tags = [\"comedy\", \"detective\", \"horror\", \"cartoon\", \"action\"]\n",
    "target = [tags[random.randrange(len(tags))] for item in range(len(new_texts))]\n",
    "ngram = [1, 2]\n",
    "features, feat_values = create_sparse_features(new_texts, target, ngram)\n",
    "print(feat_values.shape)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is screenshot of memory usage at my PC with this script\n",
    "<img src=\"https://uc8aad429cfe96b4ba6605878e72.previews.dropboxusercontent.com/p/thumb/AASymsQ4ed_7_D2wgWBErHxRL63_zOM6NiUveM_dI1F4T3iFFv9IG3jx67GExOp3wlFmczHofmoO9SncvooEgQuMAl548SBcfCJDlEYQqV7cLthnhnYnwQ1zlCOorFwheSigE-HGM9ALRoFlPdNwbHt1d340kxg7Dk18WEf88qhgKWknA7eEiQ-zOwGnLHPB62sPvoEzB30GEqDTkRFeZ8qTp6HPUdWKaNfq69cUwBvjCNGv5kiFBzDTHffGcnwdV30/p.png?size=2048x1536&size_mode=3\"> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we see we get the same shape of feature matrix as the dense has, but 5 times faster and without usage almost of any memory. I tried this approach with different settings and could get matrix with shapes up to 70000 rows and 20mln features as columns without serious usage of memory. Apart this, such script could be modified for any needs to create any architecture of certain bag of words. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}