{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "CWPK \\#65: scikit-learn Basics and Initial Encoding\n",
    "=======================================\n",
    "\n",
    "Welcome to a Lengthy Installment on Feature Engineering\n",
    "--------------------------\n",
    "\n",
    "<div style=\"float: left; width: 305px; margin-right: 10px;\">\n",
    "\n",
    "<img src=\"http://kbpedia.org/cwpk-files/cooking-with-kbpedia-305.png\" title=\"Cooking with KBpedia\" width=\"305\" />\n",
    "\n",
    "</div>\n",
    "\n",
    "We devote the next two installments of [*Cooking with Python and KBpedia*](https://www.mkbergman.com/cooking-with-python-and-kbpedia/) to the venerable [Python](https://en.wikipedia.org/wiki/Python_(programming_language)) machine learning package, [scikit-learn](https://en.wikipedia.org/wiki/Scikit-learn). Also known as 'sklearn', this package offers a wealth of classic [machine learning](https://en.wikipedia.org/wiki/Machine_learning) methods and utilities, along with abilities to construct machine learning [pipelines](https://en.wikipedia.org/wiki/Pipeline_(computing)) and collect and present results via a rich set of statistical measures. \n",
    "\n",
    "Though it may be a distinction without a difference, this installment marks our transition in this **Part VI** to packages *devoted* to machine learning. Though our earlier discussions of [gensim](https://en.wikipedia.org/wiki/Gensim) and [spaCy](https://en.wikipedia.org/wiki/SpaCy) highlighted packages that employ much machine learning in their capabilities, their focus is not strictly machine learning in the same way as our remaining packages. After two installments on classic machine learning with scikit-learn, the remaining five installments in this **Part VI** are dedicated to [deep learning](https://en.wikipedia.org/wiki/Deep_learning) with [knowledge graphs](https://en.wikipedia.org/wiki/Knowledge_graph).\n",
    "\n",
    "In this installment we will set-up scikit-learn and prep a master data file as input to it (and to subsequent packages). The efforts we detail in this installment -- one of the longest in our **CWPK** series -- falls into the discipline known as '[feature engineering](https://en.wikipedia.org/wiki/Feature_engineering)', the process of extracting and crafting numeric representations staged properly for machine learning. In this installment, we load all of the necessary pieces and then proceed through a pipeline of data wrangling steps to create our vector file of numeric representations directly usable by the machine learners. This information then prepares us for the following installment, where we will set up our experimental training sets, establish a '[gold standard](https://en.wikipedia.org/wiki/Gold_standard_(test))' by which we calculate statistical performance, and begin some initial classification. We also set up our framework for reporting and comparing results in all of our machine learning installments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install scikit-learn\n",
    "Because of our initial use of [Anaconda](https://en.wikipedia.org/wiki/Anaconda_(Python_distribution)), we already have scikit-learn installed. We can confirm this with our standard command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conda list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will have occasion to add some extensions, but will do so in context as the need arises."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic Intro to sklearn\n",
    "[scikit-learn](https://scikit-learn.org/stable/index.html) has been an actively used and developed Python package since its initial release in 2007. It is a veritable Swiss army knife of machine learning algorithms and utilities. It has extensive and clear documentation and many, many online examples to help guide the way in the use of the package. However, that being said, there is much to learn about this package and much work to be done with the package in setting up raw data for proper machine learning use.\n",
    "\n",
    "scikit-learn's [API documentation](https://scikit-learn.org/stable/modules/classes.html) is the best introductory source to gain an appreciation for the scope of the package's capabilities. Here are some of the major categories of sklearn's capabilities:\n",
    "\n",
    "|methods|other functions|utilities|\n",
    "|----------------|----------------|---------------|\n",
    "|ensemble methods|dimensionality reduction|warnings and exceptions|\n",
    "|learning methods|kernel operations|metrics|\n",
    "|classification methods|dataset functions|preprocessing|\n",
    "|neural networks|feature selection|pipelines |\n",
    "|SVM|feature extraction|sample generators |\n",
    "|decision trees|decomposition |plotting |\n",
    "|linear models|regressions|splitting |\n",
    "|manifold learning|validation  | normalization  |\n",
    "|nearest neighbor|Bayesian statistics  | randomization  |\n",
    "\n",
    "sklearn offers a diversity of supervised and unsupervised learning methods, as well as dataset transformations and data loading routines. There are about a dozen supervised methods in the package, and nearly ten unsupervised methods. Multiple utilities exist for transferring datasets to other Python data science packages, including the [pandas](https://en.wikipedia.org/wiki/Pandas_(software)), gensim and [PyTorch](https://en.wikipedia.org/wiki/PyTorch) ones used in **CWPK**. Other prominent data formats readable by sklearn include [SciPy](https://en.wikipedia.org/wiki/SciPy) and its binary formats, [NumPy](https://en.wikipedia.org/wiki/NumPy) arrays, the [libSVM](https://en.wikipedia.org/wiki/LIBSVM) sparse format, and common formats including [CSV](https://en.wikipedia.org/wiki/Comma-separated_values), [Excel](https://en.wikipedia.org/wiki/Microsoft_Excel), [JSON](https://en.wikipedia.org/wiki/JSON) and [SQL](https://en.wikipedia.org/wiki/SQL). sklearn provides converters for translating string and categorial formats into numeric representations usable by the learning methods.\n",
    "\n",
    "A [user guide](https://scikit-learn.org/stable/user_guide.html#user-guide) provides examples in how to use most of these capabilities, and [related projects](https://scikit-learn.org/stable/related_projects.html) list dozens of other Python packages that work closely with sklearn. Online searches turn up tens to hundreds of examples of how to work with the package. We only touch upon a few of scikit-learn's capabilities in our series. But, clearly, the package's scope warrants having it as an essential part of your data scienct toolkit."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepping the Text Data\n",
    "In earlier installments we introduced the three main portions of KBpedia data in **structure**, **annotations** and **pages** that can contribute features to our machine learning efforts. What we now want to do is to consolidate these parts into a single master that may form the reference source for our learning efforts moving forward. Mixing and matching various parts of this master data will enable us to generate a variety of dataset configurations that we may test and compare.\n",
    "\n",
    "Overall, there are about **15** steps in this process of creating a master input file. This is one of the most time-consuming tasks in the entire **CWPK** effort. There is a good reason why data preparation is given such prominent recognition in most discussions of machine learning. However, it is also the case that stepwise examples about how ***exactly*** to conduct such preparations is not well documented. As a result, we try to provide more than the standard details below.\n",
    "\n",
    "Here are the major steps we undertake to prepare our KBpedia data for machine learning:\n",
    "\n",
    "**1. Assemble contributing data parts**\n",
    "\n",
    "To refresh our memories, here are the three key input files to our master data compilation:\n",
    "  - **structure** - <code>C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv</code>\n",
    "  - **annotations** - <code>C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv</code>\n",
    "  - **pages** - <code>C:/1-PythonProjects/kbpedia/v300/models/inputs/wikipedia-trigram.txt</code>.\n",
    "\n",
    "We can inspect these three input files in turn, in the order listed (correct for your own file references):  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv')\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the case of the **pages** file we need to convert its <code>.txt</code> extension to <code>.csv</code>, add a header row of <code>id, text</code> as its first row, and remove any commas from its ID field. (These steps were not needed for the previous two files since they were already in pandas format.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/wikipedia-trigram.csv')\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2. Map files**\n",
    "\n",
    "We inspect each of the three files and create a master lookup table for what each includes (not shown). We see that only the **annotations** file has the correct number of reference concepts. It also has the largest number of fields (columns). As a result, we will use that file as our target for incorporating the information in the other two files. (By the way, entries marked with 'NaN' are empty.) We will use our IDs (called 'source' in the **structure** file) as the index key for matching the extended information.\n",
    "\n",
    "Prior to mapping, we note that the **annotations** file does not have the updated underscores in place of the initial hyphens (see [**CWPK #48**](https://www.mkbergman.com/2394/cwpk-48-case-study-a-sweeping-refactoring/)), so we load up that file, and manually make that change to the <code>id, superclass</code>, and <code>subclass</code> fields. We also delete two of the fields in the **annotations** file that provide no use to our learning efforts, <code>editorialNote</code> and <code>isDefinedBy</code>. After these changes, we name the file <code>C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv</code>. This is the master file we will use for consolidating our input information for all learning efforts moving forward.\n",
    "\n",
    "<div style=\"background-color:#ffecec; border:1px dotted #f5aca6; vertical-align:middle; margin:15px 60px; padding:8px;\"> \n",
    "  <span style=\"font-weight: bold;\">NOTE:</span> During development of these routines I typically use temporary master files for saving each interim step, which provides the opportunity to inspect each transitional step before moving forward. I only provide the concluding versions of these steps in <code>C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv</code>, which is listed on GitHub under the <code>https://github.com/Cognonto/CWPK/tree/master/sandbox/models/inputs</code> directory, along with many of the other inputs noted below. If you want to inspect interim versions as outlined in the steps below, you will need to reconstruct the steps locally.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**3. Prepare structure file**\n",
    "\n",
    "In the **structure** file, only two fields occur that are not already in the **annotations** file, namely the count of direct subclass children ('<code>weight</code>') and the <code>supertype</code> (ST). The ST field is a many-to-one, which means we need to loop over that field and combine instances into a single cell. We will look to [**CWPK #44**](https://www.mkbergman.com/2385/cwpk-44-annotation-ingest/) for some example looping code by iterating instances, only now using the standard <code>','</code> separator in place of the double pipes <code>'||'</code> used earlier. \n",
    "\n",
    "Since we need to remove the '<code>kko:</code> prefix from the **structure** file (to correspond to the convention in the **annotations** file), we make a copy of the file and rename and save it to <code>C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_struct_temp.csv</code>. Since we now have an altered file, we can also remove the <code>target</code> column and rename <code>source</code> &rarr; <code>id</code> and <code>weight</code> &rarr; <code>count</code>. With these changes we then pull in our earlier code and create the routine that will put all of the SuperTypes for a given reference concept (RC) on the same line, separated by the standard <code>','</code> separator. We will name the new output file <code>C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_structure.csv</code>. Here is the code (note the input file is sorted on <code>id</code>):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "in_file = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_struct_temp.csv'\n",
    "out_file = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_structure.csv'\n",
    "\n",
    "with open(in_file, 'r', encoding='utf8') as input:\n",
    "    reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'count', 'SuperType'])                 \n",
    "    header = ['id', 'count', 'SuperType']\n",
    "    with open(out_file, mode='w', encoding='utf8', newline='') as output:                                           \n",
    "        csv_out = csv.writer(output)\n",
    "        x_st = ''\n",
    "        x_id = ''\n",
    "        row_out = ()\n",
    "        flg = 0\n",
    "        for row in reader:\n",
    "            r_id = row['id']                \n",
    "            r_cnt = row['count']\n",
    "            r_st = row['SuperType']                                                              \n",
    "            if r_id != x_id:                               #Note 1\n",
    "                csv_out.writerow(row_out)\n",
    "                x_id = r_id                                #Note 1\n",
    "                x_cnt = r_cnt\n",
    "                x_st = r_st\n",
    "                flg = 1\n",
    "                row_out = (x_id, x_cnt, x_st)\n",
    "            elif r_id == x_id:\n",
    "                x_st = x_st + ',' + r_st                   #Note 2\n",
    "                flg = flg + 1\n",
    "                row_out = (x_id, x_cnt, x_st)\n",
    "    output.close()         \n",
    "input.close()\n",
    "print('KBpedia SuperType flattening is complete . . .')                                                          "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This routine does produce an extra line at the top of the file that needs to be removed manually (at least the code here does not handle it). This basic routine looks to find the change in name for the reference concept ID **(1)** that signals a new series is occurring. The SuperTypes encountered that share the same RC but have different names, are added to a single list string **(2)**. \n",
    "\n",
    "To see the resulting file, here's the statement:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_structure.csv')\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**4. Prepare pages file**\n",
    "\n",
    "We are dealing with files here that strain the capabilities of standard spreadsheets. Large files become hard to load and sluggish (like minutes of processing with LibreOffice) even when they do. I have needed to find some alternative file editors to handle large files.\n",
    "\n",
    "<div style=\"background-color:#ffc; border:1px dotted #ff0; vertical-align:middle; margin:15px 60px; padding:8px;\"><span style=\"font-weight: bold;\">NOTE:</span> I have tried numerous large file editors that I discuss further in the <strong>CWPK #75</strong> conclusion to this series, but at present I am using the free 'hackable' <a href=\"https://github.com/atom\">Atom</a> editor. Atom is a highly configurable editor from <a href=\"https://github.com/atom\">GitHub </a> suitable to large files. It has a rich ecosystem of 'packages' that provide specific functionality, including one (<a href=\"https://atom.io/packages/tablr\">tablr</a>) that provides CSV table editing and manipulation. If Atom proves inadequate which would likely force me to purchase an alternative, my current evaluations point to <a href=\"https://www.emeditor.com/\">EmEditor</a>, which also has a limited free option and a 30-day trial of the paid version (\\$40 first year, \\$20 yearly thereafter). Finally, in order to use Atom properly on CSV files it turns out I also needed to fix a known <a href=\"https://github.com/abe33/atom-tablr/issues/100\">tablr JS issue</a>.</div>\n",
    "\n",
    "Because we want to convert our **pages** file to a single vector representation by RC (the same listing as in the **annotations** file), we will wait on finishing preparing this file for the moment. We return to this question below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**5. Create merge routine**\n",
    "\n",
    "OK, so we now have properly formatted files for incorporation into our new master. Our files have reached a size where combining or joining them should be done programmatically, not via spreadsheet utilities. \n",
    "\n",
    "pandas offers a number of [join and merge routines](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html), some based on row [concatenations](https://datacarpentry.org/python-ecology-lesson/05-merging-data/index.html), others based on joins such as [inner or outer](http://psrc.github.io/itm-tutorial-python/04-merging-data) patterned on SQL. In our case, we want to do what is called a 'left join' wherein the left-specified file retains all of its rows, while the right file matches where it can.\n",
    "\n",
    "Clearly, there are many needs and ways for merging files, but here is one example for a 'left join':"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "file_a = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv'\n",
    "file_b = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_structure.csv'\n",
    "file_out = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_2.csv'\n",
    "\n",
    "df_a = pd.read_csv(file_a)\n",
    "\n",
    "df_b = pd.read_csv(file_b)\n",
    "\n",
    "merged_inner = pd.merge(left=df_a, right=df_b, left_on='id', right_on='id')\n",
    "# Since both sources share the 'id' column name, we could skip the 'left_on' and 'right_on' parameters\n",
    "\n",
    "# What's the size of the output data?\n",
    "merged_inner.shape\n",
    "merged_inner\n",
    "\n",
    "merged_inner.to_csv(file_out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This works great, too. This is the pattern we will use for other merges. We have the green light to continue with our data preparations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**6, Fix columns in master file**\n",
    "\n",
    "We now return our attention to the master file. There are many small steps that must be taken to get the file into shape for encoding it for use by the <code>scikit-learn</code> machine learners. Preferably, of course, these steps get combined (once they are developed and tested) into an overall processing pipeline. We will discuss pipelines in later installments. But, for now, in order to understand many of the particulars involved, we will proceed baby step by baby step through these preparations. This does mean, however, that we need to wrap each of these steps into the standard routines of opening and closing files and identifying columns with the active data.\n",
    "\n",
    "Generally, at the top of each baby step routine, all of which use <code>pandas</code>, we open the master from the prior step and then save the results under a new master name. This enables us to run the routine multiple times as we work out the kinks and to return to prior versions should we later discover processing issues:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')\n",
    "out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv'\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can always invoke some of the standard <code>pandas</code> calls to see what the current version of the master file contains:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cols = list(df.columns.values)\n",
    "print(cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also manipulate, drop, or copy columns or change their display order:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.drop('CleanDef', axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df[['Unnamed: 0', 'id', 'prefLabel', 'subClassOf', 'count', 'superClassOf', 'SuperType', 'altLabel', 'def_final']]\n",
    "df.to_csv(out_f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copies contents in 'SuperType' column to a new column location, 'pSuperType'\n",
    "df.loc[:, 'pSuperType'] = df['SuperType']\n",
    "df.to_csv(out_f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will first focus on the 'definitions' column in the master file. In addition to duplicating input files under an alternative output name, we also will move any changes in column data to new columns. This makes it easier to see changes and overwrites and to recover prior data. Again, once the routines are running correctly, it is possible to collapse this overkill into pipelines.\n",
    "\n",
    "One of our earlier preparatory steps was to ensure that the Cyc hrefs first mentioned in [**CWPK #48**](https://www.mkbergman.com/2394/cwpk-48-case-study-a-sweeping-refactoring/) were removed from certain KBpedia definitions that had been obtained from OpenCyc. As before, we use the amazing <code>BeautifulSoup</code> HTML parser:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Removal of Cyc hrefs\n",
    "from bs4 import BeautifulSoup \n",
    "\n",
    "cleandef = []\n",
    "for column in df[['definition']]:\n",
    "    columnContent = df[column]\n",
    "    for row in columnContent:\n",
    "        line = str(row)\n",
    "        soup = BeautifulSoup(line)                               \n",
    "        tags = soup.select(\"a[href^='http://sw.opencyc.org/']\")  \n",
    "        if tags != []:\n",
    "            for item in tags:                                    \n",
    "                item.unwrap()                                    \n",
    "                item_text = soup.get_text()                      \n",
    "        else:\n",
    "            item_text = line\n",
    "        cleandef.append(item_text)\n",
    "\n",
    "df['CleanDef'] = cleandef        \n",
    "df.to_csv(out_f)\n",
    "print('File written and closed.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example also shows us how to loop over a <code>pandas</code> column. The routine finds the matching <code>href</code>, picks out the open and closing tags, and retains the link label while it removes the link HTML.\n",
    "\n",
    "In reviewing our data, we also observe that we have some string quoting issues and a few instances of 'smart quotes' embedded within our definitions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Forcing quotes around all definitions\n",
    "cleandef = []\n",
    "quote = '\"'\n",
    "for column in df[['CleanDef']]:\n",
    "    columnContent = df[column]\n",
    "    for row in columnContent:\n",
    "        line = str(row)\n",
    "        if line[0] != quote:\n",
    "            line = quote + line + quote\n",
    "            line = line.replace('\"\"\"', '\"')\n",
    "#        elif line[-1] != quote:\n",
    "#            line = line + quote\n",
    "        else:\n",
    "            continue\n",
    "        cleandef.append(line)\n",
    "\n",
    "df['definition'] = cleandef\n",
    "df.drop('Unnamed: 0', axis=1, inplace=True)\n",
    "df.drop('CleanDef', axis=1, inplace=True)\n",
    "df.to_csv(out_f)\n",
    "print('File written and closed.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also need to 'normalize' the definitions text by converting it to lower case, removing punctuation and stop words, and other refinements. There are many useful text processing techniques in the following code block:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Normalize the definitions\n",
    "from gensim.parsing.preprocessing import remove_stopwords\n",
    "from string import punctuation\n",
    "from gensim import utils\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv')\n",
    "out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_3.csv'\n",
    "\n",
    "more_stops = ['b', 'c', 'category', 'com', 'd', 'f', 'formatnum', 'g', 'gave', 'gov', 'h', \n",
    "              'htm', 'html', 'http', 'https', 'id', 'isbn', 'j', 'k', 'l', 'loc', 'm', 'n', \n",
    "              'need', 'needed', 'org', 'p', 'properties', 'q', 'r', 's', 'took', 'url', 'use', \n",
    "              'v', 'w', 'www', 'y', 'z'] \n",
    "\n",
    "def is_digit(word):\n",
    "    try:\n",
    "        int(word)\n",
    "        return True\n",
    "    except ValueError:\n",
    "        return False\n",
    "    \n",
    "tokendef = []\n",
    "quote = '\"'\n",
    "for column in df[['definition']]:\n",
    "    columnContent = df[column]\n",
    "    i = 0\n",
    "    for row in columnContent:\n",
    "        line = str(row)\n",
    "        try:\n",
    "            # Lowercase the text\n",
    "            line = line.lower()\n",
    "            # Remove punctuation \n",
    "            line = line.translate(str.maketrans('', '', string.punctuation))\n",
    "            # More preliminary cleanup\n",
    "            line = line.replace(\"‘\", \"\").replace(\"’\", \"\").replace('-', ' ').replace('–', ' ').replace('↑', '')\n",
    "            # Remove stopwords            \n",
    "            line = remove_stopwords(line)\n",
    "            splitwords = line.split()\n",
    "            goodwords = [word for word in splitwords if word not in more_stops]\n",
    "            line = ' '.join(goodwords)\n",
    "            # Remove number strings (but not alphanumerics)\n",
    "            new_line = []\n",
    "            for word in line.split():\n",
    "                if not is_digit(word):\n",
    "                    new_line.append(word)\n",
    "            line = ' '.join(new_line) \n",
    "#            print(line) \n",
    "        except Exception as e:\n",
    "            print ('Exception error: ' + str(e))\n",
    "        tokendef.append(line)  \n",
    "        i = i + 1\n",
    "df['tokendef'] = tokendef\n",
    "df.drop('Unnamed: 0', axis=1, inplace=True)\n",
    "df.drop('definition', axis=1, inplace=True)\n",
    "df.to_csv(out_f)\n",
    "print('Normalization applied to ' + str(i) + ' texts')\n",
    "print('File written and closed.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To be consistent with our **page** processing steps, we also will extract bigrams from the definition text. We had earlier worked out this routine (see [**CWPK #63**](https://www.mkbergman.com/2414/cwpk-63-staging-data-sci-resources-and-preprocessing/)) that we generalize here. This also provides the core routines for preprocessing input text for evaluations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Phrase extraction applied to 58069 texts\n",
      "File written and closed.\n"
     ]
    }
   ],
   "source": [
    "# Phrase extract bigrams, from CWPK #63\n",
    "from gensim.models.phrases import Phraser, Phrases\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_definition_short.csv')\n",
    "out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_definition_bigram.csv'\n",
    "\n",
    "definition = []\n",
    "for column in df[['definition']]:\n",
    "    columnContent = df[column]\n",
    "    i = 0\n",
    "    for row in columnContent:\n",
    "        line = str(row)\n",
    "        try:\n",
    "            splitwords = line.split()\n",
    "            common_terms = ['aka']\n",
    "            ngram = Phrases(splitwords, min_count=2, threshold=10, max_vocab_size=80000000, \n",
    "                    delimiter=b'_', common_terms=common_terms)\n",
    "            ngram = Phraser(ngram)\n",
    "            line = list(ngram[splitwords])           \n",
    "            line = ', '.join(line)\n",
    "            line = line.replace(', ', ' ')\n",
    "            line = line.replace(' s ', '')         \n",
    "#            print(line) \n",
    "        except Exception as e:\n",
    "            print ('Exception error: ' + str(e))\n",
    "        definition.append(line)  \n",
    "        i = i + 1\n",
    "df['def_bigram'] = definition\n",
    "df.drop('Unnamed: 0', axis=1, inplace=True)\n",
    "df.drop('definition', axis=1, inplace=True)\n",
    "df.to_csv(out_f)\n",
    "print('Phrase extraction applied to ' + str(i) + ' texts')\n",
    "print('File written and closed.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the other desired changes could be done in the open spreadsheet, so no programmatic code is provided here. We wanted to update the use of underscores in all URI identifiers, retain case, and separate multiple entries by blank space rather than commas or double pipes. We made these changes via simple find-and-replace for the <code>subClassOf</code>, <code>superClassOf</code>, and <code>SuperType</code> columns. (The <code>id</code> column is already a single token, so it was only checked for underscores.)\n",
    "\n",
    "Unlike the **pages** and definitions, which are all normalized and lowercased, we also wanted to remove punctuation, separate entries by spaces, and remove punctuation but retain case for the <code>prefLabel</code> and <code>altLabel</code> columns. Again, simple find-and-replace was used here.\n",
    "\n",
    "Thus, aside from the **pages** that we still need to merge in vector form (see below), these baby steps complete all of the text preprocessing in our master file. For these columns, we now have the inputs to the later vectorizing routines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**7. 'Push down' SuperTypes**\n",
    "\n",
    "We discussed in [**CWPK #56**](https://www.mkbergman.com/2404/cwpk-56-graph-visualization-and-extraction/) how we wanted to evaluate the SuperType assignments for a given KBpedia reference concept (RC). Our main desire is to give each RC its most specific SuperType assignments. Some STs are general, higher-level categories that provide limited or no discriminatory power. \n",
    "\n",
    "We term the process of narrowing SuperType assignments to the lowest and most specific levels for a given RC a 'push down'. The process we have defined for this first begins by pruning any mention of a more general category within an RCs current SuperType listing, unless doing so would leave the RC without an ST assignment. We supplement this approach with a second pass where we iterate one by one over some common STs and remove them unless it would leave the RC without an ST assignment. Here is that code, which we write into a new column so that we do not lose the starting listing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Narrow SuperTypes, per #CWPK #56\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')\n",
    "out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv'\n",
    "\n",
    "parent_set = ['SocialSystems','Products','Methodeutic','Eukaryotes','ConceptualSystems',\n",
    "              'AVInfo','Systems','Places', 'OrganicChemistry','MediativeRelations',\n",
    "              'LivingThings', 'Information','CopulativeRelations','Artifacts','Agents',\n",
    "              'TimeTypes','Symbolic','SpaceTypes','RepresentationTypes', 'RelationTypes',\n",
    "              'LocationPlace', 'OrganicMatter','NaturalMatter', 'AttributeTypes','Predications',\n",
    "              'Manifestations', 'Constituents', 'AdjunctualAttributes', 'IntrinsicAttributes',\n",
    "              'ContextualAttributes', 'DirectRelations', 'Concepts', 'KnowledgeDomains', 'Shapes',\n",
    "              'SituationTypes', 'Forms', 'Associatives', 'Denotatives', 'TopicsCategories',\n",
    "              'Indexes', 'ActionTypes', 'AreaRegion']\n",
    "\n",
    "second_pass = ['KnowledgeDomains', 'SituationTypes', 'Forms', 'Concepts', 'ActionTypes', \n",
    "               'AreaRegion', 'Artifacts', 'EventTypes']\n",
    "\n",
    "clean_st = []\n",
    "quote = '\"'\n",
    "for column in df[['SuperType']]:\n",
    "    columnContent = df[column]\n",
    "    i = 0\n",
    "    for row in columnContent:\n",
    "        line = str(row)\n",
    "        try:\n",
    "            line = line.replace(', ', ' ')\n",
    "            splitclass = line.split()\n",
    "            # Remove duplicates because dict only has uniques\n",
    "            splitclass = list(dict.fromkeys(splitclass))\n",
    "            line = ' '.join(splitclass)\n",
    "            goodclass = [word for word in splitclass if word not in parent_set]\n",
    "            test_line = ' '.join(goodclass)\n",
    "            if test_line == '':\n",
    "                clean_st.append(line)\n",
    "            else:\n",
    "                line = test_line\n",
    "                clean_st.append(line)\n",
    "        except Exception as e:\n",
    "            print ('Exception error: ' + str(e))\n",
    "        i = i + 1\n",
    "print('First pass count: ' + str(i))\n",
    "\n",
    "# Second pass\n",
    "print('Starting second pass . . . .')\n",
    "clean2_st = []\n",
    "i = 0\n",
    "length = len(clean_st)\n",
    "print('Length of clean_st: ', length)\n",
    "ln = len(second_pass)\n",
    "for row in clean_st:\n",
    "    line = str(row)\n",
    "    try_line = line\n",
    "    for item in second_pass:\n",
    "        word = str(item)\n",
    "        try_line = str(try_line)\n",
    "        try_line = try_line.strip()\n",
    "        try_line = try_line.replace(word, '')\n",
    "        try_line = try_line.strip()\n",
    "        try_line = try_line.replace('  ', ' ')\n",
    "        char = len(try_line)\n",
    "        if char < 6:\n",
    "            try_line = line\n",
    "            line = line\n",
    "        else:\n",
    "            line = try_line\n",
    "    clean2_st.append(line)                    \n",
    "    print('line: ' + str(i) + ' ' + line) \n",
    "    i = i + 1\n",
    "\n",
    "df['clean_ST'] = clean2_st\n",
    "df.drop('Unnamed: 0', axis=1, inplace=True)\n",
    "df = df[['id', 'prefLabel', 'subClassOf', 'count', 'superClassOf', 'SuperType', 'clean_ST', 'altLabel', 'def_final']]\n",
    "df.to_csv(out_f, encoding='utf-8')\n",
    "print('ST reduction applied to ' + str(i) + ' texts')\n",
    "print('File written and closed.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Depending on the number of print statements one might include, the listing above can produce a very long listing!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**8. Final text revisions**\n",
    "\n",
    "We have a few minor changes to attend to prior to the numeric encoding of our data. The first revision is based on the fact that a minor portion of both <code>altLabels</code> and <code>definitions</code> are much longer than the others. We analyzed min, max, mean and median for these two text sets. We roughly doubled the size of the rough mean and median for each set, and trimmed the strings to a maximum length of 150 and 300 characters, respectively. We employed the <code>textwrap</code> package to make the splits at word boundaries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import textwrap\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')\n",
    "out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv'\n",
    "\n",
    "texts = df['altLabel']\n",
    "new_texts = []\n",
    "for line in texts:\n",
    "    line = str(line)\n",
    "    length = len(line)\n",
    "    if length == 0:\n",
    "        new_line = line\n",
    "    else:\n",
    "        new_line = textwrap.shorten(line, width=150, placeholder='')\n",
    "    new_texts.append(new_line)\n",
    "    \n",
    "df['altLabel'] = new_texts\n",
    "\n",
    "df.drop('Unnamed: 0', axis=1, inplace=True)\n",
    "#df.drop('def_final', axis=1, inplace=True)\n",
    "df.to_csv(out_f, encoding='utf-8')\n",
    "print('Definitions shortened.')\n",
    "print('File written and closed.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We repeated this routine for both text sets and also did some minor punctuation clean up for the <code>altLabels</code>. We have now completed all text modifications and have a clean text master. We name this file <code>kbpedia_master_text.csv</code> and keep it for archive."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare Numeric Encodings\n",
    "\n",
    "We now shift our gears from prepping and staging the text to encoding all information to a proper numeric form given its content.\n",
    "\n",
    "**9. Plan the numeric encoding**\n",
    "\n",
    "Only numbers can be represented to machine learning models. Here is one place where <code>sklearn</code> really shines with its utility functions. \n",
    "\n",
    "The manner by which we encode fields needs to be geared to the information content we can and want to convey, so context and scale (or its reciprocal, reduction) play prominent roles in helping decide what form of encoding works best. Text and language encodings, which are among the most challenging, can range from naive unique numeric identifiers to adjacency or transformed or learned representations for given contexts, including relations, categories, sentences, paragraphs or documents. A finite set of 'categories', or other fields with a discrete number of targets, can be the targets of learning representations that encompass a range of members. \n",
    "\n",
    "A sentence or document or image, for example, is often reduced to a fixed number of dimensions, sometimes ranging into the hundreds, that are represented by arrays of numbers. Initial encoding representations may be trained against a desired labeled set to adjust or transform those arrays to come into correspondence with their intended targets. Items with many dimensions can occupy sparse matrices where most or many values are zero (non-presence). To reduce the size of large arrays we may also undergo further compression or dimension reduction through techniques like [principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis), or PCA. Many other clustering or dimension reduction techniques exist.\n",
    "\n",
    "The art or skill in machine learning often resides at the interface between raw input data and how it is transformed into these vectors recognizable by the computer. There are some automated ways to evaluate options and to help make parameter and model choice decisions, such as [grid search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search). <code>sklearn</code>, again, has many useful methods and tools in these general areas. I am drinking from a fire hose, and you will too if you poke much in this area.\n",
    "\n",
    "A general problem area, then, is often characterized by data that is heterogeneous in terms of what it captures, corresponding, if you will, to the different fields or columns one might find in a spreadsheet. We may have some numeric values that need to be normalized, some text information ranging from single category terms or identifiers to definitions or complete texts, or complex arrays that are themselves a transformation of some underlying information. At minimum, we can say that multiple techniques may be necessary for encoding multiple different input data columns.\n",
    "\n",
    "In terms of big data, <code>pandas</code> is roughly the equivalent analog to the spreadsheet. A nice tutorial provides some guidance on working jointly with [<code>pandas</code> and <code>sklearn</code>](https://www.dunderdata.com/blog/from-pandas-to-scikit-learn-an-exciting-new-worflow). One early alternative to capture this need to apply different transformations to different input data columns was the independent [sklearn-pandas](https://github.com/scikit-learn-contrib/sklearn-pandas). I have looked closely at this option and I like its syntax and approach, but <code>scikit-learn</code> eventually adopted its own [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) methods, and they have become the standard and more actively developed option.\n",
    "\n",
    "ColumnTransformer is a <code>sklearn</code> method for picking individual columns from <code>pandas</code> data sets, especially for [heterogeneous data](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data), and can be combined into pipelines. This design is well suited to [mixed data types](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html), including how to render your pipelines with HTML, and to [transform multiple columns](https://www.mikulskibartosz.name/preprocessing-the-input-panda-dataframe-using-columntransformer-in-scikit-learn/) based on <code>pandas</code> inputs. There is much to learn in this area, but perhaps start with this [pretty good overview](https://jorisvandenbossche.github.io/blog/2018/05/28/scikit-learn-columntransformer/) and then how these techniques might be combined into [pipelines with custom transformers](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).\n",
    "  \n",
    "We will provide some ColumnTransformer examples, but I will also try to explain each baby step. I'll talk more about individual techniques, but here are the encoding options we have identified to transfer our KBpedia master file:\n",
    "\n",
    "|  Encoding Type  |  Applicable Fields  |\n",
    "|:---------------:|:-------------------:|\n",
    "| one-hot | clean_ST (also target)|\n",
    "| count vector | id, prefLabel, subClassOf, superClassOf |\n",
    "| tfidf | altLabel, definition |\n",
    "| doc2vec | page |\n",
    "\n",
    "We will cover these in the baby steps to follow."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**10. Category (one-hot) encode**\n",
    "\n",
    "[Category encoding](https://tutorialedge.net/python/data-science/preparing-dataset-machine-learning-scikit-learn/) is to take a column listing of strings (generally, and may also be multiple columns) and then convert the category strings to a unique number. <code>sklearn</code> has a function called [<code>LabelEncoder</code>](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) for this function. Since a single column of unique numbers might imply order or hierarchy to some learners, this approach may be followed by a [<code>OneHotEncoder<code>](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) where each category is given its own column with a binary match (1) or not (0) assigned to each columm depending on what categories it has. Depending on the number of categories, this column array can grow to quite a large size and pose memory issues. The category approach is definitely appropriate for a score of items at the low end, and perhaps up to 500 at the upper end. Since our active <code>SuperType</code> categories number about 70, we first explore this option.\n",
    "    \n",
    "<code>scikit-learn</code> has been actively developed of late, and this initial approach has been updated with an improved <code>OneHotEncoder</code> that works directly with strings paired with the [<code>ColumnTransformer</code>](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data) estimator. If you research category encoding online you might be confused about these earlier and later descriptions. Know that the same general approach applies here of assigning an array of <code>SuperType</code> categories to each row entry in our master data.\n",
    "   \n",
    "However, as we saw before for our cleaned STs, some rows (reference concepts, or RCs) may have more than one category assigned. Though it is possible to run <code>ColumnTransformer</code> over multiple columns at once, <code>sklearn</code> produces a new output column for each input column. I was not able to find a way to convert multiple ST strings in a single column to their matching category values. I am pretty sure there is a way for experts to figure out this problem, but I was not able to do so.\n",
    "    \n",
    "Fortunately, in the process of investigating these matters I encountered the <code>pandas</code> function of [<code>get_dummies</code>](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) that does one-hot encoding directly. More fortunately still there is also a <code>pandas</code> string function that allows multiple values to be split into individual ones, that can be applied as [<code>str.get_dummies</code>](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html). Further, with a bit of other Python magic, we can take the synthetic headers derived from the unique <code>SuperType</code> classes and give them a <code>st_</code> prefix and combine them (concatenate) into a new resulting <code>pandas</code> dataframe. The net result is a very slick and short routine for category encoding our clean <code>SuperTypes</code>:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')\n",
    "out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_cleanst.csv'\n",
    "\n",
    "df_2 = pd.concat([df, df['clean_ST'].str.get_dummies(sep=', ').rename(lambda x: 'st_' + x, axis='columns')], axis=1)\n",
    "\n",
    "df_2.drop('Unnamed: 0', axis=1, inplace=True)\n",
    "\n",
    "df_2.to_csv(out_f, encoding='utf-8')\n",
    "print('Categories one-hot encoded.')\n",
    "print('File written and closed.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result of this routine can be seen in its <code>get_dummies</code> dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv')\n",
    "\n",
    "df.info()\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will work in the idea of <code>sklearn</code> pipelines at the end of this installment and the beginning of the next, which argues for keeping everything within this environment for grid search and other preprocessing and consistency tasks. But, we are also looking for the simplest methods and will also be using master files to drive a host of subsequent tasks. In this regard, we are not solely focused on keeping the analysis pipeline totally within <code>scikit-learn</code>, but establishing robust starting points for a diversity of machine learning platforms. In this regard, the use of the <code>get_dummies</code> approach may make some sense."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**11. Fill in missing values** and\n",
    "\n",
    "**12. Other categorical text (vocabulary) encode**\n",
    "\n",
    "<code>scikit-learn</code> machine learners do not work on unbalanced datasets. If one column has x number of items, other comparative columns need to have the same. One can pare down the training sets to the lowest common number, or one may provide estimates or 'fill-ins' for the open items. Helpfully, <code>sklearn</code> has a number of [useful utilities](https://scikit-learn.org/stable/modules/impute.html) including imputation for filling in missing values.\n",
    "\n",
    "Since, in the case of our KBpedia master, all missing values relate to text (especially for the <code>altLabels</code> category), we can simply assign a blank space as our fill in using standard [<code>pandas</code> utilities](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#missing-data-handling). However, one can also do replacements, conditional fill-ins, averages and the like depending on circumstances. If you have missing data, you should consult the [package documentation](https://scikit-learn.org/stable/modules/impute.html). In our code example below (see note **(1)**), however, we limit ourselves to the filling in with blank spaces.\n",
    "\n",
    "There are a number of preprocessing encoding methods in <code>sklearn</code> useful to text, including [<code>CountVectorizer</code>](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), [<code>HashingVectorizer</code>](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html), [<code>TfidfTransformer</code>](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html), and [<code>TfidfVectorizer</code>](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). The <code>CountVectorizer</code> creates a matrix of text tokens and their counts. The <code>HashingVectorizer</code> uses the [hash trick](https://en.wikipedia.org/wiki/Feature_hashing) to generate unique numbers for text tokens with the least amount of memory but no ability to later look up the contributing tokens. The <code>TfidfVectorizer</code> calculates both both term frequency and inverse document frequency, while the <code>TfidfTransformer</code> first requires the <code>CountVectorizer</code> to calculate term frequency.\n",
    "\n",
    "The fields we need to encode include the <code>prefLabel</code>, the <code>id</code>, the <code>subClassOf</code> (parents), and the <code>superClassOf</code> (children) entries. The latter three all are based on the same listing of 58 K KBpedia reference concepts (RCs). The <code>prefLabel</code> has major overlap with these RCs, but the terms are not concatenated and some synonyms and other qualifiers appear in the list. Nonetheless, because there is a great deal of overlap in all four of these fields, it appears best that we use a combined vocabulary across all four fields.\n",
    "    \n",
    "The term frequency/inverse document frequency ([TF/IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) method is a proven statistical way to indicate the importance of a term in relation to an entire corpus. Though we are dealing with a limited vocabulary for our RCs, some RCs are more often in relationships with other RCs and some RCs are more frequently used than others. While the other methods would give us our needed numerical encoding, we first pick TF/IDF to test because it appears to retain the most useful information in its encoding.\n",
    "\n",
    "After taking care of missing items, we want our coding routine, then, to construct a common vocabulary across our four subject text fields. This common vocabulary and its frequency counts is what we will use to calculate the document frequencies across all four columns. For this purpose we will use the <code>CountVectorizer</code> method. Then, for each of the four individual columns that comprise this vocabulary we will use the <code>TfidfTransformer</code> method to get the term frequencies for each entry and to construct its overall TF/IDF scores. We will need to construct this routine using the <code>ColumnTransformer</code> method described above.\n",
    "\n",
    "There are many choices and nuances to learn with this approach. While doing so, I came to realize some significant downfalls. The number of unique tokens across all four columns is about 80,000. When parsed against the 58 K RCs it results in a huge matrix, but a very sparse one. There is an average of about four tokens across all four columns for each RC, and rarely does an RC have more than seven. This unusual profile, though, does make sense since we are dealing with a controlled vocabulary of 58 K tokens for three of the columns, and close matches and synonyms with some type designators for the <code>prefLabel</code> field. So, anytime our routines needed to access information in one of these four columns, we would incur a very large memory penalty. Since memory is a limiting factor in all cases, but especially so with my office-grade equipment, this is a heavy anchor to be dragging into the picture.\n",
    "\n",
    "Our first need to create a shared vocabulary across all four input columns brought the first insight to bypass this large matrix conundrum. The <code>CountVectorizer</code> produces a tuple listing index and unique token ID. The token ID can be linked back to its text key, and the sequence of the token IDs is in text alphabetical order. Rather than needing to capture a large sparse matrix, we only need to record the IDs for the few matching terms. What is really cool about this structure is that we can reduce our memory demand by more than 14,000 times while being fully lossless with lookup to the actual text. Further, this vocabulary structure can be saved separately and incorporated in other learners and utilities. \n",
    "\n",
    "So, our initial routine lays out how we can combine the contents of our four columns, which then becomes the basis for fitting the <code>TfidfVectorizer</code>. (The dictionary creation part of this routine is based on the <code>TfidfVectorizer</code>, so either may be used.) Let's first look at this structure derived from our KBpedia master file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Concatenation code adapted from https://github.com/scikit-learn/scikit-learn/issues/16148\n",
    "\n",
    "import pandas as pd\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')\n",
    "\n",
    "df['altLabel'] = df['altLabel'].fillna(' ')              # Note 1\n",
    "df['superClassOf'] = df['superClassOf'].fillna(' ')\n",
    "\n",
    "def concat_cols(df):\n",
    "    assert len(df.columns) >= 2\n",
    "    res = df.iloc[:, 0].astype('str')\n",
    "    for key in df.columns[1:]:\n",
    "       res = res + ' ' + df[key]\n",
    "    return res \n",
    "\n",
    "tf = TfidfVectorizer().fit(concat_cols(df[['id', 'prefLabel', 'subClassOf', 'superClassOf']]))\n",
    "print(tf.vocabulary_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us a vocabulary and then a lookup basis for linking our individual columns in this group of four. Note that some of these terms are quite long, since they are the product of an already concatenated identifier creation. That also makes them very strong signals for possible text matches.\n",
    "\n",
    "Nearly all of these types of machine learners first require the data to be '[fit](https://scikit-learn.org/stable/glossary.html#term-fit)' and then '[transformed](https://scikit-learn.org/stable/glossary.html#term-transform)'. Fit means to make a new numeric representation for the input data including any format or validation checks. Transform means to convert that new representation to a form most useful to the learner at hand, such as a floating decimal for a frequency value, that may also be the result of some conversion or algorithmic changes. Each machine learner has its own approaches to fit and transform, and parameters that may be set when these functions are called may tweak methods further. These approaches may be combined together into a slighly more efficient '[fit-transform](https://scikit-learn.org/stable/glossary.html#term-fit-transform)' step, which is the approach we take in this example: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "token_gen = make_column_transformer(\n",
    "    (TfidfVectorizer(vocabulary=tf.vocabulary_),'prefLabel')\n",
    ")\n",
    "tfidf_array = token_gen.fit_transform(df)\n",
    "print(tfidf_array)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, we surmise, then, we can loop over these results sets for each of the four columns, loop over matching token IDs for each unique ID, and then to write out a simple array for each RC entry. In the case of <code>id</code> there will be a single matching token ID. For the other three columns, there is at most a few entries. This promises to provide a very efficient encoding that we can also tie into external routines as appropriate.\n",
    "\n",
    "Alas, like much else that appears simple on the face of it, what one sees when printing this output is not the data form presented when saving it to file. For example, if one does a <code>type(tfidf-array)</code> we see that the object is actually a pretty complicated data structure, a [<code>scipy.sparse.csr_matrix</code>](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html). (<code>scikit-learn</code> is itself based on [<code>SciPy</code>](https://en.wikipedia.org/wiki/SciPy), which is itself based on [<code>NumPy</code>](https://en.wikipedia.org/wiki/NumPy).) We get hints we might be working with a different data structure when we see <code>print</code> statements that produce truncated listings in terms of rows and columns. We can not do our typical tricks on this structure, like converting it to a string or list, prior to standard string processing. What we first need to do is to get it into a manipulable form, such as a <code>pandas</code> CSV form. We need to do this for each of the four columns separately:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Adapted from https://www.biostars.org/p/452028/\n",
    "\n",
    "import pandas as pd\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.compose import make_column_transformer\n",
    "import numpy as np\n",
    "from scipy.sparse import csr_matrix\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')\n",
    "out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_preflabel.csv'\n",
    "out_file = open(out_f, 'w', encoding='utf8')\n",
    "\n",
    "#cols = ['id', 'prefLabel', 'subClassOf', 'superClassOf']\n",
    "cols = ['prefLabel']\n",
    "\n",
    "df['altLabel'] = df['altLabel'].fillna(' ')              \n",
    "df['superClassOf'] = df['superClassOf'].fillna(' ')\n",
    "\n",
    "def concat_cols(df):\n",
    "    assert len(df.columns) >= 2\n",
    "    res = df.iloc[:, 0].astype('str')\n",
    "    for key in df.columns[1:]:\n",
    "       res = res + ' ' + df[key]\n",
    "    return res \n",
    "\n",
    "tf = TfidfVectorizer().fit(concat_cols(df[['id', 'prefLabel', 'subClassOf', 'superClassOf']]))\n",
    "\n",
    "for c in cols:                                          # Using 'prefLabel' as our example\n",
    "    c_label = str(c)\n",
    "    print(c_label)\n",
    "    token_gen = make_column_transformer(\n",
    "         (TfidfVectorizer(vocabulary=tf.vocabulary_),c_label)\n",
    "         )\n",
    "    tokens = token_gen.fit_transform(df)\n",
    "    print(tokens)\n",
    "df_2 = pd.DataFrame(data=tokens)\n",
    "df_2.to_csv(out_f, encoding='utf-8')\n",
    "print('File written and closed.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a sample of what such a file output looks like:\n",
    "    \n",
    "<pre>\n",
    "0,\"  (0, 66038)\t0.6551037573323365\n",
    "  (0, 42860)\t0.7555389249595651\"\n",
    "1,\"  (0, 75910)\t1.0\"\n",
    "2,\"  (0, 50502)\t1.0\"\n",
    "3,\"  (0, 55394)\t0.7704414465093152\n",
    "  (0, 53041)\t0.637510766576247\"\n",
    "4,\"  (0, 75414)\t0.4035084561691855\n",
    "  (0, 35644)\t0.5053985169471683\n",
    "  (0, 13178)\t0.47182153985726\n",
    "  (0, 9754)\t0.5992809853435092\"\n",
    "5,\"  (0, 50446)\t0.7232652656552668\n",
    "  (0, 5844)\t0.6905703117689149\"\n",
    "6,\"  (0, 41964)\t0.5266799956122452\n",
    "  (0, 12319)\t0.8500636342191596\"\n",
    "7,\"  (0, 67750)\t0.7206261499559882\n",
    "  (0, 47278)\t0.45256791509595035\n",
    "  (0, 27998)\t0.5252430239663488\"\n",
    "</pre>\n",
    "\n",
    "Inspection of the <code>scipy.sparse.csr_matrix</code> files shows that the frequency values are separated from the index and key by a tab separator, with sometimes multiples of values. This form can certainly be processed with Python, but we can also open the files as tab-delimited in a local spreadsheet, and then delete the frequency column to get a much simpler form to wrangle. Since this only takes a few minutes, we take this path. \n",
    "\n",
    "This is the basis, then, that we need to clean up for our \"simple\" vectors, reflected in this generic routine, and we slightly change our file names to account for the difference:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import csv\n",
    "import re                                 # If we want to use regex; we don't here\n",
    "\n",
    "in_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_superclassof.csv'\n",
    "out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_superclassof_ok.csv'\n",
    "\n",
    "#out_file = open(out_f, 'w', encoding='utf8')\n",
    "\n",
    "cols = ['id', 'token']\n",
    "\n",
    "with open(in_f, 'r', encoding = 'utf8') as infile, open(out_f, 'w', encoding='utf8', newline='') as outfile:\n",
    "    reader = csv.reader(infile)\n",
    "    writer = csv.writer(outfile, delimiter=' ', quoting=csv.QUOTE_NONE, escapechar='\\\\')\n",
    "    new_row = []\n",
    "    for row in reader:\n",
    "        row = str(row)\n",
    "        row = row.replace('\"', '')\n",
    "        row = row.replace(\",  (0, \",  \"','\")       \n",
    "        row = row.replace(\"'  (0', ' \", \"@'\")      \n",
    "        row = row.replace(\")\", \"\")\n",
    "        row = str(row)[1 : -1]                    # A nifty for removing square brackets around a list\n",
    "        new_row.append(row)                       # Transition here because we need to find-replace across rows\n",
    "    new_row = str(new_row)\n",
    "    new_row = new_row.replace('\"', '')            # Notice how bracketing quotes change depending on target text\n",
    "    new_row = new_row.replace(\"', @'\", \",\")       # Notice how bracketing quotes change depending on target text\n",
    "    new_row = new_row.replace(\"', '\", \"'\\n'\")\n",
    "    new_row = new_row.replace(\"''\\n\", \"\")\n",
    "    print(new_row)                                                           \n",
    "    writer.writerow([new_row])\n",
    "print('Matrix conversion complete.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The basic approach is to baby step a change, review the output, and plan the next substitution. It is a hack, but it is also pretty remarkable to be able to bridge such disparate data structures. The payoff from my attempts with Python are now really paying off. I'll always be interested in figuring out more efficient ways. But the entry point to doing so is getting real stuff done.\n",
    "\n",
    "Since our major focus here is on [data wrangling](https://en.wikipedia.org/wiki/Data_wrangling), known as feature engineering in a machine learning context, it is not enough to write these routines and cursorily test outputs. Though not needed for every intermediate step, after sufficiently accumulating changes and interim files, it is advisable to visually inspect your current work product. Automated routines can easily mask edge cases that are wrong. Since we are nearly at the point of committing to the file vector representations our learners will work from, this marks a good time to manually confirm results.\n",
    "\n",
    "In inspecting the <code>kbpedia_master_preflabel_ok.csv</code> example, I found about 25 of the 58 K reference concepts with either a format or representation problem. That inspection by scrolling through the entire listing looking for visual triggers took about thirty to forty minutes. Granted, those errors are less than 0.05% of our population, but they are errors nonetheless. The inspection of the first 5 instances in comparison to the master file (using the common <code>id</code>) took another 10 minutes or so. That led me to the hypothesis that periods ('.') caused labels to be skipped and other issues related to individual character or symbol use. The actual inspection and correction of these items took perhaps another thirty to forty minutes; about 40 percent were due to periods, the others to specific symbols or characters.\n",
    "\n",
    "The <code>id</code> file checked out well. That took just a few minutes to fast scroll through the listing looking for visual flags. Another ten to fifteen minutes showed the <code>subClassOf</code> to check out as well. This file took a bit longer because every reference concept has at least one parent. \n",
    "\n",
    "However, when checking the <code>superClassOf</code> file I turned up more than 100 errors. More than the other files, it took about forty minutes to check and record the errors. I feared checking these instances to resolution would take tremendous time, but as I began inspecting my initial sample all were returning as extremely long fields. Indeed, the problem that I had been seeing that caused me to flag the entry in the intial review was a colon (':') in the listing, a conventional indicator in Python for a truncated field or listing. The introduced colon was the apparent cause of the problem in all concepts I sampled. I determined the likely maximum length of SuperClass entries to be about 240 characters. Fortunately, we already have a script in step **8. Final text revisions** to shorten fields. We clearly overlooked those 100 instances where SuperClass entries are too long. We add this filter at **Step 8** and begin to proceed to cycle through all of the routines from that point forward. It took about an hour to repeat all steps forward from there. This case validates why committing to scripts is sound practice."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**13. An alternate TF/IDF encode**\n",
    "\n",
    "Our earlier experience with TF-IDF and <code>CountVectorizer</code> was not the easiest to set up. I wanted to see if there were perhaps better or easier ways to conduct a TF-IDF analysis. We still had the <code>definition</code> and <code>altLabel</code> columns to encode. \n",
    "\n",
    "In my initial diligence I had come across a wrapper around many interesting text functions. It is called [Texthero](https://github.com/jbesomi/texthero) and it provides a consistent interface over [NLTK](https://en.wikipedia.org/wiki/Natural_Language_Toolkit), SpaCy, Gensim, [TextBlob](https://textblob.readthedocs.io/en/dev/) and sklearn. It provides common text wrangling utilities, NLP tools, vector representations, and results visualization. If your problem area falls within the scope of this tool, you will be rewarded with a very straightforward interface. However, if you need to tweak parameters or modify what comes out of the box, Texthero is likely not the tool for you.\n",
    "\n",
    "Since TF-IDF is one of its built-in capabilities, we can show the simple approach available:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TF/IDF calculations completed.\n",
      "File written and closed.\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import texthero as hero\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_definition_bigram.csv')\n",
    "out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_definition.csv'\n",
    "\n",
    "#df['altLabel'] = df['altLabel'].fillna(' ') \n",
    "texts = df['def_bigram']\n",
    " \n",
    "df['def_tfidf'] = hero.tfidf(texts, min_df=1, max_features=1000)\n",
    "\n",
    "df.drop('Unnamed: 0', axis=1, inplace=True)\n",
    "df = df[['id', 'def_tfidf']]\n",
    "df.to_csv(out_f, encoding='utf-8')\n",
    "print('TF/IDF calculations completed.')\n",
    "print('File written and closed.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We apply this code to both the <code>altLabel</code> and <code>definition</code> fields with different parameters: (df=1, 500 features) and (df=3, 1000 features) for these fields, respectively. Since these routines produce large files that can not be easily viewed, we write them out with only the frequencies and the <code>id</code> field as the mapping key."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**14. Prepare doc2vec vectors from pages for master**\n",
    "\n",
    "We discussed word and document embedding in the previous installment. We're attempting to capture a vector representation of the Wikipedia page descriptions for about 31 K of the 58 K reference concepts (RCs) in current KBpedia. We found [<code>doc2vec</code>](https://radimrehurek.com/gensim/models/doc2vec.html) to be a promising approach. Our prior installment had us representing our documents with about 1000 features. We will retain these earlier settings and build on last results.\n",
    "\n",
    "We have earlier built our model, so we need to load it, read in our source files, and calculate our <code>doc2vec</code> values per input line, that is, per RC with a Wikipedia article. To also output strings readable by the next step in the pipeline, we also need to do some formatting changes, including find and replaces for line feeds and carriage returns. As we process each line, we append it to an array (which becomes a list in Python) that will update our initial records with the new calculated vectors into a new column ('doc2vec'):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim.models.doc2vec import Doc2Vec, TaggedLineDocument\n",
    "import pandas as pd\n",
    "\n",
    "in_f = r'C:\\1-PythonProjects\\kbpedia\\v300\\models\\inputs\\kbpedia-pages.csv'\n",
    "out_f = r'C:\\1-PythonProjects\\kbpedia\\v300\\models\\inputs\\kbpedia-d2v.csv'\n",
    "\n",
    "df = pd.read_csv(in_f, header=0, usecols=['id', 'prefLabel','doc2vec']) \n",
    "\n",
    "src = r'C:\\1-PythonProjects\\kbpedia\\v300\\models\\results\\wikipedia-d2v.model'\n",
    "model = Doc2Vec.load(src)\n",
    "\n",
    "doc2vec = []\n",
    "\n",
    "# For each row in df.id\n",
    "for i in df['id']:\n",
    "    array = model.docvecs[i]\n",
    "    array = str(array)\n",
    "    array = array.replace('\\r', '')\n",
    "    array = array.replace('\\n', '')\n",
    "    array = array.replace('  ', ' ')\n",
    "    array = array.replace('  ', ' ')\n",
    "    count = i + 1\n",
    "    doc2vec.append(array)\n",
    "\n",
    "df['doc2vec'] = doc2vec\n",
    "\n",
    "df.to_csv(out_f)\n",
    "print('Processing done with ', count, 'records')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the things we notice as we process this information is that I have been inconsistent in the use of 'id', especially since it has emerged to be the primary lookup key. Looking over the efforts to date I see that sometimes 'id' is the numeric sequence ID, sometimes it is the unique URI fragment (the standard within KBpedia), and sometimes it is a Wikipedia URI fragment with underscores for spaces. The latter is the basis for the existing doc2vec files.\n",
    "\n",
    "Conformance with the original sense of 'id' means to use the URI fragment that uniquely identifies each reference concept (RC) in KBpedia. This is the common sense I want to enforce. It has to be unique within KBpedia, and therefore is a key that any contributing file should reference in order to bring its information into the system. I need to clean up the sloppiness.\n",
    "\n",
    "This need for consistency forces me to use the merge steps noted under **Step 5** above to map the canonical 'id' (the KBpedia URI fragment) to the Wikipedia IDs used in our mappings. Again, scripts are our friend and we are able to bring this pages file into compliance without problems.\n",
    "\n",
    "After this replacing, we have now completed the representation of our input information into machine learner form. It is time for us to create the vector lookup file. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Consolidate a New Vector Master File\n",
    "\n",
    "OK, we now have successfully converted all categorical and text and non-numeric information into numeric forms readable by our learners. We need to consolidate this information since it will be the starting basis for our next machine learning efforts.\n",
    "\n",
    "In a production setting, all of these individual steps and scripts would be better organized into formal programs and routines, best embedded within the the pipeline construct of your choice (though having diverse scopes, gensim, sklearn, and spaCy all have pipeline capabilities).\n",
    "\n",
    "To best stage our machine learning tests to come, I decide to create a parallel master file, only this one using vectors rather than text or categories as its column contents. We want the column structure to roughly parallel the text structure, and we also decide to keep the page and <code>altLabel</code> information separate (but readily incorporable) to keep general file sizes manageable. Again, we use <code>id</code> as our master lookup key, specifically referring to the unique URI fragment for each KBpedia RC.\n",
    "\n",
    "**15. Merge page vectors into master**\n",
    "\n",
    "Under **Step 5** above we identified and wrote a code block for merging information from two different files based on the lookup concurrence of keys between the source and target. Care is appropriately required that key matches and cardinality be respected. It is important to be sure the matching keys from different files have the right reference and format to indeed match.\n",
    "\n",
    "Since we like the idea of a master 'accumulating' file to which all contributo files map, we use the left join method of the inner merge we first described under **Step 5**, continually using the master 'accumulating' file as our left join basis. We name the vector version of our master file <code>kbpedia_master_vec.csv</code>, and we proceed to merge all prior files with vectors into it. (We will not immediately merge the doc2vec pages file since it is quite large; we will only do this merge as needed for the later analysis.)\n",
    "\n",
    "As we've noted before, we want to proceed in column order from left to right base on our column order in the earlier master. Here is the general routine with some commented out lines used on occasion to clean up the columns to be kept or their ordering: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Merge complete.\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "file_a = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_vec.csv'\n",
    "file_b = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_cleanst.csv'\n",
    "file_out = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_vec1.csv'\n",
    "\n",
    "df_a = pd.read_csv(file_a)\n",
    "\n",
    "df_b = pd.read_csv(file_b, engine='python', encoding='utf-8', error_bad_lines=False)\n",
    "\n",
    "merged_inner = pd.merge(left=df_a, right=df_b, how='outer', left_on='id_x', right_on='id')\n",
    "# Since both sources share the 'id' column name, we could skip the 'left_on' and 'right_on' parameters\n",
    "\n",
    "# What's the size of the output data?\n",
    "merged_inner.shape\n",
    "merged_inner\n",
    "\n",
    "merged_inner.to_csv(file_out)\n",
    "print('Merge complete.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After each merge, we remove the extraneous columns by writing to file the columns we want to keep. We can also directly drop columns and do other activities such as rename. We may also need to change the datatype of a column because of default parameters in the <code>pandas</code> routines.\n",
    "\n",
    "The code block below shows some of these actions, with others commented out. The basic workflow, then, is to invoke the next file, make sure our merge conditions are appropriate to the two files undergoing a merge, save to file with a different file name, inspect those results, and make further changes until the merge is clean. Then, we move on to the next item requiring incorporation and rinse and repeat."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_vec1.csv')\n",
    "file_out = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_vec.csv'\n",
    "\n",
    "df = df[['id_x', 'id_token', 'pref_token', 'sub_token', 'count', 'super_token', 'alt_tfidf', 'def_tfidf']]\n",
    "#df.rename(columns={'id_x': 'id'})\n",
    "#df.drop('Unnamed: 0', axis=1, inplace=True)\n",
    "#df.drop('Unnamed: 0.1', axis=1, inplace=True)\n",
    "df.info()\n",
    "df.to_csv(file_out\n",
    "print('File written.')\n",
    "df "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also readily inspect the before and after files to make sure we are getting the results we expect:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_vec.csv')\n",
    "\n",
    "df.info()\n",
    "\n",
    "df "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is important to inspect results as this process unfolds and to check each run to make sure the number of records has not grown. If it does grow, that is due to problems in the input files of some manner that is causing the merges not to be clean, which can add more rows (records) to the merged entity. For example, one problem I found were duplicate reference concepts (RCs) that varied because of differences in capitalization (especially when merging on the basis of the KBpedia URI fragments). That caused me to reach back quite a few steps to correct the input problem. I have also flagged doing a more thorough check for nomimal duplicates for the next version release of KBpedia.\n",
    "\n",
    "Other items might include punctuation or errors due to such in earlier processing steps. One of the reasons I kept the <code>id</code> field from both files in the first step of the build was to have a readable basis for checking proper registry and to identify possible problem concepts. Once the checks were complete, I could delete the extraneous <code>id</code> column.\n",
    "\n",
    "The result of this incremental merging and assembly was to create the final <code>kbpedia_master_vec.csv</code> file, which we will have much occasion to discuss in next installments. The column structure of this final vector file is:\n",
    "\n",
    "<pre>id id_token pref_token sub_token count super_token alt_tfidf def_tfidf st_AVInfo st_ActionTypes plus 66 more STs</pre>             "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now Ready for Use\n",
    "\n",
    "We are now ready to move to the use of <code>scikit-learn</code> for real applications, which we address in the next **CWPK** installment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Additional Documentation\n",
    "\n",
    "Here is some additional documentation the provides background to today's installment.\n",
    "\n",
    "- [50 tips for sklearn](https://www.kaggle.com/python10pm/sklearn-50-best-tips-and-tricks)\n",
    "- [Scikit-learn Tutorial: Machine Learning in Python](https://www.dataquest.io/blog/sci-kit-learn-tutorial/) is an excellent starting point to gain a basic understanding\n",
    "- [skorch](https://github.com/skorch-dev/skorch) is a scikit-learn compatible neural network library that wraps PyTorch and has scoring routines, perhaps applicable across the board\n",
    "- [Category Encoders](https://github.com/scikit-learn-contrib/category_encoders) - a set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques;  encoding of category or lists functions\n",
    "- The [preprocessing module](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) provides means to get to some level of standardized input, using scalings and transformations among other techniques \n",
    "- [Model Evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html) provides an impressive number of scoring parameters and metrics\n",
    "- [pandas and scikit-learn](https://www.ritchieng.com/pandas-scikit-learn/)\n",
    "- [Lesson 4. Introduction to List Comprehensions in Python: Write More Efficient Loops](https://www.earthdatascience.org/courses/intro-to-earth-data-science/write-efficient-python-code/loops/list-comprehensions), and\n",
    "- [Curse of Dimensionality](https://builtin.com/data-science/curse-dimensionality)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " <div style=\"background-color:#ffecec; border:1px dotted #f5aca6; vertical-align:middle; margin:15px 60px; padding:8px;\"> \n",
    "  <span style=\"font-weight: bold;\">NOTE:</span> This article is part of the <a href=\"https://www.mkbergman.com/cooking-with-python-and-kbpedia/\" style=\"font-style: italic;\">Cooking with Python and KBpedia</a> series. See the <a href=\"https://www.mkbergman.com/cooking-with-python-and-kbpedia/\"><strong>CWPK</strong> listing</a> for other articles in the series. <a href=\"http://kbpedia.org/\">KBpedia</a> has its own Web site. The <em>cowpoke</em> Python <a href=\"https://github.com/Cognonto/cowpoke\">code listing covering the series</a> is also available from GitHub.\n",
    "  </div>\n",
    "\n",
    "<div style=\"background-color:#ebf8e2; border:1px dotted #71c837; vertical-align:middle; margin:15px 60px; padding:8px;\"> \n",
    "\n",
    "<span style=\"font-weight: bold;\">NOTE:</span> This <strong>CWPK \n",
    "installment</strong> is available both as an online interactive\n",
    "file <a href=\"https://mybinder.org/v2/gh/Cognonto/CWPK/master\" ><img src=\"https://mybinder.org/badge_logo.svg\" style=\"display:inline-block; vertical-align: middle;\" /></a> or as a <a href=\"https://github.com/Cognonto/CWPK\" title=\"CWPK notebook\" alt=\"CWPK notebook\">direct download</a> to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the <code>*.ipynb</code> file. It may take a bit of time for the interactive option to load.</div>\n",
    "\n",
    "<div style=\"background-color:#feeedc; border:1px dotted #f7941d; vertical-align:middle; margin:15px 60px; padding:8px;\"> \n",
    "<div style=\"float: left; margin-right: 5px;\"><img src=\"http://kbpedia.org/cwpk-files/warning.png\" title=\"Caution!\" width=\"32\" /></div>I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment -- which is part of the fun of Python -- and to <a href=\"mailto:mike@mkbergman.com\">notify me</a> should you make improvements.    \n",
    "\n",
    "</div>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}