{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# OpenRefine Client\n",
    "\n",
    "Notebook demonstrating how to control headless OpenRefine via a Python client.\n",
    "\n",
    "Use the `dbutlerdb/refine-client-py` fork of `PaulMakepeace/refine-client-py` for Python3 support.\n",
    "\n",
    "I'm not yet convinced this is sensible or that the python client is as useable as it might be? Things like `pandas` are perfectly serviceable for working with tabular data in a notebook, so why would we want to use the OpenRefine engine?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting Started\n",
    "\n",
    "To start, ensure that the OpenRefine application server is running. You can start it from the notebook homepage (`New -> OpenRefine Session`.\n",
    "\n",
    "The server connection is looked for on the default port 3333. This can be hardcoded as part of the `nbopenrefineder` OpneRefine start-up command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from open.refine import refine\n",
    "\n",
    "#What is the distinction between these two?\n",
    "#Can we make use of it in a Jupyter context somehow?\n",
    "\n",
    "server = refine.RefineServer()\n",
    "orefine = refine.Refine(server)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can list any projects that currently exist (this should be empty):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_items([])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "orefine.list_projects().items()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating an OpenRefine Project\n",
    "\n",
    "\n",
    "*It would be useful to have a simple recipe for creating a project from a pandas dataframe.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Create a simple test data file as a CSV file\n",
    "\n",
    "#via https://github.com/dbutlerdb/refine-client-py/blob/master/tests/data/duplicates.csv\n",
    "data = '''email,name,state,gender,purchase\n",
    "danny.baron@example1.com,Danny Baron,CA,M,TV\n",
    "melanie.white@example2.edu,Melanie White,NC,F,iPhone\n",
    "danny.baron@example1.com,D. Baron,CA,M,Winter jacket\n",
    "ben.tyler@example3.org,Ben Tyler,NV,M,Flashlight\n",
    "arthur.duff@example4.com,Arthur Duff,OR,M,Dining table\n",
    "danny.baron@example1.com,Daniel Baron,CA,M,Bike\n",
    "jean.griffith@example5.org,Jean Griffith,WA,F,Power drill\n",
    "melanie.white@example2.edu,Melanie White,NC,F,iPad\n",
    "ben.morisson@example6.org,Ben Morisson,FL,M,Amplifier\n",
    "arthur.duff@example4.com,Arthur Duff,OR,M,Night table'''\n",
    "\n",
    "\n",
    "fn = 'test.csv'\n",
    "\n",
    "with open(fn,'w') as f:\n",
    "    f.write(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "#Create an OpenRefine project from the data file\n",
    "#Use the absolute path to the file\n",
    "p2 = orefine.new_project(project_file=os.path.abspath(fn),\n",
    "                      project_name='Test 1',\n",
    "                      project_file_name=fn)\n",
    "\n",
    "#Do we have to mediate this via a file transfer? eg could we go more directly from a pandas dataframe somehow?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'1996034361665': {'name': 'Untitled',\n",
       "  'created': '2019-01-07T13:36:27Z',\n",
       "  'modified': '2019-01-07T13:36:27Z',\n",
       "  'creator': '',\n",
       "  'contributors': '',\n",
       "  'subject': '',\n",
       "  'description': '',\n",
       "  'rowCount': 0,\n",
       "  'customMetadata': {},\n",
       "  'importOptionMetadata': [{'storeBlankRows': True,\n",
       "    'includeFileSources': False,\n",
       "    'skipDataLines': 0,\n",
       "    'guessCellValueTypes': False,\n",
       "    'headerLines': 1,\n",
       "    'ignoreLines': -1,\n",
       "    'processQuotes': True,\n",
       "    'fileSource': 'test.csv',\n",
       "    'separator': ',',\n",
       "    'storeBlankCellsAsNulls': True}]}}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#For some reason, the project does not appear to get named?\n",
    "#There also can be a delay before the index listing shows that the data has been loaded?\n",
    "orefine.list_projects()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_table(p):\n",
    "    ''' Display currently selected rows in the table. '''\n",
    "    \n",
    "    cells = [ [col['v'] for col in row['cells']] for row in p.get_rows().rows.rows_response ]\n",
    "    df = pd.DataFrame( cells )\n",
    "    \n",
    "    #The list of columns seems to include historical items\n",
    "    #But how do we also guarantee the current one? dicts are inherently unordered?\n",
    "    cols = ['Unnamed_{}'.format(i) for i in range(len(df.columns))]\n",
    "    for (k,v) in sorted(p.column_order.items(), key=lambda kv: kv[1]):\n",
    "        cols[v]=k\n",
    "        \n",
    "    #Set the column names guessed at - is there a better way?\n",
    "    df.columns = cols\n",
    "    \n",
    "    display ( df ) #columns = [n for n in p.column_order]\n",
    "    \n",
    "    \n",
    "#How do we get the full list of column names?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>email</th>\n",
       "      <th>name</th>\n",
       "      <th>state</th>\n",
       "      <th>gender</th>\n",
       "      <th>purchase</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>danny.baron@example1.com</td>\n",
       "      <td>Danny Baron</td>\n",
       "      <td>CA</td>\n",
       "      <td>M</td>\n",
       "      <td>TV</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>melanie.white@example2.edu</td>\n",
       "      <td>Melanie White</td>\n",
       "      <td>NC</td>\n",
       "      <td>F</td>\n",
       "      <td>iPhone</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>danny.baron@example1.com</td>\n",
       "      <td>D. Baron</td>\n",
       "      <td>CA</td>\n",
       "      <td>M</td>\n",
       "      <td>Winter jacket</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ben.tyler@example3.org</td>\n",
       "      <td>Ben Tyler</td>\n",
       "      <td>NV</td>\n",
       "      <td>M</td>\n",
       "      <td>Flashlight</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>arthur.duff@example4.com</td>\n",
       "      <td>Arthur Duff</td>\n",
       "      <td>OR</td>\n",
       "      <td>M</td>\n",
       "      <td>Dining table</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>danny.baron@example1.com</td>\n",
       "      <td>Daniel Baron</td>\n",
       "      <td>CA</td>\n",
       "      <td>M</td>\n",
       "      <td>Bike</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>jean.griffith@example5.org</td>\n",
       "      <td>Jean Griffith</td>\n",
       "      <td>WA</td>\n",
       "      <td>F</td>\n",
       "      <td>Power drill</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>melanie.white@example2.edu</td>\n",
       "      <td>Melanie White</td>\n",
       "      <td>NC</td>\n",
       "      <td>F</td>\n",
       "      <td>iPad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>ben.morisson@example6.org</td>\n",
       "      <td>Ben Morisson</td>\n",
       "      <td>FL</td>\n",
       "      <td>M</td>\n",
       "      <td>Amplifier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>arthur.duff@example4.com</td>\n",
       "      <td>Arthur Duff</td>\n",
       "      <td>OR</td>\n",
       "      <td>M</td>\n",
       "      <td>Night table</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        email           name state gender       purchase\n",
       "0    danny.baron@example1.com    Danny Baron    CA      M             TV\n",
       "1  melanie.white@example2.edu  Melanie White    NC      F         iPhone\n",
       "2    danny.baron@example1.com       D. Baron    CA      M  Winter jacket\n",
       "3      ben.tyler@example3.org      Ben Tyler    NV      M     Flashlight\n",
       "4    arthur.duff@example4.com    Arthur Duff    OR      M   Dining table\n",
       "5    danny.baron@example1.com   Daniel Baron    CA      M           Bike\n",
       "6  jean.griffith@example5.org  Jean Griffith    WA      F    Power drill\n",
       "7  melanie.white@example2.edu  Melanie White    NC      F           iPad\n",
       "8   ben.morisson@example6.org   Ben Morisson    FL      M      Amplifier\n",
       "9    arthur.duff@example4.com    Arthur Duff    OR      M    Night table"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_table(p2)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}