{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenRefine Client\n", "\n", "Notebook demonstrating how to control headless OpenRefine via a Python client.\n", "\n", "Use the `dbutlerdb/refine-client-py` fork of `PaulMakepeace/refine-client-py` for Python3 support.\n", "\n", "I'm not yet convinced this is sensible or that the python client is as useable as it might be? Things like `pandas` are perfectly serviceable for working with tabular data in a notebook, so why would we want to use the OpenRefine engine?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Started\n", "\n", "To start, ensure that the OpenRefine application server is running. You can start it from the notebook homepage (`New -> OpenRefine Session`.\n", "\n", "The server connection is looked for on the default port 3333. This can be hardcoded as part of the `nbopenrefineder` OpneRefine start-up command." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from open.refine import refine\n", "\n", "#What is the distinction between these two?\n", "#Can we make use of it in a Jupyter context somehow?\n", "\n", "server = refine.RefineServer()\n", "orefine = refine.Refine(server)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can list any projects that currently exist (this should be empty):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "orefine.list_projects().items()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating an OpenRefine Project\n", "\n", "\n", "*It would be useful to have a simple recipe for creating a project from a pandas dataframe.*" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#Create a simple test data file as a CSV file\n", "\n", "#via https://github.com/dbutlerdb/refine-client-py/blob/master/tests/data/duplicates.csv\n", "data = '''email,name,state,gender,purchase\n", "danny.baron@example1.com,Danny Baron,CA,M,TV\n", "melanie.white@example2.edu,Melanie White,NC,F,iPhone\n", "danny.baron@example1.com,D. Baron,CA,M,Winter jacket\n", "ben.tyler@example3.org,Ben Tyler,NV,M,Flashlight\n", "arthur.duff@example4.com,Arthur Duff,OR,M,Dining table\n", "danny.baron@example1.com,Daniel Baron,CA,M,Bike\n", "jean.griffith@example5.org,Jean Griffith,WA,F,Power drill\n", "melanie.white@example2.edu,Melanie White,NC,F,iPad\n", "ben.morisson@example6.org,Ben Morisson,FL,M,Amplifier\n", "arthur.duff@example4.com,Arthur Duff,OR,M,Night table'''\n", "\n", "\n", "fn = 'test.csv'\n", "\n", "with open(fn,'w') as f:\n", " f.write(data)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "#Create an OpenRefine project from the data file\n", "#Use the absolute path to the file\n", "p2 = orefine.new_project(project_file=os.path.abspath(fn),\n", " project_name='Test 1',\n", " project_file_name=fn)\n", "\n", "#Do we have to mediate this via a file transfer? eg could we go more directly from a pandas dataframe somehow?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "{'1996034361665': {'name': 'Untitled',\n", " 'created': '2019-01-07T13:36:27Z',\n", " 'modified': '2019-01-07T13:36:27Z',\n", " 'creator': '',\n", " 'contributors': '',\n", " 'subject': '',\n", " 'description': '',\n", " 'rowCount': 0,\n", " 'customMetadata': {},\n", " 'importOptionMetadata': [{'storeBlankRows': True,\n", " 'includeFileSources': False,\n", " 'skipDataLines': 0,\n", " 'guessCellValueTypes': False,\n", " 'headerLines': 1,\n", " 'ignoreLines': -1,\n", " 'processQuotes': True,\n", " 'fileSource': 'test.csv',\n", " 'separator': ',',\n", " 'storeBlankCellsAsNulls': True}]}}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#For some reason, the project does not appear to get named?\n", "#There also can be a delay before the index listing shows that the data has been loaded?\n", "orefine.list_projects()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def show_table(p):\n", " ''' Display currently selected rows in the table. '''\n", " \n", " cells = [ [col['v'] for col in row['cells']] for row in p.get_rows().rows.rows_response ]\n", " df = pd.DataFrame( cells )\n", " \n", " #The list of columns seems to include historical items\n", " #But how do we also guarantee the current one? dicts are inherently unordered?\n", " cols = ['Unnamed_{}'.format(i) for i in range(len(df.columns))]\n", " for (k,v) in sorted(p.column_order.items(), key=lambda kv: kv[1]):\n", " cols[v]=k\n", " \n", " #Set the column names guessed at - is there a better way?\n", " df.columns = cols\n", " \n", " display ( df ) #columns = [n for n in p.column_order]\n", " \n", " \n", "#How do we get the full list of column names?" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailnamestategenderpurchase
0danny.baron@example1.comDanny BaronCAMTV
1melanie.white@example2.eduMelanie WhiteNCFiPhone
2danny.baron@example1.comD. BaronCAMWinter jacket
3ben.tyler@example3.orgBen TylerNVMFlashlight
4arthur.duff@example4.comArthur DuffORMDining table
5danny.baron@example1.comDaniel BaronCAMBike
6jean.griffith@example5.orgJean GriffithWAFPower drill
7melanie.white@example2.eduMelanie WhiteNCFiPad
8ben.morisson@example6.orgBen MorissonFLMAmplifier
9arthur.duff@example4.comArthur DuffORMNight table
\n", "
" ], "text/plain": [ " email name state gender purchase\n", "0 danny.baron@example1.com Danny Baron CA M TV\n", "1 melanie.white@example2.edu Melanie White NC F iPhone\n", "2 danny.baron@example1.com D. Baron CA M Winter jacket\n", "3 ben.tyler@example3.org Ben Tyler NV M Flashlight\n", "4 arthur.duff@example4.com Arthur Duff OR M Dining table\n", "5 danny.baron@example1.com Daniel Baron CA M Bike\n", "6 jean.griffith@example5.org Jean Griffith WA F Power drill\n", "7 melanie.white@example2.edu Melanie White NC F iPad\n", "8 ben.morisson@example6.org Ben Morisson FL M Amplifier\n", "9 arthur.duff@example4.com Arthur Duff OR M Night table" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_table(p2)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }