{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenRefine Client\n", "\n", "Notebook demonstrating how to control headless OpenRefine via a Python client.\n", "\n", "Use the `dbutlerdb/refine-client-py` fork of `PaulMakepeace/refine-client-py` for Python3 support.\n", "\n", "I'm not yet convinced this is sensible or that the python client is as useable as it might be? Things like `pandas` are perfectly serviceable for working with tabular data in a notebook, so why would we want to use the OpenRefine engine?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Started\n", "\n", "To start, ensure that the OpenRefine application server is running. You can start it from the notebook homepage (`New -> OpenRefine Session`.\n", "\n", "The server connection is looked for on the default port 3333. This can be hardcoded as part of the `nbopenrefineder` OpneRefine start-up command." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from open.refine import refine\n", "\n", "#What is the distinction between these two?\n", "#Can we make use of it in a Jupyter context somehow?\n", "\n", "server = refine.RefineServer()\n", "orefine = refine.Refine(server)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can list any projects that currently exist (this should be empty):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "orefine.list_projects().items()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating an OpenRefine Project\n", "\n", "\n", "*It would be useful to have a simple recipe for creating a project from a pandas dataframe.*" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#Create a simple test data file as a CSV file\n", "\n", "#via https://github.com/dbutlerdb/refine-client-py/blob/master/tests/data/duplicates.csv\n", "data = '''email,name,state,gender,purchase\n", "danny.baron@example1.com,Danny Baron,CA,M,TV\n", "melanie.white@example2.edu,Melanie White,NC,F,iPhone\n", "danny.baron@example1.com,D. Baron,CA,M,Winter jacket\n", "ben.tyler@example3.org,Ben Tyler,NV,M,Flashlight\n", "arthur.duff@example4.com,Arthur Duff,OR,M,Dining table\n", "danny.baron@example1.com,Daniel Baron,CA,M,Bike\n", "jean.griffith@example5.org,Jean Griffith,WA,F,Power drill\n", "melanie.white@example2.edu,Melanie White,NC,F,iPad\n", "ben.morisson@example6.org,Ben Morisson,FL,M,Amplifier\n", "arthur.duff@example4.com,Arthur Duff,OR,M,Night table'''\n", "\n", "\n", "fn = 'test.csv'\n", "\n", "with open(fn,'w') as f:\n", " f.write(data)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "#Create an OpenRefine project from the data file\n", "#Use the absolute path to the file\n", "p2 = orefine.new_project(project_file=os.path.abspath(fn),\n", " project_name='Test 1',\n", " project_file_name=fn)\n", "\n", "#Do we have to mediate this via a file transfer? eg could we go more directly from a pandas dataframe somehow?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "{'1996034361665': {'name': 'Untitled',\n", " 'created': '2019-01-07T13:36:27Z',\n", " 'modified': '2019-01-07T13:36:27Z',\n", " 'creator': '',\n", " 'contributors': '',\n", " 'subject': '',\n", " 'description': '',\n", " 'rowCount': 0,\n", " 'customMetadata': {},\n", " 'importOptionMetadata': [{'storeBlankRows': True,\n", " 'includeFileSources': False,\n", " 'skipDataLines': 0,\n", " 'guessCellValueTypes': False,\n", " 'headerLines': 1,\n", " 'ignoreLines': -1,\n", " 'processQuotes': True,\n", " 'fileSource': 'test.csv',\n", " 'separator': ',',\n", " 'storeBlankCellsAsNulls': True}]}}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#For some reason, the project does not appear to get named?\n", "#There also can be a delay before the index listing shows that the data has been loaded?\n", "orefine.list_projects()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def show_table(p):\n", " ''' Display currently selected rows in the table. '''\n", " \n", " cells = [ [col['v'] for col in row['cells']] for row in p.get_rows().rows.rows_response ]\n", " df = pd.DataFrame( cells )\n", " \n", " #The list of columns seems to include historical items\n", " #But how do we also guarantee the current one? dicts are inherently unordered?\n", " cols = ['Unnamed_{}'.format(i) for i in range(len(df.columns))]\n", " for (k,v) in sorted(p.column_order.items(), key=lambda kv: kv[1]):\n", " cols[v]=k\n", " \n", " #Set the column names guessed at - is there a better way?\n", " df.columns = cols\n", " \n", " display ( df ) #columns = [n for n in p.column_order]\n", " \n", " \n", "#How do we get the full list of column names?" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | name | \n", "state | \n", "gender | \n", "purchase | \n", "|
---|---|---|---|---|---|
0 | \n", "danny.baron@example1.com | \n", "Danny Baron | \n", "CA | \n", "M | \n", "TV | \n", "
1 | \n", "melanie.white@example2.edu | \n", "Melanie White | \n", "NC | \n", "F | \n", "iPhone | \n", "
2 | \n", "danny.baron@example1.com | \n", "D. Baron | \n", "CA | \n", "M | \n", "Winter jacket | \n", "
3 | \n", "ben.tyler@example3.org | \n", "Ben Tyler | \n", "NV | \n", "M | \n", "Flashlight | \n", "
4 | \n", "arthur.duff@example4.com | \n", "Arthur Duff | \n", "OR | \n", "M | \n", "Dining table | \n", "
5 | \n", "danny.baron@example1.com | \n", "Daniel Baron | \n", "CA | \n", "M | \n", "Bike | \n", "
6 | \n", "jean.griffith@example5.org | \n", "Jean Griffith | \n", "WA | \n", "F | \n", "Power drill | \n", "
7 | \n", "melanie.white@example2.edu | \n", "Melanie White | \n", "NC | \n", "F | \n", "iPad | \n", "
8 | \n", "ben.morisson@example6.org | \n", "Ben Morisson | \n", "FL | \n", "M | \n", "Amplifier | \n", "
9 | \n", "arthur.duff@example4.com | \n", "Arthur Duff | \n", "OR | \n", "M | \n", "Night table | \n", "