{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenRefine Client\n", "\n", "Notebook demonstrating how to control OpenRefine via a Python client.\n", "\n", "Use the `dbutlerdb/refine-client-py` fork of `PaulMakepeace/refine-client-py` for Python3 support.\n", "\n", "I'm not yet convinced this is sensible or that the python client is as useable as it might be? Things like `pandas` are perfectly serviceable for working with tabular data in a notebook, so why would we want to use the OpenRefine engine?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Started\n", "\n", "To start, ensure that the OpenRefine application server is running. You can start it from the notebook homepage (`New -> OpenRefine Session`.\n", "\n", "The server connection is looked for on the default port 3333. This can be hardcoded as part of the `nbopenrefineder` OpneRefine start-up command." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "#We might have to spot these in the notebook server log - the terminal output where the notebook server is started\n", "#If we're in MyBinder, that's not much help though?\n", "#os.environ['OPENREFINE_HOST']='127.0.0.1'\n", "#os.environ['OPENREFINE_PORT']='39113'\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from open.refine import refine\n", "\n", "#What is the distinction between these two?\n", "#Can we make use of it in a Jupyter context somehow?\n", "\n", "server = refine.RefineServer()\n", "orefine = refine.Refine(server)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can list any projects that currently exist:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_items([])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "orefine.list_projects().items()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Projects\n", "\n", "Projects can be created via the API from a source data file. " ] }, { "cell_type": "code", "execution_count": 255, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "try:\n", " import pandas as pd\n", "except:\n", " !pip install pandas\n", " import pandas as pd" ] }, { "cell_type": "code", "execution_count": 256, "metadata": {}, "outputs": [], "source": [ "#Create a simple test data file as a CSV file\n", "\n", "#via https://github.com/dbutlerdb/refine-client-py/blob/master/tests/data/duplicates.csv\n", "data = '''email,name,state,gender,purchase\n", "danny.baron@example1.com,Danny Baron,CA,M,TV\n", "melanie.white@example2.edu,Melanie White,NC,F,iPhone\n", "danny.baron@example1.com,D. Baron,CA,M,Winter jacket\n", "ben.tyler@example3.org,Ben Tyler,NV,M,Flashlight\n", "arthur.duff@example4.com,Arthur Duff,OR,M,Dining table\n", "danny.baron@example1.com,Daniel Baron,CA,M,Bike\n", "jean.griffith@example5.org,Jean Griffith,WA,F,Power drill\n", "melanie.white@example2.edu,Melanie White,NC,F,iPad\n", "ben.morisson@example6.org,Ben Morisson,FL,M,Amplifier\n", "arthur.duff@example4.com,Arthur Duff,OR,M,Night table'''\n", "\n", "\n", "fn = 'test.csv'\n", "\n", "with open(fn,'w') as f:\n", " f.write(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Either copy and paste the above data into a new OpenRefine project as a *clipboard* item or load it in as a CSV data file.\n", "\n", "We should now be able to see the project via the project index list:" ] }, { "cell_type": "code", "execution_count": 257, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'1695820561517': {'name': 'clipboard',\n", " 'created': '2019-01-04T13:48:22Z',\n", " 'modified': '2019-01-04T13:48:24Z',\n", " 'creator': '',\n", " 'contributors': '',\n", " 'subject': '',\n", " 'description': '',\n", " 'rowCount': 10,\n", " 'customMetadata': {},\n", " 'importOptionMetadata': [{'guessCellValueTypes': False,\n", " 'ignoreLines': -1,\n", " 'processQuotes': True,\n", " 'fileSource': '(clipboard)',\n", " 'encoding': '',\n", " 'separator': ',',\n", " 'storeBlankCellsAsNulls': True,\n", " 'storeBlankRows': True,\n", " 'skipDataLines': 0,\n", " 'includeFileSources': False,\n", " 'headerLines': 1,\n", " 'limit': -1,\n", " 'projectName': 'clipboard'}]}}" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#For some reason, the project does not appear to get named?\n", "orefine.list_projects()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Opening an OpenRefine Project\n", "\n", "To open an OpenRefine project, we need to get it's key value from the project list. (Note that different projects may have the same name.)" ] }, { "cell_type": "code", "execution_count": 258, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['__class__',\n", " '__delattr__',\n", " '__dict__',\n", " '__dir__',\n", " '__doc__',\n", " '__eq__',\n", " '__format__',\n", " '__ge__',\n", " '__getattribute__',\n", " '__gt__',\n", " '__hash__',\n", " '__init__',\n", " '__init_subclass__',\n", " '__le__',\n", " '__lt__',\n", " '__module__',\n", " '__ne__',\n", " '__new__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__setattr__',\n", " '__sizeof__',\n", " '__str__',\n", " '__subclasshook__',\n", " '__weakref__',\n", " 'default_options',\n", " 'get_project_id',\n", " 'get_project_name',\n", " 'list_projects',\n", " 'new_project',\n", " 'open_project',\n", " 'server',\n", " 'set_options',\n", " 'set_project_name']" ] }, "execution_count": 258, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir(orefine)" ] }, { "cell_type": "code", "execution_count": 259, "metadata": {}, "outputs": [], "source": [ "KEY = '2165169842672' #Need to use a valid key\n", "KEY = list(orefine.list_projects().keys())[0]\n", "\n", "p = orefine.open_project(KEY)" ] }, { "cell_type": "code", "execution_count": 260, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['email', 'name', 'state', 'gender', 'purchase']" ] }, "execution_count": 260, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Inspect the column names\n", "p.columns" ] }, { "cell_type": "code", "execution_count": 261, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['email', 'name', 'state', 'gender', 'purchase2'],\n", " {'email': 0,\n", " 'name': 1,\n", " 'state': 2,\n", " 'gender': 3,\n", " 'purchase': 4,\n", " 'purchase2': 4})" ] }, "execution_count": 261, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Rename a column\n", "\n", "col = p.columns[-1]\n", "p.rename_column(col,'{}2'.format(col))\n", "p.columns, p.column_order" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Hmmm, it looks like there's a duplicate there?*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding A Column\n", "\n", "OpenRefine allows you to add a column derived from one or more other columns.\n", "\n", "```\n", ".add_column(oldColumn, newColumn, expression=\"value\", column_insert_index=None, on_error='set-to-blank'])\n", "```\n", "\n", "The `value` keyword denotes using the cell values from the original column." ] }, { "cell_type": "code", "execution_count": 262, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'code': 'ok',\n", " 'historyEntry': {'id': 1546610143487,\n", " 'description': 'Create new column name2 based on column name by filling 10 rows with value',\n", " 'time': '2019-01-04T13:48:33Z'}}" ] }, "execution_count": 262, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.add_column('name', 'name2', column_insert_index=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*How do we access the history from the API? Or do we have to create one ourselves from command responses?*" ] }, { "cell_type": "code", "execution_count": 263, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'code': 'ok',\n", " 'historyEntry': {'id': 1546610429767,\n", " 'description': 'Create new column NAME based on column name by filling 10 rows with toUppercase(value)',\n", " 'time': '2019-01-04T13:48:33Z'}}" ] }, "execution_count": 263, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Add a column based on the name\n", "p.add_column('name', 'NAME', \"toUppercase(value)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a list of supported GREL (Google Refine Expression Language) commands, see [here](https://github.com/OpenRefine/OpenRefine/wiki/GREL-String-Functions)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Delete A Column\n", "\n", "The `remove column` OpenRefine option doesn't appear to be supported?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export CSV\n", "\n", "The export appears to work on the dataset with current filters set. This is distinct from the complere dataset." ] }, { "cell_type": "code", "execution_count": 264, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailnameNAMEname2stategenderpurchase2
0danny.baron@example1.comDanny BaronDANNY BARONDanny BaronCAMTV
1melanie.white@example2.eduMelanie WhiteMELANIE WHITEMelanie WhiteNCFiPhone
2danny.baron@example1.comD. BaronD. BAROND. BaronCAMWinter jacket
3ben.tyler@example3.orgBen TylerBEN TYLERBen TylerNVMFlashlight
4arthur.duff@example4.comArthur DuffARTHUR DUFFArthur DuffORMDining table
5danny.baron@example1.comDaniel BaronDANIEL BARONDaniel BaronCAMBike
6jean.griffith@example5.orgJean GriffithJEAN GRIFFITHJean GriffithWAFPower drill
7melanie.white@example2.eduMelanie WhiteMELANIE WHITEMelanie WhiteNCFiPad
8ben.morisson@example6.orgBen MorissonBEN MORISSONBen MorissonFLMAmplifier
9arthur.duff@example4.comArthur DuffARTHUR DUFFArthur DuffORMNight table
\n", "
" ], "text/plain": [ " email name NAME name2 \\\n", "0 danny.baron@example1.com Danny Baron DANNY BARON Danny Baron \n", "1 melanie.white@example2.edu Melanie White MELANIE WHITE Melanie White \n", "2 danny.baron@example1.com D. Baron D. BARON D. Baron \n", "3 ben.tyler@example3.org Ben Tyler BEN TYLER Ben Tyler \n", "4 arthur.duff@example4.com Arthur Duff ARTHUR DUFF Arthur Duff \n", "5 danny.baron@example1.com Daniel Baron DANIEL BARON Daniel Baron \n", "6 jean.griffith@example5.org Jean Griffith JEAN GRIFFITH Jean Griffith \n", "7 melanie.white@example2.edu Melanie White MELANIE WHITE Melanie White \n", "8 ben.morisson@example6.org Ben Morisson BEN MORISSON Ben Morisson \n", "9 arthur.duff@example4.com Arthur Duff ARTHUR DUFF Arthur Duff \n", "\n", " state gender purchase2 \n", "0 CA M TV \n", "1 NC F iPhone \n", "2 CA M Winter jacket \n", "3 NV M Flashlight \n", "4 OR M Dining table \n", "5 CA M Bike \n", "6 WA F Power drill \n", "7 NC F iPad \n", "8 FL M Amplifier \n", "9 OR M Night table " ] }, "execution_count": 264, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from io import StringIO\n", "\n", "pd.read_csv( StringIO( p.export(export_format='csv') ) )" ] }, { "cell_type": "code", "execution_count": 265, "metadata": {}, "outputs": [], "source": [ "def show_export(p):\n", " display( pd.read_csv( StringIO( p.export(export_format='csv') ) ) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Via https://stackoverflow.com/a/53742983/454773 it looks like there may be an `options` parmamter that allows for column selection. This is not currently supported by the Python client." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Display Table With Current Filters\n", "\n", "We can display the state of the table with current project filters set." ] }, { "cell_type": "code", "execution_count": 266, "metadata": {}, "outputs": [], "source": [ "def show_table(p):\n", " ''' Display currently selected rows in the table. '''\n", " \n", " cells = [ [col['v'] for col in row['cells']] for row in p.get_rows().rows.rows_response ]\n", " df = pd.DataFrame( cells )\n", " \n", " #The list of columns seems to include historical items\n", " #But how do we also guarantee the current one? dicts are inherently unordered?\n", " cols = ['Unnamed_{}'.format(i) for i in range(len(df.columns))]\n", " for (k,v) in sorted(p.column_order.items(), key=lambda kv: kv[1]):\n", " cols[v]=k\n", " \n", " #Set the column names guessed at - is there a better way?\n", " df.columns = cols\n", " \n", " display ( df ) #columns = [n for n in p.column_order]\n", " \n", " \n", "#How do we get the full list of column names?" ] }, { "cell_type": "code", "execution_count": 267, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailnameNAMEname2purchasegenderpurchase2
0danny.baron@example1.comDanny BaronCAMTVDanny BaronDANNY BARON
1melanie.white@example2.eduMelanie WhiteNCFiPhoneMelanie WhiteMELANIE WHITE
2danny.baron@example1.comD. BaronCAMWinter jacketD. BaronD. BARON
3ben.tyler@example3.orgBen TylerNVMFlashlightBen TylerBEN TYLER
4arthur.duff@example4.comArthur DuffORMDining tableArthur DuffARTHUR DUFF
5danny.baron@example1.comDaniel BaronCAMBikeDaniel BaronDANIEL BARON
6jean.griffith@example5.orgJean GriffithWAFPower drillJean GriffithJEAN GRIFFITH
7melanie.white@example2.eduMelanie WhiteNCFiPadMelanie WhiteMELANIE WHITE
8ben.morisson@example6.orgBen MorissonFLMAmplifierBen MorissonBEN MORISSON
9arthur.duff@example4.comArthur DuffORMNight tableArthur DuffARTHUR DUFF
\n", "
" ], "text/plain": [ " email name NAME name2 purchase \\\n", "0 danny.baron@example1.com Danny Baron CA M TV \n", "1 melanie.white@example2.edu Melanie White NC F iPhone \n", "2 danny.baron@example1.com D. Baron CA M Winter jacket \n", "3 ben.tyler@example3.org Ben Tyler NV M Flashlight \n", "4 arthur.duff@example4.com Arthur Duff OR M Dining table \n", "5 danny.baron@example1.com Daniel Baron CA M Bike \n", "6 jean.griffith@example5.org Jean Griffith WA F Power drill \n", "7 melanie.white@example2.edu Melanie White NC F iPad \n", "8 ben.morisson@example6.org Ben Morisson FL M Amplifier \n", "9 arthur.duff@example4.com Arthur Duff OR M Night table \n", "\n", " gender purchase2 \n", "0 Danny Baron DANNY BARON \n", "1 Melanie White MELANIE WHITE \n", "2 D. Baron D. BARON \n", "3 Ben Tyler BEN TYLER \n", "4 Arthur Duff ARTHUR DUFF \n", "5 Daniel Baron DANIEL BARON \n", "6 Jean Griffith JEAN GRIFFITH \n", "7 Melanie White MELANIE WHITE \n", "8 Ben Morisson BEN MORISSON \n", "9 Arthur Duff ARTHUR DUFF " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_table(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table Manipulation\n", "\n", "How can we manipulate the data table?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Column Names and Column Order\n", "\n", "We can look up the current column names and column order as follows:" ] }, { "cell_type": "code", "execution_count": 268, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'email': 0,\n", " 'name': 1,\n", " 'state': 4,\n", " 'gender': 5,\n", " 'purchase': 4,\n", " 'purchase2': 6,\n", " 'name2': 3,\n", " 'NAME': 2}" ] }, "execution_count": 268, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.column_order" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rearrange column order:" ] }, { "cell_type": "code", "execution_count": 269, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailnamestategenderpurchase2NAMEname2
0danny.baron@example1.comDanny BaronCAMTVDANNY BARONDanny Baron
1melanie.white@example2.eduMelanie WhiteNCFiPhoneMELANIE WHITEMelanie White
2danny.baron@example1.comD. BaronCAMWinter jacketD. BAROND. Baron
3ben.tyler@example3.orgBen TylerNVMFlashlightBEN TYLERBen Tyler
4arthur.duff@example4.comArthur DuffORMDining tableARTHUR DUFFArthur Duff
5danny.baron@example1.comDaniel BaronCAMBikeDANIEL BARONDaniel Baron
6jean.griffith@example5.orgJean GriffithWAFPower drillJEAN GRIFFITHJean Griffith
7melanie.white@example2.eduMelanie WhiteNCFiPadMELANIE WHITEMelanie White
8ben.morisson@example6.orgBen MorissonFLMAmplifierBEN MORISSONBen Morisson
9arthur.duff@example4.comArthur DuffORMNight tableARTHUR DUFFArthur Duff
\n", "
" ], "text/plain": [ " email name state gender purchase2 \\\n", "0 danny.baron@example1.com Danny Baron CA M TV \n", "1 melanie.white@example2.edu Melanie White NC F iPhone \n", "2 danny.baron@example1.com D. Baron CA M Winter jacket \n", "3 ben.tyler@example3.org Ben Tyler NV M Flashlight \n", "4 arthur.duff@example4.com Arthur Duff OR M Dining table \n", "5 danny.baron@example1.com Daniel Baron CA M Bike \n", "6 jean.griffith@example5.org Jean Griffith WA F Power drill \n", "7 melanie.white@example2.edu Melanie White NC F iPad \n", "8 ben.morisson@example6.org Ben Morisson FL M Amplifier \n", "9 arthur.duff@example4.com Arthur Duff OR M Night table \n", "\n", " NAME name2 \n", "0 DANNY BARON Danny Baron \n", "1 MELANIE WHITE Melanie White \n", "2 D. BARON D. Baron \n", "3 BEN TYLER Ben Tyler \n", "4 ARTHUR DUFF Arthur Duff \n", "5 DANIEL BARON Daniel Baron \n", "6 JEAN GRIFFITH Jean Griffith \n", "7 MELANIE WHITE Melanie White \n", "8 BEN MORISSON Ben Morisson \n", "9 ARTHUR DUFF Arthur Duff " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "p.reorder_columns(['email','name','state','gender','purchase2','NAME','name2'])\n", "show_export(p)" ] }, { "cell_type": "code", "execution_count": 270, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailname
0danny.baron@example1.comDanny Baron
1melanie.white@example2.eduMelanie White
2danny.baron@example1.comD. Baron
3ben.tyler@example3.orgBen Tyler
4arthur.duff@example4.comArthur Duff
5danny.baron@example1.comDaniel Baron
6jean.griffith@example5.orgJean Griffith
7melanie.white@example2.eduMelanie White
8ben.morisson@example6.orgBen Morisson
9arthur.duff@example4.comArthur Duff
\n", "
" ], "text/plain": [ " email name\n", "0 danny.baron@example1.com Danny Baron\n", "1 melanie.white@example2.edu Melanie White\n", "2 danny.baron@example1.com D. Baron\n", "3 ben.tyler@example3.org Ben Tyler\n", "4 arthur.duff@example4.com Arthur Duff\n", "5 danny.baron@example1.com Daniel Baron\n", "6 jean.griffith@example5.org Jean Griffith\n", "7 melanie.white@example2.edu Melanie White\n", "8 ben.morisson@example6.org Ben Morisson\n", "9 arthur.duff@example4.com Arthur Duff" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "p.reorder_columns(['email','name'])\n", "show_export(p)" ] }, { "cell_type": "code", "execution_count": 271, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailname
0danny.baron@example1.comDanny Baron
1melanie.white@example2.eduMelanie White
2danny.baron@example1.comD. Baron
3ben.tyler@example3.orgBen Tyler
4arthur.duff@example4.comArthur Duff
5danny.baron@example1.comDaniel Baron
6jean.griffith@example5.orgJean Griffith
7melanie.white@example2.eduMelanie White
8ben.morisson@example6.orgBen Morisson
9arthur.duff@example4.comArthur Duff
\n", "
" ], "text/plain": [ " email name\n", "0 danny.baron@example1.com Danny Baron\n", "1 melanie.white@example2.edu Melanie White\n", "2 danny.baron@example1.com D. Baron\n", "3 ben.tyler@example3.org Ben Tyler\n", "4 arthur.duff@example4.com Arthur Duff\n", "5 danny.baron@example1.com Daniel Baron\n", "6 jean.griffith@example5.org Jean Griffith\n", "7 melanie.white@example2.edu Melanie White\n", "8 ben.morisson@example6.org Ben Morisson\n", "9 arthur.duff@example4.com Arthur Duff" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "p.reorder_columns(['email','name','state','gender','purchase2','NAME','name2'])\n", "show_export(p)" ] }, { "cell_type": "code", "execution_count": 272, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['email', 'name']" ] }, "execution_count": 272, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although the full set of columns aren't in the exportable view of the table, they do still exist:" ] }, { "cell_type": "code", "execution_count": 273, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'flagged': False,\n", " 'starred': False,\n", " 'cells': [{'v': 'danny.baron@example1.com'},\n", " {'v': 'Danny Baron'},\n", " {'v': 'CA'},\n", " {'v': 'M'},\n", " {'v': 'TV'},\n", " {'v': 'Danny Baron'},\n", " {'v': 'DANNY BARON'}],\n", " 'i': 0}" ] }, "execution_count": 273, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.get_rows().rows.rows_response[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So how do we reset that? We presumably need to unstep the project history?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sorting\n", "\n", "Data in the table can be sorted on one or more columns, each ascending or descending." ] }, { "cell_type": "code", "execution_count": 274, "metadata": {}, "outputs": [], "source": [ "from open.refine.facet import Sorting" ] }, { "cell_type": "code", "execution_count": 275, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'column': 'email',\n", " 'valueType': 'string',\n", " 'caseSensitive': False,\n", " 'reverse': False,\n", " 'errorPosition': 1,\n", " 'blankPosition': 2}]" ] }, "execution_count": 275, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorting = Sorting('email')\n", "\n", "sorting.criteria" ] }, { "cell_type": "code", "execution_count": 276, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'column': 'name',\n", " 'valueType': 'string',\n", " 'caseSensitive': False,\n", " 'reverse': False,\n", " 'errorPosition': 1,\n", " 'blankPosition': 2},\n", " {'column': 'gender',\n", " 'valueType': 'string',\n", " 'caseSensitive': False,\n", " 'reverse': False,\n", " 'errorPosition': 1,\n", " 'blankPosition': 2}]" ] }, "execution_count": 276, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Sorting(['name', 'gender']).criteria" ] }, { "cell_type": "code", "execution_count": 277, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailnamestategenderpurchase2NAMEname2
0arthur.duff@example4.comArthur DuffORMDining tableArthur DuffARTHUR DUFF
1arthur.duff@example4.comArthur DuffORMNight tableArthur DuffARTHUR DUFF
2ben.morisson@example6.orgBen MorissonFLMAmplifierBen MorissonBEN MORISSON
3ben.tyler@example3.orgBen TylerNVMFlashlightBen TylerBEN TYLER
4danny.baron@example1.comDanny BaronCAMTVDanny BaronDANNY BARON
5danny.baron@example1.comD. BaronCAMWinter jacketD. BaronD. BARON
6danny.baron@example1.comDaniel BaronCAMBikeDaniel BaronDANIEL BARON
7jean.griffith@example5.orgJean GriffithWAFPower drillJean GriffithJEAN GRIFFITH
8melanie.white@example2.eduMelanie WhiteNCFiPhoneMelanie WhiteMELANIE WHITE
9melanie.white@example2.eduMelanie WhiteNCFiPadMelanie WhiteMELANIE WHITE
\n", "
" ], "text/plain": [ " email name state gender purchase2 \\\n", "0 arthur.duff@example4.com Arthur Duff OR M Dining table \n", "1 arthur.duff@example4.com Arthur Duff OR M Night table \n", "2 ben.morisson@example6.org Ben Morisson FL M Amplifier \n", "3 ben.tyler@example3.org Ben Tyler NV M Flashlight \n", "4 danny.baron@example1.com Danny Baron CA M TV \n", "5 danny.baron@example1.com D. Baron CA M Winter jacket \n", "6 danny.baron@example1.com Daniel Baron CA M Bike \n", "7 jean.griffith@example5.org Jean Griffith WA F Power drill \n", "8 melanie.white@example2.edu Melanie White NC F iPhone \n", "9 melanie.white@example2.edu Melanie White NC F iPad \n", "\n", " NAME name2 \n", "0 Arthur Duff ARTHUR DUFF \n", "1 Arthur Duff ARTHUR DUFF \n", "2 Ben Morisson BEN MORISSON \n", "3 Ben Tyler BEN TYLER \n", "4 Danny Baron DANNY BARON \n", "5 D. Baron D. BARON \n", "6 Daniel Baron DANIEL BARON \n", "7 Jean Griffith JEAN GRIFFITH \n", "8 Melanie White MELANIE WHITE \n", "9 Melanie White MELANIE WHITE " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "p.sorting = Sorting('email')\n", "p.reorder_rows()\n", "\n", "show_table(p)" ] }, { "cell_type": "code", "execution_count": 278, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'column': 'name', 'reverse': True, 'errorPosition': 1, 'blankPosition': 2}]" ] }, "execution_count": 278, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Seems we can be more explicit - but are the missing items (e.g. valueType, caseSensitive) important?\n", "Sorting([{'column':'name','reverse':True}]).criteria" ] }, { "cell_type": "code", "execution_count": 279, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailnamestategenderpurchase2NAMEname2
0arthur.duff@example4.comArthur DuffORMDining tableArthur DuffARTHUR DUFF
1arthur.duff@example4.comArthur DuffORMNight tableArthur DuffARTHUR DUFF
2ben.morisson@example6.orgBen MorissonFLMAmplifierBen MorissonBEN MORISSON
3ben.tyler@example3.orgBen TylerNVMFlashlightBen TylerBEN TYLER
4danny.baron@example1.comDanny BaronCAMTVDanny BaronDANNY BARON
5danny.baron@example1.comDaniel BaronCAMBikeDaniel BaronDANIEL BARON
6danny.baron@example1.comD. BaronCAMWinter jacketD. BaronD. BARON
7jean.griffith@example5.orgJean GriffithWAFPower drillJean GriffithJEAN GRIFFITH
8melanie.white@example2.eduMelanie WhiteNCFiPhoneMelanie WhiteMELANIE WHITE
9melanie.white@example2.eduMelanie WhiteNCFiPadMelanie WhiteMELANIE WHITE
\n", "
" ], "text/plain": [ " email name state gender purchase2 \\\n", "0 arthur.duff@example4.com Arthur Duff OR M Dining table \n", "1 arthur.duff@example4.com Arthur Duff OR M Night table \n", "2 ben.morisson@example6.org Ben Morisson FL M Amplifier \n", "3 ben.tyler@example3.org Ben Tyler NV M Flashlight \n", "4 danny.baron@example1.com Danny Baron CA M TV \n", "5 danny.baron@example1.com Daniel Baron CA M Bike \n", "6 danny.baron@example1.com D. Baron CA M Winter jacket \n", "7 jean.griffith@example5.org Jean Griffith WA F Power drill \n", "8 melanie.white@example2.edu Melanie White NC F iPhone \n", "9 melanie.white@example2.edu Melanie White NC F iPad \n", "\n", " NAME name2 \n", "0 Arthur Duff ARTHUR DUFF \n", "1 Arthur Duff ARTHUR DUFF \n", "2 Ben Morisson BEN MORISSON \n", "3 Ben Tyler BEN TYLER \n", "4 Danny Baron DANNY BARON \n", "5 Daniel Baron DANIEL BARON \n", "6 D. Baron D. BARON \n", "7 Jean Griffith JEAN GRIFFITH \n", "8 Melanie White MELANIE WHITE \n", "9 Melanie White MELANIE WHITE " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "p.sorting = Sorting(['email',{'column':'name','reverse':True}])\n", "p.reorder_rows()\n", "\n", "show_table(p)" ] }, { "cell_type": "code", "execution_count": 280, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'flagged': False,\n", " 'starred': False,\n", " 'cells': [{'v': 'arthur.duff@example4.com'},\n", " {'v': 'Arthur Duff'},\n", " {'v': 'OR'},\n", " {'v': 'M'},\n", " {'v': 'Dining table'},\n", " {'v': 'Arthur Duff'},\n", " {'v': 'ARTHUR DUFF'}],\n", " 'i': 0}" ] }, "execution_count": 280, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.get_rows().rows.rows_response[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unset filters\n", "\n", "It looks like we can use `.engine.reset_all()` to reset filters." ] }, { "cell_type": "code", "execution_count": 281, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailnamestategenderpurchase2NAMEname2
0arthur.duff@example4.comArthur DuffORMDining tableArthur DuffARTHUR DUFF
1arthur.duff@example4.comArthur DuffORMNight tableArthur DuffARTHUR DUFF
2ben.morisson@example6.orgBen MorissonFLMAmplifierBen MorissonBEN MORISSON
3ben.tyler@example3.orgBen TylerNVMFlashlightBen TylerBEN TYLER
4danny.baron@example1.comDanny BaronCAMTVDanny BaronDANNY BARON
5danny.baron@example1.comDaniel BaronCAMBikeDaniel BaronDANIEL BARON
6danny.baron@example1.comD. BaronCAMWinter jacketD. BaronD. BARON
7jean.griffith@example5.orgJean GriffithWAFPower drillJean GriffithJEAN GRIFFITH
8melanie.white@example2.eduMelanie WhiteNCFiPhoneMelanie WhiteMELANIE WHITE
9melanie.white@example2.eduMelanie WhiteNCFiPadMelanie WhiteMELANIE WHITE
\n", "
" ], "text/plain": [ " email name state gender purchase2 \\\n", "0 arthur.duff@example4.com Arthur Duff OR M Dining table \n", "1 arthur.duff@example4.com Arthur Duff OR M Night table \n", "2 ben.morisson@example6.org Ben Morisson FL M Amplifier \n", "3 ben.tyler@example3.org Ben Tyler NV M Flashlight \n", "4 danny.baron@example1.com Danny Baron CA M TV \n", "5 danny.baron@example1.com Daniel Baron CA M Bike \n", "6 danny.baron@example1.com D. Baron CA M Winter jacket \n", "7 jean.griffith@example5.org Jean Griffith WA F Power drill \n", "8 melanie.white@example2.edu Melanie White NC F iPhone \n", "9 melanie.white@example2.edu Melanie White NC F iPad \n", "\n", " NAME name2 \n", "0 Arthur Duff ARTHUR DUFF \n", "1 Arthur Duff ARTHUR DUFF \n", "2 Ben Morisson BEN MORISSON \n", "3 Ben Tyler BEN TYLER \n", "4 Danny Baron DANNY BARON \n", "5 Daniel Baron DANIEL BARON \n", "6 D. Baron D. BARON \n", "7 Jean Griffith JEAN GRIFFITH \n", "8 Melanie White MELANIE WHITE \n", "9 Melanie White MELANIE WHITE " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "p.engine.reset_all()\n", "show_table(p)" ] }, { "cell_type": "code", "execution_count": 282, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailname
0arthur.duff@example4.comArthur Duff
1arthur.duff@example4.comArthur Duff
2ben.morisson@example6.orgBen Morisson
3ben.tyler@example3.orgBen Tyler
4danny.baron@example1.comDanny Baron
5danny.baron@example1.comDaniel Baron
6danny.baron@example1.comD. Baron
7jean.griffith@example5.orgJean Griffith
8melanie.white@example2.eduMelanie White
9melanie.white@example2.eduMelanie White
\n", "
" ], "text/plain": [ " email name\n", "0 arthur.duff@example4.com Arthur Duff\n", "1 arthur.duff@example4.com Arthur Duff\n", "2 ben.morisson@example6.org Ben Morisson\n", "3 ben.tyler@example3.org Ben Tyler\n", "4 danny.baron@example1.com Danny Baron\n", "5 danny.baron@example1.com Daniel Baron\n", "6 danny.baron@example1.com D. Baron\n", "7 jean.griffith@example5.org Jean Griffith\n", "8 melanie.white@example2.edu Melanie White\n", "9 melanie.white@example2.edu Melanie White" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#The export filter still seems to be set though?\n", "show_export(p)" ] }, { "cell_type": "code", "execution_count": 283, "metadata": {}, "outputs": [], "source": [ "#Also need an example with numeric type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Project history\n", "\n", "How do we explore a project's history?" ] }, { "cell_type": "code", "execution_count": 284, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('2019-01-04T13:48:39Z', 1546609838269, 'Reorder rows')" ] }, "execution_count": 284, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def show_history_item(h):\n", " return h.time, h.id,h.description\n", "\n", "show_history_item(p.history_entry)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating an OpenRefine Project\n", "\n", "*This doesn't seem to work?*\n", "\n", "*It would be useful to have a simple recipe for creating a project from a pandas dataframe.*" ] }, { "cell_type": "code", "execution_count": 285, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "#Create an OpenRefine project from the data file\n", "#Use the absolute path to the file\n", "p2 = orefine.new_project(project_file=os.path.abspath(fn),\n", " project_name='Test 1',\n", " project_file_name=fn)\n", "\n", "#Do we have to mediate this via a file transfer? eg could we go more directly from a pandas dataframe somehow?" ] }, { "cell_type": "code", "execution_count": 286, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "{'1718022364826': {'name': 'Untitled',\n", " 'created': '2019-01-04T13:48:41Z',\n", " 'modified': '2019-01-04T13:48:41Z',\n", " 'creator': '',\n", " 'contributors': '',\n", " 'subject': '',\n", " 'description': '',\n", " 'rowCount': 0,\n", " 'customMetadata': {},\n", " 'importOptionMetadata': [{'storeBlankRows': True,\n", " 'includeFileSources': False,\n", " 'skipDataLines': 0,\n", " 'guessCellValueTypes': False,\n", " 'headerLines': 1,\n", " 'ignoreLines': -1,\n", " 'processQuotes': True,\n", " 'fileSource': 'test.csv',\n", " 'separator': ',',\n", " 'storeBlankCellsAsNulls': True}]},\n", " '1695820561517': {'name': 'clipboard',\n", " 'created': '2019-01-04T13:48:22Z',\n", " 'modified': '2019-01-04T13:48:40Z',\n", " 'creator': '',\n", " 'contributors': '',\n", " 'subject': '',\n", " 'description': '',\n", " 'rowCount': 10,\n", " 'customMetadata': {},\n", " 'importOptionMetadata': [{'guessCellValueTypes': False,\n", " 'ignoreLines': -1,\n", " 'processQuotes': True,\n", " 'fileSource': '(clipboard)',\n", " 'encoding': '',\n", " 'separator': ',',\n", " 'storeBlankCellsAsNulls': True,\n", " 'storeBlankRows': True,\n", " 'skipDataLines': 0,\n", " 'includeFileSources': False,\n", " 'headerLines': 1,\n", " 'limit': -1,\n", " 'projectName': 'clipboard'}]}}" ] }, "execution_count": 286, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#For some reason, the project does not appear to get named?\n", "#There also can be a delay before the index listing shows that the data has been loaded?\n", "orefine.list_projects()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Delete a Project\n", "\n", "Take care with this one — there is no warning..." ] }, { "cell_type": "code", "execution_count": 287, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 287, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p2.delete()" ] }, { "cell_type": "code", "execution_count": 288, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['__class__',\n", " '__delattr__',\n", " '__dict__',\n", " '__dir__',\n", " '__doc__',\n", " '__eq__',\n", " '__format__',\n", " '__ge__',\n", " '__getattribute__',\n", " '__gt__',\n", " '__hash__',\n", " '__init__',\n", " '__init_subclass__',\n", " '__le__',\n", " '__lt__',\n", " '__module__',\n", " '__ne__',\n", " '__new__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__setattr__',\n", " '__sizeof__',\n", " '__str__',\n", " '__subclasshook__',\n", " '__weakref__',\n", " 'add_column',\n", " 'annotate_one_row',\n", " 'apply_operations',\n", " 'blank_down',\n", " 'clusterer_defaults',\n", " 'compute_clusters',\n", " 'compute_facets',\n", " 'delete',\n", " 'do_json',\n", " 'do_raw',\n", " 'edit',\n", " 'export',\n", " 'export_rows',\n", " 'fill_down',\n", " 'flag_row',\n", " 'get_models',\n", " 'get_operations',\n", " 'get_preference',\n", " 'get_reconciliation_service_by_name_or_url',\n", " 'get_reconciliation_services',\n", " 'get_rows',\n", " 'guess_types_of_column',\n", " 'json_data',\n", " 'mass_edit',\n", " 'move_column',\n", " 'project_name',\n", " 'project_url',\n", " 'reconcile',\n", " 'remove_rows',\n", " 'rename_column',\n", " 'reorder_columns',\n", " 'reorder_rows',\n", " 'split_column',\n", " 'star_row',\n", " 'text_transform',\n", " 'transpose_columns_into_rows',\n", " 'transpose_rows_into_columns',\n", " 'wait_until_idle']" ] }, "execution_count": 288, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir(refine.RefineProject)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Facets and Clustering\n", "\n", "OpenRefine supports facetting on a column, including facet counts, and several clustering algorithms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Facets" ] }, { "cell_type": "code", "execution_count": 289, "metadata": {}, "outputs": [], "source": [ "from open.refine import facet\n", "\n", "fr=p.compute_facets(facet.TextFacet('name'))" ] }, { "cell_type": "code", "execution_count": 290, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 Arthur Duff\n", "2 Melanie White\n", "1 Ben Tyler\n", "1 D. Baron\n", "1 Danny Baron\n", "1 Jean Griffith\n", "1 Ben Morisson\n", "1 Daniel Baron\n" ] } ], "source": [ "facets = fr.facets[0]\n", "for k in sorted(facets.choices,\n", " key=lambda k: facets.choices[k].count,\n", " reverse=True):\n", " print(facets.choices[k].count, k)" ] }, { "cell_type": "code", "execution_count": 293, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'type': 'list',\n", " 'name': 'email',\n", " 'columnName': 'email',\n", " 'omitBlank': False,\n", " 'omitError': False,\n", " 'selectBlank': False,\n", " 'selectError': False,\n", " 'invert': False,\n", " 'expression': 'value',\n", " 'selection': [{'v': {'v': 'danny.baron@example1.com',\n", " 'l': 'danny.baron@example1.com'}},\n", " {'v': {'v': 'arthur.duff@example4.com', 'l': 'arthur.duff@example4.com'}}]}" ] }, "execution_count": 293, "metadata": {}, "output_type": "execute_result" } ], "source": [ "facet_value = facet.TextFacet(column='email',\n", " selection=['danny.baron@example1.com','arthur.duff@example4.com'])\n", "facet_value.as_dict()" ] }, { "cell_type": "code", "execution_count": 294, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailname
0arthur.duff@example4.comArthur Duff
1arthur.duff@example4.comArthur Duff
2ben.morisson@example6.orgBen Morisson
3ben.tyler@example3.orgBen Tyler
4danny.baron@example1.comDanny Baron
5danny.baron@example1.comDaniel Baron
6danny.baron@example1.comD. Baron
7jean.griffith@example5.orgJean Griffith
8melanie.white@example2.eduMelanie White
9melanie.white@example2.eduMelanie White
\n", "
" ], "text/plain": [ " email name\n", "0 arthur.duff@example4.com Arthur Duff\n", "1 arthur.duff@example4.com Arthur Duff\n", "2 ben.morisson@example6.org Ben Morisson\n", "3 ben.tyler@example3.org Ben Tyler\n", "4 danny.baron@example1.com Danny Baron\n", "5 danny.baron@example1.com Daniel Baron\n", "6 danny.baron@example1.com D. Baron\n", "7 jean.griffith@example5.org Jean Griffith\n", "8 melanie.white@example2.edu Melanie White\n", "9 melanie.white@example2.edu Melanie White" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#The export doesn't appear to be affected?\n", "p.engine.add_facet(facet_value)\n", "p.compute_facets()\n", "\n", "show(p)" ] }, { "cell_type": "code", "execution_count": 295, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailnamestategenderpurchase2NAMEname2
0arthur.duff@example4.comArthur DuffORMDining tableArthur DuffARTHUR DUFF
1arthur.duff@example4.comArthur DuffORMNight tableArthur DuffARTHUR DUFF
2danny.baron@example1.comDanny BaronCAMTVDanny BaronDANNY BARON
3danny.baron@example1.comDaniel BaronCAMBikeDaniel BaronDANIEL BARON
4danny.baron@example1.comD. BaronCAMWinter jacketD. BaronD. BARON
\n", "
" ], "text/plain": [ " email name state gender purchase2 \\\n", "0 arthur.duff@example4.com Arthur Duff OR M Dining table \n", "1 arthur.duff@example4.com Arthur Duff OR M Night table \n", "2 danny.baron@example1.com Danny Baron CA M TV \n", "3 danny.baron@example1.com Daniel Baron CA M Bike \n", "4 danny.baron@example1.com D. Baron CA M Winter jacket \n", "\n", " NAME name2 \n", "0 Arthur Duff ARTHUR DUFF \n", "1 Arthur Duff ARTHUR DUFF \n", "2 Danny Baron DANNY BARON \n", "3 Daniel Baron DANIEL BARON \n", "4 D. Baron D. BARON " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#But the table view is?\n", "show_table(p)" ] }, { "cell_type": "code", "execution_count": 296, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'type': 'list',\n", " 'name': 'email',\n", " 'columnName': 'email',\n", " 'omitBlank': False,\n", " 'omitError': False,\n", " 'selectBlank': False,\n", " 'selectError': False,\n", " 'invert': False,\n", " 'expression': 'value',\n", " 'selection': []}" ] }, "execution_count": 296, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Reset the facet filter\n", "facet_value.reset()\n", "facet_value.as_dict()" ] }, { "cell_type": "code", "execution_count": 300, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailnamestategenderpurchase2NAMEname2
0arthur.duff@example4.comArthur DuffORMDining tableArthur DuffARTHUR DUFF
1arthur.duff@example4.comArthur DuffORMNight tableArthur DuffARTHUR DUFF
2ben.morisson@example6.orgBen MorissonFLMAmplifierBen MorissonBEN MORISSON
3ben.tyler@example3.orgBen TylerNVMFlashlightBen TylerBEN TYLER
4danny.baron@example1.comDanny BaronCAMTVDanny BaronDANNY BARON
5danny.baron@example1.comDaniel BaronCAMBikeDaniel BaronDANIEL BARON
6danny.baron@example1.comD. BaronCAMWinter jacketD. BaronD. BARON
7jean.griffith@example5.orgJean GriffithWAFPower drillJean GriffithJEAN GRIFFITH
8melanie.white@example2.eduMelanie WhiteNCFiPhoneMelanie WhiteMELANIE WHITE
9melanie.white@example2.eduMelanie WhiteNCFiPadMelanie WhiteMELANIE WHITE
\n", "
" ], "text/plain": [ " email name state gender purchase2 \\\n", "0 arthur.duff@example4.com Arthur Duff OR M Dining table \n", "1 arthur.duff@example4.com Arthur Duff OR M Night table \n", "2 ben.morisson@example6.org Ben Morisson FL M Amplifier \n", "3 ben.tyler@example3.org Ben Tyler NV M Flashlight \n", "4 danny.baron@example1.com Danny Baron CA M TV \n", "5 danny.baron@example1.com Daniel Baron CA M Bike \n", "6 danny.baron@example1.com D. Baron CA M Winter jacket \n", "7 jean.griffith@example5.org Jean Griffith WA F Power drill \n", "8 melanie.white@example2.edu Melanie White NC F iPhone \n", "9 melanie.white@example2.edu Melanie White NC F iPad \n", "\n", " NAME name2 \n", "0 Arthur Duff ARTHUR DUFF \n", "1 Arthur Duff ARTHUR DUFF \n", "2 Ben Morisson BEN MORISSON \n", "3 Ben Tyler BEN TYLER \n", "4 Danny Baron DANNY BARON \n", "5 Daniel Baron DANIEL BARON \n", "6 D. Baron D. BARON \n", "7 Jean Griffith JEAN GRIFFITH \n", "8 Melanie White MELANIE WHITE \n", "9 Melanie White MELANIE WHITE " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_table(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clustering\n", "\n", "Open Refine supports several clustering methods:\n", "\n", "- `clusterer_type: binning; refine_function: fingerprint|metaphone3|cologne-phonetic`\n", "- `clusterer_type: binning; refine_function: ngram-fingerprint; params: {'ngram-size': INT}`\n", "- `clusterer_type: knn; refine_function: levenshtein|ppm; params: {'radius': FLOAT,'blocking-ngram-size': INT}`" ] }, { "cell_type": "code", "execution_count": 291, "metadata": {}, "outputs": [], "source": [ "clusters=p.compute_clusters('name',\n", " clusterer_type='binning',\n", " refine_function='cologne-phonetic')\n", "for cluster in clusters:\n", " print(cluster)" ] }, { "cell_type": "code", "execution_count": 292, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'value': 'Danny Baron', 'count': 1}, {'value': 'Daniel Baron', 'count': 1}]\n" ] } ], "source": [ "clusters=p.compute_clusters('name',\n", " clusterer_type='knn',\n", " refine_function='levenshtein',\n", " params={'radius':3})\n", "for cluster in clusters:\n", " print(cluster)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So how do we then select and apply a correction amongst clustered values?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reconciliation\n", "\n", "Reconciliation allows column values to be \"reconciled\" with data from third party services." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "data='''Name\n", "Diane Abbott\n", "Boris Johnson\n", "Boris Johnstone\n", "Diana Abbot\n", "Boris Johnston\n", "Joanna Lumley\n", "Boris Johnstone\n", "'''\n", "\n", "fn = 'test2.csv'\n", "\n", "with open(fn,'w') as f:\n", " f.write(data)\n", " \n", "p=orefine.new_project(project_file=os.path.abspath(fn),\n", " project_name='Test 2',\n", " project_file_name=fn)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'preview': {'width': 400,\n", " 'url': 'https://tools.wmflabs.org/openrefine-wikidata/en/preview?id={{id}}',\n", " 'height': 100},\n", " 'extend': {'property_settings': [{'default': 0,\n", " 'name': 'limit',\n", " 'label': 'Limit',\n", " 'type': 'number',\n", " 'help_text': 'Maximum number of values to return per row (0 for no limit)'},\n", " {'default': 'best',\n", " 'name': 'rank',\n", " 'label': 'Ranks',\n", " 'choices': [{'name': 'Any rank', 'value': 'any'},\n", " {'name': 'Only the best rank', 'value': 'best'},\n", " {'name': 'Preferred and normal ranks', 'value': 'no_deprecated'}],\n", " 'type': 'select',\n", " 'help_text': 'Filter statements by rank'},\n", " {'default': 'any',\n", " 'name': 'references',\n", " 'label': 'References',\n", " 'choices': [{'name': 'Any statement', 'value': 'any'},\n", " {'name': 'At least one reference', 'value': 'referenced'},\n", " {'name': 'At least one non-wiki reference', 'value': 'no_wiki'}],\n", " 'type': 'select',\n", " 'help_text': 'Filter statements by their references'},\n", " {'default': False,\n", " 'name': 'count',\n", " 'label': 'Return counts instead of values',\n", " 'type': 'checkbox',\n", " 'help_text': 'The number of values will be returned.'}],\n", " 'propose_properties': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',\n", " 'service_path': '/en/propose_properties'}},\n", " 'view': {'url': 'https://www.wikidata.org/wiki/{{id}}'},\n", " 'ui': {'handler': 'ReconStandardServicePanel'},\n", " 'identifierSpace': 'http://www.wikidata.org/entity/',\n", " 'name': 'Wikidata (en)',\n", " 'suggest': {'property': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',\n", " 'flyout_service_path': '/en/flyout/property?id=${id}',\n", " 'service_path': '/en/suggest/property'},\n", " 'type': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',\n", " 'flyout_service_path': '/en/flyout/type?id=${id}',\n", " 'service_path': '/en/suggest/type'},\n", " 'entity': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',\n", " 'flyout_service_path': '/en/flyout/entity?id=${id}',\n", " 'service_path': '/en/suggest/entity'}},\n", " 'defaultTypes': [{'name': 'entity', 'id': 'Q35120'}],\n", " 'url': 'https://tools.wmflabs.org/openrefine-wikidata/en/api',\n", " 'schemaSpace': 'http://www.wikidata.org/prop/direct/'}]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.get_reconciliation_services()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'preview': {'width': 400,\n", " 'url': 'https://tools.wmflabs.org/openrefine-wikidata/en/preview?id={{id}}',\n", " 'height': 100},\n", " 'extend': {'property_settings': [{'default': 0,\n", " 'name': 'limit',\n", " 'label': 'Limit',\n", " 'type': 'number',\n", " 'help_text': 'Maximum number of values to return per row (0 for no limit)'},\n", " {'default': 'best',\n", " 'name': 'rank',\n", " 'label': 'Ranks',\n", " 'choices': [{'name': 'Any rank', 'value': 'any'},\n", " {'name': 'Only the best rank', 'value': 'best'},\n", " {'name': 'Preferred and normal ranks', 'value': 'no_deprecated'}],\n", " 'type': 'select',\n", " 'help_text': 'Filter statements by rank'},\n", " {'default': 'any',\n", " 'name': 'references',\n", " 'label': 'References',\n", " 'choices': [{'name': 'Any statement', 'value': 'any'},\n", " {'name': 'At least one reference', 'value': 'referenced'},\n", " {'name': 'At least one non-wiki reference', 'value': 'no_wiki'}],\n", " 'type': 'select',\n", " 'help_text': 'Filter statements by their references'},\n", " {'default': False,\n", " 'name': 'count',\n", " 'label': 'Return counts instead of values',\n", " 'type': 'checkbox',\n", " 'help_text': 'The number of values will be returned.'}],\n", " 'propose_properties': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',\n", " 'service_path': '/en/propose_properties'}},\n", " 'view': {'url': 'https://www.wikidata.org/wiki/{{id}}'},\n", " 'ui': {'handler': 'ReconStandardServicePanel'},\n", " 'identifierSpace': 'http://www.wikidata.org/entity/',\n", " 'name': 'Wikidata (en)',\n", " 'suggest': {'property': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',\n", " 'flyout_service_path': '/en/flyout/property?id=${id}',\n", " 'service_path': '/en/suggest/property'},\n", " 'type': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',\n", " 'flyout_service_path': '/en/flyout/type?id=${id}',\n", " 'service_path': '/en/suggest/type'},\n", " 'entity': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',\n", " 'flyout_service_path': '/en/flyout/entity?id=${id}',\n", " 'service_path': '/en/suggest/entity'}},\n", " 'defaultTypes': [{'name': 'entity', 'id': 'Q35120'}],\n", " 'url': 'https://tools.wmflabs.org/openrefine-wikidata/en/api',\n", " 'schemaSpace': 'http://www.wikidata.org/prop/direct/'}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p.get_reconciliation_service_by_name_or_url('Wikidata (en)')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'id': 'Q5', 'name': 'human', 'score': 5, 'count': 6},\n", " {'id': 'Q13442814',\n", " 'name': 'scholarly article',\n", " 'score': 1.833333333333333,\n", " 'count': 5},\n", " {'id': 'Q3305213', 'name': 'painting', 'score': 1.5, 'count': 2},\n", " {'id': 'Q5398426', 'name': 'television series', 'score': 0.5, 'count': 1},\n", " {'id': 'Q571', 'name': 'book', 'score': 0.3333333333333333, 'count': 1}]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Get suggested types\n", "p.guess_types_of_column('Name',\n", " 'https://tools.wmflabs.org/openrefine-wikidata/en/api')" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "reconciliation_config = {\n", " 'mode': 'standard-service',\n", " 'service': service['url'],\n", " 'identifierSpace': service['identifierSpace'],\n", " 'schemaSpace': service['schemaSpace'],\n", " 'type': {\n", " 'id': reconciliation_type['id'],\n", " 'name': reconciliation_type['name'],\n", " },\n", " 'autoMatch': True,\n", " 'columnDetails': [],\n", " }" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "p.reconcile(column='Name',\n", " service='Wikidata (en)',\n", " reconciliation_type={'id': 'Q5', 'name': 'human'})\n", "#Throws an error looking for 'server_url' rather than 'url' in get_reconciliation_service_by_name_or_url ?" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "reconciliation_config = {\n", " 'mode': 'standard-service',\n", " 'service': 'https://tools.wmflabs.org/openrefine-wikidata/en/preview?id={{id}}',\n", " 'identifierSpace': 'http://www.wikidata.org/entity/',\n", " 'schemaSpace': 'http://www.wikidata.org/prop/direct/',\n", " 'type': {\n", " 'id': 'Q5',\n", " 'name': 'human',\n", " },\n", " 'autoMatch': True,\n", " 'columnDetails': [],\n", " }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p.reconcile(column='Name',\n", " service='Wikidata (en)',\n", " reconciliation_config=reconciliation_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reconciliation is asynchronous - how do we check the actual response when it arrives?\n", "\n", "Code suggests call `wait_until_idle()`, but where? Perhaps requires a fork in the code to embed it in `refine.py::RefineProject().reconcile()` prior to providing the response?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_csv( StringIO( p.export(export_format='csv') ) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See also: [OpenRefine Style Reconciliation Containers](https://blog.ouseful.info/2015/02/02/openrefine-style-reconciliation-containers/). It would be good to weave reconciliation service recipes into Binderised demos, perhaps even using a [Jupyter kernel gateway powered API](https://blog.ouseful.info/2017/09/06/building-a-json-api-using-jupyer-notebooks-in-under-5-minutes/) to allow reconciliation service definitions via a notebook?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Debug strategies\n", "\n", "I need to figure some recipes for debugging in notebooks... Indeed, debugging anywhere!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "#%%debug\n", "import pdb; pdb.set_trace()\n", "p.guess_types_of_column('Name', 'Wikidata (en)')\n", "#c" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pdb.pm()\n", "#locals()['s']\n", "#q" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }