{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using the openrefine-client in a Python 2 environment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparations\n", "\n", "First we need an OpenRefine server running and the openrefine-client installed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Option 1: binder\n", "\n", "This [binder](https://github.com/betatim/openrefineder) has OpenRefine, the openrefine-client and a Jupyter server proxy preinstalled. OpenRefine should be listening on default port 3333 and the GUI should be available at the urlpath `/openrefine`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "if 'openrefineder' in os.environ['HOSTNAME']:\n", " notebook = !jupyter notebook list | grep -o -E 'http\\S+'\n", " openrefine_url = notebook[0].replace('?token', 'openrefine?token')\n", " openrefine_url = openrefine_url.replace('http://0.0.0.0:8888','')\n", " from IPython.core.display import display, HTML\n", " display(HTML('Click here to open OpenRefine'))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Option 2: Local environment\n", "\n", "Ensure you have an OpenRefine server running. Then install the OpenRefine client as follows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "pip install openrefine-client\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a directory\n", "\n", "We will store some files so it is clearer to use a new folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os, datetime\n", "path = os.path.expanduser('~') + '/' + datetime.datetime.now().strftime('%Y%m%d_%H%M%S')\n", "try:\n", " os.mkdir(path)\n", " os.chdir(path)\n", "except OSError:\n", " print (\"Creation of the directory %s failed\" % path)\n", "else:\n", " print (os.getcwd())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import module" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from google.refine import cli" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create project\n", "\n", "Download sample data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.download('https://git.io/fj5hF','duplicates.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import file into OpenRefine (and store returned project)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p1 = cli.create('duplicates.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## List all projects" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show project metadata" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.info(p1.project_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export project to terminal" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.export(p1.project_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apply rules from json file\n", "\n", "Download sample json file (the content of this file was previously extracted via Undo/Redo history in the OpenRefine graphical user interface)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.download('https://git.io/fj5ju','duplicates-deletion.json')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply transformations rules" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "cli.apply(p1.project_id, 'duplicates-deletion.json')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Export project to terminal again" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "cli.export(p1.project_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export project to file\n", "\n", "Export data in Excel (.xls) format" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.export(p1.project_id, 'deduped.xls')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Delete project" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.delete(p1.project_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced templating\n", "\n", "Create another project from the example file above" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p2 = cli.create('duplicates.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following example code will export the columns \"name\" and \"purchase\" in JSON format from the project \"advanced\" for rows matching the regex text filter ^F$ in column \"gender\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.templating(p2.project_id,\n", "prefix='''{ \"events\" : [\n", "''',\n", "template=' { \"name\" : {{jsonize(cells[\"name\"].value)}}, \"purchase\" : {{jsonize(cells[\"purchase\"].value)}} }',\n", "rowSeparator=''',\n", "''',\n", "suffix='''\n", "] }''',\n", "filterQuery='^F$',\n", "filterColumn='gender')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is also an option to store the results in multiple files. Each file will contain the prefix, an processed row, and the suffix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.templating(p2.project_id,\n", "prefix='''{ \"events\" : [\n", "''',\n", "template=' { \"name\" : {{jsonize(cells[\"name\"].value)}}, \"purchase\" : {{jsonize(cells[\"purchase\"].value)}} }',\n", "rowSeparator=''',\n", "''',\n", "suffix='''\n", "] }''',\n", "filterQuery='^F$',\n", "filterColumn='gender',\n", "output_file='advanced.json',\n", "splitToFiles=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filenames are suffixed with the row number by default (e.g. `advanced_1.json`, `advanced_2.json` etc.). There is another option to use the value in the first column instead:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.templating(p2.project_id,\n", "prefix='''{ \"events\" : [\n", "''',\n", "template=' { \"name\" : {{jsonize(cells[\"name\"].value)}}, \"purchase\" : {{jsonize(cells[\"purchase\"].value)}} }',\n", "rowSeparator=''',\n", "''',\n", "suffix='''\n", "] }''',\n", "filterQuery='^F$',\n", "filterColumn='gender',\n", "output_file='advanced.json',\n", "splitToFiles=True,\n", "suffixById=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the results in the current directory" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "os.listdir(os.getcwd())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because our project \"advanced\" contains duplicates in the first column \"email\" this command will overwrite files (e.g. `advanced_melanie.white@example2.edu.json`). When using this option, the first column should contain unique identifiers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Delete project" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.delete(p2.project_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting help" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(cli)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Client and server can be executed on different machines. Host and port of the OpenRefine server can be specified:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cli.refine.REFINE_HOST = 'localhost'\n", "cli.refine.REFINE_PORT = '3333'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please file an [issue](https://github.com/opencultureconsulting/openrefine-client/issues) if you miss some features in the command line interface or if you have tracked a bug. And you are welcome to ask any questions!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.15" } }, "nbformat": 4, "nbformat_minor": 2 }