{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using the openrefine-client in a Linux Bash environment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparations\n", "\n", "First we need an OpenRefine server running and the openrefine-client installed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Option 1: binder\n", "\n", "This [binder](https://github.com/betatim/openrefineder) has OpenRefine, the openrefine-client and a Jupyter server proxy preinstalled. OpenRefine should be listening on default port 3333 and the GUI should be available at the urlpath `/openrefine`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Option 2: Local environment\n", "\n", "Ensure you have an OpenRefine server running. Then install the OpenRefine client as follows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wget -nv https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux -O ~/.local/bin/openrefine-client\n", "chmod +x ~/.local/bin/openrefine-client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a directory\n", "\n", "We will store some files so it is clearer to use a new folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "workspace=$(date +%Y%m%d_%H%M%S)\n", "mkdir -p ~/$workspace && cd ~/$workspace && pwd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create project\n", "\n", "Download sample data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client --download \"https://git.io/fj5hF\" --output=duplicates.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import file into OpenRefine" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "openrefine-client --create duplicates.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## List all projects" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client --list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show project metadata" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client --info \"duplicates\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export project to terminal" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client --export \"duplicates\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apply rules from json file\n", "\n", "Download sample json file (the content of this file was previously extracted via Undo/Redo history in the OpenRefine graphical user interface)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client --download \"https://git.io/fj5ju\" --output=duplicates-deletion.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply transformations rules" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client --apply duplicates-deletion.json \"duplicates\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Export project to terminal again" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client --export \"duplicates\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export project to file\n", "\n", "Export data in Excel (.xls) format" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client --export \"duplicates\" --output deduped.xls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Delete project" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "openrefine-client --delete \"duplicates\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced templating\n", "\n", "Create another project from the example file above" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client --create duplicates.csv --projectName=advanced" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following example code will export the columns \"name\" and \"purchase\" in JSON format from the project \"advanced\" for rows matching the regex text filter ^F$ in column \"gender\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "openrefine-client \"advanced\" \\\n", "--prefix='{ \"events\" : [\n", "' \\\n", "--template=' { \"name\" : {{jsonize(cells[\"name\"].value)}}, \"purchase\" : {{jsonize(cells[\"purchase\"].value)}} }' \\\n", "--rowSeparator=',\n", "' \\\n", "--suffix='\n", "] }' \\\n", "--filterQuery='^F$' \\\n", "--filterColumn='gender'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is also an option to store the results in multiple files. Each file will contain the prefix, an processed row, and the suffix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client \"advanced\" \\\n", "--prefix='{ \"events\" : [\n", "' \\\n", "--template=' { \"name\" : {{jsonize(cells[\"name\"].value)}}, \"purchase\" : {{jsonize(cells[\"purchase\"].value)}} }' \\\n", "--rowSeparator=',\n", "' \\\n", "--suffix='\n", "] }' \\\n", "--filterQuery='^F$' \\\n", "--filterColumn='gender' \\\n", "--output=advanced.json \\\n", "--splitToFiles=true" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filenames are suffixed with the row number by default (e.g. `advanced_1.json`, `advanced_2.json` etc.). There is another option to use the value in the first column instead:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client \"advanced\" \\\n", "--prefix='{ \"events\" : [\n", "' \\\n", "--template=' { \"name\" : {{jsonize(cells[\"name\"].value)}}, \"purchase\" : {{jsonize(cells[\"purchase\"].value)}} }' \\\n", "--rowSeparator=',\n", "' \\\n", "--suffix='\n", "] }' \\\n", "--filterQuery='^F$' \\\n", "--filterColumn='gender' \\\n", "--output=advanced.json \\\n", "--splitToFiles=true \\\n", "--suffixById=true" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the results in the current directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because our project \"advanced\" contains duplicates in the first column \"email\" this command will overwrite files (e.g. `advanced_melanie.white@example2.edu.json`). When using this option, the first column should contain unique identifiers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Delete project" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "openrefine-client --delete \"advanced\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting help" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "openrefine-client --help" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) is available as a one file executable for Windows, Mac OS and Linux. Client and server can be executed on different machines (host and port of the OpenRefine server can be specified, e.g. `-H 127.0.0.1 -P 80`).\n", "\n", "Please file an [issue](https://github.com/opencultureconsulting/openrefine-client/issues) if you miss some features in the command line interface or if you have tracked a bug. And you are welcome to ask any questions!" ] } ], "metadata": { "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 2 }