{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HKBU Library 2019 Workshop @ HKBU \n",
    "## Ep. 1 A flashtalk on data processing \n",
    "\n",
    "- **Date & Venue**: 10 April 2019, 4/F HKBU Library \n",
    "\n",
    "- **Facilitator**: Dr. Xinzhi Zhang (JOUR, Hong Kong Baptist University, [MSc in AI & Digital Media](http://comd.hkbu.edu.hk/masters/en/aidm))\n",
    " - xzzhang2@gmail.com || http://drxinzhizhang.com/ || https://github.com/xzzhang2 || @xin_zhi_zhang || https://scholar.google.com.hk/citations?user=iOFeIDIAAAAJ&hl=en \n",
    "\n",
    "- **Workshop Outcomes**\n",
    "    1. Installing Python\n",
    "    2. --> **Data processing** <--\n",
    "    3. Importing and knowing your data \n",
    "        0. [Pandas](https://pandas.pydata.org/), [Matplotlib](https://matplotlib.org/), and [Seaborn](https://seaborn.pydata.org/)\n",
    "        1. Getting the attributes of the your data\n",
    "        2. Case selection \n",
    "        3. Basic statistics \n",
    "        4. Pivot table \n",
    "    4. Data exploration: exploring the data by visualization \n",
    "        1. Univariate (unidimensional) and bivariate (two-way) data visualization\n",
    "        2. Multivariate (multidimensional) problem-driven data visualization and data-driven exploration\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Notes: The codes in this notebook are modified from various sources, including the [official tutorial](http://pandas.pydata.org/pandas-docs/version/0.15/10min.html), [tutorial 01](https://realpython.com/python-data-cleaning-numpy-pandas/), and [this one](https://computational-communication.com/python-data-cleaning/). All codes and data files demonstrated here are for educational purposes only and released under the MIT licence. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Importing the raw data \n",
    "\n",
    "in this notebook, we will cover some commonly used data cleaning steps. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# the raw csv data file is from here: https://github.com/realpython/python-data-cleaning/tree/master/Datasets \n",
    "\n",
    "df = pd.read_csv('data/BL-Flickr-Images-Book_2.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape \n",
    "# (the number of cases/observations, the number of variables)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# can drop some useless columns (variables)\n",
    "to_drop = ['Edition Statement',\n",
    "           'Corporate Author',\n",
    "           'Corporate Contributors',\n",
    "           'Former owner',\n",
    "           'Engraver',\n",
    "           'Contributors',\n",
    "           'Issuance type',\n",
    "           'Shelfmarks']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df.drop(to_drop, inplace=True, axis=1) # axis=0 along the rows (namely, index in pandas), and axis=1 along the columns.\n",
    "# do the same thing: df.drop(columns=to_drop, inplace=True) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns # now you can see that the columns are dropped "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# whether the records are unique \n",
    "df['Identifier'].is_unique"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# set a new index \n",
    "df = df.set_index('Identifier')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# identify a place if having the identifier \n",
    "# loc = location-based indexing \n",
    "df.loc[472]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleaning the numerical columns "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regular Expression\n",
    "- A regular expression, also referred to as “regex” or “regexp”, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.\n",
    "\n",
    "Resources: \n",
    "1. a general introduction: https://regexr.com/\n",
    "2. a quick start: https://www.regular-expressions.info/quickstart.html\n",
    "2. test your code: https://regex101.com/\n",
    "4. A cheat sheet: https://www.rexegg.com/regex-quickstart.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[1800:, 'Date of Publication'].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# regular expression here \n",
    "regex = r'^(\\d{4})'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# about string extraction: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html \n",
    "extr = df['Date of Publication'].str.extract(r'^(\\d{4})', expand=False) \n",
    "extr.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df['Date of Publication'] = pd.to_numeric(extr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Date of Publication'].dtype"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Date of Publication']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Date of Publication'].isnull().sum() # how many missing values? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.isnull().sum() # missing values in the entire dataset  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleaning the strings  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Place of Publication'].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(df.loc[4157862])\n",
    "print('---------------------------- another case ----------------------------') \n",
    "print(df.loc[4159587])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Place of Publication'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "place_df = df.groupby('Place of Publication').size()\n",
    "for k in place_df.index:\n",
    "    print(k, place_df[k])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# let's take a look at London \n",
    "london_pub =[]\n",
    "for i in df['Place of Publication']:\n",
    "    if i.__contains__('London'):\n",
    "        london_pub.append(True)\n",
    "    else:\n",
    "        london_pub.append(False)\n",
    "\n",
    "df['Place of Publication'][london_pub]  = 'London'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Place of Publication']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Newcastle_pub = df['Place of Publication'].isin(['Newcastle-upon-Tyne', 'Newcastle upon Tyne'])\n",
    "df['Place of Publication'][Newcastle_pub]  = 'Newcastle' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "\n",
    "pub = df['Place of Publication']\n",
    "oxford_pub = pub.str.contains('Oxford')\n",
    "df['Place of Publication'] = np.where(oxford_pub, 'Oxford', \n",
    "                                      pub.str.replace('-', ' '))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Place of Publication']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}