{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# HKBU Library 2019 Workshop @ HKBU \n", "## Ep. 1 A flashtalk on data processing \n", "\n", "- **Date & Venue**: 10 April 2019, 4/F HKBU Library \n", "\n", "- **Facilitator**: Dr. Xinzhi Zhang (JOUR, Hong Kong Baptist University, [MSc in AI & Digital Media](http://comd.hkbu.edu.hk/masters/en/aidm))\n", " - xzzhang2@gmail.com || http://drxinzhizhang.com/ || https://github.com/xzzhang2 || @xin_zhi_zhang || https://scholar.google.com.hk/citations?user=iOFeIDIAAAAJ&hl=en \n", "\n", "- **Workshop Outcomes**\n", " 1. Installing Python\n", " 2. --> **Data processing** <--\n", " 3. Importing and knowing your data \n", " 0. [Pandas](https://pandas.pydata.org/), [Matplotlib](https://matplotlib.org/), and [Seaborn](https://seaborn.pydata.org/)\n", " 1. Getting the attributes of the your data\n", " 2. Case selection \n", " 3. Basic statistics \n", " 4. Pivot table \n", " 4. Data exploration: exploring the data by visualization \n", " 1. Univariate (unidimensional) and bivariate (two-way) data visualization\n", " 2. Multivariate (multidimensional) problem-driven data visualization and data-driven exploration\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Notes: The codes in this notebook are modified from various sources, including the [official tutorial](http://pandas.pydata.org/pandas-docs/version/0.15/10min.html), [tutorial 01](https://realpython.com/python-data-cleaning-numpy-pandas/), and [this one](https://computational-communication.com/python-data-cleaning/). All codes and data files demonstrated here are for educational purposes only and released under the MIT licence. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing the raw data \n", "\n", "in this notebook, we will cover some commonly used data cleaning steps. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# the raw csv data file is from here: https://github.com/realpython/python-data-cleaning/tree/master/Datasets \n", "\n", "df = pd.read_csv('data/BL-Flickr-Images-Book_2.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.shape \n", "# (the number of cases/observations, the number of variables)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# can drop some useless columns (variables)\n", "to_drop = ['Edition Statement',\n", " 'Corporate Author',\n", " 'Corporate Contributors',\n", " 'Former owner',\n", " 'Engraver',\n", " 'Contributors',\n", " 'Issuance type',\n", " 'Shelfmarks']\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.drop(to_drop, inplace=True, axis=1) # axis=0 along the rows (namely, index in pandas), and axis=1 along the columns.\n", "# do the same thing: df.drop(columns=to_drop, inplace=True) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.columns # now you can see that the columns are dropped " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# whether the records are unique \n", "df['Identifier'].is_unique" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# set a new index \n", "df = df.set_index('Identifier')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# identify a place if having the identifier \n", "# loc = location-based indexing \n", "df.loc[472]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning the numerical columns " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regular Expression\n", "- A regular expression, also referred to as “regex” or “regexp”, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.\n", "\n", "Resources: \n", "1. a general introduction: https://regexr.com/\n", "2. a quick start: https://www.regular-expressions.info/quickstart.html\n", "2. test your code: https://regex101.com/\n", "4. A cheat sheet: https://www.rexegg.com/regex-quickstart.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[1800:, 'Date of Publication'].head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# regular expression here \n", "regex = r'^(\\d{4})'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# about string extraction: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html \n", "extr = df['Date of Publication'].str.extract(r'^(\\d{4})', expand=False) \n", "extr.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df['Date of Publication'] = pd.to_numeric(extr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Date of Publication'].dtype" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Date of Publication']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Date of Publication'].isnull().sum() # how many missing values? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.isnull().sum() # missing values in the entire dataset " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning the strings " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Place of Publication'].head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(df.loc[4157862])\n", "print('---------------------------- another case ----------------------------') \n", "print(df.loc[4159587])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Place of Publication'].unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "place_df = df.groupby('Place of Publication').size()\n", "for k in place_df.index:\n", " print(k, place_df[k])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's take a look at London \n", "london_pub =[]\n", "for i in df['Place of Publication']:\n", " if i.__contains__('London'):\n", " london_pub.append(True)\n", " else:\n", " london_pub.append(False)\n", "\n", "df['Place of Publication'][london_pub] = 'London'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Place of Publication']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Newcastle_pub = df['Place of Publication'].isin(['Newcastle-upon-Tyne', 'Newcastle upon Tyne'])\n", "df['Place of Publication'][Newcastle_pub] = 'Newcastle' " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np \n", "\n", "pub = df['Place of Publication']\n", "oxford_pub = pub.str.contains('Oxford')\n", "df['Place of Publication'] = np.where(oxford_pub, 'Oxford', \n", " pub.str.replace('-', ' '))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Place of Publication']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head(20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }