{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "In this short tutorial, I want to show how you can read in various formatted software data with Python and Pandas. We use the `read_csv` as well as the `read_excel` methods to accomplish our tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Reading CSV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading files with mixed separators\n",
    "\n",
    "In this section we read a more unstructured data set:\n",
    "\n",
    "It's a Git log output in the following format. \n",
    "\n",
    "```\n",
    "<timestamp><whitespace><timezone><tabulator><author>\n",
    "```\n",
    "\n",
    "It contains two different separators: whitespace and tabular. Here is an the content of the file `datasets/mixed_dataset.csv`\n",
    "\n",
    "```\n",
    "1514531161 -0800\tLinus Torvalds\n",
    "1514489303 -0500\tDavid S. Miller\n",
    "1514487644 -0800\tTom Herbert\n",
    "1514487643 -0800\tTom Herbert\n",
    "1514482693 -0500\tWillem de Bruijn\n",
    "```\n",
    "\n",
    "We can read in this kind of data:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>timestamp</th>\n",
       "      <th>timezone</th>\n",
       "      <th>author</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">NaN</th>\n",
       "      <th>1514531161</th>\n",
       "      <td>-800</td>\n",
       "      <td>Linus Torvalds</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1514489303</th>\n",
       "      <td>-500</td>\n",
       "      <td>David S. Miller</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1514487644</th>\n",
       "      <td>-800</td>\n",
       "      <td>Tom Herbert</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1514487643</th>\n",
       "      <td>-800</td>\n",
       "      <td>Tom Herbert</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1514482693</th>\n",
       "      <td>-500</td>\n",
       "      <td>Willem de Bruijn</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                timestamp          timezone  author\n",
       "NaN 1514531161       -800    Linus Torvalds     NaN\n",
       "    1514489303       -500   David S. Miller     NaN\n",
       "    1514487644       -800       Tom Herbert     NaN\n",
       "    1514487643       -800       Tom Herbert     NaN\n",
       "    1514482693       -500  Willem de Bruijn     NaN"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pds\n",
    "pd.read_csv(\n",
    "    \"datasets/mixed_separators.txt\",\n",
    "    sep=\"^([0-9]*?) (.*?)\\t(.*?)$\",\n",
    "    engine='python',\n",
    "    names=['timestamp', 'timezone', 'author'],\n",
    "\n",
    "    header=None)"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}