{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "In this short tutorial, I want to show how you can read in various formatted software data with Python and Pandas. We use the `read_csv` as well as the `read_excel` methods to accomplish our tasks." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Reading CSV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading files with mixed separators\n", "\n", "In this section we read a more unstructured data set:\n", "\n", "It's a Git log output in the following format. \n", "\n", "```\n", "\n", "```\n", "\n", "It contains two different separators: whitespace and tabular. Here is an the content of the file `datasets/mixed_dataset.csv`\n", "\n", "```\n", "1514531161 -0800\tLinus Torvalds\n", "1514489303 -0500\tDavid S. Miller\n", "1514487644 -0800\tTom Herbert\n", "1514487643 -0800\tTom Herbert\n", "1514482693 -0500\tWillem de Bruijn\n", "```\n", "\n", "We can read in this kind of data:\n", "\n" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestamptimezoneauthor
NaN1514531161-800Linus TorvaldsNaN
1514489303-500David S. MillerNaN
1514487644-800Tom HerbertNaN
1514487643-800Tom HerbertNaN
1514482693-500Willem de BruijnNaN
\n", "
" ], "text/plain": [ " timestamp timezone author\n", "NaN 1514531161 -800 Linus Torvalds NaN\n", " 1514489303 -500 David S. Miller NaN\n", " 1514487644 -800 Tom Herbert NaN\n", " 1514487643 -800 Tom Herbert NaN\n", " 1514482693 -500 Willem de Bruijn NaN" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pds\n", "pd.read_csv(\n", " \"datasets/mixed_separators.txt\",\n", " sep=\"^([0-9]*?) (.*?)\\t(.*?)$\",\n", " engine='python',\n", " names=['timestamp', 'timezone', 'author'],\n", "\n", " header=None)" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }