{ "metadata": { "name": "", "signature": "sha256:58a51b073de673ca8c623fda7bb4fccac7e29f7fcc2abd810b331a6eb35be084" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Getting started - A guide to Git, GitHub and IPython Notebooks" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "What you will need for this course" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This course focuses on developing practical skills in working with data and providing students with a hands-on understanding of classical data analysis techniques. As expected, this will be a code-intensive course. We have chosen [Python](https://www.python.org/) as the language to work with, since it allows for fast prototyping and is supported by a great variety of scientific (and, specifically, data related) libraries. For a quick introduction to Python, you can check [Lecture 1](http://nbviewer.ipython.org/github/dataminingapp/dataminingapp-lectures/blob/master/Lecture-1/Intro%20to%20Python.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The materials of this course can be found under [this GitHub account](https://github.com/dataminingapp/). Both the lectures and the homeworks of this course are in the format of [IPython notebooks](http://ipython.org/notebook.html)." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Installing Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many ways to install Python. We highly recommend the free **Anaconda Scientific Python** distribution, which you can download from https://store.continuum.io/cshop/anaconda/. This Python distribution contains most of the packages that we will be using throughout the course. It also includes an easy-to-use but powerful packagin system, *conda*. For compabitility reasons, we will be using **Python 2.7**, so make sure to download the correct version of Anaconda." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Installing Git" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the goals of this course is make you familiar with the workflow of code-versioning and collaboration. We will be using [GitHub](https://github.com/) to host all the materials of the course, and we will expect you to use it also when submitting your homeworks. You should download **git** from [here](http://git-scm.com/downloads) if it is not already installed in your machine and create a profile on GitHub. You can find extensive documentation on how to use **git** on the [Help Pages of Github](https://help.github.com), on [Atlassian](https://www.atlassian.com/git/tutorials/setting-up-a-repository), on [GitRef](http://gitref.org/) and many other sites." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Working with Git" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Alt text](./git-staging-area.svg)" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Configuration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first time we use *git* on a new machine, we need to configure our name and email\n", "\n", "```\n", "$ git config --global user.name \"Charalampos Mavroforakis\"\n", "$ git config -- global user.mail \"cmav@bu.edu\"\n", "```\n", "Use the email that you used for your GitHub account." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Creating a Repository" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After installing Git, we can configure our first repository. First, let's create a new directory.\n", "\n", "```\n", "$ mkdir thoughts\n", "$ cd thoughts\n", "```\n", "\n", "Now, we can create a *git* repository in this directory.\n", "\n", "```\n", "$ git init\n", "```\n", "\n", "We can check that everything is set up correctly by asking *git* to tell us the status of our project.\n", "\n", "```\n", "$ git status\n", "On branch master\n", "\n", "Initial commit\n", "\n", "nothing to commit (create/copy files and use \"git add\" to track)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now, create a file named ```science.txt```, edit it with your favorite text editor and add the following lines\n", "\n", "```\n", "Starting to think about data\n", "```\n", "\n", "If we check the status of our repository again, *git* tells us that there is a new file:\n", "```\n", "$ git status\n", "On branch master\n", "\n", "Initial commit\n", "\n", "Untracked files:\n", " (use \"git add ...\" to include in what will be committed)\n", "\n", " science.txt\n", "\n", "nothing added to commit but untracked files present (use \"git add\" to track)\n", "```\n", "\n", "The \"untracked files\" message means that there's a file in the directory that *git* isn't keeping track of. We can tell *git* that it should do so using ```git add```:\n", "\n", "```\n", "$ git add science.txt\n", "```\n", "\n", "and then check that the file is now being tracked:\n", "\n", "```\n", "$ git status\n", "On branch master\n", "\n", "Initial commit\n", "\n", "Changes to be committed:\n", " (use \"git rm --cached ...\" to unstage)\n", "\n", " new file: science.txt\n", "\n", "```\n", "\n", "*git* now knows that it's supposed to keep track of ```science.txt```, but it hasn't yet recorded any changes for posterity as a commit. To get it to do that, we need to run one more command:\n", "\n", "```\n", "$ git commit -m \"Preparing for science\"\n", "[master (root-commit) f516d22] Preparing for science\n", " 1 file changed, 1 insertion(+)\n", " create mode 100644 science.txt\n", "```\n", "\n", "When we run ```git commit```, *git* takes everything we have told it to save by using ```git add``` and stores a copy permanently inside the special ```.git``` directory. This permanent copy is called a **revision** and its short identifier is *f516d22*. (Your revision may have another identifier.)\n", "\n", "We use the -m flag (for \"message\") to record a comment that will help us remember later on what we did and why. If we just run ```git commit``` without the ```-m``` option, *git* will launch ```vim``` (or whatever other editor we configured at the start) so that we can write a longer message. If you are using Windows and you are not familiar with ```vim```, try installing [GitPad](https://github.com/github/GitPad).\n", "\n", "If we run git status now:\n", "```\n", "$ git status\n", "On branch master\n", "nothing to commit, working directory clean\n", "```\n", "\n", "it tells us everything is up to date. If we want to know what we've done recently, we can ask *git* to show us the project's history using ```git log```:\n", "\n", "```\n", "$ git log\n", "Author: Charalampos Mavroforakis \n", "Date: Sun Jan 25 12:48:44 2015 -0500\n", "\n", " Preparing for science\n", "```" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Changing a file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, suppose that we want to edit the file:\n", "\n", "```\n", "Starting to think about data\n", "I need to attend CS591\n", "```\n", "\n", "Now if we run ```git status```, *git* will tell us that a file that it is tracking has been modified:\n", "\n", "```\n", "$ git status\n", "On branch master\n", "Changes not staged for commit:\n", " (use \"git add ...\" to update what will be committed)\n", " (use \"git checkout -- ...\" to discard changes in working directory)\n", "\n", " modified: science.txt\n", "\n", "no changes added to commit (use \"git add\" and/or \"git commit -a\")\n", "```\n", "\n", "The last line is the key phrase: *\"no changes added to commit\"*. We have changed this file, but we haven't told *git* we will want to save those changes (which we do with ```git add```) much less actually saved them. Let's double-check our work using ```git diff```, which shows us the differences between the current state of the file and the most recently saved version:\n", "\n", "```\n", "$ git diff\n", "diff --git a/science.txt b/science.txt\n", "index 0ac4b7b..c5b1b05 100644\n", "--- a/science.txt\n", "+++ b/science.txt\n", "@@ -1 +1,2 @@\n", " Starting to think about data\n", "+I need to attend CS591\n", "```\n", "\n", "Let's commit our change:\n", "```\n", "$ git commit -m \"Related course\"\n", "On branch master\n", "Changes not staged for commit:\n", " modified: science.txt\n", "\n", "no changes added to commit\n", "```\n", "\n", "*Whoops!* *Git* won't commit the file because we didn't use ```git add``` first. Let's fix that:\n", "```\n", "$ git add science.txt\n", "$ git commit -m \"Related course\"\n", "[master 1bd7277] Related course\n", " 1 file changed, 1 insertion(+)\n", "```\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Git* insists that we add files to the set we want to commit before actually committing anything because we may not want to commit everything at once. For example, suppose we're adding a few citations to our project. We might want to commit those additions, and the corresponding addition to the bibliography, but not commit the work we're doing on the analysis (which we haven't finished yet).\n", "\n", "To allow for this, *git* has a special staging area where it keeps track of things that have been added to the current change set but not yet committed. ```git add``` puts things in this area, and ```git commit``` then copies them to long-term storage (as a commit):\n", "\n", "![Alt text](./git-staging-area.svg)" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Recovering old versions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can save changes to files and see what we have changed. How can we restore older versions however? Let's suppose we accidentally overwrite the file:\n", "\n", "```\n", "$ cat science.txt\n", "Despair! Nothing works\n", "```\n", "\n", "Now, ```git status``` tells us that the file has been changed, but those changes haven't been staged:\n", "\n", "```\n", "$ git status\n", "On branch master\n", "Changes not staged for commit:\n", " (use \"git add ...\" to update what will be committed)\n", " (use \"git checkout -- ...\" to discard changes in working directory)\n", "\n", " modified: science.txt\n", "\n", "no changes added to commit (use \"git add\" and/or \"git commit -a\")\n", "```\n", "We can put things back the way they were by using ```git checkout```:\n", "\n", "```\n", "$ git checkout HEAD science.txt\n", "$ cat science.txt\n", "Starting to think about data\n", "I need to attend CS591\n", "```" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "More information" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can find more information on *git* here:\n", "* [Software Carpentry](http://swcarpentry.github.io/git-novice/01-backup.html)\n", "* [GitHub Help](https://help.github.com/)\n", "* [Atlassian Help](https://www.atlassian.com/git/tutorials/setting-up-a-repository)\n", "* [GitRef](http://gitref.org/)\n", "* [Git Ready](http://gitready.com/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** IMPORTANT! **\n", "\n", "Never ```git add``` sensitive files, e.g. *passwords*, *keys*, etc., unless you are really sure you need this. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "IPython Notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "IPython has become the standard for interactive computing in Python. After installing *Anaconda*, you can access IPython (and the Notebooks) either through the Anaconda **Launcher** or the **Anaconda command prompt**. \n", "\n", "To run the IPython Notebook server from the command line, type ```ipython notebook``` from the terminal. Your web browser will open and load the environment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the notebook, you can type and run code:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"hi!\"" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "hi!\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use auto-complete (with the TAB key) and see the documentation (by adding \\`?\\`):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "# os.listdir?" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The errors are nicely formatted:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "1/0" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "ZeroDivisionError", "evalue": "integer division or modulo by zero", "output_type": "pyerr", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mZeroDivisionError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;36m1\u001b[0m\u001b[1;33m/\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mZeroDivisionError\u001b[0m: integer division or modulo by zero" ] } ], "prompt_number": 21 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Essential Shortcuts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* *Esc / Enter*: Switch between edit and command mode\n", "* Execute cells\n", " * *Shift-Enter*: Run and move to the next cell\n", " * *Alt-Enter*: Run and make new cell\n", " * *Ctrl-Enter*: Run in place\n", "* *a / b* : Insert cell below / above\n", "* *d*: Delete cell" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Conda Package Manager" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Anaconda also installs a package manager, that makes it easy to install and update Python packages. To call it, you need to type ```conda``` in the *Anaconda command prompt*. You can read a brief FAQ for ```conda``` [here](http://conda.pydata.org/docs/faq.html#pkg-installation)." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Working from the Undergraduate lab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Anaconda has been installed in the Linux machines of the Undergraduate lab as well. If you want to work from there, you need to follow the next steps:\n", "\n", "1. Add the ```conda``` executable to your PATH\n", "```\n", "$ export PATH=/usr/local/anaconda/bin:$PATH\n", "```\n", "\n", "2. Create a new environment (only do this once)\n", "``` \n", "$ conda create -p ~/envs/test numpy scipy networkx pandas scikit-learn matplotlib beautiful-soup ipython-notebook=2.2\n", "```\n", "You can change its name to something other than ```test```.\n", "\n", "3. Activate the environment\n", "```\n", "$ source activate ~/envs/test\n", "```\n", "\n", "4. Run IPython or IPython notebook\n", "```\n", "$ ipython2 notebook\n", "```\n", "\n", "5. Deactivate the environment\n", "```\n", "$ source deactivate\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "GitHub" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Systems like *git* allow us to move work between any two repositories. In practice, though, it's easiest to use one copy as a central hub, and to keep it on the web rather than on someone's laptop. Most programmers use hosting services like [GitHub](https://github.com/) or [BitBucket](https://bitbucket.org/) to hold those master copies. For the purpose of our course, we will be using [GitHub](https://github.com/) to host the course material. You will also submit your homeworks through this platform. Next, we will cover how you can fork and clone the course's repository and how to submit your solutions to the homework. For more information on how to create your own repository on GitHub and upload code to it, please see the tutorial by [Software Carpentry](http://swcarpentry.github.io/git-novice/02-collab.html)." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Course repositories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The material of the course is hosted on GitHub, under [this account](https://github.com/dataminingapp/)." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Clone the lecture repository" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to download a copy of the lectures and run them locally on your computer, you need to clone the lecture repository. To do that:\n", "\n", "1. Create a new folder for the course.\n", "```\n", "$ mkdir cs591\n", "$ cd cs591\n", "```\n", "2. Copy the clone url from the [repository's website](https://github.com/dataminingapp/dataminingapp-lectures).\n", "3. Clone the repository from *git*.\n", "```\n", "$ git clone https://github.com/dataminingapp/dataminingapp-lectures.git\n", "```\n", "\n", "You should now have a directory named ```dataminingapp-lectures``` with the course material.\n", "\n", "To update the repository and download the **new material**, type\n", "```\n", "$ git pull\n", "```" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Fork & Clone the homework repository" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to download and submit the homework, you will need to follow the next steps. You need to do this **once**:\n", "\n", "1. Fork the [homework repository](https://github.com/dataminingapp/spring-2015-homeworks) under your GitHub account.\n", "2. Clone your fork locally, as above.\n", "3. Set up the upstream channel, so that you can download the changes \n", "```\n", "$ git remote add upstream https://github.com/dataminingapp/spring-2015-homeworks.git\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, **everytime** that you want to work on the homework, you will need to :\n", "\n", "1. Make sure your fork is up-to-date\n", "```\n", "$ git pull --rebase upstream master\n", "```\n", "\n", "2. Work on the homework. Don't forget to **commit** regularly!\n", "```\n", "<...>\n", "$ git add Homework-0.ipynb\n", "$ git commit -m \"Adds a hello-world function\"\n", "```\n", "\n", "3. Push your changes to your fork, on GitHub\n", "```\n", "$ git push\n", "```\n", "\n", "4. Before the submission deadline, make sure that the changes that you did to the homework (the commits) are reflected in your online fork." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Practice" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, practice what we have seen today by solving and submitting Homework 0." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Code for setting the style of the notebook\n", "from IPython.core.display import HTML\n", "def css_styling():\n", " styles = open(\"../theme/custom.css\", \"r\").read()\n", " return HTML(styles)\n", "css_styling()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "" ] } ], "prompt_number": 1 } ], "metadata": {} } ] }