{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mining the Social Web, 3rd Edition\n", "\n", "## Appendix C: Python and Jupyter Notebook Tips & Tricks\n", "\n", "This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from [_Mining the Social Web (3rd Edition)_](http://bit.ly/Mining-the-Social-Web-3E). The intent behind this notebook is to reinforce the concepts from the sample code in a fun, convenient, and effective way. This notebook assumes that you are reading along with the book and have the context of the discussion as you work through these exercises.\n", "\n", "In the somewhat unlikely event that you've somehow stumbled across this notebook outside of its context on GitHub, [you can find the full source code repository here](http://bit.ly/Mining-the-Social-Web-3E).\n", "\n", "## Copyright and Licensing\n", "\n", "You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/mikhailklassen/Mining-the-Social-Web-3rd-Edition/blob/master/LICENSE) that governs its use." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Try tinkering around with Python to test out Jupyter Notebook...\n", "\n", "As you work through these notebooks, it's assumed that you'll be executing each cell in turn, because some cells will define variables that cells below them will use. Here's a very simple example to illustrate how this all works..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Execute this cell to define this variable\n", "# Either click Cell => Run from the menu or type\n", "# ctrl-Enter to execute. See the Help menu for lots\n", "# of useful tips. Help => IPython Help and\n", "# Help => Keyboard Shortcuts are especially\n", "# useful.\n", "\n", "message = \"I want to mine the social web!\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The variable 'message is defined here. Execute this cell to see for yourself\n", "print(message)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The variable 'message' is defined here, but we'll delete it\n", "# after displaying it to illustrate an important point...\n", "print(message)\n", "del message" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The variable message is no longer defined in this cell or two cells \n", "# above anymore. Try executing this cell or that cell to see for yourself.\n", "print(message)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Try typing in some code of your own!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python Idioms\n", "\n", "This section of the notebook introduces a few Python idioms that are used widely throughout the book that you might find very helpful to review. This section is not intended to be a Python tutorial. It is intended to highlight some of the fundamental aspects of Python that will help you to follow along with the source code, assuming you have a general programming background. Sections 1 through 8 of the [Python Tutorial](http://docs.python.org/2/tutorial/) are what you should spend a couple of hours working through if you are looking for a gentle introduction Python as a programming language." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Python Data Structures are like JSON\n", "\n", "If you come from a web development background, a good starting point for understanding Python data structures is to start with [JSON](http://json.org) as a reference point. If you don't have a web development background, think of JSON as a simple but expressive specification for representing arbitrary data structures using strings, numbers, lists, and dictionaries. The following cell introduces some data structures. Execute the following cell that illustrates these fundamental data types to follow along." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "an_integer = 23\n", "print(an_integer, type(an_integer))\n", "print()\n", "\n", "a_float = 23.0\n", "print(a_float, type(a_float))\n", "print()\n", "\n", "a_string = \"string\"\n", "print(a_string, type(a_string))\n", "print()\n", "\n", "a_list = [1,2,3]\n", "print(a_list, type(a_list))\n", "print(a_list[0]) # access the first item\n", "print()\n", "\n", "a_dict = {'a' : 1, 'b' : 2, 'c' : 3}\n", "print(a_dict, type(a_dict))\n", "print(a_dict['a']) # access the item with key 'a'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assuming you've followed along with these fundamental data types, consider the possiblities for arbitrarily composing them to represent more complex structures:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "contacts = [\n", " {\n", " 'name' : 'Bob',\n", " 'age' : 23,\n", " 'married' : False,\n", " 'height' : 1.8, # meters\n", " 'languages' : ['English', 'Spanish'],\n", " 'address' : '123 Maple St.',\n", " 'phone' : '(555) 555-5555'\n", " },\n", " \n", " {'name' : 'Sally',\n", " 'age' : 26,\n", " 'married' : True,\n", " 'height' : 1.5, # meters\n", " 'languages' : ['English'],\n", " 'address' : '456 Elm St.',\n", " 'phone' : '(555) 555-1234'\n", " } \n", "]\n", "\n", "for contact in contacts:\n", " print(\"Name:\", contact['name'])\n", " print(\"Married:\", contact['married'])\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As alluded to previously, the data structures very much lend themselves to constructing JSON in a very natural way. This is often quite convenient for web application development that involves using a Python server process to send data back to a JavaScript client. The following cell illustrates the general idea." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "print(contacts)\n", "print(type(contacts)) # list\n", "\n", "# json.dumps pronounced (dumps stands for \"dump string\") takes a Python data structure\n", "# that is serializable to JSON and dumps it as a string\n", "jsonified_contacts = json.dumps(contacts, indent=2) # indent is used for pretty-printing\n", "\n", "print(type(jsonified_contacts)) # str\n", "print(jsonified_contacts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A couple of additional types that you'll run across regularly are tuples and the special None type. Think of a tuple as an immutable list and None as a special value that indicates an empty value, which is neither True nor False." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a_tuple = (1,2,3)\n", "\n", "an_int = (1) # You must include a trailing comma when only one item is in the tuple\n", "\n", "a_tuple = (1,)\n", "\n", "a_tuple = (1,2,3,) # Trailing commas are ok in tuples and lists \n", "\n", "none = None\n", "\n", "print(none == None) # True\n", "print(none == True) # False\n", "print(none == False) # False\n", "\n", "print()\n", "\n", "# In general, you'll see the special 'is' operator used when comparing a value to \n", "# None, but most of the time, it works the same as '=='\n", "\n", "print(none is None) # True\n", "print(none is True) # False\n", "print(none is False) # False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As indicated in the [python.org tutorial](http://docs.python.org/2/tutorial/controlflow.html#default-argument-values), None is often used as a default value in function calls, which are _defined_ by the keyword _def_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def square(x):\n", " return x*x\n", "\n", "print(square(2)) # 4\n", "print()\n", "\n", "# The default value for L is only created once and shared amongst\n", "# calls\n", "\n", "def f1(a, L=[]):\n", " L.append(a)\n", " return L\n", "\n", "print(f1(1)) # [1]\n", "print(f1(2)) # [1, 2]\n", "print(f1(3)) # [1, 2, 3]\n", "print()\n", "\n", "# Each call creates a new value for L\n", "\n", "def f2(a, L=None):\n", " if L is None:\n", " L = []\n", " L.append(a)\n", " return L\n", "\n", "print(f2(1)) # [1]\n", "print(f2(2)) # [2]\n", "print(f2(3)) # [3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List and String Slicing\n", "\n", "For lists and strings, you'll often want to extract a particular selection using a starting and ending index. In Python, this is called _slicing_. The syntax involves using square brackets in the same way that you are extracting a single value, but you include an additional parameter to indicate the boundary for the slice." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g']\n", "\n", "print(a_list[0]) # a\n", "print(a_list[0:2]) # ['a', 'b']\n", "print(a_list[:2]) # Same as above. The starting index is implicitly 0\n", "print(a_list[3:]) # ['d', 'e', 'f', 'g'] Ending index is implicitly the length of the list\n", "print(a_list[-1]) # g Negative indices start at the end of the list\n", "print(a_list[-3:-1]) # ['e', 'f'] Start at the end and work backwards. (The index after the colon is still excluded)\n", "print(a_list[-3:]) # ['e', 'f', 'g'] The last three items in the list\n", "print(a_list[:-4]) # ['a', 'b', 'c'] # Everything up to the last 4 items\n", "\n", "a_string = 'abcdefg'\n", "\n", "# String slicing works the very same way\n", "\n", "print(a_string[:-4]) # abc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List Comprehensions\n", "\n", "Think of Python's [list comprehensions](http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions) idiom as a concise and efficient way to create lists. You'll often see list comprehensions used as an alternative to for loops for a common set of problems. Although they may take some getting used to, you'll soon find them to be a natural expression. See the section entitled \"Loops\" from [Python Performance Tips](http://wiki.python.org/moin/PythonSpeed/PerformanceTips) for more details on some of the details on why list comprehensions may be more performant than loops or functions like map in various situations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# One way to create a list containing 0..9:\n", "\n", "a_list = []\n", "for i in range(10):\n", " a_list.append(i)\n", "print(a_list) \n", " \n", "# How to do it with a list comprehension\n", "\n", "print([ i for i in range(10) ])\n", "\n", "\n", "# But what about a nested loop like this one, which\n", "# even contains a conditional expression in it:\n", "\n", "a_list = []\n", "for i in range(10):\n", " for j in range(10, 20):\n", " if i % 2 == 0:\n", " a_list.append(i)\n", "\n", "print(a_list)\n", "\n", "# You can achieve a nested list comprehension to \n", "# achieve the very same result. When written with readable\n", "# indention like below, note the striking similarity to\n", "# the equivalent code as presented above.\n", "\n", "print([ i\n", " for i in range(10)\n", " for j in range(10, 20)\n", " if i % 2 == 0\n", " ])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dictionary Comprehensions\n", "\n", "In the same way that you can concisely construct lists with list comprehensions, you can concisely construct dictionaries with dictionary comprehensions. The underlying concept involved and the syntax is very similar to list comprehensions. The following example illustrates a few different way to create the same dictionary and introduces dictionary construction syntax." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Literal syntax\n", "\n", "a_dict = { 'a' : 1, 'b' : 2, 'c' : 3 }\n", "print(a_dict)\n", "print()\n", "\n", "# Using the dict constructor\n", "\n", "a_dict = dict([('a', 1), ('b', 2), ('c', 3)])\n", "print(a_dict)\n", "print()\n", "\n", "# Dictionary comprehension syntax\n", "\n", "a_dict = { k : v for (k,v) in [('a', 1), ('b', 2), ('c', 3)] }\n", "print(a_dict)\n", "print()\n", "\n", "# A more appropriate circumstance to use dictionary comprehension would \n", "# involve more complex computation\n", "\n", "a_dict = { k : k*k for k in range(10) } # {0: 0, 1: 1, 2: 4, 3: 9, ..., 9: 81}\n", "print(a_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Enumeration\n", "\n", "While iterating over a collection such as a list, it's often handy to know the index for the item that you are looping over in addition to its value. While a reasonable approach is to maintain a looping index, the enumerate function spares you the trouble." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lst = ['a', 'b', 'c']\n", "\n", "# You could opt to maintain a looping index...\n", "i = 0\n", "for item in lst:\n", " print(i, item)\n", " i += 1\n", "\n", "# ...but the enumerate function spares you the trouble of maintaining a loop index\n", "for i, item in enumerate(lst):\n", " print(i, item)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \\*args and \\*\\*kwargs\n", "\n", "Conceptually, Python functions accept lists of arguments that can be followed by additional keyword arguments. A common idiom that you'll see when calling functions is to _dereference_ a list or dictionary with the asterisk or double-asterisk, respectively, a special trick for satisfying the function's parameterization." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def f(a, b, c, d=None, e=None):\n", " print(a, b, c, d, e)\n", "\n", "f(1, 2, 3) # 1 2 3 None None\n", "f(1, 3, 3, d=4) # 1 2 3 4 None\n", "f(1, 2, 3, d=4, e=5) # 1 2 3 4 5\n", "\n", "args = [1,2,3]\n", "kwargs = {'d' : 4, 'e' : 5}\n", "\n", "f(*args, **kwargs) # 1 2 3 4 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### String Substitutions\n", "\n", "It's often clearer in code to use string substitution than to concatenate strings, although both options can get the job done. The string type's built-in format function is also very handy and adds to the readability of code. The following examples illustrate some of the common string substitutions that you'll regularly encounter in the code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "name1, name2 = \"Bob\", \"Sally\"\n", "\n", "print(\"Hello, \" + name1 + \". My name is \" + name2)\n", "\n", "print(\"Hello, %s. My name is %s\" % (name1, name2,))\n", "\n", "print(\"Hello, {0}. My name is {1}\".format(name1, name2))\n", "print(\"Hello, {0}. My name is {1}\".format(*[name1, name2]))\n", "names = [name1, name2]\n", "print(\"Hello, {0}. My name is {1}\".format(*names))\n", "\n", "\n", "print(\"Hello, {you}. My name is {me}\".format(you=name1, me=name2))\n", "print(\"Hello, {you}. My name is {me}\".format(**{'you' : name1, 'me' : name2}))\n", "names = {'you' : name1, 'me' : name2}\n", "print(\"Hello, {you}. My name is {me}\".format(**names))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Serving Static Content\n", "\n", "IPython Notebook has some handy features for interacting with the web browser that you should know about. A few of the features that you'll see in the source code are embedding inline frames, and serving static content such as images, text files, JavaScript files, etc. The ability to serve static content is especially handy if you'd like to display an inline visualization for analysis, and you'll see this technique used throughout the notebook.\n", "\n", "The following cell illustrates creating and embedding an inline frame and serving the static source file for this notebook, which is serialized as JSON data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import IFrame\n", "from IPython.core.display import display\n", "\n", "# IPython Notebook can serve files relative to the location of\n", "# the working notebook into inline frames. Prepend the path \n", "# with the 'files' prefix\n", "\n", "static_content = 'resources/appc-pythontips/hello.txt'\n", "\n", "display(IFrame(static_content, '100%', '600px'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Shared Folders\n", "\n", "The Docker container maps the top level directory of your GitHub checkout (the directory containing `README.md`) on your host machine to its `/home/jovyan/notebooks` folder and automatically synchronizes files between the guest and host environments as an incredible convenience to you. This mapping and synchronization enables Jupyter Notebooks you are running on the guest machine to access files that you can conveniently manage on your host machine and vice-versa. For example, many of the scripts in Jupyter Notebooks may write out data files and you can easily access those data files on your host environment (should you desire to do so) without needing to connect into the virtual machine with an SSH session. On the flip side, you can provide data files to Jupyter Notebook, which is running on the guest machine by copying them anywhere into your top level GitHub checkout. \n", "\n", "In effect, the top level directory of your GitHub checkout is automatically synchronized between the guest and host environments so that you have access to everything that is happening and can manage your source code, modified notebooks, and everything else all from your host machine. See _docker-compose.yml_ for more details on how synchronized folders can be configured.\n", "\n", "The following code snippet illustrates how to access files. Keep in mind that the code that you execute in this cell writes data to the Docker container, and it's Docker that automatically synchronizes it back to your guest environment. It's a subtle but important detail. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "# The absolute path to the shared folder on the VM\n", "shared_folder=\"/home/jovyan/notebooks\"\n", "\n", "# List the files in the shared folder\n", "print(os.listdir(shared_folder))\n", "print()\n", "\n", "# How to read and display a snippet of the share/README.md file...\n", "README = os.path.join(shared_folder, \"README.md\")\n", "txt = open(README).read()\n", "print(txt[:200])\n", "\n", "# Write out a file to the guest but notice that it is available on the host\n", "# by checking the contents of your GitHub checkout\n", "f = open(os.path.join(shared_folder, \"Hello.txt\"), \"w\")\n", "f.write(\"Hello. This text is written on the guest but synchronized to the host by Vagrant\")\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Copyright and Licensing\n", "\n", "You are free to use or adapt this notebook for any purpose you'd like. However, please respect the following [Simplified BSD License](https://github.com/mikhailklassen/Mining-the-Social-Web-3rd-Edition/blob/master/LICENSE) (also known as \"FreeBSD License\") that governs its use. Basically, you can do whatever you want with the code so long as you retain the copyright notice.\n", "\n", "Copyright (c) 2018, Matthew A. Russell & Mikhail Klassen\n", "All rights reserved.\n", "\n", "Redistribution and use in source and binary forms, with or without\n", "modification, are permitted provided that the following conditions are met: \n", "\n", "1. Redistributions of source code must retain the above copyright notice, this\n", " list of conditions and the following disclaimer. \n", "2. Redistributions in binary form must reproduce the above copyright notice,\n", " this list of conditions and the following disclaimer in the documentation\n", " and/or other materials provided with the distribution. \n", "\n", "THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND\n", "ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED\n", "WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\n", "DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR\n", "ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES\n", "(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;\n", "LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND\n", "ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT\n", "(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS\n", "SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n", "\n", "The views and conclusions contained in the software and documentation are those\n", "of the authors and should not be interpreted as representing official policies, \n", "either expressed or implied, of the FreeBSD Project." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 1 }