{ "metadata": { "name": "", "signature": "sha256:907ef5a478fbc4c5d0efac4e14283a73189959c0714b1e5dcc210b04c93cd4fc" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Chapter 6 - Objected Oriented Programming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-- *A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Chapter is still in DRAFT stage

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this chapter we will introduce a new programming paradigm: **Object Oriented Programming**. We will build an application that builds a social network and computes a graph of relations between people on Twitter. The nodes of the graph will be the twitter users, and the directed edges indicate that one speaks to another. The edges will carry a weight representing the number of times messages were sent. \n", "\n", "Given a twitter corpus, we will extract who talks to whom, and whenever a connection is found, an edge is added to our graph, or an existing edge is strenghtened.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Object oriented programming is a data-centered programming paradigm that is based on the idea of grouping data and functions that act on particular data in so-called **classes**. A class can be seen as a complex data-type, a template if you will. Variables that are of that data type are said to be **objects** or **instances** of that class.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An example will clarify things:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "class Person:\n", " def __init__(self, name, age):\n", " self.name = name\n", " self.age = age" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, several things happen here. Here we created a class Person with a function `__init__`. Functions that start with underscores are always special functions to Python which are connected with other built-in aspects of the language. The initialisation function will be called when an object of that initialised. Let's do so:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "author = Person(\"Maarten\", 30)\n", "print(\"My name is \" + author.name)\n", "print(\"My age is \" + str(author.age))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Functions within a class are called **methods**. The initialisation method assigns the two parameters that are passed to variables that *belong to the object*, within a class definition the object is always represented by `self`.\n", "\n", "The first argument of a method is always `self`, and it will always point to the instance of the class. This first argument however is never explicitly specified when you call the method. It is implicitly passed by Python itself. That is why you see a discrepancy between the number of arguments in the instantiation and in the class definition.\n", "\n", "\n", "Any variable or methods in a class can be accessed using the period (`.`) syntax:\n", "\n", " object.variable \n", "\n", "or:\n", "\n", " object.method\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above example we printed the name and age. We can turn this into a method as well, thus allowing any person to introduce himself/herself. Let's extend our example:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "class Person:\n", " def __init__(self, name, age):\n", " self.name = name\n", " self.age = age\n", " \n", " def introduceyourself(self):\n", " print(\"My name is \" + self.name)\n", " print(\"My age is \" + str(self.age))\n", " \n", "author = Person(\"Maarten\",30)\n", "author.introduceyourself()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you see what happens here? Do you understand the role of `self` and notation with the period?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unbeknowst to you, we have already made use of countless objects and methods throughout this course. Things like strings, lists, sets, dictionaries are all objects! Isn't that a shock? :) The object oriented paradigm is ubiquitous in Python!" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add a variable `gender` (a string) to the Person class and adapt the initialisation method accordingly. Also add a method `ismale()` that uses this new information and returns a boolean value (True/False).\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#adapt the code:\n", "\n", "class Person:\n", " def __init__(self, name, age):\n", " self.name = name\n", " self.age = age\n", " \n", " def introduceyourself(self):\n", " print(\"My name is \" + self.name)\n", " print(\"My age is \" + str(self.age))\n", " \n", "author = Person(\"Maarten\",30)\n", "author.introduceyourself()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Inheritance *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the neat things you can do with classes is that you can build more specialised classes on top of more generic classes. `Person` for instance is a rather generic concept. We can use this generic class to build a more specialised class `Teacher`, a person that teaches a course. If you use inheritance, everything that the parent class could do, the inherited class can do as well!\n", "\n", "The syntax for inheritance is as follows, do not confuse it with parameters in a function/method definition. We also add an extra method `stateprofession()` otherwise `Teacher` would be no different than `Person`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "class Person:\n", " def __init__(self, name, age):\n", " self.name = name\n", " self.age = age\n", " \n", " def introduceyourself(self):\n", " print(\"My name is \" + self.name)\n", " print(\"My age is \" + str(self.age))\n", "\n", " \n", "class Teacher(Person): #this class inherits the class above!\n", " def stateprofession(self):\n", " print(\"I am a teacher!\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "author = Teacher(\"Maarten\",30)\n", "author.introduceyourself()\n", "author.stateprofession()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the class `Person` would have already had a method `stateprofession`, then it would have been overruled (we say *overloaded*) by the one in the `Teacher` class. Edit the example above, add a print like *\"I have no profession! :'(\"* and see that nothings changes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of completely overloading a method, you can also call the method of the parent class. The following example contains modified versions of all methods, adds some extra methods and variables to keep track of the courses that are taught by the teacher. The edited methods call the method of the parent class the avoid repetition of code (one of the deadly sins of computer programming):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "class Teacher(Person): #this class inherits the class above!\n", " def __init__(self, name, age):\n", " self.courses = [] #initialise a new variable\n", " super().__init__(name,age) #call the init of Person\n", " \n", " def stateprofession(self):\n", " print(\"I am a teacher!\") \n", " \n", " def introduceyourself(self):\n", " super().introduceyourself() #call the introduceyourself() of the Person\n", " self.stateprofession()\n", " print(\"I teach \" + str(self.nrofcourses()) + \" course(s)\")\n", " for course in self.courses:\n", " print(\"I teach \" + course)\n", " \n", " \n", " \n", " def addcourse(self, course):\n", " self.courses.append(course)\n", " \n", " def nrofcourses(self):\n", " return len(self.courses)\n", " \n", " \n", "author = Teacher(\"Maarten\",30)\n", "author.addcourse(\"Python\")\n", "author.introduceyourself()\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Operator overloading" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you write your own classes, you can define what needs to happen if an operator such as for example `+`,`/` or `<` is used on your class. You can also define what happens when the keyword `in` or built-in functions such as `len()` are you used with your class. This allows for a very elegant way of programming. Each of these operators and built-in functions have an associated method which you can overload. All of these methods start, like `__init__`, with a double underscore.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example. Let's allow comparison of tweets using the '<' and '>' operators. The methods for the opertors are respectively `__lt__` and `__gt__`, both take one argument, the other object to compare to. A tweet qualifies as greater than another if it is a newer, more recent, tweet:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "class Tweet:\n", " def __init__(self, message, time):\n", " self.message = message\n", " self.time = time # we will assume here that time is a numerical value\n", " \n", " def __lt__(self, other):\n", " return self.time < other.time\n", " \n", " def __gt__(self, other):\n", " return self.time > other.time \n", " \n", "\n", "oldtweet = Tweet(\"this is an old tweet\",20)\n", "newtweet = Tweet(\"this is a new tweet\",1000)\n", "print(newtweet > oldtweet)\n", " " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may not yet see much use in this, but consider for example the built-in function `sorted()`. Having such methods defined now means we can sort our tweets! And because we defined the methods `__lt__` and `__gt__` based on time. It will automatically sort them on time, from old to new:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tweets = [newtweet,oldtweet]\n", "\n", "for tweet in sorted(tweets):\n", " print(tweet.message)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember the `in` keyword? Used checking items in lists and keys in dictionaries? To recap:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "fruits = ['banana','pear','orange']\n", "print('pear' in fruits)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overloading this operator is done using the `__contains__` method. It takes as extra argument the item that is being searched for ('pear' in the above example). The method should return a boolean value. For tweets, let's implement support for the `in` operator and have it check whether a certain word is in the tweet." ] }, { "cell_type": "code", "collapsed": false, "input": [ "class Tweet:\n", " def __init__(self, message, time):\n", " self.message = message\n", " self.time = time\n", "\n", " def __lt__(self, other):\n", " return self.time < other.time\n", " \n", " def __contains__(self, word):\n", " #Implement the method\n", "\n", "tweet = \"I just flushed my toilet\"\n", "#now write code to check if the word \"flushed\" is in the tweet\n", "#and print something nice if that's the case\n", "\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Iteration over an object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember how we can iterate over lists and dictionaries using a for loop? To recap:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "fruits = ['banana','pear','orange']\n", "for fruit in fruits:\n", " print(fruit)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do the same for our own object. We can make them support iteration. This is done by overloading the `__iter__` method. It takes no extra arguments and should be a **generator**. Which if you recall means that you should use `yield` instead of `return`. Consider the following class `TwitterUser`, if we iterate over an instance of that class, we want to iterate over all tweets. To make it more fun, let's iterate in chronologically sorted order:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "class TwitterUser:\n", " def __init__(self, name):\n", " self.name = name\n", " self.tweets = [] #This will be a list of all tweets, these should be Tweet objects\n", " \n", " def append(self, tweet):\n", " assert isinstance(tweet, Tweet) #this code will check if tweet is an instance\n", " #of the Tweet class. If not, an exception\n", " #will be raised\n", " #append the tweet to our list\n", " self.tweets.append(tweet)\n", " \n", " def __iter__(self):\n", " for tweet in sorted(self.tweets):\n", " yield tweet\n", "\n", " \n", "tweeter = TwitterUser(\"proycon\")\n", "tweeter.append(Tweet(\"My peanut butter sandwich has just fallen bottoms-down\",4)) \n", "tweeter.append(Tweet(\"Tying my shoelaces\",2)) \n", "tweeter.append(Tweet(\"Wiggling my toes\",3)) \n", "tweeter.append(Tweet(\"Staring at a bird\",1)) \n", "\n", "for tweet in tweeter:\n", " print(tweet.message)\n", "\n", " \n", " " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The method `__len__` is invoked when the built-in function `len()` is used. We want it to return the number of tweets a user has. Implement it in the example above and then run the following test, which should return `True` if you did well:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(len(tweeter) == 4)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Extracting a social network of Twitter users" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will turn to the practical assignment of this chapter. The extraction of a graph of who tweets whom. For this purpose we make available the dataset [twitterdata.zip](http://lst.science.ru.nl/~proycon/twitterdata.zip) , download and extract it in a location of your choice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The program we are writing will consist of three classes: `Tweet`,`TweetUser` and `TweetGraph`. `TweetGraph` will maintain a dictionary of users (`TweetUser`), these are the nodes of our graph. `TweetUser` will in turn maintain a list of tweets (`Tweet`). `TweetUser` will also maintain a dictionary in which the keys are other TweetUser instances and the values indicate the weight of the relationship. This thus makes up the edges of our graph.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will not have to write everything from scratch, we will provide a full skeleton in which you have to implement certain methods. We are going to use our external editor for this assignment. Copy the below code, edit it, and save it as `tweetnet.py`. When done, run the program from the command line, passing it one parameter, the directory where the *txt files from twitterdata.zip can be found: *python3 tweetnet.py /path/to/twitterdata/*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#! /usr/bin/env python3\n", "# -*- coding: utf8 -*-\n", "\n", "import sys\n", "import preprocess\n", "\n", "\n", "class Tweet: \n", " def __init__(self, message, time):\n", " self.message = message\n", " self.time = time\n", " \n", "\n", "class TwitterUser:\n", " def __init__(self, name):\n", " self.name = name\n", " self.tweets = [] #This will be a list of all tweets \n", " self.relations = {} #This will be a dictionary in which the keys are TwitterUser objects and the values are the weight of the relation (an integer) \n", " \n", " def append(self, tweet):\n", " assert isinstance(tweet, Tweet) #this is a test, if tweet is not an instance\n", " #of Tweet, it will raise an Exception.\n", " self.tweets.append(tweet)\n", " \n", " def __iter__(self):\n", " #This function, a generator, should iterate over all tweets\n", " #\n", " \n", " \n", " def __hash__(self): \n", " #For an object to be usable as a dictionary key, it must have a hash method. Call the hash() function over something that uniquely defined this object\n", " #and thus can act as a key in a dictionary. In our case, the user name is good, as no two users will have the same name: \n", " return hash(self.name)\n", " \n", " \n", " def addrelation(self, user):\n", " if user and user != self.name: #user must not be empty, and must not be the user itself\n", " if user in self.relations:\n", " #the user is already in our relations, strengthen the bond:\n", " self.relations[user] += 1\n", " elif user in graph: \n", " #the user exists in the graph, we can add a relation!\n", " self.relations[user] = 1\n", " #if the user does not exist in the graph, no relations will be added \n", " \n", " \n", " def computerelations(self, graph):\n", " for tweet in self:\n", " #tokenise the actual tweet content (use the tokeniser in preprocess!):\n", " tokens = #\n", " \n", " #Search for @username tokens, extract the username, and call self.addrelation()\n", " #\n", " \n", " \n", " def printrelations(self):\n", " #print the relations, include both users and the weight\n", " #\n", " \n", " \n", " def gephioutput(self): \n", " #produce CSV output that gephi can import\n", " for recipient, weight in self.relations.items():\n", " for i in range(0, weight):\n", " yield self.name + \",\" + recipient\n", " \n", " \n", "class TwitterGraph:\n", " def __init__(self, corpusdirectory): \n", " self.users = {} #initialisation of dictionary that will store all twitter users. They keys are the names, the values are TwitterUser objects.\n", " #the keys are the usernames (strings), and the values are\n", " # TweetUser instances\n", " \n", " #Load the twitter corpus \n", " #tip: use preprocess.find_corpusfiles and preprocess.read_corpus_file,\n", " #do not use preproces.readcorpus as that will include sentence segmentation\n", " #which we do not want\n", " \n", " #Each txt file contains the tweets of one user.\n", " #all files contain three columns, separated by a TAB (\\t). The first column\n", " #is the user, the second the time, and the third is the tweetmessage itself.\n", " #Create Tweet instances for every line that contains a @ (ignore other lines \n", " #to conserve memory). Add those tweet instances to the right TweetUser. Create\n", " #TweetUser instances as new users are encountered.\n", " \n", " #self.users[user], which user being the username (string), should be an instance of the\n", " #of TweetUser\n", " \n", " #\n", "\n", " \n", " #Compute relations between users\n", " for user in self:\n", " assert isinstance(user,TweetUser)\n", " user.computerelations(self)\n", " \n", "\n", " \n", " def __contains__(self, user):\n", " #Does this user exist?\n", " return user in self.users\n", " \n", " def __iter__(self):\n", " #Iterate over all users\n", " for user in self.users.values():\n", " yield user\n", "\n", " def __getitem__(self, user): \n", " #Retrieve the specified user\n", " return self.users[user]\n", " \n", "\n", "#this is the actual main body of the program. The program should be passed one parameter\n", "#on the command line: the directory that contains the *.txt files from twitterdata.zip.\n", "\n", "#We instantiate the graph, which will load and compute all relations\n", "twittergraph = TwitterGraph(sys.argv[1])\n", "\n", "#We output all relations:\n", "for twitteruser in twittergraph:\n", " twitteruser.printrelations()\n", "\n", "#And we output to a file so you can visualise your graph in the program GEPHI\n", "f = open('gephigraph.csv','wt',encoding='utf-8')\n", "for twitteruser in twittergraph:\n", " for line in twitteruser.gephioutput(): \n", " f.write(line + \"\\n\")\n", "f.close()\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ignore this, it's only here to make the page pretty:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import HTML\n", "def css_styling():\n", " styles = open(\"styles/custom.css\", \"r\").read()\n", " return HTML(styles)\n", "css_styling()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "/*\n", "Placeholder for custom user CSS\n", "\n", "mainly to be overridden in profile/static/custom/custom.css\n", "\n", "This will always be an empty file in IPython\n", "*/\n", "\n", "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\"Creative
Python Programming for the Humanities by http://fbkarsdorp.github.io/python-course is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/fbkarsdorp/python-course.

" ] } ], "metadata": {} } ] }