{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Practical 5: Pandas

\n", "

Easing into EDA

\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Complete | Part 1: Foundations | Part 2: Data | Part 3: Analysis | |\n", "| :------- | :------------------ | :----------- | :--------------- | --: |\n", "| 40% | ▓▓▓▓▓▓▓▓ | ▓░░░░░ | ░░░░░░ | 5/10\n", "\n", "This session is a kind of tour-de-`pandas`, which is Python's equivalent of the `tidyverse` meets `data.tables`, meaning that pandas is fundamental to Python's data science ecosystem and is probably one of the most-widely used libraries in the language: I get more than 285,000 questions tagged as pandas questions on StackOverflow. \n", "\n", "This week we are also going to start looking at the **[InsideAirbnb](http://insideairbnb.com/)** data which forms the core of the work that we do over the remaining 5 weeks of the term. The focus of *this* notebook is simple numeric data: no mapping or text data... yet... and direct manipulation of data types, derivation of summary statistics, and simple plotting.\n", "\n", "We hope that you will be able to draw on the past few practical sessions to develop a more intuitive understanding of how to interact with pandas since it supports both a 'dictionary-of-lists' style of interaction *and* a methods-based style of interaction with the 'Data Frame'." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " ⚠ Important: Conceptually, this practical links together all of the preceding ones; you will find data structures, classes and methods, reading CSV files from a remote location, numpy, and more than you ever wanted to know about data types in Python. Making these connections will make the remainder of term much, much easier, so it might be worth revising this practical over Reading Week so make sure it all makes sense!
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Importance of Data Exploration\n", "\n", "After a few weeks getting to grips with Python, we're now going to start working with some real data. One of the first things that we do when working with any new data set is to familiarise ourselves with it. There are a _huge_ number of ways to do this, but there are no shortcuts to:\n", "\n", "1. Reading about the data (how it was collected, what the sample size was, etc.)\n", "2. Reviewing any accompanying metadata (data about the data, column specs, etc.)\n", "3. Looking at the data itself at the row- and column-levels\n", "4. Producing descriptive statistics\n", "5. Visualising the data using plots\n", "\n", "You should use _all_ of these together to really understand where the data came from, how it was handled, and whether there are gaps or other problems. If you're wondering which comes first, the concept of _start with a chart_ is always good... though we've obviously not _quite_ gotten there yet! This week we want you to get a handle on pandas itself, so although we will do some plotting of charts, we'll focus on 3-4 with a tiny bit of 5. There will be much more on plotting charts next week, and you should be looking into 1 and 2 yourself based on what's been written both on the [Inside Airbnb web site](http://insideairbnb.com/about.html) and in the [suggested readings](https://github.com/jreades/i2p/blob/master/bib/Bibliography.md).\n", "\n", "So although they don't need to be done now, you probably want to add both those links to your reading list!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Required Libraries\n", "\n", "
\n", " 🔗 Connections: This is why we spent time talking about Packages, Methods Classes in the lectures... because now we're going to be making intensive use of them.
\n", "\n", "It's always sensible to import packages these at the top of the notebook:\n", "\n", "1. Because it lets everyone know what they need to have installed to run your code.\n", "2. It's easy to run this and then skip further down the notebook if you have already done *some* of the work and saved an intermediate output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beyond what we provide below there are [numerous](http://lmgtfy.com/?q=introduction+to+pandas+python) useful introductions; [one of our favourites](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/) is from Greg Reda, and there are some [good videos](https://youtu.be/TSsSWuhBpmY) on [our YouTube channel](https://www.youtube.com/playlist?list=PLJ5Y5hxm-0W7rOOYBHf6KC6QNnWOi09kh). And of course, there’s [TONS of stuff](http://stackoverflow.com/questions/tagged/pandas) on StackOverflow. If you want an actual physical book, you might try [McKinney (2017)](http://shop.oreilly.com/product/0636920050896.do).\n", "\n", "However, one thing you will really want to bookmark is [the official documentation](http://pandas.pydata.org/pandas-docs/stable/) since you will undoubtedly need to refer to it fairly regularly. _Note_: this link is to the most recent release. Over time there will be updates published and you _may_ find that you no longer have the most up-to-date version. If you find that you are now using an older version of pandas and the methods have changed then you'll need to track down the _specific_ version of the documentation that you need from the [home page](http://pandas.pydata.org).\n", "\n", "You can always check what version you have installed like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(pd.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note: the <package_name>.__version__ approach isn't guaranteed to work with every package, but it will work with most of them. Remember that variables and methods starting and ending with '__' are private and any interaction with them should be approached very, very carefully.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 1: Reading and Writing Data\n", "\n", "
\n", " 🔗 Connections: You will really need to get to grips with Pandas through the lectures on Data and Pandas.
\n", "\n", "Pandas can do a *lot*, and you might be feeling a little intimidated by this, but here's the thing: we were already writing something like pandas from scratch! That's because pandas takes a **column-view of data** in the same way that our **Dictionary-of-Lists** did, it's just that it's got a lot more features than our 'simple' tool did. That's why the documentation is so much more forbidding and why pandas is so much more powerful.\n", "\n", "But at its heart, a pandas `Data Frame` (`df` for short) is a collection of `Data Series` objects (i.e. columns) with an index. Each Series is like one of our column-lists from the last notebook. And the `df` is like the dictionary that held the data together. So you've seen this before and you already _know_ what's going on... or at least you now have an _analogy_ that you can use to make sense of pandas:\n", "```python\n", "myDataFrame = {\n", " '': ,\n", " '': ,\n", " '': \n", "}\n", "``` \n", "\n", "And pandas gives us two ways to access that data:\n", "\n", "1. Using a method syntax: `myDataFrame.column_name_1`\n", "2. Using a dictionary syntax: `myDataFrame['column_name_1']`\n", "\n", "Depending on which syntax you prefer, you can use these interchangeably. The only times you *have* to choose one over the other are: a) assignment (e.g. `myDataFrame['column_name_1'] = ...`); b) columns with spaces in their names (e.g. `myDataFrame['Column Name 1')`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1.1: Read Remote Data\n", "\n", "
Difficulty level: Low (this time around).
\n", "\n", "You will need to do several things here to read the remote, compressed CSV file specified by `url` into a data frame called `df`. Setting `low_memory=False` ensures that pandas will try to load the entire data set _before_ guessing the data format! Obviously, with very large files this is probably a bad idea and it's possible to force a particular column type while readng in the data as well. For larger data sets there are platforms like [Dask](https://dask.org/) (see, eg, [this](https://towardsdatascience.com/why-and-how-to-use-dask-with-big-data-746e34dac7c3)), and beyond that are [other options](https://towardsdatascience.com/scaling-pandas-comparing-dask-ray-modin-vaex-and-rapids-c74c85a4e59c)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set download URL\n", "host = 'https://orca.casa.ucl.ac.uk'\n", "path = '~jreades/data/2023-09-06-listings.csv.gz'\n", "url = f'{host}/{path}'\n", "\n", "# your code here\n", "df = pd.read_csv(??, compression='gzip', low_memory=False)\n", "print(f\"Data frame is {df.shape[0]:,} x {df.shape[1]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should get a data frame containing 75 columns and about 80,000 rows of data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1.2: Inspecting the Data Frame\n", "\n", "
Difficulty level: Low.
\n", "\n", "Let's get a general sense of the data by printing out information _about_ the data frame. There are several ways to do this (and we'll see another futher on):\n", "\n", "- `df.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)` -- descriptive stats for all **numeric** columns\n", "- `df.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None)` -- summarises all columns, but without distribution information\n", "- `df.memory_usage(index=True, deep=False)` -- memory usage details about each column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Quick question**: What is another term for the 0.5 percentile?\n", "\n", "**Answer**: **??**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2.1 Describing\n", "\n", "Show the 5th and 95th percentiles:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.describe(percentiles=[??]) # 0.5 is always shown" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2.2 Info" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info(verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should get that the data frame uses about 40MB of memory and has 24 float64 columns, 17 ints, and 34 objects (text)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2.3 Memory Usage" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mem_usage = df.memory_usage(index=True, deep=True)\n", "mem_usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2.4 Printing the Columns\n", "\n", "Finally, I find it *very* useful to be able to quickly print out a list of the **columns** without all of the details shown above. You just need to _print_ the _columns_ as a _list_:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(df.??.to_??())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should get a list showing every single column. If you get `Index(['id', 'listing_url',...], dtype='object')` then you have printed the column _index_ object and you to need to tell the object to convert its output **to a list** (*hint*: Google)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1.3: Saving the File Locally\n", "\n", "
Difficulty level: Low
\n", "\n", "Now save the file somewhere local so that you don't have to keep downloading 40MB of compressed data every time you want to start the practical. We'll be using this data for the rest of term, so you might as well save yourself some time and bandwidth! We'll talk more about data processing pipelines over the course of the term, but I'd suggest putting this data set into a `data/raw` folder because then you can have directories like `data/clean` and `data/analytical` as you move through the process of cleaning and prepping your data for analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = os.path.join('data','raw') # A default location to save raw data\n", "fn = url.split('/')[-1] # What does this do?\n", "print(f\"Writing to: {fn}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if not os.path.exists(path): # And what does *this* do?\n", " print(f\"Creating {path} under {os.getcwd()}\")\n", " os.makedirs(path)\n", " \n", "df.to_csv(os.path.join(path,fn), index=False)\n", "print(\"Done.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 2: Managing Data\n", "\n", "Although you _could_ do the next few steps as part of loading the _raw_ data, I always prefer to keep the original data set handy since I almost always discover that there are fields I didn't realise I needed when I started my work. So my approach to coding is usually:\n", "\n", "1. Download the raw file and save it locally in a `data/raw` directory.\n", "2. Load the first `nrows` of data so that I can quickly:\n", " - Check that the specification matches the data and select columns/rows accordingly.\n", " - Identify obviously invalid rows/columns and investigate further.\n", " - Check the code to fix data types and (where relevant) values works.\n", " - Write this new, smaller file ($m` << m$ and $n` << n$) out to a `data/clean` or `data/canonical` directory (depending on whether formatting the columns is so complex or takes so long on a large data set that it needs to be separated out from actual cleaning).\n", " - Test out some initial ideas for further analysis.\n", "3. Re-run the code (remove the `nrows` limit) using the full data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 2.1 Think Parameters\n", "\n", "
Difficulty level: Moderate; although the code here is simple, but the logic is not.
\n", "\n", "You should always be looking for ways to _avoid_ hard-coding values that might change over time, especially those linked to the date. \n", "\n", "#### 2.1.1 Parameterising File Names\n", "\n", "Ideally, if the data changes on a regular basis, I might try to work out how to make it easy to update the code to download the latest file. For instance, if the file looks like `2022-09-10-listings.csv.gz` then I might well specify the `url` as `{date}-listings.csv.gz` or `{year}-{month}-{day}-listings.csv.gz` and set up the variables that I need beforehand.\n", "\n", "Using parameters makes it easier to write robust code that doesn't have unwanted side-effects. Here's a common one: you write code to download and process a file named `20221111-data.csv.gz`. After doing all the steps in Tasks 2 and 3 below you save it to `clean-data.csv.gz`. \n", "\n", "**Quick Question**: what happens when your boss asks you to process `20221211-data.csv.gz`?\n", "\n", "**Answer**: you are going to lose every single output from the 2022-11-11 data that depends on `clean-data.csv.gz` because they are *all* going to now be generated from the 2022-12-11 data set instead. Worse, you will almost certainly have *no* way of knowing which output came from which data set (because you probably stopped tracking dates very early in your application)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.2 Parameterising Column Loading\n", "\n", "In order to avoid having to load lots of data that we aren't sure we need yet, we can restrict the number of rows and columns that we load. We got `cols` by copying the output of `(df.columns.to_list()` and then removing the fields that we thought we _weren't_ interested in." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cols = ['id', 'listing_url', 'last_scraped', 'name', 'description', 'host_id', 'host_name', \n", " 'host_since', 'host_location', 'host_about', 'host_is_superhost', \n", " 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'latitude', 'longitude', \n", " 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', \n", " 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'availability_365', 'number_of_reviews', \n", " 'first_review', 'last_review', 'review_scores_rating', 'license', 'reviews_per_month']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.3 Parameterising Row Loading\n", "\n", "You could also do something like this to ensure that we don't load all of the rows while we're still working things out:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "del(df) # Check we've deleted the original object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use `del(df)` to ensure that we aren't accidentally using the 'old' data frame: after deleting it from Python's memory to free up space, we will read in a new data frame. See if you can set pandas to:\n", "- Use the columns specified in `cols`\n", "- Load only the first 10,000 rows using `nrows`\n", "- Don't forget to set `low_memory=False`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "testing = True\n", "\n", "if testing:\n", " df = pd.read_csv(os.path.join(path,fn), \n", " low_memory=False, usecols=??, nrows=10000)\n", "else:\n", " df = pd.read_csv(os.path.join(path,fn), \n", " low_memory=False, usecols=??)\n", "\n", "print(f\"Data frame is {df.shape[0]:,} x {df.shape[1]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how this code deliberately works the same for either testing _or_ operational execution -- we just flip between the option by changing the `testing` variable from `True` to `False`! We could use this `testing` variable throughout our code if we wanted to change other behaviours based on development/deployment context." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 3: Tidying Up: an Exploration\n", "\n", "
\n", " 🔗 Connections: You will want to refer to the Randomness lecture to understand how we can select the same random sample each time and to the session on Logic lecture to cover NaNs and NAs.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3.1 Randomly Selected Rows\n", "\n", "
Difficulty level: Low
\n", "\n", "I often like to start my EDA by simply printint out randomly-selected rows to get a feeling for what's in the data. Does what I see square with what I read? What does the `name` look like? What do I see in `last_scraped` and is it a sensible data? What's the `id` field for?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sample(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1.1 Fixing the Columns \n", "If you look very closely, you'll see that pandas isn't showing you the _full_ range of columns since there are 42! If you'd like to only look at specific columns then you can specify them after the sample method call using what looks like a nested list: `[[]]`. I'd like you to sample: 'latitude', 'longitude', 'license', 'property_type', 'room_type' and 'price' only." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sample(3)[??]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact, things are a little more complicated than that: we're implicitly takeing a cut through the data by column (keeping all rows). The *real* syntax is `df[,]`, and if we don't specify both then it defaults to `df[]`. \n", "\n", "Perhaps this will help to make it more clear:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[[4552, 4554, 4556, 4557],['latitude','longitude','property_type','room_type','price']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare that with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.iloc[4552:4557,14:19]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code *seems* similar: notice that we drop `sample` because we now want to select rows rather than sample them, but what are `iloc` and `loc`? The way I remember it is that `iloc` means *integer* location, while `loc` means *label* location (ok, that should therefore be `lloc`, but you get the idea).\n", "\n", "In this case, the **index** (the numbers down the left-hand side in bold) is numeric, so we can treat it as a *label* (which allows us to use `df.loc`) *or* a list-type index (which allows us to use `df.iloc`). Note, in contrast, that with `loc` we refer to the columns by *label*, whereas with `iloc` we refer to them by *location*. So `.loc` allows us to access rows and columns non-sequentially/randomly by label, while `iloc` allows us to access them as a numeric range.\n", "\n", "Notice how this works differently if we specify a non-numeric index:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.set_index('listing_url').loc[\n", " [f'https://www.airbnb.com/rooms/{x}' for x in df.id.sample(3).to_list()], # <- Try running this list comprehension separately!\n", " ['latitude','longitude','property_type','room_type','price']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.set_index('listing_url').iloc[0:3,13:18] # <- Notice change in indexing b/c 'listing_url' is no longer a column, it's the index now!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

⚠ Stop: It's vital that you understand how this code works. By which I mean why it does something at all, not exactly how to use loc and iloc (though that is also useful).

\n", "

df.set_index(...) changes the index from the default row number to another field in the data frame. This operation returns a new data frame with listing_url as its index. Because set index returned a data frame, we can simply add another method call (iloc or loc) on to the end of that line and it returns a new data frame in turn! \n", "

The fact that each operation returns a new data frame (or data series) is why you can even do this:

\n", " \n", " df.set_index('listing_url').iloc[0:3].latitude.mean()\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.set_index('listing_url').iloc[0:3].latitude.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1.2 Fixing a Random Sample\n", "\n", "See if you can work out from the documentation (Google search time!) how to get the same 'random' sample every time you re-run this code block:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sample(3, ??)[['latitude','longitude','license','review_scores_rating','price']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3.2 Dealing with NaNs and Nulls\n", "\n", "
Difficulty level: Hard. There is a lot going on here and you should be paying close attention.
\n", "\n", "If you really dig into the data you will see that a number of data types that aren't 'appropriate' for their contents: the id columns are floats; the dates aren't dates; there's a boolean that's not a boolean... It would be nice to fix these! " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Add some columns here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note: I had intended to ask you to fix these by combining code from previous weeks with information provided in the lecture, but it turns out that the InsideAirbnb data set is dirty. There are a lot of NaN values and some of these are deeply problematic for some of the column types in pandas. There are also a number of challenges with other columns so, instead, I've opted to show you how I would clean this data as a first pass to get it into a format where it's tractable for further cleaning.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.1 Identifying Problem Rows\n", "\n", "The reason I'm not asking you to do this part yourselves is that it took me nearly an hour just to work out why I couldn't convert some of the columns to the right data types; then I started finding rows like these:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.price.isna()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.room_type.isna()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[~(df.price.str.startswith('$', na=False))]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the row with index value `9104` is clearly _wrong_: if I had to guess, I'd say that it's some kind of partial extract/write process because there _are_ elements in there that look right but they are in the wrong columns. So we can _probably_ drop some of these rows, but one thing to do is look at the frequency of NaNs across the data frame _first_. So we need to look for NaNs and Nulls, but it's quite obvious that a `NaN` in the listing id is a basic problem and we should [drop these](https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.id.isna()][['id','listing_url','name','description','host_id','host_name','price']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As always, if you don't know that's going on, break it down:\n", "\n", "- You have seen how `[[]]` works, so that's just selecting the columns that we want to show.\n", "- You know how `df[]` works, so that isn't anything really new either!\n", "- So the only really new part is `df.id.isna()`: `df.id` is the `id` column (we could have written this `df['id']` if we wanted) and `isna()` is a test for whether or not a value is NaN. \n", "\n", "So this shows that only one row in the 10,000 row sub-sample has a NaN for its id." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you're not sure what the next line does, try breaking it down by running the inner bits before you run the `drop` command; and also try looking online for examples of how to use `df.drop` (e.g. just up above):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.drop(df[df.id.isna()].index.array, axis=0, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With that really troublesome data out of the way, you can now turn to [counting NaNs or Nulls](https://www.delftstack.com/howto/python-pandas/how-to-count-the-nan-values-in-a-column-in-pandas-dataframe/#df.isnull.sum-method-to-count-nan-occurrences) in the remaining data with a view to identifying other rows that can probably be dropped." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.2: Counting Nulls by Column\n", "\n", "As a starting point I would look to drop the columns that contain only NaNs. Remember that we've dropped a row from the data frame so our maximum is now $n-1$)! Notice how this next command works:\n", "```python\n", "# returns a data frame with all values set to True/False according to Null status\n", "df.isnull() \n", "# counts these values by column (we'll see another option in a moment)\n", "df.isnull.sum(axis=0) \n", "# Sort results in descending order\n", "df.isnull.sum(axis=0).sort_values(ascending=False) \n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.isnull().sum(axis=0).sort_values(ascending=False)[:12]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most obvious ones here are: bathrooms, license, and host_about." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.drop(columns=['bathrooms','license','host_about'], inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we have dropped everything `inplace` the code simply runs and doesn't return anything." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.3: Counting Nulls by Row\n", "\n", "We now know that there _are_ still quite a few problems, but we do still need a way to identify the rows that are causing most of the problems.\n", "\n", "Notice here that the change from `axis=0` to `axis=1` changes the 'direction' of the `sum` from columns to rows. And we are getting back a data series because the summing operation reduces it to just one column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.isnull().sum(axis=1).sort_values(ascending=False).head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So that is Series showing how many NaN values there are by index value. You should see two columns of numbers: the first is the row id, the second is the number of Nulls in that row.\n", "\n", "If we save the results to a variable called `probs` (i.e. problems) then we can decide what to do next.\n", "\n", "
Note: there's a chance that Python will complain why you try to run the third line of code. This is particularly likely if you are using Anaconda Python directly (i.e. not Docker). In that case you need to add the code listed at the start of Task 5.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "probs = df.isnull().sum(axis=1)\n", "print(type(probs)) # Note that this has returned a series!\n", "probs.plot.hist(bins=30) # Oooooooh, check out what we can do with a series!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at this histogram, these look like two groups in the data so I would start there. I would take values greater than 3–5 as being ones that are most likely be problematic. We can use the index from `probs` to select out the rows we want to inspect from the main data frame.\n", "\n", "Here's another bit of code that bears unpacking:\n", "\n", "1. `probs > 7`: this selects only those rows in the 'probs' series whose value is greater than 7\n", "2. `probs[...].index` returns the index values from the Series, which we will then pass to the `drop` command.\n", "3. `df.drop(..., inplace=True)` will then drop the rows selected by `probs[probs>7].index`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.drop(probs[probs > 5].index, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Have reduced data frame to: {df.shape[0]} rows and {df.shape[1]} columns\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3.3 Profiling (Optional)\n", "\n", "
Difficulty level: Low.
\n", "\n", "The Pandas Profiling tool (rebranded a year or so back as [ydata-profiling](https://github.com/ydataai/ydata-profiling)) offers an alternative way of understanding what's going on in your data. The output [looks rather nice](https://docs.profiling.ydata.ai/) and you might be tempted to ask why we didn't use this straight away on the full data set -- well, if you really want to know, see what happens when you profile all 70,000-odd rows and 70-odd columns in the raw data frame... in effect: while it's 'nice to have', the likelihood of crashing your computer increases significantly and it's a bit of a tangent, so that's why it's no longer included in the Docker image.\n", "\n", "If you *do* want to explore this then you'll need to install the library, and **this is a good chance to look at a quite sophisiticated way to install software on another machine**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " from ydata_profiling import ProfileReport\n", "except ModuleNotFoundError:\n", " !mamba install -y -c conda-forge ydata-profiling\n", " from ydata_profiling import ProfileReport" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.1 Specify the Profiling Columns\n", "\n", "Looking back over earlier code see if you can work out how to profile `latitude`, `longitude`,and `review_scores_rating` together." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "profile = ProfileReport(??, title=\"Pandas Profiling Report\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.2 Profiling Targets\n", "\n", "You can write the profile either directly into the Jupyter notebook (this file) or into a separate HTML (i.e. Web) page." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "profile.to_notebook_iframe()\n", "# You can also write this profile to a web page:\n", "# profile.to_file(\"your_report.html\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 4. Fixing Data Types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to challenge yourself, then I'd suggest trying to work out how to adapt what we saw in previous weeks using the data type dictionary to map column names to column types; however, a more straightforward way to do this is to create different for loops for each:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.1 Managing Memory\n", "\n", "
Difficulty level: Low.
\n", "\n", "As to _why_ you'd want to fix your data types, there are two reasons: 1) to ensure that you can make the *most* of your data; 2) to ensure that it takes up as little space as possible in memory. Some simple examples:\n", "\n", "- A column containing only `'True'` (4 bytes) and `'False'` (5 bytes) will take up much more space than a column containing only `True` and `False` (1 bit each).\n", "- A column containing only `'Red'`, `'Green'`, and `'Blue'` (3, 5, and 4 bytes each respectively) will take up much more space that a column where we use the numbers `1, 2, 3` to represent these values and have a map that tells us `1==Red`, `2==Blue`, and `3==Green`.\n", "\n", "Let's test this idea out:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rtm = df.room_type.memory_usage(deep=True) # Room Type Memory\n", "ctm = df.room_type.astype('category').memory_usage(deep=True) # Categorical Type Memory\n", "\n", "print(f\"The raw memory usage of `room_type` is {rtm/1000:,.0f} Kb.\")\n", "print(f\"The categorical memory usage of `room_type` is {ctm/1000:,.0f} Kb.\")\n", "print(f\"That's {(ctm/rtm)*100:.0f}% of the original!\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "shm = df.host_is_superhost.memory_usage(deep=True) # Super Host Memory\n", "bhm = df.host_is_superhost.replace({'f':False, 't':True}).astype('bool').memory_usage(deep=True) # Boolean Host Memory\n", "\n", "print(f\"The raw memory usage of `host_is_superhost` is {shm/1000:,.0f} Kb.\")\n", "print(f\"The boolean memory usage of `host_is_superhost` is {bhm/1000:,.0f} Kb.\")\n", "print(f\"That's {(bhm/shm)*100:.0f}% of the original!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.2 Boolean Values\n", "\n", "
Difficulty level: Moderate.
\n", "\n", "Let's start with columns that are likely to be boolean:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bools = ['host_is_superhost']\n", "df.sample(5, random_state=43)[bools]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we have to `map` 't' to True and 'f' to False _before_ converting the column to a boolean type. If you simply tried to replace them with the strings 'True' and 'False', then the conversion would run into the same problem as Week 3: any string that is not `None` will convert a True boolean." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This approach requires us to map 't' \n", "# and 'f' to True and False\n", "for b in bools:\n", " print(f\"Converting {b}\")\n", " df[b] = df[b].replace({'f':False, 't':True}).astype('bool')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sample(5, random_state=43)[bools]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.3: Dates\n", "\n", "
Difficulty level: Hard.
\n", "\n", "I've found dates to be particularly challenging, though pandas has _tried_ to make this process less painful than before using the `infer_datetime_format` option. What can be particularly frustrating is if a row has a non-sensical date value (e.g. `t`, as happened in 2019/20 when I taught this module) in _one_ of the datetime columns and this causes the type conversion to fail. When that happens, pandas is not always great about communicating where a problem occurred and I eventually worked it out by trying to convert _parts_ of each series in the data frame to a datetime until I had a block that failed. I then knew that I could narrow this down using integer location indexing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dates = ['last_scraped','host_since','first_review','last_review']\n", "df.sample(5, random_state=43)[dates]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for d in dates:\n", " print(\"Converting \" + d)\n", " df[d] = pd.to_datetime(df[d], infer_datetime_format=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sample(5, random_state=43)[dates]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, it's not actually clear there what has changed! But if you dig a little more deeply:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(df[dates[1]].dtype, \"\\n\")\n", "df.sample(5, random_state=45)[dates[1]].dt.strftime('%A %B %d, %Y')\n", "# Try some other columns and other formats!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In that line of code we:\n", "- Took a random sample (setting the state to 45)\n", "- Took the first column from the dates list (`dates[1]`)\n", "- Used the _date_ accessor method \n", "- And called `string format time` with the format `%A %B %d, %Y` (Full Day of Week, Month Name, Date, 4-digit Year)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.4: Categories\n", "\n", "
Difficulty level: Moderate.
\n", "\n", "We know that these are likely to be categories because there'd be no other way to allow users to effectively search Airbnb." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cats = ['property_type','room_type']\n", "df.sample(5, random_state=42)[cats]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This next piece of code is quite useful for grouping and counting operations: we are counting the occurences of each unique value in part particular column or combination of columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[cats[0]].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[cats[1]].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
💡 Tip: one column has many different values (including Campers/RVs and Yurts!), the other has just four. If I were looking to conduct research I'd probably start with the room_type column since I may not care about hotels and therefore never even need to decide whether I care about boutique ones!
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for c in cats:\n", " print(f\"Converting {c}\")\n", " df[c] = df[c].astype('category')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(df[cats[1]].cat.categories.values)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sample(5, random_state=42)[cats]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.5: Dealing with Strings\n", "\n", "
Difficulty level: Hard.
\n", "\n", "We'll have to put some more work into deal with the description and other more free-from text fields later in the term, but for now let's just deal with a straightforward one: price!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "money = ['price']\n", "df.sample(5, random_state=42)[money]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**You will get an error when you run this**, that's because I want you to do a little thinking about how to extend the code to fix the _second_ problem. It's the same type of problem, you'll just need to figure out how to solve it by adapting code you've already got!\n", "\n", "
Note: For now don't worry about what regex=False means. It will all make sense when we get to dealing with text.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for m in money:\n", " print(f\"Converting {m}\")\n", " df[m] = df[m].str.replace('$','', regex=False).astype('float')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for m in money:\n", " print(f\"Converting {m}\")\n", " df[m] = df[m].str.replace('$','', regex=False).str.replace(??).astype('float')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sample(5, random_state=42)[money]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sort_values(by=money[0], ascending=False).head(10)[['id','name','price','minimum_nights']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.6: Dealing with Integers\n", "\n", "
Difficulty level: Hard.
\n", "\n", "This is the issue that made me abandon the idea of making you clean the data yourselves. Although _floats_ have no issues with `np.nan` in the Series, by default there are no numpy integer arrays that can cope with NaNs. This was such a major issue for Pandas that they've actually created their _own_ data type that does support NaN values in integer columns. There are a lot of integer columns, but only one of them seems to be a problem." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ints = ['id','host_id','host_listings_count','host_total_listings_count','accommodates',\n", " 'beds','minimum_nights','maximum_nights','availability_365']\n", "for i in ints:\n", " print(f\"Converting {i}\")\n", " try:\n", " df[i] = df[i].astype('float').astype('int')\n", " except ValueError as e:\n", " print(\" - !!!Converting to unsigned 16-bit integer!!!\")\n", " df[i] = df[i].astype('float').astype(pd.UInt16Dtype())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we convert the column but using a `try / except` approach that allows to trap `ValueError` exceptions triggered by the presence of NaNs in the column. The following code tells us that there are just eight of these in the 10k sample, but they're enough to cause the code to fail if you don't trap them. The alternatives would be to: a) drop those rows; or b) leave the data as floats. For some reason the latter offends my sense of order, and the former feels like avoiding the problem rather than dealing with it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.beds.isna().value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.7: Validation\n", "\n", "
Difficulty level: Low.
\n", "\n", "Ordinarily, at this point I would then output information to confirm that all of the opeations I _think_ I've undertaken were correctly applied." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.8: Saving\n", "\n", "At this point I would save a copy of the cleaned data. And I would only consider this data _partially_ cleaned since we've not made it any further than just ensuring that each column is in an appropriate format and that some particularly problematic rows have been dropped!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = os.path.join('data','clean')\n", "\n", "if not os.path.exists(path):\n", " print(f\"Creating {path} under {os.getcwd()}\")\n", " os.makedirs(path)\n", " \n", "df.to_csv(os.path.join(path,fn), index=False)\n", "print(\"Done.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Feather is an alternative format (gradually being replaced by parquet, which is more widely supported) for data interchange between R and Python: it's fast, it preserves data types, it's compressed, and it will avoid the kinds of the problems that come up when you move to/from CSV as a default." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 5. Selection using Criteria\n", "\n", "So far we've been taking primarily a row and column view of the data, now we want to think about selecting ranges from within the data set..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 5.1: Selecting using Data Types\n", "\n", "
Difficulty level: Low.
\n", "\n", "If we wanted to filter in/out certain columns pandas can do that! Let's try for floats and ints (*hint*: these are 64-bit data types)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.select_dtypes(include=[??])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 5.2: Selecting using Conditions\n", "\n", "
Difficulty level: Hard.
\n", "\n", "What if we wanted to find whole homes listings for more than $100/night? \n", "\n", "To do this we use a combination of the selection approaches above in combination with conditionals, but first we need to see what sort of properties there are in the data set! `groupby` is a really useful function that we'll come back to later in the term, but for now notice that it helps us to group the analysis by `room_type` so that subsequently asking for the `property_type` value counts allows the same `property_type` to appear in more than once place if it's associated with more than one `room_type`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby('room_type')['property_type'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now try to select only the `Entire home/apt` room type:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.??=='??']['property_type'].value_counts().head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your output should be:\n", "```\n", "Entire rental unit 3907\n", "Entire home 821\n", "Entire condo 246\n", "Entire townhouse 161\n", "Entire loft 77\n", "Entire serviced apartment 34\n", "Entire guest suite 17\n", "Entire cottage 8\n", "Entire guesthouse 8\n", "Boat 7\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 5.3: Arbitrary Selection Criteria\n", "\n", "
\n", " Difficulty Level: Moderate, if the previous section made sense to you.\n", "
\n", "\n", "Let's get a sense of where the mean and median value fall:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"The mean price is ${df.price.??():0.2f}\")\n", "print(f\"The median price is ${df.price.??():0.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see that the mean is higher than the median price but both are _very_ roughly plausible values. Given your understanding of distributions from, say, Quantitative Methods, what can you say about the pricing distribution of Airbnb units?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might want to have a [look at the documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#series): it's rather a long list, but most of your descriptive stats are on that page in the [Cumulative / Descriptive Stats](http://pandas.pydata.org/pandas-docs/stable/api.html#computations-descriptive-stats) section, and there's also lots of information about methods for [strings](http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling) and [categorical data](http://pandas.pydata.org/pandas-docs/stable/api.html#categorical)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So I would take `Entire home/apt` as the right value to filter on _together with_ the price per night from the `price` column. Why don't we set 'pricey' listings to Entire Home/Apartment listings costing more than the mean of $167.72/night (_note_: this is totally arbitrary)?\n", "\n", "Here we see two things:\n", "\n", "1. The use of the bitwise `&` (it's *not* the same as `and`).\n", "2. The fact that you need parentheses around the selection in order to make the the `&` work.\n", "\n", "But this will filter on two criteria simultaneously:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pricey = df[(??) & (df.price>df.price.??)]\n", "print(f\"Selected {pricey.shape[0]:,} rows\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This should select slightly more than 50% of the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 5.4: Selection with an Aggregate\n", "\n", "
Difficulty level: Low.
\n", "\n", "Let's find the cheapest and most expensive listings using `min` and `max` methods:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.price==df.price.??()][['price','id','listing_url','room_type','description']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should get one or more units priced at very low levels... or even (seemingly) for free!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.price==df.price.max()][['price','id','listing_url','room_type','description']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see one or more units priced at exceedingly high levels... and here's a way to see a few more of these budget-busting options." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sort_values(by='price', ascending=False).head(3)[\n", " ['price','listing_url','room_type','description']\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
⚠ Stop: Ask yourself if the result is plausible.
\n", "\n", "**Quick question**: what do you make of this result?\n", "\n", "**Answer**: Try following some of the listing_urls." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 5.5: Selection with a Range\n", "\n", "
Difficulty level: Moderate
\n", "\n", "Perhaps we aren't just looking for extremes... how about all of the properties falling within the middle of the distribution? We can ask for any abitrary quantile we like, so let's go with the 25th and 75th percentile to get the middle 50% of the data. Google how to get percentiles from pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfr = df[ \n", " (df.price > df.price.quantile(??)) & \n", " (df.price < df.price.quantile(??)) ]\n", "\n", "print(f\"Lower Quartile: {df.price.quantile(??):>6.2f}\")\n", "print(f\"Upper Quartile: {df.price.quantile(??):>6.2f}\")\n", "print()\n", "print(f\"Range selected contains {dfr.shape[0]:,} rows.\")\n", "print(f\"Minimum price: {dfr.price.??():>6.2f}\")\n", "print(f\"Maximum price: {dfr.price.??():>6.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That example contains a few things to which you need to pay attention:\n", "1. *Again* you can see that, with mutiple selections, we had to put parentheses around each one -- this forces Python to...\n", "2. Process the `&` (bit-wise AND) that asks pandas to \"Find all the rows where condition 1 _and_ condition 2 are both `True`\". So it calculates the `True`/`False` for the left side and the `True`/`False` for the right side of the `&`, and then combines them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 6: Enhancing our Understanding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.1 Deriving a New Variable\n", "\n", "
Difficulty level: 🤯
\n", "\n", "Let's try calculating several derived measures of distribution for the price... these deliberately demonstrate different ways of handling this process (and notice also the little call to `apply` that can perform additional tasks)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.1.1 The *Z*-Score\n", "\n", "The z-score is given by $z = (x - \\bar{x})/\\sigma$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['z'] = (df.?? - df.??.??()) / df.??.??()\n", "df.z.describe().apply(lambda x: f\"{x:5.5f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.1.2 Inter-Quartile Standardisation\n", "\n", "The IQR-standardised score is given by $i = (x - Q_{1})/(Q_{3} - Q_{1})$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['iqs'] = (df.price - ??)/(??-??)\n", "df.iqs.describe().apply(lambda x: f\"{x:5.5f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.1.3 Log-Normalisation\n", "\n", "The natural log of the price is gven by $ln(x)$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['lnprice'] = np.log(??)\n", "df.lnprice.describe().apply(lambda x: f\"{x:5.5f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.2. Quick (and Dirty) Plotting\n", "\n", "One of the first things we should do when exploring a new dataset is plot (aka graph) the data. We've left plotting until a little later in this practical so that we could see some other basic attributes of how pandas stores data. We'll look at plotting and exploratory data analyses in much more detail next week, including using packages other than pandas. \n", "\n", "For now, let's look at the basic plotting functionality pandas provides - in conjunctions with the online documentation for both [DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) and [Series](https://pandas.pydata.org/pandas-docs/stable/reference/index.html). There are also examples of all [the different types of plots pandas can produce](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
⚠ Warning: MacOS users who are not using Docker will need to do certain things in a specific order at the start of any notebook in order to show maps or graphs. Please make a copy of the following code for any notebook that you create and make it the first code that you run in the notebook...
\n", "\n", "```python\n", "# Needed on a Mac\n", "import matplotlib as mpl\n", "mpl.use('TkAgg')\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.3: Histograms\n", "\n", "
Difficulty level: Low
\n", "\n", "First, let's see some of the ways we could visualise the distribution of the `Series` in the dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.price.plot.hist() # histogram" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the code worked properly you should have just created a standard [histogram](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.hist.html) plot (if you can't see one, ask for help). \n", "\n", "You can also pass in simple options to improve the plotting:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.price.plot.hist(bins=??); # Notice the ';' here to suppress ``" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.4: KDE Plots\n", "\n", "
Difficulty level: Low
\n", "\n", "Similarly, we can produce a [Kernel Density Estimate plot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.kde.html) (modifying the limits of the x-axis using `xlim`):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.price.plot.kde(xlim=(0,??)); #kernel density estimate plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kind of handy, no? These aren't the _best_ looking plots, but they are all being generated on-the-fly for you by pandas with no more than a cheery `DataFrame.Series.plot.`! Since those plots are all just method calls, many of them take optional parameters to change the colour, the notation (scientific or not), and other options. For example, many of the documentation pages linked to above are rather brief, but include a link to [the general options that can be applied to all `Series.plot`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.html) calls.\n", "\n", "This is why we like pandas: it allows us to be _constructively lazy_. We don't need to know _how_ a draw a KDE plot (though it always helps if you don't see what you expected), we just need to know that pandas provides a method that will do it for you. And _that_ is why it's always worth having a [look at the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/index.html). \n", "\n", "#### 6.4.1 A Slight Case of Over-Plotting\n", "\n", "Generally, Jupyter is clever enough to overlay plots one on top of the other if you call them all in the same cell. We'll see ways to gain more control later, but this is still a good start!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.z.plot.kde(xlim=[-2, 10])\n", "df.iqs.plot.kde(xlim=[-2, 10])\n", "df.lnprice.plot.kde();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.5: Boxplots\n", "\n", "
Difficulty level: Low
\n", "\n", "A standard [boxplot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.box.html):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.price.plot.box(figsize=(4, 8)); # boxplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.6: Scatterplots\n", "\n", "
Difficulty level: Low
\n", "\n", "We can also plot two variables in a [scatter plot](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.scatter.html#pandas.DataFrame.plot.scatter) by applying a plot method to the `DataFrame` (not an individual `Series`):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.plot.scatter(x='longitude', y='latitude', c='price', s=2, cmap='viridis', figsize=(15,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how the code above has the form `DataFrame.plot.`, not `DataFrame.Series.plot.` as in the prior plots. Think about why this then means we need the `x` and `y` arguments. \n", "\n", "Looking at the plot produced, it's hard to see where the high values are, so we might want to think about ways that we could make it easier to spot the big numbers... We could, for instance, also vary the size of the point in a plot by some variable, but why does the following not really work?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tmp = df[df.price>0]\n", "tmp.plot.scatter(x='longitude', y='latitude', c='price', s=(tmp.price/tmp.price.min()), cmap='viridis', figsize=(15,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can plot subsets of our data without creating a new object. See if you can work out what the following code is doing that is different from the last plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.price > df.price.quantile(0.90)].plot.scatter(x='longitude', y='latitude', c='price', cmap='viridis', s=8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.7: Hex Bin Plots\n", "\n", "
Difficulty level: Low
\n", "\n", "And pandas allows us to create 'less standard' plots, like a [hex bin plot](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.hexbin.html#pandas.DataFrame.plot.hexbin):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.plot.hexbin(x='longitude', y='latitude', gridsize=50, figsize=(10,7)) #hex bin plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's just a taste of what the basic plotting functionality of pandas can do. Feel free to explore more yourself and we'll also see [the seaborn package](http://seaborn.pydata.org/index.html) later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Credits!\n", "\n", "#### Contributors:\n", "The following individuals have contributed to these teaching materials: James Millington (james.millington@kcl.ac.uk), Jon Reades (jonathan.reades@kcl.ac.uk)\n", "\n", "#### License\n", "These teaching materials are licensed under a mix of [The MIT License](https://opensource.org/licenses/mit-license.php) and the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/).\n", "\n", "#### Acknowledgements:\n", "Supported by the [Royal Geographical Society](https://www.rgs.org/HomePage.htm) (with the Institute of British Geographers) with a Ray Y Gildea Jr Award.\n", "\n", "#### Potential Dependencies:\n", "This notebook may depend on the following libraries: pandas, matplotlib" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }