{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 11.1.\n", "\n", "Download the CSV file ‘nobel.csv’: \n", "https://edu.nl/3xmbd \n", "\n", "Using the data in this file, try to answer the following questions:\n", "* How many Nobel laureates are described in this file?\n", "* How many of these are from the Netherlands? What are the names of the Dutch Nobel prize winners?\n", "* How many female Nobel laureates are there in this data set? What are their names?\n", "* Who was the youngest person ever to receive the Nobel Prize? Who was the oldest person? \n", "Tip: create a new column called ‘Age’, in which the year in which the prize was awarded is subtracted from the year of birth. The year of birth can be extracted from the date of birth using the following code: \n", "> `nobel['Birth Year'] = pd.to_datetime(nobel['Birth Date']).dt.year`\n", "\n", "\n", "* For each discipline, calculate the average age of the Nobel laureates. \n", "* Create a list of all the unique countries, and calculate the number of Nobel laureates per country." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "nobel = pd.read_csv( 'nobel.csv' )\n", "\n", "# Each row describes a nobel laureate\n", "# The number of rows and columns can be found using 'shape'\n", "# 'shape' results in a so-called tuple, with two values. To print only the first value,\n", "# use shape[0]\n", "print( 'Number of nobel laureates: {}'.format(nobel.shape[0] ) )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Number of Dutch nobel laureates\n", "dutch = nobel[ nobel['Birth Country'] == 'Netherlands' ]\n", "print( '\\n\\nNumber of Dutch Nobel laureates: {}'.format(dutch.shape[0] ) )\n", "\n", "for index, row in dutch.iterrows():\n", " print( row['Full Name'] )\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Number of Female nobel laureates\n", "female = nobel[ nobel['Sex'] == 'Female' ]\n", "print( '\\n\\nNumber of female Nobel laureates: {}'.format(female.shape[0] ) )\n", "\n", "for index, row in female.iterrows():\n", " print( row['Full Name'] )\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# extract the year of birth from the birth date\n", "nobel['Birth Year'] = pd.to_datetime(nobel['Birth Date']).dt.year\n", "\n", "# calculate age by subtracting birth year from year of nobel prize \n", "nobel['Age'] = nobel['Year'] - nobel['Birth Year']\n", "\n", "# find minimum and maximum in column 'Age'\n", "youngest = nobel['Age'].min()\n", "oldest = nobel['Age'].max()\n", "\n", "print('\\n')\n", "# print the names of these Nobel laureates, using iterrows()\n", "for index, row in nobel.iterrows():\n", " if row['Age'] == youngest:\n", " print('The youngest Nobel laureate was {}, aged {}.'.format( row['Full Name'] , int(row['Age'])))\n", " if row['Age'] == oldest:\n", " print( 'The oldest Nobel laureate was {}, aged {}.'.format( row['Full Name'] , row['Age']) )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For each discipline, calculate the average age of the Nobel laureates.\n", "# First, divide the data frame into groups based on 'Category', using groupby('Category')\n", "# In these groups, select only the 'Age' column, by appending ['Age']\n", "# Next calculate the mean. The result is a Series object. \n", "# You can get the indices of this Series using the 'index' property\n", "print('\\nThe average age in each category:')\n", "average_age = nobel.groupby('Category')['Age'].mean()\n", "for c in average_age.index:\n", " print( '{}: {}'.format( c , round( average_age[c] , 1 ) ) )\n", "\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a list of all the unique countries, and calculate \n", "# the number of Nobel laureates per country.\n", "# First, divide the data frame into groups based on 'Birth Country', using groupby('Category')\n", "# In these groups, select only the 'Laureate ID' column, by appending ['Laureate ID']\n", "# Next, calculate the number of rows in each group, using count() \n", "# The result is a Series object. \n", "# You can get the indices of this Series using the 'index' property \n", "\n", "print('\\nThe number of Nobel laureates per country')\n", "countries = nobel.groupby('Birth Country')['Laureate ID'].count()\n", "for c in countries.index:\n", " print( '{}: {}'.format( c , countries[c] ) )\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 11.2.\n", "\n", "Download the Excel file names 'Grades.xsl' from the following address: \n", "\n", "https://edu.nl/m6w8n\n", "\n", "This spreadsheet contains the partial grades received in a certain course. \n", "\n", "Write code which can calculate and print the final grade for each student. The essay and the exam both count for 40% the grade. The remaining 20% is determined by the presentation. All grades should have one digit after the decimal point. \n", "\n", "Print a list with the full names of the students, together with their final grade for the course.\n", "\n", "Additionally, answer the following questions:\n", "* Which student has received the highest mark for the essay?\n", "* Which student scored worst on the presentation? \n", "* How many students have received a 6 or lower for the exam?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_excel( 'Grades.xlsx' , sheet_name = \"Sheet1\")\n", "\n", "df.columns.tolist()\n", "\n", "df['final_mark'] = round( 0.4 * df['Essay'] + 0.4 * df['Exam'] + 0.2 * df['Presentation'] , 1 ) \n", "\n", "df = df.sort_values(['Last'])\n", "\n", " \n", "best_essays = df[ df['Essay'] == df['Essay'].max() ] \n", "worst_presentations = df[ df['Presentation'] == df['Presentation'].min() ] \n", "insufficient_exam = df[ df['Exam'] < 6 ] \n", "\n", "print('Best essay:')\n", "for index , row in best_essays.iterrows():\n", " print( '{} {}: {}'.format( row['First'] , row['Last'] , row['Essay'] ) )\n", " \n", " \n", "print('\\nWorst presentation:')\n", "for index , row in worst_presentations.iterrows():\n", " print( '{} {}: {}'.format( row['First'] , row['Last'] , row['Presentation'] ) )\n", "\n", "print('\\nThe following students received an insufficient mark for the exam:')\n", "for index , row in insufficient_exam.iterrows():\n", " print( '{} {}: {}'.format( row['First'] , row['Last'] , row['Exam'] ) )\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 11.3\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the exercise, we make use of the following data set: \n", "\n", "> Fatah-Black, Dr KJ (Leiden University) (2017): Lijsten van de leveranties, met de namen van de kopers en betaalde prijzen op de verkopingen van de VOC kamer Zeeland, 1725 – 1777. DANS. https://doi.org/10.17026/dans-23z-ez6f\n", "\n", "This data set lists the goods that have been sold at VOC auctions in Zeeland in the 18th century. For each transaction, data is available about the name of the buyer, the product that was bought, the date of the transaction and the price that was paid. The price is specified in Flemish pounds and groten. 1 Flemish pound was equal to 20 groten. \n", "\n", "For the purpose of this exercise, the original data set has been changed slightly. Some rows containing invalid data were removed. Column names and values were also translated into English. The modified CSV file can be downloaded from https://tdm.universiteitleiden.nl/Python/voc-data.csv\n", "\n", "As this data set offers information about the sale of coffee only, we can use these data to trace the historical development of the price of one pound of coffee during the period that is covered by the data set. **Try to create a CSV file containing the following two columns: the full date and the average price of a pound of coffee on that date**. \n", "\n", "Follow the steps below.\n", "\n", "Firstly, add code below to download the data set. The code to write the file to disk is already provided." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "# FIXME: add code to download the CSV file from the URL 'https://tdm.universiteitleiden.nl/Python/voc-data.csv'\n", "# Make sure to assign the file contents to the data_file variable\n", "response = requests.get('https://tdm.universiteitleiden.nl/Python/voc-data.csv')\n", "if response.status_code == 200:\n", " data_file = response.text\n", "\n", "# Write to voc-data.csv in the current working directory\n", "with open('voc-data.csv', 'w', encoding='utf-8') as output_file:\n", " output_file.write(data_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have downloaded the CVS file, you can access the data values in this file using the code below. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Let's load the data into a data frame named `voc`:\n", "voc = pd.read_csv('voc-data.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To explore the contents data set, write some code to see the number of rows and columns, the columns names and all the values in te first 5 rows. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print( voc.shape )\n", "print( voc.columns.tolist() )\n", "voc.head(5) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One difficulty in the original data set is that the year, the month and the day are captured in different columns. \n", "\n", "In a time series visualisation, dates can best be represented using a variable of the type `datetime`. Pandas has a convenient method named [`pd.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) which can assemble a datetime out of multiple columns. \n", "\n", "Write some code to add a new column to the existing data frame which can can create a new `datetime` object by combining the values found in the columns named `year`, `month` and `day`. Save this new datetime object in a column named 'date'. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "voc['date'] = pd.to_datetime(voc[['year', 'month', 'day']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, try to calculate the total price that was paid for the product in each individual transaction. Store the result of this calculation in a column named ‘total price’. The total price can be calculated by adding the values in the column ‘flemish_pounds' and the values in the column ‘groten’, divided by 20. \n", "\n", "To be able to compare the prices, they obviously need to be normalised. This can be done by dividing the total price by the amount of coffee that purchased. All quantities are given in pounds, so if we calculate the values in the ‘total price’ column by the quantities, this should result in the price per point. Save the results of this calculation in a column named ‘price_per_pound’. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "voc['total_price'] = voc['flemish_pounds'] + (voc['groten'] / 20)\n", "voc['price_per_pound'] = voc['total_price'] / voc['quantity']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data set contains a number of records with missing data. In such cases, the values in the empty columns are represented using the code `NaN`, which stands for ‘Not a Number’. To remove all the records in which at least on of the column contains a missing value, we can make use of the dropna() method. If we only want to remove records if the have a `NaN` value in a specific column, we can work with the `subset` parameter, which needs to refer to the name of this column, as follows:\n", "\n", "`df.dropna(subset = ['column_name'])`\n", "\n", "Add some code which can remove records in which the 'price_per_pound' column has a `NaN` value. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "voc = voc.dropna(subset = ['price_per_pound'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is one final operation we need to complete to arrive at the correct values. The original data set contains data about the prices paid in individual transactions. We want to know the average prices per day. Calculate the average prices per day, by making use use of the `groupby()` and the `mean()` function. \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "voc = voc.groupby('date').mean() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, save the data frame as a CSV file using Panda's `to_csv` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use Pandas' built-in functionality\n", "voc.to_csv('prices_of_coffee_over_time.csv')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }