{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# IMDb Day 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The format of the IMDb file is as follows:\n", "\n", "- Each record is on a separate line\n", "- Columns are separated by the `|` character\n", "- The header line starts with `#`\n", "\n", "An example of an IMDb file with the header line and the top two records is shown below:\n", "\n", "<img src=\"../../lectures/img/header_imdb.png\" alt=\"Drawing\" style=\"width: 1200px;\"/> " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Find the number of unique genres\n", "Using the data provided in `250.imdb`, find the total number of unique genres. It is recommended to use `set` to help filter out duplicates.\n", "\n", "Note: Be mindful of case sensitivity (e.g., \"Action\" and \"action\" should be considered the same genre).\n", "\n", "__Hint__: The correct answer is 22." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Find the number of movies per genre\n", "\n", "Correct answers:\n", "\n", "<img src=\"../../lectures/img/movie_dict.png\" alt=\"Drawing\" style=\"width: 500px;\"/> " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. (Optional/Extra) What is the average length of the movies (hours and minutes) in each genre?\n", "\n", "Here you have to loop twice!\n", "\n", "Correct answers:\n", "\n", "<img src=\"../../lectures/img/average_length.png\" alt=\"Drawing\" style=\"width: 500px;\"/> " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. (Advanced) Re-structure and write the output to a new file as below\n", "\n", "<img src=\"../../lectures/img/re-structured.png\" alt=\"Drawing\" style=\"width: 400px;\"/> \n", "\n", "Note:\n", "- Use a text editor, not notebooks for this\n", "- Use functions as much as possible\n", "- Use `sys.argv` for input/output\n", "\n", "<br><br><br><br><br><br><br>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tips if you're unsure how to start\n", "\n", "As everything is coding, there are many different ways of writing code that will achieve the same end result. Below is presented one way of thinking about these problems, there are of course many other ways.\n", "\n", "### 1. Find the number of unique genres\n", "\n", "1. Create an empty list outside the loop where you will collect all the different genres\n", "2. Start by reading the file and splitting up the columns, just as you did on yesterdays exercise\n", "3. Identify the columns where all the genres for a movie is listed, and split this column into a list\n", "4. Loop over this list of genres and add them to your empty list from step one, UNLESS IT IS ALREADY THERE\n", "5. After looping over all lines, check the length of your list from step 1\n", "\n", "### 2. Find the number of movies per genre\n", "\n", "1. Use the code from above, but instead of creating an empty list before starting to loop over the file, create an empty dictionary\n", "2. When looping over the genres, check if they are in the dictionary, otherwise add them and assign the value 1 to them. If they are present already, increase the value with 1.\n", "\n", "### 3. What is the average length of the movies (hours and minutes) in each genre?\n", "\n", "1. Use the code above, but instead of assigning the value 1 to each genre initially, add the runtime of the movie as a list item\n", "2. For each new movie, append the runtime to the existing list, so by the end of the loop you have, for each genre, a list of the runtimes for all movies in that genre\n", "3. Loop over the dictionary and calculate the average of the list\n", "4. Format the average (that is in seconds) to hours and minutes by dividing appropriately\n", "5. Print the results, or save them to a variable or file\n", "\n", "### 4. Re-structure and write the output to a new file as below\n", "\n", "1. Use the code above, but instead of just adding the runtime as a list element to each genre, add a list (or tuple) of items (rating, movie, year, runtime) to the list. In the end you will for each genre have a list of lists (or tuples), containing all the relevant information for each movie\n", "2. Loop over the dictionary and write the content of the dictionary to a new file with the correct formatting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Answers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Find the number of unique genres" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Movie(s) with highest rating 9.3:\n", "The Shawshank Redemption\n" ] } ], "source": [ "# Code Snippet for Finding the Movie with the Highest Rating\n", "# Note that this is just one of the solutions\n", "with open('../../downloads/250.imdb', 'r') as fh: \n", " movieList = [] \n", " highestRating = -100 \n", " \n", " for line in fh: \n", " if not line.startswith('#'): \n", " cols = line.strip().split('|')\n", " rating = float(cols[1].strip())\n", " title = cols[6].strip()\n", " movieList.append((rating, title))\n", " if rating > highestRating:\n", " highestRating = rating\n", " print(\"Movie(s) with highest rating \" + str(highestRating) + \":\" )\n", " for i in range(len(movieList)):\n", " if movieList[i][0] == highestRating:\n", " print(movieList[i][1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Find the number of movies per genre" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['drama', 'war', 'adventure', 'comedy', 'family', 'animation', 'biography', 'history', 'action', 'crime', 'mystery', 'thriller', 'fantasy', 'romance', 'sci-fi', 'western', 'musical', 'music', 'historical', 'sport', 'film-noir', 'horror']\n", "22\n" ] } ], "source": [ "# Code Snippet for finding the number of unique genres\n", "# Note that this is just one of the solutions\n", "with open('../../downloads/250.imdb', 'r') as fh:\n", " # empty list to start with\n", " genres_list = []\n", " # iterate over the file\n", " for line in fh:\n", " if not line.startswith('#'):\n", " # split the line into a list, del |\n", " cols = line.strip().split('|')\n", " # extract genres from list, split genres into list\n", " genres = cols[5].strip().split(',')\n", " # loop over genre list and add to empty start list if genre not already in list\n", " for genre in genres:\n", " if genre.lower() not in genres_list:\n", " genres_list.append(genre.lower())\n", "\n", "print(genres_list)\n", "print(len(genres_list))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ### 3. (Optional/Extra) What is the average length of the movies (hours and minutes) in each genre?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The average length for movies in genre drama is 2h14min\n", "The average length for movies in genre war is 2h30min\n", "The average length for movies in genre adventure is 2h13min\n", "The average length for movies in genre comedy is 1h53min\n", "The average length for movies in genre family is 1h44min\n", "The average length for movies in genre animation is 1h40min\n", "The average length for movies in genre biography is 2h30min\n", "The average length for movies in genre history is 2h47min\n", "The average length for movies in genre action is 2h18min\n", "The average length for movies in genre crime is 2h11min\n", "The average length for movies in genre mystery is 2h3min\n", "The average length for movies in genre thriller is 2h11min\n", "The average length for movies in genre fantasy is 2h2min\n", "The average length for movies in genre romance is 2h2min\n", "The average length for movies in genre sci-fi is 2h6min\n", "The average length for movies in genre western is 2h11min\n", "The average length for movies in genre musical is 1h57min\n", "The average length for movies in genre music is 2h24min\n", "The average length for movies in genre historical is 2h38min\n", "The average length for movies in genre sport is 2h17min\n", "The average length for movies in genre film-noir is 1h43min\n", "The average length for movies in genre horror is 1h59min\n" ] } ], "source": [ "# Code Snippet for calculating the average length of the movies (in hours and minutes) for each genre\n", "# Note that this is just one of the solutions\n", "with open('../../downloads/250.imdb', 'r') as fh:\n", " genreDict = {}\n", "\n", " for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('|')\n", " genre = cols[5].strip()\n", " glist = genre.split(',')\n", " runtime = cols[3] # length of movie in seconds\n", " for entry in glist:\n", " if not entry.lower() in genreDict:\n", " genreDict[entry.lower()] = [] # add a list with the runtime\n", " genreDict[entry.lower()].append(int(runtime)) # append runtime to existing list\n", " fh.close()\n", "\n", " for genre in genreDict: # loop over the genres in the dictionaries\n", " average = sum(genreDict[genre])/len(genreDict[genre]) # calculate average length per genre\n", " hours = int(average/3600) # format seconds to hours\n", " minutes = (average - (3600*hours))/60 # format seconds to minutes\n", " print('The average length for movies in genre '+genre\\\n", " +' is '+str(hours)+'h'+str(round(minutes))+'min')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Re-structure and write the output to a new file as below¶" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example code can be found at https://uppsala.instructure.com/courses/99844/modules/items/1111740" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 4 }