# IMDb Day 3

The format of the IMDb file is as follows:

- Each record is on a separate line
- Columns are separated by the `|` character
- The header line starts with `#`

An example of an IMDb file with the header line and the top two records is shown below:

<img src="../../lectures/img/header_imdb.png" alt="Drawing" style="width: 1200px;"/> 

## 1. Find the number of unique genres
Using the data provided in `250.imdb`, find the total number of unique genres. It is recommended to use `set` to help filter out duplicates.

Note: Be mindful of case sensitivity (e.g., "Action" and "action" should be considered the same genre).

__Hint__: The correct answer is 22.

## 2. Find the number of movies per genre

Correct answers:

<img src="../../lectures/img/movie_dict.png" alt="Drawing" style="width: 500px;"/> 

## 3. (Optional/Extra) What is the average length of the movies (hours and minutes) in each genre?

Here you have to loop twice!

Correct answers:

<img src="../../lectures/img/average_length.png" alt="Drawing" style="width: 500px;"/> 

## 4. (Advanced) Re-structure and write the output to a new file as below

<img src="../../lectures/img/re-structured.png" alt="Drawing" style="width: 400px;"/> 

Note:
- Use a text editor, not notebooks for this
- Use functions as much as possible
- Use `sys.argv` for input/output

<br><br><br><br><br><br><br>

## Tips if you're unsure how to start

As everything is coding, there are many different ways of writing code that will achieve the same end result. Below is presented one way of thinking about these problems, there are of course many other ways.

### 1. Find the number of unique genres

1. Create an empty list outside the loop where you will collect all the different genres
2. Start by reading the file and splitting up the columns, just as you did on yesterdays exercise
3. Identify the columns where all the genres for a movie is listed, and split this column into a list
4. Loop over this list of genres and add them to your empty list from step one, UNLESS IT IS ALREADY THERE
5. After looping over all lines, check the length of your list from step 1

### 2. Find the number of movies per genre

1. Use the code from above, but instead of creating an empty list before starting to loop over the file, create an empty dictionary
2. When looping over the genres, check if they are in the dictionary, otherwise add them and assign the value 1 to them. If they are present already, increase the value with 1.

### 3. What is the average length of the movies (hours and minutes) in each genre?

1. Use the code above, but instead of assigning the value 1 to each genre initially, add the runtime of the movie as a list item
2. For each new movie, append the runtime to the existing list, so by the end of the loop you have, for each genre, a list of the runtimes for all movies in that genre
3. Loop over the dictionary and calculate the average of the list
4. Format the average (that is in seconds) to hours and minutes by dividing appropriately
5. Print the results, or save them to a variable or file

### 4. Re-structure and write the output to a new file as below

1. Use the code above, but instead of just adding the runtime as a list element to each genre, add a list (or tuple) of items (rating, movie, year, runtime) to the list. In the end you will for each genre have a list of lists (or tuples), containing all the relevant information for each movie
2. Loop over the dictionary and write the content of the dictionary to a new file with the correct formatting

## Answers

### 1. Find the number of unique genres

In [4]:
# Code Snippet for Finding the Movie with the Highest Rating
# Note that this is just one of the solutions
with open('../../downloads/250.imdb', 'r') as fh: 
 movieList = [] 
 highestRating = -100 
 
 for line in fh: 
 if not line.startswith('#'): 
 cols = line.strip().split('|')
 rating = float(cols[1].strip())
 title = cols[6].strip()
 movieList.append((rating, title))
 if rating > highestRating:
 highestRating = rating
 print("Movie(s) with highest rating " + str(highestRating) + ":" )
 for i in range(len(movieList)):
 if movieList[i][0] == highestRating:
 print(movieList[i][1])

Movie(s) with highest rating 9.3:
The Shawshank Redemption


### 2. Find the number of movies per genre

In [5]:
# Code Snippet for finding the number of unique genres
# Note that this is just one of the solutions
with open('../../downloads/250.imdb', 'r') as fh:
 # empty list to start with
 genres_list = []
 # iterate over the file
 for line in fh:
 if not line.startswith('#'):
 # split the line into a list, del |
 cols = line.strip().split('|')
 # extract genres from list, split genres into list
 genres = cols[5].strip().split(',')
 # loop over genre list and add to empty start list if genre not already in list
 for genre in genres:
 if genre.lower() not in genres_list:
 genres_list.append(genre.lower())

print(genres_list)
print(len(genres_list))

['drama', 'war', 'adventure', 'comedy', 'family', 'animation', 'biography', 'history', 'action', 'crime', 'mystery', 'thriller', 'fantasy', 'romance', 'sci-fi', 'western', 'musical', 'music', 'historical', 'sport', 'film-noir', 'horror']
22


 ### 3. (Optional/Extra) What is the average length of the movies (hours and minutes) in each genre?

In [9]:
# Code Snippet for calculating the average length of the movies (in hours and minutes) for each genre
# Note that this is just one of the solutions
with open('../../downloads/250.imdb', 'r') as fh:
 genreDict = {}

 for line in fh:
 if not line.startswith('#'):
 cols = line.strip().split('|')
 genre = cols[5].strip()
 glist = genre.split(',')
 runtime = cols[3] # length of movie in seconds
 for entry in glist:
 if not entry.lower() in genreDict:
 genreDict[entry.lower()] = [] # add a list with the runtime
 genreDict[entry.lower()].append(int(runtime)) # append runtime to existing list
 fh.close()

 for genre in genreDict: # loop over the genres in the dictionaries
 average = sum(genreDict[genre])/len(genreDict[genre]) # calculate average length per genre
 hours = int(average/3600) # format seconds to hours
 minutes = (average - (3600*hours))/60 # format seconds to minutes
 print('The average length for movies in genre '+genre\
 +' is '+str(hours)+'h'+str(round(minutes))+'min')

The average length for movies in genre drama is 2h14min
The average length for movies in genre war is 2h30min
The average length for movies in genre adventure is 2h13min
The average length for movies in genre comedy is 1h53min
The average length for movies in genre family is 1h44min
The average length for movies in genre animation is 1h40min
The average length for movies in genre biography is 2h30min
The average length for movies in genre history is 2h47min
The average length for movies in genre action is 2h18min
The average length for movies in genre crime is 2h11min
The average length for movies in genre mystery is 2h3min
The average length for movies in genre thriller is 2h11min
The average length for movies in genre fantasy is 2h2min
The average length for movies in genre romance is 2h2min
The average length for movies in genre sci-fi is 2h6min
The average length for movies in genre western is 2h11min
The average length for movies in genre musical is 1h57min
The average length for 

### 4. Re-structure and write the output to a new file as below¶

Example code can be found at https://uppsala.instructure.com/courses/99844/modules/items/1111740