{ "cells": [ { "cell_type": "markdown", "id": "4553d762-c2a5-416b-a6af-e31cf10d8060", "metadata": {}, "source": [ "# Movie Madness \n", "\n", "**Description:** \n", "You are a data analyst for a movie streaming service. You have been tasked with analyzing a dataset of movie ratings to determine which genres are the most popular among users. \n", "\n", "The dataset contains the following columns:\n", "- **user_id:** Unique identifier for each user\n", "- **movie_id:** Unique identifier for each movie\n", "- **rating:** Rating given by the user to the movie (on a scale of 1-5)\n", "- **genre:** Genre of the movie (e.g. Action, Comedy, Drama, etc.)\n", "\n", "**Your task is to:** \n", "- Load the dataset into a Pandas DataFrame\n", "- Group the data by genre and calculate the average rating for each genre\n", "- Sort the results in descending order by average rating\n", "\n", "**Data:** \n", "You can use the following sample data to get started: \n", "```\n", "user_id,movie_id,rating,genre\n", "1,101,4,Action\n", "1,102,3,Comedy\n", "2,101,5,Action\n", "2,103,4,Drama\n", "3,102,2,Comedy\n", "3,104,5,Action\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "b06da085-cf53-43fb-ac61-a6e8a247a0cf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]\n", "Pandas version 2.2.1\n" ] } ], "source": [ "# import libraries\n", "import pandas as pd\n", "import sys\n", "\n", "print('Python version ' + sys.version)\n", "print('Pandas version ' + pd.__version__)" ] }, { "cell_type": "code", "execution_count": 2, "id": "8e8e4449-e430-4d50-b58a-558545995a8e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idratinggenre
011014Action
111023Comedy
221015Action
321034Drama
431022Comedy
531045Action
\n", "
" ], "text/plain": [ " user_id movie_id rating genre\n", "0 1 101 4 Action\n", "1 1 102 3 Comedy\n", "2 2 101 5 Action\n", "3 2 103 4 Drama\n", "4 3 102 2 Comedy\n", "5 3 104 5 Action" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's try to copy the data using the clipboard\n", "df = pd.read_clipboard(sep=\",\")\n", "df" ] }, { "cell_type": "code", "execution_count": 3, "id": "50f812d2-ce5c-4b99-be25-dbb054c2b910", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 6 entries, 0 to 5\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 user_id 6 non-null int64 \n", " 1 movie_id 6 non-null int64 \n", " 2 rating 6 non-null int64 \n", " 3 genre 6 non-null object\n", "dtypes: int64(3), object(1)\n", "memory usage: 324.0+ bytes\n" ] } ], "source": [ "# check the data types\n", "df.info()" ] }, { "cell_type": "code", "execution_count": 4, "id": "c5cafbca-5f61-46ee-8d23-4122ab9e9b1e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "genre\n", "Action 4.666667\n", "Comedy 2.500000\n", "Drama 4.000000\n", "Name: rating, dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create groupby object\n", "group = df.groupby('genre')\n", "\n", "# calculate average rating\n", "avg = group['rating'].mean()\n", "avg" ] }, { "cell_type": "markdown", "id": "7cba4c79-c497-4325-a2e8-c2bac62dae2c", "metadata": {}, "source": [ "I decided to place it all in one line. Yes, it is a bit ugly.\n", "\n", "Here is what I did: \n", "- I decided to merge the Series that contains the average ratings with the original dataframe via the column named genre\n", "- I renamed the Series so the column names were clear\n", "- I finally sorted the values descending " ] }, { "cell_type": "code", "execution_count": 5, "id": "9d19ba2f-f31d-43ef-a57e-03675a67eba4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idratingaverage_rating
genre
Action110144.666667
Action210154.666667
Action310454.666667
Drama210344.000000
Comedy110232.500000
Comedy310222.500000
\n", "
" ], "text/plain": [ " user_id movie_id rating average_rating\n", "genre \n", "Action 1 101 4 4.666667\n", "Action 2 101 5 4.666667\n", "Action 3 104 5 4.666667\n", "Drama 2 103 4 4.000000\n", "Comedy 1 102 3 2.500000\n", "Comedy 3 102 2 2.500000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.set_index('genre').merge(avg.rename('average_rating'), on='genre').sort_values('average_rating', ascending=False)" ] }, { "cell_type": "markdown", "id": "fc5aa17a-7477-4f2a-b73f-93cf63a6b777", "metadata": {}, "source": [ "If all you needed to see was the averages..." ] }, { "cell_type": "code", "execution_count": 5, "id": "b8f83ffe-faeb-42ba-94ff-c77ba3bcb73a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "genre\n", "Action 4.666667\n", "Drama 4.000000\n", "Comedy 2.500000\n", "Name: rating, dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "avg.sort_values(ascending=False)" ] }, { "cell_type": "markdown", "id": "ad584bc5-648e-4486-90ff-cec184716d9f", "metadata": {}, "source": [ "# Summary:\n", "This tutorial guided you through the analysis of a movie ratings dataset using Pandas. It covered loading data, grouping by genre, calculating average ratings, merging data, and sorting results.\n", "\n", "### Key Takeaways:\n", "- How to load data from a clipboard into a Pandas DataFrame using `pd.read_clipboard()`\n", "- Understanding data types using `df.info()`\n", "- Grouping data by a column (genre) using `df.groupby()`\n", "- Calculating the average rating for each group using `group['rating'].mean()`\n", "- Merging data from a Series into the original DataFrame using `df.merge()` or `df.set_index().merge()`\n", "- Renaming columns using `rename()`\n", "- Sorting data in descending order using `sort_values()`" ] }, { "cell_type": "markdown", "id": "b1d5ecb0-c6c6-493d-a582-a612a25a419a", "metadata": {}, "source": [ "

This tutorial was created by HEDARO

" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 5 }