{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Let's Clean Some Data\n", "\n", "Fivethirtyeight has some great data sets and this is one of them. Some light cleaning should make it more usable!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# import libraries\n", "import pandas as pd\n", "import numpy as np\n", "star_wars = pd.read_csv(\"star_wars.csv\", encoding=\"ISO-8859-1\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RespondentIDHave you seen any of the 6 films in the Star Wars franchise?Do you consider yourself to be a fan of the Star Wars film franchise?Which of the following Star Wars films have you seen? Please select all that apply.Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film....Unnamed: 28Which character shot first?Are you familiar with the Expanded Universe?Do you consider yourself to be a fan of the Expanded Universe?Do you consider yourself to be a fan of the Star Trek franchise?GenderAgeHousehold IncomeEducationLocation (Census Region)
03292879998YesYesStar Wars: Episode I The Phantom MenaceStar Wars: Episode II Attack of the ClonesStar Wars: Episode III Revenge of the SithStar Wars: Episode IV A New HopeStar Wars: Episode V The Empire Strikes BackStar Wars: Episode VI Return of the Jedi3.0...Very favorablyI don't understand this questionYesNoNoMale18-29NaNHigh school degreeSouth Atlantic
13292879538NoNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNYesMale18-29$0 - $24,999Bachelor degreeWest South Central
\n", "

2 rows × 38 columns

\n", "
" ], "text/plain": [ " RespondentID Have you seen any of the 6 films in the Star Wars franchise? \\\n", "0 3292879998 Yes \n", "1 3292879538 No \n", "\n", " Do you consider yourself to be a fan of the Star Wars film franchise? \\\n", "0 Yes \n", "1 NaN \n", "\n", " Which of the following Star Wars films have you seen? Please select all that apply. \\\n", "0 Star Wars: Episode I The Phantom Menace \n", "1 NaN \n", "\n", " Unnamed: 4 \\\n", "0 Star Wars: Episode II Attack of the Clones \n", "1 NaN \n", "\n", " Unnamed: 5 \\\n", "0 Star Wars: Episode III Revenge of the Sith \n", "1 NaN \n", "\n", " Unnamed: 6 \\\n", "0 Star Wars: Episode IV A New Hope \n", "1 NaN \n", "\n", " Unnamed: 7 \\\n", "0 Star Wars: Episode V The Empire Strikes Back \n", "1 NaN \n", "\n", " Unnamed: 8 \\\n", "0 Star Wars: Episode VI Return of the Jedi \n", "1 NaN \n", "\n", " Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. \\\n", "0 3.0 \n", "1 NaN \n", "\n", " ... Unnamed: 28 Which character shot first? \\\n", "0 ... Very favorably I don't understand this question \n", "1 ... NaN NaN \n", "\n", " Are you familiar with the Expanded Universe? \\\n", "0 Yes \n", "1 NaN \n", "\n", " Do you consider yourself to be a fan of the Expanded Universe? \\\n", "0 No \n", "1 NaN \n", "\n", " Do you consider yourself to be a fan of the Star Trek franchise? Gender \\\n", "0 No Male \n", "1 Yes Male \n", "\n", " Age Household Income Education Location (Census Region) \n", "0 18-29 NaN High school degree South Atlantic \n", "1 18-29 $0 - $24,999 Bachelor degree West South Central \n", "\n", "[2 rows x 38 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Index(['RespondentID',\n", " 'Have you seen any of the 6 films in the Star Wars franchise?',\n", " 'Do you consider yourself to be a fan of the Star Wars film franchise?',\n", " 'Which of the following Star Wars films have you seen? Please select all that apply.',\n", " 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',\n", " 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',\n", " 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',\n", " 'Unnamed: 14',\n", " 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',\n", " 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',\n", " 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',\n", " 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',\n", " 'Unnamed: 28', 'Which character shot first?',\n", " 'Are you familiar with the Expanded Universe?',\n", " 'Do you consider yourself to be a fan of the Expanded Universe?',\n", " 'Do you consider yourself to be a fan of the Star Trek franchise?',\n", " 'Gender', 'Age', 'Household Income', 'Education',\n", " 'Location (Census Region)'],\n", " dtype='object')" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(1186, 38)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# explore the data frame\n", "display(star_wars.head(2))\n", "display(star_wars.columns)\n", "display(star_wars.shape)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# rename columns\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Yes 936\n", "No 250\n", "Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64\n", "Yes 552\n", "NaN 350\n", "No 284\n", "Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64\n" ] } ], "source": [ "# value counts for columns 1:2\n", "print(star_wars[\"Have you seen any of the 6 films in the Star Wars franchise?\"].value_counts(dropna=False))\n", "print(star_wars[\"Do you consider yourself to be a fan of the Star Wars film franchise?\"].value_counts(dropna=False))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True 936\n", "False 250\n", "Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64\n", "True 552\n", "NaN 350\n", "False 284\n", "Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64\n" ] } ], "source": [ "# switch values to boolean for columns 1:2\n", "yes_no_bool = {\"Yes\":True, \"No\":False}\n", "star_wars[\"Have you seen any of the 6 films in the Star Wars franchise?\"] = star_wars[\"Have you seen any of the 6 films in the Star Wars franchise?\"].map(yes_no_bool)\n", "star_wars[\"Do you consider yourself to be a fan of the Star Wars film franchise?\"] = star_wars[\"Do you consider yourself to be a fan of the Star Wars film franchise?\"].map(yes_no_bool)\n", "print(star_wars[\"Have you seen any of the 6 films in the Star Wars franchise?\"].value_counts(dropna=False))\n", "print(star_wars[\"Do you consider yourself to be a fan of the Star Wars film franchise?\"].value_counts(dropna=False))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Star Wars: Episode I The Phantom Menace 673\n", "NaN 513\n", "Name: Which of the following Star Wars films have you seen? Please select all that apply., dtype: int64\n" ] } ], "source": [ "# value counts for columns 3\n", "print(star_wars[\"Which of the following Star Wars films have you seen? Please select all that apply.\"].value_counts(dropna=False))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True 673\n", "False 513\n", "Name: Which of the following Star Wars films have you seen? Please select all that apply., dtype: int64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# switch values to boolean for columns 3:9\n", "watch_bool = {\"Star Wars: Episode I The Phantom Menace\" : True,\n", " \"Star Wars: Episode II Attack of the Clones\" : True,\n", " \"Star Wars: Episode III Revenge of the Sith\" : True,\n", " \"Star Wars: Episode IV A New Hope\" : True,\n", " \"Star Wars: Episode V The Empire Strikes Back\" : True,\n", " \"Star Wars: Episode VI Return of the Jedi\" : True,\n", " np.NaN : False}\n", "\n", "for col in star_wars.columns[3:9]:\n", " star_wars[col] = star_wars[col].map(watch_bool)\n", "\n", "# value counts for columns 3\n", "display(star_wars.iloc[:,3].value_counts(dropna=False))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['seen_ep_1', 'seen_ep_2', 'seen_ep_3', 'seen_ep_4', 'seen_ep_5',\n", " 'seen_ep_6'],\n", " dtype='object')" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# rename columns\n", "star_wars = star_wars.rename(columns={\"Which of the following Star Wars films have you seen? Please select all that apply.\" : \"seen_ep_1\",\n", " \"Unnamed: 4\" : \"seen_ep_2\",\n", " \"Unnamed: 5\" : \"seen_ep_3\",\n", " \"Unnamed: 6\" : \"seen_ep_4\",\n", " \"Unnamed: 7\" : \"seen_ep_5\",\n", " \"Unnamed: 8\" : \"seen_ep_6\"})\n", "\n", "display(star_wars.columns[3:9])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dtype('float64')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convert rankings to float\n", "star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)\n", "\n", "# star_wars.columns[9].dtype # why does this work sometimes?\n", "star_wars.iloc[:,9].dtype" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['rank_ep_1', 'rank_ep_2', 'rank_ep_3', 'rank_ep_4', 'rank_ep_5',\n", " 'rank_ep_6'],\n", " dtype='object')" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# rename more columns\n", "new_columns = {\"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.\":\"rank_ep_1\",\n", " \"Unnamed: 10\":\"rank_ep_2\",\n", " \"Unnamed: 11\":\"rank_ep_3\",\n", " \"Unnamed: 12\":\"rank_ep_4\",\n", " \"Unnamed: 13\":\"rank_ep_5\",\n", " \"Unnamed: 14\":\"rank_ep_6\",}\n", "star_wars = star_wars.rename(columns = new_columns)\n", "\n", "display(star_wars.columns[9:15])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Closer Look" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Have you seen any of the 6 films in the Star Wars franchise? 0.789207\n", "Are you a fan of the Star Wars films? 0.660287\n", "seen_ep_1 0.567454\n", "seen_ep_2 0.481450\n", "seen_ep_3 0.463744\n", "seen_ep_4 0.511804\n", "seen_ep_5 0.639123\n", "seen_ep_6 0.622260\n", "rank_ep_1 3.732934\n", "rank_ep_2 4.087321\n", "rank_ep_3 4.341317\n", "rank_ep_4 3.272727\n", "rank_ep_5 2.513158\n", "rank_ep_6 3.047847\n", "dtype: float64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Have you seen any of the 6 films in the Star Wars franchise? 936\n", "Are you a fan of the Star Wars films? 552\n", "seen_ep_1 673\n", "seen_ep_2 571\n", "seen_ep_3 550\n", "seen_ep_4 607\n", "dtype: object" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# get average value for seen any, which episode, rankings by episode\n", "means = star_wars.iloc[:,1:15].mean()\n", "means = means.rename({\"Do you consider yourself to be a fan of the Star Wars film franchise?\":\"Are you a fan of the Star Wars films?\"})\n", "display(means)\n", "\n", "# totals\n", "sums = star_wars.iloc[:,1:7].sum()\n", "sums = sums.rename({\"Do you consider yourself to be a fan of the Star Wars film franchise?\":\"Are you a fan of the Star Wars films?\"})\n", "display(sums)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True 936\n", "False 250\n", "Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "True 552\n", "NaN 350\n", "False 284\n", "Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0.6602870813397129" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# math check :)\n", "display(star_wars[\"Have you seen any of the 6 films in the Star Wars franchise?\"].value_counts(dropna=False))\n", "# display(552/(552+284))\n", "display(star_wars[\"Do you consider yourself to be a fan of the Star Wars film franchise?\"].value_counts(dropna=False))\n", "display(552/(552+284))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# whats up with the fan nan's? 250 are people who haven't seen any, but what about the other 100\n", "# amke a fan nan group" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# reshape means with ep as index and seen and rank as columns then make side by side barh" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# bar chart for fan & episodes seen in % of total respondents\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "means[0:8].plot.barh()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# bar chart for rankings\n", "means[7:14].plot.barh()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Impressions\n", "* \n", "\n", "* Episode V is the favorite, but not by much." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 1 }