{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Winning Jeopardy\n",
"\n",
"Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.
The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded from [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).
Data Dictionary:\n",
"\n",
"> Show Number - the Jeopardy episode number of the show this question was in.
\n",
"> Air Date - the date the episode aired.
\n",
"> Round - the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
\n",
"> Category - the category of the question.
\n",
"> Value - the number of dollars answering the question correctly is worth.
\n",
"> Question - the text of the question.
\n",
"> Answer - the text of the answer.
\n",
"\n",
"### Aim\n",
"\n",
"**Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Introduction\n",
"\n",
"- We will extract data into pandas dataframe. \n",
"- Clean the dataset by dropping colimns with null values\n",
"- Clean the Column names"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Show Number | \n",
" Air Date | \n",
" Round | \n",
" Category | \n",
" Value | \n",
" Question | \n",
" Answer | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" HISTORY | \n",
" $200 | \n",
" For the last 8 years of his life, Galileo was ... | \n",
" Copernicus | \n",
"
\n",
" \n",
" 1 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" ESPN's TOP 10 ALL-TIME ATHLETES | \n",
" $200 | \n",
" No. 2: 1912 Olympian; football star at Carlisl... | \n",
" Jim Thorpe | \n",
"
\n",
" \n",
" 2 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" EVERYBODY TALKS ABOUT IT... | \n",
" $200 | \n",
" The city of Yuma in this state has a record av... | \n",
" Arizona | \n",
"
\n",
" \n",
" 3 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" THE COMPANY LINE | \n",
" $200 | \n",
" In 1963, live on \"The Art Linkletter Show\", th... | \n",
" McDonald's | \n",
"
\n",
" \n",
" 4 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" EPITAPHS & TRIBUTES | \n",
" $200 | \n",
" Signer of the Dec. of Indep., framer of the Co... | \n",
" John Adams | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Show Number Air Date Round Category Value \\\n",
"0 4680 2004-12-31 Jeopardy! HISTORY $200 \n",
"1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 \n",
"2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 \n",
"3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 \n",
"4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 \n",
"\n",
" Question Answer \n",
"0 For the last 8 years of his life, Galileo was ... Copernicus \n",
"1 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe \n",
"2 The city of Yuma in this state has a record av... Arizona \n",
"3 In 1963, live on \"The Art Linkletter Show\", th... McDonald's \n",
"4 Signer of the Dec. of Indep., framer of the Co... John Adams "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"jeopardy = pd.read_csv('jeopardy.csv')\n",
"jeopardy.dropna(inplace=True)\n",
"jeopardy.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',\n",
" ' Question', ' Answer'],\n",
" dtype='object')"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"jeopardy.columns"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',\n",
" 'Answer'],\n",
" dtype='object')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Removing spaces from column names\n",
"jeopardy.columns = [x.strip() for x in jeopardy.columns]\n",
"jeopardy.columns"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 216928 entries, 0 to 216929\n",
"Data columns (total 7 columns):\n",
"Show Number 216928 non-null int64\n",
"Air Date 216928 non-null object\n",
"Round 216928 non-null object\n",
"Category 216928 non-null object\n",
"Value 216928 non-null object\n",
"Question 216928 non-null object\n",
"Answer 216928 non-null object\n",
"dtypes: int64(1), object(6)\n",
"memory usage: 13.2+ MB\n"
]
}
],
"source": [
"jeopardy.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Normalizing Text\n",
"\n",
"Before we begin to do analysis, we need to normalize all of the text columns (the `Question` and `Answer` columns). The idea is to lowercase words and remove punctuation so `Don't` and `don't` aren't considered to be different words while comparing them."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"\n",
"def normalizing_string(string):\n",
" string = string.lower()\n",
" string = re.sub(\"[^A-Z0-9a-z\\s]\", \"\", string)\n",
" return string"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Show Number | \n",
" Air Date | \n",
" Round | \n",
" Category | \n",
" Value | \n",
" Question | \n",
" Answer | \n",
" clean_question | \n",
" clean_answer | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" HISTORY | \n",
" $200 | \n",
" For the last 8 years of his life, Galileo was ... | \n",
" Copernicus | \n",
" for the last 8 years of his life galileo was u... | \n",
" copernicus | \n",
"
\n",
" \n",
" 1 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" ESPN's TOP 10 ALL-TIME ATHLETES | \n",
" $200 | \n",
" No. 2: 1912 Olympian; football star at Carlisl... | \n",
" Jim Thorpe | \n",
" no 2 1912 olympian football star at carlisle i... | \n",
" jim thorpe | \n",
"
\n",
" \n",
" 2 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" EVERYBODY TALKS ABOUT IT... | \n",
" $200 | \n",
" The city of Yuma in this state has a record av... | \n",
" Arizona | \n",
" the city of yuma in this state has a record av... | \n",
" arizona | \n",
"
\n",
" \n",
" 3 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" THE COMPANY LINE | \n",
" $200 | \n",
" In 1963, live on \"The Art Linkletter Show\", th... | \n",
" McDonald's | \n",
" in 1963 live on the art linkletter show this c... | \n",
" mcdonalds | \n",
"
\n",
" \n",
" 4 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" EPITAPHS & TRIBUTES | \n",
" $200 | \n",
" Signer of the Dec. of Indep., framer of the Co... | \n",
" John Adams | \n",
" signer of the dec of indep framer of the const... | \n",
" john adams | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Show Number Air Date Round Category Value \\\n",
"0 4680 2004-12-31 Jeopardy! HISTORY $200 \n",
"1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 \n",
"2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 \n",
"3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 \n",
"4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 \n",
"\n",
" Question Answer \\\n",
"0 For the last 8 years of his life, Galileo was ... Copernicus \n",
"1 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe \n",
"2 The city of Yuma in this state has a record av... Arizona \n",
"3 In 1963, live on \"The Art Linkletter Show\", th... McDonald's \n",
"4 Signer of the Dec. of Indep., framer of the Co... John Adams \n",
"\n",
" clean_question clean_answer \n",
"0 for the last 8 years of his life galileo was u... copernicus \n",
"1 no 2 1912 olympian football star at carlisle i... jim thorpe \n",
"2 the city of yuma in this state has a record av... arizona \n",
"3 in 1963 live on the art linkletter show this c... mcdonalds \n",
"4 signer of the dec of indep framer of the const... john adams "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"jeopardy['clean_question'] = jeopardy['Question'].apply(normalizing_string)\n",
"jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalizing_string)\n",
"jeopardy.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Normalizing Columns\n",
"\n",
"There are some other columns to be normalized.\n",
"\n",
"The `Value` column should be numeric to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.\n",
"\n",
"The `Air Date` column should also be a datetime, not a string, enabling us to work with it more easily."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def normalizing_values(value):\n",
" value = re.sub(\"[^A-Z0-9a-z\\s]\", \"\", value)\n",
" try:\n",
" value = int(value)\n",
" except Exception:\n",
" value = 0\n",
" return value"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Show Number int64\n",
"Air Date datetime64[ns]\n",
"Round object\n",
"Category object\n",
"Value object\n",
"Question object\n",
"Answer object\n",
"clean_question object\n",
"clean_answer object\n",
"clean_value int64\n",
"dtype: object\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Show Number | \n",
" Air Date | \n",
" Round | \n",
" Category | \n",
" Value | \n",
" Question | \n",
" Answer | \n",
" clean_question | \n",
" clean_answer | \n",
" clean_value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" HISTORY | \n",
" $200 | \n",
" For the last 8 years of his life, Galileo was ... | \n",
" Copernicus | \n",
" for the last 8 years of his life galileo was u... | \n",
" copernicus | \n",
" 200 | \n",
"
\n",
" \n",
" 1 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" ESPN's TOP 10 ALL-TIME ATHLETES | \n",
" $200 | \n",
" No. 2: 1912 Olympian; football star at Carlisl... | \n",
" Jim Thorpe | \n",
" no 2 1912 olympian football star at carlisle i... | \n",
" jim thorpe | \n",
" 200 | \n",
"
\n",
" \n",
" 2 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" EVERYBODY TALKS ABOUT IT... | \n",
" $200 | \n",
" The city of Yuma in this state has a record av... | \n",
" Arizona | \n",
" the city of yuma in this state has a record av... | \n",
" arizona | \n",
" 200 | \n",
"
\n",
" \n",
" 3 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" THE COMPANY LINE | \n",
" $200 | \n",
" In 1963, live on \"The Art Linkletter Show\", th... | \n",
" McDonald's | \n",
" in 1963 live on the art linkletter show this c... | \n",
" mcdonalds | \n",
" 200 | \n",
"
\n",
" \n",
" 4 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" EPITAPHS & TRIBUTES | \n",
" $200 | \n",
" Signer of the Dec. of Indep., framer of the Co... | \n",
" John Adams | \n",
" signer of the dec of indep framer of the const... | \n",
" john adams | \n",
" 200 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Show Number Air Date Round Category Value \\\n",
"0 4680 2004-12-31 Jeopardy! HISTORY $200 \n",
"1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 \n",
"2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 \n",
"3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 \n",
"4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 \n",
"\n",
" Question Answer \\\n",
"0 For the last 8 years of his life, Galileo was ... Copernicus \n",
"1 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe \n",
"2 The city of Yuma in this state has a record av... Arizona \n",
"3 In 1963, live on \"The Art Linkletter Show\", th... McDonald's \n",
"4 Signer of the Dec. of Indep., framer of the Co... John Adams \n",
"\n",
" clean_question clean_answer clean_value \n",
"0 for the last 8 years of his life galileo was u... copernicus 200 \n",
"1 no 2 1912 olympian football star at carlisle i... jim thorpe 200 \n",
"2 the city of yuma in this state has a record av... arizona 200 \n",
"3 in 1963 live on the art linkletter show this c... mcdonalds 200 \n",
"4 signer of the dec of indep framer of the const... john adams 200 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"jeopardy['clean_value'] = jeopardy['Value'].apply(normalizing_values)\n",
"jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])\n",
"print(jeopardy.dtypes)\n",
"jeopardy.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:\n",
"\n",
"**How often the answer is deducible from the question.
\n",
"How often new questions are repeats of older questions.**\n",
"\n",
"We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. \n",
"\n",
"We'll work on the first question now, and come back to the second.\n",
"\n",
"---\n",
"\n",
"### How often the answer is deducible from the question."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.057921237245162335"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def function_ans_in_ques(row):\n",
" split_answer = row['clean_answer'].split()\n",
" split_question = row['clean_question'].split()\n",
"\n",
" match_count = 0\n",
"\n",
" try:\n",
" split_answer.remove('the')\n",
" except ValueError:\n",
" pass\n",
"\n",
" if len(split_answer) == 0:\n",
" return 0\n",
"\n",
" for element in split_answer:\n",
" if element in split_question:\n",
" match_count += 1\n",
" return match_count/len(split_answer)\n",
"\n",
"jeopardy['answer_in_question'] = jeopardy.apply(function_ans_in_ques, axis=1)\n",
"jeopardy['answer_in_question'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clean_question | \n",
" clean_answer | \n",
"
\n",
" \n",
" \n",
" \n",
" 14 | \n",
" on june 28 1994 the natl weather service began... | \n",
" the uv index | \n",
"
\n",
" \n",
" 24 | \n",
" this asian political party was founded in 1885... | \n",
" the congress party | \n",
"
\n",
" \n",
" 31 | \n",
" it can be a place to leave your puppy when you... | \n",
" a kennel | \n",
"
\n",
" \n",
" 38 | \n",
" during the 19541955 sun sessions elvis climbed... | \n",
" the mystery train | \n",
"
\n",
" \n",
" 53 | \n",
" in 1961 james brown announced all aboard for t... | \n",
" night train | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" clean_question clean_answer\n",
"14 on june 28 1994 the natl weather service began... the uv index\n",
"24 this asian political party was founded in 1885... the congress party\n",
"31 it can be a place to leave your puppy when you... a kennel\n",
"38 during the 19541955 sun sessions elvis climbed... the mystery train\n",
"53 in 1961 james brown announced all aboard for t... night train"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"jeopardy[jeopardy['answer_in_question'] != 0][['clean_question', 'clean_answer']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.*\n",
"\n",
"---\n",
"\n",
"### How often new questions are repeats of older questions"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8740091471018069"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"overlap_ratio = []\n",
"terms_repeated_in_ques = []\n",
"terms_repeated_overall = set()\n",
"terms_used = set()\n",
"\n",
"for i, rows in jeopardy.iterrows():\n",
" split_question = rows['clean_question'].split(\" \")\n",
" terms = [x for x in split_question if len(x) > 5] ## Words which are more then 5 letters long \n",
" temp = []\n",
" match_count = 0\n",
"\n",
" for word in terms:\n",
" if word in terms_used:\n",
" match_count += 1 ## increases match count if word is already seen earlier\n",
" terms_repeated_overall.add(word) ## adds word to the repeated words set (smaller then used words)\n",
" temp.append(word) ## Word added in temporary array to be added in repeat words \n",
" ## column in dataframe\n",
" terms_used.add(word) ## adds word to the used words set\n",
" \n",
" if len(terms) > 0:\n",
" match_count /= len(terms)\n",
" \n",
" overlap_ratio.append(match_count)\n",
" terms_repeated_in_ques.append(temp)\n",
" \n",
"jeopardy['overlap_ratio'] = overlap_ratio\n",
"jeopardy['overlap_terms'] = terms_repeated_in_ques\n",
"\n",
"jeopardy['overlap_ratio'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*There is a 87% overlap of words between new questions and old ones. However words can be put together as different phases with a big difference in meaning. So this huge overlap is not super significant.*\n",
"\n",
"---\n",
"\n",
"### Low-Value vs High-Value Questions\n",
"\n",
"Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.\n",
"\n",
"We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:\n",
"\n",
"Low value - Any row where Value is less than 800.
\n",
"High value - Any row where Value is greater than 800."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Show Number | \n",
" Air Date | \n",
" Round | \n",
" Category | \n",
" Value | \n",
" Question | \n",
" Answer | \n",
" clean_question | \n",
" clean_answer | \n",
" clean_value | \n",
" answer_in_question | \n",
" overlap_ratio | \n",
" overlap_terms | \n",
" high_value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" HISTORY | \n",
" $200 | \n",
" For the last 8 years of his life, Galileo was ... | \n",
" Copernicus | \n",
" for the last 8 years of his life galileo was u... | \n",
" copernicus | \n",
" 200 | \n",
" 0.0 | \n",
" 0.0 | \n",
" [] | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" ESPN's TOP 10 ALL-TIME ATHLETES | \n",
" $200 | \n",
" No. 2: 1912 Olympian; football star at Carlisl... | \n",
" Jim Thorpe | \n",
" no 2 1912 olympian football star at carlisle i... | \n",
" jim thorpe | \n",
" 200 | \n",
" 0.0 | \n",
" 0.0 | \n",
" [] | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" EVERYBODY TALKS ABOUT IT... | \n",
" $200 | \n",
" The city of Yuma in this state has a record av... | \n",
" Arizona | \n",
" the city of yuma in this state has a record av... | \n",
" arizona | \n",
" 200 | \n",
" 0.0 | \n",
" 0.0 | \n",
" [] | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" THE COMPANY LINE | \n",
" $200 | \n",
" In 1963, live on \"The Art Linkletter Show\", th... | \n",
" McDonald's | \n",
" in 1963 live on the art linkletter show this c... | \n",
" mcdonalds | \n",
" 200 | \n",
" 0.0 | \n",
" 0.0 | \n",
" [] | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 4680 | \n",
" 2004-12-31 | \n",
" Jeopardy! | \n",
" EPITAPHS & TRIBUTES | \n",
" $200 | \n",
" Signer of the Dec. of Indep., framer of the Co... | \n",
" John Adams | \n",
" signer of the dec of indep framer of the const... | \n",
" john adams | \n",
" 200 | \n",
" 0.0 | \n",
" 0.0 | \n",
" [] | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Show Number Air Date Round Category Value \\\n",
"0 4680 2004-12-31 Jeopardy! HISTORY $200 \n",
"1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 \n",
"2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 \n",
"3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 \n",
"4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 \n",
"\n",
" Question Answer \\\n",
"0 For the last 8 years of his life, Galileo was ... Copernicus \n",
"1 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe \n",
"2 The city of Yuma in this state has a record av... Arizona \n",
"3 In 1963, live on \"The Art Linkletter Show\", th... McDonald's \n",
"4 Signer of the Dec. of Indep., framer of the Co... John Adams \n",
"\n",
" clean_question clean_answer \\\n",
"0 for the last 8 years of his life galileo was u... copernicus \n",
"1 no 2 1912 olympian football star at carlisle i... jim thorpe \n",
"2 the city of yuma in this state has a record av... arizona \n",
"3 in 1963 live on the art linkletter show this c... mcdonalds \n",
"4 signer of the dec of indep framer of the const... john adams \n",
"\n",
" clean_value answer_in_question overlap_ratio overlap_terms high_value \n",
"0 200 0.0 0.0 [] 0 \n",
"1 200 0.0 0.0 [] 0 \n",
"2 200 0.0 0.0 [] 0 \n",
"3 200 0.0 0.0 [] 0 \n",
"4 200 0.0 0.0 [] 0 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def high_or_low_value(row):\n",
" value = 0\n",
" if row['clean_value'] > 800:\n",
" value = 1\n",
" return value\n",
" \n",
"jeopardy['high_value'] = jeopardy.apply(high_or_low_value, axis=1)\n",
"jeopardy.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*High Value column categorizes data into either High Value [1] or Low Value [0].*\n",
"\n",
"---\n",
"\n",
"### Observed Quantity of High Value vs Low Value Questions\n",
"\n",
"We will create a function that takes in a word, then return the # of high/low values questions this word showed up in."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(15, 31), (0, 2), (16, 40), (2, 1), (1, 2)]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def high_or_low_count(word):\n",
" low_count = 0\n",
" high_count = 0\n",
" \n",
" for i, rows in jeopardy.iterrows():\n",
" split_question = rows['clean_question'].split(\" \")\n",
" if word in split_question:\n",
" if rows['high_value'] == 1:\n",
" high_count += 1\n",
" else:\n",
" low_count += 1\n",
" \n",
" return high_count, low_count\n",
"\n",
"observed_high_low = []\n",
"comparison_terms = list(terms_repeated_overall)[:5]\n",
"\n",
"for item in comparison_terms:\n",
" observed_high_low.append(high_or_low_count(item))\n",
" \n",
"observed_high_low"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['somewhere', 'cholla', 'patriot', 'noonan', 'bharat']"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"comparison_terms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*For terms in `comparison_terms` the Observed High Count, Low Count is mentioned in `observed_high_low`*\n",
"\n",
"---\n",
"\n",
"### Applying the Chi-Squared test\n",
"\n",
"We can use the chi squared test to see if the values of the terms in \"comparsion_terms\" are statiscally significant.\n",
"\n",
"\n",
"For that, we will find the expected High Count, Low Count.
\n",
"Then, we will pass expected and observed values through `chisquare` function from `scipy.stats` to get the chi-value."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Power_divergenceResult(statistic=0.4179159369510027, pvalue=0.5179787872243642),\n",
" Power_divergenceResult(statistic=0.7899630882409683, pvalue=0.3741112870360538),\n",
" Power_divergenceResult(statistic=0.0018217776638311067, pvalue=0.9659547992512113),\n",
" Power_divergenceResult(statistic=2.174012332188078, pvalue=0.1403596010701264),\n",
" Power_divergenceResult(statistic=0.03723001319762459, pvalue=0.8469974958245368)]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from scipy.stats import chisquare\n",
"import numpy as np\n",
"\n",
"high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]\n",
"low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]\n",
"\n",
"chi_squared = []\n",
"\n",
"for item in observed_high_low:\n",
" total = sum(item)\n",
" total_prop = total/jeopardy.shape[0]\n",
" high_value_expected = total_prop*high_value_count\n",
" low_value_expected = total_prop*low_value_count\n",
" \n",
" observed = np.array([item[0], item[1]])\n",
" expected = np.array([high_value_expected, low_value_expected])\n",
" \n",
" chi_squared.append(chisquare(observed, expected))\n",
" \n",
"chi_squared"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chi Squared Results\n",
"\n",
"None of the p values are less than 0.05, so these are not statiscally significant.\n",
"\n",
"***However, if we perform the same test for all the words, then words with pvalue less then 0.05 and high chi-square value would be most valuable to study.***"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}