{ "cells": [ { "cell_type": "markdown", "metadata": { "extensions": { "jupyter_dashboards": { "version": 1, "views": { "grid_default": { "col": 0, "height": 4, "hidden": false, "row": 0, "width": 4 }, "report_default": { "hidden": false } } } } }, "source": [ "# Project: Wrangling and Analyze Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Gathering\n", "In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.\n", "1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#importing the necessary libraries\n", "import numpy as np\n", "import pandas as pd\n", "import requests\n", "import os\n", "import tweepy\n", "import json\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "extensions": { "jupyter_dashboards": { "version": 1, "views": { "grid_default": { "hidden": true }, "report_default": { "hidden": true } } } } }, "outputs": [], "source": [ "#reading and checking the csv file\n", "df1=pd.read_csv('twitter-archive-enhanced.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#downloading the image_predictions file\n", "folder='twitter'\n", "if not os.path.exists(folder):\n", " os.makedirs(folder)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "url=' https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'\n", "response=requests.get(url)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#writing the downloaded doc to a file so we can access it\n", "with open(os.path.join(folder, url.split('/')[-1]), mode='wb')as file:\n", " file.write(response.content)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "#accessing and viewing the tsv file\n", "df2= pd.read_csv('image-predictions.tsv', sep='\\t')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "#Loading the json file into a dataframe\n", "new_list = [] #an empty list\n", "\n", "with open('tweet-json.txt') as file:\n", " for tweet in file:\n", " data = json.loads(tweet)\n", " tweet_id=data['id']\n", " retweet_count=data['retweet_count']\n", " favorite_count=data['favorite_count']\n", " \n", " new_list.append({\"tweet_id\": tweet_id, \"retweet_count\": int(retweet_count),\n", " \"favourite_count\": favorite_count})\n", " \n", "df3=pd.DataFrame(new_list, columns= ['tweet_id', 'retweet_count', 'favourite_count'])\n" ] }, { "cell_type": "markdown", "metadata": { "extensions": { "jupyter_dashboards": { "version": 1, "views": { "grid_default": { "col": 4, "height": 4, "hidden": false, "row": 28, "width": 4 }, "report_default": { "hidden": false } } } } }, "source": [ "## Assessing Data\n", "In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment\n", "programmatic assessement to assess the data.\n", "\n", "**Note:** pay attention to the following key points when you access the data.\n", "\n", "* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.\n", "\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idin_reply_to_status_idin_reply_to_user_idtimestampsourcetextretweeted_status_idretweeted_status_user_idretweeted_status_timestampexpanded_urlsrating_numeratorrating_denominatornamedoggoflooferpupperpuppo
0892420643555336193NaNNaN2017-08-01 16:23:56 +0000<a href=\"http://twitter.com/download/iphone\" r...This is Phineas. He's a mystical boy. Only eve...NaNNaNNaNhttps://twitter.com/dog_rates/status/892420643...1310PhineasNoneNoneNoneNone
1892177421306343426NaNNaN2017-08-01 00:17:27 +0000<a href=\"http://twitter.com/download/iphone\" r...This is Tilly. She's just checking pup on you....NaNNaNNaNhttps://twitter.com/dog_rates/status/892177421...1310TillyNoneNoneNoneNone
2891815181378084864NaNNaN2017-07-31 00:18:03 +0000<a href=\"http://twitter.com/download/iphone\" r...This is Archie. He is a rare Norwegian Pouncin...NaNNaNNaNhttps://twitter.com/dog_rates/status/891815181...1210ArchieNoneNoneNoneNone
3891689557279858688NaNNaN2017-07-30 15:58:51 +0000<a href=\"http://twitter.com/download/iphone\" r...This is Darla. She commenced a snooze mid meal...NaNNaNNaNhttps://twitter.com/dog_rates/status/891689557...1310DarlaNoneNoneNoneNone
4891327558926688256NaNNaN2017-07-29 16:00:24 +0000<a href=\"http://twitter.com/download/iphone\" r...This is Franklin. He would like you to stop ca...NaNNaNNaNhttps://twitter.com/dog_rates/status/891327558...1210FranklinNoneNoneNoneNone
\n", "
" ], "text/plain": [ " tweet_id in_reply_to_status_id in_reply_to_user_id \\\n", "0 892420643555336193 NaN NaN \n", "1 892177421306343426 NaN NaN \n", "2 891815181378084864 NaN NaN \n", "3 891689557279858688 NaN NaN \n", "4 891327558926688256 NaN NaN \n", "\n", " timestamp \\\n", "0 2017-08-01 16:23:56 +0000 \n", "1 2017-08-01 00:17:27 +0000 \n", "2 2017-07-31 00:18:03 +0000 \n", "3 2017-07-30 15:58:51 +0000 \n", "4 2017-07-29 16:00:24 +0000 \n", "\n", " source \\\n", "0 \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idjpg_urlimg_nump1p1_confp1_dogp2p2_confp2_dogp3p3_confp3_dog
0666020888022790149https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg1Welsh_springer_spaniel0.465074Truecollie0.156665TrueShetland_sheepdog0.061428True
1666029285002620928https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg1redbone0.506826Trueminiature_pinscher0.074192TrueRhodesian_ridgeback0.072010True
2666033412701032449https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg1German_shepherd0.596461Truemalinois0.138584Truebloodhound0.116197True
3666044226329800704https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg1Rhodesian_ridgeback0.408143Trueredbone0.360687Trueminiature_pinscher0.222752True
4666049248165822465https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg1miniature_pinscher0.560311TrueRottweiler0.243682TrueDoberman0.154629True
.......................................
2070891327558926688256https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg2basset0.555712TrueEnglish_springer0.225770TrueGerman_short-haired_pointer0.175219True
2071891689557279858688https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg1paper_towel0.170278FalseLabrador_retriever0.168086Truespatula0.040836False
2072891815181378084864https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg1Chihuahua0.716012Truemalamute0.078253Truekelpie0.031379True
2073892177421306343426https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg1Chihuahua0.323581TruePekinese0.090647Truepapillon0.068957True
2074892420643555336193https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg1orange0.097049Falsebagel0.085851Falsebanana0.076110False
\n", "

2075 rows × 12 columns

\n", "" ], "text/plain": [ " tweet_id jpg_url \\\n", "0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg \n", "1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg \n", "2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg \n", "3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg \n", "4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg \n", "... ... ... \n", "2070 891327558926688256 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg \n", "2071 891689557279858688 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg \n", "2072 891815181378084864 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg \n", "2073 892177421306343426 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg \n", "2074 892420643555336193 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg \n", "\n", " img_num p1 p1_conf p1_dog p2 \\\n", "0 1 Welsh_springer_spaniel 0.465074 True collie \n", "1 1 redbone 0.506826 True miniature_pinscher \n", "2 1 German_shepherd 0.596461 True malinois \n", "3 1 Rhodesian_ridgeback 0.408143 True redbone \n", "4 1 miniature_pinscher 0.560311 True Rottweiler \n", "... ... ... ... ... ... \n", "2070 2 basset 0.555712 True English_springer \n", "2071 1 paper_towel 0.170278 False Labrador_retriever \n", "2072 1 Chihuahua 0.716012 True malamute \n", "2073 1 Chihuahua 0.323581 True Pekinese \n", "2074 1 orange 0.097049 False bagel \n", "\n", " p2_conf p2_dog p3 p3_conf p3_dog \n", "0 0.156665 True Shetland_sheepdog 0.061428 True \n", "1 0.074192 True Rhodesian_ridgeback 0.072010 True \n", "2 0.138584 True bloodhound 0.116197 True \n", "3 0.360687 True miniature_pinscher 0.222752 True \n", "4 0.243682 True Doberman 0.154629 True \n", "... ... ... ... ... ... \n", "2070 0.225770 True German_short-haired_pointer 0.175219 True \n", "2071 0.168086 True spatula 0.040836 False \n", "2072 0.078253 True kelpie 0.031379 True \n", "2073 0.090647 True papillon 0.068957 True \n", "2074 0.085851 False banana 0.076110 False \n", "\n", "[2075 rows x 12 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#viewing the image-predictions dataframe\n", "df2" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idretweet_countfavourite_count
0892420643555336193885339467
1892177421306343426651433819
2891815181378084864432825461
3891689557279858688896442908
4891327558926688256977441048
............
234966604924816582246541111
2350666044226329800704147311
235166603341270103244947128
235266602928500262092848132
23536660208880227901495322535
\n", "

2354 rows × 3 columns

\n", "
" ], "text/plain": [ " tweet_id retweet_count favourite_count\n", "0 892420643555336193 8853 39467\n", "1 892177421306343426 6514 33819\n", "2 891815181378084864 4328 25461\n", "3 891689557279858688 8964 42908\n", "4 891327558926688256 9774 41048\n", "... ... ... ...\n", "2349 666049248165822465 41 111\n", "2350 666044226329800704 147 311\n", "2351 666033412701032449 47 128\n", "2352 666029285002620928 48 132\n", "2353 666020888022790149 532 2535\n", "\n", "[2354 rows x 3 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking the tweet-json file\n", "df3" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2356 entries, 0 to 2355\n", "Data columns (total 17 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 tweet_id 2356 non-null int64 \n", " 1 in_reply_to_status_id 78 non-null float64\n", " 2 in_reply_to_user_id 78 non-null float64\n", " 3 timestamp 2356 non-null object \n", " 4 source 2356 non-null object \n", " 5 text 2356 non-null object \n", " 6 retweeted_status_id 181 non-null float64\n", " 7 retweeted_status_user_id 181 non-null float64\n", " 8 retweeted_status_timestamp 181 non-null object \n", " 9 expanded_urls 2297 non-null object \n", " 10 rating_numerator 2356 non-null int64 \n", " 11 rating_denominator 2356 non-null int64 \n", " 12 name 2356 non-null object \n", " 13 doggo 2356 non-null object \n", " 14 floofer 2356 non-null object \n", " 15 pupper 2356 non-null object \n", " 16 puppo 2356 non-null object \n", "dtypes: float64(4), int64(3), object(10)\n", "memory usage: 313.0+ KB\n" ] } ], "source": [ "#checking the information of df1\n", "df1.info()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2075 entries, 0 to 2074\n", "Data columns (total 12 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 tweet_id 2075 non-null int64 \n", " 1 jpg_url 2075 non-null object \n", " 2 img_num 2075 non-null int64 \n", " 3 p1 2075 non-null object \n", " 4 p1_conf 2075 non-null float64\n", " 5 p1_dog 2075 non-null bool \n", " 6 p2 2075 non-null object \n", " 7 p2_conf 2075 non-null float64\n", " 8 p2_dog 2075 non-null bool \n", " 9 p3 2075 non-null object \n", " 10 p3_conf 2075 non-null float64\n", " 11 p3_dog 2075 non-null bool \n", "dtypes: bool(3), float64(3), int64(2), object(4)\n", "memory usage: 152.1+ KB\n" ] } ], "source": [ "#checking the information of df2\n", "df2.info()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2354 entries, 0 to 2353\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype\n", "--- ------ -------------- -----\n", " 0 tweet_id 2354 non-null int64\n", " 1 retweet_count 2354 non-null int64\n", " 2 favourite_count 2354 non-null int64\n", "dtypes: int64(3)\n", "memory usage: 55.3 KB\n" ] } ], "source": [ "#checking the information of df3\n", "df3.info()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idimg_nump1_confp2_confp3_conf
count2.075000e+032075.0000002075.0000002.075000e+032.075000e+03
mean7.384514e+171.2038550.5945481.345886e-016.032417e-02
std6.785203e+160.5618750.2711741.006657e-015.090593e-02
min6.660209e+171.0000000.0443331.011300e-081.740170e-10
25%6.764835e+171.0000000.3644125.388625e-021.622240e-02
50%7.119988e+171.0000000.5882301.181810e-014.944380e-02
75%7.932034e+171.0000000.8438551.955655e-019.180755e-02
max8.924206e+174.0000001.0000004.880140e-012.734190e-01
\n", "
" ], "text/plain": [ " tweet_id img_num p1_conf p2_conf p3_conf\n", "count 2.075000e+03 2075.000000 2075.000000 2.075000e+03 2.075000e+03\n", "mean 7.384514e+17 1.203855 0.594548 1.345886e-01 6.032417e-02\n", "std 6.785203e+16 0.561875 0.271174 1.006657e-01 5.090593e-02\n", "min 6.660209e+17 1.000000 0.044333 1.011300e-08 1.740170e-10\n", "25% 6.764835e+17 1.000000 0.364412 5.388625e-02 1.622240e-02\n", "50% 7.119988e+17 1.000000 0.588230 1.181810e-01 4.944380e-02\n", "75% 7.932034e+17 1.000000 0.843855 1.955655e-01 9.180755e-02\n", "max 8.924206e+17 4.000000 1.000000 4.880140e-01 2.734190e-01" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking the descriptive statistics of df2\n", "df2.describe()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17 tweet_id\n", "29 tweet_id\n", "dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking for duplicated columns across the 3 dataframes\n", "columns=pd.Series(list(df1)+ list(df2)+ list(df3))\n", "columns[columns.duplicated()]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idin_reply_to_status_idin_reply_to_user_idtimestampsourcetextretweeted_status_idretweeted_status_user_idretweeted_status_timestampexpanded_urlsrating_numeratorrating_denominatornamedoggoflooferpupperpuppo
19888202515573088257NaNNaN2017-07-21 01:02:36 +0000<a href=\"http://twitter.com/download/iphone\" r...RT @dog_rates: This is Canela. She attempted s...8.874740e+174.196984e+092017-07-19 00:47:34 +0000https://twitter.com/dog_rates/status/887473957...1310CanelaNoneNoneNoneNone
32886054160059072513NaNNaN2017-07-15 02:45:48 +0000<a href=\"http://twitter.com/download/iphone\" r...RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...8.860537e+171.960740e+072017-07-15 02:44:07 +0000https://twitter.com/dog_rates/status/886053434...1210NoneNoneNoneNoneNone
36885311592912609280NaNNaN2017-07-13 01:35:06 +0000<a href=\"http://twitter.com/download/iphone\" r...RT @dog_rates: This is Lilly. She just paralle...8.305833e+174.196984e+092017-02-12 01:04:29 +0000https://twitter.com/dog_rates/status/830583320...1310LillyNoneNoneNoneNone
68879130579576475649NaNNaN2017-06-26 00:13:58 +0000<a href=\"http://twitter.com/download/iphone\" r...RT @dog_rates: This is Emmy. She was adopted t...8.780576e+174.196984e+092017-06-23 01:10:23 +0000https://twitter.com/dog_rates/status/878057613...1410EmmyNoneNoneNoneNone
73878404777348136964NaNNaN2017-06-24 00:09:53 +0000<a href=\"http://twitter.com/download/iphone\" r...RT @dog_rates: Meet Shadow. In an attempt to r...8.782815e+174.196984e+092017-06-23 16:00:04 +0000https://www.gofundme.com/3yd6y1c,https://twitt...1310ShadowNoneNoneNoneNone
......................................................
1023746521445350707200NaNNaN2016-06-25 01:52:36 +0000<a href=\"http://twitter.com/download/iphone\" r...RT @dog_rates: This is Shaggy. He knows exactl...6.678667e+174.196984e+092015-11-21 00:46:50 +0000https://twitter.com/dog_rates/status/667866724...1010ShaggyNoneNoneNoneNone
1043743835915802583040NaNNaN2016-06-17 16:01:16 +0000<a href=\"http://twitter.com/download/iphone\" r...RT @dog_rates: Extremely intelligent dog here....6.671383e+174.196984e+092015-11-19 00:32:12 +0000https://twitter.com/dog_rates/status/667138269...1010NoneNoneNoneNoneNone
1242711998809858043904NaNNaN2016-03-21 19:31:59 +0000<a href=\"http://twitter.com/download/iphone\" r...RT @twitter: @dog_rates Awesome Tweet! 12/10. ...7.119983e+177.832140e+052016-03-21 19:29:52 +0000https://twitter.com/twitter/status/71199827977...1210NoneNoneNoneNoneNone
2259667550904950915073NaNNaN2015-11-20 03:51:52 +0000<a href=\"http://twitter.com\" rel=\"nofollow\">Tw...RT @dogratingrating: Exceptional talent. Origi...6.675487e+174.296832e+092015-11-20 03:43:06 +0000https://twitter.com/dogratingrating/status/667...1210NoneNoneNoneNoneNone
2260667550882905632768NaNNaN2015-11-20 03:51:47 +0000<a href=\"http://twitter.com\" rel=\"nofollow\">Tw...RT @dogratingrating: Unoriginal idea. Blatant ...6.675484e+174.296832e+092015-11-20 03:41:59 +0000https://twitter.com/dogratingrating/status/667...510NoneNoneNoneNoneNone
\n", "

181 rows × 17 columns

\n", "
" ], "text/plain": [ " tweet_id in_reply_to_status_id in_reply_to_user_id \\\n", "19 888202515573088257 NaN NaN \n", "32 886054160059072513 NaN NaN \n", "36 885311592912609280 NaN NaN \n", "68 879130579576475649 NaN NaN \n", "73 878404777348136964 NaN NaN \n", "... ... ... ... \n", "1023 746521445350707200 NaN NaN \n", "1043 743835915802583040 NaN NaN \n", "1242 711998809858043904 NaN NaN \n", "2259 667550904950915073 NaN NaN \n", "2260 667550882905632768 NaN NaN \n", "\n", " timestamp \\\n", "19 2017-07-21 01:02:36 +0000 \n", "32 2017-07-15 02:45:48 +0000 \n", "36 2017-07-13 01:35:06 +0000 \n", "68 2017-06-26 00:13:58 +0000 \n", "73 2017-06-24 00:09:53 +0000 \n", "... ... \n", "1023 2016-06-25 01:52:36 +0000 \n", "1043 2016-06-17 16:01:16 +0000 \n", "1242 2016-03-21 19:31:59 +0000 \n", "2259 2015-11-20 03:51:52 +0000 \n", "2260 2015-11-20 03:51:47 +0000 \n", "\n", " source \\\n", "19
Tw... \n", "2260 Tw... \n", "\n", " text retweeted_status_id \\\n", "19 RT @dog_rates: This is Canela. She attempted s... 8.874740e+17 \n", "32 RT @Athletics: 12/10 #BATP https://t.co/WxwJmv... 8.860537e+17 \n", "36 RT @dog_rates: This is Lilly. She just paralle... 8.305833e+17 \n", "68 RT @dog_rates: This is Emmy. She was adopted t... 8.780576e+17 \n", "73 RT @dog_rates: Meet Shadow. In an attempt to r... 8.782815e+17 \n", "... ... ... \n", "1023 RT @dog_rates: This is Shaggy. He knows exactl... 6.678667e+17 \n", "1043 RT @dog_rates: Extremely intelligent dog here.... 6.671383e+17 \n", "1242 RT @twitter: @dog_rates Awesome Tweet! 12/10. ... 7.119983e+17 \n", "2259 RT @dogratingrating: Exceptional talent. Origi... 6.675487e+17 \n", "2260 RT @dogratingrating: Unoriginal idea. Blatant ... 6.675484e+17 \n", "\n", " retweeted_status_user_id retweeted_status_timestamp \\\n", "19 4.196984e+09 2017-07-19 00:47:34 +0000 \n", "32 1.960740e+07 2017-07-15 02:44:07 +0000 \n", "36 4.196984e+09 2017-02-12 01:04:29 +0000 \n", "68 4.196984e+09 2017-06-23 01:10:23 +0000 \n", "73 4.196984e+09 2017-06-23 16:00:04 +0000 \n", "... ... ... \n", "1023 4.196984e+09 2015-11-21 00:46:50 +0000 \n", "1043 4.196984e+09 2015-11-19 00:32:12 +0000 \n", "1242 7.832140e+05 2016-03-21 19:29:52 +0000 \n", "2259 4.296832e+09 2015-11-20 03:43:06 +0000 \n", "2260 4.296832e+09 2015-11-20 03:41:59 +0000 \n", "\n", " expanded_urls rating_numerator \\\n", "19 https://twitter.com/dog_rates/status/887473957... 13 \n", "32 https://twitter.com/dog_rates/status/886053434... 12 \n", "36 https://twitter.com/dog_rates/status/830583320... 13 \n", "68 https://twitter.com/dog_rates/status/878057613... 14 \n", "73 https://www.gofundme.com/3yd6y1c,https://twitt... 13 \n", "... ... ... \n", "1023 https://twitter.com/dog_rates/status/667866724... 10 \n", "1043 https://twitter.com/dog_rates/status/667138269... 10 \n", "1242 https://twitter.com/twitter/status/71199827977... 12 \n", "2259 https://twitter.com/dogratingrating/status/667... 12 \n", "2260 https://twitter.com/dogratingrating/status/667... 5 \n", "\n", " rating_denominator name doggo floofer pupper puppo \n", "19 10 Canela None None None None \n", "32 10 None None None None None \n", "36 10 Lilly None None None None \n", "68 10 Emmy None None None None \n", "73 10 Shadow None None None None \n", "... ... ... ... ... ... ... \n", "1023 10 Shaggy None None None None \n", "1043 10 None None None None None \n", "1242 10 None None None None None \n", "2259 10 None None None None None \n", "2260 10 None None None None None \n", "\n", "[181 rows x 17 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking for which values in the df1 that are retweets\n", "df1[df1.retweeted_status_id.notnull()]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking for duplicated values in df1\n", "sum(df1.duplicated())" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking for duplicated values in df2\n", "sum(df2.duplicated())" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking for duplicated values in df1\n", "sum(df3.duplicated())" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tweet_id 0\n", "in_reply_to_status_id 2278\n", "in_reply_to_user_id 2278\n", "timestamp 0\n", "source 0\n", "text 0\n", "retweeted_status_id 2175\n", "retweeted_status_user_id 2175\n", "retweeted_status_timestamp 2175\n", "expanded_urls 59\n", "rating_numerator 0\n", "rating_denominator 0\n", "name 0\n", "doggo 0\n", "floofer 0\n", "pupper 0\n", "puppo 0\n", "dtype: int64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking the number of missing values in each column of df1\n", "df1.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tweet_id 0\n", "jpg_url 0\n", "img_num 0\n", "p1 0\n", "p1_conf 0\n", "p1_dog 0\n", "p2 0\n", "p2_conf 0\n", "p2_dog 0\n", "p3 0\n", "p3_conf 0\n", "p3_dog 0\n", "dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking the number of missing values in each column of df2\n", "df2.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tweet_id 0\n", "retweet_count 0\n", "favourite_count 0\n", "dtype: int64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking the number of missing values in each column of df3\n", "df3.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quality issues\n", "#### df1 (twiiter-archive)\n", "- Erroneous datatypes in some columns.\n", "- Some values in the name column are in lowercase and appear to not be dog names.\n", "- Some tweet-id's have values in the *retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp* column.\n", "- Missing data in the *in_reply_to_status, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_url*.\n", "\n", "#### df2 (image-prediction)\n", "- Erroneous datatype in tweet_id column.\n", "- Outliers in *p2_conf and p3_conf* column.\n", "- Non-descriptive column names i.e *p1, p2, p3, p1_conf* e.t.c.\n", "- Inconsistent alphabet case in *p1, p2, p3*.\n", "\n", "#### df3 (tweet-json)\n", "- Erroneous datatype in *tweet_id* column.\n" ] }, { "cell_type": "markdown", "metadata": { "extensions": { "jupyter_dashboards": { "version": 1, "views": { "grid_default": { "col": 0, "height": 7, "hidden": false, "row": 40, "width": 12 }, "report_default": { "hidden": false } } } } }, "source": [ "### Tidiness issues\n", "- Dog stages (doggo, popper, puppo, and floofer) should be in one column.\n", "- *tweet_id* in twitter-archive dataframe is duplicated in the other two dataframe.\n", "- Merging the three dataframe as one." ] }, { "cell_type": "markdown", "metadata": { "extensions": { "jupyter_dashboards": { "version": 1, "views": { "grid_default": { "col": 4, "height": 4, "hidden": false, "row": 32, "width": 4 }, "report_default": { "hidden": false } } } } }, "source": [ "## Cleaning Data\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Make copies of original pieces of data\n", "df1_clean=df1.copy()\n", "df2_clean=df2.copy()\n", "df3_clean=df3.copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quality\n", "### df1\n", "### Issue #1: Erroneous datatypes in the *tweet_id, timestamp, retweeted_status_timestamp*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define: Tweet_id would be changed from *int* to *string* datatype, while the other two would be changed to *datetime*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "df1_clean['tweet_id']=df1_clean['tweet_id'].astype(str)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "date=['timestamp', 'retweeted_status_timestamp']\n", "for value in date:\n", " df1_clean[value]=df1_clean[value].apply(pd.to_datetime)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2356 entries, 0 to 2355\n", "Data columns (total 17 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 tweet_id 2356 non-null object \n", " 1 in_reply_to_status_id 78 non-null float64 \n", " 2 in_reply_to_user_id 78 non-null float64 \n", " 3 timestamp 2356 non-null datetime64[ns, UTC]\n", " 4 source 2356 non-null object \n", " 5 text 2356 non-null object \n", " 6 retweeted_status_id 181 non-null float64 \n", " 7 retweeted_status_user_id 181 non-null float64 \n", " 8 retweeted_status_timestamp 181 non-null datetime64[ns, UTC]\n", " 9 expanded_urls 2297 non-null object \n", " 10 rating_numerator 2356 non-null int64 \n", " 11 rating_denominator 2356 non-null int64 \n", " 12 name 2356 non-null object \n", " 13 doggo 2356 non-null object \n", " 14 floofer 2356 non-null object \n", " 15 pupper 2356 non-null object \n", " 16 puppo 2356 non-null object \n", "dtypes: datetime64[ns, UTC](2), float64(4), int64(2), object(9)\n", "memory usage: 313.0+ KB\n" ] } ], "source": [ "df1_clean.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Issue #2: Some values in the name column are in lowercase and appear to not be dog names" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "extensions": { "jupyter_dashboards": { "version": 1, "views": { "grid_default": { "hidden": true }, "report_default": { "hidden": true } } } } }, "source": [ "#### Define: Replace the values that have lower case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['such', 'a', 'quite', 'not', 'one', 'incredibly', 'mad', 'an', 'very', 'just', 'my', 'his', 'actually', 'getting', 'this', 'unacceptable', 'all', 'old', 'infuriating', 'the', 'by', 'officially', 'life', 'light', 'space']\n" ] } ], "source": [ "#we define a function that shows us the values in lowercase\n", "list=[]\n", "\n", "for lower in df1_clean.name:\n", " if lower.islower() and lower not in list:\n", " list.append(lower)\n", " \n", "print(list)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "#replace the values in [list] with 'None' since there are also other values that have None\n", "df1_clean['name'].replace(list, 'None', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "595 None\n", "552 Rusty\n", "492 Atlas\n", "74 Terrance\n", "1826 None\n", "879 Theo\n", "37 None\n", "1002 None\n", "361 Leo\n", "1710 Penny\n", "572 None\n", "1949 None\n", "87 Nugget\n", "621 None\n", "1267 Olaf\n", "1714 None\n", "254 Charlie\n", "741 Bell\n", "1862 None\n", "172 None\n", "Name: name, dtype: object" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1_clean.name.sample(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Issue #3: Some tweet-id's have values in the retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define: We remove these columns since what we need is original tweets and not retweets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "181\n", "181\n", "181\n" ] } ], "source": [ "#lets check the sum of the values in these columns\n", "print(df1_clean.retweeted_status_id.notnull().sum())\n", "print(df1_clean.retweeted_status_user_id.notnull().sum())\n", "print(df1_clean.retweeted_status_timestamp.notnull().sum())" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "#dropping the column\n", "df1_clean=df1_clean[df1_clean.retweeted_status_id.isnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "0\n", "0\n" ] } ], "source": [ "#cross check to be sure the columns have dropped\n", "print(df1_clean.retweeted_status_id.notnull().sum())\n", "print(df1_clean.retweeted_status_user_id.notnull().sum())\n", "print(df1_clean.retweeted_status_timestamp.notnull().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Issue #4: Missing data in the *in_reply_to_status, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_url*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define: Dropping off these columns as the amount of data missing is enormous and the columns won't be used for further analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "#group all columns into a list and drop them\n", "column=['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id',\n", " 'retweeted_status_timestamp', 'expanded_urls']\n", "\n", "df1_clean.drop(column, axis=1, inplace=True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tweet_id 0\n", "timestamp 0\n", "source 0\n", "text 0\n", "rating_numerator 0\n", "rating_denominator 0\n", "name 0\n", "doggo 0\n", "floofer 0\n", "pupper 0\n", "puppo 0\n", "dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1_clean.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### df2\n", "### Issue #5: Erroneous datatype in tweet_id column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define: Chane the datatype from int to string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "#changing the datatype to string using .astype()\n", "df2_clean['tweet_id']=df2_clean.tweet_id.astype(str)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2075 entries, 0 to 2074\n", "Data columns (total 12 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 tweet_id 2075 non-null object \n", " 1 jpg_url 2075 non-null object \n", " 2 img_num 2075 non-null int64 \n", " 3 p1 2075 non-null object \n", " 4 p1_conf 2075 non-null float64\n", " 5 p1_dog 2075 non-null bool \n", " 6 p2 2075 non-null object \n", " 7 p2_conf 2075 non-null float64\n", " 8 p2_dog 2075 non-null bool \n", " 9 p3 2075 non-null object \n", " 10 p3_conf 2075 non-null float64\n", " 11 p3_dog 2075 non-null bool \n", "dtypes: bool(3), float64(3), int64(1), object(5)\n", "memory usage: 152.1+ KB\n" ] } ], "source": [ "df2_clean.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Issue #6: Outliers in *p2_conf and p3_conf* column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define: Dropping the columns as they would not be used in further analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "list=['p2_conf', 'p3_conf']\n", "\n", "df2_clean.drop(list, axis=1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2075 entries, 0 to 2074\n", "Data columns (total 10 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 tweet_id 2075 non-null object \n", " 1 jpg_url 2075 non-null object \n", " 2 img_num 2075 non-null int64 \n", " 3 p1 2075 non-null object \n", " 4 p1_conf 2075 non-null float64\n", " 5 p1_dog 2075 non-null bool \n", " 6 p2 2075 non-null object \n", " 7 p2_dog 2075 non-null bool \n", " 8 p3 2075 non-null object \n", " 9 p3_dog 2075 non-null bool \n", "dtypes: bool(3), float64(1), int64(1), object(5)\n", "memory usage: 119.7+ KB\n" ] } ], "source": [ "df2_clean.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Issue #7: Non-descriptive column names i.e p1, p2, p3, p1_conf e.t.c." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define: Rename the columns with the following descriptive names- prediction1, pred1_confidence, pred1_asdog, prediction2, pred2_asdog, prediction3, pred3_asdog" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "df2_clean=df2_clean.rename(columns={'p1': 'prediction1', 'p1_conf': 'p1_confidence', 'p1_dog': 'p1_asdog', 'p2':'prediction2', \n", " 'p2_dog':'p2_asdog', 'p3':'prediction3', 'p3_dog':'p3_asdog'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idjpg_urlimg_numprediction1p1_confidencep1_asdogprediction2p2_asdogprediction3p3_asdog
0666020888022790149https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg1Welsh_springer_spaniel0.465074TruecollieTrueShetland_sheepdogTrue
\n", "
" ], "text/plain": [ " tweet_id jpg_url \\\n", "0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg \n", "\n", " img_num prediction1 p1_confidence p1_asdog prediction2 \\\n", "0 1 Welsh_springer_spaniel 0.465074 True collie \n", "\n", " p2_asdog prediction3 p3_asdog \n", "0 True Shetland_sheepdog True " ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2_clean.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Issue #8: Inconsistent alphabet case in *p1, p2, p3*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define: we change the values in the 3 columns to lowercase" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "df2_clean['prediction1']=df2_clean.prediction1.str.lower()\n", "df2_clean['prediction2']=df2_clean.prediction2.str.lower()\n", "df2_clean['prediction3']=df2_clean.prediction3.str.lower()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "171 hog\n", "1747 cowboy_boot\n", "988 miniature_pinscher\n", "143 crash_helmet\n", "984 ram\n", "1529 malamute\n", "493 bath_towel\n", "331 tennis_ball\n", "421 robin\n", "23 golden_retriever\n", "1093 home_theater\n", "1297 upright\n", "2049 samoyed\n", "786 pug\n", "1434 old_english_sheepdog\n", "619 space_heater\n", "508 walking_stick\n", "127 pembroke\n", "253 wood_rabbit\n", "899 border_terrier\n", "Name: prediction1, dtype: object" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2_clean.prediction1.sample(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### df3\n", "### Issue #9: Erroneous datatype in tweet_id column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define: Change the datatype from int to string using the astype() method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "df3_clean['tweet_id']=df3_clean.tweet_id.astype(str)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2354 entries, 0 to 2353\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 tweet_id 2354 non-null object\n", " 1 retweet_count 2354 non-null int64 \n", " 2 favourite_count 2354 non-null int64 \n", "dtypes: int64(2), object(1)\n", "memory usage: 55.3+ KB\n" ] } ], "source": [ "df3_clean.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tidiness\n", "### Issue #1: Dog stages (doggo, popper, puppo, and floofer) should be in one column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define: Using pd.melt, we convert the four columns of dog stages into one column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "#using the pd.melt method\n", "df1_clean=pd.melt(df1_clean, id_vars=['tweet_id', 'timestamp', 'source', 'text', 'rating_numerator', 'rating_denominator', 'name'],\n", " var_name='types', value_name='dog_stages')" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "#removing the var_name column as this just repeats the types of dog_stages and also to get rid of duplicates\n", "df1_clean.drop(columns='types', inplace=True)\n", "df1_clean.drop_duplicates(inplace=True)\n", "\n", "#replacing the 'None' values with NaN\n", "df1_clean['dog_stages']=df1_clean.dog_stages.replace('None', np.nan)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pupper 234\n", "doggo 87\n", "puppo 25\n", "floofer 10\n", "Name: dog_stages, dtype: int64" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1_clean['dog_stages'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Issue #2: *tweet_id* in twitter-archive (df1) dataframe is duplicated in the other two dataframe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define: Merge the favorite count and retweet count to the twitter-archive table, joining on tweet_id." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "df1_clean=pd.merge(df1_clean, df3_clean, on='tweet_id', how='left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 2531 entries, 0 to 2530\n", "Data columns (total 10 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 tweet_id 2531 non-null object \n", " 1 timestamp 2531 non-null datetime64[ns, UTC]\n", " 2 source 2531 non-null object \n", " 3 text 2531 non-null object \n", " 4 rating_numerator 2531 non-null int64 \n", " 5 rating_denominator 2531 non-null int64 \n", " 6 name 2531 non-null object \n", " 7 dog_stages 356 non-null object \n", " 8 retweet_count 2531 non-null int64 \n", " 9 favourite_count 2531 non-null int64 \n", "dtypes: datetime64[ns, UTC](1), int64(4), object(5)\n", "memory usage: 217.5+ KB\n" ] } ], "source": [ "df1_clean.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Issue #3: Merging the dataframes into one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define: Before merging the image prediction table, we would be dropping some columns as they would not be used for further analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "#dropping the prediction2, p2_asdog, prediction3, p3_asdog columns\n", "list = ['prediction2', 'p2_asdog', 'prediction3', 'p3_asdog']\n", "df2_clean.drop(list, axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "#then we merge the image-prediction dataframe with the twitter-archive on tweet_id\n", "df1_clean=pd.merge(df1_clean, df2_clean, on='tweet_id', how='left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 2531 entries, 0 to 2530\n", "Data columns (total 15 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 tweet_id 2531 non-null object \n", " 1 timestamp 2531 non-null datetime64[ns, UTC]\n", " 2 source 2531 non-null object \n", " 3 text 2531 non-null object \n", " 4 rating_numerator 2531 non-null int64 \n", " 5 rating_denominator 2531 non-null int64 \n", " 6 name 2531 non-null object \n", " 7 dog_stages 356 non-null object \n", " 8 retweet_count 2531 non-null int64 \n", " 9 favourite_count 2531 non-null int64 \n", " 10 jpg_url 2311 non-null object \n", " 11 img_num 2311 non-null float64 \n", " 12 prediction1 2311 non-null object \n", " 13 p1_confidence 2311 non-null float64 \n", " 14 p1_asdog 2311 non-null object \n", "dtypes: datetime64[ns, UTC](1), float64(2), int64(4), object(8)\n", "memory usage: 316.4+ KB\n" ] } ], "source": [ "df1_clean.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The datatype of the *img_num* column so we would need to change it back to integer. First we convert all the NaN values to 0 to avoid errors." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "df1_clean['img_num']=df1_clean.img_num.fillna(0)\n", "df1_clean['img_num']=df1_clean.img_num.astype(int)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 2531 entries, 0 to 2530\n", "Data columns (total 15 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 tweet_id 2531 non-null object \n", " 1 timestamp 2531 non-null datetime64[ns, UTC]\n", " 2 source 2531 non-null object \n", " 3 text 2531 non-null object \n", " 4 rating_numerator 2531 non-null int64 \n", " 5 rating_denominator 2531 non-null int64 \n", " 6 name 2531 non-null object \n", " 7 dog_stages 356 non-null object \n", " 8 retweet_count 2531 non-null int64 \n", " 9 favourite_count 2531 non-null int64 \n", " 10 jpg_url 2311 non-null object \n", " 11 img_num 2531 non-null int32 \n", " 12 prediction1 2311 non-null object \n", " 13 p1_confidence 2311 non-null float64 \n", " 14 p1_asdog 2311 non-null object \n", "dtypes: datetime64[ns, UTC](1), float64(1), int32(1), int64(4), object(8)\n", "memory usage: 306.5+ KB\n" ] } ], "source": [ "#checking to see that the datatype has changed\n", "df1_clean.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Storing Data\n", "Save gathered, assessed, and cleaned master dataset to a CSV file named \"twitter_archive_master.csv\"." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "df1_clean.to_csv('twitter_archive_master.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyzing and Visualizing Data\n", "In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idtimestampsourcetextrating_numeratorrating_denominatornamedog_stagesretweet_countfavourite_countjpg_urlimg_numprediction1p1_confidencep1_asdog
08924206435553361932017-08-01 16:23:56+00:00<a href=\"http://twitter.com/download/iphone\" r...This is Phineas. He's a mystical boy. Only eve...1310PhineasNaN885339467https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg1orange0.097049False
18921774213063434262017-08-01 00:17:27+00:00<a href=\"http://twitter.com/download/iphone\" r...This is Tilly. She's just checking pup on you....1310TillyNaN651433819https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg1chihuahua0.323581True
28918151813780848642017-07-31 00:18:03+00:00<a href=\"http://twitter.com/download/iphone\" r...This is Archie. He is a rare Norwegian Pouncin...1210ArchieNaN432825461https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg1chihuahua0.716012True
38916895572798586882017-07-30 15:58:51+00:00<a href=\"http://twitter.com/download/iphone\" r...This is Darla. She commenced a snooze mid meal...1310DarlaNaN896442908https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg1paper_towel0.170278False
\n", "
" ], "text/plain": [ " tweet_id timestamp \\\n", "0 892420643555336193 2017-08-01 16:23:56+00:00 \n", "1 892177421306343426 2017-08-01 00:17:27+00:00 \n", "2 891815181378084864 2017-07-31 00:18:03+00:00 \n", "3 891689557279858688 2017-07-30 15:58:51+00:00 \n", "\n", " source \\\n", "0
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idrating_numeratorrating_denominatorretweet_countfavourite_countimg_nump1_confidence
count2.531000e+032531.0000002531.0000002531.0000002531.0000002531.0000002311.000000
mean7.384206e+1712.92730110.4235482899.7929679048.9320431.1054920.597116
std6.693796e+1644.2522086.5087955085.33174312668.5141760.6444760.271582
min6.660209e+170.0000000.0000000.00000052.0000000.0000000.044333
25%6.783890e+1710.00000010.000000642.0000002093.5000001.0000000.367945
50%7.124382e+1711.00000010.0000001408.0000004228.0000001.0000000.596796
75%7.904598e+1712.00000010.0000003263.00000011309.5000001.0000000.846807
max8.924206e+171776.000000170.00000079515.000000132810.0000004.0000001.000000
\n", "" ], "text/plain": [ " tweet_id rating_numerator rating_denominator retweet_count \\\n", "count 2.531000e+03 2531.000000 2531.000000 2531.000000 \n", "mean 7.384206e+17 12.927301 10.423548 2899.792967 \n", "std 6.693796e+16 44.252208 6.508795 5085.331743 \n", "min 6.660209e+17 0.000000 0.000000 0.000000 \n", "25% 6.783890e+17 10.000000 10.000000 642.000000 \n", "50% 7.124382e+17 11.000000 10.000000 1408.000000 \n", "75% 7.904598e+17 12.000000 10.000000 3263.000000 \n", "max 8.924206e+17 1776.000000 170.000000 79515.000000 \n", "\n", " favourite_count img_num p1_confidence \n", "count 2531.000000 2531.000000 2311.000000 \n", "mean 9048.932043 1.105492 0.597116 \n", "std 12668.514176 0.644476 0.271582 \n", "min 52.000000 0.000000 0.044333 \n", "25% 2093.500000 1.000000 0.367945 \n", "50% 4228.000000 1.000000 0.596796 \n", "75% 11309.500000 1.000000 0.846807 \n", "max 132810.000000 4.000000 1.000000 " ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#shows the descriptive statistics of our dataset\n", "df_new.describe()" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 1973\n", "2 226\n", "0 220\n", "3 75\n", "4 37\n", "Name: img_num, dtype: int64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_new['img_num'].value_counts()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
img_num1234
p1_asdog
False52642217
True14471845430
\n", "
" ], "text/plain": [ "img_num 1 2 3 4\n", "p1_asdog \n", "False 526 42 21 7\n", "True 1447 184 54 30" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_new.groupby('p1_asdog')['img_num'].value_counts().unstack()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pupper 234\n", "doggo 87\n", "puppo 25\n", "floofer 10\n", "Name: dog_stages, dtype: int64" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_new['dog_stages'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Insights:\n", "1. Tweets that have one image were mostly predicted and we have more of them as dogs.\n", "\n", "2. The mean of the first prediction confidence is approximately 60%.\n", "\n", "3. We have more favorite_count than retweet_count.\n", "\n", "4. We have more dogs as pupper (smaller and younger dogs) than the rest of the other dog stages." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Q1. What dog stage has the highest count?" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAU0AAAGDCAYAAACvEkIAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAXjElEQVR4nO3de9RddX3n8ffHoKKCAk1kMCDBGqtoZ3AMeK84zvKGLnCqCONo7DjS1vuoHaHOKLaLWUyrVmu9FEeW1AuY2jLiZbwhYFFuASkXgcpIkAiFIAioiCb9zh/7F3KMT5LnF57znPOQ92utZz37/PbeZ3/Pznk++e29z/6dVBWSpNm5z6QLkKSFxNCUpA6GpiR1MDQlqYOhKUkdDE1J6mBoSlIHQ1Njl2RNkjuT3JHkx0m+neQPkozl/ZfkaW0btyW5Jcm3khzY5r0yydnj2K52DIam5ssLq2pXYF/geOBtwMfmeiNJHgx8AfgAsAewFHgXcNdcb0s7JkNT86qqbquq04CXAiuTPA4gyUOS/E2SdUmuTfLfN/ZEkyxK8p4kNye5JsnrklSSnWbYxKPadk6uqg1VdWdVfbWqLknyGOAjwJOT/CTJj9vzH5LkO0luT3JdkmNHnzDJK1pNP0ryP1rP+d+3efdJcnSS/9fmr0qyR5u3c5JPtvYfJ7kgyZ5j2bGaN4amJqKqzgfWAk9vTR8AHgI8AngG8Arg99q8VwPPAw4A/i1w2Fae+p+ADUlOSvK8JLuPbPMK4A+Ac6pql6rarc36advebsAhwB8mOQwgyf7Ah4CXAXu1GpeObO8NrZ5nAA8DbgU+2OatbMvvA/xG2/adW90xmnqGpibpemCPJIsYep7HVNUdVbUGeA/w8rbc4cD7q2ptVd3KcHg/o6q6HXgaUMBHgXVJTttaD6+qzqyqS6vqX6rqEuBkhhAEeDHw+ao6u6p+AbyjPfdGvw+8vdV2F3As8OLWC/4lQ1g+svV6L2z1aQEzNDVJS4FbgMXA/YBrR+Zdy6Ye3cOA60bmjU7/mqq6oqpeWVV7A49r679vS8sneWKSM9qpgdsYeoSLZ9p2Vf0M+NHI6vsCp7bD7x8DVwAbgD2BTwBfAU5Jcn2SP0ty363VrulnaGoi2tXspcDZwM0MvbJ9RxZ5OPDDNn0DsPfIvH1mu52quhL4OEN4wq/2Ejf6NHAasE9VPYThvGdm2naSBzD0Hje6DnheVe028rNzVf2wqn5ZVe+qqv2BpwAvYDgNoAXM0NS8SvLgJC8ATgE+2Q6LNwCrgOOS7JpkX+DNwCfbaquANyZZmmQ3hivvW3r+Ryd5S5K92+N9gCOBc9siNwJ7J7nfyGq7ArdU1c+THAT8x5F5nwVemOQpbZ13sSlQYQjY41rNJFmS5NA2/cwkv91OP9zO8B/Dhp79peljaGq+fD7JHQw9s7cD72XThR6A1zNckPk+Q+/z08CJbd5Hga8ClwDfAb4ErGfmALoDeCJwXpKfMoTlZcBb2vxvAJcD/5zk5tb2GuBPWn3vYAhpAKrq8lbbKQy9zjuAm9j0Eab3M/RSv9rWP7dtH+BfMYTu7QyH7Wex6T8CLVBxEGItNEmeB3ykqvbd5sJzv+1dgB8Dy6vqmvnevibPnqamXpIHJHl+kp2SLAXeCZw6j9t/YZIHJnkQ8G7gUmDNfG1f08XQ1EIQhnOJtzIcnl/BcBg9Xw5l+HjU9cBy4IjyEG2H5eG5JHWwpylJHQxNSeow04AHC8bixYtr2bJlky5D0r3MhRdeeHNVLZlp3oIOzWXLlrF69epJlyHpXibJtVua5+G5JHUwNCWpg6EpSR0MTUnqYGhKUgdDU5I6GJqS1MHQlKQOhqYkdTA0JamDoSlJHQxNSepgaEpShwU9ytFcWXb0Fyddwt3WHH/IpEuQtBX2NCWpg6EpSR0MTUnqYGhKUgdDU5I6GJqS1MHQlKQOhqYkdTA0JamDoSlJHQxNSepgaEpSB0NTkjoYmpLUwdCUpA6GpiR1MDQlqYOhKUkdDE1J6mBoSlIHQ1OSOhiaktTB0JSkDoamJHUwNCWpg6EpSR0MTUnqYGhKUgdDU5I6GJqS1MHQlKQOhqYkdTA0JamDoSlJHQxNSepgaEpSB0NTkjoYmpLUwdCUpA6GpiR1MDQlqcPYQjPJPknOSHJFksuTvLG175Hka0m+137vPrLOMUmuTnJVkueMqzZJ2l7j7GmuB95SVY8BngS8Nsn+wNHA6VW1HDi9PabNOwJ4LPBc4ENJFo2xPknqNrbQrKobquqiNn0HcAWwFDgUOKktdhJwWJs+FDilqu6qqmuAq4GDxlWfJG2PeTmnmWQZ8HjgPGDPqroBhmAFHtoWWwpcN7La2ta2+XMdlWR1ktXr1q0ba92StLmxh2aSXYC/A95UVbdvbdEZ2urXGqpOqKoVVbViyZIlc1WmJM3KWEMzyX0ZAvNTVfX3rfnGJHu1+XsBN7X2tcA+I6vvDVw/zvokqdc4r54H+BhwRVW9d2TWacDKNr0S+NxI+xFJ7p9kP2A5cP646pOk7bHTGJ/7qcDLgUuTXNza/hg4HliV5FXAD4CXAFTV5UlWAd9luPL+2qraMMb6JKnb2EKzqs5m5vOUAM/awjrHAceNqyZJuqe8I0iSOhiaktTB0JSkDoamJHUwNCWpg6EpSR0MTUnqYGhKUgdDU5I6GJqS1MHQlKQOhqYkdTA0JamDoSlJHQxNSepgaEpSB0NTkjoYmpLUwdCUpA6GpiR1MDQlqYOhKUkdDE1J6mBoSlIHQ1OSOhiaktTB0JSkDoamJHUwNCWpg6EpSR0MTUnqYGhKUgdDU5I6GJqS1MHQlKQOhqYkdTA0JamDoSlJHQxNSepgaEpSB0NTkjoYmpLUwdCUpA6GpiR1MDQlqYOhKUkdDE1J6mBoSlIHQ1OSOhiaktTB0JSkDoamJHUwNCWpg6EpSR0MTUnqYGhKUgdDU5I6jC00k5yY5KYkl420HZvkh0kubj/PH5l3TJKrk1yV5DnjqkuS7olx9jQ/Djx3hva/qKoD2s+XAJLsDxwBPLat86Eki8ZYmyRtl7GFZlV9E7hllosfCpxSVXdV1TXA1cBB46pNkrbXJM5pvi7JJe3wfffWthS4bmSZta1NkqbKfIfmh4HfBA4AbgDe09ozw7I10xMkOSrJ6iSr161bN5YiJWlL5jU0q+rGqtpQVf8CfJRNh+BrgX1GFt0buH4Lz3FCVa2oqhVLliwZb8GStJl5Dc0ke408fBGw8cr6acARSe6fZD9gOXD+fNYmSbOx07ieOMnJwMHA4iRrgXcCByc5gOHQew3w+wBVdXmSVcB3gfXAa6tqw7hqk6TtNbbQrKojZ2j+2FaWPw44blz1SNJc8I4gSepgaEpSB0NTkjoYmpLUwdCUpA6GpiR1MDQlqYOhKUkdDE1J6mBoSlIHQ1OSOhiaktTB0JSkDoamJHUwNCWpg6EpSR0MTUnqYGhKUodZhWaSp86mTZLu7Wbb0/zALNsk6V5tq1+sluTJwFOAJUnePDLrwcCicRYmSdNoW99GeT9gl7bcriPttwMvHldRkjStthqaVXUWcFaSj1fVtfNUkyRNrdl+7/n9k5wALBtdp6r+3TiKkqRpNdvQ/FvgI8D/BjaMrxxJmm6zDc31VfXhsVYiSQvAbD9y9Pkkr0myV5I9Nv6MtTJJmkKz7WmubL//aKStgEfMbTmSNN1mFZpVtd+4C5GkhWBWoZnkFTO1V9XfzG05kjTdZnt4fuDI9M7As4CLAENT0g5ltofnrx99nOQhwCfGUpEkTbHtHRruZ8DyuSxEkhaC2Z7T/DzD1XIYBup4DLBqXEVJ0rSa7TnNd49Mrweuraq1Y6hHkqbarA7P28AdVzKMdLQ78ItxFiVJ02q2I7cfDpwPvAQ4HDgviUPDSdrhzPbw/O3AgVV1E0CSJcDXgc+OqzBJmkazvXp+n42B2fyoY11JuteYbU/zy0m+ApzcHr8U+NJ4SpKk6bWt7wh6JLBnVf1Rkv8APA0IcA7wqXmoT5KmyrYOsd8H3AFQVX9fVW+uqv/K0Mt833hLk6Tps63QXFZVl2zeWFWrGb76QpJ2KNsKzZ23Mu8Bc1mIJC0E2wrNC5K8evPGJK8CLhxPSZI0vbZ19fxNwKlJXsamkFzB8H3oLxpjXZI0lbb1vec3Ak9J8kzgca35i1X1jbFXJklTaLbjaZ4BnDHmWiRp6nlXjyR1MDQlqYOhKUkdDE1J6mBoSlIHQ1OSOhiaktTB0JSkDoamJHUwNCWpw9hCM8mJSW5KctlI2x5Jvpbke+337iPzjklydZKrkjxnXHVJ0j0xzp7mx4HnbtZ2NHB6VS0HTm+PSbI/cATw2LbOh5IsGmNtkrRdxhaaVfVN4JbNmg8FTmrTJwGHjbSfUlV3VdU1wNXAQeOqTZK213yf09yzqm4AaL8f2tqXAteNLLe2tf2aJEclWZ1k9bp168ZarCRtblouBGWGtpppwao6oapWVNWKJUuWjLksSfpV8x2aNybZC6D9vqm1rwX2GVlub+D6ea5NkrZpvkPzNGBlm14JfG6k/Ygk90+yH7AcOH+ea5OkbZrVyO3bI8nJwMHA4iRrgXcCxwOr2hez/QB4CUBVXZ5kFfBdYD3w2qraMK7aJGl7jS00q+rILcx61haWPw44blz1SNJcmJYLQZK0IBiaktTB0JSkDoamJHUwNCWpg6EpSR0MTUnqYGhKUgdDU5I6GJqS1MHQlKQOY7v3XAvfsqO/OOkSAFhz/CGTLkG6mz1NSepgaEpSB0NTkjoYmpLUwdCUpA6GpiR1MDQlqYOhKUkdDE1J6mBoSlIHQ1OSOhiaktTB0JSkDoamJHUwNCWpg6EpSR0MTUnqYGhKUgdDU5I6GJqS1MHQlKQOhqYkdTA0JamDoSlJHQxNSepgaEpSB0NTkjoYmpLUwdCUpA6GpiR1MDQlqYOhKUkdDE1J6mBoSlIHQ1OSOhiaktTB0JSkDoamJHUwNCWpg6EpSR0MTUnqYGhKUgdDU5I6GJqS1MHQlKQOO01io0nWAHcAG4D1VbUiyR7AZ4BlwBrg8Kq6dRL1SdKWTLKn+cyqOqCqVrTHRwOnV9Vy4PT2WJKmyjQdnh8KnNSmTwIOm1wpkjSzSYVmAV9NcmGSo1rbnlV1A0D7/dCZVkxyVJLVSVavW7dunsqVpMFEzmkCT62q65M8FPhakitnu2JVnQCcALBixYoaV4GSNJOJ9DSr6vr2+ybgVOAg4MYkewG03zdNojZJ2pp5D80kD0qy68Zp4NnAZcBpwMq22Ergc/NdmyRtyyQOz/cETk2ycfufrqovJ7kAWJXkVcAPgJdMoDZJ2qp5D82q+j7wb2Zo/xHwrPmuR5J6TNNHjiRp6hmaktTB0JSkDoamJHUwNCWpg6EpSR0MTUnqYGhKUgdDU5I6GJqS1MHQlKQOhqYkdTA0JamDoSlJHQxNSepgaEpSB0NTkjoYmpLUwdCUpA6GpiR1MDQlqYOhKUkdDE1J6mBoSlIHQ1OSOhiaktTB0JSkDoamJHXYadIFSAvJsqO/OOkS7rbm+EMmXcIOyZ6mJHUwNCWpg6EpSR0MTUnqYGhKUgdDU5I6GJqS1MHQlKQOhqYkdTA0JamDoSlJHQxNSepgaEpSB0NTkjoYmpLUwdCUpA6GpiR1MDQlqYOhKUkd/I4gSffYjvTdSfY0JamDoSlJHQxNSepgaEpSB0NTkjoYmpLUwdCUpA6GpiR1mLrQTPLcJFcluTrJ0ZOuR5JGTVVoJlkEfBB4HrA/cGSS/SdblSRtMlWhCRwEXF1V36+qXwCnAIdOuCZJutu0heZS4LqRx2tbmyRNhWkbsCMztNWvLJAcBRzVHv4kyVVjr2p2FgM339Mnyf+ag0qmyz3eL/fCfQLul5lM09/QvluaMW2huRbYZ+Tx3sD1owtU1QnACfNZ1GwkWV1VKyZdx7Rxv8zM/fLrFso+mbbD8wuA5Un2S3I/4AjgtAnXJEl3m6qeZlWtT/I64CvAIuDEqrp8wmVJ0t2mKjQBqupLwJcmXcd2mLpTBlPC/TIz98uvWxD7JFW17aUkScD0ndOUpKlmaKpbkmOTvHXSdWg6JXlDkiuS/DDJX23ncyxJcl6S7yR5+lzXeE9M3TnNHVmSRVW1YdJ1SPfQaxhuhX4GsL0fIXoWcGVVrZztCvP192NPE0iyLMmVSU5KckmSzyZ5YJI1SRa3ZVYkObNNH5vkE0m+keR7SV7d2g9O8s0kpyb5bpKPJLlPm/fsJOckuSjJ3ybZpbWvSfKOJGcDL5nMHti2JG9vA6l8Hfit1nZAknPbPjs1ye6t/cDWdk6SP09yWWt/YJJVbd5nWk9iRZt3ZJJLk1yWLIyPbc/T+2ZB7ZckHwEewfBRwd1H2vdNcnrbT6cnefiW2pMcAPwZ8PwkFyd5wFT9/VTVDv8DLGO48+ip7fGJwFuBNcDi1rYCOLNNHwv8I/AAhrsYrgMeBhwM/JzhTbMI+Brw4rbMN4EHtfXfBryjTa8B/tuk98E29s8TgEuBBwIPBq5u++cS4BltmT8B3temLwOe0qaPBy5r028F/rpNPw5Y3/brw4AfAEsYjn6+ARw26dc9Be+bhbpf1rTX90rgr1rb54GVbfo/A/9nG+2j607V3489zU2uq6pvtelPAk/bxvKfq6o7q+pm4AyGwUYAzq9hwJENwMnteZ7EMGrTt5JcDKzkV2/T+swcvYZxeTpwalX9rKpuZ+hFPAjYrarOasucBPxOkt2AXavq26390yPP8zSGQVioqssYQhfgQIZgWVdV64FPAb8zzhc0h8b5vlnI+2VzT2bTe+ETbNpPW2ofNVV/P57T3GTzz14VQ09o438sO89i+S21B/haVR25hW3/tKPOSZntZ9NmGj9gW/O2ts60G/f75t5qS++nmdqn6u/HnuYmD0/y5DZ9JHA2Q9f/Ca3tdzdb/tAkOyf5DYbDqwta+0HtNtD7AC9tz3Mu8NQkj4S7z+09amyvZO59E3hRO7e0K/BChjfqrSNXNl8OnFVVtwJ3JHlSaz9i5HnOBg4HyDBO6m+39vOAZyRZnGFM1SOBs1gYxvm+Wcj7ZXPfZtN74WUMr29r7aOm6u/H0NzkCmBlkkuAPYAPA+8C3p/kH4DNr8qdD3yR4R/0T6tq48Ai59DO4wHXMBzWrmM4R3Nye/5zgUeP9+XMnaq6iOEQ6GLg74B/aLNWAn/eXtMBDOc1AV4FnJDkHIZewm2t/UPAkrb82xgOz2+rqhuAYxgOV/8RuKiqPjfmlzVXxvm+Wcj7ZXNvAH6v7aeXA2/cRvvdpu3vxzuCGK6CAl+oqsfNcvljgZ9U1bs3az8YeGtVvWCOS1xQkuxSVT9p00cDe1XVG1tv6b5V9fMkvwmcDjyqhgGnFxzfNzsmz2lqHA5JcgzD++tahl4CDFffz0hyX4Ye6B8u1MDUjsuepiR18JymJHUwNCWpg6EpSR0MTS0YGe5/v7zdp3xxkicmeVOSB066Nu04vBCkBaF9gPy9wMFVdVcbEON+DB+OXtFuS5TGzp6mFoq9gJur6i6AFpIbB7U4I8kZAEk+nGR165G+a+PKSZ7fRiQ6O8lfJvlCa39QkhOTXJBh7MZDW/tjk5zferSXJFk+3y9Y08mephaENhTY2Qyf9fw68JmqOivJGkZ6mkn2qKpb2gfpT2e44+SfgO8Bv1NV1yQ5mWFQkRck+Z/Ad6vqk22wkfOBxzPcnXNuVX0qwzejLqqqO+f1RWsq2dPUgtDuMHoCcBSwDvhMklfOsOjhSS4CvgM8lmF0nEcD36+qa9oyJ48s/2zg6DZ6zpkMA2w8nOG2xj9O8jZgXwNTG3lHkBaMNmzamcCZSS5luPf9bkn2YxjP8sCqujXJxxlCcFsjL/1uVV21WfsVSc4DDgG+kuS/VNU35uaVaCGzp6kFIclvbXZe8QCGWzTvAHZtbQ9mGH3ptiR7MnzlAsCVwCPaveIwjCK00VeA1ydJ287j2+9HMPRO/5Jh/NB/PdevSQuTPU0tFLsAH2jnHdczjB5/FMNwaf83yQ1V9cwk3wEuB74PfAugqu5M8hrgy0luZjhvudGfAu8DLmnBuQZ4AUOw/qckvwT+mU0jOGkH54Ug7RA2jrzUgvGDwPeq6i8mXZcWHg/PtaN4dbvYcznwEOCvJ1uOFip7mpLUwZ6mJHUwNCWpg6EpSR0MTUnqYGhKUgdDU5I6/H/VH+UZi8GrRQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#we would show the distribution of the dogstages on a barchart\n", "stages=df_new['dog_stages'].value_counts()\n", "stages.plot(kind='bar', figsize = (5,6))\n", "plt.xticks(rotation=0)\n", "plt.title(\"Dog Stages\")\n", "plt.xlabel(\"Stages\")\n", "plt.ylabel(\"Count\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Most dogs were classified as a pupper, which is a smaller and usually younger type of dogs. But the data for the dog stages is incomplete so we cannot acertain the accuracy of this." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Q2. What image number occured most and how accurate was it prediction as a dog?" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#plotting the image numbers with the prediction as dog \n", "pred=df_new.groupby('p1_asdog')['img_num'].value_counts().unstack()\n", "pred.index=['False', 'True']\n", "pred.plot(kind='bar', figsize= (5,6))\n", "plt.legend(['1', '2', '3', '4'])\n", "plt.xticks(rotation=0)\n", "plt.title(\"Image Number Count with Prediction as Dog\")\n", "plt.xlabel(\"Prediction as Dog\")\n", "plt.ylabel(\"Image Number Count\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Tweet with one image were mostly predicted through the neural network and more 50% of the prediction were dogs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Q3. What correlation exists between the favourite count and retweet count" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#we want to see the correlation between the favorite and retweet count\n", "df_new.plot(x='favourite_count', y='retweet_count', kind='scatter')\n", "plt.title('Favorite and Retweet Count');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- From the above, we can see a positive correlation between the retweet and favorite count." ] } ], "metadata": { "extensions": { "jupyter_dashboards": { "activeView": "report_default", "version": 1, "views": { "grid_default": { "cellMargin": 10, "defaultCellHeight": 20, "maxColumns": 12, "name": "grid", "type": "grid" }, "report_default": { "name": "report", "type": "report" } } } }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }