{ "cells": [ { "cell_type": "markdown", "id": "4da20fdc-6b74-4b76-98e4-a639d02fb476", "metadata": {}, "source": [ "# AN ANALYSIS TO IDENTIFY THE PEAK HOURS: ASK HN AND SHOW HN POSTS" ] }, { "cell_type": "markdown", "id": "51df4ce9-1c16-4295-875b-9f832bca7d19", "metadata": {}, "source": [ "# Project Description\n", " Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as \"posts\") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. We are mostly concerned with posts of `Ask HN` where the user is asking the platform a question and `Show HN` where the user is showcasing something they made or something that they find interesting on the platform.\n", " \n", " \n", "# Project Objectives\n", " This project aims to:\n", " - Identify which posts between `Ask HN` and `Show HN` are more likely to be posted\n", " - Analyze which time of the day that a user posting will receive a comment\n", " - Pinpoint the peak hours where most users are posting and commenting.\n", "\n", "The following are an example posts title for `Ask HN` and `Show HN`:\n", "\n", "Ask HN: How to improve my personal website?
\n", "Ask HN: Am I the only one outraged by Twitter shutting down share counts?
\n", "Ask HN: Aby recent changes to CSS that broke mobile?\n", "\n", "Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
\n", "Show HN: Something pointless I made
\n", "Show HN: Shanhu.io, a programming playground powered by e8vm\n", "\n", "Sources: \n", "
[Hacker news](https://news.ycombinator.com/)
\n", "[Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts)
\n", "[DataQuest](https://www.dataquest.io/)" ] }, { "cell_type": "markdown", "id": "951eb4c1-944a-4d54-a28e-eb7210fec7e0", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "c71f38ab-d418-4b50-b4ac-f83412b78f13", "metadata": {}, "source": [ "# Introduction\n", "\n", "Before we begin our analysis, we are going to create a function that will open, read and save the csv file in a variable named `hn`. We will also create a function that we can use for initial exploration of the dataset. And lastly we are going to import the `modules` that we are going to use on this project." ] }, { "cell_type": "code", "execution_count": 1, "id": "27804b96-259e-434d-b7b7-55441b2320a0", "metadata": {}, "outputs": [], "source": [ "def dataset_csv(dataset, header = True):\n", " '''\n", " Open and reads the csv file\n", " \n", " Parameter\n", " ---\n", " dataset: str\n", " A string type in the format of 'example.csv'\n", " header: bool, default = True\n", " While true, the result will include a header, otherwise it will return the rows only \n", " Return\n", " ---\n", " A list of lists\n", " '''\n", " \n", " from csv import reader\n", " \n", " file = open(dataset)\n", " read = reader(file)\n", " data = list(read)\n", " \n", " if header:\n", " return data\n", " if not header:\n", " return data[1:]\n", "\n", "def explore_data(dataset, start, end, rows_and_columns = True):\n", " '''\n", " Prints out an initial exploration of the data\n", " \n", " Parameter\n", " ---\n", " dataset: list\n", " A variable of the dataset\n", " start: int\n", " The starting index to be shown\n", " end: int\n", " The ending index to be shown\n", " rows_and_columns: bool, default = True\n", " While true, it will show the total number of rows and columns of the dataset\n", " Return\n", " ---\n", " A print statement of the above parameters\n", " '''\n", " print('Sample {0} no. rows'.format(len(dataset[start:end])))\n", " print('')\n", " \n", " for row in dataset[start:end]:\n", " print(row)\n", " print('')\n", " if rows_and_columns:\n", " print('Total number of columns: {0}\\nTotal number of rows: {1}'.format(len(dataset[0]),len(dataset[1:])))\n", " " ] }, { "cell_type": "code", "execution_count": 2, "id": "5d2e5a29-4bb7-4f0f-8973-6c7a2b2d5776", "metadata": {}, "outputs": [], "source": [ "import datetime as dt # Since we are going to deal with the hours later on in our analysis, we are going to need the datetime module\n", "from matplotlib import pyplot as plt # We are going to represent our data in a graph later on in our analysis to derive a meaning from the data" ] }, { "cell_type": "markdown", "id": "13b0333b-f890-4ef1-be5e-92f1aec67476", "metadata": {}, "source": [ "
The function named `dataset_csv()` will be used to open the hacker news dataset. We will keep the *header* argument as true since we want to see what our headers are. But we are going to create a separate variable for the headers and the rows." ] }, { "cell_type": "code", "execution_count": 3, "id": "8638aa87-7801-4a3a-840b-2e8f2a9cfa97", "metadata": {}, "outputs": [], "source": [ "hn = dataset_csv(r'C:\\Users\\Mico\\OneDrive\\Desktop\\DATASETS\\KAGGLE\\HACKER NEWS POST\\DATAQUEST\\hacker_news.csv')\n", "hn_header = hn[0] # this will extract the header from the hn dataset\n", "hn_rows = hn[1:] # this will exclude the header from the hn dataset" ] }, { "cell_type": "markdown", "id": "2f852f8f-5027-4c17-8b31-1e258875deb3", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "204c6bb5-666d-4bd1-83a3-9b08a1517ca4", "metadata": {}, "source": [ "# Initial Data Exploration\n", "To begin our data analysis, we will first have an initial data exploration in order to have an insight on what are the contents of our dataset.\n", "In this section we are going to do the following:\n", "- Use the `explore_data()` to display the initial overview of the dataset. \n", " - Inputting the **start** as **0** to show the headers.\n", " - The **end** as **5** to return the first five rows.\n", " - And keeping the argument of the **rows_and_columns** as the default **True** to see how many columns and rows are in the dataset.\n", "- Describe the individual attributes." ] }, { "cell_type": "markdown", "id": "6a47ab06-5280-41d5-b145-1789e8be4538", "metadata": {}, "source": [ "### Data Exploration" ] }, { "cell_type": "code", "execution_count": 4, "id": "9f819849-1edf-4faf-924f-aea8d55fd2fd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sample 5 no. rows\n", "\n", "['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']\n", "\n", "['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']\n", "\n", "['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']\n", "\n", "['11964716', \"Florida DJs May Face Felony for April Fools' Water Joke\", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']\n", "\n", "['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']\n", "\n", "Total number of columns: 7\n", "Total number of rows: 20100\n" ] } ], "source": [ "explore_data(hn,0,5)" ] }, { "cell_type": "markdown", "id": "abc1e9b8-d0a9-44b3-bf32-02be2b9ec31e", "metadata": {}, "source": [ "
The data above shows the **7 attributes** of each rows which are described in the table below." ] }, { "cell_type": "markdown", "id": "4868e8bf-5bde-4ed2-b5ae-98486cdc8ab8", "metadata": {}, "source": [ "### Attributes Description" ] }, { "cell_type": "markdown", "id": "30c5bf4f-334a-4289-a02b-657eaecb24bd", "metadata": {}, "source": [ "|Index|Columns|Description|Ideal Data Type|\n", "|:---:|:---:|:---:|:---:|\n", "|0|id|The unique identifier from the hacker news post|str|\n", "|1|Title|Title of the post|str|\n", "|2|url|Url of the item being linked to|str|\n", "|3|num_points|The number of upvotes the post received|int|\n", "|4|num_comments|The number of the comments the post received|int|\n", "|5|author|The name of the account that made the post|str|\n", "|6|created_at|The date and time the post was made (Timezone: Eastern Time in the US)|datetime|\n", "\n", "**Original dataset source:** [Kaggle - Hacker News Posts](https://www.kaggle.com/hacker-news/hacker-news-posts)
\n", "**Date created:** 2016-09-27
\n", "**Date updated:** 2016-09-27
\n", "**Source used in this analysis:** DataQuest - Guided Project: Exploring Hacker News Posts
\n", "**Note:** The dataset from kaggle contains more than 300,000 rows while the dataset from this analysis contains 20,101 rows. The dataset was reduced by removing submissions that did not received any comments, and then randomly sampling from the remaining submissions.
" ] }, { "cell_type": "markdown", "id": "b5f16de6-2cfb-4a6b-88cc-4516ca9d9ad0", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "c7d7ced0-2b42-40e7-a12a-5647e5355183", "metadata": {}, "source": [ "# Data Cleaning\n", "\n", "As we've mentioned we're only concern with the posts that have a beginning title of `Ask HN` or `Show HN`. In this section we are going to do the following:\n", "- Since `Ask HM` amd `Show HN` are in the **title** column. We will verify if there's any missing value in this column.\n", "- Create a list of lists that contains `Ask HN` or `Show HN` in the beginning of the title." ] }, { "cell_type": "markdown", "id": "a60cc474-d8ff-48c4-a234-50033b82197a", "metadata": {}, "source": [ "### Dataset Verification" ] }, { "cell_type": "markdown", "id": "31d6dca2-169d-4a22-ada6-d4f8de9f314e", "metadata": {}, "source": [ "In order to determine the usesability of our dataset, we have to check whether there's a missing value and identify how it will affect our analysis." ] }, { "cell_type": "code", "execution_count": 5, "id": "d77c7800-0f43-4d52-a17b-51e86f729058", "metadata": {}, "outputs": [], "source": [ "def missing_data(dataset,index):\n", " '''\n", " Count the rows with a missing value ('')\n", " \n", " Parameter\n", " ---\n", " dataset: list\n", " A variable of the dataset\n", " index: int\n", " The index value of the column to be iterated\n", " Return\n", " ---\n", " Statement of the number of rows with missing value\n", " '''\n", " \n", " missing_column = list()\n", " \n", " for row in dataset:\n", " if len(row) != len(hn_header): # This is to check if there's a row that contains missing data\n", " missing_column.append(row)\n", " elif row[index] == '': # This is to check if there's a row with an empty `created_at`\n", " missing_column.append(row)\n", " return 'The number of rows with missing data is {0}.'.format(len(missing_column))" ] }, { "cell_type": "code", "execution_count": 6, "id": "f83d4995-0afd-4f53-875d-0001f7f76c72", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Column - id , Index - 0\n", "The number of rows with missing data is 0.\n", "\n", "Column - title , Index - 1\n", "The number of rows with missing data is 0.\n", "\n", "Column - url , Index - 2\n", "The number of rows with missing data is 2440.\n", "\n", "Column - num_points , Index - 3\n", "The number of rows with missing data is 0.\n", "\n", "Column - num_comments , Index - 4\n", "The number of rows with missing data is 0.\n", "\n", "Column - author , Index - 5\n", "The number of rows with missing data is 0.\n", "\n", "Column - created_at , Index - 6\n", "The number of rows with missing data is 0.\n", "\n" ] } ], "source": [ "for columns in range(7):\n", " print('Column -',hn_header[columns],', Index - ',columns)\n", " print(missing_data(hn_rows,columns))\n", " print('')" ] }, { "cell_type": "markdown", "id": "137a9b32-a110-47bb-9fb4-243fd78cf47b", "metadata": {}, "source": [ "As we've investigated the only column that contains a missing value *('')* is `url` which is not our concern since it is only a url where the item is being linked to." ] }, { "cell_type": "markdown", "id": "8f330332-fcfb-4352-87b6-9b921e40d021", "metadata": {}, "source": [ "### Datatype Conversion" ] }, { "cell_type": "markdown", "id": "39f38dff-e6a0-4a88-9d3b-7a154c52c438", "metadata": {}, "source": [ "Data useability is an integral part of our analysis. We have to make sure that the types data we are using is in compliance with our data requirements. As we have shown on the table above **[Attributes Description](#table_cell)** we have an ideal data type for each of our column. In the following below, we are going to check the data types of each columns and convert that column whenever the data type doesn't meet our **[data requirements](#table_cell)**." ] }, { "cell_type": "code", "execution_count": 7, "id": "db559dec-134c-441e-ab05-846a9a6c25f7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Column - id , Index - 0\n", "\n", "\n", "Column - title , Index - 1\n", "\n", "\n", "Column - url , Index - 2\n", "\n", "\n", "Column - num_points , Index - 3\n", "\n", "\n", "Column - num_comments , Index - 4\n", "\n", "\n", "Column - author , Index - 5\n", "\n", "\n", "Column - created_at , Index - 6\n", "\n", "\n" ] } ], "source": [ "for element in range(7):\n", " print('Column -',hn_header[element],', Index - ',element)\n", " print(type(hn_rows[1][element]))\n", " print('')" ] }, { "cell_type": "markdown", "id": "1038d625-e20f-42f4-acc1-4ad94dc807e6", "metadata": {}, "source": [ "As we can see, the following columns does not meet our **[data requirements](#table_cell)**.\n", "- **num_points** , Index - 3\n", " - Since this column represents the total number of up votes after subtracting the number of down votes. This should have a data type of `integer`.\n", "- **num_comments** , Index - 4\n", " - Since this column represents the total number of comments. This should have a data type of `integer`.\n", "- **created_at** , Index - 6\n", " - Since this column represents the date and time the post was made. This should have a data type of `datetime`.\n", " \n", "In order to comply, we have to convert these three columns on their corresponding data types." ] }, { "cell_type": "code", "execution_count": 8, "id": "7911a73b-ffff-4301-bd74-f688034e4c02", "metadata": {}, "outputs": [], "source": [ "def convert_data(dataset,index,type_of_data):\n", " \n", " import datetime as dt\n", " \n", " for rows in dataset:\n", " convert_data = rows[index]\n", " \n", " if type_of_data == 'datetime':\n", " convert_to = dt.datetime.strptime(convert_data,'%m/%d/%Y %H:%M')\n", " elif type_of_data == int: \n", " convert_to = int(convert_data)\n", " \n", " rows[index] = convert_to" ] }, { "cell_type": "code", "execution_count": 9, "id": "ee003dbf-c2f7-4fff-b847-9a05a4d8b30c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Column - id , Index - 0\n", "\n", "\n", "Column - title , Index - 1\n", "\n", "\n", "Column - url , Index - 2\n", "\n", "\n", "Column - num_points , Index - 3\n", "\n", "\n", "Column - num_comments , Index - 4\n", "\n", "\n", "Column - author , Index - 5\n", "\n", "\n", "Column - created_at , Index - 6\n", "\n", "\n" ] } ], "source": [ "# To convert the `num_points` \n", "convert_data(hn_rows,3,int)\n", "convert_data(hn_rows,4,int)\n", "convert_data(hn_rows,6,'datetime')\n", "\n", "# To verify the changes that we've made. We will iterate on one of the rows in our dataset.\n", "\n", "for element in range(7):\n", " print('Column -',hn_header[element],', Index - ',element)\n", " print(type(hn_rows[1][element]))\n", " print('')" ] }, { "cell_type": "markdown", "id": "1e3b47df-fe95-4de9-bbf1-948b53e6a3cf", "metadata": {}, "source": [ "Now that we are complying with our **[data requirements](#table_cell)** we can procede to segregate our dataset. Remember that we are interested at `Ask HN` and `Show HN` posts to analyze whether the date and time of posting affects the number of average comments." ] }, { "cell_type": "markdown", "id": "501d0234-e75a-4890-a7e6-d9103f6f702a", "metadata": {}, "source": [ "### Dataset Segregation" ] }, { "cell_type": "markdown", "id": "c5938aec-44b2-4d22-8c76-7ff029cb93db", "metadata": {}, "source": [ "In order to derive a meaning from our dataset, we're going to create a sperate lists of the following:\n", "- Posts that starts with Ask HN\n", "- Posts that starts with Show HN\n", "- Other posts\n", "\n", "Even though our concern are the posts that contains `Ask HN` and `Show HN` we will still create a separate list for `Other posts` for comparison purposes further down in our analysis." ] }, { "cell_type": "code", "execution_count": 10, "id": "573ae3e9-f0c6-4e11-96c9-d63281b2694c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Posts that starts with Ask HN:\n", "Sample 3 no. rows\n", "\n", "['12296411', 'Ask HN: How to improve my personal website?', '', 2, 6, 'ahmedbaracat', datetime.datetime(2016, 8, 16, 9, 55)]\n", "\n", "['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', 28, 29, 'tkfx', datetime.datetime(2015, 11, 22, 13, 43)]\n", "\n", "['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', 1, 1, 'polskibus', datetime.datetime(2016, 5, 2, 10, 14)]\n", "\n", "Total number of columns: 7\n", "Total number of rows: 1743\n", "\n", "\n", "Posts that starts with Show HN:\n", "Sample 3 no. rows\n", "\n", "['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', 26, 22, 'kfihihc', datetime.datetime(2015, 11, 25, 14, 3)]\n", "\n", "['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', 747, 102, 'dhotson', datetime.datetime(2015, 11, 29, 22, 46)]\n", "\n", "['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', 1, 1, 'h8liu', datetime.datetime(2016, 4, 28, 18, 5)]\n", "\n", "Total number of columns: 7\n", "Total number of rows: 1161\n", "\n", "\n", "Other posts:\n", "Sample 3 no. rows\n", "\n", "['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', 386, 52, 'ne0phyte', datetime.datetime(2016, 8, 4, 11, 52)]\n", "\n", "['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', 39, 10, 'josep2', datetime.datetime(2016, 1, 26, 19, 30)]\n", "\n", "['11964716', \"Florida DJs May Face Felony for April Fools' Water Joke\", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', 2, 1, 'vezycash', datetime.datetime(2016, 6, 23, 22, 20)]\n", "\n", "Total number of columns: 7\n", "Total number of rows: 17193\n" ] } ], "source": [ "ask_post = list()\n", "show_post = list()\n", "other_post = list()\n", "\n", "for row in hn_rows:\n", " title = row[1].lower() # Since there might be posts that contains `ask HN` and `show HN` rather than `Ask HN` and `Show HN`, we will convert all our title into lower cases and iterate by using `ask hn` and `show hn`\n", " if title.startswith('ask hn'):\n", " ask_post.append(row)\n", " elif title.startswith('show hn'):\n", " show_post.append(row)\n", " else:\n", " other_post.append(row)\n", " \n", "# We'll print out the first 3 rows of each list\n", "\n", "print('Posts that starts with Ask HN:')\n", "explore_data(ask_post,0,3)\n", "print('\\n')\n", "print('Posts that starts with Show HN:')\n", "explore_data(show_post,0,3)\n", "print('\\n')\n", "print('Other posts:')\n", "explore_data(other_post,0,3)" ] }, { "cell_type": "markdown", "id": "2816eb31-668a-45d1-ba65-517738952266", "metadata": {}, "source": [ "
After segregating the dataset, we can see that *approximately* **14%** of the dataset contains posts that start swith either `Ask HN` or `Show HN`. Now that we have a separate dataset we can procede with the data analysis.\n" ] }, { "cell_type": "markdown", "id": "b5093206-1a79-4084-8275-ea1770d30bea", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "e4d0bff8-e97a-4f1a-a187-c29fe11612c0", "metadata": {}, "source": [ "# Data Analysis" ] }, { "cell_type": "markdown", "id": "fd139185-d105-4b82-b21f-53fa8269a522", "metadata": {}, "source": [ "Now that we got a clean set of data, we are going to analyze it to gain an information to drive our conclusion and recommendations. In this section we are going to do the following:\n", "- Get the average comments made on each segreted data that we created above. And decide which data to focus on.\n", "- Create a frequency table for the numbers of comments made and posts created by the hour.\n", "- Create a line graph of the folliwng:\n", " - Posts created by hour\n", " - Comments made by hour\n", " - Average comments made per posts created\n", "- Analyze the hours that has an above average comments made." ] }, { "cell_type": "code", "execution_count": 11, "id": "cc47c93b-6b03-400e-8cfa-3d10f9057951", "metadata": {}, "outputs": [], "source": [ "def total_column(dataset, index):\n", " '''\n", " Summation of the emtire column given that the data is int\n", " \n", " Parameter\n", " ---\n", " dataset: list\n", " A variable of the dataset\n", " index: int\n", " The index number of the column to be iterated\n", " Return\n", " ---\n", " Single integer value\n", " '''\n", " \n", " total = 0\n", " \n", " for row in dataset:\n", " total_column = row[index]\n", " total += total_column\n", " \n", " average = round((total / (len(dataset))),2)\n", " \n", " return total, average" ] }, { "cell_type": "code", "execution_count": 12, "id": "a213b34a-ab88-4e7b-941f-f7c382c1736e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ask_post dataset:\n", "The total number of comments is 24483 wtih the average 14.04 number of comments\n", "\n", "\n", "show_post dataset:\n", "The total number of comments is 11988 wtih the average 10.32 number of comments\n", "\n", "\n", "other_post dataset:\n", "The total number of comments is 462055 wtih the average 26.87 number of comments\n" ] } ], "source": [ "ask_posts_comments, show_posts_comments, other_posts_comments = total_column(ask_post,4), total_column(show_post,4),total_column(other_post,4)\n", "print('ask_post dataset:')\n", "print('The total number of comments is {0} wtih the average {1} number of comments'.format(ask_posts_comments[0],ask_posts_comments[1]))\n", "print('\\n')\n", "print('show_post dataset:')\n", "print('The total number of comments is {0} wtih the average {1} number of comments'.format(show_posts_comments[0],show_posts_comments[1]))\n", "print('\\n')\n", "print('other_post dataset:')\n", "print('The total number of comments is {0} wtih the average {1} number of comments'.format(other_posts_comments[0],other_posts_comments[1]))" ] }, { "cell_type": "markdown", "id": "e19afead-2a20-4bed-bebd-e2503e07d40b", "metadata": {}, "source": [ "
We can observe that on average `Ask HN` posts has a higher number of comments compared to `Show HN` posts. And other related posts have much larger number of people that commented on those posts.\n", "\n", "Since `Ask HN` posts are more likely to receive comments with an average of **14.04** comments, we'll focus our analysis just on these posts. But a further analysis about `Show HN` and other posts is recommended to be studied outside this project to further give a concrete support on our conclusion and recommendations. \n", "\n", "We will start our analysis by look at the first five rows of our `Ask HN` dataset that we segregated. And at the same time we are going to extract the number of comments and date and time created in order to show a frequency of the following:\n", "- Posts created by the hour of the day\n", "- Comments made by the hour of the day" ] }, { "cell_type": "code", "execution_count": 13, "id": "2c3cd2fb-ad52-4a7c-b638-43f510c9692e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ask HN dataset:\n", "\n", "Sample 5 no. rows\n", "\n", "['12296411', 'Ask HN: How to improve my personal website?', '', 2, 6, 'ahmedbaracat', datetime.datetime(2016, 8, 16, 9, 55)]\n", "\n", "['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', 28, 29, 'tkfx', datetime.datetime(2015, 11, 22, 13, 43)]\n", "\n", "['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', 1, 1, 'polskibus', datetime.datetime(2016, 5, 2, 10, 14)]\n", "\n", "['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', 1, 3, 'sph130', datetime.datetime(2016, 8, 2, 14, 20)]\n", "\n", "['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', 28, 17, 'roykolak', datetime.datetime(2015, 10, 15, 16, 38)]\n", "\n", "Total number of columns: 7\n", "Total number of rows: 1743\n", "\n", "\n", "Extracted date and time created and number of comments by the hour\n", "\n", "Sample 3 no. rows\n", "\n", "(datetime.datetime(2016, 8, 16, 9, 55), 6)\n", "\n", "(datetime.datetime(2015, 11, 22, 13, 43), 29)\n", "\n", "(datetime.datetime(2016, 5, 2, 10, 14), 1)\n", "\n" ] } ], "source": [ "print('Ask HN dataset:')\n", "print('')\n", "explore_data(ask_post,0,5)\n", "\n", "result_list = list()\n", "\n", "for row in ask_post:\n", " result = row[6], row[4]\n", " result_list.append(result)\n", "print('\\n')\n", "print('Extracted date and time created and number of comments by the hour')\n", "print('')\n", "explore_data(result_list,0,3,rows_and_columns = False)" ] }, { "cell_type": "markdown", "id": "eecde0e5-b4b2-46c7-9547-06dbee14244f", "metadata": {}, "source": [ "Now that we have a list of tuples containing the attributes `created_at` and `num_comments` we are going to create the frequency table for both of these attributes. The frequency tables will take in the hour per day in a 24 hour format as a key and posts created and comments made on those hour as the values.\n", "\n", "So we will have two frequency table:\n", "1. Hour of the day by the number of posts created\n", "2. Hour of the day by the number of comments made" ] }, { "cell_type": "code", "execution_count": 14, "id": "ec00cf62-4f0e-41d2-9658-f152b78337b2", "metadata": { "tags": [] }, "outputs": [], "source": [ "counts_by_hour = dict()\n", "comments_by_hour = dict()\n", "\n", "for row in result_list:\n", " hour = row[0].time().hour\n", " counts_by_hour[hour] = counts_by_hour.get(hour, 0)+1\n", " if hour not in comments_by_hour:\n", " comments_by_hour[hour] = row[1]\n", " elif hour in comments_by_hour:\n", " comments_by_hour[hour] += row[1]" ] }, { "cell_type": "code", "execution_count": 15, "id": "436f6756-6c6f-46ed-92b9-7f7d2e1ee67e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of posts created by the hour of the day:\n", "Hour:Number of posts\n", "{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}\n", "\n", "\n", "Number of comments made by the hour of the day:\n", "Hour:Number of comments\n", "{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}\n" ] } ], "source": [ "print('Number of posts created by the hour of the day:')\n", "print('Hour:Number of posts')\n", "print(counts_by_hour)\n", "print('\\n')\n", "print('Number of comments made by the hour of the day:')\n", "print('Hour:Number of comments')\n", "print(comments_by_hour)" ] }, { "cell_type": "markdown", "id": "da023216-75b3-4667-9daa-a4b51994ed8a", "metadata": {}, "source": [ "Now, getting an insight on this frequency table is very confusing. A much better way of representing these numbers is by using a `line graph`. It is easier to derive a meaning from a line graph rather than looking at these numbers. In the below graphs, we are going to show the relationship of the hour of the day by the number of comments made and posts created." ] }, { "cell_type": "code", "execution_count": 16, "id": "bf7b3c18-7b4d-4068-872a-64a627c12d95", "metadata": {}, "outputs": [], "source": [ "# Since we are using a 24 hour time. We will create a list consisting of the values in our time. The purpose of this list is to label our x-axis which is the `hour` in our dataset.\n", "hours_24 = [hour for hour in range(24)]" ] }, { "cell_type": "markdown", "id": "646ffdce-8e7a-485d-8340-bec4a474b1b1", "metadata": {}, "source": [ "#### Posts created by hour" ] }, { "cell_type": "code", "execution_count": 17, "id": "d475accb-1841-4e60-8dcf-cf2886963cec", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = sorted(counts_by_hour.items(), key=lambda x: x[0])\n", "plt.plot(*zip(*x))\n", "plt.xlabel(\"Hour\")\n", "plt.ylabel(\"Number of posts\")\n", "plt.title('Posts created by hour')\n", "plt.xticks(hours_24[::2])\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "166248ca-0614-4ffd-b8bd-0a21dc96eabe", "metadata": {}, "source": [ "As we can see there is a spike of posts being created by the afternoon up to the evening. We can see `three` peaks in our line graph, which all of them are in the afternoon and evening." ] }, { "cell_type": "markdown", "id": "c1c0a120-97e2-4392-bd15-ad29991a7d21", "metadata": {}, "source": [ "#### Comments made by hour" ] }, { "cell_type": "code", "execution_count": 18, "id": "eff6b49f-a22c-4e80-a8a6-838777f87840", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "y=sorted(comments_by_hour.items(), key=lambda x: x[0])\n", "plt.plot(*zip(*y))\n", "plt.xlabel(\"Hour\")\n", "plt.ylabel(\"Number of comments\")\n", "plt.title('Comments by hour')\n", "plt.xticks(hours_24[::2])\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "90cd36f3-660b-40f8-b47d-3930d2fd6c4f", "metadata": {}, "source": [ "We can observe from the line graph above that there is a higher surge of comments being made by `15:00` or `3:00 PM`. And a lot less throughout the day.\n", "
\n", "\n", "Now that we have an insight on posts created and comments made by the hour. We are going to look at the average number of comments per posts by the hour. Following below, we are going to create a list of lists of the average comments per posts by the hour. Since we have already sorted out our dataset above, we can directly iterate on our two list two create the list of lists." ] }, { "cell_type": "code", "execution_count": 19, "id": "5b3cb2db-fdcc-4c46-82cb-f94597e25428", "metadata": {}, "outputs": [], "source": [ "average_per_hour = list()\n", "for i in range(24):\n", " posts = x[i][1] # Since we have a tupple of (hour,posts) we are going to extract the value by specifying the index [1]\n", " comments = y[i][1] # Since we have a tupple of (hour,comments) we are going to extract the value by specifying the index [1]\n", " average = round((comments/posts),2)\n", " average_by_hour = i, average\n", " average_per_hour.append(average_by_hour)" ] }, { "cell_type": "markdown", "id": "04c376cd-1ccc-4f0f-a97b-d521e7c8b9a5", "metadata": {}, "source": [ "#### Average comments made per posts created" ] }, { "cell_type": "code", "execution_count": 20, "id": "b74163ce-a493-46ad-bd6a-e7f103b5fa42", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.plot(*zip(*average_per_hour))\n", "plt.xlabel(\"Hour\")\n", "plt.ylabel(\"Number of comments\")\n", "plt.title('Average comments per posts')\n", "plt.xticks(hours_24[::2])\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "cfb0e61b-6398-4f6d-8479-4ca09f6f60d4", "metadata": {}, "source": [ "As expected there is a spike of comments made per posts created on `13:00` or `3:00PM` but we can also observe that there is are peaks that happened on the morning and the evening. Remember that `Ask HN` data has an [**average 14.04 number of comments**](#average). So we can say that the times around `02:00` or `2:00 AM`, `13:00` or `3:00PM` and `20:00` or `8:00PM` are all above average." ] }, { "cell_type": "code", "execution_count": 21, "id": "4ad609e7-fac9-454b-ba7e-0bbfc5478634", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15:00: Average of 38.59 number of comments\n", "02:00: Average of 23.81 number of comments\n", "20:00: Average of 21.52 number of comments\n", "16:00: Average of 16.8 number of comments\n", "21:00: Average of 16.01 number of comments\n", "13:00: Average of 14.74 number of comments\n" ] } ], "source": [ "sorted(average_per_hour, key=lambda x: x[1],reverse = True)\n", "\n", "above_average_hours = list()\n", "for hours in range(24):\n", " if average_per_hour[hours][1] > ask_posts_comments[1]:\n", " avg_per_hour = average_per_hour[hours][1]\n", " hours = average_per_hour[hours][0]\n", " total_avg = hours,avg_per_hour\n", " above_average_hours.append(total_avg)\n", " \n", "top_above_avg_hours = sorted(above_average_hours,key = lambda x:x[1], reverse = True)\n", "\n", "for hours in top_above_avg_hours:\n", " time = dt.time(hours[0])\n", " formatted_time = time.strftime('%H:%M')\n", " print('{0}: Average of {1} number of comments'.format(formatted_time,hours[1]))" ] }, { "cell_type": "markdown", "id": "0540784d-f09d-4826-a951-993ef8605c78", "metadata": {}, "source": [ "We can see that the hours that has an above average of comments made per posts created corresponds with the [results of our line graph above](#line_graph)." ] }, { "cell_type": "markdown", "id": "1df7a0d7-d427-40ed-967b-fc7a073bc6e0", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "1bc8fea1-8f53-4c63-b0ff-7f5c0c523e46", "metadata": {}, "source": [ "# Conclusion and Recommendations" ] }, { "cell_type": "markdown", "id": "c5a0de73-0981-48c8-a975-f14fff08d9d8", "metadata": {}, "source": [ "From the information that we extracted from our results, we can the **peak times** below where a user who created a posts are more likely to receive a comment:\n", "1. 15:00 or 3:00 PM\n", "2. 02:00 or 2:00 AM\n", "3. 20:00 or 8:00 PM\n", "4. 16:00 or 4:00 PM\n", "5. 21:00 or 9:00 PM\n", "6. 13:00 or 1:00 PM\n", "\n", "**Note**: These timings are in **Eastern Time in the US**. \n", "\n", "The following times below are in `Qatar Timezone (GMT +3)`\n", "\n", "1. 22:00 or 10:00 PM\n", "2. 09:00 or 9:00 AM\n", "3. 02:00 or 2:00 AM\n", "4. 23:00 or 11:00 PM\n", "5. 04:00 or 4:00 AM\n", "6. 20:00 or 8:00 PM\n", "\n", "And the following times below are in `Philippines Timezone (GMT +8)`\n", "\n", "1. 03:00 or 3:00 AM\n", "2. 14:00 or 2:00 PM\n", "3. 07:00 or 7:00 AM\n", "4. 04:00 or 4:00 AM\n", "5. 09:00 or 9:00 AM\n", "6. 13:00 or 1:00 AM\n", "\n", "Even though these times **exceeds the average number of comments by posts created** we have to remember that we have only identified the times where it is more *more likely* to receive a comment on a topic relating to `Ask HN` posts. But *we cannot guarantee* that we will receive a comment for posting on these times.\n", "\n", "Though due to the **high number of active people** on these times, we recommend to posT the `Ask HN` related posts by these hours." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 5 }