{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Finding Optimal Post Times -- Analyzing High Volume of Comments in Specific Time of the Day\n", "We are analyzing the data set coming from the submissions to popular technology site [Hacker News](https://news.ycombinator.com/). Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as \"posts\") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.\n", "\n", "We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit Ask HN posts to ask the Hacker News community a specific question.\n", "Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. \n", "\n", "*This project will unveil wether if \"Ask HN\" or \"Show HN\" receives more comments on average and use that to determine wether a specific time determines if the post receives more comments on average.*\n", "\n", "## Summary of Result\n", "We have determined that `Ask HN` has the most post and we also determined that 3:00 pm in the afternoon is the peak time that users post comments in posts every day. \n", "\n", "This project will be beneficial to users wanting an answer to their specific needs. Now they will have an idea what time to post a question to HackerNews and gain a lot of answers to their question.\n", "\n", "*For more details, you can explore the steps below*\n", "\n", "OR\n", "\n", "[click here to view the conclusion below](#conclusion)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "First let's make the data set readable from Jupyter Notebook." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from csv import reader;\n", "# Read data set\n", "opened_file = open(\"hacker_news.csv\");\n", "read_file = reader(opened_file);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the header to not interupt to our analysis, we will separate it from the actual data set that we are querying." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] \n", "\n", "['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos', '10/31/2015 9:48'] \n", "\n", "['11370829', 'Crate raises $4M seed round for its next-gen SQL database', 'http://techcrunch.com/2016/03/15/crate-raises-4m-seed-round-for-its-next-gen-sql-database/', '3', '1', 'hitekker', '3/27/2016 18:08'] \n", "\n", "['12335860', 'How often to update third party libraries?', '', '7', '5', 'rabid_oxen', '8/22/2016 12:37'] \n", "\n", "['11079821', 'APOD: LIGO detects gravity waves...', 'http://apod.nasa.gov/apod/astropix.html', '1', '2', 'AliCollins', '2/11/2016 12:57'] \n", "\n" ] } ], "source": [ "# Turn data set into list of list\n", "hn = list(read_file);\n", "\n", "# Remove the header\n", "headers = hn[0]; \n", "hn = hn[1:]; # Assigns the data set without header\n", "\n", "print(headers, \"\\n\"); \n", "\n", "# Display 5 row from the data set\n", "for i in range(5, 25, 5): # Display rows starting from 5 with increments of 5 up until the 25th row\n", " print(hn[i], \"\\n\");" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Dataset Explanation\n", "You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:\n", "- **id**: The unique identifier from Hacker News for the post\n", "- **title**: The title of the post\n", "- **url**: The URL that the posts links to, if it the post has a URL\n", "- **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes\n", "- **num_comments**: The number of comments that were made on the post\n", "- **author**: The username of the person who submitted the post\n", "- **created_at**: The date and time at which the post was submitted\n", "\n", "##### Example\n", "\n", "| id | title | url | num_points | num_comments | author | created_at |\n", "|--------------|--------------|-------------------------------------------------|-------------------------|----------------------|---------------------|----------------|\n", "| 12296411\t | Ask HN: How to improve my personal website?\t | \t | 2 | 6 | ahmedbaracat\t | 8/16/2016 9:55:00 AM |\n", "| 10975351\t | How to Use Open Source and Shut the F*ck Up at the Same Time\t | http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/\t | 39 | 10 | josep2 | 1/26/2016 19:30 |\n", "| 11964716\t | Florida DJs May Face Felony for April Fools' Water Joke\t | http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/\t | 2\t | 1 | vezycash\t | 6/23/2016 22:20\n", " |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see there are entries that have missing columns and there are others that don't have a tag. For the missing values above which is the URL, we actually dont need this so that entry is good as it is, While the title that doesn't have any tag (Ask HN or Show HN) at the beginning will be separated from the actual data set that we will need later in the process" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepping\n", "### Segregating Tagged Post for Analysis \n", "Since we are only interested with post that has `Ask HN` and `Show HN`, we will use the `startswith()` method to test if a title column of a row has word containing `Ask HN` and `Show HN`. We will then contain it in a list to separate each post within their own list.\n", "\n", "[proceed to next step](#avg-comment)\n", "#### Code Explanation\n", "`ask_post`, `show_post`, `other_post`is a list of list containing the separated tagged post\n", "- This code will segregrate each tagged post to their corresponding list in order to process our target post (**ask_post** and **show_post**)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepping\n", "### Segregating Tagged Post for Analysis \n", "Since we are only interested with post that has `Ask HN` and `Show HN`, we will use the `startswith()` method to test if a title column of a row has word containing `Ask HN` and `Show HN`. We will then contain it in a list to separate each post within their own list.\n", "\n", "[proceed to next step](#avg-comment)\n", "#### Code Explanation\n", "`ask_post`, `show_post`, `other_post`is a list of list containing the separated tagged post\n", "- This code will segregrate each tagged post to their corresponding list in order to process our target post (**ask_post** and **show_post**)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of post in Ask HN: 1,744\n", "Number of post in Show HN: 1,162\n", "Number of post in other tagged post: 17,194\n" ] } ], "source": [ "ask_post = [];\n", "show_post = [];\n", "other_post = [];\n", "\n", "# Code to separate \"ask hn\" and \"show hn\" posts into separate list\n", "for row in hn: # iterate each row in data set \"hn\"\n", " title = row[1];\n", " \n", " # append titles to \"ask_post\" that has \"ask hn\" as title\n", " if title.lower().startswith(\"ask hn\"):\n", " ask_post.append(row);\n", " # append titles to \"show_post\" that has \"show hn\" as title\n", " elif title.lower().startswith(\"show hn\"):\n", " show_post.append(row);\n", " # if neither, append to \"other_post\"\n", " else:\n", " other_post.append(row);\n", " \n", "# Code to print the length of rows in the list\n", "print(\"Number of post in Ask HN: {:,}\".format(len(ask_post)))\n", "print(\"Number of post in Show HN: {:,}\".format(len(show_post)))\n", "print(\"Number of post in other tagged post: {:,}\".format(len(other_post)))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Output Explanation\n", "`Ask HN` has more post than `Show HN`. This only would mean that for a technology site, there are a lot of post that ask questions rather than a post that show something interesting. \n", "> **BUT** this will not identify wether `ask_post` has more comments than `show_post`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis\n", "### Average Comments per Tag \n", "Now that we have segragated our data set based on their tag, we wil now use that to sum up all the comments on all post and divide that by the total number of the post leaving us with the average comments.\n", "\n", "[proceed to next step](#count-post) \n", "#### Code Explanation\n", "This code will determine the average comments of both tagged post (ask_hn and show_hn)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average Comments in ask_post: 14.04\n", "Average Comments in show_post: 10.32\n" ] } ], "source": [ "total_ask_comments = 0; # Count total ask_post comments\n", "total_show_comments = 0; # Count total show_post comments\n", "\n", "# Code to output average comments per post in \"ask_post\"\n", "for post in ask_post:\n", " num_comments = int(post[4]); # Assign Column num_comments in a variable \"num_comments\"\n", " \n", " # Total all of the comments\n", " total_ask_comments += num_comments; \n", " \n", "# Code to output average comments per post in \"show_post\"\n", "for post in show_post:\n", " num_comments = int(post[4]);\n", " \n", " # Total all of the comments\n", " total_show_comments += num_comments;\n", " \n", "avg_ask_comments = total_ask_comments / len(ask_post) # Average comments per ask_post\n", "print(\"Average Comments in ask_post: {:.2f}\".format(avg_ask_comments)) # Output format ask_post\n", "\n", "avg_show_comments = total_show_comments / len(show_post); # Average comments per ask_post\n", "print(\"Average Comments in show_post: {:.2f}\".format(avg_show_comments)) # Output format show_post" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Output Explanation\n", "`Show HN` garnered an average comments per post of 10.32 while `Ask HN` is 3.72 ahead with 14.04 average comments per post. This only means that ask_post has more comments over show_post hence,\n", "> we will use ask_post first to determine what is the peak time for receiving comments on average." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count Post and Comment by Hour \n", "In the previous cell, we have determined which tag has the most comments on average and now we will use that to have a greater detail _**what time**_ is it optimal to post an `Ask HN` tag and garner a lot of comments? But first, to do that we need to count the `post` per hour of the day and count the `comments` per hour of the day and use the `hour` as a key to divide both count of post and comments leaving us the average comments per post on that hour.\n", "\n", "[proceed to next step](#avg-hour)\n", "#### Code Explanation\n", "Code to create a dictionary for counts of comments per hour and post per hour to use later for computing for average comments per post per hour.\n", "- We have determined that `Ask HN` has more average comments than `Show HN` and we will use ask_post now to count \"post\" and \"comments\" in each hour of the day to determine later what hours has more post and comments" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import datetime as dt; # Module to process date and time\n", "result_list = []; # list to create list of list\n", "\n", "# Code to create a list of list data set \n", "for post in ask_post:\n", " time_created = post[6];\n", " comments = int(post[4]); # convert number into integer\n", " \n", " result_list.append([time_created, comments]);\n", "\n", "counts_by_hour = {}; # Dictinary to store total post per hour of day\n", "comments_by_hour = {}; # Dictinary to store total comments per hour\n", "\n", "# Code to create a dictionary with \"time\" hour as `key` with `value` as comments\n", "for result in result_list: # Retrieve each list in the data set of list of list (result_list)\n", " # Assign each element of each row in descriptive variables\n", " date_str = result[0];\n", " comments = result[1]; \n", " \n", " time = dt.datetime.strptime(date_str, \"%m/%d/%Y %H:%M\"); # convert the \"time_created\" as datetime obj to process dates\n", " hour = str(time.hour); # Take `hour` only in the date object\n", " \n", " # Count and initialize posts and comments in corresponding dictionary\n", " if hour not in counts_by_hour: # if the key `time` is not in counts_by_hour dictionary, initialize\n", " counts_by_hour[hour] = 1;\n", " comments_by_hour[hour] = comments;\n", " else: # else add all of them for counting\n", " counts_by_hour[hour] += 1; \n", " comments_by_hour[hour] += comments;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Determine Average Comments per Hour \n", "Since we have determined in the previous cell the count of post and comments in a specific hour, now what we need to do now is to find the average comments per post in that specific hour to determine later in the output what time of the day that has the most comment per post in that hour\n", "\n", "[proceed to next step](#sort)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "avg_by_hour = []; # list of list containing hours of day and avg comments per post in that hour\n", "\n", "# Code to determine average comments per post in specific hour\n", "for hour_post, post in counts_by_hour.items(): # Assign `key`=hour and `value`=total post in variables respectively\n", " for hour_comments, comments in comments_by_hour.items(): # Assign `key`=hour and `value`=total post in variables respectively\n", " \n", " # Code to determine if they have the same keys (hour) and determine avg comments per post in that hour\n", " if int(hour_post) == int(hour_comments):\n", " avg_by_hour.append([hour_post, comments/post]) # append a list of hour, avg_comments per hour respectively\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Sort Average Comments per Post \n", "Now we have determined the average comments per post in specific hour of the day, all we have to do now is to make it readable for us to see the peak time for posting an `Ask HN` tag.\n", "\n", "[proceed to next step](#final)\n", "#### Code Explanation\n", "This code will sort the list of average comments in descending order\n", "> In order to sort this, we have to swap the elements of the list of list so that when we sort that list of list, it will sort the average comments per post value in descending order not the hour it posted." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "[[38.5948275862069, '15'],\n", " [23.810344827586206, '2'],\n", " [21.525, '20'],\n", " [16.796296296296298, '16'],\n", " [16.009174311926607, '21'],\n", " [14.741176470588234, '13'],\n", " [13.440677966101696, '10'],\n", " [13.233644859813085, '14'],\n", " [13.20183486238532, '18'],\n", " [11.46, '17'],\n", " [11.383333333333333, '1'],\n", " [11.051724137931034, '11'],\n", " [10.8, '19'],\n", " [10.25, '8'],\n", " [10.08695652173913, '5'],\n", " [9.41095890410959, '12'],\n", " [9.022727272727273, '6'],\n", " [8.127272727272727, '0'],\n", " [7.985294117647059, '23'],\n", " [7.852941176470588, '7'],\n", " [7.796296296296297, '3'],\n", " [7.170212765957447, '4'],\n", " [6.746478873239437, '22'],\n", " [5.5777777777777775, '9']]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "swap_avg_by_hour = []; # list of swapped elements of `avg_by_hour` for sorting\n", "\n", "# Code for sorting average comments in descending order\n", "for row in avg_by_hour:\n", " # Assign column to variables for readability\n", " hour = row[0];\n", " comments = row[1];\n", " \n", " swap_avg_by_hour.append([comments, hour]); # Append swapped elements to list\n", " \n", "sorted_swap = sorted(swap_avg_by_hour, reverse=True) # sort the list in descending order\n", "sorted_swap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Output Explanation \n", "\n", "[proceed to next step](#convert)\n", "\n", "We have now sorted the list. The only step to do is to display the top 5 hours for posting an `Ask HN` and format it for readbility" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 Hours for Ask Posts Comments in Eastern Time\n", "15:00:00: 38.59 average comments per post\n", "02:00:00: 23.81 average comments per post\n", "20:00:00: 21.52 average comments per post\n", "16:00:00: 16.80 average comments per post\n", "21:00:00: 16.01 average comments per post\n" ] } ], "source": [ "print(\"Top 5 Hours for Ask Posts Comments in Eastern Time\");\n", "# Code for outputing top 5 Ask Post highest number of average comments in each hour\n", "for i in range(5):\n", " hours = dt.datetime.strptime(sorted_swap[i][1], \"%H\").strftime(\"%H:%M:%S\") # Turn each hour in the list(sorted_swap) as datetime obj\n", " print(\"{hours}: {comments:.2f} average comments per post\".format(hours=hours, comments=sorted_swap[i][0])) # Formated output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert to different Timezones \n", "\n", "[proceed to conclusion](#conclusion)\n", "\n", "The output above is only applicable with the timezone of *Eastern Times* and may not be usable for other users that live with a different timezone. For your convenience, I will convert this Eastern Times to 3 different timezones." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Eastern Timezone -> America/Los_Angeles Timezone\n", "15:00:00 -> 13:00:00 : 38.59 Average Comments per Post\n", "02:00:00 -> 00:00:00 : 23.81 Average Comments per Post\n", "20:00:00 -> 18:00:00 : 21.52 Average Comments per Post\n", "16:00:00 -> 14:00:00 : 16.80 Average Comments per Post\n", "21:00:00 -> 19:00:00 : 16.01 Average Comments per Post\n", "\n", "Eastern Timezone -> Europe/Madrid Timezone\n", "15:00:00 -> 22:00:00 : 38.59 Average Comments per Post\n", "02:00:00 -> 09:00:00 : 23.81 Average Comments per Post\n", "20:00:00 -> 03:00:00 : 21.52 Average Comments per Post\n", "16:00:00 -> 23:00:00 : 16.80 Average Comments per Post\n", "21:00:00 -> 04:00:00 : 16.01 Average Comments per Post\n", "\n", "Eastern Timezone -> America/Puerto_Rico Timezone\n", "15:00:00 -> 16:00:00 : 38.59 Average Comments per Post\n", "02:00:00 -> 03:00:00 : 23.81 Average Comments per Post\n", "20:00:00 -> 21:00:00 : 21.52 Average Comments per Post\n", "16:00:00 -> 17:00:00 : 16.80 Average Comments per Post\n", "21:00:00 -> 22:00:00 : 16.01 Average Comments per Post\n" ] } ], "source": [ "import pytz;\n", "from datetime import timedelta;\n", "\n", "localFormat = \"%H:%M:%S\"; # Format of time\n", "timezones = [\"America/Los_Angeles\", \"Europe/Madrid\", \"America/Puerto_Rico\"]; # Output different timezones\n", "eastern = pytz.timezone(\"US/Eastern\");\n", "\n", "for tz in timezones:\n", " print(\"\\nEastern Timezone -> {} Timezone\".format(tz));\n", " for i in range(5): # Iterate to top 5 Hours with the most average comments per post\n", " hours = dt.datetime.strptime(sorted_swap[i][1], \"%H\") # Turn each hour in the list(sorted_swap) as datetime obj\n", " est_moment = eastern.localize(hours); # Localize EST time to convert to different timezones\n", " localDatetime = est_moment.astimezone(pytz.timezone(tz)); # Convert the top 5 hours in different timezone\n", " \n", " \"\"\"I have to add additional hours\n", " to the generated timezone by pytz\n", " it seems it doesn't correctly output\n", " the different times in each place\"\"\"\n", " if tz == \"America/Los_Angeles\":\n", " add_time = timedelta(minutes=57);\n", " elif tz == \"Europe/Madrid\":\n", " add_time = timedelta(hours=2, minutes=19);\n", " elif tz == \"America/Puerto_Rico\":\n", " add_time = timedelta(minutes=28);\n", " \n", " localDatetime = localDatetime + add_time # Added the missing hours\n", " print(\"{0} -> {1} : {2:.2f} Average Comments per Post\".format(est_moment.strftime(localFormat), localDatetime.strftime(localFormat), sorted_swap[i][0]));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And there you have it, the top 5 Hours to post your `Ask HN` to gain a large volume of comments. Hope you find this useful!\n", "> TIP!: You can convert this to your own timezone by modifying the list of timezone above!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion \n", "We have determined that `Ask HN` has the most post and we determined also that 3:00 pm in the afternoon is the peak time that users post comments in posts every day.\n", "\n", "\n", "This answered the two question that we have earlier\n", "1. Do Ask HN or Show HN receive more comments on average?\n", "2. Do posts created at a certain time receive more comments on average?\n", "\n", "To answer this we took the steps of:\n", "1. [Segregate Tagged Post for Analysis](#segregate) \n", "2. [Determine the Highest Average Comments in each Tagged Post](#avg-comment)\n", "3. [Count Post and Comment by Hour of the Highest Average Comments Determined Earlier](#count-post)\n", "4. [Determine Average Comments per Hour](#avg-hour)\n", "5. [Sort Average Comments per Post](#sort)\n", "\n", "This project will be beneficial to users wanting an answer to their specific needs. Now they will have an idea what time to post a question to HackerNews and gain a lot of answers to their question." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }