{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Exploring Hacker News Posts\n", "\n", "Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as \"posts\") are voted and commented upon, similar to reddit.\n", "\n", "We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.\n", "\n", "We'll compare these two types of posts to determine the following:\n", "\n", "- Do Ask HN or Show HN receive more comments on average?\n", "- Do posts created at a certain time receive more comments on average?\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# import reader to open the data set file\n", "from csv import reader\n", "opened_file = open('hacker_news.csv')\n", "read_file = reader(opened_file)\n", "data_set = list(read_file)\n", "headers = data_set[0]\n", "hn = data_set[1:]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make it easier to explore the two data sets, we'll first write a function named explore_data() that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']\n", "\n", "\n", "['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']\n", "\n", "\n", "['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']\n", "\n", "\n", "['11964716', \"Florida DJs May Face Felony for April Fools' Water Joke\", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']\n", "\n", "\n", "['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']\n", "\n", "\n", "['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']\n", "\n", "\n", "Number of rows is 20100\n", "Number of colums is 7\n" ] } ], "source": [ "# define function to explore the data set\n", "def explore_data(dataset, start, end, rows_and_columns=False):\n", " dataset_slice = dataset[start:end]\n", " for row in dataset_slice:\n", " print(row)\n", " print('\\n')\n", " \n", " if rows_and_columns==True:\n", " print('Number of rows is ', len(dataset))\n", " print('Number of colums is ', len(dataset[0]))\n", "\n", "#explore the first 5 rows of our data set \n", "print(headers)\n", "print('\\n')\n", "explore_data(hn, 0, 5, True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculating the Average Number of Comments for Ask HN and Show HN posts\n", "\n", "Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll separate posts beginning with Ask HN and Show HN (and case variations) into two different lists next." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The number of posts in ask_posts is 1744\n", "The number of posts in show_posts is 1162\n", "The number of posts in other_posts is 17194\n" ] } ], "source": [ "#create 3 list for different post types\n", "ask_posts = []\n", "show_posts = []\n", "other_posts = []\n", "\n", "#loop over the data set and append rows for ask_posts, show_posts and other_posts \n", "#to the lists respectively\n", "for row in hn:\n", " title = row[1]\n", " title = title.lower()\n", " if title.startswith('ask hn') == True:\n", " ask_posts.append(row)\n", " elif title.startswith('show hn') == True:\n", " show_posts.append(row)\n", " else:\n", " other_posts.append(row)\n", "\n", "#create a function which will return the number of posts in each category\n", "def num_posts(posts):\n", " number = 0\n", " for row in posts:\n", " number += 1\n", " \n", " return number\n", "\n", "num_ask = num_posts(ask_posts)\n", "num_show = num_posts(show_posts)\n", "num_others = num_posts(other_posts)\n", "\n", "print('The number of posts in ask_posts is ', num_ask )\n", "print('The number of posts in show_posts is ', num_show)\n", "print('The number of posts in other_posts is ', num_others)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now when we separated defferent types of posts, we'll determine if ask posts or show posts receive more comments on average." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "14.038417431192661\n" ] } ], "source": [ "#compute the number of ask HN comments\n", "total_ask_comments = 0\n", "#loop over the ask_posts list to calculate the number of comments \n", "for row in ask_posts:\n", " total_ask_comments += int(row[4])\n", " \n", "avg_ask_comments = total_ask_comments/num_ask\n", "print(avg_ask_comments)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've computed that on average a post in Ask HN has ~14 comments.\n", "Now we'll rake a look at Show HN posts." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10.31669535283993\n" ] } ], "source": [ "#compute the number of Show HN comments\n", "total_show_comments = 0\n", "#loop over show_post lists and to calculate the number of comments\n", "for row in show_posts:\n", " total_show_comments += int(row[4])\n", " \n", "avg_show_comments = total_show_comments/num_show\n", "print(avg_show_comments)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've computed that on average a post in Show HN gets ~ 10 comments.\n", "\n", "We've figured out that Ask HN posts receive more comments than Show HN. We can assume that people are more willing to ask questions than discuss, critisize, or praise whatether other users showed in the Sho HN post type." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding the Amount of Ask Posts and Comments by Hour Created\n", "\n", "Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.\n", "\n", "Next, we'll determine if ask posts created at a certain time are more likely to attract comments." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{0: 55, 1: 60, 2: 58, 3: 54, 4: 47, 5: 46, 6: 44, 7: 34, 8: 48, 9: 45, 10: 59, 11: 58, 12: 73, 13: 85, 14: 107, 15: 116, 16: 108, 17: 100, 18: 109, 19: 110, 20: 80, 21: 109, 22: 71, 23: 68}\n", "{0: 447, 1: 683, 2: 1381, 3: 421, 4: 337, 5: 464, 6: 397, 7: 267, 8: 492, 9: 251, 10: 793, 11: 641, 12: 687, 13: 1253, 14: 1416, 15: 4477, 16: 1814, 17: 1146, 18: 1439, 19: 1188, 20: 1722, 21: 1745, 22: 479, 23: 543}\n" ] } ], "source": [ "#import thr datetime class useing an alias dt\n", "from datetime import datetime as dt\n", "\n", "#append the date in row[6] and the number of comments in row[4] to the result list\n", "result_list = []\n", "for row in ask_posts:\n", " a_list = []\n", " a_list.append(row[6])\n", " a_list.append(int(row[4]))\n", " result_list.append(a_list)\n", "\n", "\"\"\"create dictionaries counts_by_hour for the number of post created at each other\n", "and comments_by_hour for a number of comments left \n", "\n", "\"\"\"\n", "counts_by_hour = {}\n", "comments_by_hour = {}\n", "\n", "#loop over the result list and append data to the dictionaries\n", "for row in result_list:\n", " time_str = row[0]\n", " datetime_dt = dt.strptime(time_str, \"%m/%d/%Y %H:%M\")\n", " time_str = datetime_dt\n", " hour_dt = time_str.hour\n", " \n", " if hour_dt not in counts_by_hour:\n", " counts_by_hour[hour_dt] = 1\n", " comments_by_hour[hour_dt] = row[1]\n", " else:\n", " counts_by_hour[hour_dt] += 1\n", " comments_by_hour[hour_dt] += row[1]\n", " \n", "\n", "print(counts_by_hour)\n", "print(comments_by_hour)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have created two dictionaries: one, containing the number of posts for each our of the day, and the other containing the corresponding number of comments ask posts received. Now we are ready to calculate the average number of comments for posts created during each hour of the day.\n", "\n", "## Calculating the Average Number of Comments for Ask HN Posts by Hour" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 8.127272727272727]\n", "[1, 11.383333333333333]\n", "[2, 23.810344827586206]\n", "[3, 7.796296296296297]\n", "[4, 7.170212765957447]\n", "[5, 10.08695652173913]\n", "[6, 9.022727272727273]\n", "[7, 7.852941176470588]\n", "[8, 10.25]\n", "[9, 5.5777777777777775]\n", "[10, 13.440677966101696]\n", "[11, 11.051724137931034]\n", "[12, 9.41095890410959]\n", "[13, 14.741176470588234]\n", "[14, 13.233644859813085]\n", "[15, 38.5948275862069]\n", "[16, 16.796296296296298]\n", "[17, 11.46]\n", "[18, 13.20183486238532]\n", "[19, 10.8]\n", "[20, 21.525]\n", "[21, 16.009174311926607]\n", "[22, 6.746478873239437]\n", "[23, 7.985294117647059]\n" ] } ], "source": [ "avg_by_hour = []\n", "\n", "#compute the average number of comments, append the result to the avg_by_hour list\n", "for key in counts_by_hour:\n", " a_list = []\n", " a_list.append(key)\n", " avg = comments_by_hour[key]/counts_by_hour[key]\n", " a_list.append(avg)\n", " avg_by_hour.append(a_list)\n", " \n", "for a_list in avg_by_hour:\n", " print(a_list)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[8.127272727272727, 0]\n", "[11.383333333333333, 1]\n", "[23.810344827586206, 2]\n", "[7.796296296296297, 3]\n", "[7.170212765957447, 4]\n", "[10.08695652173913, 5]\n", "[9.022727272727273, 6]\n", "[7.852941176470588, 7]\n", "[10.25, 8]\n", "[5.5777777777777775, 9]\n", "[13.440677966101696, 10]\n", "[11.051724137931034, 11]\n", "[9.41095890410959, 12]\n", "[14.741176470588234, 13]\n", "[13.233644859813085, 14]\n", "[38.5948275862069, 15]\n", "[16.796296296296298, 16]\n", "[11.46, 17]\n", "[13.20183486238532, 18]\n", "[10.8, 19]\n", "[21.525, 20]\n", "[16.009174311926607, 21]\n", "[6.746478873239437, 22]\n", "[7.985294117647059, 23]\n" ] } ], "source": [ "swap_avg_by_hour = []\n", "\n", "#swap the time and the average number of comments and append the list to the swap_avg_by_time list\n", "for row in avg_by_hour:\n", " a_list = []\n", " a_list.append(row[1])\n", " a_list.append(row[0])\n", " swap_avg_by_hour.append(a_list)\n", " \n", "for a_list in swap_avg_by_hour:\n", " print(a_list)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 Hours for Ask Posts Comments\n", "15:00: 38.59 average comments per post\n", "02:00: 23.81 average comments per post\n", "20:00: 21.52 average comments per post\n", "16:00: 16.80 average comments per post\n", "21:00: 16.01 average comments per post\n" ] } ], "source": [ "#implement the sorted function to the list with swapped comment and time, reverse it\n", "sorted_swap = sorted(swap_avg_by_hour, reverse=True)\n", "\n", "string = 'Top 5 Hours for Ask Posts Comments'\n", "print(string)\n", "for row in sorted_swap[:5]:\n", " string1 = dt.strptime(str(row[1]), '%H')\n", " string1 = dt.strftime(string1, '%H:%M')\n", " output = \"{0}: {1:.2f} average comments per post\".format(string1, row[0])\n", " print(output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the result, the best hour to create a post is 15:00 (3pm). That time you are more likely to receive comments. The next is 2:00 (2am). The difference between average comments at 3pm and 2am is big (~15 comments). You can receive an average of 21 comments at 20:00 (8pm) and the difference between posts at 2am is small (~2 comments).\n", "\n", "In the next step we are going to calculate the average number of comments for a Show post." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculating the Average Number of Comments for a Show HN Post by Hour\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{0: 31, 1: 28, 2: 30, 3: 27, 4: 26, 5: 19, 6: 16, 7: 26, 8: 34, 9: 30, 10: 36, 11: 44, 12: 61, 13: 99, 14: 86, 15: 78, 16: 93, 17: 93, 18: 61, 19: 55, 20: 60, 21: 47, 22: 46, 23: 36}\n", "{0: 487, 1: 246, 2: 127, 3: 287, 4: 247, 5: 58, 6: 142, 7: 299, 8: 165, 9: 291, 10: 297, 11: 491, 12: 720, 13: 946, 14: 1156, 15: 632, 16: 1084, 17: 911, 18: 962, 19: 539, 20: 612, 21: 272, 22: 570, 23: 447}\n" ] } ], "source": [ "#create a list with time in row[6] and number of comments in row[4]\n", "show_result_list = []\n", "for row in show_posts:\n", " a_list1 = []\n", " a_list1.append(row[6])\n", " a_list1.append(int(row[4]))\n", " show_result_list.append(a_list1)\n", "\n", "\"\"\"create two empty dictionaries, s_counts_by_hour for the number of show posts for each hour, and\n", "s_comments_by_hour for the number of comments for each hour\n", "\"\"\" \n", "s_counts_by_hour = {}\n", "s_comments_by_hour = {}\n", "\n", "#loop through the show_result_list and append data to the dictionaries\n", "for row in show_result_list:\n", " s_time_str = row[0]\n", " s_datetime_dt = dt.strptime(s_time_str, \"%m/%d/%Y %H:%M\")\n", " s_hour_dt = s_datetime_dt.hour\n", " \n", " if s_hour_dt not in s_counts_by_hour:\n", " s_counts_by_hour[s_hour_dt] = 1\n", " s_comments_by_hour[s_hour_dt] = row[1]\n", " else:\n", " s_counts_by_hour[s_hour_dt] += 1\n", " s_comments_by_hour[s_hour_dt] += row[1]\n", " \n", "\n", "print(s_counts_by_hour)\n", "print(s_comments_by_hour)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 15.709677419354838]\n", "[1, 8.785714285714286]\n", "[2, 4.233333333333333]\n", "[3, 10.62962962962963]\n", "[4, 9.5]\n", "[5, 3.0526315789473686]\n", "[6, 8.875]\n", "[7, 11.5]\n", "[8, 4.852941176470588]\n", "[9, 9.7]\n", "[10, 8.25]\n", "[11, 11.159090909090908]\n", "[12, 11.80327868852459]\n", "[13, 9.555555555555555]\n", "[14, 13.44186046511628]\n", "[15, 8.102564102564102]\n", "[16, 11.655913978494624]\n", "[17, 9.795698924731182]\n", "[18, 15.770491803278688]\n", "[19, 9.8]\n", "[20, 10.2]\n", "[21, 5.787234042553192]\n", "[22, 12.391304347826088]\n", "[23, 12.416666666666666]\n" ] } ], "source": [ "s_avg_by_hour = []\n", "\n", "\"\"\"compute the average number of comments and append a list with time \n", "and the average comments to the s_avg_by_hour list\n", "\"\"\"\n", "for key in s_counts_by_hour:\n", " s_a_list = []\n", " s_a_list.append(key)\n", " s_avg = s_comments_by_hour[key]/s_counts_by_hour[key]\n", " s_a_list.append(s_avg)\n", " s_avg_by_hour.append(s_a_list)\n", " \n", "for s_a_list in s_avg_by_hour:\n", " print(s_a_list)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[15.709677419354838, 0]\n", "[8.785714285714286, 1]\n", "[4.233333333333333, 2]\n", "[10.62962962962963, 3]\n", "[9.5, 4]\n", "[3.0526315789473686, 5]\n", "[8.875, 6]\n", "[11.5, 7]\n", "[4.852941176470588, 8]\n", "[9.7, 9]\n", "[8.25, 10]\n", "[11.159090909090908, 11]\n", "[11.80327868852459, 12]\n", "[9.555555555555555, 13]\n", "[13.44186046511628, 14]\n", "[8.102564102564102, 15]\n", "[11.655913978494624, 16]\n", "[9.795698924731182, 17]\n", "[15.770491803278688, 18]\n", "[9.8, 19]\n", "[10.2, 20]\n", "[5.787234042553192, 21]\n", "[12.391304347826088, 22]\n", "[12.416666666666666, 23]\n" ] } ], "source": [ "s_swap_avg_by_hour = []\n", "\n", "#swap the index of time and comments and append the result to s_swap_avg_by_hour\n", "for row in s_avg_by_hour:\n", " sa_list = []\n", " sa_list.append(row[1])\n", " sa_list.append(row[0])\n", " s_swap_avg_by_hour.append(sa_list)\n", " \n", "for a_list in s_swap_avg_by_hour:\n", " print(a_list)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 Hours for Show Posts Comments\n", "18:00: 15.77 average comments per post\n", "00:00: 15.71 average comments per post\n", "14:00: 13.44 average comments per post\n", "23:00: 12.42 average comments per post\n", "22:00: 12.39 average comments per post\n" ] } ], "source": [ "#implement the sorted function to sort the result \n", "s_sorted_swap = sorted(s_swap_avg_by_hour, reverse=True)\n", "\n", "s_string = 'Top 5 Hours for Show Posts Comments'\n", "print(s_string)\n", "for row in s_sorted_swap[:5]:\n", " sstring1 = dt.strptime(str(row[1]), '%H')\n", " sstring1 = dt.strftime(sstring1, '%H:%M')\n", " s_output = \"{0}: {1:.2f} average comments per post\".format(sstring1, row[0])\n", " print(s_output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the best time for creating a Show post is 18:00 (6pm). On average this post receives almost the same number of comments as the one posted at 00:00 (12pm). The third best time for posting is 14:00 (2pm). The difference between the average number of posts is vague. \n", "\n", "Now we'll calculate an average comments number for other posts.\n", "\n", "## Calculating the Average Number of Comments for Other Posts by Hour" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{0: 611, 1: 500, 2: 441, 3: 407, 4: 454, 5: 388, 6: 408, 7: 448, 8: 496, 9: 534, 10: 591, 11: 660, 12: 789, 13: 918, 14: 958, 15: 1040, 16: 1101, 17: 1169, 18: 1084, 19: 980, 20: 911, 21: 874, 22: 758, 23: 674}\n", "{0: 16544, 1: 11536, 2: 12254, 3: 10918, 4: 10953, 5: 9768, 6: 8714, 7: 12010, 8: 13405, 9: 14732, 10: 15728, 11: 19532, 12: 23944, 13: 28363, 14: 30973, 15: 30700, 16: 27959, 17: 32727, 18: 29186, 19: 26167, 20: 21080, 21: 20635, 22: 17635, 23: 16592}\n" ] } ], "source": [ "# create a list for post in other cateory, loop through other_posts, \n", "#append time (row[6]) and number of comments (row[4])\n", "other_result_list = []\n", "for row in other_posts:\n", " o_list1 = []\n", " o_list1.append(row[6])\n", " o_list1.append(int(row[4]))\n", " other_result_list.append(o_list1)\n", "\n", "\"\"\"create empty dictionaries:\n", "o_counts_by_hour for the number of post at a given hour\n", "o_comments_by_hour number of comments for posts of a given hour\n", "\"\"\"\n", "o_counts_by_hour = {}\n", "o_comments_by_hour = {}\n", "\n", "#loop over other_result_list, append data to the dictionaries\n", "for row in other_result_list:\n", " o_time_str = row[0]\n", " o_datetime_dt = dt.strptime(o_time_str, \"%m/%d/%Y %H:%M\")\n", " o_hour_dt = o_datetime_dt.hour\n", " \n", " if o_hour_dt not in o_counts_by_hour:\n", " o_counts_by_hour[o_hour_dt] = 1\n", " o_comments_by_hour[o_hour_dt] = row[1]\n", " else:\n", " o_counts_by_hour[o_hour_dt] += 1\n", " o_comments_by_hour[o_hour_dt] += row[1]\n", " \n", "\n", "print(o_counts_by_hour)\n", "print(o_comments_by_hour)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 27.076923076923077]\n", "[1, 23.072]\n", "[2, 27.786848072562357]\n", "[3, 26.825552825552826]\n", "[4, 24.125550660792953]\n", "[5, 25.175257731958762]\n", "[6, 21.357843137254903]\n", "[7, 26.808035714285715]\n", "[8, 27.026209677419356]\n", "[9, 27.588014981273407]\n", "[10, 26.612521150592215]\n", "[11, 29.593939393939394]\n", "[12, 30.34727503168568]\n", "[13, 30.896514161220043]\n", "[14, 32.33089770354906]\n", "[15, 29.51923076923077]\n", "[16, 25.394187102633968]\n", "[17, 27.99572284003422]\n", "[18, 26.924354243542435]\n", "[19, 26.701020408163266]\n", "[20, 23.13940724478595]\n", "[21, 23.60983981693364]\n", "[22, 23.265171503957784]\n", "[23, 24.617210682492583]\n" ] } ], "source": [ "o_avg_by_hour = []\n", "\n", "#compute the average number of comments and append it with the time to a new list\n", "for key in o_counts_by_hour:\n", " o_a_list = []\n", " o_a_list.append(key)\n", " o_avg = o_comments_by_hour[key]/o_counts_by_hour[key]\n", " o_a_list.append(o_avg)\n", " o_avg_by_hour.append(o_a_list)\n", " \n", "for o_a_list in o_avg_by_hour:\n", " print(o_a_list)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[27.076923076923077, 0]\n", "[23.072, 1]\n", "[27.786848072562357, 2]\n", "[26.825552825552826, 3]\n", "[24.125550660792953, 4]\n", "[25.175257731958762, 5]\n", "[21.357843137254903, 6]\n", "[26.808035714285715, 7]\n", "[27.026209677419356, 8]\n", "[27.588014981273407, 9]\n", "[26.612521150592215, 10]\n", "[29.593939393939394, 11]\n", "[30.34727503168568, 12]\n", "[30.896514161220043, 13]\n", "[32.33089770354906, 14]\n", "[29.51923076923077, 15]\n", "[25.394187102633968, 16]\n", "[27.99572284003422, 17]\n", "[26.924354243542435, 18]\n", "[26.701020408163266, 19]\n", "[23.13940724478595, 20]\n", "[23.60983981693364, 21]\n", "[23.265171503957784, 22]\n", "[24.617210682492583, 23]\n" ] } ], "source": [ "o_swap_avg_by_hour = []\n", "\n", "#swap the index of time and the number of comments\n", "for row in o_avg_by_hour:\n", " oa_list = []\n", " oa_list.append(row[1])\n", " oa_list.append(row[0])\n", " o_swap_avg_by_hour.append(oa_list)\n", " \n", "for o_list in o_swap_avg_by_hour:\n", " print(o_list)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 Hours for Other Posts Comments\n", "14:00: 32.33 average comments per post\n", "13:00: 30.90 average comments per post\n", "12:00: 30.35 average comments per post\n", "11:00: 29.59 average comments per post\n", "15:00: 29.52 average comments per post\n" ] } ], "source": [ "#implement the sorted function to creaate a top-list\n", "o_sorted_swap = sorted(o_swap_avg_by_hour, reverse=True)\n", "\n", "o_string = 'Top 5 Hours for Other Posts Comments'\n", "print(o_string)\n", "for row in o_sorted_swap[:5]:\n", " ostring1 = dt.strptime(str(row[1]), '%H')\n", " ostring1 = dt.strftime(ostring1, '%H:%M')\n", " ooutput = \"{0}: {1:.2f} average comments per post\".format(ostring1, row[0])\n", " print(ooutput)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We found out that other post created at 14:00 (2pm) receive more comments. Howether the average numbers are distributed almost evenly, so you will get more comments if you create a post between 11:00 (11am) and 16:00 (4pm).\n", "\n", "Now let's compare the results for all three post types." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 Hours for Ask Posts Comments\n", "15:00: 38.59 average comments per post\n", "02:00: 23.81 average comments per post\n", "20:00: 21.52 average comments per post\n", "16:00: 16.80 average comments per post\n", "21:00: 16.01 average comments per post\n", "\n", "\n", "Top 5 Hours for Show Posts Comments\n", "18:00: 15.77 average comments per post\n", "00:00: 15.71 average comments per post\n", "14:00: 13.44 average comments per post\n", "23:00: 12.42 average comments per post\n", "22:00: 12.39 average comments per post\n", "\n", "\n", "Top 5 Hours for Other Posts Comments\n", "14:00: 32.33 average comments per post\n", "13:00: 30.90 average comments per post\n", "12:00: 30.35 average comments per post\n", "11:00: 29.59 average comments per post\n", "15:00: 29.52 average comments per post\n" ] } ], "source": [ "#print all the results from different posts categories\n", "\n", "print(string)\n", "for row in sorted_swap[:5]:\n", " string1 = dt.strptime(str(row[1]), '%H')\n", " string1 = dt.strftime(string1, '%H:%M')\n", " output = \"{0}: {1:.2f} average comments per post\".format(string1, row[0])\n", " print(output)\n", " \n", "print('\\n')\n", "\n", "print(s_string)\n", "for row in s_sorted_swap[:5]:\n", " sstring1 = dt.strptime(str(row[1]), '%H')\n", " sstring1 = dt.strftime(sstring1, '%H:%M')\n", " s_output = \"{0}: {1:.2f} average comments per post\".format(sstring1, row[0])\n", " print(s_output)\n", "\n", "print('\\n')\n", "\n", "print(o_string)\n", "for row in o_sorted_swap[:5]:\n", " ostring1 = dt.strptime(str(row[1]), '%H')\n", " ostring1 = dt.strftime(ostring1, '%H:%M')\n", " ooutput = \"{0}: {1:.2f} average comments per post\".format(ostring1, row[0])\n", " print(ooutput)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The result\n", "\n", "- Ask Posts on average get more comments throughout the day. The most commented ones are at 15:00 (3pm), 2:00 (2am), 20:00 (8pm). Then users activity declines. So the best way to get more comments is to create an Ask post at 15:00 (3pm)\n", "\n", "- Other posts receive less comments. The most commented ones are at 14:00 (2pm), 13:00 (1pm), and 12:00 (12pm). The difference in comments within the top-list is not striking, so you will get many comments posting between 12:00 and 16:00 \n", "\n", "- The least amount of comments is received by the Show posts. The most commented ones are at 18:00 (6pm), 00:00 (12pm), and 14:00 (2pm).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Determining whether Show or Ask Posts Receive More Points on Average\n", "\n", "Karma points are calculated as the number of upvotes a given user's content has received minus the number of downvotes. We want to know which type of post gets more upvotes." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "27.555077452667813\n" ] } ], "source": [ "show_points = 0\n", "\n", "#iterate over the show_post, compute the number of points\n", "for row in show_posts:\n", " points = int(row[3])\n", " show_points += points\n", "\n", "#find the average by dividing show_points into num_show\n", "avg_show_points = show_points/num_show\n", "print(avg_show_points)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15.061926605504587\n" ] } ], "source": [ "ask_points = 0\n", "\n", "#iterate over the ask_posts to compute the number of points\n", "for row in ask_posts:\n", " points = int(row[3])\n", " ask_points += points\n", "\n", "#find the average ask points by dividing ask_points into num_ask\n", "avg_ask_points = ask_points/num_ask\n", "print(avg_ask_points)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see, that show posts receive more points on average. That means that creating a Show HN post can get you much more upvotes.\n", "\n", "Next we'll figure out if post created at a certain time are more likely to receive more points." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Determining if Posts Created at a Certain Time are More Likely to Receive More Points" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 Hours for Ask Posts Points:\n", "15:00: 29.99 point per post\n", "13:00: 24.26 point per post\n", "16:00: 23.35 point per post\n", "17:00: 19.41 point per post\n", "10:00: 18.68 point per post\n" ] } ], "source": [ "ask_result_list = []\n", "\n", "#loop through ask_post, append time(row[6]) and number of points(row[3]) to the list\n", "for row in ask_posts:\n", " a_list = []\n", " a_list.append(row[6])\n", " a_list.append(int(row[3]))\n", " ask_result_list.append(a_list)\n", "\n", "\"\"\"create two empty dictionaries:\n", "counts_ask_by_hour for a number of posts created at a given hour\n", "ask_points_by_hour for a number of points for posts at a given hour\n", "\"\"\"\n", "counts_ask_by_hour = {}\n", "ask_points_by_hour = {}\n", "\n", "#loop through ask_result_list, append the data \n", "for row in ask_result_list:\n", " time_str = row[0]\n", " datetime_dt = dt.strptime(time_str, \"%m/%d/%Y %H:%M\")\n", " time_str = datetime_dt\n", " hour_dt = time_str.hour\n", " \n", " if hour_dt not in ask_points_by_hour:\n", " ask_points_by_hour[hour_dt] = row[1]\n", " counts_ask_by_hour[hour_dt] = 1\n", " else:\n", " ask_points_by_hour[hour_dt] += row[1]\n", " counts_ask_by_hour[hour_dt] +=1\n", " \n", "average_by_hour = []\n", "\n", "#find out the average by deviding a number of points (ask_points_by_hour) into a number of comments\n", "for key in counts_ask_by_hour:\n", " a_list = []\n", " a_list.append(key)\n", " avg1 = ask_points_by_hour[key]/counts_ask_by_hour[key]\n", " a_list.append(avg1)\n", " average_by_hour.append(a_list)\n", " \n", "swap_average_by_hour = []\n", "\n", "#swap the index of time and number of posts, append the list to swap_average_by_hour\n", "for row in average_by_hour:\n", " av_list = []\n", " av_list.append(row[1])\n", " av_list.append(row[0])\n", " swap_average_by_hour.append(av_list)\n", " \n", "#implement the sorted function to swap_average_by_hour to create a top-list \n", "sor_ask_average = sorted(swap_average_by_hour, reverse=True)\n", "\n", "string2 = 'Top 5 Hours for Ask Posts Points:'\n", "print(string2)\n", "for row in sor_ask_average[:5]:\n", " string3 = dt.strptime(str(row[1]), '%H')\n", " string3 = dt.strftime(string3, '%H:%M')\n", " output = \"{0}: {1:.2f} point per post\".format(string3, row[0])\n", " print(output)\n", " \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see, that Ask HN posts created at 15:00 (3pm) get considerably more points that those created at any other time.\n", "\n", "Now we'll analyse the Show HN posts." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 Hours for Show Posts Points:\n", "23:00: 42.39 points per post\n", "12:00: 41.69 points per post\n", "22:00: 40.35 points per post\n", "00:00: 37.84 points per post\n", "18:00: 36.31 points per post\n" ] } ], "source": [ "show_result_list = []\n", "\n", "#loop through the show_posts and append the time(row[6]) and number of points(row[3]) to show_result_list\n", "for row in show_posts:\n", " c_list = []\n", " c_list.append(row[6])\n", " c_list.append(int(row[3]))\n", " show_result_list.append(c_list)\n", "\n", "\"\"\"create two emoty dictionaries:\n", "counts_show_by_hour for the number of posts at a given hour\n", "show_points_by_hour for the number of points for these posts\n", "\"\"\"\n", "counts_show_by_hour = {}\n", "show_points_by_hour = {}\n", "\n", "#loop through show_result_list, append the number of points and calculate the number of posts\n", "for row in show_result_list:\n", " time_str1 = row[0]\n", " datetime_dt1 = dt.strptime(time_str1, \"%m/%d/%Y %H:%M\")\n", " hour_dt1 = datetime_dt1.hour\n", " \n", " if hour_dt1 not in show_points_by_hour:\n", " show_points_by_hour[hour_dt1] = row[1]\n", " counts_show_by_hour[hour_dt1] = 1\n", " else:\n", " show_points_by_hour[hour_dt1] += row[1]\n", " counts_show_by_hour[hour_dt1] += 1\n", " \n", "show_average_by_hour = []\n", "\n", "#compute the average number of points per post, append to show_average_by_hour\n", "for key in counts_show_by_hour:\n", " a_list = []\n", " a_list.append(key)\n", " avg2 = show_points_by_hour[key]/counts_show_by_hour[key]\n", " a_list.append(avg2)\n", " show_average_by_hour.append(a_list)\n", " \n", "swap_showp_by_hour = []\n", "\n", "#swap the index of time and number of points and append to the swap_showp_by_hour\n", "for row in show_average_by_hour:\n", " v_list = []\n", " v_list.append(row[1])\n", " v_list.append(row[0])\n", " swap_showp_by_hour.append(v_list)\n", "\n", "#implement the sorted function to create a top-list\n", "sor_show_pointsbh = sorted(swap_showp_by_hour, reverse=True)\n", "\n", "string4 = 'Top 5 Hours for Show Posts Points:'\n", "print(string4)\n", "\n", "for row in sor_show_pointsbh[:5]:\n", " string5 = dt.strptime(str(row[1]), '%H')\n", " string5 = dt.strftime(string5, '%H:%M')\n", " output = '{0}: {1:.2f} points per post'.format(string5, row[0])\n", " print(output)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the difference between the number of Show HN posts received every hour is small. Points are distributed evenly. \n", "It doesn't matter much when exactly you create a Show HN post if it's in the period:\n", "\n", "- from 22:00 (10pm) to 00:00 (12pm) \n", "- at 12:00 (12am)\n", "\n", "Now we will compute the average number of points other posts receive." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 Hours for Other Posts Points:\n", "13:00: 62.53 points per post\n", "14:00: 61.79 points per post\n", "15:00: 60.54 points per post\n", "10:00: 60.48 points per post\n", "19:00: 60.01 points per post\n" ] } ], "source": [ "other_result_list = []\n", "\n", "#loop through other_posts, append the time (row[6]) and the number of points (row[3]) to the other_result_list\n", "for row in other_posts:\n", " oth_list = []\n", " oth_list.append(row[6])\n", " oth_list.append(int(row[3]))\n", " other_result_list.append(oth_list)\n", "\n", "\"\"\"create two empty dictionaries:\n", "counts_other_by_hour - for the number of posts at a given hour\n", "other_points_by_hour - for the number of points for these posts\n", "\"\"\"\n", "counts_other_by_hour = {} \n", "other_points_by_hour = {}\n", "\n", "#loop through other_result_list, append time and calculate the number of post for each hour\n", "for row in other_result_list:\n", " time_str3 = row[0]\n", " datetime_dt3 = dt.strptime(time_str3, \"%m/%d/%Y %H:%M\")\n", " hour_dt3 = datetime_dt3.hour\n", " \n", " if hour_dt3 not in other_points_by_hour:\n", " other_points_by_hour[hour_dt3] = row[1]\n", " counts_other_by_hour[hour_dt3] = 1\n", " else:\n", " other_points_by_hour[hour_dt3] += row[1]\n", " counts_other_by_hour[hour_dt3] += 1\n", " \n", "av_other_points = []\n", "\n", "#compute the average of points per hour, append to av_other_points\n", "for key in counts_other_by_hour:\n", " f_list = []\n", " f_list.append(key)\n", " avg_f = other_points_by_hour[key]/counts_other_by_hour[key]\n", " f_list.append(avg_f)\n", " av_other_points.append(f_list)\n", " \n", "swap_other_points = []\n", "\n", "#swap the index of the time and number of points\n", "for row in av_other_points:\n", " v_list = []\n", " v_list.append(row[1])\n", " v_list.append(row[0])\n", " swap_other_points.append(v_list)\n", "\n", "#implement the sorted function on swap_other_points to create a top-list\n", "sor_other_pointsbh = sorted(swap_other_points , reverse=True)\n", "\n", "string7 = 'Top 5 Hours for Other Posts Points:'\n", "print(string7)\n", "\n", "for row in sor_other_pointsbh[:5]:\n", " string8 = dt.strptime(str(row[1]), '%H')\n", " string8 = dt.strftime(string8, '%H:%M')\n", " oth_output = '{0}: {1:.2f} points per post'.format(string8, row[0])\n", " print(oth_output)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best period for posting an Other Post is between 13:00 (1pm) and 16:00 (4pm). Now we'll compare the result for different post types." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 Hours for Ask Posts Points:\n", "15:00: 29.99 point per post\n", "13:00: 24.26 point per post\n", "16:00: 23.35 point per post\n", "17:00: 19.41 point per post\n", "10:00: 18.68 point per post\n", "\n", "\n", "Top 5 Hours for Show Posts Points:\n", "23:00: 42.39 points per post\n", "12:00: 41.69 points per post\n", "22:00: 40.35 points per post\n", "00:00: 37.84 points per post\n", "18:00: 36.31 points per post\n", "\n", "\n", "Top 5 Hours for Other Posts Points:\n", "13:00: 62.53 points per post\n", "14:00: 61.79 points per post\n", "15:00: 60.54 points per post\n", "10:00: 60.48 points per post\n", "19:00: 60.01 points per post\n" ] } ], "source": [ "#print all the results\n", "string2 = 'Top 5 Hours for Ask Posts Points:'\n", "print(string2)\n", "for row in sor_ask_average[:5]:\n", " string3 = dt.strptime(str(row[1]), '%H')\n", " string3 = dt.strftime(string3, '%H:%M')\n", " output = \"{0}: {1:.2f} point per post\".format(string3, row[0])\n", " print(output)\n", "print('\\n')\n", "\n", "string4 = 'Top 5 Hours for Show Posts Points:'\n", "print(string4)\n", "for row in sor_show_pointsbh[:5]:\n", " string5 = dt.strptime(str(row[1]), '%H')\n", " string5 = dt.strftime(string5, '%H:%M')\n", " output = '{0}: {1:.2f} points per post'.format(string5, row[0])\n", " print(output)\n", "print('\\n')\n", "\n", "string7 = 'Top 5 Hours for Other Posts Points:'\n", "print(string7)\n", "for row in sor_other_pointsbh[:5]:\n", " string8 = dt.strptime(str(row[1]), '%H')\n", " string8 = dt.strftime(string8, '%H:%M')\n", " oth_output = '{0}: {1:.2f} points per post'.format(string8, row[0])\n", " print(oth_output)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The Result\n", "\n", "Maximum points on average receives a post not in the Show or Ask HN category. If you want to create such a post, the best time for it is between 13:00 (1pm) and 16:00 (4pm).\n", "\n", "The second place is taken by the Show HN posts. The get less points. If you want to get maximum points, the best time for it is at 23:00 (11pm), 12:00 (12 am), and 22:00 (10pm).\n", "\n", "The Ask HN post category receives half as many points as the Show HN one. You can get the maximum points if you create a post at 15:00 (3pm), 13:00 (1pm), 16:00 (4pm)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 2 }