{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Ask HN vs Show HN posts: Hacker News Engagement analysis\n", "\n", "## Overview\n", "\n", "In this project, we look at the popular technology news site [Hacker News](https://news.ycombinator.com/), and analyse the posts with the titles that begin with either Ask HN or Show HN.\n", "\n", "Ask HN posts are the ones that users submit to ask the community a specific question, while, Show HN posts are the ones they submit to show the community a project, product, or something interesting.\n", "\n", "We will compare the two types of posts to determine:\n", "- Which receive more comments on average?\n", "- Do posts created at a certain time receive more comments on average?\n", "\n", "## Dataset\n", "\n", "Dataset for this project can be downloaded from [here](https://www.kaggle.com/hacker-news/hacker-news-posts). Below are the description of the columns:\n", "\n", "- id: unique identifier from Hacker News for the post\n", "- title: the title of the post\n", "- url: the URL that the posts link to (if it has a URL)\n", "- num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes\n", "- num_comments: the number of comments on the post\n", "- author: the username of the person who submitted the post\n", "- created_at: the date and time of the post's submission (the time zone is *US Eastern Time*)\n", "\n", "**Note**: This dataset is based on approximately 20,000 rows randomly sampled from the submissions after removing posts without any comments.\n", "\n", "Let's read the dataset and display some posts.\n", "\n", "## Data Analysis" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']\n", "['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']\n", "['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']\n", "['11964716', \"Florida DJs May Face Felony for April Fools' Water Joke\", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']\n", "['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']\n", "['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']\n" ] } ], "source": [ "from csv import reader\n", "opened_file = open(\"hacker_news.csv\")\n", "read_file = reader(opened_file)\n", "hn = list(read_file)\n", "headers = hn[0]\n", "hn = hn[1:]\n", "print(headers)\n", "for post in hn[:5]:\n", " print(post)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's seperate the posts as we are only interested in Ask HN and Show HN posts." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of Ask HN posts: 1744\n", "Number of Show HN posts: 1162\n", "Number of Other posts: 17194\n", "\n", "\n", "===== Ask HN posts =====\n", "[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]\n", "\n", "\n", "===== Show HN posts =====\n", "[['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]\n" ] } ], "source": [ "ask_posts = []\n", "show_posts = []\n", "other_posts = []\n", "for post in hn:\n", " title = post[1]\n", " title = title.lower()\n", " if title.startswith(\"ask hn\"):\n", " ask_posts.append(post)\n", " elif title.startswith(\"show hn\"):\n", " show_posts.append(post)\n", " else:\n", " other_posts.append(post)\n", "print('Number of Ask HN posts: ',len(ask_posts))\n", "print('Number of Show HN posts: ',len(show_posts))\n", "print('Number of Other posts: ',len(other_posts))\n", "print('\\n')\n", "print('='*5,'Ask HN posts','='*5)\n", "print(ask_posts[:5])\n", "print('\\n')\n", "print('='*5,'Show HN posts','='*5)\n", "print(show_posts[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Does Ask HN or Show HN posts receive more comments on average?\n", "\n", "Let's determine the total number of comments and average comment per post for Ask HN vs Show HN posts." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===== Ask HN =====\n", "Total posts = 1744\n", "Total comments = 24483\n", "Average comments per post = 14.04\n", "\n", "\n", "===== Show HN =====\n", "Total posts = 1162\n", "Total comments = 11988\n", "Average comments per post = 10.32\n" ] } ], "source": [ "def get_comments_stats(posts, print_answers = True):\n", " total_comments = 0\n", " avg_comments = 0\n", " total_posts = len(posts)\n", " for post in posts:\n", " num_comments = int(post[4])\n", " num_points = int(post[3])\n", " total_comments += num_comments\n", " avg_comments = round(total_comments/total_posts,2)\n", " \n", " if print_answers:\n", " print('Total posts = ',total_posts)\n", " print('Total comments = ',total_comments)\n", " print('Average comments per post = ',avg_comments)\n", " \n", " return avg_comments, total_comments\n", "\n", "print('='*5,'Ask HN','='*5)\n", "avg_ask_comments,total_ask_comments = get_comments_stats(ask_posts)\n", "print('\\n')\n", "print('='*5,'Show HN','='*5)\n", "avg_show_comments,total_show_comments = get_comments_stats(show_posts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After seperating the Ask HN and Show HN posts and calculating the average comments across posts, we can see that the average number of comments per post is higher for **Ask HN** posts (about 14 comments per post) than **Show HN** posts (about 10 comments per post).\n", "\n", "This answers our first question\n", "\n", "> **Ask HN posts receive more comments on average than Show HN posts**\n", "\n", "Since ask posts are more likely to receive comments, we will focus only on these posts for answering our next question.\n", "\n", "#### 2. Do posts created at a certain time receive more comments on average?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To answer this, we need to:\n", "1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received\n", "2. Calculate the average number of comments ask posts receive by hour created\n", "\n", "For this, we will seperate the hour a post was made from the created_at datetime field in our dataset and then create two dictionaries - One for the number of posts and another for the number of comments by the hour." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import datetime as dt\n", "import pytz\n", "result_list = []\n", "for post in ask_posts:\n", " created_at = dt.datetime.strptime(post[6],'%m/%d/%Y %H:%M')\n", " num_comments = int(post[4])\n", " result_list.append([created_at,num_comments])\n", "\n", "counts_by_hour = {}\n", "comments_by_hour = {}\n", "for row in result_list:\n", " created_date = row[0]\n", " created_hour = created_date.strftime('%H')\n", " num_comments = row[1]\n", " if created_hour in counts_by_hour:\n", " counts_by_hour[created_hour] += 1\n", " comments_by_hour[created_hour] += num_comments\n", " else:\n", " counts_by_hour[created_hour] = 1\n", " comments_by_hour[created_hour] = num_comments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the number of posts and total number of comments segmented by the hour in a day, we will determine the average number of comments per post by the hour the post was created in the day and display the top 5 hours during which Ask HN posts get more comments.\n", "\n", "**Note**: The dataset timezone as per the documentation in *US/Eastern* - So we will display both this timezone and local timezone which in my case is *Europe/London*." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/markdown": [ "**Top 5 Hours (US/Eastern) for Ask Posts Comments**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "15:00: 38.59 average comments per post\n", "02:00: 23.81 average comments per post\n", "20:00: 21.52 average comments per post\n", "16:00: 16.80 average comments per post\n", "21:00: 16.01 average comments per post\n", "\n", "\n" ] }, { "data": { "text/markdown": [ "**Top 5 Hours (Europe/London) for Ask Posts Comments**" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "19:55: 38.59 average comments per post\n", "06:55: 23.81 average comments per post\n", "00:55: 21.52 average comments per post\n", "20:55: 16.80 average comments per post\n", "01:55: 16.01 average comments per post\n" ] } ], "source": [ "from IPython.display import display, Markdown\n", "\n", "# Function to return the results (hour in the day) sorted by highest average comments\n", "def sorted_result(freq_tbl):\n", " table = freq_tbl\n", " table_display = []\n", " for key in table:\n", " key_val_as_tuple = (table[key], key)\n", " table_display.append(key_val_as_tuple)\n", "\n", " table_sorted = sorted(table_display, reverse = True)\n", " return table_sorted\n", "\n", "# Function to sort and display the results formatted - By default displays top 5 in US/Eastern timezone but can change to local timezone and pass a Local timezone\n", "def display_result(freq_tbl, display_top_n = 5, display_in_local_time = False, local_timezone = 'Europe/London'):\n", " # Dataset timezone per documentation\n", " us_eastern = pytz.timezone('US/Eastern')\n", " # Ability to display in a different timezone by taking the parameter\n", " local_tz = pytz.timezone(local_timezone)\n", " heading_template = \"\"\n", " \n", " if display_in_local_time:\n", " heading_template = 'Top {} Hours ({}) for Ask Posts Comments'.format(display_top_n,local_tz.zone)\n", " else:\n", " heading_template = 'Top {} Hours ({}) for Ask Posts Comments'.format(display_top_n,us_eastern.zone)\n", " \n", " display(Markdown('**'+heading_template+'**'))\n", " #print(heading_template)\n", " #print('='*len(heading_template))\n", " \n", " table_sorted = sorted_result(freq_tbl)\n", " display_str_template = \"{}: {:.2f} average comments per post\"\n", " row_count = 0\n", " \n", " for entry in table_sorted:\n", " row_count += 1\n", " if row_count <= display_top_n:\n", " hour = dt.datetime.strptime(entry[1], '%H')\n", " \n", " if display_in_local_time:\n", " hour = hour.replace(tzinfo=us_eastern)\n", " hour = hour.astimezone(local_tz)\n", "\n", " hour_fmt = hour.strftime('%H:%M')\n", " print(display_str_template.format(hour_fmt,entry[0]))\n", " else:\n", " exit\n", " \n", "avg_by_hour = {}\n", "for hour in comments_by_hour:\n", " posts = counts_by_hour[hour]\n", " num_comments = comments_by_hour[hour]\n", " avg_comments = num_comments/posts\n", " avg_by_hour[hour] = avg_comments\n", "display_result(freq_tbl = avg_by_hour,display_in_local_time = False)\n", "print('\\n')\n", "display_result(freq_tbl = avg_by_hour,display_in_local_time = True,local_timezone = 'Europe/London')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This answers our last question:\n", "\n", "> **Ask HN posts created around 3 pm US/Eastern time (around 8 pm Europe/London time) seems to have highest average comments per post**\n", "\n", "## Conclusion\n", "\n", "After our data analysis on the Ask HN and Show HN posts, we conclude that Ask HN posts receive more comments and that best hour in the day for the comment activity on the Ask HN posts is around 3 pm US/Eastern time (8 pm Europe/London time)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }