{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring Hacker News Posts\n", "\n", "## Introduction\n", "\n", "Hacker News is a extremely popular site in the tech and startup world. A user can submit a post, where they are voted on commented on, very similar to reddit. The top posts can recieve hundreds of thousands of visitors. \n", "\n", "I am aiming to explore two types of posts: `Ask HN` and `Show HN`. \n", "\n", "To find out the following:\n", "\n", "* Do `Ask HN` or `Show HN` receive more comments on average?\n", "* Do posts created at a certain time receive more comments on average?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing and Reading the Data\n", "\n", "In the cell below I have done the following:\n", "\n", "* Imported the reader\n", "* Opended and read the file `hacker_news.csv`\n", "* Turned the file into a list of lists with the `list()` function and assigned it to a variable `hn`\n", "* Assigned only the header row to a variable `headers`, so I can easily reference the column titles if needed\n", "* Then updated the variable `hn` so that does not include the header \n", "* Finally I used the `print()` function to display the `headers`and the frist 5 rows of `hn`\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']\n", "\n" ] }, { "data": { "text/plain": [ "[['12224879',\n", " 'Interactive Dynamic Video',\n", " 'http://www.interactivedynamicvideo.com/',\n", " '386',\n", " '52',\n", " 'ne0phyte',\n", " '8/4/2016 11:52'],\n", " ['10975351',\n", " 'How to Use Open Source and Shut the Fuck Up at the Same Time',\n", " 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',\n", " '39',\n", " '10',\n", " 'josep2',\n", " '1/26/2016 19:30'],\n", " ['11964716',\n", " \"Florida DJs May Face Felony for April Fools' Water Joke\",\n", " 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',\n", " '2',\n", " '1',\n", " 'vezycash',\n", " '6/23/2016 22:20'],\n", " ['11919867',\n", " 'Technology ventures: From Idea to Enterprise',\n", " 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',\n", " '3',\n", " '1',\n", " 'hswarna',\n", " '6/17/2016 0:01'],\n", " ['10301696',\n", " 'Note by Note: The Making of Steinway L1037 (2007)',\n", " 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',\n", " '8',\n", " '2',\n", " 'walterbell',\n", " '9/30/2015 4:12']]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from csv import reader\n", "file = open(\"hacker_news.csv\")\n", "read = reader(file)\n", "hn = list(read)\n", "headers = hn[0]\n", "hn = hn[1:]\n", "print(headers)\n", "print(\"\")\n", "hn[:5]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting Ask HN and Show HN Posts\n", "\n", "In the code cell below I first made three empty lists in which to store the specific posts I needed. \n", "\n", "Then looped through each row in `hn`. I wanted to find the rows that contained the following elements: \"ask hn\", \"show hn\", and then the remaining. I decided to use the string method `startswith`, to ensure there was no issues with the strings in the list of lists being uppercase or lowercase. I made the title column of the data lower by using the `lower` method and assigning it to a variable called `title`. \n", "\n", "Then used conditional statements to find the rows that started with the identified string. If it was found, I used the `append` method to append that specific row found into the specific list. \n", "\n", "In the next two cells, I wanted to count and print my newly created lists to ensure all went well. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "ask_posts = []\n", "show_posts = []\n", "other_posts = []\n", "\n", "for post in hn:\n", " title = post[1].lower()\n", " if title.startswith('ask hn'):\n", " ask_posts.append(post)\n", " if title.startswith('show hn'):\n", " show_posts.append(post)\n", " else: \n", " other_posts.append(post)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1744\n", "1162\n", "18938\n" ] } ], "source": [ "print(len(ask_posts))\n", "print(len(show_posts))\n", "print(len(other_posts))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['12296411',\n", " 'Ask HN: How to improve my personal website?',\n", " '',\n", " '2',\n", " '6',\n", " 'ahmedbaracat',\n", " '8/16/2016 9:55'],\n", " ['10610020',\n", " 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',\n", " '',\n", " '28',\n", " '29',\n", " 'tkfx',\n", " '11/22/2015 13:43'],\n", " ['11610310',\n", " 'Ask HN: Aby recent changes to CSS that broke mobile?',\n", " '',\n", " '1',\n", " '1',\n", " 'polskibus',\n", " '5/2/2016 10:14'],\n", " ['12210105',\n", " 'Ask HN: Looking for Employee #3 How do I do it?',\n", " '',\n", " '1',\n", " '3',\n", " 'sph130',\n", " '8/2/2016 14:20'],\n", " ['10394168',\n", " 'Ask HN: Someone offered to buy my browser extension from me. What now?',\n", " '',\n", " '28',\n", " '17',\n", " 'roykolak',\n", " '10/15/2015 16:38']]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ask_posts[:5]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['10627194',\n", " 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform',\n", " 'https://iot.seeed.cc',\n", " '26',\n", " '22',\n", " 'kfihihc',\n", " '11/25/2015 14:03'],\n", " ['10646440',\n", " 'Show HN: Something pointless I made',\n", " 'http://dn.ht/picklecat/',\n", " '747',\n", " '102',\n", " 'dhotson',\n", " '11/29/2015 22:46'],\n", " ['11590768',\n", " 'Show HN: Shanhu.io, a programming playground powered by e8vm',\n", " 'https://shanhu.io',\n", " '1',\n", " '1',\n", " 'h8liu',\n", " '4/28/2016 18:05'],\n", " ['12178806',\n", " 'Show HN: Webscope Easy way for web developers to communicate with Clients',\n", " 'http://webscopeapp.com',\n", " '3',\n", " '3',\n", " 'fastbrick',\n", " '7/28/2016 7:11'],\n", " ['10872799',\n", " 'Show HN: GeoScreenshot Easily test Geo-IP based web pages',\n", " 'https://www.geoscreenshot.com/',\n", " '1',\n", " '9',\n", " 'kpsychwave',\n", " '1/9/2016 20:45']]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_posts[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculating the Average Number of Comments for Ask HN and Show HN Posts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section the aim was to compare the average number of comments for the Ask HN and Show HN posts. \n", "\n", "The following tasks were complete in the below cells:\n", "* Used the `print` function to display the headers to find the right index\n", "* For each list (`ask_posts` and `show_posts`) I used a for loop to iterate over each, turning the `num_comments` column into a integer using the `int` function. Then added the sum of the comments to a pre made variable named `total_ask_comments`or `total_show_comments`\n", "* Then computed the average for each and assigning it to variable, either `avg_ask_comments` or `avg_show_commments`\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']\n" ] } ], "source": [ "print(headers)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The average number of comments for Ask Posts: 14.038417431192661\n", "The average number of comments for Show Posts: 10.31669535283993\n" ] } ], "source": [ "total_ask_comments = 0\n", "\n", "for a in ask_posts:\n", " num = int(a[4])\n", " total_ask_comments += num\n", " \n", "avg_ask_comments = total_ask_comments / len(ask_posts)\n", "\n", "print(\"The average number of comments for Ask Posts: \", avg_ask_comments)\n", "\n", "total_show_comments = 0\n", "\n", "for s in show_posts:\n", " num = int(s[4])\n", " total_show_comments += num\n", " \n", "avg_show_comments = total_show_comments / len(show_posts)\n", "\n", "print(\"The average number of comments for Show Posts: \",avg_show_comments)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From analysis on the average comments for each lists. It was found that Ask posts have more comments on average than the Show posts. \n", "\n", "This could be due to the desired outcome of a Ask Post. If you were to do an Ask post, then you are intending that someone will comment, i.e. answer your question. The Show posts though do not have a question to answer, the viewers of the post simply look at the post. The viewer may wish to comment but it is not as natural in comparison to someone asking you a question. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding the Amount of Ask Posts and Comments by Hour Created\n", "\n", "In this section, I made two dictionaries: `counts_by_hour`; and `comments_by_hour`. \n", "\n", "* `counts_by_hour`: contains the number of ask posts created during each hour of the day.\n", "* `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.\n", "\n", "A summary of the cells below:\n", "\n", "* imported the `datetime` module as `dt`\n", "* created an empty list `result_list` to store two elements from the columns: `created_at`; and `num_comments`\n", "* I iterated over `ask_posts` and appended the two elements to `result_list`\n", "* Then I created two empty dictionaries named: `counts_by_hour`; and `comments_by_hour`\n", "* Then created a for loop to iterate over the `result_list`\n", "* Before extracting and adding to the dictionaries. I parsed the date and created a datetime object using the `datetime.strptime()` method\n", "* I only wanted the hour section of the datetime object, so I used the `datetime.strftime()` method \n", "* Finally used conditional statements to compute the data, so it can be calculated and added to the create dictionary" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']\n" ] } ], "source": [ "print(headers)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}\n", "\n", "{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}\n" ] } ], "source": [ "import datetime as dt\n", "\n", "result_list = []\n", "\n", "for row in ask_posts:\n", " result_list.append([row[6], int(row[4])])\n", "\n", "counts_by_hour = {}\n", "comments_by_hour = {}\n", "\n", "for row in result_list:\n", " comment_num = row[1]\n", " created = row[0]\n", " created_dt = dt.datetime.strptime(created, '%m/%d/%Y %H:%M')\n", " created_hour = created_dt.strftime('%H')\n", " \n", " if created_hour in counts_by_hour:\n", " counts_by_hour[created_hour] += 1\n", " comments_by_hour[created_hour] += comment_num\n", " else:\n", " counts_by_hour[created_hour] = 1\n", " comments_by_hour[created_hour] = comment_num \n", " \n", "print(counts_by_hour)\n", "print(\"\")\n", "print(comments_by_hour)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Caculating the Average Number of Comments for Ask HN by Hour\n", "\n", "Next I will use the two dictionaries created to calculate the average number of comments for posts created during each hour of the day. \n", "\n", "This was done by:\n", "\n", "* Creating an empty list `avg_per_hour`\n", "* Iterated over the keys of `comments_by_hour` \n", "* Then computed the average number of comments and rounding the answer to a 2 decimal place using the `round` function\n", "* Finally appending two elements the hour and the `average`" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['09', 5.58],\n", " ['13', 14.74],\n", " ['10', 13.44],\n", " ['14', 13.23],\n", " ['16', 16.8],\n", " ['23', 7.99],\n", " ['12', 9.41],\n", " ['17', 11.46],\n", " ['15', 38.59],\n", " ['21', 16.01],\n", " ['20', 21.52],\n", " ['02', 23.81],\n", " ['18', 13.2],\n", " ['03', 7.8],\n", " ['05', 10.09],\n", " ['19', 10.8],\n", " ['01', 11.38],\n", " ['22', 6.75],\n", " ['08', 10.25],\n", " ['04', 7.17],\n", " ['00', 8.13],\n", " ['06', 9.02],\n", " ['07', 7.85],\n", " ['11', 11.05]]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "avg_per_hour = []\n", "\n", "for hour in comments_by_hour:\n", " average = round(comments_by_hour[hour] / counts_by_hour[hour], 2) # decided it was best to round the average to two decimal places\n", " avg_per_hour.append([hour, average])\n", " \n", "avg_per_hour" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sorting and Printing Values from List of Lists" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[[5.58, '09'],\n", " [14.74, '13'],\n", " [13.44, '10'],\n", " [13.23, '14'],\n", " [16.8, '16'],\n", " [7.99, '23'],\n", " [9.41, '12'],\n", " [11.46, '17'],\n", " [38.59, '15'],\n", " [16.01, '21'],\n", " [21.52, '20'],\n", " [23.81, '02'],\n", " [13.2, '18'],\n", " [7.8, '03'],\n", " [10.09, '05'],\n", " [10.8, '19'],\n", " [11.38, '01'],\n", " [6.75, '22'],\n", " [10.25, '08'],\n", " [7.17, '04'],\n", " [8.13, '00'],\n", " [9.02, '06'],\n", " [7.85, '07'],\n", " [11.05, '11']]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "swap_avg_per_hour = []\n", "\n", "for row in avg_per_hour:\n", " hour = row[0]\n", " avg = row[1]\n", " swap_avg_per_hour.append([avg, hour])\n", " \n", "swap_avg_per_hour" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 Hours for Ask Posts Comments\n", " 12:00 PST, 14:00, CST, 15:00 EST: 38.59 average comments per post\n", " 23:00 PST, 01:00, CST, 02:00 EST: 23.81 average comments per post\n", " 17:00 PST, 19:00, CST, 20:00 EST: 21.52 average comments per post\n", " 13:00 PST, 15:00, CST, 16:00 EST: 16.80 average comments per post\n", " 18:00 PST, 20:00, CST, 21:00 EST: 16.01 average comments per post\n" ] } ], "source": [ "sorted_swap = sorted(swap_avg_per_hour, reverse=True)\n", "\n", "print(\"Top 5 Hours for Ask Posts Comments\")\n", "\n", "for row in sorted_swap[:5]:\n", " hour_dt = dt.datetime.strptime(row[1], '%H') \n", " hour_str = hour_dt.strftime('%H:%M') \n", " \n", " pt_hour_dt = dt.datetime.strptime(row[1], '%H') - dt.timedelta(hours=3)\n", " pt_hour_str = pt_hour_dt.strftime('%H:%M')\n", " \n", " ct_hour_dt = dt.datetime.strptime(row[1], '%H') - dt.timedelta(hours=1)\n", " ct_hour_str = ct_hour_dt.strftime('%H:%M')\n", " \n", " \n", " print(' ', '{pst_time} PST, {cst_time}, CST, {est_time} EST: {avg:.2f} average comments per post'.format(pst_time=pt_hour_str, cst_time=ct_hour_str, est_time=hour_str, avg=row[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results showed that between the hours of 3 PM and 4PM EST had the highest average amount of comments per post. I felt it was unclear in why this was. \n", "\n", "I therefore, decided to compare the most populous timezones in the USA (Pacifc, Central, and Eastern). To see if a clear indication appeared. The highest averages of comments, were to be found in the middle of the day, possibly when most users would be active. This would explain why these times across the USA display would be much higher than the other 4 results. In addition, it is important to mention that Hacker News was started by Y Combinator, which is located in Pacific Time. \n", "\n", "It would interested to see were the most commmon post comes from in regards to timezone. To see if it matches with the above results.\n", "\n", "From the results above, it would be reccomended that if your intention was to create a post to attractive the highest possible comments, 3 PM EST would be advised." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 5 Hours for Ask Posts Comments - European Timezone Comparison\n", " 15:00 EST : 22:00, CEST: 38.59 average comments per post\n", " 02:00 EST : 09:00, CEST: 23.81 average comments per post\n", " 20:00 EST : 03:00, CEST: 21.52 average comments per post\n", " 16:00 EST : 23:00, CEST: 16.80 average comments per post\n", " 21:00 EST : 04:00, CEST: 16.01 average comments per post\n" ] } ], "source": [ "print(\"Top 5 Hours for Ask Posts Comments - European Timezone Comparison\")\n", "\n", "for row in sorted_swap[:5]:\n", " est_hour_dt = dt.datetime.strptime(row[1], '%H') \n", " est_hour_str = est_hour_dt.strftime('%H:%M') \n", " \n", " # Central European Standard Time Zone\n", " cest_hour_dt = dt.datetime.strptime(row[1], '%H') + dt.timedelta(hours=7)\n", " cest_hour_str = cest_hour_dt.strftime('%H:%M')\n", " \n", " print(' ', '{est_time} EST : {cest_time}, CEST: {avg:.2f} average comments per post'.format(est_time=est_hour_str, cest_time=cest_hour_str, avg=row[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above results are a comparison between the Eastern and Central European time zones. \n", "\n", "From analysing the results, perhaps anothe reason why 3 PM EST is has a higher amount of comments on average is due to the fact that Europe is still active. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "It can be concluded that the best time to post with the intention of gaining the most amount of comments for your post is between the hours of 3 PM - 4 PM EST.\n", "\n", "This could be due to the fact that it is during a time when two large populations (North America and Europe) are most acitve. \n", "\n", "Future add on for this project would be to compare this data collected with the following: Number of Users per country/ state, Where the highest amount of Posts come from i.e. location. This could provide further details on when it is best to post with the possibilties of other findings regarding the use general use of Hacker News for creating engagement. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }