{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scoring Company Headline Sentiment\n", "\n", "This is a quick example of scraping headlines and then using VADER to assign a sentiment score. These scores are fitted into a time series, which can then be combined with a time series of closing prices.\n", "\n", "Getting headlines from single sources such as Bloomberg or Reuters is usually a lot easier, especially when they have APIs. But Yahoo Finance headlines are semi-curated from multiple sources and may offer a better cross-section of news articles, so it could be worth the trouble." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "import urllib\n", "import re\n", "from time import sleep\n", "import pandas as pd\n", "import numpy as np\n", "import datetime\n", "import json\n", "import sys\n", "import time\n", "import requests\n", "from selenium import webdriver\n", "from selenium.webdriver.common.keys import Keys\n", "import os\n", "%matplotlib inline\n", "today = datetime.datetime.now()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code combines multiple json files. It could be useful to keep separate files from each scrape in case the combined file is corrupted for whatever reason." ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# The financials argument is for combining financial statement jsons, not shown here\n", "\n", "def merge_json_headlines (input_files = (), *args, output_file = 'data/all_headlines.json'):\n", " head_list = []\n", " for i in range(len(input_files)):\n", " with open (input_files[i]) as current_file:\n", " head_list.append(json.load(current_file))\n", " \n", " collector = head_list[0]\n", " for i in range(1, len(input_files)):\n", " for k_time,v_time in head_list[i].items():\n", " if k_time in collector:\n", " for k_head, v_head in v_time.items():\n", " collector[k_time][k_head] = head_list[i][k_time][k_head]\n", " else:\n", " collector[k_time] = head_list[i][k_time]\n", "\n", " with open('data/' + output_file, 'w') as outfile:\n", " json.dump(collector, outfile)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This was a fun scraping exercise. The headline api calls are not only not paginated, but coded. So we have to log network traffic to get the specific api calls and then access them directly to get the json headline data. This is not, strictly speaking, necessary, since we can simply scrape the list items added to the page, but there is more precision in the dates from the json.\n", "\n", "The Selenium scrolling is also not strictly necessary, but it reduces the need for contant scraping for fear that a flurry of headlines will cause one to miss a few." ] }, { "cell_type": "code", "execution_count": 200, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# We might want to pay special attention to some companies that have more headlines.\n", "# Make sure we get everything by scrolling down on the news page and finding the json injections directly.\n", "# That is, until I can decipher the API codes. Then there'll be no need for any of this.\n", "\n", "import re\n", "import pyautogui\n", "\n", "# The special list of companies we pay extra attention to. While it would be possible to do this for every company,\n", "# it would be a waste of time.\n", "\n", "def get_more_headlines():\n", " # It makes more sense to have a limited list since most companies don't have nearly as many stories about them\n", " # as high profile companies like Apple or Facebook.\n", " # with open('companies.json') as company_data:\n", " # data = json.load(company_data)\n", " #tickers = [key.replace('.','-') for key,value in data.items()]\n", " \n", " tickers = ['AAPL','AMZN','FB','GOOG','NFLX']\n", "\n", " # Each page usually requires about 38-40 pagedowns to get everything. 50 would be very safe.\n", " # 10 should probably be enough as long as this is run regularly.\n", " pagedowns = 50\n", " \n", " chromeOptions = webdriver.ChromeOptions()\n", " \n", " # disable images, or sometimes not\n", " \n", " prefs = {\"profile.managed_default_content_settings.images\":2}\n", " chromeOptions.add_experimental_option(\"prefs\",prefs)\n", " chromeOptions.add_argument('--ignore-certificate-errors')\n", " \n", " # For whatever reason, the log file doesn't complete in headless mode, so we use the \"next best\"\n", " # option of moving the window out of the way quickly.\n", " \n", " #chromeOptions.add_argument(\"--headless\")\n", " \n", " extra_headlines = {}\n", "\n", " # First we trigger the json injections into the page while logging network traffic through Chrome\n", " \n", " for ticker in tickers:\n", " \n", " # Log files stored in d:\\jsontemp\\\n", " \n", " chromeOptions.add_argument('--log-net-log=d:\\\\jsontemp\\\\'+ticker+'.json') \n", " driver = webdriver.Chrome(chrome_options=chromeOptions)\n", " \n", " # Shift the window waaaaay to the left, out of sight\n", " \n", " driver.set_window_position(-3000, 0)\n", " driver.set_script_timeout(15)\n", "\n", " url = 'https://finance.yahoo.com/quote/'+ticker+'/news?p='+ticker\n", " driver.get(url)\n", " \n", " sleep(0.5)\n", "\n", " elem = driver.find_element_by_tag_name(\"body\")\n", "\n", " for i in range(pagedowns):\n", " elem.send_keys(Keys.PAGE_DOWN)\n", " \n", " # On a slow connection, this time should be increased to allow for things to load properly\n", " \n", " time.sleep(0.1)\n", "\n", "\n", " # Closing the tab rather than the entire browser first might increase the likelihood of the log saving properly.\n", " \n", " driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 'w')\n", " driver.close()\n", " \n", " # It seems like there should be a pause to \n", " \n", " sleep(1)\n", "\n", " # With the net logs in hand, get the relevant json urls from them. These files are rather large.\n", " \n", " for ticker in tickers:\n", " with open('d:\\\\jsontemp\\\\'+ticker+'.json','r') as infile:\n", " data = json.load(infile)\n", " \n", " # Get the relevant json urls which contain \"/content;\", but ignore repeats\n", " \n", " json_list = []\n", " for i,e in enumerate(data['events']):\n", " for k1,v1 in e.items():\n", " if type(v1) == dict:\n", " for k2,v2 in v1.items():\n", " if type(v2) == str:\n", " if '/content;' in v2:\n", " if v2 not in json_list:\n", " json_list.append(v2)\n", "\n", " # \n", " for js in json_list:\n", " json_news = requests.get(js).json()['data']['items']\n", " \n", " try:\n", " for d in json_news:\n", " news_array.append(d)\n", " except:\n", " news_array = json_news\n", "\n", " for i,d in enumerate(news_array):\n", " if type(d) == dict:\n", " d['pubtime'] = int(time.mktime(datetime.datetime.strptime(d['publishDateStr'],'%B %d, %Y').timetuple()))*1000\n", " if d['provider'] != None:\n", " d['publisher'] = d['provider']['name']\n", " else:\n", " d['publisher'] = 'N/A'\n", " news_array = process_headlines (news_array)\n", " extra_headlines[ticker] = news_array\n", " \n", " with open('data/extraheadlines_'+datetime.datetime.now().strftime('%Y-%m-%d')+'.json', 'w') as outfile:\n", " json.dump(extra_headlines, outfile)\n", " \n", "def process_headlines(heads):\n", " # We need to account for editor's picks articles which have a different forma\n", " reorg = {}\n", " for head in heads:\n", " for e in ['clusterInfo','editorsPick','format','storyline','i13n','id',\n", " 'idPrefix','images','is_eligible','link','off_network','type']:\n", " try:\n", " head.pop(e)\n", " except:\n", " pass \n", " \n", " head['pubtime'] = datetime.datetime.fromtimestamp(int(head['pubtime'])//1000).strftime('%Y-%m-%d')\n", " if 'summary' not in head:\n", " head['summary'] = '(NO SUMMARY)'\n", " \n", " if head['pubtime'] in reorg:\n", " reorg[head['pubtime']][head['title']] = {'summary':head['summary'],\n", " 'publisher': head['publisher'],\n", " 'url': head['url']}\n", " else:\n", " reorg[head['pubtime']]={head['title']:{'summary':head['summary'],\n", " 'publisher': head['publisher'],\n", " 'url': head['url']}}\n", " \n", " # Store the number of headlines each day, as a large number of headlines MIGHT have a higher influence\n", " # with price movement in either direction.\n", " for day in reorg:\n", " daily_heads = len(reorg[day])\n", " reorg[day]['daily_headlines'] = daily_heads\n", " \n", " return reorg" ] }, { "cell_type": "code", "execution_count": 201, "metadata": { "scrolled": false }, "outputs": [], "source": [ "get_more_headlines()" ] }, { "cell_type": "code", "execution_count": 202, "metadata": { "collapsed": true }, "outputs": [], "source": [ "files = [os.path.join('data', f) for f in sorted(os.listdir('data')) if str(f).endswith('json')]" ] }, { "cell_type": "code", "execution_count": 204, "metadata": {}, "outputs": [], "source": [ "merge_json_headlines (input_files = files, output_file = 'all_headlines.json')" ] }, { "cell_type": "code", "execution_count": 205, "metadata": { "collapsed": true }, "outputs": [], "source": [ "with open('data/all_headlines.json','r') as infile:\n", " data = json.load(infile)" ] }, { "cell_type": "code", "execution_count": 206, "metadata": { "collapsed": true }, "outputs": [], "source": [ "fb = data['FB']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text to sentiment score\n", "\n", "SentimentIntensityAnalyzer in vader assigns compound, positive, neutral and negative scores for inputted text. Like the name suggests, the compound score is an overall score $\\in (-1,1)$, where lower scores are more negative, and higher scores are more positive." ] }, { "cell_type": "code", "execution_count": 207, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import nltk\n", "from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA\n", "from nltk.corpus import stopwords\n", "from nltk.tokenize import RegexpTokenizer, word_tokenize, sent_tokenize" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll try two methods: headlines only and headlines with summaries. Summaries usually give siginificantly more relevant information than headlines alone. However, studies have shown that most people only read headlines, so the results might shock you! Data scientists hate them! Actually, because Yahoo displays the summary together with the headline, it is difficult to avoid the summaries.\n", "\n", "The daily_headlines dictionary key that gives the number of daily headlines is not used in this example. Among other things, it could be used to try to predict the impact of sentiment with regards to a percentage change in price. A plethora of positive or negative coverage could signal major movements." ] }, { "cell_type": "code", "execution_count": 208, "metadata": {}, "outputs": [], "source": [ "def get_hl_summary (ticker):\n", " text_dict = {}\n", " for k,v in ticker.items():\n", " text = []\n", " for hl in ticker[k]:\n", " if hl != 'daily_headlines':\n", " text.append('. '.join([hl, ticker[k][hl]['summary']]))\n", " text_dict[k] = text\n", " return text_dict\n", "\n", "def get_hl (ticker):\n", " text_dict = {}\n", " for k,v in ticker.items():\n", " text = []\n", " for hl in ticker[k]:\n", " if hl != 'daily_headlines':\n", " text.append(hl)\n", " text_dict[k] = text\n", " return text_dict" ] }, { "cell_type": "code", "execution_count": 209, "metadata": {}, "outputs": [], "source": [ "# Gets the mean compound score of the inputted lines.\n", "\n", "def get_score(lines):\n", " sid = SIA()\n", " score_list = []\n", " for line in lines:\n", " scores = sid.polarity_scores(line)\n", " current_scores = []\n", " \n", " # only using the combined score for now, but other scores could be useful\n", " \n", " for s in sorted(scores):\n", " current_scores.append(scores[s])\n", " score_list.append(current_scores)\n", " score_list = pd.DataFrame(score_list)\n", " mean_score = np.mean(score_list,axis=0)[0]\n", " return mean_score" ] }, { "cell_type": "code", "execution_count": 210, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_daily_scores(data):\n", " all_scores = {}\n", " for k, v in data.items():\n", " scores = []\n", " for i,lines in enumerate(v):\n", " lines = sent_tokenize(data[k][i])\n", " scores.append(get_score(lines))\n", " all_scores[k] = np.mean(scores)\n", " return all_scores" ] }, { "cell_type": "code", "execution_count": 211, "metadata": {}, "outputs": [], "source": [ "hl_summary = get_hl_summary(fb)" ] }, { "cell_type": "code", "execution_count": 212, "metadata": {}, "outputs": [], "source": [ "all_scores = get_daily_scores(hl_summary)" ] }, { "cell_type": "code", "execution_count": 213, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'2018-01-10': 0.0,\n", " '2018-01-12': 0.49698333333333333,\n", " '2018-01-13': 0.14999166666666663,\n", " '2018-01-14': 0.41883333333333334,\n", " '2018-01-15': 0.12347166666666665,\n", " '2018-01-16': 0.091691000000000009,\n", " '2018-01-17': 0.059152642276422765,\n", " '2018-01-18': 0.1148108163265306,\n", " '2018-01-19': 0.068206269841269831}" ] }, "execution_count": 213, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_scores" ] }, { "cell_type": "code", "execution_count": 214, "metadata": { "collapsed": true }, "outputs": [], "source": [ "hl_only = get_hl(fb)\n", "all_scores_hl_only = get_daily_scores(hl_only)" ] }, { "cell_type": "code", "execution_count": 215, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'2018-01-10': 0.0,\n", " '2018-01-12': 0.06133333333333333,\n", " '2018-01-13': 0.214725,\n", " '2018-01-14': 0.42149999999999999,\n", " '2018-01-15': -0.013089999999999996,\n", " '2018-01-16': 0.066098000000000004,\n", " '2018-01-17': 0.063718292682926822,\n", " '2018-01-18': 0.079093877551020403,\n", " '2018-01-19': -0.016459523809523806}" ] }, "execution_count": 215, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_scores_hl_only" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The news has been slightly positive during this time period. The big discrepancy between the summary and no summary scores was on January 12, so let's take a look at that." ] }, { "cell_type": "code", "execution_count": 216, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{\"Cramer's game plan: JP Morgan set the benchmark. Now watc...\": {'publisher': 'CNBC Videos',\n", " 'summary': \"Jim Cramer laid out the stocks and events he'll be watching as earnings season kicks into high gear with the big banks' reports.\",\n", " 'url': 'https://finance.yahoo.com/video/cramers-game-plan-jp-morgan-233900741.html'},\n", " \"Cramer: 'This time it's different' can actually make you ...\": {'publisher': 'CNBC Videos',\n", " 'summary': 'Jim Cramer said the success of Facebook, Amazon, Netflix and Alphabet proves why \"this time it\\'s different\" can help you, not hurt you.',\n", " 'url': 'https://finance.yahoo.com/video/cramer-time-different-actually-234100884.html'},\n", " \"Make money with 'this time it's different'\": {'publisher': 'CNBC Videos',\n", " 'summary': 'Jim Cramer said the success of Facebook, Amazon, Netflix and Alphabet proves why \"this time it\\'s different\" can help you, not hurt you.',\n", " 'url': 'https://finance.yahoo.com/video/money-time-different-234900972.html'},\n", " 'daily_headlines': 3}" ] }, "execution_count": 216, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fb['2018-01-12']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All Jim Cramer stories, and two of them are repeats. Sometimes less is more.\n", "\n", "Anyway, we can then classify these scores as follows:\n", "\n", "They can be adjusted, but these are the guideline thresholds. We can then do the usual things with regression and ensembles to predict the direction of the stock." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }