{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Relative Valuation using Market Comparables\n", "\n", "**Quentin Batista** \n", "*The University of Tokyo*\n", "\n", "Relative valuation is a popular technique which relies on the market value of similar assets to price a given asset. For example, if you want to price a laundry business, you might want to calculate the price to earnings ratio of other similar businesses and multiply this ratio by the earnings of the laundry business to obtain a valuation. Sheridan and Martin describe this methodology as a 4-step process:\n", "\n", "**Step 1:** Identify similar or comparable investments and recent market prices for each.\n", "\n", "**Step 2:** Calculate a valuation metric for use in valuing the asset.\n", "\n", "**Step 3:** Calculate an initial estimate of value.\n", "\n", "**Step 4:** Refine or tailor your initial valuation estimate to the specific characteristics of the investment.\n", "\n", "Some of the most commonly used multiples are:\n", "\n", "- Price to earnings (P/E)\n", "\n", "- Market to book value of equity\n", "\n", "- Enterprise Value (EV) to EBITDA\n", "\n", "- Enterprise Value (EV) to revenue\n", "\n", "- Enterprise Value (EV) to cash flow\n", "\n", "This list is not exhaustive, and it is possible for you to create your own multiple. This is particularly popular in the technology sector where analysts have come up with multiples such as Enterprise Value to unique visitors or website hits. In doing so, you must ensure that the components of the multiple are consistent with each other. For example, you might consider using the price to sales ratio as a valuation multiple. However, an implicit assumption behind this multiple is that comparable companies have identical capital structures, which is very rarely the case in practice. When this assumption is violated, the multiple becomes inconsistent because it is impacted by the amount of debt that a company has relative to its equity.\n", "\n", "Finally, a key step in applying this methodology is to determine which multiple is appropriate for the asset you are trying to value. For example, valuing young firms and startups using the P/E ratio is likely to be inappropriate if those firms have negative or highly volatile earnings. Instead, using the EV to sales ratio would likely give a better estimate. Additionally, it is important to realize that these multiples have different characteristics. While EV to EBITDA ignores the firm's CapEx, depreciation, and capital structure, while the P/E ratio takes those into account. Using these multiples concurrently allows you to see the big picture and understand what is driving the valuation of an asset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Systematic Relative Valuation Using Machine Learning\n", "\n", "### Objective\n", "\n", "In this notebook, we systematize the methodology introduced above for companies in the S&P 500 using two different Machine Learning approaches. First, we replicate Gael Varoquaux's analysis on scikit-learn which extracts a graphical structure from the correlation of intraday variations and then applies Affinity Propagation to group together stocks that behave similarly. Second, we analyze the companies' latest 10K using the Doc2Vec implementation in gensim to quantify similarity. In the first case, we use the average multiple in each cluster to value each company. In the second case, we pick the top 3 comparable companies and transform the similarity measures into weights to compute the average comparable multiple. We then apply this multiple to each company's financials to obtain a valuation.\n", "\n", "The hope of using this approach is to capture relationships between companies that human analysts would not be able to perceive. Reading through the financial statements of all the companies in the S&P 500 and forming an opinion about how these companies relate to each other would be an extremely time-consuming process. As a result, most analysts on the Street cover a specific industry group, and therefore, they might not realize that a company they are researching is actually very similar to a company in classified in a different industry group. By using Machine Learning, we can alleviate this issue, and hopefully, gain some valuable insights about the stock market structure.\n", "\n", "### Data\n", "\n", "We scrape the list of companies in the S&P 500, their ticker and CIK code from Wikipedia. We then use the CIK code to scrape the latest 10K from EDGAR. There are 505 companies in the Wikipedia list because some companies trade under multiple symbols (for example, Discovery Communications Inc.). A few companies' financial statements are not available on EDGAR for various reasons -- we ignore those companies. We clean the data by removing \"Table of Contents\" markers when they exist, page numbers, line breaks, punctuations and numbers from the statements. We scrape company fundamentals and their historical prices from Yahoo! Finance. Note that we drop the companies with negative multiples to simplify valuation. While Yahoo! Finance might not be a particularly reliable source for getting data about company fundamentals, it is the best among free data sources." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Required Packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install numpy\n", "!pip install pandas\n", "!pip install beautifulsoup4\n", "!pip install gensim\n", "!pip install nltk\n", "!pip install sklearn\n", "!pip install bokeh\n", "!pip install pandas_datareader\n", "!pip install datetime" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from bs4 import BeautifulSoup\n", "from urllib.request import urlopen\n", "from collections import namedtuple\n", "from gensim.models import doc2vec\n", "from gensim.similarities import docsim\n", "from gensim import corpora\n", "import itertools\n", "from string import punctuation, digits\n", "from nltk import word_tokenize\n", "from nltk.corpus import stopwords\n", "from sklearn.manifold import TSNE\n", "from bokeh.plotting import figure, show, output_notebook\n", "from bokeh.models import ColumnDataSource, LabelSet, Legend, HoverTool\n", "from bokeh.palettes import all_palettes\n", "from pandas_datareader.data import DataReader\n", "from datetime import date" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Scraping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping 10Ks from EDGAR" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from bs4 import BeautifulSoup\n", "from urllib.request import urlopen\n", "\n", "\n", "def table_extractor(soup):\n", " \"\"\"Extract the tables from a soup object\"\"\"\n", " for table in soup.find_all(\"table\"):\n", " table.extract()\n", " return soup\n", "\n", "\n", "sp_500_wiki_link = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'\n", "soup_wiki = BeautifulSoup(urlopen(sp_500_wiki_link), 'html.parser')\n", "table_wiki = soup_wiki.find(\"table\", {\"class\": \"wikitable sortable\"})\n", "\n", "# Fail now if the right table hasn't been found\n", "header = table_wiki.findAll('th')\n", "if header[0].string != \"Ticker symbol\" or header[1].string != \"Security\":\n", " raise Exception(\"Can't parse wikipedia's table!\")\n", "\n", "# Retreive the values in the table\n", "records = []\n", "rows = table_wiki.findAll('tr')\n", "for row in rows:\n", " fields = row.findAll('td')\n", " if fields:\n", " # Get info and SEC company link\n", " symbol = fields[0].string\n", " wiki_link = \"https://en.wikipedia.org/wiki\" + fields[1].a.get('href')\n", " CIK = fields[7].string\n", " sec_company_link = \"https://www.sec.gov/cgi-bin/browse-edgar?\" + \\\n", " \"action=getcompany&CIK=\" + CIK + \\\n", " \"&type=10-K&dateb=&owner=include&count=40\"\n", " name = fields[1].a.string\n", " sector = fields[3].string\n", "\n", " # Get link for the page with latest 10-K related filings\n", " soup_comp = BeautifulSoup(urlopen(sec_company_link), 'html.parser')\n", " table_comp = soup_comp.find(\"table\", {\"class\": \"tableFile2\"})\n", " try:\n", " # Get the latest filing page\n", " filings_link = \"https://www.sec.gov\" + \\\n", " table_comp.a.get('href')\n", "\n", " # Get the link for the latest 10K\n", " soup_filings = BeautifulSoup(urlopen(filings_link),\n", " 'html.parser')\n", " table_filings = soup_filings.find(\"table\", {\"class\": \"tableFile\"})\n", " _10K_link = \"https://www.sec.gov\" + table_filings.a.get('href')\n", "\n", " # Extracting the text from the latest 10K\n", " try:\n", " soup_latest_10K = BeautifulSoup(urlopen(_10K_link).read(),\n", " 'html.parser')\n", " soup_latest_10K = table_extractor(soup_latest_10K)\n", " _latest_10K_txt = soup_latest_10K.get_text()\n", "\n", " except:\n", " # If the latest 10K is not available, return N/A\n", " _latest_10K_txt = np.nan\n", "\n", " except:\n", " # If the filings are not available, return N/A\n", " _10K_link = np.nan\n", " _latest_10K_txt = np.nan\n", "\n", " # Append results\n", " records.append([symbol, wiki_link, name, sector, sec_company_link,\n", " CIK, _latest_10K_txt])\n", "\n", "headers = ['Symbol', 'Wikipedia Link', 'Name', 'Sector', 'SEC Filings Link',\n", " 'CIK', 'Latest 10K']\n", "data = pd.DataFrame(records, columns=headers)\n", "\n", "# Correct ambiguous tickers for later purpose\n", "ambiguous_tickers = ['BRK.B', 'BF.B']\n", "corrected_tickers = ['BRK-B', 'BF-B']\n", "\n", "for i, ticker in enumerate(ambiguous_tickers):\n", " data['Symbol'] = data['Symbol'].replace(ticker, corrected_tickers[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping Fundamentals from Yahoo! Finance" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def unit_converter(data):\n", " billion = 1_000_000_000\n", " million = 1_000_000\n", " if data[-1] == 'B':\n", " return float(data[:-1])*billion\n", " elif data[-1] == 'M':\n", " return float(data[:-1])*million\n", " else:\n", " return float(data)\n", "\n", "\n", "items = ['Enterprise Value', 'Enterprise Value/Revenue', 'Diluted EPS',\n", " 'Enterprise Value/EBITDA', 'Revenue', 'EBITDA', 'Trailing P/E']\n", "\n", "for i, ticker in enumerate(data['Symbol']):\n", " key_stats_link = 'https://finance.yahoo.com/quote/' + ticker + \\\n", " '/key-statistics?p=' + ticker\n", " key_stats_soup = BeautifulSoup(urlopen(key_stats_link).read(),\n", " 'html.parser').findAll('td')\n", " for j, row in enumerate(key_stats_soup):\n", " for item in items:\n", " try:\n", " if item == row.span.string:\n", " # Dangerous assumption\n", " data.loc[i, item] = \\\n", " unit_converter(key_stats_soup[j+1].string)\n", " except:\n", " next" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Disclaimer: This scraper makes an important assumption regarding the structure of the html file. Specifically, it assumes that the data for each item of interest is the string of text located in the 'td' tag following the one where the name of the item was found." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping Historical Prices from Yahoo! Finance" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pandas_datareader.data import DataReader\n", "from datetime import date\n", "\n", "start = date(2013, 1, 2)\n", "end = date.today()\n", "\n", "data_source = 'yahoo'\n", "\n", "# Sometimes fail -- retry if it does\n", "historical_prices_panel = DataReader(data['Symbol'], data_source, start, end,\n", " retry_count=5)\n", "\n", "# Current price is last close\n", "last_price = \\\n", "historical_prices_panel['Close'].tail(1).T.iloc[:, 0].rename('Current Price')\n", "data = data.join(last_price, on='Symbol')\n", "\n", "close_price = historical_prices_panel['Close']\n", "open_price = historical_prices_panel['Open']" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Uncomment to save the data\n", "# data.to_csv('10K_data.csv')\n", "# historical_prices_panel['Close'].to_csv('close_price_data.csv')\n", "# historical_prices_panel['Open'].to_csv('open_price_data.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Cleaning and Preprocessing" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Uncomment to read the data\n", "# data = pd.read_csv('10K_data.csv')\n", "# close_price = pd.read_csv('close_price_data.csv')\n", "# open_price = pd.read_csv('open_price_data.csv')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Remove companies without filings\n", "no_filings_data = data[data['Latest 10K'].isnull()]\n", "data = data[~data['Latest 10K'].isnull()]\n", "\n", "# Remove duplicates (keep first)\n", "data = data.drop_duplicates(keep='first')\n", "\n", "# Drop Google duplicate\n", "data = data[data['Symbol'] != 'GOOG']\n", "\n", "# Drop NA rows (about 60 companies)\n", "data = data.dropna()\n", "\n", "# Drop companies with negative EPS or EBITDA (about 30 companies)\n", "data = data[(data[['EBITDA', 'Diluted EPS']] > 0).all(1)]\n", "\n", "# Reset index\n", "data = data.reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import namedtuple\n", "from nltk import word_tokenize\n", "from nltk.corpus import stopwords\n", "from string import punctuation, digits\n", "\n", "\n", "def _10K_string_cleaner(_10K):\n", " # Lowercase the text\n", " _10K = _10K.lower()\n", " stopchar = punctuation + digits + '’“”'\n", " for ch in stopchar:\n", " # Replace stopchar by whitespace\n", " _10K = _10K.replace(ch, ' ')\n", " # Tokenize\n", " _10K = word_tokenize(_10K)\n", " # Remove stopwords\n", " _10K = [word for word in _10K if word not in stopwords.words('english')]\n", " return _10K\n", "\n", "\n", "corpus = []\n", "analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')\n", "for i, text in enumerate(data['Latest 10K']):\n", " corpus.append(analyzedDocument(_10K_string_cleaner(text), [i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Doc2Vec Model\n", "\n", "Doc2Vec is an implementation of the Paragraph Vector unsupervised learning algorithm proposed by Quoc Le and Tomas Mikolov for learning fixed-length feature representations from variable-length pieces of texts. It aims to overcome some of the weaknesses of the more traditional bag-of-words (BOW) methodology." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from gensim.models import doc2vec\n", "from gensim.similarities import docsim\n", "\n", "model_NLP = doc2vec.Doc2Vec(corpus, size=500, window=300, min_count=1,\n", " workers=4)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "import itertools\n", "\n", "similarity_matrix_NLP = np.empty(0)\n", "length_docvecs = len(model_NLP.docvecs)\n", "for item in itertools.product(range(length_docvecs), range(length_docvecs)):\n", " similarity_matrix_NLP = np.append(similarity_matrix_NLP,\n", " model_NLP.docvecs.similarity(*item))\n", "\n", "similarity_matrix_NLP = similarity_matrix_NLP.reshape((length_docvecs,\n", " length_docvecs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization\n", "\n", "### Stock Market Structure\n", "\n", "We use three different methodologies for visualizing the structure of the stock market. First, we create a simple scatter plot of the companies' EV/EBITDA against their P/E ratio. Second, we use the correlation matrix of the S&P 500 stocks between the start of 2013 and the last price. Third, we use the similarity matrix outputted from the Doc2Vec model. In the last two cases, t-SNE is used for dimensionality reduction." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Loading BokehJS ...\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(global) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = true;\n", "\n", " if (typeof (window._bokeh_onload_callbacks) === \"undefined\" || force === true) {\n", " window._bokeh_onload_callbacks = [];\n", " window._bokeh_is_loading = undefined;\n", " }\n", "\n", "\n", " \n", " if (typeof (window._bokeh_timeout) === \"undefined\" || force === true) {\n", " window._bokeh_timeout = Date.now() + 5000;\n", " window._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"

\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"

\\n\"+\n", " \"\\n\"+\n", " \"\\n\"+\n", " \"from bokeh.resources import INLINE\\n\"+\n", " \"output_notebook(resources=INLINE)\\n\"+\n", " \"\\n\"+\n", " \"
\"}};\n", "\n", " function display_loaded() {\n", " if (window.Bokeh !== undefined) {\n", " var el = document.getElementById(\"1e8520a1-3bfd-43f4-a500-c0b40fd26b28\");\n", " el.textContent = \"BokehJS \" + Bokeh.version + \" successfully loaded.\";\n", " } else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(display_loaded, 100)\n", " }\n", " }\n", "\n", " function run_callbacks() {\n", " window._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n", " delete window._bokeh_onload_callbacks\n", " console.info(\"Bokeh: all callbacks have finished\");\n", " }\n", "\n", " function load_libs(js_urls, callback) {\n", " window._bokeh_onload_callbacks.push(callback);\n", " if (window._bokeh_is_loading > 0) {\n", " console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", " return null;\n", " }\n", " if (js_urls == null || js_urls.length === 0) {\n", " run_callbacks();\n", " return null;\n", " }\n", " console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", " window._bokeh_is_loading = js_urls.length;\n", " for (var i = 0; i < js_urls.length; i++) {\n", " var url = js_urls[i];\n", " var s = document.createElement('script');\n", " s.src = url;\n", " s.async = false;\n", " s.onreadystatechange = s.onload = function() {\n", " window._bokeh_is_loading--;\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: all BokehJS libraries loaded\");\n", " run_callbacks()\n", " }\n", " };\n", " s.onerror = function() {\n", " console.warn(\"failed to load library \" + url);\n", " };\n", " console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", " document.getElementsByTagName(\"head\")[0].appendChild(s);\n", " }\n", " };var element = document.getElementById(\"1e8520a1-3bfd-43f4-a500-c0b40fd26b28\");\n", " if (element == null) {\n", " console.log(\"Bokeh: ERROR: autoload.js configured with elementid '1e8520a1-3bfd-43f4-a500-c0b40fd26b28' but no matching script tag was found. \")\n", " return false;\n", " }\n", "\n", " var js_urls = [\"https://cdn.pydata.org/bokeh/release/bokeh-0.12.5.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.5.min.js\"];\n", "\n", " var inline_js = [\n", " function(Bokeh) {\n", " Bokeh.set_log_level(\"info\");\n", " },\n", " \n", " function(Bokeh) {\n", " \n", " },\n", " \n", " function(Bokeh) {\n", " \n", " document.getElementById(\"1e8520a1-3bfd-43f4-a500-c0b40fd26b28\").textContent = \"BokehJS is loading...\";\n", " },\n", " function(Bokeh) {\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-0.12.5.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-0.12.5.min.css\");\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.5.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.5.min.css\");\n", " }\n", " ];\n", "\n", " function run_inline_js() {\n", " \n", " if ((window.Bokeh !== undefined) || (force === true)) {\n", " for (var i = 0; i < inline_js.length; i++) {\n", " inline_js[i](window.Bokeh);\n", " }if (force === true) {\n", " display_loaded();\n", " }} else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(run_inline_js, 100);\n", " } else if (!window._bokeh_failed_load) {\n", " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", " window._bokeh_failed_load = true;\n", " } else if (force !== true) {\n", " var cell = $(document.getElementById(\"1e8520a1-3bfd-43f4-a500-c0b40fd26b28\")).parents('.cell').data().cell;\n", " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", " }\n", "\n", " }\n", "\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", " run_inline_js();\n", " } else {\n", " load_libs(js_urls, function() {\n", " console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n", " run_inline_js();\n", " });\n", " }\n", "}(this));" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from bokeh.plotting import figure, show, output_notebook\n", "from bokeh.models import ColumnDataSource, LabelSet, Legend, HoverTool\n", "from bokeh.palettes import all_palettes\n", "\n", "output_notebook()\n", "\n", "category_items = data['Sector'].unique()\n", "palette = all_palettes['Viridis'][len(category_items)]\n", "colormap = dict(zip(category_items, palette))\n", "data['Color'] = data['Sector'].map(colormap)\n", "\n", "TOOLS = \"crosshair,pan,wheel_zoom,reset,tap,save,box_select\"\n", "\n", "source = ColumnDataSource(dict(x=data['Trailing P/E'],\n", " y=data['Enterprise Value/EBITDA'],\n", " color=data['Color'],\n", " label=data['Name'],\n", " ticker=data['Symbol'],\n", " sector=data['Sector']))\n", "\n", "hover = HoverTool(tooltips=[\n", " (\"index\", \"$index\"),\n", " (\"name\", \"@label\"),\n", " (\"sector\", \"@sector\"),\n", " (\"ticker\", \"@ticker\"),\n", " (\"(x,y)\", \"($x, $y)\"),\n", "])\n", "\n", "p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)\n", "\n", "labels = LabelSet(x='x', y='y', text='label', source=source,\n", " text_font_size='8pt')\n", "\n", "# Remove \"legend='sector',\" to remove legend\n", "p.scatter(x='x', y='y', color='color', legend='sector', source=source)\n", "\n", "p.title.text = \"Current Valuation Structure\"\n", "p.title.text_font_size = \"25px\"\n", "\n", "# Uncomment the following line to add labels\n", "# p.add_layout(labels)\n", "\n", "p.xaxis.axis_label = 'Trailing P/E'\n", "p.yaxis.axis_label = 'EV/EBITDA'\n", "\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a clear positive relationship between a company's EV/EBITDA and P/E ratio. While most companies are clustered in the bottom left quadrant, there are a few outliers. We choose to keep those outliers for valuation because they still provide valuable information about companies as we will see later." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.manifold import TSNE\n", "\n", "tsne = TSNE(n_components=2, random_state=0, perplexity=10.0)\n", "Y = tsne.fit_transform(similarity_matrix_NLP)\n", "\n", "plotting_df = pd.concat([data[['Name', 'Symbol', 'Sector', 'Color']],\n", " pd.DataFrame(Y, columns=['x', 'y'])], axis=1)\n", "\n", "source = ColumnDataSource(dict(x=plotting_df['x'],\n", " y=plotting_df['y'],\n", " color=plotting_df['Color'],\n", " label=plotting_df['Name'],\n", " ticker=plotting_df['Symbol'],\n", " sector=plotting_df['Sector']))\n", "\n", "hover = HoverTool(tooltips=[\n", " (\"index\", \"$index\"),\n", " (\"name\", \"@label\"),\n", " (\"sector\", \"@sector\"),\n", " (\"ticker\", \"@ticker\"),\n", " (\"(x,y)\", \"($x, $y)\"),\n", "])\n", "\n", "p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)\n", "\n", "labels = LabelSet(x='x', y='y', text='label', source=source,\n", " text_font_size='8pt')\n", "\n", "p.scatter(x='x', y='y', color='color', legend='sector', source=source)\n", "\n", "p.title.text = \"2D Company Similarity Visualization (NLP)\"\n", "p.title.text_font_size = \"25px\"\n", "\n", "# Uncomment the following line to add labels\n", "# p.add_layout(labels)\n", "\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While companies are mostly clustered by their industry groups, this analysis seems to capture relationships between companies which go beyond those. The fact that companies such as Activision Blizzard and Electronic Arts are close to each other suggests that the model is doing something right. However, note that it is possible that the model relies on trivial factors, for example, attributing a high similarity score to companies whose financial statements have the same auditor (say, because the writing style is similar)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/QBatista/anaconda/lib/python3.6/site-packages/numpy/linalg/linalg.py:1757: RuntimeWarning: invalid value encountered in slogdet\n", " sign, logdet = _umath_linalg.slogdet(a, signature=signature)\n", "/Users/QBatista/anaconda/lib/python3.6/site-packages/sklearn/covariance/graph_lasso_.py:230: RuntimeWarning: invalid value encountered in multiply\n", " * coefs)\n", "/Users/QBatista/anaconda/lib/python3.6/site-packages/sklearn/covariance/graph_lasso_.py:232: RuntimeWarning: invalid value encountered in multiply\n", " * coefs)\n", "/Users/QBatista/anaconda/lib/python3.6/site-packages/sklearn/covariance/graph_lasso_.py:252: ConvergenceWarning: graph_lasso: did not converge after 100 iteration: dual gap: -1.510e-04\n", " ConvergenceWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Cluster 1: Activision Blizzard, Electronic Arts\n", "Cluster 2: Alphabet Inc Class A, Amazon.com Inc, eBay, Expedia Inc., Facebook, Inc., Microsoft Corp., Netflix Inc., Priceline.com Inc, TripAdvisor\n", "Cluster 3: American Tower Corp A, Crown Castle International Corp., SBA Communications\n", "Cluster 4: Aetna Inc, Anthem Inc., Centene Corporation, CIGNA Corp., Humana Inc., United Health Group Inc.\n", "Cluster 5: Block H&R\n", "Cluster 6: BorgWarner, Delphi Automotive, Goodyear Tire & Rubber, LKQ Corporation\n", "Cluster 7: Cadence Design Systems, Synopsys Inc.\n", "Cluster 8: AmerisourceBergen Corp, Cardinal Health, CVS Health, Express Scripts, Henry Schein, McKesson Corp., Patterson Companies\n", "Cluster 9: Carnival Corp., Royal Caribbean Cruises Ltd\n", "Cluster 10: AbbVie Inc., Alexion Pharmaceuticals, Amgen Inc, Biogen Inc., Bristol-Myers Squibb, Celgene Corp., Gilead Sciences, Illumina Inc, Mylan N.V., Regeneron, Vertex Pharmaceuticals Inc\n", "Cluster 11: Chevron Corp., Exxon Mobil Corp., Occidental Petroleum, Schlumberger Ltd., TechnipFMC\n", "Cluster 12: Chipotle Mexican Grill\n", "Cluster 13: The Clorox Company, Colgate-Palmolive, Estee Lauder Cos., Kimberly-Clark, Procter & Gamble\n", "Cluster 14: Charter Communications, Comcast Corp.\n", "Cluster 15: Cimarex Energy, Concho Resources, Devon Energy, Newfield Exploration Co, Pioneer Natural Resources, Range Resources Corp.\n", "Cluster 16: Constellation Brands, Molson Coors Brewing Company\n", "Cluster 17: Costco Wholesale Corp., Wal-Mart Stores\n", "Cluster 18: Alaska Air Group Inc, American Airlines Group, Delta Air Lines, Southwest Airlines, United Continental Holdings\n", "Cluster 19: Discovery Communications, Discovery Communications, Viacom Inc.\n", "Cluster 20: Dish Network\n", "Cluster 21: Dollar General, Dollar Tree, Kroger Co., Target Corp.\n", "Cluster 22: Albemarle Corp, Eastman Chemical, FMC Corporation, LyondellBasell, The Mosaic Company, Nucor Corp.\n", "Cluster 23: Equifax Inc.\n", "Cluster 24: Extra Space Storage, Public Storage\n", "Cluster 25: Fastenal Co, Grainger (W.W.) Inc.\n", "Cluster 26: Accenture plc, Automatic Data Processing, Cognizant Technology Solutions, Fidelity National Information Services, Fiserv Inc, Gartner Inc, Intuit, Paychex Inc., Verisk Analytics\n", "Cluster 27: Flowserve Corporation, Fluor Corp., Freeport-McMoRan Inc., Jacobs Engineering Group, Quanta Services Inc.\n", "Cluster 28: Carmax Inc, Ford Motor, General Motors\n", "Cluster 29: Brown-Forman Corp., Campbell Soup, Conagra Brands, General Mills, The Hershey Company, Hormel Foods Corp., JM Smucker, Kellogg Co., McCormick & Co., Mondelez International, Sysco Corp., Tyson Foods\n", "Cluster 30: Hasbro Inc.\n", "Cluster 31: Home Depot, Lowe's Cos., Newell Brands, Tractor Supply Company\n", "Cluster 32: 3M Company, Amphenol Corp, Boeing Company, Cintas Corporation, Danaher Corp., General Electric, Genuine Parts, Harris Corporation, Honeywell Int'l Inc., Illinois Tool Works, Monsanto Co., Rockwell Collins, Roper Technologies, Snap-On Inc., Textron Inc., United Technologies\n", "Cluster 33: CBOE Holdings, CME Group Inc., Intercontinental Exchange, Nasdaq, Inc.\n", "Cluster 34: Avery Dennison Corp, Ball Corp, International Paper, Packaging Corporation of America, Weyerhaeuser Corp.\n", "Cluster 35: Interpublic Group, Omnicom Group\n", "Cluster 36: Affiliated Managers Group Inc, Archer-Daniels-Midland Co, BlackRock, CBRE Group, Corning Inc., Franklin Resources, Garmin Ltd., Invesco Ltd., S&P Global, Inc., T. Rowe Price Group, TE Connectivity Ltd., Western Union Co\n", "Cluster 37: Applied Materials Inc, KLA-Tencor Corp., Lam Research, Micron Technology, Nvidia Corporation\n", "Cluster 38: Lennar Corp., Pulte Homes Inc.\n", "Cluster 39: General Dynamics, L-3 Communications Holdings, Lockheed Martin Corp., Northrop Grumman Corp., Raytheon Co.\n", "Cluster 40: Host Hotels & Resorts, Marriott Int'l., Wyndham Worldwide\n", "Cluster 41: Aon plc, Arthur J. Gallagher & Co., Marsh & McLennan, Willis Towers Watson\n", "Cluster 42: Acuity Brands Inc, Fortune Brands Home & Security, Leggett & Platt, Masco Corp., Mohawk Industries, Whirlpool Corp.\n", "Cluster 43: Alliance Data Systems, Cerner, Global Payments Inc, Mastercard Inc., Total System Services, Visa Inc.\n", "Cluster 44: Mattel Inc.\n", "Cluster 45: MGM Resorts International, Wynn Resorts\n", "Cluster 46: Michael Kors Holdings\n", "Cluster 47: Gap Inc., Kohl's, Macy's Inc., Nordstrom\n", "Cluster 48: CSX Corp., Kansas City Southern, Norfolk Southern Corp., Union Pacific\n", "Cluster 49: Advance Auto Parts, AutoZone Inc, O'Reilly Automotive\n", "Cluster 50: Kinder Morgan, ONEOK, Williams Cos.\n", "Cluster 51: AMETEK Inc, Caterpillar Inc., Cummins Inc., Deere & Co., Dover Corp., Eaton Corporation, Emerson Electric Company, FLIR Systems, Harley-Davidson, Ingersoll-Rand PLC, PACCAR Inc., Parker-Hannifin, Pentair Ltd., Rockwell Automation Inc., Stanley Black & Decker, United Rentals, Inc., Xylem Inc.\n", "Cluster 52: Altria Group Inc, Coca-Cola Company, Dr Pepper Snapple Group, Monster Beverage, PepsiCo Inc., Philip Morris International\n", "Cluster 53: Allergan, Plc, Johnson & Johnson, Lilly (Eli) & Co., Merck & Co., Pfizer Inc.\n", "Cluster 54: Air Products & Chemicals Inc, Ecolab, Intl Flavors & Fragrances, PPG Industries, Praxair Inc., Sealed Air, Sherwin-Williams\n", "Cluster 55: AFLAC Inc, Ameriprise Financial, Assurant Inc, Berkshire Hathaway, Hartford Financial Svc.Gp., International Business Machines, Leucadia National Corp., Lincoln National, Loews Corp., MetLife Inc., Principal Financial Group, Prudential Financial, Robert Half International, Torchmark Corp., Unum Group\n", "Cluster 56: DaVita, Laboratory Corp. of America Holding, Quest Diagnostics\n", "Cluster 57: Adobe Systems Inc, Akamai Technologies Inc, ANSYS, CA, Inc., Citrix Systems, F5 Networks, Oracle Corp., Red Hat Inc., Verisign Inc.\n", "Cluster 58: Duke Realty Corp, Federal Realty Investment Trust, General Growth Properties Inc., Iron Mountain Incorporated, Macerich, Prologis, Realty Income Corporation, Regency Centers Corporation, Simon Property Group Inc\n", "Cluster 59: HP Inc., NetApp, Seagate Technology, Western Digital\n", "Cluster 60: Signet Jewelers\n", "Cluster 61: Darden Restaurants, McDonald's Corp., Starbucks Corp., Yum! Brands Inc\n", "Cluster 62: Abbott Laboratories, Align Technology, Bard (C.R.) Inc., Baxter International Inc., Becton Dickinson, Boston Scientific, The Cooper Companies, Edwards Lifesciences, Hologic, IDEXX Laboratories, Intuitive Surgical Inc., Medtronic plc, ResMed, Stryker Corp., Zimmer Biomet Holdings\n", "Cluster 63: Analog Devices, Inc., Apple Inc., Broadcom, Cisco Systems, Intel Corp., Juniper Networks, Microchip Technology, Motorola Solutions Inc., QUALCOMM Inc., Skyworks Solutions, Texas Instruments, Xilinx Inc\n", "Cluster 64: Foot Locker Inc, L Brands Inc., Ross Stores, TJX Companies Inc., Ulta Salon Cosmetics & Fragrance Inc\n", "Cluster 65: TransDigm Group\n", "Cluster 66: Allstate Corp, Chubb Limited, Cincinnati Financial, Progressive Corp., The Travelers Companies Inc., XL Capital\n", "Cluster 67: CBS Corp., Time Warner Inc., Twenty-First Century Fox Class A, Twenty-First Century Fox Class B, The Walt Disney Company\n", "Cluster 68: Apartment Investment & Management, AvalonBay Communities, Inc., Equity Residential, Mid-America Apartments, UDR Inc\n", "Cluster 69: C. H. Robinson Worldwide, Expeditors International, FedEx Corporation, J. B. Hunt Transport Services, United Parcel Service\n", "Cluster 70: HCA Holdings, Universal Health Services, Inc.\n", "Cluster 71: Hanesbrands Inc, Nike, PVH Corp., Under Armour, V.F. Corp.\n", "Cluster 72: Andeavor, Marathon Petroleum, Phillips 66, Valero Energy\n", "Cluster 73: Digital Realty, Equinix, HCP Inc., Ventas Inc, Welltower Inc.\n", "Cluster 74: AT&T Inc, CenturyLink Inc, Verizon Communications\n", "Cluster 75: Alexandria Real Estate Equities Inc, Boston Properties, SL Green Realty, Vornado Realty Trust\n", "Cluster 76: Martin Marietta Materials, Vulcan Materials\n", "Cluster 77: Eversource Energy, Tapestry, Inc., Walgreens Boots Alliance\n", "Cluster 78: Republic Services Inc, Waste Management Inc.\n", "Cluster 79: Agilent Technologies Inc, PerkinElmer, Thermo Fisher Scientific, Varian Medical Systems, Waters Corporation\n", "Cluster 80: Alliant Energy Corp, Ameren Corp, American Electric Power, American Water Works Company Inc, CenterPoint Energy, CMS Energy, Consolidated Edison, Dominion Energy, DTE Energy, Duke Energy, Edison Int'l, Exelon Corp., NextEra Energy, NiSource Inc., PG&E Corp., Pinnacle West Capital, PPL Corp., Public Serv. Enterprise Inc., SCANA Corp, Sempra Energy, Southern Co., Wec Energy Group Inc, Xcel Energy Inc\n" ] } ], "source": [ "from sklearn import cluster, covariance\n", "\n", "# This code was adapted from Gael Varoquaux's work (see references)\n", "\n", "# Calculate intraday variation\n", "variation_df = close_price - open_price\n", "variation_df = variation_df.T.reindex(data['Symbol'])\n", "\n", "# Get name, sector and color from the data dataframe\n", "variation_df = \\\n", "data[['Symbol', 'Name', 'Sector', 'Color']].join(variation_df, on='Symbol')\n", "\n", "# Drop rows with NAs\n", "variation_df = variation_df.dropna(axis=0)\n", "\n", "# Data for the model\n", "var_data = variation_df.drop(['Symbol', 'Name', 'Sector', 'Color'], axis=1).T\n", "\n", "# Learn a graphical structure from the correlations\n", "edge_model = covariance.GraphLassoCV()\n", "\n", "# Standardize the time series: using correlations rather than covariance is\n", "# more efficient for structure recovery\n", "var_data /= var_data.std(axis=0)\n", "edge_model.fit(var_data)\n", "\n", "# Cluster using affinity propagation\n", "_, labels = cluster.affinity_propagation(edge_model.covariance_)\n", "n_labels = labels.max()\n", "\n", "variation_df['Cluster'] = np.nan\n", "\n", "for i in range(n_labels + 1):\n", " variation_df.loc[labels == i, 'Cluster'] = i + 1\n", " print('Cluster %i: %s' % ((i + 1),\n", " ', '.join(variation_df['Name'][labels == i])))\n", "\n", "# Find a low-dimension embedding for visualization: find the best position of\n", "# the nodes (the stocks) on a 2D plane\n", "embedding = tsne.fit_transform(var_data.T)\n", "\n", "# Display a graph of the partial correlations\n", "partial_correlations = edge_model.precision_.copy()\n", "d = 1 / np.sqrt(np.diag(partial_correlations))\n", "partial_correlations *= d\n", "partial_correlations *= d[:, np.newaxis]\n", "non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.06)\n", "\n", "# Plot the edges\n", "start_idx, end_idx = np.where(non_zero)\n", "segments = [[embedding.T[:, start], embedding.T[:, stop]]\n", " for start, stop in zip(start_idx, end_idx)]\n", "values = np.abs(partial_correlations[non_zero])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "d1 = variation_df[['Name', 'Symbol', 'Sector', 'Color']].reset_index(drop=True)\n", "d2 = pd.DataFrame(np.stack(embedding, axis=1).T, columns=['x', 'y'])\n", "plotting_df = pd.concat([d1, d2], axis=1)\n", "\n", "source = ColumnDataSource(dict(x=plotting_df['x'],\n", " y=plotting_df['y'],\n", " color=plotting_df['Color'],\n", " label=plotting_df['Name'],\n", " ticker=plotting_df['Symbol'],\n", " sector=plotting_df['Sector']))\n", "\n", "hover = HoverTool(tooltips=[\n", " (\"index\", \"$index\"),\n", " (\"name\", \"@label\"),\n", " (\"sector\", \"@sector\"),\n", " (\"ticker\", \"@ticker\"),\n", " (\"(x,y)\", \"($x, $y)\"),\n", "])\n", "\n", "p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)\n", "\n", "labels = LabelSet(x='x', y='y', text='label', source=source,\n", " text_font_size='8pt')\n", "\n", "p.scatter(x='x', y='y', color='color', legend='sector', source=source)\n", "\n", "p.title.text = \"2D Company Similarity Visualization (Correlation)\"\n", "p.title.text_font_size = \"25px\"\n", "\n", "# Comment this line to remove segments\n", "p.segment(*np.reshape(np.array(segments).flatten(), (len(segments), 4)).T)\n", "\n", "# Uncomment the following line to add labels\n", "# p.add_layout(labels)\n", "\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: Segments represent partial correlation coefficients whose absolute value exceed 6%.\n", "\n", "Here, companies are much more clearly clustered by industry groups. For example, one can see that airline companies (American Airlines Group, Delta Air Lines, etc) are clustered together and connected to each other." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Valuation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NLP Approach" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Subtract identity matrix from similarity matrix to avoid selecting the\n", "# original company\n", "sim_mat = similarity_matrix_NLP - np.eye(len(similarity_matrix_NLP))\n", "\n", "\n", "def valuation_calculator(index, data, multiple, similarity_matrix):\n", " sorted_similarity_array = -np.sort((-similarity_matrix[0]))\n", " sorted_similarity_indices = (-similarity_matrix[index]).argsort()\n", " top_3_comps = sorted_similarity_array[:3]\n", " baseline_comp = sorted_similarity_array[4]\n", " normalized_weights = ((top_3_comps-baseline_comp) /\n", " sum(top_3_comps-baseline_comp))\n", " top_3_pe = data[multiple][sorted_similarity_indices[:3]]\n", " weigthed_pe = np.dot(normalized_weights, top_3_pe)\n", " if multiple == 'Trailing P/E':\n", " valuation = weigthed_pe * data['Diluted EPS'][index]\n", " elif multiple == 'Enterprise Value/EBITDA':\n", " valuation = weigthed_pe * data['EBITDA'][index]\n", " elif multiple == 'Enterprise Value/Revenue':\n", " valuation = weigthed_pe * data['Revenue'][index]\n", " return valuation\n", "\n", "\n", "valuation_df_nlp = data[['Name', 'Symbol', 'Sector', 'Color', 'Current Price',\n", " 'Enterprise Value']].copy()\n", "\n", "for i, company in enumerate(data['Name']):\n", " valuation_df_nlp.loc[i, 'Valuation (P/E)'] = \\\n", " valuation_calculator(i, data, 'Trailing P/E', sim_mat)\n", "\n", " valuation_df_nlp.loc[i, 'Valuation (EV/EBITDA)'] = \\\n", " valuation_calculator(i, data, 'Enterprise Value/EBITDA', sim_mat)\n", "\n", " valuation_df_nlp.loc[i, 'Valuation (EV/Revenue)'] = \\\n", " valuation_calculator(i, data, 'Enterprise Value/Revenue', sim_mat)\n", "\n", "valuation_df_nlp['% Over/Undervaluation (EV/Revenue)'] = \\\n", "(valuation_df_nlp['Valuation (EV/Revenue)'] /\n", " valuation_df_nlp['Enterprise Value']-1)*100\n", "\n", "valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'] = \\\n", "(valuation_df_nlp['Valuation (EV/EBITDA)'] /\n", " valuation_df_nlp['Enterprise Value']-1)*100\n", "\n", "valuation_df_nlp['% Over/Undervaluation (P/E)'] = \\\n", "(valuation_df_nlp['Valuation (P/E)'] / valuation_df_nlp['Current Price']-1)*100" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 430.000000\n", "mean 101.705297\n", "std 451.641116\n", "min -89.167688\n", "25% -23.991184\n", "50% 17.955084\n", "75% 91.800569\n", "max 6113.063281\n", "Name: % Over/Undervaluation (EV/Revenue), dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valuation_df_nlp['% Over/Undervaluation (EV/Revenue)'].describe()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 430.000000\n", "mean 31.095668\n", "std 261.918370\n", "min -85.110505\n", "25% -19.046748\n", "50% 5.264212\n", "75% 41.416836\n", "max 5282.424119\n", "Name: % Over/Undervaluation (EV/EBITDA), dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'].describe()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 430.000000\n", "mean 70.585367\n", "std 268.240955\n", "min -97.265648\n", "25% -28.156419\n", "50% 6.538062\n", "75% 58.264523\n", "max 2774.539533\n", "Name: % Over/Undervaluation (P/E), dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valuation_df_nlp['% Over/Undervaluation (P/E)'].describe()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Add the price on the last day of 2016 to the dataframe\n", "last_day_2016 = close_price.loc['2016-12-30', :].rename('2016-12-30 Price')\n", "valuation_df_nlp = valuation_df_nlp.join(last_day_2016, on='Symbol')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Percentage of correct predictions YTD (NLP): 0.651162790698\n", "Average Return: 0.1147698701648122\n" ] } ], "source": [ "# Calculates the percentage of correct predictions year to date\n", "valuation_df_nlp['Actual Change'] = \\\n", "valuation_df_nlp['Current Price'] > valuation_df_nlp['2016-12-30 Price']\n", "\n", "valuation_df_nlp['Return'] = \\\n", "valuation_df_nlp['Current Price'] / valuation_df_nlp['2016-12-30 Price'] - 1\n", "\n", "valuation_df_nlp['Prediction'] = \\\n", "valuation_df_nlp['Valuation (P/E)'] > valuation_df_nlp['2016-12-30 Price']\n", "\n", "avg_return = \\\n", "np.mean(valuation_df_nlp['Prediction'].replace(0, -1) *\n", " valuation_df_nlp['Return'])\n", "\n", "outcome = (valuation_df_nlp['Prediction'] == valuation_df_nlp['Actual Change'])\n", "correct_pred = sum(outcome) / len(outcome)\n", "print('Percentage of correct predictions YTD (NLP): ' + str(correct_pred))\n", "print('Average Return: ' + str(avg_return))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Disclaimer: While the system appears to be promising, there are many reasons other than luck for which this should not be taken at face value. First, we use the last price of 2016 as a reference, but the financial statements are published later even though they reflect the state of the company at the end of the year. Additionally, many more tests would actually need to performed to evaluate the quality of these signals which is beyond the aim of this notebook. As such, you should not consider using these results for taking investment decisions." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "source = \\\n", "ColumnDataSource(dict(x=valuation_df_nlp['% Over/Undervaluation (P/E)'],\n", " y=valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'],\n", " color=valuation_df_nlp['Color'],\n", " label=valuation_df_nlp['Name'],\n", " ticker=valuation_df_nlp['Symbol'],\n", " sector=valuation_df_nlp['Sector']))\n", "\n", "hover = HoverTool(tooltips=[\n", " (\"index\", \"$index\"),\n", " (\"name\", \"@label\"),\n", " (\"sector\", \"@sector\"),\n", " (\"ticker\", \"@ticker\"),\n", " (\"(x,y)\", \"($x, $y)\"),\n", "])\n", "\n", "p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)\n", "\n", "labels = LabelSet(x='x', y='y', text='label', source=source,\n", " text_font_size='8pt')\n", "\n", "# Remove \"legend='sector',\" to remove legend\n", "p.scatter(x='x', y='y', color='color', legend='sector', source=source)\n", "\n", "p.title.text = \"Over/Undervaluation (NLP) Based On...\"\n", "p.title.text_font_size = \"25px\"\n", "\n", "# Uncomment the following line to add labels\n", "# p.add_layout(labels)\n", "\n", "p.xaxis.axis_label = 'Trailing P/E'\n", "p.yaxis.axis_label = 'EV/EBITDA'\n", "\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A positive value means that the company should be trading at a higher value (as of the end of 2016). Overall, there appears to be a positive relationship between the valuation based on those two multiples. One caveat, however, is the limited amount of companies. Some companies in the S&P500 have no good comparable companies, and therefore, the percentage of over/undervaluation suggested by the model should not be taken at face value. For example, companies such as Netflix and Amazon appear highly overvalued based on this model. However, this is because idiosyncratic factors have a large influence on these companies which are not captured by this methodology. In fact, on an EV/Revenue basis, the model actually suggests that Amazon is undervalued. Conducting this analysis with a much larger dataset would partially alleviate this issue. An interesting observation is that Nvidia, which has performed very well recently due to developments in the GPU space is still not as \"overvalued\" as Netflix on the basis of those two multiples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Correlation Approach" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cols_to_be_used = ['Name', 'Symbol', 'Sector', 'Color', 'Current Price',\n", " 'Enterprise Value', 'Enterprise Value/Revenue',\n", " 'Enterprise Value/EBITDA', 'Trailing P/E']\n", "\n", "df1 = variation_df[['Name', 'Cluster']].set_index('Name')\n", "val_df_corr = data[cols_to_be_used].join(df1, on='Name')\n", "val_df_corr = val_df_corr.dropna()\n", "\n", "for i in range(n_labels + 2):\n", " val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster P/E'] = \\\n", " np.mean(val_df_corr[val_df_corr['Cluster'] == i]['Trailing P/E'])\n", "\n", " val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster EV/EBITDA'] = \\\n", " np.mean(val_df_corr[val_df_corr['Cluster'] == i]\n", " ['Enterprise Value/EBITDA'])\n", "\n", " val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster EV/Revenue'] = \\\n", " np.mean(val_df_corr[val_df_corr['Cluster'] == i]\n", " ['Enterprise Value/Revenue'])\n", "\n", "val_df_corr['% Over/Undervaluation (EV/Revenue)'] = \\\n", "(val_df_corr['Enterprise Value/Revenue'] /\n", " val_df_corr['Cluster EV/Revenue']-1)*100\n", "\n", "val_df_corr['% Over/Undervaluation (EV/EBITDA)'] = \\\n", "(val_df_corr['Enterprise Value/EBITDA'] /\n", " val_df_corr['Cluster EV/EBITDA'] - 1) * 100\n", "\n", "val_df_corr['% Over/Undervaluation (P/E)'] = \\\n", "(val_df_corr['Trailing P/E'] / val_df_corr['Cluster P/E'] - 1) * 100" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 411.000000\n", "mean 0.000002\n", "std 41.113803\n", "min -96.331000\n", "25% -25.817000\n", "50% -2.205000\n", "75% 17.535500\n", "max 186.307000\n", "Name: % Over/Undervaluation (EV/Revenue), dtype: float64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val_df_corr['% Over/Undervaluation (EV/Revenue)'].round(3).describe()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 411.000000\n", "mean -0.000032\n", "std 45.999853\n", "min -98.236000\n", "25% -18.438500\n", "50% -4.308000\n", "75% 11.977500\n", "max 610.876000\n", "Name: % Over/Undervaluation (EV/EBITDA), dtype: float64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val_df_corr['% Over/Undervaluation (EV/EBITDA)'].round(3).describe()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 411.000000\n", "mean -0.000015\n", "std 92.845076\n", "min -93.641000\n", "25% -39.437500\n", "50% -11.260000\n", "75% 11.894500\n", "max 1026.426000\n", "Name: % Over/Undervaluation (P/E), dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val_df_corr['% Over/Undervaluation (P/E)'].round(3).describe()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "source = \\\n", "ColumnDataSource(dict(x=val_df_corr['% Over/Undervaluation (P/E)'],\n", " y=val_df_corr['% Over/Undervaluation (EV/EBITDA)'],\n", " color=val_df_corr['Color'],\n", " label=val_df_corr['Name'],\n", " ticker=val_df_corr['Symbol'],\n", " sector=val_df_corr['Sector']))\n", "\n", "hover = HoverTool(tooltips=[\n", " (\"index\", \"$index\"),\n", " (\"name\", \"@label\"),\n", " (\"sector\", \"@sector\"),\n", " (\"ticker\", \"@ticker\"),\n", " (\"(x,y)\", \"($x, $y)\"),\n", "])\n", "\n", "p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)\n", "\n", "labels = LabelSet(x='x', y='y', text='label', source=source,\n", " text_font_size='8pt')\n", "\n", "# Remove \"legend='sector',\" to remove legend\n", "p.scatter(x='x', y='y', color='color', legend='sector', source=source)\n", "\n", "p.title.text = \"Over/Undervaluation (Correlation) Based On...\"\n", "p.title.text_font_size = \"25px\"\n", "\n", "# Uncomment the following line to add labels\n", "# p.add_layout(labels)\n", "\n", "p.xaxis.axis_label = 'Trailing P/E'\n", "p.yaxis.axis_label = 'EV/EBITDA'\n", "\n", "show(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This methodology gives valuations which are much less extreme than the NLP approach. One potential explanation for this is that clusters are closer to the traditional industry groups, and therefore, to how the market values these companies." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results were obtained as of:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "datetime.date(2017, 11, 28)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "date.today()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References\n", "\n", "Titman, Sheridan and Martin, John D. _Valuation: the Art and Science of Corporate Investment Decisions_. Prentice Hall, 2015. Print.\n", "\n", "Quoc Le and Tomas Mikolov. _Distributed Representations of Sentences and Documents_. http://arxiv.org/pdf/1405.4053v2.pdf\n", "\n", "Laurens van der Maaten and Geoffrey Hinton. _Visualizing Data using t-SNE_. http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf \n", "\n", "Gael Varoquaux. _Visualizing the Stock Market Structure_. http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#sphx-glr-auto-examples-applications-plot-stock-market-py" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }