{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Relative Valuation using Market Comparables\n", "\n", "**Quentin Batista** \n", "*The University of Tokyo*\n", "\n", "Relative valuation is a popular technique which relies on the market value of similar assets to price a given asset. For example, if you want to price a laundry business, you might want to calculate the price to earnings ratio of other similar businesses and multiply this ratio by the earnings of the laundry business to obtain a valuation. Sheridan and Martin describe this methodology as a 4-step process:\n", "\n", "**Step 1:** Identify similar or comparable investments and recent market prices for each.\n", "\n", "**Step 2:** Calculate a valuation metric for use in valuing the asset.\n", "\n", "**Step 3:** Calculate an initial estimate of value.\n", "\n", "**Step 4:** Refine or tailor your initial valuation estimate to the specific characteristics of the investment.\n", "\n", "Some of the most commonly used multiples are:\n", "\n", "- Price to earnings (P/E)\n", "\n", "- Market to book value of equity\n", "\n", "- Enterprise Value (EV) to EBITDA\n", "\n", "- Enterprise Value (EV) to revenue\n", "\n", "- Enterprise Value (EV) to cash flow\n", "\n", "This list is not exhaustive, and it is possible for you to create your own multiple. This is particularly popular in the technology sector where analysts have come up with multiples such as Enterprise Value to unique visitors or website hits. In doing so, you must ensure that the components of the multiple are consistent with each other. For example, you might consider using the price to sales ratio as a valuation multiple. However, an implicit assumption behind this multiple is that comparable companies have identical capital structures, which is very rarely the case in practice. When this assumption is violated, the multiple becomes inconsistent because it is impacted by the amount of debt that a company has relative to its equity.\n", "\n", "Finally, a key step in applying this methodology is to determine which multiple is appropriate for the asset you are trying to value. For example, valuing young firms and startups using the P/E ratio is likely to be inappropriate if those firms have negative or highly volatile earnings. Instead, using the EV to sales ratio would likely give a better estimate. Additionally, it is important to realize that these multiples have different characteristics. While EV to EBITDA ignores the firm's CapEx, depreciation, and capital structure, while the P/E ratio takes those into account. Using these multiples concurrently allows you to see the big picture and understand what is driving the valuation of an asset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Systematic Relative Valuation Using Machine Learning\n", "\n", "### Objective\n", "\n", "In this notebook, we systematize the methodology introduced above for companies in the S&P 500 using two different Machine Learning approaches. First, we replicate Gael Varoquaux's analysis on scikit-learn which extracts a graphical structure from the correlation of intraday variations and then applies Affinity Propagation to group together stocks that behave similarly. Second, we analyze the companies' latest 10K using the Doc2Vec implementation in gensim to quantify similarity. In the first case, we use the average multiple in each cluster to value each company. In the second case, we pick the top 3 comparable companies and transform the similarity measures into weights to compute the average comparable multiple. We then apply this multiple to each company's financials to obtain a valuation.\n", "\n", "The hope of using this approach is to capture relationships between companies that human analysts would not be able to perceive. Reading through the financial statements of all the companies in the S&P 500 and forming an opinion about how these companies relate to each other would be an extremely time-consuming process. As a result, most analysts on the Street cover a specific industry group, and therefore, they might not realize that a company they are researching is actually very similar to a company in classified in a different industry group. By using Machine Learning, we can alleviate this issue, and hopefully, gain some valuable insights about the stock market structure.\n", "\n", "### Data\n", "\n", "We scrape the list of companies in the S&P 500, their ticker and CIK code from Wikipedia. We then use the CIK code to scrape the latest 10K from EDGAR. There are 505 companies in the Wikipedia list because some companies trade under multiple symbols (for example, Discovery Communications Inc.). A few companies' financial statements are not available on EDGAR for various reasons -- we ignore those companies. We clean the data by removing \"Table of Contents\" markers when they exist, page numbers, line breaks, punctuations and numbers from the statements. We scrape company fundamentals and their historical prices from Yahoo! Finance. Note that we drop the companies with negative multiples to simplify valuation. While Yahoo! Finance might not be a particularly reliable source for getting data about company fundamentals, it is the best among free data sources." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Required Packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install numpy\n", "!pip install pandas\n", "!pip install beautifulsoup4\n", "!pip install gensim\n", "!pip install nltk\n", "!pip install sklearn\n", "!pip install bokeh\n", "!pip install pandas_datareader\n", "!pip install datetime" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from bs4 import BeautifulSoup\n", "from urllib.request import urlopen\n", "from collections import namedtuple\n", "from gensim.models import doc2vec\n", "from gensim.similarities import docsim\n", "from gensim import corpora\n", "import itertools\n", "from string import punctuation, digits\n", "from nltk import word_tokenize\n", "from nltk.corpus import stopwords\n", "from sklearn.manifold import TSNE\n", "from bokeh.plotting import figure, show, output_notebook\n", "from bokeh.models import ColumnDataSource, LabelSet, Legend, HoverTool\n", "from bokeh.palettes import all_palettes\n", "from pandas_datareader.data import DataReader\n", "from datetime import date" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Scraping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping 10Ks from EDGAR" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from bs4 import BeautifulSoup\n", "from urllib.request import urlopen\n", "\n", "\n", "def table_extractor(soup):\n", " \"\"\"Extract the tables from a soup object\"\"\"\n", " for table in soup.find_all(\"table\"):\n", " table.extract()\n", " return soup\n", "\n", "\n", "sp_500_wiki_link = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'\n", "soup_wiki = BeautifulSoup(urlopen(sp_500_wiki_link), 'html.parser')\n", "table_wiki = soup_wiki.find(\"table\", {\"class\": \"wikitable sortable\"})\n", "\n", "# Fail now if the right table hasn't been found\n", "header = table_wiki.findAll('th')\n", "if header[0].string != \"Ticker symbol\" or header[1].string != \"Security\":\n", " raise Exception(\"Can't parse wikipedia's table!\")\n", "\n", "# Retreive the values in the table\n", "records = []\n", "rows = table_wiki.findAll('tr')\n", "for row in rows:\n", " fields = row.findAll('td')\n", " if fields:\n", " # Get info and SEC company link\n", " symbol = fields[0].string\n", " wiki_link = \"https://en.wikipedia.org/wiki\" + fields[1].a.get('href')\n", " CIK = fields[7].string\n", " sec_company_link = \"https://www.sec.gov/cgi-bin/browse-edgar?\" + \\\n", " \"action=getcompany&CIK=\" + CIK + \\\n", " \"&type=10-K&dateb=&owner=include&count=40\"\n", " name = fields[1].a.string\n", " sector = fields[3].string\n", "\n", " # Get link for the page with latest 10-K related filings\n", " soup_comp = BeautifulSoup(urlopen(sec_company_link), 'html.parser')\n", " table_comp = soup_comp.find(\"table\", {\"class\": \"tableFile2\"})\n", " try:\n", " # Get the latest filing page\n", " filings_link = \"https://www.sec.gov\" + \\\n", " table_comp.a.get('href')\n", "\n", " # Get the link for the latest 10K\n", " soup_filings = BeautifulSoup(urlopen(filings_link),\n", " 'html.parser')\n", " table_filings = soup_filings.find(\"table\", {\"class\": \"tableFile\"})\n", " _10K_link = \"https://www.sec.gov\" + table_filings.a.get('href')\n", "\n", " # Extracting the text from the latest 10K\n", " try:\n", " soup_latest_10K = BeautifulSoup(urlopen(_10K_link).read(),\n", " 'html.parser')\n", " soup_latest_10K = table_extractor(soup_latest_10K)\n", " _latest_10K_txt = soup_latest_10K.get_text()\n", "\n", " except:\n", " # If the latest 10K is not available, return N/A\n", " _latest_10K_txt = np.nan\n", "\n", " except:\n", " # If the filings are not available, return N/A\n", " _10K_link = np.nan\n", " _latest_10K_txt = np.nan\n", "\n", " # Append results\n", " records.append([symbol, wiki_link, name, sector, sec_company_link,\n", " CIK, _latest_10K_txt])\n", "\n", "headers = ['Symbol', 'Wikipedia Link', 'Name', 'Sector', 'SEC Filings Link',\n", " 'CIK', 'Latest 10K']\n", "data = pd.DataFrame(records, columns=headers)\n", "\n", "# Correct ambiguous tickers for later purpose\n", "ambiguous_tickers = ['BRK.B', 'BF.B']\n", "corrected_tickers = ['BRK-B', 'BF-B']\n", "\n", "for i, ticker in enumerate(ambiguous_tickers):\n", " data['Symbol'] = data['Symbol'].replace(ticker, corrected_tickers[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping Fundamentals from Yahoo! Finance" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def unit_converter(data):\n", " billion = 1_000_000_000\n", " million = 1_000_000\n", " if data[-1] == 'B':\n", " return float(data[:-1])*billion\n", " elif data[-1] == 'M':\n", " return float(data[:-1])*million\n", " else:\n", " return float(data)\n", "\n", "\n", "items = ['Enterprise Value', 'Enterprise Value/Revenue', 'Diluted EPS',\n", " 'Enterprise Value/EBITDA', 'Revenue', 'EBITDA', 'Trailing P/E']\n", "\n", "for i, ticker in enumerate(data['Symbol']):\n", " key_stats_link = 'https://finance.yahoo.com/quote/' + ticker + \\\n", " '/key-statistics?p=' + ticker\n", " key_stats_soup = BeautifulSoup(urlopen(key_stats_link).read(),\n", " 'html.parser').findAll('td')\n", " for j, row in enumerate(key_stats_soup):\n", " for item in items:\n", " try:\n", " if item == row.span.string:\n", " # Dangerous assumption\n", " data.loc[i, item] = \\\n", " unit_converter(key_stats_soup[j+1].string)\n", " except:\n", " next" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Disclaimer: This scraper makes an important assumption regarding the structure of the html file. Specifically, it assumes that the data for each item of interest is the string of text located in the 'td' tag following the one where the name of the item was found." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping Historical Prices from Yahoo! Finance" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pandas_datareader.data import DataReader\n", "from datetime import date\n", "\n", "start = date(2013, 1, 2)\n", "end = date.today()\n", "\n", "data_source = 'yahoo'\n", "\n", "# Sometimes fail -- retry if it does\n", "historical_prices_panel = DataReader(data['Symbol'], data_source, start, end,\n", " retry_count=5)\n", "\n", "# Current price is last close\n", "last_price = \\\n", "historical_prices_panel['Close'].tail(1).T.iloc[:, 0].rename('Current Price')\n", "data = data.join(last_price, on='Symbol')\n", "\n", "close_price = historical_prices_panel['Close']\n", "open_price = historical_prices_panel['Open']" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Uncomment to save the data\n", "# data.to_csv('10K_data.csv')\n", "# historical_prices_panel['Close'].to_csv('close_price_data.csv')\n", "# historical_prices_panel['Open'].to_csv('open_price_data.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Cleaning and Preprocessing" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Uncomment to read the data\n", "# data = pd.read_csv('10K_data.csv')\n", "# close_price = pd.read_csv('close_price_data.csv')\n", "# open_price = pd.read_csv('open_price_data.csv')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Remove companies without filings\n", "no_filings_data = data[data['Latest 10K'].isnull()]\n", "data = data[~data['Latest 10K'].isnull()]\n", "\n", "# Remove duplicates (keep first)\n", "data = data.drop_duplicates(keep='first')\n", "\n", "# Drop Google duplicate\n", "data = data[data['Symbol'] != 'GOOG']\n", "\n", "# Drop NA rows (about 60 companies)\n", "data = data.dropna()\n", "\n", "# Drop companies with negative EPS or EBITDA (about 30 companies)\n", "data = data[(data[['EBITDA', 'Diluted EPS']] > 0).all(1)]\n", "\n", "# Reset index\n", "data = data.reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import namedtuple\n", "from nltk import word_tokenize\n", "from nltk.corpus import stopwords\n", "from string import punctuation, digits\n", "\n", "\n", "def _10K_string_cleaner(_10K):\n", " # Lowercase the text\n", " _10K = _10K.lower()\n", " stopchar = punctuation + digits + '’“”'\n", " for ch in stopchar:\n", " # Replace stopchar by whitespace\n", " _10K = _10K.replace(ch, ' ')\n", " # Tokenize\n", " _10K = word_tokenize(_10K)\n", " # Remove stopwords\n", " _10K = [word for word in _10K if word not in stopwords.words('english')]\n", " return _10K\n", "\n", "\n", "corpus = []\n", "analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')\n", "for i, text in enumerate(data['Latest 10K']):\n", " corpus.append(analyzedDocument(_10K_string_cleaner(text), [i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Doc2Vec Model\n", "\n", "Doc2Vec is an implementation of the Paragraph Vector unsupervised learning algorithm proposed by Quoc Le and Tomas Mikolov for learning fixed-length feature representations from variable-length pieces of texts. It aims to overcome some of the weaknesses of the more traditional bag-of-words (BOW) methodology." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from gensim.models import doc2vec\n", "from gensim.similarities import docsim\n", "\n", "model_NLP = doc2vec.Doc2Vec(corpus, size=500, window=300, min_count=1,\n", " workers=4)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "import itertools\n", "\n", "similarity_matrix_NLP = np.empty(0)\n", "length_docvecs = len(model_NLP.docvecs)\n", "for item in itertools.product(range(length_docvecs), range(length_docvecs)):\n", " similarity_matrix_NLP = np.append(similarity_matrix_NLP,\n", " model_NLP.docvecs.similarity(*item))\n", "\n", "similarity_matrix_NLP = similarity_matrix_NLP.reshape((length_docvecs,\n", " length_docvecs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization\n", "\n", "### Stock Market Structure\n", "\n", "We use three different methodologies for visualizing the structure of the stock market. First, we create a simple scatter plot of the companies' EV/EBITDA against their P/E ratio. Second, we use the correlation matrix of the S&P 500 stocks between the start of 2013 and the last price. Third, we use the similarity matrix outputted from the Doc2Vec model. In the last two cases, t-SNE is used for dimensionality reduction." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(global) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = true;\n", "\n", " if (typeof (window._bokeh_onload_callbacks) === \"undefined\" || force === true) {\n", " window._bokeh_onload_callbacks = [];\n", " window._bokeh_is_loading = undefined;\n", " }\n", "\n", "\n", " \n", " if (typeof (window._bokeh_timeout) === \"undefined\" || force === true) {\n", " window._bokeh_timeout = Date.now() + 5000;\n", " window._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"
\\n\"+\n",
" \"