Analysing companies 10-Ks for changes to predict stock price movement¶

Introduction¶

Publicly traded companies in the United States are required by law to file reports with the Securities and Exchange Commission (SEC) on "10-K" and "10-Q." These reports include qualitative as well as quantitative explanations of the success of the business, from sales estimates to qualitative risk factors.

"Companies are required to disclose" important pending litigation or other legal proceedings "details. As such, 10-Ks and 10-Qs also provide useful insights into the success of a company. As such, 10-Ks and 10-Qs often hold valuable insights into a company's performance.

However, these observations can be hard to obtain. In 2013, the average 10-K was 42,000 words long. Beyond the sheer length, for many investors, dense terminology and tons of boilerplate will further obscure real sense.

In order to extract meaning from the data they contain, we do not need to read the 10-Ks of cover-to-cover businesses.

Hypothesis¶

When major things happen to their business, companies make major textual modifications to their 10-Ks . We consequently consider textual changes to 10-Ks to be a signal of the movement of future share prices.

Since the vast majority (86 percent) of textual changes have negative sentiment, we usually expect a decrease in stock price to be signaled by significant textual changes (Cohen et al . 2018).

Starting Rationale:¶

Major text changes in 10-K over time indicate significant decreases in future returns. We can short the companies with the largest text changes in their filings and long the companies with the smallest text changes in their filings.

Methodology¶

Scrape the 10-K documents from the SEC EDGAR Database for a set of publicly traded firms. Upon scraping, we perform some basic data-cleaning and pre-processing for the 10-K document (removing HTML Tags and numerical tables, converting to txt files, lemmatisation, stemming, removing stop words, and so on).
After pre-processing we analyse the textual data from one of the 10-K documents using Exploratory Data Analysis(EDA) techniques such as Bag of Words(BoW), TF_IDF, Wordcloud, LDA modeling with interactive pyLDAvis, Top 20 most frequently used words, and Positivity score of the 10-K document using Textblob library.
For a particular company, we calculate the cosine similarity and the Jaccard similarity over the set of 10-Ks that were scraped(cosine and Jaccard similarity are relatively less computationally intensive to compute). We calculate the similarity by comparing each 10-K document with the previous year's 10-K and given it a score. We then calculate the difference between these similarity scores and compare this over the years.
Try to map these text changes with the stock price movement for the firm. We download the prices of the stock from the first-time the 10-K document was available till the day the lasted 10-K document was filed. Upon carefully analysing, we can conclude as to whether the results align with the hypothesis or not.

In [201]:

# Importing built-in libraries 

import os
import re
import unicodedata
from time import gmtime, strftime
from datetime import datetime, timedelta


import warnings
warnings.filterwarnings("ignore")

# Importing libraries you need to install
import requests
from lxml import html
import bs4 as bs
from tqdm import tqdm
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from gensim import corpora
from gensim.models import TfidfModel, LdaMulticore
from wordcloud import WordCloud
import pyLDAvis.gensim
from datetime import datetime
from collections import Counter
import seaborn as sns
from textblob import TextBlob
import pandas_datareader.data as web

/usr/local/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)
[nltk_data] Downloading package wordnet to /Users/amruth/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

Scraping

Stock Tickers to CIK¶

In [202]:

stock_tickers = ["AMZN"] #for this example I am using the Amazon's Stock ticker, feel free to add many more stocks to the lists

The SEC indexes the filings of companies by means of its own internal identifier, the "Central Index Key"(CIK). We'll need to convert tickers into CIKs in order to find for business filings on EDGAR.

In [203]:

def TickerToCik(tickers): #helper function to convert ticker into CIK for SEC
    
    _url = 'http://www.sec.gov/cgi-bin/browse-edgar?CIK={}&Find=Search&owner=exclude&action=getcompany'
    
    cik_re = re.compile(r'.*CIK=(\d{10}).*') #checking for CIK = 10 digit number

    dict = {}
    
    for ticker in tqdm(tickers, desc='Mapping Tickers to CIK', unit=' Mappings'): # Use tqdm lib for progress bar
        
        results = cik_re.findall(requests.get(_url.format(ticker)).text)
        
        if len(results):
            dict[str(ticker).lower()] = str(results[0]) #saving the in the format "AMZN" : '0001018724'
    
    return dict

In [204]:

ciks = TickerToCik(stock_tickers)
ciks

Mapping Tickers to CIK: 100%|██████████| 1/1 [00:01<00:00,  1.19s/ Mappings]

Out[204]:

{'amzn': '0001018724'}

Converting the Ticker with corresponding CIKs into a Dataframe¶

In [205]:

tick_cik_df = pd.DataFrame.from_dict(data=ciks, orient= 'index')
tick_cik_df.reset_index(inplace=True)
tick_cik_df.columns = ["ticker", "cik"]
tick_cik_df['cik'] = tick_cik_df['cik'].str.lower()
tick_cik_df

Out[205]:

	ticker	cik
0	amzn	0001018724

P.S: Some CIKs might be linked to multiple tickers. However for this scope of this project, I will not be checking the uniqueness of the ticker-cik pairing. Therefore please check manually if there are multiple tickers for the same cik.

Now with the list of CIKs, we would like to download the 10-Ks from the EDGAR database from the offical SEC website.

From past experiences, I had to ensure some technical considerations before proceeding:

The SEC limits users to a maximum of 10 requests per second, so we need to make sure we're not making requests too quickly.
When scrapping massive amount of information, it is very likely the scrapper will runing into am error or breaking. Trying to ensure we can continue from where the scrapper stops is one of the most important and efficient things to do inorder to ensure efficiency.

The function below scrapes all 10-Ks for one particular CIK. This web scraper primarily depends on the 'requests' and 'BeautifulSoup' libraries.

We will create different directory for each CIK, and puts all the filings for that CIK within that directory. After scraping, the file structure should look like this:

- 10Ks
    - CIK1
        - 10K #1
        - 10K #2
        ...
    - CIK2
        - 10K #1
        - 10K #2
        ...
    ...

The scraper will create the directory for each CIK. However, we need to create different directories to hold our 10-K files.

Defining Paths¶

We need to create different directories to hold our 10-K and 10-Q files. The exact pathname depends on your local setup, so please enter your correct path in the below format.

In [206]:

path_10k = '/Users/amruth/Desktop/HKU/Academics/Year 4 Sem 1/FINA4350/Midterm Project/'

In [207]:

def makefolder(x):
    try:
        os.mkdir(x+"10Ks")
    except:
        print("Folder/Directory for 10K already created")
    return path_10k+"10Ks"

In [208]:

path_10k = makefolder(path_10k) #helps in making the two folders 10Ks and 10Qs

Actual Scraping¶

Scrapes all 10-Ks for a particular CIK from EDGAR.

Step 1: Scrap the webpage that contains the table with 10K documents over the years.

"First Webpage to be scrapped to find all the 10K Documents"

Step 2: Access the table that stores the "Documents" link for each 10K document. After that we load up the individual links for each of the 10K documents, one by one.

"First Webpage to be scrapped to find all the 10K Documents"

Step 3: Access the corresponding document link which has "Description" as 10-K. This loads up the entire 10-K document in html format for that particular filling.

"First Webpage to be scrapped to find all the 10K Documents"

In [209]:

#Examples of the links that we will be scraping from

browse_url_base_10k = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=%s&type=10-K'
filing_url_base_10k = 'http://www.sec.gov/Archives/edgar/data/%s/%s-index.html'
doc_url_base_10k = 'http://www.sec.gov/Archives/edgar/data/%s/%s/%s'

In [210]:

def Scrape10K(base_url, filing_url, doc_url, cik):
    
    
    # Check if we've already scraped this CIK
    try:
        os.mkdir(cik)
    except OSError:
        print("The CIK has already been scraped", cik)
        return
    
    # Setting current directory for that CIK
    os.chdir(cik)
    
    print('Scraping CIK', cik)
    
    # Request list of 10-K filings --> STEP 1 in the pictures
    base_res = requests.get(base_url % cik)  #  STEP 1 in the pictures
    
    
    # Parse the response HTML using BeautifulSoup
    base_soup = bs.BeautifulSoup(base_res.text, "lxml")

    # Extract all tables from the response
    base_html_tables = base_soup.find_all('table')
    
    # Check that the table we're looking for exists and If it doesn't, exit
    if len(base_html_tables)<3:
        os.chdir('..')
        return
    
    # Parse the Filings table
    fil_table = pd.read_html(str(base_html_tables[2]), header=0)[0]
    fil_table['Filings'] = [str(y) for y in fil_table['Filings']]

    # Get only 10-K and 10-K405 document filings
    fil_table = fil_table[(fil_table['Filings'] == '10-K')| (fil_table['Filings'] == '10-K405') ]
    

    # If filings table doesn't have any 10-Ks or 10-K405s, exit
    if len(fil_table)==0:
        os.chdir('..')
        return
    
    # Get accession number for each 10-K and 10-K405 filing
    fil_table['Acc_No'] = [x.replace('\xa0',' ')
                               .split('Acc-no: ')[1]
                               .split(' ')[0] for x in fil_table['Description']]
    #print(fil_table)

   
    # Iterate through each filing and scrape the corresponding document...
    for index, row in fil_table.iterrows():
        
        # find the uniquie accession number for filing
        acc_no = str(row['Acc_No'])
        
        # find the page with the accession number and Parse the table of documents for the filing
        docs_page_html = bs.BeautifulSoup(requests.get(filing_url % (cik, acc_no)).text, 'lxml') # STEP 2 in the pictures
        
        docs_tables = docs_page_html.find_all('table')
        
        if len(docs_tables)==0:
            continue
            
        #converting the HTML table to a Dataframe    
        docs_df = pd.read_html(str(docs_tables[0]), header=0)[0]
        docs_df['Type'] = [str(x) for x in docs_df['Type']]
        
        # Get the 10-K for the filing
        docs_df = docs_df[(docs_df['Type'] == '10-K')| (docs_df['Type'] == '10-K405')]
        
        # If there aren't any 10-K, skip to the next filing
        if len(docs_df)==0:
            continue
            

        elif len(docs_df)>0:
            docs_df = docs_df.iloc[0]
        
        docname = docs_df['Document']
    
        if str(docname) != 'nan':
        # STEP 3 in the pictures
            #print(str(doc_url % (cik, acc_no.replace('-', ''), docname)).split()[0])
            file = requests.get(str(doc_url % (cik, acc_no.replace('-', ''), docname)).split()[0])


    #         # Save the file in appropriate format
    #         if '.txt' in str(docname):
    #             # Save text as TXT
    #             date = str(row['Filing Date'])
    #             filename = cik + '_' + date + '.txt'
    #             html_file = open(filename, 'a')
    #             html_file.write(file.text)
    #             html_file.close()
    #         else:
            # Save text as HTML
            date = str(row['Filing Date'])
            filename = cik + '_' + date + '.html'
            html_file = open(filename, 'a')
            html_file.write(file.text)
            html_file.close()
        
    # Move back to the main 10-K directory
    os.chdir('..')
        
    return

In [211]:

os.chdir(path_10k)

# Iterate over CIKs and scrape 10-Ks
for cik in tqdm(tick_cik_df['cik']):
    Scrape10K(base_url=browse_url_base_10k, filing_url=filing_url_base_10k, doc_url=doc_url_base_10k, cik=cik)

  0%|          | 0/1 [00:00<?, ?it/s]

Scraping CIK 0001018724

100%|██████████| 1/1 [00:33<00:00, 33.28s/it]

Data Cleaning¶

We now have 10-Ks in HTML format for each CIK. Before computing our similarity scores, however, we need to clean the files up a bit.

The following needs to be done:

Remove all tables and tags (if their numeric character content is greater than 15%), HTML tags, XBRL tables, exhibits, ASCII-encoded PDFs, graphics, XLS, and other binary files.

Convert the HTML file to .txt file

In [212]:

def DelTags(file_soup):
    
    
    # Remove HTML tags 
    doc = file_soup.get_text()
    
    # Remove newline characters
    doc = doc.replace('\n', ' ')
    
    # Replace unicode characters with their "normal" representations
    doc = unicodedata.normalize('NFKD', doc)
    
    return doc

In [213]:

def DelTables(file_soup):

    def GetDigitPercentage(tablestring):
        if len(tablestring)>0.0:
            numbers = sum([char.isdigit() for char in tablestring])
            length = len(tablestring)
            return numbers/length
        else:
            return 1
    
    # Evaluates numerical character % for each table
    # and removes the table if the percentage is > 15%
    [x.extract() for x in file_soup.find_all('table') if GetDigitPercentage(x.get_text())>0.15]
    
    return file_soup

In [214]:

def ConvertHTML(cik):
    

   # Remove al the following such as newlines, unicode text, XBRL tables, numerical tables and HTML tags, 

    
    # Look for files scraped for that CIK
    try: 
        os.chdir(cik)
    # ...if we didn't scrape any files for that CIK, exit
    except FileNotFoundError:
        print("Directory not available CIK", cik)
        return
        
    print("Parsing CIK %s..." % cik)
    parsed = False # flag to tell if we've parsed anything
    
    # Make a new directory with all the .txt files called "textonly"
    try:
        os.mkdir('textonly')
    except OSError:
        pass
    
   # List of file in that directory
    file_list = [fname for fname in os.listdir() if not (fname.startswith('.') | os.path.isdir(fname))]
    
    # Iterate over scraped files and clean
    for filename in file_list:
            
        # Check if file has already been cleaned
        new_filename = filename.replace('.html', '.txt')
        text_file_list = os.listdir('textonly')
        if new_filename in text_file_list:
            continue
        
        # If it hasn't been cleaned already, keep going...
        
        # Clean file
        with open(filename, 'r') as file:
            parsed = True
            soup = bs.BeautifulSoup(file.read(), "lxml")
            soup = DelTables(soup)
            text = DelTags(soup)
            with open('textonly/'+new_filename, 'w') as newfile:
                newfile.write(text)
    
    # If all files in the CIK directory have been parsed
    # then log that
    if parsed==False:
        print("Already parsed CIK", cik)
    
    os.chdir('..')
    return

In [215]:

os.chdir(path_10k)

# Iterate over CIKs and clean HTML filings
for cik in tqdm(tick_cik_df['cik']):
    ConvertHTML(cik)

  0%|          | 0/1 [00:00<?, ?it/s]

Parsing CIK 0001018724...

100%|██████████| 1/1 [00:22<00:00, 22.13s/it]

Data Preprocessing¶

As you can see, the text for the documents are very messy. To clean this up, we'll remove the html and lowercase all the text.

After running the two cells above, we have cleaned plaintext 10-K for each CIK. The file structure looks like this:

- 10Ks
    - CIK1
        - 10K #1
        - 10K #2
        ...
        - textonly

String Lemmatization¶

After the text is cleaned up, it's time to distill the verbs down. Implement the lemmatize function to lemmatize verbs in the list of words provided.

In [216]:

def lemmetize(words):
    lemmatized_words = [WordNetLemmatizer().lemmatize(w.lower()) for w in words]
    return lemmatized_words

Removing StopWords and Punctuations¶

In [217]:

stop_words = set(stopwords.words('english'))
fin_stop_words = ("million","including","billion","december","january")
stop_words.update(fin_stop_words)

# removing stop words, numbers , removing punctuations and special characters and spaces
def remove_stopwords(words):
    filtered = [re.sub(r'[^\w\s]','',w) for w in words if not re.sub(r'[^\w\s]','',w) in stop_words and  not re.sub(r'[^\w\s]','',w).isnumeric() and not re.search('^\s*[0-9]',re.sub(r'[^\w\s]','',w)) and len(re.sub(r'[^\w\s]','',w)) > 3  ]
    return filtered

Stemming Words¶

In [218]:

from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

   
#ps = PorterStemmer() 
ps = SnowballStemmer("english") 

def stem_words(words):
    stemmed = [ps.stem(w) for w in words]
    return stemmed

String Tokenization for each 10K document¶

In [219]:

from nltk.tokenize import sent_tokenize,word_tokenize

wordcount={} #dictionary to count the sentences and tokens in each 10-K document
docs = {} #dictionary to save the tokens after preprocessing for each 10-K document
file_doc=[] # Saving only the first document for the scope of this project

for cik in tqdm(tick_cik_df['cik']):
    
    #setting directory to the .txt files folder
    os.chdir(path_10k+'/'+cik+"/textonly")
    
    #listing files in directory
    files = [j for j in os.listdir()]
    files.sort(reverse=True)
    
    #iterating over each 10-K file
    for file in files:
        
      text = open(file,"r").read()
    
      #using sentence 
      sents = sent_tokenize(text)
      file_doc = sents
        
        
      tokens = word_tokenize(text.lower())
      partial_lem = lemmetize(tokens)
      after_stopwords = remove_stopwords(partial_lem)
        
      docs[file] = after_stopwords
    
      counts = {}
      counts["tokens"]=len(tokens)
      counts["sentences"]= len(sents)
      wordcount[file] = counts
      continue #Looping over just one document for now-

100%|██████████| 1/1 [00:13<00:00, 13.10s/it]

Exploratory Data Analysis(EDA) on the latest 10-K Document¶

For the scope of the project, all the EDA will be done only on the most latest 10-K document.

TokenID using Genism¶

In [220]:

dataset = [lemmetize(remove_stopwords(d.lower().split())) for d in file_doc]

dictionary = corpora.Dictionary(dataset)

Creating BoW¶

It is a basically object that contains the word id(token id) and its frequency in each document (just lists the number of times each word occurs in the sentence).

In [221]:

corpus = [dictionary.doc2bow(file) for file in dataset]

TF-IDF¶

Term Frequency – Inverse Document Frequency(TF-IDF) is also a bag-of-words model but unlike the regular corpus, TFIDF down weights tokens (words) that appears frequently across documents.In simple terms, words that occur more frequently across the documents get smaller weights.

In [222]:

tfidf= TfidfModel(corpus) 

#For printing the TF-IDF Model
# for doc in tfidf[corpus]:
#     for id, freq in doc:
#         print([dictionary[id], np.around(freq, decimals=2)])

Printing words with more than .5 Tf-IDF score, these are words that occur rarely in the document(Might have a higher meaning)

In [223]:

for doc in tfidf[corpus]:
    for id, freq in doc:
        if np.around(freq,decimals=2)> .5:
            print([dictionary[id], np.around(freq, decimals=2) ])

['commission', 0.59]
['file', 0.81]
['employer', 0.74]
['identification', 0.67]
['organization', 1.0]
['item', 0.55]
['property', 0.83]
['legal', 0.65]
['proceeding', 0.65]
['data', 0.52]
['selected', 0.64]
['qualitative', 0.52]
['quantitative', 0.52]
['data', 0.51]
['supplementary', 0.66]
['accountant', 0.53]
['disagreement', 0.53]
['registrant', 0.53]
['executive', 0.65]
['item', 0.53]
['compensation', 0.54]
['beneficial', 0.52]
['owner', 0.52]
['relationship', 0.55]
['expressed', 0.54]
['operationsforwardlooking', 0.58]
['movie', 0.56]
['organized', 0.62]
['principally', 0.52]
['state', 0.56]
['reincorporated', 0.55]
['anticipate', 0.51]
['beyond', 0.6]
['music', 0.52]
['dvdvideo', 0.57]
['dvdvideo', 0.51]
['largest', 0.51]
['preorder', 0.57]
['video', 0.69]
['earlystage', 0.77]
['retail', 0.57]
['segment', 0.63]
['respectively', 0.6]
['previous', 0.79]
['retail', 1.0]
['largest', 0.52]
['amazon', 0.55]
['auction', 0.57]
['zshops', 0.61]
['service', 1.0]
['segment', 0.62]
['international', 0.79]
['language', 0.52]
['localized', 0.52]
['coming', 0.52]
['associate', 0.53]
['enrolled', 0.58]
['seasonality', 0.64]
['summer', 0.64]
['incorporate', 0.51]
['environment', 0.57]
['employee', 0.54]
['fulltime', 0.53]
['one', 0.54]
['deficit', 0.85]
['indebtedness', 0.56]
['incur', 0.58]
['substantial', 0.51]
['react', 0.52]
['meet', 0.58]
['greater', 0.51]
['devote', 0.58]
['intensify', 0.57]
['pressure', 0.54]
['peak', 0.51]
['christmas', 0.52]
['extent', 0.52]
['large', 0.59]
['fluctuate', 0.58]
['reason', 0.55]
['competitor', 0.57]
['introduction', 0.65]
['internet', 0.51]
['acceptance', 0.53]
['upgrade', 0.51]
['combination', 0.54]
['brownout', 0.51]
['downtime', 0.51]
['sell', 0.52]
['variation', 0.74]
['variation', 0.53]
['carrier', 0.68]
['disruption', 0.58]
['slows', 0.55]
['train', 0.52]
['favorably', 0.54]
['occur', 0.56]
['reputation', 0.56]
['insurance', 0.54]
['person', 0.57]
['harm', 0.51]
['affect', 0.52]
['occur', 0.56]
['reputation', 0.56]
['expand', 0.51]
['effort', 0.61]
['succeed', 0.79]
['fluctuation', 0.53]
['currency', 0.54]
['political', 0.66]
['duty', 0.59]
['import', 0.57]
['licensing', 0.57]
['limitation', 0.55]
['repatriation', 0.68]
['obtaining', 0.61]
['nationalization', 1.0]
['longer', 0.57]
['cycle', 0.72]
['pricing', 0.51]
['affecting', 0.52]
['loan', 0.52]
['law', 1.0]
['local', 0.61]
['income', 0.58]
['month', 0.58]
['arrears', 0.62]
['record', 0.51]
['equity', 0.55]
['regularly', 0.52]
['loss', 0.56]
['difficulty', 0.54]
['create', 0.72]
['assimilating', 0.57]
['ongoing', 0.53]
['problem', 0.51]
['retaining', 0.51]
['acquired', 0.51]
['relationship', 0.51]
['arise', 0.74]
['inventory', 0.52]
['write', 0.59]
['automated', 0.7]
['infringed', 0.58]
['license', 0.57]
['history', 0.51]
['fluctuates', 0.63]
['trading', 0.51]
['economic', 0.55]
['trend', 0.59]
['internetrelated', 0.52]
['variation', 0.72]
['quarterly', 0.52]
['competitor', 0.69]
['analyst', 0.79]
['regulation', 0.75]
['structure', 0.54]
['departure', 0.79]
['layoff', 0.51]
['restructurings', 0.51]
['methodology', 0.54]
['compulsory', 0.51]
['regulation', 0.58]
['impede', 0.53]
['unfavorable', 0.51]
['resolution', 0.54]
['regulate', 0.62]
['turn', 0.57]
['tax', 0.51]
['network', 0.54]
['possibility', 0.57]
['evaluating', 0.51]
['stop', 0.53]
['product', 0.53]
['indemnify', 0.6]
['breach', 0.55]
['provider', 0.53]
['unsettled', 0.53]
['fraction', 0.53]
['viable', 0.67]
['often', 0.53]
['introduce', 0.59]
['phenomenon', 0.55]
['unlawful', 0.56]
['good', 0.52]
['user', 0.55]
['fraudulent', 0.72]
['conduct', 0.57]
['since', 0.6]
['secretary', 0.61]
['treasurer', 0.61]
['employed', 0.69]
['bezos', 0.53]
['bezos', 0.76]
['mark', 0.68]
['britto', 0.73]
['nationsbank', 0.57]
['received', 0.59]
['britto', 0.81]
['research', 0.91]
['dalzell', 0.64]
['richard', 0.77]
['acted', 0.55]
['received', 0.54]
['dalzell', 0.84]
['jenson', 0.67]
['warren', 0.74]
['accountancy', 0.52]
['brigham', 0.67]
['diego', 0.75]
['piacentini', 0.66]
['apple', 0.51]
['apple', 0.59]
['fiatimpresit', 0.53]
['john', 0.73]
['risher', 0.68]
['received', 0.54]
['risher', 0.84]
['jeffrey', 0.71]
['wilke', 0.71]
['received', 1.0]
['chemical', 0.55]
['estate', 0.64]
['real', 0.64]
['expire', 0.55]
['item', 1.0]
['cooperate', 0.53]
['defendant', 0.54]
['item', 1.0]
['vote', 0.6]
['item', 0.67]
['part', 0.74]
['quarter', 0.7]
['dividend', 0.68]
['unregistered', 0.65]
['item', 1.0]
['indicative', 0.51]
['pooling', 0.56]
['restatement', 0.62]
['note', 0.74]
['discussion', 0.51]
['note', 0.55]
['item', 1.0]
['sale', 0.64]
['respectively', 0.77]
['dvdvideo', 0.51]
['earlystage', 0.66]
['revenue', 0.6]
['respectively', 0.63]
['toysruscom', 0.56]
['across', 0.68]
['projection', 0.51]
['businessadditional', 0.53]
['associated', 0.52]
['profit', 0.65]
['gross', 0.59]
['margin', 0.51]
['margin', 0.64]
['gross', 0.61]
['respectively', 0.51]
['partnership', 0.56]
['operationsnet', 0.69]
['margin', 0.64]
['gross', 0.66]
['projection', 0.51]
['businessadditional', 0.53]
['marketing', 0.57]
['marketing', 0.59]
['cooperative', 0.51]
['reimbursement', 0.51]
['representing', 0.52]
['center', 0.52]
['administrative', 0.57]
['compensation', 0.55]
['respectively', 0.51]
['stockbased', 0.66]
['fluctuation', 0.58]
['amortization', 0.6]
['goodwill', 0.53]
['intangible', 0.56]
['operationsimpairmentrelated', 0.53]
['impairmentrelated', 0.96]
['accordingly', 0.54]
['intangible', 0.51]
['operation', 0.7]
['loss', 0.62]
['projection', 0.51]
['businessadditional', 0.53]
['interest', 0.52]
['interest', 0.71]
['loss', 0.51]
['noncash', 0.55]
['related', 0.54]
['corresponding', 0.68]
['livingcom', 0.51]
['classified', 0.76]
['share', 0.53]
['equitymethod', 0.6]
['zero', 0.53]
['basis', 0.63]
['equitymethod', 0.65]
['substantially', 0.52]
['operational', 0.51]
['restructuring', 0.53]
['forma', 0.61]
['informational', 0.62]
['derived', 0.65]
['reach', 0.54]
['projection', 0.51]
['businessadditional', 0.53]
['balance', 0.52]
['activity', 0.54]
['used', 0.59]
['activity', 0.53]
['provided', 0.61]
['financing', 0.51]
['financing', 0.55]
['approximately', 0.57]
['businessadditional', 0.53]
['additional', 0.53]
['dilution', 0.52]
['cash', 0.54]
['knowledge', 0.6]
['grantee', 0.51]
['noncash', 0.69]
['gain', 0.59]
['equitymethod', 0.71]
['offsetting', 0.52]
['reduction', 0.61]
['bulletin', 0.57]
['unrealized', 0.51]
['acquiror', 0.51]
['security', 0.59]
['classified', 0.56]
['increased', 0.68]
['unearned', 0.59]
['unearned', 0.51]
['livingcom', 0.51]
['item', 1.0]
['portfolio', 0.51]
['utilized', 0.57]
['currency', 0.6]
['foreign', 0.55]
['euro', 0.69]
['inception', 0.61]
['euro', 0.52]
['conversion', 0.51]
['partner', 0.62]
['accounted', 0.58]
['writedown', 0.55]
['item', 1.0]
['listed', 0.51]
['responsibility', 0.63]
['july', 0.8]
['eliminated', 0.6]
['intercompany', 0.6]
['actual', 0.51]
['differ', 0.56]
['combination', 0.53]
['previously', 0.51]
['instrument', 0.57]
['reflect', 0.53]
['presented', 0.51]
['month', 0.54]
['arrears', 0.59]
['investees', 0.64]
['issuer', 0.51]
['performed', 0.51]
['case', 0.59]
['knowledge', 0.58]
['evaluates', 0.54]
['periodically', 0.54]
['asset', 0.52]
['software', 0.61]
['development', 0.54]
['intangible', 0.53]
['businessunit', 0.63]
['longlived', 0.53]
['ratably', 0.63]
['amounted', 0.54]
['performs', 0.51]
['knowledge', 0.58]
['performed', 0.59]
['marketing', 0.6]
['contract', 0.54]
['marketing', 0.52]
['prepaid', 0.72]
['expense', 0.51]
['amounted', 0.75]
['cost', 0.52]
['capitalized', 0.85]
['currency', 0.56]
['translated', 0.59]
['earnings', 0.51]
['share', 0.55]
['standard', 0.53]
['hedging', 0.51]
['june', 0.69]
['issued', 0.72]
['adopt', 0.77]
['sfas', 0.62]
['sfas', 1.0]
['adoption', 0.72]
['sfas', 0.7]
['eitf', 0.51]
['july', 0.54]
['adopted', 0.64]
['consensus', 0.52]
['capitalized', 0.54]
['depreciated', 0.55]
['fasb', 0.54]
['interpretation', 0.64]
['involving', 0.64]
['clarifies', 0.57]
['definition', 0.52]
['adoption', 0.52]
['eitf', 0.57]
['eitf', 0.58]
['classification', 0.59]
['comprised', 0.57]
['originally', 0.51]
['occurred', 0.8]
['reflect', 0.59]
['reported', 0.66]
['impairmentrelated', 0.66]
['asset', 0.55]
['constructioninprogress', 0.7]
['capitalized', 0.67]
['comparable', 0.86]
['intangible', 0.58]
['intangible', 0.51]
['bulletin', 0.67]
['unrealized', 0.58]
['security', 0.59]
['classified', 0.56]
['equity', 0.57]
['gross', 0.56]
['unrealized', 0.72]
['gain', 0.55]
['unsecured', 0.52]
['convertible', 0.51]
['reset', 0.66]
['inception', 0.61]
['euro', 0.52]
['forward', 0.66]
['convertible', 0.59]
['subordinated', 0.6]
['august', 0.58]
['subordinated', 0.6]
['redemption', 0.54]
['discount', 0.62]
['senior', 0.54]
['note', 0.52]
['indenture', 0.6]
['discount', 0.59]
['redemption', 0.51]
['accreted', 0.51]
['repurchased', 0.51]
['immaterial', 0.54]
['extinguishment', 0.67]
['occurred', 0.64]
['indebtedness', 0.55]
['covenant', 0.65]
['compliance', 0.7]
['lease', 0.61]
['rental', 0.7]
['lease', 0.52]
['credit', 0.53]
['preferred', 0.67]
['preferred', 0.72]
['stock', 0.51]
['plan', 0.65]
['share', 0.54]
['plan', 0.58]
['plan', 0.57]
['grant', 0.6]
['restricted', 0.63]
['option', 0.6]
['provided', 0.54]
['granted', 0.54]
['apply', 0.53]
['share', 0.61]
['allowance', 0.52]
['contribute', 0.52]
['eligible', 0.52]
['matched', 0.54]
['identifies', 0.57]
['loss', 0.52]
['geographic', 0.54]
['transfer', 0.6]
['cost', 0.54]
['toysruscom', 0.56]
['depreciation', 0.59]
['earlystage', 0.53]
['depreciation', 0.53]
['noncompulsory', 0.55]
['defendant', 0.54]
['unaudited', 0.69]
['share', 0.51]
['rounding', 0.53]
['weightedaverage', 0.53]
['item', 1.0]
['accountant', 0.52]
['disagreement', 0.52]
['item', 0.67]
['part', 0.74]
['officer', 0.52]
['item', 1.0]
['item', 1.0]
['item', 1.0]
['item', 0.67]
['part', 0.74]
['exhibit', 0.67]
['bylaw', 0.53]
['quarterly', 0.52]
['registration', 0.63]
['filed', 0.67]
['june', 0.74]
['morgan', 0.51]
['stanley', 0.51]
['quarterly', 0.65]
['version', 0.51]
['filed', 0.74]
['march', 0.67]
['registration', 0.6]
['registration', 0.52]
['registration', 0.52]
['filed', 0.74]
['march', 0.67]
['registration', 0.53]
['filed', 0.74]
['march', 0.67]
['registration', 0.53]
['filed', 0.74]
['march', 0.67]
['filed', 0.74]
['march', 0.67]
['diego', 0.52]
['ratio', 0.52]
['consent', 0.54]
['none', 0.55]
['duly', 0.57]
['___________________________________________', 0.65]
['bylaw', 0.53]
['quarterly', 0.52]
['registration', 0.63]
['filed', 0.67]
['june', 0.74]
['morgan', 0.51]
['stanley', 0.51]
['quarterly', 0.65]
['version', 0.55]
['filed', 0.74]
['march', 0.67]
['registration', 0.6]
['registration', 0.52]
['registration', 0.52]
['filed', 0.74]
['march', 0.67]
['registration', 0.53]
['filed', 0.74]
['march', 0.67]
['registration', 0.53]
['filed', 0.74]
['march', 0.67]
['filed', 0.74]
['march', 0.67]
['diego', 0.52]
['ratio', 0.52]
['consent', 0.54]
['executive', 0.57]
['plan', 0.54]

WordCloud for the given 10K document¶

The wordCloud helps us visualize some of the most frequently used words in the document.

In [224]:

wordcloud = WordCloud(max_font_size=50, max_words=50, background_color="white", width=800, height=400).generate(" ".join(docs[list(docs.keys())[0]])) #first documents tokens from docs(which contains many tokens from different docs)

# Display the generated image:
plt.figure( figsize=(20,10), facecolor='k' )
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

LDA Modelling¶

Train our lda model using gensim.models.LdaMulticore and save it to ‘lda_model’. For each topic, we will explore the words occuring in that topic and its relative weight. The pyLDAvis helps us visualize this LDA modelling in a very user-friendly manner.

In [225]:

lda_model = LdaMulticore(corpus, num_topics = 6, id2word = dictionary,passes = 10,workers = 2)
lda_model.show_topics()

Out[225]:

[(0,
  '0.032*"loss" + 0.031*"company" + 0.023*"security" + 0.021*"value" + 0.018*"share" + 0.017*"fair" + 0.016*"investment" + 0.015*"stock" + 0.015*"equity" + 0.013*"cash"'),
 (1,
  '0.019*"service" + 0.015*"customer" + 0.013*"sale" + 0.011*"business" + 0.010*"company" + 0.010*"site" + 0.009*"fulfillment" + 0.009*"revenue" + 0.009*"result" + 0.009*"center"'),
 (2,
  '0.034*"product" + 0.013*"service" + 0.012*"business" + 0.012*"customer" + 0.012*"sale" + 0.010*"goodwill" + 0.009*"intangible" + 0.009*"result" + 0.008*"addition" + 0.008*"impairment"'),
 (3,
  '0.028*"company" + 0.021*"statement" + 0.019*"form" + 0.018*"incorporated" + 0.016*"reference" + 0.015*"financial" + 0.014*"accounting" + 0.014*"item" + 0.013*"registration" + 0.011*"report"'),
 (4,
  '0.030*"note" + 0.023*"company" + 0.013*"revenue" + 0.013*"asset" + 0.011*"consolidated" + 0.011*"year" + 0.011*"cost" + 0.010*"cash" + 0.010*"convertible" + 0.009*"subordinated"'),
 (5,
  '0.016*"president" + 0.015*"investment" + 0.013*"vice" + 0.013*"fulfillment" + 0.013*"center" + 0.011*"market" + 0.009*"general" + 0.009*"senior" + 0.009*"customer" + 0.009*"strategic"')]

pyLDAvis¶

The area of circle represents the importance of each topic over the entire corpus, the distance between the center of circles indicate the similarity between topics. For each topic, the histogram on the right side listed the top 30 most relevant terms. From these topics, we can try to tell a story from them.

The first topic has the word "loss" appearing the most number of times. This might represent some invesment the company had undertaken that lead to this loss.¶

P.S.: This is just an example and it might not be very accurate.

In [226]:

import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis

Out[226]:

Top 20 used words in the 10K report¶

In [227]:

counter=Counter(docs[list(docs.keys())[0]]) #first documents tokens from docs(which contains many tokens from different docs)
most=counter.most_common()

x, y=[], []
for word,count in most[:20]:
    x.append(word)
    y.append(count)

plt.figure(figsize=(16,6))    
sns.barplot(x=y,y=x)

Out[227]:

<AxesSubplot:>

Positivity Score of the 10K document¶

Finding the sentiment value of the document using the Texblob library.

In [228]:

positivity = TextBlob(" ".join(docs[list(docs.keys())[0]])) #first documents tokens from docs(which contains many tokens from different docs)

print(positivity.sentiment)

Sentiment(polarity=0.07811700059963189, subjectivity=0.36037901272269396)

The polarity value ranges from [-1,1]. -1 meaning that the given string has a very negative context and whereas 1 has a very postive intent. 0 means that the statment is neutral, hence its neither positive or negative. For the document that we used, the score is about .08, refering that the document is more or less neutral or slightly positive. Again this doesnt help too much with our quantitative analysis.

In [229]:

def polarity(text):
    return TextBlob(" ".join(text)).sentiment.polarity


sentences = pd.DataFrame(columns=["sentences","polarity_score"])

for sent in dataset:
    sentences = sentences.append({'sentences':sent, 'polarity_score':polarity(sent) }, ignore_index=True)

    
sentences['polarity_score'].hist(figsize=(10,8))

Out[229]:

<AxesSubplot:>

The above graph shows us the distribution of polarity over the sentences. As we can see here that most of the words used in the 10-K document are very neutral and use their words very consciously.

In [230]:

# Displaying the sentences that had a polarity score of over .5

sentences[sentences['polarity_score']>.5]

Out[230]:

	sentences	polarity_score
6	[indicate, check, mark, disclosure, delinquent...	1.000
44	[also, believe, platform, flexibility, allows,...	0.800
130	[none, employee, represented, labor, union, co...	0.700
181	[effect, acquisition, business, combination, a...	0.750
194	[addition, benefit, newer, market, segment, fi...	0.525
245	[determine, decline, fair, value, investment, ...	0.700
264	[order, successful, must, accurately, predict,...	0.575
358	[cost, incur, result, liability, relating, sal...	0.700
359	[running, amazon, marketplace, auction, zshops...	0.550
362	[guarantee, program, fraudulent, activity, use...	0.700
561	[nonamortization, approach, goodwill, would, a...	0.700
564	[record, impairment, loss, goodwill, intangibl...	0.700
639	[also, recorded, unearned, revenue, equity, se...	0.700
645	[valuation, also, reduce, otherwise, fair, val...	0.600
647	[fair, value, security, le, amount, cash, paid...	0.700
650	[accounting, grantee, equity, instrument, rece...	0.700
658	[also, made, cash, payment, investment, equity...	0.550
663	[unrealized, gain, represent, difference, comp...	0.700
667	[fair, value, equity, security, classified, ma...	0.700
738	[asset, company, acquired, recorded, fair, val...	0.700
756	[investment, included, marketable, security, a...	0.700
760	[case, company, estimated, fair, value, equity...	0.700
765	[valuation, also, reduce, fair, value, account...	0.600
768	[company, periodically, evaluates, whether, de...	0.700
782	[goodwill, intangible, goodwill, represents, e...	0.700
785	[company, record, impairment, loss, goodwill, ...	0.700
791	[measurement, fair, value, businessunit, goodw...	0.700
796	[longlived, asset, disposed, carried, fair, va...	0.700
814	[valuation, also, reduce, fair, value, account...	0.600
835	[company, provides, additional, forma, disclos...	0.700
856	[requires, derivative, instrument, recorded, b...	0.700
892	[impairment, loss, recorded, reflect, investme...	0.700
910	[unrealized, gain, represent, difference, comp...	0.700
917	[fair, value, equity, security, classified, ma...	0.700
944	[fair, value, forward, purchase, agreement]	0.700
945	[based, upon, quoted, market, price, fair, val...	0.700
1104	[nonqualified, stock, option, letter, agreemen...	0.600
1106	[nonqualified, stock, option, letter, agreemen...	0.600
1143	[nonqualified, stock, option, letter, agreemen...	0.600
1145	[nonqualified, stock, option, letter, agreemen...	0.600

Calculating the Similarity Score¶

For this part, I will use cosine similarity and Jaccard similarity to compare the various 10K documents over the years from a particular company and find their similarity over years.

In [231]:

def CosSimilarity(A, B):
    
    '''
    The input paramters A and B are words from two different documents
    '''
    # Compile complete set of words in A or B
    words = list(A.union(B))
    
    # Determine which words are in A
    vec_A = [1 if x in A else 0 for x in words]
    
    # Determine which words are in B
    vec_B = [1 if x in B else 0 for x in words]
  
    # Compute cosine score using scikit-learn
    array_A = np.array(vec_A).reshape(1, -1)
    array_B = np.array(vec_B).reshape(1, -1)
    
    cosine_score = cosine_similarity(array_A, array_B)[0,0]
    
    return cosine_score

In [232]:

def JaccardSimilarity(A, B):
    
    '''
    The input paramters A and B are words from two different documents
    and this function return the JaccardSimilarityScore 
    '''
    
    # Count number of words in both A and B
    intersect = len(A.intersection(B))
    
    # Count number of words in A or B
    union = len(A.union(B))
    
    # Compute Jaccard similarity score
    jaccard_score = intersect / union
    
    return jaccard_score

In [233]:

def ComputeSimilarityScores10K(cik):
    

    # Open the directory that holds text filings for the CIK
    os.chdir(cik+'/textonly')
    print("Parsing CIK %s..." % cik)
    
    # Get list of files to over which to compute scores
    # excluding hidden files and directories
    flist = [fname for fname in os.listdir()]
    flist.sort()
    
    # Check if scores have already been calculated...
    try:
        os.mkdir('../metrics')
    # ... if they have been, exit
    except OSError:
        print("Similarity Scores are already calculated")
        os.chdir('../..')
        return
    
    # Check if enough files exist to compute sim scores...
    # If not, exit
    
    if not len(flist):
        return
    
    if len(flist) < 2:
        print("No files to compare for CIK", cik)
        os.chdir('../..')
        return
    
    # Initialize dataframe to store sim scores
    #find the dates from the names of the files that are present in the "textonly directory"
    dates = [x[-14:-4] for x in flist]
    
    #creating an empty array according to match the size of the cosine score
    
    cosscore = [0]*len(dates)
    jaccard_score = [0]*len(dates)
    
    data = pd.DataFrame(columns={'cosine_score': cosscore, 
                                 'jaccard_score': jaccard_score},
                       index=dates)
        
    # Open first file
    file_A = flist[0]
    
    with open(file_A, 'r') as file:
        
        text_A = file.read()
        
    # Iterate over each 10-K file...
    for i in range(1, len(flist)):

        file_B = flist[i]

        # Get file text B
        with open(file_B, 'r') as file:
            text_B = file.read()

        # Get set of words in A, B
        words_A = set(re.findall(r"[\w']+", text_A))
        
        words_B = set(re.findall(r"[\w']+", text_B))

        # Calculate similarity scores
        cos_score = CosSimilarity(words_A, words_B)
        jaccard_score = JaccardSimilarity(words_A, words_B)

        # Store score values
        date_B = file_B[-14:-4]
        data.at[date_B, 'cosine_score'] = cos_score
        data.at[date_B, 'jaccard_score'] = jaccard_score

        # Reset value for next loop
        # (We don't open the file again, for efficiency)
        text_A = text_B

    # Save scores
    os.chdir('../metrics')
    data.to_csv(cik+'_sim_scores.csv', index=True)
    os.chdir('../..')
    print("Metics Successfully Calulated, Check Metrics Directory")

In [234]:

# Calculating the similarity score for the nearest 10-K files in the "metrics" directory

os.chdir(path_10k)

for cik in tqdm(tick_cik_df['cik']):
    ComputeSimilarityScores10K(cik)

  0%|          | 0/1 [00:00<?, ?it/s]

Parsing CIK 0001018724...

100%|██████████| 1/1 [00:00<00:00,  1.33it/s]

Metics Successfully Calulated, Check Metrics Directory

Constructing the similarity index dataframe¶

In [235]:

cik = tick_cik_df['cik'][0]

sim_df = pd.read_csv(path_10k+"/"+cik+"/metrics/"+cik+"_sim_scores.csv")

new_columns = sim_df.columns.values
new_columns[0] = 'Dates'
sim_df.columns = new_columns
sim_df = sim_df.set_index('Dates')

sim_df = sim_df.fillna(0)

sim_df['cosine_score'] = sim_df['cosine_score'].astype(float)

sim_df['jaccard_score'] = sim_df['jaccard_score'].astype(float)

sim_df = pd.concat( [sim_df, sim_df.pct_change()] , axis = 1, sort=False)
sim_df

Out[235]:

	cosine_score	jaccard_score	cosine_score	jaccard_score
Dates
2001-03-23	0.000000	0.000000	NaN	NaN
2002-01-24	0.862940	0.758853	inf	inf
2003-02-19	0.864848	0.761874	0.002211	0.003981
2004-02-25	0.851950	0.741768	-0.014913	-0.026391
2005-03-11	0.861708	0.756727	0.011453	0.020167
2006-02-17	0.892635	0.806089	0.035891	0.065230
2007-02-16	0.894098	0.808454	0.001638	0.002934
2008-02-11	0.925033	0.860487	0.034600	0.064361
2009-01-30	0.912895	0.839693	-0.013122	-0.024165
2010-01-29	0.910969	0.836496	-0.002109	-0.003808
2011-01-28	0.911847	0.837964	0.000963	0.001755
2012-02-01	0.898984	0.815545	-0.014106	-0.026755
2013-01-30	0.897201	0.813449	-0.001984	-0.002570
2014-01-31	0.829994	0.709302	-0.074907	-0.128031
2015-01-30	0.903684	0.824288	0.088784	0.162111
2016-01-29	0.900408	0.818542	-0.003625	-0.006971
2017-02-10	0.911172	0.836827	0.011954	0.022338
2018-02-02	0.902087	0.821606	-0.009971	-0.018189
2019-02-01	0.909733	0.834296	0.008476	0.015445
2020-01-31	0.886158	0.794998	-0.025914	-0.047103

In the above dataframe, we can observe that over the years the cosine similarity score and the jaccard score are very close and barely any difference.

Comparing our hypothesis with the real stock market situation¶

In [236]:

tickers = [tick_cik_df['ticker'][0]]

time_series =web.DataReader(tickers, "av-monthly-adjusted", start=datetime(2001, 3, 23),end=datetime(2020, 10, 31),api_key='UGBIPKZSN5NWM5LV')
time_series

Out[236]:

	open	high	low	close	adjusted close	volume	dividend amount
2001-03-30	9.875	14.00	9.563	10.23	10.23	167635900	0.0
2001-04-30	10.330	18.16	8.100	15.78	15.78	189780400	0.0
2001-05-31	15.900	17.60	13.100	16.69	16.69	144686700	0.0
2001-06-29	17.220	17.92	11.200	14.15	14.15	136695900	0.0
2001-07-31	14.100	17.42	11.070	12.49	12.49	172078700	0.0
...	...	...	...	...	...	...	...
2020-06-30	2448.000	2796.00	2437.130	2758.82	2758.82	87845661	0.0
2020-07-31	2757.990	3344.29	2754.000	3164.68	3164.68	127304238	0.0
2020-08-31	3180.510	3495.00	3073.000	3450.96	3450.96	83519879	0.0
2020-09-30	3489.580	3552.25	2871.000	3148.73	3148.73	115969967	0.0
2020-10-30	3208.000	3496.24	3019.000	3036.15	3036.15	116252115	0.0

236 rows × 7 columns

In [196]:

x = time_series.index.values.tolist()
y = time_series['adjusted close']

# dates = [datetime.strptime(date,"%Y-%m-%d")for date in x ]
# i = 2001
# years = [i for i in range(2021)]
# plt.xticks(x,x[::12],rotation='vertical')
# plt.locator_params(axis='x', nbins=len(x)/11)

plt.figure(figsize=(20,8))
plt.xticks(rotation=90)
plt.xlabel("Dates starting from 2001")
plt.plot(x,y)
plt.show()

/usr/local/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)

Conclusion¶

From the above graph, we can see that the stock price of amazon has steadily increased over the span of 20 years. From the similarities score generated in the 'sim_df' dataframe, we know that the changes in the cosine similarities and jaccard similarities between the years is very low. Hence from the hypothesis we started, were we suggested 'Major text changes in 10-K over time indicate significant decreases in future returns.' can be proven true for this case. AMZN had little text changes in its 10-K documents overtime and this suggested to the steady growth of the company.

However we can stop here. It is important to test the hypothesis with other stocks before we come to any concrete conclusion.