## Sentiment Analysis using LDA

1. Data Collection: We will start by collecting the top 20 news summaries for each company in the Dow Jones Industrial Average using the Yahoo Finance API.

2. Initial Sentiment Analysis: Perform a basic sentiment analysis on these summaries to get an initial sentiment score for each company.

3. Topic Modeling: Use Latent Dirichlet Allocation (LDA) to identify five key topics that these news summaries were talking about.

4. Topic-Specific Sentiment Analysis: Calculate the average sentiment for news summaries belonging to each of these topics.

5. Weighted Sentiment Analysis: Use these topic-specific sentiment scores to recalculate a weighted sentiment score for each company.

6. Comparison: Compare the original and new weighted sentiment scores to evaluate the difference.

In [None]:
!pip install -q yahoo_fin pandas_datareader gensim textblob

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/hexuser/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/hexuser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import requests
import pandas as pd
from yahoo_fin import stock_info as info
from yahoo_fin import news
from pandas_datareader import DataReader
import numpy as np
import warnings
warnings.filterwarnings('ignore')

from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string


In [None]:
# Get the list of tickers that comprise the Dow Jones Industrial Average
tickers = info.tickers_dow()
tickers

['AAPL',
 'AMGN',
 'AMZN',
 'AXP',
 'BA',
 'CAT',
 'CRM',
 'CSCO',
 'CVX',
 'DIS',
 'DOW',
 'GS',
 'HD',
 'HON',
 'IBM',
 'INTC',
 'JNJ',
 'JPM',
 'KO',
 'MCD',
 'MMM',
 'MRK',
 'MSFT',
 'NKE',
 'PG',
 'TRV',
 'UNH',
 'V',
 'VZ',
 'WMT']

In [None]:
# Initialize an empty DataFrame to store the summaries
dow_news_df = pd.DataFrame(columns=['Ticker', 'Summaries'])
# Iterate through the list of Dow tickers and fetch news summaries
for ticker in tickers:
    ticker_news = news.get_yf_rss(ticker)
    summaries = [article['summary'] for article in ticker_news]
    dow_news_df = dow_news_df.append({'Ticker': ticker, 'Summaries': summaries}, ignore_index=True)
dow_news_df.head()

Unnamed: 0,Ticker,Summaries
0,AAPL,"[Magnificent Seven stocks, including AI leader..."
1,AMGN,[Amgen's shares have come under pressure this ...
2,AMZN,[Amazon.com said on Wednesday it plans to push...
3,AXP,[The pair both declared substantial improvemen...
4,BA,[Boeing’s global fleet of 787 Dreamliner jets ...


In [None]:
dow_news_df

Unnamed: 0,Ticker,Summaries
0,AAPL,"[Magnificent Seven stocks, including AI leader..."
1,AMGN,[Amgen's shares have come under pressure this ...
2,AMZN,[Amazon.com said on Wednesday it plans to push...
3,AXP,[The pair both declared substantial improvemen...
4,BA,[Boeing’s global fleet of 787 Dreamliner jets ...
5,CAT,[The bull and bear debate over the cyclical st...
6,CRM,[Key Insights Institutions' substantial holdin...
7,CSCO,[Cisco Systems (CSCO) concluded the recent tra...
8,CVX,[(Bloomberg) -- President Joe Biden’s administ...
9,DIS,[Workers who help bring Disneyland’s beloved c...


In [None]:
from textblob import TextBlob
# Function to calculate sentiment polarity
def calculate_sentiment(text):
    return TextBlob(text).sentiment.polarity
# Initialize an empty DataFrame to store the sentiment scores
dow_sentiment_df = pd.DataFrame(columns=['Ticker', 'Average Sentiment'])
# Iterate through the DataFrame and calculate the average sentiment for each ticker
for index, row in dow_news_df.iterrows():
    ticker = row['Ticker']
    summaries = row['Summaries']
    if summaries:
        avg_sentiment = np.mean([calculate_sentiment(summary) for summary in summaries])
        dow_sentiment_df = dow_sentiment_df.append({'Ticker': ticker, 'Average Sentiment': avg_sentiment}, ignore_index=True)
dow_sentiment_df.head()

Unnamed: 0,Ticker,Average Sentiment
0,AAPL,0.195268
1,AMGN,0.125121
2,AMZN,0.143147
3,AXP,0.158369
4,BA,0.145588


In [None]:
dow_sentiment_df

Unnamed: 0,Ticker,Average Sentiment
0,AAPL,0.195268
1,AMGN,0.125121
2,AMZN,0.143147
3,AXP,0.158369
4,BA,0.145588
5,CAT,0.099819
6,CRM,0.134925
7,CSCO,0.08852
8,CVX,0.12459
9,DIS,0.169991


In [None]:
# Initialize an empty DataFrame to store the top 20 summaries for each ticker
dow_top20_summaries_df = pd.DataFrame(columns=['Ticker', 'Summary'])
# Iterate through the list of Dow tickers and fetch the top 20 news summaries
for ticker in tickers:
    ticker_news = news.get_yf_rss(ticker)[:20]
    for article in ticker_news:
        summary = article['summary']
        dow_top20_summaries_df = dow_top20_summaries_df.append({'Ticker': ticker, 'Summary': summary}, ignore_index=True)
dow_top20_summaries_df.head(40)

Unnamed: 0,Ticker,Summary
0,AAPL,"Magnificent Seven stocks, including AI leader ..."
1,AAPL,"So much for the ""pay or okay"" model that the F..."
2,AAPL,Apple is opening up web distribution for iOS a...
3,AAPL,These four stocks will be the cream of the cro...
4,AAPL,Apple has fixed a bug that suggested the Pales...
5,AAPL,Apple CEO Tim Cook says ‘the investment abilit...
6,AAPL,These are stocks you should always consider bu...
7,AAPL,"Amazon, Apple initiated: Wall Street's top ana..."
8,AAPL,The tech giant is no longer the world's top sm...
9,AAPL,These companies are at earlier stages in their...


In [None]:
dow_top20_summaries_df

Unnamed: 0,Ticker,Summary
0,AAPL,"Magnificent Seven stocks, including AI leader ..."
1,AAPL,"So much for the ""pay or okay"" model that the F..."
2,AAPL,Apple is opening up web distribution for iOS a...
3,AAPL,These four stocks will be the cream of the cro...
4,AAPL,Apple has fixed a bug that suggested the Pales...
...,...,...
595,WMT,The price reductions come as consumers feel th...
596,WMT,This retailer's faster growth helped fund a bi...
597,WMT,"Nichole Hart walks 20,000 steps as she searche..."
598,WMT,"Alaska Permanent, the largest U.S. state wealt..."


In [None]:
# Function to calculate sentiment polarity
def calculate_sentiment(text):
    return TextBlob(text).sentiment.polarity
# Initialize an empty DataFrame to store the sentiment scores for the top 20 summaries
dow_top20_sentiment_df = pd.DataFrame(columns=['Ticker', 'Summary', 'Sentiment'])
# Iterate through the DataFrame and calculate the sentiment for each summary
for index, row in dow_top20_summaries_df.iterrows():
    ticker = row['Ticker']
    summary = row['Summary']
    sentiment = calculate_sentiment(summary)
    dow_top20_sentiment_df = dow_top20_sentiment_df.append({'Ticker': ticker, 'Summary': summary, 'Sentiment': sentiment}, ignore_index=True)
dow_top20_sentiment_df.head(40)

Unnamed: 0,Ticker,Summary,Sentiment
0,AAPL,"Magnificent Seven stocks, including AI leader ...",1.0
1,AAPL,"So much for the ""pay or okay"" model that the F...",0.233333
2,AAPL,Apple is opening up web distribution for iOS a...,0.225
3,AAPL,These four stocks will be the cream of the cro...,0.0
4,AAPL,Apple has fixed a bug that suggested the Pales...,0.1
5,AAPL,Apple CEO Tim Cook says ‘the investment abilit...,-0.125
6,AAPL,These are stocks you should always consider bu...,0.0
7,AAPL,"Amazon, Apple initiated: Wall Street's top ana...",0.5
8,AAPL,The tech giant is no longer the world's top sm...,0.25
9,AAPL,These companies are at earlier stages in their...,0.0625


In [None]:
# Function to clean and tokenize text
def clean_tokenize(text):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]
    return tokens

# Tokenize the summaries
tokenized_summaries = dow_top20_summaries_df['Summary'].apply(clean_tokenize)

# Create a dictionary and corpus from the tokenized summaries
dictionary = corpora.Dictionary(tokenized_summaries)
corpus = [dictionary.doc2bow(text) for text in tokenized_summaries]

# Apply LDA model
lda_model = models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)
topics = lda_model.print_topics(num_words=4)

topics

[(0, '0.014*"’" + 0.008*"2024" + 0.007*"\'s" + 0.006*"april"'),
 (1, '0.014*"stocks" + 0.014*"\'s" + 0.009*"trading" + 0.007*"earnings"'),
 (2, '0.012*"\'s" + 0.007*"2024" + 0.006*"stock" + 0.006*"market"'),
 (3, '0.009*"stocks" + 0.008*"earnings" + 0.008*"company" + 0.007*"\'s"'),
 (4, '0.012*"\'s" + 0.010*"’" + 0.006*"u.s." + 0.005*"rate"')]

In [None]:
# Re-run the LDA topic modeling code after downloading the required NLTK resources
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Function to clean and tokenize text
def clean_tokenize(text):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]
    return tokens

# Tokenize the summaries
tokenized_summaries = dow_top20_summaries_df['Summary'].apply(clean_tokenize)

# Create a dictionary and corpus from the tokenized summaries
dictionary = corpora.Dictionary(tokenized_summaries)
corpus = [dictionary.doc2bow(text) for text in tokenized_summaries]

# Apply LDA model
lda_model = models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)
topics = lda_model.print_topics(num_words=4)

topics

[(0, '0.013*"trading" + 0.011*"stocks" + 0.011*"day" + 0.010*"’"'),
 (1, '0.011*"\'s" + 0.011*"’" + 0.009*"stocks" + 0.007*"market"'),
 (2, '0.013*"earnings" + 0.012*"’" + 0.011*"2024" + 0.008*"company"'),
 (3, '0.011*"\'s" + 0.011*"stocks" + 0.006*"’" + 0.006*"company"'),
 (4, '0.019*"\'s" + 0.008*"said" + 0.007*"street" + 0.007*"wall"')]

In [None]:
# Function to assign topics to summaries based on LDA model
def assign_topic_to_summary(summary):
    bow = dictionary.doc2bow(clean_tokenize(summary))
    topic_scores = lda_model[bow]
    dominant_topic = max(topic_scores, key=lambda x: x[1])[0]
    return dominant_topic

# Assign topics to each summary
dow_top20_summaries_df['Topic'] = dow_top20_summaries_df['Summary'].apply(assign_topic_to_summary)

# Perform sentiment analysis on each summary
dow_top20_summaries_df['Sentiment'] = dow_top20_summaries_df['Summary'].apply(calculate_sentiment)

# Group by topic and calculate average sentiment
topic_sentiment_df = dow_top20_summaries_df.groupby('Topic')['Sentiment'].mean().reset_index()

topic_sentiment_df

Unnamed: 0,Topic,Sentiment
0,0,0.124343
1,1,0.110615
2,2,0.126383
3,3,0.178993
4,4,0.12617


In [None]:
# Function to calculate weighted sentiment based on topic sentiment
def calculate_weighted_sentiment(row):
    topic = row['Topic']
    sentiment = row['Sentiment']
    topic_weight = topic_sentiment_df[topic_sentiment_df['Topic'] == topic]['Sentiment'].values[0]
    return sentiment * topic_weight

# Calculate weighted sentiment for each summary
dow_top20_summaries_df['Weighted_Sentiment'] = dow_top20_summaries_df.apply(calculate_weighted_sentiment, axis=1)

# Calculate new average sentiment for each company based on weighted sentiment
new_dow_sentiment_df = dow_top20_summaries_df.groupby('Ticker')['Weighted_Sentiment'].mean().reset_index()

# Merge with original dow_sentiment_df to compare
comparison_df = pd.merge(dow_sentiment_df, new_dow_sentiment_df, on='Ticker', how='inner')
comparison_df.columns = ['Ticker', 'Original_Sentiment', 'New_Weighted_Sentiment']

comparison_df

Unnamed: 0,Ticker,Original_Sentiment,New_Weighted_Sentiment
0,AAPL,0.195268,0.027609
1,AMGN,0.125121,0.018886
2,AMZN,0.143147,0.019433
3,AXP,0.158369,0.022741
4,BA,0.145588,0.021359
5,CAT,0.099819,0.012247
6,CRM,0.134925,0.018329
7,CSCO,0.08852,0.011163
8,CVX,0.12459,0.017225
9,DIS,0.169991,0.022478


## Conclusions:

1. Nuanced Understanding: The weighted sentiment scores provide a more nuanced understanding of the news landscape for each company. They take into account not just the sentiment of the news, but also the importance of the topic that the news belongs to.

2. Risk Mitigation: By focusing on topic-specific sentiment, investors can potentially mitigate risks. For example, if a company has negative sentiment in a critical topic like "Corporate Announcements," it might be a red flag.

3. Strategic Investment: The topic-weighted sentiment can be used to fine-tune investment strategies. For instance, you might prioritize companies with positive news in topics that are currently trending or are of strategic importance, like "Stock Market Trends."

4. Dynamic Adaptation: As the importance of topics changes over time (e.g., during earnings season, product launches, etc.), the weighted sentiment scores can adapt dynamically, providing timely investment insights.

5. Comprehensive Analysis: Combining both general and topic-specific sentiment gives a more rounded view, allowing for better-informed investment decisions.

By using weighted sentiment scores, investors can make more nuanced and strategic decisions, potentially leading to better investment outcomes.