{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Intro to Data Science \n", "## Part V - Text Mining \n", "\n", "### Table of Contents \n", "- #### Text Mining \n", " - Theory \n", " - In practice \n", " - Vectorizing documents \n", " - Normalizing document vectors \n", " - Vectorizing large corpora \n", " - Topic modeling \n", " - Document similarity \n", "- #### Artificial Neural Networks (ANN) \n", " - Single-layer networks \n", " - Multi-layer networks \n", "\n", "---\n", "\n", "## What is Text Mining? \n", "Text mining (or text analytics) is the process of extracting structured, meaningful features from natural language texts. Since most raw text data is unstructured, we need **Natural Language Processing (NLP)** techniques, statistical modeling, and machine learning to transform it into a form suitable for analysis. \n", "\n", "## Why is Text Mining Important? \n", "Around **80% of generated data** is unstructured, meaning it doesn’t fit neatly into tables or databases. This includes: \n", "\n", "- Emails, meeting notes, reports \n", "- Social media posts, chat logs \n", "- Articles, books, and research papers \n", "- Transcripts of voice recordings and videos \n", "\n", "While these data sources may contain metadata (e.g., length, topic, or category), extracting deeper insights requires transforming them into structured formats. For example, voice recordings and videos can be **transcribed into text**, which can then be processed just like any other document. \n", "\n", "### Common Use Cases \n", "Text mining enables a wide range of applications, including: \n", "\n", "- **Document similarity analysis** (e.g., finding related articles) \n", "- **Deduplication** (e.g., identifying near-duplicate documents) \n", "- **Document clustering** (e.g., grouping news articles by topic) \n", "- **Topic extraction** (e.g., summarizing themes in a collection of documents) \n", "- **Sentiment analysis** (e.g., determining if reviews are positive or negative) \n", "- **Automated annotation** (e.g., adding tags to documents based on content) \n", "- **Text classification** (e.g., spam detection in emails) \n", "- **Text filtering** (e.g., removing offensive content) \n", "\n", "---\n", "\n", "## Tools for Text Mining \n", "### NLP Techniques \n", "- **Tokenization** – Splitting text into words or phrases \n", "- **Stemming & Lemmatization** – Reducing words to their base form (e.g., \"running\" → \"run\") \n", "- **Part-of-Speech (POS) Tagging** – Identifying grammatical roles (noun, verb, adjective, etc.) \n", "- **Stopword Filtering** – Removing common but uninformative words (e.g., \"the\", \"and\", \"is\") \n", "- **Bag-of-Words (BoW) Representation** – Converting text into a numerical format \n", "- **TF-IDF Transformation** – Adjusting term importance based on document frequency \n", "\n", "### Other Key Tools \n", "- **Word Embeddings (e.g., Word2Vec, GloVe)** – Capturing word meanings in vector form \n", "- **Hashing** – Efficient vectorization of large datasets \n", "- **Similarity Metrics** – Measuring how alike documents are (e.g., cosine, Jaccard, Levenshtein) \n", "- **Matrix Factorization** – Reducing dimensionality for better topic modeling " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "import numpy as np\n", "import scipy.sparse as sp\n", "import pandas as pd\n", "\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text Mining in Practice \n", "\n", "### 1. Reading and Examining the Data \n", "\n", "Before diving into text processing, we need to understand the raw data. \n", "The collection of text documents we analyze is called a **corpus**. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('./data/SMSSpamCollection', 'rb') as spamfile:\n", " corpus = [line.decode('utf-8').strip() for line in spamfile]\n", "len(corpus)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for text in corpus[:5]:\n", " print(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the data is in TSV format, read it accordingly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corpus = pd.read_csv('./data/SMSSpamCollection', sep='\\t', names=['label', 'message'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corpus.groupby('label').describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corpus['length'] = corpus.message.str.len()\n", "corpus.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corpus['wordcount'] = corpus.message.str.split().str.len()\n", "corpus.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corpus.length.plot(bins=20, kind='hist');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corpus.length.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "910 long sms???" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corpus.loc[corpus.length > 900, 'message'].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is there a difference between spam and ham messages?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corpus[['length', 'label']].hist(bins=50, by='label', sharex=True);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why not try a simple predictor?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.naive_bayes import MultinomialNB" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "splitted = train_test_split(corpus.length.values[:, np.newaxis], # we need a matrix, not a vector\n", " corpus.label.values,\n", " test_size=.25,\n", " stratify=corpus.label.values,\n", " random_state=42)\n", "X_train, X_test, y_train, y_test = splitted" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe = Pipeline([('nb', MultinomialNB())])\n", "pipe.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accuracy_score(y_test, pipe.predict(X_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our baseline accuracy is around 87%. Not bad, but we haven’t done any preprocessing yet. Let’s improve this!\n", "\n", "### 2. Preprocessing\n", "#### a) [Bag-of-words representation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)\n", "\n", "The **Bag-of-Words (BoW) model** converts text documents into numerical vectors by counting word occurrences. Each unique word gets a fixed position in a vector, and the corresponding value represents how often it appears in the document.\n", "\n", "Example: \n", "Let’s say we have the following documents:\n", "```python\n", "docs = [\"I like trains.\", \"Trains are like big cars.\", \"I like big cars\"]\n", "```\n", "The vocabulary (set of unique words) would be:\n", "```python\n", "features = {'I': 0, 'like': 1, 'trains': 2, 'are': 3, 'big': 4, 'cars': 5}\n", "```\n", "And the vectorized form of the documents would be:\n", "```python\n", "vectors = [[1, 1, 1, 0, 0, 0], # \"I like trains.\"\n", " [0, 1, 1, 1, 1, 1], # \"Trains are like big cars.\"\n", " [1, 1, 0, 0, 1, 1]] # \"I like big cars.\"\n", "```\n", "Each row corresponds to a document, and each column represents a word. The numbers indicate word frequency.\n", "\n", "Fortunately, we don’t need to build this from scratch — `scikit-learn` provides a built-in [`CountVectorizer`](http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation) for converting text into a bag-of-words representation.\n", "\n", "Let's try out our little example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cntvec = CountVectorizer()\n", "docs = [\"I like trains.\",\n", " \"Trains are like big cars.\",\n", " \"I like big cars\"]\n", "\n", "cntvec.fit_transform(docs).todense(), cntvec.vocabulary_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### N-grams\n", "\n", "N-grams are continuous sequences of $n$ words from a given text. They help capture contextual information that individual words might miss. \n", "\n", "For example, a **2-gram (bigram)** representation of the sentence `\"I like trains.\"` would be:\n", "\n", "```python\n", "[(\"I\", \"like\"), (\"like\", \"trains\")]\n", "```\n", "\n", "By increasing $n$, we can capture more context. A **3-gram (trigram)** for the same sentence would be:\n", "\n", "```python\n", "[(\"I\", \"like\", \"trains\")]\n", "```\n", "\n", "N-grams are especially useful for tasks like **text classification**, **speech recognition**, and **predictive text**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cntvec = CountVectorizer(ngram_range=(2, 2))\n", "cntvec.fit_transform(docs).todense(), cntvec.vocabulary_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Minimum and Maximum Document Frequency\n", "\n", "The parameters **`min_df`** and **`max_df`** in **TF-IDF vectorization** help control which terms are included in the vocabulary by setting frequency thresholds: \n", "\n", "- **`min_df`** (minimum document frequency): Excludes rare words that appear in fewer than the specified number (or percentage) of documents. \n", "- **`max_df`** (maximum document frequency): Removes very common words that appear in more than the specified number (or percentage) of documents. \n", "\n", "This filtering step helps **reduce noise and improve model performance** by focusing on informative terms while ignoring overly rare or common words." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cntvec = CountVectorizer(max_df=1)\n", "cntvec.fit_transform(docs).todense(), cntvec.vocabulary_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cntvec = CountVectorizer(min_df=3)\n", "cntvec.fit_transform(docs).todense(), cntvec.vocabulary_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Advanced Tokenization \n", "\n", "When building a vocabulary, words are analyzed and transformed to reduce redundancy and improve text representation. \n", "By default, Scikit-Learn's tokenizer **lowercases words** and **removes short and stop words**, but it does not apply deeper transformations. \n", "\n", "However, more advanced **Natural Language Processing (NLP) techniques** can help extract meaningful base words. \n", "\n", "### Lemmatization \n", "\n", "**Lemmatization** reduces words to their **dictionary root form** (a valid word you would find in a dictionary). \n", "\n", "For example: \n", "- `\"are\"` → `\"be\"` \n", "- `\"trains\"` → `\"train\"` \n", "- `\"running\"` → `\"run\"` \n", "- `\"better\"` → `\"good\"` \n", "\n", "Lemmatization uses linguistic rules and considers context, making it more accurate than simple stemming. \n", "\n", "### Stemming \n", "\n", "**Stemming** is a simpler technique that removes affixes (prefixes/suffixes) from words to obtain the root form. \n", "Unlike lemmatization, it **does not** consider context or grammar—it just chops off endings. \n", "\n", "Example: \n", "- `\"running\"` → `\"run\"` \n", "- `\"flies\"` → `\"fli\"` \n", "- `\"happily\"` → `\"happili\"` \n", "- `\"better\"` → `\"better\"` (incorrectly unchanged) \n", "\n", "Since stemming is based on crude rules, it sometimes produces non-existent words, but it can still be useful for reducing vocabulary size in search engines or text classification tasks. \n", "\n", "### Using `nltk` for Stemming \n", "\n", "A popular NLP library for stemming and other text-processing tasks is [`Natural Language Toolkit (nltk)`](https://www.nltk.org/). \n", "\n", "#### Installation \n", "```bash\n", "conda activate szisz_ds_2025\n", "pip install nltk\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from nltk.stem import PorterStemmer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "stemmer = PorterStemmer()\n", "stemmer.stem('trains')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Lemmatization\n", "\n", "Lemmatization is similar to stemming, but with an important difference: **Lemmatization always returns real words**, whereas stemming might produce non-existent ones.\n", "\n", "Lemmatization uses linguistic knowledge (e.g., dictionaries and grammar rules) to find a word’s root form, making it **more accurate than stemming** but also **slower**. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "nltk.download('wordnet')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from nltk.stem import WordNetLemmatizer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lemmatizer = WordNetLemmatizer()\n", "lemmatizer.lemmatize('are', pos='v')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lemmatization provides better accuracy but is computationally more expensive than stemming. \n", "\n", "We can use **lemmatization** to create **custom text analyzers** in `CountVectorizer`, allowing us to process text more effectively before vectorization!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def split_into_lemmas(message):\n", " message = ''.join([char for char in message.lower()\n", " if char.isalnum() or char.isspace()])\n", " return [lemmatizer.lemmatize(word, pos='v') \n", " for word in message.split()]\n", "\n", "[split_into_lemmas(doc) for doc in docs]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### TextBlob: A User-Friendly NLP Library \n", "\n", "[`TextBlob`](https://textblob.readthedocs.io/en/dev/) is a simple yet powerful NLP library that makes text processing easy. \n", "It provides an intuitive interface for common NLP tasks, including **lemmatization**, **sentiment analysis**, and **POS tagging**. \n", "\n", "**Why use TextBlob?**\n", "- Great for quick and easy text analysis.\n", "- Requires minimal setup.\n", "- Good for beginners in NLP.\n", "\n", "**Installation**:\n", "```bash\n", "conda activate szisz_df_2025\n", "pip install textblob\n", "python -m textblob.download_corpora\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from textblob import TextBlob" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def split_into_lemmas(message):\n", " message = message.lower()\n", " words = TextBlob(message).words\n", " return [word.lemma for word in words]\n", "\n", "[split_into_lemmas(doc) for doc in docs]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cntvec = CountVectorizer(analyzer=split_into_lemmas)\n", "cntvec.fit_transform(docs).todense(), cntvec.vocabulary_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's insert our vectorizer to our pipeline!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "splitted = train_test_split(corpus.message,\n", " corpus.label.values,\n", " test_size=.25,\n", " stratify=corpus.label.values,\n", " random_state=42)\n", "X_train, X_test, y_train, y_test = splitted" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe = Pipeline([\n", " ('cntvec', CountVectorizer(analyzer=split_into_lemmas, min_df=10, max_df=.5)),\n", " ('nb', MultinomialNB())\n", "])\n", "\n", "pipe.fit(X_train, y_train)\n", "accuracy_score(y_test, pipe.predict(X_test))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(pipe['cntvec'].vocabulary_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### SpaCy: A High-Performance NLP Library\n", "\n", "[`spacy`](https://spacy.io/) is a **more advanced** NLP library designed for **speed and accuracy**. \n", "It includes powerful tokenization, lemmatization, named entity recognition, and dependency parsing.\n", "\n", "**Why use SpaCy?**\n", "- **Faster and more efficient** than NLTK and TextBlob.\n", "- Supports **pre-trained language models** for various NLP tasks.\n", "- Ideal for **large-scale** NLP applications.\n", "\n", "**Installation** (*requires admin rights!*):\n", "```bash\n", "conda activate szisz_ds_2025\n", "pip install spacy\n", "python -m spacy download en_core_web_sm\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import spacy" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "nlp = spacy.load('en_core_web_sm')\n", "[token.lemma_ for token in nlp(docs[0])]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pd.DataFrame([\n", " {'text': token.text, \n", " 'lemma': token.lemma_, \n", " 'POS': token.pos_, \n", " 'tag': token.tag_, \n", " 'dep': token.dep_,\n", " 'shape': token.shape_,\n", " 'is_alpha': token.is_alpha, \n", " 'is_stop': token.is_stop}\n", " for token in nlp(docs[0])\n", "]).set_index('text').transpose()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pipe = Pipeline([\n", " ('cntvec', CountVectorizer(analyzer=lambda x: [w.lemma_ for w in nlp(x)], min_df=10, max_df=.5)),\n", " ('nb', MultinomialNB())\n", "])\n", "pipe.fit(X_train, y_train)\n", "accuracy_score(y_test, pipe.predict(X_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### b) [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) \n", "\n", "TF-IDF (Term Frequency - Inverse Document Frequency) is a method for **normalizing word counts** in a document. It helps identify words that are important **within a document** but not overly common **across all documents** in a corpus. \n", "\n", "TF-IDF is the product of two components: \n", "\n", "1. **Term Frequency (TF)** – Measures how often a word appears in a document. \n", " $$\n", " \\mathrm{tf} (t,d) = \\frac{1}{2} + \\frac{f_{t,d}}{2 \\cdot \\max\\{f_{t',d} : t' \\in d\\}}\n", " $$ \n", " where: \n", " - $( f_{t,d} )$ is the count of term $( t )$ in document $( d )$. \n", " - The denominator normalizes by the highest term frequency in ( $d$ ) to avoid bias toward longer documents. \n", "\n", "2️. **Inverse Document Frequency (IDF)** – Measures how **rare** a word is across documents. \n", " $$\n", " \\mathrm{idf}(t, D) = \\log \\frac{N}{|\\{d \\in D: t \\in d\\}|}\n", " $$ \n", " where: \n", " - $( N )$ is the total number of documents in the corpus. \n", " - $( |\\{d \\in D: t \\in d\\}| )$ counts how many documents contain the term $( t )$. \n", " - Words that appear in **many** documents get a **low** IDF score, while rare words get a **high** IDF score. \n", "\n", "**Why use TF-IDF?** \n", "- **Better than raw word counts** – avoids favoring common words like \"the\" or \"is.\" \n", "- **Useful for keyword extraction** – identifies important terms in a document. \n", "- **A foundation for search engines** – helps rank relevant documents based on query terms. \n", "\n", "In `scikit-learn`, we can compute TF-IDF using [`TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe = Pipeline([\n", " ('cntvec', CountVectorizer(analyzer=split_into_lemmas, min_df=5, max_df=.9)),\n", " ('tfidf', TfidfTransformer()),\n", " ('nb', MultinomialNB())\n", "])\n", "pipe.fit(X_train, y_train)\n", "accuracy_score(y_test, pipe.predict(X_test))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.argsort([5, 3, 7, 9, 1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for word in np.argsort(pipe['tfidf'].idf_)[-20:][::-1]:\n", " print(word, pipe['cntvec'].get_feature_names_out()[word], pipe['tfidf'].idf_[word])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### c) [Hashing](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) \n", "\n", "When working with **very large text corpora**, a major challenge is **memory usage**. As the number of documents grows, the vocabulary size increases, requiring more memory to store word indices. \n", "\n", "To overcome this, we use the [**hashing trick**]((http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing)), which replaces explicit vocabulary storage with a **fixed-size feature space**. Instead of keeping a dictionary of words, we apply a **hash function** that maps words directly to feature indices. This allows **constant memory usage**, regardless of corpus size! \n", "\n", "**How does it work?** \n", "- Each word is processed by a **hash function** that outputs an integer index. \n", "- The hashed index determines **where** the word contributes in the feature matrix. \n", "- Since hash functions can produce **collisions** (different words mapping to the same index), the method works best with **high-dimensional spaces** to minimize information loss. \n", "\n", "**Advantages:** \n", "- **Memory-efficient** – No need to store a growing vocabulary. \n", "- **Fast transformation** – Works well for streaming or real-time applications. \n", "- **Scalable** – Handles massive datasets without increasing memory footprint. \n", "\n", "**Disadvantages:** \n", "- **No inverse transform** – Once words are hashed, we **lose interpretability** (i.e., we can’t recover the original words). \n", "- **Possible collisions** – Different words may be mapped to the same index, causing minor accuracy loss. \n", "\n", "**In `scikit-learn`,** we can use [`HashingVectorizer`](http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing) to efficiently transform text data without storing a vocabulary. Let’s try it in action! " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import HashingVectorizer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe = Pipeline([\n", " ('hash', HashingVectorizer(analyzer=split_into_lemmas, n_features=1000, alternate_sign=False)),\n", " ('nb', MultinomialNB())\n", "])\n", "pipe.fit(X_train, y_train)\n", "accuracy_score(y_test, pipe.predict(X_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Latent Semantic Indexing (LSI) \n", "\n", "_\"Latent Semantic Analysis (LSA) is a technique in natural language processing that analyzes the relationships between a set of documents and the terms they contain by identifying underlying **concepts**. The key idea is that words with similar meanings tend to appear in similar contexts.\"_ — [Wikipedia](https://en.wikipedia.org/wiki/Latent_semantic_analysis) \n", "\n", "#### How Does It Work? \n", "LSA is based on the **distributional hypothesis**, which states that words appearing in similar contexts tend to have similar meanings. To uncover these **hidden structures in text**, we apply **Singular Value Decomposition (SVD)** to a **Tf-Idf matrix** of the corpus. \n", "\n", "Mathematically, given a **term-document matrix** $ A $ (where rows are words, columns are documents, and values are Tf-Idf scores), we apply **SVD**: \n", "\n", "$$\n", "A = U \\Sigma V^T\n", "$$\n", "\n", "where: \n", "- $ U $ contains **word-topic associations** \n", "- $ \\Sigma $ contains **importance weights** of topics \n", "- $ V^T $ contains **document-topic associations** \n", "\n", "By keeping only the **top $ k $ singular values**, we reduce noise and capture the **most important latent topics** in the dataset. \n", "\n", "#### Why Use LSA? \n", "- **Reduces noise** – Helps remove irrelevant word variations (e.g., synonyms). \n", "- **Captures hidden relationships** – Groups words and documents by meaning, not just surface-level similarity. \n", "- **Improves document retrieval** – Useful in search engines and recommendation systems. \n", "\n", "#### Limitations \n", "- **Computationally expensive** – Requires matrix decomposition, which is slower than simpler methods. \n", "- **Fixed topics** – Unlike more advanced methods (e.g., LDA), LSA does not model **topic probabilities**, only a **fixed representation**. \n", "\n", "In `scikit-learn`, LSA can be implemented using [`TruncatedSVD`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html), which efficiently performs dimensionality reduction. Let’s see it in action! " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import TruncatedSVD\n", "from sklearn.svm import SVC" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe = Pipeline([\n", " ('cntvec', CountVectorizer(analyzer=split_into_lemmas)),\n", " ('tfidf', TfidfTransformer(sublinear_tf=True)),\n", " ('svd', TruncatedSVD(n_components=300, random_state=42)),\n", " ('svm', SVC(C=300))\n", "])\n", "pipe.fit(X_train, y_train)\n", "accuracy_score(y_test, pipe.predict(X_test))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feat_names = pipe['cntvec'].get_feature_names_out()\n", "topics = pipe['svd'].components_\n", "topic_str = pipe['svd'].explained_variance_ratio_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_most_important(topic, feat_names):\n", " indeces = np.argsort(topic)[::-1]\n", " terms = [feat_names[weightIndex] for weightIndex in indeces[:10]] \n", " weights = [topic[weightIndex] for weightIndex in indeces[:10]] \n", " return dict(zip(terms, weights))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in range(10):\n", " print(i, topic_str[i], get_most_important(topics[i], feat_names))\n", " print('-' * 80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Document Similarity Metrics \n", "\n", "When comparing documents, **Euclidean distance** is often not ideal. Since text data is high-dimensional and sparse, measuring similarity using raw distances can be misleading. Instead, we use **cosine similarity**, which measures the **angle** between two document vectors rather than their absolute distance. \n", "\n", "#### Cosine Similarity Formula \n", "\n", "Given two document vectors $ \\mathbf{A} $ and $ \\mathbf{B} $, the cosine similarity is defined as:\n", "\n", "$$\n", "\\text{cosine similarity}(\\mathbf{A}, \\mathbf{B}) = \\frac{\\mathbf{A} \\cdot \\mathbf{B}}{\\|\\mathbf{A}\\| \\|\\mathbf{B}\\|}\n", "$$\n", "\n", "where: \n", "- $ \\mathbf{A} \\cdot \\mathbf{B} $ is the **dot product** of the two vectors \n", "- $ \\|\\mathbf{A}\\| $ and $ \\|\\mathbf{B}\\| $ are the **Euclidean norms** (magnitudes) of the vectors \n", "\n", "#### Why Cosine Similarity? \n", "- **Insensitive to document length** – Longer documents won’t automatically seem more dissimilar. \n", "- **Captures semantic similarity** – Focuses on relative word usage rather than absolute frequencies. \n", "- **Efficient to compute** – Especially with sparse matrix optimizations. \n", "\n", "Since **Tf-Idf vectors** already capture word importance, cosine similarity can be computed simply by taking the **dot product** of two document vectors. This makes it a natural choice for **document retrieval**, **clustering**, and **classification** tasks. \n", "\n", "Let’s see how to implement it in Python using `scikit-learn`! " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS\n", "from sklearn.feature_extraction.text import TfidfVectorizer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def split_to_lemmas_and_filter(message):\n", " lemmas = split_into_lemmas(message)\n", " return [lemma for lemma in lemmas \n", " if lemma not in ENGLISH_STOP_WORDS]\n", " \n", "tfidf = TfidfVectorizer(analyzer=split_to_lemmas_and_filter,\n", " min_df=10,\n", " max_df=.5).fit(corpus.message)\n", "vects = tfidf.transform(corpus.message)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vect = tfidf.transform([corpus.message[0]])\n", "corpus.message[0], vect" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sims = vects.dot(vect.T).toarray().flatten()\n", "most_similar = np.argsort(sims)[-10:][::-1]\n", "\n", "for i, index in enumerate(most_similar):\n", " print(i, sims[index])\n", " print(corpus.message[index])\n", " print('-' * 80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Named Entity Recognition (NER)\n", "\n", "Named Entity Recognition (NER) is a key **Natural Language Processing (NLP)** technique that identifies and categorizes important entities in text, such as **names, locations, dates, organizations, and more**. It helps extract structured information from unstructured text. \n", "\n", "For example, in the sentence: \n", "*\"Apple Inc. is headquartered in Cupertino, California, and was founded by Steve Jobs.\"* \n", "\n", "A NER system would identify: \n", "- **Apple Inc.** → *Organization* \n", "- **Cupertino, California** → *Location* \n", "- **Steve Jobs** → *Person* \n", "\n", "#### Why is NER useful?\n", "NER is widely used for: \n", "- **Information extraction** (e.g., extracting company names from financial reports). \n", "- **Question answering** (e.g., recognizing dates, locations, and names in queries). \n", "- **Content classification** (e.g., tagging people, places, and organizations in news articles). \n", "\n", "#### Tools for NER\n", "There are several NLP libraries that provide pre-trained NER models: \n", "- **spaCy** – A fast and efficient NLP library with built-in NER models. \n", "- **NLTK** – Includes simple NER tools but requires additional training. \n", "- **TextBlob** – Provides a simpler interface for NER tasks. \n", "- **Transformers (Hugging Face)** – State-of-the-art models for entity recognition. \n", "\n", "Let's see how to use **spaCy** for Named Entity Recognition!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Example: Apply NER on a sample text from the dataset\n", "sample_text = X_train.iloc[5] # Take the first training sample\n", "doc = nlp(sample_text)\n", "\n", "# Print detected entities\n", "print(\"Original Text:\", sample_text)\n", "print(\"\\nNamed Entities:\")\n", "for ent in doc.ents:\n", " print(f\"{ent.text} → {ent.label_}\")\n", "\n", "# Visualizing NER results (Jupyter Notebook only)\n", "spacy.displacy.render(doc, style=\"ent\", jupyter=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Sentiment Analysis\n", "\n", "Sentiment analysis is the process of determining whether a piece of text conveys a **positive**, **negative**, or **neutral** sentiment. It is widely used in **social media monitoring**, **customer feedback analysis**, and **brand reputation management**.\n", "\n", "There are two main approaches to sentiment analysis:\n", "1. **Lexicon-based methods**: Use predefined dictionaries of words with sentiment scores (e.g., VADER, TextBlob).\n", "2. **Machine learning-based methods**: Train a classifier (e.g., logistic regression, neural networks) on labeled sentiment data.\n", "\n", "In this section, we will demonstrate **sentiment analysis** using the `TextBlob` library, which provides an easy-to-use interface for extracting sentiment polarity from text." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Example function for sentiment analysis\n", "def analyze_sentiment(text):\n", " blob = TextBlob(text)\n", " return blob.sentiment.polarity # Returns a value between -1 (negative) and 1 (positive)\n", "\n", "# Apply sentiment analysis to the dataset\n", "X_train_sentiment = X_train.apply(analyze_sentiment)\n", "X_test_sentiment = X_test.apply(analyze_sentiment)\n", "\n", "# Quick look at results\n", "sample_texts = X_test[:5]\n", "sample_sentiments = X_test_sentiment[:5]\n", "\n", "for text, sentiment in zip(sample_texts, sample_sentiments):\n", " sentiment_label = \"Positive\" if sentiment > 0 else \"Negative\" if sentiment < 0 else \"Neutral\"\n", " print(f\"Text: {text}\\nSentiment Score: {sentiment:.2f} ({sentiment_label})\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model of the Week:\n", "### Neural Networks\n", "\n", "\n", "\n", "Artificial Neural Networks (ANNs) are a supervised machine learning method used for both classification and regression tasks. Inspired by the structure and functioning of the human brain, ANNs consist of basic computational units called [**neuron**](https://en.wikipedia.org/wiki/Perceptron)s and the connections between them.\n", "\n", "Each **neuron** performs a simple computation: it takes multiple inputs, applies a weighted summation, and then passes the result through an activation function. Due to their simplicity, individual neurons can only solve **linear** problems. This mechanism is mathematically expressed as:\n", "\n", "$$\n", "y_{i} = f\\left(\\sum_{i} w_{i} x_{i} \\right)\n", "$$\n", "\n", "where:\n", "- $ y_i $ is the neuron's output,\n", "- $ x_i $ are the input features,\n", "- $ w_i $ are the weights assigned to each input,\n", "- $ f $ is an activation function that introduces non-linearity.\n", "\n", "### Learning Process\n", "\n", "The key to neural networks is their ability to **learn** by adjusting weights based on errors. The simplest learning rule, used in the perceptron model, updates weights as follows:\n", "\n", "$$\n", "w_{i}(t+1) = w_{i}(t) + (d_{j} - y_{j}(t)) x_{j,i}\n", "$$\n", "\n", "where:\n", "- $ d_j $ is the expected (true) output for the $ j $th input,\n", "- $ y_j(t) $ is the predicted output,\n", "- $ w_i(t) $ is the weight at time step $ t $.\n", "\n", "Through repeated weight updates, the network gradually improves its accuracy in predicting outputs. However, a single-layer perceptron is still limited to solving linearly separable problems. To overcome this, we introduce **multi-layer networks**, which allow for more complex decision boundaries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **The XOR Problem & Non-Linearity Issue**\n", "To illustrate the limitations of a single-layer perceptron, consider the XOR classification problem:\n", "\n", "| $x_1$ | $x_2$ | Output |\n", "|---|---|---|\n", "| 0 | 0 | A |\n", "| 0 | 1 | B |\n", "| 1 | 0 | B |\n", "| 1 | 1 | A |" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Perceptron" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "XOR_X, XOR_y = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]), np.array([0, 1, 1, 0])\n", "df = pd.DataFrame(data=XOR_X, columns=['x', 'y'])\n", "df['label'] = XOR_y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_results_with_hyperplane(clf, clf_name, df, ax):\n", " x_min, x_max = df.x.min() - .5, df.x.max() + .5\n", " y_min, y_max = df.y.min() - .5, df.y.max() + .5\n", "\n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))\n", " Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " \n", " ax.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired, shading='auto')\n", " ax.scatter(df.x, df.y, c=df.label, edgecolors='k')\n", " ax.set_title(clf_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "perceptron = Perceptron(verbose=2, random_state=42).fit(XOR_X, XOR_y)\n", "perceptron" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "plot_results_with_hyperplane(perceptron, 'perceptron', df, ax);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix\n", "conf_mat = confusion_matrix(XOR_y, perceptron.predict(XOR_X))\n", "conf_mat" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(conf_mat, annot=True, cbar=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Solving Non-Linear Problems\n", "\n", "As we can see, a single neuron is not able to solve this non-linear problem. But they are not called **networks** for nothing! The power of artificial neural networks lies in their topology. If we connect more **neurons**, we get a (real) neural network. The neurons are organized into **layers**. The first layer is the **input layer**, followed by zero or more **hidden layer**(s), and finally, the **output layer**. Each layer can contain any number of neurons. Different topologies lead to different ANN subtypes.\n", "\n", "\n", "\n", "The simplest form of ANN is the **Multi-Layer Perceptron (MLP)**, which consists of:\n", "- An **input layer** receiving raw data\n", "- One or more **hidden layers** that transform the data\n", "- An **output layer** producing predictions\n", "\n", "To allow for non-linearity, we use output (activation) functions such as:\n", "- **Sigmoid**: $y(v_i) = \\frac{1}{1 + e^{-v_i}}$ (good for probabilities)\n", "- **Tanh**: $y(v_i) = \\tanh(v_i)$ (better for zero-centered outputs)\n", "- **ReLU**: $y(v_i) = \\max(0, v_i)$ (common in deep networks)\n", "\n", "The weight updating algorithm is called [**Backpropagation**](https://en.wikipedia.org/wiki/Backpropagation). It propagates the errors backward through the network, updating the weights of every neuron that contributed to the error. This follows the principles of [gradient descent](https://en.wikipedia.org/wiki/Backpropagation#Derivation), adjusting the weights based on the partial derivatives of the loss function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.neural_network import MLPClassifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mlp = MLPClassifier(random_state=42).fit(XOR_X, XOR_y)\n", "mlp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "plot_results_with_hyperplane(mlp, 'mlp', df, ax)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conf_mat = confusion_matrix(XOR_y, mlp.predict(XOR_X))\n", "conf_mat" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(conf_mat, annot=True, cbar=False)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "A super nice tutorial can be found [here](https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), it is worth checking out." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Key Hyperparameters in MLPClassifier\n", "- **`hidden_layer_sizes`**: Defines the number of neurons per hidden layer (e.g., `(4,)` means one hidden layer with 4 neurons)\n", "- **`activation`**: Common choices include `\"relu\"`, `\"tanh\"`, `\"logistic\"` (sigmoid)\n", "- **`solver`**: Optimization algorithm (`\"adam\"` is recommended for most cases)\n", "- **`max_iter`**: Number of training iterations (increase if the model doesn’t converge)\n", "\n", "This simple MLP demonstrates how adding hidden layers allows the network to learn **non-linear** patterns, solving problems that a single-layer perceptron cannot." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "szisz_ds_2025", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 1 }