{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text-Klassifikations-Beispiel\n", "\n", "Das Beispiel basiert auf einem [offenen Datensat](http://qwone.com/~jason/20Newsgroups/) von [Newsgroup-Nachtrichten](https://de.wikipedia.org/wiki/Newsgroup) und orientiert sich an [diesem offiziellen Tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) von scikit-learn zur Textanalyse. \n", "\n", "Wir nutzen Dokumente von mehreren Newsgroups und trainieren damit einen Classifier, der dann ein Zudornung von neuen Texten auf eine dieser Gruppen machen kann. Sprich die Newsgroups stellen die Klassen dar." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# In diesem Fall liegen die Daten noch nicht als Teil von scikit-learn\n", "# vor, es wird aber eine Funktion angeboten, mit die Daten bezogen werden können.\n", "from sklearn.datasets import fetch_20newsgroups" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Festlegen von vier Newsgroups, die wir nutzen wollen.\n", "selected_categories = [\"sci.crypt\", \"sci.electronics\", \"sci.med\", \"sci.space\"]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Beziehen der Trainingset- und Testsets-Dokumente\n", "newsgroup_posts_train = fetch_20newsgroups(\n", " data_home=\"newsgroup_data\",\n", " subset='train',\n", " categories=selected_categories,\n", " shuffle=True, random_state=1)\n", "newsgroup_posts_test = fetch_20newsgroups(\n", " data_home=\"newsgroup_data\",\n", " subset='test',\n", " categories=selected_categories,\n", " shuffle=True, random_state=1)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sklearn.utils._bunch.Bunch" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Die Objekte, die wir erhalten, sind scikit-learn-Bunches.\n", "type(newsgroup_posts_train)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['DESCR', 'data', 'filenames', 'target', 'target_names']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Und haben die üblichen Atribute von Bunches\n", "dir(newsgroup_posts_train)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".. _20newsgroups_dataset:\n", "\n", "The 20 newsgroups text dataset\n", "------------------------------\n", "\n", "The 20 newsgroups dataset comprises around 18000 newsgroups posts on\n", "20 topics split in two subsets: one for training (or development)\n", "and the other one for testing (or for performance evaluation). The split\n", "between the train and test set is based upon a messages posted before\n", "and after a specific date.\n", "\n", "This module contains two loaders. The first one,\n", ":func:`sklearn.datasets.fetch_20newsgroups`,\n", "returns a list of the raw texts that can be fed to text feature\n", "extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`\n", "with custom parameters so as to extract feature vectors.\n", "The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,\n", "returns ready-to-use features, i.e., it is not necessary to use a feature\n", "extractor.\n", "\n", "**Data Set Characteristics:**\n", "\n", " ================= ==========\n", " Classes 20\n", " Samples total 18846\n", " Dimensionality 1\n", " Features text\n", " ================= ==========\n", "\n", "Usage\n", "~~~~~\n", "\n", "The :func:`sklearn.datasets.fetch_20newsgroups` function is a data\n", "fetching / caching functions that downloads the data archive from\n", "the original `20 newsgroups website`_, extracts the archive contents\n", "in the ``~/scikit_learn_data/20news_home`` folder and calls the\n", ":func:`sklearn.datasets.load_files` on either the training or\n", "testing set folder, or both of them::\n", "\n", " >>> from sklearn.datasets import fetch_20newsgroups\n", " >>> newsgroups_train = fetch_20newsgroups(subset='train')\n", "\n", " >>> from pprint import pprint\n", " >>> pprint(list(newsgroups_train.target_names))\n", " ['alt.atheism',\n", " 'comp.graphics',\n", " 'comp.os.ms-windows.misc',\n", " 'comp.sys.ibm.pc.hardware',\n", " 'comp.sys.mac.hardware',\n", " 'comp.windows.x',\n", " 'misc.forsale',\n", " 'rec.autos',\n", " 'rec.motorcycles',\n", " 'rec.sport.baseball',\n", " 'rec.sport.hockey',\n", " 'sci.crypt',\n", " 'sci.electronics',\n", " 'sci.med',\n", " 'sci.space',\n", " 'soc.religion.christian',\n", " 'talk.politics.guns',\n", " 'talk.politics.mideast',\n", " 'talk.politics.misc',\n", " 'talk.religion.misc']\n", "\n", "The real data lies in the ``filenames`` and ``target`` attributes. The target\n", "attribute is the integer index of the category::\n", "\n", " >>> newsgroups_train.filenames.shape\n", " (11314,)\n", " >>> newsgroups_train.target.shape\n", " (11314,)\n", " >>> newsgroups_train.target[:10]\n", " array([ 7, 4, 4, 1, 14, 16, 13, 3, 2, 4])\n", "\n", "It is possible to load only a sub-selection of the categories by passing the\n", "list of the categories to load to the\n", ":func:`sklearn.datasets.fetch_20newsgroups` function::\n", "\n", " >>> cats = ['alt.atheism', 'sci.space']\n", " >>> newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)\n", "\n", " >>> list(newsgroups_train.target_names)\n", " ['alt.atheism', 'sci.space']\n", " >>> newsgroups_train.filenames.shape\n", " (1073,)\n", " >>> newsgroups_train.target.shape\n", " (1073,)\n", " >>> newsgroups_train.target[:10]\n", " array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])\n", "\n", "Converting text to vectors\n", "~~~~~~~~~~~~~~~~~~~~~~~~~~\n", "\n", "In order to feed predictive or clustering models with the text data,\n", "one first need to turn the text into vectors of numerical values suitable\n", "for statistical analysis. This can be achieved with the utilities of the\n", "``sklearn.feature_extraction.text`` as demonstrated in the following\n", "example that extract `TF-IDF`_ vectors of unigram tokens\n", "from a subset of 20news::\n", "\n", " >>> from sklearn.feature_extraction.text import TfidfVectorizer\n", " >>> categories = ['alt.atheism', 'talk.religion.misc',\n", " ... 'comp.graphics', 'sci.space']\n", " >>> newsgroups_train = fetch_20newsgroups(subset='train',\n", " ... categories=categories)\n", " >>> vectorizer = TfidfVectorizer()\n", " >>> vectors = vectorizer.fit_transform(newsgroups_train.data)\n", " >>> vectors.shape\n", " (2034, 34118)\n", "\n", "The extracted TF-IDF vectors are very sparse, with an average of 159 non-zero\n", "components by sample in a more than 30000-dimensional space\n", "(less than .5% non-zero features)::\n", "\n", " >>> vectors.nnz / float(vectors.shape[0])\n", " 159.01327...\n", "\n", ":func:`sklearn.datasets.fetch_20newsgroups_vectorized` is a function which\n", "returns ready-to-use token counts features instead of file names.\n", "\n", ".. _`20 newsgroups website`: http://people.csail.mit.edu/jrennie/20Newsgroups/\n", ".. _`TF-IDF`: https://en.wikipedia.org/wiki/Tf-idf\n", "\n", "\n", "Filtering text for more realistic training\n", "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n", "\n", "It is easy for a classifier to overfit on particular things that appear in the\n", "20 Newsgroups data, such as newsgroup headers. Many classifiers achieve very\n", "high F-scores, but their results would not generalize to other documents that\n", "aren't from this window of time.\n", "\n", "For example, let's look at the results of a multinomial Naive Bayes classifier,\n", "which is fast to train and achieves a decent F-score::\n", "\n", " >>> from sklearn.naive_bayes import MultinomialNB\n", " >>> from sklearn import metrics\n", " >>> newsgroups_test = fetch_20newsgroups(subset='test',\n", " ... categories=categories)\n", " >>> vectors_test = vectorizer.transform(newsgroups_test.data)\n", " >>> clf = MultinomialNB(alpha=.01)\n", " >>> clf.fit(vectors, newsgroups_train.target)\n", " MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)\n", "\n", " >>> pred = clf.predict(vectors_test)\n", " >>> metrics.f1_score(newsgroups_test.target, pred, average='macro')\n", " 0.88213...\n", "\n", "(The example :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py` shuffles\n", "the training and test data, instead of segmenting by time, and in that case\n", "multinomial Naive Bayes gets a much higher F-score of 0.88. Are you suspicious\n", "yet of what's going on inside this classifier?)\n", "\n", "Let's take a look at what the most informative features are:\n", "\n", " >>> import numpy as np\n", " >>> def show_top10(classifier, vectorizer, categories):\n", " ... feature_names = vectorizer.get_feature_names_out()\n", " ... for i, category in enumerate(categories):\n", " ... top10 = np.argsort(classifier.coef_[i])[-10:]\n", " ... print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))\n", " ...\n", " >>> show_top10(clf, vectorizer, newsgroups_train.target_names)\n", " alt.atheism: edu it and in you that is of to the\n", " comp.graphics: edu in graphics it is for and of to the\n", " sci.space: edu it that is in and space to of the\n", " talk.religion.misc: not it you in is that and to of the\n", "\n", "\n", "You can now see many things that these features have overfit to:\n", "\n", "- Almost every group is distinguished by whether headers such as\n", " ``NNTP-Posting-Host:`` and ``Distribution:`` appear more or less often.\n", "- Another significant feature involves whether the sender is affiliated with\n", " a university, as indicated either by their headers or their signature.\n", "- The word \"article\" is a significant feature, based on how often people quote\n", " previous posts like this: \"In article [article ID], [name] <[e-mail address]>\n", " wrote:\"\n", "- Other features match the names and e-mail addresses of particular people who\n", " were posting at the time.\n", "\n", "With such an abundance of clues that distinguish newsgroups, the classifiers\n", "barely have to identify topics from text at all, and they all perform at the\n", "same high level.\n", "\n", "For this reason, the functions that load 20 Newsgroups data provide a\n", "parameter called **remove**, telling it what kinds of information to strip out\n", "of each file. **remove** should be a tuple containing any subset of\n", "``('headers', 'footers', 'quotes')``, telling it to remove headers, signature\n", "blocks, and quotation blocks respectively.\n", "\n", " >>> newsgroups_test = fetch_20newsgroups(subset='test',\n", " ... remove=('headers', 'footers', 'quotes'),\n", " ... categories=categories)\n", " >>> vectors_test = vectorizer.transform(newsgroups_test.data)\n", " >>> pred = clf.predict(vectors_test)\n", " >>> metrics.f1_score(pred, newsgroups_test.target, average='macro')\n", " 0.77310...\n", "\n", "This classifier lost over a lot of its F-score, just because we removed\n", "metadata that has little to do with topic classification.\n", "It loses even more if we also strip this metadata from the training data:\n", "\n", " >>> newsgroups_train = fetch_20newsgroups(subset='train',\n", " ... remove=('headers', 'footers', 'quotes'),\n", " ... categories=categories)\n", " >>> vectors = vectorizer.fit_transform(newsgroups_train.data)\n", " >>> clf = MultinomialNB(alpha=.01)\n", " >>> clf.fit(vectors, newsgroups_train.target)\n", " MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)\n", "\n", " >>> vectors_test = vectorizer.transform(newsgroups_test.data)\n", " >>> pred = clf.predict(vectors_test)\n", " >>> metrics.f1_score(newsgroups_test.target, pred, average='macro')\n", " 0.76995...\n", "\n", "Some other classifiers cope better with this harder version of the task. Try\n", "running :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py` with and without\n", "the ``--filter`` option to compare the results.\n", "\n", ".. topic:: Data Considerations\n", "\n", " The Cleveland Indians is a major league baseball team based in Cleveland,\n", " Ohio, USA. In December 2020, it was reported that \"After several months of\n", " discussion sparked by the death of George Floyd and a national reckoning over\n", " race and colonialism, the Cleveland Indians have decided to change their\n", " name.\" Team owner Paul Dolan \"did make it clear that the team will not make\n", " its informal nickname -- the Tribe -- its new team name.\" \"It's not going to\n", " be a half-step away from the Indians,\" Dolan said.\"We will not have a Native\n", " American-themed name.\"\n", "\n", " https://www.mlb.com/news/cleveland-indians-team-name-change\n", "\n", ".. topic:: Recommendation\n", "\n", " - When evaluating text classifiers on the 20 Newsgroups data, you\n", " should strip newsgroup-related metadata. In scikit-learn, you can do this\n", " by setting ``remove=('headers', 'footers', 'quotes')``. The F-score will be\n", " lower because it is more realistic.\n", " - This text dataset contains data which may be inappropriate for certain NLP\n", " applications. An example is listed in the \"Data Considerations\" section\n", " above. The challenge with using current text datasets in NLP for tasks such\n", " as sentence completion, clustering, and other applications is that text\n", " that is culturally biased and inflammatory will propagate biases. This\n", " should be taken into consideration when using the dataset, reviewing the\n", " output, and the bias should be documented.\n", "\n", ".. topic:: Examples\n", "\n", " * :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py`\n", "\n", " * :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`\n", "\n" ] } ], "source": [ "print(newsgroup_posts_train.DESCR)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From: pmetzger@snark.shearson.com (Perry E. Metzger)\n", "Subject: Do we need the clipper for cheap security?\n", "Organization: Partnership for an America Free Drug\n", "Lines: 53\n", "\n", "amanda@intercon.com (Amanda Walker) writes:\n", ">> The answer seems obvious to me, they wouldn't. There is other hardware \n", ">> out there not compromised. DES as an example (triple DES as a better \n", ">> one.) \n", ">\n", ">So, where can I buy a DES-encrypted cellular phone? How much does it cost?\n", ">Personally, Cylink stuff is out of my budget for personal use :)...\n", "\n", "If the Clipper chip can do cheap crypto for the masses, obviously one\n", "could do the same thing WITHOUT building in back doors.\n", "\n", "Indeed, even without special engineering, you can construct a good\n", "system right now. A standard codec chip, a chip to do vocoding, a DES\n", "chip, a V32bis integrated modem module, and a small processor to do\n", "glue work, are all you need to have a secure phone. You can dump one\n", "or more of the above if you have a fast processor. With integration,\n", "you could put all of them onto a single chip -- and in the future they\n", "can be.\n", "\n", "Yes, cheap crypto is good -- but we don't need it from the government.\n", "You can do everything the clipper chip can do without needing it to be\n", "compromised. When the White House releases stuff saying \"this is good\n", "because it gives people privacy\", note that we didn't need them to\n", "give us privacy, the capability is available using commercial hardware\n", "right now.\n", "\n", "Indeed, were it not for the government doing everything possible to\n", "stop them, Qualcomm would have designed strong encryption right in to\n", "the CDMA cellular phone system they are pioneering. Were it not for\n", "the NSA and company, cheap encryption systems would be everywhere. As\n", "it is, they try every trick in the book to stop it. Had it not been\n", "for them, I'm sure cheap secure phones would be out right now.\n", "\n", "They aren't the ones making cheap crypto available. They are the ones\n", "keeping cheap crypto out of people's hands. When they hand you a\n", "clipper chip, what you are getting is a mess of pottage -- your prize\n", "for having traded in your birthright.\n", "\n", "And what did we buy with our birthright? Did we get safety from\n", "foreigners? No. They can read conference papers as well as anyone else\n", "and are using strong cryptography. Did we get safety from professional\n", "terrorists? I suspect that they can get cryptosystems themselves on\n", "the open market that work just fine -- most of them can't be idiots\n", "like the guys that bombed the trade center. Are we getting cheaper\n", "crypto for ourselves? No, because the market would have provided that\n", "on its own had they not deliberately sabotaged it.\n", "\n", "Someone please tell me what exactly we get in our social contract in\n", "exchange for giving up our right to strong cryptography?\n", "--\n", "Perry Metzger\t\tpmetzger@shearson.com\n", "--\n", "Laissez faire, laissez passer. Le monde va de lui meme.\n", "\n" ] } ], "source": [ "# Die Daten sind allerdings Newsgroup-Messages:\n", "# Ein Beispiel\n", "print(newsgroup_posts_train.data[6])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['sci.crypt', 'sci.electronics', 'sci.med', 'sci.space']\n" ] } ], "source": [ "print(newsgroup_posts_train.target_names)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'sci.crypt'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Die Targets sind die newsgroup\n", "newsgroup_posts_train.target_names[newsgroup_posts_train.target[6]]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Um die Wörter zu zählen, aber auch um Stopwörte zu entfernen und zum Tokenisieren nutzen\n", "# wir ein Objekt der CountVectorizer-Klasse\n", "# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\n", "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "count_vect = CountVectorizer()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
CountVectorizer()
TfidfTransformer(use_idf=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
TfidfTransformer(use_idf=False)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()