# Text-Klassifikations-Beispiel

Das Beispiel basiert auf einem [offenen Datensat](http://qwone.com/~jason/20Newsgroups/) von [Newsgroup-Nachtrichten](https://de.wikipedia.org/wiki/Newsgroup) und orientiert sich an [diesem offiziellen Tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) von scikit-learn zur Textanalyse. 

Wir nutzen Dokumente von mehreren Newsgroups und trainieren damit einen Classifier, der dann ein Zudornung von neuen Texten auf eine dieser Gruppen machen kann. Sprich die Newsgroups stellen die Klassen dar.

In [2]:
# In diesem Fall liegen die Daten noch nicht als Teil von scikit-learn
# vor, es wird aber eine Funktion angeboten, mit die Daten bezogen werden können.
from sklearn.datasets import fetch_20newsgroups

In [3]:
# Festlegen von vier Newsgroups, die wir nutzen wollen.
selected_categories = ["sci.crypt", "sci.electronics", "sci.med", "sci.space"]

In [4]:
# Beziehen der Trainingset- und Testsets-Dokumente
newsgroup_posts_train = fetch_20newsgroups(
 data_home="newsgroup_data",
 subset='train',
 categories=selected_categories,
 shuffle=True, random_state=1)
newsgroup_posts_test = fetch_20newsgroups(
 data_home="newsgroup_data",
 subset='test',
 categories=selected_categories,
 shuffle=True, random_state=1)

In [5]:
# Die Objekte, die wir erhalten, sind scikit-learn-Bunches.
type(newsgroup_posts_train)

sklearn.utils.Bunch

In [6]:
# Und haben die üblichen Atribute von Bunches
dir(newsgroup_posts_train)

['DESCR', 'data', 'filenames', 'target', 'target_names']

In [7]:
print(newsgroup_posts_train.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

 Classes 20
 Samples total 18846
 Dimensionality 1
 Features text

Usage
~~~~~

The :func:`sklearn.datasets.fetch_20newsgrou

In [8]:
# Die Daten sind allerdings Newsgroup-Messages:
# Ein Beispiel
print(newsgroup_posts_train.data[6])

From: pmetzger@snark.shearson.com (Perry E. Metzger)
Subject: Do we need the clipper for cheap security?
Organization: Partnership for an America Free Drug
Lines: 53

amanda@intercon.com (Amanda Walker) writes:
>> The answer seems obvious to me, they wouldn't. There is other hardware 
>> out there not compromised. DES as an example (triple DES as a better 
>> one.) 
>
>So, where can I buy a DES-encrypted cellular phone? How much does it cost?
>Personally, Cylink stuff is out of my budget for personal use :)...

If the Clipper chip can do cheap crypto for the masses, obviously one
could do the same thing WITHOUT building in back doors.

Indeed, even without special engineering, you can construct a good
system right now. A standard codec chip, a chip to do vocoding, a DES
chip, a V32bis integrated modem module, and a small processor to do
glue work, are all you need to have a secure phone. You can dump one
or more of the above if you have a fast processor. With integration,
you could put

In [40]:
print(newsgroup_posts_train.target_names)

['sci.crypt', 'sci.electronics', 'sci.med', 'sci.space']


In [9]:
# Die Targets sind die newsgroup
newsgroup_posts_train.target_names[newsgroup_posts_train.target[6]]

'sci.crypt'

In [10]:
# Um die Wörter zu zählen, aber auch um Stopwörte zu entfernen und zum Tokenisieren nutzen
# wir ein Objekt der CountVectorizer-Klasse
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
count_vect = CountVectorizer()

In [12]:
count_vect.fit(newsgroup_posts_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
 dtype=, encoding='utf-8', input='content',
 lowercase=True, max_df=1.0, max_features=None, min_df=1,
 ngram_range=(1, 1), preprocessor=None, stop_words=None,
 strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
 tokenizer=None, vocabulary=None)

In [13]:
# Über alle Dokumente bekommen wir die folgende Zahl an Wörter
len(count_vect.get_feature_names())

38683

In [14]:
# Wir können uns ein paar anschauen
count_vect.get_feature_names()[10000:10050]

['cellar',
 'cellphone',
 'cells',
 'cellsat',
 'cellular',
 'cellulars',
 'celluloid',
 'celp',
 'celsius',
 'cement',
 'cen',
 'censoring',
 'censorship',
 'censure',
 'census',
 'cent',
 'centaur',
 'centauri',
 'centaurs',
 'centennial',
 'center',
 'centered',
 'centerline',
 'centerpiece',
 'centers',
 'centigrade',
 'centimeter',
 'centimeters',
 'central',
 'centralia',
 'centralised',
 'centralism',
 'centralization',
 'centralize',
 'centralized',
 'centrally',
 'centre',
 'centres',
 'centrifuge',
 'centronic',
 'cents',
 'centure',
 'centuries',
 'century',
 'ceo',
 'cepek',
 'cephalopods',
 'cept',
 'ceramic',
 'cereal']

In [15]:
# oder sogar das counting-Diktionary mit den Wörtern und ihre Vorkommen-Anzahl bekommen
print(count_vect.vocabulary_)



In [16]:
# Diese Countings müssen wir für den Klassifikator in eine Matrix transformieren
X_train_counts = count_vect.transform(newsgroup_posts_train.data)

In [17]:
X_train_counts.shape

(2373, 38683)

In [18]:
# Wir normalisieren die Wörtercouting auf die Anzahl an Wörter im Text
# (Term Frequency - TF). Dazu nutzen wir eine Objekt der Klasse TfidfTransformer
# (schalten das idf (Inverse Document Frequency) aber ab.)
from sklearn.feature_extraction.text import TfidfTransformer

In [19]:
tf_transformer = TfidfTransformer(use_idf=False)

In [20]:
tf_transformer.fit(X_train_counts)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=False)

In [21]:
X_train_tf = tf_transformer.transform(X_train_counts)

In [22]:
X_train_tf.shape

(2373, 38683)

In [23]:
# Jetzt können wir eine Klassifkator erstellen 
from sklearn.ensemble import RandomForestClassifier
tf_random_forest_classifier = RandomForestClassifier()

In [24]:
# ... und diesem mit der Matrix trainieren.
tf_random_forest_classifier.fit(X_train_tf, newsgroup_posts_train.target)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
 criterion='gini', max_depth=None, max_features='auto',
 max_leaf_nodes=None, max_samples=None,
 min_impurity_decrease=0.0, min_impurity_split=None,
 min_samples_leaf=1, min_samples_split=2,
 min_weight_fraction_leaf=0.0, n_estimators=100,
 n_jobs=None, oob_score=False, random_state=None,
 verbose=0, warm_start=False)

In [25]:
# Um zu testen wie gut der Klassifikator funktioniert prozessieren,
# wir das Test-Set mit dem CountVectorizer-Objekt und führen 
# die gleiche TF-Transformation durch.
X_test_counts = count_vect.transform(newsgroup_posts_test.data)
X_test_tf = tf_transformer.transform(X_test_counts)

In [26]:
X_test_counts.shape

(1579, 38683)

In [41]:
# Jetzt können wir mit der score-Methods die Güte des Klassikators auf
# dem Test-Set prüfen.
tf_random_forest_classifier.score(X_test_tf, newsgroup_posts_test.target)

0.8467384420519316

In [28]:
# Der Klassifikator scheint gut genug zu funktionieren.
# Wir können jetzt Listen von Dokumenten klassifizieren.
# Wir nehmen zwei Dokumete aus unserem Test-Set und
# erstellen zusätzlich ein sehr kleines eigenes Dokument,
# das nur aus einem Satz bestehent.
docs_to_classify = [
 newsgroup_posts_test.data[1],
 newsgroup_posts_test.data[7],
 "The sun send a lot of radiation to the planets including earth"]

In [29]:
# Werfen wir einen kurzen Blick auf die zwei Dokumente aus dem Testset.
print(newsgroup_posts_test.data[1])

From: dmuntz@quip.eecs.umich.edu (Dan Muntz)
Subject: Re: new encryption
Organization: University of Michigan EECS Dept., Ann Arbor
Lines: 13

In article strnlght@netcom.com (David Sternlight) writes:
>psionic@wam.umd.edu, whose parenthesized name is either an unfortunate
>coincidence or casts serious doubt on his bona fides, posts a message in
>which he seems willing to take the word of a private firm about which he
>knows little that their new encryption algorithm is secure and contains no
>trapdoors, while seemingly distrusting that of the government about clipper.

Will someone please post the David Sternlight FAQ to alt.privacy.clipper before
someone unfamiliar with him takes him seriously and starts yet another
flame fest?

 -Dan




In [30]:
print(newsgroup_posts_test.data[7])

From: jcarey@news.weeg.uiowa.edu (John Carey)
Subject: med school
Organization: University of Iowa, Iowa City, IA, USA
Lines: 27

Actually I am entering vet school next year, but the question is 
relevant for med students too.

Memorizing large amounts has never been my strong point academically.
Since this is a major portion of medical education -- anatomy, 
histology, pathology, pharmacology, are for the most part mass 
memorization -- I am a little concerned. As I am sure most 
med students are.

Can anyone suggest techniques for this type of memorization? I 
have had reasonable success with nemonics and memory tricks like
thinking up little stories to associate unrelated things. But I have
never applied them to large amounts of "data".

Has anyone had luck with any particular books, memory systems, or
cheap software? 

Can you suggest any helpful organizational techniques? Being an
older student who returned to school this year, organization (another
one of my weak points) has been

In [31]:
X_to_classify_counts = count_vect.transform(docs_to_classify)
X_to_classify_tfidf = tf_transformer.transform(X_to_classify_counts)

In [32]:
predicted_classes = tf_random_forest_classifier.predict(X_to_classify_tfidf)

In [33]:
for predicted_class in predicted_classes:
 print(newsgroup_posts_train.target_names[predicted_class])

sci.crypt
sci.med
sci.electronics


In [34]:
# Um den Klassifikator zu verbessern nutzen wird statt der Term-Frequenz
# TFIDF (Term Frequency times Inverse Document Frequency) und erstellen
# damit unser Matrizen.

In [35]:
tfidf_transformer = TfidfTransformer(use_idf=True).fit(X_train_counts)

In [36]:
X_train_tfidf = tfidf_transformer.transform(X_train_counts)

In [37]:
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

In [38]:
tfidf_random_forest_classifier = RandomForestClassifier()
tfidf_random_forest_classifier.fit(X_train_tfidf, newsgroup_posts_train.target)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
 criterion='gini', max_depth=None, max_features='auto',
 max_leaf_nodes=None, max_samples=None,
 min_impurity_decrease=0.0, min_impurity_split=None,
 min_samples_leaf=1, min_samples_split=2,
 min_weight_fraction_leaf=0.0, n_estimators=100,
 n_jobs=None, oob_score=False, random_state=None,
 verbose=0, warm_start=False)

In [39]:
tfidf_random_forest_classifier.score(X_test_tfidf, newsgroup_posts_test.target)

0.8562381253958201