{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text-Klassifikations-Beispiel\n", "\n", "Das Beispiel basiert auf einem [offenen Datensat](http://qwone.com/~jason/20Newsgroups/) von [Newsgroup-Nachtrichten](https://de.wikipedia.org/wiki/Newsgroup) und orientiert sich an [diesem offiziellen Tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) von scikit-learn zur Textanalyse. \n", "\n", "Wir nutzen Dokumente von mehreren Newsgroups und trainieren damit einen Classifier, der dann ein Zudornung von neuen Texten auf eine dieser Gruppen machen kann. DES as an example (triple DES as a better \n", ">> one.) \n", ">\n", ">So, where can I buy a DES-encrypted cellular phone? How much does it cost?\n", ">Personally, Cylink stuff is out of my budget for personal use :)...\n", "\n", "If the Clipper chip can do cheap crypto for the masses, obviously one\n", "could do the same thing WITHOUT building in back doors.\n", "\n", "Indeed, even without special engineering, you can construct a good\n", "system right now. A standard codec chip, a chip to do vocoding, a DES\n", "chip, a V32bis integrated modem module, and a small processor to do\n", "glue work, are all you need to have a secure phone. You can dump one\n", "or more of the above if you have a fast processor. With integration,\n", "you could put all of them onto a single chip -- and in the future they\n", "can be.\n", "\n", "Yes, cheap crypto is good -- but we don't need it from the government.\n", "You can do everything the clipper chip can do without needing it to be\n", "compromised. When the White House releases stuff saying \"this is good\n", "because it gives people privacy\", note that we didn't need them to\n", "give us privacy, the capability is available using commercial hardware\n", "right now.\n", "\n", "Indeed, were it not for the government doing everything possible to\n", "stop them, Qualcomm would have designed strong encryption right in to\n", "the CDMA cellular phone system they are pioneering. Were it not for\n", "the NSA and company, cheap encryption systems would be everywhere. As\n", "it is, they try every trick in the book to stop it. Had it not been\n", "for them, I'm sure cheap secure phones would be out right now.\n", "\n", "They aren't the ones making cheap crypto available. They are the ones\n", "keeping cheap crypto out of people's hands. When they hand you a\n", "clipper chip, what you are getting is a mess of pottage -- your prize\n", "for having traded in your birthright.\n", "\n", "And what did we buy with our birthright? Did we get safety from\n", "foreigners? No. They can read conference papers as well as anyone else\n", "and are using strong cryptography. Did we get safety from professional\n", "terrorists? I suspect that they can get cryptosystems themselves on\n", "the open market that work just fine -- most of them can't be idiots\n", "like the guys that bombed the trade center. Are we getting cheaper\n", "crypto for ourselves? No, because the market would have provided that\n", "on its own had they not deliberately sabotaged it.\n", "\n", "Someone please tell me what exactly we get in our social contract in\n", "exchange for giving up our right to strong cryptography?\n", "--\n", "Perry Metzger\t\tpmetzger@shearson.com\n", "--\n", "Laissez faire, laissez passer. Le monde va de lui meme.\n", "\n" ] } ], "source": [ "# Die Daten sind allerdings Newsgroup-Messages:\n", "# Ein Beispiel\n", "print(newsgroup_posts_train.data[6])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['sci.crypt', 'sci.electronics', 'sci.med', 'sci.space']\n" ] } ], "source": [ "print(newsgroup_posts_train.target_names)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'sci.crypt'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Die Targets sind die newsgroup\n", "newsgroup_posts_train.target_names[newsgroup_posts_train.target[6]]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Um die Wörter zu zählen, aber auch um Stopwörte zu entfernen und zum Tokenisieren nutzen\n", "# wir ein Objekt der CountVectorizer-Klasse\n", "# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\n", "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "count_vect = CountVectorizer()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
