{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", " \n", "
Datenanalyse mit PythonManfred Hammerl
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8) Variablen umkodieren\n", "\n", "Zu Beginn wieder der Import von **Pandas** und das Einlesen unseres Datenfiles." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrock
015022.673.67
115711.003.33
226632.004.33
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock\n", "0 1 50 2 2.67 3.67\n", "1 1 57 1 1.00 3.33\n", "2 2 66 3 2.00 4.33" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "daten = pd.read_csv(\"C:\\\\Datenfiles\\\\daten.csv\")\n", "\n", "daten.head(3).round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8.1) Typologie erstellen\n", "\n", "Hin und wieder benötigt man eine Variable, die aus einer Kombination mehrerer anderer Variablen gebildet wird. Ein typisches Beispiel wäre in Studien bei SchülerInnen oder StudentInnen das elterliche Bildungsniveau. Man erhebt dabei das Bildungsniveau der Mutter und jenes des Vaters und kann diese Informationen dann in einer Variable 'elterliches Bildungsniveau' kombinieren (z.B. beide Matura, eine/r mit Matura, beide ohne Matura - man erhält hier also 4 Kategorien).\n", "\n", "In unserem Datensatz haben wir die Variablen *sex* und *wohnort*, aus welchen wir eine solche Typologie (als neue Variable *typo*) bilden können. Uns interessiert, wieviele Frauen bzw. Männer in ländlicher Umgebung und wieviele in Städten wohnen. Die *wohnort* Kategorie 2 (kleinstädtische Umgebung) lassen wir außen vor. Die Codierung erinnert entfernt an die Syntax, mit der in SPSS solche Typologien gebildet werden können. Mit **loc** lassen sich Bedingungen formulieren und definierte Werte einer neuen Variable zuordnen.\n", "\n", "[5 ways to apply an IF condition in pandas DataFrame](https://datatofish.com/if-condition-in-pandas-dataframe/)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "daten.loc[(daten.sex == 1) & (daten.wohnort == 1), 'typo'] = 'frau_land' # neue Werte können natürlich auch Zahlen sein\n", "daten.loc[(daten.sex == 1) & (daten.wohnort == 3), 'typo'] = 'frau_stadt'\n", "daten.loc[(daten.sex == 2) & (daten.wohnort == 1), 'typo'] = 'mann_land'\n", "daten.loc[(daten.sex == 2) & (daten.wohnort == 3), 'typo'] = 'mann_stadt'\n", "daten.loc[(daten.wohnort==2), 'typo'] = 'ohne Zuordnung' # als Restkategorie sozusagen..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Die neue Variable *typo* wurde dem Datensatz wie unten ersichtlich angefügt und enthält unsere definierten Zuordnungen." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrocktypo
015022.6666673.666667ohne Zuordnung
115711.0000003.333333frau_land
226632.0000004.333333mann_stadt
315022.3333332.666667ohne Zuordnung
416032.3333333.000000frau_stadt
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock typo\n", "0 1 50 2 2.666667 3.666667 ohne Zuordnung\n", "1 1 57 1 1.000000 3.333333 frau_land\n", "2 2 66 3 2.000000 4.333333 mann_stadt\n", "3 1 50 2 2.333333 2.666667 ohne Zuordnung\n", "4 1 60 3 2.333333 3.000000 frau_stadt" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nun könnten wir uns die Häufigkeiten dieser Typologie ansehen. Am besten gleich in Form einer Grafik." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "grafik1 = daten['typo'].value_counts(normalize = False)\n", "\n", "# 'normalize=True' und '*100', dann hätte man Prozentwerte statt absoluter Häufigkeiten\n", "\n", "ax = grafik1.plot.bar(rot = 25) # 'rot' = Rotation der Y-Achsen Labels in Grad" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Als 'Abkürzung' bietet sich folgende direkte Methode (nur eine Codezeile) an (diesmal sind relative Häufigkeiten, also Prozentwerte, abgebildet):" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "ax = (daten['typo'].value_counts(normalize = True)*100).plot.bar(rot = 25) # relative Häufigkeiten, multipliziert mit 100 (d.h.: Prozent)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8.2) Alterskategorien in neuer Variable erstellen\n", "\n", "Relativ oft muss man metrische Variablen mit sehr vielen Ausprägungen in wenige Kategorien umkodieren. Ein klassisches Beispiel sind Alterskategorien. Wenn wir die Variable *age* aus unserem Datensatz in ein paar Alterskategorien umkodieren möchten, sehen wir uns zuerst die Verteilung der Altersvariable an. Am besten betrachten wir die kumulierten (**cumsum()**) relativen (*normalize = True*) Häufigkeiten.\n", "\n", "[Recode Data](https://pythonfordatascience.org/recode-data/)\n", "\n", "[Creating a Cumulative Frequency Column in a Dataframe Python](https://stackoverflow.com/questions/38891974/creating-a-cumulative-frequency-column-in-a-dataframe-python)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15 0.34\n", "20 0.68\n", "21 3.06\n", "22 7.82\n", "23 13.27\n", "24 17.35\n", "25 23.13\n", "26 28.91\n", "27 32.65\n", "28 36.05\n", "Name: age, dtype: float64" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten['age'].value_counts(sort = False, normalize = True).cumsum().head(10).round(4)*100 # kumulierte Prozentwerte" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In der Praxis würden wir uns natürlich die gesamte Ausgabe ansehen und nicht nur - aus Platzgründen - die ersten zehn Werte wie in der Ausgabe oben. Wenn man das macht, kann man anhand der kumulierten relativen Häufigkeiten ablesen, welche Alterskategorien zu bilden sind, wenn man einigermaßen gleich große Kategorien (was die darin enthaltene Fallzahl betrifft) erhalten möchte.\n", "\n", "Mit folgender selbst zu schreibender Funktion definieren wir vier Alterskategorien, die in etwa ähnlich große Kategorien liefern. Danach rufen wir diese selbst geschriebene Funktion *altkat* mit der Funktion **apply()** auf und wenden sie auf unsere Altersvariable *age* an. Die ermittelten Alterskategorien werden sodann in die neue Variable *Altersgruppe* übernommen.\n", "\n", "[pandas.DataFrame.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "def altkat(age):\n", " if age <= 25:\n", " return \"15 - 25 Jahre\"\n", " elif 26 <= age <= 35:\n", " return \"26 - 35 Jahre\"\n", " elif 36 <= age <= 49:\n", " return \"36 - 49 Jahre\"\n", " elif age > 49:\n", " return \"50 Jahre plus\"\n", "\n", "daten['Altersgruppe'] = daten['age'].apply(altkat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Eine Häufigkeitsauszählung der neu erstellten Variable *Altersgruppe* ergibt, dass jeweils rund 70 Personen in den vier Altersgruppen aufscheinen." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15 - 25 Jahre 68\n", "26 - 35 Jahre 71\n", "36 - 49 Jahre 78\n", "50 Jahre plus 77\n", "Name: Altersgruppe, dtype: int64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten['Altersgruppe'].value_counts(sort = False).sort_index() # 'sort_index()' um nach den Alterskategorien zu sortieren" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Die kumulierten relativen Summen zeigen, dass die erste Altersgruppe (15 - 15 Jahre) mit knapp über 23% etwas unterrepräsentiert ist (bei vier Altersgruppen würde man sich 25% pro Altersgruppe erwarten - eine solch exakte Gruppenaufteilung ist in der Praxis jedoch kaum möglich)." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15 - 25 Jahre 23.13\n", "26 - 35 Jahre 47.28\n", "36 - 49 Jahre 73.81\n", "50 Jahre plus 100.00\n", "Name: Altersgruppe, dtype: float64" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten['Altersgruppe'].value_counts(sort = False, normalize = True).sort_index().cumsum().round(4)*100 # Prozentwerte" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hier ohne **cumsum()**, d.h. wir erhalten die relativen Häufigkeiten jeder Altersgruppe." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15 - 25 Jahre 23.13\n", "26 - 35 Jahre 24.15\n", "36 - 49 Jahre 26.53\n", "50 Jahre plus 26.19\n", "Name: Altersgruppe, dtype: float64" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten['Altersgruppe'].value_counts(sort = False, normalize = True).sort_index().round(4)*100 # Prozentwerte" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Kontrolle der Umkodierung\n", "\n", "Wann immer man Variablen umkodiert, sollte man die Korrektheit der Umkodierung prüfen, bevor man mit den umkodierten neuen Variablen weitere Auswertungen durchführt. Die folgende (teilweise) Darstellung des Dataframes liefert schon erste Indizien dafür, dass die Umkodierung richtig sein dürfte." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrocktypoAltersgruppe
26021513.0000003.500000mann_land15 - 25 Jahre
17622034.6666672.333333mann_stadt15 - 25 Jahre
6412134.6666671.000000frau_stadt15 - 25 Jahre
7712135.0000001.333333frau_stadt15 - 25 Jahre
25522124.6666671.000000ohne Zuordnung15 - 25 Jahre
........................
8817311.6666674.000000frau_land50 Jahre plus
1817533.0000005.000000frau_stadt50 Jahre plus
9727532.0000003.666667mann_stadt50 Jahre plus
13627712.0000005.000000mann_land50 Jahre plus
21919211.6666674.000000frau_land50 Jahre plus
\n", "

294 rows × 7 columns

\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock typo Altersgruppe\n", "260 2 15 1 3.000000 3.500000 mann_land 15 - 25 Jahre\n", "176 2 20 3 4.666667 2.333333 mann_stadt 15 - 25 Jahre\n", "64 1 21 3 4.666667 1.000000 frau_stadt 15 - 25 Jahre\n", "77 1 21 3 5.000000 1.333333 frau_stadt 15 - 25 Jahre\n", "255 2 21 2 4.666667 1.000000 ohne Zuordnung 15 - 25 Jahre\n", ".. ... ... ... ... ... ... ...\n", "88 1 73 1 1.666667 4.000000 frau_land 50 Jahre plus\n", "18 1 75 3 3.000000 5.000000 frau_stadt 50 Jahre plus\n", "97 2 75 3 2.000000 3.666667 mann_stadt 50 Jahre plus\n", "136 2 77 1 2.000000 5.000000 mann_land 50 Jahre plus\n", "219 1 92 1 1.666667 4.000000 frau_land 50 Jahre plus\n", "\n", "[294 rows x 7 columns]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.sort_values(by = 'age', ascending = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mit einer Kreuztabelle kann man die Umkodierung genau überprüfen, ohne dafür die 294 Zeilen des Dataframes durchsehen zu müssen. Aus Platzgründen sind unten nur die ersten zehn Zeilen der Kreuztabelle dargestellt - die Umkodierung ist jedenfalls richtig, soviel sei verraten." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Altersgruppe15 - 25 Jahre26 - 35 Jahre36 - 49 Jahre50 Jahre plus
age
151000
201000
217000
2214000
2316000
2412000
2517000
2601700
2701100
2801000
\n", "
" ], "text/plain": [ "Altersgruppe 15 - 25 Jahre 26 - 35 Jahre 36 - 49 Jahre 50 Jahre plus\n", "age \n", "15 1 0 0 0\n", "20 1 0 0 0\n", "21 7 0 0 0\n", "22 14 0 0 0\n", "23 16 0 0 0\n", "24 12 0 0 0\n", "25 17 0 0 0\n", "26 0 17 0 0\n", "27 0 11 0 0\n", "28 0 10 0 0" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(daten.age, daten.Altersgruppe).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8.3) Mediansplit\n", "\n", "Mit der Funktion **qcut()** kann man automatisch (ungefähr) gleich große (was die Fallzahl betrifft) Kategorien erstellen. Erstellt man 2 Kategorien, handelt es sich um einen sog. Mediansplit. Dies wird im folgenden Beispiel gezeigt. Der Parameter *retbins = True* gibt uns die Kategorieneinteilung wieder (die sog. *bins*).\n", "\n", "[pandas.qcut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 alt\n", "1 alt\n", "2 alt\n", "3 alt\n", "4 alt\n", " ... \n", "289 jung\n", "290 jung\n", "291 jung\n", "292 alt\n", "293 jung\n", "Name: age, Length: 294, dtype: category\n", "Categories (2, object): ['jung' < 'alt']" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out, bins = pd.qcut(daten ['age'], 2, labels = [\"jung\", \"alt\"], retbins = True) # 'out' = output, 'bins' = klassifizierung\n", "\n", "# hier in 2 gleich große Gruppen (Mediansplit) aufgeteilt, auch 3, 4, 5, usw. Gruppen sind natürlich möglich.\n", "\n", "out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Diese Verfahren liefert uns zwei Outputs: *out* und *bins*. *out* liefert uns eine Übersicht der zugeordneten Kategorien pro Zeile (siehe oben), *bins* liefert uns die Kategorieneinteilung (siehe unten)." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([15., 36., 92.])" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Die Variable *Alter* wurde in 2 Gruppen aufgeteilt, Gruppe 'jung' von 15 bis 36 und Gruppe 'alt' von >36 bis 92 (d.h. von 37 bis 92).\n", "\n", "Hier zum Vergleich das Originalalter:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 50\n", "1 57\n", "2 66\n", "3 50\n", "4 60\n", " ..\n", "289 24\n", "290 36\n", "291 31\n", "292 37\n", "293 23\n", "Name: age, Length: 294, dtype: int64" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten['age']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Und hier zum Vergleich der **Median** des Alters, er liegt bei 36 Jahren. Somit wurde oben wie erwähnt ein Mediansplit durchgeführt." ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "36.0" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten['age'].median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Einfügen der Mediansplit-Variable ins Dataframe\n", "\n", "Mit folgender Codezeile können wir den Mediansplit wie vorhin berechnen und als neue Variable gleich direkt in unser Dataframe einfügen." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "daten['age_mediansplit'] = pd.qcut(daten ['age'], 2, labels = [\"jung\", \"alt\"])" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrocktypoAltersgruppeage_mediansplit
015022.6666673.666667ohne Zuordnung50 Jahre plusalt
115711.0000003.333333frau_land50 Jahre plusalt
226632.0000004.333333mann_stadt50 Jahre plusalt
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock typo Altersgruppe \\\n", "0 1 50 2 2.666667 3.666667 ohne Zuordnung 50 Jahre plus \n", "1 1 57 1 1.000000 3.333333 frau_land 50 Jahre plus \n", "2 2 66 3 2.000000 4.333333 mann_stadt 50 Jahre plus \n", "\n", " age_mediansplit \n", "0 alt \n", "1 alt \n", "2 alt " ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.head(3)" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrocktypoAltersgruppeage_mediansplit
29123115.0000004.666667mann_land26 - 35 Jahrejung
29213735.0000001.000000frau_stadt36 - 49 Jahrealt
29312314.3333333.333333frau_land15 - 25 Jahrejung
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock typo Altersgruppe \\\n", "291 2 31 1 5.000000 4.666667 mann_land 26 - 35 Jahre \n", "292 1 37 3 5.000000 1.000000 frau_stadt 36 - 49 Jahre \n", "293 1 23 1 4.333333 3.333333 frau_land 15 - 25 Jahre \n", "\n", " age_mediansplit \n", "291 jung \n", "292 alt \n", "293 jung " ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.tail(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8.4) Klassifizierung nach gleich großen Intervallen\n", "\n", "Eine ähnliche, aber nicht idente, Methode liegt mit der Funktion **cut()** vor.\n", "\n", "[pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html#pandas.cut)\n", "\n", "Es werden hierbei nämlich keine gleich großen Gruppen (was die Fallzahl betrifft) gebildet, sondern die Funktion **cut()** richtet sich nach den Werten, d.h. z.B. von 20-40, 40-60, 60-80, usw., also jeweils 20 Abstand, egal wie viele Fälle enthalten sind." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "out1, bins1 = pd.cut(daten['age'], 3, retbins = True, labels = ['jung', 'mittel', 'alt'])" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 mittel\n", "1 mittel\n", "2 mittel\n", "3 mittel\n", "4 mittel\n", " ... \n", "289 jung\n", "290 jung\n", "291 jung\n", "292 jung\n", "293 jung\n", "Name: age, Length: 294, dtype: category\n", "Categories (3, object): ['jung' < 'mittel' < 'alt']" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out1" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([14.92, 40.67, 66.33, 92. ])" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bins1.round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**cut()** hat also Alterskategorien gebildet, welche von 15-40, 41-66 und 67-92 reichen; also jeweils 26-Jahre-Intervalle." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Einfügen der Klassifizierungsvariable ins Dataframe\n", "\n", "Mit folgender Codezeile lässt sich diese Variable direkt ins Dataframe einfügen." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "daten['Alterskategorien'] = pd.cut(daten['age'], 3, labels = ['jung', 'mittel', 'alt'])" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrocktypoAltersgruppeage_mediansplitAlterskategorien
015022.6666673.666667ohne Zuordnung50 Jahre plusaltmittel
115711.0000003.333333frau_land50 Jahre plusaltmittel
226632.0000004.333333mann_stadt50 Jahre plusaltmittel
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock typo Altersgruppe \\\n", "0 1 50 2 2.666667 3.666667 ohne Zuordnung 50 Jahre plus \n", "1 1 57 1 1.000000 3.333333 frau_land 50 Jahre plus \n", "2 2 66 3 2.000000 4.333333 mann_stadt 50 Jahre plus \n", "\n", " age_mediansplit Alterskategorien \n", "0 alt mittel \n", "1 alt mittel \n", "2 alt mittel " ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.head(3)" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrocktypoAltersgruppeage_mediansplitAlterskategorien
29123115.0000004.666667mann_land26 - 35 Jahrejungjung
29213735.0000001.000000frau_stadt36 - 49 Jahrealtjung
29312314.3333333.333333frau_land15 - 25 Jahrejungjung
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock typo Altersgruppe \\\n", "291 2 31 1 5.000000 4.666667 mann_land 26 - 35 Jahre \n", "292 1 37 3 5.000000 1.000000 frau_stadt 36 - 49 Jahre \n", "293 1 23 1 4.333333 3.333333 frau_land 15 - 25 Jahre \n", "\n", " age_mediansplit Alterskategorien \n", "291 jung jung \n", "292 alt jung \n", "293 jung jung " ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.tail(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8.5) Dummyvariablen erstellen\n", "\n", "Pandas bietet mit der Funktion **get_dummies()** ein einfache Möglichkeit, Dummyvariablen automatisch zu erstellen. Im Folgenden sollen aus der Variable *wohnort* (drei Ausprägungen) Dummyvariablen erstellt werden. Die Variable *wohnort* ist nominal (bzw. ordinal, je nach Interpretation), aber nicht dichotom, kann also bspw. in Regressionsanalysen nicht als unabhängige Variable eingesetzt werden. Dummyvariablen, zwar im Prinzip noch nominal, aber als dichotome Variablen gleichzeitig auch als intervallskalierte Variablen zu interpretieren, jedoch schon.\n", "\n", "[pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
123
0010
1100
2001
3010
4001
\n", "
" ], "text/plain": [ " 1 2 3\n", "0 0 1 0\n", "1 1 0 0\n", "2 0 0 1\n", "3 0 1 0\n", "4 0 0 1" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy = pd.get_dummies(daten['wohnort'])\n", "\n", "dummy.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Die Dummyvariablen sind einstweilen im Objekt (Dataframe) *dummy* gespeichert. Wir erfahren gleich, wir die Dummyvariablen in unseren ursprünglichen Datensatz integrieren.\n", "\n", "Aber jedenfalls wurde richtig umkodiert, wenn wir obige Dummy-Tabelle mit untenstehendem Auszug aus den Originaldaten zum *wohnort* vergleichen:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2\n", "1 1\n", "2 3\n", "3 2\n", "4 3\n", "Name: wohnort, dtype: int64" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten['wohnort'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Einfügen der Dummyvariablen ins Dataframe\n", "\n", "Das Einfügen der Dummyvariablen ins Dataframe erfolgt - wie bei den beiden vorangehenden Beispielen - leicht mit einer Codezeile. Allerdings muss man dabei vorab wissen, wieviele Dummyvariablen erstellt werden - denn genau so viele neue Variablen muss man anlegen! In unserem Fall werden 3 Dummyvariablen erstellt, also legen wird 3 neue Variablen (*D1*, *D2*, *D3*) an!" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "daten[['D1', 'D2', 'D3']] = pd.get_dummies(daten['wohnort'])" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrocktypoAltersgruppeage_mediansplitAlterskategorienD1D2D3
015022.6666673.666667ohne Zuordnung50 Jahre plusaltmittel010
115711.0000003.333333frau_land50 Jahre plusaltmittel100
226632.0000004.333333mann_stadt50 Jahre plusaltmittel001
315022.3333332.666667ohne Zuordnung50 Jahre plusaltmittel010
416032.3333333.000000frau_stadt50 Jahre plusaltmittel001
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock typo Altersgruppe \\\n", "0 1 50 2 2.666667 3.666667 ohne Zuordnung 50 Jahre plus \n", "1 1 57 1 1.000000 3.333333 frau_land 50 Jahre plus \n", "2 2 66 3 2.000000 4.333333 mann_stadt 50 Jahre plus \n", "3 1 50 2 2.333333 2.666667 ohne Zuordnung 50 Jahre plus \n", "4 1 60 3 2.333333 3.000000 frau_stadt 50 Jahre plus \n", "\n", " age_mediansplit Alterskategorien D1 D2 D3 \n", "0 alt mittel 0 1 0 \n", "1 alt mittel 1 0 0 \n", "2 alt mittel 0 0 1 \n", "3 alt mittel 0 1 0 \n", "4 alt mittel 0 0 1 " ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8.6) Variablen umpolen\n", "\n", "Dies geht nun schon in Richtung 'Variablen berechnen', was Inhalt des nächsten Kapitels ist, passt jedoch verständlicherweise noch in dieses Kapitel zum Thema 'umkodieren von Variablen'. Oft kommt es in empirischen Studien vor, dass Items unterschiedlich gepolt sind, aber für die Auswertung in eine einheitliche Richtung gebracht werden müssen. Dies erreicht man durch eine sehr einfache Berechnung. Dabei wird die Originalvariable nicht überschrieben sondern eine neue Variable erstellt und ins Dataframe eingefügt.\n", "\n", "[Deriving New Columns & Defining Python Functions](https://mode.com/python-tutorial/defining-python-functions/)" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "daten[\"volksmusik_umk\"] = (daten[\"volksmusik\"]*(-1))+6 # z.B. +6 (5-stufige), +7 (6-stufige Antwortskala), usw." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unsere Originalvariable *volksmusik* hat Ausprägungen von 1 bis 5. Will man diese Variable umpolen, multipliziert man mit -1. Die Ausprägungen reichen dann von -5 bis -1. Um wieder auf eine Spannweite von 1 bis 5 zu kommen, wird die Zahl 6 addiert. Fertig, die Variable *volksmusik_umk* weist die gleiche Spannweite im positiven Bereich auf wie zuvor, wurde jedoch inhaltlich umgepolt. Was zuvor 1 (höre sehr gerne Volksmusik) war, ist nun 5 (höre sehr gerne Volksmusik)." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrocktypoAltersgruppeage_mediansplitAlterskategorienD1D2D3volksmusik_umk
015022.673.67ohne Zuordnung50 Jahre plusaltmittel0103.33
115711.003.33frau_land50 Jahre plusaltmittel1005.00
226632.004.33mann_stadt50 Jahre plusaltmittel0014.00
315022.332.67ohne Zuordnung50 Jahre plusaltmittel0103.67
416032.333.00frau_stadt50 Jahre plusaltmittel0013.67
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock typo Altersgruppe \\\n", "0 1 50 2 2.67 3.67 ohne Zuordnung 50 Jahre plus \n", "1 1 57 1 1.00 3.33 frau_land 50 Jahre plus \n", "2 2 66 3 2.00 4.33 mann_stadt 50 Jahre plus \n", "3 1 50 2 2.33 2.67 ohne Zuordnung 50 Jahre plus \n", "4 1 60 3 2.33 3.00 frau_stadt 50 Jahre plus \n", "\n", " age_mediansplit Alterskategorien D1 D2 D3 volksmusik_umk \n", "0 alt mittel 0 1 0 3.33 \n", "1 alt mittel 1 0 0 5.00 \n", "2 alt mittel 0 0 1 4.00 \n", "3 alt mittel 0 1 0 3.67 \n", "4 alt mittel 0 0 1 3.67 " ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.head().round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Kontrolle der Umkodierung mittels Kreuztabellen\n", "\n", "Visuell kann man aus obigem Ausschnitt des Dataframes schon ablesen, dass höchstwahrscheinlich korrekt umkodiert wurde. Dennoch empfiehlt es sich immer, jegliche Umkodierung zu überprüfen. Bei Umkodierungen von Items mit wenigen Ausprägungen geht das sehr übersichtlich mit Kreuztabellen. Alle Werte ausserhalb der Diagonalen müssen *0* sein! In unserem Fall haben die beiden Variablen etwas mehr Ausprägungen, sodass dies nicht ganz so übersichtlich ist." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
volksmusik_umk1.001.331.501.672.002.332.673.003.333.674.004.334.675.00
volksmusik
1.00000000000000011
1.3300000000000040
1.6700000000000700
2.00000000000014000
2.33000000000110000
2.67000000001000000
3.00000000021000000
3.33000000270000000
3.67000002100000000
4.00000037000000000
4.33000310000000000
4.5000100000000000
4.67036000000000000
5.00630000000000000
\n", "
" ], "text/plain": [ "volksmusik_umk 1.00 1.33 1.50 1.67 2.00 2.33 2.67 3.00 3.33 3.67 \\\n", "volksmusik \n", "1.00 0 0 0 0 0 0 0 0 0 0 \n", "1.33 0 0 0 0 0 0 0 0 0 0 \n", "1.67 0 0 0 0 0 0 0 0 0 0 \n", "2.00 0 0 0 0 0 0 0 0 0 0 \n", "2.33 0 0 0 0 0 0 0 0 0 11 \n", "2.67 0 0 0 0 0 0 0 0 10 0 \n", "3.00 0 0 0 0 0 0 0 21 0 0 \n", "3.33 0 0 0 0 0 0 27 0 0 0 \n", "3.67 0 0 0 0 0 21 0 0 0 0 \n", "4.00 0 0 0 0 37 0 0 0 0 0 \n", "4.33 0 0 0 31 0 0 0 0 0 0 \n", "4.50 0 0 1 0 0 0 0 0 0 0 \n", "4.67 0 36 0 0 0 0 0 0 0 0 \n", "5.00 63 0 0 0 0 0 0 0 0 0 \n", "\n", "volksmusik_umk 4.00 4.33 4.67 5.00 \n", "volksmusik \n", "1.00 0 0 0 11 \n", "1.33 0 0 4 0 \n", "1.67 0 7 0 0 \n", "2.00 14 0 0 0 \n", "2.33 0 0 0 0 \n", "2.67 0 0 0 0 \n", "3.00 0 0 0 0 \n", "3.33 0 0 0 0 \n", "3.67 0 0 0 0 \n", "4.00 0 0 0 0 \n", "4.33 0 0 0 0 \n", "4.50 0 0 0 0 \n", "4.67 0 0 0 0 \n", "5.00 0 0 0 0 " ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(round(daten.volksmusik, 2), round(daten.volksmusik_umk, 2))\n", "\n", "# die built-in Funktion 'round()' wurde verwendet, um die Spalten-/Zeilenbeschriftungen kurz und somit den Output übersichtlich zu halten!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Kontrolle der Umkodierung mittels Korrelation\n", "\n", "Bei Variablen mit vielen Ausprägungen kann man auch einfach die Korrelation zwischen Originalvariable und umkodierter Variable berechnen. Alles andere als ein Korrelationskoeffizient von '-1' würde auf Fehler bei der Umkodierung hinweisen.\n", "\n", "[pandas.DataFrame.corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
volksmusikvolksmusik_umk
volksmusik1.0-1.0
volksmusik_umk-1.01.0
\n", "
" ], "text/plain": [ " volksmusik volksmusik_umk\n", "volksmusik 1.0 -1.0\n", "volksmusik_umk -1.0 1.0" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[['volksmusik', 'volksmusik_umk']].corr(method = 'pearson')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ebenso kann ein Streudiagramm mit den beiden Variablen visuell Aufschluss über den Erfolg der Umkodierung geben. Jeglicher Ausreißer von einer diagonalen Linie würde auf einen Fehler bei der Umkodierung hinweisen." ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYMAAAEGCAYAAACHGfl5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAb10lEQVR4nO3df5RcZZ3n8fcnoSFZfoghLWSSSKNEB+LwI7aZIB43oLKQsOBgFNxFFs7uBFh2dFQWozgwzDBzMui6/NolZsFBFgGVX4MkEZhIRJdfdkJ+ABklh7SHHkLSEAkEEkjId/+4t6FSqeq+FerWrer6vM6pk7rPfW7dLw/d9e3nufc+jyICMzNrbyOKDsDMzIrnZGBmZk4GZmbmZGBmZjgZmJkZsEfRAeyOsWPHRldXV9FhmJm1lKVLl74YEZ2V9rVkMujq6qKnp6foMMzMWoqk31fb52EiMzNzMjAzMycDMzPDycDMzHAyMDMzGpAMJPVKWiVpuaRdbgFS4mpJayStlDQlr1gu/PEyjvzrn3Phj5fldQozs5bUqFtLj4uIF6vsOwmYlL7+FLgu/beuuuYsePv97U+s4/YnFtA7d2a9T2Nm1pKaYZjoVOCmSDwK7C9pXD1PUK0n4B6CmVmiEckggPslLZU0u8L+8cBzJdt9adlOJM2W1COpp7+/v6YAHli9oaZyM7N204hkcGxETCEZDrpA0ifL9qvCMbusuBMR8yOiOyK6OzsrPk1d1WcOe19N5WZm7Sb3ZBARz6f/bgDuAqaWVekDJpZsTwCer2cM3z298jXpauVmZu0m12QgaW9J+w68B04Aniyrdg9wVnpX0TRgU0Ssq3csvXNnMuvocbxn1EhmHT3OF4/NzErkfTfRgcBdkgbOdUtE/FzSeQARMQ9YCMwA1gCvA+fkFYx7AmZmleWaDCLiWeDICuXzSt4HcEGecZiZ2eCa4dZSMzMrmJOBmZk5GZiZmZOBmZnhZGBmZjgZmJkZTgZmZoaTgZmZ4WSQi8VPv8A3bl/B4qdfKDoUM7NMGrW4Tds44X8u4XfrXwPgxz19fPjAvbnvq9MLjcnMbCjuGdTR4qdfeDsRDPjt+tfcQzCzpudkUEf3P72+pnIzs2bhZFBHJxx+YE3lZmbNwsmgjj51+EF8+MC9dyr78IF786nDDyooIjOzbHwBuc7u++p0Fj/9Avc/vZ4TDj/QicDMWoKTQQ4+dfhBTgJm1lIaMkwkaaSkJyTdW2HfdEmbJC1PX5c0IiYzM3tHo3oGXwFWA/tV2f+riDi5QbGYmVmZ3HsGkiYAM4Hr8z6XmZntnkYME10JXATsGKTOMZJWSFokaXKlCpJmS+qR1NPf359HnGZmbSvXZCDpZGBDRCwdpNoy4OCIOBK4Bri7UqWImB8R3RHR3dnZWf9gzczaWN49g2OBUyT1ArcBx0u6ubRCRLwSEZvT9wuBDkljc47LzMxK5JoMIuKbETEhIrqAM4BfRMSZpXUkHSRJ6fupaUwv5RmXmZntrJDnDCSdBxAR84BZwPmStgNbgDMiIoqIy8ysXakVv3e7u7ujp6en6DAK9f0Hn+Hulev47BHjOPe4SUWHY2YtQNLSiOiutM9PILegw769kC3bkyS+et2rXLn4GVZfPqPgqMyslXmiuhbz/QefeTsRDNiyPfj+g88UFJGZDQdOBi3m7pXraio3M8vCyaDFfPaIcTWVm5ll4WTQYs49bhKj99BOZaP3kC8im9m74gvILWj15TN8N5GZ1ZWTQYs697hJTgJmVjceJjIzMycDMzNzMjAzM5wMzMwMJwMzM8PJwMzMcDIwMzOcDMzMDCcDMzOjQclA0khJT0i6t8I+Sbpa0hpJKyVNaURMNrSzr3+ED128gLOvf6ToUMwsZ43qGXwFWF1l30nApPQ1G7iuQTHZILrmLGDJmo28+RYsWbORrjkLig7JzHKUezKQNAGYCVxfpcqpwE2ReBTYX5LnYy5QtZ6Aewhmw1cjegZXAhcBO6rsHw88V7Ldl5btRNJsST2Sevr7++sepL3j4bUbayo3s9aXazKQdDKwISKWDlatQlnsUhAxPyK6I6K7s7OzbjHarj5+yJiays2s9eXdMzgWOEVSL3AbcLykm8vq9AETS7YnAM/nHJcN4sb/ckxN5WbW+nJNBhHxzYiYEBFdwBnALyLizLJq9wBnpXcVTQM2RYQX9C1Y79yZTD90DHuOhOmHjqF37syiQzKzHBWyuI2k8wAiYh6wEJgBrAFeB84pIibblXsCZu2jYckgIpYAS9L380rKA7igUXGYmdmu/ASymZk5GZiZmZOBmZnhZGBmZmRMBpL2qlDmJ5DMzIaJrD2DOyV1DGykcwc9kE9IZmbWaFmTwd3AT9OpqLuA+4Bv5hWUmZk1VqbnDCLi/0jakyQpdAHnRsTDOcZlbeTCHy/jgdUb+Mxh7+O7p3s5C7MiDJoMJH2tdJNkDqHlwDRJ0yLieznGZm2gdJ2E259Yx+1PLPDUF2YFGGqYaN+S1z7AXSTTRgyUme22C3+8rKZyM8vPoD2DiLisUYFY+3lg9Yaays0sP1lvLe2WdJekZek6xSslrcw7OBvePnPY+2oqN7P8ZJ2o7kfAfwdWUX3FMrOafPf0Kdz+xK5rK/sislnjZb21tD8i7omItRHx+4FXrpFZW+idO5NZR4/jPaNGMuvocb54bFaQrD2DSyVdDywG3hgojIg7c4nK2op7AmbFy5oMzgH+GOjgnWGiAAZNBpJGAQ8Be6Xnuj0iLi2rMx34J2BtWnRnRPxNxrjMzKwOsiaDIyPiT3bj898Ajo+Izel0Fr+WtCgiHi2r96uIOHk3Pt/MzOog6zWDRyUdXuuHR2JzutmRvqLWzzEzs3xlTQafAJZL+m16W+mqrLeWpvMZLQc2AA9ExGMVqh0jaYWkRZImV/mc2ZJ6JPX09/dnDNvMzLLIOkx04u6eICLeAo6StD9wl6SPRMSTJVWWAQenQ0kzSOY/mlThc+YD8wG6u7vduzAzq6OsPYOo8sosIl4GllCWWCLilYGhpIhYCHRIGlvLZ5uZ2buTtWewgOTLX8Ao4BDgt0DFIZ0BkjqBbRHxsqTRwKeBfyircxCwPiJC0lSSBPVSTf8VZmb2rmSdwnqnO4kkTQHOzXDoOOCHkkaSfMn/JCLulXRe+rnzgFnA+ZK2A1uAMyLCw0BmZg2k3f3elbQsIgp5Wqi7uzt6enqKOLWZWcuStDQiuivty9QzKFvXYAQwBfAtPdYyvn3nChY9tZ6TJh/I5acdWXQ4Zk0n6zWD0rULtpNcQ7ij/uGY1V/pAjo3P97HzY/3eQ4kszJZrxkMuq6BpGsi4i/qE5JZ/Xz7zhVVy91DMHtH1ltLh3JsnT7HrK4WPbW+pnKzdlWvZGDWlE6afGBN5WbtysnAhrVqQ0EeIjLbWb2Sger0OWZ11zt3JmdOncABe3dw5tQJvnhsVkHWW0u7IqK3rOxjEfGbdPOqegdmVk+Xn3Ykl59WdBRmzStrz+BOSeMHNiT9W+AHA9sRcWOd4zIzswbKmgzOBe6WdFA6s+hVwIz8wjIzs0bK+pzBbyR9Gbgf2Ap8JiL8BLKZ2TAxaDKQ9DN2nqr63wCbgBskERGn5BmcmZk1xlA9g+82JAozMyvUoMkgIn7ZqEDMzKw4Qw0T/ToiPiHpVXYeLhLJevf75RqdmZk1xFA9g0+k/+47WD0zM2ttmW4tlfRBSXul76dL+nK6wP1Qx42S9LikFZKekrTL7KdKXC1pjaSV6SpqZi3puCsW0zVnAcddsbjoUMxqkvU5gzuAtyQdCtxAsgbyLRmOewM4PiKOBI4CTpQ0razOScCk9DUbuC5jTGZNpWvOAtZu3ArA2o1bd1pHwazZZU0GOyJiO/BnwJUR8VWS9Y0HFYnN6WZH+ipfZ/NU4Ka07qPA/pKG/GyzZlKtJ+AegrWKrMlgm6QvAv8JuDct68hyoKSRkpYDG4AHIuKxsirjgedKtvvSsvLPmS2pR1JPf7+fd7PmMtAjyFpu1myyJoNzgGOAv4uItZIOAW7OcmBEvBURRwETgKmSPlJWpdKMp+W9ByJifkR0R0R3Z2dnxrDNGuOQMaNqKjdrNpmSQUQ8HRFfjohb0+21ETG3lhNFxMvAEuDEsl19wMSS7QnA87V8tlnRHrzoUzWVmzWbrHcTrZX0bPkrw3GdA3cdSRoNfBr4l7Jq9wBnpXcVTQM2RcS62v4zzIrXO3fm2z2BQ8aM8roJ1lIyTVQHdJe8HwV8HhiT4bhxwA8ljSRJPD+JiHslnQcQEfOAhSQzoK4BXicZkjJrSe4JWKtSxC7D89kOTJ9OrnM8mXR3d0dPT08RpzYza1mSlkZEd6V9WVc6K30QbARJT8FPJZuZDRNZh4n+R8n77UAv8IW6R2NmZoXIurjNcXkHYmZmxck6TLQ/cBbQVXpMRHw5l6jMzKyhsg4TLQQeBVYBO/ILx8zMipA1GYyKiK/lGomZmRUm63QU/1fSn0saJ2nMwCvXyMzMrGGy9gzeBL4DXMw78wYF8IE8gjIzs8bKmgy+BhwaES/mGYyZDW3KZYvYuGUHY0aPYNmlJxUdjg0TWYeJniKZKsLMCtQ1ZwEbtyT3cGzcssML6FjdZO0ZvAUsl/QgyeplgG8tNWukKZctqlruHoK9W1mTwd3py8wKMtAjyFpuVousTyD/cOC9pPcCEyNiZW5RmdkuxoweUfGLf8zorKO9ZtVlXc9giaT90ttJVwD/KOl7+YZmZqWqDQV5iMjqIeufFO+JiFeA04B/jIiPkixUY2YN1Dt35ts9gTGjR3gBHaubrNcM9pA0jmSm0otzjMfMhuCegOUha8/gb4D7gDUR8RtJHwCeGeogSRMlPShptaSnJH2lQp3pkjZJWp6+LqntP8HMzN6trD2DxRHx04GNiHhW0oUZjtsOfD0ilknaF1gq6YGIeLqs3q8i4uSMsZiZWZ1l7Rn8TNJ+AxuSDgd+NtRBEbEuIpal718FVgPjdydQMzPLT9Zk8PckCWEfSR8FfgqcWcuJJHUBRwOPVdh9jKQVkhZJmlzl+NmSeiT19Pf313JqMzMbQtbnDBZI6gDuJ1n7+LMRMeQ1gwGS9gHuAP4yvSup1DLg4IjYLGkGycNtkyrEMB+YD9Dd3R3l+83MbPcNmgwkXcM7s5QC7Ac8C/yFpEzTUaRJ5A7gRxFxZ/n+0uQQEQsl/W9JYz0pnplZ4wzVM+gp215ay4dLEnADsDoiKj6kJukgYH1EhKSpJENXL9VyHjMze3cGTQal01DspmOBLwGrJC1Py74FvD/9/HnALOB8SduBLcAZEeFhIDOzBhpqmGgVOw8T7SQijhjs+Ij4NaAh6lwLXDtYHTNrrOOuWMzajVs5ZMwoHrzoU0WHYw0w1DCR7/03azOlaySs3biVrjkLPO1FGxj01tKI+P3AC9gK/En62pKWmdkwctwVi2sqt+Ej66ylXwAeBz5PMj/RY5Jm5RmYmTXe2o1bayq34SPrdBQXAx+LiA0AkjqBfwZuzyswM2u8Q8aMqvjFf8iYUQVEY42U9QnkEQOJIPVSDceaWYuodrHYF5GHv6xf6Isk3SfpbElnAwuAhfmFZWZF6Z078+2ewCFjRvnicZvIOkz0ArAKOIrkVtH5EXFXXkGZWbHcE2g/WZPBvsB/BjYCtwEP5xaRmZk1XKZhooi4LCImAxcAfwT8UtI/5xqZmZk1TK0XgTeQDBm9BLyv/uGYmVkRsj5ncL6kJcBiYCzw50NNRWFmZq0j6zWDg0nWIlieYyxmZlaQrIvbzMk7EDMzK44fHDMzMycDMzPLfs1gt0iaCNwEHATsIHlY7aqyOgKuAmYArwNnR8SyPOMys+Ywac4CtgEdwDN+0rlQefcMtgNfj4jDgGnABZIOL6tzEjApfc0Grss5JjNrAl1pIgDYxs7rKFjj5ZoMImLdwF/5EfEqsBoYX1btVOCmSDwK7C9pXJ5xmVmxJlX54q9Wbvlr2DUDSV3A0cBjZbvGA8+VbPexa8JA0mxJPZJ6+vv7c4vTzPK3rcZyy19DkoGkfYA7SJ5VeKV8d4VDdll3OSLmR0R3RHR3dnbmEaaZNUhHjeWWv9yTgaQOkkTwo4i4s0KVPmBiyfYE4Pm84zKz4lS7WOyLyMXJNRmkdwrdAKyOiO9VqXYPcJYS04BNEbEuz7jMrHi9c2e+3RPoSLetOLneWgocC3wJWCVpeVr2LeD9ABExj2SRnBnAGpJbS8/JOSYzaxLuCTSPXJNBRPyaytcESusEydTYZmZWED+BbGZmTgZmZuZkYGZmOBmYmRlOBmZmhpOBmZnhZGBmZjgZmJkZ+T+BbGZWqMl/tYDXtsHeHfDU3/qJ52rcMzCzYatrTpIIAF7b5gV0BuNkYGbD0uS/qvzFX6283TkZmNmw9FqVlXKqlbc7JwMzG5b2rrJSTrXydudkYGbDUrWLxb6IXJmTgZkNW71zZ77dE9i7wwvoDMa3lprZsOaeQDZ5L3v5A0kbJD1ZZf90SZskLU9fl+QZj5lZFi9tfoMVz73MS5vfKDqUhsm7Z3AjcC1w0yB1fhURJ+cch5lZJv+0/F/5xh0r6Rgxgm07dnDF547glKPGFx1W7nLtGUTEQ8DGPM9hZlYvL21+g2/csZKt23bw6hvb2bptBxfdsbItegjNcAH5GEkrJC2SNLlaJUmzJfVI6unv729kfGbWJvr+sIWOETt/LXaMGEHfH7YUFFHjFJ0MlgEHR8SRwDXA3dUqRsT8iOiOiO7Ozs5GxWdmbWTCe0ezbceOncq27djBhPeOLiiixik0GUTEKxGxOX2/EOiQNLbImMysfR2wz15c8bkjGNUxgn332oNRHSO44nNHcMA+exUdWu4KvbVU0kHA+ogISVNJktNLRcZkZu3tlKPGc+yhY+n7wxYmvHd0WyQCyDkZSLoVmA6MldQHXAp0AETEPGAWcL6k7cAW4IyIiDxjMjMbygH77NU2SWBArskgIr44xP5rSW49NTNrWR+as4A3gT2B37XoU85FX0A2M2tpXWkiAHiT1l0zwcnAzGw3fajKF3+18mbmZGBmtpverLG8mTkZmJntpj1rLG9mTgZmZrup2sXiVryI7GRgZvYu9M6d+XZPYE9ad80Er2dgZvYutWJPoJx7BmZm5mRgZmZOBmZmhpOBmZnhZGBmZjgZmJkZTgZmZoaTgZmZ4WRgZmbkv9LZD4CTgQ0R8ZEK+wVcBcwAXgfOjohlecZkZtaqStdKqPe0F3n3DG4EThxk/0nApPQ1G7gu53jMzFpS+aI59V5EJ9dkEBEPARsHqXIqcFMkHgX2lzQuz5jMzFpNtS/+eiaEoq8ZjAeeK9nuS8t2IWm2pB5JPf39/Q0JzsysXRSdDFShLCpVjIj5EdEdEd2dnZ05h2Vm1l6KTgZ9wMSS7QnA8wXFYmbWlKpdLK7nReSik8E9wFlKTAM2RcS6gmMyM2s65V/89b6bKO9bS28FpgNjJfUBlwIdABExD1hIclvpGpJbS8/JMx4zs1aW5ypquSaDiPjiEPsDuCDPGMzMbGhFDxOZmVkTcDIwMzMnAzMzczIwMzNAyTXc1iKpH/j9bh4+FnixjuHUi+OqTbPGBc0bm+OqzXCM6+CIqPjUbksmg3dDUk9EdBcdRznHVZtmjQuaNzbHVZt2i8vDRGZm5mRgZmbtmQzmFx1AFY6rNs0aFzRvbI6rNm0VV9tdMzAzs121Y8/AzMzKOBmYmdnwTAaSfiBpg6Qnq+yXpKslrZG0UtKUJolruqRNkpanr0saFNdESQ9KWi3pKUlfqVCn4W2WMa6Gt5mkUZIel7QijeuyCnWKaK8scRXyM5aee6SkJyTdW2FfIb+TGeIqsr16Ja1Kz9tTYX992ywiht0L+CQwBXiyyv4ZwCKSldamAY81SVzTgXsLaK9xwJT0/b7A74DDi26zjHE1vM3SNtgnfd8BPAZMa4L2yhJXIT9j6bm/BtxS6fxF/U5miKvI9uoFxg6yv65tNix7BhHxELBxkCqnAjdF4lFgf0njmiCuQkTEuohYlr5/FVjNrmtRN7zNMsbVcGkbbE43O9JX+Z0YRbRXlrgKIWkCMBO4vkqVQn4nM8TVzOraZsMyGWQwHniuZLuPJviSSR2TdvMXSZrc6JNL6gKOJvmrslShbTZIXFBAm6VDC8uBDcADEdEU7ZUhLijmZ+xK4CJgR5X9Rf18XcngcUFxv5MB3C9pqaTZFfbXtc3aNRmoQlkz/AW1jGTukCOBa4C7G3lySfsAdwB/GRGvlO+ucEhD2myIuApps4h4KyKOIlm3e6qkj5RVKaS9MsTV8PaSdDKwISKWDlatQlmu7ZUxriJ/J4+NiCnAScAFkj5Ztr+ubdauyaAPmFiyPQF4vqBY3hYRrwx08yNiIdAhaWwjzi2pg+QL90cRcWeFKoW02VBxFdlm6TlfBpYAJ5btKvRnrFpcBbXXscApknqB24DjJd1cVqeI9hoyriJ/viLi+fTfDcBdwNSyKnVts3ZNBvcAZ6VX46cBmyJiXdFBSTpIktL3U0n+/7zUgPMKuAFYHRHfq1Kt4W2WJa4i2kxSp6T90/ejgU8D/1JWrYj2GjKuItorIr4ZERMiogs4A/hFRJxZVq3h7ZUlrgJ/J/eWtO/Ae+AEoPwuxLq2Wa5rIBdF0q0kdwGMldQHXEpyMY2ImAcsJLkSvwZ4HTinSeKaBZwvaTuwBTgj0tsGcnYs8CVgVTreDPAt4P0lsRXRZlniKqLNxgE/lDSS5MvhJxFxr6TzSuIqor2yxFXUz9gumqC9ssRVVHsdCNyV5qE9gFsi4ud5tpmnozAzs7YdJjIzsxJOBmZm5mRgZmZOBmZmhpOBmZnhZGC2E0ldSmeVlXS2pGsLjufh9N/pqjCrplm9OBmYNbGI+HjRMVh7cDKwYU/SP0j6ryXbfy3p65K+I+lJJXPGnz7EZ8yU9IiksZI+nx63QtJD6f6zJd0t6WeS1kr6b5K+pmSe/EcljUnrLZHUnb4fm06FgKTJStYiWK5kbvpJafnmCrF8LP3cD9StkaztORlYO7gNKP2y/wLwInAUcCTJtA3fUZXpfyX9GTAHmBERLwKXAP8unbzslJKqHwH+A8kcMn8HvB4RRwOPAGcNEeN5wFXpJHPdJPPOVIrl48A84NSIeHaIzzTLbFhOR2FWKiKekPQ+SX8EdAJ/IEkEt0bEW8B6Sb8EPgasLDv8OJIv5xNKZkz9f8CNkn4ClE6e92C67sKrkjYBP0vLVwFHDBHmI8DFSubXvzMinqlQ5zBgfhpL4RMr2vDinoG1i9tJ5pk5naSnUGn630qeJVll7UMDBRFxHvBtkhkjl0s6IN31RslxO0q2d/DOH17beef3blTJZ95C0svYAtwn6fgKsawDtpKs62BWV04G1i5uI5mZchZJYngIOF3JYjCdJEuSPl7huN8DpwE3KV3YRNIHI+KxiLiEZLhpYoXjqukFPpq+nzVQmI7/PxsRV5PMRlmpJ/Eyyapcfy9peg3nNBuSk4G1hYh4iuQv/H9Np/m9i2RIaAXwC+CiiHihyrG/Bf4j8FNJHyS5vrAqvQX1ofQzsvouySyYDwOl8+KfDjyZzs76x8BNVWJZD/x74H9J+tMazms2KM9aamZm7hmYmZmTgZmZ4WRgZmY4GZiZGU4GZmaGk4GZmeFkYGZmwP8HegZ5fvS48gsAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "ax = daten.plot.scatter(x = 'volksmusik', y = 'volksmusik_umk')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", "
https://github.com/manfred2020/DA_mit_Python
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 2 }