{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", "
Datenanalyse mit Pythonhttps://github.com/manfred2020/DA_mit_PythonManfred Hammerl
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6) Daten gruppieren, sortieren und filtern\n", "\n", "Zu Beginn wie üblich **Pandas** importieren und den bereits bekannten Datensatz laden. Damit sehen wir uns dann div. Möglichkeiten an, Daten zu gruppieren bzw. verschiedene Auswertungen anhand der gruppierten Daten vorzunehmen." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrock
015022.673.67
115711.003.33
226632.004.33
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock\n", "0 1 50 2 2.67 3.67\n", "1 1 57 1 1.00 3.33\n", "2 2 66 3 2.00 4.33" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "daten = pd.read_csv(\"C:\\\\Datenfiles\\\\daten.csv\")\n", "\n", "daten.head(3).round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.1) Daten gruppieren" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### GroupBy()\n", "\n", "Die Funktion **groupby()** ist - wie wir sehen werden - sehr nützlich und vielseitig einsetzbar. Sehen wir uns für den Anfang die Mittelwerte unserer Variablen pro Geschlecht an, indem wir **groupby()** und **mean()** kombinieren.\n", "\n", "[pandas.DataFrame.groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby)\n", "\n", "[GroupBy](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agewohnortvolksmusikhardrock
sex
137.622.0903.7713.157
239.682.2273.7672.747
\n", "
" ], "text/plain": [ " age wohnort volksmusik hardrock\n", "sex \n", "1 37.62 2.090 3.771 3.157\n", "2 39.68 2.227 3.767 2.747" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.groupby(['sex']).mean().round(3) # z.B. um die Mittelwerte aller Variablen pro Geschlecht auszugeben" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wie in obiger Tabelle ersichtlich, wurden pro Geschlecht (1 = weiblich, 2 = männlich) die Mittelwerte der übrigen Variablen ausgegeben.\n", "\n", "Analog funktioniert dies auch für Korrelationen. Unten ein Beispiel, in welchem wir **groupby()** und **corr()** kombinieren und somit getrennt nach Geschlecht die Korrelationen (in diesem Fall der Spearman Rangkorrelationskoeffizient) ausgegeben werden. Angezeigt wird nur der Korrelationskoeffizient (und nicht die Fallzahl oder die Irrtumswahrscheinlichkeit, wie man es bspw. von SPSS kennt), mehr kann **Pandas** nicht liefern." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agewohnortvolksmusikhardrock
sex
1age1.00-0.15-0.470.33
wohnort-0.151.000.23-0.06
volksmusik-0.470.231.00-0.32
hardrock0.33-0.06-0.321.00
2age1.00-0.13-0.370.35
wohnort-0.131.000.260.05
volksmusik-0.370.261.00-0.31
hardrock0.350.05-0.311.00
\n", "
" ], "text/plain": [ " age wohnort volksmusik hardrock\n", "sex \n", "1 age 1.00 -0.15 -0.47 0.33\n", " wohnort -0.15 1.00 0.23 -0.06\n", " volksmusik -0.47 0.23 1.00 -0.32\n", " hardrock 0.33 -0.06 -0.32 1.00\n", "2 age 1.00 -0.13 -0.37 0.35\n", " wohnort -0.13 1.00 0.26 0.05\n", " volksmusik -0.37 0.26 1.00 -0.31\n", " hardrock 0.35 0.05 -0.31 1.00" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.groupby(['sex']).corr(method = 'spearman').round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selbstverständlich kann man die Ausgabe nach Belieben optimieren, bspw. wird im Folgenden nur die Korrelation von *hardrock* und *volksmusik* pro Geschlecht ausgegeben. Beachte: Alle drei Variablen, also auch die Gruppierungsvariable, müssen bereits zu Beginn aus dem Dataframe ausgewählt werden ('daten[[.....]]'), ansonsten wird die Gruppierungsvariable nicht erkannt ('groupby([...])')!" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hardrockvolksmusik
sex
1hardrock1.00-0.34
volksmusik-0.341.00
2hardrock1.00-0.30
volksmusik-0.301.00
\n", "
" ], "text/plain": [ " hardrock volksmusik\n", "sex \n", "1 hardrock 1.00 -0.34\n", " volksmusik -0.34 1.00\n", "2 hardrock 1.00 -0.30\n", " volksmusik -0.30 1.00" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[['sex', 'hardrock', 'volksmusik']].groupby(['sex']).corr(method = 'pearson').round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Im Zusammenhang mit Gruppierungen können viele weitere Auswertungsfunktionen angewandt werden - nicht nur die gezeigten *mean()* und *corr()*, bspw.:\n", "\n", "|Funktion| |Funktion| |\n", "|-|-|-|-|\n", "|**min()**|Minimum|**count()**|Häufigkeiten exkl. NaN|\n", "|**max()**|Maximum|**size()**|Häufigkeiten inkl. NaN|\n", "|**median()**|Median|**mad()**|Mittlere absolute Abweichung vom Median|\n", "|**var()**|Varianz|**std()**|Standardabweichung|\n", "|**sum()**|Summe|**sem()**|Standardfehler des Mittelwerts|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Die describe() Funktion\n", "Wir kennen die **describe()** Funktion bereits aus dem vorangegangenen Kapitel. Sehen wir uns nun div. Kennzahlen der Variable *age* getrennt für beide Geschlechter an; komibinieren wir also **groupby()** mit **describe()**." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
sex
1166.037.6214.1121.025.035.048.7592.0
2128.039.6813.6615.028.037.051.0077.0
\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "sex \n", "1 166.0 37.62 14.11 21.0 25.0 35.0 48.75 92.0\n", "2 128.0 39.68 13.66 15.0 28.0 37.0 51.00 77.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.groupby('sex').describe()['age'].round(2) # 'unstack()' auch möglich für andere Anzeige" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Man kann dies auch so schreiben (diesmal gruppiert nach *wohnort*):" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "daten.groupby('wohnort')['age'].describe().round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beide Schreibweisen führen zum gleichen Ergebnis. Mit den Möglichkeiten von **groupby()** und **describe()** kann man sich schnell einen guten Überblick über einige Variablen bzw. Gruppen verschaffen." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " wohnort\n", "count 1 95.00\n", " 2 60.00\n", " 3 139.00\n", "mean 1 40.37\n", " 2 40.43\n", " 3 36.42\n", "std 1 13.80\n", " 2 14.11\n", " 3 13.74\n", "dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.groupby(['wohnort'])['age'].describe().unstack().loc[['count', 'mean', 'std']].round(2)\n", "\n", "# Dies funktioniert nur mit 'unstack()'!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Man kann auch 2 Gruppierungsvariablen angeben; im folgenden Fall wird wieder nach *wohnort* sowie auf einer übergeordneten Ebenen nach *sex* gruppiert." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
sexwohnort
1158.039.8814.2422.026.042.549.7592.0
235.039.0013.6322.026.037.050.0069.0
373.035.1614.0321.025.027.047.0075.0
2137.041.1413.2415.034.041.050.0077.0
225.042.4414.8121.028.045.055.0072.0
366.037.8213.3820.026.037.047.5075.0
\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "sex wohnort \n", "1 1 58.0 39.88 14.24 22.0 26.0 42.5 49.75 92.0\n", " 2 35.0 39.00 13.63 22.0 26.0 37.0 50.00 69.0\n", " 3 73.0 35.16 14.03 21.0 25.0 27.0 47.00 75.0\n", "2 1 37.0 41.14 13.24 15.0 34.0 41.0 50.00 77.0\n", " 2 25.0 42.44 14.81 21.0 28.0 45.0 55.00 72.0\n", " 3 66.0 37.82 13.38 20.0 26.0 37.0 47.50 75.0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[['age', 'sex', 'wohnort']].groupby(['sex', 'wohnort'])['age'].describe().round(2)\n", "\n", "#'loc[]', um nur ausgewählte Kennzahlen auszugeben, funktioniert bei solcher mehrfacher Gruppierung nicht" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Im Folgenden wird nach zwei Variabeln, *sex* und *wohnort* gruppiert und jeweils ausschließlich(!) der Mittelwert des Alters ausgegeben. Mit der **describe()** Funktion geht das nicht, mit den div. anderen Auswertungsfunktionen (z.B. **mean()**, **median()**, **count()**, usw.) schon." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wohnort123
sex
139.8839.0035.16
241.1442.4437.82
\n", "
" ], "text/plain": [ "wohnort 1 2 3\n", "sex \n", "1 39.88 39.00 35.16\n", "2 41.14 42.44 37.82" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[['age', 'sex', 'wohnort']].groupby(['sex', 'wohnort'])['age'].mean().round(2).unstack()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Die aggregate() Funktion\n", "\n", "Die **aggregate()** Funktion kann hier ev. weiterhelfen, sie erweitert nämlich unseren Spielraum bez. der gezielten Ausgabe gewünschter Kennzahlen etwas. Im ersten Beispiel wird - getrennt nach Wohnort - für alle Variablen jeweils der Mittelwert und der Median ausgegeben. **aggregate()** kann übrigens auch mit **agg()** abgekürzt werden.\n", "\n", "[10 Python Pandas tips to make data analysis faster](https://towardsdatascience.com/10-python-pandas-tricks-to-make-data-analysis-more-enjoyable-cb8f55af8c30)\n", "\n", "[pandas.DataFrame.aggregate](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagevolksmusikhardrock
meanmedianmeanmedianmeanmedianmeanmedian
wohnort
11.39140.37423.393.672.993.0
21.42140.43393.684.003.063.0
31.47136.42344.074.332.933.0
\n", "
" ], "text/plain": [ " sex age volksmusik hardrock \n", " mean median mean median mean median mean median\n", "wohnort \n", "1 1.39 1 40.37 42 3.39 3.67 2.99 3.0\n", "2 1.42 1 40.43 39 3.68 4.00 3.06 3.0\n", "3 1.47 1 36.42 34 4.07 4.33 2.93 3.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.groupby('wohnort').aggregate(['mean', 'median']).round(2) # Man kann 'aggregate' oder kurz 'agg' schreiben." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nun wählen wir sinnvoller Weise nur 2 metrischen Variablen, *volksmusik* und *hardrock* zur Ausgabe des Mittelwerts und Medians aus." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
volksmusikhardrock
meanmedianmeanmedian
wohnort
13.393.672.993.0
23.684.003.063.0
34.074.332.933.0
\n", "
" ], "text/plain": [ " volksmusik hardrock \n", " mean median mean median\n", "wohnort \n", "1 3.39 3.67 2.99 3.0\n", "2 3.68 4.00 3.06 3.0\n", "3 4.07 4.33 2.93 3.0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.groupby('wohnort').aggregate({'volksmusik' : ['mean', 'median'], 'hardrock' : ['mean', 'median']}).round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Und nun gruppieren wir - wie wir es weiter oben schon versucht haben - nach zwei Variablen (*sex* und *wohnort*) und lassen uns für die Variable *age* den Mittelwert und den Median sowie für die Variable *hardrock* den kleinsten und den größten Wert sowie die Gesamtzahl der Nennungen anzeigen." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agehardrock
meanmedianminmaxcount
sexwohnort
1139.8842.51.004.6758
239.0037.01.335.0035
335.1627.01.005.0073
2141.1441.01.005.0037
242.4445.01.005.0025
337.8237.01.005.0066
\n", "
" ], "text/plain": [ " age hardrock \n", " mean median min max count\n", "sex wohnort \n", "1 1 39.88 42.5 1.00 4.67 58\n", " 2 39.00 37.0 1.33 5.00 35\n", " 3 35.16 27.0 1.00 5.00 73\n", "2 1 41.14 41.0 1.00 5.00 37\n", " 2 42.44 45.0 1.00 5.00 25\n", " 3 37.82 37.0 1.00 5.00 66" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.groupby(['sex','wohnort']).agg({'age' : ['mean', 'median'], 'hardrock' : ['min', 'max', 'count']}).round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Die transform() Funktion\n", "\n", "Die **transform()** Funktion liefert in Kombination mit **groupby()** nicht den üblichen gruppierten (aggregierten) Output, sondern ermöglicht gruppenweise Berechnungen, welche dann für jede Zeile im Datensatz ausgegeben werden. Im Beispiel unten ist die Zentrierung der Variable *age* innerhalb der beiden Gruppen der Variable *sex* wiedergegeben. Um zu zeigen, dass korrekt - pro Gruppe - gerechnet wurde, wird im Anschluss auch über den gesamten Datensatz zentriert und die Ergebnisse grafisch veranschaulicht.\n", "\n", "[pandas.DataFrame.transform](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "data_gruppiert = daten[['age', 'sex']].groupby(['sex']).transform(lambda x: x-x.mean()).round(2)\n", "# Hier wird der Altersmittelwert von jeder Beobachtung abgezogen, 'age' wird also zentriert - und zwar innerhalb jeder Gruppe von 'sex'!" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
age
012.38
119.38
226.32
312.38
422.38
......
289-13.62
290-3.68
291-8.68
292-0.62
293-14.62
\n", "

294 rows × 1 columns

\n", "
" ], "text/plain": [ " age\n", "0 12.38\n", "1 19.38\n", "2 26.32\n", "3 12.38\n", "4 22.38\n", ".. ...\n", "289 -13.62\n", "290 -3.68\n", "291 -8.68\n", "292 -0.62\n", "293 -14.62\n", "\n", "[294 rows x 1 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_gruppiert" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nun noch schnell den Spaltennamen (*age*) ändern, da wird später noch eine zweite Spalte mit dem Namen *age* haben werden und die Spalten natürlich unterscheiden möchten." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "data_gruppiert.rename(columns = {\"age\" : \"age_gruppiert\"}, inplace = True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
age_gruppiert
012.38
119.38
226.32
312.38
422.38
\n", "
" ], "text/plain": [ " age_gruppiert\n", "0 12.38\n", "1 19.38\n", "2 26.32\n", "3 12.38\n", "4 22.38" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_gruppiert.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hier nun zum Vergleich die 'normale' Zentrierung mit dem Altersmittelwert der Gesamtstichprobe:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "data_gesamt = daten['age'].transform(lambda x: x-x.mean()).round(2)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 11.48\n", "1 18.48\n", "2 27.48\n", "3 11.48\n", "4 21.48\n", " ... \n", "289 -14.52\n", "290 -2.52\n", "291 -7.52\n", "292 -1.52\n", "293 -15.52\n", "Name: age, Length: 294, dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_gesamt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wir haben jetzt das Alter sowohl nach Geschlecht als auch gesamt zentriert - es liegen nun zwei zentrierte Altersvariablen vor. Diese sollen nun für die weitere Analyse in eine Dataframe gebracht werden. Ausserdem soll die Variable *sex* mit in diese Dataframe." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "data = pd.concat([data_gruppiert, data_gesamt, daten['sex']], axis = 1)\n", "# 'axis = 1' bezieht sich auf die Spalten, 'axis = 0' würde sich auf die Zeilen beziehen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hier das fertige Dataframe mit den beiden zentrierten Altersvariablen sowie der Variable *sex*. Die Spalte *age* repräsentiert die über den gesamten Datensatz zentrierte Altersvariable." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
age_gruppiertagesex
012.3811.481
119.3818.481
226.3227.482
312.3811.481
422.3821.481
\n", "
" ], "text/plain": [ " age_gruppiert age sex\n", "0 12.38 11.48 1\n", "1 19.38 18.48 1\n", "2 26.32 27.48 2\n", "3 12.38 11.48 1\n", "4 22.38 21.48 1" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sehen wir uns nun anhand zweier Grafiken an, ob unsere Berechnung mit **groupby()** und **transform()** vorhin funktioniert hat, d.h. ob die Altersvariable tatsächlich pro Geschlecht zentriert wurde. Ist dies der Fall, so müsste der Mittelwert pro Geschlecht bei 0 liegen. Dagegen liegt der Mittelwert der über den gesamten Datensatz zentrierten Altersvariable pro Geschlecht wohl nicht bei 0." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY8AAADQCAYAAAD/Ne1xAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAASaElEQVR4nO3df7CWZZ3H8fdXoAhlFAxdBBGaxbOioeJBNC3MFsUfBaaW2pSgiZZsP2Z2JnB3R53SZbdmN500hib8VZNhbsUk689JHQ1TNLcVETkSwUlGER2TFU3iu3+cG+ZwOD+ey/PAg/B+zZx5nvu6r+u6v/fh6Geu+77PeSIzkSSpxF6NLkCS9N5jeEiSihkekqRihockqZjhIUkqZnhIkor1bXQBO8sHP/jBHDlyZKPLkKT3lCeffPKVzBzSsX2PCY+RI0eyZMmSRpchSe8pEfHHztq9bCVJKmZ4SJKKGR6SpGJ7zD0PSY3zzjvv0NrayltvvdXoUtSF/v37M3z4cPr161dTf8ND0g7X2trKwIEDGTlyJBHR6HLUQWayfv16WltbGTVqVE1jDI9dzMhZdzW6hN3GqjlnNLoEVd566y2DYxcWEey///6sW7eu5jHe85C0Uxgcu7bSfx/DQ5JUzMtWkna6el+e3RMvUX7kIx/hN7/5TdGYa6+9liuuuKIux3flIUk7WGayefPmus5ZEhxbjn/ttdfW7fiGh6Q9wtSpUznmmGM4/PDDmTdvHgA//OEPOfTQQznppJO45JJLmDlzJgDr1q3j7LPPZvz48YwfP55HH320y3nXrVvHpEmTGDduHJdeeimHHHIIr7zyCqtWreKwww7jy1/+MuPGjWPNmjXss88+W8f97Gc/Y9q0aQBMmzaNyy67jI9+9KMceuih/OpXvwLg5ptvZsqUKUyePJmmpiauvvrqrePbz/Xtb3+b8ePHM3bsWK688kqA7Y5/8cUXs3HjRo466ig+97nP9fr76WUrSXuE+fPnM3jwYDZu3Mj48eM544wz+OY3v8lTTz3FwIEDOfnkkznyyCMB+OpXv8rXv/51TjzxRFavXs2pp57KsmXLOp336quv5uSTT2b27NncfffdW4MJYPny5dx0003ceOONPda3atUqHnroIV544QU+/vGP09LSAsDjjz/OM888w4ABA7bW3dzcvHXcvffey4oVK3j88cfJTD71qU/x8MMPM2LEiO2Of8cdd/D000+/22/hNgwPSXuE66+/np///OcArFmzhttuu42JEycyePBgAM4991yef/55AO6//36effbZrWP//Oc/88YbbzBw4MDt5n3kkUe2zjt58mQGDRq0dd8hhxzCcccdV1N9n/nMZ9hrr70YPXo0H/rQh3juuecAmDRpEvvvvz8An/70p3nkkUe2C497772Xo48+GoANGzawYsUKRowYUXT8UoaHpN3egw8+yP3338/ixYsZMGAAJ510Ek1NTV2uJjZv3szixYv5wAc+0OPcmdnlvr333nub7faPw3b8bfuOj8pu2e6qvf3xZ8+ezaWXXrpN+6pVq7Y7fj15z0PSbu/1119n0KBBDBgwgOeee47HHnuMN998k4ceeojXXnuNTZs2ceedd27tf8opp/C9731v63Z3l3pOPPFEFixYALStAl577bUu+x544IEsW7aMzZs3b12tbHHHHXewefNmXnjhBVauXElTUxMA9913H6+++iobN27kF7/4BSeccMI240499VTmz5/Phg0bAPjTn/7Eyy+/3Onx+/XrxzvvvNNlfSVceUja6Xb2o7WTJ09m7ty5jB07lqamJo477jiGDRvGFVdcwYQJEzjooIMYM2YM++67L9B2ievyyy9n7NixbNq0iY997GPMnTu307mvvPJKzj//fH76058yceJEhg4dysCBA7f+z7y9OXPmcOaZZ3LwwQdzxBFHbNOnqamJiRMn8tJLLzF37lz69+8PtIXT5z//eVpaWrjgggu2uWQFbUG3bNkyjj/+eKDtRvqPfvQj+vTps93xZ8yYwdixYxk3bhw//vGP3903sxLdLbl2J83Nzfle+DAo/zxJ/eyJz/7vqpYtW8Zhhx3W6DK2s2HDBvbZZx82bdrEWWedxUUXXcRZZ51VNMfbb79Nnz596Nu3L4sXL+ZLX/pS8U3padOmceaZZ3LOOeds037zzTezZMmSbVZBO1Jn/04R8WRmNnfsW5fLVhExOSKWR0RLRMzqZH9ExPXV/t9HxLiexkbE4Ii4LyJWVK+DOsw5IiI2RMQ/1uMcJO15rrrqKo466iiOOOIIRo0axdSpU4vnWL16NePHj+fII4/kK1/5Cj/4wQ/qX+guqNcrj4joAzwPTAJagSeA8zPz2XZ9Tgf+ATgdmABcl5kTuhsbEf8OvJqZc6pQGZSZ32g3553AZuC3mfmdnup05bHnceWx69hVVx4lbrrpJq677rpt2k444QRuuOGGBlVUfyUrj3rc8zgWaMnMldWBbgemAM+26zMFuDXbkuqxiNgvIoYCI7sZOwU4qRp/C/Ag8I2q31RgJfB/dahfkno0ffp0pk+f3ugydhn1uGw1DFjTbru1aqulT3djD8zMtQDV6wEAEbE3bSFyNT2IiBkRsSQilpT8qWFJ9ben3F99ryr996lHeHT2d3w7VtFVn1rGdnQ18J+Zuf2jDB0nypyXmc2Z2TxkyJCeukvaQfr378/69esNkF3Ulg+D2vKEVy3qcdmqFTi43fZw4MUa+7yvm7EvRcTQzFxbXeLa8uDyBOCc6p7IfsDmiHgrM3fO4wiSig0fPpzW1taiDxvSzrXlY2hrVY/weAIYHRGjgD8B5wEXdOizEJhZ3dOYALxehcK6bsYuBC4E5lSvvwTIzI9umTQirgI2GBzSrq1fv341f7yp3ht6HR6ZuSkiZgL3AH2A+Zm5NCIuq/bPBRbR9qRVC/AmML27sdXUc4AFEXExsBo4t7e1SpLqoy6/YZ6Zi2gLiPZtc9u9T+DyWsdW7euBT/Rw3KveRbmSpF7yb1tJkooZHpKkYoaHJKmY4SFJKmZ4SJKKGR6SpGKGhySpmOEhSSpmeEiSihkekqRihockqZjhIUkqZnhIkooZHpKkYoaHJKmY4SFJKmZ4SJKKGR6SpGKGhySpmOEhSSpmeEiSihkekqRihockqZjhIUkqZnhIkooZHpKkYoaHJKmY4SFJKlaX8IiIyRGxPCJaImJWJ/sjIq6v9v8+Isb1NDYiBkfEfRGxonodVLVPiognI+J/q9eT63EOkqTa9To8IqIPcANwGjAGOD8ixnTodhowuvqaAXy/hrGzgAcyczTwQLUN8Arwycz8MHAhcFtvz0GSVKYeK49jgZbMXJmZfwFuB6Z06DMFuDXbPAbsFxFDexg7Bbilen8LMBUgM3+XmS9W7UuB/hHx/jqchySpRvUIj2HAmnbbrVVbLX26G3tgZq4FqF4P6OTYZwO/y8y333X1kqRifeswR3TSljX2qWVs5weNOBz4N+CUbvrMoO0yGSNGjKhlWklSDeqx8mgFDm63PRx4scY+3Y19qbq0RfX68pZOETEc+Dnwhcx8oavCMnNeZjZnZvOQIUOKTkqS1LV6hMcTwOiIGBUR7wPOAxZ26LMQ+EL11NVxwOvVpajuxi6k7YY41esvASJiP+AuYHZmPlqH+iVJhXp92SozN0XETOAeoA8wPzOXRsRl1f65wCLgdKAFeBOY3t3Yauo5wIKIuBhYDZxbtc8E/hb4l4j4l6rtlMzcujKRJO1Y9bjnQWYuoi0g2rfNbfc+gctrHVu1rwc+0Un7t4Bv9bJkSVIv+BvmkqRihockqZjhIUkqZnhIkooZHpKkYoaHJKmY4SFJKmZ4SJKKGR6SpGKGhySpmOEhSSpmeEiSihkekqRihockqZjhIUkqZnhIkooZHpKkYoaHJKmY4SFJKmZ4SJKKGR6SpGKGhySpmOEhSSpmeEiSihkekqRihockqZjhIUkqZnhIkooZHpKkYnUJj4iYHBHLI6IlImZ1sj8i4vpq/+8jYlxPYyNicETcFxErqtdB7fbNrvovj4hT63EOkqTa9To8IqIPcANwGjAGOD8ixnTodhowuvqaAXy/hrGzgAcyczTwQLVNtf884HBgMnBjNY8kaSepx8rjWKAlM1dm5l+A24EpHfpMAW7NNo8B+0XE0B7GTgFuqd7fAkxt1357Zr6dmX8AWqp5JEk7Sd86zDEMWNNuuxWYUEOfYT2MPTAz1wJk5tqIOKDdXI91Mtd2ImIGbSsdRowYUePpNNaqOWc0ugSpUyNn3dXoEnYr7/X/1uux8ohO2rLGPrWMfTfHa2vMnJeZzZnZPGTIkB6mlSTVqh7h0Qoc3G57OPBijX26G/tSdWmL6vXlguNJknageoTHE8DoiBgVEe+j7Wb2wg59FgJfqJ66Og54vbok1d3YhcCF1fsLgV+2az8vIt4fEaNouwn/eB3OQ5JUo17f88jMTRExE7gH6APMz8ylEXFZtX8usAg4nbab228C07sbW009B1gQERcDq4FzqzFLI2IB8CywCbg8M//a2/OQJNWuHjfMycxFtAVE+7a57d4ncHmtY6v29cAnuhhzDXBNL0qWJPWCv2EuSSpmeEiSihkekqRihockqZjhIUkqZnhIkooZHpKkYoaHJKmY4SFJKmZ4SJKKGR6SpGKGhySpmOEhSSpmeEiSihkekqRihockqZjhIUkqZnhIkooZHpKkYoaHJKmY4SFJKmZ4SJKKGR6SpGKGhySpmOEhSSpmeEiSihkekqRihockqVivwiMiBkfEfRGxonod1EW/yRGxPCJaImJWLeMjYnbVf3lEnFq1DYiIuyLiuYhYGhFzelO/JOnd6e3KYxbwQGaOBh6otrcREX2AG4DTgDHA+RExprvx1f7zgMOBycCN1TwA38nMvwOOBk6IiNN6eQ6SpEK9DY8pwC3V+1uAqZ30ORZoycyVmfkX4PZqXHfjpwC3Z+bbmfkHoAU4NjPfzMxfA1RzPQUM7+U5SJIK9TY8DszMtQDV6wGd9BkGrGm33Vq1dTe+uzEARMR+wCdpW7F0KiJmRMSSiFiybt26Ws9JktSDvj11iIj7gb/pZNc/1XiM6KQtezMmIvoCPwGuz8yVXU2SmfOAeQDNzc09HVOSVKMewyMz/76rfRHxUkQMzcy1ETEUeLmTbq3Awe22hwMvVu+7Gt/dGGgLhBWZ+d2e6pck1V9vL1stBC6s3l8I/LKTPk8AoyNiVES8j7Yb4Qt7GL8QOC8i3h8Ro4DRwOMAEfEtYF/ga72sXZL0LvU2POYAkyJiBTCp2iYiDoqIRQCZuQmYCdwDLAMWZObS7sZX+xcAzwJ3A5dn5l8jYjhtl8vGAE9FxNMR8cVenoMkqVCPl626k5nrgU900v4icHq77UXAolrHV/uuAa7p0NZK5/dDJEk7kb9hLkkqZnhIkooZHpKkYoaHJKmY4SFJKmZ4SJKKGR6SpGKGhySpmOEhSSpmeEiSihkekqRihockqZjhIUkqZnhIkooZHpKkYr36PA9Je45Vc85odAnahbjykCQVMzwkScUMD0lSMcNDklTM8JAkFTM8JEnFDA9JUrHIzEbXsFNExDrgj42uYzfxQeCVRhchdcGfz/o6JDOHdGzcY8JD9RMRSzKzudF1SJ3x53Pn8LKVJKmY4SFJKmZ46N2Y1+gCpG7487kTeM9DklTMlYckqZjhoZpFxPyIeDkinml0LVJ7EXFwRPw6IpZFxNKI+Gqja9rdedlKNYuIjwEbgFsz84hG1yNtERFDgaGZ+VREDASeBKZm5rMNLm235cpDNcvMh4FXG12H1FFmrs3Mp6r3bwDLgGGNrWr3ZnhI2q1ExEjgaOC3DS5lt2Z4SNptRMQ+wJ3A1zLzz42uZ3dmeEjaLUREP9qC48eZ+V+Nrmd3Z3hIes+LiAB+CCzLzP9odD17AsNDNYuInwCLgaaIaI2Iixtdk1Q5Afg8cHJEPF19nd7oonZnPqorSSrmykOSVMzwkCQVMzwkScUMD0lSMcNDklTM8JAkFTM8JEnFDA9pB4uIvSPiroj4n4h4JiI+GxHHRMRDEfFkRNwTEUMjYt+IWB4RTdW4n0TEJY2uX+pM30YXIO0BJgMvZuYZABGxL/DfwJTMXBcRnwWuycyLImImcHNEXAcMyswfNK5sqWv+hrm0g0XEocA9wALgV8BrwG+AlVWXPsDazDyl6j8POBs4MjNbd37FUs9ceUg7WGY+HxHHAKcD/wrcByzNzOM79o2IvYDDgI3AYMDw0C7Jex7SDhYRBwFvZuaPgO8AE4AhEXF8tb9fRBxedf86bZ+Cdz4wv/oz49Iux5WHtON9GPh2RGwG3gG+BGwCrq/uf/QFvhsR7wBfBI7NzDci4mHgn4ErG1S31CXveUiSinnZSpJUzPCQJBUzPCRJxQwPSVIxw0OSVMzwkCQVMzwkScUMD0lSsf8HN8DOjmZ2oEkAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAADQCAYAAAAETtrrAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAM5UlEQVR4nO3dXYyc1X3H8e8v2IkhtQjGTuLYOOsLg2oo0LDmRYkalxLH4Eo2pTRw0URpyypRLFHuUNI0kaKqSEVVi4hiucLKW5UQidJYxZTGUQNtaloMgggwUAvReGRUHBfxIkBhy78XHqLNdtb2esY79pzvR1p5nnnOzjlrrb569uzMbKoKSdLoe8ewFyBJmhsGX5IaYfAlqREGX5IaYfAlqREGX5IaMW/YCzicxYsX19jY2LCXIUknjYcffvhnVbWk17kTOvhjY2Ps3r172MuQpJNGkv+a6ZxbOpLUCIMvSY0w+JLUiBN6D7+XN998k06nwxtvvDHspQzMggULWL58OfPnzx/2UiSNsJMu+J1Oh4ULFzI2NkaSYS+nb1XFwYMH6XQ6rFy5ctjLkTTCTrrgv/HGGyMTe4AknHnmmRw4cGDYS9EIGrv5nmEvYaQ8d8uGYS+hLyflHv6oxP5to/b1SDoxnZTBlyTN3km3pTPdoH9kPdl/ZJOkmXiFfww2bdrERRddxLnnnsvWrVsBuOOOOzj77LNZu3YtN9xwA5s3bwbgwIEDXHPNNaxZs4Y1a9bw4x//eJhLl9Swk/4Kfxi2bdvGokWLeP3111mzZg0bNmzgK1/5Co888ggLFy7k8ssv54ILLgDgxhtv5KabbuIjH/kIP/3pT/n4xz/Onj17hvwVSGqRwT8Gt912G3fffTcA+/bt41vf+hYf/ehHWbRoEQDXXnstzzzzDAA7d+7kySef/MXnvvzyy7zyyissXLhw7hcuqWkGf5Z+9KMfsXPnTnbt2sVpp53G2rVrOeecc2a8an/rrbfYtWsXp5566hyvVJJ+2UD28JNsS/JCksdnOJ8ktyXZm+QnST40iHmH4aWXXuKMM87gtNNO46mnnuLBBx/ktdde4/777+fFF19kcnKSu+666xfj161bx+233/6L40cffXQIq5akwf3S9uvA+sOcvxJY1f2YAL42oHnn3Pr165mcnOT888/ni1/8IpdeeinLli3j85//PJdccglXXHEFq1ev5vTTTwcObf/s3r2b888/n9WrV7Nly5YhfwWSWjWQLZ2qeiDJ2GGGbAS+WVUFPJjkPUmWVtXz/c4910+jfNe73sW99977/+4fHx9nYmKCyclJrr76atatWwfA4sWLufPOO+d0jZLUy1w9LXMZsG/Kcad738j48pe/zIUXXsh5553HypUr2bRp07CXJEm/ZK5+advrvQOq58BkgkPbPqxYseJ4rmmgbr311mEvQZIOa66u8DvAWVOOlwP7ew2sqq1VNV5V40uW9PyzjJKkYzBXwd8OfLL7bJ1LgZf62b8/9KuA0TFqX4+kE9NAtnSSfAdYCyxO0gG+BMwHqKotwA7gKmAv8Brw6WOda8GCBRw8eJAzzzxzJN5l8u33w1+wYMGwlyJpxA3qWTrXH+F8AZ8bxFzLly+n0+mM1PvHv/0XryTpeDrpXmk7f/58/zKUJB0D3y1Tkhph8CWpEQZfkhph8CWpEQZfkhph8CWpEQZfkhph8CWpEQZfkhph8CWpEQZfkhph8CWpEQZfkhph8CWpEQZfkhph8CWpEQZfkhph8CWpEQZfkhph8CWpEQZfkhph8CWpEQZfkhph8CWpEQZfkhph8CWpEQMJfpL1SZ5OsjfJzT3Or03yUpJHux9/Ooh5JUlHb16/D5DkFOCrwMeADvBQku1V9eS0of9SVb/d73ySpGMziCv8i4G9VfVsVf0c+C6wcQCPK0kaoEEEfxmwb8pxp3vfdJcleSzJvUnOHcC8kqRZ6HtLB0iP+2ra8SPAB6vq1SRXAX8PrOr5YMkEMAGwYsWKASxPkgSDucLvAGdNOV4O7J86oKperqpXu7d3APOTLO71YFW1tarGq2p8yZIlA1ieJAkGE/yHgFVJViZ5J3AdsH3qgCTvT5Lu7Yu78x4cwNySpKPU95ZOVU0m2QzcB5wCbKuqJ5J8pnt+C/C7wGeTTAKvA9dV1fRtH0nScTSIPfy3t2l2TLtvy5TbtwO3D2IuSdKx8ZW2ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktSIgQQ/yfokTyfZm+TmHueT5Lbu+Z8k+dAg5pUkHb2+g5/kFOCrwJXAauD6JKunDbsSWNX9mAC+1u+8kqTZGcQV/sXA3qp6tqp+DnwX2DhtzEbgm3XIg8B7kiwdwNySpKM0iOAvA/ZNOe5075vtGEnScTRvAI+RHvfVMYw5NDCZ4NC2DytWrOhvZXNg7OZ7hr2EkfLcLRuGvYSR4v+nphrEFX4HOGvK8XJg/zGMAaCqtlbVeFWNL1myZADLkyTBYIL/ELAqycok7wSuA7ZPG7Md+GT32TqXAi9V1fMDmFuSdJT63tKpqskkm4H7gFOAbVX1RJLPdM9vAXYAVwF7gdeAT/c7ryRpdgaxh09V7eBQ1Kfet2XK7QI+N4i5JEnHxlfaSlIjDL4kNcLgS1IjDL4kNcLgS1IjDL4kNcLgS1IjDL4kNcLgS1IjDL4kNcLgS1IjDL4kNcLgS1IjDL4kNcLgS1IjDL4kNcLgS1IjDL4kNcLgS1IjDL4kNcLgS1IjDL4kNcLgS1IjDL4kNcLgS1IjDL4kNcLgS1Ij5vXzyUkWAXcCY8BzwO9V1Ys9xj0HvAL8LzBZVeP9zCtJmr1+r/BvBn5YVauAH3aPZ/KbVXWhsZek4eg3+BuBb3RvfwPY1OfjSZKOk36D/76qeh6g++97ZxhXwD8leTjJRJ9zSpKOwRH38JPsBN7f49QXZjHPh6tqf5L3Aj9I8lRVPTDDfBPABMCKFStmMYUk6XCOGPyqumKmc0n+O8nSqno+yVLghRkeY3/33xeS3A1cDPQMflVtBbYCjI+P15G/BEnS0eh3S2c78Knu7U8B358+IMm7kyx8+zawDni8z3klSbPUb/BvAT6W5D+Bj3WPSfKBJDu6Y94H/GuSx4D/AO6pqn/sc15J0iz19Tz8qjoI/FaP+/cDV3VvPwtc0M88kqT++UpbSWqEwZekRhh8SWqEwZekRhh8SWqEwZekRhh8SWqEwZekRhh8SWqEwZekRhh8SWqEwZekRhh8SWqEwZekRhh8SWqEwZekRhh8SWqEwZekRhh8SWqEwZekRhh8SWqEwZekRhh8SWrEvGEv4GT33C0bhr0ESToqXuFLUiMMviQ1wuBLUiP6Cn6Sa5M8keStJOOHGbc+ydNJ9ia5uZ85JUnHpt8r/MeB3wEemGlAklOArwJXAquB65Os7nNeSdIs9fUsnaraA5DkcMMuBvZW1bPdsd8FNgJP9jO3JGl25mIPfxmwb8pxp3ufJGkOHfEKP8lO4P09Tn2hqr5/FHP0uvyvw8w3AUx0D19N8vRRzKEjWwz8bNiLkGbg9+fgfHCmE0cMflVd0efkHeCsKcfLgf2HmW8rsLXPOTVNkt1VNeMv1qVh8vtzbszFls5DwKokK5O8E7gO2D4H80qSpuj3aZlXJ+kAlwH3JLmve/8HkuwAqKpJYDNwH7AH+F5VPdHfsiVJs5WqGbfTNUKSTHS3y6QTjt+fc8PgS1IjfGsFSWqEwR9xSbYleSHJ48NeizRVkrOS/HOSPd23aLlx2GsadW7pjLgkvwG8Cnyzqs4b9nqktyVZCiytqkeSLAQeBjZVla/CP068wh9xVfUA8D/DXoc0XVU9X1WPdG+/wqFn8fkq/OPI4EsauiRjwK8D/z7kpYw0gy9pqJL8CnAX8MdV9fKw1zPKDL6koUkyn0Ox/9uq+rthr2fUGXxJQ5FD76t+B7Cnqv5y2OtpgcEfcUm+A+wCzknSSfKHw16T1PVh4PeBy5M82v24atiLGmU+LVOSGuEVviQ1wuBLUiMMviQ1wuBLUiMMviQ1wuBLUiMMviQ1wuBLPSR5d5J7kjyW5PEkn0hyUZL7kzyc5L4kS5OcnuTpJOd0P+87SW4Y9vqlXuYNewHSCWo9sL+qNgAkOR24F9hYVQeSfAL4s6r6gySbga8n+WvgjKr6m+EtW5qZr7SVekhyNnAf8D3gH4AXgX8Dnu0OOQV4vqrWdcdvBa4BLqiqztyvWDoyr/ClHqrqmSQXAVcBfw78AHiiqi6bPjbJO4BfBV4HFgEGXyck9/ClHpJ8AHitqr4N3ApcAixJcln3/Pwk53aH38Shv9Z0PbCt+5a/0gnHK3ypt18D/iLJW8CbwGeBSeC27n7+POCvkrwJ/BFwcVW9kuQB4E+ALw1p3dKM3MOXpEa4pSNJjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktQIgy9JjTD4ktSI/wNIJhF5FjCKhwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "ax = data.groupby(['sex']).mean().plot.bar(y = 'age_gruppiert', rot = 0, figsize = (6, 3))\n", "ax = data.groupby(['sex']).mean().plot.bar(y = 'age', rot = 0, figsize = (6, 3))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age 38.52\n", "dtype: float64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[['age']].mean().round(2) # Gesamtmittelwert des Alters aus originalen Daten\n", "# \"[['age']]\" -> \".round()\" ist möglich. \"['age']\" (was im Fall einer Variable normalerweise genügt) -> \".round()\" ist nicht möglich!" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
age
sex
137.62
239.68
\n", "
" ], "text/plain": [ " age\n", "sex \n", "1 37.62\n", "2 39.68" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[['age', 'sex']].groupby(['sex']).mean().round(2) # Altersmittelwert für Frauen (1) und Männer (2) aus originalen Daten" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age_gruppiert 0.000136\n", "age -0.002993\n", "dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['age_gruppiert', 'age']].mean() # Gesamtmittelwerte der beiden zentrierten Variablen" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
age_gruppiert
sex
10.000482
2-0.000313
\n", "
" ], "text/plain": [ " age_gruppiert\n", "sex \n", "1 0.000482\n", "2 -0.000313" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['age_gruppiert', 'sex']].groupby(['sex']).mean()\n", "# Mittelwerte für Frauen (1) und Männer (2) der pro Geschlecht zentrierten Variable" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
age
sex
1-0.899518
21.159688
\n", "
" ], "text/plain": [ " age\n", "sex \n", "1 -0.899518\n", "2 1.159688" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['age', 'sex']].groupby(['sex']).mean()\n", "# Mittelwerte für Frauen (1) und Männer (2) der für den gesamten Datensatz zentrierten Variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wie in den Grafiken und weiteren Auswertungen ersichtlich, wurde tatsächlich korrekt pro Geschlecht zentriert - die Mittelwerte liegen für beide Geschlechter bei 0 (von minimalen Rundungsdifferenzen abgesehen). Zusätzlich wurde anhand der originalen Altersdaten noch der Gesamtmittelwert sowie der Mittelwert pro Geschlecht ermittelt.\n", "___\n", "Mit **groupby()** können noch weitere Funktionen sinnvoll zusammen verwendet werden, z.B. **apply()** oder **filter()**. Vgl. dazu das Buch von Jake VanderPlas (2016)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.2) Fälle sortieren\n", "\n", "Nach diesen umfangreicheren Erläuterungen zum Thema gruppieren folgen nun zwei kürzere Abschnitte zu häufig angewendeten Methoden: Sortieren und filtern von Daten.\n", "\n", "Beginnen wir mit einem einfachen Beispiel und sortieren wir unseren Datensatz aufsteigend nach Wohnort (drei Ausprägungen: 1, 2 und 3). Dazu benötigen wir nur die Funktion **sort_values()** und eine Angabe, nach welcher Variable sortiert werden soll.\n", "\n", "[DataFrame.sort_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrock
14612214.333.67
17125014.004.00
17016011.334.00
16712515.002.67
16114511.673.67
..................
19622434.001.33
19722634.671.33
5725832.003.33
10013134.334.67
6912734.673.00
\n", "

294 rows × 5 columns

\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock\n", "146 1 22 1 4.33 3.67\n", "171 2 50 1 4.00 4.00\n", "170 1 60 1 1.33 4.00\n", "167 1 25 1 5.00 2.67\n", "161 1 45 1 1.67 3.67\n", ".. ... ... ... ... ...\n", "196 2 24 3 4.00 1.33\n", "197 2 26 3 4.67 1.33\n", "57 2 58 3 2.00 3.33\n", "100 1 31 3 4.33 4.67\n", "69 1 27 3 4.67 3.00\n", "\n", "[294 rows x 5 columns]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.sort_values(by = ['wohnort']).round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Eine aufsteigende Sortiertung ist die Standardeinstellung. Möchten wir absteigend sortieren, geben wir *ascending = False* ein." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrock
19427221.0000005.0
915632.0000005.0
1817533.0000005.0
3315322.0000005.0
19515523.3333335.0
..................
12424814.3333331.0
9023415.0000001.0
24622235.0000001.0
27723535.0000001.0
16322935.0000001.0
\n", "

294 rows × 5 columns

\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock\n", "194 2 72 2 1.000000 5.0\n", "9 1 56 3 2.000000 5.0\n", "18 1 75 3 3.000000 5.0\n", "33 1 53 2 2.000000 5.0\n", "195 1 55 2 3.333333 5.0\n", ".. ... ... ... ... ...\n", "124 2 48 1 4.333333 1.0\n", "90 2 34 1 5.000000 1.0\n", "246 2 22 3 5.000000 1.0\n", "277 2 35 3 5.000000 1.0\n", "163 2 29 3 5.000000 1.0\n", "\n", "[294 rows x 5 columns]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.sort_values(by = 'hardrock', ascending = False, na_position = 'first')\n", "# Mit 'na_position' lässt sich einstellen, ob fehlende Werte (falls vorhanden) am Beginn oder am Ende angezeigt werden sollen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selbstverständlich kann nicht nur nach einer Variable, sondern nach beliebig vielen Variablen sortiert werden. Nachfolgend wurden die beiden Variablen *volksmusik* und *hardrock* ausgewählt, wobei zuerst nach *volksmusik* aufsteigend sortiert wird und dann nach *hardrock* (die Reihenfolge der Angabe der Variablen ist somit von Relevanz!)." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrock
1515821.03.000000
115711.03.333333
3714411.03.333333
17915821.03.333333
4514911.03.666667
..................
21012315.04.333333
22714325.04.333333
13513935.04.500000
29123115.04.666667
6322535.05.000000
\n", "

294 rows × 5 columns

\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock\n", "15 1 58 2 1.0 3.000000\n", "1 1 57 1 1.0 3.333333\n", "37 1 44 1 1.0 3.333333\n", "179 1 58 2 1.0 3.333333\n", "45 1 49 1 1.0 3.666667\n", ".. ... ... ... ... ...\n", "210 1 23 1 5.0 4.333333\n", "227 1 43 2 5.0 4.333333\n", "135 1 39 3 5.0 4.500000\n", "291 2 31 1 5.0 4.666667\n", "63 2 25 3 5.0 5.000000\n", "\n", "[294 rows x 5 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.sort_values(by = ['volksmusik', 'hardrock'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.3) Daten filtern\n", "\n", "Filtern gehört jedenfalls zu den wichtigen und oft angewandten Methoden. Im Folgenden werden wir sowohl aus einem Dataframe einzelne Spalten (Variablen) herausfiltern als auch innerhalb von Spalten nach bestimmten Werten filtern. Um nur bestimmte Spalten angezeigt zu bekommen, bietet **Pandas** die Funktion **filter()** an. Nachfolgend interessieren uns nur die Variablen *age* und *sex*, also filtern wir diese heraus.\n", "\n", "\n", "[DataFrame.filter](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html#pandas.DataFrame.filter)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesex
0501
1571
2662
3501
4601
\n", "
" ], "text/plain": [ " age sex\n", "0 50 1\n", "1 57 1\n", "2 66 2\n", "3 50 1\n", "4 60 1" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.filter(items = ['age', 'sex']).head() # Mit 'items=[]' 2 Spalten auswählen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Allerdings benötigen wird diese Funktion gar nicht, es geht auch so:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesex
0501
1571
2662
3501
4601
\n", "
" ], "text/plain": [ " age sex\n", "0 50 1\n", "1 57 1\n", "2 66 2\n", "3 50 1\n", "4 60 1" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[['age', 'sex']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wir können **filter()** auch in Kombination mit **describe()** anwenden, um uns div. Kennzahlen nur für ausgewählte Spalten ausgeben zu lassen. Genauso geht es - siehe weiter unten - aber auch ohne **filter()**." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hardrockvolksmusik
count294.0294.0
mean3.03.8
std1.11.1
min1.01.0
25%2.33.0
50%3.04.0
75%3.74.7
max5.05.0
\n", "
" ], "text/plain": [ " hardrock volksmusik\n", "count 294.0 294.0\n", "mean 3.0 3.8\n", "std 1.1 1.1\n", "min 1.0 1.0\n", "25% 2.3 3.0\n", "50% 3.0 4.0\n", "75% 3.7 4.7\n", "max 5.0 5.0" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten.filter(items=['hardrock', 'volksmusik']).describe().round(1)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hardrockvolksmusik
count294.0294.0
mean3.03.8
std1.11.1
min1.01.0
25%2.33.0
50%3.04.0
75%3.74.7
max5.05.0
\n", "
" ], "text/plain": [ " hardrock volksmusik\n", "count 294.0 294.0\n", "mean 3.0 3.8\n", "std 1.1 1.1\n", "min 1.0 1.0\n", "25% 2.3 3.0\n", "50% 3.0 4.0\n", "75% 3.7 4.7\n", "max 5.0 5.0" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[['hardrock', 'volksmusik']].describe().round(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Nach Werten innerhalb von Spalten filtern\n", "\n", "Dies verlangt nach einer anderen Vorgehensweise, die Funktion **filter()** hilft uns hierbei nicht.\n", "\n", "Nachfolgend wird nun ein neues Objekt (mit Namen *testfilter*) erstellt (genauer gesagt handelt es sich dabei um eine boolesche Variable, welche die Werte *wahr* oder *falsch* annehmen kann), welchem alle Daten aus der Variable *volksmusik* zugewiesen werden, die kleiner als 1,5 sind.\n", "\n", "[How To Filter Pandas Dataframe By Values of Column?](https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "289 False\n", "290 False\n", "291 False\n", "292 False\n", "293 False\n", "Name: volksmusik, Length: 294, dtype: bool" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "testfilter = daten['volksmusik'] < 1.5 # boolesche Variable erstellen\n", "\n", "testfilter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wendet man dies auf unser Dataframe *daten* an, so werden nur mehr jene Zeilen ausgegeben, auf welche die angegebene Bedingung zutrifft." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrock
115711.0000003.333333
1125011.3333331.000000
1314711.0000004.333333
1515821.0000003.000000
3714411.0000003.333333
4514911.0000003.666667
9123831.3333331.333333
10925121.0000004.333333
11212211.3333332.666667
14716921.0000004.333333
17016011.3333334.000000
17915821.0000003.333333
19427221.0000005.000000
20626011.0000003.666667
23814821.0000005.000000
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock\n", "1 1 57 1 1.000000 3.333333\n", "11 2 50 1 1.333333 1.000000\n", "13 1 47 1 1.000000 4.333333\n", "15 1 58 2 1.000000 3.000000\n", "37 1 44 1 1.000000 3.333333\n", "45 1 49 1 1.000000 3.666667\n", "91 2 38 3 1.333333 1.333333\n", "109 2 51 2 1.000000 4.333333\n", "112 1 22 1 1.333333 2.666667\n", "147 1 69 2 1.000000 4.333333\n", "170 1 60 1 1.333333 4.000000\n", "179 1 58 2 1.000000 3.333333\n", "194 2 72 2 1.000000 5.000000\n", "206 2 60 1 1.000000 3.666667\n", "238 1 48 2 1.000000 5.000000" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[testfilter]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Man kann sich das Erstellen der booleschen Variable auch ersparen und mit folgender direkten Methode zum gleichen Ziel kommen." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrock
115711.0000003.333333
1125011.3333331.000000
1314711.0000004.333333
1515821.0000003.000000
3714411.0000003.333333
4514911.0000003.666667
9123831.3333331.333333
10925121.0000004.333333
11212211.3333332.666667
14716921.0000004.333333
17016011.3333334.000000
17915821.0000003.333333
19427221.0000005.000000
20626011.0000003.666667
23814821.0000005.000000
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock\n", "1 1 57 1 1.000000 3.333333\n", "11 2 50 1 1.333333 1.000000\n", "13 1 47 1 1.000000 4.333333\n", "15 1 58 2 1.000000 3.000000\n", "37 1 44 1 1.000000 3.333333\n", "45 1 49 1 1.000000 3.666667\n", "91 2 38 3 1.333333 1.333333\n", "109 2 51 2 1.000000 4.333333\n", "112 1 22 1 1.333333 2.666667\n", "147 1 69 2 1.000000 4.333333\n", "170 1 60 1 1.333333 4.000000\n", "179 1 58 2 1.000000 3.333333\n", "194 2 72 2 1.000000 5.000000\n", "206 2 60 1 1.000000 3.666667\n", "238 1 48 2 1.000000 5.000000" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[daten['volksmusik'] < 1.5] # direktes Filtern nach Werten einer Variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selbstverständlich kann man auch nach Werten mehrerer Variablen filtern. Die jeweiligen Bedingungen werden innerhalb runder Klammern geschrieben und durch '&' verknüpft." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrock
115711.0000003.333333
1314711.0000004.333333
1515821.0000003.000000
3714411.0000003.333333
4514911.0000003.666667
11212211.3333332.666667
14716921.0000004.333333
17016011.3333334.000000
17915821.0000003.333333
23814821.0000005.000000
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock\n", "1 1 57 1 1.000000 3.333333\n", "13 1 47 1 1.000000 4.333333\n", "15 1 58 2 1.000000 3.000000\n", "37 1 44 1 1.000000 3.333333\n", "45 1 49 1 1.000000 3.666667\n", "112 1 22 1 1.333333 2.666667\n", "147 1 69 2 1.000000 4.333333\n", "170 1 60 1 1.333333 4.000000\n", "179 1 58 2 1.000000 3.333333\n", "238 1 48 2 1.000000 5.000000" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[(daten['volksmusik'] < 1.5) & (daten['sex'] == 1)] # gleichzeitig nach Werten von zwei Variablen filtern" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zum Abschluss noch ein Beispiel, in welchem nach Werten aus drei Variablen gefiltert wird und der Output danach auch noch nach zwei Variablen aufsteigend sortiert wird. Wir möchten nur jene Personen auflisten, die sehr gerne Volksmusik hören, nicht gerne Hardrock hören und männlich sind. Der Output soll nach Wohnort und Alter sortiert werden." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexagewohnortvolksmusikhardrock
23722312.04.333333
29023612.03.666667
2224112.03.666667
20626011.03.666667
13627712.05.000000
10925121.04.333333
19427221.05.000000
6225532.05.000000
226632.04.333333
9727532.03.666667
\n", "
" ], "text/plain": [ " sex age wohnort volksmusik hardrock\n", "237 2 23 1 2.0 4.333333\n", "290 2 36 1 2.0 3.666667\n", "22 2 41 1 2.0 3.666667\n", "206 2 60 1 1.0 3.666667\n", "136 2 77 1 2.0 5.000000\n", "109 2 51 2 1.0 4.333333\n", "194 2 72 2 1.0 5.000000\n", "62 2 55 3 2.0 5.000000\n", "2 2 66 3 2.0 4.333333\n", "97 2 75 3 2.0 3.666667" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "daten[(daten['volksmusik'] < 2.5) & (daten['hardrock'] > 3.5) & (daten['sex'] == 2)].sort_values(by = ['wohnort', 'age'])\n", "\n", "# nach Werten von drei Variablen gefiltert und Output nach zwei weiteren Variablen sortiert" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 2 }