{ "nbformat_minor": 0, "nbformat": 4, "cells": [ { "source": [ "$$\n", "\\def\\CC{\\bf C}\n", "\\def\\QQ{\\bf Q}\n", "\\def\\RR{\\bf R}\n", "\\def\\ZZ{\\bf Z}\n", "\\def\\NN{\\bf N}\n", "$$\n", "# Tableaux et analyse de donn\u00e9es avec Pandas" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "from __future__ import division, print_function # Python 3\n", "from sympy import init_printing\n", "init_printing(use_latex='mathjax',use_unicode=False) # Affichage des r\u00e9sultats" ], "outputs": [], "metadata": {} }, { "source": [ "Les donn\u00e9es massives jouent et continueront de jouer un r\u00f4le important dans la soci\u00e9t\u00e9 du 21e si\u00e8cle. Dans ce chapitre, nous ferons une introduction \u00e0 une librairie de l'environnement Python qui permet de repr\u00e9senter et analyser des donn\u00e9es. Cette librairie s'appelle [pandas](http://pandas.pydata.org/), contraction des termes anglais \"panel\" et \"data\". Dans pandas, un tableau 3-dimensionnel de donn\u00e9es est appel\u00e9 un \"panel\". La librairie pandas joue le m\u00eame r\u00f4le qu'un tableur comme Microsoft Excel, LibreOffice calc ou celui qu'on retrouve dans GeoGebra.\n", "\n", "[\"image\"](http://pandas.pydata.org/)\n", "\n", "Les principales structures de donn\u00e9es de pandas sont les `Series` (pour stocker des donn\u00e9es selon une dimension) et les `DataFrame` (pour stocker des donn\u00e9es selon 2 dimensions - lignes et colonnes). On peut aussi repr\u00e9senter des donn\u00e9es selon trois dimensions ou plus avec `Panel` et `Panel4D`.\n", "\n", "Dans ce chapitre nous d\u00e9crivons les tableaux de donn\u00e9es \u00e0 une et deux dimensions. Nous verrons comment faire des calculs statistiques et cr\u00e9er des graphiques \u00e0 partir de celles-ci. Nous verrons comment importer et exporter des donn\u00e9es. Finalement, nous ferons un exemple bas\u00e9 sur le site de donn\u00e9es de la Belgique: . On trouvera plus d'informations dans la [documentation en ligne](http://pandas.pydata.org/pandas-docs/stable/) de pandas incluant une [introduction en 10 minutes](http://pandas.pydata.org/pandas-docs/stable/10min.html), les [notions de base](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics) et quelques [tutoriels](http://pandas.pydata.org/pandas-docs/stable/tutorials.html).\n", "\n", "## Tableau unidimensionnel de donn\u00e9es\n", "\n", "En utilisant sympy, construisons une liste de 0 et de 1 telle qu'un 1 est \u00e0 la position `i` si et seulement si `i` est un nombre premier:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "from sympy import isprime\n", "L = [isprime(i) for i in range(15)]\n", "L" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "[False, False, True, True, False, True, False, True, False, False, False, \n", " True, False, True, False]" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "La librairie pandas permet de repr\u00e9senter les tableaux unidimensionnels de donn\u00e9es appel\u00e9s *s\u00e9ries*. Faisons un premier exemple. La liste Python ci-haut peut \u00eatre transform\u00e9e en une s\u00e9rie de pandas en faisant comme suit:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "from pandas import Series\n", "s = Series(L)\n", "s" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "0 False\n", "1 False\n", "2 True\n", "3 True\n", "4 False\n", "5 True\n", "6 False\n", "7 True\n", "8 False\n", "9 False\n", "10 False\n", "11 True\n", "12 False\n", "13 True\n", "14 False\n", "dtype: bool" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Par d\u00e9faut, les indices sont les nombres de `0` \u00e0 `n-1` o\u00f9 `n` est la taille de la liste. On peut acc\u00e9der aux \u00e9l\u00e9ments de la s\u00e9rie de la m\u00eame fa\u00e7on qu'on le fait pour les \u00e9l\u00e9ments d'une liste:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "s[0]" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "False" ] }, "metadata": {} } ], "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "s[7]" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "True" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "## Afficher quelques statistiques\n", "\n", "L'int\u00e9r\u00eat des s\u00e9ries de pandas par rapport aux listes Python de base est qu'un grand nombres de fonctions utiles sont disponibles sur les s\u00e9ries de pandas et qui retournent souvent d'autres s\u00e9ries. Par exemple, on peut obtenir une br\u00e8ve description statistique des \u00e9l\u00e9ments d'une s\u00e9rie avec la m\u00e9thode `describe()` :" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "s.describe()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "count 15\n", "unique 2\n", "top False\n", "freq 9\n", "dtype: object" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Ci-haut, cela nous indique qu'il y a deux valeurs distinctes dans la s\u00e9rie et que `False` est la plus fr\u00e9quence avec 9 apparitions sur 15. En effet, il y 6 nombres premiers inf\u00e9rieurs \u00e0 15.\n", "\n", "On peut obtenir la s\u00e9ries des sommes cumul\u00e9es d'une s\u00e9rie avec la m\u00e9thode `cumsum()`. Ici `False` vaut z\u00e9ro et `True` vaut `1` :" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "s.cumsum()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "0 0\n", "1 0\n", "2 1\n", "3 2\n", "4 2\n", "5 3\n", "6 3\n", "7 4\n", "8 4\n", "9 4\n", "10 4\n", "11 5\n", "12 5\n", "13 6\n", "14 6\n", "dtype: int64" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Il suffit de faire `s.TOUCHE_TABULATION` pour voir les nombreuses possibilit\u00e9s offertes par pandas. On y reviendra.\n", "\n", "## Op\u00e9rations sur une s\u00e9rie\n", "\n", "Les op\u00e9rations arithm\u00e9tiques sont d\u00e9finies sur les s\u00e9ries. Elle sont appliqu\u00e9es sur chaque terme:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "t = s.cumsum()\n", "t * 1000 + 43" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "0 43\n", "1 43\n", "2 1043\n", "3 2043\n", "4 2043\n", "5 3043\n", "6 3043\n", "7 4043\n", "8 4043\n", "9 4043\n", "10 4043\n", "11 5043\n", "12 5043\n", "13 6043\n", "14 6043\n", "dtype: int64" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "On peut aussi appliquer une fonction aux \u00e9l\u00e9ments d'une s\u00e9rie avec la m\u00e9thode `apply` :" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "def carre_plus_trois(x):" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "... return x**2 + 3" ] }, "metadata": {} } ], "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "t.apply(carre_plus_trois)" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "0 3\n", "1 3\n", "2 4\n", "3 7\n", "4 7\n", "5 12\n", "6 12\n", "7 19\n", "8 19\n", "9 19\n", "10 19\n", "11 28\n", "12 28\n", "13 39\n", "14 39\n", "dtype: int64" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "## Concat\u00e9nation de deux s\u00e9ries\n", "\n", "Avec pandas, il est possible de construire un tableau comportant plus d'une colonne. Par exemple, les nombres premiers dans la premi\u00e8re colonne et la somme cumul\u00e9e dans la deuxi\u00e8me. Une premi\u00e8re fa\u00e7on est avec la fonction `concat` qui concat\u00e8ne deux s\u00e9ries:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "from pandas import concat\n", "concat([s, s.cumsum()])" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "0 0\n", "1 0\n", "2 1\n", "3 1\n", "4 0\n", "5 1\n", "6 0\n", "7 1\n", "8 0\n", "9 0\n", "10 0\n", "11 1\n", "12 0\n", "13 1\n", "14 0\n", "0 0\n", "1 0\n", "2 1\n", "3 2\n", "4 2\n", "5 3\n", "6 3\n", "7 4\n", "8 4\n", "9 4\n", "10 4\n", "11 5\n", "12 5\n", "13 6\n", "14 6\n", "dtype: int64" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "La concat\u00e9nation a \u00e9t\u00e9 faite une en-dessous de l'autre et cela a aussi eu pour effet de transformer les valeurs bool\u00e9ennes en nombres entiers, car les donn\u00e9es d'une m\u00eame colonne doivent avoir le m\u00eame type. Ce n'est pas exactement ce qu'on voulait. Pour sp\u00e9cifier que la concat\u00e9nation doit \u00eatre faite en colonnes, il faut sp\u00e9cifier dans quelle direction (axe) ou veut concat\u00e9ner les donn\u00e9es. On donne alors une valeur `1` \u00e0 l'argument `axis` plut\u00f4t que `0` (la valeur par d\u00e9faut) pour obtenir ce que l'on veut:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "concat([s, s.cumsum()], axis=1)" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " 0 1\n", "0 False 0\n", "1 False 0\n", "2 True 1\n", "3 True 2\n", "4 False 2\n", "5 True 3\n", "6 False 3\n", "7 True 4\n", "8 False 4\n", "9 False 4\n", "10 False 4\n", "11 True 5\n", "12 False 5\n", "13 True 6\n", "14 False 6" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Pour donner des titres plus parlant aux colonnes, il s'agit de sp\u00e9cifier une liste de titres via l'argument `keys`. Comme le nombre de nombres entiers inf\u00e9rieur \u00e0 `x` est souvent d\u00e9not\u00e9 $\\pi(x)$, on utilise `'pi_x'` pour le nom de la deuxi\u00e8me colonne:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "keys = ['isprime', 'pi_x']\n", "df = concat([s, s.cumsum()], axis=1, keys=keys)\n", "df" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " isprime pi_x\n", "0 False 0\n", "1 False 0\n", "2 True 1\n", "3 True 2\n", "4 False 2\n", "5 True 3\n", "6 False 3\n", "7 True 4\n", "8 False 4\n", "9 False 4\n", "10 False 4\n", "11 True 5\n", "12 False 5\n", "13 True 6\n", "14 False 6" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Le type du tableau ci-haut est `DataFrame` pour tableau de donn\u00e9es:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "type(df)" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "## Tableau 2-dimensionnel de donn\u00e9es\n", "\n", "Une autre fa\u00e7on de cr\u00e9er le m\u00eame tableau est en utilisant la fonction `DataFrame` directement:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "from pandas import DataFrame" ], "outputs": [], "metadata": {} }, { "source": [ "D'abord, on calcule en Python la liste des sommes cumul\u00e9es de la liste `L` :" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "L = [isprime(i) for i in range(15)]\n", "L_cumsum = [sum(L[:i]) for i in range(1,len(L)+1)]\n", "L_cumsum" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "[0, 0, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6]" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "On cr\u00e9e un dictionnaire qui associe des noms de colonnes \u00e0 des valeurs:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "d = {'isprime':L, 'pi_x':L_cumsum}\n", "d" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "{'isprime': [False, False, True, True, False, True, False, True, \n", " False, False, False, True, False, True, False],\n", " 'pi_x': [0, 0, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6]}" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "On cr\u00e9e un objet de type `DataFrame` \u00e0 partir de ce dictionnaire:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df = DataFrame(d)\n", "df" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " isprime pi_x\n", "0 False 0\n", "1 False 0\n", "2 True 1\n", "3 True 2\n", "4 False 2\n", "5 True 3\n", "6 False 3\n", "7 True 4\n", "8 False 4\n", "9 False 4\n", "10 False 4\n", "11 True 5\n", "12 False 5\n", "13 True 6\n", "14 False 6" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Comme pour les s\u00e9ries, on peut obtenir les statistiques simples pour les donn\u00e9es de chaque colonne d'un tableau de donn\u00e9es avec la m\u00e9thode `describe()` :" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.describe()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " pi_x\n", "count 15.000000\n", "mean 3.266667\n", "std 1.944467\n", "min 0.000000\n", "25% 2.000000\n", "50% 4.000000\n", "75% 4.500000\n", "max 6.000000" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Il est aussi possible de cr\u00e9er des tableaux de donn\u00e9es en dimensions sup\u00e9rieures, mais cela d\u00e9passe le cadre de ce cours:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "from pandas import Panel,Panel4D" ], "outputs": [], "metadata": {} }, { "source": [ "## Acc\u00e9der \u00e0 une colonne d'un tableau\n", "\n", "Le nom des colonnes peut \u00eatre utilis\u00e9 pour acc\u00e9der aux colonnes d'un tableau de la fa\u00e7on suivante sans parenth\u00e8se:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.pi_x" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "0 0\n", "1 0\n", "2 1\n", "3 2\n", "4 2\n", "5 3\n", "6 3\n", "7 4\n", "8 4\n", "9 4\n", "10 4\n", "11 5\n", "12 5\n", "13 6\n", "14 6\n", "Name: pi_x, dtype: int64" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Comme pour un dictionnaire, on peut aussi acc\u00e9der \u00e0 une colonne avec les crochets. Il faut alors sp\u00e9cifier le nom de la colonne entre guillemets:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df['pi_x']" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "0 0\n", "1 0\n", "2 1\n", "3 2\n", "4 2\n", "5 3\n", "6 3\n", "7 4\n", "8 4\n", "9 4\n", "10 4\n", "11 5\n", "12 5\n", "13 6\n", "14 6\n", "Name: pi_x, dtype: int64" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Cela peut se combiner avec d'autres m\u00e9thodes comme l'affichage de statistiques `df.pi_x.describe()` ou encore des calculs:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.pi_x * 100" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "0 0\n", "1 0\n", "2 100\n", "3 200\n", "4 200\n", "5 300\n", "6 300\n", "7 400\n", "8 400\n", "9 400\n", "10 400\n", "11 500\n", "12 500\n", "13 600\n", "14 600\n", "Name: pi_x, dtype: int64" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "## Afficher les premi\u00e8res/derni\u00e8res lignes\n", "\n", "Parfois, on travaille avec des tableaux de tr\u00e8s grande taille et il n'est pas pratique d'afficher toutes les donn\u00e9es \u00e0 l'\u00e9cran. On construit d'abord un tableau de 1000 lignes avec les m\u00eames colonnes que le pr\u00e9c\u00e9dent:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "L = [isprime(i) for i in range(1000)]\n", "s = Series(L)\n", "d = {'isprime':s, 'pi_x':s.cumsum()}\n", "df = DataFrame(d)" ], "outputs": [], "metadata": {} }, { "source": [ "Pour afficher les cinq premi\u00e8res lignes d'un tableau de donn\u00e9es, on utilise la m\u00e9thode `head()` :" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.head()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " isprime pi_x\n", "0 False 0\n", "1 False 0\n", "2 True 1\n", "3 True 2\n", "4 False 2" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Pour afficher les cinq derni\u00e8res lignes d'un tableau de donn\u00e9es, on utilise la m\u00e9thode `tail()` :" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.tail()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " isprime pi_x\n", "995 False 167\n", "996 False 167\n", "997 True 168\n", "998 False 168\n", "999 False 168" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Les deux m\u00e9thodes `head` et `tail` peuvent prendre un nombre entier en argument pour indiquer le nombre de lignes \u00e0 afficher si on veut en voir plus ou moins:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.tail(10)" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " isprime pi_x\n", "990 False 166\n", "991 True 167\n", "992 False 167\n", "993 False 167\n", "994 False 167\n", "995 False 167\n", "996 False 167\n", "997 True 168\n", "998 False 168\n", "999 False 168" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "## Sous-tableau\n", "\n", "Pour acc\u00e9der \u00e0 un sous-tableau de lignes cons\u00e9cutives, on utilise les crochets comme pour les listes Python. Ici, on affiche le sous-tableau des lignes 500 \u00e0 519. En fait, cela cr\u00e9e un nouveau tableau de 20 lignes:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df[500:520]" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " isprime pi_x x_logx\n", "500 False 95 80.4556\n", "501 False 95 80.5906\n", "502 False 95 80.7256\n", "503 True 96 80.8605\n", "504 False 96 80.9954\n", "505 False 96 81.1303\n", "506 False 96 81.2651\n", "507 False 96 81.3999\n", "508 False 96 81.5346\n", "509 True 97 81.6694\n", "510 False 97 81.804\n", "511 False 97 81.9387\n", "512 False 97 82.0733\n", "513 False 97 82.2079\n", "514 False 97 82.3425\n", "515 False 97 82.477\n", "516 False 97 82.6115\n", "517 False 97 82.7459\n", "518 False 97 82.8803\n", "519 False 97 83.0147" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Pour acc\u00e9der \u00e0 une donn\u00e9e particuli\u00e8re dans le tableau, on utilise la m\u00e9thode `at` en sp\u00e9cifiant l'indice de la ligne puis le nom de la colonne entre crochets:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.at[510, 'x_logx']" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "81.804042504952918" ] }, "metadata": {} } ], "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.at[510, 'pi_x']" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "97" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "## Ajouter une colonne dans un tableau\n", "\n", "Supposons que l'on veuille ajouter une colonne \u00e0 un tableau. Cela se fait avec la m\u00e9thode `insert()`.\n", "\n", "Johann Carl Friedrich Gauss avait devin\u00e9 au 19e si\u00e8cle que $\\pi(x)$, le nombre de nombres premiers inf\u00e9rieurs \u00e0 $x$, \u00e9tait approximativement $x/\\log(x)$. Construisons une s\u00e9rie qui calcule cette fonction pour les 1000 premiers nombres entiers:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "from math import log\n", "def x_sur_log_x(x): " ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "... if x > 1:\n", "... return x/log(x)\n", "... else:\n", "... return None" ] }, "metadata": {} } ], "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "t = Series(range(1000)).apply(x_sur_log_x)" ], "outputs": [], "metadata": {} }, { "source": [ "On ajoute la nouvelle colonne avec la m\u00e9thode `insert` en sp\u00e9cifiant la position o\u00f9 on veut l'ins\u00e9rer, le titre de la colonne et les donn\u00e9es:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.insert(2, 'x_logx', t)\n", "df['x_logx'] = t # equivalent, notation comme les dictionnaires Python" ], "outputs": [], "metadata": {} }, { "source": [ "En 1838, Dirichlet a contact\u00e9 Gauss pour lui dire qu'il avait trouv\u00e9 une meilleure approximation de la fontion $\\pi(x)$ en utilisant l'int\u00e9grale de l'inverse de la fonction $\\log(x)$, c'est-\u00e0-dire par la fonction $Li(x)=\\int_2^x {1\\over\\log(t)} dt$.\n", "\n", "En utilisant sympy, calculons les 1000 premi\u00e8res valeurs de $Li(x)$ et ajoutons cette colonne dans le tableau:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "from sympy import Li \n", "K = [Li(x).n() for x in range(1000)]\n", "df['Li_x'] = Series(K, dtype='float64')" ], "outputs": [], "metadata": {} }, { "source": [ "On peut afficher les premi\u00e8res et derni\u00e8res lignes du tableau \u00e0 quatre colonnes:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.head()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " isprime pi_x x_logx Li_x\n", "0 False 0 NaN -1.04516378011749\n", "1 False 0 NaN -inf\n", "2 True 1 2.88539 0\n", "3 True 2 2.73072 1.11842481454970\n", "4 False 2 2.88539 1.92242131492156" ] }, "metadata": {} } ], "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.tail()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " isprime pi_x x_logx Li_x\n", "995 False 167 144.146 175.840407548189\n", "996 False 167 144.269 175.985266957056\n", "997 True 168 144.393 176.130105300461\n", "998 False 168 144.517 176.274922605648\n", "999 False 168 144.641 176.419718899799" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "## Visualiser les donn\u00e9es\n", "\n", "On active d'abord les dessins de matplotlib dans le notebook Jupyter:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "%matplotlib inline" ], "outputs": [], "metadata": {} }, { "source": [ "Pour visualiser les donn\u00e9es, il suffit d'utiliser la commande `plot` :" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.plot()" ], "outputs": [], "metadata": {} }, { "source": [ "\"image\"\n", "\n", "On voit bien que $\\pi(x)$, le nombre de nombres premiers inf\u00e9rieurs \u00e0 $x$, se trouve bien entre les fonctions $\\pi(x)$ et $Li(x)$ sur l'intervalle `[0,1000]`.\n", "\n", "On peut visualiser qu'une partie par exemple l'intervalle `[0,100]` en choisissant d'abord un sous-tableau:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df[:100].plot()" ], "outputs": [], "metadata": {} }, { "source": [ "\"image\"\n", "\n", "D'autres types de graphiques peuvent \u00eatre plus adapt\u00e9es dans d'autres situations (histogrammes, tartes, etc.). Voici la liste m\u00e9thodes disponibles:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.plot.area df.plot.box df.plot.hist df.plot.pie\n", "df.plot.bar df.plot.density df.plot.kde df.plot.scatter\n", "df.plot.barh df.plot.hexbin df.plot.line" ], "outputs": [], "metadata": {} }, { "source": [ "On trouvera des exemples d'utilisation de ces m\u00e9thodes de visualisation de donn\u00e9es dans la documentation de pandas:\n", "\n", "\n", "\n", "## Exporter des donn\u00e9es\n", "\n", "Il est possible d'exporter un tableau de donn\u00e9es de pandas vers plusieurs formats:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.to_[TOUCHE_TABULATION]" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "df.to_clipboard df.to_excel df.to_json df.to_period df.to_sql \n", "df.to_csv df.to_gbq df.to_latex df.to_pickle df.to_stata \n", "df.to_dense df.to_hdf df.to_msgpack df.to_records df.to_string\n", "df.to_dict df.to_html df.to_panel df.to_sparse df.to_timestamp\n", "df.to_wide df.to_xarray" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Pour exporter vers le format `.xlsx` on fait:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "from pandas import ExcelWriter\n", "writer = ExcelWriter('tableau.xlsx')\n", "df.to_excel(writer,'Feuille 1')\n", "writer.save()" ], "outputs": [], "metadata": {} }, { "source": [ "On peut v\u00e9rifier que Excel ouvre bien ce fichier qui se trouve dans le m\u00eame r\u00e9pertoire que le notebook Jupyter (utiliser la commande `pwd`, abbr\u00e9viation de \"present working directory\" en anglais, pour conna\u00eetre ce r\u00e9pertoire en cas de doute).\n", "\n", "Pour exporter vers le format `.csv` on fait:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.to_csv('tableau.csv')" ], "outputs": [], "metadata": {} }, { "source": [ "**NOTE**: L'importation et l'exportation vers le format excel .xls exige que les librairies Python `xlrd` et `openpyxl` soit install\u00e9es. On peut les installer avec pip gr\u00e2ce \u00e0 la commande `pip install xlrd openpyxl`.\n", "\n", "## Importer des donn\u00e9es\n", "\n", "Pour importer un fichier Excel dans pandas, on fait:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "import pandas as pd\n", "df = pd.read_excel('tableau.xlsx')\n", "df.head()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " isprime pi_x Li_x x_logx\n", "0 False 0 -1.045164 NaN\n", "1 False 0 -inf NaN\n", "2 True 1 0.000000 2.885390\n", "3 True 2 1.118425 2.730718\n", "4 False 2 1.922421 2.885390" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Parfois, un fichier Excel est corrompu et il vaut mieux passer par le format `.csv`. On proc\u00e8de alors ainsi:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df = pandas.read_csv('tableau.csv')\n", "df.head()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " Unnamed: 0 isprime pi_x Li_x x_logx\n", "0 0 False 0 -1.045164 NaN\n", "1 1 False 0 -inf NaN\n", "2 2 True 1 0.000000 2.885390\n", "3 3 True 2 1.118425 2.730718\n", "4 4 False 2 1.922421 2.885390" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Parfois, la ligne de titre n'est pas sur la premi\u00e8re ligne. \u00c0 ce moment l\u00e0, on peut sp\u00e9cifier la valeur de l'argument `header` pour dire o\u00f9 commencer la lecture du fichier en entr\u00e9e:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df = pandas.read_csv('tableau.csv', header=56)\n", "df.head()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " 55 False 16 18.6860810929 13.7248383046\n", "0 56 False 16 18.935063 13.911828\n", "1 57 False 16 19.182942 14.098263\n", "2 58 False 16 19.429748 14.284156\n", "3 59 True 17 19.675508 14.469518\n", "4 60 False 17 19.920249 14.654360" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "## Exemple: analyser des donn\u00e9es de data.gov.be\n", "\n", "Le site web contient des centaines de donn\u00e9es de toutes sortes de sujet sur la Belgique. Par exemple, \u00e0 la page\n", "\n", "> \n", "\n", "on retrouve des donn\u00e9es m\u00e9t\u00e9orologiques de Ostende depuis 2010. Sur cette page, on peut y t\u00e9l\u00e9charger le fichier `meteoostende.xls` au format excel. On peut l'importer dans pandas facilement:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df = pandas.read_excel('meteoostende.xls')" ], "outputs": [], "metadata": {} }, { "source": [ "Il est possible d'\u00e9crire l'URL directement ce qui \u00e9vite d'avoir \u00e0 t\u00e9l\u00e9charger le fichier:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "url = (\"http://opendata.digitalwallonia.be/dataset/\"" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " \"4fd7a1cf-f959-46ff-83d0-807778fe3438/resource/\"\n", " \"14306677-fb41-4472-9a23-2923f5e22d69/download/meteoostende.xls\")" ] }, "metadata": {} } ], "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df = pandas.read_excel(url)" ], "outputs": [], "metadata": {} }, { "source": [ "Ce tableau de donn\u00e9es comporte 1461 lignes:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "len(df)" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "1461" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "et 10 colonnes dont les titres sont:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.columns" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "Index([u'P\u00e9riode', u'Date', u'Temp\u00e9rature de l'air - moyenne (\u00b0C)',\n", " u'Temp\u00e9rature de l'air - minimum (\u00b0C)',\n", " u'Temp\u00e9rature de l'air - maximum (\u00b0C)', u'Humidit\u00e9 relative (%)',\n", " u'Rayonnement solaire quotidien - horizontal (kWh/m\u00b2/j)',\n", " u'Pression atmosph\u00e9rique (kPa)', u'Vitesse du vent (m/s)',\n", " u'Temp\u00e9rature du sol (\u00b0C)'],\n", " dtype='object')" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Les premi\u00e8res lignes permettent de se donner une id\u00e9es des donn\u00e9es. On peut aussi utiliser `df.describe()` :" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.head()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " P\u00e9riode Date Temp\u00e9rature de l'air - moyenne (\u00b0C) \\\n", "0 1 2010-01-01 3.90\n", "1 2 2010-01-02 4.11\n", "2 3 2010-01-03 3.24\n", "3 4 2010-01-04 3.83\n", "4 5 2010-01-05 3.88\n", "\n", " Temp\u00e9rature de l'air - minimum (\u00b0C) Temp\u00e9rature de l'air - maximum (\u00b0C) \\\n", "0 2.76 5.20\n", "1 2.95 5.26\n", "2 2.26 4.73\n", "3 2.40 4.68\n", "4 2.99 4.35\n", "\n", " Humidit\u00e9 relative (%) \\\n", "0 0.7465\n", "1 0.8288\n", "2 0.7919\n", "3 0.7825\n", "4 0.7757\n", "\n", " Rayonnement solaire quotidien - horizontal (kWh/m\u00b2/j) \\\n", "0 1.08\n", "1 0.65\n", "2 1.04\n", "3 0.68\n", "4 0.72\n", "\n", " Pression atmosph\u00e9rique (kPa) Vitesse du vent (m/s) \\\n", "0 100.14 7.70\n", "1 101.28 6.13\n", "2 102.02 5.46\n", "3 101.67 3.45\n", "4 100.55 4.86\n", "\n", " Temp\u00e9rature du sol (\u00b0C)\n", "0 6.15\n", "1 6.11\n", "2 5.94\n", "3 5.56\n", "4 5.42" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Pour voir ce qu'il y a \u00e0 la 100e ligne du tableau, on utilise la m\u00e9thode `iloc`. Ce sont les donn\u00e9es m\u00e9t\u00e9o du 11 avril 2010:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.iloc[100]" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "P\u00e9riode 101\n", "Date 2010-04-11 00:00:00\n", "Temp\u00e9rature de l'air - moyenne (\u00b0C) 7.25\n", "Temp\u00e9rature de l'air - minimum (\u00b0C) 5.68\n", "Temp\u00e9rature de l'air - maximum (\u00b0C) 9.16\n", "Humidit\u00e9 relative (%) 0.8023\n", "Rayonnement solaire quotidien - horizontal (kWh/m\u00b2/j) 4.69\n", "Pression atmosph\u00e9rique (kPa) 102.56\n", "Vitesse du vent (m/s) 7.62\n", "Temp\u00e9rature du sol (\u00b0C) 7.28\n", "Name: 100, dtype: object" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Pour afficher les moyennes par colonnes, on utilise la m\u00e9thode `mean()` :" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.mean()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "P\u00e9riode 731.000000\n", "Temp\u00e9rature de l'air - moyenne (\u00b0C) 11.013005\n", "Temp\u00e9rature de l'air - minimum (\u00b0C) 9.289713\n", "Temp\u00e9rature de l'air - maximum (\u00b0C) 12.980171\n", "Humidit\u00e9 relative (%) 0.796279\n", "Rayonnement solaire quotidien - horizontal (kWh/m\u00b2/j) 3.283337\n", "Pression atmosph\u00e9rique (kPa) 101.377502\n", "Vitesse du vent (m/s) 6.117276\n", "Temp\u00e9rature du sol (\u00b0C) 11.255428\n", "dtype: float64" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Pour \u00e9tudier une colonne en particulier, par exemple la pression atmosph\u00e9rique, c'est-\u00e0-dire la septi\u00e8me colonne, on peut proc\u00e9der ainsi:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "s = df.icol(7)\n", "s.head()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "0 100.14\n", "1 101.28\n", "2 102.02\n", "3 101.67\n", "4 100.55\n", "Name: Pression atmosph\u00e9rique (kPa), dtype: float64" ] }, "metadata": {} } ], "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "s.describe()" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "count 1461.000000\n", "mean 101.377502\n", "std 0.932066\n", "min 97.470000\n", "25% 100.850000\n", "50% 101.430000\n", "75% 101.970000\n", "max 103.820000\n", "Name: Pression atmosph\u00e9rique (kPa), dtype: float64" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Finalement, on peut dessiner l'\u00e9volution de la pression atmosph\u00e9rique en fonction de la date:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "date = df.columns[1]\n", "pression = df.columns[7]\n", "df.plot(x=date, y=pression)" ], "outputs": [], "metadata": {} }, { "source": [ "\"image\"\n", "\n", "Pour afficher un histogramme de la pression atmosph\u00e9rique, il s'agit d'utiliser `df.plot.hist` avec les m\u00eames arguments:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df.plot.hist(x=date, y=pression)" ], "outputs": [], "metadata": {} }, { "source": [ "\"image\"\n", "\n", "## Filtrer les lignes d'un tableau\n", "\n", "Parfois, il est pertinent de filtrer les lignes d'un tableau `df`. La fa\u00e7on de faire est d'abord de cr\u00e9er une s\u00e9rie `s_vrai_faux` avec le m\u00eame nombre de lignes contenant des valeurs bool\u00e9ennes en utilisant `True` pour les lignes que l'on veut garder et `False` sinon. La syntaxe est la suivante: `df[s_vrai_faux]` qui retourne un tableau filtr\u00e9.\n", "\n", "Voici un premier exemple facile o\u00f9 on veut afficher que les nombres multiples de 3 d'une s\u00e9rie:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "In [32]: s = Series(range(10))\n", "In [31]: s\n", "Out[31]:\n", "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n", "6 6\n", "7 7\n", "8 8\n", "9 9\n", "dtype: int64" ], "outputs": [], "metadata": {} }, { "source": [ "On cr\u00e9e une s\u00e9rie de la m\u00eame longueur qui teste si les entr\u00e9es sont multiples de trois ou non:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "In [29]: s % 3 == 0\n", "Out[29]:\n", "0 True\n", "1 False\n", "2 False\n", "3 True\n", "4 False\n", "5 False\n", "6 True\n", "7 False\n", "8 False\n", "9 True\n", "dtype: bool" ], "outputs": [], "metadata": {} }, { "source": [ "On utilise la pr\u00e9c\u00e9dent s\u00e9rie de bool\u00e9en pour filtrer les lignes de la premi\u00e8re s\u00e9rie:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "In [30]: s[s % 3 == 0]\n", "Out[30]:\n", "0 0\n", "3 3\n", "6 6\n", "9 9\n", "dtype: int64" ], "outputs": [], "metadata": {} }, { "source": [ "Faisons maintenant un exemple au sujet de la m\u00e9t\u00e9o de Ostende. Supposons qu'on s'int\u00e9resse \u00e0 la temp\u00e9rature moyenne les jours de No\u00ebl \u00e0 Ostende. D'abord, on cr\u00e9e une fonction qui teste si une date est bien le jour de No\u00ebl:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "est_noel = lambda date:date.day==25 and date.month==12" ], "outputs": [], "metadata": {} }, { "source": [ "On applique cette fonction au tableau. On obtient une s\u00e9rie de vrai ou faux:" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "s_vrai_faux = df['Date'].apply(est_noel)\n", "s_vrai_faux.tail(10)" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ "1451 False\n", "1452 False\n", "1453 False\n", "1454 True\n", "1455 False\n", "1456 False\n", "1457 False\n", "1458 False\n", "1459 False\n", "1460 False\n", "Name: Date, dtype: bool" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "Finalement, on filtre le tableau avec cette s\u00e9rie. Et on affiche que les deux colonnes qui nous int\u00e9ressent (la date et la temp\u00e9rature):" ], "cell_type": "markdown", "metadata": {} }, { "execution_count": null, "cell_type": "code", "source": [ "df_noel = df[s_vrai_faux]\n", "df_noel.icol([1,2])" ], "outputs": [ { "execution_count": 1, "output_type": "execute_result", "data": { "text/plain": [ " Date Temp\u00e9rature de l'air - moyenne (\u00b0C)\n", "358 2010-12-25 3.63\n", "723 2011-12-25 10.62\n", "1089 2012-12-25 9.22\n", "1454 2013-12-25 7.23" ] }, "metadata": {} } ], "metadata": {} }, { "source": [ "## Conclusion\n", "\n", "Les outils Python tels que la librairie pandas sont utilis\u00e9s par les gens qui analysent des donn\u00e9es comme le m\u00e9dia alternatif [BuzzFeedNews](https://www.buzzfeed.com/news) qui a mis au jour en janvier 2016 \\[TennisRacket\\]\\_ le fait que des matchs de tennis de l'ATP avaient \u00e9t\u00e9 truqu\u00e9s. Les donn\u00e9es ainsi que les notebook Jupyter r\u00e9alis\u00e9s par BuzzFeedNews sont disponibles sur github \u00e0 l'adresse . On y trouvera d'autres analyses de donn\u00e9es tels que les tremblements de terre reli\u00e9s \u00e0 l'exploitation des gaz de schiste aux \u00c9tats-Unis, les mouvements des donateurs de la campagne pr\u00e9sidentielle am\u00e9ricaine lorsqu'un candidat sort de la course, ou une analyse du placement des enfants dans les cr\u00e8ches.\n", "\n", "Le lecteur d\u00e9sirant en savoir plus sur pandas est invit\u00e9 \u00e0 lire les [tutoriels en ligne](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) sur pandas. La librairie pandas est utilis\u00e9e par la librairie Python de statistiques [StatsModels](http://statsmodels.sourceforge.net/) qui permet de faire encore plus comme des mod\u00e8les statistiques, des estimations et des tests statistiques." ], "cell_type": "markdown", "metadata": {} } ], "metadata": { "kernelspec": { "display_name": "python2", "name": "python2" } } }