{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Données de l'IREP et Devoir¶" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# I. Création du jeu de données" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import urllib.request\n", "import zipfile\n", "from tqdm import tqdm\n", "from pyproj import Proj, transform\n", "import cartopy.crs as ccrs\n", "import cartopy.feature as cfeature\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Téléchargement des données\n", "\n", "On commence par télécharger toutes les données. \n", "\n", "En TP vous avez téléchargé manuellement le .zip correspondant à chaque année, puis vous l'avez renommé et dézippé. C'est fastidieux ! \n", "\n", "Dans la cellule suivante je vous montre comment le faire de manière automatisée." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "headers = {\n", " 'Connection': 'keep-alive',\n", " 'Upgrade-Insecure-Requests': '1',\n", " 'DNT': '1',\n", " 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',\n", " 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',\n", " 'Referer': 'http://www.georisques.gouv.fr/dossiers/irep/telechargement',\n", " 'Accept-Encoding': 'gzip, deflate',\n", " 'Accept-Language': 'en-US,en;q=0.9,fr-FR;q=0.8,fr;q=0.7,ru;q=0.6,de;q=0.5,pt;q=0.4',\n", "}\n", "\n", "url = 'http://www.georisques.gouv.fr/irep/data/'\n", "for i in tqdm(range(2003,2018)):\n", " response = requests.get(url+str(i), headers=headers, verify=False)\n", " with open('./'+str(i)+'.zip', mode='wb') as localfile:\n", " localfile.write(response.content)\n", " \n", " with zipfile.ZipFile('./'+str(i)+'.zip',\"r\") as zip_ref:\n", " zip_ref.extractall(\"./data/\"+str(i)+'/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Explications:\n", "\n", "Pour savoir comment j'ai trouvé la requête, consultez [ce lien](https://www.alexkras.com/copy-any-api-call-as-curl-request-with-chrome-developer-tools/). \n", "On convertit la commande cURL en requête Python en passant par [ce site](https://curl.trillworks.com/#python). \n", "J'ai supprimé les cookies car ils sont inutiles. \n", "\n", "On enregistre ensuite le résultat de la requête (le fichier .zip), puis on dézip l'archive." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Système de coordonnées\n", "\n", "Xavier Dupré utilise un bruteforce pour trouver le système de coordonnées utilisé dans les données. Cependant ce n'est pas nécessaire.\n", "\n", "En cherchant un peu sur le site de l'IREP, on trouve [ceci](http://www.georisques.gouv.fr/dossiers/irep/form-etablissement/details/2534#/) qui indique que le système utilisé est \"Lambert II Etendu\". \n", "\n", "A partir de là on trouve ce [pdf](http://www.ign.fr/sites/all/files/geodesie_projections.pdf) et cette [page](https://spatialreference.org/ref/epsg/ntf-paris-lambert-zone-ii/).\n", "\n", "Donc dans `pyproj` on doit utiliser `epsg:27572`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Déchets\n", "\n", "Pour une entreprise et une année données, les données répertorient l'émission de plusieurs types de déchets. On vérifie facilement que toutes les quantités de déchets sont exprimées dans la même unité. Pour une entreprise et une année donnée on **doit donc faire la somme des émissions** (sinon on se retrouve avec des cercles concentriques à la fin). En réalité il faudrait pondérer la somme par la toxicité de chaque type de déchet." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 14/14 [00:07<00:00, 1.77it/s]\n" ] } ], "source": [ "p1 = Proj(init='epsg:4326') # longitude / latitude\n", "p2 = Proj(init='epsg:27572') # Lambert II étendu\n", "\n", "#Initialisation du jeu de données\n", "df = pd.read_csv(\"./data/2003/Prod_dechets_dangereux.csv\")\n", "df2 = pd.read_csv(\"./data/2003/etablissements.csv\")\n", "long, lat = transform(p2, p1, df2.Coordonnees_X.values, df2.Coordonnees_Y.values) #Passage en coordonnées GPS\n", "df2['LLX'] = long\n", "df2['LLY'] = lat\n", "df = df.merge(df2, on=\"Identifiant\")\n", "df = df.groupby(['Identifiant']).agg(\n", " {'Quantite':'sum', 'Nom_Etablissement_x':'first', #C'est ici qu'on somme les déchets de chaque entreprise\n", " 'LLX':'first', 'LLY':'first'}).reset_index() #pour l'année\n", "df = df.rename({'Quantite': 'Quantite2003'}, axis='columns') #On renomme la colonne Quantite en Quantite2003\n", "\n", "for i in tqdm(range(2004,2018)): #Ajouts successifs des années suivantes\n", " df_temp = pd.read_csv(\"./data/\"+str(i)+\"/Prod_dechets_dangereux.csv\")\n", " df2_temp = pd.read_csv(\"./data/\"+str(i)+\"/etablissements.csv\")\n", " long, lat = transform(p2, p1, df2_temp.Coordonnees_X.values, df2_temp.Coordonnees_Y.values)\n", " df2_temp['LLX'] = long\n", " df2_temp['LLY'] = lat\n", " df_temp = df_temp.merge(df2_temp, on=\"Identifiant\")\n", " df_temp = df_temp.groupby(['Identifiant']).agg(\n", " {'Quantite':'sum', 'Nom_Etablissement_x':'first',\n", " 'LLX':'first', 'LLY':'first'}).reset_index()\n", " df_temp = df_temp.rename({'Quantite': 'Quantite'+str(i)}, axis='columns')\n", " df = df.merge(df_temp, on=[\"Identifiant\",\"Nom_Etablissement_x\",\"LLX\",\"LLY\"], how=\"outer\") # C'est ici que se fait \n", " #l'ajout des données de l'année i dans le DataFrame final\n", " \n", "df = df.fillna('0') #Le merge outer fait apparaitre des valeurs manquantes (NaN), on les remplace par 0 " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Identifiant | \n", "Quantite2003 | \n", "Nom_Etablissement_x | \n", "LLX | \n", "LLY | \n", "Quantite2004 | \n", "Quantite2005 | \n", "Quantite2006 | \n", "Quantite2007 | \n", "Quantite2008 | \n", "Quantite2009 | \n", "Quantite2010 | \n", "Quantite2011 | \n", "Quantite2012 | \n", "Quantite2013 | \n", "Quantite2014 | \n", "Quantite2015 | \n", "Quantite2016 | \n", "Quantite2017 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4016 | \n", "0.00001 | \n", "0 | \n", "SARL JEAN CARTON | \n", "2.492857 | \n", "50.976482 | \n", "0 | \n", "0 | \n", "112.909 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
8568 | \n", "6.00713 | \n", "0 | \n", "Sabena Technics | \n", "4.429484 | \n", "43.677185 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "63 | \n", "51.743 | \n", "78.5 | \n", "0 | \n", "0 | \n", "
5881 | \n", "25.08495 | \n", "0 | \n", "SARL PHENIX RECYCLAGE | \n", "-1.414239 | \n", "43.530031 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "36.84 | \n", "40 | \n", "45.06 | \n", "32.05 | \n", "25.35 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
9737 | \n", "29.00334 | \n", "0 | \n", "station d'épuration | \n", "-3.698529 | \n", "47.932145 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "9 | \n", "0 | \n", "0 | \n", "
10265 | \n", "29.16724 | \n", "0 | \n", "LE PAPE ENVIRONNEMENT (PLUGUFFAN) | \n", "-4.179016 | \n", "47.980242 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "921 | \n", "0 | \n", "