{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\"GMV\n", "\"UPM\n", "

QA: AEMET dataset 🌥️

\n", "
INESDATA-MOV
\n", "
\n", "\n", "# Análisis de calidad\n", "Este cuaderno analiza la calidad del dataset proveniente de la fuente de datos de la Agencia Estatal de Meteorología ([AEMET](https://www.aemet.es/es/portada)). La calidad del mismo se validará teniendo en cuenta los siguientes aspectos:\n", "\n", "* Análisis del dataset\n", "* Análisis de las variables\n", "* Conversiones de tipos de datos\n", "\n", "La **calidad del dato** se refiere a la medida en que los datos son adecuados para su uso, por lo que es esencial para garantizar la confiabilidad y utilidad de los datos en diversas aplicaciones y contextos. Así, en este notebook se evaluarán también las cinco dimensiones de la calidad del dato:\n", "1. **Unicidad**: Ausencia de duplicados o registros repetidos en un conjunto de datos. Los datos son únicos cuando cada registro o entidad en el conjunto de datos es único y no hay duplicados presentes.\n", "2. **Exactitud**: Los datos exactos son libres de errores y representan con precisión la realidad que están destinados a describir. Esto implica que los datos deben ser correctos y confiables para su uso en análisis y toma de decisiones.\n", "3. **Completitud**: Los datos completos contienen toda la información necesaria para el análisis y no tienen valores faltantes o nulos que puedan afectar la interpretación o validez de los resultados.\n", "4. **Consistencia**: Los datos consistentes mantienen el mismo formato, estructura y significado en todas las instancias, lo que facilita su comparación y análisis sin ambigüedad.\n", "5. **Validez**: Medida en que los datos son precisos y representan con exactitud la realidad que están destinados a describir. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Nota

\n", "

\n", "Este dataset ha sido creado ejecutando el comando create del paquete de Python inesdata_mov_datasets.
\n", "Para poder ejecutar este comando es necesario haber ejecutado antes el comando extract, que realiza la extracción de datos de la API de la AEMET y los almacena en Minio. El comando create se encargaría de descargar dichos datos y unirlos todos en un único dataset.\n", "

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "from ydata_profiling import ProfileReport" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "ROOT_PATH = os.path.dirname(os.path.dirname(os.path.abspath(os.getcwd())))\n", "DATA_PATH = os.path.join(ROOT_PATH, \"data\", \"processed\")\n", "AEMET_DATA_PATH = os.path.join(DATA_PATH, \"aemet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Cada fila de este dataset representa la meteorología de Madrid para una determinada fecha y hora concretos.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

-

\n", "

\n", "Vamos a analizar la calidad del dataset generado para los días comprendidos entre 02/03/2024 - 31/03/2024 que son los días de los que disponemos información actualmente.\n", "

\n", "
" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#####################################################################\n", "# Creacción del dataset completo con toda la información disponible #\n", "#####################################################################\n", "\n", "# Directorio con los CSV\n", "directory = AEMET_DATA_PATH\n", "\n", "aemet_data = []\n", "\n", "for root, directories, files in os.walk(directory):\n", " for filename in files:\n", " if filename.endswith('.csv'):\n", " filepath = os.path.join(root, filename)\n", " aemet_data.append(pd.read_csv(filepath))\n", "\n", "aemet_dataset = pd.concat(aemet_data, ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## QA checks ✅" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Análisis del dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " El dataset anterior recoge datos de casi todo marzo del 2024. En este apartado se analiza si el dataset presenta alguna anomalía de manera general. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
estadoCielo_valueestadoCielo_descripcionprecipitacion_valueprobPrecipitacion_valueprobTormenta_valuenieve_valueprobNieve_valuetemperatura_valuesensTermica_valuehumedadRelativa_valuedireccionvelocidadvientoAndRachaMax_valuedatetime
012nPoco nuboso0NaNNaN0NaN8.05.077.0['SO']['16']NaN2024-03-02 00:00:00
112nPoco nuboso0NaNNaN0NaN8.05.077.0NaNNaN27.02024-03-02 00:00:00
215nMuy nuboso0NaNNaN0NaN8.05.080.0['O']['17']NaN2024-03-02 01:00:00
315nMuy nuboso0NaNNaN0NaN8.05.080.0NaNNaN27.02024-03-02 01:00:00
4NaNNaNNaN0.00.0NaN0.0NaNNaNNaNNaNNaNNaN2024-03-02 01:07:00
\n", "
" ], "text/plain": [ " estadoCielo_value estadoCielo_descripcion precipitacion_value \\\n", "0 12n Poco nuboso 0 \n", "1 12n Poco nuboso 0 \n", "2 15n Muy nuboso 0 \n", "3 15n Muy nuboso 0 \n", "4 NaN NaN NaN \n", "\n", " probPrecipitacion_value probTormenta_value nieve_value probNieve_value \\\n", "0 NaN NaN 0 NaN \n", "1 NaN NaN 0 NaN \n", "2 NaN NaN 0 NaN \n", "3 NaN NaN 0 NaN \n", "4 0.0 0.0 NaN 0.0 \n", "\n", " temperatura_value sensTermica_value humedadRelativa_value direccion \\\n", "0 8.0 5.0 77.0 ['SO'] \n", "1 8.0 5.0 77.0 NaN \n", "2 8.0 5.0 80.0 ['O'] \n", "3 8.0 5.0 80.0 NaN \n", "4 NaN NaN NaN NaN \n", "\n", " velocidad vientoAndRachaMax_value datetime \n", "0 ['16'] NaN 2024-03-02 00:00:00 \n", "1 NaN 27.0 2024-03-02 00:00:00 \n", "2 ['17'] NaN 2024-03-02 01:00:00 \n", "3 NaN 27.0 2024-03-02 01:00:00 \n", "4 NaN NaN 2024-03-02 01:07:00 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Visualización del dataset\n", "aemet_dataset.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Se observa que para la misma fecha existen varios datos. Esto se debe a que en los datos de la AEMET la variable *vientoAndRachaMax* aparece separada del resto. Lo que haremos entonces será hacer una agrupación por fecha para que los datos que nos falten de *vientoAndRachaMax* se añadan al resto. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Veamos un ejemplo de esto para una fecha dada: 2024-03-02 00:00:00 " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
estadoCielo_valueestadoCielo_descripcionprecipitacion_valueprobPrecipitacion_valueprobTormenta_valuenieve_valueprobNieve_valuetemperatura_valuesensTermica_valuehumedadRelativa_valuedireccionvelocidadvientoAndRachaMax_valuedatetime
012nPoco nuboso0NaNNaN0NaN8.05.077.0['SO']['16']NaN2024-03-02 00:00:00
112nPoco nuboso0NaNNaN0NaN8.05.077.0NaNNaN27.02024-03-02 00:00:00
162612nPoco nuboso0.0NaNNaN0NaN8.05.077.0['SO']['16']NaN2024-03-02 00:00:00
162712nPoco nuboso0.0NaNNaN0NaN8.05.077.0NaNNaN27.02024-03-02 00:00:00
\n", "
" ], "text/plain": [ " estadoCielo_value estadoCielo_descripcion precipitacion_value \\\n", "0 12n Poco nuboso 0 \n", "1 12n Poco nuboso 0 \n", "1626 12n Poco nuboso 0.0 \n", "1627 12n Poco nuboso 0.0 \n", "\n", " probPrecipitacion_value probTormenta_value nieve_value \\\n", "0 NaN NaN 0 \n", "1 NaN NaN 0 \n", "1626 NaN NaN 0 \n", "1627 NaN NaN 0 \n", "\n", " probNieve_value temperatura_value sensTermica_value \\\n", "0 NaN 8.0 5.0 \n", "1 NaN 8.0 5.0 \n", "1626 NaN 8.0 5.0 \n", "1627 NaN 8.0 5.0 \n", "\n", " humedadRelativa_value direccion velocidad vientoAndRachaMax_value \\\n", "0 77.0 ['SO'] ['16'] NaN \n", "1 77.0 NaN NaN 27.0 \n", "1626 77.0 ['SO'] ['16'] NaN \n", "1627 77.0 NaN NaN 27.0 \n", "\n", " datetime \n", "0 2024-03-02 00:00:00 \n", "1 2024-03-02 00:00:00 \n", "1626 2024-03-02 00:00:00 \n", "1627 2024-03-02 00:00:00 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[aemet_dataset[\"datetime\"]==\"2024-03-02 00:00:00\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Para esta fecha existen hasta 4 registros diferentes. De estos 4 registros, 2 de ellos son duplicados de los otros dos, así que además de disponer información incompleta se tiene información duplicada. Vemos a continuación si ocurre lo mismo para el resto de fechas: " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "datetime\n", "2024-03-02 00:00:00 4\n", "2024-03-21 13:00:00 4\n", "2024-03-21 05:00:00 4\n", "2024-03-21 06:00:00 4\n", "2024-03-21 07:00:00 4\n", " ..\n", "2024-03-07 13:19:00 2\n", "2024-03-19 07:13:00 2\n", "2024-03-27 19:01:00 2\n", "2024-03-05 07:13:00 2\n", "2024-03-18 13:19:00 2\n", "Name: datetime, Length: 773, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset.groupby([\"datetime\"])[\"datetime\"].count().sort_values(ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Para eliminar la información duplicada cada variable debe tener el mismo tipo de datos para todas sus filas. Nos fijaremos en los tipos actuales para establecer el tipo de cada variable. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2848 entries, 0 to 2847\n", "Data columns (total 14 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 estadoCielo_value 2624 non-null object \n", " 1 estadoCielo_descripcion 2624 non-null object \n", " 2 precipitacion_value 2624 non-null object \n", " 3 probPrecipitacion_value 224 non-null float64\n", " 4 probTormenta_value 224 non-null float64\n", " 5 nieve_value 2624 non-null object \n", " 6 probNieve_value 224 non-null float64\n", " 7 temperatura_value 2604 non-null float64\n", " 8 sensTermica_value 2604 non-null float64\n", " 9 humedadRelativa_value 2604 non-null float64\n", " 10 direccion 1302 non-null object \n", " 11 velocidad 1302 non-null object \n", " 12 vientoAndRachaMax_value 1302 non-null float64\n", " 13 datetime 2848 non-null object \n", "dtypes: float64(7), object(7)\n", "memory usage: 311.6+ KB\n" ] } ], "source": [ "aemet_dataset.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Variables como *precipitacion_value* y *nieve_value* están declaradas como strings cuando deberían ser variables float. Si vemos los diferentes valores que poseen, se detecta que existe el valor \"Ip\" que es el que hace que estas variables se determinen como strings. Dado que \"Ip\" significa precipitación inapreciable o cantidad de nieve inapreciable se sustituirán estos valores por 0.0 " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['0', nan, '0.1', '0.6', '6', '0.2', '1', '0.5', '4', '0.9', '0.8',\n", " '0.7', 'Ip', '0.3', '3', '2', '0.4', 0.0, 0.1, 0.6, 6.0, 0.2, 1.0,\n", " 0.5, 4.0, 0.9, 3.0, 2.0, 0.4, 0.3, 0.7, 0.8], dtype=object)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[\"precipitacion_value\"].unique()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['0', nan, 'Ip', 0.0], dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[\"nieve_value\"].unique()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "aemet_dataset = aemet_dataset.replace({'precipitacion_value': {\"Ip\":0.0}, 'nieve_value': {\"Ip\":0.0}})" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['0', nan, '0.1', '0.6', '6', '0.2', '1', '0.5', '4', '0.9', '0.8',\n", " '0.7', 0.0, '0.3', '3', '2', '0.4', 0.1, 0.6, 6.0, 0.2, 1.0, 0.5,\n", " 4.0, 0.9, 3.0, 2.0, 0.4, 0.3, 0.7, 0.8], dtype=object)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[\"precipitacion_value\"].unique()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['0', nan, '0.1', '0.6', '6', '0.2', '1', '0.5', '4', '0.9', '0.8',\n", " '0.7', 0.0, '0.3', '3', '2', '0.4', 0.1, 0.6, 6.0, 0.2, 1.0, 0.5,\n", " 4.0, 0.9, 3.0, 2.0, 0.4, 0.3, 0.7, 0.8], dtype=object)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[\"precipitacion_value\"].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " De esta manera ya se puede seleccionar un tipo para cada variable: " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "aemet_dataset = aemet_dataset.astype({'estadoCielo_value': 'str', 'estadoCielo_descripcion':'str',\\\n", " 'precipitacion_value': 'float64', 'probPrecipitacion_value': 'float64',\\\n", " 'probTormenta_value': 'float64', 'nieve_value': 'float64',\\\n", " 'probNieve_value': 'float64', 'temperatura_value': 'float64',\\\n", " 'sensTermica_value': 'float64', 'humedadRelativa_value': 'float64',\\\n", " 'vientoAndRachaMax_value': 'float64' \n", " })" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 2848 entries, 0 to 2847\n", "Data columns (total 14 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 estadoCielo_value 2848 non-null object \n", " 1 estadoCielo_descripcion 2848 non-null object \n", " 2 precipitacion_value 2624 non-null float64\n", " 3 probPrecipitacion_value 224 non-null float64\n", " 4 probTormenta_value 224 non-null float64\n", " 5 nieve_value 2624 non-null float64\n", " 6 probNieve_value 224 non-null float64\n", " 7 temperatura_value 2604 non-null float64\n", " 8 sensTermica_value 2604 non-null float64\n", " 9 humedadRelativa_value 2604 non-null float64\n", " 10 direccion 1302 non-null object \n", " 11 velocidad 1302 non-null object \n", " 12 vientoAndRachaMax_value 1302 non-null float64\n", " 13 datetime 2848 non-null object \n", "dtypes: float64(9), object(5)\n", "memory usage: 311.6+ KB\n" ] } ], "source": [ "aemet_dataset.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " En este punto se pueden eliminar los duplicados: " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "aemet_dataset = aemet_dataset.drop_duplicates()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "datetime\n", "2024-03-02 00:00:00 2\n", "2024-03-21 13:00:00 2\n", "2024-03-21 05:00:00 2\n", "2024-03-21 06:00:00 2\n", "2024-03-21 07:00:00 2\n", " ..\n", "2024-03-07 13:19:00 1\n", "2024-03-19 07:13:00 1\n", "2024-03-27 19:01:00 1\n", "2024-03-05 07:13:00 1\n", "2024-03-18 13:19:00 1\n", "Name: datetime, Length: 773, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset.groupby([\"datetime\"])[\"datetime\"].count().sort_values(ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " También se puede ya resumir la información de dos filas en una única fila:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
estadoCielo_valueestadoCielo_descripcionprecipitacion_valueprobPrecipitacion_valueprobTormenta_valuenieve_valueprobNieve_valuetemperatura_valuesensTermica_valuehumedadRelativa_valuedireccionvelocidadvientoAndRachaMax_valuedatetime
012nPoco nuboso0.0NaNNaN0.0NaN8.05.077.0['SO']['16']NaN2024-03-02 00:00:00
112nPoco nuboso0.0NaNNaN0.0NaN8.05.077.0NaNNaN27.02024-03-02 00:00:00
\n", "
" ], "text/plain": [ " estadoCielo_value estadoCielo_descripcion precipitacion_value \\\n", "0 12n Poco nuboso 0.0 \n", "1 12n Poco nuboso 0.0 \n", "\n", " probPrecipitacion_value probTormenta_value nieve_value probNieve_value \\\n", "0 NaN NaN 0.0 NaN \n", "1 NaN NaN 0.0 NaN \n", "\n", " temperatura_value sensTermica_value humedadRelativa_value direccion \\\n", "0 8.0 5.0 77.0 ['SO'] \n", "1 8.0 5.0 77.0 NaN \n", "\n", " velocidad vientoAndRachaMax_value datetime \n", "0 ['16'] NaN 2024-03-02 00:00:00 \n", "1 NaN 27.0 2024-03-02 00:00:00 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[aemet_dataset[\"datetime\"]==\"2024-03-02 00:00:00\"]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "aemet_reduced = aemet_dataset.groupby([\"datetime\"])[['direccion','velocidad','vientoAndRachaMax_value']].sum()\n", "aemet_dataset = aemet_dataset.drop(columns=[\"direccion\", \"velocidad\", \"vientoAndRachaMax_value\"])\n", "\n", "aemet_dataset = aemet_dataset.merge(aemet_reduced, how=\"inner\", on= \"datetime\").drop_duplicates()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "datetime\n", "2024-03-02 00:00:00 1\n", "2024-03-21 07:00:00 1\n", "2024-03-22 06:00:00 1\n", "2024-03-22 07:00:00 1\n", "2024-03-22 07:13:00 1\n", " ..\n", "2024-03-13 07:00:00 1\n", "2024-03-13 07:13:00 1\n", "2024-03-13 08:00:00 1\n", "2024-03-13 09:00:00 1\n", "2024-03-31 23:00:00 1\n", "Name: datetime, Length: 773, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset.groupby([\"datetime\"])[\"datetime\"].count().sort_values(ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Terminamos comprobando que el dataset del que disponemos en este punto tiene sentido en términos de cantidad de registros: " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
estadoCielo_value
monthday
3224
324
424
524
624
724
824
1124
1224
1324
1424
1524
1624
1724
1824
1924
2024
2124
2223
2323
2423
2523
2623
2723
2823
2923
3023
3122
\n", "
" ], "text/plain": [ " estadoCielo_value\n", "month day \n", "3 2 24\n", " 3 24\n", " 4 24\n", " 5 24\n", " 6 24\n", " 7 24\n", " 8 24\n", " 11 24\n", " 12 24\n", " 13 24\n", " 14 24\n", " 15 24\n", " 16 24\n", " 17 24\n", " 18 24\n", " 19 24\n", " 20 24\n", " 21 24\n", " 22 23\n", " 23 23\n", " 24 23\n", " 25 23\n", " 26 23\n", " 27 23\n", " 28 23\n", " 29 23\n", " 30 23\n", " 31 22" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset_test = aemet_dataset.copy()\n", "\n", "aemet_dataset_test[\"datetime\"] = pd.to_datetime(aemet_dataset_test[\"datetime\"])\n", "\n", "aemet_dataset_test[\"day\"] = aemet_dataset_test[\"datetime\"].dt.day\n", "aemet_dataset_test[\"month\"] = aemet_dataset_test[\"datetime\"].dt.month\n", "aemet_dataset_test[\"year\"] = aemet_dataset_test[\"datetime\"].dt.year\n", "aemet_dataset_test[\"hour\"] = aemet_dataset_test[\"datetime\"].dt.hour\n", "aemet_dataset_test[\"minute\"] = aemet_dataset_test[\"datetime\"].dt.minute\n", "\n", "aemet_dataset_test[aemet_dataset_test[\"minute\"]==0].groupby([\"month\", \"day\"])[[\"estadoCielo_value\"]].count()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "aemet_dataset.reset_index(drop=True, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " De los resultados obtenidos se observa que el dataset ya presenta el número de registros que se supone (1 por hora o aproximadamente 1 por hora). Si agrupásemos por horas y minutos existen casos en los que no se dispone sólo información horaria sino que existen varios valores para la misma hora. Esto lo veremos más adelante ya que corresponde más bien con el análisis de variables. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Análisis de las variables y conversiones de tipos\n", "\n", " Realizamos conjúntamente el análisis de las variables y las conversiones a su tipo determinado de manera conjunta ya que ambas cosas van ligadas. En pasos anteriores se ha realizado ya la conversión de tipos pero de manera sencilla, en este paso estudiamos más cautelosamente cada caso." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 773 entries, 0 to 772\n", "Data columns (total 14 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 estadoCielo_value 773 non-null object \n", " 1 estadoCielo_descripcion 773 non-null object \n", " 2 precipitacion_value 661 non-null float64\n", " 3 probPrecipitacion_value 112 non-null float64\n", " 4 probTormenta_value 112 non-null float64\n", " 5 nieve_value 661 non-null float64\n", " 6 probNieve_value 112 non-null float64\n", " 7 temperatura_value 651 non-null float64\n", " 8 sensTermica_value 651 non-null float64\n", " 9 humedadRelativa_value 651 non-null float64\n", " 10 datetime 773 non-null object \n", " 11 direccion 773 non-null object \n", " 12 velocidad 773 non-null object \n", " 13 vientoAndRachaMax_value 773 non-null float64\n", "dtypes: float64(9), object(5)\n", "memory usage: 84.7+ KB\n" ] } ], "source": [ "aemet_dataset.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Estado cielo\n", "\n", " El estado del cielo lo describen dos variables: *estadoCielo_value* y *estadoCielo_descripcion*. Conociendo la correspondencia entre ambas variables se podría prescindir de una de ellas." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
estadoCielo_descripcionestadoCielo_value
Bruma82
82n
Cubierto16
16n
Cubierto con lluvia26
26n
Cubierto con lluvia escasa46
46n
Cubierto con nieve36
Cubierto con tormenta54
54n
Cubierto con tormenta y lluvia escasa64
64n
Despejado11
11n
Intervalos nubosos13
Intervalos nubosos con lluvia escasa43
43n
Muy nuboso15
15n
Muy nuboso con lluvia escasa45
45n
Muy nuboso con tormenta y lluvia escasa63n
Niebla81
Nubes altas17
17n
Nuboso14
14n
Nuboso con lluvia escasa44
Nuboso con tormenta y lluvia escasa62
Poco nuboso12
12n
nannan
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: []\n", "Index: [(Bruma, 82), (Bruma, 82n), (Cubierto, 16), (Cubierto, 16n), (Cubierto con lluvia, 26), (Cubierto con lluvia, 26n), (Cubierto con lluvia escasa, 46), (Cubierto con lluvia escasa, 46n), (Cubierto con nieve, 36), (Cubierto con tormenta, 54), (Cubierto con tormenta, 54n), (Cubierto con tormenta y lluvia escasa, 64), (Cubierto con tormenta y lluvia escasa, 64n), (Despejado, 11), (Despejado, 11n), (Intervalos nubosos, 13), (Intervalos nubosos con lluvia escasa, 43), (Intervalos nubosos con lluvia escasa, 43n), (Muy nuboso, 15), (Muy nuboso, 15n), (Muy nuboso con lluvia escasa, 45), (Muy nuboso con lluvia escasa, 45n), (Muy nuboso con tormenta y lluvia escasa, 63n), (Niebla , 81), (Nubes altas, 17), (Nubes altas, 17n), (Nuboso, 14), (Nuboso, 14n), (Nuboso con lluvia escasa, 44), (Nuboso con tormenta y lluvia escasa, 62), (Poco nuboso, 12), (Poco nuboso, 12n), (nan, nan)]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[[\"estadoCielo_value\", \"estadoCielo_descripcion\"]].groupby([\"estadoCielo_descripcion\", \"estadoCielo_value\"]).count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " En la tabla anterior se observa que, a pesar de que por cada valor de descripción existan uno o varios valores, estos tienen el mismo significado. Por ello eliminaremos la columna *estadoCielo_descripcion* y haremos que los valores de *estadoCielo_value* sean valores numéricos para un mejor uso de esta variable en el modelo de predicción. Finalmente renombraremos esta variable a *estado_cielo*" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# Cambiamos los valores 12n, 17n, etc. por 12, 17, etc. ya que representan el \n", "# mismo valor de estadoCielo\n", "aemet_dataset[\"estadoCielo_value\"] = aemet_dataset[\"estadoCielo_value\"].replace(to_replace=r'(\\d+)n', value=r'\\1', regex=True).replace('nan', np.nan)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
estadoCielo_descripcionestadoCielo_value
Bruma82
Cubierto16
Cubierto con lluvia26
Cubierto con lluvia escasa46
Cubierto con nieve36
Cubierto con tormenta54
Cubierto con tormenta y lluvia escasa64
Despejado11
Intervalos nubosos13
Intervalos nubosos con lluvia escasa43
Muy nuboso15
Muy nuboso con lluvia escasa45
Muy nuboso con tormenta y lluvia escasa63
Niebla81
Nubes altas17
Nuboso14
Nuboso con lluvia escasa44
Nuboso con tormenta y lluvia escasa62
Poco nuboso12
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: []\n", "Index: [(Bruma, 82), (Cubierto, 16), (Cubierto con lluvia, 26), (Cubierto con lluvia escasa, 46), (Cubierto con nieve, 36), (Cubierto con tormenta, 54), (Cubierto con tormenta y lluvia escasa, 64), (Despejado, 11), (Intervalos nubosos, 13), (Intervalos nubosos con lluvia escasa, 43), (Muy nuboso, 15), (Muy nuboso con lluvia escasa, 45), (Muy nuboso con tormenta y lluvia escasa, 63), (Niebla , 81), (Nubes altas, 17), (Nuboso, 14), (Nuboso con lluvia escasa, 44), (Nuboso con tormenta y lluvia escasa, 62), (Poco nuboso, 12)]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[[\"estadoCielo_value\", \"estadoCielo_descripcion\"]].groupby([\"estadoCielo_descripcion\", \"estadoCielo_value\"]).count()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# Eliminamos la columna estadoCielo_descripcion\n", "aemet_dataset.drop(columns='estadoCielo_descripcion', inplace=True)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# Cambiamos el tipo de la variable estadoCielo_value a numérico y \n", "# cambiamos el nombre de esta columna por estado_cielo\n", "aemet_dataset['estadoCielo_value'] = pd.to_numeric(aemet_dataset['estadoCielo_value'], errors='coerce')\n", "aemet_dataset.rename(columns={'estadoCielo_value': 'estado_cielo'}, inplace=True)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "estado_cielo\n", "17.0 169\n", "11.0 163\n", "NaN 112\n", "16.0 105\n", "12.0 61\n", "46.0 47\n", "64.0 47\n", "15.0 27\n", "43.0 15\n", "14.0 10\n", "36.0 3\n", "54.0 3\n", "45.0 2\n", "26.0 2\n", "82.0 2\n", "63.0 1\n", "44.0 1\n", "81.0 1\n", "62.0 1\n", "13.0 1\n", "Name: count, dtype: int64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[\"estado_cielo\"].value_counts(dropna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Precipitación\n", "\n", " La precipitación la describen tres variables: *precipitacion_value*, *probPrecipitacion_value* y *probTormenta_value*. Veamos si todas ellas son necesarias para el análisis." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 773 entries, 0 to 772\n", "Data columns (total 13 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 estado_cielo 661 non-null float64\n", " 1 precipitacion_value 661 non-null float64\n", " 2 probPrecipitacion_value 112 non-null float64\n", " 3 probTormenta_value 112 non-null float64\n", " 4 nieve_value 661 non-null float64\n", " 5 probNieve_value 112 non-null float64\n", " 6 temperatura_value 651 non-null float64\n", " 7 sensTermica_value 651 non-null float64\n", " 8 humedadRelativa_value 651 non-null float64\n", " 9 datetime 773 non-null object \n", " 10 direccion 773 non-null object \n", " 11 velocidad 773 non-null object \n", " 12 vientoAndRachaMax_value 773 non-null float64\n", "dtypes: float64(10), object(3)\n", "memory usage: 78.6+ KB\n" ] } ], "source": [ "aemet_dataset.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Se observa que tanto *probPrecipitacion_value* como *probTormenta_value* sólo tienen valores diferentes a los nulos para el 15% de la muestra total. Si se observa el dataset detenidamente, nos damos cuenta de que estas probabilidades no están establecidas a horas determinadas (00:00:00, 01:00:00, etc.) sino a horas del tipo 00:07:00. Esto se debe a que las probabilidades se corresponden con periodos de tiempo. En verdad 00:07:00 indica que esa probabilidad se prevee entre las 00:00:00 y las 07:00:00. Vamos a observar si se pueden juntar estos datos para horas determinadas. " ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetimeprobPrecipitacion_value
22024-03-02 01:07:000.0
92024-03-02 07:13:00100.0
162024-03-02 13:19:00100.0
232024-03-02 19:01:0070.0
302024-03-03 01:07:000.0
\n", "
" ], "text/plain": [ " datetime probPrecipitacion_value\n", "2 2024-03-02 01:07:00 0.0\n", "9 2024-03-02 07:13:00 100.0\n", "16 2024-03-02 13:19:00 100.0\n", "23 2024-03-02 19:01:00 70.0\n", "30 2024-03-03 01:07:00 0.0" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[~aemet_dataset[\"probPrecipitacion_value\"].isnull()][[\"datetime\", \"probPrecipitacion_value\"]].head()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "aemet_dataset_prob = aemet_dataset.copy()\n", "\n", "aemet_dataset_prob[\"datetime\"] = pd.to_datetime(aemet_dataset_prob[\"datetime\"])\n", "\n", "aemet_dataset_prob[\"day\"] = aemet_dataset_prob[\"datetime\"].dt.day\n", "aemet_dataset_prob[\"month\"] = aemet_dataset_prob[\"datetime\"].dt.month\n", "aemet_dataset_prob[\"year\"] = aemet_dataset_prob[\"datetime\"].dt.year\n", "aemet_dataset_prob[\"hour\"] = aemet_dataset_prob[\"datetime\"].dt.hour\n", "aemet_dataset_prob[\"minute\"] = aemet_dataset_prob[\"datetime\"].dt.minute\n", "aemet_dataset_prob[\"second\"] = aemet_dataset_prob[\"datetime\"].dt.second" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "aemet_dataset_prob = aemet_dataset_prob[[\"day\", \"month\", \"hour\", \"minute\", \"second\", \"probPrecipitacion_value\", \"probTormenta_value\", \"probNieve_value\"]]" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
daymonthhourminutesecondprobPrecipitacion_valueprobTormenta_valueprobNieve_value
023000NaNNaNNaN
123100NaNNaNNaN
2231700.00.00.0
323200NaNNaNNaN
423300NaNNaNNaN
523400NaNNaNNaN
623500NaNNaNNaN
723600NaNNaNNaN
823700NaNNaNNaN
9237130100.025.00.0
\n", "
" ], "text/plain": [ " day month hour minute second probPrecipitacion_value \\\n", "0 2 3 0 0 0 NaN \n", "1 2 3 1 0 0 NaN \n", "2 2 3 1 7 0 0.0 \n", "3 2 3 2 0 0 NaN \n", "4 2 3 3 0 0 NaN \n", "5 2 3 4 0 0 NaN \n", "6 2 3 5 0 0 NaN \n", "7 2 3 6 0 0 NaN \n", "8 2 3 7 0 0 NaN \n", "9 2 3 7 13 0 100.0 \n", "\n", " probTormenta_value probNieve_value \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 0.0 0.0 \n", "3 NaN NaN \n", "4 NaN NaN \n", "5 NaN NaN \n", "6 NaN NaN \n", "7 NaN NaN \n", "8 NaN NaN \n", "9 25.0 0.0 " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset_prob[0:10]" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
daymonthhourminutesecondprobPrecipitacion_valueprobTormenta_valueprobNieve_value
023000NaNNaNNaN
123100NaNNaNNaN
2231700.00.00.0
3232000.00.00.0
4233000.00.00.0
5234000.00.00.0
6235000.00.00.0
7236000.00.00.0
8237000.00.00.0
9237130100.025.00.0
\n", "
" ], "text/plain": [ " day month hour minute second probPrecipitacion_value \\\n", "0 2 3 0 0 0 NaN \n", "1 2 3 1 0 0 NaN \n", "2 2 3 1 7 0 0.0 \n", "3 2 3 2 0 0 0.0 \n", "4 2 3 3 0 0 0.0 \n", "5 2 3 4 0 0 0.0 \n", "6 2 3 5 0 0 0.0 \n", "7 2 3 6 0 0 0.0 \n", "8 2 3 7 0 0 0.0 \n", "9 2 3 7 13 0 100.0 \n", "\n", " probTormenta_value probNieve_value \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 \n", "5 0.0 0.0 \n", "6 0.0 0.0 \n", "7 0.0 0.0 \n", "8 0.0 0.0 \n", "9 25.0 0.0 " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset_prob[0:10].fillna(method=\"ffill\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Como se observa en las tablas anteriores *ffill* es un buen método para conseguir los datos que faltan. El siguiente paso sería eliminar los registros con *minute!=0*. Aprovecharemos ya también para dividir datetime en día, mes, hora, minutos y segundos. " ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "aemet_dataset[\"datetime\"] = pd.to_datetime(aemet_dataset[\"datetime\"])\n", "\n", "aemet_dataset[\"day\"] = aemet_dataset[\"datetime\"].dt.day\n", "aemet_dataset[\"month\"] = aemet_dataset[\"datetime\"].dt.month\n", "aemet_dataset[\"year\"] = aemet_dataset[\"datetime\"].dt.year\n", "aemet_dataset[\"hour\"] = aemet_dataset[\"datetime\"].dt.hour\n", "aemet_dataset[\"minute\"] = aemet_dataset[\"datetime\"].dt.minute\n", "aemet_dataset[\"second\"] = aemet_dataset[\"datetime\"].dt.second\n", "\n", "aemet_dataset.drop(columns=[\"datetime\"], inplace=True)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "aemet_dataset.fillna(method=\"ffill\", inplace=True)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "aemet_dataset = aemet_dataset[aemet_dataset.minute==0]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "aemet_dataset.reset_index(drop=True, inplace=True)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 661 entries, 0 to 660\n", "Data columns (total 18 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 estado_cielo 661 non-null float64\n", " 1 precipitacion_value 661 non-null float64\n", " 2 probPrecipitacion_value 659 non-null float64\n", " 3 probTormenta_value 659 non-null float64\n", " 4 nieve_value 661 non-null float64\n", " 5 probNieve_value 659 non-null float64\n", " 6 temperatura_value 661 non-null float64\n", " 7 sensTermica_value 661 non-null float64\n", " 8 humedadRelativa_value 661 non-null float64\n", " 9 direccion 661 non-null object \n", " 10 velocidad 661 non-null object \n", " 11 vientoAndRachaMax_value 661 non-null float64\n", " 12 day 661 non-null int32 \n", " 13 month 661 non-null int32 \n", " 14 year 661 non-null int32 \n", " 15 hour 661 non-null int32 \n", " 16 minute 661 non-null int32 \n", " 17 second 661 non-null int32 \n", "dtypes: float64(10), int32(6), object(2)\n", "memory usage: 77.6+ KB\n" ] } ], "source": [ "aemet_dataset.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Nieve\n", "\n", " Algo similar a lo que ocurría con la precipitación ocurre con la nieve. Se analizará mejor la *probNieve_value* en el EDA." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dirección\n", "\n", "La variable *direccion* representa los valores de norte, sur, este, oeste. Como vienen en una lista, los sacamos y los convertimos a categorías." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "direccion\n", "['SO'] 143\n", "['NE'] 135\n", "['O'] 117\n", "['S'] 104\n", "['SE'] 54\n", "['E'] 48\n", "['NO'] 37\n", "['N'] 13\n", "0 10\n", "Name: count, dtype: int64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[\"direccion\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['SO', 'O', 'NO', 'NE', 'N', 'S', 'SE', 'E', NaN]\n", "Categories (8, object): ['E', 'N', 'NE', 'NO', 'O', 'S', 'SE', 'SO']" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[\"direccion\"] = aemet_dataset[\"direccion\"].str.replace(\"['\", \"\").str.replace(\"']\", \"\").astype('category')\n", "aemet_dataset[\"direccion\"].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Velocidad\n", "\n", "Los valores de la variable *velocidad* vienen en una lista, los sacamos y los convertimos a *float*." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "velocidad\n", "['10'] 54\n", "['8'] 47\n", "['9'] 46\n", "['7'] 39\n", "['6'] 34\n", "Name: count, dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset['velocidad'].value_counts().head()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "velocidad\n", "10.0 54\n", "8.0 47\n", "9.0 46\n", "7.0 39\n", "6.0 34\n", "Name: count, dtype: int64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[\"velocidad\"] = aemet_dataset[\"velocidad\"].str.replace(\"['\", \"\").str.replace(\"']\", \"\").astype(float)\n", "aemet_dataset[\"velocidad\"].value_counts().head()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
estado_cieloprecipitacion_valueprobPrecipitacion_valueprobTormenta_valuenieve_valueprobNieve_valuetemperatura_valuesensTermica_valuehumedadRelativa_valuedireccionvelocidadvientoAndRachaMax_valuedaymonthyearhourminutesecond
012.00.0NaNNaN0.0NaN8.05.077.0SO16.027.0232024000
115.00.0NaNNaN0.0NaN8.05.080.0O17.027.0232024100
215.00.00.00.00.00.08.05.082.0O17.033.0232024200
316.00.00.00.00.00.08.05.084.0SO20.040.0232024300
416.00.00.00.00.00.08.04.086.0O25.035.0232024400
516.00.00.00.00.00.08.05.084.0SO21.037.0232024500
616.00.00.00.00.00.08.05.086.0SO23.044.0232024600
716.00.00.00.00.00.08.04.082.0SO27.051.0232024700
816.00.0100.025.00.00.08.04.081.0SO32.047.0232024800
946.00.1100.025.00.00.08.04.080.0SO28.047.0232024900
1016.00.0100.025.00.00.08.04.080.0SO30.050.02320241000
1116.00.0100.025.00.00.08.04.078.0SO29.051.02320241100
1216.00.0100.025.00.00.08.04.073.0SO31.057.02320241200
1364.00.1100.025.00.00.08.04.075.0SO35.058.02320241300
1464.00.1100.080.00.00.09.05.075.0SO35.062.02320241400
1516.00.0100.080.00.00.09.05.079.0SO38.062.02320241500
1664.00.1100.080.00.00.09.05.079.0SO38.061.02320241600
1764.00.6100.080.00.00.08.03.086.0SO39.072.02320241700
1836.06.0100.080.00.00.08.03.091.0SO45.028.02320241800
1946.00.2100.080.00.00.07.05.071.0O11.019.02320241900
\n", "
" ], "text/plain": [ " estado_cielo precipitacion_value probPrecipitacion_value \\\n", "0 12.0 0.0 NaN \n", "1 15.0 0.0 NaN \n", "2 15.0 0.0 0.0 \n", "3 16.0 0.0 0.0 \n", "4 16.0 0.0 0.0 \n", "5 16.0 0.0 0.0 \n", "6 16.0 0.0 0.0 \n", "7 16.0 0.0 0.0 \n", "8 16.0 0.0 100.0 \n", "9 46.0 0.1 100.0 \n", "10 16.0 0.0 100.0 \n", "11 16.0 0.0 100.0 \n", "12 16.0 0.0 100.0 \n", "13 64.0 0.1 100.0 \n", "14 64.0 0.1 100.0 \n", "15 16.0 0.0 100.0 \n", "16 64.0 0.1 100.0 \n", "17 64.0 0.6 100.0 \n", "18 36.0 6.0 100.0 \n", "19 46.0 0.2 100.0 \n", "\n", " probTormenta_value nieve_value probNieve_value temperatura_value \\\n", "0 NaN 0.0 NaN 8.0 \n", "1 NaN 0.0 NaN 8.0 \n", "2 0.0 0.0 0.0 8.0 \n", "3 0.0 0.0 0.0 8.0 \n", "4 0.0 0.0 0.0 8.0 \n", "5 0.0 0.0 0.0 8.0 \n", "6 0.0 0.0 0.0 8.0 \n", "7 0.0 0.0 0.0 8.0 \n", "8 25.0 0.0 0.0 8.0 \n", "9 25.0 0.0 0.0 8.0 \n", "10 25.0 0.0 0.0 8.0 \n", "11 25.0 0.0 0.0 8.0 \n", "12 25.0 0.0 0.0 8.0 \n", "13 25.0 0.0 0.0 8.0 \n", "14 80.0 0.0 0.0 9.0 \n", "15 80.0 0.0 0.0 9.0 \n", "16 80.0 0.0 0.0 9.0 \n", "17 80.0 0.0 0.0 8.0 \n", "18 80.0 0.0 0.0 8.0 \n", "19 80.0 0.0 0.0 7.0 \n", "\n", " sensTermica_value humedadRelativa_value direccion velocidad \\\n", "0 5.0 77.0 SO 16.0 \n", "1 5.0 80.0 O 17.0 \n", "2 5.0 82.0 O 17.0 \n", "3 5.0 84.0 SO 20.0 \n", "4 4.0 86.0 O 25.0 \n", "5 5.0 84.0 SO 21.0 \n", "6 5.0 86.0 SO 23.0 \n", "7 4.0 82.0 SO 27.0 \n", "8 4.0 81.0 SO 32.0 \n", "9 4.0 80.0 SO 28.0 \n", "10 4.0 80.0 SO 30.0 \n", "11 4.0 78.0 SO 29.0 \n", "12 4.0 73.0 SO 31.0 \n", "13 4.0 75.0 SO 35.0 \n", "14 5.0 75.0 SO 35.0 \n", "15 5.0 79.0 SO 38.0 \n", "16 5.0 79.0 SO 38.0 \n", "17 3.0 86.0 SO 39.0 \n", "18 3.0 91.0 SO 45.0 \n", "19 5.0 71.0 O 11.0 \n", "\n", " vientoAndRachaMax_value day month year hour minute second \n", "0 27.0 2 3 2024 0 0 0 \n", "1 27.0 2 3 2024 1 0 0 \n", "2 33.0 2 3 2024 2 0 0 \n", "3 40.0 2 3 2024 3 0 0 \n", "4 35.0 2 3 2024 4 0 0 \n", "5 37.0 2 3 2024 5 0 0 \n", "6 44.0 2 3 2024 6 0 0 \n", "7 51.0 2 3 2024 7 0 0 \n", "8 47.0 2 3 2024 8 0 0 \n", "9 47.0 2 3 2024 9 0 0 \n", "10 50.0 2 3 2024 10 0 0 \n", "11 51.0 2 3 2024 11 0 0 \n", "12 57.0 2 3 2024 12 0 0 \n", "13 58.0 2 3 2024 13 0 0 \n", "14 62.0 2 3 2024 14 0 0 \n", "15 62.0 2 3 2024 15 0 0 \n", "16 61.0 2 3 2024 16 0 0 \n", "17 72.0 2 3 2024 17 0 0 \n", "18 28.0 2 3 2024 18 0 0 \n", "19 19.0 2 3 2024 19 0 0 " ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aemet_dataset[0:20]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 661 entries, 0 to 660\n", "Data columns (total 18 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 estado_cielo 661 non-null float64 \n", " 1 precipitacion_value 661 non-null float64 \n", " 2 probPrecipitacion_value 659 non-null float64 \n", " 3 probTormenta_value 659 non-null float64 \n", " 4 nieve_value 661 non-null float64 \n", " 5 probNieve_value 659 non-null float64 \n", " 6 temperatura_value 661 non-null float64 \n", " 7 sensTermica_value 661 non-null float64 \n", " 8 humedadRelativa_value 661 non-null float64 \n", " 9 direccion 651 non-null category\n", " 10 velocidad 651 non-null float64 \n", " 11 vientoAndRachaMax_value 661 non-null float64 \n", " 12 day 661 non-null int32 \n", " 13 month 661 non-null int32 \n", " 14 year 661 non-null int32 \n", " 15 hour 661 non-null int32 \n", " 16 minute 661 non-null int32 \n", " 17 second 661 non-null int32 \n", "dtypes: category(1), float64(11), int32(6)\n", "memory usage: 73.4 KB\n" ] } ], "source": [ "aemet_dataset.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PROFILING 📑" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "profile = ProfileReport(\n", " aemet_dataset,\n", " title=\"🌥️ AEMET QA\",\n", " dataset={\n", " \"description\": \"AEMET - Estado de la meteorología\",\n", " \"url\": \"https://opendata.aemet.es/dist/index.html\",\n", " },\n", " variables={\n", " \"descriptions\": {\n", " \"day\": \"Fecha (día) de la petición a la API\",\n", " \"month\": \"Fecha (mes) de la petición a la API\",\n", " \"year\": \"Fecha (año) de la petición a la API\",\n", " \"hour\": \"Fecha (hora) de la petición a la API\",\n", " \"minute\": \"Fecha (minute) de la petición a la API\",\n", " \"second\": \"Fecha (second) de la petición a la API\",\n", " \"estado_cielo\": \"Estado del cielo\",\n", " \"precipitacion_value\": \"Precipitación\",\n", " \"probPrecipitacion_value\": \"Probabilidad de precipitación\",\n", " \"probTormenta_value\": \"Probabilidad de tormenta\",\n", " \"nieve_value\": \"Nieve\",\n", " \"probNieve_value\": \"Probabilidad de nieve\",\n", " \"temperatura_value\": \"Temperatura\",\n", " \"sensTermica_value\": \"Sensación térmica\",\n", " \"humedadRelativa_value\": \"Humedad relativa\",\n", " \"direccion\": \"Dirección del viento\",\n", " \"velocidad\": \"Velocidad del viento\",\n", " \"vientoAndRachaMax_value\": \"Viento racha máxima\",\n", " }\n", " },\n", " interactions=None,\n", " explorative=True,\n", " dark_mode=True,\n", ")\n", "profile.to_file(os.path.join(ROOT_PATH, \"docs\", \"qa\", \"aemet_report.html\"))\n", "# profile.to_notebook_iframe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.11.7 ('venv': venv)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "7056b316d281e41af8bb677c2c15c6d2127a9166071d02d66c8bc69014a8260f" } } }, "nbformat": 4, "nbformat_minor": 2 }