{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"
\n",
"
\n",
"
QA: AEMET dataset 🌥️
\n",
"
INESDATA-MOV
\n",
"
\n",
"\n",
"# Análisis de calidad\n",
"Este cuaderno analiza la calidad del dataset proveniente de la fuente de datos de la Agencia Estatal de Meteorología ([AEMET](https://www.aemet.es/es/portada)). La calidad del mismo se validará teniendo en cuenta los siguientes aspectos:\n",
"\n",
"* Análisis del dataset\n",
"* Análisis de las variables\n",
"* Conversiones de tipos de datos\n",
"\n",
"La **calidad del dato** se refiere a la medida en que los datos son adecuados para su uso, por lo que es esencial para garantizar la confiabilidad y utilidad de los datos en diversas aplicaciones y contextos. Así, en este notebook se evaluarán también las cinco dimensiones de la calidad del dato:\n",
"1. **Unicidad**: Ausencia de duplicados o registros repetidos en un conjunto de datos. Los datos son únicos cuando cada registro o entidad en el conjunto de datos es único y no hay duplicados presentes.\n",
"2. **Exactitud**: Los datos exactos son libres de errores y representan con precisión la realidad que están destinados a describir. Esto implica que los datos deben ser correctos y confiables para su uso en análisis y toma de decisiones.\n",
"3. **Completitud**: Los datos completos contienen toda la información necesaria para el análisis y no tienen valores faltantes o nulos que puedan afectar la interpretación o validez de los resultados.\n",
"4. **Consistencia**: Los datos consistentes mantienen el mismo formato, estructura y significado en todas las instancias, lo que facilita su comparación y análisis sin ambigüedad.\n",
"5. **Validez**: Medida en que los datos son precisos y representan con exactitud la realidad que están destinados a describir. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"
Nota
\n",
"
\n",
"Este dataset ha sido creado ejecutando el comando create
del paquete de Python inesdata_mov_datasets
.
\n",
"Para poder ejecutar este comando es necesario haber ejecutado antes el comando extract
, que realiza la extracción de datos de la API de la AEMET y los almacena en Minio. El comando create
se encargaría de descargar dichos datos y unirlos todos en un único dataset.\n",
"
\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")\n",
"\n",
"from ydata_profiling import ProfileReport"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"ROOT_PATH = os.path.dirname(os.path.dirname(os.path.abspath(os.getcwd())))\n",
"DATA_PATH = os.path.join(ROOT_PATH, \"data\", \"processed\")\n",
"AEMET_DATA_PATH = os.path.join(DATA_PATH, \"aemet\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Cada fila de este dataset representa la meteorología de Madrid para una determinada fecha y hora concretos.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"
-
\n",
"
\n",
"Vamos a analizar la calidad del dataset generado para los días comprendidos entre 02/03/2024 - 31/03/2024 que son los días de los que disponemos información actualmente.\n",
"
\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"#####################################################################\n",
"# Creacción del dataset completo con toda la información disponible #\n",
"#####################################################################\n",
"\n",
"# Directorio con los CSV\n",
"directory = AEMET_DATA_PATH\n",
"\n",
"aemet_data = []\n",
"\n",
"for root, directories, files in os.walk(directory):\n",
" for filename in files:\n",
" if filename.endswith('.csv'):\n",
" filepath = os.path.join(root, filename)\n",
" aemet_data.append(pd.read_csv(filepath))\n",
"\n",
"aemet_dataset = pd.concat(aemet_data, ignore_index=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## QA checks ✅"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Análisis del dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" El dataset anterior recoge datos de casi todo marzo del 2024. En este apartado se analiza si el dataset presenta alguna anomalía de manera general. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" estadoCielo_value | \n",
" estadoCielo_descripcion | \n",
" precipitacion_value | \n",
" probPrecipitacion_value | \n",
" probTormenta_value | \n",
" nieve_value | \n",
" probNieve_value | \n",
" temperatura_value | \n",
" sensTermica_value | \n",
" humedadRelativa_value | \n",
" direccion | \n",
" velocidad | \n",
" vientoAndRachaMax_value | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 12n | \n",
" Poco nuboso | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 77.0 | \n",
" ['SO'] | \n",
" ['16'] | \n",
" NaN | \n",
" 2024-03-02 00:00:00 | \n",
"
\n",
" \n",
" 1 | \n",
" 12n | \n",
" Poco nuboso | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 77.0 | \n",
" NaN | \n",
" NaN | \n",
" 27.0 | \n",
" 2024-03-02 00:00:00 | \n",
"
\n",
" \n",
" 2 | \n",
" 15n | \n",
" Muy nuboso | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 80.0 | \n",
" ['O'] | \n",
" ['17'] | \n",
" NaN | \n",
" 2024-03-02 01:00:00 | \n",
"
\n",
" \n",
" 3 | \n",
" 15n | \n",
" Muy nuboso | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 80.0 | \n",
" NaN | \n",
" NaN | \n",
" 27.0 | \n",
" 2024-03-02 01:00:00 | \n",
"
\n",
" \n",
" 4 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0.0 | \n",
" 0.0 | \n",
" NaN | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2024-03-02 01:07:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" estadoCielo_value estadoCielo_descripcion precipitacion_value \\\n",
"0 12n Poco nuboso 0 \n",
"1 12n Poco nuboso 0 \n",
"2 15n Muy nuboso 0 \n",
"3 15n Muy nuboso 0 \n",
"4 NaN NaN NaN \n",
"\n",
" probPrecipitacion_value probTormenta_value nieve_value probNieve_value \\\n",
"0 NaN NaN 0 NaN \n",
"1 NaN NaN 0 NaN \n",
"2 NaN NaN 0 NaN \n",
"3 NaN NaN 0 NaN \n",
"4 0.0 0.0 NaN 0.0 \n",
"\n",
" temperatura_value sensTermica_value humedadRelativa_value direccion \\\n",
"0 8.0 5.0 77.0 ['SO'] \n",
"1 8.0 5.0 77.0 NaN \n",
"2 8.0 5.0 80.0 ['O'] \n",
"3 8.0 5.0 80.0 NaN \n",
"4 NaN NaN NaN NaN \n",
"\n",
" velocidad vientoAndRachaMax_value datetime \n",
"0 ['16'] NaN 2024-03-02 00:00:00 \n",
"1 NaN 27.0 2024-03-02 00:00:00 \n",
"2 ['17'] NaN 2024-03-02 01:00:00 \n",
"3 NaN 27.0 2024-03-02 01:00:00 \n",
"4 NaN NaN 2024-03-02 01:07:00 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Visualización del dataset\n",
"aemet_dataset.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Se observa que para la misma fecha existen varios datos. Esto se debe a que en los datos de la AEMET la variable *vientoAndRachaMax* aparece separada del resto. Lo que haremos entonces será hacer una agrupación por fecha para que los datos que nos falten de *vientoAndRachaMax* se añadan al resto. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Veamos un ejemplo de esto para una fecha dada: 2024-03-02 00:00:00 "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" estadoCielo_value | \n",
" estadoCielo_descripcion | \n",
" precipitacion_value | \n",
" probPrecipitacion_value | \n",
" probTormenta_value | \n",
" nieve_value | \n",
" probNieve_value | \n",
" temperatura_value | \n",
" sensTermica_value | \n",
" humedadRelativa_value | \n",
" direccion | \n",
" velocidad | \n",
" vientoAndRachaMax_value | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 12n | \n",
" Poco nuboso | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 77.0 | \n",
" ['SO'] | \n",
" ['16'] | \n",
" NaN | \n",
" 2024-03-02 00:00:00 | \n",
"
\n",
" \n",
" 1 | \n",
" 12n | \n",
" Poco nuboso | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 77.0 | \n",
" NaN | \n",
" NaN | \n",
" 27.0 | \n",
" 2024-03-02 00:00:00 | \n",
"
\n",
" \n",
" 1626 | \n",
" 12n | \n",
" Poco nuboso | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 77.0 | \n",
" ['SO'] | \n",
" ['16'] | \n",
" NaN | \n",
" 2024-03-02 00:00:00 | \n",
"
\n",
" \n",
" 1627 | \n",
" 12n | \n",
" Poco nuboso | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 77.0 | \n",
" NaN | \n",
" NaN | \n",
" 27.0 | \n",
" 2024-03-02 00:00:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" estadoCielo_value estadoCielo_descripcion precipitacion_value \\\n",
"0 12n Poco nuboso 0 \n",
"1 12n Poco nuboso 0 \n",
"1626 12n Poco nuboso 0.0 \n",
"1627 12n Poco nuboso 0.0 \n",
"\n",
" probPrecipitacion_value probTormenta_value nieve_value \\\n",
"0 NaN NaN 0 \n",
"1 NaN NaN 0 \n",
"1626 NaN NaN 0 \n",
"1627 NaN NaN 0 \n",
"\n",
" probNieve_value temperatura_value sensTermica_value \\\n",
"0 NaN 8.0 5.0 \n",
"1 NaN 8.0 5.0 \n",
"1626 NaN 8.0 5.0 \n",
"1627 NaN 8.0 5.0 \n",
"\n",
" humedadRelativa_value direccion velocidad vientoAndRachaMax_value \\\n",
"0 77.0 ['SO'] ['16'] NaN \n",
"1 77.0 NaN NaN 27.0 \n",
"1626 77.0 ['SO'] ['16'] NaN \n",
"1627 77.0 NaN NaN 27.0 \n",
"\n",
" datetime \n",
"0 2024-03-02 00:00:00 \n",
"1 2024-03-02 00:00:00 \n",
"1626 2024-03-02 00:00:00 \n",
"1627 2024-03-02 00:00:00 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[aemet_dataset[\"datetime\"]==\"2024-03-02 00:00:00\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Para esta fecha existen hasta 4 registros diferentes. De estos 4 registros, 2 de ellos son duplicados de los otros dos, así que además de disponer información incompleta se tiene información duplicada. Vemos a continuación si ocurre lo mismo para el resto de fechas: "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"datetime\n",
"2024-03-02 00:00:00 4\n",
"2024-03-21 13:00:00 4\n",
"2024-03-21 05:00:00 4\n",
"2024-03-21 06:00:00 4\n",
"2024-03-21 07:00:00 4\n",
" ..\n",
"2024-03-07 13:19:00 2\n",
"2024-03-19 07:13:00 2\n",
"2024-03-27 19:01:00 2\n",
"2024-03-05 07:13:00 2\n",
"2024-03-18 13:19:00 2\n",
"Name: datetime, Length: 773, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset.groupby([\"datetime\"])[\"datetime\"].count().sort_values(ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Para eliminar la información duplicada cada variable debe tener el mismo tipo de datos para todas sus filas. Nos fijaremos en los tipos actuales para establecer el tipo de cada variable. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 2848 entries, 0 to 2847\n",
"Data columns (total 14 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 estadoCielo_value 2624 non-null object \n",
" 1 estadoCielo_descripcion 2624 non-null object \n",
" 2 precipitacion_value 2624 non-null object \n",
" 3 probPrecipitacion_value 224 non-null float64\n",
" 4 probTormenta_value 224 non-null float64\n",
" 5 nieve_value 2624 non-null object \n",
" 6 probNieve_value 224 non-null float64\n",
" 7 temperatura_value 2604 non-null float64\n",
" 8 sensTermica_value 2604 non-null float64\n",
" 9 humedadRelativa_value 2604 non-null float64\n",
" 10 direccion 1302 non-null object \n",
" 11 velocidad 1302 non-null object \n",
" 12 vientoAndRachaMax_value 1302 non-null float64\n",
" 13 datetime 2848 non-null object \n",
"dtypes: float64(7), object(7)\n",
"memory usage: 311.6+ KB\n"
]
}
],
"source": [
"aemet_dataset.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Variables como *precipitacion_value* y *nieve_value* están declaradas como strings cuando deberían ser variables float. Si vemos los diferentes valores que poseen, se detecta que existe el valor \"Ip\" que es el que hace que estas variables se determinen como strings. Dado que \"Ip\" significa precipitación inapreciable o cantidad de nieve inapreciable se sustituirán estos valores por 0.0 "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['0', nan, '0.1', '0.6', '6', '0.2', '1', '0.5', '4', '0.9', '0.8',\n",
" '0.7', 'Ip', '0.3', '3', '2', '0.4', 0.0, 0.1, 0.6, 6.0, 0.2, 1.0,\n",
" 0.5, 4.0, 0.9, 3.0, 2.0, 0.4, 0.3, 0.7, 0.8], dtype=object)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[\"precipitacion_value\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['0', nan, 'Ip', 0.0], dtype=object)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[\"nieve_value\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"aemet_dataset = aemet_dataset.replace({'precipitacion_value': {\"Ip\":0.0}, 'nieve_value': {\"Ip\":0.0}})"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['0', nan, '0.1', '0.6', '6', '0.2', '1', '0.5', '4', '0.9', '0.8',\n",
" '0.7', 0.0, '0.3', '3', '2', '0.4', 0.1, 0.6, 6.0, 0.2, 1.0, 0.5,\n",
" 4.0, 0.9, 3.0, 2.0, 0.4, 0.3, 0.7, 0.8], dtype=object)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[\"precipitacion_value\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['0', nan, '0.1', '0.6', '6', '0.2', '1', '0.5', '4', '0.9', '0.8',\n",
" '0.7', 0.0, '0.3', '3', '2', '0.4', 0.1, 0.6, 6.0, 0.2, 1.0, 0.5,\n",
" 4.0, 0.9, 3.0, 2.0, 0.4, 0.3, 0.7, 0.8], dtype=object)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[\"precipitacion_value\"].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" De esta manera ya se puede seleccionar un tipo para cada variable: "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"aemet_dataset = aemet_dataset.astype({'estadoCielo_value': 'str', 'estadoCielo_descripcion':'str',\\\n",
" 'precipitacion_value': 'float64', 'probPrecipitacion_value': 'float64',\\\n",
" 'probTormenta_value': 'float64', 'nieve_value': 'float64',\\\n",
" 'probNieve_value': 'float64', 'temperatura_value': 'float64',\\\n",
" 'sensTermica_value': 'float64', 'humedadRelativa_value': 'float64',\\\n",
" 'vientoAndRachaMax_value': 'float64' \n",
" })"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 2848 entries, 0 to 2847\n",
"Data columns (total 14 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 estadoCielo_value 2848 non-null object \n",
" 1 estadoCielo_descripcion 2848 non-null object \n",
" 2 precipitacion_value 2624 non-null float64\n",
" 3 probPrecipitacion_value 224 non-null float64\n",
" 4 probTormenta_value 224 non-null float64\n",
" 5 nieve_value 2624 non-null float64\n",
" 6 probNieve_value 224 non-null float64\n",
" 7 temperatura_value 2604 non-null float64\n",
" 8 sensTermica_value 2604 non-null float64\n",
" 9 humedadRelativa_value 2604 non-null float64\n",
" 10 direccion 1302 non-null object \n",
" 11 velocidad 1302 non-null object \n",
" 12 vientoAndRachaMax_value 1302 non-null float64\n",
" 13 datetime 2848 non-null object \n",
"dtypes: float64(9), object(5)\n",
"memory usage: 311.6+ KB\n"
]
}
],
"source": [
"aemet_dataset.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" En este punto se pueden eliminar los duplicados: "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"aemet_dataset = aemet_dataset.drop_duplicates()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"datetime\n",
"2024-03-02 00:00:00 2\n",
"2024-03-21 13:00:00 2\n",
"2024-03-21 05:00:00 2\n",
"2024-03-21 06:00:00 2\n",
"2024-03-21 07:00:00 2\n",
" ..\n",
"2024-03-07 13:19:00 1\n",
"2024-03-19 07:13:00 1\n",
"2024-03-27 19:01:00 1\n",
"2024-03-05 07:13:00 1\n",
"2024-03-18 13:19:00 1\n",
"Name: datetime, Length: 773, dtype: int64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset.groupby([\"datetime\"])[\"datetime\"].count().sort_values(ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" También se puede ya resumir la información de dos filas en una única fila:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" estadoCielo_value | \n",
" estadoCielo_descripcion | \n",
" precipitacion_value | \n",
" probPrecipitacion_value | \n",
" probTormenta_value | \n",
" nieve_value | \n",
" probNieve_value | \n",
" temperatura_value | \n",
" sensTermica_value | \n",
" humedadRelativa_value | \n",
" direccion | \n",
" velocidad | \n",
" vientoAndRachaMax_value | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 12n | \n",
" Poco nuboso | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" 0.0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 77.0 | \n",
" ['SO'] | \n",
" ['16'] | \n",
" NaN | \n",
" 2024-03-02 00:00:00 | \n",
"
\n",
" \n",
" 1 | \n",
" 12n | \n",
" Poco nuboso | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" 0.0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 77.0 | \n",
" NaN | \n",
" NaN | \n",
" 27.0 | \n",
" 2024-03-02 00:00:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" estadoCielo_value estadoCielo_descripcion precipitacion_value \\\n",
"0 12n Poco nuboso 0.0 \n",
"1 12n Poco nuboso 0.0 \n",
"\n",
" probPrecipitacion_value probTormenta_value nieve_value probNieve_value \\\n",
"0 NaN NaN 0.0 NaN \n",
"1 NaN NaN 0.0 NaN \n",
"\n",
" temperatura_value sensTermica_value humedadRelativa_value direccion \\\n",
"0 8.0 5.0 77.0 ['SO'] \n",
"1 8.0 5.0 77.0 NaN \n",
"\n",
" velocidad vientoAndRachaMax_value datetime \n",
"0 ['16'] NaN 2024-03-02 00:00:00 \n",
"1 NaN 27.0 2024-03-02 00:00:00 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[aemet_dataset[\"datetime\"]==\"2024-03-02 00:00:00\"]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"aemet_reduced = aemet_dataset.groupby([\"datetime\"])[['direccion','velocidad','vientoAndRachaMax_value']].sum()\n",
"aemet_dataset = aemet_dataset.drop(columns=[\"direccion\", \"velocidad\", \"vientoAndRachaMax_value\"])\n",
"\n",
"aemet_dataset = aemet_dataset.merge(aemet_reduced, how=\"inner\", on= \"datetime\").drop_duplicates()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"datetime\n",
"2024-03-02 00:00:00 1\n",
"2024-03-21 07:00:00 1\n",
"2024-03-22 06:00:00 1\n",
"2024-03-22 07:00:00 1\n",
"2024-03-22 07:13:00 1\n",
" ..\n",
"2024-03-13 07:00:00 1\n",
"2024-03-13 07:13:00 1\n",
"2024-03-13 08:00:00 1\n",
"2024-03-13 09:00:00 1\n",
"2024-03-31 23:00:00 1\n",
"Name: datetime, Length: 773, dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset.groupby([\"datetime\"])[\"datetime\"].count().sort_values(ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Terminamos comprobando que el dataset del que disponemos en este punto tiene sentido en términos de cantidad de registros: "
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" estadoCielo_value | \n",
"
\n",
" \n",
" month | \n",
" day | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" 2 | \n",
" 24 | \n",
"
\n",
" \n",
" 3 | \n",
" 24 | \n",
"
\n",
" \n",
" 4 | \n",
" 24 | \n",
"
\n",
" \n",
" 5 | \n",
" 24 | \n",
"
\n",
" \n",
" 6 | \n",
" 24 | \n",
"
\n",
" \n",
" 7 | \n",
" 24 | \n",
"
\n",
" \n",
" 8 | \n",
" 24 | \n",
"
\n",
" \n",
" 11 | \n",
" 24 | \n",
"
\n",
" \n",
" 12 | \n",
" 24 | \n",
"
\n",
" \n",
" 13 | \n",
" 24 | \n",
"
\n",
" \n",
" 14 | \n",
" 24 | \n",
"
\n",
" \n",
" 15 | \n",
" 24 | \n",
"
\n",
" \n",
" 16 | \n",
" 24 | \n",
"
\n",
" \n",
" 17 | \n",
" 24 | \n",
"
\n",
" \n",
" 18 | \n",
" 24 | \n",
"
\n",
" \n",
" 19 | \n",
" 24 | \n",
"
\n",
" \n",
" 20 | \n",
" 24 | \n",
"
\n",
" \n",
" 21 | \n",
" 24 | \n",
"
\n",
" \n",
" 22 | \n",
" 23 | \n",
"
\n",
" \n",
" 23 | \n",
" 23 | \n",
"
\n",
" \n",
" 24 | \n",
" 23 | \n",
"
\n",
" \n",
" 25 | \n",
" 23 | \n",
"
\n",
" \n",
" 26 | \n",
" 23 | \n",
"
\n",
" \n",
" 27 | \n",
" 23 | \n",
"
\n",
" \n",
" 28 | \n",
" 23 | \n",
"
\n",
" \n",
" 29 | \n",
" 23 | \n",
"
\n",
" \n",
" 30 | \n",
" 23 | \n",
"
\n",
" \n",
" 31 | \n",
" 22 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" estadoCielo_value\n",
"month day \n",
"3 2 24\n",
" 3 24\n",
" 4 24\n",
" 5 24\n",
" 6 24\n",
" 7 24\n",
" 8 24\n",
" 11 24\n",
" 12 24\n",
" 13 24\n",
" 14 24\n",
" 15 24\n",
" 16 24\n",
" 17 24\n",
" 18 24\n",
" 19 24\n",
" 20 24\n",
" 21 24\n",
" 22 23\n",
" 23 23\n",
" 24 23\n",
" 25 23\n",
" 26 23\n",
" 27 23\n",
" 28 23\n",
" 29 23\n",
" 30 23\n",
" 31 22"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset_test = aemet_dataset.copy()\n",
"\n",
"aemet_dataset_test[\"datetime\"] = pd.to_datetime(aemet_dataset_test[\"datetime\"])\n",
"\n",
"aemet_dataset_test[\"day\"] = aemet_dataset_test[\"datetime\"].dt.day\n",
"aemet_dataset_test[\"month\"] = aemet_dataset_test[\"datetime\"].dt.month\n",
"aemet_dataset_test[\"year\"] = aemet_dataset_test[\"datetime\"].dt.year\n",
"aemet_dataset_test[\"hour\"] = aemet_dataset_test[\"datetime\"].dt.hour\n",
"aemet_dataset_test[\"minute\"] = aemet_dataset_test[\"datetime\"].dt.minute\n",
"\n",
"aemet_dataset_test[aemet_dataset_test[\"minute\"]==0].groupby([\"month\", \"day\"])[[\"estadoCielo_value\"]].count()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"aemet_dataset.reset_index(drop=True, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" De los resultados obtenidos se observa que el dataset ya presenta el número de registros que se supone (1 por hora o aproximadamente 1 por hora). Si agrupásemos por horas y minutos existen casos en los que no se dispone sólo información horaria sino que existen varios valores para la misma hora. Esto lo veremos más adelante ya que corresponde más bien con el análisis de variables. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Análisis de las variables y conversiones de tipos\n",
"\n",
" Realizamos conjúntamente el análisis de las variables y las conversiones a su tipo determinado de manera conjunta ya que ambas cosas van ligadas. En pasos anteriores se ha realizado ya la conversión de tipos pero de manera sencilla, en este paso estudiamos más cautelosamente cada caso."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 773 entries, 0 to 772\n",
"Data columns (total 14 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 estadoCielo_value 773 non-null object \n",
" 1 estadoCielo_descripcion 773 non-null object \n",
" 2 precipitacion_value 661 non-null float64\n",
" 3 probPrecipitacion_value 112 non-null float64\n",
" 4 probTormenta_value 112 non-null float64\n",
" 5 nieve_value 661 non-null float64\n",
" 6 probNieve_value 112 non-null float64\n",
" 7 temperatura_value 651 non-null float64\n",
" 8 sensTermica_value 651 non-null float64\n",
" 9 humedadRelativa_value 651 non-null float64\n",
" 10 datetime 773 non-null object \n",
" 11 direccion 773 non-null object \n",
" 12 velocidad 773 non-null object \n",
" 13 vientoAndRachaMax_value 773 non-null float64\n",
"dtypes: float64(9), object(5)\n",
"memory usage: 84.7+ KB\n"
]
}
],
"source": [
"aemet_dataset.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Estado cielo\n",
"\n",
" El estado del cielo lo describen dos variables: *estadoCielo_value* y *estadoCielo_descripcion*. Conociendo la correspondencia entre ambas variables se podría prescindir de una de ellas."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
"
\n",
" \n",
" estadoCielo_descripcion | \n",
" estadoCielo_value | \n",
"
\n",
" \n",
" \n",
" \n",
" Bruma | \n",
" 82 | \n",
"
\n",
" \n",
" 82n | \n",
"
\n",
" \n",
" Cubierto | \n",
" 16 | \n",
"
\n",
" \n",
" 16n | \n",
"
\n",
" \n",
" Cubierto con lluvia | \n",
" 26 | \n",
"
\n",
" \n",
" 26n | \n",
"
\n",
" \n",
" Cubierto con lluvia escasa | \n",
" 46 | \n",
"
\n",
" \n",
" 46n | \n",
"
\n",
" \n",
" Cubierto con nieve | \n",
" 36 | \n",
"
\n",
" \n",
" Cubierto con tormenta | \n",
" 54 | \n",
"
\n",
" \n",
" 54n | \n",
"
\n",
" \n",
" Cubierto con tormenta y lluvia escasa | \n",
" 64 | \n",
"
\n",
" \n",
" 64n | \n",
"
\n",
" \n",
" Despejado | \n",
" 11 | \n",
"
\n",
" \n",
" 11n | \n",
"
\n",
" \n",
" Intervalos nubosos | \n",
" 13 | \n",
"
\n",
" \n",
" Intervalos nubosos con lluvia escasa | \n",
" 43 | \n",
"
\n",
" \n",
" 43n | \n",
"
\n",
" \n",
" Muy nuboso | \n",
" 15 | \n",
"
\n",
" \n",
" 15n | \n",
"
\n",
" \n",
" Muy nuboso con lluvia escasa | \n",
" 45 | \n",
"
\n",
" \n",
" 45n | \n",
"
\n",
" \n",
" Muy nuboso con tormenta y lluvia escasa | \n",
" 63n | \n",
"
\n",
" \n",
" Niebla | \n",
" 81 | \n",
"
\n",
" \n",
" Nubes altas | \n",
" 17 | \n",
"
\n",
" \n",
" 17n | \n",
"
\n",
" \n",
" Nuboso | \n",
" 14 | \n",
"
\n",
" \n",
" 14n | \n",
"
\n",
" \n",
" Nuboso con lluvia escasa | \n",
" 44 | \n",
"
\n",
" \n",
" Nuboso con tormenta y lluvia escasa | \n",
" 62 | \n",
"
\n",
" \n",
" Poco nuboso | \n",
" 12 | \n",
"
\n",
" \n",
" 12n | \n",
"
\n",
" \n",
" nan | \n",
" nan | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: [(Bruma, 82), (Bruma, 82n), (Cubierto, 16), (Cubierto, 16n), (Cubierto con lluvia, 26), (Cubierto con lluvia, 26n), (Cubierto con lluvia escasa, 46), (Cubierto con lluvia escasa, 46n), (Cubierto con nieve, 36), (Cubierto con tormenta, 54), (Cubierto con tormenta, 54n), (Cubierto con tormenta y lluvia escasa, 64), (Cubierto con tormenta y lluvia escasa, 64n), (Despejado, 11), (Despejado, 11n), (Intervalos nubosos, 13), (Intervalos nubosos con lluvia escasa, 43), (Intervalos nubosos con lluvia escasa, 43n), (Muy nuboso, 15), (Muy nuboso, 15n), (Muy nuboso con lluvia escasa, 45), (Muy nuboso con lluvia escasa, 45n), (Muy nuboso con tormenta y lluvia escasa, 63n), (Niebla , 81), (Nubes altas, 17), (Nubes altas, 17n), (Nuboso, 14), (Nuboso, 14n), (Nuboso con lluvia escasa, 44), (Nuboso con tormenta y lluvia escasa, 62), (Poco nuboso, 12), (Poco nuboso, 12n), (nan, nan)]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[[\"estadoCielo_value\", \"estadoCielo_descripcion\"]].groupby([\"estadoCielo_descripcion\", \"estadoCielo_value\"]).count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" En la tabla anterior se observa que, a pesar de que por cada valor de descripción existan uno o varios valores, estos tienen el mismo significado. Por ello eliminaremos la columna *estadoCielo_descripcion* y haremos que los valores de *estadoCielo_value* sean valores numéricos para un mejor uso de esta variable en el modelo de predicción. Finalmente renombraremos esta variable a *estado_cielo*"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# Cambiamos los valores 12n, 17n, etc. por 12, 17, etc. ya que representan el \n",
"# mismo valor de estadoCielo\n",
"aemet_dataset[\"estadoCielo_value\"] = aemet_dataset[\"estadoCielo_value\"].replace(to_replace=r'(\\d+)n', value=r'\\1', regex=True).replace('nan', np.nan)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
"
\n",
" \n",
" estadoCielo_descripcion | \n",
" estadoCielo_value | \n",
"
\n",
" \n",
" \n",
" \n",
" Bruma | \n",
" 82 | \n",
"
\n",
" \n",
" Cubierto | \n",
" 16 | \n",
"
\n",
" \n",
" Cubierto con lluvia | \n",
" 26 | \n",
"
\n",
" \n",
" Cubierto con lluvia escasa | \n",
" 46 | \n",
"
\n",
" \n",
" Cubierto con nieve | \n",
" 36 | \n",
"
\n",
" \n",
" Cubierto con tormenta | \n",
" 54 | \n",
"
\n",
" \n",
" Cubierto con tormenta y lluvia escasa | \n",
" 64 | \n",
"
\n",
" \n",
" Despejado | \n",
" 11 | \n",
"
\n",
" \n",
" Intervalos nubosos | \n",
" 13 | \n",
"
\n",
" \n",
" Intervalos nubosos con lluvia escasa | \n",
" 43 | \n",
"
\n",
" \n",
" Muy nuboso | \n",
" 15 | \n",
"
\n",
" \n",
" Muy nuboso con lluvia escasa | \n",
" 45 | \n",
"
\n",
" \n",
" Muy nuboso con tormenta y lluvia escasa | \n",
" 63 | \n",
"
\n",
" \n",
" Niebla | \n",
" 81 | \n",
"
\n",
" \n",
" Nubes altas | \n",
" 17 | \n",
"
\n",
" \n",
" Nuboso | \n",
" 14 | \n",
"
\n",
" \n",
" Nuboso con lluvia escasa | \n",
" 44 | \n",
"
\n",
" \n",
" Nuboso con tormenta y lluvia escasa | \n",
" 62 | \n",
"
\n",
" \n",
" Poco nuboso | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: [(Bruma, 82), (Cubierto, 16), (Cubierto con lluvia, 26), (Cubierto con lluvia escasa, 46), (Cubierto con nieve, 36), (Cubierto con tormenta, 54), (Cubierto con tormenta y lluvia escasa, 64), (Despejado, 11), (Intervalos nubosos, 13), (Intervalos nubosos con lluvia escasa, 43), (Muy nuboso, 15), (Muy nuboso con lluvia escasa, 45), (Muy nuboso con tormenta y lluvia escasa, 63), (Niebla , 81), (Nubes altas, 17), (Nuboso, 14), (Nuboso con lluvia escasa, 44), (Nuboso con tormenta y lluvia escasa, 62), (Poco nuboso, 12)]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[[\"estadoCielo_value\", \"estadoCielo_descripcion\"]].groupby([\"estadoCielo_descripcion\", \"estadoCielo_value\"]).count()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# Eliminamos la columna estadoCielo_descripcion\n",
"aemet_dataset.drop(columns='estadoCielo_descripcion', inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# Cambiamos el tipo de la variable estadoCielo_value a numérico y \n",
"# cambiamos el nombre de esta columna por estado_cielo\n",
"aemet_dataset['estadoCielo_value'] = pd.to_numeric(aemet_dataset['estadoCielo_value'], errors='coerce')\n",
"aemet_dataset.rename(columns={'estadoCielo_value': 'estado_cielo'}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"estado_cielo\n",
"17.0 169\n",
"11.0 163\n",
"NaN 112\n",
"16.0 105\n",
"12.0 61\n",
"46.0 47\n",
"64.0 47\n",
"15.0 27\n",
"43.0 15\n",
"14.0 10\n",
"36.0 3\n",
"54.0 3\n",
"45.0 2\n",
"26.0 2\n",
"82.0 2\n",
"63.0 1\n",
"44.0 1\n",
"81.0 1\n",
"62.0 1\n",
"13.0 1\n",
"Name: count, dtype: int64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[\"estado_cielo\"].value_counts(dropna=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Precipitación\n",
"\n",
" La precipitación la describen tres variables: *precipitacion_value*, *probPrecipitacion_value* y *probTormenta_value*. Veamos si todas ellas son necesarias para el análisis."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 773 entries, 0 to 772\n",
"Data columns (total 13 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 estado_cielo 661 non-null float64\n",
" 1 precipitacion_value 661 non-null float64\n",
" 2 probPrecipitacion_value 112 non-null float64\n",
" 3 probTormenta_value 112 non-null float64\n",
" 4 nieve_value 661 non-null float64\n",
" 5 probNieve_value 112 non-null float64\n",
" 6 temperatura_value 651 non-null float64\n",
" 7 sensTermica_value 651 non-null float64\n",
" 8 humedadRelativa_value 651 non-null float64\n",
" 9 datetime 773 non-null object \n",
" 10 direccion 773 non-null object \n",
" 11 velocidad 773 non-null object \n",
" 12 vientoAndRachaMax_value 773 non-null float64\n",
"dtypes: float64(10), object(3)\n",
"memory usage: 78.6+ KB\n"
]
}
],
"source": [
"aemet_dataset.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Se observa que tanto *probPrecipitacion_value* como *probTormenta_value* sólo tienen valores diferentes a los nulos para el 15% de la muestra total. Si se observa el dataset detenidamente, nos damos cuenta de que estas probabilidades no están establecidas a horas determinadas (00:00:00, 01:00:00, etc.) sino a horas del tipo 00:07:00. Esto se debe a que las probabilidades se corresponden con periodos de tiempo. En verdad 00:07:00 indica que esa probabilidad se prevee entre las 00:00:00 y las 07:00:00. Vamos a observar si se pueden juntar estos datos para horas determinadas. "
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" datetime | \n",
" probPrecipitacion_value | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 2024-03-02 01:07:00 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 9 | \n",
" 2024-03-02 07:13:00 | \n",
" 100.0 | \n",
"
\n",
" \n",
" 16 | \n",
" 2024-03-02 13:19:00 | \n",
" 100.0 | \n",
"
\n",
" \n",
" 23 | \n",
" 2024-03-02 19:01:00 | \n",
" 70.0 | \n",
"
\n",
" \n",
" 30 | \n",
" 2024-03-03 01:07:00 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" datetime probPrecipitacion_value\n",
"2 2024-03-02 01:07:00 0.0\n",
"9 2024-03-02 07:13:00 100.0\n",
"16 2024-03-02 13:19:00 100.0\n",
"23 2024-03-02 19:01:00 70.0\n",
"30 2024-03-03 01:07:00 0.0"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[~aemet_dataset[\"probPrecipitacion_value\"].isnull()][[\"datetime\", \"probPrecipitacion_value\"]].head()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"aemet_dataset_prob = aemet_dataset.copy()\n",
"\n",
"aemet_dataset_prob[\"datetime\"] = pd.to_datetime(aemet_dataset_prob[\"datetime\"])\n",
"\n",
"aemet_dataset_prob[\"day\"] = aemet_dataset_prob[\"datetime\"].dt.day\n",
"aemet_dataset_prob[\"month\"] = aemet_dataset_prob[\"datetime\"].dt.month\n",
"aemet_dataset_prob[\"year\"] = aemet_dataset_prob[\"datetime\"].dt.year\n",
"aemet_dataset_prob[\"hour\"] = aemet_dataset_prob[\"datetime\"].dt.hour\n",
"aemet_dataset_prob[\"minute\"] = aemet_dataset_prob[\"datetime\"].dt.minute\n",
"aemet_dataset_prob[\"second\"] = aemet_dataset_prob[\"datetime\"].dt.second"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"aemet_dataset_prob = aemet_dataset_prob[[\"day\", \"month\", \"hour\", \"minute\", \"second\", \"probPrecipitacion_value\", \"probTormenta_value\", \"probNieve_value\"]]"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" day | \n",
" month | \n",
" hour | \n",
" minute | \n",
" second | \n",
" probPrecipitacion_value | \n",
" probTormenta_value | \n",
" probNieve_value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 7 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
" 3 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" 3 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 6 | \n",
" 2 | \n",
" 3 | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" 3 | \n",
" 6 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 8 | \n",
" 2 | \n",
" 3 | \n",
" 7 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" 2 | \n",
" 3 | \n",
" 7 | \n",
" 13 | \n",
" 0 | \n",
" 100.0 | \n",
" 25.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" day month hour minute second probPrecipitacion_value \\\n",
"0 2 3 0 0 0 NaN \n",
"1 2 3 1 0 0 NaN \n",
"2 2 3 1 7 0 0.0 \n",
"3 2 3 2 0 0 NaN \n",
"4 2 3 3 0 0 NaN \n",
"5 2 3 4 0 0 NaN \n",
"6 2 3 5 0 0 NaN \n",
"7 2 3 6 0 0 NaN \n",
"8 2 3 7 0 0 NaN \n",
"9 2 3 7 13 0 100.0 \n",
"\n",
" probTormenta_value probNieve_value \n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 0.0 0.0 \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"5 NaN NaN \n",
"6 NaN NaN \n",
"7 NaN NaN \n",
"8 NaN NaN \n",
"9 25.0 0.0 "
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset_prob[0:10]"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" day | \n",
" month | \n",
" hour | \n",
" minute | \n",
" second | \n",
" probPrecipitacion_value | \n",
" probTormenta_value | \n",
" probNieve_value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 7 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
" 3 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" 3 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 2 | \n",
" 3 | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" 3 | \n",
" 6 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 8 | \n",
" 2 | \n",
" 3 | \n",
" 7 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 9 | \n",
" 2 | \n",
" 3 | \n",
" 7 | \n",
" 13 | \n",
" 0 | \n",
" 100.0 | \n",
" 25.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" day month hour minute second probPrecipitacion_value \\\n",
"0 2 3 0 0 0 NaN \n",
"1 2 3 1 0 0 NaN \n",
"2 2 3 1 7 0 0.0 \n",
"3 2 3 2 0 0 0.0 \n",
"4 2 3 3 0 0 0.0 \n",
"5 2 3 4 0 0 0.0 \n",
"6 2 3 5 0 0 0.0 \n",
"7 2 3 6 0 0 0.0 \n",
"8 2 3 7 0 0 0.0 \n",
"9 2 3 7 13 0 100.0 \n",
"\n",
" probTormenta_value probNieve_value \n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 0.0 0.0 \n",
"3 0.0 0.0 \n",
"4 0.0 0.0 \n",
"5 0.0 0.0 \n",
"6 0.0 0.0 \n",
"7 0.0 0.0 \n",
"8 0.0 0.0 \n",
"9 25.0 0.0 "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset_prob[0:10].fillna(method=\"ffill\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Como se observa en las tablas anteriores *ffill* es un buen método para conseguir los datos que faltan. El siguiente paso sería eliminar los registros con *minute!=0*. Aprovecharemos ya también para dividir datetime en día, mes, hora, minutos y segundos. "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"aemet_dataset[\"datetime\"] = pd.to_datetime(aemet_dataset[\"datetime\"])\n",
"\n",
"aemet_dataset[\"day\"] = aemet_dataset[\"datetime\"].dt.day\n",
"aemet_dataset[\"month\"] = aemet_dataset[\"datetime\"].dt.month\n",
"aemet_dataset[\"year\"] = aemet_dataset[\"datetime\"].dt.year\n",
"aemet_dataset[\"hour\"] = aemet_dataset[\"datetime\"].dt.hour\n",
"aemet_dataset[\"minute\"] = aemet_dataset[\"datetime\"].dt.minute\n",
"aemet_dataset[\"second\"] = aemet_dataset[\"datetime\"].dt.second\n",
"\n",
"aemet_dataset.drop(columns=[\"datetime\"], inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"aemet_dataset.fillna(method=\"ffill\", inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"aemet_dataset = aemet_dataset[aemet_dataset.minute==0]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"aemet_dataset.reset_index(drop=True, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 661 entries, 0 to 660\n",
"Data columns (total 18 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 estado_cielo 661 non-null float64\n",
" 1 precipitacion_value 661 non-null float64\n",
" 2 probPrecipitacion_value 659 non-null float64\n",
" 3 probTormenta_value 659 non-null float64\n",
" 4 nieve_value 661 non-null float64\n",
" 5 probNieve_value 659 non-null float64\n",
" 6 temperatura_value 661 non-null float64\n",
" 7 sensTermica_value 661 non-null float64\n",
" 8 humedadRelativa_value 661 non-null float64\n",
" 9 direccion 661 non-null object \n",
" 10 velocidad 661 non-null object \n",
" 11 vientoAndRachaMax_value 661 non-null float64\n",
" 12 day 661 non-null int32 \n",
" 13 month 661 non-null int32 \n",
" 14 year 661 non-null int32 \n",
" 15 hour 661 non-null int32 \n",
" 16 minute 661 non-null int32 \n",
" 17 second 661 non-null int32 \n",
"dtypes: float64(10), int32(6), object(2)\n",
"memory usage: 77.6+ KB\n"
]
}
],
"source": [
"aemet_dataset.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Nieve\n",
"\n",
" Algo similar a lo que ocurría con la precipitación ocurre con la nieve. Se analizará mejor la *probNieve_value* en el EDA."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Dirección\n",
"\n",
"La variable *direccion* representa los valores de norte, sur, este, oeste. Como vienen en una lista, los sacamos y los convertimos a categorías."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"direccion\n",
"['SO'] 143\n",
"['NE'] 135\n",
"['O'] 117\n",
"['S'] 104\n",
"['SE'] 54\n",
"['E'] 48\n",
"['NO'] 37\n",
"['N'] 13\n",
"0 10\n",
"Name: count, dtype: int64"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[\"direccion\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['SO', 'O', 'NO', 'NE', 'N', 'S', 'SE', 'E', NaN]\n",
"Categories (8, object): ['E', 'N', 'NE', 'NO', 'O', 'S', 'SE', 'SO']"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[\"direccion\"] = aemet_dataset[\"direccion\"].str.replace(\"['\", \"\").str.replace(\"']\", \"\").astype('category')\n",
"aemet_dataset[\"direccion\"].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Velocidad\n",
"\n",
"Los valores de la variable *velocidad* vienen en una lista, los sacamos y los convertimos a *float*."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"velocidad\n",
"['10'] 54\n",
"['8'] 47\n",
"['9'] 46\n",
"['7'] 39\n",
"['6'] 34\n",
"Name: count, dtype: int64"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset['velocidad'].value_counts().head()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"velocidad\n",
"10.0 54\n",
"8.0 47\n",
"9.0 46\n",
"7.0 39\n",
"6.0 34\n",
"Name: count, dtype: int64"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[\"velocidad\"] = aemet_dataset[\"velocidad\"].str.replace(\"['\", \"\").str.replace(\"']\", \"\").astype(float)\n",
"aemet_dataset[\"velocidad\"].value_counts().head()"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" estado_cielo | \n",
" precipitacion_value | \n",
" probPrecipitacion_value | \n",
" probTormenta_value | \n",
" nieve_value | \n",
" probNieve_value | \n",
" temperatura_value | \n",
" sensTermica_value | \n",
" humedadRelativa_value | \n",
" direccion | \n",
" velocidad | \n",
" vientoAndRachaMax_value | \n",
" day | \n",
" month | \n",
" year | \n",
" hour | \n",
" minute | \n",
" second | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 12.0 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" 0.0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 77.0 | \n",
" SO | \n",
" 16.0 | \n",
" 27.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 15.0 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" 0.0 | \n",
" NaN | \n",
" 8.0 | \n",
" 5.0 | \n",
" 80.0 | \n",
" O | \n",
" 17.0 | \n",
" 27.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 15.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 5.0 | \n",
" 82.0 | \n",
" O | \n",
" 17.0 | \n",
" 33.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 16.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 5.0 | \n",
" 84.0 | \n",
" SO | \n",
" 20.0 | \n",
" 40.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 16.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 4.0 | \n",
" 86.0 | \n",
" O | \n",
" 25.0 | \n",
" 35.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" 16.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 5.0 | \n",
" 84.0 | \n",
" SO | \n",
" 21.0 | \n",
" 37.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 6 | \n",
" 16.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 5.0 | \n",
" 86.0 | \n",
" SO | \n",
" 23.0 | \n",
" 44.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 6 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 7 | \n",
" 16.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 4.0 | \n",
" 82.0 | \n",
" SO | \n",
" 27.0 | \n",
" 51.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 7 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 8 | \n",
" 16.0 | \n",
" 0.0 | \n",
" 100.0 | \n",
" 25.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 4.0 | \n",
" 81.0 | \n",
" SO | \n",
" 32.0 | \n",
" 47.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 8 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" 46.0 | \n",
" 0.1 | \n",
" 100.0 | \n",
" 25.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 4.0 | \n",
" 80.0 | \n",
" SO | \n",
" 28.0 | \n",
" 47.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 9 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 10 | \n",
" 16.0 | \n",
" 0.0 | \n",
" 100.0 | \n",
" 25.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 4.0 | \n",
" 80.0 | \n",
" SO | \n",
" 30.0 | \n",
" 50.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 10 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 11 | \n",
" 16.0 | \n",
" 0.0 | \n",
" 100.0 | \n",
" 25.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 4.0 | \n",
" 78.0 | \n",
" SO | \n",
" 29.0 | \n",
" 51.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 11 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 12 | \n",
" 16.0 | \n",
" 0.0 | \n",
" 100.0 | \n",
" 25.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 4.0 | \n",
" 73.0 | \n",
" SO | \n",
" 31.0 | \n",
" 57.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 12 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 13 | \n",
" 64.0 | \n",
" 0.1 | \n",
" 100.0 | \n",
" 25.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 4.0 | \n",
" 75.0 | \n",
" SO | \n",
" 35.0 | \n",
" 58.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 13 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 14 | \n",
" 64.0 | \n",
" 0.1 | \n",
" 100.0 | \n",
" 80.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 9.0 | \n",
" 5.0 | \n",
" 75.0 | \n",
" SO | \n",
" 35.0 | \n",
" 62.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 14 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 15 | \n",
" 16.0 | \n",
" 0.0 | \n",
" 100.0 | \n",
" 80.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 9.0 | \n",
" 5.0 | \n",
" 79.0 | \n",
" SO | \n",
" 38.0 | \n",
" 62.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 15 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 16 | \n",
" 64.0 | \n",
" 0.1 | \n",
" 100.0 | \n",
" 80.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 9.0 | \n",
" 5.0 | \n",
" 79.0 | \n",
" SO | \n",
" 38.0 | \n",
" 61.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 16 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 17 | \n",
" 64.0 | \n",
" 0.6 | \n",
" 100.0 | \n",
" 80.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 3.0 | \n",
" 86.0 | \n",
" SO | \n",
" 39.0 | \n",
" 72.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 17 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 18 | \n",
" 36.0 | \n",
" 6.0 | \n",
" 100.0 | \n",
" 80.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 8.0 | \n",
" 3.0 | \n",
" 91.0 | \n",
" SO | \n",
" 45.0 | \n",
" 28.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 18 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 19 | \n",
" 46.0 | \n",
" 0.2 | \n",
" 100.0 | \n",
" 80.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 7.0 | \n",
" 5.0 | \n",
" 71.0 | \n",
" O | \n",
" 11.0 | \n",
" 19.0 | \n",
" 2 | \n",
" 3 | \n",
" 2024 | \n",
" 19 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" estado_cielo precipitacion_value probPrecipitacion_value \\\n",
"0 12.0 0.0 NaN \n",
"1 15.0 0.0 NaN \n",
"2 15.0 0.0 0.0 \n",
"3 16.0 0.0 0.0 \n",
"4 16.0 0.0 0.0 \n",
"5 16.0 0.0 0.0 \n",
"6 16.0 0.0 0.0 \n",
"7 16.0 0.0 0.0 \n",
"8 16.0 0.0 100.0 \n",
"9 46.0 0.1 100.0 \n",
"10 16.0 0.0 100.0 \n",
"11 16.0 0.0 100.0 \n",
"12 16.0 0.0 100.0 \n",
"13 64.0 0.1 100.0 \n",
"14 64.0 0.1 100.0 \n",
"15 16.0 0.0 100.0 \n",
"16 64.0 0.1 100.0 \n",
"17 64.0 0.6 100.0 \n",
"18 36.0 6.0 100.0 \n",
"19 46.0 0.2 100.0 \n",
"\n",
" probTormenta_value nieve_value probNieve_value temperatura_value \\\n",
"0 NaN 0.0 NaN 8.0 \n",
"1 NaN 0.0 NaN 8.0 \n",
"2 0.0 0.0 0.0 8.0 \n",
"3 0.0 0.0 0.0 8.0 \n",
"4 0.0 0.0 0.0 8.0 \n",
"5 0.0 0.0 0.0 8.0 \n",
"6 0.0 0.0 0.0 8.0 \n",
"7 0.0 0.0 0.0 8.0 \n",
"8 25.0 0.0 0.0 8.0 \n",
"9 25.0 0.0 0.0 8.0 \n",
"10 25.0 0.0 0.0 8.0 \n",
"11 25.0 0.0 0.0 8.0 \n",
"12 25.0 0.0 0.0 8.0 \n",
"13 25.0 0.0 0.0 8.0 \n",
"14 80.0 0.0 0.0 9.0 \n",
"15 80.0 0.0 0.0 9.0 \n",
"16 80.0 0.0 0.0 9.0 \n",
"17 80.0 0.0 0.0 8.0 \n",
"18 80.0 0.0 0.0 8.0 \n",
"19 80.0 0.0 0.0 7.0 \n",
"\n",
" sensTermica_value humedadRelativa_value direccion velocidad \\\n",
"0 5.0 77.0 SO 16.0 \n",
"1 5.0 80.0 O 17.0 \n",
"2 5.0 82.0 O 17.0 \n",
"3 5.0 84.0 SO 20.0 \n",
"4 4.0 86.0 O 25.0 \n",
"5 5.0 84.0 SO 21.0 \n",
"6 5.0 86.0 SO 23.0 \n",
"7 4.0 82.0 SO 27.0 \n",
"8 4.0 81.0 SO 32.0 \n",
"9 4.0 80.0 SO 28.0 \n",
"10 4.0 80.0 SO 30.0 \n",
"11 4.0 78.0 SO 29.0 \n",
"12 4.0 73.0 SO 31.0 \n",
"13 4.0 75.0 SO 35.0 \n",
"14 5.0 75.0 SO 35.0 \n",
"15 5.0 79.0 SO 38.0 \n",
"16 5.0 79.0 SO 38.0 \n",
"17 3.0 86.0 SO 39.0 \n",
"18 3.0 91.0 SO 45.0 \n",
"19 5.0 71.0 O 11.0 \n",
"\n",
" vientoAndRachaMax_value day month year hour minute second \n",
"0 27.0 2 3 2024 0 0 0 \n",
"1 27.0 2 3 2024 1 0 0 \n",
"2 33.0 2 3 2024 2 0 0 \n",
"3 40.0 2 3 2024 3 0 0 \n",
"4 35.0 2 3 2024 4 0 0 \n",
"5 37.0 2 3 2024 5 0 0 \n",
"6 44.0 2 3 2024 6 0 0 \n",
"7 51.0 2 3 2024 7 0 0 \n",
"8 47.0 2 3 2024 8 0 0 \n",
"9 47.0 2 3 2024 9 0 0 \n",
"10 50.0 2 3 2024 10 0 0 \n",
"11 51.0 2 3 2024 11 0 0 \n",
"12 57.0 2 3 2024 12 0 0 \n",
"13 58.0 2 3 2024 13 0 0 \n",
"14 62.0 2 3 2024 14 0 0 \n",
"15 62.0 2 3 2024 15 0 0 \n",
"16 61.0 2 3 2024 16 0 0 \n",
"17 72.0 2 3 2024 17 0 0 \n",
"18 28.0 2 3 2024 18 0 0 \n",
"19 19.0 2 3 2024 19 0 0 "
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aemet_dataset[0:20]"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 661 entries, 0 to 660\n",
"Data columns (total 18 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 estado_cielo 661 non-null float64 \n",
" 1 precipitacion_value 661 non-null float64 \n",
" 2 probPrecipitacion_value 659 non-null float64 \n",
" 3 probTormenta_value 659 non-null float64 \n",
" 4 nieve_value 661 non-null float64 \n",
" 5 probNieve_value 659 non-null float64 \n",
" 6 temperatura_value 661 non-null float64 \n",
" 7 sensTermica_value 661 non-null float64 \n",
" 8 humedadRelativa_value 661 non-null float64 \n",
" 9 direccion 651 non-null category\n",
" 10 velocidad 651 non-null float64 \n",
" 11 vientoAndRachaMax_value 661 non-null float64 \n",
" 12 day 661 non-null int32 \n",
" 13 month 661 non-null int32 \n",
" 14 year 661 non-null int32 \n",
" 15 hour 661 non-null int32 \n",
" 16 minute 661 non-null int32 \n",
" 17 second 661 non-null int32 \n",
"dtypes: category(1), float64(11), int32(6)\n",
"memory usage: 73.4 KB\n"
]
}
],
"source": [
"aemet_dataset.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PROFILING 📑"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"profile = ProfileReport(\n",
" aemet_dataset,\n",
" title=\"🌥️ AEMET QA\",\n",
" dataset={\n",
" \"description\": \"AEMET - Estado de la meteorología\",\n",
" \"url\": \"https://opendata.aemet.es/dist/index.html\",\n",
" },\n",
" variables={\n",
" \"descriptions\": {\n",
" \"day\": \"Fecha (día) de la petición a la API\",\n",
" \"month\": \"Fecha (mes) de la petición a la API\",\n",
" \"year\": \"Fecha (año) de la petición a la API\",\n",
" \"hour\": \"Fecha (hora) de la petición a la API\",\n",
" \"minute\": \"Fecha (minute) de la petición a la API\",\n",
" \"second\": \"Fecha (second) de la petición a la API\",\n",
" \"estado_cielo\": \"Estado del cielo\",\n",
" \"precipitacion_value\": \"Precipitación\",\n",
" \"probPrecipitacion_value\": \"Probabilidad de precipitación\",\n",
" \"probTormenta_value\": \"Probabilidad de tormenta\",\n",
" \"nieve_value\": \"Nieve\",\n",
" \"probNieve_value\": \"Probabilidad de nieve\",\n",
" \"temperatura_value\": \"Temperatura\",\n",
" \"sensTermica_value\": \"Sensación térmica\",\n",
" \"humedadRelativa_value\": \"Humedad relativa\",\n",
" \"direccion\": \"Dirección del viento\",\n",
" \"velocidad\": \"Velocidad del viento\",\n",
" \"vientoAndRachaMax_value\": \"Viento racha máxima\",\n",
" }\n",
" },\n",
" interactions=None,\n",
" explorative=True,\n",
" dark_mode=True,\n",
")\n",
"profile.to_file(os.path.join(ROOT_PATH, \"docs\", \"qa\", \"aemet_report.html\"))\n",
"# profile.to_notebook_iframe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.11.7 ('venv': venv)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "7056b316d281e41af8bb677c2c15c6d2127a9166071d02d66c8bc69014a8260f"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}