{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ¿Cuánto cree la IA que vale tu casa?\n", "\n", "Comencemos con un ejemplo típico dentro del mundo de la machine-learning (¡gracias, Andrew Ng!): predecir cuánto vale una casa en base a sus características ('features'). Las características son el conjunto de atributos de la casa sobre la que queremos predecir el precio: por ejemplo 'm²', 'n_plantas', 'n_habitaciones' , etc... \n", "\n", "¿Has notado que solo he mencionado características numéricas? Esto es por que los modelos de ML 'entienden' mucho mejor los números, que los datos en otro formato. De hecho lo que mejor admiten son arrays de números normalizados (en el rango [0,1]). Quizás estás pensando que hay factores de naturaleza no numérica, que previsiblemente influirán en el precio de la casa que estamos prediciendo. Por ejemplo: la localización (¿En qué país está la casa?, ¿en qué barrio está?, ¿cuánto vale el metro cuadrado en esa zona?). El proceso para transformar estos datos en numéricos, se llama 'codificación' o 'encoding'.\n", "\n", "Pero vamos a empezar por el principio: hemos elegido un dataset (conjunto de datos necesario para entrenar al modelo) de [Kaggle](https://www.kaggle.com/datasets/yasserh/housing-prices-dataset)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploración de los datos" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd # librerías básicas para la manipulación de datos en ML\n", "import numpy as np # librerías básicas para la manipulación de datos en ML" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "dataset = pd.read_csv('dataset.csv') # lectura del dataset" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['price',\n", " 'area',\n", " 'bedrooms',\n", " 'bathrooms',\n", " 'stories',\n", " 'mainroad',\n", " 'guestroom',\n", " 'basement',\n", " 'hotwaterheating',\n", " 'airconditioning',\n", " 'parking',\n", " 'prefarea',\n", " 'furnishingstatus']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(dataset.columns) # lista de 13 características" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(545, 13)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.shape # tamaño del dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tenemos un dataset de 545 registros, y 13 características para cada registro" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | price | \n", "area | \n", "bedrooms | \n", "bathrooms | \n", "stories | \n", "mainroad | \n", "guestroom | \n", "basement | \n", "hotwaterheating | \n", "airconditioning | \n", "parking | \n", "prefarea | \n", "furnishingstatus | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "13300000 | \n", "7420 | \n", "4 | \n", "2 | \n", "3 | \n", "yes | \n", "no | \n", "no | \n", "no | \n", "yes | \n", "2 | \n", "yes | \n", "furnished | \n", "
| 1 | \n", "12250000 | \n", "8960 | \n", "4 | \n", "4 | \n", "4 | \n", "yes | \n", "no | \n", "no | \n", "no | \n", "yes | \n", "3 | \n", "no | \n", "furnished | \n", "
| 2 | \n", "12250000 | \n", "9960 | \n", "3 | \n", "2 | \n", "2 | \n", "yes | \n", "no | \n", "yes | \n", "no | \n", "no | \n", "2 | \n", "yes | \n", "semi-furnished | \n", "
| 3 | \n", "12215000 | \n", "7500 | \n", "4 | \n", "2 | \n", "2 | \n", "yes | \n", "no | \n", "yes | \n", "no | \n", "yes | \n", "3 | \n", "yes | \n", "furnished | \n", "
| 4 | \n", "11410000 | \n", "7420 | \n", "4 | \n", "1 | \n", "2 | \n", "yes | \n", "yes | \n", "yes | \n", "no | \n", "yes | \n", "2 | \n", "no | \n", "furnished | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 540 | \n", "1820000 | \n", "3000 | \n", "2 | \n", "1 | \n", "1 | \n", "yes | \n", "no | \n", "yes | \n", "no | \n", "no | \n", "2 | \n", "no | \n", "unfurnished | \n", "
| 541 | \n", "1767150 | \n", "2400 | \n", "3 | \n", "1 | \n", "1 | \n", "no | \n", "no | \n", "no | \n", "no | \n", "no | \n", "0 | \n", "no | \n", "semi-furnished | \n", "
| 542 | \n", "1750000 | \n", "3620 | \n", "2 | \n", "1 | \n", "1 | \n", "yes | \n", "no | \n", "no | \n", "no | \n", "no | \n", "0 | \n", "no | \n", "unfurnished | \n", "
| 543 | \n", "1750000 | \n", "2910 | \n", "3 | \n", "1 | \n", "1 | \n", "no | \n", "no | \n", "no | \n", "no | \n", "no | \n", "0 | \n", "no | \n", "furnished | \n", "
| 544 | \n", "1750000 | \n", "3850 | \n", "3 | \n", "1 | \n", "2 | \n", "yes | \n", "no | \n", "no | \n", "no | \n", "no | \n", "0 | \n", "no | \n", "unfurnished | \n", "
545 rows × 13 columns
\n", "| \n", " | price | \n", "area | \n", "bedrooms | \n", "bathrooms | \n", "stories | \n", "parking | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "13300000 | \n", "7420 | \n", "4 | \n", "2 | \n", "3 | \n", "2 | \n", "
| 1 | \n", "12250000 | \n", "8960 | \n", "4 | \n", "4 | \n", "4 | \n", "3 | \n", "
| 2 | \n", "12250000 | \n", "9960 | \n", "3 | \n", "2 | \n", "2 | \n", "2 | \n", "
| 3 | \n", "12215000 | \n", "7500 | \n", "4 | \n", "2 | \n", "2 | \n", "3 | \n", "
| 4 | \n", "11410000 | \n", "7420 | \n", "4 | \n", "1 | \n", "2 | \n", "2 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 540 | \n", "1820000 | \n", "3000 | \n", "2 | \n", "1 | \n", "1 | \n", "2 | \n", "
| 541 | \n", "1767150 | \n", "2400 | \n", "3 | \n", "1 | \n", "1 | \n", "0 | \n", "
| 542 | \n", "1750000 | \n", "3620 | \n", "2 | \n", "1 | \n", "1 | \n", "0 | \n", "
| 543 | \n", "1750000 | \n", "2910 | \n", "3 | \n", "1 | \n", "1 | \n", "0 | \n", "
| 544 | \n", "1750000 | \n", "3850 | \n", "3 | \n", "1 | \n", "2 | \n", "0 | \n", "
545 rows × 6 columns
\n", "| \n", " | price | \n", "area | \n", "bedrooms | \n", "bathrooms | \n", "stories | \n", "parking | \n", "
|---|---|---|---|---|---|---|
| count | \n", "5.450000e+02 | \n", "545.000000 | \n", "545.000000 | \n", "545.000000 | \n", "545.000000 | \n", "545.000000 | \n", "
| mean | \n", "4.766729e+06 | \n", "5150.541284 | \n", "2.965138 | \n", "1.286239 | \n", "1.805505 | \n", "0.693578 | \n", "
| std | \n", "1.870440e+06 | \n", "2170.141023 | \n", "0.738064 | \n", "0.502470 | \n", "0.867492 | \n", "0.861586 | \n", "
| min | \n", "1.750000e+06 | \n", "1650.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "0.000000 | \n", "
| 25% | \n", "3.430000e+06 | \n", "3600.000000 | \n", "2.000000 | \n", "1.000000 | \n", "1.000000 | \n", "0.000000 | \n", "
| 50% | \n", "4.340000e+06 | \n", "4600.000000 | \n", "3.000000 | \n", "1.000000 | \n", "2.000000 | \n", "0.000000 | \n", "
| 75% | \n", "5.740000e+06 | \n", "6360.000000 | \n", "3.000000 | \n", "2.000000 | \n", "2.000000 | \n", "1.000000 | \n", "
| max | \n", "1.330000e+07 | \n", "16200.000000 | \n", "6.000000 | \n", "4.000000 | \n", "4.000000 | \n", "3.000000 | \n", "