{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "mP9aApBqQJed" }, "source": [ "Creando rutinas de preprocesamiento declarativas\n", "=============================================" ] }, { "cell_type": "markdown", "metadata": { "id": "2V8sUvroQJej" }, "source": [ "## Introducción\n", "\n", "Las programación declaritiva es una tendencia en ingeniería de software donde, a diferencía de la programación imperativa, permite describir las instrucciones que se desean realizar para que luego, un proceso, convierta esas instrucciones en código que puede ser ejecutado.\n", "\n", "Kedro es una framework de código abierto de Python diseñado para crear código de ciencia de datos declarativo, reproducible, mantenible y modular.\n", "\n", "Kedro aplica prácticas de ingeniería de software para ayudar a los usuarios a crear pipelines de ciencia de datos y data engineering listos para producción." ] }, { "cell_type": "markdown", "metadata": { "id": "tRmZrmz7QJel" }, "source": [ "### Instalación" ] }, { "cell_type": "markdown", "metadata": { "id": "3Yj6mHMaQJel" }, "source": [ "Necesitaremos instalar las librerias:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "IeIXxuSaQJem", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "38478847-9b68-4d6f-a324-aadc2f911010" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/198.6 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m198.6/198.6 kB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/98.2 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.2/98.2 kB\u001b[0m \u001b[31m7.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m236.5/236.5 kB\u001b[0m \u001b[31m11.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m206.5/206.5 kB\u001b[0m \u001b[31m10.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m41.2/41.2 kB\u001b[0m \u001b[31m2.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m118.6/118.6 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m66.4/66.4 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m739.1/739.1 kB\u001b[0m \u001b[31m31.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h" ] } ], "source": [ "%pip install kedro kedro-viz --quiet" ] }, { "cell_type": "markdown", "metadata": { "id": "tTLrI09WQJeo" }, "source": [ "### Sobre el conjunto de datos del censo UCI\n", "\n", "El conjunto de datos del censo de la UCI es un conjunto de datos en el que cada registro representa a una persona. Cada registro contiene 14 columnas que describen a una una sola persona, de la base de datos del censo de Estados Unidos de 1994. Esto incluye información como la edad, el estado civil y el nivel educativo. La tarea es determinar si una persona tiene un ingreso alto (definido como ganar más de $50 mil al año). Esta tarea, dado el tipo de datos que utiliza, se usa a menudo en el estudio de equidad, en parte debido a los atributos comprensibles del conjunto de datos, incluidos algunos que contienen tipos sensibles como la edad y el género, y en parte también porque comprende una tarea claramente del mundo real." ] }, { "cell_type": "markdown", "metadata": { "id": "vWyomYxTQJep" }, "source": [ "Descargamos el conjunto de datos" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "ovck-xIsQJeq" }, "outputs": [], "source": [ "!wget https://santiagxf.blob.core.windows.net/public/datasets/uci_census.zip \\\n", " --quiet --no-clobber\n", "!mkdir -p datasets/uci_census\n", "!unzip -qq uci_census.zip -d datasets/uci_census" ] }, { "cell_type": "markdown", "metadata": { "id": "EbVjMnWWQJeq" }, "source": [ "Lo importamos" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "K3qSdhVHQJeq" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "train = pd.read_csv('datasets/uci_census/data/adult-train.csv')\n", "test = pd.read_csv('datasets/uci_census/data/adult-test.csv')" ] }, { "cell_type": "code", "source": [ "train" ], "metadata": { "id": "6GyGTqzJzaaY", "outputId": "68de04d3-2260-48c5-8d4c-02ef6ce1cd84", "colab": { "base_uri": "https://localhost:8080/", "height": 562 } }, "execution_count": 6, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " income age workclass fnlwgt education education-num \\\n", "0 <=50K 39 State-gov 77516 Bachelors 13 \n", "1 <=50K 50 Self-emp-not-inc 83311 Bachelors 13 \n", "2 <=50K 38 Private 215646 HS-grad 9 \n", "3 <=50K 53 Private 234721 11th 7 \n", "4 <=50K 28 Private 338409 Bachelors 13 \n", "... ... ... ... ... ... ... \n", "32556 <=50K 27 Private 257302 Assoc-acdm 12 \n", "32557 >50K 40 Private 154374 HS-grad 9 \n", "32558 <=50K 58 Private 151910 HS-grad 9 \n", "32559 <=50K 22 Private 201490 HS-grad 9 \n", "32560 >50K 52 Self-emp-inc 287927 HS-grad 9 \n", "\n", " marital-status occupation relationship race \\\n", "0 Never-married Adm-clerical Not-in-family White \n", "1 Married-civ-spouse Exec-managerial Husband White \n", "2 Divorced Handlers-cleaners Not-in-family White \n", "3 Married-civ-spouse Handlers-cleaners Husband Black \n", "4 Married-civ-spouse Prof-specialty Wife Black \n", "... ... ... ... ... \n", "32556 Married-civ-spouse Tech-support Wife White \n", "32557 Married-civ-spouse Machine-op-inspct Husband White \n", "32558 Widowed Adm-clerical Unmarried White \n", "32559 Never-married Adm-clerical Own-child White \n", "32560 Married-civ-spouse Exec-managerial Wife White \n", "\n", " gender capital-gain capital-loss hours-per-week native-country \n", "0 Male 2174 0 40 United-States \n", "1 Male 0 0 13 United-States \n", "2 Male 0 0 40 United-States \n", "3 Male 0 0 40 United-States \n", "4 Female 0 0 40 Cuba \n", "... ... ... ... ... ... \n", "32556 Female 0 0 38 United-States \n", "32557 Male 0 0 40 United-States \n", "32558 Female 0 0 40 United-States \n", "32559 Male 0 0 20 United-States \n", "32560 Female 15024 0 40 United-States \n", "\n", "[32561 rows x 15 columns]" ], "text/html": [ "\n", "
| \n", " | income | \n", "age | \n", "workclass | \n", "fnlwgt | \n", "education | \n", "education-num | \n", "marital-status | \n", "occupation | \n", "relationship | \n", "race | \n", "gender | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "native-country | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "<=50K | \n", "39 | \n", "State-gov | \n", "77516 | \n", "Bachelors | \n", "13 | \n", "Never-married | \n", "Adm-clerical | \n", "Not-in-family | \n", "White | \n", "Male | \n", "2174 | \n", "0 | \n", "40 | \n", "United-States | \n", "
| 1 | \n", "<=50K | \n", "50 | \n", "Self-emp-not-inc | \n", "83311 | \n", "Bachelors | \n", "13 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "13 | \n", "United-States | \n", "
| 2 | \n", "<=50K | \n", "38 | \n", "Private | \n", "215646 | \n", "HS-grad | \n", "9 | \n", "Divorced | \n", "Handlers-cleaners | \n", "Not-in-family | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "
| 3 | \n", "<=50K | \n", "53 | \n", "Private | \n", "234721 | \n", "11th | \n", "7 | \n", "Married-civ-spouse | \n", "Handlers-cleaners | \n", "Husband | \n", "Black | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "
| 4 | \n", "<=50K | \n", "28 | \n", "Private | \n", "338409 | \n", "Bachelors | \n", "13 | \n", "Married-civ-spouse | \n", "Prof-specialty | \n", "Wife | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "40 | \n", "Cuba | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 32556 | \n", "<=50K | \n", "27 | \n", "Private | \n", "257302 | \n", "Assoc-acdm | \n", "12 | \n", "Married-civ-spouse | \n", "Tech-support | \n", "Wife | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "38 | \n", "United-States | \n", "
| 32557 | \n", ">50K | \n", "40 | \n", "Private | \n", "154374 | \n", "HS-grad | \n", "9 | \n", "Married-civ-spouse | \n", "Machine-op-inspct | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "
| 32558 | \n", "<=50K | \n", "58 | \n", "Private | \n", "151910 | \n", "HS-grad | \n", "9 | \n", "Widowed | \n", "Adm-clerical | \n", "Unmarried | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "
| 32559 | \n", "<=50K | \n", "22 | \n", "Private | \n", "201490 | \n", "HS-grad | \n", "9 | \n", "Never-married | \n", "Adm-clerical | \n", "Own-child | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "20 | \n", "United-States | \n", "
| 32560 | \n", ">50K | \n", "52 | \n", "Self-emp-inc | \n", "287927 | \n", "HS-grad | \n", "9 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Wife | \n", "White | \n", "Female | \n", "15024 | \n", "0 | \n", "40 | \n", "United-States | \n", "
32561 rows × 15 columns
\n", "| \n", " | id | \n", "company_rating | \n", "company_location | \n", "total_fleet_count | \n", "iata_approved | \n", "
|---|---|---|---|---|---|
| 0 | \n", "3888 | \n", "100% | \n", "Isle of Man | \n", "1.0 | \n", "f | \n", "
| 1 | \n", "46728 | \n", "100% | \n", "NaN | \n", "1.0 | \n", "f | \n", "
| 2 | \n", "34618 | \n", "38% | \n", "Isle of Man | \n", "1.0 | \n", "f | \n", "
| 3 | \n", "28619 | \n", "100% | \n", "Bosnia and Herzegovina | \n", "1.0 | \n", "f | \n", "
| 4 | \n", "8240 | \n", "NaN | \n", "Chile | \n", "1.0 | \n", "t | \n", "