{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Päivitetty 2022-09-11 13:08:40.192717\n" ] } ], "source": [ "from datetime import datetime\n", "print(f'Päivitetty {datetime.now()}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Datan valmistelu koneoppimista varten\n", "\n", "Koneoppimisen malleissa käytettävän datan osalta on huomioitava:\n", "\n", "* Data ei saa sisältää puuttuvia arvoja.\n", "* Kategoriset muuttujat pitää muuntaa dummy-muuttujiksi.\n", "* Datan standardointi voi auttaa parempien mallien luomisessa.\n", "* Paljon muista poikkeavat arvot kannattaa joissain tapauksissa poistaa.\n", "* Sopivat muuttujien muunnokset saattavat tuottaa parempia malleja.\n", "* Jos ennustettava muuttuja on kategorinen, jonka jokin luokka on opetusdatassa aliedustettuna, niin datan tasapainottamisen jälkeen saadaan yleensä parempi malli.\n", "\n", "Tämä muistio sisältää esimerkkejä yllä mainittujen seikkojen hoitamiseen." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['nro', 'sukup', 'ikä', 'perhe', 'koulutus', 'palveluv', 'palkka',\n", " 'johto', 'työtov', 'työymp', 'palkkat', 'työteht', 'työterv', 'lomaosa',\n", " 'kuntosa', 'hieroja'],\n", " dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Avaan esimerkeissä käytettävän datan\n", "df = pd.read_excel('https://taanila.fi/data1.xlsx')\n", "\n", "# Kaikki rivit ja sarakkeet näytetään\n", "pd.options.display.max_rows = None\n", "pd.options.display.max_columns = None\n", "\n", "# Datan muuttujat listana\n", "df.columns" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 82 entries, 0 to 81\n", "Data columns (total 16 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 nro 82 non-null int64 \n", " 1 sukup 82 non-null int64 \n", " 2 ikä 82 non-null int64 \n", " 3 perhe 82 non-null int64 \n", " 4 koulutus 81 non-null float64\n", " 5 palveluv 80 non-null float64\n", " 6 palkka 82 non-null int64 \n", " 7 johto 82 non-null int64 \n", " 8 työtov 81 non-null float64\n", " 9 työymp 82 non-null int64 \n", " 10 palkkat 82 non-null int64 \n", " 11 työteht 82 non-null int64 \n", " 12 työterv 47 non-null float64\n", " 13 lomaosa 20 non-null float64\n", " 14 kuntosa 9 non-null float64\n", " 15 hieroja 22 non-null float64\n", "dtypes: float64(7), int64(9)\n", "memory usage: 10.4 KB\n" ] } ], "source": [ "# Tietoa muuttujista\n", "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Puuttuvat arvot\n", "\n", "Puuttuvat arvot voin hoitaa kahdella tavalla: \n", "* poistamalla puuttuvia arvoja sisältävät rivit tai \n", "* täydentämällä puuttuvat arvot tilanteeseen sopivilla arvoilla.\n", "\n", "Puuttuvia arvoja sisältävät rivit voin poistaa dropna()-toiminnolla:\n", "\n", "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html\n", "\n", "Puuttuvia arvoja voin täydentää fillna()-toiminnolla:\n", "\n", "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html\n", "\n", "Katson ensin dataa värjäämällä puuttuvat arvot:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 nrosukupikäperhekoulutuspalveluvpalkkajohtotyötovtyöymppalkkattyötehttyötervlomaosakuntosahieroja
0113811.00000022.000000358733.000000333nannannannan
1212922.00000010.000000296315.000000213nannannannan
2313011.0000007.000000198934.0000001131.000000nannannan
3413621.00000014.000000214433.0000003331.000000nannannan
4512412.0000004.000000218323.0000002121.000000nannannan
5623122.00000014.000000191044.0000005241.0000001.000000nannan
6714912.00000016.000000206635.000000422nannan1.000000nan
7815511.0000000.000000206635.0000003131.000000nannannan
8914021.00000023.000000276824.000000424nan1.000000nannan
91013311.00000016.000000210632.0000001111.000000nannannan
101113921.00000022.000000265135.000000313nannannannan
111214023.00000021.000000284635.000000312nan1.000000nan1.000000
121313523.00000015.000000280835.000000314nan1.000000nannan
131415823.00000021.000000358745.000000413nannannannan
141515323.00000012.000000339344.000000444nan1.000000nannan
151624223.00000023.000000269133.0000003331.000000nannan1.000000
161712614.0000002.000000522555.000000545nannan1.000000nan
171823823.00000017.000000272943.000000421nannannannan
181914213.00000020.000000292523.0000004141.000000nannan1.000000
192024022.00000013.000000245733.0000004321.000000nannan1.000000
202124023.00000020.000000269124.000000534nannannannan
212214723.00000017.000000487424.000000324nan1.000000nannan
222314421.00000027.000000351044.000000444nan1.000000nannan
232413613.0000007.000000444634.000000345nannannannan
242514323.0000001.000000292544.000000444nan1.000000nannan
252612612.0000003.000000152124.0000002131.000000nan1.0000001.000000
262712612.0000002.000000198924.0000002231.000000nannan1.000000
272825611.00000015.000000222334.0000003241.000000nannan1.000000
282914722.00000023.000000280824.0000003141.000000nannannan
29301211nannan194944.0000003321.000000nannannan
303112113.0000001.000000234045.0000003421.000000nan1.0000001.000000
313214521.00000024.000000292544.000000434nan1.000000nannan
323315923.00000015.000000627844.000000544nan1.000000nannan
333413721.00000014.000000218315.0000001121.000000nannan1.000000
343512822.0000005.000000198934.0000003331.000000nan1.0000001.000000
353613123.0000000.000000155924.0000003131.000000nannannan
363725622.00000017.000000272955.000000555nannannan1.000000
373815021.0000001.000000202755.0000004141.0000001.000000nannan
383913012.00000010.000000230035.000000334nannannannan
394013211.0000003.000000210615.0000004131.000000nannannan
404113323.0000009.000000284633.0000004231.000000nannannan
414212912.0000006.000000253434.0000003121.000000nannannan
424324023.00000012.000000214444.000000444nan1.000000nannan
434413012.0000007.000000222323.0000004131.000000nannan1.000000
444515521.00000035.000000265145.0000004241.000000nannan1.000000
454625121.00000028.000000198933.0000002231.000000nannan1.000000
464722213.00000021.000000187233.000000413nannan1.000000nan
474813421.00000018.000000218345.000000413nannannannan
484912722.0000007.000000272944.000000335nannan1.000000nan
495012913.0000007.000000234034.0000003231.000000nannannan
505123922.00000010.000000210645.000000545nan1.000000nannan
515214121.00000018.000000226155.000000525nan1.000000nannan
525314421.0000003.000000198912.0000002111.000000nannannan
535412512.0000001.000000155924.0000003121.000000nannannan
545524521.00000017.000000241735.000000433nannannan1.000000
555623121.0000006.000000194944.0000004331.000000nannan1.000000
565716122.00000036.00000031192nan2151.000000nannan1.000000
575813822.000000nan257423.0000001121.000000nannan1.000000
585912012.0000001.000000226134.000000323nannannannan
596013111.00000010.000000214444.0000003131.000000nannannan
606114411.00000019.000000218322.0000001121.000000nannannan
616214021.0000000.000000187223.0000001231.000000nannannan
626325122.00000010.000000187243.0000002231.000000nannannan
636424412.0000001.000000171544.0000003231.000000nannan1.000000
646523522.00000017.000000269144.0000005241.000000nannan1.000000
656623721.00000016.000000202755.0000005451.0000001.000000nannan
666713724.0000008.000000506934.0000003221.0000001.000000nan1.000000
676813323.0000007.000000241724.000000314nannannannan
686912822.0000001.000000351045.000000314nannannannan
697015222.00000022.000000311934.0000003221.0000001.000000nan1.000000
707113422.0000001.000000249535.000000534nan1.000000nannan
717214622.00000023.000000347035.000000534nan1.000000nannan
727324023.0000002.000000202753.0000004341.000000nannannan
737414521.00000020.000000284635.0000001131.000000nannannan
747514011.0000001.000000194915.0000001111.000000nannannan
757613712.00000015.000000159815.0000001111.000000nannannan
767713912.00000022.000000218345.000000312nannannannan
777812213.0000000.000000159844.000000434nan1.0000001.000000nan
787913311.0000002.000000163813.0000002121.000000nannannan
798012712.0000007.000000261234.0000003331.000000nan1.000000nan
808113522.00000016.000000280834.000000333nannannannan
818223523.00000015.000000218334.0000004341.000000nannannan
\n" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.style.highlight_null()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Esimerkkidatassa neljän viimeisen sarakkeen osalta puuttuvia arvoja sisältävien rivien poistaminen ei tule kyseeseen, koska dataa ei tämän jälkeen jäisi jäljelle.\n", "\n", "Seuraavassa poistan rivit, joilla on puuttuvia arvoja muuttujissa 'koulutus', 'työtov' ja 'palveluv' sekä täydennän neljän viimeisen muuttujan puuttuvat arvot nolliksi. Näin toimimalla menetän datasta 3 riviä." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "79" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = df.dropna(subset=['koulutus', 'työtov', 'palveluv'])\n", "\n", "df1 = df1.fillna({'työterv':0, 'lomaosa':0, 'kuntosa':0, 'hieroja':0})\n", "\n", "# Katson kuinka monta riviä jäi jäljelle\n", "df1.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Seuraavassa täydennän kaikki puuttuvat arvot, jolloin dataan jää alkuperäinen määrä rivejä. Eri muuttujille käytän erilaisia korvaamismenetelmiä (mediaani, keskiarvo, 0)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "82" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = df.fillna({'koulutus': df['koulutus'].median(), \n", " 'työtov': df['työtov'].mean(), \n", " 'palveluv': df['palveluv'].mean(), \n", " 'työterv':0, 'lomaosa':0, 'kuntosa':0, 'hieroja':0})\n", "\n", "# Katson kuinka monta riviä jäi jäljelle\n", "df2.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Kategoristen muuttujien muuntaminen dummy-muuttujiksi\n", "\n", "Pandas-kirjaston get_dummies-toiminto muuntaa kategoriset muuttujat dummy-muuttujiksi.\n", "\n", "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html\n", "\n", "Esimerkiksi sukup-muuttuja saa arvoja 1 (mies) ja 2 (nainen). get_dummies-toiminto tekee sukup-muuttujasta muuttujat sukup_1 ja sukup_2. Jos kyseessä on mies, niin sukup_1-muuttujan arvo on 1. Jos kyseessä on nainen, niin sukup_2-muuttujan arvo on 1.\n", "\n", "Seuraavassa muunnan sukup-, perhe- ja koulutus-muuttujat dummy-muuttujiksi:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 82 entries, 0 to 81\n", "Data columns (total 21 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 nro 82 non-null int64 \n", " 1 ikä 82 non-null int64 \n", " 2 palveluv 82 non-null float64\n", " 3 palkka 82 non-null int64 \n", " 4 johto 82 non-null int64 \n", " 5 työtov 82 non-null float64\n", " 6 työymp 82 non-null int64 \n", " 7 palkkat 82 non-null int64 \n", " 8 työteht 82 non-null int64 \n", " 9 työterv 82 non-null float64\n", " 10 lomaosa 82 non-null float64\n", " 11 kuntosa 82 non-null float64\n", " 12 hieroja 82 non-null float64\n", " 13 sukup_1 82 non-null uint8 \n", " 14 sukup_2 82 non-null uint8 \n", " 15 perhe_1 82 non-null uint8 \n", " 16 perhe_2 82 non-null uint8 \n", " 17 koulutus_1.0 82 non-null uint8 \n", " 18 koulutus_2.0 82 non-null uint8 \n", " 19 koulutus_3.0 82 non-null uint8 \n", " 20 koulutus_4.0 82 non-null uint8 \n", "dtypes: float64(6), int64(7), uint8(8)\n", "memory usage: 9.1 KB\n" ] } ], "source": [ "df_dummies = pd.get_dummies(data=df2, columns=['sukup', 'perhe', 'koulutus'])\n", "\n", "# Katson datan muuttujat\n", "df_dummies.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Standardointi\n", "\n", "Jos muuttujat ovat suuruusluokaltaan erilaisia, niin muuttujien skaalauksella voidaan joissain tapauksissa päästä parempiin malleihin. Standardointi on paljon käytetty skaalausmenetelmä. Standardoinnissa muuttujan arvot muunnetaan normaalijakauman z-pisteiksi. Z-piste ilmoittaa kuinka monen keskihajonnan päässä muuttujan arvo on kaikkien arvojen keskiarvosta.\n", "\n", "Standardoinnin voin toteuttaa sklearn.preprocessing-kirjastosta tuodulla StandardScaler-toiminnolla.\n", "\n", "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nrosukupikäperhekoulutuspalveluvpalkkajohtotyötovtyöymppalkkattyötehttyötervlomaosakuntosahieroja
0110.00502211.01.1365701.21200733.0000003330.00.00.00.0
121-0.92146822.0-0.2516070.47280615.0000002130.00.00.00.0
231-0.81852511.0-0.598651-0.68101034.0000001131.00.00.00.0
341-0.20086521.00.211119-0.49739433.0000003331.00.00.00.0
451-1.43618412.0-0.945695-0.45119423.0000002121.00.00.00.0
562-0.71558122.00.211119-0.77459444.0000005241.01.00.00.0
6711.13739812.00.442481-0.58979435.0000004220.00.01.00.0
7811.75505811.0-1.408421-0.58979435.0000003131.00.00.00.0
8910.21090821.01.2522510.24180624.0000004240.01.00.00.0
9101-0.50969511.00.442481-0.54241032.0000001111.00.00.00.0
101110.10796521.01.1365700.10320635.0000003130.00.00.00.0
111210.21090823.01.0208880.33420635.0000003120.01.00.01.0
12131-0.30380823.00.3268000.28919135.0000003140.01.00.00.0
131412.06388723.01.0208881.21200745.0000004130.00.00.00.0
141511.54917123.0-0.0202440.98219144.0000004440.01.00.00.0
151620.41679523.01.2522510.15059133.0000003331.00.00.01.0
16171-1.23029814.0-1.1770583.15240755.0000005450.00.01.00.0
171820.00502223.00.5581630.19560643.0000004210.00.00.00.0
181910.41679513.00.9052070.42779123.0000004141.00.00.01.0
192020.21090822.00.095437-0.12660933.0000004321.00.00.01.0
202120.21090823.00.9052070.15059124.0000005340.00.00.00.0
212210.93151123.00.5581632.73660724.0000003240.01.00.00.0
222310.62268121.01.7149771.12079144.0000004440.01.00.00.0
23241-0.20086513.0-0.5986512.22959234.0000003450.00.00.00.0
242510.51973823.0-1.2927400.42779144.0000004440.01.00.00.0
25261-1.23029812.0-1.061377-1.23541024.0000002131.00.01.01.0
26271-1.23029812.0-1.177058-0.68101024.0000002231.00.00.01.0
272821.85800111.00.326800-0.40381034.0000003241.00.00.01.0
282910.93151122.01.2522510.28919124.0000003141.00.00.00.0
29301-1.74501412.00.000000-0.72839444.0000003321.00.00.00.0
30311-1.74501413.0-1.292740-0.26521045.0000003421.00.01.01.0
313210.72562521.01.3679320.42779144.0000004340.01.00.00.0
323312.16683123.00.3268004.39980844.0000005440.01.00.00.0
33341-0.09792221.00.211119-0.45119415.0000001121.00.00.01.0
34351-1.02441122.0-0.830014-0.68101034.0000003331.00.01.01.0
35361-0.71558123.0-1.408421-1.19039524.0000003131.00.00.00.0
363721.85800122.00.5581630.19560655.0000005550.00.00.01.0
373811.24034121.0-1.292740-0.63599455.0000004141.01.00.00.0
38391-0.81852512.0-0.251607-0.31259435.0000003340.00.00.00.0
39401-0.61263811.0-1.061377-0.54241015.0000004131.00.00.00.0
40411-0.50969523.0-0.3672880.33420633.0000004231.00.00.00.0
41421-0.92146812.0-0.714333-0.03539434.0000003121.00.00.00.0
424320.21090823.0-0.020244-0.49739444.0000004440.01.00.00.0
43441-0.81852512.0-0.598651-0.40381023.0000004131.00.00.01.0
444511.75505821.02.6404280.10320645.0000004241.00.00.01.0
454621.34328421.01.830658-0.68101033.0000002231.00.00.01.0
46472-1.64207113.01.020888-0.81961033.0000004130.00.01.00.0
47481-0.40675221.00.673844-0.45119445.0000004130.00.00.00.0
48491-1.12735522.0-0.5986510.19560644.0000003350.00.01.00.0
49501-0.92146813.0-0.598651-0.26521034.0000003231.00.00.00.0
505120.10796522.0-0.251607-0.54241045.0000005450.01.00.00.0
515210.31385121.00.673844-0.35879455.0000005250.01.00.00.0
525310.62268121.0-1.061377-0.68101012.0000002111.00.00.00.0
53541-1.33324112.0-1.292740-1.19039524.0000003121.00.00.00.0
545520.72562521.00.558163-0.17399435.0000004330.00.00.01.0
55562-0.71558121.0-0.714333-0.72839444.0000004331.00.00.01.0
565712.37271722.02.7561090.65760624.0617282151.00.00.01.0
575810.00502222.00.0000000.01199123.0000001121.00.00.01.0
58591-1.84795812.0-1.292740-0.35879434.0000003230.00.00.00.0
59601-0.71558111.0-0.251607-0.49739444.0000003131.00.00.00.0
606110.62268111.00.789525-0.45119422.0000001121.00.00.00.0
616210.21090821.0-1.408421-0.81961023.0000001231.00.00.00.0
626321.34328422.0-0.251607-0.81961043.0000002231.00.00.00.0
636420.62268112.0-1.292740-1.00559444.0000003231.00.00.01.0
64652-0.30380822.00.5581630.15059144.0000005241.00.00.01.0
65662-0.09792221.00.442481-0.63599455.0000005451.01.00.00.0
66671-0.09792224.0-0.4829702.96760734.0000003221.01.00.01.0
67681-0.50969523.0-0.598651-0.17399424.0000003140.00.00.00.0
68691-1.02441122.0-1.2927401.12079145.0000003140.00.00.00.0
697011.44622822.01.1365700.65760634.0000003221.01.00.01.0
70711-0.40675222.0-1.292740-0.08159435.0000005340.01.00.00.0
717210.82856822.01.2522511.07340735.0000005340.01.00.00.0
727320.21090823.0-1.177058-0.63599453.0000004341.00.00.00.0
737410.72562521.00.9052070.33420635.0000001131.00.00.00.0
747510.21090811.0-1.292740-0.72839415.0000001111.00.00.00.0
75761-0.09792212.00.326800-1.14419515.0000001111.00.00.00.0
767710.10796512.01.136570-0.45119445.0000003120.00.00.00.0
77781-1.64207113.0-1.408421-1.14419544.0000004340.01.01.00.0
78791-0.50969511.0-1.177058-1.09681013.0000002121.00.00.00.0
79801-1.12735512.0-0.5986510.05700634.0000003331.00.01.00.0
80811-0.30380822.00.4424810.28919134.0000003330.00.00.00.0
81822-0.30380823.00.326800-0.45119434.0000004341.00.00.00.0
\n", "
" ], "text/plain": [ " nro sukup ikä perhe koulutus palveluv palkka johto \\\n", "0 1 1 0.005022 1 1.0 1.136570 1.212007 3 \n", "1 2 1 -0.921468 2 2.0 -0.251607 0.472806 1 \n", "2 3 1 -0.818525 1 1.0 -0.598651 -0.681010 3 \n", "3 4 1 -0.200865 2 1.0 0.211119 -0.497394 3 \n", "4 5 1 -1.436184 1 2.0 -0.945695 -0.451194 2 \n", "5 6 2 -0.715581 2 2.0 0.211119 -0.774594 4 \n", "6 7 1 1.137398 1 2.0 0.442481 -0.589794 3 \n", "7 8 1 1.755058 1 1.0 -1.408421 -0.589794 3 \n", "8 9 1 0.210908 2 1.0 1.252251 0.241806 2 \n", "9 10 1 -0.509695 1 1.0 0.442481 -0.542410 3 \n", "10 11 1 0.107965 2 1.0 1.136570 0.103206 3 \n", "11 12 1 0.210908 2 3.0 1.020888 0.334206 3 \n", "12 13 1 -0.303808 2 3.0 0.326800 0.289191 3 \n", "13 14 1 2.063887 2 3.0 1.020888 1.212007 4 \n", "14 15 1 1.549171 2 3.0 -0.020244 0.982191 4 \n", "15 16 2 0.416795 2 3.0 1.252251 0.150591 3 \n", "16 17 1 -1.230298 1 4.0 -1.177058 3.152407 5 \n", "17 18 2 0.005022 2 3.0 0.558163 0.195606 4 \n", "18 19 1 0.416795 1 3.0 0.905207 0.427791 2 \n", "19 20 2 0.210908 2 2.0 0.095437 -0.126609 3 \n", "20 21 2 0.210908 2 3.0 0.905207 0.150591 2 \n", "21 22 1 0.931511 2 3.0 0.558163 2.736607 2 \n", "22 23 1 0.622681 2 1.0 1.714977 1.120791 4 \n", "23 24 1 -0.200865 1 3.0 -0.598651 2.229592 3 \n", "24 25 1 0.519738 2 3.0 -1.292740 0.427791 4 \n", "25 26 1 -1.230298 1 2.0 -1.061377 -1.235410 2 \n", "26 27 1 -1.230298 1 2.0 -1.177058 -0.681010 2 \n", "27 28 2 1.858001 1 1.0 0.326800 -0.403810 3 \n", "28 29 1 0.931511 2 2.0 1.252251 0.289191 2 \n", "29 30 1 -1.745014 1 2.0 0.000000 -0.728394 4 \n", "30 31 1 -1.745014 1 3.0 -1.292740 -0.265210 4 \n", "31 32 1 0.725625 2 1.0 1.367932 0.427791 4 \n", "32 33 1 2.166831 2 3.0 0.326800 4.399808 4 \n", "33 34 1 -0.097922 2 1.0 0.211119 -0.451194 1 \n", "34 35 1 -1.024411 2 2.0 -0.830014 -0.681010 3 \n", "35 36 1 -0.715581 2 3.0 -1.408421 -1.190395 2 \n", "36 37 2 1.858001 2 2.0 0.558163 0.195606 5 \n", "37 38 1 1.240341 2 1.0 -1.292740 -0.635994 5 \n", "38 39 1 -0.818525 1 2.0 -0.251607 -0.312594 3 \n", "39 40 1 -0.612638 1 1.0 -1.061377 -0.542410 1 \n", "40 41 1 -0.509695 2 3.0 -0.367288 0.334206 3 \n", "41 42 1 -0.921468 1 2.0 -0.714333 -0.035394 3 \n", "42 43 2 0.210908 2 3.0 -0.020244 -0.497394 4 \n", "43 44 1 -0.818525 1 2.0 -0.598651 -0.403810 2 \n", "44 45 1 1.755058 2 1.0 2.640428 0.103206 4 \n", "45 46 2 1.343284 2 1.0 1.830658 -0.681010 3 \n", "46 47 2 -1.642071 1 3.0 1.020888 -0.819610 3 \n", "47 48 1 -0.406752 2 1.0 0.673844 -0.451194 4 \n", "48 49 1 -1.127355 2 2.0 -0.598651 0.195606 4 \n", "49 50 1 -0.921468 1 3.0 -0.598651 -0.265210 3 \n", "50 51 2 0.107965 2 2.0 -0.251607 -0.542410 4 \n", "51 52 1 0.313851 2 1.0 0.673844 -0.358794 5 \n", "52 53 1 0.622681 2 1.0 -1.061377 -0.681010 1 \n", "53 54 1 -1.333241 1 2.0 -1.292740 -1.190395 2 \n", "54 55 2 0.725625 2 1.0 0.558163 -0.173994 3 \n", "55 56 2 -0.715581 2 1.0 -0.714333 -0.728394 4 \n", "56 57 1 2.372717 2 2.0 2.756109 0.657606 2 \n", "57 58 1 0.005022 2 2.0 0.000000 0.011991 2 \n", "58 59 1 -1.847958 1 2.0 -1.292740 -0.358794 3 \n", "59 60 1 -0.715581 1 1.0 -0.251607 -0.497394 4 \n", "60 61 1 0.622681 1 1.0 0.789525 -0.451194 2 \n", "61 62 1 0.210908 2 1.0 -1.408421 -0.819610 2 \n", "62 63 2 1.343284 2 2.0 -0.251607 -0.819610 4 \n", "63 64 2 0.622681 1 2.0 -1.292740 -1.005594 4 \n", "64 65 2 -0.303808 2 2.0 0.558163 0.150591 4 \n", "65 66 2 -0.097922 2 1.0 0.442481 -0.635994 5 \n", "66 67 1 -0.097922 2 4.0 -0.482970 2.967607 3 \n", "67 68 1 -0.509695 2 3.0 -0.598651 -0.173994 2 \n", "68 69 1 -1.024411 2 2.0 -1.292740 1.120791 4 \n", "69 70 1 1.446228 2 2.0 1.136570 0.657606 3 \n", "70 71 1 -0.406752 2 2.0 -1.292740 -0.081594 3 \n", "71 72 1 0.828568 2 2.0 1.252251 1.073407 3 \n", "72 73 2 0.210908 2 3.0 -1.177058 -0.635994 5 \n", "73 74 1 0.725625 2 1.0 0.905207 0.334206 3 \n", "74 75 1 0.210908 1 1.0 -1.292740 -0.728394 1 \n", "75 76 1 -0.097922 1 2.0 0.326800 -1.144195 1 \n", "76 77 1 0.107965 1 2.0 1.136570 -0.451194 4 \n", "77 78 1 -1.642071 1 3.0 -1.408421 -1.144195 4 \n", "78 79 1 -0.509695 1 1.0 -1.177058 -1.096810 1 \n", "79 80 1 -1.127355 1 2.0 -0.598651 0.057006 3 \n", "80 81 1 -0.303808 2 2.0 0.442481 0.289191 3 \n", "81 82 2 -0.303808 2 3.0 0.326800 -0.451194 3 \n", "\n", " työtov työymp palkkat työteht työterv lomaosa kuntosa hieroja \n", "0 3.000000 3 3 3 0.0 0.0 0.0 0.0 \n", "1 5.000000 2 1 3 0.0 0.0 0.0 0.0 \n", "2 4.000000 1 1 3 1.0 0.0 0.0 0.0 \n", "3 3.000000 3 3 3 1.0 0.0 0.0 0.0 \n", "4 3.000000 2 1 2 1.0 0.0 0.0 0.0 \n", "5 4.000000 5 2 4 1.0 1.0 0.0 0.0 \n", "6 5.000000 4 2 2 0.0 0.0 1.0 0.0 \n", "7 5.000000 3 1 3 1.0 0.0 0.0 0.0 \n", "8 4.000000 4 2 4 0.0 1.0 0.0 0.0 \n", "9 2.000000 1 1 1 1.0 0.0 0.0 0.0 \n", "10 5.000000 3 1 3 0.0 0.0 0.0 0.0 \n", "11 5.000000 3 1 2 0.0 1.0 0.0 1.0 \n", "12 5.000000 3 1 4 0.0 1.0 0.0 0.0 \n", "13 5.000000 4 1 3 0.0 0.0 0.0 0.0 \n", "14 4.000000 4 4 4 0.0 1.0 0.0 0.0 \n", "15 3.000000 3 3 3 1.0 0.0 0.0 1.0 \n", "16 5.000000 5 4 5 0.0 0.0 1.0 0.0 \n", "17 3.000000 4 2 1 0.0 0.0 0.0 0.0 \n", "18 3.000000 4 1 4 1.0 0.0 0.0 1.0 \n", "19 3.000000 4 3 2 1.0 0.0 0.0 1.0 \n", "20 4.000000 5 3 4 0.0 0.0 0.0 0.0 \n", "21 4.000000 3 2 4 0.0 1.0 0.0 0.0 \n", "22 4.000000 4 4 4 0.0 1.0 0.0 0.0 \n", "23 4.000000 3 4 5 0.0 0.0 0.0 0.0 \n", "24 4.000000 4 4 4 0.0 1.0 0.0 0.0 \n", "25 4.000000 2 1 3 1.0 0.0 1.0 1.0 \n", "26 4.000000 2 2 3 1.0 0.0 0.0 1.0 \n", "27 4.000000 3 2 4 1.0 0.0 0.0 1.0 \n", "28 4.000000 3 1 4 1.0 0.0 0.0 0.0 \n", "29 4.000000 3 3 2 1.0 0.0 0.0 0.0 \n", "30 5.000000 3 4 2 1.0 0.0 1.0 1.0 \n", "31 4.000000 4 3 4 0.0 1.0 0.0 0.0 \n", "32 4.000000 5 4 4 0.0 1.0 0.0 0.0 \n", "33 5.000000 1 1 2 1.0 0.0 0.0 1.0 \n", "34 4.000000 3 3 3 1.0 0.0 1.0 1.0 \n", "35 4.000000 3 1 3 1.0 0.0 0.0 0.0 \n", "36 5.000000 5 5 5 0.0 0.0 0.0 1.0 \n", "37 5.000000 4 1 4 1.0 1.0 0.0 0.0 \n", "38 5.000000 3 3 4 0.0 0.0 0.0 0.0 \n", "39 5.000000 4 1 3 1.0 0.0 0.0 0.0 \n", "40 3.000000 4 2 3 1.0 0.0 0.0 0.0 \n", "41 4.000000 3 1 2 1.0 0.0 0.0 0.0 \n", "42 4.000000 4 4 4 0.0 1.0 0.0 0.0 \n", "43 3.000000 4 1 3 1.0 0.0 0.0 1.0 \n", "44 5.000000 4 2 4 1.0 0.0 0.0 1.0 \n", "45 3.000000 2 2 3 1.0 0.0 0.0 1.0 \n", "46 3.000000 4 1 3 0.0 0.0 1.0 0.0 \n", "47 5.000000 4 1 3 0.0 0.0 0.0 0.0 \n", "48 4.000000 3 3 5 0.0 0.0 1.0 0.0 \n", "49 4.000000 3 2 3 1.0 0.0 0.0 0.0 \n", "50 5.000000 5 4 5 0.0 1.0 0.0 0.0 \n", "51 5.000000 5 2 5 0.0 1.0 0.0 0.0 \n", "52 2.000000 2 1 1 1.0 0.0 0.0 0.0 \n", "53 4.000000 3 1 2 1.0 0.0 0.0 0.0 \n", "54 5.000000 4 3 3 0.0 0.0 0.0 1.0 \n", "55 4.000000 4 3 3 1.0 0.0 0.0 1.0 \n", "56 4.061728 2 1 5 1.0 0.0 0.0 1.0 \n", "57 3.000000 1 1 2 1.0 0.0 0.0 1.0 \n", "58 4.000000 3 2 3 0.0 0.0 0.0 0.0 \n", "59 4.000000 3 1 3 1.0 0.0 0.0 0.0 \n", "60 2.000000 1 1 2 1.0 0.0 0.0 0.0 \n", "61 3.000000 1 2 3 1.0 0.0 0.0 0.0 \n", "62 3.000000 2 2 3 1.0 0.0 0.0 0.0 \n", "63 4.000000 3 2 3 1.0 0.0 0.0 1.0 \n", "64 4.000000 5 2 4 1.0 0.0 0.0 1.0 \n", "65 5.000000 5 4 5 1.0 1.0 0.0 0.0 \n", "66 4.000000 3 2 2 1.0 1.0 0.0 1.0 \n", "67 4.000000 3 1 4 0.0 0.0 0.0 0.0 \n", "68 5.000000 3 1 4 0.0 0.0 0.0 0.0 \n", "69 4.000000 3 2 2 1.0 1.0 0.0 1.0 \n", "70 5.000000 5 3 4 0.0 1.0 0.0 0.0 \n", "71 5.000000 5 3 4 0.0 1.0 0.0 0.0 \n", "72 3.000000 4 3 4 1.0 0.0 0.0 0.0 \n", "73 5.000000 1 1 3 1.0 0.0 0.0 0.0 \n", "74 5.000000 1 1 1 1.0 0.0 0.0 0.0 \n", "75 5.000000 1 1 1 1.0 0.0 0.0 0.0 \n", "76 5.000000 3 1 2 0.0 0.0 0.0 0.0 \n", "77 4.000000 4 3 4 0.0 1.0 1.0 0.0 \n", "78 3.000000 2 1 2 1.0 0.0 0.0 0.0 \n", "79 4.000000 3 3 3 1.0 0.0 1.0 0.0 \n", "80 4.000000 3 3 3 0.0 0.0 0.0 0.0 \n", "81 4.000000 4 3 4 1.0 0.0 0.0 0.0 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "\n", "# Tässä standardoin iän, palkan ja palveluvuodet\n", "df2[['ikä', 'palkka', 'palveluv']] = pd.DataFrame(scaler.fit_transform(df2[['ikä', 'palkka', 'palveluv']]))\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Poikkeavat arvot\n", "\n", "Poikkeavina arvoina voidaan pitää arvoja, jotka olisivat normaalijakaumassa epätodennäköisiä. Tällaisia arvoja sisältävien rivien poistaminen voi joissain tapauksissa parantaa mallia. Poikkeavien arvojen poistamisen mielekkyys riippuu monista seikoista ja on harkittava kussakin tapauksessa erikseen. \n", "\n", "Poistaminen voidaan tehdä z-pisteiden (standardoitujen arvojen) perusteella. Z-piste ilmoittaa kuinka monen keskihajonnan päässä arvo on kaikkien arvojen keskiarvosta. Usein rajana käytetään arvoa 3: jos muuttujan arvo on yli kolmen keskihajonnan päässä keskiarvosta, niin se poistetaan.\n", "\n", "Lisätietoa poikkeavien arvojen poistamisesta:\n", "\n", "https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-data-frame\n", "\n", "Seuraavassa lasken normaalijakauman todennäköisyyden itseisarvoltaan yli kolmen (3) suuruisille z-pisteille:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Todennäköisyys sille, että arvo on yli kolmen keskihajonnan päässä keskiarvostaan 0.0026997960632601866\n" ] } ], "source": [ "from scipy import stats\n", "\n", "print('Todennäköisyys sille, että arvo on yli kolmen keskihajonnan päässä keskiarvostaan', \n", " 2*stats.norm.cdf(-3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Edellä standardoin df2:n palkan. Katsotaan viisi suurinta ja pienintä z-pistettä:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "32 4.399808\n", "16 3.152407\n", "66 2.967607\n", "21 2.736607\n", "23 2.229592\n", "Name: palkka, dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2['palkka'].nlargest(n=5)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "25 -1.235410\n", "35 -1.190395\n", "53 -1.190395\n", "75 -1.144195\n", "77 -1.144195\n", "Name: palkka, dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2['palkka'].nsmallest(n=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yllä olevan mukaisesti poistettavaksi joutuisivat kaksi suurinta palkkaa, joiden z-pisteet ovat suurempia kuin 3.\n", "\n", "Poistaminen sujuu yhdellä koodirivillä. Seuraava koodi toimii vaikka z-pisteitä ei olisi dataan ennestään laskettukaan:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "80" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3 = df2[(np.abs(stats.zscore(df2))<3).all(axis=1)]\n", "\n", "# Katson kuinka monta riviä jäi jäljelle\n", "df3.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tässä tapauksessa poikkeavien arvojen poistaminen johti ainoastaan kahden rivin poistamiseen.\n", "\n", "Jos olet tottunut käyttämään lambdaa, niin edellisen voi tehdä myös seuraavasti (tässä lasken z-pisteet ilman stats.zscore()-toimintoa):" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "80" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4 = df2[df2.apply(lambda x: np.abs(x-x.mean())/x.std()<3).all(axis = 1)]\n", "\n", "# Katson kuinka monta riviä jäi jäljelle\n", "df4.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logaritmimuunnos\n", "\n", "Muuttujien normaalijakaumasta poikkeavia jakaumia on mahdollista korjata lähemmäksi normaalijakaumaa muuttujien muunnoksilla. Paljon käytetty muunnos vinon jakauman korjaamiseen on logaritmien ottaminen.\n", "\n", "Seuraavassa muunnan palkka-muuttujan arvot logaritmeikseen. Histogrammilla voin nopeasti tarkistaa korjaantuiko jakauma lähemmäksi normaalijakaumaa." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[,\n", " ]], dtype=object)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAEICAYAAABGaK+TAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAUiElEQVR4nO3df5DtdX3f8ecriAYRFUrY8CtZk1qmtjeivYMYMum21AShDTiTZKQoULXXpGFG2zuToO0Eq21K26BptDG5iootkhh/BIZoKqI7jtNICoR4IVcL6lUu3IAGA1zSxF5994/zvfa4nr179uw5e76fvc/HzJnzPd/v93zP+3P2u6/97vfH55uqQpLUnu+ZdwGSpMkY4JLUKANckhplgEtSowxwSWqUAS5JjTLA5yjJcpJXdcOXJ/n0KvOtOk2ap3mtw0kWk1SSJ01rmS0ywCWpUQa4JDXKAJ+CJHuTvC7Jnyb5epJ3J/neJMcnuTnJV7vxNyc5bcxl/uckn07yjMNNS3JBkj9O8liS+5O8YeoN1JbX+jqc5JQkNyV5JMl9Sf750LRjklzX1b8nyS8m2bfez+gjA3x6LgF+Evhh4G8B/4bB9/tu4AeBHwD+D/C2wy0kyfckeQfwI8BPVNWja0x7ArgUeCZwAfDzSS6aast0pGh5Hb4B2AecAvw08CtJzu2mXQUsAj8EvAh42TqX3VsG+PS8rarur6pHgH8PXFxVf15VH6yqv6yqx7vxf/8wyziawYp4AvBPquov15pWVctVtbuqvlVVn+3mOdxnSKtpch1OcjrwY8AvVdVfVdVdwDuBl3ez/CzwK1X19araB/z6uMvuuyP6CO6U3T80/GXglCRPBd4CnAcc3007LslRVfXNEcv4m8BzgbOq6hvjTEvyAuBq4O8CTwaeAvzuFNqjI0+r6/ApwCPdH5jh+rcPTR9u2/Bw09wCn57Th4Z/AHgQ2AmcAbygqp4O/Hg3PassYw/wz4CPJjljzGnvA24CTq+qZwC/eZjlS4fT6jr8IHBCkuNW1P9AN7wfGN5vP9zOphng0/MLSU5LcgLweuB3gOMY7DP8i278VWstpKpu6N7/8SQ/PMa04xhsffxVkrOAfzq1FulI0+Q6XFX3A/8T+A/dgdcfAV4JXN/N8n7gdd0B2VOBK9az/D4zwKfnfcDHgC92j38H/BpwDPA14DPAH4yzoKq6Dngj8Ikki2tM+xfAG5M8Dvwyg5VVmkTL6/DFDA5UPgh8GLiqqm7ppr2RwQHOLwEfBz4A/PUEn9E78YYOG5dkL/Cqqvr4vGuRJnEkrcNJfh54aVU1f7DfLXBJW1qSk5Oc053CeAaD/fofnndd02CAS+qlJJckOTDicc86F/Vk4LeAx4FPADcCvzHteufBXSiS1Ci3wCWpUZt6Ic+JJ55Yi4uLU1veE088wbHHHju15fXNVm7fRtp2xx13fK2qvm/KJc3EtNf5YS2vH63WPq+6V1vnNzXAFxcXuf3226e2vOXlZZaWlqa2vL7Zyu3bSNuSfHm61czOtNf5YS2vH63WPq+6V1vn3YUiSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNavqemLsfeJTLr/z9db1n79UXzKgaqZ8W1/k7coi/K/3nFrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktSoNQM8yelJPplkT5J7krymG/+GJA8kuat7nD/7ciVJh4zTmdVBYGdV3ZnkOOCOJLd0095SVb86u/IkSatZM8Craj+wvxt+PMke4NRZFyZJOrx1dSebZBF4HnAbcA5wRZJLgdsZbKV/fcR7dgA7ABYWFlheXt5gyf/fwjGwc9vBdb1nmp8/awcOHGiq3vXoc9uSnA68F/h+4FvArqr6L0lOAH4HWAT2Aj87ap2XNsvYAZ7kacAHgddW1WNJ3g68Caju+RrgFSvfV1W7gF0A27dvr6WlpSmUPfDW62/kmt3r69J87yXT+/xZW15eZprfV5/0vG2r7Ta8HLi1qq5OciVwJfBLc6xTR7ixzkJJcjSD8L6+qj4EUFUPVdU3q+pbwDuAs2ZXprR5qmp/Vd3ZDT8OHNpteCFwXTfbdcBFcylQ6oxzFkqAa4E9VfXmofEnD832EuDu6ZcnzdeK3YYL3TGhQ8eGTppjadJYu1DOAV4O7E5yVzfu9cDFSc5ksAtlL/DqGdQnzc2I3Ybjvm9mx32GjXscYb3HiQ6Z5TGKPh8DOZy+1T3OWSifBkatuR+ZfjlSP4zabQg8lOTkqtrf/Qf68Kj3zvK4z7BxjyOs976xh8zyeFHPj4Gsqm91eyWmtMJquw2Bm4DLuuHLgBs3uzZpWNN3pZdmZLXdhlcD70/ySuArwM/MpzxpwACXVjjMbkOAczezFulw3IUiSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhp1xHUnuzjB3Un2Xn3BDCqRpI1xC1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUqCPuQh5JszPuhXI7tx3k8m5eL5SbnFvgktQoA1ySGrVmgCc5Pcknk+xJck+S13TjT0hyS5J7u+fjZ1+uJOmQcbbADwI7q+pvA2cDv5DkOcCVwK1V9Wzg1u61JGmTrBngVbW/qu7shh8H9gCnAhcC13WzXQdcNKMaJUkjrOsslCSLwPOA24CFqtoPg5BPctIq79kB7ABYWFhgeXl5I/V+h4VjBkezZ22aNa/HgQMH5vbZs7aV2yZtlrEDPMnTgA8Cr62qx5KM9b6q2gXsAti+fXstLS1NUOZob73+Rq7ZPfszIfdesjTzzxhleXmZaX5ffbKV2yZtlrHOQklyNIPwvr6qPtSNfijJyd30k4GHZ1OiJGmUcc5CCXAtsKeq3jw06Sbgsm74MuDG6ZcnSVrNOPsfzgFeDuxOclc37vXA1cD7k7wS+ArwMzOpUJI00poBXlWfBlbb4X3udMuRJI3LKzElqVEGuCQ1qje9EY7bi9mwndtmUIgkNcItcGmEJO9K8nCSu4fGvSHJA0nu6h7nz7NGyQCXRnsPcN6I8W+pqjO7x0c2uSbpOxjg0ghV9SngkXnXIR2OAS6tzxVJPtvtYrELZc1Vbw5iSg14O/AmoLrna4BXrJxplh24DRu3Q7BJO3ybpO5xP2u4I7qWOjXrWydsBrg0pqp66NBwkncAN68y38w6cBs2bodgl09whhdM1onbuJ+1c9vBb3dEN6/O4ibRt07Y3IUijelQ522dlwB3rzavtBncApdGSHIDsAScmGQfcBWwlORMBrtQ9gKvnld9Ehjg0khVdfGI0ddueiHSYbgLRZIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSo9YM8CTvSvJwkruHxr0hyQNJ7uoe58+2TEnSSuNsgb8HOG/E+LdU1Znd4yPTLUuStJY1A7yqPgU8sgm1SJLWYSN3pb8iyaXA7cDOqvr6qJmS7AB2ACwsLLC8vDxyYTu3HVx3AQvHTPa+9Vqt5lk7cODA3D571rZy26TNMmmAvx14E1Dd8zXAK0bNWFW7gF0A27dvr6WlpZELvPzK3193ETu3HeSa3Rv5GzSevZcszfwzRlleXma176t1W7lt0maZKP2q6qFDw0neAdw8tYokrWpxaENn57aDE234aOuY6DTCJCcPvXwJcPdq80qSZmPNLfAkNwBLwIlJ9gFXAUtJzmSwC2Uv8OrZlShJGmXNAK+qi0eMvnYGtUiS1sErMSWpUQa4JDXKAJekRhngktQoA1ySGjX7yxglNWnRi4R6zy1wSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXBphlZt5n5DkliT3ds/Hz7NGyQCXRnsP330z7yuBW6vq2cCt3WtpbgxwaYRVbuZ9IXBdN3wdcNFm1iSt5JWY0vgWqmo/QFXtT3LSqJnGvZH3JIZv4r1ZN/WeheHaW7q5dd9uxm2AS1M27o28J3H5intibsZNvWdhuPZ53TR8En27Gbe7UKTxPXTofrDd88NzrkdHOANcGt9NwGXd8GXAjXOsRTLApVG6m3n/IXBGkn1JXglcDbwoyb3Ai7rX0ty0uQNNmrFVbuYNcO6mFiIdhlvgktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEatGeB2bC9J/TTOFvh7sGN7SeqdNQPcju0lqZ8m7QtlrI7tYfzO7SfpmH6zOrSfVwfufes8fpq2ctukzTLzzqzG7dx+uKP6cW1Wh/bz6nC+b53HT9NWbpu0WSY9C8WO7SVpziYNcDu2l6Q5G+c0Qju2l6QeWnMHsh3bS1I/eSWmJDXKAJekRhngktQoA1ySGmWAS1KjZn8Zo6SRFie4+lga5ha4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVFeyCNpria5oGnv1RfMoJL2uAUuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGuVphNI6JdkLPA58EzhYVdvnW5GOVAa4NJl/UFVfm3cROrK5C0WSGuUWuLR+BXwsSQG/VVW7hicm2QHsAFhYWGB5eXnkQnZuO7ihIhaO2fgy5mWjta/2nc7agQMH5vbZoxjg0vqdU1UPJjkJuCXJ56rqU4cmdoG+C2D79u21tLQ0ciGXb/CemDu3HeSa3W3+Cm+09r2XLE2vmHVYXl5mtZ/nPLgLRVqnqnqwe34Y+DBw1nwr0pHKAJfWIcmxSY47NAz8BHD3fKvSkarN/7+k+VkAPpwEBr8/76uqP5hvSTpSbSjAPR9WR5qq+iLw3HnXIcF0tsA9H1aS5sB94JLUqI1ugR/2fFiY7Tmxm3Ue7CTnfe5+4NGJPmvbqc/49nDfzjmdpq3cNmmzbDTAD3s+LMz2nNjNOg92knNOJz3Hd/iz+nbO6TRt5bZJm2VDu1A8H1aS5mfiAPd8WEmar43sf/B8WEmao4kD3PNhJWm+PI1QkhplgEtSo+wLZQyLG+z2U5JmwS1wSWqUAS5JjXIXiiQdxvAu1J3bDo51lfXeqy+YZUnf5ha4JDXKAJekRhngktQoA1ySGuVBzCPUJOe2b9aBGUnjcQtckhplgEtSowxwSWqU+8AlNcdjOANugUtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIa5YU8ko4Im3lz8s260MgtcElqlFvgW8BmbllMYlR9a91bcCte9ixNm1vgktQoA1ySGmWAS1KjDHBJatSGAjzJeUk+n+S+JFdOqyipr1zn1ScTB3iSo4D/CrwYeA5wcZLnTKswqW9c59U3G9kCPwu4r6q+WFXfAH4buHA6ZUm95DqvXklVTfbG5KeB86rqVd3rlwMvqKorVsy3A9jRvTwD+Pzk5X6XE4GvTXF5fbOV27eRtv1gVX3fNIsZR0/W+WEtrx+t1j6vukeu8xu5kCcjxn3XX4Oq2gXs2sDnrF5AcntVbZ/FsvtgK7ev0bbNfZ3/jmLa/A6BdmvvW90b2YWyDzh96PVpwIMbK0fqNdd59cpGAvx/Ac9O8qwkTwZeCtw0nbKkXnKdV69MvAulqg4muQL4H8BRwLuq6p6pVTaemf+bOmdbuX3Nta0n6/yw5r7DIa3W3qu6Jz6IKUmaL6/ElKRGGeCS1KheBXiS05N8MsmeJPckeU03/oQktyS5t3s+fug9r+sua/58kp8cGv/3kuzupv16klGngG26JEcl+eMkN3evt1LbnpnkA0k+1/0MX7iV2jcPSc5IctfQ47Ekr10xz1KSR4fm+eU5lfsdkvzL7vf47iQ3JPneFdPT/XzvS/LZJM+fV60rjVF7P77zqurNAzgZeH43fBzwvxlcsvyfgCu78VcC/7Ebfg7wJ8BTgGcBXwCO6qb9EfBCBufufhR48bzb19X1r4D3ATd3r7dS264DXtUNPxl45lZq37wfDA6c/hmDizqGxy8dWp/68gBOBb4EHNO9fj9w+Yp5zu9+vgHOBm6bd93rqL0X33mvtsCran9V3dkNPw7sYfBlXsggHOieL+qGLwR+u6r+uqq+BNwHnJXkZODpVfWHNfi23zv0nrlJchpwAfDOodFbpW1PB34cuBagqr5RVX/BFmlfT5wLfKGqvjzvQsb0JOCYJE8Cnsp3nzN/IfDeGvgM8Mzu598Ha9XeC70K8GFJFoHnAbcBC1W1HwYhD5zUzXYqcP/Q2/Z1407thleOn7dfA34R+NbQuK3Sth8Cvgq8u9tF9M4kx7J12tcHLwVuWGXaC5P8SZKPJvk7m1nUKFX1APCrwFeA/cCjVfWxFbOttg7M1Zi1Qw++814GeJKnAR8EXltVjx1u1hHj6jDj5ybJPwYerqo7xn3LiHG9bFvnScDzgbdX1fOAJxjsMllNa+2bq+7CoZ8CfnfE5DsZ7FZ5LvBW4Pc2sbSRumMdFzLYPXYKcGySl62cbcRb5/6zHrP2XnznvQvwJEczCO/rq+pD3eiHDv1r1T0/3I1f7dLmfd3wyvHzdA7wU0n2MujF7h8m+e9sjbbBoK59VXVb9/oDDAJ9q7Rv3l4M3FlVD62cUFWPVdWBbvgjwNFJTtzsAlf4R8CXquqrVfV/gQ8BP7pinr52TbBm7X35znsV4N3ZBtcCe6rqzUOTbgIu64YvA24cGv/SJE9J8izg2cAfdf+qP57k7G6Zlw69Zy6q6nVVdVpVLTL4V/gTVfUytkDbAKrqz4D7k5zRjToX+FO2SPt64GJW2X2S5PsPnamT5CwGv9d/vom1jfIV4OwkT+1qO5fBMa1hNwGXdmejnM1gV8X+zS50hDVr7813Pu+jqMMP4McY/Av1WeCu7nE+8DeAW4F7u+cTht7zrxmcwfB5hs5WALYDd3fT3kZ31WkfHgwdwd5KbQPOBG7vfn6/Bxy/ldo3x+/1qQzC4RlD434O+Llu+ArgHgZn9XwG+NF519zV9W+Bz3U/y//G4Iyj4brD4AYZXwB2A9vnXfM6au/Fd+6l9JLUqF7tQpEkjc8Al6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY36f8DeXpbjNWCEAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df['palkka_log'] = np.log(df['palkka'])\n", "\n", "# Katsotaan muuuttujien histogrammit\n", "df[['palkka', 'palkka_log']].hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Logaritmia ei voi ottaa nollasta. Tämän vuoksi esimerkkidatan palveluvuosiin en voi käyttää logaritmimuunnosta. Jos lisään palveluvuosiin yhden vuoden (jolloin muuttuja ilmoittaa kuinka monetta vuotta henkilö palvelee yrityksessä), niin logaritmimuunnos onnistuu: " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[,\n", " ]], dtype=object)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df['palveluv_log'] = np.log(df['palveluv']+1)\n", "\n", "df[['palveluv', 'palveluv_log']].hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Muuttujien muunnoksia käytetään eri tarkoituksiin ja muunnokset ovat laaja ja monitahoinen aihe. Logaritmin ohella paljon käytettyjä muunnoksia ovat neliöjuuri, toiseen potenssiin korottaminen, käänteisluku jne. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Datan tasapainottaminen\n", "\n", "Luokittelumalleja käytettäessä opetusdata kannattaa tasapainottaa jos jokin luokista on selvästi aliedustettuna. Tasapainottamisen voi tehdä monella tavalla. Katso https://towardsdatascience.com/5-techniques-to-work-with-imbalanced-data-in-machine-learning-80836d45d30c\n", "\n", "Helppo tapa tasapainoltukseen on käyttää **imbalanced-learn** -kirjastoa: https://imbalanced-learn.org/stable/.\n", "Kirjaston voi asentaa Anacondaan komentoriviltä (Anaconda prompt) komennolla:\n", "\n", "`conda install -c conda-forge imbalanced-learn`" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0 73\n", "1.0 9\n", "Name: kuntosa, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Jos df2:n kuntosalin käyttöä mittaava muuttuja olisi kohdemuuttujana\n", "# niin käyttäjien määrä on aliedustettuna\n", "df2['kuntosa'].value_counts()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "kuntosa\n", "0.0 73\n", "1.0 73\n", "dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Tuodaan RandomOverSampler\n", "from imblearn.over_sampling import RandomOverSampler\n", "\n", "# Selittävät muuttujat\n", "X = df2.drop('kuntosa', axis=1)\n", "\n", "# Kohdemuuttuja\n", "y = df2['kuntosa']\n", "\n", "# Tasapainotus\n", "ros = RandomOverSampler(random_state=2)\n", "X, y = ros.fit_resample(X, y)\n", "\n", "# Tarkistetaan kohdemuuttujan jakauma\n", "pd.DataFrame(y).value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lisätietoa\n", "\n", "Data-analytiikka Pythonilla: https://tilastoapu.wordpress.com/python/" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }