{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<h1 style=\"color:blue\">Praktikum 4. </h1>\n", "<h3 style=\"color:blue\">Lihtne andme- ja tekstianalüütika Pythonis</h3>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selles praktikumis tutvume tabelkujul olevate andmete töötluse võimalustega Pythonis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<h3 style=\"color:green\">csv - Comma Separated Values</h3>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "csv formaat on väga levinud andmete säilitamiseks tabelkujul. Võite proovida avada andmefaili MS Exceli või muu tabelitöötlusprogrammiga - enne avamist küsib programm üle, mida kasutada **eraldajana**. Ehk mis märk eraldab veerge üksteisest failis. Nagu nime järgi võib oletada, on standardseks eraldajaks koma. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CSV-failid talletavad lihttekstilise tabeli kujul andmeid, kus rea defineerib üldiselt reavahetuse sümbol ning veeru ettemääratud eraldusmärk. Kui tekstilised tunnused sisaldavad reavahetuse sümbolit või veeru eraldajat, piiritletakse tunnuse väärtus jutumärkidega (\"\"). Traditsiooniliselt kasutatakse veergude eralduseks koma, reavahetuseks süsteemi reavahetuse sümbolite jada ning jutumärgiks jutumärki \" . Kuna CSV pole aga standardiseeritud, võib kohata väga erinevaid kujusid, mistõttu CSV-dega töötavad süsteemid võimaldavad kasutajal määrata erinevaid formaadi parameetreid.\n", "\n", "CSV talletab struktureeritud andmeid (igal andmepunktil - näiteks raamatul, tootel, kasutajal - on fikseeritud tunnused), mistõttu on see ajalooliselt käinud käsikäes andmebaaside ja meie aine valdkonnas ka masinõppemeetoditega. CSV hiilgab kompaktsuse ning platvormist sõltumatusega (kui välja jätta standardiseerimatus). CSV-ga töötamiseks on Pythonis olemas [_csv_ teek](https://docs.python.org/3/library/csv.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_csv_ teek lubab nii lugeda kui kirjutada CSV formaadis faile. Andmeridadega manipuleerimiseks saab kasutada kas ühetasemelisi itereeritavaid andmestruktuure (nt _list_ või _tuple_) või sõnaraamatuid." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['nimi', 'sugu', 'vanus']\n", "\n", "['Teele', 'naine', '25']\n", "['Ivan', 'mees', '87']\n", "['Arfi', 'mees', '12']\n", "['Leida', 'naine', '58']\n" ] } ], "source": [ "import csv\n", "with open('ilusad_inimesed.csv') as csv_file:\n", " rows = []\n", " \n", " reader = csv.reader(csv_file)\n", " header = next(reader)\n", " rows.append(header)\n", " \n", " print(header)\n", " print()\n", " \n", " for row in reader:\n", " rows.append(row)\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Eelnevas koodiplokis nägime, et _csv.reader()_ võimaldab lugeda CSV ridu ükshaaval, teisendades rea Pythoni sõnede listiks. Kui me teame, et mingi tunnus on sõnest erinevat tüüpi (nt täisarv, ujukomaarv või kuupäev), peame selle käsitsi vastavale kujule teisendama.\n", "\n", "Lisaks tutvusime eelnevas plokis __iteraatorist__ järgmise elemendi pärimisega. Kõik objektid, mille saame _for_-tsüklisse pista, on itereeritavad ning pakuvad iteraatorit, mis võimaldab andmestruktuuris sisalduvaid elemente ükshaaval läbida. Iteraatoritel on üldiselt kaks meetodit - _next_ ja *has_next*. _For_-tsükkel itereerib üle andmestruktuuri, kuni andmestruktuuris on veel elemente.\n", "\n", "Antud näites annab **next()** järgmise rea. Esimest korda kutsudes saame esimese (päise) rea. Edaspidi anname _reader_'i _for_-tsüklile, mis kutsub **next()** meetodit ülejäänud korrad ülejäänud ridade saamiseks." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "with open('ilusad_inimesed.csv.copy','w') as csv_file:\n", " writer = csv.writer(csv_file)\n", " for row in rows:\n", " writer.writerow(row)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Nägime, et _csv.writer()_ võimaldab kirjutada _csv_ ridu, kui need on Pythonis itereeritaval kujul. Antud juhul olid ridadeks sõnede järjendid." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Lisaks _list_'idele saame töötada ka sõnaraamatutega (_dict_'idega)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'nimi': 'Teele', 'sugu': 'naine', 'vanus': '25'}\n", "{'nimi': 'Ivan', 'sugu': 'mees', 'vanus': '87'}\n", "{'nimi': 'Arfi', 'sugu': 'mees', 'vanus': '12'}\n", "{'nimi': 'Leida', 'sugu': 'naine', 'vanus': '58'}\n" ] } ], "source": [ "with open('ilusad_inimesed.csv') as csv_file:\n", " rows = []\n", " \n", " reader = csv.DictReader(csv_file)\n", " for row in reader:\n", " rows.append(row)\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kui listide korral pidime esimese rea ehk päisega ise toimetama, siis sõnaraamatute korral loetakse see vaikimisi sisse ning selle veergude väärtused määratakse ülejäänud ridade veergude nimedeks.\n", "\n", "**Ettevaatust: ** kui CSV-l puudub päis, lähevad salaja esimese rea andmed kaduma." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fieldnames: ['nimi', 'sugu', 'vanus']\n" ] } ], "source": [ "with open('ilusad_inimesed.csv.copy','w') as csv_file:\n", " fieldnames = [fieldname for fieldname in rows[0]]\n", " print(\"Fieldnames:\", fieldnames)\n", " \n", " writer = csv.DictWriter(csv_file, fieldnames=fieldnames)\n", " writer.writeheader()\n", " for row in rows:\n", " writer.writerow(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_csv_ teek sobib eelkõige CSV formaadis kirjutamiseks ja lugemiseks. Kui on soovi keerulisemaid või suuremaid numbrilisi tabelitöötlusi teha, osutub kasulikuks [pandas](http://pandas.pydata.org/)'e teek." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ülesanne 1. Tsensuur (1,5p)\n", "Failis comments.csv on toodud 200 lugejakommentaari ühest eesti meediaväljaandest. Veerus \"Staatus\" on märgitud, kas moderaator on kommentaari ära keelanud (staatus 1) või mitte (staatus 2). Veergudes \"Pos\" ja \"Neg\" on kirjas, kui palju on kommentaar saanud lugejatelt vastavalt positiivseid ja negatiivseid hinnanguid.\n", "\n", "Lugege fail sisse, kasutades csv teeki, ning leidke:\n", "\n", "1) Millised 10 kommentaari on pälvinud kõige enam hindeid lugejatelt? (0,5p)\n", "\n", "2) Kas lugejad ja moderaatorid on ühel meelel kommentaaride sobilikkuse osas: milline osakaal moderaatori poolt keelatud kommentaaridest on saanud lugejatelt rohkem negatiivseid hääli kui positiivseid? Milline osakaal lubatud kommentaaridest? (1p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Pandas](http://pandas.pydata.org/) on Pythoni teek, mis võimaldab mugavalt töödelda tabelkujul olevaid andmeid - muuhulgas ka lugeda-kirjutada csv faile. Vaatame järgmist näidet:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "ilusad_inimesed = pd.read_csv(\"ilusad_inimesed.csv\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>nimi</th>\n", " <th>sugu</th>\n", " <th>vanus</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Teele</td>\n", " <td>naine</td>\n", " <td>25</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Ivan</td>\n", " <td>mees</td>\n", " <td>87</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Arfi</td>\n", " <td>mees</td>\n", " <td>12</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Leida</td>\n", " <td>naine</td>\n", " <td>58</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " nimi sugu vanus\n", "0 Teele naine 25\n", "1 Ivan mees 87\n", "2 Arfi mees 12\n", "3 Leida naine 58" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ilusad_inimesed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nagu näeme, oskab pandas kuvada meie csv-faili ilusa tabelina. Millega tegu?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(ilusad_inimesed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mida sellega teha saab? Töödelda andmeid nii, et neid on samal ajal ka mugav vaadata." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Andmeid saab töödelda nii rea kui veeru kaupa. Mõned näited paljudest võimalikest operatsioonidest (põhjalikumat juhendit vaata [siit](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)):" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Veergude lisamine\n", "ilusad_inimesed['pensionär'] = ilusad_inimesed['vanus'] > 65\n", "ilusad_inimesed['laste arv'] = 0\n", "ilusad_inimesed['perekonnanimi'] = ['Kask', 'Smirnov', 'Jalakas', 'Kuusepuu']" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>nimi</th>\n", " <th>sugu</th>\n", " <th>vanus</th>\n", " <th>pensionär</th>\n", " <th>laste arv</th>\n", " <th>perekonnanimi</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Teele</td>\n", " <td>naine</td>\n", " <td>25</td>\n", " <td>False</td>\n", " <td>0</td>\n", " <td>Kask</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Ivan</td>\n", " <td>mees</td>\n", " <td>87</td>\n", " <td>True</td>\n", " <td>0</td>\n", " <td>Smirnov</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Arfi</td>\n", " <td>mees</td>\n", " <td>12</td>\n", " <td>False</td>\n", " <td>0</td>\n", " <td>Jalakas</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Leida</td>\n", " <td>naine</td>\n", " <td>58</td>\n", " <td>False</td>\n", " <td>0</td>\n", " <td>Kuusepuu</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " nimi sugu vanus pensionär laste arv perekonnanimi\n", "0 Teele naine 25 False 0 Kask\n", "1 Ivan mees 87 True 0 Smirnov\n", "2 Arfi mees 12 False 0 Jalakas\n", "3 Leida naine 58 False 0 Kuusepuu" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ilusad_inimesed" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Veeru eemaldamine\n", "del ilusad_inimesed['laste arv']" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>nimi</th>\n", " <th>sugu</th>\n", " <th>vanus</th>\n", " <th>pensionär</th>\n", " <th>perekonnanimi</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Teele</td>\n", " <td>naine</td>\n", " <td>25</td>\n", " <td>False</td>\n", " <td>Kask</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Ivan</td>\n", " <td>mees</td>\n", " <td>87</td>\n", " <td>True</td>\n", " <td>Smirnov</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Arfi</td>\n", " <td>mees</td>\n", " <td>12</td>\n", " <td>False</td>\n", " <td>Jalakas</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Leida</td>\n", " <td>naine</td>\n", " <td>58</td>\n", " <td>False</td>\n", " <td>Kuusepuu</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " nimi sugu vanus pensionär perekonnanimi\n", "0 Teele naine 25 False Kask\n", "1 Ivan mees 87 True Smirnov\n", "2 Arfi mees 12 False Jalakas\n", "3 Leida naine 58 False Kuusepuu" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ilusad_inimesed" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Transponeerimine (vahetame veerud-read omavahel)\n", "ilusad_inimesed2 = ilusad_inimesed.T" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>nimi</th>\n", " <td>Teele</td>\n", " <td>Ivan</td>\n", " <td>Arfi</td>\n", " <td>Leida</td>\n", " </tr>\n", " <tr>\n", " <th>sugu</th>\n", " <td>naine</td>\n", " <td>mees</td>\n", " <td>mees</td>\n", " <td>naine</td>\n", " </tr>\n", " <tr>\n", " <th>vanus</th>\n", " <td>25</td>\n", " <td>87</td>\n", " <td>12</td>\n", " <td>58</td>\n", " </tr>\n", " <tr>\n", " <th>pensionär</th>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>perekonnanimi</th>\n", " <td>Kask</td>\n", " <td>Smirnov</td>\n", " <td>Jalakas</td>\n", " <td>Kuusepuu</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3\n", "nimi Teele Ivan Arfi Leida\n", "sugu naine mees mees naine\n", "vanus 25 87 12 58\n", "pensionär False True False False\n", "perekonnanimi Kask Smirnov Jalakas Kuusepuu" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ilusad_inimesed2" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4, 5)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ridade ja veergude arvu leidmine\n", "ilusad_inimesed.shape" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5, 4)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ilusad_inimesed2.shape" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>nimi</th>\n", " <th>sugu</th>\n", " <th>vanus</th>\n", " <th>pensionär</th>\n", " <th>perekonnanimi</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Teele</td>\n", " <td>naine</td>\n", " <td>25</td>\n", " <td>False</td>\n", " <td>Kask</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Ivan</td>\n", " <td>mees</td>\n", " <td>87</td>\n", " <td>True</td>\n", " <td>Smirnov</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Arfi</td>\n", " <td>mees</td>\n", " <td>12</td>\n", " <td>False</td>\n", " <td>Jalakas</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Leida</td>\n", " <td>naine</td>\n", " <td>58</td>\n", " <td>False</td>\n", " <td>Kuusepuu</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " nimi sugu vanus pensionär perekonnanimi\n", "0 Teele naine 25 False Kask\n", "1 Ivan mees 87 True Smirnov\n", "2 Arfi mees 12 False Jalakas\n", "3 Leida naine 58 False Kuusepuu" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ilusad_inimesed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tabelite oluliseks funktsiooniks on võimalus saada oma andmetest hea ülevaade. Selleks on pandases erinevaid filtreerimise ja sorteerimise võimalusi, millest mõnesid järgnevalt vaatame." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>nimi</th>\n", " <th>sugu</th>\n", " <th>vanus</th>\n", " <th>pensionär</th>\n", " <th>perekonnanimi</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>Ivan</td>\n", " <td>mees</td>\n", " <td>87</td>\n", " <td>True</td>\n", " <td>Smirnov</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Leida</td>\n", " <td>naine</td>\n", " <td>58</td>\n", " <td>False</td>\n", " <td>Kuusepuu</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " nimi sugu vanus pensionär perekonnanimi\n", "1 Ivan mees 87 True Smirnov\n", "3 Leida naine 58 False Kuusepuu" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Leiame read, kus veeru 'vanus' väärtus on suurem kui 30\n", "ilusad_inimesed[ilusad_inimesed['vanus'] > 30]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>nimi</th>\n", " <th>sugu</th>\n", " <th>vanus</th>\n", " <th>pensionär</th>\n", " <th>perekonnanimi</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Teele</td>\n", " <td>naine</td>\n", " <td>25</td>\n", " <td>False</td>\n", " <td>Kask</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Leida</td>\n", " <td>naine</td>\n", " <td>58</td>\n", " <td>False</td>\n", " <td>Kuusepuu</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " nimi sugu vanus pensionär perekonnanimi\n", "0 Teele naine 25 False Kask\n", "3 Leida naine 58 False Kuusepuu" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Võime seada ka mitu tingimust korraga\n", "ilusad_inimesed[(ilusad_inimesed['vanus'] < 65) & (ilusad_inimesed['sugu'] == 'naine')]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>nimi</th>\n", " <th>sugu</th>\n", " <th>vanus</th>\n", " <th>pensionär</th>\n", " <th>perekonnanimi</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>Ivan</td>\n", " <td>mees</td>\n", " <td>87</td>\n", " <td>True</td>\n", " <td>Smirnov</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Leida</td>\n", " <td>naine</td>\n", " <td>58</td>\n", " <td>False</td>\n", " <td>Kuusepuu</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>Teele</td>\n", " <td>naine</td>\n", " <td>25</td>\n", " <td>False</td>\n", " <td>Kask</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Arfi</td>\n", " <td>mees</td>\n", " <td>12</td>\n", " <td>False</td>\n", " <td>Jalakas</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " nimi sugu vanus pensionär perekonnanimi\n", "1 Ivan mees 87 True Smirnov\n", "3 Leida naine 58 False Kuusepuu\n", "0 Teele naine 25 False Kask\n", "2 Arfi mees 12 False Jalakas" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Muudame sorteerimise järjekorda\n", "ilusad_inimesed.sort_values(['vanus'], ascending = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Samuti on võimalik matplotlibi abiga pandase DataFrame'is olevad andmed kerge vaevaga joonisele panna. Rohkem näiteid jooniste kohta leiab näiteks siit http://queirozf.com/entries/pandas-dataframe-plot-examples-with-matplotlib-pyplot." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x219d717e080>" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAACuCAYAAAAmsfauAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAEENJREFUeJzt3X+QVeV9x/H3B1iFqAmIixJWXUy3FYkGcGPFJGrA8UetaA2aVccaS8RYU5RqI80fjtia0Uya6KQaSzWRKDOgaxyYaP0R3dR00rHdVSLBjQMliCtGVhIsNRJBvv3jnEViFvay9+6evc/9vGaYe8+559z73TO7Hx+fe57nUURgZmbVb1jRBZiZWWU40M3MEuFANzNLhAPdzCwRDnQzs0Q40M3MEuFANzNLhAPdzCwRDnQzs0SMGMwPO+SQQ6KxsXEwP9LMrOp1dHS8GRH1fR03qIHe2NhIe3v7YH6kmVnVk/RKKce5y8XMLBEOdDOzRDjQzcwSMah96DbE3fSRoisozU1vFV2BDYDt27fT1dXFtm3bii6lMCNHjqShoYG6urp+ne9AN7Mhoauri4MOOojGxkYkFV3OoIsINm/eTFdXFxMnTuzXe7jLxcyGhG3btjF27NiaDHMASYwdO7as/0NxoJvZkFGrYd6j3J/fgW5mloiS+tAlzQe+CASwCrgcGA8sBQ4GngcujYh3B6hOM6sxjQserej7rb/17Iq+31DUZwtd0gRgHtAcER8HhgMtwG3AtyKiCfgNMGcgCzUzs70rtctlBDBK0gjgQ8DrwAygNX99MXBe5cszMxscN9xwA3fdddeu7ZtuuomFCxcyc+ZMpk2bxrHHHsvy5csBWL9+PZMmTeKKK65g8uTJnH766bzzzjsAnHrqqbumOHnzzTfpmb9q9erVnHDCCUyZMoXjjjuONWvWVPxn6DPQI+I14BvABrIgfwvoALZExI78sC5gQm/nS5orqV1Se3d3d2WqNjOrsJaWFpYtW7Zr+8EHH+Tyyy/nkUce4fnnn6etrY3rrruOiABgzZo1XH311axevZrRo0fz8MMP7/X97777bq655hpWrlxJe3s7DQ0NFf8Z+uxDlzQGOBeYCGwBHgLO6uXQ6O38iFgELAJobm7u9Rgzs6JNnTqVTZs2sXHjRrq7uxkzZgzjx49n/vz5PPvsswwbNozXXnuNN954A4CJEycyZcoUAI4//njWr1+/1/efPn06t9xyC11dXZx//vk0NTVV/GcopcvlNOCXEdEdEduBHwAnAaPzLhiABmBjxaszMxtEs2fPprW1lWXLltHS0sKSJUvo7u6mo6ODlStXcuihh+66T3z//fffdd7w4cPZsSPrsBgxYgQ7d+4E+L17yi+++GJWrFjBqFGjOOOMM3jmmWcqXn8pgb4BOFHSh5TdJDkTeAloA2bnx1wGLK94dWZmg6ilpYWlS5fS2trK7Nmzeeuttxg3bhx1dXW0tbXxyit9z2Lb2NhIR0cHAK2trbv2r1u3jqOOOop58+Yxa9YsXnzxxYrX32eXS0Q8J6mV7NbEHcALZF0ojwJLJf1jvu/eildnZjWriNsMJ0+ezNatW5kwYQLjx4/nkksu4ZxzzqG5uZkpU6Zw9NFH9/ke119/PRdeeCH3338/M2bM2LV/2bJlPPDAA9TV1XHYYYdx4403Vrx+9XTwD4bm5ubwAhdDmCfnsgJ1dnYyadKkossoXG/XQVJHRDT3da5HipqZJcKBbmaWCAe6mQ0Zg9kFPBSV+/M70M1sSBg5ciSbN2+u2VDvmQ995MiR/X4PL3BhZkNCQ0MDXV1d1PKI8p4Vi/rLgW5mQ0JdXV2/V+qxjLtczMwS4UA3M0uEA93MLBEOdDOzRDjQzcwS4UA3M0uEA93MLBEOdDOzRDjQzcwSUVKgSxotqVXSLyR1Spou6WBJT0lakz+OGehizcxsz0ptod8BPB4RRwOfADqBBcDTEdEEPJ1vm5lZQfoMdEkfBk4mX2IuIt6NiC3AucDi/LDFwHkDVaSZmfWtlBb6UUA38D1JL0i6R9IBwKER8TpA/jiut5MlzZXULqm9lmdRMzMbaKUE+ghgGvCdiJgKvM0+dK9ExKKIaI6I5vr6+n6WaWZmfSkl0LuAroh4Lt9uJQv4NySNB8gfNw1MiWZmVoo+50OPiF9JelXSn0TEy8BM4KX832XArfnj8gGt1MxqWufRk4ouoSSTftFZ2GeXusDF3wBLJO0HrAMuJ2vdPyhpDrABuGBgSjQzs1KUFOgRsRJo7uWlmZUtx8zM+ssjRc3MEuFANzNLhAPdzCwRDnQzs0Q40M3MEuFANzNLhAPdzCwRDnQzs0Q40M3MEuFANzNLhAPdzCwRDnQzs0Q40M3MEuFANzNLhAPdzCwRJQe6pOH5ItE/zLcnSnpO0hpJy/LFL8zMrCD70kK/Bth9baXbgG9FRBPwG2BOJQszM7N9U1KgS2oAzgbuybcFzCBbMBpgMXDeQBRoZmalKbWFfjvwFWBnvj0W2BIRO/LtLmBCbydKmiupXVJ7d3d3WcWamdme9Rnokv4c2BQRHbvv7uXQ6O38iFgUEc0R0VxfX9/PMs3MrC+lLBL9KWCWpD8DRgIfJmuxj5Y0Im+lNwAbB65MMzPrS58t9Ij4+4hoiIhGoAV4JiIuAdqA2flhlwHLB6xKMzPrUzn3od8A/K2ktWR96vdWpiQzM+uPUrpcdomIHwM/zp+vA06ofElmZtYfHilqZpYIB7qZWSIc6GZmiXCgm5klwoFuZpYIB7qZWSIc6GZmiXCgm5klwoFuZpYIB7qZWSIc6GZmiXCgm5klYp8m5xqKGhc8WnQJJVl/69lFl2BmiXML3cwsEQ50M7NElLKm6OGS2iR1Slot6Zp8/8GSnpK0Jn8cM/DlmpnZnpTSQt8BXBcRk4ATgaslHQMsAJ6OiCbg6XzbzMwKUsqaoq9HxPP5861AJzABOBdYnB+2GDhvoIo0M7O+7VMfuqRGYCrwHHBoRLwOWegD4/ZwzlxJ7ZLau7u7y6vWzMz2qORAl3Qg8DBwbUT8b6nnRcSiiGiOiOb6+vr+1GhmZiUoKdAl1ZGF+ZKI+EG++w1J4/PXxwObBqZEMzMrRSl3uQi4F+iMiG/u9tIK4LL8+WXA8sqXZ2ZmpSplpOingEuBVZJW5vu+CtwKPChpDrABuGBgSjQzs1L0GegR8R+A9vDyzMqWY2Zm/eWRomZmiXCgm5klwoFuZpaIqp8+12yoOnbxsUWXUJJVl60qugSrELfQzcwS4UA3M0uEA93MLBEOdDOzRDjQzcwS4UA3M0uEA93MLBEOdDOzRDjQzcwS4UA3M0tEWYEu6UxJL0taK2lBpYoyM7N91+9AlzQcuBM4CzgGuEjSMZUqzMzM9k05LfQTgLURsS4i3gWWAudWpiwzM9tX5QT6BODV3ba78n1mZlaAcqbP7W1ZuviDg6S5wNx88/8kvVzGZw6WQ4A3K/mGuq2S71ZVKn4tWbinFRFrQuV/N7/g61nRd9SAXM8jSzmonEDvAg7fbbsB2PjBgyJiEbCojM8ZdJLaI6K56DpS4GtZWb6elZXa9Syny+W/gSZJEyXtB7QAKypTlpmZ7at+t9AjYoekLwNPAMOB70bE6opVZmZm+6SsJegi4jHgsQrVMpRUVRfREOdrWVm+npWV1PVUxB98j2lmZlXIQ//NzBLhQDczS4QD3cwsEWV9KZoKSaOAa4EjI+JLkv4IaIqIfyu4NDMbIJLGASN7tiNiQ4HlVIQDPfNdYBXw6Xx7I/AQ4EDvJ0knAY3s9jsWEd8vrKAqI+krEfF1Sd+mlxHYETGvgLKSIGkW8E/AR4FNZKMwO4HJRdZVCQ70TFNEXCTpAoCI+K00MON3a4Gk+4GPASuB9/LdATjQS/dS/theaBVp+gfgROBHETFV0meBiwquqSIc6Jl3JY0kbwlJmgi8W2xJVa0ZOCZ8T2w5Pg/8EBgdEXcUXUxitkfEZknDJA2LiDYpjdmWHOiZm4HHgQZJi4FTgDnFllTVfg4cBrxedCFV7HhJRwJ/Jen7fGAyvIj4dTFlJWGLpAOBZ4ElkjYBOwquqSI8sCgnqR44iewP56cRsangkqqWpDZgCvBfwO969kfErMKKqjKS5gFXAUcBr/H7gR4RcVQhhSVA0gHANrJregnwEWBJRGwutLAKqOlAl3Tc3l6PiBcHq5aUSDqlt/0R8e+DXUu1k/SdiLiq6DqsOtR6oP9kLy9HRJw8aMWYfYCkYcCLEfHxomtJgaSt9HLHUI+I+PAgljMgaroPPSI+U3QNKZJ0IvBtYBKwH9lsnG+n8AczmCJip6SfSToihXukixYRBwFIuhn4FXA/73e7HFRgaRVT0y30HvnAomvIBhZd5YFF5ZHUTjY//kNkd7z8Jdn1/GqhhVUhSc8AnyT7PuLtfHdEhNfv7SdJz0XEn/a1rxrVdAt9Nz0Di3pa7B5YVKaIWCtpeES8B3xP0k+LrqlKLdztucgGvyVxz3SB3pN0CdnC9kF2Pd/b+ynVwXO5ZJoi4mvAdsgGFtH7mqlWmt/mq1itlPR1SfOBA4ouqhrlXyS/BZwN3AfMBO4usqYEXAxcCLyR/7sg31f13ELPeGBRZV1K1lj4MjCfbO3ZzxVaUZWR9Mdk3VYXAZuBZWRdpJ8ttLAERMR6IMkuK/ehA5LOBBYAx5B1s5wCzImIpwstrEpJ+gvgsYj4XZ8HW68k7QR+QvZ7uDbft873n/dfLcyP4xY6EBGPS+rg/YFFf+eBRWWZBdwu6VmyfsonIiKJkXiD6HNkLfQ2SY+TXUd3A5anM39Mdn4ct9BzklqAj0XELZIOB8ZFREfRdVUrSXXAWWRzknwaeCoivlhsVdUnH9V4HlnXywxgMfBIRDxZaGEJkHRARLzd95HVw4EOSPpnoA44OSImSTqYrFX5yYJLq2p5qJ8JXA58JiLqCy6pquW/lxcAn4+IGUXXU60kTQfuBQ6MiCMkfQK4MiL+uuDSyua7XDInRcSVZPM79Ex8tF+xJVUvSWdKug9YC8wG7gHGF1pUAiLi1xHxLw7zst0OnEH2ZTMR8TMgiVHh7kPPbM+HWffc5TIW2FlsSVXtC2R9vlf6i1EbiiLi1Q8seZDEfeg1HeiSRuRf1t0JPAzUS1pIdo/qwr2ebHsUES1F12C2F6/mK2pFPl5iHu9/YVrVaroPXdLzETEtfz4ZOI3sToIfRcTPCy2uCu1l8iORDVf3XC5WOEmHAHfw/t/7k8C8FOaYr/VAfyEiphZdh5kVS9K1EXF70XWUq9YDvQv45p5ej4g9vmZm6ZC0ISKOKLqOctV0HzrZtK4H4gEbZrUuiQyo9UB/PSJuLroIMytcEl0VtR7oSfxX2cz61seX9qMGuZwBUet96Aen8M22mRnUeKCbmaXEQ//NzBLhQDczS4QD3WqSpJslnbYPx39UUutA1mRWLvehm5klwi10S5qkRkmdkv5V0mpJT0oaJek+SbPzY9ZL+pqk/5TULmmapCck/Y+kL+32Pp7fx4Y0B7rVgibgzoiYDGyh9wWrX42I6WTreN5HNo/7iYAHnlnVqPWBRVYbfhkRK/PnHUBjL8esyB9Xka1ksxXYKmmbpNGDUKNZ2dxCt1qw+yIb79F7Q6bnmJ0fOH7nHo43G3Ic6GZmiXCgm5klwrctmpklwi10M7NEONDNzBLhQDczS4QD3cwsEQ50M7NEONDNzBLhQDczS8T/AwcwrU2u3DZwAAAAAElFTkSuQmCC\n", "text/plain": [ "<Figure size 432x144 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "ilusad_inimesed.plot(kind='bar',x='nimi',y='vanus', figsize=(6,2))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Lisaks veergudele võime ka ridadele numbrite asemel \"nimed\" panna\n", "ilusad_inimesed.index = ['a', 'b', 'c', 'd']" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>nimi</th>\n", " <th>sugu</th>\n", " <th>vanus</th>\n", " <th>pensionär</th>\n", " <th>perekonnanimi</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>a</th>\n", " <td>Teele</td>\n", " <td>naine</td>\n", " <td>25</td>\n", " <td>False</td>\n", " <td>Kask</td>\n", " </tr>\n", " <tr>\n", " <th>b</th>\n", " <td>Ivan</td>\n", " <td>mees</td>\n", " <td>87</td>\n", " <td>True</td>\n", " <td>Smirnov</td>\n", " </tr>\n", " <tr>\n", " <th>c</th>\n", " <td>Arfi</td>\n", " <td>mees</td>\n", " <td>12</td>\n", " <td>False</td>\n", " <td>Jalakas</td>\n", " </tr>\n", " <tr>\n", " <th>d</th>\n", " <td>Leida</td>\n", " <td>naine</td>\n", " <td>58</td>\n", " <td>False</td>\n", " <td>Kuusepuu</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " nimi sugu vanus pensionär perekonnanimi\n", "a Teele naine 25 False Kask\n", "b Ivan mees 87 True Smirnov\n", "c Arfi mees 12 False Jalakas\n", "d Leida naine 58 False Kuusepuu" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ilusad_inimesed" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Teele'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ilusad_inimesed['nimi']['a']" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# DataFrame'i võime konvertida erinevateks Pythoni sõnastikeks\n", "ilusad_inimesed_dict = ilusad_inimesed.to_dict(\"split\")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'columns': ['nimi', 'sugu', 'vanus', 'pensionär', 'perekonnanimi'],\n", " 'data': [['Teele', 'naine', 25, False, 'Kask'],\n", " ['Ivan', 'mees', 87, True, 'Smirnov'],\n", " ['Arfi', 'mees', 12, False, 'Jalakas'],\n", " ['Leida', 'naine', 58, False, 'Kuusepuu']],\n", " 'index': ['a', 'b', 'c', 'd']}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ilusad_inimesed_dict " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "ilusad_inimesed_dict = ilusad_inimesed.to_dict(\"list\")" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'nimi': ['Teele', 'Ivan', 'Arfi', 'Leida'],\n", " 'pensionär': [False, True, False, False],\n", " 'perekonnanimi': ['Kask', 'Smirnov', 'Jalakas', 'Kuusepuu'],\n", " 'sugu': ['naine', 'mees', 'mees', 'naine'],\n", " 'vanus': [25, 87, 12, 58]}" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ilusad_inimesed_dict " ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([['Teele', 'naine', 25, False, 'Kask'],\n", " ['Ivan', 'mees', 87, True, 'Smirnov'],\n", " ['Arfi', 'mees', 12, False, 'Jalakas'],\n", " ['Leida', 'naine', 58, False, 'Kuusepuu']], dtype=object)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ilusad_inimesed.values" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "# Csv-sse salvestamine käib lihtsalt\n", "ilusad_inimesed.to_csv(\"test1.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ülesanne 2. Andmete puhastamine (1p)\n", "Failis vanasõnad.txt on toodud hulk eesti vanasõnu, mis pärinevad originaalis Anne Hussari, Arvo Krikmanni ja Ingrid Sarve \"Vanasõnaraamatust\" (1984), kokku on korjatud aga [siit](http://www.folklore.ee/~kriku/VSR/FRAMEST.HTM). Tutvuge toodud andmefailiga - näete, et peale vanasõnade leidub seal veel natuke üht-teist. Lisaks on osa vanasõnu kirjakeelsed, osa aga murdekeelsed. \n", "\n", "Kuna järgmises kahes ülesandes on vaja seda andmestikku kasutada, siis looge omale puhastatud andmefail, mis vastaks järgmistele tingimustele:\n", "* fail on csv-formaadis\n", "* fail sisaldab igal real üht vanasõna (ja ei midagi muud)\n", "* fail sisaldab ainult kirjakeelseid vanasõnu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Vihjeid**:\n", "* ebavajaliku eemaldamisel on abiks regulaaravaldised\n", "* murdekeelt aitab kirjakeelest eristada morfoloogiline analüsaator - kui lülitame välja oletamise, siis jätab analüsaator tundmatud sõnad analüüsimata. Seega, lisame puhastatud faili ainult laused, mille kõik sõnad saavad analüüsi ka ilma oletamiseta. \n", "\n", " **NB1!** Ilma oletamiseta jäävad ka kirjavahemärgid analüüsideta, aga neid tuleks siinkohal sõnadena mitte käsitleda\n", " \n", " **NB2!** Oletamise väljalülitamine toimib ainult siis, kui lülitame välja ka ühestamise ja pärisnimeanalüüsi\n", " \n", " \n", "* puhastatud faili jõudvate vanasõnade arv võiks olla neljakohaline, et järgmised ülesanded ka õnnestuksid" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "from estnltk import Text" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[],\n", " [{'clitic': '',\n", " 'ending': 'd',\n", " 'form': 'd',\n", " 'lemma': 'külvama',\n", " 'partofspeech': 'V',\n", " 'root': 'külva',\n", " 'root_tokens': ['külva']}],\n", " [],\n", " [],\n", " [{'clitic': '',\n", " 'ending': 'd',\n", " 'form': 'd',\n", " 'lemma': 'lõikama',\n", " 'partofspeech': 'V',\n", " 'root': 'lõika',\n", " 'root_tokens': ['lõika']}],\n", " []]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Morf analüüs ilma oletamiseta. \n", "Text(\"Kuda külvad, nõnna lõikad.\", guess = False, disambiguate = False, propername = False).analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ülesanne 3. Lihtsad nimisõnafraasid vanasõnades (2p)\n", "Leidke, millised nimisõnast ja omadussõnast koosnevad fraasid (nt 'sinine ämber') esinevad eesti vanasõnades. Looge csv fail, milles niisuguste fraaside sagedused vanasõnades oleks esitatud risttabelina: reatunnusteks on nimisõnad, veergudeks omadussõnad ja tabelis sisalduksid vastavate koosesinemiste sagedused. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selleks iga vanasõna puhul:\n", "* Teostage vanasõna morfoloogiline analüüs\n", "* Leidke, kas vanasõnas esineb omadussõnale järgnev nimisõna ja kui jah, siis kas ka nende sõnade arv ja kääne fraasi moodustamiseks omavahel sobivad (mõelge eesti keele grammatikale - nt 'sinine ämber' peaks olema lubatav fraas, 'sinistena ämber' aga mitte, kuna kääne ja arv ei ühildu). \n", "\n", "Csv-faili kirjutage ainult need nimisõnad, mis esinevad vähemalt 5 erineva omadussõnaga" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Vihjeid**:\n", "* mitmeste analüüsidega talitage nii, nagu ise õigemaks peate\n", "* risttabelit csv-faili kirjutada on pandas teegiga märgatavalt lihtsam kui csv teegiga" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ülesanne 4. Mängime ka (2,5p)\n", "\n", "Looge lihtne vanasõnatundmismäng, kus kasutajale kuvatakse lüngaga vanasõna ning variandid, mille vahel valida - milline sõna lünka käib. \n", "\n", "* Lüngaks olgu vanasõna viimane nimi- või omadussõna, vastavalt sellele, kumb lauses tagapool esineb.\n", "* Variantidena pakutagu 10 sobivas vormis nimi- või omadussõna (vastavalt lüngale), sh õige vastus. Sobivas vormis sõnad leidke teistest vanasõnadest.\n", "* Variandid ei tohi omavahel korduda\n", "* Variandid on juhuslikus järjekorras\n", "* Ärge pakkuge arvamiseks vanasõnu, mille lüngaga tähistatav sõna (lemma) samas vanasõnas mujal esineb (\"Kuidas küla koerale, nõnda koer ____.\" ei ole huvitav arvata)\n", "* Ärge pakkuge arvamiseks vanasõnu, mille lüngale 9 samas vormis varianti ei leidu teistes vanasõnades.\n", "* Vanasõnu pakutakse arvamiseks juhuslikus järjekorras, mis genereeritakse mängu käivitamisel - st mängu uuesti käivitades ei tule uuesti kohe samad küsimused, mis eelmisel korral\n", "* Mängijale kuvatakse arvamise järel, kas ta arvas õigesti või valesti, vale korral ka õige vastus\n", "* Mängijale kuvatakse mänguskoori (süsteemi võib vabalt valida - kas valede vastuste eest miinuspunktid jms)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Vihjeid**:\n", "* Kasutage ülesandes 2 loodud puhastatud faili\n", "* Mõistlik on luua mängu jaoks sõnastik, mis sisaldab vanasõna ja vastusevariante, mitte mängu käigus jooksvalt iga kord variante otsima hakata" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.3" } }, "nbformat": 4, "nbformat_minor": 1 }