{ "cells": [ { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2017-02-09T14:20:28.605732", "start_time": "2017-02-09T14:20:25.810350" } }, "outputs": [ { "data": { "text/html": [ " \n", "\n", "\n", " \n", "\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Custom libraries\n", "from datascienceutils import plotter\n", "from datascienceutils import analyze\n", "from datascienceutils import predictiveModels as pm\n", "from datascienceutils import sklearnUtils as sku\n", "\n", "from IPython.display import Image\n", "# Standard libraries\n", "import json\n", "%matplotlib inline\n", "import datetime\n", "import numpy as np\n", "import pandas as pd\n", "import random\n", "\n", "from sklearn import cross_validation\n", "from sklearn import metrics\n", "\n", "from bokeh.plotting import figure, show, output_file, output_notebook, ColumnDataSource\n", "from bokeh.charts import Histogram\n", "import bokeh\n", "output_notebook()\n", "\n", "# Set pandas display options\n", "#pd.set_option('display.width', pd.util.terminal.get_terminal_size()[0])\n", "pd.set_option('display.expand_frame_repr', False)\n", "pd.set_option('max_colwidth', 800)\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2017-02-09T14:56:17.528735", "start_time": "2017-02-09T14:56:17.518161" } }, "outputs": [], "source": [ "# Data set from https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/\n", "columns = ['class', 'age', 'sex', 'steroid', 'antivirals', 'fatigue', 'malaise', 'anorexia', \n", " 'big_liver', 'firm_liver', 'palpable_spleen', 'spiders', 'ascites', 'varices', 'bilirubin',\n", " 'alk_phosphate', 'sgot', 'albumin', 'protime', 'histology']\n", "\n", "hepatitis_df = pd.read_csv('~/DataScientist/data/Hepatitis/hepatitis.data', names=columns, na_values=['?'])\n", " \n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2017-02-09T14:27:13.120969", "start_time": "2017-02-09T14:27:13.115893" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['1. Title: Hepatitis Domain\\n',\n", " '\\n',\n", " '2. Sources:\\n',\n", " ' (a) unknown\\n',\n", " ' (b) Donor: G.Gong (Carnegie-Mellon University) via \\n',\n", " ' Bojan Cestnik\\n',\n", " ' Jozef Stefan Institute\\n',\n", " ' Jamova 39\\n',\n", " ' 61000 Ljubljana\\n',\n", " ' Yugoslavia (tel.: (38)(+61) 214-399 ext.287) }\\n',\n", " ' (c) Date: November, 1988\\n',\n", " '\\n',\n", " '3. Past Usage:\\n',\n", " ' 1. Diaconis,P. & Efron,B. (1983). Computer-Intensive Methods in \\n',\n", " ' Statistics. Scientific American, Volume 248.\\n',\n", " ' -- Gail Gong reported a 80% classfication accuracy\\n',\n", " ' 2. Cestnik,G., Konenenko,I, & Bratko,I. (1987). Assistant-86: A\\n',\n", " ' Knowledge-Elicitation Tool for Sophisticated Users. In I.Bratko\\n',\n", " ' & N.Lavrac (Eds.) Progress in Machine Learning, 31-45, Sigma Press.\\n',\n", " ' -- Assistant-86: 83% accuracy\\n',\n", " '\\n',\n", " '4. Relevant Information:\\n',\n", " ' Please ask Gail Gong for further information on this database.\\n',\n", " '\\n',\n", " '5. Number of Instances: 155\\n',\n", " '\\n',\n", " '6. Number of Attributes: 20 (including the class attribute)\\n',\n", " '\\n',\n", " '7. Attribute information: \\n',\n", " ' 1. Class: DIE, LIVE\\n',\n", " ' 2. AGE: 10, 20, 30, 40, 50, 60, 70, 80\\n',\n", " ' 3. SEX: male, female\\n',\n", " ' 4. STEROID: no, yes\\n',\n", " ' 5. ANTIVIRALS: no, yes\\n',\n", " ' 6. FATIGUE: no, yes\\n',\n", " ' 7. MALAISE: no, yes\\n',\n", " ' 8. ANOREXIA: no, yes\\n',\n", " ' 9. LIVER BIG: no, yes\\n',\n", " ' 10. LIVER FIRM: no, yes\\n',\n", " ' 11. SPLEEN PALPABLE: no, yes\\n',\n", " ' 12. SPIDERS: no, yes\\n',\n", " ' 13. ASCITES: no, yes\\n',\n", " ' 14. VARICES: no, yes\\n',\n", " ' 15. BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00\\n',\n", " ' -- see the note below\\n',\n", " ' 16. ALK PHOSPHATE: 33, 80, 120, 160, 200, 250\\n',\n", " ' 17. SGOT: 13, 100, 200, 300, 400, 500, \\n',\n", " ' 18. ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0\\n',\n", " ' 19. PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90\\n',\n", " ' 20. HISTOLOGY: no, yes\\n',\n", " '\\n',\n", " ' The BILIRUBIN attribute appears to be continuously-valued. I checked\\n',\n", " ' this with the donater, Bojan Cestnik, who replied:\\n',\n", " '\\n',\n", " ' About the hepatitis database and BILIRUBIN problem I would like to '\n", " 'say\\n',\n", " ' the following: BILIRUBIN is continuous attribute (= the number of '\n", " \"it's\\n\",\n", " ' \"values\" in the ASDOHEPA.DAT file is negative!!!); \"values\" are '\n", " 'quoted\\n',\n", " ' because when speaking about the continuous attribute there is no '\n", " 'such \\n',\n", " ' thing as all possible values. However, they represent so called\\n',\n", " ' \"boundary\" values; according to these \"boundary\" values the '\n", " 'attribute\\n',\n", " ' can be discretized. At the same time, because of the continious\\n',\n", " ' attribute, one can perform some other test since the continuous\\n',\n", " ' information is preserved. I hope that these lines have at least '\n", " 'roughly \\n',\n", " ' answered your question. \\n',\n", " '\\n',\n", " '8. Missing Attribute Values: (indicated by \"?\")\\n',\n", " ' Attribute Number: Number of Missing Values:\\n',\n", " ' 1: 0\\n',\n", " ' 2: 0\\n',\n", " ' 3: 0\\n',\n", " ' 4: 1\\n',\n", " ' 5: 0\\n',\n", " ' 6: 1\\n',\n", " ' 7: 1\\n',\n", " ' 8: 1\\n',\n", " ' 9: 10\\n',\n", " '\\t\\t 10: 11\\n',\n", " '\\t\\t 11: 5\\n',\n", " '\\t\\t 12: 5\\n',\n", " '\\t\\t 13: 5\\n',\n", " '\\t\\t 14: 5\\n',\n", " '\\t\\t 15: 6\\n',\n", " '\\t\\t 16: 29\\n',\n", " '\\t\\t 17: 4\\n',\n", " '\\t\\t 18: 16\\n',\n", " '\\t\\t 19: 67\\n',\n", " '\\t\\t 20: 0\\n',\n", " '\\n',\n", " '9. Class Distribution:\\n',\n", " ' DIE: 32\\n',\n", " ' LIVE: 123\\n']\n" ] } ], "source": [ "from pprint import pprint\n", "import os\n", "with open(os.path.expanduser('~/DataScientist/data/Hepatitis/hepatitis.names'), 'r') as fd:\n", " pprint(fd.readlines())" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2017-02-09T14:56:24.862488", "start_time": "2017-02-09T14:56:24.827740" } }, "outputs": [ { "data": { "text/html": [ "\n", " | class | \n", "age | \n", "sex | \n", "steroid | \n", "antivirals | \n", "fatigue | \n", "malaise | \n", "anorexia | \n", "big_liver | \n", "firm_liver | \n", "palpable_spleen | \n", "spiders | \n", "ascites | \n", "varices | \n", "bilirubin | \n", "alk_phosphate | \n", "sgot | \n", "albumin | \n", "protime | \n", "histology | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "2 | \n", "30 | \n", "2 | \n", "1.0 | \n", "2 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "1.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "1.0 | \n", "85.0 | \n", "18.0 | \n", "4.0 | \n", "NaN | \n", "1 | \n", "
1 | \n", "2 | \n", "50 | \n", "1 | \n", "1.0 | \n", "2 | \n", "1.0 | \n", "2.0 | \n", "2.0 | \n", "1.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "0.9 | \n", "135.0 | \n", "42.0 | \n", "3.5 | \n", "NaN | \n", "1 | \n", "
2 | \n", "2 | \n", "78 | \n", "1 | \n", "2.0 | \n", "2 | \n", "1.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "0.7 | \n", "96.0 | \n", "32.0 | \n", "4.0 | \n", "NaN | \n", "1 | \n", "
3 | \n", "2 | \n", "31 | \n", "1 | \n", "NaN | \n", "1 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "0.7 | \n", "46.0 | \n", "52.0 | \n", "4.0 | \n", "80.0 | \n", "1 | \n", "
4 | \n", "2 | \n", "34 | \n", "1 | \n", "2.0 | \n", "2 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "1.0 | \n", "NaN | \n", "200.0 | \n", "4.0 | \n", "NaN | \n", "1 | \n", "