{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with Data in Python (Part 2)\n", "\n", "For the next example, we are going to be using the Python data analysis library `pandas`, which lets the user explore tabular datasets and perform complex search, indexing, statistical, and other operations.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: World Population Growth" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in the UN population data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "data = pd.read_csv('Data/population.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print just the first few rows of data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print column names" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_columns = ['Year','Series','Value']\n", "\n", "data[my_columns].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select data based on a matching criterion." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "year = 2005\n", "\n", "data[data['Year'] == year].head(n=20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "series = \"Population mid-year estimates (millions)\"\n", "\n", "data[data['Series'] == series]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can contruct more complex matching criteria. Here we want all \n", "# the mid-year population estimates for Canada.\n", "query = (data[\"Region/Country/Area\"] == \"Canada\") & \\\n", " (data[\"Series\"] == \"Population mid-year estimates (millions)\")\n", "\n", "data[query]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can contruct more complex matching criteria. Here we want all \n", "# the mid-year population estimates for Canada.\n", "query = (data[\"Region/Country/Area\"] == \"Germany\") & \\\n", " (data[\"Series\"] == \"Population mid-year estimates (millions)\")\n", "\n", "data[query]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "world = pd.read_csv('Data/world_population.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "world.head(n=20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "world = world[::-1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "high = world[world[\"Variant\"] == \"High\"]\n", "med = world[world[\"Variant\"] == \"Medium\"]\n", "low = world[world[\"Variant\"] == \"Low\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot the world population by year for the three scenarios" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "# Get the data for each variant, store as arrays\n", "years_h = high[\"Year(s)\"].values\n", "years_m = med[\"Year(s)\"].values\n", "years_l = low[\"Year(s)\"].values\n", "\n", "# Population in thousands, convert to billions\n", "pop_h = high[\"Value\"].values / 1.0e6\n", "pop_m = med[\"Value\"].values / 1.0e6\n", "pop_l = low[\"Value\"].values / 1.0e6\n", "\n", "# Plot population against against years\n", "plt.plot(years_l, pop_l)\n", "plt.plot(years_m, pop_m)\n", "plt.plot(years_h, pop_h)\n", "plt.legend([\"Low\", \"Medium\", \"High\"])\n", "plt.grid(True, alpha=0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learn More\n", "\n", "You can learn more about `pandas` by visiting the [homepage](https://pandas.pydata.org/).\n", "\n", "For a 10-minute tutorial, read \"[10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Data Mining\n", "\n", "As summarized on [Wikipedia](https://en.wikipedia.org/wiki/Data_mining), data mining involves six common classes of tasks:\n", "\n", "* **Anomaly detection** (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.\n", "\n", "* **Association rule learning** (dependency modelling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.\n", "\n", "* **Clustering** – is the task of discovering groups and structures in the data that are in some way or another \"similar\", without using known structures in the data.\n", "\n", "* **Classification** – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as \"legitimate\" or as \"spam\".\n", "\n", "* **Regression** – attempts to find a function which models the data with the least error that is, for estimating the relationships among data or datasets.\n", "\n", "* **Summarization** – providing a more compact representation of the data set, including visualization and report generation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn import cluster, datasets\n", "from sklearn.cluster import KMeans\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "n_samples = 1500\n", "\n", "blobs = datasets.make_blobs(n_samples=n_samples, centers=3, random_state=8)\n", "\n", "X, y = blobs\n", "\n", "# normalize dataset for easier parameter selection\n", "X = StandardScaler().fit_transform(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.shape(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(X[:10,:])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(X[:,0],X[:,1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apply K-Means Clustering\n", "\n", "Read more about the nature of the algorithm [here](https://en.wikipedia.org/wiki/K-means_clustering)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kmeans = KMeans(n_clusters=3)\n", "\n", "kmeans.fit(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predicted_categories = kmeans.predict(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in range(kmeans.n_clusters):\n", " plt.scatter(X[predicted_categories == i, 0], \n", " X[predicted_categories == i, 1], label='Category {}'.format(i))\n", "plt.legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(kmeans.cluster_centers_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise: What if you didn't know the true number of clusters ahead of time? Could you somehow measure the quality of your model?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Outlier Detection" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U scikit-learn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.neighbors import LocalOutlierFactor\n", "\n", "# Generate training data\n", "X = 0.3 * np.random.randn(100, 2)\n", "\n", "# Generate some abnormal novel observations\n", "X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))\n", "X = np.vstack([X+2, X-2, X_outliers])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(X[:, 0], X[:, 1], c='white', edgecolor='k', s=20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# fit the model\n", "clf = LocalOutlierFactor(n_neighbors=20, contamination='auto')\n", "\n", "y_pred = clf.fit_predict(X)\n", "y_pred_outliers = y_pred[200:]\n", "\n", "# plot the level sets of the decision function\n", "xx, yy = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))\n", "Z = clf._decision_function(np.c_[xx.ravel(), yy.ravel()])\n", "Z = Z.reshape(xx.shape)\n", "\n", "plt.title(\"Local Outlier Factor (LOF)\")\n", "plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)\n", "\n", "a = plt.scatter(X[:200, 0], X[:200, 1], c='white', edgecolor='k', s=20)\n", "b = plt.scatter(X[200:, 0], X[200:, 1], c='red', edgecolor='k', s=20)\n", "\n", "plt.axis('tight')\n", "plt.xlim((-5, 5))\n", "plt.ylim((-5, 5))\n", "plt.legend([a, b],\n", " [\"normal observations\",\n", " \"abnormal observations\"],\n", " loc=\"upper left\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Regression\n", "\n", "Example has been adapted from Haydar Ali Ismail's [Medium post](https://medium.com/@haydar_ai/learning-data-science-day-9-linear-regression-on-boston-housing-dataset-cd62a80775ef), titled \"Learning Data Science: Day 9 - Linear Regression on Boston Housing Dataset\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.datasets import load_boston\n", "boston = load_boston()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boston" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(boston.keys())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(boston.data.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(boston.feature_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(boston.DESCR)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bos = pd.DataFrame(boston.data)\n", "bos.columns = boston.feature_names" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bos.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The price\n", "boston.target" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bos['PRICE'] = boston.target" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regression using a linear model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn\n", "from sklearn.model_selection import train_test_split\n", "\n", "X = bos.drop('PRICE', axis = 1)\n", "Y = bos['PRICE']\n", "\n", "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)\n", "print(X_train.shape)\n", "print(X_test.shape)\n", "print(Y_train.shape)\n", "print(Y_test.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "model = LinearRegression()\n", "model.fit(X_train, Y_train)\n", "\n", "Y_pred = model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.scatter(Y_test, Y_pred)\n", "plt.xlabel(\"True Prices\")\n", "plt.ylabel(\"Predicted Prices\")\n", "plt.title(\"True Prices vs Predicted prices\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mse = sklearn.metrics.mean_squared_error(Y_test, Y_pred)\n", "print(mse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regression using a Decision Tree" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "\n", "model = DecisionTreeRegressor()\n", "model.fit(X_train, Y_train)\n", "\n", "Y_pred = model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.scatter(Y_test, Y_pred)\n", "plt.\n", "plt.xlabel(\"True Prices\")\n", "plt.ylabel(\"Predicted Prices\")\n", "plt.title(\"True Prices vs Predicted prices\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.bar(range(len(boston.feature_names)), model.feature_importances_)\n", "plt.xticks(range(13), boston.feature_names, rotation='vertical');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reminder of what the factors were" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Factor | Description |\n", "| ------ | ------------|\n", "| CRIM | per capita crime rate by town |\n", "| ZN | proportion of residential land zoned for lots over 25,000 sq.ft. |\n", "| INDUS | proportion of non-retail business acres per town |\n", "| CHAS | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |\n", "| NOX | nitric oxides concentration (parts per 10 million) |\n", "| RM | average number of rooms per dwelling |\n", "| AGE | proportion of owner-occupied units built prior to 1940 |\n", "| DIS | weighted distances to five Boston employment centres |\n", "| RAD | index of accessibility to radial highways |\n", "| TAX | full-value property-tax rate per \\$10,000 |\n", "| PTRATIO| pupil-teacher ratio by town |\n", "| B | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |\n", "| LSTAT | \\% lower status of the population |\n", "| MEDV | Median value of owner-occupied homes in $1000's |" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.6", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }