{ "cells": [ { "cell_type": "markdown", "id": "37767433-c4ee-4dcd-b08c-961bfee0e08b", "metadata": {}, "source": [ "# cuML\n", "\n", "Another package we are going to explore is [cuML](https://github.com/rapidsai/cuml). Similar to `scikit-learn` you can use `cuml` to train machine learning models on your data to make predictions. As with other packages in the RAPIDS suite of tools the API of `cuml` is the same as `scikit-learn` but the underlying code has been implemented to run on the GPU." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!git clone https://github.com/rapidsai/rapidsai-csp-utils.git\n", "!python rapidsai-csp-utils/colab/pip-install.py\n" ] }, { "cell_type": "code", "execution_count": null, "id": "2332c71a-919f-473c-a17a-49649d031c53", "metadata": {}, "outputs": [], "source": [ "import cudf\n" ] }, { "cell_type": "markdown", "id": "ae63fed6-3a26-4498-8a18-c9a48955b4c5", "metadata": {}, "source": [ "Let's look at training a K Nearest Neigbours model to predict whether someone has diabetes based on some other attributes such as their blood pressure, glucose levels, BMI, etc.\n", "\n", "We start by loading in our data to a GPU dataframe with `cudf`." ] }, { "cell_type": "code", "execution_count": null, "id": "0ad9c9f9-7a12-481a-8071-099a20ac2ee5", "metadata": {}, "outputs": [], "source": [ "df = cudf.read_csv(\"https://github.com/jacobtomlinson/gpu-python-tutorial/raw/main/data/diabetes.csv\")\n", "df.head()\n" ] }, { "cell_type": "markdown", "id": "8a6603cb-c2fa-4fc1-8899-b93cc6cd2f30", "metadata": {}, "source": [ "Next we need to create two separate tables. One containing the attributes of the patient except the diabetes column, and one with just the diabetes column. " ] }, { "cell_type": "code", "execution_count": null, "id": "628dd21c-57f6-416b-b987-fc7b26038edb", "metadata": {}, "outputs": [], "source": [ "X = df.drop(columns=[\"Outcome\"])\n", "X.head()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7941ed26-7e13-4712-8e0e-f3066b6f1998", "metadata": {}, "outputs": [], "source": [ "y = df[\"Outcome\"].values\n", "y[0:5]\n" ] }, { "cell_type": "markdown", "id": "8216b7b1-96e6-43ba-907b-c6c38385ef7f", "metadata": {}, "source": [ "Next we need to use the `train_test_split` method from `cuml` to split our data into two sets.\n", "\n", "The first larger set will be used to train out model. We will take 80% of the data from each table and call them `X_train` and `y_train`. When the model is trained it will be able to see both sets of data in order to perform clustering.\n", "\n", "The other 20% of the data will be called `X_test` and `y_test`. Once our model is trained we will feel our `X_test` data through our model to predict whether those people have diabetes. We can then compare those pridictions with the actual `y_test` data to see how accurate our model is.\n", "\n", "We also set `random_state` to `1` to make the random selection consistent, just for the purposes of this tutorial. We also set `statify` which means that if 75% of people in our intial data have diabetes then 75% of people in our training set will be guarantted to have diabetes." ] }, { "cell_type": "code", "execution_count": null, "id": "442519ad-1ef7-453a-bc48-59acc231d7ae", "metadata": {}, "outputs": [], "source": [ "from cuml.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)\n" ] }, { "cell_type": "markdown", "id": "1a48f7c7-22a3-4b15-b679-800619e9e66b", "metadata": {}, "source": [ "Now that we have our training data we can import our `KNeighborsClassifier` from `cuml` and fit our model." ] }, { "cell_type": "code", "execution_count": null, "id": "2120bacd-f9e1-4d19-9a29-e4215669d1d0", "metadata": {}, "outputs": [], "source": [ "from cuml.neighbors import KNeighborsClassifier\n", "\n", "knn = KNeighborsClassifier(n_neighbors = 3)\n", "\n", "knn.fit(X_train,y_train)\n" ] }, { "cell_type": "markdown", "id": "40916ef1-e47c-4d92-b8a9-2022de7abad9", "metadata": {}, "source": [ "Fitting our model happened on our GPU and now we can make some predictions. Let's predict the first five people from our test set." ] }, { "cell_type": "code", "execution_count": null, "id": "a5c1bac6-bce3-4c61-92e6-603c76c1aace", "metadata": {}, "outputs": [], "source": [ "knn.predict(X_test)[0:5]\n" ] }, { "cell_type": "markdown", "id": "82e267a1-c286-4fca-a716-1e1901130d9b", "metadata": {}, "source": [ "We can see here that our new model thinks that the first patient has diabetes but the rest do not.\n", "\n", "Let's run the whole test set through the scoring function along with the actual answers and see how well our model performs." ] }, { "cell_type": "code", "execution_count": null, "id": "0dfc2d6b-9248-4f3d-a770-aec1f3a1b548", "metadata": {}, "outputs": [], "source": [ "knn.score(X_test, y_test)\n" ] }, { "cell_type": "markdown", "id": "f2455f63-46a5-4d6f-8e22-51eb1812afe9", "metadata": {}, "source": [ "Congratulations you just trained a machine learning model on the GPU in Python and achieved a score of 69% accuracy. There are a bunch of things we could do here to improve this score, but that is beyond the scope of this tutorial." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }