{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Poker Induction Rule Problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem is about the famous Cards Game, Poker.\n", "\n", "In Poker, there is something called Poker Hands, The hand consists of 5 cards which determines the score of each player. Tranditionally in our life, the rules for calculating the hands can be seen here https://en.wikipedia.org/wiki/List_of_poker_hands\n", "\n", "Straight flush > Four of a kind > Full house > Flush > Straight > Three of a kind > Two pair > One pair > High card(Nothing).\n", "\n", "Well, the problem considers us as Aliens that don't know anything about this game and its rules, It wants from us to predict the rules of the games given a dataset that contains the 5 cards and the class(Poker Hands)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset is taken from here https://archive.ics.uci.edu/ml/datasets/Poker+Hand, we have 10 features and a label for each record. Each consecutive pair represents a card (Suit {Heart, Spade, Diamond, Club}, Rank {Ace, 2, 3, ... Q, K}). The label is represented as numerical value (0 - 9) which indciates the poker hand.\n", "\n", "1: One pair; one pair of equal ranks within five cards\n", "\n", "2: Two pairs; two pairs of equal ranks within five cards \n", "\n", "3: Three of a kind; three equal ranks within five cards \n", "\n", "4: Straight; five cards, sequentially ranked with no gaps\n", "\n", "5: Flush; five cards with the same suit \n", "\n", "6: Full house; pair + different rank three of a kind \n", "\n", "7: Four of a kind; four equal ranks within five cards \n", "\n", "8: Straight flush; straight + flush \n", "\n", "9: Royal flush; {Ace, King, Queen, Jack, Ten} + flush \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When solving problems on Kaggle using Python, make sure that you have Pandas, NumPy, SciPy, Scikit-learn libraries installed in your Python package, they contains many utilities and built-in algorithms that make the solving easier.\n", "\n", "First, let's import the modules that we will use in the project." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pnd\n", "from sklearn.cross_validation import cross_val_score\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.neighbors import KNeighborsClassifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the awesome Panda library, we can parse the .csv file of the training set and hold the data in a table" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def getTrainingData():\n", " print(\"Get training data ...\\n\")\n", "\n", " trainingData = pnd.read_csv(\"./train.csv\")\n", " trainingData['id'] = range(1, len(trainingData) + 1) #For 1-base index\n", "\n", " return trainingData" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second, We need to extract the features and the label from the table" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Get training data ...\n", "\n" ] } ], "source": [ "trainingData = getTrainingData()\n", "labels = trainingData['hand']\n", "features = trainingData.drop(['id', 'hand'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When dealing with Machine Learning algorithms, you need to calculate the effiency of the algorithm with the data, this could be done using several techniques such as K-Fold cross validation https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation and Precision and recall https://en.wikipedia.org/wiki/Precision_and_recall.\n", "\n", "I've used K-fold cross validation for this problem." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def kFoldCrossValidation(kFold):\n", " trainingData = getTrainingData()\n", " label = trainingData['hand']\n", " features = trainingData.drop(['id'], axis=1)\n", " crossValidationResult = dict()\n", "\n", " print(\"Start Cross Validation ...\\n\")\n", "\n", " randomForest = RandomForestClassifier(n_estimators=100)\n", " kNearestNeighbour = KNeighborsClassifier(n_neighbors=100)\n", " crossValidationResult['RF'] = cross_val_score(randomForest, trainingData, label, cv=kFold).mean()\n", " crossValidationResult['KNN'] = cross_val_score(kNearestNeighbour, trainingData, label, cv=kFold).mean()\n", "\n", " print(\"KNN: %s\\n\" % str(crossValidationResult['KNN']))\n", " print(\"RF: %s\\n\" % str(crossValidationResult['RF']))\n", " print(\"\\n\")\n", "\n", " return crossValidationResult['KNN'], crossValidationResult['RF']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I've decided to use K Nearest Neighbour and Random Forest according to the recommendation and the benchmark of the problem. Above, I've created instances from the Random Forest and K Nearest Neighbour modules, then get the score of each one to help me to decide which one is better." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "if __name__ == '__main__':\n", " trainingData = getTrainingData()\n", " labels = trainingData['hand']\n", " features = trainingData.drop(['id', 'hand'], axis=1)\n", "\n", " KNN, RF = kFoldCrossValidation(5)\n", " classifier = None\n", "\n", " if KNN > RF:\n", " classifier = KNeighborsClassifier(n_neighbors=100)\n", " else:\n", " classifier = RandomForestClassifier(n_estimators=10, n_jobs=-1)\n", "\n", " testData, result = getTestData()\n", "\n", " print(\"Classification in progress ...\\n\")\n", "\n", " classifier.fit(features, labels)\n", " result.insert(1, 'hand', classifier.predict(testData))\n", " result.to_csv(\"./results.csv\", index=False)\n", "\n", " print(\"Classification Ends ...\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I've made a condition to decide which classifier will be used according to the calculated score in the previous step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best score I've got is 0.5624 from Random Forest which is close to the benchmark of this algorithm which is 0.62408. But the best score for this problem is 1.0000.\n", "\n", "Well, this is was the first problem I've ever solved on Kaggle (Year ago), I was trying to begin and learn how to use Python libraries and submit the result, so, I haven't tried to improve the accuracy. Later, I've found that to improve the accuracy, we may make some preprocessing on data, tuning the parameters, regularization and other things that help on getting better accuracy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Important Note" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't use any module as a black box when you don't know how it works, the libraries are made to use it to avoid wasting the time on re-code them again, but, you MUST know what is your code doing." ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }