{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Otto Group Product Classification Challenge\n",
    "\n",
    "In this competition we have 93 anonimized features describing products of Otto Group. The challenge is to classify products into correct categories.\n",
    "Features represent counts of different events. There is little which can be done with features themselves - no high correlation between features, no skeweresness. So this competition is about modelling.\n",
    "\n",
    "The metric to calculate the accuracy of predictions is multiclass loss. To simplify I'll calculate log loss.\n",
    "\n",
    "As this is multiple classufication problem, the approach differs from binary classification. The result of prediction is probabilities of belonging to each class for products."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import sklearn\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import log_loss\n",
    "from sklearn.calibration import CalibratedClassifierCV\n",
    "import numpy as np\n",
    "import xgboost as xgb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "data = pd.read_csv('../input/train.csv')\n",
    "test = pd.read_csv('../input/test.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "X_train = data.drop('id', axis=1)\n",
    "X_train = X_train.drop('target', axis=1)\n",
    "Y_train = LabelEncoder().fit_transform(data.target.values)\n",
    "X_test = test.drop('id', axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "Xtrain, Xtest, ytrain, ytest = train_test_split(X_train, Y_train, test_size=0.20, random_state=36)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "At first I tried several other models (logistic regression, SVM and others), but they weren't good enough. Random Forest proved to be better. Parameters were obtained with CVgridsearch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loss on validation set:  0.488063577439\n"
     ]
    }
   ],
   "source": [
    "clf = RandomForestClassifier(n_estimators=300, n_jobs=-1, criterion = 'gini')\n",
    "calibrated_clf = CalibratedClassifierCV(clf, method='isotonic', cv=5)\n",
    "calibrated_clf.fit(Xtrain, ytrain)\n",
    "\n",
    "y_val = calibrated_clf.predict_proba(Xtest)\n",
    "y_submit = calibrated_clf.predict_proba(X_test)\n",
    "print(\"Loss on validation set: \", log_loss(ytest, y_val, eps=1e-15, normalize=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I decided to add XGBoost to improve the model and it helped. The final result was using weighted predictions from both models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "params = {\"objective\": \"multi:softprob\", \"num_class\": 9}\n",
    "gbm = xgb.train(params, xgb.DMatrix(X_train, Y_train), 20)\n",
    "Y_pred = gbm.predict(xgb.DMatrix(X_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "y = 0.2 * Y_pred + 0.8 * y_submit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "sample = pd.read_csv('../input/sampleSubmission.csv')\n",
    "preds = pd.DataFrame(y, index=sample.id.values, columns=sample.columns[1:])\n",
    "preds.to_csv('Otto.csv', index_label='id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "This competition has already ended, but people still can submit their solutions and see their scores. Top places have score of ~0.38.\n",
    "\n",
    "My ensemble model got a score of 0.48926."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}