{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Introduction\n",
      "In this article we present our Cats-Dogs-Classifier, which can tell whether a given image shows a dog or a cat with \n",
      "an accuracy of 80%. We achieve this by reproducing the results of the paper [Machine Learning Attacks \n",
      "Against the Asirra CAPTCHA (Philippe Golle)](http://xenon.stanford.edu/~pgolle/papers/dogcat.pdf). The \"Cats vs Dogs - Classification problem\" raised a lot of interest in  the context of the  [Kaggle \"Dogs vs. Cats\" competition]( http://www.kaggle.com/c/dogs-vs-cats). Our classifier is built on top of the python scientific eco-system.\n",
      "We expect our reader to have read the mentioned paper.\n",
      "\n",
      "# Organisation of this article\n",
      "In the spirit of \"open science\", we make our source code easily available to you, so that you can play with it and reproduce the results easily. We therefore placed the whole source code in a GitHub repository.\n",
      "You can use this notebook to play interactively with the data, but we also make it available as a [static html page](https://github.com/Safadurimo/cats-and-dogs) for those who just want to have a quick look.\n",
      "\n",
      "Along this documentation, you will find the following files in the repository:\n",
      "\n",
      "* download.sh              Script which downloads the data\n",
      "* resize.sh                Script which resizes the data\n",
      "* README.md                 README with installation instructions\n",
      "\n",
      "# Used tools\n",
      "We performed our calculation using a 64core/512gb compute server, having four 16-core AMD \"Abu Dhabi\" 6376 CPUs (2.3GHz standard clockrate).\n",
      "\n",
      "We used the following software along with its version numbers:\n",
      "\n",
      "* Python 2.7.3\n",
      "* sklearn 0.14.1\n",
      "* scipy 0.10.1\n",
      "* numpy 1.8.1\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import os\n",
      "import multiprocessing\n",
      "import re\n",
      "import time\n",
      "import random\n",
      "from glob import glob\n",
      "import itertools\n",
      "import pickle\n",
      "\n",
      "import numpy as np\n",
      "\n",
      "import skimage\n",
      "from skimage import io\n",
      "\n",
      "from sklearn import cross_validation\n",
      "from sklearn import svm\n",
      "from sklearn import preprocessing\n",
      "from sklearn.linear_model.logistic import LogisticRegression\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Data\n",
      "We use as input data the data from the kaggle competition. The Data came as images of different sizes. As the features we will use are based on fixed sizes images and the Asirra challenge also presents images of fixed size, we resize all the images to 250×250 pixels. If the picture is not a square, we use a white background. That seems to be the same way as the picture are presented in the asirra challenge, so our data should be very similar to the data used in the article. The script \"resize.sh\" will do the job for us. We will shuffle the list of filenames randomly and take the first 10,000 files for all our calculations."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def build_file_list(dir):\n",
      "  \"\"\" Given a directory, it builds a shuffled list of the file \"\"\"\n",
      "  random.seed(42)\n",
      "  image_filenames = glob('{}/*.jpg'.format(dir))\n",
      "  image_filenames.sort() # make the function independent of the order your operating system returns the files\n",
      "  random.shuffle(image_filenames)\n",
      "  return image_filenames\n",
      "\n",
      "def build_labels(file_list,n_samples=None):\n",
      "  \"\"\" build the labels from the filenames: cats corresponds to a 1, dogs corresonds to a -1 \"\"\"\n",
      "  if(n_samples==None): n_samples=len(file_list)\n",
      "  n_samples=max(n_samples,len(file_list))\n",
      "  file_list = file_list[:n_samples]\n",
      "  y = np.zeros(n_samples,dtype=np.int32)\n",
      "  for (i,f) in enumerate(file_list):\n",
      "    if \"dog\" in str(f): \n",
      "      y[i]=-1\n",
      "    else:\n",
      "      y[i]=1\n",
      "      assert(\"cat\" in str(f)) \n",
      "  return y\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "file_list = build_file_list(\"data/train_resized\")\n",
      "pickle.dump(file_list, open(\"file_list.pkl\",\"wb\"))\n",
      "\n",
      "y=build_labels(file_list,n_samples=None)\n",
      "np.save('y',y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Building Color feature\n",
      "The following functions build the feature matrices for the color features exactly as described in the paper."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def file_to_rgb(filename):\n",
      "  \"\"\" return an image in rgb format: a gray scale image will be converted, a rgb image will be left untouched\"\"\"\n",
      "  bild = io.imread(filename)\n",
      "  if (bild.ndim==2):\n",
      "    rgb_bild= skimage.color.gray2rgb(bild)\n",
      "  else:\n",
      "    rgb_bild = bild\n",
      "  return rgb_bild\n",
      "\n",
      "def hsv_to_feature(hsv,N,C_h,C_s,C_v):\n",
      "  \"\"\" Takes an hsv picture and returns a feature vector for it.\n",
      "  The vector is built as described in the paper 'Machine Learning Attacks Against the Asirra CAPTCHA' \"\"\"  \n",
      "  res = np.zeros((N,N,C_h,C_s,C_v))\n",
      "  cell_size= 250/N\n",
      "  h_range = np.arange(0.0,1.0,1.0/C_h)\n",
      "  h_range = np.append(h_range,1.0)\n",
      "  s_range = np.arange(0.0,1.0,1.0/C_s)\n",
      "  s_range = np.append(s_range,1.0)\n",
      "  v_range = np.arange(0.0,1.0,1.0/C_v)\n",
      "  v_range = np.append(v_range,1.0)\n",
      "  for i in range(N):\n",
      "    for j in range(N):\n",
      "      cell= hsv[i*cell_size:i*cell_size+cell_size,j*cell_size:j*cell_size+cell_size,:]\n",
      "      # check for h\n",
      "      for h in range(C_h):\n",
      "        h_cell = np.logical_and(cell[:,:,0]>=h_range[h],cell[:,:,0]<h_range[h+1])\n",
      "        for s in range(C_s): \n",
      "          s_cell = np.logical_and(cell[:,:,1]>=s_range[s],cell[:,:,1]<s_range[s+1])\n",
      "          for v in range(C_v):\n",
      "            v_cell = np.logical_and(cell[:,:,2]>=v_range[v],cell[:,:,2]<v_range[v+1])\n",
      "            gesamt = np.logical_and(np.logical_and(h_cell,s_cell),v_cell)\n",
      "            res[i,j,h,s,v] = gesamt.any()\n",
      "  return np.asarray(res).reshape(-1)\n",
      "\n",
      "def build_color_featurevector(pars):\n",
      "  \"\"\" Takes a jpeg file and the parameters of the feature vector and builds such a vector\"\"\"\n",
      "  filename,N,C_h,C_s,C_v =pars\n",
      "  rgb_bild = file_to_rgb(filename)\n",
      "  assert (rgb_bild.shape[2]==3)\n",
      "  return hsv_to_feature(skimage.color.rgb2hsv(rgb_bild),N,C_h,C_s,C_v)\n",
      "\t    \n",
      "def build_color_featurematrix(file_list,N,C_h,C_s,C_v):\n",
      "    \"\"\" Builds the feature matrix of the jpegs in file list\n",
      "    return featurematrix where the i-th row corresponds to the feature in the i-th image of the file list\"\n",
      "    \"\"\"\n",
      "    pool = multiprocessing.Pool()\n",
      "    x = [(f,N,C_h,C_s,C_v) for f in file_list]\n",
      "    res = pool.map(build_color_featurevector,x)\n",
      "    return np.array(res)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def build_color_feature_matrices_or_load(file_list):\n",
      "  try:\n",
      "    F1 = np.load(\"F1.npy\")\n",
      "  except IOError:\n",
      "    F1 = build_color_featurematrix(file_list,1,10,10,10)\n",
      "  try:\n",
      "    F2 = np.load(\"F2.npy\")\n",
      "  except IOError:\n",
      "    F2 = build_color_featurematrix(file_list,3,10,8,8)\n",
      "  try:\n",
      "    F3 = np.load(\"F3.npy\")\n",
      "  except IOError:\n",
      "    F3 = build_color_featurematrix(file_list,5,10,6,6)\n",
      "  return F1,F2,F3"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "file_list = pickle.load(open(\"file_list.pkl\",\"rb\"))\n",
      "%time F1,F2,F3 =build_color_feature_matrices_or_load(file_list[:10000])\n",
      "np.save(\"F1\",F1) \n",
      "np.save(\"F2\",F2)\n",
      "np.save(\"F3\",F3)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 14.6 s, sys: 9.18 s, total: 23.8 s\n",
        "Wall time: 3min 8s\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We approximately needed 25 minutes to build the feature matrices on our computing cluster."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Classifying with Color features"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def classify_color_feature(F,y):\n",
      "  start = time.time()\n",
      "  clf = svm.SVC(kernel='rbf',gamma=0.001)\n",
      "  scores = cross_validation.cross_val_score(clf, F, y, cv=5,n_jobs=-1) \n",
      "  time_diff = time.time() - start \n",
      "  print \"Accuracy: %.1f  +- %.1f   (calculated in %.1f seconds)\"   % (np.mean(scores)*100,np.std(scores)*100,time_diff)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "F1=np.load(\"F1.npy\")\n",
      "F2=np.load(\"F2.npy\")\n",
      "F3=np.load(\"F3.npy\")\n",
      "y=np.load(\"y.npy\")\n",
      "union = np.hstack((F1,F2,F3))\n",
      "\n",
      "classify_color_feature(F1[:5000],y[:5000])\n",
      "classify_color_feature(F2[:5000],y[:5000])\n",
      "classify_color_feature(F3[:5000],y[:5000])\n",
      "\n",
      "classify_color_feature(F3[:10000],y[:10000]) \n",
      "\n",
      "classify_color_feature(union[:5000],y[:5000])\n",
      "classify_color_feature(union[:10000],y[:10000]) "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accuracy: 66.6  +- 1.1   (calculated in 226.3 seconds)\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accuracy: 75.4  +- 0.6   (calculated in 2305.1 seconds)\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accuracy: 74.4  +- 0.9   (calculated in 1577.5 seconds)\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accuracy: 75.6  +- 1.0   (calculated in 7111.8 seconds)\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accuracy: 75.8  +- 0.9   (calculated in 1014.7 seconds)\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accuracy: 77.1  +- 0.8   (calculated in 5067.6 seconds)\n"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "These result are very close to what has been reported in the paper."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Building texture feature\n",
      "The following functions build the texture feature matrices as described in the paper. We experienced that this process is computationally very, very heavy. Therefore we chose a shortcut and our texture features are not as fine-grained as described in the paper."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def texture_texture_distance(T1,T2):\n",
      "  \"\"\" Returns the distance between two tiles. \"\"\"\n",
      "  y=np.linalg.norm(T1-T2,axis=2)\n",
      "  assert(y.shape==(5,5))\n",
      "  return np.mean(y)\n",
      "\n",
      "def build_tiles(number_of_tiles,files,threshold):\n",
      "  \"\"\" Returns a number_of_tiles*5*5 - Matrix, where every 5*5_texture is at least threshold from each other \"\"\"\n",
      "  current=0\n",
      "  textures = np.zeros((number_of_tiles,5,5,3))\n",
      "  while(current<number_of_tiles):\n",
      "    file_index = random.randint(0,len(files)-1)\n",
      "    i = random.randint(0,49)\n",
      "    j = random.randint(0,49)\n",
      "    bild = io.imread(files[file_index])\n",
      "    if (bild.ndim==2):\n",
      "      rgb_bild= skimage.color.gray2rgb(bild)\n",
      "    else:\n",
      "      rgb_bild = bild\n",
      "    cell = rgb_bild[i*5:i*5+5,j*5:j*5+5,:] \n",
      "    close = False\n",
      "    for i in range(current):\n",
      "      T = textures[i,:,:] \n",
      "      if(texture_texture_distance(cell,T)<threshold):\n",
      "        close=True\n",
      "        break\n",
      "    if(not close):\n",
      "      textures[current,:,:]=cell\n",
      "      current+=1\n",
      "  return textures\n",
      "    \n",
      "def build_textures_or_load(number_of_tiles,files,threshold):\n",
      "    try:\n",
      "        textures = np.load(\"textures.npy\")\n",
      "    except IOError:\n",
      "        textures = build_tiles(number_of_tiles,files,threshold)\n",
      "    return textures\n",
      "    \n",
      "\n",
      "def texture_image_distance_simple(rgb,T):\n",
      "  \"\"\" Returns the distance between an image and a tile. \n",
      "      This is a simplified version of the distance described in the paper. Instead of using every possible\n",
      "      upper-left corner, we only use these, which are multiplies of five\"\"\"\n",
      "  assert(rgb.shape==(250,250,3))\n",
      "  assert(T.shape==(5,5,3))\n",
      "  bigtile = np.tile(T,(50,50,1))\n",
      "  distances = np.linalg.norm(rgb-bigtile,axis=2)\n",
      "  assert(distances.shape==(250,250))\n",
      "  splitted = [np.hsplit(x,50) for x in np.vsplit(distances,50)]\n",
      "  merged = list(itertools.chain.from_iterable(splitted)) # flatten the list\n",
      "  assert(len(merged)==50*50) # splitted should contain a list of the submatrices\n",
      "  maxvalues=[np.max(x) for x in merged]\n",
      "  return np.min(maxvalues)\n",
      "\n",
      "def build_texture_feature_vector(pars):\n",
      "  filename,textures=pars\n",
      "  bild = io.imread(filename)\n",
      "  if (bild.ndim==2):\n",
      "    rgb= skimage.color.gray2rgb(bild)\n",
      "  else:\n",
      "    rgb = bild\n",
      "  res=[]\n",
      "  for t in textures:\n",
      "    res.append(texture_image_distance_simple(rgb,t))\n",
      "  return res\n",
      "\n",
      "def build_texture_feature_matrix(file_list,texture):\n",
      "  \"\"\" Builds the feature matrix of the jpegs in file_list, takes maximal n_samples\n",
      "    return X \"\"\"\n",
      "  pool = multiprocessing.Pool()\n",
      "  res = pool.map(build_texture_feature_vector,[(f,texture) for f in file_list])\n",
      "  return np.array(res)\n",
      "\n",
      "def build_texture_feature_matrix_or_load(file_list,textures):\n",
      "    try:\n",
      "        G = np.load(\"G.npy\")\n",
      "    except IOError:\n",
      "        print \"Building matrix\"\n",
      "        G = build_texture_feature_matrix(file_list,textures)\n",
      "    return G"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "file_list = pickle.load(open(\"file_list.pkl\",\"rb\"))\n",
      "%time textures = build_textures_or_load(5000,file_list,40)\n",
      "np.save(\"textures\",textures) "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 20min 15s, sys: 42.6 s, total: 20min 57s\n",
        "Wall time: 20min 37s\n"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It took approximately 20 minutes to build the textures."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "textures=np.load(\"textures.npy\")\n",
      "file_list = pickle.load(open(\"file_list.pkl\",\"rb\"))\n",
      "%time G=build_texture_feature_matrix_or_load(file_list[:10000],textures)\n",
      "np.save(\"G\",G)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 4 ms, sys: 556 ms, total: 560 ms\n",
        "Wall time: 574 ms\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The calculation of the matrix G took really, really long, we spent a whole day on it."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Classifying with texture features"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We found that the reported classifier in the paper did a very bad job in classifying the images. Therefore, we switched to logistic regression. Nevertheless, we don't reach the values reported in the paper."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def classify_texture_feature(G,y):\n",
      "  start = time.time()\n",
      "  scores = cross_validation.cross_val_score(LogisticRegression(), G, y, cv=5,n_jobs=-1) \n",
      "  time_diff = time.time() - start \n",
      "  print \"Accuracy: %.1f  +- %.1f   (calculated in %.1f seconds)\"   % (np.mean(scores)*100,np.std(scores)*100,time_diff)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "G=np.load(\"G.npy\")\n",
      "y=np.load(\"y.npy\")\n",
      "classify_texture_feature(G[:5000,:1000],y[:5000])\n",
      "classify_texture_feature(G[:5000,:5000],y[:5000])\n",
      "classify_texture_feature(G[:10000,:5000],y[:10000])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accuracy: 66.8  +- 0.7   (calculated in 145.1 seconds)\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accuracy: 68.6  +- 1.4   (calculated in 134.9 seconds)\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accuracy: 68.0  +- 0.5   (calculated in 1279.2 seconds)\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Combined Classifiers\n",
      "As we didn't arrive at the values for the texture feature, we don't expect our combined classifier to arrive at the values reported in the figure. Nevertheless, we implement the combined classifier as described in the paper and report its results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "class Combined:\n",
      "  def __init__(self,clf1,clf2):\n",
      "    self.clf1=clf1\n",
      "    self.clf2=clf2\n",
      "  def predict(self,F,G):\n",
      "    y1=self.clf1.predict_proba(F)\n",
      "    y2=self.clf2.predict_proba(G)\n",
      "    y_out= 2*y1/3+y2/3\n",
      "    m=np.argmax(y_out, axis=1)\n",
      "    m[m==1]=1.0\n",
      "    m[m==0]=-1.0\n",
      "    return m"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "F1=np.load(\"F1.npy\")\n",
      "F2=np.load(\"F2.npy\")\n",
      "F3=np.load(\"F3.npy\")\n",
      "union = np.hstack((F1,F2,F3))\n",
      "G=np.load(\"G.npy\")\n",
      "y=np.load(\"y.npy\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf_color = svm.SVC(kernel='rbf',gamma=0.001,probability=True)\n",
      "clf_color.fit(union[:1000],y[:1000])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf_texture = LogisticRegression()\n",
      "clf_texture.fit(G[:1000],y[:1000])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "combined=Combined(clf_color,clf_texture)\n",
      "print \"Accuracy: \", np.mean(combined.predict(union[1000:2000],G[1000:2000])==y[1000:2000])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accuracy:  "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0.708\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Discussion and Questions\n",
      "Please take some time to help me improve my findings and help me to learn.\n",
      "I'm especially interested in the following questions:\n",
      "\n",
      "* Why does it take so long to build the texture features? Do I do something wrong? Shall I implement these functions in a faster language?\n",
      "* How can I achieve the accuracy reported for the texture features?\n",
      "* Is this texture feature approach standard? I didn't find so much in the literature about it...\n",
      "* Is this combination of classifier standard? Is there a standard way to do it sklearn?\n",
      "* Did I do the combination as described in the paper? Do I need to use the decision function instead?\n",
      "\n",
      "Furthermore, I would be really happy to receive criticism on the organisation of the paper, python style, and any hint I can learn from. Thank you very much!"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}