{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "In this article we present our Cats-Dogs-Classifier, which can tell whether a given image shows a dog or a cat with \n", "an accuracy of 80%. We achieve this by reproducing the results of the paper [Machine Learning Attacks \n", "Against the Asirra CAPTCHA (Philippe Golle)](http://xenon.stanford.edu/~pgolle/papers/dogcat.pdf). The \"Cats vs Dogs - Classification problem\" raised a lot of interest in the context of the [Kaggle \"Dogs vs. Cats\" competition]( http://www.kaggle.com/c/dogs-vs-cats). Our classifier is built on top of the python scientific eco-system.\n", "We expect our reader to have read the mentioned paper.\n", "\n", "# Organisation of this article\n", "In the spirit of \"open science\", we make our source code easily available to you, so that you can play with it and reproduce the results easily. We therefore placed the whole source code in a GitHub repository.\n", "You can use this notebook to play interactively with the data, but we also make it available as a [static html page](https://github.com/Safadurimo/cats-and-dogs) for those who just want to have a quick look.\n", "\n", "Along this documentation, you will find the following files in the repository:\n", "\n", "* download.sh Script which downloads the data\n", "* resize.sh Script which resizes the data\n", "* README.md README with installation instructions\n", "\n", "# Used tools\n", "We performed our calculation using a 64core/512gb compute server, having four 16-core AMD \"Abu Dhabi\" 6376 CPUs (2.3GHz standard clockrate).\n", "\n", "We used the following software along with its version numbers:\n", "\n", "* Python 2.7.3\n", "* sklearn 0.14.1\n", "* scipy 0.10.1\n", "* numpy 1.8.1\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import multiprocessing\n", "import re\n", "import time\n", "import random\n", "from glob import glob\n", "import itertools\n", "import pickle\n", "\n", "import numpy as np\n", "\n", "import skimage\n", "from skimage import io\n", "\n", "from sklearn import cross_validation\n", "from sklearn import svm\n", "from sklearn import preprocessing\n", "from sklearn.linear_model.logistic import LogisticRegression\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data\n", "We use as input data the data from the kaggle competition. The Data came as images of different sizes. As the features we will use are based on fixed sizes images and the Asirra challenge also presents images of fixed size, we resize all the images to 250×250 pixels. If the picture is not a square, we use a white background. That seems to be the same way as the picture are presented in the asirra challenge, so our data should be very similar to the data used in the article. The script \"resize.sh\" will do the job for us. We will shuffle the list of filenames randomly and take the first 10,000 files for all our calculations." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def build_file_list(dir):\n", " \"\"\" Given a directory, it builds a shuffled list of the file \"\"\"\n", " random.seed(42)\n", " image_filenames = glob('{}/*.jpg'.format(dir))\n", " image_filenames.sort() # make the function independent of the order your operating system returns the files\n", " random.shuffle(image_filenames)\n", " return image_filenames\n", "\n", "def build_labels(file_list,n_samples=None):\n", " \"\"\" build the labels from the filenames: cats corresponds to a 1, dogs corresonds to a -1 \"\"\"\n", " if(n_samples==None): n_samples=len(file_list)\n", " n_samples=max(n_samples,len(file_list))\n", " file_list = file_list[:n_samples]\n", " y = np.zeros(n_samples,dtype=np.int32)\n", " for (i,f) in enumerate(file_list):\n", " if \"dog\" in str(f): \n", " y[i]=-1\n", " else:\n", " y[i]=1\n", " assert(\"cat\" in str(f)) \n", " return y\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "file_list = build_file_list(\"data/train_resized\")\n", "pickle.dump(file_list, open(\"file_list.pkl\",\"wb\"))\n", "\n", "y=build_labels(file_list,n_samples=None)\n", "np.save('y',y)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building Color feature\n", "The following functions build the feature matrices for the color features exactly as described in the paper." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def file_to_rgb(filename):\n", " \"\"\" return an image in rgb format: a gray scale image will be converted, a rgb image will be left untouched\"\"\"\n", " bild = io.imread(filename)\n", " if (bild.ndim==2):\n", " rgb_bild= skimage.color.gray2rgb(bild)\n", " else:\n", " rgb_bild = bild\n", " return rgb_bild\n", "\n", "def hsv_to_feature(hsv,N,C_h,C_s,C_v):\n", " \"\"\" Takes an hsv picture and returns a feature vector for it.\n", " The vector is built as described in the paper 'Machine Learning Attacks Against the Asirra CAPTCHA' \"\"\" \n", " res = np.zeros((N,N,C_h,C_s,C_v))\n", " cell_size= 250/N\n", " h_range = np.arange(0.0,1.0,1.0/C_h)\n", " h_range = np.append(h_range,1.0)\n", " s_range = np.arange(0.0,1.0,1.0/C_s)\n", " s_range = np.append(s_range,1.0)\n", " v_range = np.arange(0.0,1.0,1.0/C_v)\n", " v_range = np.append(v_range,1.0)\n", " for i in range(N):\n", " for j in range(N):\n", " cell= hsv[i*cell_size:i*cell_size+cell_size,j*cell_size:j*cell_size+cell_size,:]\n", " # check for h\n", " for h in range(C_h):\n", " h_cell = np.logical_and(cell[:,:,0]>=h_range[h],cell[:,:,0]=s_range[s],cell[:,:,1]=v_range[v],cell[:,:,2]