{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# GPyOpt: armed bandits optimization\n", "\n", "### Written by Javier Gonzalez, Amazon Reseach Cambridge, UK.\n", "\n", "\n", "*Last updated Monday, 22 May 2016.*\n", "\n", "In this notebook we will see how to do armed bandits optimization with GPyOpt. To do this will use data of weather forecasts at weather stations across more that 10.000 locations in the United States. The project [OpenWeatherMap project](http://openweathermap.org/) provides an API service to download this information and at that [dataset](https://github.com/WeatherStudy/weather_study) it is possible to find the weather forecasts for these stations. In this notebook we will use the file target_day_20140422.dat that contains the weather forecasts for each station in the United States for the April 22, 2014. The latitude and longitude of the stations is available as well as the forecasts for the next 7 days.\n", "\n", "We start by loading the packages that we will need in out analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import numpy as np \n", "import GPyOpt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we load the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filename='./data/target_day_20140422.dat'\n", "f = open(filename, 'r')\n", "contents = f.readlines()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we process the dataset. We selected only the data corresponding to the day April 22, 2014 and we remove the stations in Alaska and the US islands. The fist part of the next cell was taken from this [matplotlib tutorial](https://github.com/ginaschmalzle/pyladies_matplotlib_ipython_notebooks/blob/master/Matplotlib_tutorial.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Create a dictionary for the forecasted\n", "forecast_dict = {}\n", "for line in range(1, len(contents)):\n", " line_split = contents[line].split(' ')\n", " try:\n", " forecast_dict[line_split[0], line_split[1]][line_split[2]] = {'MaxT':float(line_split[3]),\n", " 'MinT':float(line_split[4][:-1])}\n", " except:\n", " forecast_dict[line_split[0], line_split[1]] = {}\n", " forecast_dict[line_split[0], line_split[1]][line_split[2]] = {'MaxT':float(line_split[3]),\n", " 'MinT':float(line_split[4][:-1])}\n", "keys = forecast_dict.keys()\n", "day_out = '0' # 0-7\n", "temp = 'MaxT' # MaxT or MinT\n", "temperature = []; lat = []; lon = []\n", "for key in keys:\n", " temperature.append(float(forecast_dict[key][day_out][temp]))\n", " lat.append(float(key[0]))\n", " lon.append(float(key[1]))\n", " \n", "## Create numpy arrays for the analyisis and remove Alaska and the islands\n", "lon = np.array(lon)\n", "lat = np.array(lat)\n", "sel = np.logical_and(np.logical_and(lat>24,lat<51),np.logical_and(lon> -130, lon <-65))\n", "stations_coordinates_all = np.array([lon,lat]).T\n", "stations_maxT_all = np.array([temperature]).T\n", "stations_coordinates = stations_coordinates_all[sel,:]\n", "stations_maxT = stations_maxT_all[sel,:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check the total number of stations.\n", "stations_maxT.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The array *stations_coordinates* contains the longitude and latitude of the weather stations and *stations_maxT* contains the maximum temperature value recorded in those locations on the April 22, 2014. Next we make a plot of all available stations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12,7))\n", "sc = plt.scatter(stations_coordinates[:,0],stations_coordinates[:,1], c='b', s=2, edgecolors='none')\n", "plt.title('US weather stations',size=25)\n", "plt.xlabel('Logitude',size=15)\n", "plt.ylabel('Latitude',size=15)\n", "plt.ylim((25,50))\n", "plt.xlim((-128,-65))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal is to find the **coldest stations** in this map using the **minumum number of queries**. We use the full dataset to create this objective function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Class that defines the function to optimize given the available locations\n", "class max_Temp(object):\n", " def __init__(self,stations_coordinates,stations_maxT):\n", " self.stations_coordinates = stations_coordinates\n", " self.stations_maxT = stations_maxT\n", "\n", " def f(self,x):\n", " return np.dot(0.5*(self.stations_coordinates == x).sum(axis=1),self.stations_maxT)[:,None]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The class *max_Temp* returns the temperature of each station everytime is queried with the coordinates of one of the available stations. To use it for this optimization example we create and instance of it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Objective function given the current inputs\n", "func = max_Temp(stations_coordinates,stations_maxT)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our design space is now the coordinates of the weather stations. We crete it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "domain = [{'name': 'stations', 'type': 'bandit', 'domain':stations_coordinates }] # armed bandit with the locations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we create the GPyOpt object. We will initilize the process with 50 stations, assume that the data are noisy, and we won't normalize the outputs. A seed is used for reproducibility" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from numpy.random import seed\n", "seed(123)\n", "myBopt = GPyOpt.methods.BayesianOptimization(f=func.f, # function to optimize \n", " domain=domain, \n", " initial_design_numdata = 5,\n", " acquisition_type='EI',\n", " exact_feval = True,\n", " normalize_Y = False,\n", " optimize_restarts = 10,\n", " acquisition_weight = 2,\n", " de_duplication = True) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myBopt.model.model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We run the optimization for a maximum of 50 iterations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run the optimization\n", "max_iter = 50 # evaluation budget\n", "myBopt.run_optimization(max_iter) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "GPyOpt prints a message to say that the optimization was stopped because the same location was selected twice. Let's have a look to the results. We plot the map with the true temperature of the stations, the coldest one and the best found location. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(15,7))\n", "jet = plt.cm.get_cmap('jet')\n", "sc = plt.scatter(stations_coordinates[:,0],stations_coordinates[:,1], c=stations_maxT[:, 0], vmin=0, vmax =35, cmap=jet, s=3, edgecolors='none')\n", "cbar = plt.colorbar(sc, shrink = 1)\n", "cbar.set_label(temp)\n", "plt.plot(myBopt.x_opt[0],myBopt.x_opt[1],'ko',markersize=10, label ='Best found')\n", "plt.plot(myBopt.X[:,0],myBopt.X[:,1],'k.',markersize=8, label ='Observed stations')\n", "plt.plot(stations_coordinates[np.argmin(stations_maxT),0],stations_coordinates[np.argmin(stations_maxT),1],'k*',markersize=15, label ='Coldest station')\n", "plt.legend()\n", "plt.ylim((25,50))\n", "plt.xlim((-128,-65))\n", "\n", "plt.title('Max. temperature: April, 22, 2014',size=25)\n", "plt.xlabel('Longitude',size=15)\n", "plt.ylabel('Latitude',size=15)\n", "plt.text(-125,28,'Total stations =' + str(stations_maxT.shape[0]),size=20)\n", "plt.text(-125,26.5,'Sampled stations ='+ str(myBopt.X.shape[0]),size=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The coldest and the selected locations are very close. Note that, in total, only three evaluations were necesary to find this stations. Of course, different results can be found with different initilizations, models, acquisition, etc. To finish, we plot the value of the temperature in the best found station over the histogram of all temperatures." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(8,5))\n", "xx= plt.hist(stations_maxT,bins =50)\n", "plt.title('Distribution of max. temperatures',size=25)\n", "plt.vlines(min(stations_maxT),0,1000,lw=3,label='Coldest station')\n", "plt.vlines(myBopt.fx_opt,0,1000,lw=3,linestyles=u'dotted',label='Best found')\n", "plt.legend()\n", "plt.xlabel('Max. temperature',size=15)\n", "plt.xlabel('Frequency',size=15)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "We see that it is indeed one of the coldest stations." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 1 }