# Deep Learning Tutorial

#### Abram Hindle
#### 
#### http://softwareprocess.ca/

Slides stolen gracefully from Ben Zittlau

Slide content under CC-BY-SA 4.0 and MIT License for source code or the same license as Python3 or Keras. Slide Source code is MIT License as well.




# Intro
### What is machine learning?

Building a function from data to classify, predict, group, or represent data.



# Intro
### Machine Learning

There are a few kinds of tasks or functions that could help us here.

* Classification: given some input, predict the class that it belongs
 to. Given a point is it in the red or in the blue?
* Regression: Given a point what will its value be? In the case of a
 function with a continuous or numerous discrete outputs it might be
 appropriate.
* Representation: Learn a smaller representation of the input
 data. E.g. we have 300 features lets describe them in a 128-bit hash.




# Intro
### Motivational Example

Imagine we have this data:

![2 crescent slices](images/slice.png "A function we want to learn
 f(x,y) -> z where z is red")

[See src/genslice.py to see how we made it.](src/genslice.py)



In [1]:
# purpose: make a slight difference of circles be the dataset to learn.
import numpy as np
gY, gX = np.meshgrid(np.arange(1000)/1000.0,np.arange(1000)/1000.0)

def intersect_circle(cx,cy,radius):
 equation = (gX - float(cx)) ** 2 + (gY - float(cy)) ** 2
 matches = equation < (radius**3) 
 return matches

# rad = 0.1643167672515498
rad = 0.3
x = intersect_circle(0.5,0.5,rad) ^ intersect_circle(0.51,0.51,rad)

def plotit(x):
 import matplotlib.pyplot as plt
 plt.imshow(x)
 plt.savefig('new-slice.png') # was slice.png
 plt.imshow(x)
 plt.savefig('new-slice.pdf') # was slice.pdf
 plt.show()

# plotit(x)

def mkcol(x):
 return x.reshape((x.shape[0]*x.shape[1],1))

# make the data set
big = np.concatenate((mkcol(gX),mkcol(gY),mkcol(1*x)),axis=1)
np.savetxt("new-big-slice.csv", big, delimiter=",")

# make a 50/50 data set
nots = big[big[0:,2]==0.0,]
np.random.shuffle(nots)
nots = nots[0:1000,]
trues = big[big[0:,2]==1.0,]
np.random.shuffle(trues)
trues = trues[0:1000,]
small = np.concatenate((trues,nots))
np.savetxt("new-small-slice.csv", small, delimiter=",")


# Intro
### Make your own function

``` python
def in_circle(x,y,cx,cy,radius):
 return (x - float(cx)) ** 2 + (y - float(cy)) ** 2 < radius**2

def mysolution(pt,outer=0.3):
 return in_circle(pt[0],pt[1],0.5,0.5,outer) and not in_circle(pt[0],pt[1],0.5,0.5,0.1)
```

```
>>> myclasses = np.apply_along_axis(mysolution,1,test[0])
>>> print "My classifier!"
My classifier!
>>> print "%s / %s " % (sum(myclasses == test[1]),len(test[1]))
181 / 200 
>>> print theautil.classifications(myclasses,test[1])
[('tp', 91), ('tn', 90), ('fp', 19), ('fn', 0)]
```



# Intro 
### An example classifier

1-NN: 1 Nearest Neighbor.

Given the data, we produce a function that
outputs the CLASS of the nearest neighbour to the input data.

Whoever is closer, is the class. 3-NN is 3-nearest neighbors whereby
we use voting of the 3 neighbors instead.



# Intro
### An example classifier: 1-NN

[src/slice-classifier.py](src/slice-classifier.py)

``` python
def euclid(pt1,pt2):
 return sum([ (pt1[i] - pt2[i])**2 for i in range(0,len(pt1)) ])

def oneNN(data,labels):
 def func(input):
 distance = None
 label = None
 for i in range(0,len(data)):
 d = euclid(input,data[i])
 if distance == None or d < distance:
 distance = d
 label = labels[i]
 return label
 return func
```




# Intro
### An example classifier: 1-NN

``` python
>>> learner = oneNN(train[0],train[1])
>>> 
>>> oneclasses = np.apply_along_axis(learner,1,test[0])
>>> print "1-NN classifier!"
1-NN classifier!
>>> print "%s / %s " % (sum(oneclasses == test[1]),len(test[1]))
198 / 200 
>>> print theautil.classifications(oneclasses,test[1])
[('tp', 91), ('tn', 107), ('fp', 2), ('fn', 0)]

```

1-NN has great performance in this example, but it uses Euclidean
distance and the dataset is really quite biased to the positive
classes.

Thus we showed a simple learner that classifies data.



# Intro

* That's really interesting performance and it worked but will it
 scale and continue to work?

* 1-NN doesn't work for all problems. And it is dependent on linear
 relationships.

* What if our problem is non-linear?




In [1]:
#
# The MIT License (MIT)
# 
# Copyright (c) 2016 Abram Hindle , Leif Johnson 
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# first off we load up some modules we want to use
# import keras
import tensorflow.keras as keras
import scipy
import math
import numpy as np
import numpy.random as rnd
import logging
import sys
import collections
import theautil

# setup logging
logging.basicConfig(stream = sys.stderr, level=logging.INFO)

mupdates = 1000
data = np.loadtxt("small-slice.csv", delimiter=",")
inputs = data[0:,0:2].astype(np.float32)
outputs = data[0:,2:3].astype(np.int32)

theautil.joint_shuffle(inputs,outputs)

train_and_valid, test = theautil.split_validation(90, inputs, outputs)
train, valid = theautil.split_validation(90, train_and_valid[0], train_and_valid[1])
print("Train: %s Valid: %s"%(len(train),len(valid)))
print("Train X: %s Y^:%s"%(train[0].shape,train[1].shape))
print("Valid X: %s Y^:%s"%(valid[0].shape,valid[1].shape))


Train: 2 Valid: 2
Train X: (1620, 2) Y^:(1620, 1)
Valid X: (180, 2) Y^:(180, 1)


In [2]:

def linit(x):
 return x.reshape((len(x),))

mltrain = (train[0],linit(train[1]))
mlvalid = (valid[0],linit(valid[1]))
mltest = (test[0] ,linit(test[1]))

# my solution
def in_circle(x,y,cx,cy,radius):
 return (x - float(cx)) ** 2 + (y - float(cy)) ** 2 < radius**2

def mysolution(pt,outer=0.3):
 return in_circle(pt[0],pt[1],0.5,0.5,outer) and \
 not in_circle(pt[0],pt[1],0.5,0.5,0.1)

# apply my classifier
myclasses = np.apply_along_axis(mysolution,1,mltest[0])
print("My classifier!")
print("%s / %s " % (sum(myclasses == mltest[1]),len(mltest[1])))
print(theautil.classifications(myclasses,mltest[1]))

My classifier!
176 / 200 
[('tp', 94), ('tn', 82), ('fp', 24), ('fn', 0)]


In [3]:
def euclid(pt1,pt2):
 return sum([ (pt1[i] - pt2[i])**2 for i in range(0,len(pt1)) ])

def oneNN(data,labels):
 def func(input):
 distance = None
 label = None
 for i in range(0,len(data)):
 d = euclid(input,data[i])
 if distance == None or d < distance:
 distance = d
 label = labels[i]
 return label
 return func

learner = oneNN(mltrain[0],mltrain[1])

oneclasses = np.apply_along_axis(learner,1,mltest[0])
print("1-NN classifier!")
print("%s / %s " % (sum(oneclasses == mltest[1]),len(mltest[1])))
print(theautil.classifications(oneclasses,mltest[1]))

toneclasses = np.apply_along_axis(learner,1,mltrain[0])
print("1-NN classifier! ON TRAIN")
print("%s / %s " % (sum(toneclasses == mltrain[1]),len(mltrain[1])))
print(theautil.classifications(toneclasses,mltrain[1]))



1-NN classifier!
198 / 200 
[('tp', 94), ('tn', 104), ('fp', 2), ('fn', 0)]
1-NN classifier! ON TRAIN
1620 / 1620 
[('tp', 816), ('tn', 804), ('fp', 0), ('fn', 0)]



# Intro

* Neural networks are popular
 * Creating AI for Go
 * Labeling Images with cats and dogs
 * Speech Recognition
 * Text summarization
 * [Guitar Transcription](https://peerj.com/preprints/1193.pdf)
 * Learn audio from video[1](https://archive.org/details/DeepLearningBitmaptoPCM/)[2](http://softwareprocess.es/blog/blog/2015/08/10/deep-learning-bitmaps-to-pcm/)

* Neural networks can not only classify, but they can create content,
 they can have complicated outputs.

* Neural networks are generative!


# Intro
### Machine Learning: Neural Networks

Neural networks or "Artificial Neural Networks" are a flexible class
of non-linear machine learners. They have been found to be quite
effective as of late.

Neural networks are composed of neurons. These neurons try to emulate
biological neurons in the most metaphorical of senses. Given a set of
inputs they produce an output.




## Neurons

Neurons have functions.

* Rectified Linear Units have been shown to train quite well and
 achieve good results. By they aren't easier to differentiate.
 f(x) = max(0,x)
* Sigmoid functions are slow and were the classical neural network
 neuron, but have fallen out of favour. They will work when nothing
 else will. f(x) = 1/(1 + e^-x)
* Softplus is a RELU that is slower to compute but differentiable.
 f(x) = ln(1 + e^x)




## Neurons

![Rectifier and Sigmoid and Softplus](images/Rectifier_and_softplus_functions.svg)






## Neurons

The inputs to a neural network? The outputs of connected nodes times
their weight + a bias.

neuron(inputs) = neuron_f( sum(weights * inputs) + bias )

![Neuron example](images/neuron.png)


In [4]:
import numpy
# this is a neuron
def tanh(weights,inputs,bias=0):
 return numpy.tanh(numpy.sum(weights * inputs) + bias)
# XOR
train_x = numpy.array([[0,0],[0,1],[1,0],[1,1]])
train_y = numpy.array([0,1,1,0])
def rms(train_y_hat, train_y):
 return numpy.sqrt(numpy.mean((train_y_hat - train_y)**2))
best = None
weights = numpy.array([0.4,0.3,0.6])
train_y_hat = [tanh(weights[0:2],t,weights[2]) for t in train_x]
res = (rms(train_y_hat, train_y), train_y_hat, weights)
print(res)

for i in range(0,10000):
 weights = 4*numpy.random.rand(3)-2
 train_y_hat = [tanh(weights[0:2],t,weights[2]) for t in train_x]
 res = (rms(train_y_hat, train_y), train_y_hat, weights)
 #print(res)
 if best is None or res[0] < best[0]:
 print(f"Better {i}",res)
 best = res
print("Best result!")
print(best)


(0.5404427087207618, [0.5370495669980353, 0.7162978701990245, 0.7615941559557649, 0.8617231593133063], array([0.4, 0.3, 0.6]))
Better 0 (1.4866137432899655, [-0.6647188082502699, -0.9918813653217701, -0.8539141268422795, -0.9968157451060176], array([-0.46920893, -1.95011547, -0.80122097]))
Better 2 (1.4838992342049964, [-0.5421077685734538, -0.9073291027427597, -0.9711837558377038, -0.9952272157156234], array([-1.50558934, -0.90506655, -0.60713579]))
Better 3 (1.342187408032217, [-0.36881990441710016, -0.9462765269984749, -0.5336489328059542, -0.9642451257769672], array([-0.20817668, -1.40785515, -0.38705651]))
Better 4 (0.980351397497156, [-0.8888400983919069, -0.35220470123465963, 0.3338958373396271, 0.8844017301173606], array([ 1.76358079, 1.0484158 , -1.41637425]))
Better 7 (0.7086348569632173, [-0.7922954726760598, 0.5944221722158192, 0.49450272250389854, 0.9802553004416915], array([ 1.61957104, 1.76204491, -1.07756797]))
Better 13 (0.5189404356173829, [-0.08211296642629633, 0.701

In [5]:
# now with 3 neurons

# input1--+->neuron1\
# \ / \+neuron3--->result
# /\ /
# input2-+->neuron2/

import numpy
# this is a neuron
def tanh(weights,inputs,bias=0):
 return numpy.tanh(numpy.sum(weights * inputs) + bias)
# XOR
train_x = numpy.array([[0,0],[0,1],[1,0],[1,1]])
train_y = numpy.array([0,1,1,0])
def rms(train_y_hat, train_y):
 return numpy.sqrt(numpy.mean((train_y_hat - train_y)**2))
best = None

def random_weights():
 return 4*numpy.random.rand(9)-2
# input1--+->neuron1\
# \ / \+neuron3--->result
# /\ /
# input2-+->neuron2/
def network(weights,t):
 inputs = [tanh(weights[0:2],t,weights[2]), #neuron1
 tanh(weights[4:6],t,weights[6])] #neuron2
 w = numpy.array([weights[3],weights[7]]) #weights for neuron3
 return tanh(w,inputs,weights[8]) #neuron3
 

for i in range(0,100000):
 weights = random_weights()
 train_y_hat = [network(weights,t) for t in train_x]
 res = (rms(train_y_hat, train_y), train_y_hat, weights)
 #print(res)
 if best is None or res[0] < best[0]:
 print(f"Better {i}",res)
 best = res
print("Best result!")
print(best)

Better 0 (1.0392997571367157, [-0.9885546885269915, -0.37812464265386203, 0.1748236970017731, 0.8736086691473437], array([-1.80513234, -1.25921525, 1.54565005, -1.14505175, 0.99035643,
 1.03340689, -0.07836691, 1.77347142, -1.39458737]))
Better 1 (1.0346776683852625, [0.1125591802577348, -0.5852577386616211, -0.1746527036830924, -0.6137677643081333], array([ 1.69555135, 1.77740615, -1.7353888 , -0.3835196 , 0.88772935,
 -1.44837477, 1.13666868, 0.36490135, -0.54412153]))
Better 3 (0.5366106228234248, [0.6423068618632775, 0.819742337575398, 0.47827851567542007, 0.6592113509660609], array([-1.94740825, 1.21970186, 0.04489612, 0.3554685 , -0.2692329 ,
 -0.2536154 , 0.85163114, -0.68409661, 1.21948298]))
Better 40 (0.5128192232498399, [-0.07388123393620917, 0.7816640898528522, 0.9981015976723836, 0.9994006266853941], array([-1.96901213, 0.22350755, 0.05445328, -1.94101921, 0.92330994,
 0.88761934, -0.24228396, 1.91535551, 0.48676225]))
Better 138 (0.49253392181860345, [0.019065442846213504



## Multi-layer perceptron

Single hidden layer neural network.

![Multi-layer perceptron](images/20160208141015.png)






## Deep Learning

There's nothing particularly crazy about deep learning other than it has more hidden layers.

These hidden layers allow it to compute state and address the intricacies of complex functions. But each hidden layer adds a lot of search space.




## Deep Learning

![Deep network, multiple layers](images/20160208141143.png)






## Search

How do we find the different weights?

Well we need to search a large space. A 2x3x2 network will have 2*3*2
weights + 5 biases (3 hidden, 2 output) resulting in 17
parameters. That's already a large search space.

Most search algorithms measure their error at a certain point
(difference between prediction and actual) and then choose a direction
in their search space to travel. They do this different ways.
One way is by sampling points
around themselves in order to compute a gradient or slope and then
follow the slope around. The most common way is to calculate the gradient symbolicly, compute the derivatives, and avoid sampling altogether (like stochastic gradient descent).

Here's a 3D demo of different search algorithms.

[Different Search Parameters](http://www.robertsdionne.com/bouncingball/)





## Let's deep learn on our problem

![2 crescent slices](images/slice.png "A function we want to learn
 f(x,y) -> z where z is red")

Please open [slice-classifier](./src/slice-classifier.py) and a python
interpreter such as bpython. Search for Part 3 around line 100.




In [6]:

print('''
########################################################################
# Part 3. Let's start using neural networks!
########################################################################
''')
import tensorflow as tf
from tensorflow.keras.optimizers import SGD
from tensorflow import convert_to_tensor as tft
from keras.models import Sequential
from keras.layers.core import Dense
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore',sparse=False)
enc.fit(train[1])
train_y = enc.transform(train[1]).astype('float64')
valid_y = enc.transform(valid[1]).astype('float64')
test_y = enc.transform(test[1]).astype('float64')

print(train[0].shape)
print(train[1].shape)
print(train_y.shape)

print(train[0][0:10])
print(train[1][0:10])
print(train_y[0:10])


########################################################################
# Part 3. Let's start using neural networks!
########################################################################

(1620, 2)
(1620, 1)
(1620, 2)
[[0.867 0.463]
 [0.65 0.046]
 [0.384 0.736]
 [0.379 0.392]
 [0.562 0.657]
 [0.358 0.443]
 [0.034 0.993]
 [0.464 0.663]
 [0.386 0.573]
 [0.485 0.144]]
[[0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]]
[[1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]]


In [7]:
%%time

# rerunning this will produce different results
# try different combos here
net = Sequential([
 Dense(16,input_shape=(2,),activation="sigmoid"),
 Dense(32,activation="sigmoid"),
 Dense(2,activation="softmax")
])

# opt = SGD()#
opt = Adam() # lr=0.1)
net.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
print(net.summary())
# net.fit(x=train[0],y=train_y, epochs=100)
history = net.fit(x=train[0], y=train_y, 
 validation_data=(valid[0], valid_y),
 epochs=100, batch_size=4)


Model: "sequential"
_________________________________________________________________
 Layer (type) Output Shape Param # 
 dense (Dense) (None, 16) 48 
 
 dense_1 (Dense) (None, 32) 544 
 
 dense_2 (Dense) (None, 2) 66 
 
Total params: 658
Trainable params: 658
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Ep

In [8]:
print("Learner on the test set")
score = net.evaluate(test[0], test_y)
print("Scores: %s" % score)
predictit = net.predict(test[0])
#print(predictit.shape)
#print(predictit[0:10,])
#print(net.__class__)
#classify = net.predict_classes(test[0])
#predict_x=net.predict(test[0]) 
#classify=np.argmax(predict_x,axis=1)

def predict_classes(net,test):
 predict_x=net.predict(test) 
 classify=np.argmax(predict_x,axis=1)
 return classify
 
classify = predict_classes(net, test[0])
print("%s / %s " % (np.sum(classify == mltest[1]),len(mltest[1])))
print(collections.Counter(classify))
print(theautil.classifications(classify,mltest[1]))

Learner on the test set
Scores: [0.2566322684288025, 0.9049999713897705]
181 / 200 
Counter({1: 113, 0: 87})
[('tp', 94), ('tn', 87), ('fp', 19), ('fn', 0)]


Let's try this on unseen data.

In [9]:

def real_function(pt):
 rad = 0.1643167672515498
 in1 = in_circle(pt[0],pt[1],0.5,0.5,rad)
 in2 = in_circle(pt[0],pt[1],0.51,0.51,rad)
 return in1 ^ in2

print("And now on more unseen data that isn't 50/50")

bigtest = np.random.uniform(size=(3000,2)).astype(np.float32)
biglab = np.apply_along_axis(real_function,1,bigtest).astype(np.int32)

classify = predict_classes(net,bigtest)
print("%s / %s " % (sum(classify == biglab),len(biglab)))
print(collections.Counter(classify))
print(theautil.classifications(classify,biglab))


And now on more unseen data that isn't 50/50
2378 / 3000 
Counter({0: 2343, 1: 657})
[('tp', 35), ('tn', 2343), ('fp', 622), ('fn', 0)]




## Now let's discuss posing problems for neural networks

* Scaling inputs: Scaling can sometimes help, so can
 standardization. This means constraining values or re-centering
 them. It depends on your problem and it is worth trying.

* E.g. min max scaling:

``` python
def min_max_scale(data):
 '''scales data by minimum and maximum values between 0 and 1'''
 dmin = np.min(data)
 return (data - dmin)/(np.max(data) - dmin)
```



## The problem

* [posing.py](src/posing.py) tries to show the problem of taking
 random input data and determine what distribution it comes from.
 That is what function can produce these random values.

* Let's open up [posing.py](src/posing.py) and get an interpreter
 going.



In [10]:
# Demonstration of how to pose the problem and how different formulations
# lead to different results!
#
# The MIT License (MIT)
# 
# Copyright (c) 2016 Abram Hindle , Leif Johnson 
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# first off we load up some modules we want to use
import keras
import scipy
import math
import numpy as np
import numpy.random as rnd
import logging
import sys
from numpy.random import power, normal, lognormal, uniform
from keras.models import Sequential
from keras.layers.core import Dense
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import OneHotEncoder
import theautil
import tensorflow as tf

# What are we going to do?
# - we're going to generate data derived from 4 different distributions
# - we're going to scale that data
# - we're going to create a RBM (1 hidden layer neural network)
# - we're going to train it to classify data as belonging to one of these distributions

# maximum number of iterations before we bail
mupdates = 1000

# setup logging
logging.basicConfig(stream = sys.stderr, level=logging.INFO)

# how we pose our problem to the deep belief network matters.

# lets make the task easier by scaling all values between 0 and 1
def min_max_scale(data):
 '''scales data by minimum and maximum values between 0 and 1'''
 dmin = np.min(data)
 return (data - dmin)/(np.max(data) - dmin)

# how many samples per each distribution
bsize = 100 

# poor man's enum
LOGNORMAL=0
POWER=1
NORM=2
UNIFORM=3



## Experiment 1

* Given 1 single sample what distribution does it come from?




In [14]:
print('''
########################################################################
# Experiment 1: can we classify single samples?
#
#
#########################################################################
''')

def make_dataset1():
 '''Make a dataset of single samples with labels from which distribution they come from'''
 # now lets make some samples 
 lns = min_max_scale(lognormal(size=bsize)) #log normal
 powers = min_max_scale(power(0.1,size=bsize)) #power law
 norms = min_max_scale(normal(size=bsize)) #normal
 uniforms = min_max_scale(uniform(size=bsize)) #uniform
 # add our data together
 data = np.concatenate((lns,powers,norms,uniforms))
 
 # concatenate our labels
 labels = np.concatenate((
 (np.repeat(LOGNORMAL,bsize)),
 (np.repeat(POWER,bsize)),
 (np.repeat(NORM,bsize)),
 (np.repeat(UNIFORM,bsize))))
 tsize = len(labels)
 
 # make sure dimensionality and types are right
 data = data.reshape((len(data),1))
 data = data.astype(np.float32)
 labels = labels.astype(np.int32)
 labels = labels.reshape((len(data),))
 
 return data, labels, tsize

# this will be the training data and validation data
data, labels, tsize = make_dataset1()


# this is the test data, this is kept separate to prove we can
# actually work on the data we claim we can.
#
# Without test data, you might just have great performance on the
# train set.
test_data, test_labels, _ = make_dataset1()



########################################################################
# Experiment 1: can we classify single samples?
#
#
#########################################################################



In [15]:
# utilities

# now lets shuffle
# If we're going to select a validation set we probably want to shuffle
def joint_shuffle(arr1,arr2):
 assert len(arr1) == len(arr2)
 indices = np.arange(len(arr1))
 np.random.shuffle(indices)
 arr1[0:len(arr1)] = arr1[indices]
 arr2[0:len(arr2)] = arr2[indices]

# our data and labels are shuffled together
joint_shuffle(data,labels)

def split_validation(percent, data, labels):
 ''' 
 split_validation splits a dataset of data and labels into
 2 partitions at the percent mark
 percent should be an int between 1 and 99
 '''
 s = int(percent * len(data) / 100)
 tdata = data[0:s]
 vdata = data[s:]
 tlabels = labels[0:s]
 vlabels = labels[s:]
 return ((tdata,tlabels),(vdata,vlabels))

# make a validation set from the train set
train1, valid1 = split_validation(90, data, labels)

print(train1[0].shape)
print(train1[1].shape)

enc1 = OneHotEncoder(handle_unknown='ignore',sparse=False)
enc1.fit(train1[1].reshape(len(train1[1]),1))
train1_y = enc1.transform(train1[1].reshape(len(train1[1]),1))
print(train1_y.shape)
valid1_y = enc1.transform(valid1[1].reshape(len(valid1[1]),1))
print(valid1_y.shape)
test1_y = enc1.transform(test_labels.reshape(len(test_labels),1))
print(test1_y.shape)

(360, 1)
(360,)
(360, 4)
(40, 4)
(400, 4)


In [16]:
# build our classifier

print("We're building a MLP of 1 input layer node, 4 hidden layer nodes, and an output layer of 4 nodes. The output layer has 4 nodes because we have 4 classes that the neural network will output.")
cnet = Sequential()
cnet.add(Dense(4,input_shape=(1,),activation="sigmoid"))
cnet.add(Dense(4,activation="softmax"))
copt = SGD(lr=0.1)
# opt = Adam(lr=0.1)
cnet.compile(loss="categorical_crossentropy", optimizer=copt, metrics=["accuracy"])
history = cnet.fit(train1[0], train1_y, validation_data=(valid1[0], valid1_y),
	 epochs=100, batch_size=16)

#score = cnet.evaluate(test_data, test_labels)
#print("Scores: %s" % score)
classify = predict_classes(cnet,test_data)
print(theautil.classifications(classify,test_labels))
score = cnet.evaluate(test_data, test1_y)
print("Scores: %s" % score)


We're building a MLP of 1 input layer node, 4 hidden layer nodes, and an output layer of 4 nodes. The output layer has 4 nodes because we have 4 classes that the neural network will output.
Epoch 1/100
 1/23 [>.............................] - ETA: 3s - loss: 1.3987 - accuracy: 0.3125

 super(SGD, self).__init__(name, **kwargs)


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7


## Experiment 2

* Given 40 samples what distribution does it come from?



In [20]:
print('''
########################################################################
# Experiment 2: can we classify a sample of data?
#
#
#########################################################################
''')
print("In this example we're going to input 40 values from a single distribution, and we'll see if we can classify the distribution.")

width=40

def make_widedataset(width=width):
 # we're going to make rows of 40 features unsorted
 wlns = min_max_scale(lognormal(size=(bsize,width))) #log normal
 wpowers = min_max_scale(power(0.1,size=(bsize,width))) #power law
 wnorms = min_max_scale(normal(size=(bsize,width))) #normal
 wuniforms = min_max_scale(uniform(size=(bsize,width))) #uniform
 
 wdata = np.concatenate((wlns,wpowers,wnorms,wuniforms))
 
 # concatenate our labels
 wlabels = np.concatenate((
 (np.repeat(LOGNORMAL,bsize)),
 (np.repeat(POWER,bsize)),
 (np.repeat(NORM,bsize)),
 (np.repeat(UNIFORM,bsize))))
 
 joint_shuffle(wdata,wlabels)
 wdata = wdata.astype(np.float32)
 wlabels = wlabels.astype(np.int32)
 wlabels = wlabels.reshape((len(data),))
 return wdata, wlabels

# make our train sets
wdata, wlabels = make_widedataset()
# make our test sets
test_wdata, test_wlabels = make_widedataset()

# split out our validation set
wtrain, wvalid = split_validation(90, wdata, wlabels)
print("At this point we have a weird decision to make, how many neurons in the hidden layer?")

encwc = OneHotEncoder(handle_unknown='ignore', sparse=False)
encwc.fit(wtrain[1].reshape(len(wtrain[1]),1))
wtrain_y = encwc.transform(wtrain[1].reshape(len(wtrain[1]),1))
wvalid_y = encwc.transform(wvalid[1].reshape(len(wvalid[1]),1))
wtest_y = encwc.transform(test_wlabels.reshape(len(test_wlabels),1))

# wcnet = theanets.Classifier([width,width/4,4]) #267
wcnet = Sequential()
wcnet.add(Dense(width,input_shape=(width,),activation="sigmoid"))
wcnet.add(Dense(int(width/4),activation="sigmoid"))
wcnet.add(Dense(4,activation="softmax"))
wcnet.compile(loss="categorical_crossentropy", optimizer=SGD(lr=0.1), metrics=["accuracy"])
history = wcnet.fit(wtrain[0], wtrain_y, validation_data=(wvalid[0], wvalid_y),
	 epochs=100, batch_size=16)
score = wcnet.evaluate(test_wdata, wtest_y)
print(wcnet.metrics_names)
print("Scores: %s" % score)




########################################################################
# Experiment 2: can we classify a sample of data?
#
#
#########################################################################

In this example we're going to input 40 values from a single distribution, and we'll see if we can classify the distribution.
At this point we have a weird decision to make, how many neurons in the hidden layer?
Epoch 1/100
 1/23 [>.............................] - ETA: 3s - loss: 1.5122 - accuracy: 0.2500

 super(SGD, self).__init__(name, **kwargs)


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

In [21]:

classify = predict_classes(wcnet,test_wdata)
print(theautil.classifications(classify,test_wlabels))
score = wcnet.evaluate(test_wdata, wtest_y)
print("Scores: %s" % score)

# # You could try some of these alternative setups
# 
# [width,4]) #248
# [width,width/2,4]) #271
# [width,width,4]) #289
# [width,width*2,4]) #292
# [width,width/2,width/4,4]) #270
# [width,width/2,width/4,width/8,width/16,4]) #232
# [width,width*8,4]) #304

print("Ok that was neat, it definitely worked better, it had more data though.")

print("But what if we help it out, and we sort the values so that the first and last bins are always the min and max values?")


[('tp', 95), ('tn', 96), ('fp', 4), ('fn', 5)]
Scores: [0.46964675188064575, 0.7275000214576721]
Ok that was neat, it definitely worked better, it had more data though.
But what if we help it out, and we sort the values so that the first and last bins are always the min and max values?




## Experiment 3

* Given 40 sorted samples what distribution does it come from?


In [22]:
print('''
########################################################################
# Experiment 3: can we classify a SORTED sample of data?
#
#
#########################################################################
''')


print("Sorting the data")
wdata.sort(axis=1)
test_wdata.sort(axis=1)


swcnet = Sequential()
swcnet.add(Dense(width,input_shape=(width,),activation="sigmoid"))
swcnet.add(Dense(int(width/4),activation="sigmoid"))
swcnet.add(Dense(4,activation="softmax"))
swcnet.compile(loss="categorical_crossentropy", optimizer=SGD(lr=0.1), metrics=["accuracy"])
history = swcnet.fit(wtrain[0], wtrain_y, validation_data=(wvalid[0], wvalid_y),
	 epochs=100, batch_size=16)
score = swcnet.evaluate(test_wdata, wtest_y)
print("Scores: %s" % score)




########################################################################
# Experiment 3: can we classify a SORTED sample of data?
#
#
#########################################################################

Sorting the data
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Ep

In [23]:

classify = predict_classes(swcnet,test_wdata)
print(theautil.classifications(classify,test_wlabels))
score = swcnet.evaluate(test_wdata, wtest_y)
print("Scores: %s" % score)


[('tp', 97), ('tn', 96), ('fp', 4), ('fn', 3)]
Scores: [0.17662428319454193, 0.9825000166893005]


That was an improvement. What if we do binning instead?

In [24]:
history = swcnet.fit(wtrain[0], wtrain_y, validation_data=(wvalid[0], wvalid_y),
	 epochs=100, batch_size=16)
score = swcnet.evaluate(test_wdata, wtest_y)
print("Scores: %s" % score)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [25]:
classify = predict_classes(swcnet, test_wdata)
print(theautil.classifications(classify,test_wlabels))
score = swcnet.evaluate(test_wdata, wtest_y)
print("Scores: %s" % score)

[('tp', 98), ('tn', 95), ('fp', 5), ('fn', 2)]
Scores: [0.0653819590806961, 0.9825000166893005]


## Experiment 4

* Given 40 histogrammed samples what distribution does it come from?



In [26]:
print('''
########################################################################
# Experiment 4: can we classify a discretized histogram of sample data?
#
#
#########################################################################
'''
)
# let's try actual binning
import collections

def bin(row):
 return np.histogram(row,bins=len(row),range=(0.0,1.0))[0]/float(len(row))

print("Apply the histogram to all the data rows")
bdata = np.apply_along_axis(bin,1,wdata).astype(np.float32)
blabels = wlabels

# ensure we have our test data
test_bdata = np.apply_along_axis(bin,1,test_wdata).astype(np.float32)
test_blabels = test_wlabels

# helper data 
enum_funcs = [
 (LOGNORMAL,"log normal",lambda size: lognormal(size=size)),
 (POWER,"power",lambda size: power(0.1,size=size)),
 (NORM,"normal",lambda size: normal(size=size)),
 (UNIFORM,"uniforms",lambda size: uniform(size=size)),
]

# uses enum_funcs to evaluate PER CLASS how well our classify operates
def classify_test(bnet,ntests=1000):
 for tup in enum_funcs:
 enum, name, func = tup
 lns = min_max_scale(func(size=(ntests,width))) #log normal
 blns = np.apply_along_axis(bin,1,lns).astype(np.float32)
 blns_labels = np.repeat(enum,ntests)
 blns_labels.astype(np.int32)
 classification = predict_classes(bnet,blns)
 classified = theautil.classifications(classification,blns_labels)
 print("Name:%s Tests:[%s] Count:%s -- Res:%s" % (name,ntests, collections.Counter(classification),classified ))

# train & valid
btrain, bvalid = split_validation(90, bdata, blabels)

encb = OneHotEncoder(handle_unknown='ignore', sparse=False)
encb.fit(btrain[1].reshape(len(btrain[1]),1))
btrain_y = encb.transform(btrain[1].reshape(len(btrain[1]),1))
bvalid_y = encb.transform(bvalid[1].reshape(len(bvalid[1]),1))
btest_y = encb.transform(test_blabels.reshape(len(test_blabels),1))



# similar network structure
# bnet = theanets.Classifier([width,width/2,4])

bnet = Sequential()
bnet.add(Dense(width,input_shape=(width,),activation="sigmoid"))
bnet.add(Dense(int(width/4),activation="sigmoid"))
bnet.add(Dense(4,activation="softmax"))
bnet.compile(loss="categorical_crossentropy", optimizer=SGD(lr=0.1), metrics=["accuracy"])
history = bnet.fit(btrain[0], btrain_y, validation_data=(bvalid[0], bvalid_y),
	 epochs=100, batch_size=16)
score = bnet.evaluate(test_bdata, btest_y)
print("Scores: %s" % score)




########################################################################
# Experiment 4: can we classify a discretized histogram of sample data?
#
#
#########################################################################

Apply the histogram to all the data rows
Epoch 1/100
 1/23 [>.............................] - ETA: 3s - loss: 1.6236 - accuracy: 0.3125

 super(SGD, self).__init__(name, **kwargs)


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

In [None]:
history = bnet.fit(btrain[0], btrain_y, validation_data=(bvalid[0], bvalid_y),
	 epochs=100, batch_size=16)
score = bnet.evaluate(test_bdata, btest_y)
print("Scores: %s" % score)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
 1/23 [>.............................] - ETA: 0s - loss: 0.6539 - accuracy: 0.8750

In [None]:

classify = predict_classes(bnet,test_bdata)
print(theautil.classifications(classify,test_blabels))
score = bnet.evaluate(test_bdata, btest_y)
print("Scores: %s" % score)

classify_test(bnet)

## Representation: Inputs

* For discrete values consider discrete inputs neurons. E.g. if you have 3 letters are your input you should have 3 * 26 input neurons. 
* Each neuron is "one-hot" -- 1 neuron is set to 1 to indicate that 1 discerete value. 
* An input of AAA would be: 
 * 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
* ZZZ would be 
 * 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1





## Representation: Inputs

* For groups of elements consider representing them as their counts.
* E.g. 3 cats, 4 dogs, 1 car as: 3 4 1 on 3 input neurons.
* Neural networks work well with distributions as inputs and distributions as outputs



## Representation: Words

* Words can be represented as word counts where by your vector is the count of each word per document -- you might have a large vocabulary so watch out!
* n-grams are popular too with one-hot encoding
* Embeddings (a dense vector representation) are popular too. Autoencoded words!



## Representaiton: Images

* Each neuron can represent a pixel represented from 0 to 1
* You can have images as output too!





## Representation: Outputs

* Do not ask the neural network to distingush discrete values on 1 neuron. Don't expect 1 neuron to output 0.25 for A and 0.9 for B and 1.0 for C. Use 3 neurons!
* Distribution outputs are good
* Interpretting the output is fine for regression problems



# Tuning

* The parameters you chose were probably not correct!
* You could grid search, that is try all combinations. But that takes a lot of time.
* We want to get engage in hyper-parameter tuning.
 * We want to find good parameters for our network that perform well.
* Grid Search
 * Step 1: choose the parameter space
 * Step 2: choose a method of selecting parameters
 * Step 3: get next combination of parameters
 * Step 4: evaluate the parameters
 * Step 5: If current performance is better than prior performances keep this set of parameters
 * Step 6: goto step 3 until all parameter combinations are exhausted.
 * Step 7: report results
* Random Search
 * Step 1: choose the parameter space
 * Step 2: choose a method of selecting parameters
 * Step 3: randomly choose parameters
 * Step 4: Evaluate the parameters
 * Step 5: If current performance is better than prior performances keep this set of parameters
 * Step 6: goto step 3 until satisfied (N iterations or M seconds)
 * Step 7: report results
 

In [None]:
# let's tune
print('''
########################################################################
# Experiment 5: Can we tune the binned data?
#
#
#########################################################################
'''
)


import Search

# 1 repetition
state = {"reps":1}
params = {"batch_size":[1,4,8,16,32,64],
 "lr":[1.0,0.1,0.01,0.001,0.0001],
 "activation":["sigmoid","tanh","relu"],
 "optimizer":["SGD","Adam"],
 "epochs":[25],
 "arch":[
 [width],
 [width,width],
 [width,int(width/4)],
 [2*width,width],
 [int(width/4),int(width/8)],
 [int(width/4)]]
 }
def get_optimizer(x):
 if x == "Adam":
 return Adam
 return SGD
 
def f(state,params):
 bnet = Sequential()
 arch = params["arch"]
 bnet.add(Dense(arch[0],input_shape=(width,),activation=params["activation"]))
 for layer in arch[1:]:
 bnet.add(Dense(int(layer),activation=params["activation"]))
 bnet.add(Dense(4,activation="softmax"))
 optimizer = get_optimizer(params["optimizer"])
 bnet.compile(loss="categorical_crossentropy",
 optimizer=optimizer(lr=params["lr"]), metrics=["accuracy"])
 history = bnet.fit(btrain[0], btrain_y,
 validation_data=(bvalid[0], bvalid_y),
	 epochs=params["epochs"], batch_size=params["batch_size"])
 classify = predict_classes(bnet, test_bdata)
 print(theautil.classifications(classify,test_blabels))
 score = bnet.evaluate(test_bdata, btest_y)
 print("Scores: %s" % score)
 return score[1]


In [None]:
# set the heuristic function to f
state["f"] = f
# random search for 60 seconds
random_results = Search.random_search(state,params,Search.heuristic_function,time=60)
# get the random results
random_results = sorted(random_results, key=lambda x: x['Score'])
print(random_results[-1])


In [None]:
from pprint import pprint
pprint(random_results[-10:]) # print last 10 results


## References

* [Theanets Documentation](https://theanets.readthedocs.org/en/stable/)
* [A Practical Guide to TrainingRestricted Boltzmann Machines](https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf)
* [MLP](http://deeplearning.net/tutorial/mlp.html#mlp)
* [Deep Learning Tutorials](http://www.iro.umontreal.ca/~pift6266/H10/notes/deepintro.html)
* [Deep Learning Tutorials](http://deeplearning.net/tutorial/)
* [Coursera: Hinton's Neural Networks for Machine Learning](https://www.coursera.org/course/neuralnets)
* [The Next Generation of Neural Networks](https://www.youtube.com/watch?v=AyzOUbkUf3M)
* [Geoffrey Hinton: "Introduction to Deep Learning & Deep Belief Nets"](https://www.youtube.com/watch?v=GJdWESd543Y)
* Bengio's Deep Learning
 [(1)](https://www.youtube.com/watch?v=JuimBuvEWBg)[(2)](https://www.youtube.com/watch?v=Fl-W7_z3w3o)
* [Nvidia's Deep Learning tutorials](https://developer.nvidia.com/deep-learning-courses
)
* [Udacity Deep Learning MOOC](https://www.udacity.com/course/deep-learning--ud730)
