 # Segmenting and Clustering of Neighborhoods in Toronto City

 As Github does not support folium map, so if you want to see fully rendered notebook, click on this link 
 https://nbviewer.jupyter.org/github/Mr-Piyush-Kumar/Data_Science_Projects/blob/master/Toronto_City_Neighborhood_Clustring/TorrontoCityNeighborhoodClustring.ipynb

### Introduction

This Notebook is the part of IBM Data Science Capastone Project. In This project, I am going to explore the nearby venues of Toronto City and after I will use machine learning to create clusters of these neighborhoods and will show all these clusters in the map of Toronto City.
### Created By:- Piyush Kumar

# Part - 1 of this Project

### Objective
Displaying Toronto City Neighborhoods dataset after scrapping and cleaning from wikipedia a page.

In [1]:
# Importing libraries.
import pandas as pd
import numpy as np

In [2]:
# This line of code will fetch all the tables in this 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' 
# wikipedia page.

tabels = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [3]:
len(tabels)

3

As we can see there are three tabels. Lets check which table is reqiured for us.

In [4]:
# 1st Table, 
# NOTE:- each element in tabels list is a DataFrame.

tabels[0].head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


As we can see this is the table which we reqiured, so there is no need to check other tabels.

In [5]:
# Storing tabel 1 in a saparate DataFrame.

Toronto_df = tabels[0]
del(tabels) # deleting tables list, as we don't required it anymore.
Toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [6]:
# Data Wrangling # Data Preprocessing

# 1- Changing DataFrame's columns name according to project instructions
Toronto_df.columns = ['PostalCode','Borough','Neighborhood']

# 2- Removing those rows whoose Borough is Not assigned
Toronto_df = Toronto_df[Toronto_df['Borough']!='Not assigned']

# 3- Grouping of Postal Code
temp_lst = []
for name, group in Toronto_df.groupby('PostalCode'):
    temp_lst.append([name, group['Borough'].unique()[0],", ".join(set(group['Neighborhood'].values))])
Toronto_df = pd.DataFrame(temp_lst, columns = ['PostalCode','Borough','Neighborhood'])

# 4- Replacing Not assigned values in Neighborhood with corresponding Borough
index = Toronto_df['Neighborhood'][Toronto_df['Neighborhood'].apply(lambda x: 'Not assigned' in str(x)) == True].index 
    # index where neighborhood is Not assigned
    
code = Toronto_df.iloc[index]['PostalCode'].values[0] # postal code where neighborhood is Not assigned
Toronto_df['Neighborhood'][Toronto_df['PostalCode']==code] = Toronto_df['Borough'] # replacing 

Toronto_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Rouge Hill, Highland Creek, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, East Birchmount Park, Ionview"
7,M1L,Scarborough,"Oakridge, Clairlea, Golden Mile"
8,M1M,Scarborough,"Scarborough Village West, Cliffcrest, Cliffside"
9,M1N,Scarborough,"Cliffside West, Birch Cliff"


In [7]:
print('No. of rows in Toronto Data Frame are ',Toronto_df.shape[0],'.')

No. of rows in Toronto Data Frame are  103 .


# Part - 2 of this Project

### Objective
Getting geographical co-ordinates of each neighborhood.

In [8]:
coordinates_data = pd.read_csv('http://cocl.us/Geospatial_data') # Downloading coordinates data.
coordinates_data.columns = ['PostalCode','Latitude','Longitude']
coordinates_data.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
# merging Toronto_df and coordinates_data DataFrame together
Toronto_df = Toronto_df.merge(coordinates_data, how='left',on='PostalCode')
Toronto_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Highland Creek, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, East Birchmount Park, Ionview",43.727929,-79.262029
7,M1L,Scarborough,"Oakridge, Clairlea, Golden Mile",43.711112,-79.284577
8,M1M,Scarborough,"Scarborough Village West, Cliffcrest, Cliffside",43.716316,-79.239476
9,M1N,Scarborough,"Cliffside West, Birch Cliff",43.692657,-79.264848


# Part - 3 of this Project

### Objective
Exploring and clustering the neighborhoods in Toronto for only those boroughs that contains word Toronto in its name.

In [10]:
# Getting Latitude and Longitude of Toronto City.
!conda install -c conda-forge geopy --yes # Installing geopy library, this library helps in getting Latitude and Longitude of a given address.
from geopy.geocoders import Nominatim # Nominatim converts an address into latitude and longitude values.

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1.21.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

In [11]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="Toronto_explorer")
To_location = geolocator.geocode(address)
To_latitude = To_location.latitude
To_longitude = To_location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(To_latitude, To_longitude))

The geograpical coordinate of Toronto City are 43.653963, -79.387207.


### Create a map of Toronto City with all neighborhoods superimposed on top.

In [12]:
!conda install -c conda-forge folium=0.5.0 --yes # Installing Folium Library 
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.0.1               |             py_0         575 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         673 KB

The following NEW packages will be INSTALLED:

    altair:  4.0.1-py_0 conda-forge
    branca:  0.3.1-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge


Downloading and Extracting Packages
altair-4.0.1         | 575 KB    | #####

In [13]:
Toronto_map = folium.Map(location=[To_latitude,To_longitude], zoom_start = 11)

# add markers to map
for lat, lng, borough, neighborhood in zip(Toronto_df['Latitude'], Toronto_df['Longitude'], Toronto_df['Borough'], Toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Toronto_map)  
    
Toronto_map

As per the objective I have to show only those neighborhoods whoose borough name contain Toronto word

In [14]:
# Filtering Data Set, Considering only those rows where column Borough contains Toronto Word.

New_Toronto_df = Toronto_df[Toronto_df['Borough'].apply(lambda x: 'Toronto' in str(x))]
New_Toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"Riverdale, The Danforth West",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [15]:
# list of unique Borough in New Data set
Borough_lst = New_Toronto_df.Borough.unique().tolist()
Borough_lst

['East Toronto', 'Central Toronto', 'Downtown Toronto', 'West Toronto']

### Creating map of Toronto city with new defined neighborhood superimposed on top

In [16]:
Toronto_Map = folium.Map(location=[To_latitude,To_longitude],zoom_start=11)

for lat,lon,borough,neighborhood in zip(New_Toronto_df['Latitude'],New_Toronto_df['Longitude'],New_Toronto_df['Borough'],New_Toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label,parse_html=True)
    folium.CircleMarker(
        [lat,lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Toronto_Map)

Toronto_Map

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

### Define Foursquare Credentials and Version

In [17]:
CLIENT_ID = 'MUINF3SJELTWX0T2R3GWA5P5R3QYAGI2PDFGFR0HCERWTFNH' # my Foursquare ID
CLIENT_SECRET = 'TNBO5TIGKMZR0RR1ARMSUHBMPJ2V0JZBNZQ2G2220FAMS05U' # my Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('My credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentails:
CLIENT_ID: MUINF3SJELTWX0T2R3GWA5P5R3QYAGI2PDFGFR0HCERWTFNH
CLIENT_SECRET:TNBO5TIGKMZR0RR1ARMSUHBMPJ2V0JZBNZQ2G2220FAMS05U


### Let's explore the first neighborhood in our dataframe.
Get the neighborhood's name.

In [18]:
New_Toronto_df = New_Toronto_df.reset_index(drop=True)
New_Toronto_df.loc[0, 'Neighborhood']

'The Beaches'

Get the latitude and longitude values of The Beaches.

In [19]:
Beaches_latitude = New_Toronto_df.loc[0, 'Latitude'] # neighborhood latitude value
Beaches_longitude = New_Toronto_df.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and longitude values of The Beaches are {}, {}.'.format(
                                                               Beaches_latitude, 
                                                               Beaches_longitude))

Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


### Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.

In [20]:
# First, let's create the GET request URL. Name your URL url.
LIMIT = 100 # no. of venues
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    Beaches_latitude, 
    Beaches_longitude, 
    radius, 
    LIMIT)

url 

'https://api.foursquare.com/v2/venues/explore?&client_id=MUINF3SJELTWX0T2R3GWA5P5R3QYAGI2PDFGFR0HCERWTFNH&client_secret=TNBO5TIGKMZR0RR1ARMSUHBMPJ2V0JZBNZQ2G2220FAMS05U&v=20180605&ll=43.67635739999999,-79.2930312&radius=500&limit=100'

Send the GET request and examine the resutls

In [21]:
import requests # importing request handling library

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e38fe3e006dce001ce3e09a'},
 'response': {'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 5,
  'suggestedBounds': {'ne': {'lat': 43.680857404499996,
    'lng': -79.28682091449052},
   'sw': {'lat': 43.67185739549999, 'lng': -79.29924148550948}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.67682094413784,
          'lng': -79.29394208780985}],
        'distanc

In [22]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.

In [23]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869
4,Seaspray Restaurant,Asian Restaurant,43.678888,-79.298167


And how many venues were returned by Foursquare?

In [24]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

5 venues were returned by Foursquare.


# Explore Neighborhoods in The Beaches

Let's create a function to repeat the same process to all the neighborhoods in Manhattan

In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print('Processing ',name,'.....')
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Now write the code to run the above function on each neighborhood and create a new dataframe called Filtered_Toronto_Venues.

In [26]:
Filtered_Toronto_Venues = getNearbyVenues(names=New_Toronto_df['Neighborhood'],
                                   latitudes=New_Toronto_df['Latitude'],
                                   longitudes=New_Toronto_df['Longitude']
                                  )
Filtered_Toronto_Venues.head()

Processing  The Beaches .....
Processing  Riverdale, The Danforth West .....
Processing  The Beaches West, India Bazaar .....
Processing  Studio District .....
Processing  Lawrence Park .....
Processing  Davisville North .....
Processing  North Toronto West .....
Processing  Davisville .....
Processing  Summerhill East, Moore Park .....
Processing  South Hill, Summerhill West, Rathnelly, Forest Hill SE, Deer Park .....
Processing  Rosedale .....
Processing  St. James Town, Cabbagetown .....
Processing  Church and Wellesley .....
Processing  Harbourfront .....
Processing  Ryerson, Garden District .....
Processing  St. James Town .....
Processing  Berczy Park .....
Processing  Central Bay Street .....
Processing  Richmond, Adelaide, King .....
Processing  Toronto Islands, Harbourfront East, Union Station .....
Processing  Design Exchange, Toronto Dominion Centre .....
Processing  Victoria Hotel, Commerce Court .....
Processing  Roselawn .....
Processing  Forest Hill West, Forest Hill Nor

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,The Beaches,43.676357,-79.293031,Seaspray Restaurant,43.678888,-79.298167,Asian Restaurant


#### Let's check the size of the resulting dataframe

In [27]:
print(Filtered_Toronto_Venues.shape)

(1714, 7)


Let's check how many venues were returned for each neighborhood

In [28]:
Filtered_Toronto_Venues.groupby('Neighborhood').count().iloc[:,0]

Neighborhood
Berczy Park                                                                                                    56
Business Reply Mail Processing Centre 969 Eastern                                                              16
Central Bay Street                                                                                             83
Chinatown, Kensington Market, Grange Park                                                                      87
Christie                                                                                                       19
Church and Wellesley                                                                                           82
Davisville                                                                                                     32
Davisville North                                                                                                8
Design Exchange, Toronto Dominion Centre                                   

## Let's find out how many unique categories can be curated from all the returned venues

In [29]:
print('There are {} uniques categories.'.format(len(Filtered_Toronto_Venues['Venue Category'].unique())))

There are 230 uniques categories.


# Analyze Each Neighborhood

In [30]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Filtered_Toronto_Venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighborhood'] = Filtered_Toronto_Venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]

Toronto_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [31]:
Toronto_grouped = Toronto_onehot.groupby('Neighborhood').mean().reset_index()
Toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0
1,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.012048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,...,0.0,0.0,0.0,0.0,0.0,0.012048,0.0,0.0,0.012048,0.0
3,"Chinatown, Kensington Market, Grange Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.045977,0.0,0.068966,0.011494,0.0
4,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Church and Wellesley,0.012195,0.012195,0.0,0.0,0.0,0.0,0.0,0.0,0.012195,...,0.012195,0.012195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Design Exchange, Toronto Dominion Centre",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0
9,"Dovercourt Village, Dufferin",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
Toronto_grouped.shape

(39, 230)

### Let's print each neighborhood along with the top 5 most common venues

In [33]:
num_top_venues = 5

for hood in Toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = Toronto_grouped[Toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
          venue  freq
0   Coffee Shop  0.09
1  Cocktail Bar  0.05
2          Café  0.04
3    Steakhouse  0.04
4      Beer Bar  0.04


----Business Reply Mail Processing Centre 969 Eastern----
            venue  freq
0     Pizza Place  0.06
1   Auto Workshop  0.06
2      Comic Shop  0.06
3      Restaurant  0.06
4  Farmers Market  0.06


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.16
1                Café  0.05
2      Ice Cream Shop  0.05
3  Italian Restaurant  0.05
4      Sandwich Place  0.04


----Chinatown, Kensington Market, Grange Park----
                           venue  freq
0                           Café  0.07
1          Vietnamese Restaurant  0.07
2             Chinese Restaurant  0.06
3                            Bar  0.06
4  Vegetarian / Vegan Restaurant  0.05


----Christie----
           venue  freq
0  Grocery Store  0.21
1           Café  0.16
2           Park  0.11
3     Baby Store  0.05
4      Nightclub  0.05


--

### Let's put that into a pandas dataframe
First, let's write a function to sort the venues in descending order.

In [34]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [35]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Toronto_grouped['Neighborhood']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Beer Bar,Seafood Restaurant,Farmers Market,Steakhouse,Café,Gourmet Shop
1,Business Reply Mail Processing Centre 969 Eastern,Park,Auto Workshop,Comic Shop,Pizza Place,Burrito Place,Restaurant,Brewery,Light Rail Station,Smoke Shop,Farmers Market
2,Central Bay Street,Coffee Shop,Italian Restaurant,Café,Ice Cream Shop,Juice Bar,Sandwich Place,Burger Joint,Japanese Restaurant,Salad Place,Department Store
3,"Chinatown, Kensington Market, Grange Park",Café,Vietnamese Restaurant,Bar,Chinese Restaurant,Coffee Shop,Vegetarian / Vegan Restaurant,Dumpling Restaurant,Mexican Restaurant,Cocktail Bar,Burger Joint
4,Christie,Grocery Store,Café,Park,Restaurant,Candy Store,Nightclub,Baby Store,Gas Station,Coffee Shop,Bank


## Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [36]:
from sklearn.cluster import KMeans #importing KMeans

# set number of clusters
kclusters = 5

Toronto_grouped_clustering = Toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [37]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto = New_Toronto_df

Toronto = Toronto.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

Toronto.head() 

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Asian Restaurant,Health Food Store,Pub,Trail,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
1,M4K,East Toronto,"Riverdale, The Danforth West",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Restaurant,Bookstore,Grocery Store,Pub,Pizza Place
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,0,Sandwich Place,Italian Restaurant,Steakhouse,Fast Food Restaurant,Sushi Restaurant,Ice Cream Shop,Liquor Store,Burrito Place,Burger Joint,Fish & Chips Shop
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,Gastropub,Brewery,Bakery,Italian Restaurant,American Restaurant,Comfort Food Restaurant,Sandwich Place,Cheese Shop
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,4,Park,Bus Line,Swim School,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


#### Finally, let's visualize the resulting clusters

In [38]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[To_latitude, To_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto['Latitude'], Toronto['Longitude'], Toronto['Neighborhood'], Toronto['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

 ## Examining Clusters

Now, I can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories,I can then assign a name to each cluster

In [39]:
Toronto.loc[Toronto['Cluster Labels'] == 0, Toronto.columns[[1] + list(range(5, Toronto.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,0,Asian Restaurant,Health Food Store,Pub,Trail,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
1,East Toronto,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Restaurant,Bookstore,Grocery Store,Pub,Pizza Place
2,East Toronto,0,Sandwich Place,Italian Restaurant,Steakhouse,Fast Food Restaurant,Sushi Restaurant,Ice Cream Shop,Liquor Store,Burrito Place,Burger Joint,Fish & Chips Shop
3,East Toronto,0,Café,Coffee Shop,Gastropub,Brewery,Bakery,Italian Restaurant,American Restaurant,Comfort Food Restaurant,Sandwich Place,Cheese Shop
5,Central Toronto,0,Hotel,Park,Gym,Breakfast Spot,Dance Studio,Sandwich Place,Department Store,Food & Drink Shop,Diner,Dessert Shop
6,Central Toronto,0,Clothing Store,Coffee Shop,Yoga Studio,Bagel Shop,Fast Food Restaurant,Diner,Dessert Shop,Mexican Restaurant,Chinese Restaurant,Café
7,Central Toronto,0,Sandwich Place,Dessert Shop,Pizza Place,Coffee Shop,Italian Restaurant,Gym,Café,Sushi Restaurant,Pharmacy,Brewery
9,Central Toronto,0,Coffee Shop,Pub,Sushi Restaurant,Pizza Place,Sports Bar,Supermarket,Fried Chicken Joint,Health & Beauty Service,American Restaurant,Restaurant
11,Downtown Toronto,0,Coffee Shop,Park,Pub,Restaurant,Italian Restaurant,Café,Market,Bakery,Pizza Place,Beer Store
12,Downtown Toronto,0,Coffee Shop,Sushi Restaurant,Gay Bar,Japanese Restaurant,Restaurant,Hotel,Fast Food Restaurant,Gastropub,Burger Joint,Gym


In Cluster = 0 mostly all veneues related to Food and drinking services. 
Name:- Food Services

In [40]:
Toronto.loc[Toronto['Cluster Labels'] == 1, Toronto.columns[[1] + list(range(5, Toronto.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Central Toronto,1,Garden,Home Service,Ice Cream Shop,Women's Store,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


Cluster = 1, Name:- Home Services

In [41]:
Toronto.loc[Toronto['Cluster Labels'] == 2, Toronto.columns[[1] + list(range(5, Toronto.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,Central Toronto,2,Jewelry Store,Trail,Mexican Restaurant,Sushi Restaurant,Women's Store,Department Store,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


cluster=2 Name:- Garments Store

In [42]:
Toronto.loc[Toronto['Cluster Labels'] == 3, Toronto.columns[[1] + list(range(5, Toronto.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Central Toronto,3,Tennis Court,Women's Store,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


Cluster = 3 Name:- Play Grounds

In [43]:
Toronto.loc[Toronto['Cluster Labels'] == 4, Toronto.columns[[1] + list(range(5, Toronto.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,4,Park,Bus Line,Swim School,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
10,Downtown Toronto,4,Park,Trail,Playground,Dance Studio,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


Cluster = 4, Name:- Transports