# Exercise 11


## Phishing Detection

Phishing, by definition, is the act of defrauding an online user in order to obtain personal information by posing as a trustworthy institution or entity. Users usually have a hard time differentiating between legitimate and malicious sites because they are made to look exactly the same. Therefore, there is a need to create better tools to combat attackers.

In [2]:
import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/phishing.csv.zip', 'r') as z:
    f = z.open('phishing.csv')
    data = pd.read_csv(f, index_col=False)
data.head()

Unnamed: 0,url,phishing
0,http://www.subalipack.com/contact/images/sampl...,1
1,http://fasc.maximecapellot-gypsyjazz-ensemble....,1
2,http://theotheragency.com/confirmer/confirmer-...,1
3,http://aaalandscaping.com/components/com_smart...,1
4,http://paypal.com.confirm-key-21107316126168.s...,1


In [3]:
data.phishing.value_counts()

1    20000
0    20000
Name: phishing, dtype: int64

In [4]:
data.url[data.phishing==1].sample(50, random_state=1).tolist()

['http://dothan.com.co/gold/austspark/index.htm\n',
 'http://78.142.63.63/%7Enetsysco/process/fc1d9c7ea4773b7ff90925c2902cb5f2\n',
 'http://verify95.5gbfree.com/coverme2010/\n',
 'http://www.racom.com/uploads/productscat/bookmark/ii.php?.rand=13vqcr8bp0gud&cbcxt=mai&email=abuse@tradinghouse.ca\n',
 'http://www.cleanenergytci.com/components/update.logon.l3an7lofamerica/2342343234532534546347677898765432876543345687656543876/\n',
 'http://209.148.89.163/-/santander.co.uk/weblegn/AccountLogin.php\n',
 'http://senevi.com/confirmation/\n',
 'http://www.hellenkeller.cl/tmp/new/noticias/Modulo_de_Atualizacao_Bradesco/index2.php?id=PSO1AM04L3Q6PSBNVJ82QUCO0L5GBSY2KM2U9BYUEO14HCRDVZEMTRB3DGJO9HPT4ROC4M8HA8LRJD5FCJ27AD0NTSC3A3VDUJQX6XFG519OED4RW6Y8J8VC19EAAAO5UF21CHGHIP7W4AO1GM8ZU4BUBQ6L2UQVARVM\n',
 'http://internet-sicherheit.co/de/konflikt/src%3Dde/AZ00276ZZ75/we%3Dhs_0_2/sicherheit/konto_verifizieren/verifizierung.php\n',
 'http://alen.co/docs/cleaner\n',
 'http://rattanhouse.co/Atualizacao_

In [5]:
keywords = ['https', 'login', '.php', '.html', '@', 'sign']
for keyword in keywords:
    data['keyword_' + keyword] = data.url.str.contains(keyword).astype(int)

In [6]:
data['lenght'] = data.url.str.len() - 2
domain = data.url.str.split('/', expand=True).iloc[:, 2]
data['lenght_domain'] = domain.str.len()
domain.head(12)

0                                    www.subalipack.com
1             fasc.maximecapellot-gypsyjazz-ensemble.nl
2                                    theotheragency.com
3                                    aaalandscaping.com
4     paypal.com.confirm-key-21107316126168.securepp...
5                              lcthomasdeiriarte.edu.co
6                                       livetoshare.org
7                                            www.i-m.co
8                                     manuelfernando.co
9                                www.bladesmithnews.com
10                                      www.rasbaek.com
11                                      199.231.190.160
Name: 2, dtype: object

In [7]:
data['isIP'] = (domain.str.replace('.', '') * 1).str.isnumeric().astype(int)
data['count_com'] = data.url.str.count('com')
data.sample(15, random_state=4)

Unnamed: 0,url,phishing,keyword_https,keyword_login,keyword_.php,keyword_.html,keyword_@,keyword_sign,lenght,lenght_domain,isIP,count_com
28607,http://pennstatehershey.org/web/ibd/home/event...,0,0,0,0,0,0,0,80,20,0,0
3689,http://guiadesanborja.com/multiprinter/muestra...,1,0,1,1,0,0,0,81,18,0,1
6405,http://paranaibaweb.com/faleconosco/accounting...,1,0,0,0,1,0,0,65,16,0,1
35355,http://courts.delaware.gov/Jury%20Services/Hel...,0,0,0,0,0,0,0,94,19,0,0
16520,http://erpa.co/tmp/getproductrequest.htm\n,1,0,0,0,0,0,0,39,7,0,0
16196,http://pulapulapipoca.com/components/com_media...,1,0,1,1,0,0,0,239,18,0,4
3810,http://www.dag.or.kr/zboard/icon/visa/img/Atua...,1,0,0,0,0,0,0,62,13,0,0
3005,http://www.amazingdressup.com/wp-content/theme...,1,0,0,0,1,0,0,94,22,0,1
9003,http://web.indosuksesfutures.com/content_file/...,1,0,0,0,0,0,0,80,25,0,1
34704,http://www.nutritionaltree.com/subcat.aspx?cid...,0,0,0,0,0,0,0,69,23,0,1


In [8]:
X = data.drop(['url', 'phishing'], axis=1)

In [9]:
y = data.phishing

# Exercice 11.1

Create 5 more features

# Exercice 11.2

* Standarized the features 
* Create a Linear SVM


# Exercice 11.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


# Exercice 11.4

Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

# Exercice 11.5

Compare the results with other methods