{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 13 - Model Deployment\n",
"\n",
"by [Alejandro Correa Bahnsen](albahnsen.com/)\n",
"\n",
"version 0.1, May 2016\n",
"\n",
"## Part of the class [Machine Learning for Security Informatics](https://github.com/albahnsen/ML_SecurityInformatics)\n",
"\n",
"\n",
"\n",
"This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Agenda:\n",
"\n",
"1. Creating and saving a model\n",
"2. Running the model in batch\n",
"3. Exposing the model as an API"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1: Phishing Detection\n",
"\n",
"Phishing, by definition, is the act of defrauding an online user in order to obtain personal information by posing as a trustworthy institution or entity. Users usually have a hard time differentiating between legitimate and malicious sites because they are made to look exactly the same. Therefore, there is a need to create better tools to combat attackers."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import zipfile\n",
"with zipfile.ZipFile('../datasets/phishing.csv.zip', 'r') as z:\n",
" f = z.open('phishing.csv')\n",
" data = pd.read_csv(f, index_col=False)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" url | \n",
" phishing | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" http://www.subalipack.com/contact/images/sampl... | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" http://fasc.maximecapellot-gypsyjazz-ensemble.... | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" http://theotheragency.com/confirmer/confirmer-... | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" http://aaalandscaping.com/components/com_smart... | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" http://paypal.com.confirm-key-21107316126168.s... | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" url phishing\n",
"0 http://www.subalipack.com/contact/images/sampl... 1\n",
"1 http://fasc.maximecapellot-gypsyjazz-ensemble.... 1\n",
"2 http://theotheragency.com/confirmer/confirmer-... 1\n",
"3 http://aaalandscaping.com/components/com_smart... 1\n",
"4 http://paypal.com.confirm-key-21107316126168.s... 1"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1 20000\n",
"0 20000\n",
"Name: phishing, dtype: int64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.phishing.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating features"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['http://dothan.com.co/gold/austspark/index.htm\\n',\n",
" 'http://78.142.63.63/%7Enetsysco/process/fc1d9c7ea4773b7ff90925c2902cb5f2\\n',\n",
" 'http://verify95.5gbfree.com/coverme2010/\\n',\n",
" 'http://www.racom.com/uploads/productscat/bookmark/ii.php?.rand=13vqcr8bp0gud&cbcxt=mai&email=abuse@tradinghouse.ca\\n',\n",
" 'http://www.cleanenergytci.com/components/update.logon.l3an7lofamerica/2342343234532534546347677898765432876543345687656543876/\\n',\n",
" 'http://209.148.89.163/-/santander.co.uk/weblegn/AccountLogin.php\\n',\n",
" 'http://senevi.com/confirmation/\\n',\n",
" 'http://www.hellenkeller.cl/tmp/new/noticias/Modulo_de_Atualizacao_Bradesco/index2.php?id=PSO1AM04L3Q6PSBNVJ82QUCO0L5GBSY2KM2U9BYUEO14HCRDVZEMTRB3DGJO9HPT4ROC4M8HA8LRJD5FCJ27AD0NTSC3A3VDUJQX6XFG519OED4RW6Y8J8VC19EAAAO5UF21CHGHIP7W4AO1GM8ZU4BUBQ6L2UQVARVM\\n',\n",
" 'http://internet-sicherheit.co/de/konflikt/src%3Dde/AZ00276ZZ75/we%3Dhs_0_2/sicherheit/konto_verifizieren/verifizierung.php\\n',\n",
" 'http://alen.co/docs/cleaner\\n',\n",
" 'http://rattanhouse.co/Atualizacao_Bradesco/cadastro2013.php?2MAS2XACUJPI3U8D9ZDDG2G9YJICVABQ3K73KWDKYK0NA0AWWWCOUEDUJRXHRKPNMUYLDV89RA6OCG2MQUS0TAUXX9IOGJUEIXPDS5B0RM18OF1H860UAMJOY6ICUR81VSEKKJFPBYNLYGUXBGJ1HEHKOMLTM01P658M\\n',\n",
" 'http://steamcommunily.co/p.php?login=true\\n',\n",
" 'http://www.nyyg.com/Bradesco/5W9SQ394.html\\n',\n",
" 'http://wp.tipografiacentral.com.co/sparkde/index.html\\n',\n",
" 'http://www.entrerev.com/component/.secure.wpa/.www.paypal.com.returnUrl=/cgi-bin/5RF3S6y0K349/PayPal.co.uk/dispute_centre/sotmks/npsw&st.payment.decline.centre/ipoi/secure-codes.paypal.account4738154login.complete-infrmations.login.accountSecure26/securities/\\n',\n",
" 'http://x.co/SecurCent\\n',\n",
" 'http://dejatequerer.co/united.com/index.html\\n',\n",
" 'http://www.speakeasymovies.com/components/com_wrapper/.amazon.co.uk/\\n',\n",
" 'http://www.culturaespanola.com.br/bt/www.paypal.com/paypal.com.com/index-new.php\\n',\n",
" 'http://www.agroassistance.com/components/com_content/c05354aa285b6a932a57086ba13762a1/\\n',\n",
" 'http://www.estranetsrl.com.ar/bbvacambios.html\\n',\n",
" 'http://osfsw.cba.pl/content/classic/html/ibpf/bradesco/?UOREEIYGQTERIRVSJTUHMVMZJWWYSVNYQOFSPWVFTEJEEKMJWHFERRYTFRWPSYYWGFIGJUPLZMZLTNSKOGMQQSHSXPLMXILVSM\\n',\n",
" 'http://bitcrush.co/~geetha5/natwest/natwest/ibcarregister-natwst.html\\n',\n",
" 'http://cannot-hide-from-PhishTank.zenith-services.com/controllare/auth/\\n',\n",
" 'http://nova.pymesonline.co/fr.php\\n',\n",
" 'http://comococino.com/wp-content/uploads/2013/01/paypal.com/us/cgi-bin/webscr.htm?\\n',\n",
" 'http://www.fundacionchwinqlal.com.gt/imgs/Notas/img/_New/Agencias_Bradesco/Public_201133.php?KSR6YOU359CY1USIRMSBI8CFJF7TVREFJ6KIUFKZNXXNRP7JBYVU79APNGJI8YYR5I0YXUXLRU0JKF4WEYQL81BUGVDOTBFXUPVSKSEBNNU84X4IWT54UFYABCY5OE3J5XBOQQ1EDVMHTPZPJ4TEJSOU5NZS32B8ZNWQ\\n',\n",
" 'http://flightripe.com/confirmation/update/billing/9a523c6017caa3406af9d5c2c0cb1854/\\n',\n",
" 'http://accademiazerootto.it/templates/zerootto-new/html/com_content/category/bompreco.php\\n',\n",
" 'http://santanderseguranca.zapto.org/Clientesx/\\n',\n",
" 'http://www.muttico.com/components/com_media/p3rs0na4l/53f8b14c76c890e1806b8f9d97f12f80/\\n',\n",
" 'http://us.fxlhtvf.ml/login/en/login.html.asp?refhttp:%2F%2Futddirect.com%2Fcomponents%2Fcom_content%2Fviews%2Fcategories%2Fmenu.html\\n',\n",
" 'http://conferencistainternacional.com.co/urruirrhyttjk/Index.htm\\n',\n",
" 'http://www.creativesovereign.com/components/com_newsfeeds/views/.../perfil/\\n',\n",
" 'http://villamarina.com.co/administrator/servers/BankofAmerica/security-update/SecMeasure/account-overview.cgi/presentation/jskeys/sas/signonScreen.do/\\n',\n",
" 'http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php\\n',\n",
" 'http://www.enoxia.fr/components/com_content/tamfidelidade01.php\\n',\n",
" 'http://gobbva.com/bb/empresa/index.php?tarjeta=\\n',\n",
" 'http://paypal-com-confim.sharmikelectric.com/s4575234bf5055889415\\n',\n",
" 'http://paypal.com.au.au.webapps.mpp.homes.konyadosemeciler.com/confirm/login.australia/au/webapps/mpp/home/initthi.php?cmd=SignIn&co_partnerId=2&pUserId=&siteid=0&pageType=&pa1=&i1=&bshowgif=&UsingSSL=&ru=&pp=&pa2=&errmsg=&runame=%5C%5C%5C%5C\\n',\n",
" 'http://www.bbvabancocontinental.ya.st\\n',\n",
" 'http://www.giannielectric.com/company/components/com_poll/assets/a/a5643cded2383f7568719482a943e1a5\\n',\n",
" 'http://cooperativasanjose.com.co/plugins/josetta_ext/k2category/section/first.php\\n',\n",
" 'http://appleid-apple-com-confirm-oyns-uattw6w61x3oka3pq.scientificcollectables.com/3c43e3d92e0b8a48f09f5fbb25d008a9/index1.php?cmd=https://connect.paypal.com/WebObjects/iTunesConnect.woa?login-processing=t&login_access=13409884065d3a174c294a9bf21bf71c23a3\\n',\n",
" 'http://consultoriojuridico.co/pp/www.paypal.com/\\n',\n",
" 'http://lovetodo.in.th/administrator/components/com_content/models/key/\\n',\n",
" 'http://lnk.co/io6u45y45?erydh?mario.Carelli@poste.it\\n',\n",
" 'http://www2.bancobbvacontnental.com/Centroll/informe/03/14/datitarlz/WUJFQ0VSUkFATVVOSVpMQVcuQ09N\\n',\n",
" 'http://lfcintl.com/components/com_user/zzxc/bpd.com.do/app/do/personas/289302294350311363178310441412402464323394411438376403437407/banco.popular.php?Personal\\n',\n",
" 'http://procuraduria.videoteca.com.co/update/apple.com/.cgi-bin/WebObjects/MyAppleIdwoa/wa/sign_in.html?appId=4129.returnURL=DaHR0cDovL3N0b3JlLmFwcGxlLmNvbS91c3wxYW9zZmU4OGZjNWIyNThhYWVhOTM5MzVjZjI2NTk1OGE3MWUwY2Y0MmI2OA%26r%3DSDHCD9JUYKX777H9KT\\n']"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.url[data.phishing==1].sample(50, random_state=1).tolist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Contain any of the following:\n",
"* https\n",
"* login\n",
"* .php\n",
"* .html\n",
"* @\n",
"* sign\n",
"* ?"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"keywords = ['https', 'login', '.php', '.html', '@', 'sign']"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"for keyword in keywords:\n",
" data['keyword_' + keyword] = data.url.str.contains(keyword).astype(int)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Lenght of the url\n",
"* Lenght of domain\n",
"* is IP?\n",
"* Number of .com"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data['lenght'] = data.url.str.len() - 2"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"domain = data.url.str.split('/', expand=True).iloc[:, 2]"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data['lenght_domain'] = domain.str.len()"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 www.subalipack.com\n",
"1 fasc.maximecapellot-gypsyjazz-ensemble.nl\n",
"2 theotheragency.com\n",
"3 aaalandscaping.com\n",
"4 paypal.com.confirm-key-21107316126168.securepp...\n",
"5 lcthomasdeiriarte.edu.co\n",
"6 livetoshare.org\n",
"7 www.i-m.co\n",
"8 manuelfernando.co\n",
"9 www.bladesmithnews.com\n",
"10 www.rasbaek.com\n",
"11 199.231.190.160\n",
"Name: 2, dtype: object"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"domain.head(12)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data['isIP'] = (domain.str.replace('.', '') * 1).str.isnumeric().astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data['count_com'] = data.url.str.count('com')"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" url | \n",
" phishing | \n",
" keyword_sign | \n",
" keyword_https | \n",
" keyword_login | \n",
" keyword_.php | \n",
" keyword_.html | \n",
" keyword_@ | \n",
" count_com | \n",
" lenght | \n",
" lenght_domain | \n",
" isIP | \n",
"
\n",
" \n",
" \n",
" \n",
" 28607 | \n",
" http://pennstatehershey.org/web/ibd/home/event... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 80 | \n",
" 20 | \n",
" 0 | \n",
"
\n",
" \n",
" 3689 | \n",
" http://guiadesanborja.com/multiprinter/muestra... | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 81 | \n",
" 18 | \n",
" 0 | \n",
"
\n",
" \n",
" 6405 | \n",
" http://paranaibaweb.com/faleconosco/accounting... | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 65 | \n",
" 16 | \n",
" 0 | \n",
"
\n",
" \n",
" 35355 | \n",
" http://courts.delaware.gov/Jury%20Services/Hel... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 94 | \n",
" 19 | \n",
" 0 | \n",
"
\n",
" \n",
" 16520 | \n",
" http://erpa.co/tmp/getproductrequest.htm\\n | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 39 | \n",
" 7 | \n",
" 0 | \n",
"
\n",
" \n",
" 16196 | \n",
" http://pulapulapipoca.com/components/com_media... | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
" 239 | \n",
" 18 | \n",
" 0 | \n",
"
\n",
" \n",
" 3810 | \n",
" http://www.dag.or.kr/zboard/icon/visa/img/Atua... | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 62 | \n",
" 13 | \n",
" 0 | \n",
"
\n",
" \n",
" 3005 | \n",
" http://www.amazingdressup.com/wp-content/theme... | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 94 | \n",
" 22 | \n",
" 0 | \n",
"
\n",
" \n",
" 9003 | \n",
" http://web.indosuksesfutures.com/content_file/... | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 80 | \n",
" 25 | \n",
" 0 | \n",
"
\n",
" \n",
" 34704 | \n",
" http://www.nutritionaltree.com/subcat.aspx?cid... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 69 | \n",
" 23 | \n",
" 0 | \n",
"
\n",
" \n",
" 12561 | \n",
" http://www.formation-continue-loiret.fr/compon... | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 122 | \n",
" 32 | \n",
" 0 | \n",
"
\n",
" \n",
" 10885 | \n",
" http://191.91.128.205/httpss/bancolombiaa.olb.... | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 2 | \n",
" 451 | \n",
" 14 | \n",
" 1 | \n",
"
\n",
" \n",
" 2633 | \n",
" http://www.sternies-hp.de/components/com_conte... | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 85 | \n",
" 18 | \n",
" 0 | \n",
"
\n",
" \n",
" 22253 | \n",
" http://www.silive.com/northshore/index.ssf/200... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 85 | \n",
" 14 | \n",
" 0 | \n",
"
\n",
" \n",
" 4720 | \n",
" http://www.dineo.co.za/components/com_content/... | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 172 | \n",
" 15 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" url phishing \\\n",
"28607 http://pennstatehershey.org/web/ibd/home/event... 0 \n",
"3689 http://guiadesanborja.com/multiprinter/muestra... 1 \n",
"6405 http://paranaibaweb.com/faleconosco/accounting... 1 \n",
"35355 http://courts.delaware.gov/Jury%20Services/Hel... 0 \n",
"16520 http://erpa.co/tmp/getproductrequest.htm\\n 1 \n",
"16196 http://pulapulapipoca.com/components/com_media... 1 \n",
"3810 http://www.dag.or.kr/zboard/icon/visa/img/Atua... 1 \n",
"3005 http://www.amazingdressup.com/wp-content/theme... 1 \n",
"9003 http://web.indosuksesfutures.com/content_file/... 1 \n",
"34704 http://www.nutritionaltree.com/subcat.aspx?cid... 0 \n",
"12561 http://www.formation-continue-loiret.fr/compon... 1 \n",
"10885 http://191.91.128.205/httpss/bancolombiaa.olb.... 1 \n",
"2633 http://www.sternies-hp.de/components/com_conte... 1 \n",
"22253 http://www.silive.com/northshore/index.ssf/200... 0 \n",
"4720 http://www.dineo.co.za/components/com_content/... 1 \n",
"\n",
" keyword_sign keyword_https keyword_login keyword_.php \\\n",
"28607 0 0 0 0 \n",
"3689 0 0 1 1 \n",
"6405 0 0 0 0 \n",
"35355 0 0 0 0 \n",
"16520 0 0 0 0 \n",
"16196 0 0 1 1 \n",
"3810 0 0 0 0 \n",
"3005 0 0 0 0 \n",
"9003 0 0 0 0 \n",
"34704 0 0 0 0 \n",
"12561 0 0 0 0 \n",
"10885 0 1 0 1 \n",
"2633 0 0 0 0 \n",
"22253 0 0 0 0 \n",
"4720 0 0 0 1 \n",
"\n",
" keyword_.html keyword_@ count_com lenght lenght_domain isIP \n",
"28607 0 0 0 80 20 0 \n",
"3689 0 0 1 81 18 0 \n",
"6405 1 0 1 65 16 0 \n",
"35355 0 0 0 94 19 0 \n",
"16520 0 0 0 39 7 0 \n",
"16196 0 0 4 239 18 0 \n",
"3810 0 0 0 62 13 0 \n",
"3005 1 0 1 94 22 0 \n",
"9003 0 0 1 80 25 0 \n",
"34704 0 0 1 69 23 0 \n",
"12561 0 0 5 122 32 0 \n",
"10885 1 0 2 451 14 1 \n",
"2633 0 0 2 85 18 0 \n",
"22253 1 0 1 85 14 0 \n",
"4720 0 0 3 172 15 0 "
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.sample(15, random_state=4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create Model"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X = data.drop(['url', 'phishing'], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"y = data.phishing"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.cross_validation import cross_val_score"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"clf = RandomForestClassifier(n_jobs=-1, n_estimators=100)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0.80625, 0.81175, 0.8085 , 0.79475, 0.8025 , 0.816 ,\n",
" 0.80375, 0.80525, 0.80175, 0.794 ])"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cross_val_score(clf, X, y, cv=10)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
" max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,\n",
" oob_score=False, random_state=None, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf.fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save model"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.externals import joblib"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['22_clf_rf.pkl']"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joblib.dump(clf, '22_clf_rf.pkl', compress=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2: Model in batch\n",
"\n",
"See 22_model_deployment.py"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from m22_model_deployment import predict_proba"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.89000000000000001"
]
},
"execution_count": 132,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict_proba('http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 3: API\n",
"\n",
"Flask is considered more Pythonic than Django because Flask web application code is in most cases more explicit. Flask is easy to get started with as a beginner because there is little boilerplate code for getting a simple app up and running."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we need to install some libraries \n",
"\n",
"```\n",
"pip install flask-restplus\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load Flask"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from flask import Flask\n",
"from flask.ext.restplus import Api\n",
"from flask.ext.restplus import fields\n",
"from sklearn.externals import joblib\n",
"from flask.ext.restplus import Resource\n",
"from sklearn.externals import joblib\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create api"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"app = Flask(__name__)\n",
"\n",
"api = Api(\n",
" app, \n",
" version='1.0', \n",
" title='Phishing Prediction API',\n",
" description='Phishing Prediction API')\n",
"\n",
"ns = api.namespace('predict', \n",
" description='Phishing Classifier')\n",
" \n",
"parser = api.parser()\n",
"\n",
"parser.add_argument(\n",
" 'URL', \n",
" type=str, \n",
" required=True, \n",
" help='URL to be analyzed', \n",
" location='args')\n",
"\n",
"resource_fields = api.model('Resource', {\n",
" 'result': fields.String,\n",
"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load model and create function that predicts an URL"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"clf = joblib.load('22_clf_rf.pkl') \n",
"\n",
"@ns.route('/')\n",
"class PhishingApi(Resource):\n",
"\n",
" @api.doc(parser=parser)\n",
" @api.marshal_with(resource_fields)\n",
" def get(self):\n",
" args = parser.parse_args()\n",
" result = self.predict_proba(args)\n",
"\n",
" return result, 200\n",
"\n",
" def predict_proba(self, args):\n",
" url = args['URL']\n",
" \n",
" url_ = pd.DataFrame([url], columns=['url'])\n",
" \n",
" # Create features\n",
" keywords = ['https', 'login', '.php', '.html', '@', 'sign']\n",
" for keyword in keywords:\n",
" url_['keyword_' + keyword] = url_.url.str.contains(keyword).astype(int)\n",
" \n",
" url_['lenght'] = url_.url.str.len() - 2\n",
" domain = url_.url.str.split('/', expand=True).iloc[:, 2]\n",
" url_['lenght_domain'] = domain.str.len()\n",
" url_['isIP'] = (url_.url.str.replace('.', '') * 1).str.isnumeric().astype(int)\n",
" url_['count_com'] = url_.url.str.count('com')\n",
"\n",
" # Make prediction\n",
" p1 = clf.predict_proba(url_.drop('url', axis=1))[0,1]\n",
"\n",
" print('url=', url,'| p1=', p1)\n",
"\n",
" return {\n",
" \"result\": p1\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run API"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check using \n",
"\n",
"* http://localhost:5000/predict/?URL=http://consultoriojuridico.co/pp/www.paypal.com/\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}