{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "## PE File Classification Exercise\n", "In this notebook we're going to explore, understand and classify PE (Portable Executable) files as being 'benign' or 'malicious'.\n", "http://en.wikipedia.org/wiki/Portable_Executable\n", "The primary motivation is to explore the nexus of IPython, Pandas and scikit-learn with PE File classification as a vehicle for that exploration. The exercise intentionally shows what machine learning experts might call a naive approach, this is for clarity and conciseness. Recommendations for deeper materials and resources are given in the conclusion.\n", "\n", " | check_sum | \n", "compile_date | \n", "datadir_IMAGE_DIRECTORY_ENTRY_BASERELOC_size | \n", "datadir_IMAGE_DIRECTORY_ENTRY_EXPORT_size | \n", "datadir_IMAGE_DIRECTORY_ENTRY_IAT_size | \n", "datadir_IMAGE_DIRECTORY_ENTRY_IMPORT_size | \n", "datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size | \n", "debug_size | \n", "export_size | \n", "generated_check_sum | \n", "iat_rva | \n", "major_version | \n", "minor_version | \n", "number_of_bound_import_symbols | \n", "number_of_bound_imports | \n", "number_of_export_symbols | \n", "number_of_import_symbols | \n", "number_of_imports | \n", "number_of_rva_and_sizes | \n", "number_of_sections | \n", "\n", " |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "97308 | \n", "1383744221 | \n", "3044 | \n", "0 | \n", "592 | \n", "140 | \n", "7368 | \n", "28 | \n", "0 | \n", "97308 | \n", "50424 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "142 | \n", "6 | \n", "16 | \n", "5 | \n", "... | \n", "
1 | \n", "103233 | \n", "1383102953 | \n", "60 | \n", "0 | \n", "1008 | \n", "60 | \n", "872 | \n", "28 | \n", "0 | \n", "103233 | \n", "53248 | \n", "5 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "124 | \n", "2 | \n", "16 | \n", "8 | \n", "... | \n", "
2 | \n", "26573 | \n", "1386271379 | \n", "360 | \n", "0 | \n", "208 | \n", "100 | \n", "2588 | \n", "28 | \n", "0 | \n", "25971 | \n", "8804 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "48 | \n", "4 | \n", "16 | \n", "5 | \n", "... | \n", "
3 | \n", "0 | \n", "1373925025 | \n", "12 | \n", "0 | \n", "8 | \n", "83 | \n", "11904 | \n", "28 | \n", "0 | \n", "54015 | \n", "35064 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "16 | \n", "4 | \n", "... | \n", "
4 | \n", "50003 | \n", "1378865704 | \n", "360 | \n", "0 | \n", "208 | \n", "100 | \n", "2588 | \n", "28 | \n", "0 | \n", "59485 | \n", "8804 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "48 | \n", "4 | \n", "16 | \n", "5 | \n", "... | \n", "
5 rows \u00d7 108 columns
\n", "\n", " | check_sum | \n", "compile_date | \n", "datadir_IMAGE_DIRECTORY_ENTRY_BASERELOC_size | \n", "datadir_IMAGE_DIRECTORY_ENTRY_EXPORT_size | \n", "datadir_IMAGE_DIRECTORY_ENTRY_IAT_size | \n", "datadir_IMAGE_DIRECTORY_ENTRY_IMPORT_size | \n", "datadir_IMAGE_DIRECTORY_ENTRY_RESOURCE_size | \n", "debug_size | \n", "export_size | \n", "generated_check_sum | \n", "iat_rva | \n", "major_version | \n", "minor_version | \n", "number_of_bound_import_symbols | \n", "number_of_bound_imports | \n", "number_of_export_symbols | \n", "number_of_import_symbols | \n", "number_of_imports | \n", "number_of_rva_and_sizes | \n", "number_of_sections | \n", "\n", " |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "50.000000 | \n", "5.000000e+01 | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "50.00000 | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "50 | \n", "50.00000 | \n", "... | \n", "
mean | \n", "25235.660000 | \n", "1.035770e+09 | \n", "415.280000 | \n", "14.640000 | \n", "126.720000 | \n", "456.160000 | \n", "9615.64000 | \n", "3.920000 | \n", "14.640000 | \n", "86998.520000 | \n", "43982.640000 | \n", "0.960000 | \n", "0.120000 | \n", "0.140000 | \n", "0.740000 | \n", "0.240000 | \n", "44.560000 | \n", "3.740000 | \n", "16 | \n", "4.32000 | \n", "... | \n", "
std | \n", "45704.015095 | \n", "3.202979e+08 | \n", "1061.159532 | \n", "55.908365 | \n", "180.722252 | \n", "1060.814846 | \n", "19062.02003 | \n", "9.814275 | \n", "55.908365 | \n", "30119.209943 | \n", "44546.311213 | \n", "2.137708 | \n", "0.328261 | \n", "0.404566 | \n", "2.028471 | \n", "0.893514 | \n", "46.412595 | \n", "3.445257 | \n", "0 | \n", "1.75476 | \n", "... | \n", "
min | \n", "0.000000 | \n", "2.099200e+06 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.00000 | \n", "0.000000 | \n", "0.000000 | \n", "26104.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "16 | \n", "1.00000 | \n", "... | \n", "
25% | \n", "0.000000 | \n", "9.372855e+08 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "40.000000 | \n", "0.00000 | \n", "0.000000 | \n", "0.000000 | \n", "68094.000000 | \n", "14593.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "9.250000 | \n", "1.000000 | \n", "16 | \n", "3.00000 | \n", "... | \n", "
50% | \n", "0.000000 | \n", "1.172916e+09 | \n", "0.000000 | \n", "0.000000 | \n", "44.000000 | \n", "100.000000 | \n", "1152.00000 | \n", "0.000000 | \n", "0.000000 | \n", "82579.000000 | \n", "25835.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "26.000000 | \n", "2.500000 | \n", "16 | \n", "4.00000 | \n", "... | \n", "
75% | \n", "36417.000000 | \n", "1.219691e+09 | \n", "14.000000 | \n", "0.000000 | \n", "231.000000 | \n", "186.000000 | \n", "5938.00000 | \n", "0.000000 | \n", "0.000000 | \n", "108406.000000 | \n", "55948.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "70.500000 | \n", "5.000000 | \n", "16 | \n", "5.00000 | \n", "... | \n", "
max | \n", "150326.000000 | \n", "1.382647e+09 | \n", "4612.000000 | \n", "313.000000 | \n", "748.000000 | \n", "6234.000000 | \n", "84152.00000 | \n", "28.000000 | \n", "313.000000 | \n", "164776.000000 | \n", "189824.000000 | \n", "10.000000 | \n", "1.000000 | \n", "2.000000 | \n", "8.000000 | \n", "4.000000 | \n", "180.000000 | \n", "18.000000 | \n", "16 | \n", "9.00000 | \n", "... | \n", "
8 rows \u00d7 199 columns
\n", "