:page_with_curl: Model & Classification extractor (`SemaClassifier`) ==== When a new sample has to be evaluated, its SCDG is first build as described in the README of the SCDG. Then, `gspan` is applied to extract the biggest common subgraph and a similarity score is evaluated to decide if the graph is considered as part of the family or not. The similarity score `S` between graph `G'` and `G''` is computed as follow: Since `G''` is a subgraph of `G'`, this is calculating how much `G'` appears in `G''`. Another classifier we use is the Support Vector Machine (`SVM`) with INRIA graph kernel or the Weisfeiler-Lehman extension graph kernel. ### How to use ? Launch the container: ```bash docker run --rm --name="sema-classifier" -v ${PWD}/InputFolder:/sema-classifier/application/database -it sema-classifier ../docker_startup.sh 1 ``` Where the volume correspond to the folder containings the inputs that will be accessible by the container. Then just run the script : ``` python3 SemaClassifier.py FOLDER/FILE usage: update_readme_usage.py [-h] [--threshold THRESHOLD] [--biggest_subgraph BIGGEST_SUBGRAPH] [--support SUPPORT] [--ctimeout CTIMEOUT] [--epoch EPOCH] [--sepoch SEPOCH] [--data_scale DATA_SCALE] [--vector_size VECTOR_SIZE] [--batch_size BATCH_SIZE] (--classification | --detection) (--wl | --inria | --dl | --gspan) [--bancteian] [--delf] [--FeakerStealer] [--gandcrab] [--ircbot] [--lamer] [--nitol] [--RedLineStealer] [--sfone] [--sillyp2p] [--simbot] [--Sodinokibi] [--sytro] [--upatre] [--wabot] [--RemcosRAT] [--verbose_classifier] [--train] [--nthread NTHREAD] binaries Classification module arguments optional arguments: -h, --help show this help message and exit --classification By malware family --detection Cleanware vs Malware --wl TODO --inria TODO --dl TODO --gspan TODOe Global classifiers parameters: --threshold THRESHOLD Threshold used for the classifier [0..1] (default : 0.45) Gspan options: --biggest_subgraph BIGGEST_SUBGRAPH Biggest subgraph consider for Gspan (default: 5) --support SUPPORT Support used for the gpsan classifier [0..1] (default : 0.75) --ctimeout CTIMEOUT Timeout for gspan classifier (default : 3sec) Deep Learning options: --epoch EPOCH Only for deep learning model: number of epoch (default: 5) Always 1 for FL model --sepoch SEPOCH Only for deep learning model: starting epoch (default: 1) --data_scale DATA_SCALE Only for deep learning model: data scale value (default: 0.9) --vector_size VECTOR_SIZE Only for deep learning model: Size of the vector used (default: 4) --batch_size BATCH_SIZE Only for deep learning model: Batch size for the model (default: 1) Malware familly: --bancteian --delf --FeakerStealer --gandcrab --ircbot --lamer --nitol --RedLineStealer --sfone --sillyp2p --simbot --Sodinokibi --sytro --upatre --wabot --RemcosRAT Global parameter: --verbose_classifier Verbose output during train/classification (default : False) --train Launch training process, else classify/detect new sample with previously computed model --nthread NTHREAD Number of thread used (default: max) binaries Name of the folder containing binary'signatures to analyze (Default: output/save-SCDG/, only that for ToolChain) ``` #### Example This will train models for input dataset ```bash python3 SemaClassifier.py --train test-set/autoit ``` This will classify input dataset based on previously computed models ```bash python3 SemaClassifier.py test-set/autoit ``` #### Tests To run the classifier tests, run inside the docker container: ```bash python3 classifier_tests.py configs/config_test.ini ```