# Experiments
TODO: 24-27 June 2019:
* Create pipeline
    * with initial unigrams baseline
    * accuracy measure (e.g. precision-recall with AUROC)
## Setup environment:
* import libraries
* load csv data

In [1]:
import random
import pandas as pd
from nltk.corpus import treebank
from sklearn.model_selection import train_test_split

description_df = pd.read_csv('./data/description.csv')
installation_df = pd.read_csv('./data/installation.csv')
invocation_df = pd.read_csv('./data/invocation.csv')
citation_df = pd.read_csv('./data/citation.csv')

## Data Preview
Make sure that csv data has been successfully imported.

In [2]:
print("Number of description entries: {}".format(len(description_df)))
description_df.head()

Number of description entries: 281


Unnamed: 0,URL,excerpt
0,https://github.com/GoogleChrome/puppeteer,Puppeteer is a Node library which provides a h...
1,https://github.com/JimmySuen/integral-human-pose,The major contributors of this repository incl...
2,https://github.com/JimmySuen/integral-human-pose,Integral Regression is initially described in ...
3,https://github.com/JimmySuen/integral-human-pose,We build a 3D pose estimation system based mai...
4,https://github.com/JimmySuen/integral-human-pose,The Integral Regression is also known as soft-...


In [3]:
print("Number of installation entries: {}".format(len(installation_df)))
installation_df.head()

Number of installation entries: 800


Unnamed: 0,URL,excerpt
0,https://github.com/GoogleChrome/puppeteer,Installation
1,https://github.com/GoogleChrome/puppeteer,"To use Puppeteer in your project, run:"
2,https://github.com/GoogleChrome/puppeteer,npm i puppeteer
3,https://github.com/GoogleChrome/puppeteer,"# or ""yarn add puppeteer"""
4,https://github.com/GoogleChrome/puppeteer,puppeteer-core


In [4]:
print("Number of invocation entries: {}".format(len(invocation_df)))
invocation_df.head()

Number of invocation entries: 1118


Unnamed: 0,URL,excerpt
0,https://github.com/JimmySuen/integral-human-pose,Usage
1,https://github.com/JimmySuen/integral-human-pose,We have placed some example config files in ex...
2,https://github.com/JimmySuen/integral-human-pose,Train
3,https://github.com/JimmySuen/integral-human-pose,"For Integral Human Pose Regression, cd to pyto..."
4,https://github.com/JimmySuen/integral-human-pose,Integral Regression


In [5]:
print("Number of citation entries: {}".format(len(citation_df)))
citation_df.head()

Number of citation entries: 309


Unnamed: 0,URL,excerpt
0,https://github.com/JimmySuen/integral-human-pose,If you find Integral Regression useful in your...
1,https://github.com/JimmySuen/integral-human-pose,"@article{sun2017integral,"
2,https://github.com/JimmySuen/integral-human-pose,"title={Integral human pose regression},"
3,https://github.com/JimmySuen/integral-human-pose,"author={Sun, Xiao and Xiao, Bin and Liang, Shu..."
4,https://github.com/JimmySuen/integral-human-pose,"journal={arXiv preprint arXiv:1711.08229},"


Each data set currently contains positive samples of its respective trait. However, negative samples are necessary to distinguish the positive against some sort of control. Per category, negative samples include those from the other categories and also text samples completely unrelated to repository information. For example, in the description classifier, positive samples would be those that were labelled as a description, and negative samples would include those labelled as a installation, invocation, or citation in addition to nonpertinent text such as the Treebank corpus.

As there are many more negative samples than there are positive samples, randomly selected negative samples will be used. The aim is for about 40% positive and 60% negative. Of the 60% negative, 15% for each outside category and 15% for random, e.g. Treebank, text. 

*Question: Treebank sentences are already tokenized / split by word. Does nltk have sentences not already split or is it possible to utilize the already split state of the sentences for later tokenizer usage?*
## Description Classifier

In [6]:
neg_quant = int(len(description_df) * .375)
treebank_background = pd.DataFrame(list(map(lambda sent: ' '.join(sent), random.sample(list(treebank.sents()), neg_quant))), columns=["excerpt"]).assign(description=False)
description_corpus = pd.concat([description_df.assign(description=True), installation_df.sample(neg_quant).assign(description=False), invocation_df.sample(neg_quant).assign(description=False), citation_df.sample(neg_quant).assign(description=False),treebank_background], sort=False)
description_corpus.drop('URL', 1, inplace=True)
description_corpus.dropna(0, inplace=True)
description_corpus.reset_index(drop=True, inplace=True)
description_corpus

Unnamed: 0,excerpt,description
0,Puppeteer is a Node library which provides a h...,True
1,The major contributors of this repository incl...,True
2,Integral Regression is initially described in ...,True
3,We build a 3D pose estimation system based mai...,True
4,The Integral Regression is also known as soft-...,True
5,This is an official implementation for Integra...,True
6,The original implementation is based on our in...,True
7,LibGEOS is a LGPL-licensed package for manipul...,True
8,"Among other things, it allows you to parse Wel...",True
9,This repository contains the experiments in th...,True


## Description Classifier pipeline
### Train-test split

In [7]:
X, y = description_corpus.excerpt, description_corpus.description
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Count Vectorizer and Logistic Regression in Pipeline

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

def display_accuracy_score(y_test, y_pred_class):
    score = accuracy_score(y_test, y_pred_class)
    print('accuracy score: %s' % '{:.2%}'.format(score))
    return score
def display_null_accuracy(y_test):
    value_counts = pd.value_counts(y_test)
    null_accuracy = max(value_counts) / float(len(y_test))
    print('null accuracy: %s' % '{:.2%}'.format(null_accuracy))
    return null_accuracy

def display_accuracy_difference(y_test, y_pred_class):
    null_accuracy = display_null_accuracy(y_test)
    accuracy_score = display_accuracy_score(y_test, y_pred_class)
    difference = accuracy_score - null_accuracy
    if difference > 0:
        print('model is %s more accurate than null accuracy' % '{:.2%}'.format(difference))
    elif difference < 0:
        print('model is %s less accurate than null accuracy' % '{:.2%}'.format(abs(difference)))
    elif difference == 0:
        print('model is exactly as accurate as null accuracy')
    return null_accuracy, accuracy_score

pipeline = make_pipeline(CountVectorizer(), LogisticRegression())
pipeline.fit(X_train, y_train)
y_pred_class = pipeline.predict(X_test)
y_pred_vals = pipeline.predict_proba(X_test)
#print(y_pred_vals)
#print("X_test: {}, y_pred: {}".format(X_test, y_pred_class))
#results_df = pd.DataFrame({"x_test": X_test, "y_pred": y_pred_vals[:,1], "y_TF_pred": y_pred_class, "y_actual": y_test})
results_df = pd.DataFrame({"x_test": X_test,  "y_TF_pred": y_pred_class, "y_actual": y_test})
print(results_df)
print(confusion_matrix(y_test, y_pred_class))
print('-' * 75 + '\nClassification Report\n')
print(classification_report(y_test, y_pred_class))
display_accuracy_difference(y_test, y_pred_class)


                                                x_test  y_TF_pred  y_actual
488                           tin = _meshfix.PyTMesh()      False     False
597  Lord Chilver , 63-year-old chairman of English...      False     False
686  `` You 'd see her correcting homework in the s...      False     False
417                                             header      False     False
529  title = {{PyVista}: 3D plotting and mesh analy...      False     False
566             @inproceedings{pumarola2018ganimation,      False     False
282                 pip install opencv-python==3.2.0.6      False     False
361                                pip install empymod      False     False
365  A C++ compiler for the Python extension, and C...       True     False
2    Integral Regression is initially described in ...      False      True
561  booktitle = {Proceedings of the International ...      False     False
101  The writing functionality in segyio is largely...       True      True
595  `` You 



(0.6971428571428572, 0.8285714285714286)

In [9]:
len(description_df)

281