{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "dbdc0d5a",
   "metadata": {},
   "source": [
    "## <center> Predicting Funding Outcomes Using XGBoost and Campaign Narratives: Kickstarter Crowdfunding Campaigns"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "2c651ce2",
   "metadata": {},
   "source": [
    "I examined the effectiveness of the communication that entrepreneurs use in their crowdfunding pitches to secure funding for their ventures. I used Natural Language Processing (NLP) and XGBoost machine learning model to predict funding success based on campaigns' text.\n",
    "\n",
    "Data were collected for all Kickstarter campaigns launched in the US in 2016. The campaign text descriptions were web scrapped from [kickstarter.com](https://www.kickstarter.com/). The resulting dataset amounted to 21,711 campaigns: 9,717 funded and 11,994 unfunded.\n",
    "\n",
    "The resulting model produced a precision score for successfully funded campaigns of 0.75, which was pretty good considering that text predictors had no apriori theoretical basis. That is, among 1,951 funded campaigns, the model correctly classified 1,456 in a holdout test data set using the text data only."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a3caaa3d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importing Python libraries\n",
    "import re\n",
    "from sklearn.datasets import load_files\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import nltk\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem import WordNetLemmatizer\n",
    "import nltk"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e9812abd",
   "metadata": {},
   "source": [
    "## Importing Data\n",
    "I used the load_files function from the sklearn_datasets library to import the dataset. The load_files function automatically divides the dataset into data and target sets. The load_files function loaded the data from both \"neg\"(unfunded) and \"pos\"(funded) subfolders into the docs variable, while the target categories were stored in y (coded 0 and 1)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "aad5066c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Samples per class: [11994  9717]\n"
     ]
    }
   ],
   "source": [
    "data = load_files(r\"text files\", random_state=0)\n",
    "docs, y = data.data, data.target\n",
    "print(\"Samples per class: {}\".format(np.bincount(y)))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1a533b70",
   "metadata": {},
   "source": [
    "## Text Processing\n",
    "I removed all the text's special characters, numbers, and unwanted spaces. I used regular expressions to perform preprocessing tasks. I started by removing all non-word characters such as special characters, numbers, etc.\n",
    "\n",
    "I removed all the single characters. For instance, when I removed the punctuation mark from \"David's\" and replace it with a space, I got \"David\" and a single character \"s,\" which has no meaning. To remove such single characters, I used \\s+[a-zA-Z]\\s+ regular expression, which substituted all the single characters having spaces on either side with a single space.\n",
    "\n",
    "I used the \\^[a-zA-Z]\\s+ regular expression to replace a single character from the beginning of the document with a single space. Replacing single characters with a single space may result in multiple spaces, which is not ideal.\n",
    "\n",
    "I again used the regular expression \\s+ to replace one or more spaces with a single space. The alphabet letter \"b\" is appended before every string when you have a dataset in bytes format. The regex ^b\\s+ removes \"b\" from the start of a string. The next step was to convert the data to lower case so that the same words but have different cases can be treated equally.\n",
    "\n",
    "The final preprocessing step was lemmatization. In lemmatization, I reduced the word into dictionary root form. For instance, \"cats\" was converted into \"cat.\" Lemmatization was done to avoid creating semantically similar but syntactically different features. For example, I did not want two features named \"cats\" and \"cat,\" which were semantically similar. Therefore, I performed lemmatization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1c6e4e88",
   "metadata": {},
   "outputs": [],
   "source": [
    "#remove all the special characters, numbers, unwanted spaces, etc. from text\n",
    "stemmer = WordNetLemmatizer()\n",
    "clean_docs = []\n",
    "\n",
    "for sen in range(0, len(docs)):\n",
    "    #remove special characters\n",
    "    doc = re.sub(r'\\W', ' ', str(docs[sen]))\n",
    "    \n",
    "    #remove single characters\n",
    "    doc = re.sub(r'\\s+[a-zA-Z]\\s+', ' ', doc)\n",
    "    \n",
    "    #remove single characters from start\n",
    "    doc = re.sub(r'\\^[a-zA-Z]\\s+', ' ', doc) \n",
    "    \n",
    "    #substitute multiple spaces with single space\n",
    "    doc = re.sub(r'\\s+', ' ', doc, flags=re.I)\n",
    "    \n",
    "    #remove prefixed 'b'\n",
    "    doc = re.sub(r'^b\\s+', '', doc)\n",
    "    \n",
    "    #remove 'xef xbb xbf'\n",
    "    doc = re.sub('xef xbb xbf', '', doc)\n",
    "    \n",
    "    #remove digits\n",
    "    doc = re.sub('\\d', '', doc)\n",
    "    \n",
    "    #remove first letter 'n'\n",
    "    doc = re.sub('n', '', doc, count=1)\n",
    "    \n",
    "    #remove 'xc xa'\n",
    "    doc = re.sub('xc xa', '', doc)\n",
    "    \n",
    "    #converte to lowercase\n",
    "    doc = doc.lower()\n",
    "    \n",
    "    #lemmatization\n",
    "    doc = doc.split()\n",
    "    doc = [stemmer.lemmatize(word) for word in doc]\n",
    "    doc = ' '.join(doc)\n",
    "    \n",
    "    #append cleaned docs to list\n",
    "    clean_docs.append(doc)\n",
    "\n",
    "X = clean_docs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4c45e187",
   "metadata": {},
   "source": [
    "## Converting Text to Numbers\n",
    "\n",
    "I used tf–idf, a numerical statistic intended to reflect how important a word was to a document in a collection or corpus. The TF stands for \"Term Frequency\" while IDF stands for \"Inverse Document Frequency.\" The tf–idf value increases proportionally to the number of times a word appears in the document. It is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. The term frequency is calculated as Term frequency = (Number of Occurrences of a word) / (Total words in the document). The tf–idf value for a word in a particular document is higher if the frequency of occurrence of that word is higher in that specific document but lower in all the other documents. \n",
    "\n",
    "## Training Text Classification Model and Predicting Funding Outcomes \n",
    "\n",
    "To determine the predictive capacity of the ML model, I split the data into a training set and a testing set. I used the pipeline tool combined with RandomizedSearchCV to assemble several steps that can be cross-validated while setting different parameters. Because tf–idf makes use of the statistical properties of the training data, I used a pipeline to ensure the results of the grid search are valid. As such, the preprocessing step in the pipeline includes TfidfVectorizer. I used XGBoost for classification. RandomizedSearchCV ran through 100 randomly selected combinations of various parameters fed into the parameter grid and produced the best combination of parameters based on the precision scoring metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1c3c6b22",
   "metadata": {},
   "outputs": [],
   "source": [
    "#import ML libraries\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.pipeline import Pipeline, make_pipeline\n",
    "from sklearn.model_selection import train_test_split, RandomizedSearchCV\n",
    "import xgboost as xgb\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import classification_report, confusion_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b53c52cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Create training and testing sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "976429ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create pipeline\n",
    "pipe = make_pipeline(TfidfVectorizer(stop_words=stopwords.words('english')),\n",
    "                     xgb.XGBClassifier(objective='binary:logistic', \n",
    "                     random_state=0, n_jobs=-1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "76695725",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specify hyperparameter space\n",
    "parameters = {\n",
    "    'xgbclassifier__learning_rate': np.arange(.05, 0.1, 1),\n",
    "    'xgbclassifier__max_depth': np.arange(1, 4, 10),\n",
    "    'xgbclassifier__n_estimators': np.arange(50, 200), \n",
    "    'xgbclassifier__eta': [0.001, 0.01, 0.1],\n",
    "    'xgbclassifier__colsample_bytree': [0.1, 0.5, 0.8],    \n",
    "    'tfidfvectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],\n",
    "    'tfidfvectorizer__max_features': [500, 1000, 1500],\n",
    "    'tfidfvectorizer__min_df': [5, 7],\n",
    "    'tfidfvectorizer__max_df': [0.5, 0.7]\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b3573b8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Instantiate RandomizedSearchCV object\n",
    "grid = RandomizedSearchCV(estimator=pipe, param_distributions=parameters, n_iter=20, \n",
    "                          scoring='precision', cv=5, verbose=3, n_jobs=-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c2f600c8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 20 candidates, totalling 100 fits\n"
     ]
    }
   ],
   "source": [
    "#Fit to training set\n",
    "model = grid.fit(X_train, y_train)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f8c8c074",
   "metadata": {},
   "source": [
    "# Results\n",
    "This model produced a precision score for funded campaigns of 0.75, which was pretty good considering that text predictors had no apriori theoretical basis. That is, among 1,951 successfully funded campaigns, the model correctly classified 1,456 in a holdout test data set using the text data only. Because this test set score of the model was not significantly worse than the cross-validation score during training (0.76), the model is not likely to overfit and can be considered reliable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "767f8bba",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best cross-validation score: 0.76\n",
      "Test set score: 0.74\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.70      0.84      0.77      2999\n",
      "           1       0.74      0.56      0.64      2429\n",
      "\n",
      "    accuracy                           0.72      5428\n",
      "   macro avg       0.72      0.70      0.70      5428\n",
      "weighted avg       0.72      0.72      0.71      5428\n",
      "\n",
      "Best params:\n",
      "{'xgbclassifier__n_estimators': 199, 'xgbclassifier__max_depth': 1, 'xgbclassifier__learning_rate': 0.05, 'xgbclassifier__eta': 0.01, 'xgbclassifier__colsample_bytree': 0.8, 'tfidfvectorizer__ngram_range': (1, 1), 'tfidfvectorizer__min_df': 5, 'tfidfvectorizer__max_features': 1500, 'tfidfvectorizer__max_df': 0.5}\n",
      "\n",
      "Confusion matrix:\n",
      "[[2526  473]\n",
      " [1066 1363]]\n"
     ]
    }
   ],
   "source": [
    "# Get performance metrics\n",
    "predict = model.predict(X_test)\n",
    "print(\"Best cross-validation score: {:.2f}\".format(model.best_score_))\n",
    "print(\"Test set score: {:.2f}\".format(model.score(X_test, y_test)))\n",
    "print (classification_report(y_test, predict))\n",
    "print(\"Best params:\\n{}\\n\".format(model.best_params_))\n",
    "print(\"Confusion matrix:\\n{}\".format(confusion_matrix(y_test, predict)))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.12 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "2f2c054dce7f3ebd1e76d58cbab2bcb69b44b6a1fb131f28c8fa296b61f3b710"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}