{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PC Session 1 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Machine Learning for Prediction**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Author:**\n",
    "[Anthony Strittmatter](http://www.anthonystrittmatter.com)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We estimate the hedonic prices of used-cars. For this purpose, we web-scrape data from the online auction platform *MyLemons*. We restrict the sample to BMW 320 series, Opel Astra, Mercedes C-class, VW Golf, and VW Passat. We select used-cars with a mileage between 10,000-200,000 km and an age between 1-20 years. \n",
    "\n",
    "We obtain the following variables:\n",
    "\n",
    "\n",
    "|Variable name| Description|\n",
    "|:----|:----|\n",
    "|**Outcome variables** ||\n",
    "|*first_price*| First asking price in 1,000 CHF |\n",
    "|*final_price*| Transaction price in 1,000 CHF|\n",
    "|*overprice*| Dummy indicating *first_price > final_price* |\n",
    "|**Baseline covariates**| |\n",
    "|*bmw_320, opel_astra, mercedes_c, vw_golf, vw_passat*| Dummies for the car make and model|\n",
    "|*mileage*| Mileage of the used car (in 1,000 km)|\n",
    "|*age_car_years*| Age of the used car (in years)|\n",
    "|*diesel*| Dummy for diesel engines |\n",
    "|*private_seller*| Dummy for private seller (as opposed to professional used car sellers) |\n",
    "|*other_car_owner*| Number of previous caar owners |\n",
    "|*guarantee*| Dummy indicating that the seller offers a guarantee for the used car|\n",
    "|*maintenance_cert*| Dummy indicating that the seller has a complete maintenace certificate for the used car|\n",
    "|*inspection*| Categorial variable for the duration until next general inspection (3 categories: new, 1-2 years, < 1 year) |\n",
    "|*pm_green*| Dummy indicating that the used car has low particular matter emissions|\n",
    "|*co2_em*| CO2 emssion (in g/km)|\n",
    "|*euro_norm*| EURO emission norm under which the car is registered |\n",
    "|*page_title* | Text in the title of the used car offer |\n",
    "\n",
    "\n",
    "Furthermore, we generate some transformations of our covariates for later analysis. The transformed covariates are:\n",
    "\n",
    "|Variable name| Description|\n",
    "|:----|:----|\n",
    "|**Additional covariates** ||\n",
    "|*mileage2, mileage3, mileage4, age_car_years2, age_car_years3, age_car_years4*| Squared, cubic, and quadratic *mileage* and *age_car_years* |\n",
    "|*dur_next_ins_0*| Dummy indicating that the duration until the next general inspection is less than a years |\n",
    "|*dur_next_ins_1_2*| Dummy indicating that the duration until the next general inspection is between 1 and 2 years |\n",
    "|*new_inspection*| Dummy indicating that the used car has a new general inspection |\n",
    "|*euro_1, euro_2, euro_3, euro_4, euro_5, euro_6*| Dummies for EURO emission norms |\n",
    "\n",
    "We store the prepared data in the file *used_cars.csv*.  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "########################  Load Packages  ########################\n",
    "\n",
    "# List of required packages\n",
    "pkgs <- c('tidyverse','glmnet','corrplot','plotmo')\n",
    "\n",
    "# Load packages\n",
    "for(pkg in pkgs){\n",
    "    library(pkg, character.only = TRUE)\n",
    "}\n",
    "\n",
    "set.seed(10101) # set starting value for random number generator\n",
    "\n",
    "print('All packages successfully installed and loaded.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data Frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load the data frame and label the covariates. We select a subsample of 300 used-cars in order to decrease the computation time while you are testing your code. We can use the entire sample of 104,719 used cars after we are finised with programming."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "########################  Load Data Frame  ########################\n",
    "\n",
    "# Load data frame\n",
    "data_raw <- read.csv(\"Data/used_cars.csv\",header=TRUE, sep=\",\")\n",
    "\n",
    "# Outcome Variable\n",
    "outcomes <- c(\"first_price\")\n",
    "\n",
    "# Covariates/Features\n",
    "covariates <- c(\"mileage\",\"mileage2\", \"mileage3\", \"mileage4\", \"age_car_years\", \"age_car_years2\", \n",
    "                \"age_car_years3\", \"age_car_years4\", \"other_car_owner\", \"co2_em\", \"euro_1\", \n",
    "                \"euro_2\", \"euro_3\", \"euro_4\", \"euro_6\", \"dur_next_ins_0\", \n",
    "                \"dur_next_ins_1_2\", \"bmw_320\", \"opel_astra\", \"mercedes_c\", \n",
    "                \"vw_passat\", \"diesel\", \"private_seller\", \"guarantee\", \n",
    "                \"maintenance_cert\", \"pm_green\") \n",
    "\n",
    "all_variables <- c(outcomes, covariates)\n",
    "\n",
    "# Selection of Subsample size, max. 104,721 observations\n",
    "# Select smaller subsample to decrease computation time\n",
    "n_obs <- 300\n",
    "df <- data_raw %>%\n",
    "  dplyr::sample_n(n_obs) %>%\n",
    "  dplyr::select(all_variables)\n",
    "\n",
    "print('Data frame successfully loaded.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Descriptive Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Means and Standard Deviations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "########################  Table with Descriptive Statistics  ########################\n",
    "\n",
    "desc <- fBasics::basicStats(df) %>% t() %>% as.data.frame() %>% \n",
    "          select(Mean, Stdev, Minimum, Maximum, nobs)\n",
    "print(round(desc, digits=1))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correlation Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "########################  Correlation Matrix  ########################\n",
    "\n",
    "corr = cor(df)\n",
    "corrplot(corr, type = \"upper\", tl.col = \"black\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Take Hold-Out-Sample "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to compare the relative prediction power of different estimation procedures based on the out-of-sample MSE and $R^2$. For this purpose, we create an hold-out-sample. Additionally, we generate 100 random variables which are unrelated to the used-car prices. These variables create additional noise in the estimation. Ideally, the Lasso approach should not select those variables. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "########################  Take Hold-Out-Sample  ########################\n",
    "\n",
    "df_part <- modelr::resample_partition(df, c(obs = 0.8, hold_out = 0.2))\n",
    "df_obs <- as.data.frame(df_part$obs) # Training and estimation sample\n",
    "df_hold_out <- as.data.frame(df_part$hold_out) # Hold-out-sample\n",
    "\n",
    "# Outcomes\n",
    "first_price_obs <- as.matrix(df_obs[,1])\n",
    "first_price_hold_out <- as.matrix(df_hold_out[,1])\n",
    "\n",
    "# Generate some noisy covariates to disturbe the estimation\n",
    "noise <- c(\"noise1\", \"noise2\", \"noise3\", \"noise4\", \"noise5\", \"noise6\", \"noise7\", \"noise8\", \"noise9\", \"noise10\", \n",
    "           \"noise11\", \"noise12\", \"noise13\", \"noise14\", \"noise15\", \"noise16\", \"noise17\", \"noise18\", \"noise19\", \"noise20\", \n",
    "           \"noise21\", \"noise22\", \"noise23\", \"noise24\", \"noise25\", \"noise26\", \"noise27\", \"noise28\", \"noise29\", \"noise30\", \n",
    "           \"noise31\", \"noise32\", \"noise33\", \"noise34\", \"noise35\", \"noise36\", \"noise37\", \"noise38\", \"noise39\", \"noise40\",\n",
    "           \"noise41\", \"noise42\", \"noise43\", \"noise44\", \"noise45\", \"noise46\", \"noise47\", \"noise48\", \"noise49\", \"noise50\", \n",
    "           \"noise51\", \"noise52\", \"noise53\", \"noise54\", \"noise55\", \"noise56\", \"noise57\", \"noise58\", \"noise59\", \"noise60\", \n",
    "           \"noise61\", \"noise62\", \"noise63\", \"noise64\", \"noise65\", \"noise66\", \"noise67\", \"noise68\", \"noise69\", \"noise70\", \n",
    "           \"noise71\", \"noise72\", \"noise73\", \"noise74\", \"noise75\", \"noise76\", \"noise77\", \"noise78\", \"noise79\", \"noise80\", \n",
    "           \"noise81\", \"noise82\", \"noise83\", \"noise84\", \"noise85\", \"noise86\", \"noise87\", \"noise88\", \"noise89\", \"noise90\", \n",
    "           \"noise91\", \"noise92\", \"noise93\", \"noise94\", \"noise95\", \"noise96\", \"noise97\", \"noise98\", \"noise99\", \"noise100\")\n",
    "noise_obs <- matrix(data = rnorm(nrow(df_obs)*100),  nrow = nrow(df_obs), ncol = 100)\n",
    "colnames(noise_obs) <- noise\n",
    "noise_hold_out <- matrix(data = rnorm(nrow(df_hold_out)*100),  nrow = nrow(df_hold_out), ncol = 100)\n",
    "colnames(noise_hold_out) <- noise\n",
    "covariates <- c(covariates, noise)\n",
    "\n",
    "# Covariates/Features\n",
    "covariates_obs <- as.matrix(cbind(df_obs[,c(2:ncol(df_obs))],noise_obs))\n",
    "covariates_hold_out <- as.matrix(cbind(df_hold_out[,c(2:ncol(df_hold_out))],noise_hold_out))\n",
    "\n",
    "print('The data is now ready for your first analysis!')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# OLS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We estimate the used-car prices using an OLS model which includes all (relevant and irrelavant) covariates.\n",
    "\n",
    "**Replace parameters in questionsmarks.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "########################  OLS Model  ######################## \n",
    "\n",
    "# Setup the formula of the linear regression model\n",
    "sumx <- paste(covariates, collapse = \" + \")  \n",
    "linear <- paste(\"first_price_obs\",paste(sumx, sep=\" + \"), sep=\" ~ \")\n",
    "linear <- as.formula(linear)\n",
    "\n",
    "# Setup the data for linear regression\n",
    "data <- as.data.frame(cbind(first_price_obs,covariates_obs))\n",
    "\n",
    "# Estimate OLS model\n",
    "ols <- lm(?formula?, ?data?)\n",
    "# Some variables might be dropped because of perfect colinearity (121 covariates - 240 observations)\n",
    "\n",
    "# In-sample fitted values\n",
    "fit1_in <- predict.lm(?model?)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extrapolate fitted values to the hold-out-sample.\n",
    "\n",
    "**Replace parameters in questionsmarks.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Out-of-sample fitted values\n",
    "fit1_out <- predict.lm(?model?, newdata = data.frame(covariates_hold_out))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluate the in- and out-of-sample performance using MSE and $R^2$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# In-sample performance measures\n",
    "mse1_in <- round(mean((first_price_obs - fit1_in)^2),digits=3)\n",
    "rsquared_in <- round(1-mean((first_price_obs - fit1_in)^2)/mean((first_price_obs - mean(first_price_obs))^2),digits=3)\n",
    "print(paste0(\"In-Sample MSE OLS: \", mse1_in))\n",
    "print(paste0(\"In-Sample R-squared OLS: \", rsquared_in))\n",
    "\n",
    "# Out-of-sample performance measures\n",
    "mse1_out <- round(mean((first_price_hold_out - fit1_out)^2),digits=3)\n",
    "rsquared_out <- round(1-mean((first_price_hold_out - fit1_out)^2)/mean((first_price_hold_out - mean(first_price_hold_out))^2),digits=3)\n",
    "print(paste0(\"Out-of-Sample MSE OLS: \", mse1_out))\n",
    "print(paste0(\"Out-of-Sample R-squared OLS: \", rsquared_out))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LASSO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Standard Lasso"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The LASSO minimises the objective function\n",
    "\\begin{equation*}\n",
    "\\min_{\\beta} \\left\\{ \\sum_{i=1}^{N} \\left( Y_i-  \\beta_0 -\\sum_{j=1}^{p}X_{ij}\\beta_j \\right)^2 + \\lambda \\sum_{j=1}^{p} |\\beta_j| \\right\\}.\n",
    "\\end{equation*}\n",
    "First we have to find the optimal tuning parameter $\\lambda$ via cross-validation (CV).\n",
    "\n",
    "**Replace parameters in questionsmarks.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "########################  CV-LASSO  ######################## \n",
    "p = 1 # 1 for LASSO, 0 for Ridge\n",
    "\n",
    "set.seed(10101)\n",
    "lasso.linear <- cv.glmnet(?covariates?, ?outcome?, alpha=p, \n",
    "                          nlambda = 100, type.measure = 'mse')\n",
    "# nlambda specifies the number of different lambda values on the grid (log-scale)\n",
    "# type.measure spciefies that the optimality criteria is the MSE in CV-samples\n",
    "\n",
    "# Plot MSE in CV-Samples for different values of lambda\n",
    "plot(?model?)\n",
    "\n",
    "# Optimal Lambda\n",
    "print(paste0(\"Lambda minimising CV-MSE: \", round(lasso.linear$lambda.min,digits=8)))\n",
    "# 1 standard error rule reduces the number of included covariates\n",
    "print(paste0(\"Lambda 1 standard error rule: \", round(lasso.linear$lambda.1se,digits=8)))\n",
    "\n",
    "# Number of Non-Zero Coefficients\n",
    "print(paste0(\"Number of selected covariates (lambd.min): \",lasso.linear$glmnet.fit$df[lasso.linear$glmnet.fit$lambda==lasso.linear$lambda.min]))\n",
    "print(paste0(\"Number of selected covariates (lambd.1se): \",lasso.linear$glmnet.fit$df[lasso.linear$glmnet.fit$lambda==lasso.linear$lambda.1se]))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot Lasso Structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "########################  Visualisation of LASSO  ######################## \n",
    "\n",
    "glmcoef<-coef(lasso.linear,lasso.linear$lambda.1se)\n",
    "coef.increase<-dimnames(glmcoef[glmcoef[,1]>0,0])[[1]]\n",
    "coef.decrease<-dimnames(glmcoef[glmcoef[,1]<0,0])[[1]]\n",
    "\n",
    "lambda_min =  lasso.linear$glmnet.fit$lambda[29]/lasso.linear$glmnet.fit$lambda[1]\n",
    "set.seed(10101)\n",
    "mod <- glmnet(covariates_obs, first_price_obs, lambda.min.ratio = lambda_min, alpha=p)\n",
    "maxcoef<-coef(mod,s=lambda_min)\n",
    "coef<-dimnames(maxcoef[maxcoef[,1]!=0,0])[[1]]\n",
    "allnames<-dimnames(maxcoef[maxcoef[,1]!=0,0])[[1]][order(maxcoef[maxcoef[,1]!=0,ncol(maxcoef)],decreasing=TRUE)]\n",
    "allnames<-setdiff(allnames,allnames[grep(\"Intercept\",allnames)])\n",
    "\n",
    "#assign colors\n",
    "cols<-rep(\"gray\",length(allnames))\n",
    "cols[allnames %in% coef.increase]<-\"red\"   \n",
    "cols[allnames %in% coef.decrease]<- \"green\"\n",
    "\n",
    "plot_glmnet(mod,label=TRUE,s=lasso.linear$lambda.1se,col= cols)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot Lasso Coefficients\n",
    "\n",
    "**Replace parameters in questionsmarks.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "########################  Plot LASSO Coefficients  ########################\n",
    "\n",
    "print('LASSO coefficients')\n",
    "\n",
    "glmcoef<-coef(?model?, ?lambda?)\n",
    "print(glmcoef)\n",
    "# the LASSO coefficients are biased because of the penalty term\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In-Sample Perforamce Measures\n",
    "\n",
    "**Replace parameters in questionsmarks.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "######################## In-Sample Performance of LASSO  ######################## \n",
    "\n",
    "# Estimate LASSO model \n",
    "# Use Lambda that minizes CV-MSE\n",
    "set.seed(10101)\n",
    "lasso.fit.min <- glmnet(?covariates?, ?outcome?, lambda = lasso.linear$lambda.min)\n",
    "yhat.lasso.min <- predict(?model?, ?covariates?)\n",
    "\n",
    "# Use 1 standard error rule?covariates?, ?outcome?,\n",
    "set.seed(10101)\n",
    "lasso.fit.1se <- glmnet(?covariates?, ?outcome?, lambda = ?lambda?)\n",
    "yhat.lasso.1se <- predict(?model?, ?covariates?)\n",
    "\n",
    "# In-sample performance measures\n",
    "print(paste0(\"In-Sample MSE OLS: \", mse1_in))\n",
    "print(paste0(\"In-Sample R-squared OLS: \", rsquared_in))\n",
    "\n",
    "mse2_in <- round(mean((first_price_obs - yhat.lasso.min)^2),digits=3)\n",
    "rsquared2_in <- round(1-mean((first_price_obs - yhat.lasso.min)^2)/mean((first_price_obs - mean(first_price_obs))^2),digits=3)\n",
    "print(paste0(\"In-Sample MSE Lasso (lambda.min): \", mse2_in))\n",
    "print(paste0(\"In-Sample R-squared Lasso (lambda.min): \", rsquared2_in))\n",
    "\n",
    "mse3_in <- round(mean((first_price_obs - yhat.lasso.1se)^2),digits=3)\n",
    "rsquared3_in <- round(1-mean((first_price_obs - yhat.lasso.1se)^2)/mean((first_price_obs - mean(first_price_obs))^2),digits=3)\n",
    "print(paste0(\"In-Sample MSE Lasso(lambda.1se): \", mse3_in))\n",
    "print(paste0(\"In-Sample R-squared Lasso (lambda.1se): \", rsquared3_in))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Out-of-Sample Perforamce Measures\n",
    "\n",
    "**Replace parameters in questionsmarks.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "######################## Out-of-Sample Performance of LASSO  ######################## \n",
    "\n",
    "# Extrapolate Lasso fitted values to hold-out-sample\n",
    "yhat.lasso.min <- predict(?model?, ?covariates?)\n",
    "yhat.lasso.1se <- predict(?model?, ?covariates?)\n",
    "\n",
    "# Out-of-sample performance measures\n",
    "print(paste0(\"Out-of-Sample MSE OLS: \", mse1_out))\n",
    "print(paste0(\"Out-of-Sample R-squared OLS: \", rsquared_out))\n",
    "\n",
    "mse2_out <- round(mean((first_price_hold_out - yhat.lasso.min)^2),digits=3)\n",
    "rsquared2_out <- round(1-mean((first_price_hold_out - yhat.lasso.min)^2)/mean((first_price_hold_out - mean(first_price_hold_out))^2),digits=3)\n",
    "print(paste0(\"Out-of-Sample MSE Lasso (lambda.min): \", mse2_out))\n",
    "print(paste0(\"Out-of-Sample R-squared Lasso (lambda.min): \", rsquared2_out))\n",
    "\n",
    "mse3_out <- round(mean((first_price_hold_out - yhat.lasso.1se)^2),digits=3)\n",
    "rsquared3_out <- round(1-mean((first_price_hold_out - yhat.lasso.1se)^2)/mean((first_price_hold_out - mean(first_price_hold_out))^2),digits=3)\n",
    "print(paste0(\"Out-of-Sample MSE Lassso (lambda.1se): \", mse3_out))\n",
    "print(paste0(\"Out-of-Sample R-squared Lasso (lambda.1se): \", rsquared3_out))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could improve the performance of the LASSO prediction by adding more covariates (e.g., interactions). We can check the performance of the Risge estimator by setting *p = 0*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Honest Inference Procedure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a training and estimation sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "######################## Separate Training and Estimation Sample  ######################## \n",
    "\n",
    "df_obs2 <- as.data.frame(cbind(first_price_obs,covariates_obs))\n",
    "df_part2 <-  modelr::resample_partition(df_obs2, c(obs = 0.5, hold_out = 0.5))\n",
    "df_train <- as.data.frame(df_part2$obs) # Training sample\n",
    "df_est <- as.data.frame(df_part2$hold_out) # Estimation sample\n",
    "\n",
    "# Outcomes\n",
    "first_price_train <- as.matrix(df_train[,1])\n",
    "first_price_est <- as.matrix(df_est[,1])\n",
    "\n",
    "# Covariates/Features\n",
    "covariates_train <- as.matrix(df_train[,c(2:ncol(df_obs2))])\n",
    "covariates_est <- as.matrix(df_est[,c(2:ncol(df_obs2))])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the $\\lambda$ crossvalidated in the training sample to fit a LASSO model in the estimation sample. Extrapolate the fitted values  from the estimation sample to the hold-out-sample.\n",
    "\n",
    "**Replace parameters in questionsmarks.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Crossvalidate Lasso model\n",
    "set.seed(10101)\n",
    "lasso.linear2 <- ?cv.glmnet command?\n",
    "plot(?model?)\n",
    "\n",
    "# Optimal Lambda\n",
    "# 1 standard error rule reduces the number of included covariates\n",
    "print(paste0(\"Lambda 1 standard error rule: \", round(lasso.linear2$lambda.1se,digits=8)))\n",
    "\n",
    "# Number of Non-Zero Coefficients\n",
    "print(paste0(\"Number of selected covariates (lambd.1se): \",lasso.linear2$glmnet.fit$df[lasso.linear2$glmnet.fit$lambda==lasso.linear2$lambda.1se]))\n",
    "\n",
    "# Estimate LASSO model \n",
    "set.seed(10101)\n",
    "lasso.fit2.1se <- ?glmnet command?\n",
    "yhat2.lasso.1se <- predict(?model?, ?covariates?)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluate the performance in the hold-out-sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Out-of-sample performance measures\n",
    "mse4_out <- round(mean((first_price_hold_out - yhat2.lasso.1se)^2),digits=3)\n",
    "rsquared4_out <- round(1-mean((first_price_hold_out - yhat2.lasso.1se)^2)/mean((first_price_hold_out - mean(first_price_hold_out))^2),digits=3)\n",
    "\n",
    "print(paste0(\"Out-of-Sample MSE OLS: \", mse1_out))\n",
    "print(paste0(\"Out-of-Sample R-squared OLS: \", rsquared_out))\n",
    "\n",
    "print(paste0(\"Out-of-Sample MSE Lassso (lambda.1se): \", mse3_out))\n",
    "print(paste0(\"Out-of-Sample R-squared Lasso (lambda.1se): \", rsquared3_out))\n",
    "\n",
    "print(paste0(\"Out-of-Sample MSE Honest Lassso (lambda.1se): \", mse4_out))\n",
    "print(paste0(\"Out-of-Sample R-squared Honest Lasso (lambda.1se): \", rsquared4_out))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross-Fitting Procedure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Switch training and estimation sample and repeat the same procedure as above. Use the $\\lambda$ crossvalidated in the estimation sample to fit a LASSO model in the training sample. Extrapolate the fitted values from the training sample to the hold-out-sample.\n",
    "\n",
    "**Replace parameters in questionsmarks.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "######################## Cross-Fitting  ######################## \n",
    "\n",
    "# Crossvalidate Lasso model\n",
    "set.seed(10101)\n",
    "lasso.linear3 <- ?cv.glmnet command?\n",
    "\n",
    "plot(?model?)\n",
    "\n",
    "# Optimal Lambda\n",
    "# 1 standard error rule reduces the number of included covariates\n",
    "print(paste0(\"Lambda 1 standard error rule: \", round(lasso.linear3$lambda.1se,digits=8)))\n",
    "\n",
    "# Number of Non-Zero Coefficients\n",
    "print(paste0(\"Number of selected covariates (lambd.1se): \",lasso.linear3$glmnet.fit$df[lasso.linear3$glmnet.fit$lambda==lasso.linear3$lambda.1se]))\n",
    "\n",
    "# Estimate LASSO model \n",
    "set.seed(10101)\n",
    "lasso.fit3.1se <- ?glmnet command?\n",
    "yhat3.lasso.1se_B <- predict(?model?, ?covariates?)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take the average of the fitted values which extrapolated from the training and estimaation samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take average of fitted values from both cross-fitting samples\n",
    "yhat3.lasso.1se <- 0.5*(yhat2.lasso.1se + yhat3.lasso.1se_B)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluate the performance in the hold-out-sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Out-of-sample performance measures\n",
    "mse5_out <- round(mean((first_price_hold_out - yhat3.lasso.1se)^2),digits=3)\n",
    "rsquared5_out <- round(1-mean((first_price_hold_out - yhat3.lasso.1se)^2)/mean((first_price_hold_out - mean(first_price_hold_out))^2),digits=3)\n",
    "\n",
    "print(paste0(\"Out-of-Sample MSE OLS: \", mse1_out))\n",
    "print(paste0(\"Out-of-Sample R-squared OLS: \", rsquared_out))\n",
    "\n",
    "print(paste0(\"Out-of-Sample MSE Lassso (lambda.1se): \", mse3_out))\n",
    "print(paste0(\"Out-of-Sample R-squared Lasso (lambda.1se): \", rsquared3_out))\n",
    "\n",
    "print(paste0(\"Out-of-Sample MSE Honest Lassso (lambda.1se): \", mse4_out))\n",
    "print(paste0(\"Out-of-Sample R-squared Honest Lasso (lambda.1se): \", rsquared4_out))\n",
    "\n",
    "print(paste0(\"Out-of-Sample MSE Cross-Fitted Honest Lassso (lambda.1se): \", mse5_out))\n",
    "print(paste0(\"Out-of-Sample R-squared Cross-Fitted Honest Lasso (lambda.1se): \", rsquared5_out))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extra Exercises:\n",
    "\n",
    "1. Estimate the Post-Lasso coefficients. Do they differ from the Lasso coeffieicents? Do the performances of the Lasso and Post-Lasso estimators differ?\n",
    "\n",
    "2. Predict the used car prices using a Rdge instead of a Lasso model. Which estimator shows the better performance?\n",
    "\n",
    "3. How do the results change when you increase the sample size to 104,721 observations?\n",
    "\n",
    "2. Replace the outcome variable 'first_price' with the 'overprice' dummy. Fit a linear and logit Lasso model. How do the models differ from each other?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.4.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}