![](ISL_fig_4_6.pdf){width=700px height=350px}

Here $\pi_1 = \pi_2 = \pi_3 = 1/3$. The dashed lines are known as the Bayes decision boundaries. Were they known, they would yield the fewest misclassification errors, among all possible classifiers. The black line is the LDA decision boundary. ::: - LDA on the credit `Default` data. ::: {.panel-tabset} #### Python ```{python} from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # Pipeline pipe_lda = Pipeline(steps = [ ("col_tf", col_tf), ("model", LinearDiscriminantAnalysis()) ]) # Fit LDA lda_fit = pipe_lda.fit(X, y) # Predicted labels from LDA lda_pred = lda_fit.predict(X) # Confusion matrix lda_cfm = pd.crosstab( lda_pred, y, margins = True, rownames = ['Predicted Default Stats'], colnames = ['True Default Status'] ) lda_cfm ``` Overall training accuracy of LDA (using 0.5 as threshold) is ```{python} (lda_cfm.loc['Yes', 'Yes'] + lda_cfm.loc['No', 'No']) / lda_cfm.loc['All', 'All'] ``` The area-under-ROC curve (AUC) of LDA is ```{python} lda_auc = roc_auc_score( y, lda_fit.predict_proba(X)[:, 1] ) lda_auc ``` #### R ```{r} library(MASS) # Fit LDA lda_mod <- lda( default ~ balance + income + student, data = Default ) lda_mod # Predicted labels from LDA lda_pred = predict(lda_mod, Default) # Confusion matrix lda_cfm = table(Predicted = lda_pred$class, Default = Default$default) # Accuracy (lda_cfm['Yes', 'Yes'] + lda_cfm['No', 'No']) / sum(lda_cfm) ``` ::: ### Quadratic discriminant analysis (QDA) - In LDA, the normal distribution for each class shares the same covariance $\boldsymbol{\Sigma}$. - If we assume that the normal distribution for class $k$ has covariance $\boldsymbol{\Sigma}_k$, then it leads to the **quadratic discriminant analysis** (QDA). - The discriminant function takes the form $$ \delta_k(x) = - \frac{1}{2} (x - \mu_k)^T \boldsymbol{\Sigma}_k^{-1} (x - \mu_k) + \log \pi_k - \frac{1}{2} \log |\boldsymbol{\Sigma}_k|, $$ which is a quadratic function in $x$. ::: {#fig-lda-vs-qda}![](ISL_fig_4_9.pdf){width=700px height=350px}

The Bayes (purple dashed), LDA (black dotted), and QDA (green solid) decision boundaries for a two-class problem. Left: $\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2$. Right: $\boldsymbol{\Sigma}_1 \ne \boldsymbol{\Sigma}_2$. ::: ::: {.panel-tabset} #### Python ```{python} from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis # Pipeline pipe_qda = Pipeline(steps = [ ("col_tf", col_tf), ("model", QuadraticDiscriminantAnalysis()) ]) # Fit QDA qda_fit = pipe_qda.fit(X, y) # Predicted labels from QDA qda_pred = qda_fit.predict(X) # Confusion matrix from the QDA qda_cfm = pd.crosstab( qda_pred, y, margins = True, rownames = ['Predicted Default Stats'], colnames = ['True Default Status'] ) qda_cfm ``` Overall training accuracy of QDA (using 0.5 as threshold) is ```{python} (qda_cfm.loc['Yes', 'Yes'] + qda_cfm.loc['No', 'No']) / qda_cfm.loc['All', 'All'] ``` The area-under-ROC curve (AUC) of QDA is ```{python} qda_auc = roc_auc_score( y, qda_fit.predict_proba(X)[:, 1] ) qda_auc ``` #### R ```{r} # Fit QDA qda_mod <- qda( default ~ balance + income + student, data = Default ) qda_mod # Predicted probabilities from QDA qda_pred = predict(qda_mod, Default) # Confusion matrix qda_cfm = table(Predicted = qda_pred$class, Default = Default$default) # Accuracy (qda_cfm['Yes', 'Yes'] + qda_cfm['No', 'No']) / sum(qda_cfm) ``` ::: ### Naive Bayes - If we assume $f_k(x) = \prod_{j=1}^p f_{jk}(x_j)$ (conditional independence model) in each class, we get **naive Bayes**. For Gaussian this means the $\boldsymbol{\Sigma}_k$ are diagonal. - Naive Bayes is useful when $p$ is large (LDA and QDA break down). - Can be used for $mixed$ feature vectors (both continuous and categorical). If $X_j$ is qualitative, replace $f_{kj}(x_j)$ with probability mass function (histogram) over discrete categories. - Despite strong assumptions, naive Bayes often produces good classification results. ::: {.panel-tabset} #### Python ```{python} from sklearn.naive_bayes import GaussianNB # Pipeline pipe_nb = Pipeline(steps = [ ("col_tf", col_tf), ("model", GaussianNB()) ]) # Fit Naive Bayes classifier nb_fit = pipe_nb.fit(X, y) # Predicted labels by NB classifier nb_pred = nb_fit.predict(X) # Confusion matrix of NB classifier nb_cfm = pd.crosstab( nb_pred, y, margins = True, rownames = ['Predicted Default Stats'], colnames = ['True Default Status'] ) nb_cfm ``` Overall training accuracy of Naive Bayes classifier (using 0.5 as threshold) is ```{python} (nb_cfm.loc['Yes', 'Yes'] + nb_cfm.loc['No', 'No']) / nb_cfm.loc['All', 'All'] ``` The area-under-ROC curve (AUC) of NB is ```{python} nb_auc = roc_auc_score( y, nb_fit.predict_proba(X)[:, 1] ) nb_auc ``` #### R ```{r} library(e1071) # Fit Naive Bayes classifier nb_mod <- naiveBayes( default ~ balance + income + student, data = Default ) nb_mod # Predicted labels from Naive Bayes nb_pred = predict(nb_mod, Default) # Confusion matrix nb_cfm = table(Predicted = nb_pred, Default = Default$default) # Accuracy (nb_cfm['Yes', 'Yes'] + nb_cfm['No', 'No']) / sum(nb_cfm) ``` ::: ## $K$-nearest neighbor (KNN) classifier - KNN is a nonparametric classifier. - Given a positive integer $K$ and a test observation $x_0$, the KNN classifier first identifies the $K$ points in the training data that are closest to $x_0$, represented by $\mathcal{N}_0$. It estimates the conditional probability by $$ \operatorname{Pr}(Y=j \mid X = x_0) = \frac{1}{K} \sum_{i \in \mathcal{N}_0} I(y_i = j) $$ and then classifies the test observation $x_0$ to the class with the largest probability. - We illustrate KNN with $K=5$ on the credit default data. ::: {.panel-tabset} #### Python ```{python} from sklearn.neighbors import KNeighborsClassifier # Pipeline pipe_knn = Pipeline(steps = [ ("col_tf", col_tf), ("model", KNeighborsClassifier(n_neighbors = 5)) ]) # Fit KNN with K = 5 knn_fit = pipe_knn.fit(X, y) # Predicted labels from KNN knn_pred = knn_fit.predict(X) # Confusion matrix of KNN knn_cfm = pd.crosstab( knn_pred, y, margins = True, rownames = ['Predicted Default Stats'], colnames = ['True Default Status'] ) knn_cfm ``` Overall training accuracy of KNN classifier with $K=5$ (using 0.5 as threshold) is ```{python} (knn_cfm.loc['Yes', 'Yes'] + knn_cfm.loc['No', 'No']) / knn_cfm.loc['All', 'All'] ``` The area-under-ROC curve (AUC) of KNN ($K=5$) is ```{python} knn_auc = roc_auc_score( y, knn_fit.predict_proba(X)[:, 1] ) knn_auc ``` #### R ```{r} library(class) X_default <- Default %>% mutate(x_student = as.integer(student == "Yes")) %>% dplyr::select(x_student, balance, income) # KNN prediction with K = 5 knn_pred <- knn( train = X_default, test = X_default, cl = Default$default, k = 5 ) # Confusion matrix knn_cfm = table(Predicted = knn_pred, Default = Default$default) # Accuracy (knn_cfm['Yes', 'Yes'] + knn_cfm['No', 'No']) / sum(knn_cfm) ``` ::: ## Evaluation of classification performance: false positive, false negative, ROC and AUC - Let's summarize the classification performance of different classifiers on the training data. ::: {.panel-tabset} #### Python ```{python} # Confusion matrix from the null classifier (always 'No') null_cfm = pd.DataFrame( data = { 'No': [9667, 0, 9667], 'Yes': [333, 0, 333], 'All': [10000, 0, 10000] }, index = ['No', 'Yes', 'All'] ) null_pred = np.repeat('No', 10000) # Fitted classifiers classifiers = [logit_fit, lda_fit, qda_fit, nb_fit, knn_fit] # Confusion matrices cfms = [logit_cfm, lda_cfm, qda_cfm, nb_cfm, knn_cfm, null_cfm] ``` ```{python} sm_df = pd.DataFrame( data = { 'acc': [(cfm.loc['Yes', 'Yes'] + cfm.loc['No', 'No']) / cfm.loc['All', 'All'] for cfm in cfms], 'fpr': [(cfm.loc['Yes', 'No'] / cfm.loc['All', 'No']) for cfm in cfms], 'fnr': [(cfm.loc['No', 'Yes'] / cfm.loc['All', 'Yes']) for cfm in cfms], 'auc': np.append([roc_auc_score(y, c.predict_proba(X)[:, 1]) for c in classifiers], roc_auc_score(y, np.repeat(0, 10000))) }, index = ['logit', 'LDA', 'QDA', 'NB', 'KNN', 'Null'] ) sm_df.sort_values('acc') ``` #### R ::: - There are two types of classification errors: - **False positive rate**: The fraction of negative examples that are classified as positive. - **False negative rate**: The fraction of positive examples that are classified as negative. ![](./classification_measures.png) - Above table is the training classification performance of classifiers using their default thresholds. Varying thresholds lead to varying false positive rates (1 - specificity) and true positive rates (sensitivity). These can be plotted as the **receiver operating characteristic (ROC)** curve. The overall performance of a classifier is summarized by the **area under ROC curve (AUC)**. ```{python} from sklearn.metrics import roc_curve from sklearn.metrics import RocCurveDisplay # plt.rcParams.update({'font.size': 12}) for idx, m in enumerate(classifiers): plt.figure(); RocCurveDisplay.from_estimator(m, X, y, name = sm_df.iloc[idx].name); plt.show() # ROC curve of the null classifier (always No or always Yes) plt.figure() RocCurveDisplay.from_predictions(y, np.repeat(0, 10000), pos_label = 'Yes', name = 'Null Classifier'); plt.show() ``` - See [sklearn.metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) for other popularly used metrics for classification tasks. ## Comparison between classifiers - For a two-class problem, we can show that for LDA, $$ \log \left( \frac{p_1(x)}{1 - p_1(x)} \right) = \log \left( \frac{p_1(x)}{p_2(x)} \right) = c_0 + c_1 x_1 + \cdots + c_p x_p. $$ So it has the same form as logistic regression. The difference is in how the parameters are estimated. - Logistic regression uses the conditional likelihood based on $\operatorname{Pr}(Y \mid X)$ (known as **discriminative learning**). LDA, QDA, and Naive Bayes use the full likelihood based on $\operatorname{Pr}(X, Y)$ (known as **generative learning**). - Despite these differences, in practice the results are often very similar. - Logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model. - Logistic regression is very popular for classification, especially when $K = 2$. - LDA is useful when $n$ is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when $K > 2$. - Naive Bayes is useful when $p$ is very large. - LDA is a special case of QDA. - Under normal assumption, Naive Bayes leads to linear decision boundary, thus a special case of LDA. - KNN classifier is non-parametric and can dominate LDA and logistic regression when the decision boundary is highly nonlinear, provided that $n$ is very large and $p$ is small. - See ISL Section 4.5 for theoretical and empirical comparisons of logistic regression, LDA, QDA, NB, and KNN.