--- title: "Banachewicz, Massaron :: The Kaggle Book: Data Analysis and Machine Learning for Competitive Data Science" author: ["Konrod Banachewicz Luca Massaron Anthony Goldbloom"] lastmod: 2024-09-29T20:53:59-04:00 draft: false creator: "Emacs 29.4 (Org mode 9.8 + ox-hugo)" --- Book on Kaggle. I don't particularly care about competitions but am hoping to maybe find some decent general ML optimization tips. Thus, I'm skipping the first couple of beginning chapters. ## Competition Tasks and Metrics {#competition-tasks-and-metrics} Interestingly, there's a metadata you can mine in Kaggle, the book shows the most common evaluation metrics from 2015-2021 ### Metric Table {#metric-table} | algorithm | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | Total | |-----------------------------|------|------|------|------|------|------|------|-------| | AUC | 4 | 4 | 1 | 3 | 3 | 2 | 0 | 17 | | LogLoss | 2 | 2 | 5 | 2 | 3 | 2 | 0 | 16 | | MAP@{K} | 1 | 3 | 0 | 4 | 1 | 0 | 1 | 10 | | CategorizationAccuracy | 1 | 0 | 4 | 0 | 1 | 2 | 0 | 8 | | MulticlassLoss | 2 | 3 | 2 | 0 | 1 | 0 | 0 | 8 | | RMSLE | 2 | 1 | 3 | 1 | 1 | 0 | 0 | 8 | | QuadraticWeightedKappa | 3 | 0 | 0 | 1 | 2 | 1 | 0 | 7 | | MeanFScoreBeta | 1 | 0 | 1 | 2 | 1 | 2 | 0 | 7 | | MeanBestErrorAtK | 0 | 0 | 2 | 2 | 1 | 1 | 0 | 6 | | MCRMSLE | 0 | 0 | 1 | 0 | 0 | 5 | 0 | 6 | | MCAUC | 1 | 0 | 1 | 0 | 0 | 3 | 0 | 5 | | RMSE | 1 | 1 | 0 | 3 | 0 | 0 | 0 | 5 | | Dice | 0 | 1 | 1 | 0 | 2 | 1 | 0 | 5 | | GoogleGlobalAP | 0 | 0 | 1 | 2 | 1 | 1 | 0 | 5 | | MacroFScore | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 4 | | Score | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 3 | | CRPS | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | | OpenImagesObjectDetectionAP | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 3 | | MeanFScore | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 3 | | RSNAObjectDetectionAP | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 2 | ### Handling Never Before Seen Metrics {#handling-never-before-seen-metrics} There are tips from a grandmaster showing how to understand certain metrics - - - Additionally, tips are given on his workflow - Understand the problem statement - Curating the dataset for the problem statement - Deep dive into the data - Build a simple pipeline - Engineer features / hyperparameters, different models - Read discussions / Domain knowledge, tweak features? - Ensemble models - Deploy #### Regression {#regression} Estimation of a continuous value. **MSE** = 1/n \* SSE (The square of differences between your predictions and the 'real' values) **R^2** = SSE / SST (sum of squares total, variance of the response) R squared compares the squared errors of the model against the suqared errors from the simplest model possible, the average of the response. **RMSE**, square root of MSE - interesting due to MSE, large prediction errors penalized due to squaring, however, in the RMSE, the root effect diminishes it. Of course ,outliers still affect a lot, but just a thing to note. Discussed is using MSE, then square root and squaring the results. `TransformedTargetRegressor` in scikitlearn can help. **RMSLE**, adds a log error. MCRMSLE is another variant. what you care for this is the scale of your predictions with respect to the scale of the ground truth. Log transform to target before fitting it can help, reversing with an exponential function. **MAE** - mean absolute error, is, as it says, the absolute value of the difference between the prediction and the target. Slower convergence since you're optimizing for the median vs. mean (L1 vs. L2 norm) Most of the weird targets are just variations of these standard ones. #### Classification {#classification} ##### Binary {#binary} **Accuracy** - Simply # correct / total answers, can be misleading if classes are unbalanced, e.g. if one class is 99% of the data, you can simply predict that class and be 99% correct. **Precision** - TP / (TP + FP) - correctness of predicting a positive **Recall (Sensitivity)** - TP / (TP + FN) The precision/recall trade-off occurs because of the threshold for thep rediction. If you increase the threshold, you get more precision but less recall. Average precision computes the mean precision for all recall values from 0 to 1. Useful for object detection, but also on tabular data. It can prove valuable for fraud detection problems. This is why most use the **F1 Score** which is the harmonic mean of precision and recall. In some cases the **F-beta** score is used which is the weighted harmonic mean between precision / recall. **Log loss / ROC-AUC** - log loss is also known as cross-entropy in deep learning. Implication that the objective is to estimate as correctly as possible the probability of an example being of a positive class. ROC-AUC is the "area under the curve" of the Receiver Operating Characteristic (ROC) - true positive rate plotted against false positive rate, equivalent to one minus the true negative rate. **Matthews Correlation Coefficient (MCC)** - - for an explainer Interesting as it looks as the positive and negative precision summed offset by 1, then multiplied by a ratio of sqrt((TP + FP) \* (TN + FN) / PositiveLabels \* negativeLabels) This lets you get higher performance from both positive and negative class precision that are more in proportion to the ground truth, scoring from -1 to 1. Works well even when classes are quite imbalanced. ##### Multi-Class {#multi-class} Generally you can use the binary metrics and summarizing them using a sort of averaging strategy. **Macro Averaging** - F1 score for each class then averaged. Each class will count as much as the others, leading to equal penalizations whe nthe model dcoesn't perform well with any class. **Micro Averaging** Sum all the contributions from each class to compute an aggregated F1 score. **Weighting** Same as Macro, but then make a weighted average mean, summing the weights to 1. Useful for taking into account the frequency of positive cases that are relevant to your problem. ##### Object {#object} **Intersection Over Union (IOU) / Jaccard Index** - generally two images to compare, using 1/0 for ground truth/otherwise. Overlap between prediction and ground truth mask / dividing by area of the union. **Dice Coefficient** - area of overlap between prediction / ground truth doubled and then divided by the sum of prediction and ground truth. IoU tends to penalize the overall average more if a single class prediction is wrong. ##### Recommendation / Multi-label Classification {#recommendation-multi-label-classification} **Mean Average Precision at K (MAP@{K})** - This seems to be a complex metric. The mean average precision @ K, k as the cutoff. The average of P@K computer over all values ranging from 1 to k using the top prediction, the second, the third, etc. until k. #### Optimizing Metrics {#optimizing-metrics} Discusses metrics, since a lot of out-of-the-box algorithms don't let you choose your evaluation function (or have limited ones) - and your goal might not align with the competition. - - ##### Post-Processing {#post-processing} This means that your predictions are transformed by means of as function into something else that presents a better evaluation. An example was using a regression to predict a classification problem, using boundaries as thresholds then optimizing to find a better set of boundaries. See: When your predicted probabilities are misaligned with the training distribution of the target, there's a calibration function in scikit-learn - pipes your predictions into a post-processing function - Sigmoid (Plat's scaling - basically a logistic regression) - Isotonic Regression (non-parametric regression - overfits if few examples) ## Good Validation {#good-validation} Adaptive overfitting - basically using a test set over and over, there should be a final hold-out. Bias vs. Variance - model not complicated/expressive enough to capture complexity of problem causing the prediction to by biased, vs. overcomplication causing a scattershot of a prediction due to the model recording more details and noise. good overfitting definition - "The process of learning elements of the training set that have no generalization" ### Strategies {#strategies} The basic train-test split, with ~80/20 - chance of extract non-representative sample. Can use stratification which ensures that proportions of certain features are respected in the sampled data. Probabilistic approaches are more useful for competitions (usually) - more computationally expensive. Law of large numbers, repeatedly sampling and reducing error. Examples are k-fold, subsambling and bootstrapping #### k-fold cross-validation {#k-fold-cross-validation} folding k times, training and predicting against the fold then averaging. An important thing to note is this can measure generalizability and the score shouldn't be compared purely against simple train/test splits. k = n is leave one out, but these are highly correlated against eachother and is more representative of the dataset itself than how it would perform on unknown data. choosing k, the smaller it is, the more bias, but the higher, the more correlated and you lose out on interesting properties when predicting on unseen data. Commonly set to 5/7/10. Stratified k-fold is an option when you need to preserve the distribution of a variable / proportion of small classes (spam / fraud datasets, etc.) Scikit-multilearn, IterativeStratification Interestingly, you can make good use of stratification in regression problems, helping your regressor to fit during cross validation - creating a discrete proxy for the target Sturges' rule - np.floor(1 + np.log2(len(X_train))) --- (helpful to determine how many bins) cluster analysis on the features and then using predicted clusters as strata, PCA -> k-means GroupKFold - non-i.i.d data, grouping among examples, Using fixed lookback seems to be suggested (essentially windowing the training -and- validation) because otherwise the increasing time window just shows decreasing bias, confusing with model performance Nested cross validation! - basically cross validating inside cross validating #### Subsampling & Bootstrap {#subsampling-and-bootstrap} Subsampling is similar to k-fold, but you basically choose the amount of samples, no fixed folds. Devised to conclude the error distribution of an estimate, the bootstrap draws a sample, with replacement that is the same size as the available data. Interesting discussion of the 632 method, but seems intractable for machine learning. While not often, bootstrapping is mentioned as an alternative to cross-validation, where, due to outliers or heterogeneous examples, there is a large standard error of the evaluation metric in CV. Worth looking into - - discussing on how to figure out how to identify the target when the context is similar in images ### Adversarial Validation {#adversarial-validation} This is an interesting thing meant for competition, -however- it can be used for model drift. This essentially involves training a classifier between train and test datasets and looking at the ROC AUC to see if they're similar, with .50 showing they are. There's a couple strategies like removing variables and mimicking the test set if it's uneven. ### Leakage (Data) {#leakage--data} Another competition specific thing, but worth noting is basically metadata / ordering, etc. etc. that might give away something about the dataset. ## Tabular {#tabular} The usefulness of synthetic data is discussed - ### EDA {#eda} Auto-profiling solutions are discussed, AutoViz, Pandas Profiling, Sweetviz UMAP and t-SNE - - - These links appear helpful, noting that they're generally more revealing than classical methods based on variance restructuring like PCA / SVD. Reducing memory size of pandas (or just use something else, I suppose): ### Feature Engineering {#feature-engineering}