{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Amazon SageMaker Autopilot Data Exploration Report\n", "\n", "This report contains insights about the dataset you provided as input to the AutoML job.\n", "This data report was generated by **automl-dm-1675608463** AutoML job.\n", "To check for any issues with your data and possible improvements that can be made to it,\n", "consult the sections below for guidance.\n", "You can use information about the predictive power of each feature in the **Data Sample** section and\n", "from the correlation matrix in the **Cross Column Statistics** section to help select a subset of the data\n", "that is most significant for making predictions.\n", "\n", "**Note**: SageMaker Autopilot data reports are subject to change and updates.\n", "It is not recommended to parse the report using automated tools, as they may be impacted by such changes.\n", "\n", "## Dataset Summary\n", "**Note**: due to timeouts while processing your data, it's possible that some issues with your data were not detected.\n", "See further details in the sections below. To avoid such timeouts, try reducing the number of columns in the dataset.\n", "\n", "\n", "\n", "
\n", "Dataset Properties\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RowsColumnsDuplicate rowsTarget columnMissing target valuesInvalid target valuesDetected problem type
711020.03%sentiment0.00%0.00%MulticlassClassification
\n", "
\n", "Detected Column Types\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NumericCategoricalTextDatetimeSequence
Column Count00100
Percentage0.00%0.00%100.00%0.00%0.00%
\n", "
\n", "\n", "---\n", "\n", "## Report Contents\n", "\n", "1. [Target Analysis](#Target-Analysis)
\n", "1. [Data Sample](#Data-Sample)
\n", "1. [Feature Summary](#Feature-Summary)
\n", "1. [Duplicate Rows](#Duplicate-Rows)
\n", "1. [Cross Column Statistics](#Cross-Column-Statistics)
\n", "1. [Anomalous Rows](#Anomalous-Rows)
\n", "1. [Missing Values](#Missing-Values)
\n", "1. [Cardinality](#Cardinality)
\n", "1. [Descriptive Stats](#Descriptive-Stats)
\n", "1. [Definitions](#Definitions)
\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Target Analysis\n", "\n", "\n", "\n", "The column **sentiment** is used as the target column.\n", "See the distribution of values (labels) in the target column below:\n", "\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Number of ClassesInvalid PercentageMissing Percentage
30.00%0.00%
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Target LabelFrequency PercentageLabel Count
-133.33%2370
033.33%2370
133.33%2370
\n", "
\n", "\n", "
\n", " \n", "
Histogram of the target column labels.
\n", "
\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Sample\n", "\n", "The following table contains a random sample of **10** rows from the dataset.\n", "The top two rows provide the type and prediction power of each column.\n", "Verify the input headers correctly align with the columns of the dataset sample.\n", "If they are incorrect, update the header names of your input dataset in Amazon Simple Storage Service (Amazon S3).\n", "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentimentreview_body
Prediction Power-0.345992
Column Types-text
00The color and embellishment of this top are beautiful. this would look great on anyone super lean otherwise you just look really wide as there is no frame in the torso of this top at all. the back picture shows it well. i am 5'9 142 lbs and i ordered the small and medium and both were huge! with regrets going back.
11The model does not look as good in this shirt as it looks in real life. it is flowy flattering and elegant not to mention warm and soft. i hope it holds up!
21Love the clean lines the texture and the slight sheen on this skirt which does run a bit large in the waist. the online photos are very true to what the fabric looks like. it is a stiffer but lightweight fabric so the skirt stands away from the body. the quality of the finish is excellent and i really appreciate the hidden zipper and pockets. . fit: runs large in the waist. the size 2 i have measures 27\" on the top inside of the waistband and i think the model is wearing a size too loose...
31I was so excited to see these pants on sale. they fit beautifully and drape well. the hidden lining of 'shorts' is a great feature and i will get 3 seasons of use from these. now if i could only get them in plain black too ....
4-1First of all it arrived rolled in a tight ball the size of a large starbucks coffee. i don't know why they do that.... you're not going to be immediately impressed by something that is packaged like it's cheap! second i'm usually a 0-4 in retailer and mine is no where near as loose-fitting as the photo. also he sleeves and back are a very thin waffle knit... like a flimsy t-shirt. and see-thru. the sleeves form-fit to my arms so i couldn't layer under it but i couldn't wear it alone with ...
5-1To be clear i am not built like your models but the biggest problem was the way in which it tented out from the bustline down
61I like these pants. they fit nice but they are sheer which is not a problem for me. my only issue was that the print is not symmetrical as shown in the picture. the print is what drew me to these pants. the pattern is all over the place in the ones i received. they are nice so i will keep them at the sale price but i don't think they would be worth it at full price.
70Not sure about this yet. reordered larger size because it hits much higher than shown. belly peeked out from under it! ordered 6 and i'm normally a small/6. like the design but really short.
\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Summary\n", "\n", "\n", "Prediction power is measured by stratified splitting the data into 80%/20% training and validation folds. We fit a\n", "model for each feature separately on the training fold after applying minimal feature pre-processing and measure\n", "prediction performance on the validation data. Higher prediction power scores, toward 1, indicate columns that are\n", "more useful for predicting the target on their own. Lower scores, toward 0 point to columns that contain little useful\n", "information for predicting the target on their own.\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Duplicate Rows\n", "\n", "\n", "
⚠️ Low severity insight: “Duplicate rows”
\n", "0.03% of the rows were found to be duplicates when testing a random sample of 7110 rows from the dataset. Some data sources could include valid duplicates, but in some cases these duplicates could point to problems in data collection. Unintended duplicate rows could disrupt the automatic hyperparameter tuning of Amazon SageMaker Autopilot and result in sub-par model. Thus should be removed for more accurate results. This preprocessing can be done with Amazon SageMaker Data Wrangler using the “Drop duplicates” transform under “Manage rows”.\n", "
\n", "\n", "A sample of duplicate rows is presented below.\n", "The number of occurrences of a row is given in left most **Duplicate count** column.\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Duplicate countreview_bodysentiment
2I purchased this and another eva franco dress during retailer's recent 20% off sale. i was looking for dresses that were work appropriate but that would also transition well to happy hour or date night. they both seemed to be just what i was looking for. i ordered a 4 regular and a 6 regular as i am usually in between sizes. the 4 was definitely too small. the 6 fit technically but was very ill fitting. not only is the dress itself short but it is very short-waisted. i am only 5'3\" but...-1
\n", "
\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross Column Statistics\n", "\n", "Amazon SageMaker Autopilot calculates Pearson’s correlation between columns in your dataset.\n", "Removing highly correlated columns can reduce overfitting and training time.\n", "Pearson’s correlation is in the range [-1, 1] where 0 implies no correlation, 1 implies perfect correlation,\n", "and -1 implies perfect inverse correlation.\n", "\n", "No correlation scores were calculated because the calculation took too long to complete.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Anomalous Rows\n", "\n", "Anomalous rows are detected using the Isolation forest algorithm on a sample of\n", "**7110**\n", "randomly chosen\n", "rows after basic preprocessing. The isolation forest algorithm associates an anomaly score to each row of the dataset\n", "it is trained on. Rows with negative anomaly scores are usually considered anomalous and rows with positive anomaly\n", "scores are considered non-anomalous. When investigating an anomalous row, look for any unusual values -\n", "in particular any that might have resulted from errors in the gathering and processing of data.\n", "Deciphering whether a row is indeed anomalous, contains errors, or is in fact valid requires domain knowledge and\n", "application of business logic.\n", "\n", "There were no anomalous rows found in the dataset.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Missing Values\n", "Within the data sample, the following columns contained missing values, such as: `nan`, white spaces, or empty fields.\n", "\n", "SageMaker Autopilot will attempt to fill in missing values using various techniques. For example,\n", "missing values can be replaced with a new 'unknown' category for `Categorical` features\n", "and missing `Numerical` values can be replaced with the **mean** or **median** of the column.\n", "\n", "We found **0 of the 2** of the columns contained missing values.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cardinality\n", "For `String` features, it is important to count the number of unique values to determine whether to treat a feature as `Categorical` or `Text`\n", "and then processes the feature according to its type.\n", "\n", "For example, SageMaker Autopilot counts the number of unique entries and the number of unique words.\n", "The following string column would have **3** total entries, **2** unique entries, and **3** unique words.\n", "\n", "| | String Column |\n", "|-------|-------------------|\n", "| **0** | \"red blue\" |\n", "| **1** | \"red blue\" |\n", "| **2** | \"red blue yellow\" |\n", "\n", "If the feature is `Categorical`, SageMaker Autopilot can look at the total number of unique entries and transform it using techniques such as one-hot encoding.\n", "If the field contains a `Text` string, we look at the number of unique words, or the vocabulary size, in the string.\n", "We can use the unique words to then compute text-based features, such as Term Frequency-Inverse Document Frequency (tf-idf).\n", "\n", "**Note:** If the number of unique values is too high, we risk data transformations expanding the dataset to too many features.\n", "In that case, SageMaker Autopilot will attempt to reduce the dimensionality of the post-processed data,\n", "such as by capping the number vocabulary words for tf-idf, applying Principle Component Analysis (PCA), or other dimensionality reduction techniques.\n", "\n", "The table below shows **2 of the 2** columns ranked by the number of unique entries.\n", "\n", "
💡 Suggested Action Items\n", "\n", "- Verify the number of unique values of a feature is as expected.\n", " One explanation for unexpected number of unique values could be multiple encodings of a value.\n", " For example `US` and `U.S.` will count as two different words.\n", " You could correct the error at the data source or pre-process your dataset in your S3 bucket.\n", "- If the number of unique values seems too high for Categorical variables,\n", " investigate if multiple unique values can be grouped into a smaller set of possible values.\n", "
\n", "\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Number of Unique EntriesNumber of Unique Words (if Text)
sentiment3n/a
review_body710817986
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Descriptive Stats\n", "For each of the input features that has at least one numeric value, several descriptive statistics are computed from the data sample.\n", "\n", "SageMaker Autopilot may treat numerical features as `Categorical` if the number of unique entries is sufficiently low.\n", "For `Numerical` features, we may apply numerical transformations such as normalization, log and quantile transforms,\n", "and binning to manage outlier values and difference in feature scales.\n", "\n", "We found **1 of the 2** columns contained at least one numerical value.\n", "The table below shows the **1** columns which have the largest percentage of numerical values.\n", "Percentage of outliers is calculated only for columns which Autopilot detected to be of numeric type. Percentage of outliers is\n", "not calculated for the target column.\n", "\n", "
💡 Suggested Action Items\n", "\n", "- Investigate the origin of the data field. Are some values non-finite (e.g. infinity, nan)?\n", " Are they missing or is it an error in data input?\n", "- Missing and extreme values may indicate a bug in the data collection process.\n", " Verify the numerical descriptions align with expectations.\n", " For example, use domain knowledge to check that the range of values for a feature meets with expectations.\n", "
\n", "\n", "\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
% of Numerical ValuesMeanMedianMinMax
sentiment100.0%0.00.0-1.01.0
\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Definitions\n", "\n", "### Feature types\n", "\n", "**Numeric:** Numeric values, either floats or integers. For example: age, income. When training a machine learning model, it is assumed that numeric values are ordered and a distance is defined between them. For example, 3 is closer to 4 than to 10 and 3 < 4 < 10.\n", "\n", "**Categorical:** The column entries belong to a set of unique values that is usually much smaller than number of rows in the dataset. For example, a column from datasets with 100 rows with the unique values \"Dog\", \"Cat\" and \"Mouse\". The values could be numeric, textual, or combination of both. For example, \"Horse\", \"House\", 8, \"Love\" and 3.1 are\n", "all valid values and can be found in the same categorical column. When manipulating column of categorical values, a machine learning model does not assume that they are ordered or that distance function is defined on them, even if all of the values are numbers.\n", "\n", "**Binary:** A special case of categorical column for which the cardinality of the set of unique values is 2.\n", "\n", "**Text:** A text column that contains many non-numeric unique values, often a human readable text. In extreme cases, all the elements of the column are unique, so no two entries are the same.\n", "\n", "**Datetime:** This column contains date and/or time information.\n", "\n", "### Feature statistics\n", "\n", "**Prediction power:** Prediction power of a column (feature) is a measure of how useful it is for predicting the target variable. It is measured using a stratified split into 80%/20% training and validation folds. We fit a model for each feature separately on the training fold after applying minimal feature pre-processing and measure prediction performance on the validation data. The scores are normalized to the range [0,1]. A higher prediction power score near 1 indicate that a column is more useful for predicting the target on its own. A lower score near 0 indicate that a column contains little useful information for predicting the target on their own. Although it is possible that a column that is uninformative on its own can be useful in predicting the target when used in tandem with other features, a low score usually indicates the feature is redundant. A score of 1 implies perfect predictive abilities, which often indicates an error called target leakage. The cause is typically a column present in dataset that is hard or impossible to obtain at prediction time, such as a duplicate of the target.\n", "\n", "**Outliers:** Outliers are detected using two statistics that are robust to outliers: median and robust standard deviation (RSTD). RSTD is derived by clipping the feature values to the range [5 percentile, 95 percentile] and calculating the standard deviation of the clipped vector. All values larger than median + 5 * RSTD or smaller than median - 5 * RSTD are considered to be outliers.\n", "\n", "**Skew:** Skew measures the symmetry of the distribution and is defined as the third moment of the distribution divided by the third power of the standard deviation. The skewness of the normal distribution or any other symmetric distribution is zero. Positive values imply that the right tail of the distribution is longer than the left tail. Negative values imply that the left tail of the distribution is longer than the right tail. As a thumb rule, a distribution is considered skewed when the absolute value of the skew is larger than 3.\n", "\n", "**Kurtosis:** Pearson's kurtosis measures the heaviness of the tail of the distribution and is defined as the fourth moment of the distribution divided by the fourth power of the standard deviation. The kurtosis of the normal distribution is 3. Thus, kurtosis values lower than 3 imply that the distribution is more concentrated around the mean and the tails are lighter than the tails of the normal distribution. Kurtosis values higher than 3 imply heavier tails than the normal distribution or that the data contains outliers.\n", "\n", "**Missing Values:** Empty strings and strings composed of only white spaces are considered missing.\n", "\n", "**Valid values:**\n", "\n", "* **Numeric features / regression target:** All values that could be casted to finite floats are valid. Missing values are not valid.\n", "* **Categorical / binary / text features / classification target:** All values that are not missing are valid.\n", "* **Datetime features:** All values that could be casted to datetime object are valid. Missing values are not valid.\n", "\n", "**Invalid values:** values that are either missing or that could not be casted to the desired type. See the definition of valid values for more information" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" } }, "nbformat": 4, "nbformat_minor": 2 }