default / train
Overview
Descriptive Statistics
| column | type | count | missing | missing_% | unique | mean | median | std | se | cv | mad | min | max | range | p5 | q1 | q3 | p95 | iqr | skewness | kurtosis | top | freq |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| observation.state | text | 187507 | 0 | 0.0000 | 187507 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| action | text | 187507 | 0 | 0.0000 | 187507 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| timestamp | numeric | 187507 | 0 | 0.0000 | 352 | 6.3121 | 5.3000 | 4.8659 | 0.0112 | 0.7709 | 3.3000 | 0.0000 | 35.1000 | 35.1000 | 0.4000 | 2.4000 | 9.3000 | 15.0000 | 6.9000 | 1.0530 | 1.4028 | nan | nan |
| episode_index | numeric | 187507 | 0 | 0.0000 | 1995 | 1004.9688 | 998.0000 | 573.2126 | 1.3238 | 0.5704 | 500.0000 | 0.0000 | 1994.0000 | 1994.0000 | 108.0000 | 501.0000 | 1503.0000 | 1890.0000 | 1002.0000 | -0.0093 | -1.2072 | nan | nan |
| frame_index | numeric | 187507 | 0 | 0.0000 | 352 | 63.1209 | 53.0000 | 48.6586 | 0.1124 | 0.7709 | 33.0000 | 0.0000 | 351.0000 | 351.0000 | 4.0000 | 24.0000 | 93.0000 | 150.0000 | 69.0000 | 1.0530 | 1.4028 | nan | nan |
| next.reward | boolean | 187507 | 0 | 0.0000 | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0.0 | 187507.0000 |
| next.done | boolean | 187507 | 0 | 0.0000 | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | False | 185512.0000 |
| index | numeric | 187507 | 0 | 0.0000 | 187507 | 93753.0000 | 93753.0000 | 54128.7528 | 125.0027 | 0.5774 | 46877.0000 | 0.0000 | 187506.0000 | 187506.0000 | 9375.3000 | 46876.5000 | 140629.5000 | 178130.7000 | 93753.0000 | -0.0000 | -1.2000 | nan | nan |
| task_index | categorical | 187507 | 0 | 0.0000 | 3 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 2 | 67436.0000 |
Distribution Histograms
Boxplots
Distribution Analysis
Normality Tests & Shape
| column | n | skewness | skew_type | kurtosis | kurt_type | normality_test | normality_p | is_normal_0.05 | shapiro_p | dagostino_p | ks_p | anderson_stat | anderson_5pct_cv |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| timestamp | 187507 | 1.0530 | high skew | 1.4028 | leptokurtic | dagostino | 0.0000 | False | NaN | 0.0000 | 0.0000 | 3223.8222 | 0.7520 |
| episode_index | 187507 | -0.0093 | symmetric | -1.2072 | platykurtic | dagostino | 0.0000 | False | NaN | 0.0000 | 0.0000 | 2148.6268 | 0.7520 |
| frame_index | 187507 | 1.0530 | high skew | 1.4028 | leptokurtic | dagostino | 0.0000 | False | NaN | 0.0000 | 0.0000 | 3223.8205 | 0.7520 |
| index | 187507 | -0.0000 | symmetric | -1.2000 | platykurtic | dagostino | 0.0000 | False | NaN | 0.0000 | 0.0000 | 2084.9274 | 0.7520 |
Violin Plots
Q-Q Plots
Correlation Analysis
Correlation Heatmap (Pearson)
Correlation Heatmap (Spearman)
Variance Inflation Factor (VIF)
| column | VIF | multicollinearity |
|---|---|---|
| episode_index | 266.2900 | severe |
| index | -0.0000 | low |
| frame_index | -57054521769846057730048.0000 | low |
| timestamp | -57054521769871039004672.0000 | low |
Missing Data
| column | missing_count | missing_ratio | missing_% | dtype |
|---|---|---|---|---|
| observation.state | 0 | 0.0000 | 0.0000 | object |
| action | 0 | 0.0000 | 0.0000 | object |
| timestamp | 0 | 0.0000 | 0.0000 | float32 |
| episode_index | 0 | 0.0000 | 0.0000 | int64 |
| frame_index | 0 | 0.0000 | 0.0000 | int64 |
| next.reward | 0 | 0.0000 | 0.0000 | float32 |
| next.done | 0 | 0.0000 | 0.0000 | bool |
| index | 0 | 0.0000 | 0.0000 | int64 |
| task_index | 0 | 0.0000 | 0.0000 | int64 |
Missing Data
Missing Data Matrix
Outlier Detection
| column | q1 | q3 | iqr | lower_bound | upper_bound | outlier_count | outlier_% | min_outlier | max_outlier |
|---|---|---|---|---|---|---|---|---|---|
| timestamp | 2.4000 | 9.3000 | 6.9000 | -7.9500 | 19.6500 | 2712.0000 | 1.4500 | 19.7000 | 35.1000 |
| episode_index | 501.0000 | 1503.0000 | 1002.0000 | -1002.0000 | 3006.0000 | 0.0000 | 0.0000 | nan | nan |
| frame_index | 24.0000 | 93.0000 | 69.0000 | -79.5000 | 196.5000 | 2712.0000 | 1.4500 | 197.0000 | 351.0000 |
| index | 46876.5000 | 140629.5000 | 93753.0000 | -93753.0000 | 281259.0000 | 0.0000 | 0.0000 | nan | nan |
Outlier Detection
Categorical Analysis
Summary
| column | count | unique | top_value | top_frequency | top_% | entropy | norm_entropy |
|---|---|---|---|---|---|---|---|
| task_index | 187507 | 3 | 2 | 67436 | 35.9600 | 1.5806 | 0.9973 |
Feature Importance
| column | variance | std | cv | range |
|---|---|---|---|---|
| index | 2929921879.6667 | 54128.7528 | 0.5774 | 187506.0000 |
| episode_index | 328572.6540 | 573.2126 | 0.5704 | 1994.0000 |
| frame_index | 2367.6604 | 48.6586 | 0.7709 | 351.0000 |
| timestamp | 23.6766 | 4.8659 | 0.7709 | 35.1000 |
Feature Importance
PCA Analysis
Variance Explained
| component | variance_ratio | cumulative_ratio | eigenvalue |
|---|---|---|---|
| PC1 | 0.5058 | 0.5058 | 2.0232 |
| PC2 | 0.4942 | 1.0000 | 1.9768 |
| PC3 | 0.0000 | 1.0000 | 0.0001 |
| PC4 | 0.0000 | 1.0000 | 0.0000 |
Loadings
| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| timestamp | 0.5003 | -0.4997 | -0.0001 | 0.7071 |
| episode_index | 0.4996 | 0.5004 | -0.7071 | 0.0000 |
| frame_index | 0.5003 | -0.4997 | -0.0001 | -0.7071 |
| index | 0.4998 | 0.5002 | 0.7071 | -0.0000 |
PCA Scree Plot
PCA Loadings
Warnings
- High correlation: timestamp <-> frame_index (r=1.0)
- High correlation: episode_index <-> index (r=0.9999)
Auto-Generated InsightsADV
Executive Summary
Dataset contains 187,507 rows and 9 columns (4 numeric, 1 categorical). 4 high-priority finding(s) detected. 5 moderate observations noted. Key highlights: 1. 2 column pair(s) with |r| > 0.9 2. 2 likely confounded correlation(s) detected 3. 4/4 numeric columns are non-normal
Insight Details
Near-perfect linear relationships detected. Top pair: 'timestamp' ↔ 'frame_index' (r=1.000).
- Consider dropping one column from each pair to reduce redundancy
- Verify these are not data leakage or duplicate columns
Raw correlation differs significantly from partial correlation, suggesting confounding variables. Top: 'episode_index' ↔ 'index' (raw r=1.00, partial r=8183723376125764.00).
- Do not assume causal relationship from raw correlation for these pairs
- Investigate which variables are confounders
Most numeric columns fail normality tests (α=0.05). Non-parametric methods may be more appropriate.
- Prefer non-parametric tests (Kruskal-Wallis, Mann-Whitney) over t-tests/ANOVA
- Consider power transforms if normality is needed for downstream models
Distribution fitting reveals non-Normal best fits. Most common: beta (2 columns). Others: {'beta': 2, 'lognorm': 1, 'uniform': 1}.
- Use the identified distributions for parametric modeling or simulation
- Transform columns toward normality if Gaussian assumptions are needed
K-Means identifies 3 well-separated clusters (silhouette=0.40). Cluster sizes: {'cluster_0': 1895, 'cluster_1': 1882, 'cluster_2': 1223}.
- Profile each cluster to understand segment characteristics
- Use cluster labels as a feature for downstream modelling
VIF > 10 detected for: ['episode_index']. Worst: 'episode_index' (VIF=266.3). Redundant information may cause model instability.
- Remove one column from each highly correlated pair
- Apply PCA or regularization (Ridge/Lasso) to handle collinearity
Top interaction: 'timestamp' × 'episode_index' (strength=0.73). Product features may improve model performance.
- Create interaction (product) features for the top pairs
Box-Cox / Yeo-Johnson transforms can significantly reduce skewness for columns: ['timestamp', 'frame_index'].
- Apply the recommended transform (Box-Cox or Yeo-Johnson) in preprocessing
A small fraction of rows are flagged by multiple anomaly detection methods.
- Review flagged rows for data entry errors or special cases
All columns are fully populated — no imputation needed.
Insight Severity Distribution
Top Insights
Advanced Distribution AnalysisADV
Best-Fit Distribution
| column | best_distribution | aic | bic | ks_statistic | ks_p_value | fit_quality |
|---|---|---|---|---|---|---|
| timestamp | beta | 28275.5600 | 28301.6300 | 0.0363 | 0.0000 | poor |
| episode_index | beta | 75977.6500 | 76003.7200 | 0.0150 | 0.2102 | good |
| frame_index | lognorm | 20261.0000 | 20280.5500 | 0.5098 | 0.0000 | poor |
| index | uniform | 121419.0200 | 121432.0600 | 0.0136 | 0.3124 | good |
Jarque-Bera Normality Test
| column | jb_statistic | p_value | is_normal_0.05 | skewness | kurtosis |
|---|---|---|---|---|---|
| timestamp | 50024.3984 | 0.0000 | False | 1.0530 | 1.4028 |
| episode_index | 11389.0254 | 0.0000 | False | -0.0093 | -1.2072 |
| frame_index | 50024.3684 | 0.0000 | False | 1.0530 | 1.4028 |
| index | 11250.4200 | 0.0000 | False | -0.0000 | -1.2000 |
Power Transform Recommendation
| column | original_skewness | recommended_method | optimal_lambda | transformed_skewness | needs_transform | improvement |
|---|---|---|---|---|---|---|
| timestamp | 1.0530 | yeo-johnson | 0.2569 | -0.0496 | True | 1.0034 |
| episode_index | -0.0093 | yeo-johnson | 0.7184 | -0.2834 | False | -0.2741 |
| frame_index | 1.0530 | yeo-johnson | 0.3990 | -0.0965 | True | 0.9565 |
| index | -0.0000 | yeo-johnson | 0.7071 | -0.2916 | False | -0.2916 |
KDE Bandwidth Analysis
| column | n | std | iqr | silverman_bandwidth | scotts_bandwidth |
|---|---|---|---|---|---|
| timestamp | 187507.0000 | 4.8659 | 6.9000 | 0.3862 | 0.2967 |
| episode_index | 187507.0000 | 573.2126 | 1002.0000 | 45.4941 | 34.9517 |
| frame_index | 187507.0000 | 48.6586 | 69.0000 | 3.8619 | 2.9670 |
| index | 187507.0000 | 54128.7528 | 93753.0000 | 4296.0273 | 3300.5092 |
Best-Fit Distribution Overlay
ECDF Plot
Power Transform Comparison
Jarque-Bera Normality Test
Advanced Correlation AnalysisADV
Partial Correlation Matrix
| timestamp | episode_index | frame_index | index | |
|---|---|---|---|---|
| timestamp | 1.0000 | 6.0442 | -1.0000 | -99974086307298640.0000 |
| episode_index | 6.0442 | 1.0000 | -6.0442 | 8183723376125764.0000 |
| frame_index | -1.0000 | -6.0442 | 1.0000 | 99974086307314160.0000 |
| index | -99974086307298656.0000 | 8183723376125764.0000 | 99974086307314160.0000 | 1.0000 |
Mutual Information Matrix
| timestamp | episode_index | frame_index | index | |
|---|---|---|---|---|
| timestamp | 0.0000 | 0.0362 | 5.0693 | 0.0353 |
| episode_index | 0.0362 | 0.0000 | 0.0000 | 6.3049 |
| frame_index | 5.0693 | 0.0000 | 0.0000 | 0.0000 |
| index | 0.0353 | 6.3049 | 0.0000 | 0.0000 |
Bootstrap Correlation 95% CI
| col_a | col_b | pearson_r | ci_lower | ci_upper | ci_width | significant | |
|---|---|---|---|---|---|---|---|
| 0 | timestamp | episode_index | 0.0221 | -0.0039 | 0.0485 | 0.0524 | False |
| 1 | timestamp | frame_index | 1.0000 | 1.0000 | 1.0000 | 0.0000 | True |
| 2 | timestamp | index | 0.0225 | -0.0048 | 0.0508 | 0.0556 | False |
| 3 | episode_index | frame_index | 0.0221 | -0.0045 | 0.0496 | 0.0541 | False |
| 4 | episode_index | index | 0.9999 | 0.9999 | 1.0000 | 0.0000 | True |
| 5 | frame_index | index | 0.0225 | -0.0054 | 0.0500 | 0.0554 | False |
Distance Correlation Matrix
| timestamp | episode_index | frame_index | index | |
|---|---|---|---|---|
| timestamp | 1.0000 | 0.0360 | 1.0000 | 0.0361 |
| episode_index | 0.0360 | 1.0000 | 0.0360 | 0.9999 |
| frame_index | 1.0000 | 0.0360 | 1.0000 | 0.0361 |
| index | 0.0361 | 0.9999 | 0.0361 | 1.0000 |
Partial Correlation Heatmap
Mutual Information Heatmap
Bootstrap Correlation CI
Correlation Network
Distance Correlation Heatmap
Clustering AnalysisADV
K-Means Summary
DBSCAN Summary
Hierarchical Clustering
Cluster Profiles
| timestamp | episode_index | frame_index | index | |
|---|---|---|---|---|
| cluster_0 | 4.1473 | 1508.9013 | 41.4728 | 141229.8786 |
| cluster_1 | 4.3556 | 481.7476 | 43.5563 | 44448.0032 |
| cluster_2 | 13.2153 | 1064.5127 | 132.1529 | 99369.0989 |
Elbow & Silhouette
Cluster Scatter
Dendrogram
Cluster Profiles
Dimensionality ReductionADV
t-SNE Embedding
Factor Analysis
Factor Loadings
| factor_1 | factor_2 | |
|---|---|---|
| timestamp | 1.0000 | -0.0000 |
| episode_index | 0.0221 | -0.9997 |
| frame_index | 1.0000 | 0.0000 |
| index | 0.0225 | -0.9997 |
PCA-Weighted Feature Contribution
| column | contribution_score | rank |
|---|---|---|
| timestamp | 0.5000 | 4.0000 |
| episode_index | 0.5000 | 2.0000 |
| frame_index | 0.5000 | 3.0000 |
| index | 0.5000 | 1.0000 |
PCA Biplot
Explained Variance Curve
Factor Loadings Heatmap
Feature Engineering InsightsADV
Interaction Detection
| col_a | col_b | interaction_strength | corr_product_a | corr_product_b | corr_a_b | recommendation | |
|---|---|---|---|---|---|---|---|
| 0 | timestamp | episode_index | 0.7259 | 0.7480 | 0.5459 | 0.0221 | Strong interaction |
| 1 | episode_index | frame_index | 0.7259 | 0.5459 | 0.7480 | 0.0221 | Strong interaction |
| 2 | timestamp | index | 0.7217 | 0.7442 | 0.5495 | 0.0225 | Strong interaction |
| 3 | frame_index | index | 0.7217 | 0.7442 | 0.5495 | 0.0225 | Strong interaction |
Binning Analysis
| column | n_bins | equal_width_entropy | equal_freq_entropy | max_entropy | recommended_method | skewness |
|---|---|---|---|---|---|---|
| timestamp | 10 | 2.2300 | 3.3211 | 3.3219 | equal_frequency | 1.0530 |
| episode_index | 10 | 3.3208 | 3.3219 | 3.3219 | equal_width | -0.0093 |
| frame_index | 10 | 2.2300 | 3.3211 | 3.3219 | equal_frequency | 1.0530 |
| index | 10 | 3.3219 | 3.3219 | 3.3219 | equal_width | -0.0000 |
Advanced Anomaly DetectionADV
Isolation Forest
Local Outlier Factor
Consensus (>=2/3 agree)
Anomaly Scatter
Consensus Anomaly Comparison
Statistical TestsADV
Levene's Test (Equality of Variances)
| col_a | col_b | levene_stat | p_value | log_var_ratio | adjusted_p | significant_0.05 | stars | |
|---|---|---|---|---|---|---|---|---|
| 0 | timestamp | episode_index | 564897.5197 | 0.0000 | 9.5380 | 0.0000 | True | *** |
| 1 | timestamp | frame_index | 222079.0332 | 0.0000 | 4.6052 | 0.0000 | True | *** |
| 2 | timestamp | index | 562425.8915 | 0.0000 | 18.6338 | 0.0000 | True | *** |
| 3 | episode_index | frame_index | 482754.0326 | 0.0000 | 4.9329 | 0.0000 | True | *** |
| 4 | episode_index | index | 550576.4333 | 0.0000 | 9.0957 | 0.0000 | True | *** |
| 5 | frame_index | index | 561596.5666 | 0.0000 | 14.0286 | 0.0000 | True | *** |
Kruskal-Wallis Test
| grouping_col | numeric_col | n_groups | h_statistic | p_value | eta_squared | effect_magnitude | adjusted_p | reject_h0_0.05 | stars | interpretation | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | task_index | timestamp | 3 | 10663.7432 | 0.0000 | 0.0569 | small | 0.0000 | True | *** | Significant (η²=0.0569, small) |
| 1 | task_index | episode_index | 3 | 625.0241 | 0.0000 | 0.0033 | small | 0.0000 | True | *** | Significant (η²=0.0033, small) |
| 2 | task_index | frame_index | 3 | 10663.6852 | 0.0000 | 0.0569 | small | 0.0000 | True | *** | Significant (η²=0.0569, small) |
| 3 | task_index | index | 3 | 625.0238 | 0.0000 | 0.0033 | small | 0.0000 | True | *** | Significant (η²=0.0033, small) |
Mann-Whitney U Test
| col_a | col_b | u_statistic | p_value | rank_biserial_r | effect_magnitude | adjusted_p | significant_0.05 | stars | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | timestamp | episode_index | 95185168.0000 | 0.0000 | 0.9946 | large | 0.0000 | True | *** |
| 1 | timestamp | frame_index | 2516704372.0000 | 0.0000 | 0.8568 | large | 0.0000 | True | *** |
| 2 | timestamp | index | 1278510.0000 | 0.0000 | 0.9999 | large | 0.0000 | True | *** |
| 3 | episode_index | frame_index | 34126631417.0000 | 0.0000 | -0.9413 | large | 0.0000 | True | *** |
| 4 | episode_index | index | 188532431.5000 | 0.0000 | 0.9893 | large | 0.0000 | True | *** |
| 5 | frame_index | index | 11929357.5000 | 0.0000 | 0.9993 | large | 0.0000 | True | *** |
Chi-Square Goodness of Fit
| column | n_categories | chi2_stat | p_value | cramers_v | effect_magnitude | uniform_0.05 | interpretation |
|---|---|---|---|---|---|---|---|
| task_index | 3 | 1112.7892 | 0.0000 | 0.0545 | small | False | Non-uniform distribution |
Grubbs Outlier Test
| column | suspect_value | grubbs_statistic | critical_value | is_outlier | n |
|---|---|---|---|---|---|
| timestamp | 35.1000 | 5.9163 | 5.1454 | True | 187507 |
| episode_index | 0.0000 | 1.7532 | 5.1454 | False | 187507 |
| frame_index | 351.0000 | 5.9163 | 5.1454 | True | 187507 |
| index | 0.0000 | 1.7320 | 5.1454 | False | 187507 |
Data Profiling SummaryADV
Column Roles
| column | primary_role | confidence | secondary_role | properties |
|---|---|---|---|---|
| observation.state | id | 0.8500 | NaN | {'unique_ratio': 1.0} |
| action | id | 0.8500 | NaN | {'unique_ratio': 1.0} |
| timestamp | timestamp | 0.7000 | NaN | {'dtype': 'float32', 'hint': 'monotonic numeric with time-like name'} |
| episode_index | numeric_feature | 0.8500 | NaN | {'dtype': 'int64'} |
| frame_index | numeric_feature | 0.8500 | NaN | {'dtype': 'int64'} |
| next.reward | constant | 1.0000 | NaN | {'n_unique': 1} |
| next.done | binary | 0.9000 | NaN | {'n_unique': 2, 'values': [False, True]} |
| index | id | 0.9000 | NaN | {'unique_ratio': 1.0} |
| task_index | categorical_feature | 0.8500 | NaN | {'n_unique': 3, 'unique_ratio': 0.0} |
ML Readiness
Blocking Issues
- 1 constant column(s) — remove before modelling
- Extreme multicollinearity: VIF=266 for 'episode_index' — remove or combine
Suggestions
- Remove 3 ID-like column(s) before modelling: observation.state, action, index