{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# spark-df-profiling Meteorites example"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh\n",
"\n",
"I have previously transformed the downloaded csv to a [Parquet](https://parquet.apache.org/) table, but that doesn't matter. As long as you have your Spark Dataframe loaded, you are good to go."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import library"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import spark_df_profiling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create the DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"DataFrame[name: string, id: bigint, nametype: string, recclass: string, mass_g: double, fall: string, reclat: double, reclong: double, GeoLocation: string, source: string, reclat_city: double, year: date]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = sqlContext.read.parquet(\"/Users/Julio/Downloads/Meteorite_Landings.parquet\").cache()\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Spark Dataframes have the built-in method `.describe()`. Let's see what it shows:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------+------------------+------+---------+----------+-------------------+\n",
"|summary| id|mass_g| reclat| reclong| reclat_city|\n",
"+-------+------------------+------+---------+----------+-------------------+\n",
"| count| 45726| 45726| 45726| 45726| 45726|\n",
"| mean|26883.906202160695| NaN| NaN| NaN| NaN|\n",
"| stddev| 16863.44556599258| NaN| NaN| NaN| NaN|\n",
"| min| 1| 0.0|-87.36667|-165.43333|-103.79172917787167|\n",
"| max| 57458| NaN| NaN| NaN| NaN|\n",
"+-------+------------------+------+---------+----------+-------------------+\n",
"\n"
]
}
],
"source": [
"df.describe().show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate the report"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's use `spark_df_profiling`:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"report = spark_df_profiling.ProfileReport(df)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"\n",
" \n",
"\n",
"
\n",
"
\n",
"
Overview
\n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
"
Dataset info
\n",
"
\n",
" Number of variables | \n",
" 12 |
\n",
" Number of observations | \n",
" 45726 |
\n",
" Total Missing (%) | \n",
" 4.1% |
\n",
" Total size in memory | \n",
" 0.0 B |
\n",
" Average record size in memory | \n",
" 0.0 B |
\n",
"
\n",
"
\n",
"
\n",
"
Variables types
\n",
"
\n",
" Numeric | \n",
" 4 |
\n",
" Categorical | \n",
" 4 |
\n",
" Date | \n",
" 1 |
\n",
" Text (Unique) | \n",
" 1 |
\n",
" Rejected | \n",
" 2 |
\n",
"
\n",
"
\n",
"
\n",
"
Warnings
\n",
"
GeoLocation
has 7315 / 19.0% missing values MissingGeoLocation
has a high cardinality: 17100 distinct values Warningmass_g
is highly skewed (γ1 = 76.916)recclass
has a high cardinality: 466 distinct values Warningreclat
has 7315 / 19.0% missing values Missingreclat
has 6438 / 14.1% zerosreclat_city
is highly correlated with reclat
(ρ = 0.99423) Rejectedreclong
has 7315 / 19.0% missing values Missingreclong
has 6214 / 13.6% zerossource
has constant value NASA Rejected
\n",
"
\n",
"
\n",
"\n",
"\n",
"
\n",
"
Variables
\n",
" \n",
"\n",
"
\n",
"
\n",
"
GeoLocation
Categorical
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
" Distinct count | \n",
" 17100 |
\n",
" Unique (%) | \n",
" 44.5% |
\n",
" Missing (%) | \n",
" 19.0% |
\n",
" Missing (n) | \n",
" 7315 |
\n",
" Infinite (%) | \n",
" 0.0% |
\n",
" Infinite (n) | \n",
" 0 |
\n",
"
\n",
"\n",
"\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" (0.000000, 0.000000) | \n",
" \n",
" \n",
"6214\n",
" \n",
" | \n",
"
\n",
"\n",
" (-71.500000, 35.666670) | \n",
" \n",
" \n",
" \n",
" 4761\n",
" | \n",
"
\n",
"\n",
" (-84.000000, 168.000000) | \n",
" \n",
" \n",
" \n",
" 3040\n",
" | \n",
"
\n",
"\n",
" Other values (17097) | \n",
" \n",
" \n",
"24396\n",
" \n",
" | \n",
"
\n",
"\n",
" (Missing) | \n",
" \n",
" \n",
"7315\n",
" \n",
" | \n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"
\n",
"
\n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
"\n",
" \n",
"\n",
" (0.000000, 0.000000) | \n",
" 6214 | \n",
" 13.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-71.500000, 35.666670) | \n",
" 4761 | \n",
" 10.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-84.000000, 168.000000) | \n",
" 3040 | \n",
" 6.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-72.000000, 26.000000) | \n",
" 1505 | \n",
" 3.3% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-79.683330, 159.750000) | \n",
" 657 | \n",
" 1.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-76.716670, 159.666670) | \n",
" 637 | \n",
" 1.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-76.183330, 157.166670) | \n",
" 539 | \n",
" 1.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-79.683330, 155.750000) | \n",
" 473 | \n",
" 1.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-84.216670, 160.500000) | \n",
" 263 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-86.366670, -70.000000) | \n",
" 226 | \n",
" 0.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (0.000000, 35.666670) | \n",
" 223 | \n",
" 0.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-86.716670, -141.500000) | \n",
" 217 | \n",
" 0.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-85.666670, 175.000000) | \n",
" 185 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-24.850000, -70.533330) | \n",
" 178 | \n",
" 0.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-85.633330, -68.700000) | \n",
" 105 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-72.954880, 160.473280) | \n",
" 74 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (58.583330, 13.433330) | \n",
" 64 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-76.716670, 159.333330) | \n",
" 42 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-72.778890, 75.313610) | \n",
" 39 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (-72.983890, 75.246390) | \n",
" 38 | \n",
" 0.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" Other values (17080) | \n",
" 18931 | \n",
" 41.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" (Missing) | \n",
" 7315 | \n",
" 16.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"
\n",
"
fall
Categorical
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
" Distinct count | \n",
" 2 |
\n",
" Unique (%) | \n",
" 0.0% |
\n",
" Missing (%) | \n",
" 0.0% |
\n",
" Missing (n) | \n",
" 0 |
\n",
" Infinite (%) | \n",
" 0.0% |
\n",
" Infinite (n) | \n",
" 0 |
\n",
"
\n",
"\n",
"\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" Found | \n",
" \n",
" \n",
"44609\n",
" \n",
" | \n",
"
\n",
"\n",
" Fell | \n",
" \n",
" \n",
" \n",
" 1117\n",
" | \n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"
\n",
"
\n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
"\n",
" \n",
"\n",
" Found | \n",
" 44609 | \n",
" 97.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" Fell | \n",
" 1117 | \n",
" 2.4% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
" Distinct count | \n",
" 45716 |
\n",
" Unique (%) | \n",
" 100.0% |
\n",
" Missing (%) | \n",
" 0.0% |
\n",
" Missing (n) | \n",
" 0 |
\n",
" Infinite (%) | \n",
" 0.0% |
\n",
" Infinite (n) | \n",
" 0 |
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"\n",
" Mean | \n",
" 26884 |
\n",
" Minimum | \n",
" 1 |
\n",
" Maximum | \n",
" 57458 |
\n",
" Zeros (%) | \n",
" 0.0% |
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Quantile statistics
\n",
"
\n",
" Minimum | \n",
" 1 |
\n",
" 5-th percentile | \n",
" 2388.8 |
\n",
" Q1 | \n",
" 12681 |
\n",
" Median | \n",
" 24256 |
\n",
" Q3 | \n",
" 40654 |
\n",
" 95-th percentile | \n",
" 54891 |
\n",
" Maximum | \n",
" 57458 |
\n",
" Range | \n",
" 57457 |
\n",
" Interquartile range | \n",
" 27972 |
\n",
"
\n",
"
Descriptive statistics
\n",
"
\n",
" Standard deviation | \n",
" 16863 |
\n",
" Coef of variation | \n",
" 0.62727 |
\n",
" Kurtosis | \n",
" -1.1601 |
\n",
" Mean | \n",
" 26884 |
\n",
" MAD | \n",
" 14490 |
\n",
" Skewness | \n",
" 0.26652 |
\n",
" Sum | \n",
" 1229300000 |
\n",
" Variance | \n",
" 284380000 |
\n",
" Memory size | \n",
" 0.0 B |
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
mass_g
Numeric
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
" Distinct count | \n",
" 12577 |
\n",
" Unique (%) | \n",
" 27.6% |
\n",
" Missing (%) | \n",
" 0.3% |
\n",
" Missing (n) | \n",
" 131 |
\n",
" Infinite (%) | \n",
" 0.0% |
\n",
" Infinite (n) | \n",
" 0 |
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"\n",
" Mean | \n",
" 13278 |
\n",
" Minimum | \n",
" 0 |
\n",
" Maximum | \n",
" 60000000 |
\n",
" Zeros (%) | \n",
" 0.0% |
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Quantile statistics
\n",
"
\n",
" Minimum | \n",
" 0 |
\n",
" 5-th percentile | \n",
" 1.0978 |
\n",
" Q1 | \n",
" 7.1907 |
\n",
" Median | \n",
" 32.598 |
\n",
" Q3 | \n",
" 202.86 |
\n",
" 95-th percentile | \n",
" 3999.9 |
\n",
" Maximum | \n",
" 60000000 |
\n",
" Range | \n",
" 60000000 |
\n",
" Interquartile range | \n",
" 195.67 |
\n",
"
\n",
"
Descriptive statistics
\n",
"
\n",
" Standard deviation | \n",
" 574930 |
\n",
" Coef of variation | \n",
" 43.298 |
\n",
" Kurtosis | \n",
" 6797.7 |
\n",
" Mean | \n",
" 13278 |
\n",
" MAD | \n",
" 25113 |
\n",
" Skewness | \n",
" 76.916 |
\n",
" Sum | \n",
" 605430000 |
\n",
" Variance | \n",
" 330540000000 |
\n",
" Memory size | \n",
" 0.0 B |
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
name
Categorical, Unique
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" First 3 values | \n",
"
\n",
" \n",
" \n",
" \n",
" Abee | \n",
"
\n",
" \n",
" Asco | \n",
"
\n",
" \n",
" Aleppo | \n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" Last 3 values | \n",
"
\n",
" \n",
" \n",
" \n",
" Allende | \n",
"
\n",
" \n",
" Alessandria | \n",
"
\n",
" \n",
" Akaba | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
First 20 values
\n",
"
\n",
" \n",
" \n",
" 1 | \n",
" Abee | \n",
"
\n",
" \n",
" 2 | \n",
" Asco | \n",
"
\n",
" \n",
" 3 | \n",
" Aleppo | \n",
"
\n",
" \n",
" 4 | \n",
" Al Rais | \n",
"
\n",
" \n",
" 5 | \n",
" Arbol Solo | \n",
"
\n",
" \n",
" 6 | \n",
" Ash Creek | \n",
"
\n",
" \n",
" 7 | \n",
" Northwest Africa 5815 | \n",
"
\n",
" \n",
" 8 | \n",
" Anlong | \n",
"
\n",
" \n",
" 9 | \n",
" Aomori | \n",
"
\n",
" \n",
" 10 | \n",
" Aldsworth | \n",
"
\n",
" \n",
" 11 | \n",
" Akyumak | \n",
"
\n",
" \n",
" 12 | \n",
" Aachen | \n",
"
\n",
" \n",
" 13 | \n",
" Ambapur Nagla | \n",
"
\n",
" \n",
" 14 | \n",
" Alta'ameem | \n",
"
\n",
" \n",
" 15 | \n",
" Aarhus | \n",
"
\n",
" \n",
" 16 | \n",
" Archie | \n",
"
\n",
" \n",
" 17 | \n",
" Almahata Sitta | \n",
"
\n",
" \n",
" 18 | \n",
" Andhara | \n",
"
\n",
" \n",
" 19 | \n",
" Adzhi-Bogdo (stone) | \n",
"
\n",
" \n",
" 20 | \n",
" Aïr | \n",
"
\n",
" \n",
"
\n",
"
Last 20 values
\n",
"
\n",
" \n",
" \n",
" 45707 | \n",
" Alais | \n",
"
\n",
" \n",
" 45708 | \n",
" Arroyo Aguiar | \n",
"
\n",
" \n",
" 45709 | \n",
" Aguada | \n",
"
\n",
" \n",
" 45710 | \n",
" Angra dos Reis (stone) | \n",
"
\n",
" \n",
" 45711 | \n",
" Alexandrovsky | \n",
"
\n",
" \n",
" 45712 | \n",
" Akwanga | \n",
"
\n",
" \n",
" 45713 | \n",
" Alfianello | \n",
"
\n",
" \n",
" 45714 | \n",
" Appley Bridge | \n",
"
\n",
" \n",
" 45715 | \n",
" Achiras | \n",
"
\n",
" \n",
" 45716 | \n",
" Adhi Kot | \n",
"
\n",
" \n",
" 45717 | \n",
" Akbarpur | \n",
"
\n",
" \n",
" 45718 | \n",
" Andover | \n",
"
\n",
" \n",
" 45719 | \n",
" Acapulco | \n",
"
\n",
" \n",
" 45720 | \n",
" Albareto | \n",
"
\n",
" \n",
" 45721 | \n",
" Apt | \n",
"
\n",
" \n",
" 45722 | \n",
" Agen | \n",
"
\n",
" \n",
" 45723 | \n",
" Andura | \n",
"
\n",
" \n",
" 45724 | \n",
" Allende | \n",
"
\n",
" \n",
" 45725 | \n",
" Alessandria | \n",
"
\n",
" \n",
" 45726 | \n",
" Akaba | \n",
"
\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
nametype
Categorical
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
" Distinct count | \n",
" 2 |
\n",
" Unique (%) | \n",
" 0.0% |
\n",
" Missing (%) | \n",
" 0.0% |
\n",
" Missing (n) | \n",
" 0 |
\n",
" Infinite (%) | \n",
" 0.0% |
\n",
" Infinite (n) | \n",
" 0 |
\n",
"
\n",
"\n",
"\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" Valid | \n",
" \n",
" \n",
"45651\n",
" \n",
" | \n",
"
\n",
"\n",
" Relict | \n",
" \n",
" \n",
" \n",
" 75\n",
" | \n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"
\n",
"
\n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
"\n",
" \n",
"\n",
" Valid | \n",
" 45651 | \n",
" 99.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" Relict | \n",
" 75 | \n",
" 0.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"
\n",
"
recclass
Categorical
\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
" Distinct count | \n",
" 466 |
\n",
" Unique (%) | \n",
" 1.0% |
\n",
" Missing (%) | \n",
" 0.0% |
\n",
" Missing (n) | \n",
" 0 |
\n",
" Infinite (%) | \n",
" 0.0% |
\n",
" Infinite (n) | \n",
" 0 |
\n",
"
\n",
"\n",
"\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" L6 | \n",
" \n",
" \n",
"8287\n",
" \n",
" | \n",
"
\n",
"\n",
" H5 | \n",
" \n",
" \n",
"7143\n",
" \n",
" | \n",
"
\n",
"\n",
" L5 | \n",
" \n",
" \n",
" \n",
" 4797\n",
" | \n",
"
\n",
"\n",
" Other values (463) | \n",
" \n",
" \n",
"25499\n",
" \n",
" | \n",
"
\n",
"\n",
"
\n",
"
\n",
"\n",
"\n",
"
\n",
" \n",
"\n",
"
\n",
"
\n",
" \n",
" Value | \n",
" Count | \n",
" Frequency (%) | \n",
" | \n",
"
\n",
"\n",
" \n",
"\n",
" L6 | \n",
" 8287 | \n",
" 18.1% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" H5 | \n",
" 7143 | \n",
" 15.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" L5 | \n",
" 4797 | \n",
" 10.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" H6 | \n",
" 4529 | \n",
" 9.9% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" H4 | \n",
" 4211 | \n",
" 9.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" LL5 | \n",
" 2766 | \n",
" 6.0% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" LL6 | \n",
" 2043 | \n",
" 4.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" L4 | \n",
" 1253 | \n",
" 2.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" H4/5 | \n",
" 428 | \n",
" 0.9% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" CM2 | \n",
" 416 | \n",
" 0.9% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" H3 | \n",
" 386 | \n",
" 0.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" L3 | \n",
" 365 | \n",
" 0.8% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" CO3 | \n",
" 335 | \n",
" 0.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" Ureilite | \n",
" 300 | \n",
" 0.7% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" Iron, IIIAB | \n",
" 285 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" LL4 | \n",
" 268 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" CV3 | \n",
" 256 | \n",
" 0.6% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" Diogenite | \n",
" 241 | \n",
" 0.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" Howardite | \n",
" 240 | \n",
" 0.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" LL | \n",
" 225 | \n",
" 0.5% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"\n",
" Other values (446) | \n",
" 6952 | \n",
" 15.2% | \n",
" \n",
" \n",
" | \n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"\n",
"
\n",
"
\n",
"
reclat
Numeric
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
" Distinct count | \n",
" 12739 |
\n",
" Unique (%) | \n",
" 33.2% |
\n",
" Missing (%) | \n",
" 19.0% |
\n",
" Missing (n) | \n",
" 7315 |
\n",
" Infinite (%) | \n",
" 0.0% |
\n",
" Infinite (n) | \n",
" 0 |
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"\n",
" Mean | \n",
" -39.107 |
\n",
" Minimum | \n",
" -87.367 |
\n",
" Maximum | \n",
" 81.167 |
\n",
" Zeros (%) | \n",
" 14.1% |
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Quantile statistics
\n",
"
\n",
" Minimum | \n",
" -87.367 |
\n",
" 5-th percentile | \n",
" -84.355 |
\n",
" Q1 | \n",
" -76.714 |
\n",
" Median | \n",
" -71.529 |
\n",
" Q3 | \n",
" -0.16289 |
\n",
" 95-th percentile | \n",
" 34.494 |
\n",
" Maximum | \n",
" 81.167 |
\n",
" Range | \n",
" 168.53 |
\n",
" Interquartile range | \n",
" 76.551 |
\n",
"
\n",
"
Descriptive statistics
\n",
"
\n",
" Standard deviation | \n",
" 46.386 |
\n",
" Coef of variation | \n",
" -1.1861 |
\n",
" Kurtosis | \n",
" -1.4768 |
\n",
" Mean | \n",
" -39.107 |
\n",
" MAD | \n",
" 43.937 |
\n",
" Skewness | \n",
" 0.4913 |
\n",
" Sum | \n",
" -1502100 |
\n",
" Variance | \n",
" 2151.7 |
\n",
" Memory size | \n",
" 0.0 B |
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
reclat_city
Highly correlated
\n",
"
\n",
"\n",
"
\n",
"
This variable is highly correlated with reclat
and should be ignored for analysis
\n",
"\n",
"
\n",
"
\n",
"
\n",
" Correlation | \n",
" 0.99423 |
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
reclong
Numeric
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
" Distinct count | \n",
" 14641 |
\n",
" Unique (%) | \n",
" 38.1% |
\n",
" Missing (%) | \n",
" 19.0% |
\n",
" Missing (n) | \n",
" 7315 |
\n",
" Infinite (%) | \n",
" 0.0% |
\n",
" Infinite (n) | \n",
" 0 |
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"\n",
" Mean | \n",
" 61.053 |
\n",
" Minimum | \n",
" -165.43 |
\n",
" Maximum | \n",
" 354.47 |
\n",
" Zeros (%) | \n",
" 13.6% |
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
Quantile statistics
\n",
"
\n",
" Minimum | \n",
" -165.43 |
\n",
" 5-th percentile | \n",
" -90.466 |
\n",
" Q1 | \n",
" -0.0024196 |
\n",
" Median | \n",
" 35.666 |
\n",
" Q3 | \n",
" 157.17 |
\n",
" 95-th percentile | \n",
" 167.72 |
\n",
" Maximum | \n",
" 354.47 |
\n",
" Range | \n",
" 519.91 |
\n",
" Interquartile range | \n",
" 157.17 |
\n",
"
\n",
"
Descriptive statistics
\n",
"
\n",
" Standard deviation | \n",
" 80.655 |
\n",
" Coef of variation | \n",
" 1.3211 |
\n",
" Kurtosis | \n",
" -0.73145 |
\n",
" Mean | \n",
" 61.053 |
\n",
" MAD | \n",
" 67.606 |
\n",
" Skewness | \n",
" -0.17437 |
\n",
" Sum | \n",
" 2345100 |
\n",
" Variance | \n",
" 6505.3 |
\n",
" Memory size | \n",
" 0.0 B |
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
source
Constant
\n",
"
\n",
"\n",
"
\n",
"
This variable is constant and should be ignored for analysis
\n",
"\n",
"
\n",
"
\n",
"
\n",
" Constant value | \n",
" NASA |
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"
\n",
" Distinct count | \n",
" 244 |
\n",
" Unique (%) | \n",
" 0.5% |
\n",
" Missing (%) | \n",
" 0.7% |
\n",
" Missing (n) | \n",
" 312 |
\n",
" Infinite (%) | \n",
" 0.0% |
\n",
" Infinite (n) | \n",
" 0 |
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" Minimum | \n",
" 1688-01-01 |
\n",
" Maximum | \n",
" 2101-01-01 |
\n",
"
\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"
Sample
\n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" id | \n",
" nametype | \n",
" recclass | \n",
" mass_g | \n",
" fall | \n",
" reclat | \n",
" reclong | \n",
" GeoLocation | \n",
" source | \n",
" reclat_city | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Aachen | \n",
" 1 | \n",
" Valid | \n",
" L5 | \n",
" 21.0 | \n",
" Fell | \n",
" 50.77500 | \n",
" 6.08333 | \n",
" (50.775000, 6.083330) | \n",
" NASA | \n",
" 44.917728 | \n",
" 1880-01-01 | \n",
"
\n",
" \n",
" 1 | \n",
" Aarhus | \n",
" 2 | \n",
" Valid | \n",
" H6 | \n",
" 720.0 | \n",
" Fell | \n",
" 56.18333 | \n",
" 10.23333 | \n",
" (56.183330, 10.233330) | \n",
" NASA | \n",
" 58.489277 | \n",
" 1951-01-01 | \n",
"
\n",
" \n",
" 2 | \n",
" Abee | \n",
" 6 | \n",
" Valid | \n",
" EH4 | \n",
" 107000.0 | \n",
" Fell | \n",
" 54.21667 | \n",
" -113.00000 | \n",
" (54.216670, -113.000000) | \n",
" NASA | \n",
" 53.753995 | \n",
" 1952-01-01 | \n",
"
\n",
" \n",
" 3 | \n",
" Acapulco | \n",
" 10 | \n",
" Valid | \n",
" Acapulcoite | \n",
" 1914.0 | \n",
" Fell | \n",
" 16.88333 | \n",
" -99.90000 | \n",
" (16.883330, -99.900000) | \n",
" NASA | \n",
" 17.311136 | \n",
" 1976-01-01 | \n",
"
\n",
" \n",
" 4 | \n",
" Achiras | \n",
" 370 | \n",
" Valid | \n",
" L6 | \n",
" 780.0 | \n",
" Fell | \n",
" -33.16667 | \n",
" -64.95000 | \n",
" (-33.166670, -64.950000) | \n",
" NASA | \n",
" -29.350844 | \n",
" 1902-01-01 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"report"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save report to file"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"report.to_file(\"/tmp/example.html\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}