# spark-df-profiling Meteorites example

Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

I have previously transformed the downloaded csv to a [Parquet](https://parquet.apache.org/) table, but that doesn't matter. As long as you have your Spark Dataframe loaded, you are good to go.

## Import library

In [1]:
import spark_df_profiling

## Create the DataFrame

In [2]:
df = sqlContext.read.parquet("/Users/Julio/Downloads/Meteorite_Landings.parquet").cache()

df

DataFrame[name: string, id: bigint, nametype: string, recclass: string, mass_g: double, fall: string, reclat: double, reclong: double, GeoLocation: string, source: string, reclat_city: double, year: date]

Spark Dataframes have the built-in method `.describe()`. Let's see what it shows:

In [3]:
df.describe().show()

+-------+------------------+------+---------+----------+-------------------+
|summary|                id|mass_g|   reclat|   reclong|        reclat_city|
+-------+------------------+------+---------+----------+-------------------+
|  count|             45726| 45726|    45726|     45726|              45726|
|   mean|26883.906202160695|   NaN|      NaN|       NaN|                NaN|
| stddev| 16863.44556599258|   NaN|      NaN|       NaN|                NaN|
|    min|                 1|   0.0|-87.36667|-165.43333|-103.79172917787167|
|    max|             57458|   NaN|      NaN|       NaN|                NaN|
+-------+------------------+------+---------+----------+-------------------+



## Generate the report

Now let's use `spark_df_profiling`:

In [4]:
report = spark_df_profiling.ProfileReport(df)

In [5]:
report

0,1
Number of variables,12
Number of observations,45726
Total Missing (%),4.1%
Total size in memory,0.0 B
Average record size in memory,0.0 B

0,1
Numeric,4
Categorical,4
Date,1
Text (Unique),1
Rejected,2

0,1
Distinct count,17100
Unique (%),44.5%
Missing (%),19.0%
Missing (n),7315
Infinite (%),0.0%
Infinite (n),0

0,1
"(0.000000, 0.000000)",6214
"(-71.500000, 35.666670)",4761
"(-84.000000, 168.000000)",3040
Other values (17097),24396
(Missing),7315

Value,Count,Frequency (%),Unnamed: 3
"(0.000000, 0.000000)",6214,13.6%,
"(-71.500000, 35.666670)",4761,10.4%,
"(-84.000000, 168.000000)",3040,6.6%,
"(-72.000000, 26.000000)",1505,3.3%,
"(-79.683330, 159.750000)",657,1.4%,
"(-76.716670, 159.666670)",637,1.4%,
"(-76.183330, 157.166670)",539,1.2%,
"(-79.683330, 155.750000)",473,1.0%,
"(-84.216670, 160.500000)",263,0.6%,
"(-86.366670, -70.000000)",226,0.5%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Found,44609
Fell,1117

Value,Count,Frequency (%),Unnamed: 3
Found,44609,97.6%,
Fell,1117,2.4%,

0,1
Distinct count,45716
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,26884
Minimum,1
Maximum,57458
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,2388.8
Q1,12681.0
Median,24256.0
Q3,40654.0
95-th percentile,54891.0
Maximum,57458.0
Range,57457.0
Interquartile range,27972.0

0,1
Standard deviation,16863
Coef of variation,0.62727
Kurtosis,-1.1601
Mean,26884
MAD,14490
Skewness,0.26652
Sum,1229300000
Variance,284380000
Memory size,0.0 B

0,1
Distinct count,12577
Unique (%),27.6%
Missing (%),0.3%
Missing (n),131
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,13278
Minimum,0
Maximum,60000000
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,1.0978
Q1,7.1907
Median,32.598
Q3,202.86
95-th percentile,3999.9
Maximum,60000000.0
Range,60000000.0
Interquartile range,195.67

0,1
Standard deviation,574930
Coef of variation,43.298
Kurtosis,6797.7
Mean,13278
MAD,25113
Skewness,76.916
Sum,605430000
Variance,330540000000
Memory size,0.0 B

First 3 values
Abee
Asco
Aleppo

Last 3 values
Allende
Alessandria
Akaba

0,1
1,Abee
2,Asco
3,Aleppo
4,Al Rais
5,Arbol Solo
6,Ash Creek
7,Northwest Africa 5815
8,Anlong
9,Aomori
10,Aldsworth

0,1
45707,Alais
45708,Arroyo Aguiar
45709,Aguada
45710,Angra dos Reis (stone)
45711,Alexandrovsky
45712,Akwanga
45713,Alfianello
45714,Appley Bridge
45715,Achiras
45716,Adhi Kot

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Valid,45651
Relict,75

Value,Count,Frequency (%),Unnamed: 3
Valid,45651,99.8%,
Relict,75,0.2%,

0,1
Distinct count,466
Unique (%),1.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
L6,8287
H5,7143
L5,4797
Other values (463),25499

Value,Count,Frequency (%),Unnamed: 3
L6,8287,18.1%,
H5,7143,15.6%,
L5,4797,10.5%,
H6,4529,9.9%,
H4,4211,9.2%,
LL5,2766,6.0%,
LL6,2043,4.5%,
L4,1253,2.7%,
H4/5,428,0.9%,
CM2,416,0.9%,

0,1
Distinct count,12739
Unique (%),33.2%
Missing (%),19.0%
Missing (n),7315
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-39.107
Minimum,-87.367
Maximum,81.167
Zeros (%),14.1%

0,1
Minimum,-87.367
5-th percentile,-84.355
Q1,-76.714
Median,-71.529
Q3,-0.16289
95-th percentile,34.494
Maximum,81.167
Range,168.53
Interquartile range,76.551

0,1
Standard deviation,46.386
Coef of variation,-1.1861
Kurtosis,-1.4768
Mean,-39.107
MAD,43.937
Skewness,0.4913
Sum,-1502100
Variance,2151.7
Memory size,0.0 B

0,1
Correlation,0.99423

0,1
Distinct count,14641
Unique (%),38.1%
Missing (%),19.0%
Missing (n),7315
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,61.053
Minimum,-165.43
Maximum,354.47
Zeros (%),13.6%

0,1
Minimum,-165.43
5-th percentile,-90.466
Q1,-0.0024196
Median,35.666
Q3,157.17
95-th percentile,167.72
Maximum,354.47
Range,519.91
Interquartile range,157.17

0,1
Standard deviation,80.655
Coef of variation,1.3211
Kurtosis,-0.73145
Mean,61.053
MAD,67.606
Skewness,-0.17437
Sum,2345100
Variance,6505.3
Memory size,0.0 B

0,1
Constant value,NASA

0,1
Distinct count,244
Unique (%),0.5%
Missing (%),0.7%
Missing (n),312
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,1688-01-01
Maximum,2101-01-01

Unnamed: 0,name,id,nametype,recclass,mass_g,fall,reclat,reclong,GeoLocation,source,reclat_city,year
0,Aachen,1,Valid,L5,21.0,Fell,50.775,6.08333,"(50.775000, 6.083330)",NASA,44.917728,1880-01-01
1,Aarhus,2,Valid,H6,720.0,Fell,56.18333,10.23333,"(56.183330, 10.233330)",NASA,58.489277,1951-01-01
2,Abee,6,Valid,EH4,107000.0,Fell,54.21667,-113.0,"(54.216670, -113.000000)",NASA,53.753995,1952-01-01
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,16.88333,-99.9,"(16.883330, -99.900000)",NASA,17.311136,1976-01-01
4,Achiras,370,Valid,L6,780.0,Fell,-33.16667,-64.95,"(-33.166670, -64.950000)",NASA,-29.350844,1902-01-01


## Save report to file

In [6]:
report.to_file("/tmp/example.html")