Pandas Profiling Report

Dataset statistics

Number of variables	4
Number of observations	1393570
Missing cells	459028
Missing cells (%)	8.2%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	42.5 MiB
Average record size in memory	32.0 B

Variable types

Numeric	1
Categorical	2
Boolean	1

Warnings

`date` has a high cardinality: 365 distinct values	High cardinality
`price` has a high cardinality: 669 distinct values	High cardinality
`price` has 459028 (32.9%) missing values	Missing
`date` is uniformly distributed	Uniform

Reproduction

Analysis started	2021-10-01 03:59:22.354937
Analysis finished	2021-10-01 03:59:44.050034
Duration	21.7 seconds
Software version	pandas-profiling v3.0.0
Download configuration	config.json

listing_id
Real number (ℝ_≥0)

Distinct	3818
Distinct (%)	0.3%
Missing	0
Missing (%)	0.0%
Infinite	0
Infinite (%)	0.0%
Mean	5550111.419

Minimum	3335
Maximum	10340165
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	10.6 MiB

Quantile statistics

Minimum	3335
5-th percentile	430453
Q1	3258213
median	6118244.5
Q3	8035212
95-th percentile	9666446
Maximum	10340165
Range	10336830
Interquartile range (IQR)	4776999

Descriptive statistics

Standard deviation	2962273.53
Coefficient of variation (CV)	0.5337322635
Kurtosis	-1.104322694
Mean	5550111.419
Median Absolute Deviation (MAD)	2287820
Skewness	-0.3096605895
Sum	7.73446877 × 10¹²
Variance	8.775064467 × 10¹²
Monotonicity	Not monotonic

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
241032	365	< 0.1%
9299824	365	< 0.1%
8597687	365	< 0.1%
2309250	365	< 0.1%
7420339	365	< 0.1%
1742425	365	< 0.1%
4559985	365	< 0.1%
6304139	365	< 0.1%
2610187	365	< 0.1%
8508341	365	< 0.1%
Other values (3808)	1389920	99.7%

Minimum 5 values
Maximum 5 values

Value	Count	Frequency (%)
3335	365	< 0.1%
4291	365	< 0.1%
5682	365	< 0.1%
6606	365	< 0.1%
7369	365	< 0.1%
9419	365	< 0.1%
9460	365	< 0.1%
9531	365	< 0.1%
9534	365	< 0.1%
9596	365	< 0.1%

Value	Count	Frequency (%)
10340165	365	< 0.1%
10339145	365	< 0.1%
10339144	365	< 0.1%
10334184	365	< 0.1%
10332096	365	< 0.1%
10331249	365	< 0.1%
10319529	365	< 0.1%
10318171	365	< 0.1%
10310373	365	< 0.1%
10309898	365	< 0.1%

date
Categorical

HIGH CARDINALITY
UNIFORM

Distinct	365
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Memory size	10.6 MiB

2016-01-04	3818
2016-09-11	3818
2016-09-09	3818
2016-09-08	3818
2016-09-07	3818
Other values (360)	1374480

Length

Max length	10
Median length	10
Mean length	10
Min length	10

Characters and Unicode

Total characters	13935700
Distinct characters	11
Distinct categories	2 ?
Distinct scripts	1 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	2016-01-04
2nd row	2016-01-05
3rd row	2016-01-06
4th row	2016-01-07
5th row	2016-01-08

Common Values

Value	Count	Frequency (%)
2016-01-04	3818	0.3%
2016-09-11	3818	0.3%
2016-09-09	3818	0.3%
2016-09-08	3818	0.3%
2016-09-07	3818	0.3%
2016-09-06	3818	0.3%
2016-09-05	3818	0.3%
2016-09-04	3818	0.3%
2016-09-03	3818	0.3%
2016-09-02	3818	0.3%
Other values (355)	1355390	97.3%

Length

Histogram of lengths of the category

Value	Count	Frequency (%)
2016-01-04	3818	0.3%
2016-09-11	3818	0.3%
2016-09-09	3818	0.3%
2016-09-08	3818	0.3%
2016-09-07	3818	0.3%
2016-09-06	3818	0.3%
2016-09-05	3818	0.3%
2016-09-04	3818	0.3%
2016-09-03	3818	0.3%
2016-09-02	3818	0.3%
Other values (355)	1355390	97.3%

Most occurring characters

Value	Count	Frequency (%)
0	3096398	22.2%
-	2787140	20.0%
1	2596240	18.6%
2	2218258	15.9%
6	1637922	11.8%
3	320712	2.3%
7	263442	1.9%
5	255806	1.8%
8	255806	1.8%
4	251988	1.8%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	11148560	80.0%
Dash Punctuation	2787140	20.0%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
0	3096398	27.8%
1	2596240	23.3%
2	2218258	19.9%
6	1637922	14.7%
3	320712	2.9%
7	263442	2.4%
5	255806	2.3%
8	255806	2.3%
4	251988	2.3%
9	251988	2.3%

Dash Punctuation

Value	Count	Frequency (%)
-	2787140	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	13935700	100.0%

Most frequent character per script

Common

Value	Count	Frequency (%)
0	3096398	22.2%
-	2787140	20.0%
1	2596240	18.6%
2	2218258	15.9%
6	1637922	11.8%
3	320712	2.3%
7	263442	1.9%
5	255806	1.8%
8	255806	1.8%
4	251988	1.8%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	13935700	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
0	3096398	22.2%
-	2787140	20.0%
1	2596240	18.6%
2	2218258	15.9%
6	1637922	11.8%
3	320712	2.3%
7	263442	1.9%
5	255806	1.8%
8	255806	1.8%
4	251988	1.8%

available
Boolean

Distinct	2
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Memory size	1.3 MiB

True	934542
False	459028

Common Values
Chart

Value	Count	Frequency (%)
True	934542	67.1%
False	459028	32.9%

price
Categorical

HIGH CARDINALITY
MISSING

Distinct	669
Distinct (%)	0.1%
Missing	459028
Missing (%)	32.9%
Memory size	10.6 MiB

$150.00	36646
$100.00	31755
$75.00	29820
$125.00	27538
$65.00	26415
Other values (664)	782368

Length

Max length	9
Median length	7
Mean length	6.555124328
Min length	6

Characters and Unicode

Total characters	6126039
Distinct characters	13
Distinct categories	3 ?
Distinct scripts	1 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	66 ?
Unique (%)	< 0.1%

Sample

1st row	$85.00
2nd row	$85.00
3rd row	$85.00
4th row	$85.00
5th row	$85.00

Common Values

Value	Count	Frequency (%)
$150.00	36646	2.6%
$100.00	31755	2.3%
$75.00	29820	2.1%
$125.00	27538	2.0%
$65.00	26415	1.9%
$90.00	24942	1.8%
$95.00	24327	1.7%
$99.00	23629	1.7%
$85.00	23455	1.7%
$80.00	19817	1.4%
Other values (659)	666198	47.8%
(Missing)	459028	32.9%

Length

Histogram of lengths of the category

Value	Count	Frequency (%)
150.00	36646	3.9%
100.00	31755	3.4%
75.00	29820	3.2%
125.00	27538	2.9%
65.00	26415	2.8%
90.00	24942	2.7%
95.00	24327	2.6%
99.00	23629	2.5%
85.00	23455	2.5%
80.00	19817	2.1%
Other values (659)	666198	71.3%

Most occurring characters

Value	Count	Frequency (%)
0	2298464	37.5%
$	934542	15.3%
.	934542	15.3%
1	431978	7.1%
5	427863	7.0%
9	263820	4.3%
2	216940	3.5%
7	138000	2.3%
8	128196	2.1%
6	119875	2.0%
Other values (3)	231819	3.8%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	4256258	69.5%
Other Punctuation	935239	15.3%
Currency Symbol	934542	15.3%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
0	2298464	54.0%
1	431978	10.1%
5	427863	10.1%
9	263820	6.2%
2	216940	5.1%
7	138000	3.2%
8	128196	3.0%
6	119875	2.8%
4	119803	2.8%
3	111319	2.6%

Other Punctuation

Value	Count	Frequency (%)
.	934542	99.9%
,	697	0.1%

Currency Symbol

Value	Count	Frequency (%)
$	934542	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	6126039	100.0%

Most frequent character per script

Common

Value	Count	Frequency (%)
0	2298464	37.5%
$	934542	15.3%
.	934542	15.3%
1	431978	7.1%
5	427863	7.0%
9	263820	4.3%
2	216940	3.5%
7	138000	2.3%
8	128196	2.1%
6	119875	2.0%
Other values (3)	231819	3.8%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	6126039	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
0	2298464	37.5%
$	934542	15.3%
.	934542	15.3%
1	431978	7.1%
5	427863	7.0%
9	263820	4.3%
2	216940	3.5%
7	138000	2.3%
8	128196	2.1%
6	119875	2.0%
Other values (3)	231819	3.8%

listing_id

listing_id

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap.

First rows

	listing_id	date	available	price
0	241032	2016-01-04	t	$85.00
1	241032	2016-01-05	t	$85.00
2	241032	2016-01-06	f	NaN
3	241032	2016-01-07	f	NaN
4	241032	2016-01-08	f	NaN
5	241032	2016-01-09	f	NaN
6	241032	2016-01-10	f	NaN
7	241032	2016-01-11	f	NaN
8	241032	2016-01-12	f	NaN
9	241032	2016-01-13	t	$85.00

Last rows

	listing_id	date	available	price
1393560	10208623	2016-12-24	f	NaN
1393561	10208623	2016-12-25	f	NaN
1393562	10208623	2016-12-26	f	NaN
1393563	10208623	2016-12-27	f	NaN
1393564	10208623	2016-12-28	f	NaN
1393565	10208623	2016-12-29	f	NaN
1393566	10208623	2016-12-30	f	NaN
1393567	10208623	2016-12-31	f	NaN
1393568	10208623	2017-01-01	f	NaN
1393569	10208623	2017-01-02	f	NaN

Overview

Variables

Common Values

Length

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Dash Punctuation

Most occurring scripts

Most frequent character per script

Common

Most occurring blocks

Most frequent character per block

ASCII

Common Values

Length

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Other Punctuation

Currency Symbol

Most occurring scripts

Most frequent character per script

Common

Most occurring blocks

Most frequent character per block

ASCII

Interactions

Correlations

Pearson's r

Spearman's ρ

Kendall's τ

Phik (φk)

Missing values

Sample

First rows

Last rows