// Benjamin Skinner
// LPO 9951: PhD Student Research Practicum
// Fall 2015

// QUICK NOTE RE: QUANTILE CUTS USING -egen- WITH -cut-

// When discussing -egen- with -cut- in class the other day, I bumbled around
// an explanation regarding the different types of cuts one can make to
// continuous data. Hopefully this quick note will clear things up.

// header stuff    
clear all
set more off
set seed 316810

// generate fake data
set obs 1000
gen loginc = rnormal(10,1)
gen inc = exp(loginc)

// look at distribution of fake income variables
sum loginc inc
hist loginc, name(h_loginc)
hist inc, name(h_inc)

// Notice the skew of income when it isn't logged? Let's say that we want
// to categorize our income values into four groups. There are two general
// ways to do this that are based on qualitatively different ideas about
// how those groups should be created.

// (1) Equally sized bins based on the data at hand
// (2) Potentially unequal bins based on theoretically- or population-derived
//     cut points

// Neither is right or wrong in general. But the approach is important
// depending on the needs of your analyses. As I'll show below, the method
// used to cut the income variable will substantively change the way various
// values are coded.

// (1) Equally-sized bins ------------------------------------------------------

// create equal groups; show
egen inc_q = cut(inc), group(4) icodes
table inc_q

// This is what we did in class. Stata divides the distribution of income
// into four groups by setting the cuts at the 25th, 50th, and 75th percentiles.
// In this scenario, the data are assumed fixed and the cut points are
// relative to the values seen in the data.

// (2) Unequally-sized bins ----------------------------------------------------

// create unequal groups based on set cuts; show
egen inc_q2 = cut(inc), at(0,25000,50000,75000,1000000000) icodes
table inc_q2

// This approach is different. Instead of fixing the data and allowing the cut
// points to change, we instead fixed the cut points and require the data
// to bin accordingly. Notice how this time we had to include n + 1 cuts,
// where n = # of groups. This requires some knowledge about the range of the
// variable. I know that no values are below zero or above 1 billion. If not
// careful, Stata will return missing values for those that fall outside of the
// cut points.

// Why do it this second way? Perhaps your cut points are fixed by a theoretical
// framework. Or maybe you need to align categories of income in this dataset
// with those found in another. In both cases, it makes more sense to treat the
// cutpoints as fixed and the data as realizations of a random variable or
// distribution that should be binned accordingly.

// compare binning through cross table -----------------------------------------

// cross-table
table inc_q inc_q2

// Notice how we get very different categorizations. If the bins were the same,
// we should see a perfect diagonal matrix of values 250. This would mean that
// all income values coded as 0 using the first method would be 0 using the
// second method. The same would hold true for the other values.

// Instead, we have very different codings. While it seems that all the 0 values
// were coded the same by each method, the latter method also coded all formerly
// 1 values as 0 (NB: based on the way Stata creates random numbers, you may
// have slightly different results; setting the seed at the top should have
// helped, but differences may remain). The upper categories are split between
// the two methods.

// FINAL THOUGHT

// Both choices make sense within their respective contexts. The takeaway here
// is that in both situations, you have grouped a continuous variable into
// four categories. The grouping means something quite different in each case.
// Just be clear about what you are doing in your documentation/labeling.

exit