* Using by
* --------
* One major feature that gquantile adds is `by()`. It should be
* internally consistent if the user specifies the option `strict`.
* For example,
clear
set obs 1000000
gen group = int(runiform() * 100)
gen x = runiform()
local popts pctile strict by(group) cutby
gquantiles pctile = x, `popts' nq(10) genp(perc) groupid(id)
local xopts xtile strict by(group) cutby
gquantiles xtile = x, `xopts' nq(10)
gquantiles xtile2 = x, `xopts' cutpoints(pctile)
gquantiles xtile3 = x, `xopts' cutquantiles(perc)
assert xtile == xtile2
assert xtile == xtile3
* However, there is no requirement for the user to do so:
sysuse auto, clear
gquantiles xtile = price, xtile by(foreign) nq(50)
* Computing many quantiles
* ------------------------
* Stata's _pctile caps the number of quantiles to 1001. pctile uses
* _pctile internally, so to compute more than 1001 percentiles it needs
* to loop over various runs of _pctile in a very inefficient way. This
* inefficiency carries over to xtile because that command uses pctile
* internally. (Presumably this is the reason for the limit in the user-written
* fastxtile).
* The following executes with no errors in a reasonable amount of time
clear
set obs 1000000
gen x = runiform()
_pctile x, nq(1001)
pctile p1 = x, nq(1001)
gquantiles p2 = x, pctile nq(1001)
* However, if you increase `nq` the runtimes become excessive:
drop p*
timer clear
timer on 90
pctile p1 = x, nq(5001)
timer off 90
timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10
assert p1 == p2
timer list
* 61 seconds for only 1,00,000 observations! This is in Stata/MP with 8 cores.
* gquantiles scales nicely, by contrast:
drop p*
timer clear
timer on 10
gquantiles p2 = x, pctile nq(`=_N + 1')
timer off 10
clear
set obs 100000000
gen x = runiform() * 100
timer on 20
gquantiles p2 = x, pctile nq(`=_N + 1')
timer off 20
timer list
* That's right, gquantiles computed 100M quantiles for 100M observations in 36
* seconds, faster than pctile could compute 5,000 quantiles for 1M obsevations.
* As a side-note, using mata can afford a massive speedup, obviating the need to
* call C in case gquantiles does not do something you want. Consider:
clear all
timer clear
mata:
void function mata_pctile (string scalar newvar,
string scalar sourcevar,
real scalar nq)
{
real scalar N
real colvector X, quantiles, qpositions, qties, qtiesix, Q, Qties
X = st_data(., sourcevar)
N = rows(X)
_sort(X, 1)
quantiles = ((1::(nq - 1)) * N / nq)
qpositions = ceil(quantiles)
qties = (qpositions :== quantiles)
Q = X[qpositions]
if ( any(qties) ) {
qtiesix = selectindex(qties)
Qties = X[qpositions[qtiesix] :+ 1]
Q[qtiesix] = (Q[qtiesix] + Qties) / 2
}
st_addvar("`:set type'", newvar)
st_store((1::(nq - 1)), newvar, Q)
}
end
set obs 1000000
gen x = runiform()
timer on 80
mata: mata_pctile("p0", "x", 5001)
timer off 80
timer on 90
pctile p1 = x, nq(5001)
timer off 90
timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10
assert p0 == p1
assert p1 == p2
timer list
* Just by using mata we speeded up Stata 50 times! The mata solution does not
* scale as well as gquantiles, however:
clear
timer clear
set obs 10000000
gen x = runiform()
timer on 80
mata: mata_pctile("p0", "x", 5001)
timer off 80
timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10
timer list
* With just 10M observations, gquantiles is still a 5x improvement over mata
* when computing many quantiles.
* Computation methods
* -------------------
* Computing quantiles involves selecting elements from an unordered
* array, which can be done in two ways: Using a selection algorithm on
* the unsorted variable or sorting and then selecting elements of the
* sorted varaible.
* The internal selection algorithm of gquantiles is very fast and on average
* will run in linear O(N) time (see quickselect) The sorting algorithm runs in
* O(N log(N)) time (see quicksort). Clearly, with few quantiles we can see
* the selection algorithm will be faster. However, with a large number of
* quantiles running multiple iterations of the selection algorithm is clearly
* slower than doing a single sort.
clear
timer clear
set obs 10000000
gen x = rnormal() * 100
timer on 10
gquantiles p1 = x, pctile nq(2) method(1)
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(2) method(2)
timer off 20
assert p1 == p2
timer list
* We can see that method 2 was more than 3 times faster for a single quantile.
timer clear
timer on 10
gquantiles p1 = x, pctile nq(10) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(10) method(2) replace
timer off 20
timer list
* While method 2 was still faster, computing 10 quantiles took twice the
* time it took to compute 1. By contrast, method 1 took essentially the same
* time. This is because after sorting the data, selecting elements is nearly
* instantaneous.
timer clear
timer on 10
gquantiles p1 = x, pctile nq(100) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(100) method(2) replace
timer off 20
timer list
* With 100 quantiles we can see that the performance of method 2 is now much
* worse than method 1. Internally, gquantiles will try to switch between the
* two methods based on the nubmer of observations and the number of quantiles.
* You might be tempted to always specify method 2 for few quantiles, but there
* is a second way in which it is slower than sorting:
timer clear
replace x = int(x)
timer on 10
gquantiles p1 = x, pctile nq(10) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(10) method(2) replace
timer off 20
timer list
* What happened? While both commands are faster, now method 1 is faster than
* method 2, whereas before it was 50% slower. This is because the specific
* sorting algorithm I use handles duplicates better than the selection
* algorithm.
timer clear
timer on 10
gquantiles p1 = x, pctile nq(100) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(100) method(2) replace
timer off 20
timer list
* Again, both are faster with duplicates, but method 1 is much faster.
* Multiple subcommands
* --------------------
* gquantiles allows the user to compute several things at once:
sysuse auto, clear
gquantiles price, _pctile xtile(x1) pctile(p1) binfreq nq(10)
matrix list r(quantiles_binfreq)
l price x1 p1 in 1/10
* Specifying quantiles and cutoffs
* --------------------------------
* gquantiles allows for several ways to specify cutoffs
sysuse auto, clear
gquantiles price, _pctile p(10(10)99)
matrix p0 = r(quantiles_used)
gquantiles p1 = price, pctile nq(10) genp(g1) xtile(x1)
gquantiles x2 = price, xtile cutpoints(p1)
gquantiles x3 = price, xtile cutquantiles(g1)
qui glevelsof p1
gquantiles x4 = price, xtile cutoffs(`r(levels)')
qui glevelsof g1
gquantiles x5 = price, xtile quantiles(`r(levels)')
matrix list p0
l p1 g1 x? in 1/10