* Using by
* --------

* One major feature that gquantile adds is `by()`. It should be
* internally consistent if the user specifies the option `strict`.
* For example,

clear
set obs 1000000
gen group = int(runiform() * 100)
gen x = runiform()

local popts pctile strict by(group) cutby
gquantiles pctile = x, `popts' nq(10) genp(perc) groupid(id)

local xopts xtile strict by(group) cutby
gquantiles xtile  = x, `xopts' nq(10)
gquantiles xtile2 = x, `xopts' cutpoints(pctile)
gquantiles xtile3 = x, `xopts' cutquantiles(perc)

assert xtile == xtile2
assert xtile == xtile3

* However, there is no requirement for the user to do so:

sysuse auto, clear
gquantiles xtile = price, xtile by(foreign) nq(50)

* Computing many quantiles
* ------------------------

* Stata's _pctile caps the number of quantiles to 1001. pctile uses
* _pctile internally, so to compute more than 1001 percentiles it needs
* to loop over various runs of _pctile in a very inefficient way.  This
* inefficiency carries over to xtile because that command uses pctile
* internally. (Presumably this is the reason for the limit in the user-written
* fastxtile).

* The following executes with no errors in a reasonable amount of time

clear
set obs 1000000
gen x = runiform()
_pctile x, nq(1001)
pctile p1 = x, nq(1001)
gquantiles p2 = x, pctile nq(1001)

* However, if you increase `nq` the runtimes become excessive:

drop p*
timer clear

timer on 90
pctile p1 = x, nq(5001)
timer off 90

timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10

assert p1 == p2
timer list

* 61 seconds for only 1,00,000 observations! This is in Stata/MP with 8 cores.
* gquantiles scales nicely, by contrast:

drop p*
timer clear

timer on 10
gquantiles p2 = x, pctile nq(`=_N + 1')
timer off 10

clear
set obs 100000000
gen x = runiform() * 100

timer on 20
gquantiles p2 = x, pctile nq(`=_N + 1')
timer off 20

timer list

* That's right, gquantiles computed 100M quantiles for 100M observations in 36
* seconds, faster than pctile could compute 5,000 quantiles for 1M obsevations.
* As a side-note, using mata can afford a massive speedup, obviating the need to
* call C in case gquantiles does not do something you want. Consider:

clear all
timer clear

mata:
void function mata_pctile (string scalar newvar,
                           string scalar sourcevar,
                           real scalar nq)
{
    real scalar N
    real colvector X, quantiles, qpositions, qties, qtiesix, Q, Qties

    X = st_data(., sourcevar)
    N = rows(X)
    _sort(X, 1)
    quantiles  = ((1::(nq - 1)) * N / nq)
    qpositions = ceil(quantiles)
    qties      = (qpositions :== quantiles)
    Q          = X[qpositions]

    if ( any(qties) ) {
        qtiesix = selectindex(qties)
        Qties = X[qpositions[qtiesix] :+ 1]
        Q[qtiesix] = (Q[qtiesix] + Qties) / 2
    }

    st_addvar("`:set type'", newvar)
    st_store((1::(nq - 1)), newvar, Q)
}
end

set obs 1000000
gen x = runiform()

timer on 80
mata: mata_pctile("p0", "x", 5001)
timer off 80

timer on 90
pctile p1 = x, nq(5001)
timer off 90

timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10

assert p0 == p1
assert p1 == p2
timer list

* Just by using mata we speeded up Stata 50 times! The mata solution does not
* scale as well as gquantiles, however:

clear
timer clear
set obs 10000000
gen x = runiform()

timer on 80
mata: mata_pctile("p0", "x", 5001)
timer off 80

timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10

timer list

* With just 10M observations, gquantiles is still a 5x improvement over mata
* when computing many quantiles.

* Computation methods
* -------------------

* Computing quantiles involves selecting elements from an unordered
* array, which can be done in two ways: Using a selection algorithm on
* the unsorted variable or sorting and then selecting elements of the
* sorted varaible.

* The internal selection algorithm of gquantiles is very fast and on average
* will run in linear O(N) time (see quickselect) The sorting algorithm runs in
* O(N log(N)) time (see quicksort).  Clearly, with few quantiles we can see
* the selection algorithm will be faster.  However, with a large number of
* quantiles running multiple iterations of the selection algorithm is clearly
* slower than doing a single sort.

clear
timer clear

set obs 10000000
gen x = rnormal() * 100

timer on 10
gquantiles p1 = x, pctile nq(2) method(1)
timer off 10

timer on 20
gquantiles p2 = x, pctile nq(2) method(2)
timer off 20

assert p1 == p2
timer list

* We can see that method 2 was more than 3 times faster for a single quantile.

timer clear

timer on 10
gquantiles p1 = x, pctile nq(10) method(1) replace
timer off 10

timer on 20
gquantiles p2 = x, pctile nq(10) method(2) replace
timer off 20

timer list

* While method 2 was still faster, computing 10 quantiles took twice the
* time it took to compute 1. By contrast, method 1 took essentially the same
* time. This is because after sorting the data, selecting elements is nearly
* instantaneous.

timer clear

timer on 10
gquantiles p1 = x, pctile nq(100) method(1) replace
timer off 10

timer on 20
gquantiles p2 = x, pctile nq(100) method(2) replace
timer off 20

timer list

* With 100 quantiles we can see that the performance of method 2 is now much
* worse than method 1. Internally, gquantiles will try to switch between the
* two methods based on the nubmer of observations and the number of quantiles.
* You might be tempted to always specify method 2 for few quantiles, but there
* is a second way in which it is slower than sorting:

timer clear
replace x = int(x)

timer on 10
gquantiles p1 = x, pctile nq(10) method(1) replace
timer off 10

timer on 20
gquantiles p2 = x, pctile nq(10) method(2) replace
timer off 20

timer list

* What happened? While both commands are faster, now method 1 is faster than
* method 2, whereas before it was 50% slower. This is because the specific
* sorting algorithm I use handles duplicates better than the selection
* algorithm.

timer clear

timer on 10
gquantiles p1 = x, pctile nq(100) method(1) replace
timer off 10

timer on 20
gquantiles p2 = x, pctile nq(100) method(2) replace
timer off 20

timer list

* Again, both are faster with duplicates, but method 1 is much faster.

* Multiple subcommands
* --------------------

* gquantiles allows the user to compute several things at once:

sysuse auto, clear
gquantiles price, _pctile xtile(x1) pctile(p1) binfreq nq(10)
matrix list r(quantiles_binfreq)
l price x1 p1 in 1/10

* Specifying quantiles and cutoffs
* --------------------------------

* gquantiles allows for several ways to specify cutoffs

sysuse auto, clear

gquantiles price, _pctile p(10(10)99)
matrix p0 = r(quantiles_used)

gquantiles p1 = price, pctile nq(10) genp(g1) xtile(x1)
gquantiles x2 = price, xtile cutpoints(p1)
gquantiles x3 = price, xtile cutquantiles(g1)

qui glevelsof p1
gquantiles x4 = price, xtile cutoffs(`r(levels)')

qui glevelsof g1
gquantiles x5 = price, xtile quantiles(`r(levels)')

matrix list p0
l p1 g1 x? in 1/10