# xan stats ```txt Computes descriptive statistics on CSV data. By default, statistics are reported for *every* column in the CSV data. The default set of statistics corresponds to statistics that can be computed efficiently on a stream of data in constant memory, but more can be selected using flags documented hereafter. If you have more specific needs or want to perform custom aggregations, please be sure to check the `xan agg` command instead. Here is what the CSV output will look like: field (default) - Name of the described column count (default) - Number of non-empty values contained by the column count_empty (default) - Number of empty values contained by the column type (default) - Most likely type of the column types (default) - Pipe-separated list of all types witnessed in the column sum (default) - Sum of numerical values mean (default) - Mean of numerical values q1 (-q, -A) - First quartile of numerical values median (-q, -A) - Second quartile, i.e. median, of numerical values q3 (-q, -A) - Third quartile of numerical values variance (default) - Population variance of numerical values stddev (default) - Population standard deviation of numerical values min (default) - Minimum numerical value max (default) - Maximum numerical value approx_cardinality (-a) - Approximation of the number of distinct string values approx_q1 (-a) - Approximation of the first quartile of numerical values approx_median (-a) - Approximation of the median of numerical values approx_q3 (-a) - Approximation of the third quartile of numerical values cardinality (-c, -A) - Number of distinct string values mode (-c, -A) - Most frequent string value (tie breaking is arbitrary & random!) tied_for_mode (-c, -A) - Number of values tied for mode lex_first (default) - First string in lexical order lex_last (default) - Last string in lexical order min_length (default) - Minimum string length max_length (default) - Maximum string length Stats can be computed in parallel using the -p/--parallel or -t/--threads flags. But this cannot work on streams or gzipped files, unless a `.gzi` index (as created by `bgzip -i`) can be found beside it. Parallelization is not compatible with the -g/--groupby option. Usage: xan stats [options] [] stats options: -s, --select Select a subset of columns to compute stats for. See 'xan select --help' for the format details. This is provided here because piping 'xan select' into 'xan stats' will disable the use of indexing. -g, --groupby If given, will compute stats per group as defined by the given column selection. -A, --all Shorthand for -cq. -c, --cardinality Show cardinality and modes. This requires storing all CSV data in memory. -q, --quartiles Show quartiles. This requires storing all CSV data in memory. -a, --approx Compute approximated statistics. --nulls Include empty values in the population size for computing mean and standard deviation. -p, --parallel Whether to use parallelization to speed up computation. Will automatically select a suitable number of threads to use based on your number of cores. Use -t, --threads if you want to indicate the number of threads yourself. -t, --threads Parellize computations using this many threads. Use -p, --parallel if you want the number of threads to be automatically chosen instead. Common options: -h, --help Display this message -o, --output Write output to instead of stdout. -n, --no-headers When set, the first row will NOT be interpreted as column names. i.e., They will be included in statistics. -d, --delimiter The field delimiter for reading CSV data. Must be a single character. ```