# xan parallel ```txt Parallel processing of CSV data. This command usually parallelizes computation over multiple files, but is also able to automatically chunk CSV files and bgzipped CSV files (when a `.gzi` index can be found) when the number of available threads is greater than the number of files to read. This means this command is quite capable of parallelizing over a single CSV file. To process a single CSV file in parallel: $ xan parallel count docs.csv To process multiple files at once, you must give their paths as multiple arguments to the command or give them through stdin with one path per line or in a CSV column when using the --path-column flag: Multiple arguments through shell glob: $ xan parallel count data/**/docs.csv One path per line, fed through stdin: $ ls data/**/docs.csv | xan parallel count Paths from a CSV column through stdin: $ cat filelist.csv | xan parallel count --path-column path Note that sometimes you might find useful to use the `split` or `partition` command to preemptively split a large file into manageable chunks, if you can spare the disk space. This command has multiple subcommands that each perform some typical parallel reduce operation: - `count`: counts the number of rows in the whole dataset. - `cat`: preprocess the files and redirect the concatenated rows to your output (e.g. searching all the files in parallel and retrieving the results). - `freq`: builds frequency tables in parallel. See "xan freq -h" for an example of output. - `stats`: computes well-known statistics in parallel. See "xan stats -h" for an example of output. - `agg`: parallelize a custom aggregation. See "xan agg -h" for more details. - `groupby`: parallelize a custom grouped aggregation. See "xan groupby -h" for more details. - `map`: writes the result of given preprocessing in a new file besides the original one. This subcommand takes a filename template where `{}` will be replaced by the name of each target file without any extension (`.csv` or `.csv.gz` would be stripped for instance). This command is unable to leverage CSV file chunking. For instance, the following command: $ xan parallel map '{}_freq.csv' -P 'freq -s Category' *.csv Will create a file suffixed "_freq.csv" for each CSV file in current directory containing its frequency table for the "Category" command. Finally, preprocessing on each file can be done using two different methods: 1. Using only xan subcommands with -P, --preprocess: $ xan parallel count -P "search -s name John | slice -l 10" file.csv 2. Using a shell subcommand passed to "$SHELL -c" with -H, --shell-preprocess: $ xan parallel count -H "xan search -s name John | xan slice -l 10" file.csv The second preprocessing option will of course not work in DOS-based shells and Powershell on Windows. Usage: xan parallel count [options] [...] xan parallel cat [options] [...] xan parallel freq [options] [...] xan parallel stats [options] [...] xan parallel agg [options] [...] xan parallel groupby [options] [...] xan parallel map