# xan tokenize ```txt Tokenize the given text column by splitting it either into words, sentences or paragraphs. # tokenize words Tokenize the given text column by splitting it into word pieces (think words, numbers, hashtags etc.). This tokenizer is able to distinguish between the following types of tokens (that you can filter using --keep and --drop): "word", "number", "hashtag", "mention", "emoji", "punct", "url" and "email" The command will by default emit one row per row in the input file, with the tokens added in a new "tokens" column containing the processed and filtered tokens joined by a space (or any character given to --sep). However, when giving a column name to -T, --token-type, the command will instead emit one row per token with the token in a new "token" column, along with a new column containing the token's type. This subcommand also exposes many ways to filter and process the resulting tokens as well as ways to refine a vocabulary iteratively in tandem with the "xan vocab" command. Finally, if you still need some processing not covered by the command's flags you can use -F/--flatmap that lets you evaluate an expression over each token in order to filter, transform or split them: Filtering tokens out: $ xan tokenize words text -F 'token.startswith("Dé") && token' Splitting tokens: $ xan tokenize words text -F 'token.split("-")' Transforming tokens: $ xan tokenize words text -F 'replace(_, /é/, "e")' # tokenize sentences Tokenize the given text by splitting it into sentences, emitting one row per sentence with a new "sentence" column at the end. # tokenize paragraphs Tokenize the given text by splitting it into paragraphs, emitting one row per paragraph, with a new "paragraph" column at the end. --- Note that the command will always drop the text column from the output unless you pass --keep-text to the command. Tips: You can easily pipe the command into "xan vocab" to create a vocabulary: $ xan tokenize words text file.csv | xan vocab doc-token > vocab.csv You can easily keep the tokens in a separate file using the "tee" command: $ xan tokenize words text file.csv | tee tokens.csv | xan vocab doc-token > vocab.csv Usage: xan tokenize words [options] [] xan tokenize sentences [options] [] xan tokenize paragraphs [options] [] xan tokenize --help tokenize options: -c, --column Name for the token column. Will default to "tokens", "token" when -T/--token-type is provided, "paragraphs" or "sentences". -p, --parallel Whether to use parallelization to speed up computations. Will automatically select a suitable number of threads to use based on your number of cores. Use -t, --threads if you want to indicate the number of threads yourself. -t, --threads Parellize computations using this many threads. Use -p, --parallel if you want the number of threads to be automatically chosen instead. --keep-text Force keeping the text column in the output. tokenize words options: -S, --simple Use a simpler, more performant variant of the tokenizer but unable to infer token types, nor handle subtle cases. -N, --ngrams If given, will output token ngrams using the given n or the given range of n values using a comma as separator e.g. "1,3". This cannot be used with -T, --token-type. -T, --token-type Name of a column to add containing the type of the tokens. This cannot be used with -N, --ngrams. -D, --drop Types of tokens to drop from the results, separated by comma, e.g. "word,number". Cannot work with -k, --keep. See the list of recognized types above. -K, --keep Types of tokens to keep in the results, separated by comma, e.g. "word,number". Cannot work with -d, --drop. See the list of recognized types above. -m, --min-token Minimum characters count of a token to be included in the output. -M, --max-token Maximum characters count of a token to be included in the output. --stoplist Path to a .txt stoplist containing one word per line. -J, --filter-junk Whether to apply some heuristics to filter out words that look like junk. -L, --lower Whether to normalize token case using lower case. -U, --unidecode Whether to normalize token text to ascii. --split-hyphens Whether to split tokens by hyphens. --stemmer Stemmer to normalize the tokens. Can be one of: - "s": a basic stemmer removing typical plural inflections in most European languages. - "carry": a stemmer targeting the French language. -V, --vocab Path to a CSV file containing allowed vocabulary (or "-" for stdin). --vocab-token Column of vocabulary file containing allowed tokens. [default: token] --vocab-token-id Column of vocabulary file containing a token id to emit in place of the token itself. --sep Character used to join tokens in the output cells. Will default to a space. --ngrams-sep Separator to be use to join ngrams tokens. [default: §] -u, --uniq Sort and deduplicate the tokens. -F, --flatmap Evaluate an expression for each extracted token and return nothing, or a transformed token or a list of tokens. The evaluated expression will understand the "token" identifier as the currently processed token and "token_type" as its type. The expression will run after any of the command's preprocessing toggled through flags, but before deduplication. tokenize paragraphs options: -A, --aerated Force paragraphs to be separated by a blank line, instead of just a single line break. tokenize sentences options: --squeeze Collapse consecutive whitespace to produce a tidy output. Common options: -h, --help Display this message -o, --output Write output to instead of stdout. -n, --no-headers When set, the first row will not be interpreted as headers. -d, --delimiter The field delimiter for reading CSV data. Must be a single character. ```