# xan search
```txt
Search for (or replace) patterns in CSV data (be sure to check out `xan grep` for
a faster but coarser equivalent).
This command has several flags to select the way to perform a match:
* (default): matching a substring (e.g. "john" in "My name is john")
* -e, --exact: exact match
* -r, --regex: using a regular expression
* -u, --url-prefix: matching by url prefix (e.g. "lemonde.fr/business")
* -N, --non-empty: finding non-empty cells (does not need a pattern)
* -E, --empty: finding empty cells (does not need a pattern)
Searching for rows with any column containing "john":
$ xan search "john" file.csv > matches.csv
Searching for rows where any column has *exactly* the value "john":
$ xan search -e "john" file.csv > matches.csv
Keeping only rows where selection is not fully empty:
$ xan search -s user_id --non-empty file.csv > users-with-id.csv
Keeping only rows where selection has any empty column:
$ xan search -s user_id --empty file.csv > users-without-id.csv
When using a regular expression, be sure to mind bash escape rules (prefer single
quotes around your expression and don't forget to use backslashes when needed):
$ xan search -r '\bfran[cç]' file.csv
To restrict the columns that will be searched you can use the -s, --select flag.
All search modes (except -u/--url-prefix) can also be case-insensitive
using -i, --ignore-case.
# Searching multiple patterns at once
This command is also able to search for multiple patterns at once.
To do so, you can either use the -P, --add-pattern flag or feed a text file
with one pattern per line to the --patterns flag. You can also feed a CSV file
to the --patterns flag, in which case you will need to indicate the column
containing the patterns using the --pattern-column flag.
Giving additional patterns:
$ xan search disc -P tape -P vinyl file.csv > matches.csv
One pattern per line of text file:
$ xan search --patterns patterns.txt file.csv > matches.csv
CSV column containing patterns:
$ xan search --patterns people.csv --pattern-column name tweets.csv > matches.csv
Feeding patterns through stdin (using "-"):
$ cat patterns.txt | xan search --patterns - file.csv > matches.csv
Feeding CSV column as patterns through stdin (using "-"):
$ xan slice -l 10 people.csv | xan search --patterns - --pattern-column name file.csv > matches.csv
# Further than just filtering
Now this command is also able to perform search-adjacent operations:
- Replacing matches with -R/--replace or --replacement-column
- Reporting in a new column whether a match was found with -f/--flag
- Reporting the total number of matches in a new column with -c/--count
- Reporting a breakdown of number of matches per query given through --patterns
with -B/--breakdown.
- Reporting unique matches of multiple queries given through --patterns
using -U/--unique-matches.
For instance:
Reporting whether a match was found (instead of filtering):
$ xan search -s headline -i france -f france_match file.csv
Reporting number of matches:
$ xan search -s headline -i france -c france_count file.csv
Cleaning thousands separators (usually commas "," in English) from numerical columns:
$ xan search , --replace . -s 'count_*' file.csv
Replacing color names to their French counterpart:
$ echo 'english,french\nred,rouge\ngreen,vert' | \
$ xan search -e \
$ --patterns - --pattern-column english --replacement-column french \
$ -s color file.csv > translated.csv
Computing a breakdown of matches per query:
$ xan search -B -s headline --patterns queries.csv \
$ --pattern-column query --name-column name file.csv > breakdown.csv
Reporting unique matches per query in a new column:
$ xan search -U matches -s headline,text --patterns queries.csv \
$ --pattern-column query --name-column name file.csv > matches.csv
# Regarding parallelization
Finally, this command can leverage multithreading to run faster using
the -p/--parallel or -t/--threads flags. This said, the boost given by
parallelization might differ a lot and depends on the complexity and number of
queries and also on the size of the haystacks. That is to say `xan search --empty`
would not be significantly faster when parallelized whereas `xan search -i eternity`
definitely would.
Also, you might want to try `xan parallel cat` instead because it could be
faster in some scenarios at the cost of an increase in memory usage (and it
won't work on streams and unindexed gzipped data).
For instance, the following `search` command:
$ xan search -i eternity -p file.csv
Would directly translate to:
$ xan parallel cat -P 'search -i eternity' -F file.csv
Usage:
xan search [options] --non-empty []
xan search [options] --empty []
xan search [options] --patterns []
xan search [options] [-P ...] []
xan search --help
search mode options:
-e, --exact Perform an exact match.
-r, --regex Use a regex to perform the match.
-E, --empty Search for empty cells, i.e. filter out
any completely non-empty selection.
-N, --non-empty Search for non-empty cells, i.e. filter out
any completely empty selection.
-u, --url-prefix Match by url prefix, i.e. cells must contain urls
matching the searched url prefix. Urls are first
reordered using a scheme called a LRU, that you can
read about here:
https://github.com/medialab/ural?tab=readme-ov-file#about-lrus
search options:
-i, --ignore-case Case insensitive search.
-v, --invert-match Select only rows that did not match
-s, --select Select the columns to search. See 'xan select -h'
for the full syntax.
-A, --all Only return a row when ALL columns from the given selection
match the desired pattern, instead of returning a row
when ANY column matches.
-f, --flag Instead of filtering rows, add a new column indicating if any match
was found.
-c, --count Report the number of non-overlapping pattern matches in a new column with
given name. Will still filter out rows with 0 matches, unless --left
is used. Does not work with -v/--invert-match.
--overlapping When used with -c/--count or -B/--breakdown, return the count of
overlapping matches. Note that this can sometimes be one order of
magnitude slower that counting non-overlapping matches.
-R, --replace If given, the command will not filter rows but will instead
replace matches with the given replacement.
Does not work with --replacement-column.
Regex replacement string syntax can be found here:
https://docs.rs/regex/latest/regex/struct.Regex.html#replacement-string-syntax
-l, --limit Maximum of number rows to return. Useful to avoid downstream
buffering some times (e.g. when searching for very few
rows in a big file before piping to `view` or `flatten`).
Does not work with -p/--parallel nor -t/--threads.
--left Rows without any matches will be kept in the output when
using -U/--unique-matches, or -B/--breakdown, or -c/--count.
-p, --parallel Whether to use parallelization to speed up computation.
Will automatically select a suitable number of threads to use
based on your number of cores. Use -t, --threads if you want to
indicate the number of threads yourself.
-t, --threads Parellize computations using this many threads. Use -p, --parallel
if you want the number of threads to be automatically chosen instead.
multiple patterns options:
-P, --add-pattern Manually add patterns to query without needing to feed a file
to the --patterns flag.
-B, --breakdown When used with --patterns, will count the total number of
non-overlapping matches per pattern and write this count in
one additional column per pattern. Added column will be given
the pattern as name, unless you provide the --name-column flag.
Will not include rows that have no matches in the output, unless
the --left flag is used. You might want to use it with --overlapping
sometimes when your patterns are themselves overlapping or you might
be surprised by the tallies.
-U, --unique-matches When used with --patterns, will add a column containing a list of
unique matched patterns for each row, separated by the --sep character.
Will not include rows that have no matches in the output unless
the --left flag is used. Patterns can also be given a name through
the --name-column flag.
--sep Character to use to join pattern matches when using -U/--unique-matches.
[default: |]
--patterns Path to a text file (use "-" for stdin), containing multiple
patterns, one per line, to search at once.
--pattern-column When given a column name, --patterns file will be considered a CSV
and patterns to search will be extracted from the given column.
--replacement-column When given with both --patterns & --pattern-column, indicates the
column containing a replacement when a match occurs. Does not
work with -R/--replace.
Regex replacement string syntax can be found here:
https://docs.rs/regex/latest/regex/struct.Regex.html#replacement-string-syntax
--name-column When given with -B/--breakdown, --patterns & --pattern-column,
indicates the column containing a pattern's name that will be used
as column name in the appended breakdown.
Common options:
-h, --help Display this message
-o, --output Write output to instead of stdout.
-n, --no-headers When set, the first row will not be interpreted
as headers. (i.e., They are not searched, analyzed,
sliced, etc.)
-d, --delimiter The field delimiter for reading CSV data.
Must be a single character.
```