# SiteOne Crawler
SiteOne Crawler is a powerful and easy-to-use **website analyzer, cloner, and converter** designed for developers seeking security and performance insights, SEO specialists identifying optimization opportunities, and website owners needing reliable backups and offline versions.
**Now rewritten in Rust** for maximum performance, minimal resource usage, and zero runtime dependencies. The transition from PHP+Swoole to Rust resulted in **25% faster execution** and **30% lower memory consumption** while producing identical output.
**Discover the SiteOne Crawler advantage:**
* **Run Anywhere:** Single native binary for **πͺ Windows**, **π macOS**, and **π§ Linux** (x64 & arm64). No runtime dependencies.
* **Work Your Way:** Launch the binary without arguments for an **interactive wizard** π§ with 10 preset modes, use the extensive **command-line interface** π ([releases](https://github.com/janreges/siteone-crawler/releases), [βΆοΈ video](https://www.youtube.com/watch?v=25T_yx13naA&list=PL9mElgTe-s1Csfg0jXWmDS0MHFN7Cpjwp)) for automation and power, or enjoy the intuitive **desktop GUI application** π» ([GUI app](https://github.com/janreges/siteone-crawler-gui), [βΆοΈ video](https://www.youtube.com/watch?v=rFW8LNEVNdw)) for visual control.
* **Rich Output Formats:** Interactive **HTML audit report** π with sortable tables and quality scoring (0.0-10.0) (see [nextjs.org sample](https://crawler.siteone.io/html/2024-08-23/forever/cl8xw4r-fdag8wg-44dd.html)), detailed **JSON** for programmatic consumption, and human-readable **text** for terminal. Send HTML reports directly to your inbox via **built-in SMTP mailer** π§.
* **CI/CD Integration:** Built-in **quality gate** (`--ci`) with configurable thresholds β exit code 10 on failure enables automated deployment blocking. Also useful for **cache warming** β crawling the entire site after deployment populates your reverse proxy/CDN cache.
* **Offline & Markdown Power:** Create complete **offline clones** πΎ for browsing without a server ([nextjs.org clone](https://crawler.siteone.io/examples-exports/nextjs.org/)) or convert entire websites into clean **Markdown** π β perfect for backups, documentation, or feeding content to AI models ([examples](https://github.com/janreges/siteone-crawler-markdown-examples/)).
* **Deep Crawling & Analysis:** Thoroughly crawl every page and asset, identify errors (404s, redirects), generate **sitemaps** πΊοΈ, and even get **email summaries** π§ (watch [βΆοΈ video example](https://www.youtube.com/watch?v=PHIFSOmk0gk)).
* **Learn More:** Dive into the π [Project Website](https://crawler.siteone.io/), explore the detailed [Documentation](https://crawler.siteone.io/configuration/command-line-options/), or check the [JSON](docs/JSON-OUTPUT.md)/[Text](docs/TEXT-OUTPUT.md) output specs.
GIF animation of the crawler in action (also available as a [βΆοΈ video](https://www.youtube.com/watch?v=25T_yx13naA&list=PL9mElgTe-s1Csfg0jXWmDS0MHFN7Cpjwp)):

## Table of contents
- [β¨ Features](#-features)
* [π·οΈ Crawler](#οΈ-crawler)
* [π οΈ Dev/DevOps assistant](#οΈ-devdevops-assistant)
* [π Analyzer](#-analyzer)
* [π§ Reporter](#-reporter)
* [πΎ Offline website generator](#-offline-website-generator)
* [π Website to markdown converter](#-website-to-markdown-converter)
* [πΊοΈ Sitemap generator](#οΈ-sitemap-generator)
- [π Installation](#-installation)
* [π¦ Pre-built binaries](#-pre-built-binaries)
* [πΊ Homebrew (macOS / Linux)](#-homebrew-macos--linux)
* [π§ Debian / Ubuntu (apt)](#-debian--ubuntu-apt)
* [π© Fedora / RHEL (dnf)](#-fedora--rhel-dnf)
* [π¦ openSUSE / SLES (zypper)](#-opensuse--sles-zypper)
* [ποΈ Alpine Linux (apk)](#οΈ-alpine-linux-apk)
* [π¨ Build from source](#-build-from-source)
- [βΆοΈ Usage](#οΈ-usage)
* [Interactive wizard](#interactive-wizard)
* [Basic example](#basic-example)
* [CI/CD example](#cicd-example)
* [Fully-featured example](#fully-featured-example)
* [βοΈ Arguments](#οΈ-arguments)
+ [Basic settings](#basic-settings)
+ [Output settings](#output-settings)
+ [Resource filtering](#resource-filtering)
+ [Advanced crawler settings](#advanced-crawler-settings)
+ [File export settings](#file-export-settings)
+ [Mailer options](#mailer-options)
+ [Upload options](#upload-options)
+ [Offline exporter options](#offline-exporter-options)
+ [Markdown exporter options](#markdown-exporter-options)
+ [Sitemap options](#sitemap-options)
+ [Expert options](#expert-options)
+ [Fastest URL analyzer](#fastest-url-analyzer)
+ [SEO and OpenGraph analyzer](#seo-and-opengraph-analyzer)
+ [Slowest URL analyzer](#slowest-url-analyzer)
+ [Built-in HTTP server](#built-in-http-server)
+ [HTML-to-Markdown conversion](#html-to-markdown-conversion)
+ [CI/CD settings](#cicd-settings)
- [π Quality Scoring](#-quality-scoring)
- [π CI/CD Integration](#-cicd-integration)
- [π Output Examples](#-output-examples)
- [π§ͺ Testing](#-testing)
- [β οΈ Disclaimer](#οΈ-disclaimer)
- [π License](#-license)
## β¨ Features
In short, the main benefits can be summarized in these points:
- **π·οΈ Crawler** - very powerful crawler of the entire website reporting useful information about each URL (status code,
response time, size, custom headers, titles, etc.)
- **π οΈ Dev/DevOps assistant** - offers stress/load testing with configurable concurrent workers (`--workers`) and request
rate (`--max-reqs-per-sec`), cache warming, localhost testing, and rich URL/content-type filtering
- **π Analyzer** - analyzes all webpages and reports strange or error behaviour and useful statistics (404, redirects, bad
practices, SEO and security issues, heading structures, etc.)
- **π§ Reporter** - interactive **HTML audit report**, structured **JSON**, and colored **text** output; built-in
**SMTP mailer** sends HTML reports directly to your inbox
- **πΎ Offline website generator** - clone entire websites to browsable local HTML files (no server needed) including all
assets. Supports **multi-domain clones** β include subdomains or external domains with intelligent cross-linking.
- **π Website to markdown converter** - export the entire website to browsable text markdown (viewable on GitHub or any
text editor), or generate a **single-file markdown** with smart header/footer deduplication β ideal for **feeding to AI
tools**. Includes a **built-in web server** that renders markdown exports as styled HTML pages.
Also supports **standalone HTML-to-Markdown conversion** of local files (`--html-to-markdown`).
See [markdown examples](https://github.com/janreges/siteone-crawler-markdown-examples/).
- **πΊοΈ Sitemap generator** - allows you to generate `sitemap.xml` and `sitemap.txt` files with a list of all pages on your
website
- **π Quality scoring** - automatic quality scoring (0.0-10.0) across 5 categories: Performance, SEO, Security, Accessibility, Best Practices
- **π CI/CD quality gate** - configurable thresholds with exit code 10 on failure for automated pipelines; also
useful as a **post-deployment cache warmer** for reverse proxies and CDNs
The following features are summarized in greater detail:
### π·οΈ Crawler
- **all major platforms** supported without dependencies (π§ Linux, πͺ Windows, π macOS, arm64) β single native binary
- has incredible **π native Rust performance** with async I/O and multi-threaded crawling
- provides simulation of **different device types** (desktop/mobile/tablet) thanks to predefined User-Agents
- will crawl **all files**, styles, scripts, fonts, images, documents, etc. on your website
- will respect the `robots.txt` file and will not crawl the pages that are not allowed
- has a **beautiful interactive** and **π¨ colourful output**
- it will **clearly warn you** β οΈ of any wrong use of the tool (e.g. input parameters validation or wrong permissions)
- as `--url` parameter, you can specify also a `sitemap.xml` file (or [sitemap index](https://www.sitemaps.org/protocol.html#index)),
which will be processed as a list of URLs. In sitemap-only mode, the crawler follows only URLs from
the sitemap β it does not discover additional links from HTML pages. Gzip-compressed sitemaps (`*.xml.gz`)
are fully supported, both as direct URLs and when referenced from sitemap index files.
- respects the HTML `` tag when resolving relative URLs on pages that use it.
### π οΈ Dev/DevOps assistant
- allows testing **public** and **local projects on specific ports** (e.g. `http://localhost:3000/`)
- works as a **stress/load tester** β configure the number of **concurrent workers** (`--workers`) and the **maximum
requests per second** (`--max-reqs-per-sec`) to simulate various traffic levels and test your infrastructure's
resilience against high load or DoS scenarios
- combine with **rich filtering options** β include/ignore URLs by regex (`--include-regex`, `--ignore-regex`), disable
specific asset types (`--disable-javascript`, `--disable-images`, etc.), or limit crawl depth (`--max-depth`) to focus
the load on specific parts of your website
- will help you **warm up the application cache** or the **cache on the reverse proxy** of the entire website
### π Analyzer
- will **find the weak points** or **strange behavior** of your website
- built-in analyzers cover SEO, security headers, accessibility, best practices, performance, SSL/TLS, caching, and more
### π§ Reporter
Three output formats:
- **Interactive HTML report** β a self-contained `.html` file with sortable tables, quality scores, color-coded
findings, and sections for SEO, security, accessibility, performance, headers, redirects, 404s, and more. Open it
in any browser β no server needed.
- **JSON output** β structured data with all crawled URLs, response details, analysis findings, scores, and CI/CD gate
results. Ideal for programmatic consumption, dashboards, and integrations.
- **Text output** β human-readable colored terminal output with tables, progress bars, and summaries.
Additional reporting features:
- **Built-in SMTP mailer** β send the HTML audit report directly to one or more email addresses via your own SMTP
server. Configure sender, recipients, subject template, and SMTP credentials via CLI options.
- will provide you with data for **SEO analysis**, just add the `Title`, `Keywords` and `Description` extra columns
- will provide useful **summaries and statistics** at the end of the processing
### πΎ Offline website generator
- will help you **export the entire website** to offline form, where it is possible to browse the site through local
HTML files (without HTTP server) including all documents, images, styles, scripts, fonts, etc.
- supports **multi-domain clones** β include subdomains (`*.mysite.tld`) or entirely different domains in a single
offline export. All URLs across included domains are **intelligently rewritten to relative paths**, so the resulting
offline version cross-links pages between domains seamlessly β you get one unified browsable clone.
- you can **limit what assets** you want to download and export (see `--disable-*` directives) .. for some types of
websites the best result is with the `--disable-javascript` option.
- you can specify by `--allowed-domain-for-external-files` (short `-adf`) from which **external domains** it is possible
to **download** assets (JS, CSS, fonts, images, documents) including `*` option for all domains.
- you can specify by `--allowed-domain-for-crawling` (short `-adc`) which **other domains** should be included in the
**crawling** if there are any links pointing to them. You can enable e.g. `mysite.*` to export all language mutations
that have a different TLD or `*.mysite.tld` to export all subdomains.
- you can use `--single-page` to **export only one page** to which the URL is given (and its assets), but do not follow
other pages.
- you can use `--single-foreign-page` to **export only one page** from another domain (if allowed by `--allowed-domain-for-crawling`),
but do not follow other pages.
- you can use `--replace-content` to **replace content** in HTML/JS/CSS with `foo -> bar` or regexp in PCRE format, e.g.
`/card[0-9]/i -> card`. Can be specified multiple times.
- you can use `--replace-query-string` to **replace chars in query string** in the filename.
- you can use `--max-depth` to set the **maximum crawling depth** (for pages, not assets). `1` means `/about` or `/about/`,
`2` means `/about/contacts` etc.
- you can use it to **export your website to a static form** and host it on GitHub Pages, Netlify, Vercel, etc. as a
static backup and part of your **disaster recovery plan** or **archival/legal needs**
- works great with **older conventional websites** but also **modern ones**, built on frameworks like Next.js, Nuxt.js,
SvelteKit, Astro, Gatsby, etc. When a JS framework is detected, the export also performs some framework-specific code
modifications for optimal results.
- **try it** for your website, and you will be very pleasantly surprised :-)
### π Website to markdown converter
Two export modes:
- **Multi-file markdown** β exports the entire website with all subpages to a directory of **browsable `.md` files**.
The markdown renders nicely when uploaded to GitHub, viewed in VS Code, or any text editor. Links between pages are
converted to relative `.md` links so you can navigate between files. Optionally includes images and other files
(PDF, etc.).
- **Single-file markdown** β combines all pages into **one large markdown file** with smart removal of duplicate website
headers and footers across pages. Ideal for **feeding entire website content to AI tools** (ChatGPT, Claude, etc.)
that process markdown more effectively than raw HTML.
Smart conversion features:
- **collapsible accordions** β large link lists (menus, navigation, footer links with 8+ items) are automatically
collapsed into `` accordions with contextual labels ("Menu", "Links") for better readability
- content before the main heading (typically h1) β such as the site header and navigation β is moved to the end of the
page below a `---` separator, so the actual page content comes first
- you can set multiple selectors (CSS-like) to **remove unwanted elements** from the exported markdown
- **code block detection** and **syntax highlighting** for popular programming languages
- HTML tables are converted to proper **markdown tables**
Built-in web server:
- use `--serve-markdown=` to start a **built-in HTTP server** that renders your markdown export as styled HTML
pages with tables, dark/light mode, breadcrumb navigation, and accordion support β perfect for browsing and sharing
the export locally or on a network
Standalone HTML-to-Markdown conversion:
- use `--html-to-markdown=` to convert a **local HTML file** directly to Markdown without crawling any website
- outputs clean Markdown to **stdout** (pipe-friendly) or to a file with `--html-to-markdown-output=`
- uses the same conversion pipeline as `--markdown-export-dir` β including all cleanup, accordion collapsing, code language detection, and implicit exclusions (cookie banners, `aria-hidden` elements, `role="menu"` dropdowns)
- respects `--markdown-disable-images`, `--markdown-disable-files`, `--markdown-exclude-selector`, and `--markdown-move-content-before-h1-to-end`
- does **not** rewrite links (`.html` β `.md`) since the file is standalone with no site context
π‘ Tip: you can push the exported markdown folder to your GitHub repository, where it will be automatically rendered as a browsable
documentation. You can look at the [examples](https://github.com/janreges/siteone-crawler-markdown-examples/) of converted websites to markdown.
See all available [markdown exporter options](#markdown-exporter-options) and [HTML-to-Markdown conversion options](#html-to-markdown-conversion).
### πΊοΈ Sitemap generator
- will help you create a `sitemap.xml` and `sitemap.txt` for your website
- you can set the priority of individual pages based on the number of slashes in the URL
Don't hesitate and try it. You will love it as we do! β€οΈ
## π Installation
### π¦ Pre-built binaries
Download pre-built binaries from [π GitHub releases](https://github.com/janreges/siteone-crawler/releases) for all major platforms (π§ Linux, πͺ Windows, π macOS, x64 & arm64).
The binary is self-contained β no runtime dependencies required.
```bash
# Linux / macOS β download, extract, run
./siteone-crawler --url=https://my.domain.tld
```
**π§ Linux binary variants:**
For Linux, two binary variants are provided:
| Variant | Compatibility | Performance |
|---------|--------------|-------------|
| **glibc** (primary) | Requires glibc 2.39+ (Ubuntu 24.04+, Debian 13+, Fedora 40+) | Full native performance |
| **musl** (compatible) | Any Linux distribution (statically linked, no dependencies) | ~50β80% slower due to musl memory allocator |
The **glibc** variant is recommended for current distributions β it offers the best performance. If you are running an older distribution (e.g. Ubuntu 22.04, Debian 12) and encounter a `GLIBC_2.xx not found` error, use the **musl** variant instead. The musl binary is fully statically linked and runs on any Linux system regardless of the installed glibc version. The performance difference is mainly noticeable during CPU-intensive operations like offline and markdown exports.
**Note for macOS users**: In case that Mac refuses to start the crawler from your Download folder, move the entire folder with the Crawler **via the terminal** to another location, for example to the homefolder `~`.
### πΊ Homebrew (macOS / Linux)
```bash
brew install janreges/tap/siteone-crawler
siteone-crawler --url=https://my.domain.tld
```
### π§ Debian / Ubuntu (apt)
```bash
curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.deb.sh' | sudo -E bash
sudo apt-get install siteone-crawler
```
> **Older distributions (Ubuntu 22.04, Debian 11/12, etc.):** If you get a `GLIBC_X.XX not found` error, install the statically linked variant instead:
> ```bash
> sudo apt-get install siteone-crawler-static
> ```
> See [Linux binary variants](#-pre-built-binaries) for details on the performance difference.
### π© Fedora / RHEL (dnf)
```bash
curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.rpm.sh' | sudo -E bash
sudo dnf install siteone-crawler
```
> **Older distributions:** If you get a `GLIBC_X.XX not found` error, use `sudo dnf install siteone-crawler-static` instead.
> See [Linux binary variants](#-pre-built-binaries) for details.
### π¦ openSUSE / SLES (zypper)
```bash
curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.rpm.sh' | sudo -E bash
sudo zypper install siteone-crawler
```
> **Older distributions:** If you get a `GLIBC_X.XX not found` error, use `sudo zypper install siteone-crawler-static` instead.
> See [Linux binary variants](#-pre-built-binaries) for details.
### ποΈ Alpine Linux (apk)
```bash
curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.alpine.sh' | sudo -E bash
sudo apk add siteone-crawler
```
### π¨ Build from source
Requires [Rust](https://www.rust-lang.org/tools/install) 1.85 or later.
```bash
git clone https://github.com/janreges/siteone-crawler.git
cd siteone-crawler
# Build optimized release binary
cargo build --release
# Run
./target/release/siteone-crawler --url=https://my.domain.tld
```
**Build statically linked (musl) binary:**
```bash
# Install musl toolchain (Ubuntu/Debian)
sudo apt-get install musl-tools
rustup target add x86_64-unknown-linux-musl
# Build static binary (no system dependencies)
cargo build --release --target x86_64-unknown-linux-musl
# Run β works on any Linux distribution
./target/x86_64-unknown-linux-musl/release/siteone-crawler --url=https://my.domain.tld
```
## βΆοΈ Usage
### Interactive wizard
Run the binary **without any arguments** and an interactive wizard will guide you through the
configuration. Choose from 10 preset modes, enter the target URL, fine-tune settings with
arrow keys, and the crawler starts immediately β no need to remember CLI flags.
```
? Choose a crawl mode:
β― Quick Audit Fast site health overview β crawls all pages and assets
SEO Analysis Extract titles, descriptions, keywords, and OpenGraph tags
Performance Test Measure response times with cache disabled β find bottlenecks
Security Check Check SSL/TLS, security headers, and redirects site-wide
Offline Clone Download entire website with all assets for offline browsing
Markdown Export Convert pages to Markdown for AI models or documentation
Stress Test High-concurrency load test with cache-busting random params
Single Page Deep analysis of a single URL β SEO, security, performance
Large Site Crawl High-throughput HTML-only crawl for large sites (100k+ pages)
Custom Start from defaults and configure every option manually
ββββββββββββββββββββββββββββββββββββββ
Browse offline export Serve a previously exported offline site via HTTP
Browse markdown export Serve a previously exported markdown site via HTTP
[ββ to move, enter to select, type to filter]
```
After selecting a preset and entering the URL, the wizard shows a settings form where you can
adjust workers, timeout, content types, export options, and more. A configuration summary with the
equivalent CLI command is displayed before the crawl starts β copy it for future use without the
wizard.
If existing offline or markdown exports are detected in `./tmp/`, the wizard also offers to
**serve them via the built-in HTTP server** directly from the menu.
### Basic example
To run the crawler from the command line, provide the required arguments:
```bash
./siteone-crawler --url=https://mydomain.tld/ --device=mobile
```
### CI/CD example
```bash
# Fail deployment if quality score < 7.0 or any 5xx errors
./siteone-crawler --url=https://mydomain.tld/ --ci --ci-min-score=7.0 --ci-max-5xx=0
echo $? # 0 = pass, 10 = fail
```
### Fully-featured example
```bash
./siteone-crawler --url=https://mydomain.tld/ \
--output=text \
--workers=2 \
--max-reqs-per-sec=10 \
--memory-limit=2048M \
--resolve='mydomain.tld:443:127.0.0.1' \
--timeout=5 \
--proxy=proxy.mydomain.tld:8080 \
--http-auth=myuser:secretPassword123 \
--user-agent="My User-Agent String" \
--extra-columns="DOM,X-Cache(10),Title(40),Keywords(50),Description(50>),Heading1=xpath://h1/text()(20>),ProductPrice=regexp:/Price:\s*\$?(\d+(?:\.\d{2})?)/i#1(10)" \
--accept-encoding="gzip, deflate" \
--url-column-size=100 \
--max-queue-length=3000 \
--max-visited-urls=10000 \
--max-url-length=5000 \
--max-non200-responses-per-basename=10 \
--include-regex="/^.*\/technologies.*/" \
--include-regex="/^.*\/fashion.*/" \
--ignore-regex="/^.*\/downloads\/.*\.pdf$/i" \
--analyzer-filter-regex="/^.*$/i" \
--remove-query-params \
--keep-query-param=page \
--add-random-query-params \
--transform-url="live-site.com -> local-site.local" \
--transform-url="/cdn\.live-site\.com/ -> local-site.local/cdn" \
--show-scheme-and-host \
--do-not-truncate-url \
--output-html-report=tmp/myreport.html \
--html-report-options="summary,seo-opengraph,visited-urls,security,redirects" \
--output-json-file=/dir/report.json \
--output-text-file=/dir/report.txt \
--add-timestamp-to-output-file \
--add-host-to-output-file \
--offline-export-dir=tmp/mydomain.tld \
--replace-content='/]+>/ -> ' \
--ignore-store-file-error \
--sitemap-xml-file=/dir/sitemap.xml \
--sitemap-txt-file=/dir/sitemap.txt \
--sitemap-base-priority=0.5 \
--sitemap-priority-increase=0.1 \
--markdown-export-dir=tmp/mydomain.tld.md \
--markdown-export-single-file=tmp/mydomain.tld.combined.md \
--markdown-move-content-before-h1-to-end \
--markdown-disable-images \
--markdown-disable-files \
--markdown-remove-links-and-images-from-single-file \
--markdown-exclude-selector='.exclude-me' \
--markdown-replace-content='/]+>/ -> ' \
--markdown-replace-query-string='/[a-z]+=[^&]*(&|$)/i -> $1__$2' \
--mail-to=your.name@my-mail.tld \
--mail-to=your.friend.name@my-mail.tld \
--mail-from=crawler@my-mail.tld \
--mail-from-name="SiteOne Crawler" \
--mail-subject-template="Crawler Report for %domain% (%date%)" \
--mail-smtp-host=smtp.my-mail.tld \
--mail-smtp-port=25 \
--mail-smtp-user=smtp.user \
--mail-smtp-pass=secretPassword123 \
--ci --ci-min-score=7.0 --ci-min-security=8.0
```
## βοΈ Arguments
For a clearer list, I recommend going to the documentation: π https://crawler.siteone.io/configuration/command-line-options/
### Basic settings
| Parameter | Description |
|-----------|-------------|
| `--url=` | Required. HTTP or HTTPS URL address of the website or sitemap xml to be crawled.
Use quotation marks `''` if the URL contains query parameters. |
| `--single-page` | Load only one page to which the URL is given (and its assets), but do not follow other pages. |
| `--max-depth=` | Maximum crawling depth (for pages, not assets). Default is `0` (no limit). `1` means `/about`
or `/about/`, `2` means `/about/contacts` etc. |
| `--device=` | Device type for choosing a predefined User-Agent. Ignored when `--user-agent` is defined.
Supported values: `desktop`, `mobile`, `tablet`. Default is `desktop`. |
| `--user-agent=` | Custom User-Agent header. Use quotation marks. If specified, it takes precedence over
the device parameter. If you add `!` at the end, the siteone-crawler/version will not be
added as a signature at the end of the final user-agent. |
| `--timeout=` | Request timeout in seconds. Default is `5`. |
| `--proxy=` | HTTP proxy to use in `host:port` format. Host can be hostname, IPv4 or IPv6. |
| `--http-auth=` | Basic HTTP authentication in `username:password` format. |
| `--config-file=` | Load CLI options from a config file. One option per line, `#` comments allowed.
Without this flag, auto-discovers `~/.siteone-crawler.conf` or `/etc/siteone-crawler.conf`.
CLI arguments override config file values. |
### Output settings
| Parameter | Description |
|-----------|-------------|
| `--output=` | Output type. Supported values: `text`, `json`. Default is `text`. |
| `--extra-columns=` | Comma delimited list of extra columns added to output table. You can specify HTTP headers
(e.g. `X-Cache`), predefined values (`Title`, `Keywords`, `Description`, `DOM`), or custom
extraction from text files (HTML, JS, CSS, TXT, JSON, XML, etc.) using XPath or regexp.
For custom extraction, use the format `Custom_column_name=method:pattern#group(length)`, where
`method` is `xpath` or `regexp`, `pattern` is the extraction pattern, an optional `#group` specifies the
capturing group (or node index for XPath) to return (defaulting to the entire match or first node), and an
optional `(length)` sets the maximum output length (append `>` to disable truncation).
For example, use `Heading1=xpath://h1/text()(20>)` to extract the text of the first H1 element
from the HTML document, and `ProductPrice=regexp:/Price:\s*\$?(\d+(?:\.\d{2})?)/i#1(10)`
to extract a numeric price (e.g., "29.99") from a string like "Price: $29.99". |
| `--url-column-size=` | Basic URL column width. By default, it is calculated from the size of your terminal window. |
| `--rows-limit=` | Max. number of rows to display in tables with analysis results.
Default is `200`. |
| `--timezone=` | Timezone for datetimes in HTML reports and timestamps in output folders/files, e.g. `Europe/Prague`.
Default is `UTC`. |
| `--do-not-truncate-url` | In the text output, long URLs are truncated by default to `--url-column-size` so the table does not
wrap due to long URLs. With this option, you can turn off the truncation. |
| `--show-scheme-and-host` | On text output, show scheme and host also for origin domain URLs. |
| `--hide-progress-bar` | Hide progress bar visible in text and JSON output for more compact view. |
| `--hide-columns=` | Hide specified columns from the progress table. Comma-separated list of column names:
`type`, `time`, `size`, `cache`. Example: `--hide-columns=cache` or `--hide-columns=cache,type`. |
| `--no-color` | Disable colored output. |
| `--force-color` | Force colored output regardless of support detection. |
| `--show-inline-criticals` | Show criticals from the analyzer directly in the URL table. |
| `--show-inline-warnings` | Show warnings from the analyzer directly in the URL table. |
### Resource filtering
| Parameter | Description |
|-----------|-------------|
| `--disable-all-assets` | Disables crawling of all assets and files and only crawls pages in href attributes.
Shortcut for calling all other `--disable-*` flags. |
| `--disable-javascript` | Disables JavaScript downloading and removes all JavaScript code from HTML,
including `onclick` and other `on*` handlers. |
| `--disable-styles` | Disables CSS file downloading and at the same time removes all style definitions
by `