# SiteOne Crawler SiteOne Crawler is a powerful and easy-to-use **website analyzer, cloner, and converter** designed for developers seeking security and performance insights, SEO specialists identifying optimization opportunities, and website owners needing reliable backups and offline versions. **Now rewritten in Rust** for maximum performance, minimal resource usage, and zero runtime dependencies. The transition from PHP+Swoole to Rust resulted in **25% faster execution** and **30% lower memory consumption** while producing identical output. **Discover the SiteOne Crawler advantage:** * **Run Anywhere:** Single native binary for **πŸͺŸ Windows**, **🍎 macOS**, and **🐧 Linux** (x64 & arm64). No runtime dependencies. * **Work Your Way:** Launch the binary without arguments for an **interactive wizard** πŸ§™ with 10 preset modes, use the extensive **command-line interface** πŸ“Ÿ ([releases](https://github.com/janreges/siteone-crawler/releases), [▢️ video](https://www.youtube.com/watch?v=25T_yx13naA&list=PL9mElgTe-s1Csfg0jXWmDS0MHFN7Cpjwp)) for automation and power, or enjoy the intuitive **desktop GUI application** πŸ’» ([GUI app](https://github.com/janreges/siteone-crawler-gui), [▢️ video](https://www.youtube.com/watch?v=rFW8LNEVNdw)) for visual control. * **Rich Output Formats:** Interactive **HTML audit report** πŸ“Š with sortable tables and quality scoring (0.0-10.0) (see [nextjs.org sample](https://crawler.siteone.io/html/2024-08-23/forever/cl8xw4r-fdag8wg-44dd.html)), detailed **JSON** for programmatic consumption, and human-readable **text** for terminal. Send HTML reports directly to your inbox via **built-in SMTP mailer** πŸ“§. * **CI/CD Integration:** Built-in **quality gate** (`--ci`) with configurable thresholds β€” exit code 10 on failure enables automated deployment blocking. Also useful for **cache warming** β€” crawling the entire site after deployment populates your reverse proxy/CDN cache. * **Offline & Markdown Power:** Create complete **offline clones** πŸ’Ύ for browsing without a server ([nextjs.org clone](https://crawler.siteone.io/examples-exports/nextjs.org/)) or convert entire websites into clean **Markdown** πŸ“ β€” perfect for backups, documentation, or feeding content to AI models ([examples](https://github.com/janreges/siteone-crawler-markdown-examples/)). * **Deep Crawling & Analysis:** Thoroughly crawl every page and asset, identify errors (404s, redirects), generate **sitemaps** πŸ—ΊοΈ, and even get **email summaries** πŸ“§ (watch [▢️ video example](https://www.youtube.com/watch?v=PHIFSOmk0gk)). * **Learn More:** Dive into the 🌐 [Project Website](https://crawler.siteone.io/), explore the detailed [Documentation](https://crawler.siteone.io/configuration/command-line-options/), or check the [JSON](docs/JSON-OUTPUT.md)/[Text](docs/TEXT-OUTPUT.md) output specs. GIF animation of the crawler in action (also available as a [▢️ video](https://www.youtube.com/watch?v=25T_yx13naA&list=PL9mElgTe-s1Csfg0jXWmDS0MHFN7Cpjwp)): ![SiteOne Crawler](docs/siteone-crawler-command-line.gif) ## Table of contents - [✨ Features](#-features) * [πŸ•·οΈ Crawler](#️-crawler) * [πŸ› οΈ Dev/DevOps assistant](#️-devdevops-assistant) * [πŸ“Š Analyzer](#-analyzer) * [πŸ“§ Reporter](#-reporter) * [πŸ’Ύ Offline website generator](#-offline-website-generator) * [πŸ“ Website to markdown converter](#-website-to-markdown-converter) * [πŸ—ΊοΈ Sitemap generator](#️-sitemap-generator) - [πŸš€ Installation](#-installation) * [πŸ“¦ Pre-built binaries](#-pre-built-binaries) * [🍺 Homebrew (macOS / Linux)](#-homebrew-macos--linux) * [🐧 Debian / Ubuntu (apt)](#-debian--ubuntu-apt) * [🎩 Fedora / RHEL (dnf)](#-fedora--rhel-dnf) * [🦎 openSUSE / SLES (zypper)](#-opensuse--sles-zypper) * [πŸ”οΈ Alpine Linux (apk)](#️-alpine-linux-apk) * [πŸ”¨ Build from source](#-build-from-source) - [▢️ Usage](#️-usage) * [Interactive wizard](#interactive-wizard) * [Basic example](#basic-example) * [CI/CD example](#cicd-example) * [Fully-featured example](#fully-featured-example) * [βš™οΈ Arguments](#️-arguments) + [Basic settings](#basic-settings) + [Output settings](#output-settings) + [Resource filtering](#resource-filtering) + [Advanced crawler settings](#advanced-crawler-settings) + [File export settings](#file-export-settings) + [Mailer options](#mailer-options) + [Upload options](#upload-options) + [Offline exporter options](#offline-exporter-options) + [Markdown exporter options](#markdown-exporter-options) + [Sitemap options](#sitemap-options) + [Expert options](#expert-options) + [Fastest URL analyzer](#fastest-url-analyzer) + [SEO and OpenGraph analyzer](#seo-and-opengraph-analyzer) + [Slowest URL analyzer](#slowest-url-analyzer) + [Built-in HTTP server](#built-in-http-server) + [HTML-to-Markdown conversion](#html-to-markdown-conversion) + [CI/CD settings](#cicd-settings) - [πŸ† Quality Scoring](#-quality-scoring) - [πŸ”„ CI/CD Integration](#-cicd-integration) - [πŸ“„ Output Examples](#-output-examples) - [πŸ§ͺ Testing](#-testing) - [⚠️ Disclaimer](#️-disclaimer) - [πŸ“œ License](#-license) ## ✨ Features In short, the main benefits can be summarized in these points: - **πŸ•·οΈ Crawler** - very powerful crawler of the entire website reporting useful information about each URL (status code, response time, size, custom headers, titles, etc.) - **πŸ› οΈ Dev/DevOps assistant** - offers stress/load testing with configurable concurrent workers (`--workers`) and request rate (`--max-reqs-per-sec`), cache warming, localhost testing, and rich URL/content-type filtering - **πŸ“Š Analyzer** - analyzes all webpages and reports strange or error behaviour and useful statistics (404, redirects, bad practices, SEO and security issues, heading structures, etc.) - **πŸ“§ Reporter** - interactive **HTML audit report**, structured **JSON**, and colored **text** output; built-in **SMTP mailer** sends HTML reports directly to your inbox - **πŸ’Ύ Offline website generator** - clone entire websites to browsable local HTML files (no server needed) including all assets. Supports **multi-domain clones** β€” include subdomains or external domains with intelligent cross-linking. - **πŸ“ Website to markdown converter** - export the entire website to browsable text markdown (viewable on GitHub or any text editor), or generate a **single-file markdown** with smart header/footer deduplication β€” ideal for **feeding to AI tools**. Includes a **built-in web server** that renders markdown exports as styled HTML pages. Also supports **standalone HTML-to-Markdown conversion** of local files (`--html-to-markdown`). See [markdown examples](https://github.com/janreges/siteone-crawler-markdown-examples/). - **πŸ—ΊοΈ Sitemap generator** - allows you to generate `sitemap.xml` and `sitemap.txt` files with a list of all pages on your website - **πŸ† Quality scoring** - automatic quality scoring (0.0-10.0) across 5 categories: Performance, SEO, Security, Accessibility, Best Practices - **πŸ”„ CI/CD quality gate** - configurable thresholds with exit code 10 on failure for automated pipelines; also useful as a **post-deployment cache warmer** for reverse proxies and CDNs The following features are summarized in greater detail: ### πŸ•·οΈ Crawler - **all major platforms** supported without dependencies (🐧 Linux, πŸͺŸ Windows, 🍎 macOS, arm64) β€” single native binary - has incredible **πŸš€ native Rust performance** with async I/O and multi-threaded crawling - provides simulation of **different device types** (desktop/mobile/tablet) thanks to predefined User-Agents - will crawl **all files**, styles, scripts, fonts, images, documents, etc. on your website - will respect the `robots.txt` file and will not crawl the pages that are not allowed - has a **beautiful interactive** and **🎨 colourful output** - it will **clearly warn you** ⚠️ of any wrong use of the tool (e.g. input parameters validation or wrong permissions) - as `--url` parameter, you can specify also a `sitemap.xml` file (or [sitemap index](https://www.sitemaps.org/protocol.html#index)), which will be processed as a list of URLs. In sitemap-only mode, the crawler follows only URLs from the sitemap β€” it does not discover additional links from HTML pages. Gzip-compressed sitemaps (`*.xml.gz`) are fully supported, both as direct URLs and when referenced from sitemap index files. - respects the HTML `` tag when resolving relative URLs on pages that use it. ### πŸ› οΈ Dev/DevOps assistant - allows testing **public** and **local projects on specific ports** (e.g. `http://localhost:3000/`) - works as a **stress/load tester** β€” configure the number of **concurrent workers** (`--workers`) and the **maximum requests per second** (`--max-reqs-per-sec`) to simulate various traffic levels and test your infrastructure's resilience against high load or DoS scenarios - combine with **rich filtering options** β€” include/ignore URLs by regex (`--include-regex`, `--ignore-regex`), disable specific asset types (`--disable-javascript`, `--disable-images`, etc.), or limit crawl depth (`--max-depth`) to focus the load on specific parts of your website - will help you **warm up the application cache** or the **cache on the reverse proxy** of the entire website ### πŸ“Š Analyzer - will **find the weak points** or **strange behavior** of your website - built-in analyzers cover SEO, security headers, accessibility, best practices, performance, SSL/TLS, caching, and more ### πŸ“§ Reporter Three output formats: - **Interactive HTML report** β€” a self-contained `.html` file with sortable tables, quality scores, color-coded findings, and sections for SEO, security, accessibility, performance, headers, redirects, 404s, and more. Open it in any browser β€” no server needed. - **JSON output** β€” structured data with all crawled URLs, response details, analysis findings, scores, and CI/CD gate results. Ideal for programmatic consumption, dashboards, and integrations. - **Text output** β€” human-readable colored terminal output with tables, progress bars, and summaries. Additional reporting features: - **Built-in SMTP mailer** β€” send the HTML audit report directly to one or more email addresses via your own SMTP server. Configure sender, recipients, subject template, and SMTP credentials via CLI options. - will provide you with data for **SEO analysis**, just add the `Title`, `Keywords` and `Description` extra columns - will provide useful **summaries and statistics** at the end of the processing ### πŸ’Ύ Offline website generator - will help you **export the entire website** to offline form, where it is possible to browse the site through local HTML files (without HTTP server) including all documents, images, styles, scripts, fonts, etc. - supports **multi-domain clones** β€” include subdomains (`*.mysite.tld`) or entirely different domains in a single offline export. All URLs across included domains are **intelligently rewritten to relative paths**, so the resulting offline version cross-links pages between domains seamlessly β€” you get one unified browsable clone. - you can **limit what assets** you want to download and export (see `--disable-*` directives) .. for some types of websites the best result is with the `--disable-javascript` option. - you can specify by `--allowed-domain-for-external-files` (short `-adf`) from which **external domains** it is possible to **download** assets (JS, CSS, fonts, images, documents) including `*` option for all domains. - you can specify by `--allowed-domain-for-crawling` (short `-adc`) which **other domains** should be included in the **crawling** if there are any links pointing to them. You can enable e.g. `mysite.*` to export all language mutations that have a different TLD or `*.mysite.tld` to export all subdomains. - you can use `--single-page` to **export only one page** to which the URL is given (and its assets), but do not follow other pages. - you can use `--single-foreign-page` to **export only one page** from another domain (if allowed by `--allowed-domain-for-crawling`), but do not follow other pages. - you can use `--replace-content` to **replace content** in HTML/JS/CSS with `foo -> bar` or regexp in PCRE format, e.g. `/card[0-9]/i -> card`. Can be specified multiple times. - you can use `--replace-query-string` to **replace chars in query string** in the filename. - you can use `--max-depth` to set the **maximum crawling depth** (for pages, not assets). `1` means `/about` or `/about/`, `2` means `/about/contacts` etc. - you can use it to **export your website to a static form** and host it on GitHub Pages, Netlify, Vercel, etc. as a static backup and part of your **disaster recovery plan** or **archival/legal needs** - works great with **older conventional websites** but also **modern ones**, built on frameworks like Next.js, Nuxt.js, SvelteKit, Astro, Gatsby, etc. When a JS framework is detected, the export also performs some framework-specific code modifications for optimal results. - **try it** for your website, and you will be very pleasantly surprised :-) ### πŸ“ Website to markdown converter Two export modes: - **Multi-file markdown** β€” exports the entire website with all subpages to a directory of **browsable `.md` files**. The markdown renders nicely when uploaded to GitHub, viewed in VS Code, or any text editor. Links between pages are converted to relative `.md` links so you can navigate between files. Optionally includes images and other files (PDF, etc.). - **Single-file markdown** β€” combines all pages into **one large markdown file** with smart removal of duplicate website headers and footers across pages. Ideal for **feeding entire website content to AI tools** (ChatGPT, Claude, etc.) that process markdown more effectively than raw HTML. Smart conversion features: - **collapsible accordions** β€” large link lists (menus, navigation, footer links with 8+ items) are automatically collapsed into `
` accordions with contextual labels ("Menu", "Links") for better readability - content before the main heading (typically h1) β€” such as the site header and navigation β€” is moved to the end of the page below a `---` separator, so the actual page content comes first - you can set multiple selectors (CSS-like) to **remove unwanted elements** from the exported markdown - **code block detection** and **syntax highlighting** for popular programming languages - HTML tables are converted to proper **markdown tables** Built-in web server: - use `--serve-markdown=` to start a **built-in HTTP server** that renders your markdown export as styled HTML pages with tables, dark/light mode, breadcrumb navigation, and accordion support β€” perfect for browsing and sharing the export locally or on a network Standalone HTML-to-Markdown conversion: - use `--html-to-markdown=` to convert a **local HTML file** directly to Markdown without crawling any website - outputs clean Markdown to **stdout** (pipe-friendly) or to a file with `--html-to-markdown-output=` - uses the same conversion pipeline as `--markdown-export-dir` β€” including all cleanup, accordion collapsing, code language detection, and implicit exclusions (cookie banners, `aria-hidden` elements, `role="menu"` dropdowns) - respects `--markdown-disable-images`, `--markdown-disable-files`, `--markdown-exclude-selector`, and `--markdown-move-content-before-h1-to-end` - does **not** rewrite links (`.html` β†’ `.md`) since the file is standalone with no site context πŸ’‘ Tip: you can push the exported markdown folder to your GitHub repository, where it will be automatically rendered as a browsable documentation. You can look at the [examples](https://github.com/janreges/siteone-crawler-markdown-examples/) of converted websites to markdown. See all available [markdown exporter options](#markdown-exporter-options) and [HTML-to-Markdown conversion options](#html-to-markdown-conversion). ### πŸ—ΊοΈ Sitemap generator - will help you create a `sitemap.xml` and `sitemap.txt` for your website - you can set the priority of individual pages based on the number of slashes in the URL Don't hesitate and try it. You will love it as we do! ❀️ ## πŸš€ Installation ### πŸ“¦ Pre-built binaries Download pre-built binaries from [πŸ™ GitHub releases](https://github.com/janreges/siteone-crawler/releases) for all major platforms (🐧 Linux, πŸͺŸ Windows, 🍎 macOS, x64 & arm64). The binary is self-contained β€” no runtime dependencies required. ```bash # Linux / macOS β€” download, extract, run ./siteone-crawler --url=https://my.domain.tld ``` **🐧 Linux binary variants:** For Linux, two binary variants are provided: | Variant | Compatibility | Performance | |---------|--------------|-------------| | **glibc** (primary) | Requires glibc 2.39+ (Ubuntu 24.04+, Debian 13+, Fedora 40+) | Full native performance | | **musl** (compatible) | Any Linux distribution (statically linked, no dependencies) | ~50–80% slower due to musl memory allocator | The **glibc** variant is recommended for current distributions β€” it offers the best performance. If you are running an older distribution (e.g. Ubuntu 22.04, Debian 12) and encounter a `GLIBC_2.xx not found` error, use the **musl** variant instead. The musl binary is fully statically linked and runs on any Linux system regardless of the installed glibc version. The performance difference is mainly noticeable during CPU-intensive operations like offline and markdown exports. **Note for macOS users**: In case that Mac refuses to start the crawler from your Download folder, move the entire folder with the Crawler **via the terminal** to another location, for example to the homefolder `~`. ### 🍺 Homebrew (macOS / Linux) ```bash brew install janreges/tap/siteone-crawler siteone-crawler --url=https://my.domain.tld ``` ### 🐧 Debian / Ubuntu (apt) ```bash curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.deb.sh' | sudo -E bash sudo apt-get install siteone-crawler ``` > **Older distributions (Ubuntu 22.04, Debian 11/12, etc.):** If you get a `GLIBC_X.XX not found` error, install the statically linked variant instead: > ```bash > sudo apt-get install siteone-crawler-static > ``` > See [Linux binary variants](#-pre-built-binaries) for details on the performance difference. ### 🎩 Fedora / RHEL (dnf) ```bash curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.rpm.sh' | sudo -E bash sudo dnf install siteone-crawler ``` > **Older distributions:** If you get a `GLIBC_X.XX not found` error, use `sudo dnf install siteone-crawler-static` instead. > See [Linux binary variants](#-pre-built-binaries) for details. ### 🦎 openSUSE / SLES (zypper) ```bash curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.rpm.sh' | sudo -E bash sudo zypper install siteone-crawler ``` > **Older distributions:** If you get a `GLIBC_X.XX not found` error, use `sudo zypper install siteone-crawler-static` instead. > See [Linux binary variants](#-pre-built-binaries) for details. ### πŸ”οΈ Alpine Linux (apk) ```bash curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.alpine.sh' | sudo -E bash sudo apk add siteone-crawler ``` ### πŸ”¨ Build from source Requires [Rust](https://www.rust-lang.org/tools/install) 1.85 or later. ```bash git clone https://github.com/janreges/siteone-crawler.git cd siteone-crawler # Build optimized release binary cargo build --release # Run ./target/release/siteone-crawler --url=https://my.domain.tld ``` **Build statically linked (musl) binary:** ```bash # Install musl toolchain (Ubuntu/Debian) sudo apt-get install musl-tools rustup target add x86_64-unknown-linux-musl # Build static binary (no system dependencies) cargo build --release --target x86_64-unknown-linux-musl # Run β€” works on any Linux distribution ./target/x86_64-unknown-linux-musl/release/siteone-crawler --url=https://my.domain.tld ``` ## ▢️ Usage ### Interactive wizard Run the binary **without any arguments** and an interactive wizard will guide you through the configuration. Choose from 10 preset modes, enter the target URL, fine-tune settings with arrow keys, and the crawler starts immediately β€” no need to remember CLI flags. ``` ? Choose a crawl mode: ❯ Quick Audit Fast site health overview β€” crawls all pages and assets SEO Analysis Extract titles, descriptions, keywords, and OpenGraph tags Performance Test Measure response times with cache disabled β€” find bottlenecks Security Check Check SSL/TLS, security headers, and redirects site-wide Offline Clone Download entire website with all assets for offline browsing Markdown Export Convert pages to Markdown for AI models or documentation Stress Test High-concurrency load test with cache-busting random params Single Page Deep analysis of a single URL β€” SEO, security, performance Large Site Crawl High-throughput HTML-only crawl for large sites (100k+ pages) Custom Start from defaults and configure every option manually ────────────────────────────────────── Browse offline export Serve a previously exported offline site via HTTP Browse markdown export Serve a previously exported markdown site via HTTP [↑↓ to move, enter to select, type to filter] ``` After selecting a preset and entering the URL, the wizard shows a settings form where you can adjust workers, timeout, content types, export options, and more. A configuration summary with the equivalent CLI command is displayed before the crawl starts β€” copy it for future use without the wizard. If existing offline or markdown exports are detected in `./tmp/`, the wizard also offers to **serve them via the built-in HTTP server** directly from the menu. ### Basic example To run the crawler from the command line, provide the required arguments: ```bash ./siteone-crawler --url=https://mydomain.tld/ --device=mobile ``` ### CI/CD example ```bash # Fail deployment if quality score < 7.0 or any 5xx errors ./siteone-crawler --url=https://mydomain.tld/ --ci --ci-min-score=7.0 --ci-max-5xx=0 echo $? # 0 = pass, 10 = fail ``` ### Fully-featured example ```bash ./siteone-crawler --url=https://mydomain.tld/ \ --output=text \ --workers=2 \ --max-reqs-per-sec=10 \ --memory-limit=2048M \ --resolve='mydomain.tld:443:127.0.0.1' \ --timeout=5 \ --proxy=proxy.mydomain.tld:8080 \ --http-auth=myuser:secretPassword123 \ --user-agent="My User-Agent String" \ --extra-columns="DOM,X-Cache(10),Title(40),Keywords(50),Description(50>),Heading1=xpath://h1/text()(20>),ProductPrice=regexp:/Price:\s*\$?(\d+(?:\.\d{2})?)/i#1(10)" \ --accept-encoding="gzip, deflate" \ --url-column-size=100 \ --max-queue-length=3000 \ --max-visited-urls=10000 \ --max-url-length=5000 \ --max-non200-responses-per-basename=10 \ --include-regex="/^.*\/technologies.*/" \ --include-regex="/^.*\/fashion.*/" \ --ignore-regex="/^.*\/downloads\/.*\.pdf$/i" \ --analyzer-filter-regex="/^.*$/i" \ --remove-query-params \ --keep-query-param=page \ --add-random-query-params \ --transform-url="live-site.com -> local-site.local" \ --transform-url="/cdn\.live-site\.com/ -> local-site.local/cdn" \ --show-scheme-and-host \ --do-not-truncate-url \ --output-html-report=tmp/myreport.html \ --html-report-options="summary,seo-opengraph,visited-urls,security,redirects" \ --output-json-file=/dir/report.json \ --output-text-file=/dir/report.txt \ --add-timestamp-to-output-file \ --add-host-to-output-file \ --offline-export-dir=tmp/mydomain.tld \ --replace-content='/]+>/ -> ' \ --ignore-store-file-error \ --sitemap-xml-file=/dir/sitemap.xml \ --sitemap-txt-file=/dir/sitemap.txt \ --sitemap-base-priority=0.5 \ --sitemap-priority-increase=0.1 \ --markdown-export-dir=tmp/mydomain.tld.md \ --markdown-export-single-file=tmp/mydomain.tld.combined.md \ --markdown-move-content-before-h1-to-end \ --markdown-disable-images \ --markdown-disable-files \ --markdown-remove-links-and-images-from-single-file \ --markdown-exclude-selector='.exclude-me' \ --markdown-replace-content='/]+>/ -> ' \ --markdown-replace-query-string='/[a-z]+=[^&]*(&|$)/i -> $1__$2' \ --mail-to=your.name@my-mail.tld \ --mail-to=your.friend.name@my-mail.tld \ --mail-from=crawler@my-mail.tld \ --mail-from-name="SiteOne Crawler" \ --mail-subject-template="Crawler Report for %domain% (%date%)" \ --mail-smtp-host=smtp.my-mail.tld \ --mail-smtp-port=25 \ --mail-smtp-user=smtp.user \ --mail-smtp-pass=secretPassword123 \ --ci --ci-min-score=7.0 --ci-min-security=8.0 ``` ## βš™οΈ Arguments For a clearer list, I recommend going to the documentation: 🌐 https://crawler.siteone.io/configuration/command-line-options/ ### Basic settings | Parameter | Description | |-----------|-------------| | `--url=` | Required. HTTP or HTTPS URL address of the website or sitemap xml to be crawled.
Use quotation marks `''` if the URL contains query parameters. | | `--single-page` | Load only one page to which the URL is given (and its assets), but do not follow other pages. | | `--max-depth=` | Maximum crawling depth (for pages, not assets). Default is `0` (no limit). `1` means `/about`
or `/about/`, `2` means `/about/contacts` etc. | | `--device=` | Device type for choosing a predefined User-Agent. Ignored when `--user-agent` is defined.
Supported values: `desktop`, `mobile`, `tablet`. Default is `desktop`. | | `--user-agent=` | Custom User-Agent header. Use quotation marks. If specified, it takes precedence over
the device parameter. If you add `!` at the end, the siteone-crawler/version will not be
added as a signature at the end of the final user-agent. | | `--timeout=` | Request timeout in seconds. Default is `5`. | | `--proxy=` | HTTP proxy to use in `host:port` format. Host can be hostname, IPv4 or IPv6. | | `--http-auth=` | Basic HTTP authentication in `username:password` format. | | `--config-file=` | Load CLI options from a config file. One option per line, `#` comments allowed.
Without this flag, auto-discovers `~/.siteone-crawler.conf` or `/etc/siteone-crawler.conf`.
CLI arguments override config file values. | ### Output settings | Parameter | Description | |-----------|-------------| | `--output=` | Output type. Supported values: `text`, `json`. Default is `text`. | | `--extra-columns=` | Comma delimited list of extra columns added to output table. You can specify HTTP headers
(e.g. `X-Cache`), predefined values (`Title`, `Keywords`, `Description`, `DOM`), or custom
extraction from text files (HTML, JS, CSS, TXT, JSON, XML, etc.) using XPath or regexp.
For custom extraction, use the format `Custom_column_name=method:pattern#group(length)`, where
`method` is `xpath` or `regexp`, `pattern` is the extraction pattern, an optional `#group` specifies the
capturing group (or node index for XPath) to return (defaulting to the entire match or first node), and an
optional `(length)` sets the maximum output length (append `>` to disable truncation).
For example, use `Heading1=xpath://h1/text()(20>)` to extract the text of the first H1 element
from the HTML document, and `ProductPrice=regexp:/Price:\s*\$?(\d+(?:\.\d{2})?)/i#1(10)`
to extract a numeric price (e.g., "29.99") from a string like "Price: $29.99". | | `--url-column-size=` | Basic URL column width. By default, it is calculated from the size of your terminal window. | | `--rows-limit=` | Max. number of rows to display in tables with analysis results.
Default is `200`. | | `--timezone=` | Timezone for datetimes in HTML reports and timestamps in output folders/files, e.g. `Europe/Prague`.
Default is `UTC`. | | `--do-not-truncate-url` | In the text output, long URLs are truncated by default to `--url-column-size` so the table does not
wrap due to long URLs. With this option, you can turn off the truncation. | | `--show-scheme-and-host` | On text output, show scheme and host also for origin domain URLs. | | `--hide-progress-bar` | Hide progress bar visible in text and JSON output for more compact view. | | `--hide-columns=` | Hide specified columns from the progress table. Comma-separated list of column names:
`type`, `time`, `size`, `cache`. Example: `--hide-columns=cache` or `--hide-columns=cache,type`. | | `--no-color` | Disable colored output. | | `--force-color` | Force colored output regardless of support detection. | | `--show-inline-criticals` | Show criticals from the analyzer directly in the URL table. | | `--show-inline-warnings` | Show warnings from the analyzer directly in the URL table. | ### Resource filtering | Parameter | Description | |-----------|-------------| | `--disable-all-assets` | Disables crawling of all assets and files and only crawls pages in href attributes.
Shortcut for calling all other `--disable-*` flags. | | `--disable-javascript` | Disables JavaScript downloading and removes all JavaScript code from HTML,
including `onclick` and other `on*` handlers. | | `--disable-styles` | Disables CSS file downloading and at the same time removes all style definitions
by `