# coURLan: Clean, filter, normalize, and sample URLs [![Python package](https://img.shields.io/pypi/v/courlan.svg)](https://pypi.python.org/pypi/courlan) [![Python versions](https://img.shields.io/pypi/pyversions/courlan.svg)](https://pypi.python.org/pypi/courlan) [![Code Coverage](https://img.shields.io/codecov/c/github/adbar/courlan.svg)](https://codecov.io/gh/adbar/courlan) [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) ## Why coURLan? > "It is important for the crawler to visit 'important' pages first, > so that the fraction of the Web that is visited (and kept up to date) > is more meaningful." (Cho et al. 1998) > > "Given that the bandwidth for conducting crawls is neither infinite > nor free, it is becoming essential to crawl the Web in not only a > scalable, but efficient way, if some reasonable measure of quality or > freshness is to be maintained." (Edwards et al. 2001) This library provides an additional "brain" for web crawling, scraping and document management. It facilitates web navigation through a set of filters, enhancing the quality of resulting document collections: - Save bandwidth and processing time by steering clear of pages deemed low-value - Identify specific pages based on language or text content - Pinpoint pages relevant for efficient link gathering Additional utilities needed include URL storage, filtering, and deduplication. ## Features Separate the wheat from the chaff and optimize document discovery and retrieval: - URL handling - Validation - Normalization - Sampling - Heuristics for link filtering - Spam, trackers, and content-types - Locales and internationalization - Web crawling (frontier, scheduling) - Data store specifically designed for URLs - Usable with Python or on the command-line **Let the coURLan fish up juicy bits for you!** Courlan bird Here is a [courlan](https://en.wiktionary.org/wiki/courlan) (source: [Limpkin at Harn's Marsh by Russ](https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg), CC BY 2.0). ## Installation This package requires Python 3.10 or higher and is tested on Linux, macOS and Windows systems. Courlan is available on the package repository [PyPI](https://pypi.org/) and can notably be installed with the Python package manager `pip`: ``` bash $ pip install courlan $ pip install --upgrade courlan # to make sure you have the latest version $ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above) ``` The last version to support Python 3.6 and 3.7 is `courlan==1.2.0`. The last version to support Python 3.8 and 3.9 is `courlan==1.3.2`. ## Python Most filters revolve around the `strict` and `language` arguments. ### check_url() All useful operations chained in `check_url(url)`: ``` python >>> from courlan import check_url # return url and domain name >>> check_url('https://github.com/adbar/courlan') ('https://github.com/adbar/courlan', 'github.com') # filter out bogus domains >>> check_url('http://666.0.0.1/') >>> # tracker removal >>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123') ('http://test.net/foo.html', 'test.net') # use strict for further trimming >>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org' >>> check_url(my_url, strict=True) ('https://httpbin.org/redirect-to', 'httpbin.org') # check for redirects (HEAD request) >>> url, domain_name = check_url(my_url, with_redirects=True) # include navigation pages instead of discarding them >>> check_url('http://www.example.org/page/10/', with_nav=True) # remove trailing slash >>> check_url('https://github.com/adbar/courlan/', trailing_slash=False) ``` Language-aware heuristics, notably internationalization in URLs, are available in `lang_filter(url, language)`: ``` python # optional language argument >>> url = 'https://www.un.org/en/about-us' # success: returns clean URL and domain name >>> check_url(url, language='en') ('https://www.un.org/en/about-us', 'un.org') # failure: doesn't return anything >>> check_url(url, language='de') >>> # optional argument: strict >>> url = 'https://en.wikipedia.org/' >>> check_url(url, language='de', strict=False) ('https://en.wikipedia.org', 'wikipedia.org') >>> check_url(url, language='de', strict=True) >>> ``` Define stricter restrictions on the expected content type with `strict=True`. This also blocks certain platforms and page types where machines get lost. ``` python # strict filtering: blocked as it is a major platform >>> check_url('https://www.twitch.com/', strict=True) >>> ``` ### Sampling by domain name ``` python >>> from courlan import sample_urls >>> my_urls = ['https://example.org/' + str(x) for x in range(100)] >>> my_sample = sample_urls(my_urls, 10) # optional: exclude_min=None, exclude_max=None, strict=False, verbose=False ``` ### Web crawling and URL handling Link extraction and preprocessing: ``` python >>> from courlan import extract_links >>> doc = 'Link' >>> url = "https://example.org" >>> extract_links(doc, url) {'https://example.org/test/link.html'} # other options: external_bool, no_filter, language, strict, redirects, ... ``` The `filter_links()` function provides additional filters for crawling purposes: use of robots.txt rules and link prioritization. It returns two lists: regular links and priority (navigation) links. ``` python >>> from courlan import filter_links >>> doc = '1Tag' >>> links, links_priority = filter_links(doc, "https://example.org") >>> links ['https://example.org/page1.html'] >>> links_priority ['https://example.org/tag/listing'] ``` Determine if a link leads to another host: ``` python >>> from courlan import is_external >>> is_external('https://github.com/', 'https://www.microsoft.com/') True # default >>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True) False # taking suffixes into account >>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False) True ``` Other useful functions dedicated to URL handling: - `extract_domain(url, fast=True)`: find domain and subdomain or just domain with `fast=False` - `get_base_url(url)`: strip the URL of some of its parts - `get_host_and_path(url)`: decompose URLs in two parts: protocol + host/domain and path - `get_hostinfo(url)`: extract domain and host info (protocol + host/domain) - `fix_relative_urls(baseurl, url)`: prepend necessary information to relative links ``` python >>> from courlan import * >>> url = 'https://www.un.org/en/about-us' >>> get_base_url(url) 'https://www.un.org' >>> get_host_and_path(url) ('https://www.un.org', '/en/about-us') >>> get_hostinfo(url) ('un.org', 'https://www.un.org') >>> fix_relative_urls('https://www.un.org', 'en/about-us') 'https://www.un.org/en/about-us' ``` Other filters dedicated to crawl frontier management: - `is_not_crawlable(url)`: check for deep web or pages generally not usable in a crawling context - `is_navigation_page(url)`: check for navigation and overview pages ``` python >>> from courlan import is_navigation_page, is_not_crawlable >>> is_navigation_page('https://www.randomblog.net/category/myposts') True >>> is_not_crawlable('https://www.randomblog.net/login') True ``` See also [URL management page](https://trafilatura.readthedocs.io/en/latest/url-management.html) of the Trafilatura documentation. ### Python helpers Helper function, scrub and normalize: ``` python >>> from courlan import clean_url >>> clean_url('HTTPS://WWW.DWDS.DE:443/') 'https://www.dwds.de' ``` Basic scrubbing only: ``` python >>> from courlan import scrub_url ``` Basic canonicalization/normalization only, i.e. modifying and standardizing URLs in a consistent manner: ``` python >>> from urllib.parse import urlparse >>> from courlan import normalize_url >>> my_url = normalize_url(urlparse(my_url)) # passing URL strings directly also works >>> my_url = normalize_url(my_url) # remove unnecessary components and re-order query elements >>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True) 'http://test.net/foo.html?page=2&post=abc' ``` Basic URL validation only: ``` python >>> from courlan import validate_url >>> validate_url('http://1234') (False, None) >>> validate_url('http://www.example.org/') (True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment='')) ``` ### Troubleshooting Courlan uses an internal cache to speed up URL parsing. It can be reset as follows: ``` python >>> from courlan.meta import clear_caches >>> clear_caches() ``` ## UrlStore class The `UrlStore` class allow for storing and retrieving domain-classified URLs, where a URL like `https://example.org/path/testpage` is stored as the path `/path/testpage` within the domain `https://example.org`. It features the following methods: - URL management - `add_urls(urls=None, appendleft=None, visited=False)`: Add a list of URLs to the (possibly) existing one. Optional: append certain URLs to the left, specify if the URLs have already been visited. - `add_from_html(htmlstring, url, external=False, lang=None, with_nav=True)`: Extract and filter links in a HTML string. - `discard(domains)`: Declare domains void and prune the store. - `dump_urls()`: Return a list of all known URLs. - `print_urls()`: Print all URLs in store (URL + TAB + visited or not). - `print_unvisited_urls()`: Print all unvisited URLs in store. - `get_all_counts()`: Return all download counts for the hosts in store. - `get_known_domains()`: Return all known domains as a list. - `get_unvisited_domains()`: Find all domains for which there are unvisited URLs. - `total_url_number()`: Find number of all URLs in store. - `is_known(url)`: Check if the given URL has already been stored. - `has_been_visited(url)`: Check if the given URL has already been visited. - `filter_unknown_urls(urls)`: Take a list of URLs and return the currently unknown ones. - `filter_unvisited_urls(urls)`: Take a list of URLs and return the currently unvisited ones. - `find_known_urls(domain)`: Get all already known URLs for the given domain (ex. `https://example.org`). - `find_unvisited_urls(domain)`: Get all unvisited URLs for the given domain. - `reset()`: Re-initialize the URL store. - Crawling and downloads - `get_url(domain)`: Retrieve a single URL and consider it to be visited (with corresponding timestamp). - `get_rules(domain)`: Return the stored crawling rules for the given website. - `store_rules(website, rules)`: Store crawling rules for a given website. - `get_crawl_delay()`: Return the delay as extracted from robots.txt, or a given default. - `get_download_urls(time_limit=10, max_urls=10000)`: Get a list of immediately downloadable URLs according to the given time limit per domain. - `establish_download_schedule(max_urls=100, time_limit=10)`: Get up to the specified number of URLs along with a suitable backoff schedule (in seconds). - `download_threshold_reached(threshold)`: Find out if the download limit (in seconds) has been reached for one of the websites in store. - `unvisited_websites_number()`: Return the number of websites for which there are still URLs to visit. - `is_exhausted_domain(domain)`: Tell if all known URLs for the website have been visited. - Persistance - `write(filename)`: Save the store to disk. - `load_store(filename)`: Read a UrlStore from disk (separate function, not class method). - Optional settings: - `compressed=True`: activate compression of URLs and rules - `language=XX`: focus on a particular target language (two-letter code) - `strict=True`: stricter URL filtering - `verbose=True`: dump URLs if interrupted (requires use of `signal`) ## Command-line The main fonctions are also available through a command-line utility: ``` bash $ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt $ courlan --help usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v] [-p PARALLEL] [--strict] [-l LANGUAGE] [-r] [--sample SAMPLE] [--exclude-max EXCLUDE_MAX] [--exclude-min EXCLUDE_MIN] Command-line interface for Courlan options: -h, --help show this help message and exit I/O: Manage input and output -i INPUTFILE, --inputfile INPUTFILE name of input file (required) -o OUTPUTFILE, --outputfile OUTPUTFILE name of output file (required) -d DISCARDEDFILE, --discardedfile DISCARDEDFILE name of file to store discarded URLs (optional) -v, --verbose increase output verbosity -p PARALLEL, --parallel PARALLEL number of parallel processes (not used for sampling) Filtering: Configure URL filters --strict perform more restrictive tests -l LANGUAGE, --language LANGUAGE use language filter (ISO 639-1 code) -r, --redirects check redirects Sampling: Use sampling by host, configure sample size --sample SAMPLE size of sample per domain --exclude-max EXCLUDE_MAX exclude domains with more than n URLs --exclude-min EXCLUDE_MIN exclude domains with less than n URLs ``` ## License *coURLan* is distributed under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0.html). Versions prior to v1 were under GPLv3+ license. ## Settings `courlan` is optimized for English and German but its generic approach is also usable in other contexts. Details of strict URL filtering can be reviewed and changed in the file `settings.py`. To override the default settings, clone the repository and [re-install the package locally](https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree). ## Author Initially launched to create text databases for research purposes at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units), this package continues to be maintained but its future development depends on community support. **If you value this software or depend on it for your product, consider sponsoring it and contributing to its codebase**. Your support [on GitHub](https://github.com/sponsors/adbar) or [ko-fi.com](https://ko-fi.com/adbarbaresi) will help maintain and enhance this package. Visit the [Contributing page](https://github.com/adbar/courlan/blob/master/CONTRIBUTING.md) for more information. Reach out via the software repository or the [contact page](https://adrien.barbaresi.eu/) for inquiries, collaborations, or feedback. For more on Courlan's' software ecosystem see [this graphic](https://github.com/adbar/trafilatura/blob/master/docs/software-ecosystem.png). ## Similar work These Python libraries perform similar handling and normalization tasks but do not entail language or content filters. They also do not primarily focus on crawl optimization: - [furl](https://github.com/gruns/furl) - [ural](https://github.com/medialab/ural) - [yarl](https://github.com/aio-libs/yarl) ## References - Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. *Computer networks and ISDN systems*, 30(1-7), 161–172. - Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An adaptive model for optimizing performance of an incremental web crawler". In *Proceedings of the 10th international conference on World Wide Web - WWW'01*, pp. 106–113.