# coURLan: Clean, filter, normalize, and sample URLs
[](https://pypi.python.org/pypi/courlan)
[](https://pypi.python.org/pypi/courlan)
[](https://codecov.io/gh/adbar/courlan)
[](https://github.com/astral-sh/ruff)
## Why coURLan?
> "It is important for the crawler to visit 'important' pages first,
> so that the fraction of the Web that is visited (and kept up to date)
> is more meaningful." (Cho et al. 1998)
>
> "Given that the bandwidth for conducting crawls is neither infinite
> nor free, it is becoming essential to crawl the Web in not only a
> scalable, but efficient way, if some reasonable measure of quality or
> freshness is to be maintained." (Edwards et al. 2001)
This library provides an additional "brain" for web crawling, scraping
and document management. It facilitates web navigation through a set of
filters, enhancing the quality of resulting document collections:
- Save bandwidth and processing time by steering clear of pages deemed
low-value
- Identify specific pages based on language or text content
- Pinpoint pages relevant for efficient link gathering
Additional utilities needed include URL storage, filtering, and
deduplication.
## Features
Separate the wheat from the chaff and optimize document discovery and
retrieval:
- URL handling
- Validation
- Normalization
- Sampling
- Heuristics for link filtering
- Spam, trackers, and content-types
- Locales and internationalization
- Web crawling (frontier, scheduling)
- Data store specifically designed for URLs
- Usable with Python or on the command-line
**Let the coURLan fish up juicy bits for you!**
Here is a [courlan](https://en.wiktionary.org/wiki/courlan) (source:
[Limpkin at Harn's Marsh by
Russ](https://commons.wikimedia.org/wiki/File:Limpkin,_harns_marsh_(33723700146).jpg),
CC BY 2.0).
## Installation
This package requires Python 3.10 or higher and is tested on Linux, macOS
and Windows systems.
Courlan is available on the package repository [PyPI](https://pypi.org/)
and can notably be installed with the Python package manager `pip`:
``` bash
$ pip install courlan
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)
```
The last version to support Python 3.6 and 3.7 is `courlan==1.2.0`.
The last version to support Python 3.8 and 3.9 is `courlan==1.3.2`.
## Python
Most filters revolve around the `strict` and `language` arguments.
### check_url()
All useful operations chained in `check_url(url)`:
``` python
>>> from courlan import check_url
# return url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# filter out bogus domains
>>> check_url('http://666.0.0.1/')
>>>
# tracker removal
>>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123')
('http://test.net/foo.html', 'test.net')
# use strict for further trimming
>>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'
>>> check_url(my_url, strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
# include navigation pages instead of discarding them
>>> check_url('http://www.example.org/page/10/', with_nav=True)
# remove trailing slash
>>> check_url('https://github.com/adbar/courlan/', trailing_slash=False)
```
Language-aware heuristics, notably internationalization in URLs, are
available in `lang_filter(url, language)`:
``` python
# optional language argument
>>> url = 'https://www.un.org/en/about-us'
# success: returns clean URL and domain name
>>> check_url(url, language='en')
('https://www.un.org/en/about-us', 'un.org')
# failure: doesn't return anything
>>> check_url(url, language='de')
>>>
# optional argument: strict
>>> url = 'https://en.wikipedia.org/'
>>> check_url(url, language='de', strict=False)
('https://en.wikipedia.org', 'wikipedia.org')
>>> check_url(url, language='de', strict=True)
>>>
```
Define stricter restrictions on the expected content type with
`strict=True`. This also blocks certain platforms and page types
where machines get lost.
``` python
# strict filtering: blocked as it is a major platform
>>> check_url('https://www.twitch.com/', strict=True)
>>>
```
### Sampling by domain name
``` python
>>> from courlan import sample_urls
>>> my_urls = ['https://example.org/' + str(x) for x in range(100)]
>>> my_sample = sample_urls(my_urls, 10)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False
```
### Web crawling and URL handling
Link extraction and preprocessing:
``` python
>>> from courlan import extract_links
>>> doc = '
Link'
>>> url = "https://example.org"
>>> extract_links(doc, url)
{'https://example.org/test/link.html'}
# other options: external_bool, no_filter, language, strict, redirects, ...
```
The `filter_links()` function provides additional filters for crawling
purposes: use of robots.txt rules and link prioritization. It returns two
lists: regular links and priority (navigation) links.
``` python
>>> from courlan import filter_links
>>> doc = '1Tag'
>>> links, links_priority = filter_links(doc, "https://example.org")
>>> links
['https://example.org/page1.html']
>>> links_priority
['https://example.org/tag/listing']
```
Determine if a link leads to another host:
``` python
>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True
```
Other useful functions dedicated to URL handling:
- `extract_domain(url, fast=True)`: find domain and subdomain or just
domain with `fast=False`
- `get_base_url(url)`: strip the URL of some of its parts
- `get_host_and_path(url)`: decompose URLs in two parts: protocol +
host/domain and path
- `get_hostinfo(url)`: extract domain and host info (protocol +
host/domain)
- `fix_relative_urls(baseurl, url)`: prepend necessary information to
relative links
``` python
>>> from courlan import *
>>> url = 'https://www.un.org/en/about-us'
>>> get_base_url(url)
'https://www.un.org'
>>> get_host_and_path(url)
('https://www.un.org', '/en/about-us')
>>> get_hostinfo(url)
('un.org', 'https://www.un.org')
>>> fix_relative_urls('https://www.un.org', 'en/about-us')
'https://www.un.org/en/about-us'
```
Other filters dedicated to crawl frontier management:
- `is_not_crawlable(url)`: check for deep web or pages generally not
usable in a crawling context
- `is_navigation_page(url)`: check for navigation and overview pages
``` python
>>> from courlan import is_navigation_page, is_not_crawlable
>>> is_navigation_page('https://www.randomblog.net/category/myposts')
True
>>> is_not_crawlable('https://www.randomblog.net/login')
True
```
See also [URL management page](https://trafilatura.readthedocs.io/en/latest/url-management.html)
of the Trafilatura documentation.
### Python helpers
Helper function, scrub and normalize:
``` python
>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:443/')
'https://www.dwds.de'
```
Basic scrubbing only:
``` python
>>> from courlan import scrub_url
```
Basic canonicalization/normalization only, i.e. modifying and
standardizing URLs in a consistent manner:
``` python
>>> from urllib.parse import urlparse
>>> from courlan import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'
```
Basic URL validation only:
``` python
>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))
```
### Troubleshooting
Courlan uses an internal cache to speed up URL parsing. It can be reset
as follows:
``` python
>>> from courlan.meta import clear_caches
>>> clear_caches()
```
## UrlStore class
The `UrlStore` class allow for storing and retrieving domain-classified
URLs, where a URL like `https://example.org/path/testpage` is stored as
the path `/path/testpage` within the domain `https://example.org`. It
features the following methods:
- URL management
- `add_urls(urls=None, appendleft=None, visited=False)`: Add a
list of URLs to the (possibly) existing one. Optional:
append certain URLs to the left, specify if the URLs have
already been visited.
- `add_from_html(htmlstring, url, external=False, lang=None, with_nav=True)`:
Extract and filter links in a HTML string.
- `discard(domains)`: Declare domains void and prune the store.
- `dump_urls()`: Return a list of all known URLs.
- `print_urls()`: Print all URLs in store (URL + TAB + visited or not).
- `print_unvisited_urls()`: Print all unvisited URLs in store.
- `get_all_counts()`: Return all download counts for the hosts in store.
- `get_known_domains()`: Return all known domains as a list.
- `get_unvisited_domains()`: Find all domains for which there are unvisited URLs.
- `total_url_number()`: Find number of all URLs in store.
- `is_known(url)`: Check if the given URL has already been stored.
- `has_been_visited(url)`: Check if the given URL has already been visited.
- `filter_unknown_urls(urls)`: Take a list of URLs and return the currently unknown ones.
- `filter_unvisited_urls(urls)`: Take a list of URLs and return the currently unvisited ones.
- `find_known_urls(domain)`: Get all already known URLs for the
given domain (ex. `https://example.org`).
- `find_unvisited_urls(domain)`: Get all unvisited URLs for the given domain.
- `reset()`: Re-initialize the URL store.
- Crawling and downloads
- `get_url(domain)`: Retrieve a single URL and consider it to
be visited (with corresponding timestamp).
- `get_rules(domain)`: Return the stored crawling rules for the given website.
- `store_rules(website, rules)`: Store crawling rules for a given website.
- `get_crawl_delay()`: Return the delay as extracted from robots.txt, or a given default.
- `get_download_urls(time_limit=10, max_urls=10000)`: Get a list of immediately
downloadable URLs according to the given time limit per domain.
- `establish_download_schedule(max_urls=100, time_limit=10)`:
Get up to the specified number of URLs along with a suitable
backoff schedule (in seconds).
- `download_threshold_reached(threshold)`: Find out if the
download limit (in seconds) has been reached for one of the
websites in store.
- `unvisited_websites_number()`: Return the number of websites
for which there are still URLs to visit.
- `is_exhausted_domain(domain)`: Tell if all known URLs for
the website have been visited.
- Persistance
- `write(filename)`: Save the store to disk.
- `load_store(filename)`: Read a UrlStore from disk (separate function, not class method).
- Optional settings:
- `compressed=True`: activate compression of URLs and rules
- `language=XX`: focus on a particular target language (two-letter code)
- `strict=True`: stricter URL filtering
- `verbose=True`: dump URLs if interrupted (requires use of `signal`)
## Command-line
The main fonctions are also available through a command-line utility:
``` bash
$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]
[-p PARALLEL] [--strict] [-l LANGUAGE] [-r] [--sample SAMPLE]
[--exclude-max EXCLUDE_MAX] [--exclude-min EXCLUDE_MIN]
Command-line interface for Courlan
options:
-h, --help show this help message and exit
I/O:
Manage input and output
-i INPUTFILE, --inputfile INPUTFILE
name of input file (required)
-o OUTPUTFILE, --outputfile OUTPUTFILE
name of output file (required)
-d DISCARDEDFILE, --discardedfile DISCARDEDFILE
name of file to store discarded URLs (optional)
-v, --verbose increase output verbosity
-p PARALLEL, --parallel PARALLEL
number of parallel processes (not used for sampling)
Filtering:
Configure URL filters
--strict perform more restrictive tests
-l LANGUAGE, --language LANGUAGE
use language filter (ISO 639-1 code)
-r, --redirects check redirects
Sampling:
Use sampling by host, configure sample size
--sample SAMPLE size of sample per domain
--exclude-max EXCLUDE_MAX
exclude domains with more than n URLs
--exclude-min EXCLUDE_MIN
exclude domains with less than n URLs
```
## License
*coURLan* is distributed under the [Apache 2.0
license](https://www.apache.org/licenses/LICENSE-2.0.html).
Versions prior to v1 were under GPLv3+ license.
## Settings
`courlan` is optimized for English and German but its generic approach
is also usable in other contexts.
Details of strict URL filtering can be reviewed and changed in the file
`settings.py`. To override the default settings, clone the repository and
[re-install the package
locally](https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree).
## Author
Initially launched to create text databases for research purposes
at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
this package continues to be maintained but its future development
depends on community support.
**If you value this software or depend on it for your product, consider
sponsoring it and contributing to its codebase**. Your support
[on GitHub](https://github.com/sponsors/adbar) or [ko-fi.com](https://ko-fi.com/adbarbaresi)
will help maintain and enhance this package.
Visit the [Contributing page](https://github.com/adbar/courlan/blob/master/CONTRIBUTING.md)
for more information.
Reach out via the software repository or the [contact
page](https://adrien.barbaresi.eu/) for inquiries, collaborations, or
feedback.
For more on Courlan's' software ecosystem see [this
graphic](https://github.com/adbar/trafilatura/blob/master/docs/software-ecosystem.png).
## Similar work
These Python libraries perform similar handling and normalization tasks
but do not entail language or content filters. They also do not
primarily focus on crawl optimization:
- [furl](https://github.com/gruns/furl)
- [ural](https://github.com/medialab/ural)
- [yarl](https://github.com/aio-libs/yarl)
## References
- Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling
through URL ordering. *Computer networks and ISDN systems*, 30(1-7),
161–172.
- Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An
adaptive model for optimizing performance of an incremental web
crawler". In *Proceedings of the 10th international conference on
World Wide Web - WWW'01*, pp. 106–113.