# pup pup is a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started/Selectors). Inspired by [jq](http://stedolan.github.io/jq/), pup aims to be a fast and flexible way of exploring HTML from the terminal. ## Install Direct downloads are available through the [releases page](https://github.com/EricChiang/pup/releases/latest). If you have Go installed on your computer just run `go get`. go get github.com/ericchiang/pup If you're on OS X, use [Homebrew](http://brew.sh/) to install (no Go required). brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb ## Quick start ```bash $ curl -s https://news.ycombinator.com/ ``` Ew, HTML. Let's run that through some pup selectors: ```bash $ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a' ``` Okay, how about only the links? ```bash $ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a attr{href}' ``` Even better, let's grab the titles too: ```bash $ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a json{}' ``` ## Basic Usage ```bash $ cat index.html | pup [flags] '[selectors] [display function]' ``` ## Examples Download a webpage with wget. ```bash $ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html ``` ####Clean and indent By default pup will fill in missing tags and properly indent the page. ```bash $ cat robots.html # nasty looking HTML $ cat robots.html | pup --color # cleaned, indented, and colorful HTML ``` ####Filter by tag ```bash $ cat robots.html | pup 'title' Robots exclusion standard - Wikipedia, the free encyclopedia ``` ####Filter by id ```bash $ cat robots.html | pup 'span#See_also' See also ``` ####Filter by attribute ```bash $ cat robots.html | pup 'th[scope="row"]' Exclusion standards Related marketing topics Search marketing related topics Search engine spam Linking People Other ``` ####Pseudo Classes CSS selectors have a group of specifiers called ["pseudo classes"]( https://developer.mozilla.org/en-US/docs/Web/CSS/Pseudo-classes) which are pretty cool. pup implements a majority of the relevant ones them. Here are some examples. ```bash $ cat robots.html | pup 'a[rel]:empty' ``` ```bash $ cat robots.html | pup ':contains("History")' History History ``` ```bash $ cat robots.html | pup ':parent-of([action="edit"])' Edit links ``` For a complete list, view the [implemented selectors](#implemented-selectors) section. ####`+`, `>`, and `,` These are intermediate characters that declare special instructions. For instance, a comma `,` allows pup to specify multiple groups of selectors. ```bash $ cat robots.html | pup 'title, h1 span[dir="auto"]' Robots exclusion standard - Wikipedia, the free encyclopedia Robots exclusion standard ``` ####Chain selectors together When combining selectors, the HTML nodes selected by the previous selector will be passed to the next ones. ```bash $ cat robots.html | pup 'h1#firstHeading'

Robots exclusion standard

``` ```bash $ cat robots.html | pup 'h1#firstHeading span' Robots exclusion standard ``` ## Implemented Selectors For further examples of these selectors head over to [MDN]( https://developer.mozilla.org/en-US/docs/Web/CSS/Reference). ```bash pup '.class' pup '#id' pup 'element' pup 'selector + selector' pup 'selector > selector' pup '[attribute]' pup '[attribute="value"]' pup '[attribute*="value"]' pup '[attribute~="value"]' pup '[attribute^="value"]' pup '[attribute$="value"]' pup ':empty' pup ':first-child' pup ':first-of-type' pup ':last-child' pup ':last-of-type' pup ':only-child' pup ':only-of-type' pup ':contains("text")' pup ':nth-child(n)' pup ':nth-of-type(n)' pup ':nth-last-child(n)' pup ':nth-last-of-type(n)' pup ':not(selector)' pup ':parent-of(selector)' ``` You can mix and match selectors as you wish. ```bash cat index.html | pup 'element#id[attribute="value"]:first-of-type' ``` ## Display Functions Non-HTML selectors which effect the output type are implemented as functions which can be provided as a final argument. #### `text{}` Print all text from selected nodes and children in depth first order. ```bash $ cat robots.html | pup '.mw-headline text{}' History About the standard Disadvantages Alternatives Examples Nonstandard extensions Crawl-delay directive Allow directive Sitemap Host Universal "*" match Meta tags and headers See also References External links ``` #### `attr{attrkey}` Print the values of all attributes with a given key from all selected nodes. ```bash $ cat robots.html | pup '.catlinks div attr{id}' mw-normal-catlinks mw-hidden-catlinks ``` #### `json{}` Print HTML as JSON. ```bash $ cat robots.html | pup 'div#p-namespaces a' Article Talk ``` ```bash $ cat robots.html | pup 'div#p-namespaces a json{}' [ { "accesskey": "c", "href": "/wiki/Robots_exclusion_standard", "tag": "a", "text": "Article", "title": "View the content page [c]" }, { "accesskey": "t", "href": "/wiki/Talk:Robots_exclusion_standard", "tag": "a", "text": "Talk", "title": "Discussion about the content page [t]" } ] ``` Use the `-i` / `--indent` flag to control the intent level. ```bash $ cat robots.html | pup -i 4 'div#p-namespaces a json{}' [ { "accesskey": "c", "href": "/wiki/Robots_exclusion_standard", "tag": "a", "text": "Article", "title": "View the content page [c]" }, { "accesskey": "t", "href": "/wiki/Talk:Robots_exclusion_standard", "tag": "a", "text": "Talk", "title": "Discussion about the content page [t]" } ] ``` If the selectors only return one element the results will be printed as a JSON object, not a list. ```bash $ cat robots.html | pup --indent 4 'title json{}' { "tag": "title", "text": "Robots exclusion standard - Wikipedia, the free encyclopedia" } ``` Because there is no universal standard for converting HTML/XML to JSON, a method has been chosen which hopefully fits. The goal is simply to get the output of pup into a more consumable format. ## Flags Run `pup --help` for a list of further options