{ "cells": [ { "cell_type": "markdown", "id": "47e4a028", "metadata": {}, "source": [ "# Downloading Text from the Internet" ] }, { "cell_type": "markdown", "id": "d22cfae0", "metadata": {}, "source": [ "Here is a simple example of how to download and reformat text from the Internet." ] }, { "cell_type": "markdown", "id": "c1f802f0", "metadata": {}, "source": [ "Let's start by using `curl` to get some text data.\n", "We need the `-L` because this is a URL with a redirect built in." ] }, { "cell_type": "code", "execution_count": 21, "id": "2abed08d", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\r\n", " Dload Upload Total Spent Left Speed\r\n", "100 336 0 336 0 0 5905 0 --:--:-- --:--:-- --:--:-- 22400\r\n", "100 467k 100 467k 0 0 182k 0 0:00:02 0:00:02 --:--:-- 255k\r\n" ] } ], "source": [ "!curl -L 'http://goo.gl/g3aE4' > tomsawyer.html" ] }, { "cell_type": "markdown", "id": "dcf4c30e", "metadata": {}, "source": [ "For single pages, `curl` is generally the best tool to use.\n", "For whole directory trees and mirroring, `wget` is what people usually use." ] }, { "cell_type": "markdown", "id": "8038bc1c", "metadata": {}, "source": [ "If we look at it, we got the page in HTML format." ] }, { "cell_type": "code", "execution_count": 22, "id": "6471e588", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "\r\n", "\r\n", "The Adventures of Tom Sawyer \r\n", "\r\n", "\r\n", "
Twain, Mark, 1835-1910. The Adventures of Tom Sawyer \r\n", "
\r\n", "Electronic Text Center, University of Virginia Library
\r\n" ] } ], "source": [ "!head tomsawyer.html" ] }, { "cell_type": "markdown", "id": "b55d4388", "metadata": {}, "source": [ "Now let's convert it to text format (we could also have used `lynx -dump URL` directly).\n", "There are several other tools for converting HTML to text; they may or may not work better." ] }, { "cell_type": "code", "execution_count": 23, "id": "212d3e8f", "metadata": { "collapsed": false }, "outputs": [], "source": [ "!lynx -dump tomsawyer.html > tomsawyer.txt" ] }, { "cell_type": "markdown", "id": "732de11b", "metadata": {}, "source": [ "The output is pretty good, but has some extra spaces at the beginning." ] }, { "cell_type": "code", "execution_count": 24, "id": "262528d6", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Twain, Mark, 1835-1910. The Adventures of Tom Sawyer\r\n", " Electronic Text Center, University of Virginia Library\r\n", "\r\n", " | [1]Table of Contents for this work |\r\n", "\r\n", " | [2]All on-line databases | [3]Etext Center Homepage |\r\n", " __________________________________________________________________\r\n", "\r\n", " About the electronic version\r\n", " The Adventures of Tom Sawyer\r\n" ] } ], "source": [ "!head tomsawyer.txt" ] }, { "cell_type": "code", "execution_count": 25, "id": "e11f51b3", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " curtain of a second-story window. Was the sacred presence there? He\r\n", " climbed the fence, threaded his stealthy way through the plants, till\r\n", " he stood under that window; he looked up at it long, and with emotion;\r\n", " then he laid him down on the ground under it, disposing himself upon\r\n", " his back, with his hands clasped upon his breast and holding his poor\r\n", " wilted flower. And thus he would\r\n", " _____________________________________________________\r\n", "\r\n", " -43-\r\n", "\r\n", " die -- out in the cold world, with no shelter over his homeless head,\r\n", " no friendly hand to wipe the death-damps from his brow, no loving face\r\n", " to bend pityingly over him when the great agony came. And thus she\r\n", " would see him when she looked out upon the glad morning, and oh! would\r\n", " she drop one little tear upon his poor, lifeless form, would she heave\r\n", " one little sigh to see a bright young life so rudely blighted, so\r\n", " untimely cut down?\r\n", "\r\n", " The window went up, a maid-servant's discordant voice profaned the\r\n", " holy calm, and a deluge of water drenched the prone martyr's remains!\r\n", "\r\n" ] } ], "source": [ "!sed '1000,1020!d' tomsawyer.txt" ] }, { "cell_type": "markdown", "id": "04d6068a", "metadata": {}, "source": [ "Let's fix these with `sed`." ] }, { "cell_type": "code", "execution_count": 26, "id": "d6cdd4cf", "metadata": { "collapsed": false }, "outputs": [], "source": [ "!sed 's/^ *//' -i tomsawyer.txt" ] }, { "cell_type": "markdown", "id": "9948e20c", "metadata": {}, "source": [ "Also, there's a header at the beginning of the file (we can see that from the web page)." ] }, { "cell_type": "code", "execution_count": 27, "id": "86c376c0", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "About the print version\r\n", "The Adventures of Tom Sawyer\r\n", "Mark Twain\r\n", "Harper and Brothers\r\n", "New York and London\r\n", "1903\r\n", "\r\n", "Author's National Edition: The Writings of Mark Twain, Vol. XII\r\n", "\r\n", "Spell-check not performed due to presence of dialect.\r\n", "Published: 1876\r\n" ] } ], "source": [ "!sed '/About the print/,+10!d' tomsawyer.txt" ] }, { "cell_type": "code", "execution_count": 28, "id": "a604f6ef", "metadata": { "collapsed": false }, "outputs": [], "source": [ "!sed '1,/About the print/d' -i tomsawyer.txt" ] }, { "cell_type": "code", "execution_count": 29, "id": "a3ff9f2c", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Adventures of Tom Sawyer\r\n", "Mark Twain\r\n", "Harper and Brothers\r\n", "New York and London\r\n", "1903\r\n", "\r\n", "Author's National Edition: The Writings of Mark Twain, Vol. XII\r\n", "\r\n", "Spell-check not performed due to presence of dialect.\r\n", "Published: 1876\r\n" ] } ], "source": [ "!head tomsawyer.txt" ] }, { "cell_type": "markdown", "id": "bc5eda7f", "metadata": {}, "source": [ "# Other Converters" ] }, { "cell_type": "markdown", "id": "6d0008bb", "metadata": {}, "source": [ "If you don't like that kind of editing, there are a bunch of other converters you can use.\n", "To find them, use `apt-cache search` or `synaptic` on Ubuntu or Debian.\n", "Other Linux distributions have their own search tools.\n", "You can also check [Freecode](http://www.freecode.com/) (AKA Freshmeat)" ] }, { "cell_type": "code", "execution_count": 33, "id": "8b6a29fd", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "html2text - advanced HTML to text converter\r\n", "poppler-utils - PDF utilities (based on Poppler)\r\n", "xmlto - XML-to-any converter\r\n", "wap-wml-tools - Wireless Markup Language development and test tools\r\n", "highlight - Universal source code to formatted text converter\r\n", "highlight-common - source code to formatted text converter (architecture independent files)\r\n", "html2ps - HTML to PostScript converter\r\n", "khmerconverter - converts between legacy Khmer encodings and Unicode\r\n", "libghc-pandoc-dev - general markup converter\r\n", "libghc-pandoc-doc - general markup converter\r\n", "libghc-pandoc-prof - general markup converter\r\n", "libhighlight-perl - perl bindings for highlight source code to formatted text converter\r\n", "mira-assembler - Whole Genome Shotgun and EST Sequence Assembler\r\n", "pandoc - general markup converter\r\n", "php-text-wiki - transforms Wiki and BBCode markup into XHTML, LaTeX or plain text markup\r\n", "pod2pdf - Plain Old Documentation to Portable Document Format converter\r\n", "python-pdfminer - PDF parser and analyser\r\n", "python-zope.app.renderer - Zope 3 Text Renderer Framework\r\n", "src2tex - A converter from source program files to TeX format files\r\n", "stx2any - Converter from structured plain text to other formats\r\n", "t2html - text to HTML converter implemented in Perl\r\n", "txt2html - Text to HTML converter\r\n", "uni2ascii - UTF-8 to 7-bit ASCII and vice versa converter\r\n", "unrtf - RTF to other formats converter\r\n", "vilistextum - a HTML to text converter\r\n", "wp2x - WordPerfect 5.x documents to whatever converter\r\n", "yodl - Your Own Document Language (Yodl) is a pre-document language\r\n", "yodl-doc - Documenation for Your Own Document Language (Yodl)\r\n" ] } ], "source": [ "!apt-cache search html text converter\n", "# install with \"sudo apt-get install ...\" if needed" ] }, { "cell_type": "code", "execution_count": 31, "id": "102decb1", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "his speckled straw hat. He now looked exceedingly improved and uncomfortable.\r\n", "He was fully as uncomfortable as he looked; for there was a restraint about\r\n", "whole clothes and cleanliness that galled him. He hoped that Mary would forget\r\n", "his shoes, but the hope was blighted; she coated them thoroughly with tallow,\r\n", "as was the\r\n", "\r\n", "===============================================================================\r\n", "\r\n", " -48-\r\n", "\r\n", "\r\n", "custom, and brought them out. He lost his temper and said he was always being\r\n", "made to do everything he didn't want to do. But Mary said, persuasively:\r\n", "   \"Please, Tom -- that's a good boy.\"\r\n", "   So he got into the shoes snarling. Mary was soon ready, and the three\r\n", "children set out for Sunday-school -- a place that Tom hated with his whole\r\n", "heart; but Sid and Mary were fond of it.\r\n", "   Sabbath-school hours were from nine to half-past ten; and then church\r\n", "service. Two of the children always remained for the sermon voluntarily, and\r\n", "the other always remained too -- for stronger reasons. The church's high-\r\n", "backed, uncushioned pews would seat about three hundred persons; the edifice\r\n" ] } ], "source": [ "!html2text tomsawyer.html | sed '1000,1020!d'" ] }, { "cell_type": "code", "execution_count": 32, "id": "ae0a1239", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pityingly over him when the great agony came. And thus she would see him\r\n", "when she looked out upon the glad morning, and oh! would she drop one little\r\n", "tear upon his poor, lifeless form, would she heave one little sigh to see a\r\n", "bright young life so rudely blighted, so untimely cut down?\r\n", "\r\n", "���The window went up, a maid-servant's discordant voice profaned the holy\r\n", "calm, and a deluge of water drenched the prone martyr's remains!\r\n", "\r\n", "���The strangling hero sprang up with a relieving snort. There was a whiz as\r\n", "of a missile in the air, mingled with the murmur of a curse, a sound as of\r\n", "shivering glass followed, and a small, vague form went over the fence and\r\n", "shot away in the gloom.\r\n", "\r\n", "���Not long after, as Tom, all undressed for bed, was surveying his drenched\r\n", "garments by the light of a tallow dip, Sid woke up; but if he had any dim\r\n", "idea of making any \"references to allusions,\" he thought better of it and\r\n", "held his peace, for there was danger in Tom's eye.\r\n", "\r\n", "���Tom turned in without the added vexation of prayers, and Sid made mental\r\n", "note of the omission.\r\n", "\r\n" ] } ], "source": [ "!vilistextum tomsawyer.html - | sed '1000,1020!d'" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }