{ "cells": [ { "cell_type": "markdown", "id": "47e4a028", "metadata": {}, "source": [ "# Downloading Text from the Internet" ] }, { "cell_type": "markdown", "id": "d22cfae0", "metadata": {}, "source": [ "Here is a simple example of how to download and reformat text from the Internet." ] }, { "cell_type": "markdown", "id": "c1f802f0", "metadata": {}, "source": [ "Let's start by using `curl` to get some text data.\n", "We need the `-L` because this is a URL with a redirect built in." ] }, { "cell_type": "code", "execution_count": 21, "id": "2abed08d", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\r\n", " Dload Upload Total Spent Left Speed\r\n", "100 336 0 336 0 0 5905 0 --:--:-- --:--:-- --:--:-- 22400\r\n", "100 467k 100 467k 0 0 182k 0 0:00:02 0:00:02 --:--:-- 255k\r\n" ] } ], "source": [ "!curl -L 'http://goo.gl/g3aE4' > tomsawyer.html" ] }, { "cell_type": "markdown", "id": "dcf4c30e", "metadata": {}, "source": [ "For single pages, `curl` is generally the best tool to use.\n", "For whole directory trees and mirroring, `wget` is what people usually use." ] }, { "cell_type": "markdown", "id": "8038bc1c", "metadata": {}, "source": [ "If we look at it, we got the page in HTML format." ] }, { "cell_type": "code", "execution_count": 22, "id": "6471e588", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "\r\n", "
\r\n", "