{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 03 Fetching files on an FTP server with `ftplib`\n", "A great deal of on-line data resources reside on FTP servers. And while the `urllib` module will usually work with those servers, the `ftplib` module has some additional commands that allow us more control in listing and retrieving entire batches of files on these servers. \n", "\n", "Here we explore how this is done. We'll look at an example where we want to [potentially] download all of EPA's StreamCat data stored here:
\n", "`ftp://newftp.epa.gov/EPADataCommons/ORD/NHDPlusLandscapeAttributes/StreamCat/States`\n", "\n", "These data include a vast collection of attributes for National Hydrographic Data catchments. We will explore how we can instruct Python to download just Rhode Island data (again, because the files are small...) from the StreamCat ftp server. However, we'll design this notebook so that it can easily be tweaked to download any state.\n", "\n", "The process for grapping data on a ftp server this way is a tad more involved than in previous examples. It requires the following steps:\n", "* Creating a link to the ftp server\n", "* Logging in to the server\n", "* Navigating to the remote folder where the files we want are stored\n", "* Creating a list of the files we want \n", " * (unless it's just one file and know its name...)\n", "* Iterating through this list and downloading each one individually\n", "\n", "This last step has a few sub-steps, but we'll get to that below.\n", "\n", "---\n", "This link has some good description of the ftplib module:
\n", "http://www.pythonforbeginners.com/code-snippets-source-code/how-to-use-ftp-in-python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we'll do some boilerplate stuff. This includes importing the libraries (we'll need the `os` module too), setting some variables, and creating a place where our downloads will live." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#import the modules\n", "import os\n", "import ftplib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Set variables for the address of the web server and the folder holding our data\n", "ftpURL = \"newftp.epa.gov\"\n", "ftpDirectory = \"/EPADataCommons/ORD/NHDPlusLandscapeAttributes/StreamCat/States/\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Create the base output folder, if it doesn't exist\n", "outFolder = \"./StreamCat\"\n", "if not os.path.exists(outFolder): os.mkdir(outFolder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Set the state as a variable, so it's easily changed\n", "state = 'RI'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Make a subfolder for the state in the outFolder, to facilitate file management\n", "stateFolder = outFolder + os.sep + state\n", "if not os.path.exists(stateFolder): os.mkdir(stateFolder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've defined the state and created some folder to hold the data, we'll finally dive into the ftplib module to access the server and grab our data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step is to log into the server. Most ftp servers are \"anonymous ftp\" servers meaning, they allow anyone to log in. However, proper etiquette dictates you login as \"anonymous\" and use your email so they have a record of who is accessing their data...\n", "\n", "Below, we log into the EPA ftp server, creating an 'ftp' object which is our programmatic link to the server. Once we link to it, we log in. You should change \"user@duke.edu\" to your email address." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ftp = ftplib.FTP(ftpURL)\n", "ftp.login(\"anonymous\",\"user@duke.edu\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we are logged in, we can send ftp commands to get information if we want. For example, the `welcome` command returns a message the server admin wants you to see when you log into the server: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(ftp.welcome)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we need to do is navigate to the folder holding the files we want, which we found by navigating the ftp server in a web browser - and stored in the `ftpDirectory` variable. \n", "\n", "We get there using the ftp `cwd` (change working directory) command: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ftp.cwd(ftpDirectory)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's get a list of the files in this file, done with the `nlst` command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "files = ftp.nlst()\n", "len(files) #Over 2000 files here!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Create a list of just Rhode Island files, that is one's that end in \"_RI.zip\"\n", "RI_files = ftp.nlst(\"*_RI.zip\")\n", "len(RI_files) #49 files; that's better..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a list of files, we could write a loop that iterates through each of this. However, for simplicity, let's just grab the first file in the list and download that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#First, let's get the file we want to download. We'll just get the first item in the list\n", "f = RI_files[0] \n", "print(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is where the ftp module is much less intuitive. The `ftp.retbinary` command initiates the download of a binary file, but it requires a \"callback function\", which is an action telling the ftp module what to do with this stream of downloaded information. For our callback function, we'll be writing the binary contents to a file.\n", "\n", "*Don't worry too much about this syntax; just don't modify it (other than perhaps the output filename) and you'll be fine.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#First we need to create a filename to which the remote contents will be written\n", "outFilename = stateFolder + os.sep + f\n", "print (outFilename)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Ok, here's the command that actually grabs the file and writes it locally\n", "ftp.retrbinary(\"RETR \" + f,open(outFilename,'wb').write)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Did it work?\n", "os.listdir(stateFolder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Close the ftp connection\n", "ftp.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "► ** So what would it take to download the same file for, say, North Carolina? Not much! **" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "► ** How Might you tweak this script to grab all the RI files? **" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }