Coverage for nltk.downloader: 16%

100

101

102

103

# Natural Language Toolkit: Corpus & Model Downloader

# Author: Edward Loper <edloper@gradient.cis.upenn.edu>

# URL: <http://www.nltk.org/>

# For license information, see LICENSE.TXT

"""

The NLTK corpus and module downloader. This module defines several

interfaces which can be used to download corpora, models, and other

data packages that can be used with NLTK.

Downloading Packages

====================

If called with no arguments, ``download()`` will display an interactive

interface which can be used to download and install new packages.

If Tkinter is available, then a graphical interface will be shown,

otherwise a simple text interface will be provided.

Individual packages can be downloaded by calling the ``download()``

function with a single argument, giving the package identifier for the

package that should be downloaded:

>>> download('treebank') # doctest: +SKIP

[nltk_data] Downloading package 'treebank'...

[nltk_data] Unzipping corpora/treebank.zip.

NLTK also provides a number of \"package collections\", consisting of

a group of related packages. To download all packages in a

colleciton, simply call ``download()`` with the collection's

identifier:

>>> download('all-corpora') # doctest: +SKIP

[nltk_data] Downloading package 'abc'...

[nltk_data] Unzipping corpora/abc.zip.

[nltk_data] Downloading package 'alpino'...

[nltk_data] Unzipping corpora/alpino.zip.

...

[nltk_data] Downloading package 'words'...

[nltk_data] Unzipping corpora/words.zip.

Download Directory

==================

By default, packages are installed in either a system-wide directory

(if Python has sufficient access to write to it); or in the current

user's home directory. However, the ``download_dir`` argument may be

used to specify a different installation target, if desired.

See ``Downloader.default_download_dir()`` for more a detailed

description of how the default download directory is chosen.

NLTK Download Server

====================

Before downloading any packages, the corpus and module downloader

contacts the NLTK download server, to retrieve an index file

describing the available packages. By default, this index file is

loaded from ``http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml``.

If necessary, it is possible to create a new ``Downloader`` object,

specifying a different URL for the package index file.

Usage::

python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

or::

python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

"""

#----------------------------------------------------------------------

from __future__ import print_function

"""

0 1 2 3

[label][----][label][----]

[column ][column ]

Notes

=====

Handling data files.. Some questions:

* Should the data files be kept zipped or unzipped? I say zipped.

* Should the data files be kept in svn at all? Advantages: history;

automatic version numbers; 'svn up' could be used rather than the

downloader to update the corpora. Disadvantages: they're big,

which makes working from svn a bit of a pain. And we're planning

to potentially make them much bigger. I don't think we want

people to have to download 400MB corpora just to use nltk from svn.

* Compromise: keep the data files in trunk/data rather than in

trunk/nltk. That way you can check them out in svn if you want

to; but you don't need to, and you can use the downloader instead.

* Also: keep models in mind. When we change the code, we'd

potentially like the models to get updated. This could require a

little thought.

* So.. let's assume we have a trunk/data directory, containing a bunch

of packages. The packages should be kept as zip files, because we

really shouldn't be editing them much (well -- we may edit models

more, but they tend to be binary-ish files anyway, where diffs

aren't that helpful). So we'll have trunk/data, with a bunch of

files like abc.zip and treebank.zip and propbank.zip. For each

package we could also have eg treebank.xml and propbank.xml,

describing the contents of the package (name, copyright, license,

etc). Collections would also have .xml files. Finally, we would

pull all these together to form a single index.xml file. Some

directory structure wouldn't hurt. So how about::

/trunk/data/ ....................... root of data svn

index.xml ........................ main index file

src/ ............................. python scripts

packages/ ........................ dir for packages

corpora/ ....................... zip & xml files for corpora

grammars/ ...................... zip & xml files for grammars

taggers/ ....................... zip & xml files for taggers

tokenizers/ .................... zip & xml files for tokenizers

etc.

collections/ ..................... xml files for collections

Where the root (/trunk/data) would contain a makefile; and src/

would contain a script to update the info.xml file. It could also

contain scripts to rebuild some of the various model files. The

script that builds index.xml should probably check that each zip

file expands entirely into a single subdir, whose name matches the

package's uid.

Changes I need to make:

- in index: change "size" to "filesize" or "compressed-size"

- in index: add "unzipped-size"

- when checking status: check both compressed & uncompressed size.

uncompressed size is important to make sure we detect a problem

if something got partially unzipped. define new status values

to differentiate stale vs corrupt vs corruptly-uncompressed??

(we shouldn't need to re-download the file if the zip file is ok

but it didn't get uncompressed fully.)

- add other fields to the index: author, license, copyright, contact,

etc.

the current grammars/ package would become a single new package (eg

toy-grammars or book-grammars).

xml file should have:

- authorship info

- license info

- copyright info

- contact info

- info about what type of data/annotation it contains?

- recommended corpus reader?

collections can contain other collections. they can also contain

multiple package types (corpora & models). Have a single 'basics'

package that includes everything we talk about in the book?

n.b.: there will have to be a fallback to the punkt tokenizer, in case

they didn't download that model.

default: unzip or not?

"""

import time, os, zipfile, sys, textwrap, threading, itertools

try:

from hashlib import md5

except:

from md5 import md5

try:

TKINTER = True

from Tkinter import Tk, Frame, Label, Entry, Button, Canvas, Menu, IntVar

from tkMessageBox import showerror

from nltk.draw.table import Table

from nltk.draw.util import ShowText

except:

TKINTER = False

TclError = ValueError

from xml.etree import ElementTree

import nltk

from nltk import compat

#urllib2 = nltk.internals.import_from_stdlib('urllib2')

######################################################################

# Directory entry objects (from the data server's index file)

######################################################################

class Package(object):

"""

A directory entry for a downloadable package. These entries are

extracted from the XML index file that is downloaded by

``Downloader``. Each package consists of a single file; but if

that file is a zip file, then it can be automatically decompressed

when the package is installed.

"""

def __init__(self, id, url, name=None, subdir='',

size=None, unzipped_size=None,

checksum=None, svn_revision=None,

license='Unknown', author='Unknown',

unzip=True,

**kw):

self.id = id

"""A unique identifier for this package."""

self.name = name or id

"""A string name for this package."""

self.subdir = subdir

"""The subdirectory where this package should be installed.

E.g., ``'corpora'`` or ``'taggers'``."""

self.url = url

"""A URL that can be used to download this package's file."""

self.size = int(size)

"""The filesize (in bytes) of the package file."""

self.unzipped_size = int(unzipped_size)

"""The total filesize of the files contained in the package's

zipfile."""

self.checksum = checksum

"""The MD-5 checksum of the package file."""

self.svn_revision = svn_revision

"""A subversion revision number for this package."""

self.copyright = copyright

"""Copyright holder for this package."""

self.contact = contact

"""Name & email of the person who should be contacted with

questions about this package."""

self.license = license

"""License information for this package."""

self.author = author

"""Author of this package."""

ext = os.path.splitext(url.split('/')[-1])[1]

self.filename = os.path.join(subdir, id+ext)

"""The filename that should be used for this package's file. It

is formed by joining ``self.subdir`` with ``self.id``, and

using the same extension as ``url``."""

self.unzip = bool(int(unzip)) # '0' or '1'

"""A flag indicating whether this corpus should be unzipped by

default."""

# Include any other attributes provided by the XML file.

self.__dict__.update(kw)

@staticmethod

def fromxml(xml):

if isinstance(xml, compat.string_types):

xml = ElementTree.parse(xml)

return Package(**xml.attrib)

def __repr__(self):

return '<Package %s>' % self.id

class Collection(object):

"""

A directory entry for a collection of downloadable packages.

These entries are extracted from the XML index file that is

downloaded by ``Downloader``.

"""

def __init__(self, id, children, name=None, **kw):

self.id = id

"""A unique identifier for this collection."""

self.name = name or id

"""A string name for this collection."""

self.children = children

"""A list of the ``Collections`` or ``Packages`` directly

contained by this collection."""

self.packages = None

"""A list of ``Packages`` contained by this collection or any

collections it recursively contains."""

# Include any other attributes provided by the XML file.

self.__dict__.update(kw)

@staticmethod

def fromxml(xml):

if isinstance(xml, compat.string_types):

xml = ElementTree.parse(xml)

children = [child.get('ref') for child in xml.findall('item')]

return Collection(children=children, **xml.attrib)

def __repr__(self):

return '<Collection %s>' % self.id

######################################################################

# Message Passing Objects

######################################################################

class DownloaderMessage(object):

"""A status message object, used by ``incr_download`` to

communicate its progress."""

class StartCollectionMessage(DownloaderMessage):

"""Data server has started working on a collection of packages."""

def __init__(self, collection): self.collection = collection

class FinishCollectionMessage(DownloaderMessage):

"""Data server has finished working on a collection of packages."""

def __init__(self, collection): self.collection = collection

class StartPackageMessage(DownloaderMessage):

"""Data server has started working on a package."""