{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# gzip, zipfile, tarfile 模块：处理压缩文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os, shutil, glob\n",
    "import zlib, gzip, bz2, zipfile, tarfile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "gzip "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## zilb 模块"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`zlib` 提供了对字符串进行压缩和解压缩的功能："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "x�+��,V\u0000�D����\u0012�⒢̼t\u0000S�\u0007�\n",
      "this is a test string\n"
     ]
    }
   ],
   "source": [
    "orginal = \"this is a test string\"\n",
    "\n",
    "compressed = zlib.compress(orginal)\n",
    "\n",
    "print compressed\n",
    "print zlib.decompress(compressed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "同时提供了两种校验和的计算方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1407780813\n"
     ]
    }
   ],
   "source": [
    "print zlib.adler32(orginal) & 0xffffffff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4236695221\n"
     ]
    }
   ],
   "source": [
    "print zlib.crc32(orginal) & 0xffffffff"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## gzip 模块"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`gzip` 模块可以产生 `.gz` 格式的文件，其压缩方式由 `zlib` 模块提供。\n",
    "\n",
    "我们可以通过 `gzip.open` 方法来读写 `.gz` 格式的文件： "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "content = \"Lots of content here\"\n",
    "with gzip.open('file.txt.gz', 'wb') as f:\n",
    "    f.write(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "读："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Lots of content here\n"
     ]
    }
   ],
   "source": [
    "with gzip.open('file.txt.gz', 'rb') as f:\n",
    "    file_content = f.read()\n",
    "\n",
    "print file_content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将压缩文件内容解压出来："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with gzip.open('file.txt.gz', 'rb') as f_in, open('file.txt', 'wb') as f_out:\n",
    "    shutil.copyfileobj(f_in, f_out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "此时，目录下应有 `file.txt` 文件，内容为："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Lots of content here\n"
     ]
    }
   ],
   "source": [
    "with open(\"file.txt\") as f:\n",
    "    print f.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "os.remove(\"file.txt.gz\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### bz2 模块"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`bz2` 模块提供了另一种压缩文件的方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BZh91AY&SY*\u001c",
      "�v\u0000\u0000\t��@\u0000\"�\u001c",
      "\u0000 \u00001\u00000\"zi\u000f�\u0015\u001b�FLT`�軒)�P�˰\n",
      "this is a test string\n"
     ]
    }
   ],
   "source": [
    "orginal = \"this is a test string\"\n",
    "\n",
    "compressed = bz2.compress(orginal)\n",
    "\n",
    "print compressed\n",
    "print bz2.decompress(compressed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## zipfile 模块"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "产生一些 `file.txt` 的复制："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for i in range(10):\n",
    "    shutil.copy(\"file.txt\", \"file.txt.\" + str(i))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将这些复制全部压缩到一个 `.zip` 文件中："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "f = zipfile.ZipFile('files.zip','w')\n",
    "\n",
    "for name in glob.glob(\"*.txt.[0-9]\"):\n",
    "    f.write(name)\n",
    "    os.remove(name)\n",
    "    \n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "解压这个 `.zip` 文件，用 `namelist` 方法查看压缩文件中的子文件名："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['file.txt.9', 'file.txt.6', 'file.txt.2', 'file.txt.1', 'file.txt.5', 'file.txt.4', 'file.txt.3', 'file.txt.7', 'file.txt.8', 'file.txt.0']\n"
     ]
    }
   ],
   "source": [
    "f = zipfile.ZipFile('files.zip','r')\n",
    "print f.namelist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用 `f.read(name)` 方法来读取 `name` 文件中的内容："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "file.txt.9 content: Lots of content here\n",
      "file.txt.6 content: Lots of content here\n",
      "file.txt.2 content: Lots of content here\n",
      "file.txt.1 content: Lots of content here\n",
      "file.txt.5 content: Lots of content here\n",
      "file.txt.4 content: Lots of content here\n",
      "file.txt.3 content: Lots of content here\n",
      "file.txt.7 content: Lots of content here\n",
      "file.txt.8 content: Lots of content here\n",
      "file.txt.0 content: Lots of content here\n"
     ]
    }
   ],
   "source": [
    "for name in f.namelist():\n",
    "    print name, \"content:\", f.read(name)\n",
    "\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以用 `extract(name)` 或者 `extractall()` 解压单个或者全部文件。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## tarfile 模块"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "支持 `.tar` 格式文件的读写：\n",
    "\n",
    "例如可以这样将 `file.txt` 写入："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "f = tarfile.open(\"file.txt.tar\", \"w\")\n",
    "f.add(\"file.txt\")\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "清理生成的文件："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "os.remove(\"file.txt\")\n",
    "os.remove(\"file.txt.tar\")\n",
    "os.remove(\"files.zip\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}