{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Files\n",
    " - data is usually stored in secondary storage medium such as hard drive, flash drive, cd-rw, etc. using named locations called files\n",
    " - files can be organized into folders\n",
    " - use open() built-in function to work with files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "help(open)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## write data to a file\n",
    "- open file with a name\n",
    "- write data\n",
    "- close file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# old school\n",
    "fw = open('test.txt', 'a') # w is for write mode\n",
    "fw.write('words\\n=====\\n')\n",
    "fw.write('apple\\nball\\ncat\\ndog\\n')\n",
    "fw.write('elephant\\n')\n",
    "fw.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# newer and better syntax\n",
    "with open('words.txt', 'w') as fw:\n",
    "    fw.write('apple\\nball\\ncat\\ndog\\n')\n",
    "    fw.write('elephant\\n')\n",
    "    fw.write('zebra\\n')\n",
    "# file will be automatically closed when with block is finished executing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## read data from a file\n",
    "- open file with its name; can provide relative or absolute path\n",
    "- read in various ways; one line at a time, all lines, bytes, whole file, etc.\n",
    "- use data\n",
    "- close file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# read whole file as list of lines\n",
    "fr = open('words.txt', 'r') # 'r' or read mode by default; file must exist\n",
    "data = fr.readlines()\n",
    "fr.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "help(fr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['apple\\n', 'ball\\n', 'cat\\n', 'dog\\n', 'elephant\\n', 'zebra\\n']"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data.sort(reverse=True)\n",
    "with open('sorted_words.txt', 'w') as newFile: \n",
    "    for word in data:\n",
    "        newFile.write(word)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## fetch a page from the web"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('teaching.html', <http.client.HTTPMessage at 0x1055d8860>)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import urllib.request\n",
    "url = 'http://org.coloradomesa.edu/~rbasnet/teaching.html'\n",
    "localfile = 'teaching.html'\n",
    "urllib.request.urlretrieve(url, localfile)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## reading the whole file at once"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# read /usr/share/dict/words file in linux\n",
    "# windows path might be \" c:/temp/words.txt\" or c:\\\\temp\\words.txt\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 2537 words in the file.\n"
     ]
    }
   ],
   "source": [
    "with open(localfile) as f:\n",
    "    data = f.read()\n",
    "words = data.split(' ')\n",
    "print('There are {0} words in the file.'.format(len(words)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['<!DOCTYPE', 'html>\\n<html', 'lang=\"en\">\\n<head>\\n<meta', 'charset=\"utf-8\"', '/>\\n<title>Teaching', 'Interests', 'and', 'Current', 'Semester', 'Schedule</title>\\n<link']\n"
     ]
    }
   ],
   "source": [
    "print(words[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## parsing HTML using BeautifulSoup library\n",
    "- https://www.crummy.com/software/BeautifulSoup/bs4/doc/#\n",
    "- Alternative is nltk (Natural Language Toolkit) library\n",
    "- http://www.nltk.org/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n",
      "Teaching Interests and Current Semester Schedule\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Dr. Ram B. Basnet\n",
      "\n",
      "\n",
      "\n",
      "Home\n",
      "Research\n",
      "Teaching\n",
      "Useful Links\n",
      "Contact\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\t\t\t \n",
      "\t\t\n",
      "\n",
      "Spring 2017 Schedule\n",
      "Fall 2016 Schedule\n",
      "Spring 2016 Schedule\n",
      "Fall 2015 Schedule\n",
      "Spring 2015 Schedule\n",
      "Fall 2014 Schedule\n",
      "Spring 2014 Schedule\n",
      "Fall 2013 Schedule\n",
      "\n",
      "\n",
      "\t\t\t \n",
      "\t\t\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Teaching Fall 2017 Schedule\n",
      "\n",
      "\n",
      "Teaching Interests\n",
      "\n",
      "\n",
      "\n",
      "            \t\tInformation Assurance/Computer Security/Cyber Security\n",
      "            \t\n",
      "\n",
      "            \t\tProgramming Languages: Python, C++, Java, JavaScript, HTML, CSS\n",
      "            \t\n",
      "\n",
      "            \t\tWeb Design and Secure Web Application Development\n",
      "            \t\n",
      "\n",
      "\n",
      "Current Teaching Schedule\n",
      "\n",
      "\n",
      "\n",
      "Day/Time\n",
      "Monday\n",
      "Tuesday\n",
      "Wednesday\n",
      "Thursday\n",
      "Friday\n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t8:00 AM\n",
      "            \t\t\n",
      " \n",
      " \n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t8:30 AM\n",
      "            \t\t\n",
      " \n",
      " \n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t9:00 AM\n",
      "            \t\t\n",
      "Office Hour\n",
      "Adv. Prog. Python WS 118\n",
      "Office Hour\n",
      "Adv. Prog. Python WS 118\n",
      "Office Hour\n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t9:30 AM\n",
      "            \t\t\n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t10:00 AM\n",
      "            \t\t\n",
      "CS 2 WS 120\n",
      "CS 2 WS 120\n",
      "CS 2 WS 120\n",
      "CS 2 WS 120\n",
      "Office Hour\n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t10:30 AM\n",
      "            \t\t\n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t11:00 AM\n",
      "            \t\t\n",
      " \n",
      "Office Hour\n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t11:30 AM\n",
      "            \t\t\n",
      " \n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t12:00 PM\n",
      "            \t\t\n",
      " \n",
      " \n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t12:30 PM\n",
      "            \t\t\n",
      " \n",
      " \n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t1:00 PM\n",
      "            \t\t\n",
      "Computer Security WS 118\n",
      " \n",
      "Computer Security WS 118\n",
      " \n",
      "Computer Security WS 118\n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t1:30 PM\n",
      "            \t\t\n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t2:00 PM\n",
      "            \t\t\n",
      " \n",
      " \n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t2:30 PM\n",
      "            \t\t\n",
      " \n",
      " \n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t3:00 PM\n",
      "            \t\t\n",
      "Net/App Security WS 118\n",
      " \n",
      "Net/App Security WS 118\n",
      " \n",
      "Net/App Security WS 118\n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t3:30 PM\n",
      "            \t\t\n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t4:00 PM\n",
      "            \t\t\n",
      " \n",
      " \n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t4:30 PM\n",
      "            \t\t\n",
      " \n",
      " \n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t5:00 PM\n",
      "            \t\t\n",
      " \n",
      " \n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "            \t\t\t5:30 PM\n",
      "            \t\t\n",
      " \n",
      " \n",
      " \n",
      " \n",
      " \n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "xhtml\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){\n",
      "  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n",
      "  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n",
      "  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');\n",
      "\n",
      "  ga('create', 'UA-46738331-1', 'coloradomesa.edu');\n",
      "  ga('send', 'pageview');\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "localfile = 'teaching.html'\n",
    "with open(localfile) as f:\n",
    "    soup = BeautifulSoup(f.read(), 'lxml')\n",
    "text = soup.get_text()\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# break into lines and remove leading and trailing space on each line\n",
    "lines = [line.strip() for line in text.splitlines()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['', '', '', 'Teaching Interests and Current Semester Schedule', '', '', '', '', '', 'Dr. Ram B. Basnet', '', '', '', 'Home', 'Research', 'Teaching', 'Useful Links', 'Contact', '', '', '', '', '', '', '', 'Spring 2017 Schedule', 'Fall 2016 Schedule', 'Spring 2016 Schedule', 'Fall 2015 Schedule', 'Spring 2015 Schedule', 'Fall 2014 Schedule', 'Spring 2014 Schedule', 'Fall 2013 Schedule', '', '', '', '', '', '', '', '', 'Teaching Fall 2017 Schedule', '', '', 'Teaching Interests', '', '', '', 'Information Assurance/Computer Security/Cyber Security', '', '', 'Programming Languages: Python, C++, Java, JavaScript, HTML, CSS', '', '', 'Web Design and Secure Web Application Development', '', '', '', 'Current Teaching Schedule', '', '', '', 'Day/Time', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', '', '', '', '8:00 AM', '', '', '', '', '', '', '', '', '', '8:30 AM', '', '', '', '', '', '', '', '', '', '9:00 AM', '', 'Office Hour', 'Adv. Prog. Python WS 118', 'Office Hour', 'Adv. Prog. Python WS 118', 'Office Hour', '', '', '', '9:30 AM', '', '', '', '', '10:00 AM', '', 'CS 2 WS 120', 'CS 2 WS 120', 'CS 2 WS 120', 'CS 2 WS 120', 'Office Hour', '', '', '', '10:30 AM', '', '', '', '', '11:00 AM', '', '', 'Office Hour', '', '', '', '', '', '', '11:30 AM', '', '', '', '', '', '', '', '', '12:00 PM', '', '', '', '', '', '', '', '', '', '12:30 PM', '', '', '', '', '', '', '', '', '', '1:00 PM', '', 'Computer Security WS 118', '', 'Computer Security WS 118', '', 'Computer Security WS 118', '', '', '', '1:30 PM', '', '', '', '', '2:00 PM', '', '', '', '', '', '', '', '', '', '2:30 PM', '', '', '', '', '', '', '', '', '', '3:00 PM', '', 'Net/App Security WS 118', '', 'Net/App Security WS 118', '', 'Net/App Security WS 118', '', '', '', '3:30 PM', '', '', '', '', '', '', '4:00 PM', '', '', '', '', '', '', '', '', '', '4:30 PM', '', '', '', '', '', '', '', '', '', '5:00 PM', '', '', '', '', '', '', '', '', '', '5:30 PM', '', '', '', '', '', '', '', '', '', '', '', '', 'xhtml', '', '', '', '', \"(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){\", '(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),', 'm=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)', \"})(window,document,'script','//www.google-analytics.com/analytics.js','ga');\", '', \"ga('create', 'UA-46738331-1', 'coloradomesa.edu');\", \"ga('send', 'pageview');\", '', '', '', '']\n"
     ]
    }
   ],
   "source": [
    "print(lines)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create list of words by spliting multi-word elements\n",
    "words = [word.strip() for line in lines for word in line.split()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Teaching', 'Interests', 'and', 'Current', 'Semester', 'Schedule', 'Dr.', 'Ram', 'B.', 'Basnet', 'Home', 'Research', 'Teaching', 'Useful', 'Links', 'Contact', 'Spring', '2017', 'Schedule', 'Fall', '2016', 'Schedule', 'Spring', '2016', 'Schedule', 'Fall', '2015', 'Schedule', 'Spring', '2015', 'Schedule', 'Fall', '2014', 'Schedule', 'Spring', '2014', 'Schedule', 'Fall', '2013', 'Schedule', 'Teaching', 'Fall', '2017', 'Schedule', 'Teaching', 'Interests', 'Information', 'Assurance/Computer', 'Security/Cyber', 'Security', 'Programming', 'Languages:', 'Python,', 'C++,', 'Java,', 'JavaScript,', 'HTML,', 'CSS', 'Web', 'Design', 'and', 'Secure', 'Web', 'Application', 'Development', 'Current', 'Teaching', 'Schedule', 'Day/Time', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', '8:00', 'AM', '8:30', 'AM', '9:00', 'AM', 'Office', 'Hour', 'Adv.', 'Prog.', 'Python', 'WS', '118', 'Office', 'Hour', 'Adv.', 'Prog.', 'Python', 'WS', '118', 'Office', 'Hour', '9:30', 'AM', '10:00', 'AM', 'CS', '2', 'WS', '120', 'CS', '2', 'WS', '120', 'CS', '2', 'WS', '120', 'CS', '2', 'WS', '120', 'Office', 'Hour', '10:30', 'AM', '11:00', 'AM', 'Office', 'Hour', '11:30', 'AM', '12:00', 'PM', '12:30', 'PM', '1:00', 'PM', 'Computer', 'Security', 'WS', '118', 'Computer', 'Security', 'WS', '118', 'Computer', 'Security', 'WS', '118', '1:30', 'PM', '2:00', 'PM', '2:30', 'PM', '3:00', 'PM', 'Net/App', 'Security', 'WS', '118', 'Net/App', 'Security', 'WS', '118', 'Net/App', 'Security', 'WS', '118', '3:30', 'PM', '4:00', 'PM', '4:30', 'PM', '5:00', 'PM', '5:30', 'PM', 'xhtml', \"(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){\", '(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new', 'Date();a=s.createElement(o),', 'm=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)', \"})(window,document,'script','//www.google-analytics.com/analytics.js','ga');\", \"ga('create',\", \"'UA-46738331-1',\", \"'coloradomesa.edu');\", \"ga('send',\", \"'pageview');\"]\n"
     ]
    }
   ],
   "source": [
    "print(words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 185 words in the file.\n"
     ]
    }
   ],
   "source": [
    "print('There are {0} words in the file.'.format(len(words)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## working with binary files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with open('cmu-logo.png', 'rb') as rbf: #rb - read binary mode\n",
    "    data = rbf.read() # read the whole binary file\n",
    "    with open('cmu-logo-copy.png', 'wb') as wbf:\n",
    "        wbf.write(data) # write the whole binary file\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "14055\n"
     ]
    }
   ],
   "source": [
    "print(len(data))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "14088\n"
     ]
    }
   ],
   "source": [
    "# find the size of the data in bytes\n",
    "import sys\n",
    "print(sys.getsizeof(data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## exercises\n",
    "1. Write a program that reads a file and writes out a new file with the lines in reversed order (i.e. the first line in the old file becomes the last one in the new file.)\n",
    "2. Write a program that reads a file and prints only those lines that contain the substring snake.\n",
    "3. Write a program that reads a text file and produces an output file which is a copy of the file, except the first five columns of each line contain a four digit line number, followed by a space. Start numbering the first line in the output file at 1. Ensure that every line number is formatted to the same width in the output file. Use one of your Python programs as test data for this exercise: your output should be a printed and numbered listing of the Python program.\n",
    "4. Write a program that undoes the numbering of the previous exercise: it should read a file with numbered lines and produce another file without line numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}