{ "metadata": { "name": "", "signature": "sha256:2f588d849599d9daa03443c69bcf3f653019270e7d6328a1be4720c5521afcc2" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "import sklearn as sk\n", "import pickle\n", "print 'pandas version: ',pd.__version__\n", "print 'numpy version:',np.__version__\n", "print 'sklearn version:',sk.__version__" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "pandas version: 0.13.1\n", "numpy version: 1.8.1\n", "sklearn version: 0.14.1\n" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "home_dir='/home/ubuntu/UCSD_BigData'\n", "sys.path.append(home_dir+'/utils')\n", "\n", "from find_waiting_flow import *\n", "from AWS_keypair_management import *\n", "\n", "!pwd" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/home/ubuntu/UCSD_BigData/notebooks/weather.mapreduce\r\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## managing AWS Credentials ##\n", "It can get confusing with the AWS credentials. So I created a few new scripts that should make the\n", "process easier and error-free.\n", "\n", "Before continuing, you need to get the credentials from AWS by using the IAM console.\n", "\n", "The credentials are downloaded into files with names such as credentials.csv, credentials (2).csv etc.\n", "\n", "The LaunchNotebookServer.py script has a command that can help you get these files to your EC2 instance,\n", "the option you need is -A and it is used as\n", "* `LaunchNotebookServer.py -A ~/Downloads/credentials\\*.csv`\n", "\n", "The `\\*`, rather than `*` is intentional.\n", "\n", "The files will be copied through `scp` to the directory `\\home\\ubuntu\\Vault ` on your ec2 instance.\n", "\n", "You can then run the following command that will read through the files and check which of the credentials is active. It will then\n", "generate a dictionary `Cred` that holds all of the key pairs that are currently active.\n", "\n", "In the following cells all of your secret pairs are printed out. Make sure that the cell output is deleted before you commit it to version to github." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "comment out on purpose" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# creds = pickle.load(open('/home/ubuntu/Vault/Creds.pkl'))\n", "# config = creds['mrjob']\n", "\n", "# key_id = config['key_id']\n", "# secret_key = config['secret_key']\n", "# s3_bucket = config['s3_logs']\n", "# s3_scratch= config['s3_scratch']\n", "\n", "# print config\n", "# print key_id\n", "# print secret_key\n", "# print s3_bucket\n", "# print s3_scratch" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 41 }, { "cell_type": "code", "collapsed": false, "input": [ "def get_job_flow_id():\n", " job_flow_id=find_waiting_flow(key_id,secret_key)\n", " return job_flow_id\n", "\n", "# get_job_flow_id()\n", "DEBUG = True # if debug == true, remove all existing files, and regenerate" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## using map-reduce to calculate the total number of measurements for each station, later use this as weight##" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%writefile totalNumber.py\n", "\n", "from mrjob.job import MRJob\n", "\n", "class totalNumberOfStations(MRJob):\n", " def mapper(self, _, line):\n", " self.increment_counter('MrJob Counters','mapper',1)\n", " elements=line.split(',')\n", " if len(elements) != 368:\n", " yield 'corrupted data', 1\n", " else:\n", " yield 'useful data', 1\n", " yield(elements[0],1)\n", " \n", " def combiner(self, key, counts):\n", " self.increment_counter('MrJob Counters','combiner',1)\n", " yield key, sum(counts)\n", "\n", " def reducer(self, key, counts):\n", " self.increment_counter('MrJob Counters','reducer',1)\n", " yield key, sum(counts)\n", "\n", "if __name__ == '__main__':\n", " totalNumberOfStations.run()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Overwriting totalNumber.py\n" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "# try to run this job on local machien\n", "import os\n", "if not os.path.exists(\"all_station_count\"):\n", " !python totalNumber.py -r emr --emr-job-flow-id=$job_flow_id hdfs:/weather/weather.csv > all_station_count" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "!head -10 all_station_count" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\"AJ000037668\"\t28\r\n", "\"AJ000037679\"\t29\r\n", "\"AJ000037734\"\t56\r\n", "\"AJ000037756\"\t105\r\n", "\"AJ000037844\"\t110\r\n", "\"AJ000037866\"\t85\r\n", "\"AJ000037888\"\t27\r\n", "\"AJ000037899\"\t94\r\n", "\"AJ000037907\"\t154\r\n", "\"AM000037627\"\t52\r\n" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "import os,sys,re,pickle,coding\n", "from numpy import *\n", "\n", "!gunzip stations.pkl.gz\n", "stations=pickle.load(open(\"stations.pkl\", 'rb'))\n", "!gzip stations.pkl\n", "\n", "stations.shape\n", "type(stations)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "pandas.core.frame.DataFrame" ] } ], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "stations.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | latitude | \n", "longitude | \n", "elevation | \n", "state | \n", "name | \n", "GSNFLAG | \n", "HCNFLAG | \n", "WMOID | \n", "
---|---|---|---|---|---|---|---|---|
ACW00011604 | \n", "17.1167 | \n", "-61.7833 | \n", "10.1 | \n", "NaN | \n", "ST JOHNS COOLIDGE FLD | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
ACW00011647 | \n", "17.1333 | \n", "-61.7833 | \n", "19.2 | \n", "NaN | \n", "ST JOHNS | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
AE000041196 | \n", "25.3330 | \n", "55.5170 | \n", "34.0 | \n", "NaN | \n", "SHARJAH INTER. AIRP | \n", "GSN | \n", "NaN | \n", "41196 | \n", "
AF000040930 | \n", "35.3170 | \n", "69.0170 | \n", "3366.0 | \n", "NaN | \n", "NORTH-SALANG | \n", "GSN | \n", "NaN | \n", "40930 | \n", "
AG000060390 | \n", "36.7167 | \n", "3.2500 | \n", "24.0 | \n", "NaN | \n", "ALGER-DAR EL BEIDA | \n", "GSN | \n", "NaN | \n", "60390 | \n", "
5 rows \u00d7 8 columns
\n", "\n", " | count | \n", "
---|---|
CA004035200 | \n", "87 | \n", "
USS0006H19S | \n", "134 | \n", "
USC00390043 | \n", "1061 | \n", "
UY000001086 | \n", "20 | \n", "
UY000001084 | \n", "20 | \n", "
5 rows \u00d7 1 columns
\n", "\n", " | count | \n", "latitude | \n", "longitude | \n", "elevation | \n", "state | \n", "name | \n", "GSNFLAG | \n", "HCNFLAG | \n", "WMOID | \n", "
---|---|---|---|---|---|---|---|---|---|
CA004035200 | \n", "87 | \n", "49.1700 | \n", "-104.5800 | \n", "695.0 | \n", "NaN | \n", "MINTON | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
USS0006H19S | \n", "134 | \n", "41.3333 | \n", "-106.5000 | \n", "2572.5 | \n", "Y | \n", "SOUTH BRUSH CREEK | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
USC00390043 | \n", "1061 | \n", "43.4892 | \n", "-99.0631 | \n", "512.1 | \n", "D | \n", "ACADEMY 2NE | \n", "NaN | \n", "HCN | \n", "NaN | \n", "
UY000001086 | \n", "20 | \n", "-30.6500 | \n", "-56.1700 | \n", "70.0 | \n", "NaN | \n", "CHARQUEADA | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
UY000001084 | \n", "20 | \n", "-30.6500 | \n", "-56.3800 | \n", "190.0 | \n", "NaN | \n", "GUAYUBIRA | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows \u00d7 9 columns
\n", "\n", " | latitude | \n", "longitude | \n", "count | \n", "
---|---|---|---|
CA004035200 | \n", "49.1700 | \n", "-104.5800 | \n", "87 | \n", "
USS0006H19S | \n", "41.3333 | \n", "-106.5000 | \n", "134 | \n", "
USC00390043 | \n", "43.4892 | \n", "-99.0631 | \n", "1061 | \n", "
UY000001086 | \n", "-30.6500 | \n", "-56.1700 | \n", "20 | \n", "
UY000001084 | \n", "-30.6500 | \n", "-56.3800 | \n", "20 | \n", "
5 rows \u00d7 3 columns
\n", "Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
" Throttling
\r\n",
"