{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tokyo Photographs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "# Tokyo data\n",
       "\n",
       "This dataset contains a sample of geotagged images uploaded to Flickr for the Tokyo region. The original extract (generated by Meixu Chen, `meixu@liverpool.ac.uk`) is stored for archival purposes as `tokyo.csv`.\n",
       "\n",
       "- `Source`: Yahoo Flickr Creative Commons 100 Million Dataset\n",
       "- `URL`:\n",
       "\n",
       "> [https://yahooresearch.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images](https://yahooresearch.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images)\n",
       "\n",
       "- `Processing`: transformations applied to the original extract, including a random subset to a more manageable size for pedagogical purposes, are documented in `tokyo_cleaning.ipynb`\n",
       "    - Clean file: `tokyo_clean.csv`\n",
       "\n",
       "## Metadata\n",
       "\n",
       "For every record, the following information is provided:\n",
       "\n",
       "* `user_id`: the unique id number of each Flickr user.\n",
       "\n",
       "* `longitude`: longitude of the geotagged Flickr photo in decimal format,\n",
       "under WGS1984 geographic coordinate system.\n",
       "\n",
       "* `latitude`: latitude of the geotagged Flickr photo in decimal format,\n",
       "under WGS1984 geographic coordinate system.\n",
       "\n",
       "* `date_taken`: the date when the photo was taken.\n",
       "\n",
       "* `photo/video_page_url`: an url link where the photo/video content is\n",
       "available.\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import display_markdown\n",
    "\n",
    "display_markdown(open(\"README.md\").read(), raw=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import geopandas as gpd\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "db = pd.read_csv('data/tokyo.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Randomly subsetting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the \"seed\" so every run produces the generates the same random numbers\n",
    "np.random.seed(1234)\n",
    "# Create a sequence of length equal to the number of rows in the table\n",
    "ri = np.arange(len(db))\n",
    "# Randomly reorganize (shuffle) the values\n",
    "np.random.shuffle(ri)\n",
    "# Reindex the table by using only the first 10,000 numbers \n",
    "# of the (now randomly arranged) sequence\n",
    "db = db.iloc[ri[:10000], :]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reproject XY coordinates in separate columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 431 ms, sys: 4.86 ms, total: 436 ms\n",
      "Wall time: 436 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "pts = db.apply(lambda r: Point(r.longitude, r.latitude), axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "gdb = gpd.GeoDataFrame(db.assign(geometry=pts), \\\n",
    "                       crs={'init' :'epsg:4326'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 529 ms, sys: 7.46 ms, total: 536 ms\n",
      "Wall time: 747 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "gdb = gdb.to_crs(epsg=3857)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.13 s, sys: 20.3 ms, total: 2.15 s\n",
      "Wall time: 2.16 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "xys = gdb['geometry'].apply(lambda pt: pd.Series({'x': pt.x, 'y': pt.y}))\n",
    "gdb['x'] = xys['x']\n",
    "gdb['y'] = xys['y']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "gdb.drop('geometry', axis=1).to_csv('tokyo_clean.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Download link\n",
    "\n",
    "{download}`[Download the *tokyo_clean.csv* file] <tokyo_clean.csv>`"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}