{
 "metadata": {
  "name": "",
  "signature": "sha256:50d58774222efed73d8bb4fe800d1a37f2b749a21eb1cf193636035806453a70"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Lithology classification from images"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Can we extract text from images using [Tesseract](https://code.google.com/p/tesseract-ocr/), then process that text for lithological data? "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"Samples.png\" />"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We'll use a Python wrapper for Tesseract.\n",
      "\n",
      "[This library](https://code.google.com/p/python-tesseract/) looks cool but I can't get it to work."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "I also tried a lightweight one, [pytesseract](https://pypi.python.org/pypi/pytesseract), but it failed with an image error. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's try using [PyOCR](https://github.com/jflesch/pyocr). "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from PIL import Image"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pyocr\n",
      "import pyocr.builders"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tools = pyocr.get_available_tools()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tool = tools[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tool.get_available_languages()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "['eng']"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text = tool.image_to_string(Image.open('Samples.png'), builder=pyocr.builders.TextBuilder())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print text"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0.0 m\n",
        "\n",
        "15.0 m\n",
        "\n",
        "30.0 m\n",
        "\n",
        "75.0 In\n",
        "\n",
        "80.0 m\n",
        "\n",
        "85.0 m ~\n",
        "\n",
        "15.0 m\n",
        "\n",
        "30.0\n",
        "\n",
        "75.0\n",
        "\n",
        "80.0\n",
        "\n",
        "85.0\n",
        "\n",
        "90.0 m\n",
        "\n",
        "SAMPLE AND CORE DESCRIPTIONS\n",
        "\n",
        "Chevron Irving Bras d\u20190r #Z\n",
        "B! A. Berti\n",
        "\n",
        "No samples.\n",
        "\n",
        "Drift, subangular to subroundad small pebbles of\n",
        "quartz, granite, metesediment, ett.\n",
        "\n",
        "No samples .\n",
        "\n",
        "Fractured anhydrite and dolomite in more or less\n",
        "equal amounts, fractures 2-3 mm apart, <l mm\n",
        "wide, healed with nnhydrite and/nr gypsum; lOZ of\n",
        "sample is white to milky-white gypsum; anhydrite\n",
        "l#him to light grey microcrystslline to medium\n",
        "grained, dolomite, light grey-brown to brown,\n",
        "earthy to finely crystalline, trace ZZ porosity,\n",
        "no fluorescence.\n",
        "\n",
        "Fractured anhydrite and dolomite as above with\n",
        "about 102 gypsum in sample; crate smell, nan-ew\n",
        "black arte-ks in dolomite (Totganic or bitumen).\n",
        "\n",
        "Anhydrite, white to light tu medium brown with\n",
        "irregular stringera dolomite .tattered throughout\n",
        "rather than fractured as above, dolomite, light\n",
        "brown, miorocryatalline, trato porosity.\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is the preformance with no training. \n",
      "\n",
      "If training is required, [this might help](https://pypi.python.org/pypi/TesseractTrainer). "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Try a web API"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For all I know this is also using Tesseract. [This Mashape app](https://www.mashape.com/imagevision/text-recognition-ocr-for-images) claims to do text recognition."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import unirest"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "response = unirest.post(\"https://imagevision-text-search-v1.p.mashape.com/textSearch/detectText\",\n",
      "  headers={\n",
      "    \"X-Mashape-Key\": \"PUT MASHAPE KEY HERE\",\n",
      "    \"Content-Type\": \"application/x-www-form-urlencoded\"\n",
      "  },\n",
      "  params={\n",
      "    \"objecturl\": \"https://dl.dropboxusercontent.com/u/14965965/Samples.png\"\n",
      "  }\n",
      ")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 29
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "response.body"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "{u'imageDimensions': [860, 686],\n",
        " u'message': u'SUCCESS',\n",
        " u'objectUrl': u'https://dl.dropboxusercontent.com/u/14965965/Samples.png',\n",
        " u'resource': u'detectText',\n",
        " u'text': [{u'id': 1,\n",
        "   u'textCoordinates': [69, 259, 134, 274],\n",
        "   u'textString': u'300m'}],\n",
        " u'textDetected': True,\n",
        " u'transactionId': u'1234',\n",
        " u'version': u'IV-1.2.22.29'}"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "OK, that didn't work very well."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Text theme extraction"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Try recognizing hashtags in the Tesseract text, using [the Aylien app](http://subsurfwiki.org/wiki/Aylien) in Mashape. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import urllib"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "s = urllib.quote_plus(text)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "base_url = \"https://aylien-text.p.mashape.com/hashtags?text=\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "headers = {\"X-Mashape-Authorization\": \"PUT MASHAPE KEY HERE\"}"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 28
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "r = unirest.get(base_url+s, headers=headers)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "r.body['hashtags']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "[u'#Dolomite',\n",
        " u'#Anhydrite',\n",
        " u'#Gypsum',\n",
        " u'#Porosity',\n",
        " u'#Bitumen',\n",
        " u'#Granite',\n",
        " u'#Fluorescence',\n",
        " u'#Crystal']"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "parts = text.split('\\n\\n')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "parts"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 16,
       "text": [
        "['\\n0.0 m',\n",
        " '15.0 m',\n",
        " '30.0 m',\n",
        " '75.0 In',\n",
        " '80.0 m',\n",
        " '85.0 m ~',\n",
        " '15.0 m',\n",
        " '30.0',\n",
        " '75.0',\n",
        " '80.0',\n",
        " '85.0',\n",
        " '90.0 m',\n",
        " 'SAMPLE AND CORE DESCRIPTIONS',\n",
        " 'Chevron Irving Bras d\\xe2\\x80\\x990r #Z\\nB! A. Berti',\n",
        " 'No samples.',\n",
        " 'Drift, subangular to subroundad small pebbles of\\nquartz, granite, metesediment, ett.',\n",
        " 'No samples .',\n",
        " 'Fractured anhydrite and dolomite in more or less\\nequal amounts, fractures 2-3 mm apart, <l mm\\nwide, healed with nnhydrite and/nr gypsum; lOZ of\\nsample is white to milky-white gypsum; anhydrite\\nl#him to light grey microcrystslline to medium\\ngrained, dolomite, light grey-brown to brown,\\nearthy to finely crystalline, trace ZZ porosity,\\nno fluorescence.',\n",
        " 'Fractured anhydrite and dolomite as above with\\nabout 102 gypsum in sample; crate smell, nan-ew\\nblack arte-ks in dolomite (Totganic or bitumen).',\n",
        " 'Anhydrite, white to light tu medium brown with\\nirregular stringera dolomite .tattered throughout\\nrather than fractured as above, dolomite, light\\nbrown, miorocryatalline, trato porosity.\\n']"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "results = [unirest.get(base_url+urllib.quote_plus(e), headers=headers) for e in parts]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 23
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "hashtags = [r.body['hashtags'] for r in results]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "hashtags"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 22,
       "text": [
        "[[],\n",
        " [],\n",
        " [],\n",
        " [],\n",
        " [],\n",
        " [],\n",
        " [],\n",
        " [],\n",
        " [],\n",
        " [],\n",
        " [],\n",
        " [],\n",
        " [u'#CORE', u'#CongressOfRacialEquality'],\n",
        " [u'#ChevronCorporation'],\n",
        " [],\n",
        " [u'#Granite'],\n",
        " [],\n",
        " [u'#Dolomite',\n",
        "  u'#Anhydrite',\n",
        "  u'#Gypsum',\n",
        "  u'#Porosity',\n",
        "  u'#Fluorescence',\n",
        "  u'#Crystal'],\n",
        " [u'#Dolomite', u'#Anhydrite', u'#Gypsum', u'#Bitumen'],\n",
        " [u'#Dolomite', u'#Anhydrite', u'#Porosity']]"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}