{ "metadata": { "name": "", "signature": "sha256:50d58774222efed73d8bb4fe800d1a37f2b749a21eb1cf193636035806453a70" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Lithology classification from images" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we extract text from images using [Tesseract](https://code.google.com/p/tesseract-ocr/), then process that text for lithological data? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use a Python wrapper for Tesseract.\n", "\n", "[This library](https://code.google.com/p/python-tesseract/) looks cool but I can't get it to work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I also tried a lightweight one, [pytesseract](https://pypi.python.org/pypi/pytesseract), but it failed with an image error. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try using [PyOCR](https://github.com/jflesch/pyocr). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "from PIL import Image" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "import pyocr\n", "import pyocr.builders" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "tools = pyocr.get_available_tools()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "tool = tools[0]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "tool.get_available_languages()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "['eng']" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "text = tool.image_to_string(Image.open('Samples.png'), builder=pyocr.builders.TextBuilder())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "print text" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.0 m\n", "\n", "15.0 m\n", "\n", "30.0 m\n", "\n", "75.0 In\n", "\n", "80.0 m\n", "\n", "85.0 m ~\n", "\n", "15.0 m\n", "\n", "30.0\n", "\n", "75.0\n", "\n", "80.0\n", "\n", "85.0\n", "\n", "90.0 m\n", "\n", "SAMPLE AND CORE DESCRIPTIONS\n", "\n", "Chevron Irving Bras d\u20190r #Z\n", "B! A. Berti\n", "\n", "No samples.\n", "\n", "Drift, subangular to subroundad small pebbles of\n", "quartz, granite, metesediment, ett.\n", "\n", "No samples .\n", "\n", "Fractured anhydrite and dolomite in more or less\n", "equal amounts, fractures 2-3 mm apart,