{
"metadata": {
"name": "",
"signature": "sha256:50d58774222efed73d8bb4fe800d1a37f2b749a21eb1cf193636035806453a70"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Lithology classification from images"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can we extract text from images using [Tesseract](https://code.google.com/p/tesseract-ocr/), then process that text for lithological data? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll use a Python wrapper for Tesseract.\n",
"\n",
"[This library](https://code.google.com/p/python-tesseract/) looks cool but I can't get it to work."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I also tried a lightweight one, [pytesseract](https://pypi.python.org/pypi/pytesseract), but it failed with an image error. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try using [PyOCR](https://github.com/jflesch/pyocr). "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from PIL import Image"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pyocr\n",
"import pyocr.builders"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tools = pyocr.get_available_tools()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tool = tools[0]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tool.get_available_languages()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
"['eng']"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"text = tool.image_to_string(Image.open('Samples.png'), builder=pyocr.builders.TextBuilder())"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print text"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0.0 m\n",
"\n",
"15.0 m\n",
"\n",
"30.0 m\n",
"\n",
"75.0 In\n",
"\n",
"80.0 m\n",
"\n",
"85.0 m ~\n",
"\n",
"15.0 m\n",
"\n",
"30.0\n",
"\n",
"75.0\n",
"\n",
"80.0\n",
"\n",
"85.0\n",
"\n",
"90.0 m\n",
"\n",
"SAMPLE AND CORE DESCRIPTIONS\n",
"\n",
"Chevron Irving Bras d\u20190r #Z\n",
"B! A. Berti\n",
"\n",
"No samples.\n",
"\n",
"Drift, subangular to subroundad small pebbles of\n",
"quartz, granite, metesediment, ett.\n",
"\n",
"No samples .\n",
"\n",
"Fractured anhydrite and dolomite in more or less\n",
"equal amounts, fractures 2-3 mm apart,