{
 "metadata": {
  "name": "syllabification_evaluation"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%pylab inline\n",
      "from __future__ import print_function\n",
      "import segeval as se"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].\n",
        "For more information, type 'help(pylab)'.\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Comparing word syllabifications\n",
      "===============================\n",
      "\n",
      "Syllabification is the task of placing syllable boundaries within a word, e.g., the word *syllable* could be segmented into three syllables using two boundaries as *syl\u2027la\u2027ble*.\n",
      "\n",
      "To evaluate whether an automatic syllabifier is as good as a human syllabifier, we must evaluate how similar two segmentations are.  For this, we can use **Boundary Similarity (B)** and **Boundary Edit Distance (BED)**.  BED is an edit distance for boundaries placed within a segmentation that can quantify the number of near misses and full misses between two solutions, and B is a normalization of BED.  These metrics are described in:\n",
      "\n",
      "```Chris Fournier. 2013. Evaluating Text Segmentation using Boundary Edit Distance. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA.\n",
      "```\n",
      "\n",
      "An implementation of these metrics is provided by the [SegEval package](http://segeval.readthedocs.org/en/latest/)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Data\n",
      "----\n",
      "\n",
      "Let's assume that we already ran an automatic segmenter upon a few words, and we can compare them to what we can assume is a manual solution provided by [a version of the  CMU Pronouncing Dictionary that has been augmented with syllable boundaries](http://webdocs.cs.ualberta.ca/~kondrak/cmudict.html)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "SYLLABLE_BOUNDARY = u'\u00b7'\n",
      "\n",
      "word_order = ['automatic', 'segmentation', 'is', 'fun']\n",
      "\n",
      "manual_solution = {\n",
      "    'automatic'    : u'au\u00b7to\u00b7ma\u00b7tic',\n",
      "    'segmentation' : u'seg\u00b7men\u00b7ta\u00b7tion',\n",
      "    'is'           : u'is',\n",
      "    'fun'          : u'fun'\n",
      "}\n",
      "\n",
      "automatic_solution = {\n",
      "    'automatic'    : u'au\u00b7tom\u00b7a\u00b7tic',\n",
      "    'segmentation' : u'seg\u00b7ment\u00b7ation',\n",
      "    'is'           : u'is',\n",
      "    'fun'          : u'f\u00b7un'\n",
      "}"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 26
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We need a way to convert these syllabifications into something that the segeval package can understand.  Specifically, we want to create a tuple containing the size (in letters) of each syllable."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def convert_to_masses(syllabification):\n",
      "    '''\n",
      "    Take a syllabification, e.g., au\u00b7to\u00b7ma\u00b7tic, and produce a\n",
      "    tuple of segment masses, e.g., (2,2,2,3).\n",
      "    '''\n",
      "    syllables = syllabification.split(SYLLABLE_BOUNDARY)\n",
      "    return tuple([len(syllable) for syllable in syllables])\n",
      "\n",
      "print(u'The \\'{0}\\' syllabification results in this tuple of masses: {1}'.format(automatic_solution['automatic'], convert_to_masses(automatic_solution['automatic'])))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "The 'au\u00b7tom\u00b7a\u00b7tic' syllabification results in this tuple of masses: (2, 3, 1, 3)\n"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "BED is a bit difficult to interpret, so we want a way to read a high-level summary of its results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def bed_to_str(bed):\n",
      "    '''\n",
      "    Take a tuple containing three types of boundary edits and create a string summary.\n",
      "    '''\n",
      "    string = list()\n",
      "    additions      = len(bed[0])\n",
      "    substitutions  = len(bed[1])\n",
      "    transpositions = len(bed[2])\n",
      "    if additions > 0:\n",
      "        string.append('{0} miss(es)'.format(additions))\n",
      "    if substitutions > 0:\n",
      "        string.append('{0} sub(s)'.format(substitutions))\n",
      "    if transpositions > 0:\n",
      "        string.append('{0} near'.format(transpositions))\n",
      "    return ', '.join(string)\n",
      "\n",
      "word = 'automatic'\n",
      "manual = se.boundary_string_from_masses(convert_to_masses(manual_solution[word]))\n",
      "auto = se.boundary_string_from_masses(convert_to_masses(automatic_solution[word]))\n",
      "bed = se.boundary_edit_distance(manual, auto)\n",
      "print('BED of two segmentations of \\'{0}\\' can be interpreted as \\'{1}\\'.'.format(word, bed_to_str(bed)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "BED of two segmentations of 'automatic' can be interpreted as '1 near'.\n"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, let's compare the automatic and manual syllabifiers against eachother using BED and B."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "title = u'{0: <13}  {1: <15}  {2: <14}  {3: <18}  {4:}'.format('Word', 'Manual Solution', 'Automatic Solution', 'BED', 'B')\n",
      "print(title)\n",
      "print(''.join(['-'] * (len(title) + 3)))\n",
      "\n",
      "all_b = list()\n",
      "\n",
      "for word in word_order:\n",
      "    manual_segmentation = convert_to_masses(manual_solution[word])\n",
      "    manual_segmentation = se.boundary_string_from_masses(manual_segmentation)\n",
      "    automatic_segmentation = convert_to_masses(automatic_solution[word])\n",
      "    automatic_segmentation = se.boundary_string_from_masses(automatic_segmentation)\n",
      "    \n",
      "    b = se.boundary_similarity(manual_segmentation, automatic_segmentation, boundary_format=se.BoundaryFormat.sets)\n",
      "    bed = se.boundary_edit_distance(manual_segmentation, automatic_segmentation)\n",
      "    \n",
      "    all_b.append(b)\n",
      "    \n",
      "    values = {\n",
      "        'b'      : b,\n",
      "        'bed'    : bed_to_str(bed),\n",
      "        'word'   : word,\n",
      "        'manual' : manual_solution[word],\n",
      "        'auto'   : automatic_solution[word]\n",
      "    }\n",
      "    \n",
      "    print(u'{word: <13}  {manual: <15}  {auto: <18}  {bed: <18}  {b:.2f}'.format(**values))\n",
      "\n",
      "mean_b = sum(all_b) / len(all_b)\n",
      "std_b = sqrt(sum([pow(b - mean_b, 2) for b in all_b]) / len(all_b))\n",
      "std_err_b = float(std_b) / sqrt(len(all_b))\n",
      "\n",
      "print('\\nOverall mean B = {0:.4f} +/- {1:.4f}, n={2}'.format(mean_b, std_err_b, len(all_b)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Word           Manual Solution  Automatic Solution  BED                 B\n",
        "----------------------------------------------------------------------------\n",
        "automatic      au\u00b7to\u00b7ma\u00b7tic     au\u00b7tom\u00b7a\u00b7tic        1 near              0.83\n",
        "segmentation   seg\u00b7men\u00b7ta\u00b7tion  seg\u00b7ment\u00b7ation      1 miss(es), 1 near  0.50\n",
        "is             is               is                                      1.00\n",
        "fun            fun              f\u00b7un                1 miss(es)          0.00\n",
        "\n",
        "Overall mean B = 0.5833 +/- 0.1909, n=4\n"
       ]
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In conclusion, the overall mean similarity (B) of the syllabifications produced by the hypothetical automatic segmenter appears to be not so great, but unfortunately, with so few syllabifications to test, we cannot get a very accurate measurement of the similarity."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}