{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a57cac02",
   "metadata": {},
   "source": [
    "# Regular Expression Class based on OpenFST"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44d783c1",
   "metadata": {},
   "source": [
    "The goal of this exercise is to write a small regular expression class\n",
    "that internally uses OpenFST to perform the matching."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "058dfa5c",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "class OpenRE:\n",
    "    def __init__(self,regex=None,cost=0.0):\n",
    "        if regex is not None:\n",
    "            self.add(regex,cost)\n",
    "            self.compile()\n",
    "        # IMPLEMENT ME\n",
    "    def add(self,regex,cost=0.0):\n",
    "        \"\"\"Add a regular expression to the overall\n",
    "        regular expression using a disjunction.\"\"\"\n",
    "        # IMPLEMENT ME\n",
    "    def compile(self):\n",
    "        \"\"\"After adding component regular expressions,\n",
    "        compile the internal fst.\"\"\"\n",
    "        # IMPLEMENT ME\n",
    "        self.fst = something\n",
    "    def cost(self,s):\n",
    "        \"\"\"Match the given string against the compiled\n",
    "        regular expression and return the cost. Returns\n",
    "        `inf` if there is no match.\"\"\"\n",
    "        # IMPLEMENT ME\n",
    "        return cost"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df247c82",
   "metadata": {},
   "source": [
    "Your package should understand the following expressions:\n",
    "\n",
    "- \"ABC\" - simple strings\n",
    "- \"AB|CD\" - alternation\n",
    "- \"AB*C\" - regex star (zero or more repeats)\n",
    "- \"AB+C\" - regex plus (one or more repeats)\n",
    "- \"A(B|C)*D\" - parentheses and optional operators\n",
    "\n",
    "Assume that expressions are implicitly anchored at the beginning and end\n",
    "(no partial matches).\n",
    "\n",
    "It's OK if you limit yourself to ASCII strings. Use `ord` to encode characters\n",
    "to integers. Do not worry about escape characters or wildcards."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "599cd492",
   "metadata": {},
   "source": [
    "# Unit Tests"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ef79ae4",
   "metadata": {},
   "source": [
    "Write a set of unit tests demonstrating that your code works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e32cd81",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "assert OpenRE(\"abc\").cost(\"abc\") == 0\n",
    "assert OpenRE(\"abC\").cost(\"abc\") == inf\n",
    "assert OpenRE(\"ab\").cost(\"abc\") == inf # no anchoring\n",
    "assert OpenRE(\"(a|b)\").cost(\"a\") == 0\n",
    "assert OpenRE(\"(a|b)\").cost(\"b\") == 0\n",
    "assert OpenRE(\"a|b\").cost(\"a\") == 0\n",
    "# etc."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb2ff370",
   "metadata": {},
   "source": [
    "# Parsing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32eb7e36",
   "metadata": {},
   "source": [
    "For parsing the regular expression itself, you may want to use the `pyparsing` module.\n",
    "\n",
    "Here is a simple example of how you might go about this. Note that this is _not_ a correct\n",
    "regular expression parser yet and that you may want to generate a different kind of structure.\n",
    "\n",
    "Read the documentation to figure out how to deal with whitespace and more characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3459ffe3",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pyparsing import *\n",
    "postfix = Literal('+') | Literal('*')\n",
    "alt = Literal( '|' )\n",
    "lpar  = Literal( '(' ).suppress()\n",
    "rpar  = Literal( ')' ).suppress()\n",
    "lit = Regex('[^()|+*]+')\n",
    "expr = Forward()\n",
    "term = lit | alt + expr | Group( lpar + expr + rpar + Optional(postfix) )\n",
    "expr << ZeroOrMore( term  )\n",
    "expr.parseString(\"hello, (world|there)+|(a(b)c)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "730deba5",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}