{ "cells": [ { "cell_type": "markdown", "id": "a57cac02", "metadata": {}, "source": [ "# Regular Expression Class based on OpenFST" ] }, { "cell_type": "markdown", "id": "44d783c1", "metadata": {}, "source": [ "The goal of this exercise is to write a small regular expression class\n", "that internally uses OpenFST to perform the matching." ] }, { "cell_type": "code", "execution_count": null, "id": "058dfa5c", "metadata": { "collapsed": false }, "outputs": [], "source": [ "class OpenRE:\n", " def __init__(self,regex=None,cost=0.0):\n", " if regex is not None:\n", " self.add(regex,cost)\n", " self.compile()\n", " # IMPLEMENT ME\n", " def add(self,regex,cost=0.0):\n", " \"\"\"Add a regular expression to the overall\n", " regular expression using a disjunction.\"\"\"\n", " # IMPLEMENT ME\n", " def compile(self):\n", " \"\"\"After adding component regular expressions,\n", " compile the internal fst.\"\"\"\n", " # IMPLEMENT ME\n", " self.fst = something\n", " def cost(self,s):\n", " \"\"\"Match the given string against the compiled\n", " regular expression and return the cost. Returns\n", " `inf` if there is no match.\"\"\"\n", " # IMPLEMENT ME\n", " return cost" ] }, { "cell_type": "markdown", "id": "df247c82", "metadata": {}, "source": [ "Your package should understand the following expressions:\n", "\n", "- \"ABC\" - simple strings\n", "- \"AB|CD\" - alternation\n", "- \"AB*C\" - regex star (zero or more repeats)\n", "- \"AB+C\" - regex plus (one or more repeats)\n", "- \"A(B|C)*D\" - parentheses and optional operators\n", "\n", "Assume that expressions are implicitly anchored at the beginning and end\n", "(no partial matches).\n", "\n", "It's OK if you limit yourself to ASCII strings. Use `ord` to encode characters\n", "to integers. Do not worry about escape characters or wildcards." ] }, { "cell_type": "markdown", "id": "599cd492", "metadata": {}, "source": [ "# Unit Tests" ] }, { "cell_type": "markdown", "id": "4ef79ae4", "metadata": {}, "source": [ "Write a set of unit tests demonstrating that your code works." ] }, { "cell_type": "code", "execution_count": null, "id": "0e32cd81", "metadata": { "collapsed": false }, "outputs": [], "source": [ "assert OpenRE(\"abc\").cost(\"abc\") == 0\n", "assert OpenRE(\"abC\").cost(\"abc\") == inf\n", "assert OpenRE(\"ab\").cost(\"abc\") == inf # no anchoring\n", "assert OpenRE(\"(a|b)\").cost(\"a\") == 0\n", "assert OpenRE(\"(a|b)\").cost(\"b\") == 0\n", "assert OpenRE(\"a|b\").cost(\"a\") == 0\n", "# etc." ] }, { "cell_type": "markdown", "id": "cb2ff370", "metadata": {}, "source": [ "# Parsing" ] }, { "cell_type": "markdown", "id": "32eb7e36", "metadata": {}, "source": [ "For parsing the regular expression itself, you may want to use the `pyparsing` module.\n", "\n", "Here is a simple example of how you might go about this. Note that this is _not_ a correct\n", "regular expression parser yet and that you may want to generate a different kind of structure.\n", "\n", "Read the documentation to figure out how to deal with whitespace and more characters." ] }, { "cell_type": "code", "execution_count": null, "id": "3459ffe3", "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pyparsing import *\n", "postfix = Literal('+') | Literal('*')\n", "alt = Literal( '|' )\n", "lpar = Literal( '(' ).suppress()\n", "rpar = Literal( ')' ).suppress()\n", "lit = Regex('[^()|+*]+')\n", "expr = Forward()\n", "term = lit | alt + expr | Group( lpar + expr + rpar + Optional(postfix) )\n", "expr << ZeroOrMore( term )\n", "expr.parseString(\"hello, (world|there)+|(a(b)c)\")" ] }, { "cell_type": "code", "execution_count": null, "id": "730deba5", "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }