{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Tutorial: Generating Candidates from Richly Formatted Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running locally?\n", "\n", "If you're running this tutorial interactively on your own machine, you'll need to create a new PostgreSQL database named `intro_candidates`.\n", "\n", "If you already have the database `intro_candidates` in your postgresql, please uncomment the first line to drop it. Otherwise, download our database snapshots by executing `./download_data.sh` in the intro tutorial directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#! dropdb --if-exists intro_candidates\n", "! createdb intro_candidates\n", "! psql intro_candidates < data/intro_candidates.sql > /dev/null" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating Candidates from Richly Formatted Data\n", "\n", "A `Candidate` object represents a potential instance of a fact that you would like to extract from your data. For example, if we were trying to extract a `(Part Number, Storage Temperature)` tuple from transistor datasheets, a `Candidate` may be a mention of a `Part Number` found in a header, and a mention of numerical value found in a Table cell. In this tutorial, we will show you first how you _define_ a relation such as the part number and storage temperature example above. And how you then provide _matchers_ and _throttlers_ to generate candidates from the Fonduer Data Model. \n", "\n", "To kick things off, we first connect to the `intro_candidates` database that we imported using the `Meta` class of Fonduer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "import sys\n", "import logging\n", "from pprint import pprint\n", "\n", "from fonduer import Meta, init_logging\n", "\n", "PARALLEL = 4 # assuming a quad-core machine\n", "ATTRIBUTE = \"intro_candidates\"\n", "conn_string = f'postgresql://localhost:5432/{ATTRIBUTE}'\n", "\n", "# Configure logging for Fonduer\n", "init_logging(log_dir=\"logs\")\n", "\n", "session = Meta.init(conn_string).Session()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining Relations\n", "\n", "All Fonduer applications require the user to provide a custom `Candidate` class definition.\n", "\n", "In this tutorial, we define a `Part_Attr` relation to represent a relationship between a a transistor part number and some electrical attribute of that particular transistor.\n", "\n", "\n", "\n", "This `Candidate` is made of two `Mention`s, a `Part` and an `Attr`. Each `Mention` is made up of a`Span` of text (i.e., sequences of words or characters) that represent the mention of a part number and the mention of a maximum storage temperature. We start by defining the `Mentions` which will make up the `Candidate`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fonduer.candidates.models import mention_subclass\n", "\n", "Part = mention_subclass(\"Part\")\n", "Attr = mention_subclass(\"Attr\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In many traditional knowledge base construction systems, these two spans may come from the same `Sentence`. For example, if we were looking for a `Spouse` relation, we might have a candidate like this one from [Snorkel](https://github.com/HazyResearch/snorkel/):\n", "\n", "\n", "\n", "However, in Fonduer, we target _richly formatted data_. When working with richly formatted data, relations rarely come from the same Sentence. Instead, they can come from distant parts of a document like a `Span` of text in an page header paired with a numerical value found in a table dozens of pages later.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generating Mentions\n", "\n", "Now that we've defined the `Mention` classes we want to extract, the next step is generating those candidates. To do so, Fonduer allows users to provide _matchers_. These inputs are valuable because with richly formatted data, a naive cross-product of all `Mentions` of text in a document would result in an intractable, combinatorial explosion of candidates. Matchers serve to limit the number of `Mentions` generated. Matchers operate on individual `Spans`.\n", "\n", "### Matchers\n", "One convenient way to think about matchers is to think of them as a way to define what each component of your relation is. In our example, we can provide a matcher to define what a `part` looks like, and a matcher to define what a valid `attr` looks like (in our case this means definine what a valid maximum storage temperature looks like). \n", "\n", "Fonduer provides some pre-built matchers you can use to help make this easier as documented on [Read the Docs](https://fonduer.readthedocs.io/en/stable/user/candidates.html#matchers). For example, Fonduer provides ways to leverage dictionaries, RegEx, NLP Tags, or arbitrary functions. \n", "\n", "Importantly, matchers should try to be **as specific as possible** while still maintaining high recall.\n", "\n", "#### Writing a simple temperature matcher\n", "For the `attr` mention, we are looking for maximum storage temperatures. By inspecting our data (or relying on some domain experience), we have come to the conclusion that maximum storage temperatures are expressed as integers in the range 150 to 205, and only appear as multiples of 5. We can easily express this pattern as a regular expression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fonduer.candidates.matchers import RegexMatchSpan\n", "\n", "attr_matcher = RegexMatchSpan(rgx=r'(?:[1][5-9]|20)[05]')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Writing an advanced transistor part matcher\n", "In contrast, transistor part numbers are complex expressions. For this tutorial, suppose they are complex enough that we actually want to tackle a definition using a few different angles. In this case, we want to leverage:\n", "1. Common [naming conventions](https://en.wikipedia.org/wiki/Transistor#Part_numbering_standards.2Fspecifications) defined by manufacturers as regular expressions\n", "2. A dictionary of part numbers\n", "3. A user-defined function\n", "\n", "To start, let's construct a regular expression matcher that captures the naming conventions linked above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Transistor Naming Conventions as Regular Expressions ###\n", "eeca_rgx = r'([ABC][A-Z][WXYZ]?[0-9]{3,5}(?:[A-Z]){0,5}[0-9]?[A-Z]?(?:-[A-Z0-9]{1,7})?(?:[-][A-Z0-9]{1,2})?(?:\\/DG)?)'\n", "jedec_rgx = r'(2N\\d{3,4}[A-Z]{0,5}[0-9]?[A-Z]?)'\n", "jis_rgx = r'(2S[ABCDEFGHJKMQRSTVZ]{1}[\\d]{2,4})'\n", "others_rgx = r'((?:NSVBC|SMBT|MJ|MJE|MPS|MRF|RCA|TIP|ZTX|ZT|ZXT|TIS|TIPL|DTC|MMBT|SMMBT|PZT|FZT|STD|BUV|PBSS|KSC|CXT|FCX|CMPT){1}[\\d]{2,4}[A-Z]{0,5}(?:-[A-Z0-9]{0,6})?(?:[-][A-Z0-9]{0,1})?)'\n", "\n", "part_rgx = '|'.join([eeca_rgx, jedec_rgx, jis_rgx, others_rgx])\n", "part_rgx_matcher = RegexMatchSpan(rgx=part_rgx, longest_match_only=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, suppose that we have been provided a dictionary of known part numbers (e.g. from crowdsourcing or a collaborator data source). We can use a `DictionaryMatch` which will only match `Spans` of text that appear in the dictionary provided. In our case, we have a transistor part number dictionary from Digikey.com." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import csv\n", "from fonduer.candidates.matchers import DictionaryMatch\n", "\n", "def get_digikey_parts_set(path):\n", " \"\"\"\n", " Reads in the digikey part dictionary and yeilds each part.\n", " \"\"\"\n", " all_parts = set()\n", " with open(path, \"r\") as csvinput:\n", " reader = csv.reader(csvinput)\n", " for line in reader:\n", " (part, url) = line\n", " all_parts.add(part)\n", " return all_parts\n", "\n", "### Dictionary of known transistor parts ###\n", "dict_path = 'data/digikey_part_dictionary.csv'\n", "part_dict_matcher = DictionaryMatch(d=get_digikey_parts_set(dict_path))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Futhermore, we can provide user-defined functions as matchers as well! As an example, here we use patterns we notice in document filenames as an indication of whether a `Span` of text is a valid transistor part number. This is particularly useful in a dataset where files are named after their parts (e.g. `bc546.pdf`).\n", "\n", "Note that in the code below, we also demonstrate how to use the `Intersect` class to perform an intersection of two different matchers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from builtins import range\n", "from fonduer.candidates.matchers import LambdaFunctionMatcher, Intersect\n", "\n", "def common_prefix_length_diff(str1, str2):\n", " for i in range(min(len(str1), len(str2))):\n", " if str1[i] != str2[i]:\n", " return min(len(str1), len(str2)) - i\n", " return 0\n", "\n", "def part_file_name_conditions(attr):\n", " file_name = attr.sentence.document.name\n", " if len(file_name.split('_')) != 2: return False\n", " if attr.get_span()[0] == '-': return False\n", " name = attr.get_span().replace('-', '')\n", " return any(char.isdigit() for char in name) and any(char.isalpha() for char in name) and common_prefix_length_diff(file_name.split('_')[1], name) <= 2\n", "\n", "add_rgx = '^[A-Z0-9\\-]{5,15}$'\n", "\n", "part_file_name_lambda_matcher = LambdaFunctionMatcher(func=part_file_name_conditions)\n", "part_file_name_matcher = Intersect(RegexMatchSpan(rgx=add_rgx, longest_match_only=True), part_file_name_lambda_matcher)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point, we've created three separate `Matchers` which we want to combine into a single `Matcher` that can be used to define the `part` component of our `Part_Attr` class. We can do so using the `Union` class." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fonduer.candidates.matchers import Union\n", "\n", "part_matcher = Union(part_rgx_matcher, part_dict_matcher, part_file_name_matcher)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thus, the `attr_matcher` and `part_matcher` define each component of our relation schema.\n", "\n", "## Define Mentions's MentionSpaces\n", "Next, in order to define the \"space\" of all candidates that are even considered from the document, we need to define a `MentionSpace` for each component of the relation we wish to extract.\n", "\n", "In the case of transistor part numbers, the `MentionSpace` can be quite complex due to the need to handle implicit part numbers that are implied in text like \"BC546A/B/C...BC548A/B/C\", which refers to 9 unique part numbers. To handle these, we consider all n-grams up to 3 words long.\n", "\n", "In contrast, the `MentionSpace` for temperature values is simpler: we only need to process different unicode representations of a (-), and don't need to look at more than two words at a time.\n", "\n", "When no special preproessing like this is needed, we could have used the default OmniNgrams class provided by fonduer. For example, if we were looking to match polarities, which only take the form of \"NPN\" or \"PNP\", we could've used attr_ngrams = MentionNgrams(n_max=1)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fonduer.candidates import MentionNgrams\n", "\n", "part_ngrams = MentionNgrams(n_max=3)\n", "attr_ngrams = MentionNgrams(n_max=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we're ready to generate our Mentions using the Matchers and MentionSpaces defined above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fonduer.parser.models import Document, Sentence\n", "from fonduer.candidates.models import Mention\n", "from fonduer.candidates import MentionExtractor\n", "\n", "docs = session.query(Document).all()\n", "\n", "mention_extractor = MentionExtractor(\n", " session,\n", " [Part, Attr],\n", " [part_ngrams, attr_ngrams],\n", " [part_matcher, attr_matcher],\n", " parallelism=PARALLEL\n", ")\n", "mention_extractor.apply(docs)\n", "print(f\"Num Mentions: {session.query(Mention).count()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Candidate Extraction\n", "Next, we can extract our `Candidates`. First, we need to define our `Candidate` as a tuple of `Mentions`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fonduer.candidates.models import candidate_subclass\n", "\n", "PartAttr = candidate_subclass(\"PartAttr\", [Part, Attr])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Throttlers\n", "`Throttlers` allow us to further prune excess candidates and avoid annecessarily materializing invalid candidates. But, unlike `Matchers`, which operate on `Mentions`, `Throttlers` operate on `Candidates`. Like `Matchers`, `Throttlers` act as hard filters and should be as specific as possible while maintining complete recall, if possible.\n", "\n", "Because `Throttlers` operate on `Candidates`, users can leverage the `data_model_utils` functions provided by Fonduer to write throttling functions using information from multiple modalites of the data. Check the full `data_model_utils` API on [Read the Docs](http://fonduer.readthedocs.io/en/stable/user/data_model_utils.html).\n", "\n", "To make this concrete, here we create a `Throttler` that discards candidates if they are in the same `Table`, but the `part` and `attr` are not vertically or horizontally aligned." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "from fonduer.utils.data_model_utils import *\n", "\n", "def stg_temp_filter(c):\n", " (part, attr) = c\n", " if same_table((part, attr)):\n", " return (is_horz_aligned((part, attr)) or is_vert_aligned((part, attr)))\n", " return True\n", "\n", "temp_throttler = stg_temp_filter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running the `CandidateExtractor`\n", "\n", "Now, we have all the component necessary to perform candidate extraction. We have defined the \"space\" of things to consider for each candidate, provided matchers that signal when a valid mention is seen, and a throttler to prunes away excess candidates. We now can define the `CandidateExtractor` with the contexts to extract from, the matchers, and the throttler to use. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fonduer.candidates import CandidateExtractor\n", "\n", "\n", "candidate_extractor = CandidateExtractor(session, [PartAttr], throttlers=[temp_throttler], parallelism=PARALLEL)\n", "\n", "%time candidate_extractor.apply(docs, split=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting Candidates\n", "\n", "Once you have run the `CandidateExtractor`, just like other elements of the Fonduer Data Model, you can query and inspect the `Candidates`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_cands = session.query(PartAttr).all()\n", "print(f\"Number of candidates: {len(train_cands)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cand = train_cands[0]\n", "print(cand)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that our candidate is made up of two `Mentions`, one representing the `part`, and one the `attr` (maximum storage temperature in this case). We can look at each of those individually by simply calling their respective attributes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(cand.part)\n", "print(cand.attr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can get the raw text of each candidate:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(cand.part.context.get_span())\n", "print(cand.attr.context.get_span())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also notice that each `Span` contains the sentence it was found in, which provides you access to the full data model of the document in which it was found. You can explore the data model as was shown in the first tutorial. For example, we can access the Sentence:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(cand.part.context.sentence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The document:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(cand.part.context.sentence.document)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or even the the document's `Tables`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(cand.part.context.sentence.document.tables)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The full data model can be accessed from each candidate, which allows you to write powerful `Matchers` and `Throttlers` that leverage the structure and multimodality of your input documents." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }