{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Parts-of-Speech Tagging - Working with tags and Numpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lecture notebook you will create a matrix using some tag information and then modify it using different approaches.\n", "This will serve as hands-on experience working with Numpy and as an introduction to some elements used for POS tagging." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Some information on tags" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this notebook you will be using a toy example including only three tags (or states). In a real world application there are many more tags which can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define tags for Adverb, Noun and To (the preposition) , respectively\n", "tags = ['RB', 'NN', 'TO']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this week's assignment you will construct some dictionaries that provide useful information of the tags and words you will be working with. \n", "\n", "One of these dictionaries is the `transition_counts` which counts the number of times a particular tag happened next to another. The keys of this dictionary have the form `(previous_tag, tag)` and the values are the frequency of occurrences.\n", "\n", "Another one is the `emission_counts` dictionary which will count the number of times a particular pair of `(tag, word)` appeared in the training dataset.\n", "\n", "In general think of `transition` when working with tags only and of `emission` when working with tags and words.\n", "\n", "In this notebook you will be looking at the first one:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define 'transition_counts' dictionary\n", "# Note: values are the same as the ones in the assignment\n", "transition_counts = {\n", " ('NN', 'NN'): 16241,\n", " ('RB', 'RB'): 2263,\n", " ('TO', 'TO'): 2,\n", " ('NN', 'TO'): 5256,\n", " ('RB', 'TO'): 855,\n", " ('TO', 'NN'): 734,\n", " ('NN', 'RB'): 2431,\n", " ('RB', 'NN'): 358,\n", " ('TO', 'RB'): 200\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that there are 9 combinations of the 3 tags used. Each tag can appear after the same tag so you should include those as well.\n", "\n", "### Using Numpy for matrix creation\n", "\n", "Now you will create a matrix that includes these frequencies using Numpy arrays:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Store the number of tags in the 'num_tags' variable\n", "num_tags = len(tags)\n", "\n", "# Initialize a 3X3 numpy array with zeros\n", "transition_matrix = np.zeros((num_tags, num_tags))\n", "\n", "# Print matrix\n", "transition_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visually you can see the matrix has the correct dimensions. Don't forget you can check this too using the `shape` attribute:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print shape of the matrix\n", "transition_matrix.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before filling this matrix with the values of the `transition_counts` dictionary you should sort the tags so that their placement in the matrix is consistent:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create sorted version of the tag's list\n", "sorted_tags = sorted(tags)\n", "\n", "# Print sorted list\n", "sorted_tags" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To fill this matrix with the correct values you can use a `double for loop`. You could also use `itertools.product` to one line this double loop:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop rows\n", "for i in range(num_tags):\n", " # Loop columns\n", " for j in range(num_tags):\n", " # Define tag pair\n", " tag_tuple = (sorted_tags[i], sorted_tags[j])\n", " # Get frequency from transition_counts dict and assign to (i, j) position in the matrix\n", " transition_matrix[i, j] = transition_counts.get(tag_tuple)\n", "\n", "# Print matrix\n", "transition_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like this worked fine. However the matrix can be hard to read as `Numpy` is more about efficiency, rather than presenting values in a pretty format. \n", "\n", "For this you can use a `Pandas DataFrame`. In particular, a function that takes the matrix as input and prints out a pretty version of it will be very useful:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define 'print_matrix' function\n", "def print_matrix(matrix):\n", " print(pd.DataFrame(matrix, index=sorted_tags, columns=sorted_tags))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the tags are not a parameter of the function. This is because the `sorted_tags` list will not change in the rest of the notebook so it is safe to use the variable previously declared. To test this function simply run: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print the 'transition_matrix' by calling the 'print_matrix' function\n", "print_matrix(transition_matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is a lot better, isn't it? \n", "\n", "As you may have already deducted this matrix is not symmetrical." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with Numpy for matrix manipulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you got the matrix set up it is time to see how a matrix can be manipulated after being created. \n", "\n", "`Numpy` allows vectorized operations which means that operations that would normally include looping over the matrix can be done in a simpler manner. This is consistent with treating numpy arrays as matrices since you get support for common matrix operations. You can do matrix multiplication, scalar multiplication, vector addition and many more!\n", "\n", "For instance try scaling each value in the matrix by a factor of $\\frac{1}{10}$. Normally you would loop over each value in the matrix, updating them accordingly. But in Numpy this is as easy as dividing the whole matrix by 10:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Scale transition matrix\n", "transition_matrix = transition_matrix/10\n", "\n", "# Print scaled matrix\n", "print_matrix(transition_matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another trickier example is to normalize each row so that each value is equal to $\\frac{value}{sum \\,of \\,row}$.\n", "\n", "This can be easily done with vectorization. First you will compute the sum of each row:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute sum of row for each row\n", "rows_sum = transition_matrix.sum(axis=1, keepdims=True)\n", "\n", "# Print sum of rows\n", "rows_sum" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the `sum()` method was used. This method does exactly what its name implies. Since the sum of the rows was desired the axis was set to `1`. In Numpy `axis=1` refers to the columns so the sum is done by summing each column of a particular row, for each row. \n", "\n", "Also the `keepdims` parameter was set to `True` so the resulting array had shape `(3, 1)` rather than `(3,)`. This was done so that the axes were consistent with the desired operation. \n", "\n", "When working with Numpy, always remember to check the shape of the arrays you are working with, many unexpected errors happen because of axes not being consistent. The `shape` attribute is your friend for these cases." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Normalize transition matrix\n", "transition_matrix = transition_matrix / rows_sum\n", "\n", "# Print normalized matrix\n", "print_matrix(transition_matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the normalization that was carried out forces the sum of each row to be equal to `1`. You can easily check this by running the `sum` method on the resulting matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transition_matrix.sum(axis=1, keepdims=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a final example you are asked to modify each value of the diagonal of the matrix so that they are equal to the `log` of the sum of the current row plus the current value. When doing mathematical operations like this one don't forget to import the `math` module. \n", "\n", "This can be done using a standard `for loop` or `vectorization`. You'll see both in action:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import math\n", "\n", "# Copy transition matrix for for-loop example\n", "t_matrix_for = np.copy(transition_matrix)\n", "\n", "# Copy transition matrix for numpy functions example\n", "t_matrix_np = np.copy(transition_matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using a for-loop" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loop values in the diagonal\n", "for i in range(num_tags):\n", " t_matrix_for[i, i] = t_matrix_for[i, i] + math.log(rows_sum[i])\n", "\n", "# Print matrix\n", "print_matrix(t_matrix_for)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using vectorization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save diagonal in a numpy array\n", "d = np.diag(t_matrix_np)\n", "\n", "# Print shape of diagonal\n", "d.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can save the diagonal in a numpy array using Numpy's `diag()` function. Notice that this array has shape `(3,)` so it is inconsistent with the dimensions of the `rows_sum` array which are `(3, 1)`. You'll have to reshape before moving forward. For this you can use Numpy's `reshape()` function, specifying the desired shape in a tuple:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Reshape diagonal numpy array\n", "d = np.reshape(d, (3,1))\n", "\n", "# Print shape of diagonal\n", "d.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the diagonal has the correct shape you can do the vectorized operation by applying the `math.log()` function to the `rows_sum` array and adding the diagonal. \n", "\n", "To apply a function to each element of a numpy array use Numpy's `vectorize()` function providing the desired function as a parameter. This function returns a vectorized function that accepts a numpy array as a parameter. \n", "\n", "To update the original matrix you can use Numpy's `fill_diagonal()` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Perform the vectorized operation\n", "d = d + np.vectorize(math.log)(rows_sum)\n", "\n", "# Use numpy's 'fill_diagonal' function to update the diagonal\n", "np.fill_diagonal(t_matrix_np, d)\n", "\n", "# Print the matrix\n", "print_matrix(t_matrix_np)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To perform a sanity check that both methods yield the same result you can compare both matrices. Notice that this operation is also vectorized so you will get the equality check for each element in both matrices:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check for equality\n", "t_matrix_for == t_matrix_np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Congratulations on finishing this lecture notebook!** Now you should be more familiar with some elements used by a POS tagger such as the `transition_counts` dictionary and with working with Numpy.\n", "\n", "**Keep it up!**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 4 }