{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2\n",
    "Choose *one* of the following exercises."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2a) The fairy tale"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the beginning of a fairy tale:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> A boy called Peter lived with his 2 parents in a village on the hillside. His\n",
    "parents, like most of the other people in the village, were sheep farmers.\n",
    "There were 430 sheep in the village, 33 sheep dogs and 21 humans. Close to the village,\n",
    "there were also 5 wolves.\n",
    "Everybody in the village took turns to look after the sheep, and when Peter was\n",
    "10 years old, he was considered old enough to take his turn at shepherding."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copy it and use python and regular expressions to find:\n",
    "\n",
    "- one word consisting of exactly two letters.\n",
    "\n",
    "- all words that contain 'o'\n",
    "- all numbers written with digits (ie. \"2\" and \"43\" but not \"eleven\")\n",
    "\n",
    "Finally, rename \"Peter\" to \"Petter\".\n",
    "\n",
    "**Hint**:\n",
    "Use `finditer` to find all instances.\n",
    "\n",
    "---------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2b) Finding variants in protein sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The given fasta file (`proteins.fasta`) contains data on aminoacid sequences from 1000 different people. Each line represents one individual.\n",
    "\n",
    "We are interested in finding an amino acid substitution in the sequence \"TPLTVETLAKT\", where \"VE\" is changed to \"Lx\" (x here means anything)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- How many individuals have got this change?\n",
    "- What variants are there (which values can `x` take)?\n",
    "- Which variation is the most common one?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Hints\n",
    "\n",
    "- Start by expressing the pattern you're looking for as a regular expression.\n",
    "- Keep track of all variations of \"Lx\" you get."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bonus exercise: Exercise 2c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### IMDB titles again"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're back with IMDB. Again, feel free to reuse your previous code!\n",
    "\n",
    "Use regular expressions to solve the problems and print the results using formatting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**a)** Change titles\n",
    "\n",
    "A lot of movie titles contains the word \"and\". Some people find the ampersand \"&\" a lot prettier. Pretend you're one of them and replace all \"and\" with \"&\" in the titles. Print a table with the changed titles, approximately like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "-- Old title ---- | -- New title ---- \n",
    "\n",
    "   Me and you     |    Me & you\n",
    "   Cat and dog    |    Cat & dog\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**b)** Find common titles\n",
    "\n",
    "Titles like \"the grave\", \"the crown\", \"the road\" are popular for movies. The pattern is \"the\" + one word (a noun). Find the three most popular such words.\n",
    "\n",
    "The output should look like this:\n",
    "```\n",
    "* Noun       Count\n",
    "- Crown      3\n",
    "- Cat        1\n",
    "- Dog        1\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**c)** Find alliterations\n",
    "\n",
    "Alliteration is a fancy word for repetitions of a sound (or letter) in the beginning of each word, like in *\"Seven sisters slept soundly\"* or *\"Veni, vidi, vici\"*. It is used in poetry as well as in rhetoric. Surely there are some movie titles using alleration as well.\n",
    "\n",
    "Find all titles that are alliterations!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will consider a title as an alliteration if it contains the same letter in the beginning of every word.\n",
    "\n",
    "To make life a bit easier, we only look for alliterations with two words. Case is not important."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print the output like this:\n",
    "```\n",
    "Title            |    Letter  \n",
    "My milkshake     |         M    \n",
    "Toxic toy        |         T\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Hints:\n",
    "\n",
    "For each task\n",
    "\n",
    "- first print the header line\n",
    "- then try to figure out what your pattern should look like\n",
    "- now loop through the lines of the input file and search for the pattern\n",
    "- if you get a match, decide wheather you should print it now or save it for later\n",
    "- consider using a `Counter()` from the module `collections` for task **b)**\n",
    "\n",
    "If you find it tricky to get started with the regular expressions, have a look at the documentation:\n",
    "\n",
    "- https://docs.python.org/3/howto/regex.html\n",
    "- https://docs.python.org/3/library/re.html\n",
    "\n",
    "Try listing a few examples of strings you want to match on a piece of paper. For **c)**, you have\n",
    "\n",
    "`My milkshake`\n",
    "\n",
    "`Toxic toy`\n",
    "\n",
    "Step through each line and abstract as much as you can. What characters are important? Which one's do we need to keep track of?\n",
    "\n",
    "\n",
    "#### Capturing groups\n",
    "\n",
    "For **c)**, and possibly **b)**, you will need to use [capturing groups](https://docs.python.org/3.7/howto/regex.html#grouping):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the group: cat\n",
      "the whole match: a cat\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "p = re.compile('a ([a-z]*)')  # the parentheses creates a group\n",
    "match = p.search('I see a cat over there.')\n",
    "print('the group:', match.group(1))  # what's in the captured group\n",
    "print('the whole match:', match.group())  # what's in the whole match"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Back references\n",
    "Task **c)** also requires back references (see the end of the section about [grouping](https://docs.python.org/3.7/howto/regex.html#grouping)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the group: sleep\n",
      "the whole match: sleep and sleep\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "# look for a word X, the word 'and' and then the word X again\n",
    "p = re.compile(r'([a-z]+) and \\1') match = p.search('I eat and drink, I sleep and sleep.')\n",
    "print('the group:', match.group(1))  # what's in the captured group\n",
    "print('the whole match:', match.group())  # what's in the whole match"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "**d)** Update your code for finding alliterations to also allow and find titles with more than two words. \n",
    "Print the output like this:\n",
    "```\n",
    "Title                   |    Letter  | Repetitions\n",
    "Mary maid me milkshake  |         M  | 4\n",
    "The toxic toy           |         T  | 3\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**e)** So far we have required alliterations to start with the same *letter*. Usually, the definition is more complex as it is based on sound rather than spelling.\n",
    "\n",
    "Include the following criteria in your code:\n",
    "\n",
    "- all vowels are considered to be equivalent. *\"An elephant ate olives\"* is an alliteration.\n",
    "- allow [functions words]( https://en.wikipedia.org/wiki/Function_word) to occur in an alliteration, even if they do not match the current letter. This means that 'The sun sets and Sony speaks' is an alliteration. You do not have to list all function words - just allow five to ten of your choice."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}