{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Activity 10 - More on Text Analytics\n", "\n", "In this notebook we give a simplified version of how next word prediction can be performed. This is the core of how larger and more complex models perform, such as GPT, when trained on highly diverse and rich text datasets." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "### Here are the imports that you will require\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import urllib.request\n", "def load_data():\n", " # the data is a standard pcap packet capture file (saved as a csv output)\n", " file_name = './data/movie_lines.txt'\n", " # this will then put the csv data into a pandas dataframe\n", " #data = pd.read_csv(file_name, sep='+++$+++')\n", " lines = []\n", " with open(file_name, 'r', encoding='utf-8', errors=\"replace\") as f:\n", " for line in f:\n", " line = line.split(\" +++$+++ \")\n", " line[4] = line[4].split('\\n')[0]\n", " lines.append(line)\n", " data = pd.DataFrame.from_records(lines, columns=['ID1', 'ID2', 'ID3', 'ID4', 'Text'])\n", " return data\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | ID1 | \n", "ID2 | \n", "ID3 | \n", "ID4 | \n", "Text | \n", "
---|---|---|---|---|---|
0 | \n", "L1045 | \n", "u0 | \n", "m0 | \n", "BIANCA | \n", "They do not! | \n", "
1 | \n", "L1044 | \n", "u2 | \n", "m0 | \n", "CAMERON | \n", "They do to! | \n", "
2 | \n", "L985 | \n", "u0 | \n", "m0 | \n", "BIANCA | \n", "I hope so. | \n", "
3 | \n", "L984 | \n", "u2 | \n", "m0 | \n", "CAMERON | \n", "She okay? | \n", "
4 | \n", "L925 | \n", "u0 | \n", "m0 | \n", "BIANCA | \n", "Let's go. | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
304708 | \n", "L666371 | \n", "u9030 | \n", "m616 | \n", "DURNFORD | \n", "Lord Chelmsford seems to want me to stay back ... | \n", "
304709 | \n", "L666370 | \n", "u9034 | \n", "m616 | \n", "VEREKER | \n", "I'm to take the Sikali with the main column to... | \n", "
304710 | \n", "L666369 | \n", "u9030 | \n", "m616 | \n", "DURNFORD | \n", "Your orders, Mr Vereker? | \n", "
304711 | \n", "L666257 | \n", "u9030 | \n", "m616 | \n", "DURNFORD | \n", "Good ones, yes, Mr Vereker. Gentlemen who can ... | \n", "
304712 | \n", "L666256 | \n", "u9034 | \n", "m616 | \n", "VEREKER | \n", "Colonel Durnford... William Vereker. I hear yo... | \n", "
304713 rows × 5 columns
\n", "