{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "kernelspec": { "display_name": "TensorFlow 2.3 on Python 3.6 (CUDA 10.1)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "colab": { "name": "11-1.natural_language_preprocessing.ipynb", "provenance": [] }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "HMKLN4ZMX3rZ" }, "source": [ "# 자연어 처리와 단어 벡터" ] }, { "cell_type": "markdown", "metadata": { "id": "5Bnpl2KdX3re" }, "source": [ "이 노트북에서 자연어 처리 데이터셋(구텐베르크 프로젝트의 책) 정제하고 word2vec을 사용해 단어 벡터로 임베딩합니다.\n", "\n", "**노트:** 이 전처리 단계의 일부나 전부가 후속 애플리케이션에 도움이 되거나 전혀 도움이 되지 않을 수 있습니다. " ] }, { "cell_type": "markdown", "metadata": { "id": "eUPm3F1qX3re" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rickiepark/dl-illustrated/blob/master/notebooks/11-1.natural_language_preprocessing.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "H_gmDM7QX3re" }, "source": [ "#### 라이브러리를 적재합니다." ] }, { "cell_type": "code", "metadata": { "id": "aBWUr7XuX3rf", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "4a22a728-f298-4366-c6de-8a3ab3fb37bc" }, "source": [ "import nltk\n", "from nltk import word_tokenize, sent_tokenize\n", "from nltk.corpus import stopwords\n", "from nltk.stem.porter import *\n", "nltk.download('gutenberg')\n", "nltk.download('punkt')\n", "nltk.download('stopwords')\n", "\n", "import string\n", "\n", "import gensim\n", "from gensim.models.phrases import Phraser, Phrases\n", "from gensim.models.word2vec import Word2Vec\n", "\n", "from sklearn.manifold import TSNE\n", "\n", "import pandas as pd\n", "from bokeh.io import output_notebook, output_file\n", "from bokeh.plotting import show, figure\n", "%matplotlib inline" ], "execution_count": 1, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "[nltk_data] Downloading package gutenberg to /root/nltk_data...\n", "[nltk_data] Unzipping corpora/gutenberg.zip.\n", "[nltk_data] Downloading package punkt to /root/nltk_data...\n", "[nltk_data] Unzipping tokenizers/punkt.zip.\n", "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Unzipping corpora/stopwords.zip.\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "gKKgKNeVX3rg" }, "source": [ "#### 데이터를 적재합니다." ] }, { "cell_type": "code", "metadata": { "id": "q6kb0SVbX3rg" }, "source": [ "from nltk.corpus import gutenberg" ], "execution_count": 2, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "DcYwrUrvX3rg", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "4bc229f5-45e4-432b-df26-21804c3dc5a9" }, "source": [ "len(gutenberg.fileids())" ], "execution_count": 3, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "18" ] }, "metadata": {}, "execution_count": 3 } ] }, { "cell_type": "code", "metadata": { "id": "6l95Y900X3rg", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "dbc47dd5-5794-4e7c-f271-6f0d460607d1" }, "source": [ "gutenberg.fileids()" ], "execution_count": 4, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['austen-emma.txt',\n", " 'austen-persuasion.txt',\n", " 'austen-sense.txt',\n", " 'bible-kjv.txt',\n", " 'blake-poems.txt',\n", " 'bryant-stories.txt',\n", " 'burgess-busterbrown.txt',\n", " 'carroll-alice.txt',\n", " 'chesterton-ball.txt',\n", " 'chesterton-brown.txt',\n", " 'chesterton-thursday.txt',\n", " 'edgeworth-parents.txt',\n", " 'melville-moby_dick.txt',\n", " 'milton-paradise.txt',\n", " 'shakespeare-caesar.txt',\n", " 'shakespeare-hamlet.txt',\n", " 'shakespeare-macbeth.txt',\n", " 'whitman-leaves.txt']" ] }, "metadata": {}, "execution_count": 4 } ] }, { "cell_type": "code", "metadata": { "id": "Qwpx7SPpX3rh", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "b4c7a551-42a2-4ea0-ae79-68c7d297c576" }, "source": [ "len(gutenberg.words())" ], "execution_count": 5, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "2621613" ] }, "metadata": {}, "execution_count": 5 } ] }, { "cell_type": "code", "metadata": { "id": "ONtHOR8sX3rh" }, "source": [ "gberg_sent_tokens = sent_tokenize(gutenberg.raw())" ], "execution_count": 6, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "woB9yIScX3rh", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "4b436f45-1b44-4488-ba03-cd68b1c1c45e" }, "source": [ "gberg_sent_tokens[0:6]" ], "execution_count": 7, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['[Emma by Jane Austen 1816]\\n\\nVOLUME I\\n\\nCHAPTER I\\n\\n\\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\\nand happy disposition, seemed to unite some of the best blessings\\nof existence; and had lived nearly twenty-one years in the world\\nwith very little to distress or vex her.',\n", " \"She was the youngest of the two daughters of a most affectionate,\\nindulgent father; and had, in consequence of her sister's marriage,\\nbeen mistress of his house from a very early period.\",\n", " 'Her mother\\nhad died too long ago for her to have more than an indistinct\\nremembrance of her caresses; and her place had been supplied\\nby an excellent woman as governess, who had fallen little short\\nof a mother in affection.',\n", " \"Sixteen years had Miss Taylor been in Mr. Woodhouse's family,\\nless as a governess than a friend, very fond of both daughters,\\nbut particularly of Emma.\",\n", " 'Between _them_ it was more the intimacy\\nof sisters.',\n", " \"Even before Miss Taylor had ceased to hold the nominal\\noffice of governess, the mildness of her temper had hardly allowed\\nher to impose any restraint; and the shadow of authority being\\nnow long passed away, they had been living together as friend and\\nfriend very mutually attached, and Emma doing just what she liked;\\nhighly esteeming Miss Taylor's judgment, but directed chiefly by\\nher own.\"]" ] }, "metadata": {}, "execution_count": 7 } ] }, { "cell_type": "code", "metadata": { "id": "ExtGFWUSX3rh", "colab": { "base_uri": "https://localhost:8080/", "height": 54 }, "outputId": "28367ec7-8d80-4eb9-87ae-a1fcd40a1d4e" }, "source": [ "gberg_sent_tokens[1]" ], "execution_count": 8, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "\"She was the youngest of the two daughters of a most affectionate,\\nindulgent father; and had, in consequence of her sister's marriage,\\nbeen mistress of his house from a very early period.\"" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "code", "metadata": { "id": "EsuwAXtwX3ri", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "9837a556-35be-441f-c2f0-d366fd262024" }, "source": [ "word_tokenize(gberg_sent_tokens[1])" ], "execution_count": 9, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['She',\n", " 'was',\n", " 'the',\n", " 'youngest',\n", " 'of',\n", " 'the',\n", " 'two',\n", " 'daughters',\n", " 'of',\n", " 'a',\n", " 'most',\n", " 'affectionate',\n", " ',',\n", " 'indulgent',\n", " 'father',\n", " ';',\n", " 'and',\n", " 'had',\n", " ',',\n", " 'in',\n", " 'consequence',\n", " 'of',\n", " 'her',\n", " 'sister',\n", " \"'s\",\n", " 'marriage',\n", " ',',\n", " 'been',\n", " 'mistress',\n", " 'of',\n", " 'his',\n", " 'house',\n", " 'from',\n", " 'a',\n", " 'very',\n", " 'early',\n", " 'period',\n", " '.']" ] }, "metadata": {}, "execution_count": 9 } ] }, { "cell_type": "code", "metadata": { "id": "UQ1xafglX3ri", "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "outputId": "84bc51f4-118d-40b8-9071-2ad3b7b180de" }, "source": [ "word_tokenize(gberg_sent_tokens[1])[14]" ], "execution_count": 10, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'father'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 10 } ] }, { "cell_type": "code", "metadata": { "id": "k2mLQ1FGX3ri" }, "source": [ "# 개행 문자를 처리하고 문장과 단어를 한 번에 토큰화합니다.\n", "gberg_sents = gutenberg.sents()" ], "execution_count": 11, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "febzvJRsX3ri", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "2de10ca7-3aba-49ce-9385-fa71dd01a6ff" }, "source": [ "gberg_sents[0:6]" ], "execution_count": 12, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'],\n", " ['VOLUME', 'I'],\n", " ['CHAPTER', 'I'],\n", " ['Emma',\n", " 'Woodhouse',\n", " ',',\n", " 'handsome',\n", " ',',\n", " 'clever',\n", " ',',\n", " 'and',\n", " 'rich',\n", " ',',\n", " 'with',\n", " 'a',\n", " 'comfortable',\n", " 'home',\n", " 'and',\n", " 'happy',\n", " 'disposition',\n", " ',',\n", " 'seemed',\n", " 'to',\n", " 'unite',\n", " 'some',\n", " 'of',\n", " 'the',\n", " 'best',\n", " 'blessings',\n", " 'of',\n", " 'existence',\n", " ';',\n", " 'and',\n", " 'had',\n", " 'lived',\n", " 'nearly',\n", " 'twenty',\n", " '-',\n", " 'one',\n", " 'years',\n", " 'in',\n", " 'the',\n", " 'world',\n", " 'with',\n", " 'very',\n", " 'little',\n", " 'to',\n", " 'distress',\n", " 'or',\n", " 'vex',\n", " 'her',\n", " '.'],\n", " ['She',\n", " 'was',\n", " 'the',\n", " 'youngest',\n", " 'of',\n", " 'the',\n", " 'two',\n", " 'daughters',\n", " 'of',\n", " 'a',\n", " 'most',\n", " 'affectionate',\n", " ',',\n", " 'indulgent',\n", " 'father',\n", " ';',\n", " 'and',\n", " 'had',\n", " ',',\n", " 'in',\n", " 'consequence',\n", " 'of',\n", " 'her',\n", " 'sister',\n", " \"'\",\n", " 's',\n", " 'marriage',\n", " ',',\n", " 'been',\n", " 'mistress',\n", " 'of',\n", " 'his',\n", " 'house',\n", " 'from',\n", " 'a',\n", " 'very',\n", " 'early',\n", " 'period',\n", " '.'],\n", " ['Her',\n", " 'mother',\n", " 'had',\n", " 'died',\n", " 'too',\n", " 'long',\n", " 'ago',\n", " 'for',\n", " 'her',\n", " 'to',\n", " 'have',\n", " 'more',\n", " 'than',\n", " 'an',\n", " 'indistinct',\n", " 'remembrance',\n", " 'of',\n", " 'her',\n", " 'caresses',\n", " ';',\n", " 'and',\n", " 'her',\n", " 'place',\n", " 'had',\n", " 'been',\n", " 'supplied',\n", " 'by',\n", " 'an',\n", " 'excellent',\n", " 'woman',\n", " 'as',\n", " 'governess',\n", " ',',\n", " 'who',\n", " 'had',\n", " 'fallen',\n", " 'little',\n", " 'short',\n", " 'of',\n", " 'a',\n", " 'mother',\n", " 'in',\n", " 'affection',\n", " '.']]" ] }, "metadata": {}, "execution_count": 12 } ] }, { "cell_type": "code", "metadata": { "id": "cuAzseHwX3ri", "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "outputId": "a238a0a6-7423-4d05-d211-1ccc7f55cdea" }, "source": [ "gberg_sents[4][14]" ], "execution_count": 13, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'father'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 13 } ] }, { "cell_type": "markdown", "metadata": { "id": "HQFdAveQX3rj" }, "source": [ "#### 한 문장을 전처리합니다." ] }, { "cell_type": "markdown", "metadata": { "id": "FVOf2srnX3rj" }, "source": [ "##### 문장을 토큰화합니다." ] }, { "cell_type": "code", "metadata": { "id": "rgwGGxBdX3rj", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "1e0bf438-f7d5-4745-cb15-f9471edb4763" }, "source": [ "gberg_sents[4]" ], "execution_count": 14, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['She',\n", " 'was',\n", " 'the',\n", " 'youngest',\n", " 'of',\n", " 'the',\n", " 'two',\n", " 'daughters',\n", " 'of',\n", " 'a',\n", " 'most',\n", " 'affectionate',\n", " ',',\n", " 'indulgent',\n", " 'father',\n", " ';',\n", " 'and',\n", " 'had',\n", " ',',\n", " 'in',\n", " 'consequence',\n", " 'of',\n", " 'her',\n", " 'sister',\n", " \"'\",\n", " 's',\n", " 'marriage',\n", " ',',\n", " 'been',\n", " 'mistress',\n", " 'of',\n", " 'his',\n", " 'house',\n", " 'from',\n", " 'a',\n", " 'very',\n", " 'early',\n", " 'period',\n", " '.']" ] }, "metadata": {}, "execution_count": 14 } ] }, { "cell_type": "markdown", "metadata": { "id": "83hicu8AX3rj" }, "source": [ "##### 소문자로 바꿉니다." ] }, { "cell_type": "code", "metadata": { "id": "SwqoeTV4X3rj", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "9daa01ac-1b4e-4bad-b232-fbf959c9d53e" }, "source": [ "[w.lower() for w in gberg_sents[4]]" ], "execution_count": 15, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['she',\n", " 'was',\n", " 'the',\n", " 'youngest',\n", " 'of',\n", " 'the',\n", " 'two',\n", " 'daughters',\n", " 'of',\n", " 'a',\n", " 'most',\n", " 'affectionate',\n", " ',',\n", " 'indulgent',\n", " 'father',\n", " ';',\n", " 'and',\n", " 'had',\n", " ',',\n", " 'in',\n", " 'consequence',\n", " 'of',\n", " 'her',\n", " 'sister',\n", " \"'\",\n", " 's',\n", " 'marriage',\n", " ',',\n", " 'been',\n", " 'mistress',\n", " 'of',\n", " 'his',\n", " 'house',\n", " 'from',\n", " 'a',\n", " 'very',\n", " 'early',\n", " 'period',\n", " '.']" ] }, "metadata": {}, "execution_count": 15 } ] }, { "cell_type": "markdown", "metadata": { "id": "bBJUakWyX3rk" }, "source": [ "##### 불용어와 구둣점을 삭제합니다." ] }, { "cell_type": "code", "metadata": { "id": "yZmbu1dEX3rk" }, "source": [ "stpwrds = stopwords.words('english') + list(string.punctuation)" ], "execution_count": 16, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "rOxjrwY_X3rk", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "8f7e3a81-b55f-40bd-c2aa-fd37500a455e" }, "source": [ "stpwrds" ], "execution_count": 17, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['i',\n", " 'me',\n", " 'my',\n", " 'myself',\n", " 'we',\n", " 'our',\n", " 'ours',\n", " 'ourselves',\n", " 'you',\n", " \"you're\",\n", " \"you've\",\n", " \"you'll\",\n", " \"you'd\",\n", " 'your',\n", " 'yours',\n", " 'yourself',\n", " 'yourselves',\n", " 'he',\n", " 'him',\n", " 'his',\n", " 'himself',\n", " 'she',\n", " \"she's\",\n", " 'her',\n", " 'hers',\n", " 'herself',\n", " 'it',\n", " \"it's\",\n", " 'its',\n", " 'itself',\n", " 'they',\n", " 'them',\n", " 'their',\n", " 'theirs',\n", " 'themselves',\n", " 'what',\n", " 'which',\n", " 'who',\n", " 'whom',\n", " 'this',\n", " 'that',\n", " \"that'll\",\n", " 'these',\n", " 'those',\n", " 'am',\n", " 'is',\n", " 'are',\n", " 'was',\n", " 'were',\n", " 'be',\n", " 'been',\n", " 'being',\n", " 'have',\n", " 'has',\n", " 'had',\n", " 'having',\n", " 'do',\n", " 'does',\n", " 'did',\n", " 'doing',\n", " 'a',\n", " 'an',\n", " 'the',\n", " 'and',\n", " 'but',\n", " 'if',\n", " 'or',\n", " 'because',\n", " 'as',\n", " 'until',\n", " 'while',\n", " 'of',\n", " 'at',\n", " 'by',\n", " 'for',\n", " 'with',\n", " 'about',\n", " 'against',\n", " 'between',\n", " 'into',\n", " 'through',\n", " 'during',\n", " 'before',\n", " 'after',\n", " 'above',\n", " 'below',\n", " 'to',\n", " 'from',\n", " 'up',\n", " 'down',\n", " 'in',\n", " 'out',\n", " 'on',\n", " 'off',\n", " 'over',\n", " 'under',\n", " 'again',\n", " 'further',\n", " 'then',\n", " 'once',\n", " 'here',\n", " 'there',\n", " 'when',\n", " 'where',\n", " 'why',\n", " 'how',\n", " 'all',\n", " 'any',\n", " 'both',\n", " 'each',\n", " 'few',\n", " 'more',\n", " 'most',\n", " 'other',\n", " 'some',\n", " 'such',\n", " 'no',\n", " 'nor',\n", " 'not',\n", " 'only',\n", " 'own',\n", " 'same',\n", " 'so',\n", " 'than',\n", " 'too',\n", " 'very',\n", " 's',\n", " 't',\n", " 'can',\n", " 'will',\n", " 'just',\n", " 'don',\n", " \"don't\",\n", " 'should',\n", " \"should've\",\n", " 'now',\n", " 'd',\n", " 'll',\n", " 'm',\n", " 'o',\n", " 're',\n", " 've',\n", " 'y',\n", " 'ain',\n", " 'aren',\n", " \"aren't\",\n", " 'couldn',\n", " \"couldn't\",\n", " 'didn',\n", " \"didn't\",\n", " 'doesn',\n", " \"doesn't\",\n", " 'hadn',\n", " \"hadn't\",\n", " 'hasn',\n", " \"hasn't\",\n", " 'haven',\n", " \"haven't\",\n", " 'isn',\n", " \"isn't\",\n", " 'ma',\n", " 'mightn',\n", " \"mightn't\",\n", " 'mustn',\n", " \"mustn't\",\n", " 'needn',\n", " \"needn't\",\n", " 'shan',\n", " \"shan't\",\n", " 'shouldn',\n", " \"shouldn't\",\n", " 'wasn',\n", " \"wasn't\",\n", " 'weren',\n", " \"weren't\",\n", " 'won',\n", " \"won't\",\n", " 'wouldn',\n", " \"wouldn't\",\n", " '!',\n", " '\"',\n", " '#',\n", " '$',\n", " '%',\n", " '&',\n", " \"'\",\n", " '(',\n", " ')',\n", " '*',\n", " '+',\n", " ',',\n", " '-',\n", " '.',\n", " '/',\n", " ':',\n", " ';',\n", " '<',\n", " '=',\n", " '>',\n", " '?',\n", " '@',\n", " '[',\n", " '\\\\',\n", " ']',\n", " '^',\n", " '_',\n", " '`',\n", " '{',\n", " '|',\n", " '}',\n", " '~']" ] }, "metadata": {}, "execution_count": 17 } ] }, { "cell_type": "code", "metadata": { "id": "tOGsv4pBX3rk", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "53ac252d-9b01-45e2-b5ef-54e4421fa89c" }, "source": [ "[w.lower() for w in gberg_sents[4] if w.lower() not in stpwrds]" ], "execution_count": 18, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['youngest',\n", " 'two',\n", " 'daughters',\n", " 'affectionate',\n", " 'indulgent',\n", " 'father',\n", " 'consequence',\n", " 'sister',\n", " 'marriage',\n", " 'mistress',\n", " 'house',\n", " 'early',\n", " 'period']" ] }, "metadata": {}, "execution_count": 18 } ] }, { "cell_type": "markdown", "metadata": { "id": "ejpP_769X3rk" }, "source": [ "##### 어간을 추출합니다." ] }, { "cell_type": "code", "metadata": { "id": "N6tM1CgLX3rk" }, "source": [ "stemmer = PorterStemmer()" ], "execution_count": 19, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "jcVrJAE6X3rl", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "c91f1a22-a5fc-4958-8c54-21f5d891791a" }, "source": [ "[stemmer.stem(w.lower()) for w in gberg_sents[4] \n", " if w.lower() not in stpwrds]" ], "execution_count": 20, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['youngest',\n", " 'two',\n", " 'daughter',\n", " 'affection',\n", " 'indulg',\n", " 'father',\n", " 'consequ',\n", " 'sister',\n", " 'marriag',\n", " 'mistress',\n", " 'hous',\n", " 'earli',\n", " 'period']" ] }, "metadata": {}, "execution_count": 20 } ] }, { "cell_type": "markdown", "metadata": { "id": "jlSJcCI4X3rl" }, "source": [ "##### 바이그램을 다룹니다." ] }, { "cell_type": "code", "metadata": { "id": "HSS4P9ckX3rl" }, "source": [ "phrases = Phrases(gberg_sents) # 디텍터 훈련" ], "execution_count": 21, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "5AGNGlURX3rl" }, "source": [ "bigram = Phraser(phrases) # 더 효율적으로 문장을 변형하는 Phraser 객체를 만듭니다." ], "execution_count": 22, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "MntLiNLDX3rl", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "f688951b-3e0d-4e96-eec5-be80b2a453a8" }, "source": [ "bigram.phrasegrams # 바이그램 횟수와 점수를 출력합니다." ], "execution_count": 23, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{(b'two', b'daughters'): (19, 11.966987886528115),\n", " (b'her', b'sister'): (195, 17.796341912611076),\n", " (b\"'\", b's'): (9781, 31.066694850762417),\n", " (b'very', b'early'): (24, 11.01230173457644),\n", " (b'Her', b'mother'): (14, 13.529621959045564),\n", " (b'long', b'ago'): (38, 63.224356392701125),\n", " (b'more', b'than'): (541, 29.024006819814797),\n", " (b'had', b'been'): (1256, 22.306349272800997),\n", " (b'an', b'excellent'): (54, 39.064443355850045),\n", " (b'Miss', b'Taylor'): (48, 453.76578390553544),\n", " (b'very', b'fond'): (28, 24.134631699685762),\n", " (b'passed', b'away'): (25, 12.350716162995981),\n", " (b'too', b'much'): (173, 31.376458650431253),\n", " (b'did', b'not'): (935, 11.728586903044414),\n", " (b'any', b'means'): (27, 14.097169263925728),\n", " (b'wedding', b'-'): (15, 17.469774011299435),\n", " (b'Her', b'father'): (18, 13.129762639674155),\n", " (b'after', b'dinner'): (21, 21.5288614259916),\n", " (b'self', b'-'): (124, 47.79087603091109),\n", " (b'sixteen', b'years'): (12, 107.04772502472798),\n", " (b'five', b'years'): (42, 40.129339674923365),\n", " (b'years', b'old'): (176, 54.73622181125361),\n", " (b'seven', b'years'): (51, 52.59487691468612),\n", " (b'each', b'other'): (236, 79.41799630087341),\n", " (b'a', b'mile'): (48, 12.783277635060301),\n", " (b'must', b'be'): (601, 10.230138529643797),\n", " (b'difference', b'between'): (44, 220.52858240070225),\n", " (b'could', b'not'): (1049, 10.871141494497287),\n", " (b'having', b'been'): (49, 11.538186246573854),\n", " (b'miles', b'off'): (16, 34.78731999672721),\n", " (b'at', b'Hartfield'): (66, 27.282624103216843),\n", " (b'her', b'husband'): (158, 27.544796053941575),\n", " (b'in', b'spite'): (96, 13.442110585867532),\n", " (b'Emma', b'could'): (61, 11.335276219802779),\n", " (b'every', b'body'): (127, 36.973121045951494),\n", " (b'no', b'means'): (80, 32.57409228176136),\n", " (b'his', b'own'): (773, 10.402539077343869),\n", " (b'obliged', b'to'): (179, 10.436780686118585),\n", " (b'able', b'to'): (348, 11.446995392578943),\n", " (b'very', b'much'): (234, 16.21051090525822),\n", " (b'have', b'been'): (986, 17.98145273154076),\n", " (b'great', b'deal'): (181, 118.04185550664424),\n", " (b'\"', b'Poor'): (30, 10.125733768993836),\n", " (b'agree', b'with'): (25, 13.61194200678363),\n", " (b'-', b'humoured'): (22, 33.94127522195319),\n", " (b'for', b'ever'): (555, 12.476295381735138),\n", " (b'This', b'is'): (353, 11.381193790408082),\n", " (b'three', b'times'): (36, 35.42629642564782),\n", " (b'my', b'dear'): (253, 24.47929874292135),\n", " (b'How', b'often'): (12, 12.37814885769022),\n", " (b'My', b'dear'): (85, 84.80821711166878),\n", " (b'so', b'far'): (98, 10.161780363169663),\n", " (b'\"', b'No'): (351, 15.063925495032132),\n", " (b'We', b'must'): (68, 18.765920000462394),\n", " (b'last', b'night'): (63, 23.5929422985217),\n", " (b'doubt', b'whether'): (12, 22.92446435569112),\n", " (b'anywhere', b'else'): (6, 16.100335841295465),\n", " (b'I', b'am'): (2428, 16.95154402454624),\n", " (b'very', b'glad'): (46, 18.284606842044536),\n", " (b'am', b'sure'): (282, 65.14555013642888),\n", " (b'very', b'pretty'): (39, 20.06847092419522),\n", " (b'be', b'able'): (121, 11.34777742133673),\n", " (b'immediately', b'afterwards'): (10, 41.0611372267814),\n", " (b'sensible', b'man'): (17, 14.541599717835169),\n", " (b'intimate', b'friend'): (6, 21.899079320113316),\n", " (b'connected', b'with'): (31, 18.3761217091579),\n", " (b'than', b'usual'): (30, 28.952390048051893),\n", " (b'Brunswick', b'Square'): (11, 10881.466275659823),\n", " (b'some', b'time'): (146, 12.92674187618596),\n", " (b'poor', b'Isabella'): (10, 41.30301208842583),\n", " (b'It', b'is'): (777, 11.70604053201059),\n", " (b'am', b'afraid'): (65, 25.627764827764825),\n", " (b'moonlight', b'night'): (6, 14.74558893657606),\n", " (b'Look', b'at'): (33, 13.630663096064623),\n", " (b'\"', b'Well'): (311, 21.191639295191656),\n", " (b'vast', b'deal'): (11, 61.90490490490491),\n", " (b'an', b'hour'): (150, 41.75817958294139),\n", " (b'pretty', b'well'): (20, 17.716673032849503),\n", " (b'tolerably', b'well'): (7, 18.357847866419295),\n", " (b'\"', b'Ah'): (83, 17.2797604782697),\n", " (b'Ah', b'!'): (68, 37.53350320557592),\n", " (b\"'\", b'Tis'): (64, 23.239682149440206),\n", " (b'Miss', b'Woodhouse'): (173, 294.5313833270494),\n", " (b'you', b'please'): (93, 13.036170437015532),\n", " (b'any', b'rate'): (47, 83.92156482630273),\n", " (b',\"', b'said'): (2583, 36.033065722366544),\n", " (b'My', b'dearest'): (7, 26.665660572611245),\n", " (b'so', b'much'): (483, 20.564737038651),\n", " (b'much', b'less'): (38, 19.104713600467313),\n", " (b'any', b'body'): (93, 21.71675477576872),\n", " (b'has', b'been'): (263, 29.261102552816904),\n", " (b'been', b'used'): (29, 14.094306941975477),\n", " (b'Well', b',\"'): (60, 12.493728094244245),\n", " (b'tell', b'you'): (296, 11.61233454195183),\n", " (b'Every', b'body'): (21, 72.20115873502328),\n", " (b'\"', b'Dear'): (39, 20.048952862607795),\n", " (b'every', b'thing'): (240, 27.277565476570334),\n", " (b'very', b'sorry'): (32, 20.256026727225084),\n", " (b'turned', b'away'): (50, 19.344906255300334),\n", " (b'divided', b'between'): (10, 35.82858268446422),\n", " (b'knows', b'how'): (13, 14.801172739783404),\n", " (b'how', b'much'): (110, 15.41788827060771),\n", " (b'four', b'years'): (21, 16.257841484533913),\n", " (b'years', b'ago'): (56, 163.33385119704198),\n", " (b'any', b'thing'): (383, 35.72856040672197),\n", " (b'need', b'not'): (107, 13.47902882398845),\n", " (b'his', b'wife'): (263, 10.871008962598552),\n", " (b'Ever', b'since'): (8, 99.63963480128893),\n", " (b'leave', b'off'): (18, 10.507399991635475),\n", " (b'you', b'mean'): (142, 10.574149798763328),\n", " (b'young', b'lady'): (73, 113.30676689703485),\n", " (b'depend', b'upon'): (28, 66.33781993881054),\n", " (b'quarrel', b'with'): (21, 10.691561721691869),\n", " (b'-', b'hearted'): (45, 49.03796213698087),\n", " (b'their', b'own'): (279, 10.1646586470654),\n", " (b'You', b'are'): (231, 12.600380088963897),\n", " (b'more', b'likely'): (16, 11.177045646480481),\n", " (b'have', b'done'): (272, 12.664289823059754),\n", " (b',\"', b'rejoined'): (6, 11.95680754804532),\n", " (b'any', b'longer'): (32, 16.396440186651585),\n", " (b'very', b'well'): (171, 13.844638642769485),\n", " (b'young', b'man'): (260, 25.86418892544471),\n", " (b'dine', b'with'): (22, 13.884180846919302),\n", " (b'much', b'better'): (38, 10.763540796435533),\n", " (b'I', b'dare'): (138, 13.676667311275946),\n", " (b'dare', b'say'): (114, 128.21273285427898),\n", " (b'Depend', b'upon'): (17, 92.29609730617116),\n", " (b'take', b'care'): (59, 72.94080901625021),\n", " (b'CHAPTER', b'II'): (11, 335.55615843733045),\n", " (b'entering', b'into'): (14, 16.437697132934044),\n", " (b'never', b'seen'): (42, 14.015410764872522),\n", " (b'refrain', b'from'): (9, 12.438191682463382),\n", " (b'at', b'once'): (263, 21.418483948514538),\n", " (b'three', b'years'): (77, 37.35580371637182),\n", " (b'any', b'other'): (138, 10.208550533393701),\n", " (b'twenty', b'years'): (69, 85.2916825593849),\n", " (b'an', b'easy'): (18, 10.427619035266346),\n", " (b'according', b'to'): (747, 12.093503438006989),\n", " (b'had', b'begun'): (25, 12.033151826531775),\n", " (b'passed', b'through'): (43, 31.462657712657712),\n", " (b'its', b'being'): (58, 16.0647072143383),\n", " (b'deal', b'better'): (14, 19.993210914263546),\n", " (b'fine', b'young'): (13, 10.40328871973337),\n", " (b'belonging', b'to'): (35, 10.51254678910311),\n", " (b'Frank', b'Churchill'): (151, 1750.703455229379),\n", " (b'Miss', b'Bates'): (113, 400.43190484184277),\n", " (b'a', b'few'): (404, 11.554768952918474),\n", " (b'few', b'days'): (53, 35.91581912291018),\n", " (b'I', b'suppose'): (210, 12.338337969117697),\n", " (b'very', b'handsome'): (21, 19.759725217669143),\n", " (b'an', b'irresistible'): (7, 11.369243496644911),\n", " (b'good', b'sense'): (28, 17.373623742833203),\n", " (b'had', b'already'): (64, 11.99089493884663),\n", " (b'She', b'felt'): (26, 13.338526859809706),\n", " (b'most', b'fortunate'): (6, 11.471739412714017),\n", " (b'long', b'enough'): (38, 15.189751032711843),\n", " (b'know', b'how'): (120, 12.783055562146046),\n", " (b'dear', b'Emma'): (31, 28.3901872294143),\n", " (b'at', b'Randalls'): (39, 27.034148473861507),\n", " (b'few', b'weeks'): (19, 134.4705370732768),\n", " (b'no', b'longer'): (113, 44.45405534922727),\n", " (b'CHAPTER', b'III'): (10, 354.19816723940437),\n", " (b'Donwell', b'Abbey'): (9, 753.4937557112396),\n", " (b'card', b'-'): (18, 15.662556010130528),\n", " (b'drawing', b'-'): (53, 20.08500964173348),\n", " (b'-', b'room'): (116, 10.86355694820301),\n", " (b'thrown', b'away'): (11, 14.820859395595177),\n", " (b'After', b'these'): (18, 11.092149558498896),\n", " (b'an', b'invitation'): (11, 10.459704016913319),\n", " (b'old', b'lady'): (16, 10.88616203950411),\n", " (b'those', b'who'): (150, 15.975875581352883),\n", " (b'as', b'possible'): (81, 11.709669181717734),\n", " (b'young', b'ladies'): (44, 113.63645786708754),\n", " (b'-', b'fashioned'): (31, 34.93954802259887),\n", " (b'Goddard', b\"'\"): (34, 15.295062282208422),\n", " (b'found', b'herself'): (27, 11.226219866395585),\n", " (b's', b'sake'): (142, 28.092409785884055),\n", " (b'much', b'pleased'): (18, 13.27637741183309),\n", " (b'be', b'allowed'): (32, 10.133422261278234),\n", " (b'Miss', b'Smith'): (58, 165.24557352585305),\n", " (b'Harriet', b'Smith'): (31, 180.55133848365074),\n", " (b'several', b'years'): (10, 17.577623156769786),\n", " (b'pretty', b'girl'): (10, 40.456222524597024),\n", " (b'blue', b'eyes'): (28, 35.5958547926145),\n", " (b'They', b'were'): (188, 10.65352659402321),\n", " (b'due', b'time'): (18, 21.041915854217102),\n", " (b'its', b'own'): (54, 10.834528109355436),\n", " (b'better', b'than'): (170, 42.51429907056041),\n", " (b'body', b'else'): (31, 39.47041163749192),\n", " (b'apple', b'-'): (26, 28.22040417209909),\n", " (b'You', b'need'): (16, 14.652845388359971),\n", " (b'half', b'-'): (179, 14.008021557447472),\n", " (b'much', b'more'): (159, 10.556098666120453),\n", " (b'little', b'girl'): (50, 35.0572859257393),\n", " (b'at', b'last'): (420, 22.7569403988895),\n", " (b'CHAPTER', b'IV'): (8, 335.55615843733045),\n", " (b'every', b'respect'): (14, 12.225158144438588),\n", " (b'guided', b'by'): (14, 23.954886635563895),\n", " (b'different', b'sort'): (8, 14.491056783566354),\n", " (b'-', b'Mill'): (7, 12.705290190035953),\n", " (b'good', b'deal'): (62, 39.17741946239356),\n", " (b'very', b'happy'): (43, 11.360922087440127),\n", " (b'drink', b'tea'): (7, 32.50446757069274),\n", " (b'large', b'enough'): (11, 10.829225583329684),\n", " (b'had', b'taken'): (121, 10.962706594062695),\n", " (b'doing', b'something'): (9, 10.717002712046511),\n", " (b'three', b'miles'): (9, 16.651991868276856),\n", " (b'thing', b'else'): (26, 12.21037677485836),\n", " (b'very', b'obliging'): (14, 25.349647483193962),\n", " (b'on', b'purpose'): (35, 10.833519216418129),\n", " (b'very', b'clever'): (15, 21.695644242373216),\n", " (b'\"', b'You'): (493, 11.717232331679762),\n", " (b'know', b'what'): (219, 10.688016314679755),\n", " (b'Miss', b'Nash'): (13, 337.6861647669101),\n", " (b'does', b'not'): (211, 13.23044396496933),\n", " (b'\"', b'Oh'): (496, 20.296981145444178),\n", " (b'Oh', b'yes'): (11, 23.468344823224335),\n", " (b'very', b'entertaining'): (7, 16.05477673935618),\n", " (b'soon', b'as'): (271, 12.011817412794189),\n", " (b'Oh', b'!'): (285, 31.12744137552917),\n", " (b'have', b'seen'): (204, 13.438992082992083),\n", " (b'on', b'horseback'): (21, 54.889830696518516),\n", " (b'their', b'families'): (95, 35.2636280696419),\n", " (b'no', b'doubt'): (117, 40.19018109445808),\n", " (b'very', b'respectable'): (9, 10.88459439956351),\n", " (b'respectable', b'young'): (8, 27.705368476069587),\n", " (b'very', b'odd'): (16, 18.20644784875443),\n", " (b'perfectly', b'right'): (12, 16.999175371083012),\n", " (b'years', b'hence'): (9, 17.99121428987025),\n", " (b'young', b'woman'): (57, 30.400597455143597),\n", " (b'very', b'desirable'): (9, 14.595251581232889),\n", " (b'Dear', b'Miss'): (9, 32.27882457330758),\n", " (b'thirty', b'years'): (35, 72.53374931093936),\n", " (b'can', b'afford'): (11, 26.391976955083752),\n", " (b'good', b'luck'): (21, 51.578277957902856),\n", " (b'acquainted', b'with'): (88, 27.731238215638285),\n", " (b'your', b'own'): (181, 10.134839838816427),\n", " (b'\"', b'Yes'): (349, 27.04643052838071),\n", " (b'next', b'day'): (100, 33.668880662020904),\n", " (b'an', b'opportunity'): (34, 39.4962781888654),\n", " (b'few', b'yards'): (15, 127.2018593936402),\n", " (b'Robert', b'Martin'): (31, 1963.7493893502685),\n", " (b'few', b'minutes'): (86, 316.3684419939749),\n", " (b'Only', b'think'): (9, 11.416782816581593),\n", " (b'been', b'able'): (40, 15.917912445432831),\n", " (b'-', b'morrow'): (134, 31.19170723124743),\n", " (b'should', b'happen'): (13, 20.434509648427174),\n", " (b'Do', b'you'): (187, 17.543227495018566),\n", " (b'compared', b'with'): (25, 15.313434757631585),\n", " (b'\"', b'Certainly'): (39, 25.246829530691297),\n", " (b'You', b'must'): (84, 12.865027201533609),\n", " (b'an', b'old'): (158, 10.46791414565501),\n", " (b'old', b'man'): (201, 11.807281822173856),\n", " (b'more', b'valuable'): (10, 17.666198180904065),\n", " (b',\"', b'replied'): (256, 68.63356681944518),\n", " (b'very', b'bad'): (36, 15.601820655800676),\n", " (b'deal', b'too'): (15, 12.720185939364022),\n", " (b'no', b'more'): (553, 17.351013056535997),\n", " (b'very', b'agreeable'): (21, 21.40636898580824),\n", " (b'fixed', b'on'): (31, 10.723013373773426),\n", " (b'same', b'time'): (104, 18.367692434617606),\n", " (b'pleasing', b'young'): (8, 23.351667715544366),\n", " (b'CHAPTER', b'V'): (7, 236.13211149293625),\n", " (b'very', b'differently'): (14, 48.16433021806853),\n", " (b'\"', b'Perhaps'): (40, 10.879276747151517),\n", " (b'ever', b'since'): (60, 42.92864084409275),\n", " (b'twelve', b'years'): (22, 39.38985552857956),\n", " (b'very', b'neatly'): (7, 22.935395341937397),\n", " (b'ten', b'years'): (32, 36.45901703775031),\n", " (b'being', b'able'): (20, 14.369127392027657),\n", " (b'her', b'mother'): (239, 11.54378934477518),\n", " (b'have', b'spoken'): (82, 11.590388219544845),\n", " (b'Yes', b',\"'): (107, 26.304976605699704),\n", " (b'\"', b'Thank'): (43, 24.57613576706762),\n", " (b'Thank', b'you'): (46, 27.91929097443667),\n", " (b'\"', b'Why'): (191, 10.5490954241727),\n", " (b'could', b'possibly'): (21, 30.82056265729735),\n", " (b'How', b'can'): (53, 16.088895681879897),\n", " (b'much', b'mistaken'): (9, 10.09912469789013),\n", " (b'Very', b'well'): (40, 84.54272043745728),\n", " (b'oh', b'!'): (27, 22.811112601435184),\n", " (b'look', b'at'): (154, 10.282146776177967),\n", " (b'any', b'harm'): (10, 10.323684561965813),\n", " (b'\"', b'Very'): (70, 19.596720843150475),\n", " (b'an', b'angel'): (52, 25.819647520741913),\n", " (b'an', b'end'): (127, 18.271533362878365),\n", " (b'many', b'years'): (54, 19.71931776771305),\n", " (b',\"', b'cried'): (297, 34.72447183030883),\n", " (b'much', b'obliged'): (40, 42.98951729507285),\n", " (b'John', b'Knightley'): (58, 175.90626358469606),\n", " (b'ill', b'-'): (100, 22.2768930345429),\n", " (b'cared', b'for'): (14, 11.00409252669039),\n", " (b'I', b'assure'): (105, 13.117682643839952),\n", " (b'assure', b'you'): (125, 32.47647355375373),\n", " (b'soon', b'afterwards'): (36, 80.81087688682625),\n", " (b'CHAPTER', b'VI'): (6, 151.79921453117328),\n", " (b'most', b'agreeable'): (13, 28.296957218027913),\n", " (b'no', b'scruple'): (10, 27.38112104843708),\n", " (b'infinitely', b'superior'): (7, 278.5720720720721),\n", " (b'am', b'glad'): (34, 17.03178537511871),\n", " (b'Exactly', b'so'): (9, 26.019985274008626),\n", " (b'Did', b'you'): (73, 15.110106642904366),\n", " (b'very', b'interesting'): (15, 17.838640821506868),\n", " (b'No', b'sooner'): (14, 65.68793372043619),\n", " (b'Don', b\"'\"): (134, 25.40609319726938),\n", " (b\"'\", b't'): (2200, 30.670409254097855),\n", " (b't', b'pretend'): (9, 22.21571621014818),\n", " (b'why', b'should'): (57, 22.803466076696164),\n", " (b'cannot', b'imagine'): (13, 50.341789024899015),\n", " (b'back', b'again'): (74, 19.21576017940612),\n", " (b'almost', b'every'): (37, 10.072406158544343),\n", " (b'higher', b'than'): (34, 46.27235316124205),\n", " (b'ten', b'times'): (16, 32.71691507115478),\n", " (b'dear', b'Isabella'): (6, 10.167866890269968),\n", " (b'must', b'allow'): (12, 16.270925888340134),\n", " (b'sitting', b'down'): (24, 17.45900165672385),\n", " (b'fore', b'-'): (11, 15.528688010043942),\n", " (b'must', b'confess'): (10, 12.590597413596536),\n", " (b'depended', b'on'): (14, 20.583686511194443),\n", " (b'no', b'sooner'): (26, 28.750177100858938),\n", " (b'after', b'breakfast'): (11, 12.174058544459536),\n", " (b'sooner', b'than'): (12, 16.389570366331988),\n", " (b'at', b'home'): (154, 15.749758060408496),\n", " (b'at', b'least'): (301, 41.37119228766489),\n", " (b'Upon', b'my'): (40, 19.147455857896038),\n", " (b'Will', b'you'): (82, 14.912362397847469),\n", " (b\"'\", b'd'): (2523, 30.547355552456107),\n", " (b'She', b'paused'): (9, 28.954070883468326),\n", " (b'replied', b'Emma'): (16, 16.331281539133734),\n", " (b'can', b'hardly'): (33, 26.42101103314215),\n", " (b'am', b'persuaded'): (12, 14.509837439249205),\n", " (b'Are', b'you'): (83, 16.465572091753142),\n", " (b'I', b'beg'): (52, 10.231792462195163),\n", " (b'beg', b'your'): (38, 44.83857997838066),\n", " (b'your', b'pardon'): (41, 42.18341802803452),\n", " (b'dear', b'Miss'): (28, 19.298196342757286),\n", " (b'little', b'while'): (54, 10.255019543477895),\n", " (b'`', b'No'): (8, 11.44562481492449),\n", " (b'entered', b'into'): (98, 58.47508652207685),\n", " (b'older', b'than'): (15, 59.83493943264058),\n", " (b'advise', b'you'): (16, 10.800315623690194),\n", " (b'run', b'away'): (46, 41.7018298679982),\n", " (b'At', b'last'): (91, 78.82934999295966),\n", " (b'\"', b'Indeed'): (53, 16.736517172263895),\n", " (b'Dear', b'me'): (17, 11.639793716121261),\n", " (b'have', b'borne'): (26, 11.407140974967064),\n", " (b'good', b'opinion'): (19, 14.834724620994052),\n", " (b'good', b'natured'): (17, 34.33179126572909),\n", " (b'thank', b'you'): (59, 16.88776624795194),\n", " (b'merely', b'because'): (10, 11.301263472594304),\n", " (b'Emma', b'felt'): (19, 16.552506003089487),\n", " (b'no', b'difficulty'): (15, 13.258227033980061),\n", " (b'protest', b'against'): (6, 10.1573458158824),\n", " (b'Let', b'us'): (117, 32.264510238685276),\n", " (b'cried', b'Emma'): (27, 14.204966402031332),\n", " (b'\"', b'Has'): (17, 11.45654449291874),\n", " (b'next', b'morning'): (62, 76.77909286541964),\n", " (b'dear', b'sir'): (21, 30.139015802234486),\n", " (b'am', b'going'): (42, 10.08102475989074),\n", " (b'sat', b'down'): (150, 58.92371592771372),\n", " (b'depends', b'upon'): (8, 18.878747176262287),\n", " (b'has', b'happened'): (14, 16.88674150485437),\n", " (b'presently', b'added'): (6, 24.987070707070707),\n", " (b'could', b'afford'): (11, 16.18079539508111),\n", " (b'Certainly', b',\"'): (12, 17.04952187406462),\n", " (b'stood', b'up'): (80, 10.286479413623711),\n", " (b'\"', b'Nonsense'): (8, 10.024476431303897),\n", " (b'are', b'mistaken'): (17, 10.345444812472815),\n", " (b'does', b'seem'): (9, 14.179296113722343),\n", " (b'few', b'moments'): (43, 388.7952484944742),\n", " (b'nobody', b'knows'): (7, 38.60362047440699),\n", " (b'very', b'likely'): (29, 25.18396351271557),\n", " (b'all', b'probability'): (16, 13.453835276434582),\n", " (b'no', b'harm'): (25, 27.989590405069016),\n", " (b'cannot', b'help'): (16, 20.515795346592878),\n", " (b'very', b'different'): (29, 17.756435103435404),\n", " (b'common', b'sense'): (26, 145.7873644507308),\n", " (b'.--', b'She'): (78, 30.441114197863847),\n", " (b'-', b'natured'): (60, 48.04187853107345),\n", " (b'an', b'hundred'): (183, 30.382299526934904),\n", " (b'exactly', b'what'): (25, 14.38771426976121),\n", " (b'every', b'man'): (307, 12.828982038889638),\n", " (b'be', b'satisfied'): (66, 12.273087213735986),\n", " (b'less', b'than'): (85, 36.58696933460825),\n", " (b'large', b'fortune'): (9, 38.26326372776489),\n", " (b'no', b'use'): (40, 14.694534962661235),\n", " (b'these', b'words'): (111, 26.137791068580544),\n", " (b'well', b'acquainted'): (9, 11.682266824085007),\n", " (b'twenty', b'thousand'): (48, 77.04216497473693),\n", " (b'thousand', b'pounds'): (47, 448.5710831721469),\n", " (b'Good', b'morning'): (10, 19.263571686664424),\n", " (b'walked', b'off'): (15, 10.917066798474792),\n", " (b'cast', b'down'): (44, 15.322932013410156),\n", " (b'its', b'effects'): (7, 29.72292312498498),\n", " (b'deal', b'more'): (28, 10.737653188633582),\n", " (b'longer', b'than'): (31, 18.302452061748884),\n", " (b'perfectly', b'satisfied'): (12, 93.75833838690116),\n", " (b'three', b'hundred'): (77, 49.30380882147769),\n", " (b'looking', b'at'): (106, 12.747194191690063),\n", " (b'next', b'moment'): (21, 24.667637262918568),\n", " (b'ready', b'wit'): (8, 70.13003213003213),\n", " (b'very', b'pleasant'): (19, 11.645952038911219),\n", " (b'an', b'idea'): (37, 14.230889818929686),\n", " (b'Give', b'me'): (67, 25.719795758473396),\n", " (b'arrive', b'at'): (10, 11.926830209056545),\n", " (b'very', b'superior'): (13, 10.70318449290412),\n", " (b'pre', b'-'): (17, 49.32642073778665),\n", " (b'have', b'chosen'): (38, 12.418273092369478),\n", " (b'without', b'exception'): (8, 55.118538324420676),\n", " (b'her', b'cheeks'): (13, 10.316214846771857),\n", " (b'sit', b'down'): (54, 36.05010576005672),\n", " (b'reason', b'why'): (20, 38.7228669226916),\n", " (b'could', b'hardly'): (46, 23.719372787618898),\n", " (b'It', b'seemed'): (50, 10.045558012557422),\n", " (b'an', b'offering'): (70, 11.10916276306153),\n", " (b'let', b'us'): (282, 32.25392476944686),\n", " (b'Have', b'you'): (81, 15.50084824691241),\n", " (b'\"', b'Aye'): (59, 19.5070892717265),\n", " (b'Very', b'true'): (18, 128.88708979271206),\n", " (b'can', b'easily'): (12, 15.162057467882711),\n", " (b'Nobody', b'could'): (9, 15.935631828488972),\n", " (b'dear', b'mother'): (21, 13.46066364121149),\n", " (b'those', b'things'): (85, 16.270079692293628),\n", " (b'next', b'week'): (12, 51.50149900066622),\n", " (b'Why', b'should'): (43, 13.317115021941754),\n", " (b'.--', b'Poor'): (10, 33.94982433025911),\n", " (b'taken', b'away'): (75, 33.88283563223159),\n", " (b'stay', b'longer'): (9, 32.079572569768644),\n", " (b'three', b'days'): (97, 38.36047093343905),\n", " (b'cannot', b'bear'): (17, 21.20561660980335),\n", " (b'We', b'are'): (133, 13.053617112780595),\n", " (b'four', b'o'): (8, 13.636624231911327),\n", " (b'o', b\"'\"): (216, 29.052217020325397),\n", " (b\"'\", b'clock'): (67, 18.374166774488803),\n", " (b'ask', b'whether'): (9, 11.353068061866079),\n", " (b're', b'-'): (54, 17.206410583993414),\n", " (b'Of', b'course'): (52, 64.5306321807182),\n", " (b'ran', b'away'): (24, 15.68573713220816),\n", " (b'who', b'lived'): (27, 14.515587325296064),\n", " (b'A', b'few'): (48, 27.799197568033183),\n", " (b'.--', b'Emma'): (18, 10.09086002610704),\n", " (b'thus', b'began'): (15, 10.569989454451607),\n", " (b'Never', b'mind'): (10, 58.8196690127449),\n", " (b'good', b'fortune'): (18, 19.836146064643472),\n", " (b'Those', b'who'): (17, 18.71431093178666),\n", " (b'Jane', b'Fairfax'): (111, 897.7114059953714),\n", " (b'nothing', b'else'): (45, 34.14858386055199),\n", " (b'present', b'instance'): (6, 15.748153806977339),\n", " (b'These', b'are'): (118, 23.59039770019026),\n", " (b'once', b'more'): (124, 21.461766513970687),\n", " (b'still', b'greater'): (11, 12.724373482572735),\n", " (b'here', b'comes'): (14, 13.200933526553092),\n", " (b'turned', b'back'): (42, 28.86287494639118),\n", " (b'will', b'bring'): (144, 12.922132077825832),\n", " (b'each', b'side'): (37, 22.901925567260612),\n", " (b'waiting', b'for'): (50, 11.228665843561624),\n", " (b'still', b'remained'): (10, 13.211098151305023),\n", " (b'she', b'hoped'): (24, 14.771949542264762),\n", " (b'ten', b'minutes'): (39, 192.59908585456108),\n", " (b'most', b'favourable'): (6, 12.483951713835843),\n", " (b'ten', b'days'): (21, 17.36327679451949),\n", " (b'many', b'months'): (10, 10.650833562965003),\n", " (b'little', b'ones'): (53, 64.86319239593576),\n", " (b'-', b'tempered'): (21, 31.944729620661825),\n", " (b'passed', b'over'): (52, 23.114649934790215),\n", " (b'sir', b',\"'): (108, 26.986181179154077),\n", " (b'cannot', b'deny'): (9, 34.78674185428415),\n", " (b'talking', b'about'): (24, 19.191470943716723),\n", " (b'never', b'forget'): (18, 21.690516659921762),\n", " (b'cannot', b'tell'): (30, 18.289343248058184),\n", " (b'two', b'years'): (53, 12.946058959906635),\n", " (b'indeed', b'!--'): (19, 15.372686467521769),\n", " (b'most', b'amiable'): (8, 17.20760911907103),\n", " (b',\"', b'observed'): (18, 11.71111972171562),\n", " (b'our', b'lives'): (16, 19.655161454360538),\n", " (b'think', b'differently'): (6, 12.463321241434905),\n", " (b'shake', b'hands'): (13, 44.840574981420055),\n", " (b'How', b'long'): (48, 18.446894705078492),\n", " (b'South', b'End'): (8, 1381.451973194341),\n", " (b'perfectly', b'convinced'): (8, 75.6828750917843),\n", " (b'tells', b'me'): (15, 12.933104129023622),\n", " (b'bad', b'cold'): (7, 14.30739511156867),\n", " (b'far', b'off'): (67, 27.665534339247987),\n", " (b'am', b'sorry'): (32, 26.675629043853345),\n", " (b'Ah', b'!\"'): (14, 17.412606445880755),\n", " (b'an', b'interval'): (9, 11.127344698843956),\n", " (b'perfectly', b'well'): (16, 14.848259303721488),\n", " (b'He', b'paused'): (14, 28.882402391182513),\n", " (b'can', b'tell'): (51, 12.431003638264086),\n", " (b'morrow', b'morning'): (14, 22.316414535277676),\n", " (b'own', b'feelings'): (16, 10.723971700076298),\n", " (b'sore', b'throat'): (7, 129.1624895572264),\n", " (b'&', b'c'): (17, 4365.388235294118),\n", " (b'well', b'satisfied'): (17, 19.87189717498996),\n", " (b'looked', b'at'): (184, 13.030941652621292),\n", " (b'well', b'pleased'): (21, 19.25167566515881),\n", " (b'set', b'forward'): (22, 30.104688954112678),\n", " (b'eldest', b'daughter'): (10, 86.18032329988851),\n", " (b'short', b'time'): (23, 10.749540101684603),\n", " (b'Ha', b'!'): (35, 53.03328712107136),\n", " (b'\"', b'Quite'): (25, 17.43387205444156),\n", " (b',\"', b'continued'): (103, 41.18031481403468),\n", " (b'dining', b'-'): (20, 27.58385370205174),\n", " (b'such', b'circumstances'): (10, 10.213103979019891),\n", " (b'enter', b'into'): (108, 69.92082379259851),\n", " (b'gone', b'through'): (24, 11.359011093968116),\n", " (b'turn', b'away'): (49, 25.693197151088075),\n", " (b',\"', b'repeated'): (29, 16.0233360034719),\n", " (b'several', b'times'): (18, 100.66348633961887),\n", " (b'great', b'curiosity'): (13, 12.762317494711862),\n", " (b'upper', b'end'): (11, 50.59973817705776),\n", " (b'an', b'odd'): (25, 26.958000043591028),\n", " (b'In', b'short'): (23, 17.859418769192267),\n", " (b'dearest', b'Emma'): (8, 41.199369337360096),\n", " (b'continued', b'Mrs'): (17, 12.8091714520948),\n", " (b'go', b'home'): (35, 10.887794606718625),\n", " (b'covered', b'with'): (53, 10.080615337595193),\n", " (b'hardly', b'knew'): (16, 30.25775488600073),\n", " (b'knew', b'how'): (29, 10.771742964262854),\n", " (b'set', b'off'): (42, 12.87641808850673),\n", " (b'can', b'get'): (31, 10.915918025964645),\n", " (b'got', b'home'): (12, 13.023221532639205),\n", " (b'most', b'extraordinary'): (16, 46.22770238588718),\n", " (b'an', b'inch'): (27, 63.920413436692506),\n", " (b'at', b'ease'): (36, 17.60627316575014),\n", " (b'tete', b'-'): (7, 10.750630160799654),\n", " (b'well', b'known'): (33, 14.650399763103346),\n", " (b'Smith', b'!--'): (9, 18.89143450635386),\n", " (b'extremely', b'sorry'): (8, 73.47101219705371),\n", " (b'Every', b'thing'): (18, 22.468754541491062),\n", " (b'many', b'weeks'): (10, 20.758257250268528),\n", " (b'Am', b'I'): (43, 15.552324542536647),\n", " (b'madam', b',\"'): (13, 15.249261800405625),\n", " (b'extremely', b'well'): (16, 29.94818401937046),\n", " (b'!--', b'Such'): (10, 19.885720533004065),\n", " (b'poor', b'Harriet'): (13, 12.146024108216924),\n", " (b'-', b'headed'): (29, 37.26885122410547),\n", " (b'an', b'instant'): (95, 43.26164345230693),\n", " (b'thirty', b'thousand'): (18, 42.25669624085443),\n", " (b'so', b'easily'): (19, 10.348857779435248),\n", " (b'worth', b'having'): (9, 21.66509020844281),\n", " (b'poor', b'girl'): (11, 16.403616188855242),\n", " (b'laugh', b'at'): (34, 11.791298047589994),\n", " (b'knowing', b'what'): (21, 15.750760884791218),\n", " (b'many', b'days'): (49, 14.230461886034641),\n", " (b'whole', b'party'): (15, 21.73407276203329),\n", " (b'six', b'weeks'): (6, 18.238468797923794),\n", " (b'too', b'late'): (56, 87.81582024724356),\n", " (b'-', b'minded'): (19, 20.81504988580358),\n", " (b'her', b'companions'): (36, 11.854753785126626),\n", " (b'drew', b'near'): (34, 135.3933203484773),\n", " (b'three', b'months'): (35, 82.38812730639594),\n", " (b'other', b'side'): (133, 28.11421841631497),\n", " (b'an', b'unnatural'): (10, 19.22739708991419),\n", " (b'get', b'rid'): (18, 302.70680372001954),\n", " (b'watering', b'-'): (10, 20.552675307411103),\n", " (b'while', b'ago'): (10, 15.477310722473048),\n", " (b'at', b'Weymouth'): (16, 43.731710766540665),\n", " (b'present', b'occasion'): (9, 32.73631972474029),\n", " (b'No', b',\"'): (86, 11.567334989477068),\n", " (b'their', b'hearts'): (49, 18.284844184258766),\n", " (b'break', b'through'): (12, 11.455134820459898),\n", " (b'burst', b'forth'): (18, 49.665727664726894),\n", " (b'young', b'men'): (141, 28.05854810702794),\n", " (b'-', b'bred'): (21, 27.951638418079096),\n", " (b'nobody', b'else'): (20, 96.33066107291948),\n", " (b'something', b'else'): (35, 38.925897096435115),\n", " (b'walking', b'together'): (9, 11.844821972381299),\n", " (b'burst', b'out'): (19, 11.102331509877693),\n", " (b'-', b'sized'): (12, 27.1752040175769),\n", " (b'how', b'long'): (57, 10.246965742926971),\n", " (b'Miss', b'Fairfax'): (125, 273.2315441060061),\n", " (b'extremely', b'happy'): (7, 19.519300571284287),\n", " (b'don', b\"'\"): (693, 30.893027225924477),\n", " (b'ma', b\"'\"): (213, 29.826951267640514),\n", " (b's', b'handwriting'): (7, 11.483028817587641),\n", " (b'Ma', b\"'\"): (15, 17.287522502879252),\n", " (b'without', b'seeming'): (8, 24.2521568627451),\n", " (b'Colonel', b'Campbell'): (28, 896.7839354391274),\n", " (b'those', b'days'): (84, 24.942982765152095),\n", " (b'Miss', b'Campbell'): (12, 75.31725733771769),\n", " (b'most', b'charming'): (7, 10.480354525195523),\n", " (b'caught', b'hold'): (10, 25.412146614069687),\n", " (b'four', b'months'): (9, 21.513976100607053),\n", " (b'may', b'guess'): (11, 13.367124175942939),\n", " (b'Bless', b'me'): (14, 16.960842272062408),\n", " (b'running', b'away'): (12, 11.650622091724552),\n", " (b'My', b'father'): (32, 11.189057813492585),\n", " (b'five', b'minutes'): (37, 145.59430269856685),\n", " (b'nine', b'years'): (9, 22.04328958038157),\n", " (b'hundred', b'pounds'): (11, 62.910379437794575),\n", " (b'more', b'honourable'): (10, 10.64811945150382),\n", " (b'rather', b'than'): (75, 19.03838981947655),\n", " (b'few', b'months'): (17, 59.138874943221204),\n", " (b'she', b'wished'): (31, 11.81925902403813),\n", " (b'without', b'feeling'): (12, 13.66868744277099),\n", " (b'twelve', b'thousand'): (24, 59.18646236299163),\n", " (b'passed', b'between'): (17, 19.791026625704045),\n", " (b\",'\", b'said'): (250, 30.38023579892928),\n", " (b'Miss', b'Hawkins'): (18, 356.68101153504875),\n", " (b'dear', b'Jane'): (14, 28.08747388500318),\n", " (b'three', b'minutes'): (10, 10.882525806031556),\n", " (b'have', b'suffered'): (29, 12.705290190035953),\n", " (b'hour', b'ago'): (10, 35.65917844869341),\n", " (b'looked', b'round'): (26, 11.609514648854784),\n", " (b'help', b'thinking'): (10, 32.39549502357255),\n", " (b'a', b'series'): (16, 10.46445052916564),\n", " (b'laughed', b'at'): (28, 12.564141746945062),\n", " (b'weeks', b'ago'): (7, 66.07864088043594),\n", " (b'She', b'wished'): (10, 10.48200653568184),\n", " (b'twenty', b'miles'): (8, 32.07957256976865),\n", " (b'elder', b'sister'): (6, 20.892905405405404),\n", " (b'alas', b'!'): (23, 57.08877378327094),\n", " (b'no', b'fault'): (13, 10.496096401900884),\n", " (b'driven', b'away'): (10, 15.10864307317955),\n", " (b'setting', b'off'): (8, 17.077411634756995),\n", " (b'little', b'farther'): (16, 13.37803343166175),\n", " (b'spot', b'where'): (18, 40.48947421434327),\n", " (b'front', b'door'): (15, 45.13527518483108),\n", " (b'they', b'parted'): (20, 10.44325386818452),\n", " (b'without', b'delay'): (8, 17.83246828143022),\n", " (b'six', b'months'): (21, 149.72732500075657),\n", " (b'months', b'ago'): (7, 33.90422411666347),\n", " (b'leaned', b'back'): (6, 12.550583460172502),\n", " (b'at', b'Oxford'): (10, 14.908537761320684),\n", " (b'turned', b'round'): (20, 11.2590300905922),\n", " (b'pass', b'through'): (62, 22.188217566016075),\n", " (b'clock', b'struck'): (16, 287.9462433862434),\n", " (b'four', b'hours'): (15, 47.34066169603626),\n", " (b'faster', b'than'): (10, 39.88995962176039),\n", " (b'musical', b'society'): (6, 113.41096644049148),\n", " (b'worth', b'while'): (16, 39.41555130656469),\n", " (b'mixed', b'with'): (23, 10.67000615370459),\n", " (b'extremely', b'glad'): (10, 72.79072504708098),\n", " (b'knew', b'nothing'): (25, 12.449045733530072),\n", " (b'make', b'amends'): (9, 67.18048992450166),\n", " (b'amends', b'for'): (12, 15.10365640918289),\n", " (b'oftener', b'than'): (9, 50.683713401766134),\n", " (b'old', b'woman'): (61, 19.444732663616787),\n", " (b'post', b'-'): (19, 12.228841807909605),\n", " (b'just', b'going'): (25, 13.2603307202772),\n", " (b'At', b'least'): (17, 28.03569269825919),\n", " (b'their', b'lives'): (30, 14.906122976297906),\n", " (b'six', b'days'): (15, 14.208028157365117),\n", " (b'may', b'prove'): (9, 10.69369934075435),\n", " (b'stronger', b'than'): (29, 58.08695243797917),\n", " (b'particular', b'friend'): (10, 19.818171330419286),\n", " (b'Hum', b'!'): (7, 26.958587619877946),\n", " (b'good', b'tidings'): (14, 31.69088424528839),\n", " (b'among', b'themselves'): (30, 14.912909361688993),\n", " (b'next', b'summer'): (8, 20.514042459088895),\n", " (b'breaking', b'up'): (12, 10.278411586632634),\n", " (b'perfectly', b'safe'): (6, 16.445856823742155),\n", " (b'two', b'ladies'): (12, 10.207136726744569),\n", " (b'same', b'moment'): (29, 15.561086589572348),\n", " (b'well', b'worth'): (11, 11.682266824085007),\n", " (b',\"', b'added'): (51, 22.000525888403388),\n", " (b'little', b'girls'): (15, 24.596997116436313),\n", " (b'be', b'ashamed'): (88, 15.575445327520244),\n", " (b'been', b'staying'): (9, 13.958784759841098),\n", " (b'shut', b'up'): (64, 33.02129026711253),\n", " (b'too', b'large'): (19, 13.75905031306614),\n", " (b'At', b'first'): (26, 13.6225491635793),\n", " (b'worse', b'than'): (55, 50.56473754871035),\n", " (b'opposite', b'side'): (9, 21.485013505649793),\n", " (b'short', b'pause'): (11, 86.62091182855941),\n", " (b'large', b'party'): (8, 14.159924899255097),\n", " (b'six', b'years'): (19, 24.750919080861966),\n", " (b'who', b'knows'): (23, 17.15478502080444),\n", " (b'extremely', b'fond'): (6, 34.254458845685164),\n", " (b'or', b'twice'): (16, 10.480088120657514),\n", " (b'somebody', b'else'): (14, 146.9730657512543),\n", " (b'five', b'couple'): (7, 31.78281426662555),\n", " (b'\"', b'Don'): (54, 12.435426459085848),\n", " (b'bad', b'news'): (7, 32.31086729362592),\n", " (b'baked', b'apples'): (6, 613.5218253968253),\n", " (b'will', b'send'): (73, 16.24812706949644),\n", " (b'William', b'Larkins'): (13, 5074.297435897436),\n", " (b'low', b'voice'): (39, 55.284472898891764),\n", " (b'one', b'leg'): (17, 10.74596003475239),\n", " (b'an', b'immediate'): (12, 14.761679056127669),\n", " (b'Tell', b'me'): (40, 25.093033554681703),\n", " (b',\"', b'resumed'): (18, 28.980058972381027),\n", " (b'many', b'times'): (18, 11.52332014677216),\n", " (b'Nothing', b'can'): (12, 15.51466345550789),\n", " (b'few', b'words'): (18, 11.709874520256639),\n", " (b'no', b'objection'): (17, 30.845671058647493),\n", " (b'It', b'seems'): (24, 18.2281699492411),\n", " (b'astonished', b'at'): (22, 13.98318024510078),\n", " (b'four', b'times'): (13, 17.904877713359248),\n", " (b'other', b'end'): (40, 10.241169643435896),\n", " (b'few', b'hours'): (23, 78.0796666877091),\n", " (b'an', b'extraordinary'): (13, 10.356142591003286),\n", " (b'look', b'forward'): (11, 11.731760911835845),\n", " (b'Alas', b'!'): (16, 24.207711332135293),\n", " (b'immediately', b'followed'): (7, 12.604814218453825),\n", " (b'wait', b'till'): (17, 29.399581656260896),\n", " (b'-', b'bye'): (37, 39.93091202582728),\n", " (b'contrast', b'between'): (12, 166.24462365591398),\n", " (b'dared', b'not'): (22, 11.783553500216318),\n", " (b'three', b'weeks'): (9, 21.40970383064167),\n", " (b'-', b'sighted'): (11, 32.25189048239896),\n", " (b'Maple', b'Grove'): (31, 16731.716961498438),\n", " (b'My', b'brother'): (17, 11.063412365232326),\n", " (b'at', b'Maple'): (10, 11.542093750699882),\n", " (b'almost', b'fancy'): (9, 16.725625422582826),\n", " (b'left', b'behind'): (27, 29.19181841393264),\n", " (b'barouche', b'-'): (7, 13.975819209039548),\n", " (b'-', b'landau'): (7, 19.96545601291364),\n", " (b'whose', b'name'): (60, 31.461202630580967),\n", " (b'most', b'serious'): (8, 10.186904598490049),\n", " (b'We', b'cannot'): (19, 11.868641936045467),\n", " (b'waited', b'for'): (38, 10.711948477309232),\n", " (b'E', b'.,'): (6, 566.3278388278388),\n", " (b'person', b'who'): (30, 10.40945693009978),\n", " (b'greater', b'part'): (10, 16.18384415693171),\n", " (b'drew', b'back'): (11, 15.195773696172985),\n", " (b'Her', b'manners'): (6, 10.694300338936156),\n", " (b'third', b'time'): (23, 13.583823987016235),\n", " (b'very', b'extraordinary'): (12, 11.127072987672598),\n", " (b'better', b'acquainted'): (7, 13.449978251413658),\n", " (b'According', b'to'): (45, 10.652714079624486),\n", " (b'have', b'committed'): (34, 16.773728020950244),\n", " (b'hardly', b'less'): (8, 13.000147148472811),\n", " (b'will', b'shew'): (48, 12.349874144320705),\n", " (b'little', b'boys'): (17, 17.749724946185125),\n", " (b'easily', b'believe'): (8, 21.824887069452284),\n", " (b'my', b'lord'): (180, 36.09264842223215),\n", " (b'\"', b'Excuse'): (11, 17.184816739378107),\n", " (b'Excuse', b'me'): (13, 37.690760604583126),\n", " (b'put', b'forth'): (36, 10.696233535526662),\n", " (b'drawing', b'near'): (8, 18.702897235831365),\n", " (b'great', b'joy'): (24, 10.008185299508712),\n", " (b'eight', b'o'): (9, 65.15276021913189),\n", " (b'spread', b'abroad'): (14, 194.3118977796397),\n", " (b'few', b'lines'): (10, 43.57841479226563),\n", " (b'good', b'news'): (18, 24.79518258080434),\n", " (b'most', b'likely'): (12, 19.419480443744643),\n", " (b'talk', b'about'): (35, 21.824887069452284),\n", " (b'tells', b'us'): (12, 30.44567755366135),\n", " (b'dear', b'madam'): (10, 68.52258121703674),\n", " (b'eleven', b'years'): (6, 16.99170238487746),\n", " (b'your', b'sister'): (79, 15.965251961999174),\n", " (b'two', b'hours'): (17, 15.078877429107843),\n", " (b'two', b'months'): (19, 19.98674940210717),\n", " (b'door', b'opened'): (19, 32.960457440450135),\n", " (b'Who', b'can'): (30, 10.97976511828241),\n", " (b'began', b'talking'): (9, 13.575011249766773),\n", " (b'mean', b'?\"'): (59, 26.41782366663845),\n", " (b'In', b'spite'): (9, 11.90627917946151),\n", " (b'many', b'hours'): (12, 13.124575551782682),\n", " (b'few', b'steps'): (8, 22.235285657785926),\n", " (b'most', b'excellent'): (10, 12.940681654585935),\n", " (b'later', b'than'): (8, 11.966987886528115),\n", " (b'whole', b'story'): (18, 36.39536252354049),\n", " (b'whole', b'history'): (8, 19.2441498630819),\n", " (b'lined', b'with'): (12, 13.540300206747927),\n", " (b'-', b'plaister'): (9, 17.469774011299435),\n", " (b'Lord', b'bless'): (12, 16.7014274691358),\n", " (b'these', b'things'): (325, 42.22085680150196),\n", " (b'laid', b'down'): (26, 12.408697820671941),\n", " (b'forty', b'years'): (63, 161.26670263465516),\n", " (b'faint', b'smile'): (6, 24.271193092621665),\n", " (b'turned', b'towards'): (11, 11.117376349756114),\n", " (b'totally', b'different'): (7, 158.32821300563236),\n", " (b'Box', b'Hill'): (18, 8589.305555555555),\n", " (b'some', b'surprise'): (19, 22.39760968543046),\n", " (b'may', b'depend'): (9, 21.16461327857632),\n", " (b',\"', b'interrupted'): (25, 30.235605293907703),\n", " (b'whatever', b'else'): (8, 18.739735159540622),\n", " (b'mid', b'-'): (22, 25.824883321051338),\n", " (b'larger', b'than'): (11, 19.006392525662303),\n", " (b'were', b'assembled'): (17, 18.151746404461402),\n", " (b'insisted', b'on'): (9, 11.61131033964815),\n", " (b'clothed', b'with'): (36, 12.659106066308777),\n", " (b'twenty', b'minutes'): (7, 11.181261808550069),\n", " (b'quite', b'alone'): (14, 11.341420176217916),\n", " (b'etc', b'.,'): (11, 3964.2948717948716),\n", " (b'As', b'soon'): (46, 20.83462134191184),\n", " (b'without', b'knowing'): (18, 34.56996044031647),\n", " (b',\"', b'whispered'): (18, 26.71599186516376),\n", " (b'shan', b\"'\"): (18, 22.473779253743025),\n", " (b'looking', b'round'): (23, 17.251931821351857),\n", " (b'Pardon', b'me'): (7, 10.14751247046469),\n", " (b',\"', b'answered'): (143, 22.57516648996616),\n", " (b'An', b'old'): (15, 19.494934210941096),\n", " (b'Shall', b'we'): (25, 13.344625941350367),\n", " (b'old', b'age'): (36, 48.23121704303023),\n", " (b'an', b'infant'): (9, 23.772054583893908),\n", " (b'be', b'forgiven'): (36, 18.33341939975366),\n", " (b'lie', b'down'): (41, 31.292094007783852),\n", " (b'four', b'miles'): (7, 16.3062279175236),\n", " (b'great', b'hurry'): (16, 19.653968941856267),\n", " (b'without', b'waiting'): (12, 19.24774354186119),\n", " (b'comes', b'back'): (13, 15.420512101235838),\n", " (b'heightened', b'by'): (7, 11.875072007373555),\n", " (b'In', b'fact'): (28, 39.68320704393532),\n", " (b'cut', b'off'): (213, 155.50163439778729),\n", " (b'never', b'mind'): (29, 11.197398746902147),\n", " (b'trembling', b'voice'): (7, 11.95022270316229),\n", " (b'More', b'than'): (14, 24.23315047021944),\n", " (b'time', b'past'): (23, 13.326716908397632),\n", " (b'second', b'time'): (44, 20.945614179827096),\n", " (b'five', b'hundred'): (65, 85.8882839842231),\n", " (b'turning', b'away'): (13, 10.732346458879267),\n", " (b'an', b'arrow'): (10, 14.857534114933694),\n", " (b'--', b'oh'): (17, 12.811006767021128),\n", " (b'presented', b'themselves'): (6, 11.894714571472534),\n", " (b'at', b'random'): (11, 18.668082066349374),\n", " (b'far', b'distant'): (10, 26.63579981049186),\n", " (b'few', b'seconds'): (9, 96.54294969363461),\n", " (b'passing', b'through'): (10, 15.455340630779228),\n", " (b'will', b'heal'): (10, 10.577600656791981),\n", " (b'rose', b'early'): (12, 50.25143069404622),\n", " (b'east', b'wind'): (22, 148.38828510938603),\n", " (b'gone', b'mad'): (10, 20.604717798360767),\n", " (b'freed', b'from'): (10, 25.54271506220159),\n", " (b'sinned', b'against'): (43, 81.41747505543238),\n", " (b'locked', b'up'): (11, 13.137819321259757),\n", " (b'deep', b'sigh'): (7, 47.76258881680568),\n", " (b'ten', b'thousand'): (75, 127.07863651308065),\n", " (b'happier', b'than'): (10, 23.413671951902835),\n", " (b'contend', b'with'): (14, 10.67000615370459),\n", " (b'had', b'formerly'): (10, 11.686041677689511),\n", " (b'little', b'boy'): (63, 26.45202064896755),\n", " (b'fancying', b'herself'): (6, 29.03972577009767),\n", " (b'right', b'hand'): (196, 45.67341533298018),\n", " (b'surrounded', b'by'): (18, 30.407381352214102),\n", " (b'infinitely', b'more'): (8, 12.605071134482898),\n", " (b'such', b'cases'): (8, 14.91498581087056),\n", " (b'No', b'wonder'): (13, 19.253810919251713),\n", " (b'poor', b'fellow'): (31, 72.63322416713721),\n", " (b'Poor', b'fellow'): (7, 45.43103764921947),\n", " (b'days', b'ago'): (11, 15.442862018162295),\n", " (b'help', b'laughing'): (7, 20.6971218206158),\n", " (b'draw', b'near'): (18, 83.03474416971349),\n", " (b'at', b'intervals'): (30, 31.386395286990908),\n", " (b'into', b'temptation'): (8, 11.801423582619314),\n", " (b'stood', b'before'): (56, 10.39258282946439),\n", " (b'Sir', b'Walter'): (136, 1001.3265848443274),\n", " (b'Walter', b'Elliot'): (16, 158.52745152870995),\n", " (b'Kellynch', b'Hall'): (24, 4945.357744107744),\n", " (b'arising', b'from'): (7, 10.217086024880635),\n", " (b'Charles', b'Musgrove'): (14, 248.92084078711986),\n", " (b'first', b'year'): (71, 36.52590150555186),\n", " (b'Lady', b'Elliot'): (12, 34.95647609819121),\n", " (b'seventeen', b'years'): (7, 50.975107154632376),\n", " (b'an', b'awful'): (13, 15.611498532706445),\n", " (b'Lady', b'Russell'): (147, 1370.6424223505542),\n", " (b'Anne', b'Elliot'): (23, 69.51776079136691),\n", " (b'Miss', b'Elliot'): (48, 81.92993320516614),\n", " (b'everybody', b'else'): (20, 116.64529027877325),\n", " (b'her', b'mistress'): (30, 10.118550146240644),\n", " (b'Mr', b'Elliot'): (174, 154.42474881796687),\n", " (b'Mr', b'Shepherd'): (26, 153.51099290780144),\n", " (b'anybody', b'else'): (21, 167.7979955569876),\n", " (b'reference', b'to'): (30, 10.087797423886824),\n", " (b'an', b'honest'): (28, 22.11150665340132),\n", " (b'descend', b'into'): (11, 19.585341264772477),\n", " (b'Mrs', b'Clay'): (66, 287.0487212850306),\n", " (b'Miss', b'Anne'): (19, 13.817194691451805),\n", " (b'their', b'fathers'): (151, 21.038778684865168),\n", " (b'an', b'example'): (14, 17.829040937920432),\n", " (b'Admiral', b'Croft'): (14, 1020.8859134262656),\n", " (b'Mrs', b'Croft'): (41, 207.3760688537417),\n", " (b'walked', b'along'): (8, 11.42244112667385),\n", " (b'Frederick', b'Wentworth'): (6, 23.25274477365017),\n", " (b'either', b'side'): (18, 19.36359410488185),\n", " (b'Captain', b'Wentworth'): (196, 976.2801057938673),\n", " (b'eldest', b'son'): (15, 39.75891221190009),\n", " (b'removed', b'from'): (36, 14.303920434832891),\n", " (b'good', b'humour'): (23, 57.21965210954848),\n", " (b'The', b'Crofts'): (8, 10.24976796605675),\n", " (b'startled', b'by'): (14, 14.177381886354143),\n", " (b'most', b'important'): (14, 33.80609933127228),\n", " (b'replied', b'Anne'): (11, 13.874646644430818),\n", " (b'at', b'Uppercross'): (20, 13.940450893702454),\n", " (b'Great', b'House'): (13, 1177.9619047619049),\n", " (b'left', b'alone'): (16, 14.135954084898053),\n", " (b'Mr', b'Musgrove'): (21, 32.3891325695581),\n", " (b'Miss', b'Musgroves'): (22, 227.52634882160712),\n", " (b'Mrs', b'Musgrove'): (66, 156.77276316336287),\n", " (b'flower', b'-'): (23, 13.524986331328593),\n", " (b'grown', b'up'): (19, 12.21909069739544),\n", " (b'their', b'faces'): (62, 22.65730692397282),\n", " (b'surprised', b'at'): (27, 14.056621317816642),\n", " (b'ere', b'long'): (20, 33.88286215209292),\n", " (b'anything', b'else'): (30, 74.81176994319226),\n", " (b'quite', b'different'): (12, 19.349519727167486),\n", " (b'their', b'sakes'): (13, 16.878317708546554),\n", " (b'twentieth', b'year'): (13, 185.54755475547557),\n", " (b'on', b'board'): (69, 34.25748298789808),\n", " (b'eight', b'years'): (22, 61.898344402053596),\n", " (b'-', b'bone'): (22, 13.4993708269132),\n", " (b'their', b'heads'): (77, 21.621494582846132),\n", " (b'Your', b'sister'): (11, 24.014833799316555),\n", " (b'dressing', b'-'): (19, 29.64567711008389),\n", " (b'up', b'stairs'): (15, 14.512707389763687),\n", " (b'waited', b'till'): (7, 12.054695723363611),\n", " (b'third', b'part'): (39, 80.67984559777145),\n", " (b'Phoo', b'!'): (7, 23.96318899544706),\n", " (b'dear', b'fellow'): (11, 20.631526271893247),\n", " (b'good', b'cheer'): (14, 58.85449931267844),\n", " (b'Mrs', b'Harville'): (24, 84.64015847289754),\n", " (b'\"', b'Ay'): (34, 18.45776612748019),\n", " (b'fifteen', b'years'): (9, 52.059683902603275),\n", " (b'Charles', b'Hayter'): (33, 2649.332925336598),\n", " (b'came', b'near'): (42, 11.627447632578933),\n", " (b'Her', b'husband'): (9, 21.944148747427437),\n", " (b'two', b'hundred'): (102, 34.52951381693766),\n", " (b'Dr', b'Shirley'): (9, 1086.3943785682916),\n", " (b'went', b'up'): (206, 10.893037774183895),\n", " (b'within', b'reach'): (7, 18.479352178330245),\n", " (b'-', b'yard'): (19, 16.03782532184866),\n", " (b'turn', b'back'): (15, 10.596177405398922),\n", " (b'walking', b'along'): (8, 19.124729409339242),\n", " (b'leaning', b'against'): (12, 23.700473570392266),\n", " (b'trodden', b'under'): (9, 72.2464953271028),\n", " (b'under', b'foot'): (15, 22.229690869877782),\n", " (b'Louisa', b'Musgrove'): (15, 189.5280416794361),\n", " (b'provoke', b'me'): (18, 16.48970776450512),\n", " (b'Very', b'good'): (11, 10.325350756610252),\n", " (b'good', b'humoured'): (9, 26.157555250079305),\n", " (b'Captain', b'Harville'): (37, 475.4296696696697),\n", " (b'at', b'Lyme'): (24, 20.29341259451412),\n", " (b'earnest', b'desire'): (6, 32.79056203605514),\n", " (b'Captain', b'Benwick'): (56, 811.83861003861),\n", " (b'an', b'officer'): (9, 10.895525017618041),\n", " (b'place', b'where'): (114, 29.883792170944712),\n", " (b'-', b'coat'): (34, 13.074153453617642),\n", " (b'an', b'introduction'): (7, 10.895525017618041),\n", " (b'preceding', b'evening'): (6, 48.94191199746755),\n", " (b'an', b'agony'): (10, 15.203058164118199),\n", " (b'catching', b'hold'): (10, 135.53144860837168),\n", " (b'raised', b'up'): (35, 20.917757019882853),\n", " (b'could', b'scarcely'): (17, 17.384325631078877),\n", " (b'passed', b'along'): (11, 16.544408774745854),\n", " (b'leaning', b'over'): (16, 33.66105049605383),\n", " (b't', b'talk'): (19, 11.570685526118844),\n", " (b'Camden', b'Place'): (29, 11505.67441860465),\n", " (b'straight', b'forward'): (6, 11.997865942380443),\n", " (b'same', b'hour'): (17, 12.921871463147081),\n", " (b'-', b'glasses'): (16, 16.354682053131388),\n", " (b'poring', b'over'): (6, 41.31128924515698),\n", " (b'thirty', b'feet'): (8, 12.120929017084244),\n", " (b'Colonel', b'Wallis'): (23, 967.3885461023725),\n", " (b'Mrs', b'Wallis'): (11, 54.17933330413071),\n", " (b'-', b'haired'): (38, 60.68447814451383),\n", " (b'at', b'length'): (74, 17.89024531358482),\n", " (b'carried', b'away'): (73, 68.93872057625379),\n", " (b'greater', b'than'): (56, 48.18287227996847),\n", " (b'Miss', b'Carteret'): (12, 320.09834368530016),\n", " (b'contact', b'with'): (11, 11.02567302549474),\n", " (b'Lady', b'Dalrymple'): (25, 1027.2923588039866),\n", " (b'Laura', b'Place'): (7, 777.4104336895035),\n", " (b'be', b'established'): (41, 13.256300819785361),\n", " (b'Mrs', b'Smith'): (64, 112.00140587397476),\n", " (b'Westgate', b'Buildings'): (7, 8589.305555555555),\n", " (b'buried', b'him'): (40, 10.212164360501543),\n", " (b'at', b'liberty'): (25, 14.03156495183123),\n", " (b'human', b'nature'): (9, 43.511573911208046),\n", " (b'five', b'thousand'): (31, 37.911149464312665),\n", " (b'whose', b'names'): (9, 20.291362480518416),\n", " (b'her', b'ladyship'): (21, 34.12286449316844),\n", " (b'-', b'maker'): (21, 26.620608017218185),\n", " (b'old', b'gentleman'): (31, 27.054706126823717),\n", " (b'almost', b'entirely'): (13, 36.1061120233534),\n", " (b'lower', b'part'): (8, 15.927696983224877),\n", " (b'staring', b'at'): (33, 25.046343439018745),\n", " (b'an', b'oath'): (37, 44.98797426629384),\n", " (b'wiser', b'than'): (8, 29.3735157214781),\n", " (b'prejudice', b'against'): (7, 39.17833386126069),\n", " (b'both', b'sides'): (30, 99.67218081951572),\n", " (b'my', b'soul'): (234, 16.443748679233014),\n", " (b'rejoice', b'over'): (13, 10.117050427385381),\n", " (b'same', b'instant'): (19, 25.16281097419205),\n", " (b'every', b'one'): (375, 14.671605951506955),\n", " (b'their', b'seats'): (11, 12.658738281409915),\n", " (b'their', b'mouths'): (24, 26.497528436510585),\n", " (b'short', b'silence'): (9, 19.427323846323),\n", " (b'-', b'blooded'): (7, 17.469774011299435),\n", " (b'general', b'character'): (6, 10.833683694205032),\n", " (b'fifty', b'pounds'): (6, 35.38131472052177),\n", " (b'be', b'saved'): (61, 13.40991848436365),\n", " (b'threw', b'himself'): (8, 10.030058440961655),\n", " (b'some', b'moments'): (13, 21.006453804347828),\n", " (b'exclaimed', b'Mrs'): (11, 12.149305043956582),\n", " (b'compassion', b'on'): (20, 13.012675380640166),\n", " (b'an', b'explanation'): (12, 16.343287526427062),\n", " (b'our', b'hearts'): (17, 14.94442027934851),\n", " (b'minutes', b'afterwards'): (7, 22.217312424781305),\n", " (b'make', b'haste'): (26, 50.38536744337624),\n", " (b\"'\", b'n'): (26, 22.53339140030468),\n", " (b'n', b\"'\"): (20, 16.0952795716462),\n", " (b'rising', b'sun'): (7, 13.065928609910946),\n", " (b'-', b'faced'): (30, 42.60920490560838),\n", " (b'an', b'atonement'): (66, 89.61263272917309),\n", " (b'atonement', b'for'): (64, 24.316159515907604),\n", " (b'\"', b'Look'): (42, 10.092670148523652),\n", " (b'Look', b'here'): (14, 26.312064784218066),\n", " ...}" ] }, "metadata": {}, "execution_count": 23 } ] }, { "cell_type": "code", "metadata": { "id": "oxaBepqAX3rm" }, "source": [ "tokenized_sentence = \"Jon lives in New York City\".split()" ], "execution_count": 24, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "4kmHOCsxX3rm", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "c9ff3145-1fa7-4548-b71f-6363ec4d21a6" }, "source": [ "tokenized_sentence" ], "execution_count": 25, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['Jon', 'lives', 'in', 'New', 'York', 'City']" ] }, "metadata": {}, "execution_count": 25 } ] }, { "cell_type": "code", "metadata": { "id": "wjUMjN4MX3rm", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "0c135cd5-7e42-465a-c310-076f47d7990c" }, "source": [ "bigram[tokenized_sentence]" ], "execution_count": 26, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['Jon', 'lives', 'in', 'New_York', 'City']" ] }, "metadata": {}, "execution_count": 26 } ] }, { "cell_type": "markdown", "metadata": { "id": "eSBmTPl_X3rm" }, "source": [ "#### 말뭉치를 전처리합니다." ] }, { "cell_type": "code", "metadata": { "id": "RkZHLZ48X3rm" }, "source": [ "# Maas et al. (2001)에 따라\n", "# - (감정을 표현하는) 불용어는 남겨 둡니다.\n", "# - 어간 추출을 하지 않습니다(모델이 같은 어간을 갖는 단어의 비슷한 표현을 학습합니다).\n", "lower_sents = []\n", "for s in gberg_sents:\n", " lower_sents.append([w.lower() for w in s if w.lower()\n", " not in list(string.punctuation)])" ], "execution_count": 27, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "edDv8S3vX3rm", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "4428dbbe-0242-4d65-e428-7df58f2eeab0" }, "source": [ "lower_sents[0:5]" ], "execution_count": 28, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[['emma', 'by', 'jane', 'austen', '1816'],\n", " ['volume', 'i'],\n", " ['chapter', 'i'],\n", " ['emma',\n", " 'woodhouse',\n", " 'handsome',\n", " 'clever',\n", " 'and',\n", " 'rich',\n", " 'with',\n", " 'a',\n", " 'comfortable',\n", " 'home',\n", " 'and',\n", " 'happy',\n", " 'disposition',\n", " 'seemed',\n", " 'to',\n", " 'unite',\n", " 'some',\n", " 'of',\n", " 'the',\n", " 'best',\n", " 'blessings',\n", " 'of',\n", " 'existence',\n", " 'and',\n", " 'had',\n", " 'lived',\n", " 'nearly',\n", " 'twenty',\n", " 'one',\n", " 'years',\n", " 'in',\n", " 'the',\n", " 'world',\n", " 'with',\n", " 'very',\n", " 'little',\n", " 'to',\n", " 'distress',\n", " 'or',\n", " 'vex',\n", " 'her'],\n", " ['she',\n", " 'was',\n", " 'the',\n", " 'youngest',\n", " 'of',\n", " 'the',\n", " 'two',\n", " 'daughters',\n", " 'of',\n", " 'a',\n", " 'most',\n", " 'affectionate',\n", " 'indulgent',\n", " 'father',\n", " 'and',\n", " 'had',\n", " 'in',\n", " 'consequence',\n", " 'of',\n", " 'her',\n", " 'sister',\n", " 's',\n", " 'marriage',\n", " 'been',\n", " 'mistress',\n", " 'of',\n", " 'his',\n", " 'house',\n", " 'from',\n", " 'a',\n", " 'very',\n", " 'early',\n", " 'period']]" ] }, "metadata": {}, "execution_count": 28 } ] }, { "cell_type": "code", "metadata": { "id": "z0AV2_lpX3rn" }, "source": [ "lower_bigram = Phraser(Phrases(lower_sents))" ], "execution_count": 29, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "8S-rlcShX3rn", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "ab9d7a6d-4ce0-4ffc-e525-1d38e8e96be2" }, "source": [ "lower_bigram.phrasegrams # miss taylor, mr woodhouse, mr weston" ], "execution_count": 30, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{(b'two', b'daughters'): (19, 11.080894472729938),\n", " (b'her', b'sister'): (201, 16.93985297075414),\n", " (b'very', b'early'): (25, 10.517085686126077),\n", " (b'her', b'mother'): (253, 10.708214678014947),\n", " (b'long', b'ago'): (38, 59.22693146255728),\n", " (b'more', b'than'): (562, 28.530162383333433),\n", " (b'had', b'been'): (1260, 21.58337149316804),\n", " (b'an', b'excellent'): (58, 37.41890603576534),\n", " (b'sixteen', b'years'): (15, 131.43021613989356),\n", " (b'miss', b'taylor'): (48, 420.4375727213963),\n", " (b'mr', b'woodhouse'): (132, 104.1999395195192),\n", " (b'very', b'fond'): (30, 24.18592621729314),\n", " (b'passed', b'away'): (25, 11.75157033589844),\n", " (b'too', b'much'): (177, 30.363341094363626),\n", " (b'did', b'not'): (977, 10.846285856844784),\n", " (b'any', b'means'): (28, 14.29426622702947),\n", " (b'after', b'dinner'): (22, 18.607525024015455),\n", " (b'mr', b'weston'): (162, 91.63366549621855),\n", " (b'five', b'years'): (42, 37.66459722425521),\n", " (b'years', b'old'): (176, 48.59949606902839),\n", " (b'seven', b'years'): (53, 50.334976394001785),\n", " (b'each', b'other'): (239, 71.31335962645632),\n", " (b'well', b'informed'): (8, 14.185145241835274),\n", " (b'a', b'mile'): (49, 11.700207443348628),\n", " (b'difference', b'between'): (44, 207.8695602382043),\n", " (b'mrs', b'weston'): (249, 180.67939002300875),\n", " (b'could', b'not'): (1059, 10.213417567175872),\n", " (b'having', b'been'): (49, 10.723839064161645),\n", " (b'sixteen', b'miles'): (6, 105.04149305555556),\n", " (b'miles', b'off'): (16, 32.99209331376903),\n", " (b'at', b'hartfield'): (67, 25.556203673424893),\n", " (b'her', b'husband'): (168, 26.67864790728878),\n", " (b'in', b'spite'): (105, 13.346546665784308),\n", " (b'emma', b'could'): (61, 10.886178015450438),\n", " (b'every', b'body'): (148, 39.26143302367287),\n", " (b'no', b'means'): (80, 26.766268123208324),\n", " (b'able', b'to'): (349, 10.854560919017272),\n", " (b'very', b'much'): (241, 15.432046093862464),\n", " (b'have', b'been'): (986, 17.20636681938197),\n", " (b'great', b'deal'): (182, 110.17005431763471),\n", " (b'agree', b'with'): (26, 13.12659190346559),\n", " (b'good', b'humoured'): (30, 149.07578968117085),\n", " (b'for', b'ever'): (565, 10.477931089592536),\n", " (b'three', b'times'): (41, 38.14473048229484),\n", " (b'my', b'dear'): (340, 26.34347480365353),\n", " (b'last', b'night'): (70, 23.230729926576135),\n", " (b'doubt', b'whether'): (12, 19.56454034377786),\n", " (b'anywhere', b'else'): (6, 15.306592794980771),\n", " (b'i', b'am'): (2445, 16.33041776770322),\n", " (b'very', b'glad'): (46, 16.952677708033637),\n", " (b'am', b'sure'): (282, 60.92117707314396),\n", " (b'very', b'pretty'): (40, 18.02800780155912),\n", " (b'be', b'able'): (121, 10.915362249869167),\n", " (b'immediately', b'afterwards'): (10, 37.531108492029034),\n", " (b'mr', b'knightley'): (277, 179.5673552402442),\n", " (b'sensible', b'man'): (17, 13.46925645592164),\n", " (b'intimate', b'friend'): (6, 20.194893190921228),\n", " (b'connected', b'with'): (31, 16.865252849915358),\n", " (b'elder', b'brother'): (6, 14.418048803736536),\n", " (b'than', b'usual'): (30, 27.96507779799145),\n", " (b'brunswick', b'square'): (11, 2374.2537606278615),\n", " (b'some', b'time'): (149, 11.67826679527302),\n", " (b'poor', b'isabella'): (11, 43.037237258598005),\n", " (b'am', b'afraid'): (65, 24.404507887593226),\n", " (b'moonlight', b'night'): (6, 13.233573928258966),\n", " (b'look', b'at'): (188, 10.167753397802551),\n", " (b'vast', b'deal'): (11, 58.66570782159018),\n", " (b'an', b'hour'): (155, 40.46495358980144),\n", " (b'pretty', b'well'): (22, 13.99149138187848),\n", " (b'tolerably', b'well'): (7, 13.77985537778284),\n", " (b'miss', b'woodhouse'): (173, 272.89862807742907),\n", " (b'you', b'please'): (94, 10.458101345278243),\n", " (b'any', b'rate'): (47, 81.39512045124775),\n", " (b'very', b'true'): (50, 13.110826412772036),\n", " (b',\"', b'said'): (2585, 35.20909116868378),\n", " (b'my', b'dearest'): (20, 15.98957177137179),\n", " (b'so', b'much'): (501, 16.68921990823078),\n", " (b'much', b'less'): (40, 18.957027837059744),\n", " (b'any', b'body'): (93, 20.814200280371566),\n", " (b'has', b'been'): (266, 28.015388177187834),\n", " (b'been', b'used'): (29, 13.604916942276521),\n", " (b'dear', b'emma'): (33, 26.724674718257425),\n", " (b'every', b'thing'): (258, 26.81380543889372),\n", " (b'very', b'sorry'): (34, 20.451800038697975),\n", " (b'turned', b'away'): (50, 18.475990225490946),\n", " (b'divided', b'between'): (10, 32.86826379834854),\n", " (b'how', b'much'): (142, 13.223254805593912),\n", " (b'four', b'years'): (23, 17.088037732474124),\n", " (b'years', b'ago'): (56, 157.9226942623328),\n", " (b'any', b'thing'): (384, 34.60382364241243),\n", " (b'oh', b'dear'): (22, 13.071118312364975),\n", " (b'need', b'not'): (108, 12.811139000583829),\n", " (b'ever', b'since'): (68, 42.54674314875834),\n", " (b'leave', b'off'): (19, 10.45663799403504),\n", " (b'match', b'making'): (6, 19.51486904915495),\n", " (b'young', b'lady'): (73, 46.68733304737199),\n", " (b'depend', b'upon'): (45, 79.94120406154416),\n", " (b'more', b'likely'): (16, 10.639357811071253),\n", " (b'have', b'done'): (272, 12.079650113150391),\n", " (b',\"', b'rejoined'): (6, 10.723078831702821),\n", " (b'mr', b'elton'): (214, 139.40990783410138),\n", " (b'any', b'longer'): (32, 15.647155677008353),\n", " (b'very', b'well'): (211, 12.391214426766558),\n", " (b'young', b'man'): (266, 24.286534956791346),\n", " (b'dine', b'with'): (23, 13.166490359099223),\n", " (b'much', b'better'): (40, 10.497942188943838),\n", " (b'i', b'dare'): (138, 13.033593291934318),\n", " (b'dare', b'say'): (115, 119.19858224619772),\n", " (b'take', b'care'): (71, 74.57188636223579),\n", " (b'chapter', b'ii'): (11, 279.3347183748846),\n", " (b'entering', b'into'): (14, 14.76425085407516),\n", " (b'never', b'seen'): (42, 13.0265069565268),\n", " (b'mrs', b'churchill'): (59, 72.7025861493478),\n", " (b'refrain', b'from'): (10, 13.332723666813575),\n", " (b'at', b'once'): (270, 18.77244440167035),\n", " (b'three', b'years'): (80, 36.12190386580951),\n", " (b'mother', b's'): (212, 10.433577049626086),\n", " (b'twenty', b'years'): (71, 82.70079278294149),\n", " (b'according', b'to'): (792, 11.428318841579392),\n", " (b'had', b'begun'): (25, 11.498991764971958),\n", " (b'passed', b'through'): (45, 29.981665287433785),\n", " (b'its', b'being'): (58, 15.03567583737666),\n", " (b'deal', b'better'): (14, 18.324643289810204),\n", " (b'belonging', b'to'): (36, 10.00745855749289),\n", " (b'mr', b'frank'): (50, 51.13428151809727),\n", " (b'frank', b'churchill'): (151, 1615.1483580779638),\n", " (b'mrs', b'perry'): (11, 23.55291278198416),\n", " (b'miss', b'bates'): (113, 368.53088940274097),\n", " (b'a', b'few'): (452, 11.993753858188938),\n", " (b'few', b'days'): (53, 34.43918034342092),\n", " (b'i', b'suppose'): (210, 11.32108319421537),\n", " (b'very', b'handsome'): (21, 18.293355102534452),\n", " (b'an', b'irresistible'): (7, 10.743738402393657),\n", " (b'good', b'sense'): (28, 15.484646541076454),\n", " (b'had', b'already'): (64, 11.161440716387299),\n", " (b'long', b'enough'): (39, 14.424680952514514),\n", " (b'at', b'randalls'): (39, 24.915008599181263),\n", " (b'few', b'weeks'): (19, 130.1719018932874),\n", " (b'no', b'longer'): (117, 37.27195664059191),\n", " (b'mr', b'perry'): (36, 95.91613823715916),\n", " (b'chapter', b'iii'): (10, 294.8533138401559),\n", " (b'donwell', b'abbey'): (9, 737.1781906792567),\n", " (b'card', b'table'): (7, 52.05532134560785),\n", " (b'drawing', b'room'): (49, 219.78168548972988),\n", " (b'thrown', b'away'): (11, 13.99171343117908),\n", " (b'mrs', b'goddard'): (58, 292.68153482471274),\n", " (b'an', b'invitation'): (13, 13.178985773602884),\n", " (b'mrs', b'bates'): (30, 54.66699555102586),\n", " (b'those', b'who'): (174, 14.383367684078921),\n", " (b'as', b'possible'): (81, 10.51537185153746),\n", " (b'young', b'ladies'): (47, 111.39299334578259),\n", " (b'old', b'fashioned'): (38, 181.02950323229942),\n", " (b'coming', b'back'): (15, 10.421088167057363),\n", " (b'goddard', b's'): (34, 30.051743214173165),\n", " (b'found', b'herself'): (27, 10.853841368001415),\n", " (b's', b'sake'): (143, 27.305132602365575),\n", " (b'much', b'pleased'): (18, 12.52916167550209),\n", " (b'miss', b'smith'): (58, 148.87908909420122),\n", " (b'harriet', b'smith'): (31, 171.76221256523925),\n", " (b'several', b'years'): (10, 16.07162969101959),\n", " (b'pretty', b'girl'): (10, 36.10104059762763),\n", " (b'blue', b'eyes'): (28, 33.954794112767054),\n", " (b'due', b'time'): (18, 19.500789650495978),\n", " (b'its', b'own'): (54, 10.339621687357946),\n", " (b'an', b'egg'): (7, 16.473732217003608),\n", " (b'better', b'than'): (175, 40.469371165789006),\n", " (b'body', b'else'): (31, 37.603598047511824),\n", " (b'much', b'more'): (163, 10.196819531234667),\n", " (b'little', b'girl'): (54, 33.82072690767634),\n", " (b'at', b'last'): (512, 25.084326428238846),\n", " (b'chapter', b'iv'): (8, 252.7314118629908),\n", " (b'every', b'respect'): (14, 10.883823423596287),\n", " (b'guided', b'by'): (14, 22.059530561317086),\n", " (b'different', b'sort'): (8, 13.936709152334153),\n", " (b'abbey', b'mill'): (11, 1868.3654143077713),\n", " (b'good', b'deal'): (62, 34.98896475458069),\n", " (b'very', b'happy'): (45, 10.60950295929063),\n", " (b'mrs', b'martin'): (8, 10.98253798261059),\n", " (b'drink', b'tea'): (7, 29.382235819735826),\n", " (b'large', b'enough'): (11, 10.328422669853191),\n", " (b'had', b'taken'): (121, 10.546982214197534),\n", " (b'mr', b'martin'): (37, 92.33536178249174),\n", " (b'three', b'miles'): (9, 15.396961522801304),\n", " (b'thing', b'else'): (26, 11.724198736581018),\n", " (b'very', b'obliging'): (14, 23.82950204145935),\n", " (b'on', b'purpose'): (36, 10.243987813963042),\n", " (b'very', b'clever'): (15, 19.348740973834513),\n", " (b'miss', b'nash'): (13, 312.8837750484809),\n", " (b'does', b'not'): (218, 11.75597879763697),\n", " (b'oh', b'yes'): (33, 23.312359983486996),\n", " (b'very', b'entertaining'): (7, 15.092017959590923),\n", " (b'soon', b'as'): (277, 10.479021912326246),\n", " (b'have', b'seen'): (204, 12.72963685420248),\n", " (b'on', b'horseback'): (21, 51.189898049832905),\n", " (b'their', b'families'): (95, 33.21834852311409),\n", " (b'no', b'doubt'): (125, 34.594842365372315),\n", " (b'very', b'respectable'): (9, 10.061345306393948),\n", " (b'respectable', b'young'): (8, 26.124309153713302),\n", " (b'very', b'odd'): (20, 23.338172102460184),\n", " (b'perfectly', b'right'): (12, 15.18214895111914),\n", " (b'six', b'years'): (23, 29.166705499538022),\n", " (b'years', b'hence'): (10, 17.922302200894578),\n", " (b'young', b'woman'): (57, 28.12555022760024),\n", " (b'very', b'desirable'): (9, 13.720016326900836),\n", " (b'dear', b'miss'): (39, 23.615886026541766),\n", " (b'thirty', b'years'): (36, 70.51736596736596),\n", " (b'can', b'afford'): (11, 24.000753694092754),\n", " (b'good', b'luck'): (24, 50.83866673742493),\n", " (b'acquainted', b'with'): (88, 25.94064590446771),\n", " (b'harriet', b's'): (91, 10.391341493029481),\n", " (b'next', b'day'): (103, 31.5528616052661),\n", " (b'an', b'opportunity'): (36, 39.48600763353957),\n", " (b'few', b'yards'): (15, 121.4937751004016),\n", " (b'robert', b'martin'): (31, 1822.1955287848953),\n", " (b'few', b'minutes'): (86, 306.2550554916762),\n", " (b'been', b'able'): (40, 15.56410454288213),\n", " (b'should', b'happen'): (13, 19.568675965231453),\n", " (b'compared', b'with'): (28, 15.20617102370327),\n", " (b'well', b'bred'): (15, 56.080806770046436),\n", " (b'an', b'old'): (175, 10.0932285327629),\n", " (b'old', b'man'): (225, 11.391586430765056),\n", " (b'more', b'valuable'): (10, 16.9262510630679),\n", " (b',\"', b'replied'): (256, 67.14742919145301),\n", " (b'very', b'bad'): (37, 14.502840081288573),\n", " (b'deal', b'too'): (15, 11.983343236284412),\n", " (b'no', b'more'): (597, 15.083646516961776),\n", " (b'good', b'humour'): (28, 64.86811388829325),\n", " (b'very', b'agreeable'): (21, 20.122690612787896),\n", " (b'fixed', b'on'): (32, 10.217338522043262),\n", " (b'same', b'time'): (104, 17.450627768409742),\n", " (b'pleasing', b'young'): (8, 22.39226498889711),\n", " (b'chapter', b'v'): (7, 186.22314558325638),\n", " (b'very', b'differently'): (14, 45.276053878772764),\n", " (b'twelve', b'years'): (25, 44.13701288280007),\n", " (b'very', b'neatly'): (7, 21.56002565655846),\n", " (b'ten', b'years'): (32, 33.47750788468987),\n", " (b'being', b'able'): (20, 13.768660452584244),\n", " (b'have', b'spoken'): (82, 11.097296183547952),\n", " (b'yes', b',\"'): (117, 21.319257322202063),\n", " (b'thank', b'you'): (105, 18.729503350352864),\n", " (b'could', b'possibly'): (21, 29.277733420435023),\n", " (b'grown', b'up'): (21, 13.110608209041132),\n", " (b'any', b'harm'): (11, 11.883432028204146),\n", " (b'an', b'angel'): (58, 20.85448584795839),\n", " (b'excuse', b'me'): (31, 17.30538907663604),\n", " (b'an', b'end'): (129, 17.098851519880977),\n", " (b'many', b'years'): (55, 18.586925601179164),\n", " (b',\"', b'cried'): (297, 33.94188638327614),\n", " (b'much', b'obliged'): (41, 42.51058597592393),\n", " (b'mrs', b'john'): (39, 22.746858271991545),\n", " (b'john', b'knightley'): (58, 169.27026599029787),\n", " (b'be', b'satisfied'): (68, 11.946162525033142),\n", " (b'ill', b'humour'): (6, 26.632582093494147),\n", " (b'i', b'assure'): (105, 12.733360622358129),\n", " (b'assure', b'you'): (126, 28.43663511862174),\n", " (b'soon', b'afterwards'): (38, 78.59659889385321),\n", " (b'chapter', b'vi'): (6, 126.3657059314954),\n", " (b'most', b'agreeable'): (13, 26.484526154519585),\n", " (b'no', b'scruple'): (10, 22.499181900667867),\n", " (b'infinitely', b'superior'): (7, 270.28769265132905),\n", " (b'am', b'glad'): (34, 16.093602872722435),\n", " (b'very', b'interesting'): (15, 16.768908843989912),\n", " (b'no', b'sooner'): (40, 38.535832829867296),\n", " (b'don', b't'): (830, 258.780833955456),\n", " (b't', b'pretend'): (9, 21.45528368794326),\n", " (b'why', b'should'): (100, 17.7206514366753),\n", " (b'cannot', b'imagine'): (13, 46.70673151150224),\n", " (b'back', b'again'): (74, 17.518919801292988),\n", " (b'an', b'artist'): (10, 15.444123953440881),\n", " (b'higher', b'than'): (34, 44.04193566200464),\n", " (b'ten', b'times'): (17, 32.73356326503009),\n", " (b'mr', b'john'): (33, 14.765125870249578),\n", " (b'must', b'allow'): (12, 15.1281361623089),\n", " (b'sitting', b'down'): (24, 16.50991895653123),\n", " (b'must', b'confess'): (10, 11.475198100360734),\n", " (b'depended', b'on'): (14, 19.196211768687338),\n", " (b'after', b'breakfast'): (11, 10.176904121801002),\n", " (b'sooner', b'than'): (12, 15.493843103397815),\n", " (b'at', b'home'): (158, 14.817295862949209),\n", " (b'at', b'least'): (318, 40.00546011393844),\n", " (b'yes', b'indeed'): (18, 13.328024495550268),\n", " (b'replied', b'emma'): (16, 15.977656729388833),\n", " (b'can', b'hardly'): (33, 23.714099741177424),\n", " (b'am', b'persuaded'): (12, 13.900999100678101),\n", " (b'beg', b'your'): (40, 41.76970841001303),\n", " (b'your', b'pardon'): (42, 35.8319536079339),\n", " (b'tell', b'me'): (198, 10.69565177169633),\n", " (b'entered', b'into'): (99, 56.4162431009376),\n", " (b'older', b'than'): (15, 56.232480761366595),\n", " (b'run', b'away'): (47, 37.87484965056332),\n", " (b'have', b'borne'): (26, 10.466767991300909),\n", " (b'good', b'opinion'): (19, 13.341942820780586),\n", " (b'good', b'natured'): (66, 159.1384054846499),\n", " (b'emma', b'felt'): (19, 16.140645126868346),\n", " (b'no', b'difficulty'): (16, 11.983774780776779),\n", " (b'let', b'us'): (399, 31.769872582196264),\n", " (b'cried', b'emma'): (27, 13.88482906384106),\n", " (b'bond', b'street'): (8, 172.3757834757835),\n", " (b'some', b'weeks'): (10, 11.200775302864253),\n", " (b'next', b'morning'): (69, 78.1475383699458),\n", " (b'without', b'ceremony'): (6, 10.070891174806086),\n", " (b'dear', b'sir'): (24, 14.956727816809783),\n", " (b'sat', b'down'): (150, 55.27743735890119),\n", " (b'depends', b'upon'): (8, 17.986770913847433),\n", " (b'has', b'happened'): (14, 15.99137490529135),\n", " (b'presently', b'added'): (6, 18.898019740129936),\n", " (b'could', b'afford'): (11, 15.539720046230899),\n", " (b'does', b'seem'): (9, 12.372353151679363),\n", " (b'few', b'moments'): (43, 372.3196333721985),\n", " (b'nobody', b'knows'): (7, 29.65223357592688),\n", " (b'very', b'likely'): (33, 27.440032653801673),\n", " (b'good', b'tempered'): (10, 28.20352777751881),\n", " (b'all', b'probability'): (16, 12.433663070384382),\n", " (b'no', b'harm'): (25, 22.74642565781806),\n", " (b'cannot', b'help'): (16, 18.92381204221828),\n", " (b'very', b'different'): (29, 16.464019592281005),\n", " (b'common', b'sense'): (30, 161.56777397991883),\n", " (b'an', b'hundred'): (186, 29.194636402849085),\n", " (b'every', b'man'): (327, 11.997511974569205),\n", " (b'less', b'than'): (90, 36.539494213739246),\n", " (b'large', b'fortune'): (9, 31.791868637110017),\n", " (b'no', b'use'): (43, 12.937029592884024),\n", " (b'these', b'words'): (121, 23.298153994257166),\n", " (b'twenty', b'thousand'): (49, 74.70455718935906),\n", " (b'thousand', b'pounds'): (48, 447.52175109658555),\n", " (b'walked', b'off'): (15, 10.460708308551986),\n", " (b'cast', b'down'): (44, 13.997539984885172),\n", " (b'its', b'effects'): (8, 41.35228049391716),\n", " (b'deal', b'more'): (29, 10.514188895646884),\n", " (b'longer', b'than'): (32, 18.06310234103062),\n", " (b'perfectly', b'satisfied'): (12, 88.02579290850896),\n", " (b'three', b'hundred'): (78, 46.954702502955406),\n", " (b'well', b'known'): (41, 14.093033909096086),\n", " (b'destin', b'd'): (7, 66.59024873431653),\n", " (b'looking', b'at'): (107, 11.578403307709019),\n", " (b'next', b'moment'): (22, 24.11220301236594),\n", " (b'ready', b'wit'): (8, 64.00497196657146),\n", " (b'very', b'pleasant'): (20, 11.43334693908403),\n", " (b'an', b'idea'): (37, 13.223062649099885),\n", " (b'arrive', b'at'): (10, 10.99191555846232),\n", " (b'nobody', b'could'): (24, 14.34346357032523),\n", " (b'have', b'chosen'): (38, 11.818762830211462),\n", " (b'without', b'exception'): (8, 49.916591040343214),\n", " (b'her', b'cheeks'): (14, 10.713121301309494),\n", " (b'sit', b'down'): (61, 35.66534421470248),\n", " (b'reason', b'why'): (21, 17.06402120878811),\n", " (b'could', b'hardly'): (47, 23.031181176009962),\n", " (b'an', b'offering'): (71, 10.631678549435184),\n", " (b'can', b'easily'): (12, 13.582516083099756),\n", " (b'dear', b'mother'): (25, 13.62829366741899),\n", " (b'those', b'things'): (86, 14.8616035851137),\n", " (b'next', b'week'): (13, 53.88002448934157),\n", " (b'taken', b'away'): (75, 32.3942512515154),\n", " (b'stay', b'longer'): (10, 35.24389533529055),\n", " (b'three', b'days'): (100, 36.85571892072123),\n", " (b'cannot', b'bear'): (17, 14.755609208857674),\n", " (b'o', b'clock'): (67, 157.92042603351013),\n", " (b'ask', b'whether'): (11, 13.903668723357807),\n", " (b'ran', b'away'): (24, 14.924494326591024),\n", " (b'who', b'lived'): (27, 11.875088544445298),\n", " (b'never', b'mind'): (39, 14.649520876892138),\n", " (b'good', b'fortune'): (20, 17.889094761740502),\n", " (b'jane', b'fairfax'): (111, 878.2730646508635),\n", " (b'nothing', b'else'): (45, 30.00884087783841),\n", " (b'present', b'instance'): (6, 15.177578767810555),\n", " (b'once', b'more'): (141, 21.75868665121339),\n", " (b'still', b'greater'): (11, 11.400593546360994),\n", " (b'here', b'comes'): (20, 16.698101230888117),\n", " (b'turned', b'back'): (42, 27.420219470101824),\n", " (b'will', b'bring'): (145, 11.54256820986443),\n", " (b'each', b'side'): (37, 20.499807297291575),\n", " (b'still', b'remained'): (10, 12.133783892186747),\n", " (b'she', b'hoped'): (30, 15.572180252968563),\n", " (b'ten', b'minutes'): (42, 194.73753664413653),\n", " (b'most', b'favourable'): (6, 11.684349774052759),\n", " (b'ten', b'days'): (21, 15.980164743558),\n", " (b'little', b'ones'): (53, 58.130829972277546),\n", " (b'mr', b'wingfield'): (9, 102.72308998302208),\n", " (b'passed', b'over'): (52, 21.874958076020096),\n", " (b'yes', b'sir'): (25, 17.04815440969287),\n", " (b'sir', b',\"'): (121, 14.215738794028884),\n", " (b'cannot', b'deny'): (9, 32.95419389978213),\n", " (b'talking', b'about'): (25, 19.03311706185364),\n", " (b'never', b'forget'): (18, 19.811960968040946),\n", " (b'cannot', b'tell'): (30, 16.076959296294408),\n", " (b'two', b'years'): (53, 11.958267348869844),\n", " (b'indeed', b'!--'): (21, 14.904457500400301),\n", " (b'dear', b'madam'): (15, 86.09958447176685),\n", " (b'madam', b\",'\"): (7, 23.782979559748426),\n", " (b'most', b'amiable'): (9, 20.90883643777862),\n", " (b',\"', b'observed'): (18, 11.457536285929041),\n", " (b'five', b'times'): (10, 11.197582958562359),\n", " (b'our', b'lives'): (16, 17.148659373051412),\n", " (b'think', b'differently'): (6, 11.912561527859815),\n", " (b'grow', b'up'): (20, 10.504684266328379),\n", " (b'shake', b'hands'): (15, 50.540371218069744),\n", " (b'how', b'long'): (106, 12.747654023789359),\n", " (b'perfectly', b'convinced'): (8, 72.52055615486036),\n", " (b'tells', b'me'): (15, 12.056004001139769),\n", " (b'bad', b'cold'): (7, 12.93191412052622),\n", " (b'far', b'off'): (83, 31.654147573003083),\n", " (b'am', b'sorry'): (32, 25.556309428082436),\n", " (b'mrs', b'campbell'): (9, 25.554140665420718),\n", " (b'ah', b'!\"'): (15, 17.2162909678631),\n", " (b'an', b'interval'): (11, 15.772722335428986),\n", " (b'perfectly', b'well'): (16, 10.916140577050193),\n", " (b'ill', b'judged'): (6, 16.561437604357703),\n", " (b'can', b'tell'): (52, 10.598771687692016),\n", " (b'morrow', b'morning'): (14, 21.336997025943646),\n", " (b'own', b'feelings'): (16, 10.441084233959339),\n", " (b'sore', b'throat'): (9, 237.4797370228633),\n", " (b'well', b'satisfied'): (17, 14.614998127951495),\n", " (b'looked', b'at'): (184, 11.997273688809486),\n", " (b'well', b'pleased'): (26, 18.6180031299088),\n", " (b'set', b'forward'): (22, 28.094460681216024),\n", " (b'eldest', b'daughter'): (10, 81.36619150080689),\n", " (b'short', b'time'): (23, 10.057907277428889),\n", " (b',\"', b'continued'): (103, 40.16032072000802),\n", " (b'dining', b'room'): (18, 272.72909153952844),\n", " (b'enter', b'into'): (110, 32.02988305119611),\n", " (b'half', b'hour'): (17, 17.306277530939532),\n", " (b'gone', b'through'): (24, 10.232780528794587),\n", " (b'turn', b'away'): (53, 24.325152043340207),\n", " (b'own', b'sake'): (17, 10.579315711019701),\n", " (b'an', b'effort'): (8, 12.355299162752706),\n", " (b',\"', b'repeated'): (29, 15.364411460350311),\n", " (b'several', b'times'): (19, 99.0012388966807),\n", " (b'great', b'curiosity'): (13, 12.092919602258533),\n", " (b'upper', b'end'): (11, 46.57012007389162),\n", " (b'an', b'odd'): (25, 25.474843634541664),\n", " (b'dearest', b'emma'): (8, 38.81440851937388),\n", " (b'continued', b'mrs'): (17, 12.451539878373788),\n", " (b'go', b'home'): (37, 10.007255826550702),\n", " (b'judge', b'between'): (11, 14.608117243710463),\n", " (b'hardly', b'knew'): (16, 28.982515807625983),\n", " (b'set', b'off'): (42, 12.027600179663624),\n", " (b'got', b'home'): (12, 12.615003589161626),\n", " (b'most', b'extraordinary'): (16, 42.01871937976666),\n", " (b'an', b'inch'): (28, 63.14930683184716),\n", " (b'at', b'ease'): (36, 15.97262729589056),\n", " (b'three', b'quarters'): (8, 46.1908845684039),\n", " (b'smith', b'!--'): (9, 17.971811322996494),\n", " (b'extremely', b'sorry'): (8, 71.2760936150161),\n", " (b'many', b'weeks'): (10, 19.402474377557436),\n", " (b'madam', b',\"'): (14, 12.063463685665674),\n", " (b'extremely', b'well'): (16, 22.290942522884006),\n", " (b'without', b'knowing'): (19, 31.149500610446733),\n", " (b'poor', b'harriet'): (15, 13.183343429017174),\n", " (b'an', b'instant'): (99, 42.5420557252291),\n", " (b'thirty', b'thousand'): (18, 40.06880794701987),\n", " (b'somebody', b'else'): (17, 161.27922164467546),\n", " (b'worth', b'having'): (9, 20.026446445121145),\n", " (b'poor', b'girl'): (16, 25.656814519548806),\n", " (b'laugh', b'at'): (34, 10.68487881101924),\n", " (b'knowing', b'what'): (22, 10.324648874148785),\n", " (b'many', b'days'): (50, 13.474789292130438),\n", " (b'whole', b'party'): (15, 20.807448930462893),\n", " (b'six', b'weeks'): (6, 16.914705060106233),\n", " (b'too', b'late'): (56, 82.78532737735924),\n", " (b'her', b'companions'): (37, 11.248377921136292),\n", " (b'drew', b'near'): (34, 127.06577013042501),\n", " (b'three', b'months'): (36, 79.13878668714453),\n", " (b'other', b'side'): (133, 27.098888471327907),\n", " (b'an', b'unnatural'): (10, 18.16955759228339),\n", " (b'get', b'rid'): (18, 208.69503038021702),\n", " (b'watering', b'place'): (9, 80.21198462150338),\n", " (b'while', b'ago'): (10, 12.96130710105312),\n", " (b'at', b'weymouth'): (16, 40.30369038102851),\n", " (b'present', b'occasion'): (9, 30.82215995924605),\n", " (b'their', b'hearts'): (50, 16.688643994865927),\n", " (b'break', b'through'): (12, 10.13448685951659),\n", " (b'burst', b'forth'): (18, 46.35221285874242),\n", " (b'young', b'men'): (142, 26.258476552154963),\n", " (b'nobody', b'else'): (21, 79.07262286913833),\n", " (b'something', b'else'): (35, 35.09791978466929),\n", " (b'walking', b'together'): (9, 11.011119603989226),\n", " (b'burst', b'out'): (19, 10.336463766135193),\n", " (b'mrs', b'cole'): (30, 133.5308579852927),\n", " (b'mr', b'cole'): (23, 75.77932867599989),\n", " (b'miss', b'fairfax'): (125, 253.16322047491198),\n", " (b'extremely', b'happy'): (7, 17.871217379746273),\n", " (b'ma', b'am'): (216, 180.33592685038826),\n", " (b's', b'handwriting'): (7, 11.116318806496656),\n", " (b'without', b'seeming'): (8, 22.51140380250772),\n", " (b'colonel', b'campbell'): (28, 852.6897671568628),\n", " (b'those', b'days'): (84, 22.563719575520683),\n", " (b'mrs', b'dixon'): (14, 66.64403730356881),\n", " (b'mr', b'dixon'): (22, 99.22116646087359),\n", " (b'miss', b'campbell'): (12, 69.78535178777393),\n", " (b'caught', b'hold'): (10, 22.460761166547872),\n", " (b'four', b'months'): (12, 35.223787622984226),\n", " (b'may', b'guess'): (11, 12.384171115697546),\n", " (b'running', b'away'): (12, 10.939419925249963),\n", " (b'five', b'minutes'): (37, 138.27388748830532),\n", " (b'nine', b'years'): (9, 20.394343883776585),\n", " (b'hundred', b'pounds'): (11, 61.548167237462266),\n", " (b'rather', b'than'): (78, 18.774280659635497),\n", " (b'few', b'months'): (17, 56.65512828516137),\n", " (b'she', b'wished'): (41, 13.111265209712547),\n", " (b'without', b'feeling'): (12, 12.941338417866758),\n", " (b'ill', b'health'): (7, 24.482125154267912),\n", " (b'twelve', b'thousand'): (24, 56.81398141741622),\n", " (b'mr', b'churchill'): (19, 14.856645245478399),\n", " (b'passed', b'between'): (17, 18.633976326622793),\n", " (b\",'\", b'said'): (252, 29.90465644929233),\n", " (b'miss', b'hawkins'): (18, 330.483487394958),\n", " (b'dear', b'jane'): (15, 27.279076268282562),\n", " (b'three', b'minutes'): (10, 10.2220491437685),\n", " (b'have', b'suffered'): (30, 12.671631950727493),\n", " (b'hour', b'ago'): (10, 34.5823521342509),\n", " (b'ford', b's'): (10, 13.291250746898175),\n", " (b'looked', b'round'): (26, 11.112235527475757),\n", " (b'help', b'thinking'): (10, 29.995785987665336),\n", " (b'can', b't'): (299, 33.884042781775456),\n", " (b'human', b'nature'): (10, 30.588113365891143),\n", " (b'brown', b's'): (105, 12.277058922837673),\n", " (b'laughed', b'at'): (28, 11.579269824945039),\n", " (b'weeks', b'ago'): (7, 64.64782562239554),\n", " (b'twenty', b'miles'): (8, 30.04364737817797),\n", " (b'elder', b'sister'): (6, 19.200272911906577),\n", " (b'driven', b'away'): (10, 13.818976228325019),\n", " (b'setting', b'off'): (8, 15.77704088728183),\n", " (b'little', b'farther'): (16, 12.063492840311763),\n", " (b'spot', b'where'): (18, 30.983518539673284),\n", " (b'front', b'door'): (16, 47.781783068175294),\n", " (b'they', b'parted'): (23, 10.793474382760099),\n", " (b'without', b'delay'): (8, 16.40116562754134),\n", " (b'six', b'months'): (21, 137.42102349350557),\n", " (b'months', b'ago'): (7, 32.826357051786346),\n", " (b'leaned', b'back'): (6, 11.97267240526368),\n", " (b'ill', b'disposed'): (6, 23.462036606173417),\n", " (b'at', b'oxford'): (10, 13.739894448077902),\n", " (b'turned', b'round'): (21, 11.459424650853364),\n", " (b'pass', b'through'): (64, 20.610331926490034),\n", " (b'clock', b'struck'): (16, 271.616904052565),\n", " (b'\"\\'', b'tis'): (7, 65.71510806994678),\n", " (b'four', b'hours'): (15, 44.14409747555815),\n", " (b'parlour', b'door'): (6, 13.546761301300853),\n", " (b'faster', b'than'): (10, 38.52966274389933),\n", " (b'musical', b'society'): (6, 97.72879987078016),\n", " (b'worth', b'while'): (16, 32.80928460158145),\n", " (b'kind', b'hearted'): (6, 22.29983045849919),\n", " (b'mixed', b'with'): (29, 12.314926306023153),\n", " (b'extremely', b'glad'): (10, 69.6487855416139),\n", " (b'knew', b'nothing'): (25, 11.012602701827063),\n", " (b'make', b'amends'): (9, 63.131759488717876),\n", " (b'amends', b'for'): (12, 12.758421973797969),\n", " (b'oftener', b'than'): (9, 48.95533619224856),\n", " (b'old', b'woman'): (61, 16.85474514064448),\n", " (b'just', b'going'): (25, 11.441649458022253),\n", " (b'their', b'lives'): (30, 13.550608371899687),\n", " (b'six', b'days'): (24, 24.79935497788804),\n", " (b'may', b'prove'): (10, 12.110371414160014),\n", " (b'stronger', b'than'): (29, 55.48271435121504),\n", " (b'particular', b'friend'): (10, 18.112011830422627),\n", " (b'good', b'tidings'): (14, 28.459923484587165),\n", " (b'among', b'themselves'): (31, 14.391325508421081),\n", " (b'next', b'summer'): (9, 24.61759739599227),\n", " (b'on', b'tuesday'): (9, 11.261777570963238),\n", " (b'breaking', b'up'): (13, 11.204996550750268),\n", " (b'after', b'tea'): (9, 10.099806363302507),\n", " (b'perfectly', b'safe'): (6, 15.18214895111914),\n", " (b'two', b'ladies'): (13, 10.349667933920346),\n", " (b'same', b'moment'): (29, 15.168003008335596),\n", " (b'mr', b'cox'): (13, 82.17847198641766),\n", " (b',\"', b'added'): (51, 21.446157663405643),\n", " (b'little', b'girls'): (15, 21.933623346021392),\n", " (b'be', b'ashamed'): (88, 14.83924678279627),\n", " (b'been', b'staying'): (9, 13.648522445296638),\n", " (b'shut', b'up'): (64, 31.30735269741568),\n", " (b'too', b'large'): (19, 13.112807771198575),\n", " (b'an', b'elderly'): (7, 12.355299162752706),\n", " (b'worse', b'than'): (59, 50.382285453457605),\n", " (b'opposite', b'side'): (9, 20.5505493945621),\n", " (b'short', b'pause'): (11, 83.26224770642202),\n", " (b'large', b'party'): (8, 13.460266963292549),\n", " (b'who', b'knows'): (35, 23.27234193136448),\n", " (b'extremely', b'fond'): (6, 32.59205990088343),\n", " (b'five', b'couple'): (8, 45.27718326723041),\n", " (b'mr', b'william'): (9, 12.681862960866923),\n", " (b'bow', b'window'): (8, 25.074140074595938),\n", " (b'bad', b'news'): (9, 59.57160439127652),\n", " (b'baked', b'apples'): (6, 587.9873663751215),\n", " (b'mrs', b'wallis'): (14, 79.25236868532507),\n", " (b'will', b'send'): (73, 14.065402297907488),\n", " (b'william', b'larkins'): (13, 4596.687559354225),\n", " (b'low', b'voice'): (39, 51.584875095916104),\n", " (b'one', b'leg'): (18, 10.201271801948558),\n", " (b'an', b'immediate'): (12, 13.30570679065676),\n", " (b',\"', b'resumed'): (18, 27.880004962427332),\n", " (b'many', b'times'): (21, 13.085195623230133),\n", " (b'few', b'words'): (18, 11.20685248087905),\n", " (b'no', b'objection'): (18, 27.458185258366093),\n", " (b'astonished', b'at'): (22, 12.45750429959063),\n", " (b'four', b'times'): (13, 16.70830356064136),\n", " (b'c', b'.,'): (6, 484.806891025641),\n", " (b'few', b'hours'): (23, 74.55299835706462),\n", " (b'an', b'extraordinary'): (14, 10.692085813920611),\n", " (b'immediately', b'followed'): (7, 11.605013810035294),\n", " (b'wait', b'till'): (20, 30.636127032993738),\n", " (b'good', b'bye'): (45, 141.49566478212827),\n", " (b'contrast', b'between'): (12, 157.767666232073),\n", " (b'dared', b'not'): (22, 11.170731204026199),\n", " (b'three', b'weeks'): (11, 30.165475636508674),\n", " (b'self', b'command'): (15, 129.45887538514208),\n", " (b'mrs', b'elton'): (142, 115.93946807097048),\n", " (b'maple', b'grove'): (31, 6513.877432712216),\n", " (b'mr', b'suckling'): (10, 67.58098025198821),\n", " (b'almost', b'fancy'): (9, 14.733690490685499),\n", " (b'left', b'behind'): (27, 27.450614763395492),\n", " (b'barouche', b'landau'): (7, 17286.828571428574),\n", " (b'whose', b'name'): (60, 27.21462469955437),\n", " (b'mr', b'e'): (10, 15.659007619363122),\n", " (b'e', b'.,'): (6, 189.192933083177),\n", " (b'good', b'breeding'): (8, 36.830489215348095),\n", " (b'greater', b'part'): (10, 15.073969804175594),\n", " (b'drew', b'back'): (11, 14.116251307516126),\n", " (b'third', b'time'): (23, 12.455612727466114),\n", " (b'very', b'extraordinary'): (13, 11.60924458430071),\n", " (b'better', b'acquainted'): (7, 12.58662367380903),\n", " (b'have', b'committed'): (34, 16.060120198292402),\n", " (b'drawing', b'rooms'): (8, 127.10903361344538),\n", " (b'hardly', b'less'): (8, 12.215771125528304),\n", " (b'will', b'shew'): (48, 11.000113735916194),\n", " (b'little', b'boys'): (17, 15.382021567339674),\n", " (b'post', b'office'): (12, 378.6226533166458),\n", " (b'easily', b'believe'): (8, 20.68038054004785),\n", " (b'put', b'forth'): (41, 11.449193667080172),\n", " (b'mrs', b'bragge'): (6, 46.54504192630203),\n", " (b'drawing', b'near'): (8, 17.917525467898603),\n", " (b'great', b'joy'): (26, 10.054633379705905),\n", " (b'spread', b'abroad'): (14, 187.6671836228288),\n", " (b'few', b'lines'): (10, 41.418332420591454),\n", " (b'good', b'news'): (18, 22.239175181945157),\n", " (b'most', b'likely'): (12, 18.05763146899063),\n", " (b'talk', b'about'): (37, 21.43960312714548),\n", " (b'tells', b'us'): (12, 28.498862810540196),\n", " (b'sixty', b'five'): (6, 24.21802825921627),\n", " (b'eleven', b'years'): (6, 15.56410454288213),\n", " (b'your', b'sister'): (93, 17.156892959360377),\n", " (b'two', b'hours'): (18, 15.05994294248296),\n", " (b'two', b'months'): (20, 19.807816544517244),\n", " (b'twenty', b'four'): (17, 24.69076638463422),\n", " (b'door', b'opened'): (19, 31.970356671070014),\n", " (b'began', b'talking'): (9, 12.980535814851565),\n", " (b'mean', b'?\"'): (59, 24.673034270425305),\n", " (b'pretty', b'soon'): (13, 15.281994613759855),\n", " (b'many', b'hours'): (12, 12.100088566367637),\n", " (b'few', b'steps'): (8, 21.35632765436747),\n", " (b'most', b'excellent'): (11, 13.620613450895787),\n", " (b'surrounded', b'by'): (19, 28.432283834586467),\n", " (b'later', b'than'): (9, 15.131649368513193),\n", " (b'whole', b'story'): (18, 33.8121045120022),\n", " (b'another', b'minute'): (9, 10.472151066186651),\n", " (b'whole', b'history'): (9, 21.86545480828304),\n", " (b'lined', b'with'): (12, 12.666009731414164),\n", " (b'court', b'plaister'): (9, 660.5229257641921),\n", " (b'these', b'things'): (366, 38.76493340077559),\n", " (b'laid', b'down'): (26, 11.768166775804247),\n", " (b'forty', b'years'): (68, 158.55517564110565),\n", " (b'faint', b'smile'): (6, 23.553371223917786),\n", " (b'turned', b'towards'): (11, 10.359842928650108),\n", " (b'totally', b'different'): (7, 152.78762626262628),\n", " (b'box', b'hill'): (18, 162.9717796241427),\n", " (b'some', b'surprise'): (19, 20.088187863437586),\n", " (b'may', b'depend'): (9, 14.38565331621432),\n", " (b',\"', b'interrupted'): (25, 29.580907121938818),\n", " (b'whatever', b'else'): (9, 19.419818171605563),\n", " (b'larger', b'than'): (11, 18.09218946235273),\n", " (b'were', b'assembled'): (17, 17.322670000548754),\n", " (b'insisted', b'on'): (9, 10.828632279772345),\n", " (b'clothed', b'with'): (37, 11.957971920341324),\n", " (b'twenty', b'minutes'): (8, 15.956791968492862),\n", " (b'quite', b'alone'): (14, 10.13386446620588),\n", " (b'etc', b'.,'): (11, 3723.316923076923),\n", " (b',\"', b'whispered'): (18, 25.73538919608677),\n", " (b'shan', b't'): (20, 201.14328457446808),\n", " (b'looking', b'round'): (23, 16.131457721715485),\n", " (b',\"', b'answered'): (143, 22.031536656699593),\n", " (b'yes', b'yes'): (31, 34.42415794264907),\n", " (b'old', b'age'): (51, 59.178054288059265),\n", " (b'an', b'infant'): (10, 23.760190697601356),\n", " (b'be', b'forgiven'): (36, 17.634811346477495),\n", " (b'lie', b'down'): (41, 29.047526585175245),\n", " (b'mrs', b'smallridge'): (7, 65.16305869682283),\n", " (b'four', b'miles'): (7, 15.174533507223114),\n", " (b'great', b'hurry'): (16, 18.475293836783866),\n", " (b'without', b'waiting'): (12, 17.062783773875278),\n", " (b'comes', b'back'): (13, 14.131678904573523),\n", " (b'heightened', b'by'): (7, 10.935493782533257),\n", " (b'cut', b'off'): (217, 148.51915132090852),\n", " (b'trembling', b'voice'): (7, 11.589231329132108),\n", " (b'time', b'past'): (23, 12.303250492267754),\n", " (b'second', b'time'): (44, 18.990965084469945),\n", " (b'five', b'hundred'): (67, 84.28885553403468),\n", " (b'an', b'arrow'): (10, 13.72811018083634),\n", " (b'presented', b'themselves'): (6, 11.270798405424538),\n", " (b'at', b'random'): (13, 22.939649861138758),\n", " (b'far', b'distant'): (12, 33.91691492087898),\n", " (b'few', b'seconds'): (9, 91.12033132530121),\n", " (b'passing', b'through'): (12, 17.88838955740177),\n", " (b'domestic', b'happiness'): (6, 55.01354791780324),\n", " (b'western', b'sun'): (7, 33.12559540104024),\n", " (b'rose', b'early'): (12, 38.93680417015252),\n", " (b'east', b'wind'): (22, 116.46036526681686),\n", " (b'gone', b'mad'): (10, 19.017413169888417),\n", " (b'freed', b'from'): (11, 28.570122143171943),\n", " (b'sinned', b'against'): (43, 77.52306997194648),\n", " (b'locked', b'up'): (11, 12.679338202164779),\n", " (b'deep', b'sigh'): (7, 42.620386024232175),\n", " (b'ten', b'thousand'): (82, 129.3626084662696),\n", " (b'happier', b'than'): (11, 25.47675658984364),\n", " (b'nay', b'nay'): (7, 32.486187548659025),\n", " (b'had', b'formerly'): (10, 11.277857307953266),\n", " (b'little', b'boy'): (67, 21.623607468339106),\n", " (b'fancying', b'herself'): (6, 25.9628819086852),\n", " (b'right', b'hand'): (199, 41.801008401685465),\n", " (b'infinitely', b'more'): (8, 12.077108866621423),\n", " (b'such', b'cases'): (8, 12.89145596590909),\n", " (b'poor', b'fellow'): (38, 75.64792734629856),\n", " (b'days', b'ago'): (11, 14.965717112586058),\n", " (b'help', b'laughing'): (7, 15.773064991266716),\n", " (b'draw', b'near'): (18, 72.06606928524963),\n", " (b'at', b'intervals'): (34, 32.97574667538696),\n", " (b'into', b'temptation'): (8, 11.357116041596276),\n", " (b'sir', b'walter'): (136, 503.2387873015873),\n", " (b'walter', b'elliot'): (16, 153.52777393310265),\n", " (b'kellynch', b'hall'): (25, 1284.9930975894658),\n", " (b'charles', b'musgrove'): (14, 242.12321031569587),\n", " (b'first', b'year'): (71, 32.9850895198761),\n", " (b'lady', b'elliot'): (12, 19.25745581528584),\n", " (b'seventeen', b'years'): (7, 49.286331052460085),\n", " (b'an', b'awful'): (13, 14.535646073826713),\n", " (b'thirteen', b'years'): (7, 35.844604401789155),\n", " (b'lady', b'russell'): (147, 757.7061090581977),\n", " (b'anne', b'elliot'): (23, 67.77714022553583),\n", " (b'miss', b'elliot'): (48, 75.64966706405747),\n", " (b'everybody', b'else'): (22, 107.67396310951992),\n", " (b'russell', b's'): (30, 10.258347891901277),\n", " (b'mr', b'elliot'): (174, 150.17475957725546),\n", " (b'mr', b'shepherd'): (26, 56.76802341167009),\n", " (b'ill', b'used'): (8, 18.889563018388817),\n", " (b'anybody', b'else'): (21, 154.67714824401622),\n", " (b'an', b'honest'): (29, 20.450150338349307),\n", " (b'descend', b'into'): (11, 16.714246249896405),\n", " (b'mrs', b'clay'): (66, 167.01456220614256),\n", " (b'therefore', b'thus'): (66, 16.596178785600454),\n", " (b'miss', b'anne'): (19, 12.802348709267878),\n", " (b'their', b'fathers'): (151, 19.01913287990061),\n", " (b'an', b'example'): (14, 16.848135221935507),\n", " (b'admiral', b'croft'): (14, 929.5580402867872),\n", " (b'mrs', b'croft'): (41, 202.2301821625536),\n", " (b'walked', b'along'): (8, 10.917931320713858),\n", " (b'frederick', b'wentworth'): (6, 22.564294771388084),\n", " (b'either', b'side'): (18, 17.751889049381603),\n", " (b'captain', b'wentworth'): (196, 617.1163877348314),\n", " (b'eldest', b'son'): (15, 33.79917323054578),\n", " (b'removed', b'from'): (36, 13.162880053223592),\n", " (b'startled', b'by'): (14, 12.543654632905795),\n", " (b'most', b'important'): (14, 31.640805582833135),\n", " (b'replied', b'anne'): (11, 13.574215887165527),\n", " (b'at', b'uppercross'): (20, 12.847693509891025),\n", " (b'left', b'alone'): (17, 14.251469219988458),\n", " (b'mr', b'musgrove'): (21, 31.607104610160636),\n", " (b'miss', b'musgroves'): (22, 210.8149825783972),\n", " (b'mrs', b'musgrove'): (66, 152.88256078869972),\n", " (b'piano', b'forte'): (7, 11524.55238095238),\n", " (b'their', b'faces'): (63, 20.62178822688734),\n", " (b'surprised', b'at'): (28, 13.306003044454387),\n", " (b'ere', b'long'): (23, 32.475628447890266),\n", " (b'anything', b'else'): (31, 72.7257403863046),\n", " (b'quite', b'different'): (12, 17.841743196562472),\n", " (b'their', b'sakes'): (13, 15.501895977453241),\n", " (b'twentieth', b'year'): (13, 176.01134545454545),\n", " (b'on', b'board'): (70, 31.771507904193165),\n", " (b'eight', b'years'): (24, 64.58208896529253),\n", " (b'their', b'heads'): (79, 20.78152721615108),\n", " (b'dressing', b'room'): (14, 228.86357331988398),\n", " (b'up', b'stairs'): (15, 14.006245688437838),\n", " (b'waited', b'till'): (7, 10.336362859827453),\n", " (b'third', b'part'): (39, 74.57648218907926),\n", " (b'dear', b'fellow'): (11, 17.04239197791674),\n", " (b'good', b'cheer'): (15, 59.63031587246834),\n", " (b'mrs', b'harville'): (24, 82.53987434930892),\n", " (b'fifteen', b'years'): (10, 60.350609451991936),\n", " (b'charles', b'hayter'): (33, 2576.9838758746578),\n", " (b'came', b'near'): (42, 11.125296005240008),\n", " (b'mansion', b'house'): (8, 28.450109717868337),\n", " (b'two', b'hundred'): (105, 33.2716152575442),\n", " (b'dr', b'shirley'): (9, 1057.7604895104896),\n", " (b'went', b'up'): (207, 10.546879068033903),\n", " (b'within', b'reach'): (7, 16.566879329700722),\n", " (b'turn', b'back'): (16, 10.061923328702072),\n", " (b'walking', b'along'): (8, 17.574038573254327),\n", " (b'leaning', b'against'): (13, 24.291194507733536),\n", " (b'trodden', b'under'): (9, 65.1314925453469),\n", " (b'under', b'foot'): (15, 20.58754074709241),\n", " (b'louisa', b'musgrove'): (15, 183.23410054512416),\n", " (b'provoke', b'me'): (18, 15.672805201481697),\n", " (b'captain', b'harville'): (37, 300.52383391540553),\n", " (b'at', b'lyme'): (26, 20.671363587556005),\n", " (b'earnest', b'desire'): (6, 31.26493385696569),\n", " (b'sea', b'shore'): (14, 26.870717986676535),\n", " (b'captain', b'benwick'): (56, 513.1712788957259),\n", " (b'an', b'officer'): (9, 10.08595850020629),\n", " (b'place', b'where'): (125, 24.673951016868287),\n", " (b'breakfast', b'table'): (9, 46.889526097570425),\n", " (b'great', b'coat'): (15, 13.6532963251306),\n", " (b'mean', b'while'): (30, 21.602178501755198),\n", " (b'preceding', b'evening'): (6, 46.9167959057072),\n", " (b'dark', b'blue'): (7, 10.384974511251094),\n", " (b'an', b'agony'): (10, 14.366626933433379),\n", " (b'catching', b'hold'): (10, 119.13968966603655),\n", " (b'raised', b'up'): (35, 19.63919232400523),\n", " (b'every', b'one'): (395, 13.121967878025114),\n", " (b'could', b'scarcely'): (18, 16.706187581507773),\n", " (b'passed', b'along'): (11, 15.7751212389842),\n", " (b'leaning', b'over'): (17, 32.991030289811604),\n", " (b't', b'talk'): (19, 10.789294958017445),\n", " (b'camden', b'place'): (29, 304.80554156171286),\n", " (b'straight', b'forward'): (6, 11.08942448680352),\n", " (b'same', b'hour'): (17, 12.502011213202374),\n", " (b'looking', b'glasses'): (6, 21.695316982214575),\n", " (b'poring', b'over'): (6, 39.40595284616386),\n", " (b'thirty', b'feet'): (8, 11.48226847165992),\n", " (b'colonel', b'wallis'): (23, 919.8228040540541),\n", " (b'at', b'length'): (101, 22.856835240701436),\n", " (b'carried', b'away'): (73, 65.90036455897334),\n", " (b'greater', b'than'): (58, 46.9242105417191),\n", " (b'miss', b'carteret'): (12, 296.5877450980392),\n", " (b'lady', b'dalrymple'): (25, 567.8984418997559),\n", " (b'laura', b'place'): (7, 20.594969024440058),\n", " (b'be', b'established'): (41, 12.515027407177577),\n", " (b'mrs', b'smith'): (79, 133.20625258466546),\n", " (b'westgate', b'buildings'): (7, 3878.455128205128),\n", " (b'at', b'liberty'): (25, 11.17821921199558),\n", " (b'five', b'thousand'): (31, 35.86192793881296),\n", " (b'whose', b'names'): (9, 17.589365660794233),\n", " (b'her', b'ladyship'): (22, 32.97701536370165),\n", " (b'ladyship', b's'): (10, 11.322176562172519),\n", " (b'old', b'gentleman'): (32, 24.080053832071062),\n", " (b'their', b'minds'): (18, 15.50189597745324),\n", " (b'almost', b'entirely'): (13, 33.521094767168066),\n", " (b'lower', b'part'): (8, 14.361011772897019),\n", " (b'staring', b'at'): (33, 23.083022672770873),\n", " (b'ay', b'ay'): (6, 102.04739416427729),\n", " (b'an', b'oath'): (39, 43.30723417872082),\n", " (b'wiser', b'than'): (8, 28.371842565962236),\n", " (b'prejudice', b'against'): (7, 37.304334422590784),\n", " (b'both', b'sides'): (31, 93.46785578477042),\n", " (b'my', b'soul'): (259, 16.066883986530073),\n", " (b'same', b'instant'): (19, 24.469748442934566),\n", " (b'their', b'seats'): (12, 12.918246647877702),\n", " (b'their', b'mouths'): (24, 24.960679963695895),\n", " (b'short', b'silence'): (9, 17.74307917888563),\n", " (b'fifty', b'pounds'): (6, 33.295124367158266),\n", " (b'be', b'saved'): (61, 12.820271978084348),\n", " (b'hard', b'hearted'): (6, 30.946703493427446),\n", " (b'some', b'moments'): (14, 21.245341542207033),\n", " (b'exclaimed', b'mrs'): (11, 11.847828853967787),\n", " (b'compassion', b'on'): (20, 12.135536175606939),\n", " (b'an', b'explanation'): (12, 15.444123953440883),\n", " (b'our', b'hearts'): (21, 17.06660837126648),\n", " (b'minutes', b'afterwards'): (7, 21.36625761454931),\n", " (b'make', b'haste'): (38, 69.71834069521798),\n", " (b'n', b't'): (19, 110.43160721735502),\n", " (b'rising', b'sun'): (7, 11.452998409934125),\n", " (b'an', b'atonement'): (66, 83.74147210310166),\n", " (b'atonement', b'for'): (65, 20.656492719482426),\n", " (b'next', b'instant'): (8, 11.76775260138092),\n", " (b'she', b'doted'): (7, 14.81087366282343),\n", " (b'god', b'forbid'): (30, 46.14475859838801),\n", " (b'i', b'll'): (384, 11.175120311389023),\n", " (b'll', b'answer'): (12, 13.01399029006883),\n", " (b'market', b'place'): (13, 39.585135267754914),\n", " (b'poured', b'out'): (53, 41.58992740848729),\n", " (b'at', b'norland'): (19, 17.421149186996885),\n", " (b'many', b'generations'): (11, 16.534282513048943),\n", " (b'seven', b'thousand'): (27, 31.25935371753323),\n", " (b'mr', b'dashwood'): (15, 10.19078273641092),\n", " (b'john', b'dashwood'): (37, 157.76252403767805),\n", " (b'four', b'thousand'): (45, 51.452722885418765),\n", " (b'three', b'thousand'): (45, 26.103457945941283),\n", " (b'mrs', b'dashwood'): (121, 149.9784684291954),\n", " (b'miss', b'dashwoods'): (23, 254.21806722689078),\n", " (b'edward', b'ferrars'): (13, 135.88747894441326),\n", " (b'younger', b'brother'): (8, 31.679093146238024),\n", " (b'few', b'miles'): (7, 14.237551769578314),\n", " (b'replied', b'elinor'): (26, 38.56266294368484),\n", " (b'mrs', b'ferrars'): (73, 170.4264612070751),\n", " (b'barton', b'park'): (12, 511.69179654464176),\n", " (b'from', b'whence'): (44, 15.9501908897463),\n", " (b'barton', b'cottage'): (7, 104.58755401901469),\n", " (b'sir', b'john'): (113, 127.7876444705192),\n", " (b'at', b'barton'): (35, 22.230840455317054),\n", " (b'lady', b'middleton'): (95, 500.3860397158689),\n", " (b'be', b'fulfilled'): (39, 13.72615909174315),\n", " (b'their', b'arrival'): (15, 11.39845292459797),\n", " (b'present', b'case'): (12, 20.32475765428544),\n", " (b'mrs', b'jennings'): (229, 317.3157640888764),\n", " (b'colonel', b'brandon'): (132, 1667.5337022569443),\n", " (b'now', b'therefore'): (145, 11.124733982335009),\n", " (b'ill', b'natured'): (11, 147.81083061889248),\n", " (b'blue', b'sky'): (11, 51.839749814359976),\n", " (b'rose', b'up'): (112, 34.00672106200659),\n", " (b'at', b'allenham'): (8, 10.991915558462322),\n", " (b'miss', b'dashwood'): (70, 131.14424102974525),\n", " (b'cried', b'marianne'): (34, 25.079372634443647),\n", " (b'mr', b'willoughby'): (36, 36.68681785107931),\n", " (b'miss', b'marianne'): (31, 20.9166764174024),\n", " (b'aye', b'aye'): (36, 468.905225),\n", " (b'have', b'erred'): (10, 17.591206708068757),\n", " (b'pronounce', b'him'): (19, 17.59404209004578),\n", " (b'by', b'reason'): (70, 10.359296240180047),\n", " (b'an', b'everlasting'): (37, 33.79227121436637),\n", " (b'seven', b'days'): (103, 82.78022840230078),\n", " (b'by', b'accident'): (20, 16.659541309328006),\n", " (b'went', b'out'): (262, 11.679327162094795),\n", " (b'won', b't'): (219, 217.3972873683645),\n", " (b'miss', b'williams'): (6, 72.63373349339736),\n", " (b'laughed', b'heartily'): (6, 104.96859819569742),\n", " (b'considerable', b'time'): (10, 11.403522990282188),\n", " (b'two', b'sides'): (13, 12.98652600625674),\n", " (b'at', b'delaford'): (11, 13.190298670154785),\n", " (b'two', b'thousand'): (76, 23.9670207602225),\n", " (b'seven', b'hundred'): (50, 63.02139464474197),\n", " (b'can', b'possibly'): (10, 14.130878533659684),\n", " (b'burst', b'into'): (18, 13.516567683308244),\n", " (b'turning', b'round'): (15, 21.605294920047708),\n", " (b'mr', b'ferrars'): (26, 41.48432480083584),\n", " (b'combe', b'magna'): (11, 18907.46875),\n", " (b'mrs', b'palmer'): (37, 135.40375833106043),\n", " (b'mr', b'palmer'): (35, 100.05495777567086),\n", " (b'without', b'ceasing'): (8, 82.0058281377067),\n", " (b'stared', b'at'): (14, 13.490078185385578),\n", " (b't', b'think'): (70, 10.296761958921383),\n", " (b'miss', b'steeles'): (29, 406.7489075630252),\n", " (b'most', b'beautiful'): (16, 16.30577169961094),\n", " (b'human', b'beings'): (7, 191.5286483064261),\n", " (b'sugar', b'plums'): (24, 5662.926600985222),\n", " (b'two', b'boys'): (13, 13.239510279105899),\n", " (b'miss', b'steele'): (27, 243.16510778224333),\n", " (b'lucy', b'steele'): (10, 285.9352551984877),\n", " (b'i', b'm'): (438, 16.271176474972332),\n", " (b'm', b'sure'): (88, 102.49205464802071),\n", " (b'robert', b'ferrars'): (7, 95.96177636796193),\n", " (b'mr', b'pratt'): (8, 85.60257498585173),\n", " (b'at', b'longstaple'): (7, 16.48787333769348),\n", " (b'poor', b'edward'): (10, 12.172941195406368),\n", " (b'their', b'names'): (36, 14.302344503007454),\n", " (b'latter', b'end'): (12, 42.204171316964285),\n", " (b'i', b've'): (218, 13.517978176897723),\n", " (b'lifted', b'up'): (151, 72.07476264919077),\n", " (b'third', b'day'): (65, 33.791685361045076),\n", " (b'starting', b'up'): (9, 11.47178218291099),\n", " (b't', b'know'): (147, 12.457680257147297),\n", " (b'returned', b'home'): (11, 12.907360988149462),\n", " (b'berkeley', b'street'): (16, 1449.984531590414),\n", " (b'conduit', b'street'): (6, 203.71683501683503),\n", " (b'lit', b'up'): (23, 36.13611387616962),\n", " (b'as', b'follows'): (17, 14.27876809314034),\n", " (b'having', b'received'): (10, 11.384897636609963),\n", " (b'who', b'cares'): (8, 12.52599580423441),\n", " (b'miss', b'grey'): (10, 10.250728517213336),\n", " (b'fifty', b'thousand'): (11, 20.37397014255248),\n", " (b'why', b'don'): (28, 12.795521166648276),\n", " (b'thousand', b'times'): (12, 12.06372712383394),\n", " (b'walked', b'across'): (7, 13.912300670276734),\n", " (b'fourteen', b'years'): (6, 15.56410454288213),\n", " (b'your', b'sakes'): (16, 32.49086604178871),\n", " (b'bartlett', b's'): (6, 10.189958905955267),\n", " (b'dressing', b'gown'): (6, 509.2920875420876),\n", " (b'wild', b'beasts'): (16, 105.2408127767236),\n", " (b'miss', b'morton'): (15, 267.59796550199025),\n", " (b'six', b'hundred'): (66, 132.00536142208233),\n", " (b'harley', b'street'): (16, 1540.6085648148148),\n", " (b'most', b'high'): (60, 22.250238368104544),\n", " (b'filled', b'with'): (114, 14.087704701266775),\n", " (b'two', b'thirds'): (6, 42.47676214546476),\n", " (b'public', b'school'): (6, 43.03876796130318),\n", " (b\",'\", b'says'): (13, 42.67600070534298),\n", " (b'fell', b'upon'): (62, 13.948924382167398),\n", " (b's', b'office'): (29, 12.330706575273602),\n", " (b'yes', b'ma'): (9, 15.105959603525328),\n", " (b'come', b'near'): (47, 11.571120236270598),\n", " (b'give', b'ear'): (30, 35.78164549476025),\n", " (b'reminds', b'me'): (7, 14.199293601342392),\n", " (b'ten', b'guineas'): (6, 26.425532844164923),\n", " (b'south', b'east'): (7, 17.113977399691684),\n", " (b'mr', b'harris'): (10, 85.60257498585173),\n", " (b'quicker', b'than'): (6, 12.23883404806214),\n", " (b'bent', b'over'): (10, 11.258843670332531),\n", " (b'justified', b'by'): (17, 11.526601554562083),\n", " (b'from', b'thence'): (103, 27.33487278970147),\n", " (b'latter', b'days'): (12, 29.775541338582677),\n", " (b'sprung', b'up'): (13, 24.090742584113084),\n", " (b'or', b'later'): (13, 13.913940352137981),\n", " (b'living', b'creature'): (16, 82.52032187670486),\n", " (b'first', b'month'): (33, 25.952660129387272),\n", " (b'have', b'transgressed'): (20, 23.003885695166836),\n", " ...}" ] }, "metadata": {}, "execution_count": 30 } ] }, { "cell_type": "code", "metadata": { "id": "LokTaRisX3rn", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "dddb8524-7f28-4fe4-e621-953093b63f1d" }, "source": [ "lower_bigram[\"jon lives in new york city\".split()]" ], "execution_count": 31, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['jon', 'lives', 'in', 'new_york', 'city']" ] }, "metadata": {}, "execution_count": 31 } ] }, { "cell_type": "code", "metadata": { "id": "ByHsD6URX3rn", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "ddf9997f-beeb-47a5-98bf-d980c90bfc37" }, "source": [ "lower_bigram = Phraser(Phrases(lower_sents, \n", " min_count=32, threshold=64))\n", "lower_bigram.phrasegrams" ], "execution_count": 32, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{(b'miss', b'taylor'): (48, 156.44188752424046),\n", " (b'mr', b'woodhouse'): (132, 82.04719647206235),\n", " (b'mr', b'weston'): (162, 75.8750096465504),\n", " (b'mrs', b'weston'): (249, 160.68617883193815),\n", " (b'great', b'deal'): (182, 93.36445281155484),\n", " (b'mr', b'knightley'): (277, 161.7426545362494),\n", " (b'miss', b'woodhouse'): (173, 229.03991999355654),\n", " (b'years', b'ago'): (56, 74.31656200580369),\n", " (b'mr', b'elton'): (214, 121.40001543448062),\n", " (b'dare', b'say'): (115, 89.940748422131),\n", " (b'frank', b'churchill'): (151, 1316.456538433409),\n", " (b'miss', b'bates'): (113, 276.3981670520557),\n", " (b'drawing', b'room'): (49, 84.9156512119411),\n", " (b'mrs', b'goddard'): (58, 143.57962085740624),\n", " (b'miss', b'smith'): (58, 73.03502483866474),\n", " (b'few', b'minutes'): (86, 204.17003699445084),\n", " (b'john', b'knightley'): (58, 83.03824369335366),\n", " (b'don', b't'): (830, 250.31164302600473),\n", " (b'good', b'natured'): (66, 88.70009486029666),\n", " (b'few', b'moments'): (43, 107.77673597616273),\n", " (b'thousand', b'pounds'): (48, 166.5197213382644),\n", " (b'o', b'clock'): (67, 89.14862759956216),\n", " (b'jane', b'fairfax'): (111, 654.5620010133792),\n", " (b'miss', b'fairfax'): (125, 196.20149586805675),\n", " (b'ma', b'am'): (216, 157.25976559465136),\n", " (b'mrs', b'elton'): (142, 93.09008385260405),\n", " (b'forty', b'years'): (68, 90.60295750920322),\n", " (b'cut', b'off'): (217, 129.60397638852865),\n", " (b'ten', b'thousand'): (82, 84.00169380926596),\n", " (b'sir', b'walter'): (136, 399.5178158730159),\n", " (b'lady', b'russell'): (147, 613.6352291668503),\n", " (b'mr', b'elliot'): (174, 126.18234236668802),\n", " (b'mrs', b'clay'): (66, 93.09008385260405),\n", " (b'captain', b'wentworth'): (196, 529.8800397304311),\n", " (b'mrs', b'musgrove'): (66, 85.21323060353755),\n", " (b'charles', b'hayter'): (33, 92.03513842409491),\n", " (b'captain', b'benwick'): (56, 241.49236653916512),\n", " (b'mrs', b'smith'): (79, 84.60397123620643),\n", " (b'mrs', b'dashwood'): (121, 115.06968698446889),\n", " (b'mrs', b'ferrars'): (73, 102.75713102191294),\n", " (b'sir', b'john'): (113, 95.84073335288942),\n", " (b'lady', b'middleton'): (95, 350.27022780110826),\n", " (b'mrs', b'jennings'): (229, 279.06788181030646),\n", " (b'colonel', b'brandon'): (132, 1313.0186631944446),\n", " (b'miss', b'dashwood'): (70, 76.66894090969721),\n", " (b'won', b't'): (219, 189.9686576536643),\n", " (b'm', b'sure'): (88, 69.15126578661638),\n", " (b'six', b'hundred'): (66, 73.57675882542294),\n", " (b'gathered', b'together'): (84, 103.28151426020274),\n", " (b'thou', b'shalt'): (1282, 66.88288454162686),\n", " (b'burnt', b'offerings'): (86, 299.1594956644355),\n", " (b'sweet', b'savour'): (43, 286.1811575507396),\n", " (b'unleavened', b'bread'): (43, 237.7023822279367),\n", " (b'burnt', b'offering'): (184, 297.52711249720966),\n", " (b'afar', b'off'): (52, 108.1430971616501),\n", " (b'take', b'heed'): (58, 86.38525449882758),\n", " (b'sent', b'messengers'): (43, 79.21620881736811),\n", " (b'thus', b'saith'): (444, 144.03010304370085),\n", " (b'without', b'blemish'): (46, 83.71428289057559),\n", " (b'peace', b'offerings'): (83, 176.2591765391338),\n", " (b'sin', b'offering'): (118, 129.96187065094136),\n", " (b'meat', b'offering'): (122, 210.66899051760493),\n", " (b'fine', b'flour'): (36, 86.07753592260636),\n", " (b'high', b'places'): (99, 129.81341185361669),\n", " (b'fig', b'tree'): (37, 121.73822937625756),\n", " (b'fir', b'tree'): (36, 72.67953992612391),\n", " (b'mercy', b'endureth'): (41, 269.07896427336067),\n", " (b'chief', b'priests'): (65, 116.32043880243987),\n", " (b'jesus', b'christ'): (199, 172.16959234722393),\n", " (b'holy', b'ghost'): (90, 313.0330942696068),\n", " (b'o', b'er'): (82, 108.1508294008294),\n", " (b'couldn', b't'): (89, 171.76280480516377),\n", " (b'didn', b't'): (180, 220.5126379038613),\n", " (b'little', b'jackal'): (61, 69.81311821111686),\n", " (b'wasn', b't'): (58, 120.22357238933725),\n", " (b'isn', b't'): (63, 131.967022683778),\n", " (b'doesn', b't'): (53, 106.26437675632276),\n", " (b'wouldn', b't'): (58, 120.22357238933725),\n", " (b'father', b'brown'): (207, 91.68353015338629),\n", " (b'buster', b'bear'): (142, 479.8780734011104),\n", " (b'green', b'forest'): (66, 336.3801160984384),\n", " (b'little', b'joe'): (111, 133.28894187197614),\n", " (b'joe', b'otter'): (47, 1271.6246321984027),\n", " (b'farmer', b'brown'): (100, 386.0549863003415),\n", " (b'mock', b'turtle'): (56, 2528.8986415882964),\n", " (b'dr', b'bull'): (65, 680.7926554828151),\n", " (b'guinea', b'hen'): (51, 905.88975571316),\n", " (b'sir', b'arthur'): (71, 131.42033416875523),\n", " (b'miss', b'somers'): (49, 160.06322751322753),\n", " (b'mr', b'gresham'): (49, 87.31462648556875),\n", " (b'mrs', b'theresa'): (67, 170.20201898423875),\n", " (b'de', b'grey'): (77, 603.2159473590925),\n", " (b'dr', b'middleton'): (40, 162.73238300161378),\n", " (b'moby', b'dick'): (84, 4115.911564625851),\n", " (b'sperm', b'whale'): (183, 297.3696872050255),\n", " (b'mast', b'heads'): (37, 77.73653510124372),\n", " (b'wee', b'l'): (35, 450.40124069478907)}" ] }, "metadata": {}, "execution_count": 32 } ] }, { "cell_type": "code", "metadata": { "id": "WeuAsMs3X3rn" }, "source": [ "clean_sents = []\n", "for s in lower_sents:\n", " clean_sents.append(lower_bigram[s])" ], "execution_count": 33, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "RJdsgzxXX3ro", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "d935034e-1962-43a3-f491-b4da195976f5" }, "source": [ "clean_sents[0:9]" ], "execution_count": 34, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[['emma', 'by', 'jane', 'austen', '1816'],\n", " ['volume', 'i'],\n", " ['chapter', 'i'],\n", " ['emma',\n", " 'woodhouse',\n", " 'handsome',\n", " 'clever',\n", " 'and',\n", " 'rich',\n", " 'with',\n", " 'a',\n", " 'comfortable',\n", " 'home',\n", " 'and',\n", " 'happy',\n", " 'disposition',\n", " 'seemed',\n", " 'to',\n", " 'unite',\n", " 'some',\n", " 'of',\n", " 'the',\n", " 'best',\n", " 'blessings',\n", " 'of',\n", " 'existence',\n", " 'and',\n", " 'had',\n", " 'lived',\n", " 'nearly',\n", " 'twenty',\n", " 'one',\n", " 'years',\n", " 'in',\n", " 'the',\n", " 'world',\n", " 'with',\n", " 'very',\n", " 'little',\n", " 'to',\n", " 'distress',\n", " 'or',\n", " 'vex',\n", " 'her'],\n", " ['she',\n", " 'was',\n", " 'the',\n", " 'youngest',\n", " 'of',\n", " 'the',\n", " 'two',\n", " 'daughters',\n", " 'of',\n", " 'a',\n", " 'most',\n", " 'affectionate',\n", " 'indulgent',\n", " 'father',\n", " 'and',\n", " 'had',\n", " 'in',\n", " 'consequence',\n", " 'of',\n", " 'her',\n", " 'sister',\n", " 's',\n", " 'marriage',\n", " 'been',\n", " 'mistress',\n", " 'of',\n", " 'his',\n", " 'house',\n", " 'from',\n", " 'a',\n", " 'very',\n", " 'early',\n", " 'period'],\n", " ['her',\n", " 'mother',\n", " 'had',\n", " 'died',\n", " 'too',\n", " 'long',\n", " 'ago',\n", " 'for',\n", " 'her',\n", " 'to',\n", " 'have',\n", " 'more',\n", " 'than',\n", " 'an',\n", " 'indistinct',\n", " 'remembrance',\n", " 'of',\n", " 'her',\n", " 'caresses',\n", " 'and',\n", " 'her',\n", " 'place',\n", " 'had',\n", " 'been',\n", " 'supplied',\n", " 'by',\n", " 'an',\n", " 'excellent',\n", " 'woman',\n", " 'as',\n", " 'governess',\n", " 'who',\n", " 'had',\n", " 'fallen',\n", " 'little',\n", " 'short',\n", " 'of',\n", " 'a',\n", " 'mother',\n", " 'in',\n", " 'affection'],\n", " ['sixteen',\n", " 'years',\n", " 'had',\n", " 'miss_taylor',\n", " 'been',\n", " 'in',\n", " 'mr_woodhouse',\n", " 's',\n", " 'family',\n", " 'less',\n", " 'as',\n", " 'a',\n", " 'governess',\n", " 'than',\n", " 'a',\n", " 'friend',\n", " 'very',\n", " 'fond',\n", " 'of',\n", " 'both',\n", " 'daughters',\n", " 'but',\n", " 'particularly',\n", " 'of',\n", " 'emma'],\n", " ['between',\n", " '_them_',\n", " 'it',\n", " 'was',\n", " 'more',\n", " 'the',\n", " 'intimacy',\n", " 'of',\n", " 'sisters'],\n", " ['even',\n", " 'before',\n", " 'miss_taylor',\n", " 'had',\n", " 'ceased',\n", " 'to',\n", " 'hold',\n", " 'the',\n", " 'nominal',\n", " 'office',\n", " 'of',\n", " 'governess',\n", " 'the',\n", " 'mildness',\n", " 'of',\n", " 'her',\n", " 'temper',\n", " 'had',\n", " 'hardly',\n", " 'allowed',\n", " 'her',\n", " 'to',\n", " 'impose',\n", " 'any',\n", " 'restraint',\n", " 'and',\n", " 'the',\n", " 'shadow',\n", " 'of',\n", " 'authority',\n", " 'being',\n", " 'now',\n", " 'long',\n", " 'passed',\n", " 'away',\n", " 'they',\n", " 'had',\n", " 'been',\n", " 'living',\n", " 'together',\n", " 'as',\n", " 'friend',\n", " 'and',\n", " 'friend',\n", " 'very',\n", " 'mutually',\n", " 'attached',\n", " 'and',\n", " 'emma',\n", " 'doing',\n", " 'just',\n", " 'what',\n", " 'she',\n", " 'liked',\n", " 'highly',\n", " 'esteeming',\n", " 'miss_taylor',\n", " 's',\n", " 'judgment',\n", " 'but',\n", " 'directed',\n", " 'chiefly',\n", " 'by',\n", " 'her',\n", " 'own']]" ] }, "metadata": {}, "execution_count": 34 } ] }, { "cell_type": "code", "metadata": { "id": "jv5XOwKJX3ro", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "7b46d61c-3590-48e3-d793-5f7711576fcd" }, "source": [ "clean_sents[6] " ], "execution_count": 35, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['sixteen',\n", " 'years',\n", " 'had',\n", " 'miss_taylor',\n", " 'been',\n", " 'in',\n", " 'mr_woodhouse',\n", " 's',\n", " 'family',\n", " 'less',\n", " 'as',\n", " 'a',\n", " 'governess',\n", " 'than',\n", " 'a',\n", " 'friend',\n", " 'very',\n", " 'fond',\n", " 'of',\n", " 'both',\n", " 'daughters',\n", " 'but',\n", " 'particularly',\n", " 'of',\n", " 'emma']" ] }, "metadata": {}, "execution_count": 35 } ] }, { "cell_type": "markdown", "metadata": { "id": "bJp5e9iPX3ro" }, "source": [ "#### word2vec 실행" ] }, { "cell_type": "code", "metadata": { "id": "UocG-ELbX3ro" }, "source": [ "# min_count 대신 max_vocab_size를 사용할 수 있습니다.\n", "# model = Word2Vec(sentences=clean_sents, size=64, \n", "# sg=1, window=10, iter=5,\n", "# min_count=10, workers=4)\n", "# model.save('clean_gutenberg_model.w2v')" ], "execution_count": 36, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "qQMmGpQGX3ro" }, "source": [ "#### 모델을 살펴 봅니다." ] }, { "cell_type": "code", "metadata": { "id": "qWr1qYRTYa6u", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "fa290faf-0244-44e8-bfe4-41f6d4b335f0" }, "source": [ "# 코랩에서 실행할 경우 다음 코드를 실행합니다.\n", "!wget https://git.io/Jt02A -O clean_gutenberg_model.w2v" ], "execution_count": 37, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "--2022-12-07 02:50:02-- https://git.io/Jt02A\n", "Resolving git.io (git.io)... 140.82.112.22\n", "Connecting to git.io (git.io)|140.82.112.22|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://github.com/rickiepark/dl-illustrated/raw/master/notebooks/clean_gutenberg_model.w2v [following]\n", "--2022-12-07 02:50:03-- https://github.com/rickiepark/dl-illustrated/raw/master/notebooks/clean_gutenberg_model.w2v\n", "Resolving github.com (github.com)... 20.205.243.166\n", "Connecting to github.com (github.com)|20.205.243.166|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://raw.githubusercontent.com/rickiepark/dl-illustrated/master/notebooks/clean_gutenberg_model.w2v [following]\n", "--2022-12-07 02:50:04-- https://raw.githubusercontent.com/rickiepark/dl-illustrated/master/notebooks/clean_gutenberg_model.w2v\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 8609925 (8.2M) [application/octet-stream]\n", "Saving to: ‘clean_gutenberg_model.w2v’\n", "\n", "clean_gutenberg_mod 100%[===================>] 8.21M --.-KB/s in 0.02s \n", "\n", "2022-12-07 02:50:05 (372 MB/s) - ‘clean_gutenberg_model.w2v’ saved [8609925/8609925]\n", "\n" ] } ] }, { "cell_type": "code", "metadata": { "id": "jwr2wxsqX3ro" }, "source": [ "# 다음 코드로 모델 훈련을 건너 뜁니다.\n", "model = gensim.models.Word2Vec.load('clean_gutenberg_model.w2v') " ], "execution_count": 38, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "oQjyCyzoX3ro", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "f7edcbc1-6fe5-41aa-cd1e-5ca8e4e926aa" }, "source": [ "len(model.wv.vocab) # 전처리를 수행하지 않았다면 만개 정도 됩니다." ], "execution_count": 39, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "10329" ] }, "metadata": {}, "execution_count": 39 } ] }, { "cell_type": "code", "metadata": { "id": "BCz7vVRzX3rp", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "def1dec2-a381-4766-92fa-f6af5f596967" }, "source": [ "model.wv['dog']" ], "execution_count": 40, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([ 0.38401067, 0.01232518, -0.37594706, -0.00112308, 0.38663676,\n", " 0.01287549, 0.398965 , 0.0096426 , -0.10419296, -0.02877572,\n", " 0.3207022 , 0.27838793, 0.62772304, 0.34408906, 0.23356602,\n", " 0.24557391, 0.3398472 , 0.07168821, -0.18941355, -0.10122284,\n", " -0.35172758, 0.4038952 , -0.12179806, 0.096336 , 0.00641343,\n", " 0.02332107, 0.7743452 , 0.03591069, -0.20103034, -0.1688079 ,\n", " -0.01331445, -0.29832968, 0.08522387, -0.02750671, 0.32494134,\n", " -0.14266558, -0.4192913 , -0.09291836, -0.23813559, 0.38258648,\n", " 0.11036541, 0.005807 , -0.16745028, 0.34308755, -0.20224966,\n", " -0.77683043, 0.05146591, -0.5883941 , -0.0718769 , -0.18120563,\n", " 0.00358319, -0.29351747, 0.153776 , 0.48048878, 0.22479494,\n", " 0.5465321 , 0.29695514, 0.00986911, -0.2450937 , -0.19344331,\n", " 0.3541134 , 0.3426432 , -0.10496043, 0.00543602], dtype=float32)" ] }, "metadata": {}, "execution_count": 40 } ] }, { "cell_type": "code", "metadata": { "id": "1yyBUgt8X3rp", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "fb467e2e-aa1d-4633-ec04-2d77912af759" }, "source": [ "len(model.wv['dog'])" ], "execution_count": 41, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "64" ] }, "metadata": {}, "execution_count": 41 } ] }, { "cell_type": "code", "metadata": { "id": "0m6tFIjOX3rp", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "712b2b6d-2397-4968-dabb-4e03e9db2ced" }, "source": [ "model.wv.most_similar('dog', topn=3)" ], "execution_count": 42, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('puppy', 0.7834004163742065),\n", " ('cage', 0.7651870846748352),\n", " ('brahmin', 0.7646074295043945)]" ] }, "metadata": {}, "execution_count": 42 } ] }, { "cell_type": "code", "metadata": { "id": "5z2Z5NwKX3rp", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "c3ccd52a-934d-4ac6-df0f-8a8d566c1958" }, "source": [ "model.wv.most_similar('eat', topn=3)" ], "execution_count": 43, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('drink', 0.8292896747589111),\n", " ('bread', 0.8157557845115662),\n", " ('meat', 0.763256311416626)]" ] }, "metadata": {}, "execution_count": 43 } ] }, { "cell_type": "code", "metadata": { "id": "Y8s5dW9TX3rp", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "2e170d01-c40a-432e-80f9-4c528ef71a5c" }, "source": [ "model.wv.most_similar('day', topn=3)" ], "execution_count": 44, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('morning', 0.7578363418579102),\n", " ('night', 0.7324314713478088),\n", " ('week', 0.7262506484985352)]" ] }, "metadata": {}, "execution_count": 44 } ] }, { "cell_type": "code", "metadata": { "id": "5VZXl2AFX3rp", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "1e0cf9c4-7486-4635-9934-4657f340b4e3" }, "source": [ "model.wv.most_similar('father', topn=3)" ], "execution_count": 45, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('mother', 0.8257375359535217),\n", " ('brother', 0.7275018692016602),\n", " ('sister', 0.7177823781967163)]" ] }, "metadata": {}, "execution_count": 45 } ] }, { "cell_type": "code", "metadata": { "id": "-QAETP_pX3rq", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "ee3c0f2e-40d5-4686-d4ef-4f1dd6d4d9e8" }, "source": [ "model.wv.most_similar('ma_am', topn=3) " ], "execution_count": 46, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('madam', 0.8472708463668823),\n", " ('nancy', 0.8370794057846069),\n", " ('betty', 0.8337127566337585)]" ] }, "metadata": {}, "execution_count": 46 } ] }, { "cell_type": "code", "metadata": { "id": "lKMdxKf7X3rq", "colab": { "base_uri": "https://localhost:8080/", "height": 91 }, "outputId": "bff2e7cc-d742-4878-b007-6560c3a86697" }, "source": [ "model.wv.doesnt_match(\"mother father sister brother dog\".split())" ], "execution_count": 47, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.8/dist-packages/gensim/models/keyedvectors.py:895: FutureWarning: arrays to stack must be passed as a \"sequence\" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.\n", " vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "'dog'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 47 } ] }, { "cell_type": "code", "metadata": { "id": "3B89P3aRX3rq", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "dc681784-6474-43cd-d6a9-953607bef453" }, "source": [ "model.wv.similarity('father', 'dog')" ], "execution_count": 48, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0.44234338" ] }, "metadata": {}, "execution_count": 48 } ] }, { "cell_type": "code", "metadata": { "id": "pJ7eUIXHX3rq", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "e7596d2d-3ad8-41c8-da21-415d3bcda3d7" }, "source": [ "model.wv.most_similar(positive=['father', 'woman'], negative=['man']) " ], "execution_count": 49, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('mother', 0.7650133371353149),\n", " ('husband', 0.7556628584861755),\n", " ('sister', 0.7482180595397949),\n", " ('daughter', 0.7390402555465698),\n", " ('wife', 0.7284981608390808),\n", " ('sarah', 0.6856439113616943),\n", " ('daughters', 0.6652647256851196),\n", " ('conceived', 0.6637862920761108),\n", " ('rebekah', 0.6580977439880371),\n", " ('dearly', 0.6398962736129761)]" ] }, "metadata": {}, "execution_count": 49 } ] }, { "cell_type": "code", "metadata": { "id": "q5t2LmYFX3rq", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "4045179e-618c-4792-9d6a-7dad09ee05b8" }, "source": [ "model.wv.most_similar(positive=['husband', 'woman'], negative=['man']) " ], "execution_count": 50, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('wife', 0.707526445388794),\n", " ('sister', 0.6973985433578491),\n", " ('maid', 0.6911259889602661),\n", " ('daughter', 0.6799546480178833),\n", " ('mother', 0.6583081483840942),\n", " ('child', 0.6433471441268921),\n", " ('conceived', 0.6391384601593018),\n", " ('harlot', 0.6089693307876587),\n", " ('daughters', 0.6069822907447815),\n", " ('marriage', 0.5894294381141663)]" ] }, "metadata": {}, "execution_count": 50 } ] }, { "cell_type": "markdown", "metadata": { "id": "jFT8g2JGX3rr" }, "source": [ "#### t-SNE로 단어 벡터 차원을 줄입니다." ] }, { "cell_type": "code", "metadata": { "id": "TcQMVFcMX3rr" }, "source": [ "tsne = TSNE(n_components=2, n_iter=1000)" ], "execution_count": 51, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ZAaVW5pIX3rr", "outputId": "74ef6c27-a08c-4efd-b4f1-3b8f278e61f8", "colab": { "base_uri": "https://localhost:8080/" } }, "source": [ "X_2d = tsne.fit_transform(model.wv[model.wv.vocab])" ], "execution_count": 52, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.8/dist-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.\n", " warnings.warn(\n", "/usr/local/lib/python3.8/dist-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.\n", " warnings.warn(\n" ] } ] }, { "cell_type": "code", "metadata": { "id": "puhbXq--X3rr" }, "source": [ "coords_df = pd.DataFrame(X_2d, columns=['x','y'])\n", "coords_df['token'] = model.wv.vocab.keys()" ], "execution_count": 53, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "bdT_5jTbX3rr", "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "outputId": "8bb10aa2-a3e5-430e-8f94-7b626a42bf04" }, "source": [ "coords_df.head()" ], "execution_count": 54, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " x y token\n", "0 7.573422 63.395519 emma\n", "1 -47.785515 18.459305 by\n", "2 4.948685 63.898209 jane\n", "3 -15.264286 19.649439 volume\n", "4 -25.539066 26.625835 i" ], "text/html": [ "\n", "
\n", " | x | \n", "y | \n", "token | \n", "
---|---|---|---|
0 | \n", "7.573422 | \n", "63.395519 | \n", "emma | \n", "
1 | \n", "-47.785515 | \n", "18.459305 | \n", "by | \n", "
2 | \n", "4.948685 | \n", "63.898209 | \n", "jane | \n", "
3 | \n", "-15.264286 | \n", "19.649439 | \n", "volume | \n", "
4 | \n", "-25.539066 | \n", "26.625835 | \n", "i | \n", "