\n",
"\n",
" \n",
"## Author's identification\n",
"Authors: [Sergey Ustyantsev]() (@schokoro).\n",
"\n",
"This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem description\n",
"\n",
"In literature and history, there is often a question about the recognition of literary works and the recognition of the authorship of one or another author. Did one person write Plato's dialogues? Did the same genius write all Shakespeare's plays? Alternatively, did different people create all those works? For a long time, there was a discussion about the identity of the author of the novel \"And Quiet Flows the Don\", and whether Mikhail Sholokhov did write it. Finally, the linguistic examinations made in 1984, 1999 and 2007 proved Sholokhov's authorship.\n",
"Now, there is a similar situation concerning the novels \"The Twelve Chairs\" and \"The Golden Calf\". Officially, the authors of the novels are Illya Ilf and Yevgeny Petrov. However, there is an alternative hypothesis that the real author of those novels was Mikhail Bulgakov. The version was studied from a literary and historical point of view \\[1\\], \\[2\\], but so far, I have not met linguistic studies on this subject. In this project, I will try to fill the gap and, let's say, check the harmony with the help of algebra.\n",
"Currently, there are many ways to determine the numerical characteristics of texts based on the style of the author of the text. Such characteristics allow us to recognise the author and are called the author's invariant. One of these methods is the object of research in this project.\n",
"### Problem definition \n",
"In this project, we will\n",
"* Investigate the method of recognising authorship based on calculating the trigrams.\n",
"* Determine the optimal parameters of the method for solving the problem defined.\n",
"* Confirm or disprove the hypothesis that the author of the novels \"12 chairs\" and \"The Golden Calf\" was Mikhail Bulgakov."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Problem formalisation. Solution methods.\n",
"We will formulate the problem of recognising the author of a text with a limited set of alternatives as follows.\n",
"There are a set of texts $T = \\{ t_{1},..., t_{k} \\}$ and a set of authors $A = \\{ a_{1},..., a_{n}\\}$. For some subset of the texts $T^{'} \\subseteq T$, the authors are known $D = \\{(t_{i},\\ a_{i})\\}^{l}_{i=1}$. It is necessary to find the real author $A$ of the texts (whether they are anonymous or controversial) $T^{''}=\\{t_{|T|+1},...,\\ t_{k}\\} \\subseteq T$. \n",
"The main methods used in this project are discribed in the works \\[3\\], \\[4\\]. The parameters of the models found in these works differ somewhat but do not contradict each other. The essence of the method is as follows.\n",
"The text can be viewed as a hierarchical structure and analysed at any level as a sequence of individual constituent elements (characters, word forms, grammar classes, etc.) or groups of elements of length N, called N-grams. Text analysis looks more complicated when using features of higher levels of the hierarchy. So, in the process of morphological analysis, information obtained at the stage of lexical analysis is used, and at the stage of syntactic analysis, information obtained at the stage of morphological analysis etc. is used. In this study, we confine ourselves to character level characteristics; namely, we will use character trigram calculations as a tool."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the works mentioned above, they used Support Vector Machine with a linear kernel, in one case, and a radial basis function kernel, in another case as the machine learning algorithms. The common disadvantage of those works was the lack of adjustment of the hyper-parameters of the models. Primarily, it concerns the learning rate with the assumed value of $C = 1$, by default. When performing modelling, in this project, it was found that the optimal values of $С$ lie in the range of $10^{-9} ... 10^{-1}$ and depend on the model and the specific dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will calculate the frequencies of appearing the trigrams in substrings of length from 1000 to 25000 symbols, mentioning that the size of the corpus is limited. In each substring, we will calculate the frequencies for the most frequently encountered $N$ trigrams with $N \\in \\{100 ... 2000\\}$. Besides, to estimate the probabilities of appearing the trigrams, we will apply the method of the Laplacian smoothing, and check whether it affects the model's work."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The method reasoning and the preparation of the corpus of text data\n",
"\n",
"The primary requirements for the corpus of text data prepared for conducting linguistic studies are representativeness and balance \\[5\\]. The first requirement means that the language model presented in the corpus is statistically significant. The second requirement means that the texts gathered in the corpus represent the data evenly. \n",
"To recognise the authors, we prepared the corpus consisting of literary works written in the 1920s and 1930s. The choice was due to the following circumstances. In the 1920s, on the territories where the Russian-speaking people lived (first of all, on the territory of the Soviet Union), significant social changes took place, and the general way of life changed. \n",
"Equally, the Russian language reflected the changes in the forms of Newspeak and other changes. All that also affected the language model, made it different as to the model established later in the middle of the 20th century, so to the language people used to speak before the revolutionary events of 1917.\n",
"As we mentioned above, the corpus fo texts included works written in the second half of the 1920s and the first half of the 1930s. Perhaps these restrictions are redundant, but they allow getting a language samples for a certain period. \n",
"To train the models we will use the \"train\" corpus which is, in turn, the subset of the \"big\" corpus. It includes works by Mikhail Bulgakov, Arkady Gaidar, Maxim Gorky, Alexander Grin, Ilya Ilf and Evgeny Petrov (in collaboration), Valentin Kataev, Yuri Olesha, Andrei Platonov. We used the \"big\" corpus to build a language model, that is, to calculate the frequencies of appearing the trigrams of characters. Besides the authors presented in the \"big\" corpus, we added a set of authors from the same period to the model. \n",
"All texts presented are in lowercase. We removed the punctuation marks, numbers and spaces that allowed us taking the bigrams on the boundary of two words into account when analysing. Thus, each work was a solid line consisting of alphabetic characters only. \n",
"We prepared the \"test\" corpus by the same principle. We used the \"test\" corpus not to calculate trigram frequencies or model training, but only for model testing purposes. The \"test\" corpus included the works of the same authors presented in the \"train\" corpus. Controversial works of Ilf and Petrov has been placed in a separate directory. \n",
"All prepared data [you can download here](https://drive.google.com/open?id=1fBhxBPXocqlkojMCBGvRnPkYiCH-JMV3)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Trigrams estimating, comparison authors by trigram frequencies\n",
"\n",
"\n",
"Let's estimate the number of trigrams found for the authors included in the \"big\" corpus and the \"training\" corpus."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"from tqdm import tqdm_notebook\n",
"import pandas as pd\n",
"from matplotlib import pyplot as plt\n",
"from os import path, listdir\n",
"import seaborn as sns\n",
"import numpy as np\n",
"from sklearn.svm import SVC\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n",
"from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n",
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
"import pdb\n",
"import warnings\n",
"from pylab import rcParams\n",
"import random\n",
"from collections import defaultdict\n",
"from gc import collect\n",
"rcParams['figure.figsize'] = 12, 6\n",
"warnings.simplefilter('ignore')\n",
"%matplotlib inline\n",
"#%config InlineBackend.figure_format = 'svg'\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def make_freq_dict(file, f_dict, verbose=False):\n",
" with open(file, 'r') as file_obj:\n",
" string = file_obj.read()\n",
" length = len(string)\n",
" if verbose:\n",
" print(f'{path.split(file)[1]} - {length} symbols')\n",
" for i in range(length-2):\n",
" ngram = string[i:i + 3]\n",
" f_dict[ngram] = f_dict.get(ngram, 0) + 1\n",
" if verbose: \n",
" print(f'В словаре {len(f_dict)} триграмм')\n",
" return f_dict"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"f_dict = {}\n",
"full_corps = ''\n",
"PATH = '../corpus/'\n",
"files = listdir(PATH)\n",
"for file in files:\n",
" f_dict = make_freq_dict(path.join(PATH, file), f_dict)\n",
"df = pd.Series(f_dict) \n",
"df /= df.sum()\n",
"df.sort_values(ascending=False, inplace=True)\n",
"rcParams['figure.figsize'] = 12, 5\n",
"df.head(100).plot(kind='bar');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The graph shows that seven of the trigrams are significantly higher than the others. Perhaps we must consider them as “garbage” due to their relative frequency and discard, but on the other hand, they may serve as clear signs. We will verify them during further experiments, as well.\n",
"Let's compile individual vocabularies of frequencies for the following authors: Bulgakov, Ilf & Petrov, Olesha, Platonov and display 25 of the most frequently used trigrams in joint, and 25 of the most frequently used trigrams for each author individually.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"authors = ['Bulgakov', 'IlfPetrov', 'Olesha', 'Platonov']\n",
"colors = ['#ff7f0e','#98df8a', '#d62728', '#17becf']\n",
"#mp_colors = cm.set_array(np.array(colors))\n",
"all_dict = {}\n",
"for author in authors:\n",
" f_dict = {}\n",
" full_corps = ''\n",
" PATH = '../corpus/'\n",
" files = listdir(PATH)\n",
" for file in files:\n",
" if file.split('_')[0] == author:\n",
" f_dict = make_freq_dict(path.join(PATH, file), f_dict)\n",
" series_dict = pd.Series(f_dict)\n",
" series_dict /= series_dict.sum()\n",
" all_dict[author] = series_dict\n",
" \n",
"particular_fdict = pd.DataFrame(all_dict)\n",
"particular_fdict = particular_fdict.loc[df.head(25).index]\n",
"particular_fdict.plot(kind='bar', grid=True, colors=colors);\n",
"\n",
"fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10));\n",
"\n",
"axes = axes.flatten()\n",
"for i, author in enumerate(authors):\n",
" df = particular_fdict[author].sort_values(ascending=False)\n",
" axes[i].bar(df.index, df.values, color=colors[i]);\n",
" axes[i].tick_params(axis='x', direction='in',labelrotation=90, grid_alpha=0.5);\n",
" axes[i].legend([author]);\n",
" axes[i].grid();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The graphs show that different frequencies of different trigrams correspond to different authors, and each author has his own \"top-25\", which means that these frequencies can be used as *features* for recognition. \n",
"I composed the `Recognizer` class to work with the corpus. The descriptions of the main class methods are below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class Recognizer:\n",
" def __init__(self, corpus_path):\n",
" \"\"\"\n",
" :param corpus_path: path to corpus directory with string-books\n",
" \"\"\"\n",
" self.path = corpus_path\n",
" self.authors = ['Bulgakov', 'Gaidar', 'Gorky', 'Grin', 'IlfPetrov',\n",
" 'Kataev', 'Olesha', 'Platonov']\n",
" self.f_dict = defaultdict(int)\n",
" self.string_corps = {writer: '' for writer in self.authors}\n",
" self.make_dict_and_corps()\n",
" self.trigrams = []\n",
" self.estimators = {}\n",
"\n",
" def set_lang_model(self, string_size=5000, lang_model_size=100, skip=0, smoothing=False):\n",
" \"\"\"\n",
" Sets parameters of language model\n",
" :param string_size: length of substring\n",
" :param lang_model_size: nums of the most frequently used trigrams for analys\n",
" :param skip: num of skipped trigramms\n",
" :param smoothing: on/off Laplacian smoothing\n",
" \"\"\"\n",
" self.estimators = {}\n",
" self.string_size=string_size\n",
" self.trigrams = list(self.all_trigrams[skip: lang_model_size + skip].index)\n",
" self.smoothing = smoothing\n",
"\n",
" def make_dict_and_corps(self):\n",
" \"\"\"\n",
" Make dictionary and corpus for all authors in self.authors\n",
" \"\"\"\n",
" books = listdir(self.path)\n",
" for book in books:\n",
" author = book.split('_')[0]\n",
" with open(path.join(self.path, book)) as book_obj:\n",
" string = book_obj.read()\n",
" length = len(string)\n",
" for i in range(length - 2): # make a trigrams from string\n",
" ngram = string[i:i + 3]\n",
" self.f_dict[ngram] += 1\n",
" if author in self.authors:\n",
" self.string_corps[author] += string\n",
" self.all_trigrams = pd.Series(self.f_dict)\n",
" self.all_trigrams.sort_values(ascending=False, inplace=True)\n",
"\n",
" def makesample(self, author, count=0, reverse=False, random_state=None):\n",
" \"\"\"\n",
" :param author:\n",
" :param count:0\n",
" :param reverse: if True, method returns subsamples for all authors accept *author*\n",
" :return: returns dataframe with trigrams frequencies\n",
" \"\"\"\n",
" df_list = []\n",
" if reverse:\n",
"\n",
" count *= len(self.authors) - 1\n",
" string = ''\n",
" for person in self.string_corps:\n",
" if person is not author:\n",
" string += self.string_corps[person]\n",
" else:\n",
" string = self.string_corps[author]\n",
" len_ = len(string)\n",
" if count * self.string_size > len_:\n",
" raise ValueError(f\"{author} hasn't got a required text size ({count * self.string_size}) symbols.\")\n",
" if count == 0:\n",
" count = len_ // self.string_size\n",
" string_list = [string[i * self.string_size: ((i + 1) * self.string_size)] for i in range(len_ // self.string_size)] # make a substrings\n",
" random.seed(random_state)\n",
" string_list = random.sample(string_list, count)\n",
" for string in string_list:\n",
" df_list.append(self.evaluate_trigrams(string))\n",
"\n",
" return pd.DataFrame(df_list)\n",
"\n",
" def evaluate_trigrams(self, substring):\n",
" \"\"\"\n",
" Calculates frequencies of top-trigrams in string\n",
" :param substring: input string (str)\n",
" :return: dictionary, where keys are trigrams and values are trigrams frequencies\n",
" \"\"\"\n",
" dyct = {key: 0 for key in self.trigrams}\n",
" len_ = len(substring)\n",
" count = 0\n",
" for i in range(len_ - 2):\n",
" ngram = substring[i:i + 3]\n",
" if ngram in dyct:\n",
" dyct[ngram] += 1\n",
" count += 1\n",
" if self.smoothing:\n",
" return {key: (dyct[key] + 1)/(count + len(self.trigrams)) for key in dyct}\n",
" else:\n",
" return {key: dyct[key]/count for key in dyct}\n",
"\n",
" def one_against_one(self, model, params, author_1, author_2, scoring,\n",
" count=25, verbose=False, random_state=None):\n",
" \"\"\"\n",
" Selects the optimal learning rate for the ML-model, \n",
" performs cross-validation of the model with the given parameters and authors.\n",
" :param model: ML-model (sklearn model object)\n",
" :param params: parameters of ML-model (dict)\n",
" :param author_1: author_1 (str)\n",
" :param author_2: author_2 (str)\n",
" :param scoring: scoring\n",
" :param count: number of substrings in sample\n",
" :param verbose: \n",
" :return: mean value and standard deviation for cross_validation\n",
"\n",
" \"\"\"\n",
" if random_state:\n",
" random_state1 = (hash(author_1) * random_state) % hash(author_2)\n",
" random_state2 = (hash(author_2) * random_state) % hash(author_1)\n",
" else:\n",
" random_state1 = random_state2 = None\n",
" model_name = re.findall(r'\\w+', str(model).lower())[-1] # it's a kind of magic\n",
" cv = 5 if count > 5 else count\n",
" df1 = self.makesample(author=author_1, count=count, random_state=random_state1)\n",
" df2 = self.makesample(author=author_2, count=count, random_state=random_state2)\n",
" y = np.hstack((np.ones(df1.shape[0]).astype(np.int), np.zeros(df1.shape[0]).astype(np.int)))\n",
" pipe = make_pipeline(StandardScaler(), model(**params))\n",
" X = pd.concat([df1, df2], axis=0)\n",
" grid_params = {model_name + '__C': np.logspace(-8, 1, 10)}\n",
" grid = GridSearchCV(pipe, param_grid=grid_params, scoring=scoring, n_jobs=-1, cv=cv, verbose=False).fit(X, y)\n",
" cross_val = cross_val_score(grid.best_estimator_, X, y, cv=count, n_jobs=-1, scoring=scoring)\n",
" mean_ = cross_val.mean()\n",
" std_ = cross_val.std()\n",
" if verbose:\n",
" print(f'| mean = {cross_val.mean():.2f}, std = {cross_val.std():.2f} |')\n",
" return np.array([mean_, std_])\n",
"\n",
" def make_df(self, text_path):\n",
" \"\"\"\n",
" It gets a path to text-file, cuts the file into substrings of 'self.string_size' characters and calculates trigrams\n",
" frequencies. Returns dataframe with trigrams frequencies.\n",
" :param text_path: path to text file\n",
" :return: pd.DataFrame\n",
" \"\"\"\n",
" with open(text_path, 'r') as text_obj:\n",
" string = text_obj.read()\n",
" df_list = []\n",
" len_ = len(string)\n",
" string_list = [string[i * self.string_size: ((i + 1) * self.string_size)] \\\n",
" for i in range(len_ // self.string_size)]\n",
" for string in string_list:\n",
" df_list.append(self.evaluate_trigrams(string))\n",
" return pd.DataFrame(df_list)\n",
"\n",
" @staticmethod\n",
" def softmax(X):\n",
" \"\"\"\n",
" My own softmax with numpy and vecotorizing :7\n",
" \"\"\"\n",
" y = np.atleast_2d(X)\n",
" y = y - np.expand_dims(np.max(y, axis=1), 1)\n",
" y = np.exp(y)\n",
" ax_sum = np.expand_dims(np.sum(y, axis=1), 1)\n",
" p = y / ax_sum\n",
" if len(X.shape) == 1:\n",
" p = p.flatten()\n",
" return p\n",
"\n",
" def fit(self, model, params, mode, train_size=25, random_state=None):\n",
" \"\"\"\n",
" Fit the ML-models according to the given training data.\n",
" :param model: ML-model (sklearn model object)\n",
" :param params: parameters of ML-model\n",
" :param mode: 'oao' - one-against-one or 'oaa' - one-against-all\n",
" :param train_size: length of string for evaluate trigrams\n",
" :return: None\n",
" \"\"\"\n",
" self.mode = mode\n",
" self.random_state = random_state\n",
" self.train_size = train_size\n",
" self.model = model\n",
" self.params = params\n",
" model_name = re.findall(r'\\w+', str(model).lower())[-1]\n",
" cv = 5 if train_size > 5 else train_size\n",
" if not self.trigrams:\n",
" raise RuntimeWarning(' Set a language model first!')\n",
" n_authors = len(self.authors)\n",
" self.estimators = {}\n",
" for i in range(n_authors):\n",
" df1 = self.makesample(author=self.authors[i], count=train_size, random_state=random_state + i)\n",
" if mode == 'oao': # one against one\n",
" for j in range(i):\n",
" df0 = self.makesample(author=self.authors[j], count=train_size, random_state=random_state + 3*i + 5*j)\n",
" X = pd.concat([df1, df0], axis=0)\n",
" y = np.hstack((np.ones(df1.shape[0]).astype(np.int),\n",
" np.zeros(df0.shape[0]).astype(np.int)))\n",
" pipe = make_pipeline(StandardScaler(), model(**params))\n",
" grid_params = {model_name + '__C': np.logspace(-9, 1, 21)}\n",
" grid = GridSearchCV(pipe, param_grid=grid_params,\n",
" scoring='accuracy',\n",
" n_jobs=-1,\n",
" cv=cv).fit(X, y)\n",
" self.estimators[frozenset([self.authors[i], self.authors[j]])] = {\n",
" 'estimator': grid.best_estimator_,\n",
" 'target': [self.authors[j], self.authors[i]]\n",
" }\n",
" elif mode == 'oaa': # one against all\n",
" df0 = self.makesample(author=self.authors[i], count=train_size, reverse=True)\n",
" X = pd.concat([df1, df0], axis=0)\n",
" y = np.hstack((np.ones(df1.shape[0]).astype(np.int),\n",
" np.zeros(df0.shape[0]).astype(np.int)))\n",
" pipe = make_pipeline(StandardScaler(), model(**params))\n",
" grid_params = {model_name + '__C': np.logspace(-9, 1, 21)}\n",
" grid = GridSearchCV(pipe, param_grid=grid_params, scoring='roc_auc', n_jobs=-1).fit(X, y)\n",
" self.estimators[self.authors[i]] = grid\n",
" return\n",
"\n",
" def predict(self, text_path):\n",
" \"\"\"\n",
" Возвращает автора из корпуса, максимально соответствующего тексту. Выбор осуществляется мажоритарным\n",
" голосованием по accuracy.\n",
" :param text_path:\n",
" :return: string: author of input text (tuple of strings)\n",
" \"\"\"\n",
" if not self.estimators:\n",
" raise RuntimeWarning(\"Run 'fit'-method first\")\n",
" test_df = self.make_df(text_path)\n",
" result = defaultdict(int)\n",
" df_result = pd.DataFrame(index=test_df.index, columns=self.authors).fillna(0)\n",
" for key in self.estimators:\n",
" if self.mode == 'oaa': # one against all\n",
" pred = self.estimators[key].predict(test_df).sum()\n",
" result[key] += pred\n",
" df_result[key] = self.estimators[key].predict(test_df)\n",
" elif self.mode == 'oao': # one against one\n",
" estimator = self.estimators[key]['estimator']\n",
" target = self.estimators[key]['target']\n",
" df_result[target[1]] += estimator.predict(test_df)\n",
" df_result[target[0]] += -1 * (estimator.predict(test_df) -1)\n",
" else:\n",
" raise ValueError(\"incorrect parameter value\")\n",
" author_sum = df_result.sum()\n",
" return author_sum[author_sum == author_sum.max()].index[0]\n",
"\n",
" def hypothesis_test(self, text_path, author_1, author_2, softmax=False):\n",
" \"\"\"\n",
" Returns the probabilities of a work accessory for two authors.\n",
" :param text_path: path to text file\n",
" :param author_1: first author\n",
" :param author_2: second author\n",
" :return: pd.Series of probabilities\n",
" \"\"\"\n",
" if not self.estimators:\n",
" raise RuntimeWarning(\"Run 'fit'-method first\")\n",
" if self.random_state:\n",
" random_state1 = (hash(author_1) * self.random_state) % hash(author_2)\n",
" random_state2 = (hash(author_2) * self.random_state) % hash(author_1)\n",
" else:\n",
" random_state1 = random_state2 = None\n",
"\n",
" keys = list(self.estimators.keys())\n",
" test_df = self.make_df(text_path)\n",
"\n",
" if self.mode == 'oao':\n",
" dict_estimator = self.estimators[frozenset([author_1, author_2])]\n",
" target = dict_estimator['target']\n",
" pred = dict_estimator['estimator'].predict(test_df)\n",
" pred = pred.sum()/test_df.shape[0]\n",
" return pd.Series({target[1]: pred, target[0]: 1 - pred})\n",
"\n",
" elif self.mode == 'oaa':\n",
" if softmax:\n",
" pred1 = self.estimators[author_1].predict_proba(test_df)\n",
" pred2 = self.estimators[author_2].predict_proba(test_df)\n",
" pred = self.softmax(np.vstack((pred1[: ,1], pred2[: ,1].T)).T)\n",
" #print(pred)\n",
" #pdb.set_trace()\n",
" return pd.Series(pred.mean(axis=0), index=[author_1, author_2])\n",
" else:\n",
" model = self.model\n",
" model_name = re.findall(r'\\w+', str(model).lower())[-1]\n",
" cv = 5 if self.train_size > 5 else self.train_size\n",
" df1 = self.makesample(author=author_1, count=self.train_size, random_state=random_state1)\n",
" df0 = self.makesample(author=author_2, count=self.train_size, random_state=random_state2)\n",
"\n",
" X = pd.concat([df1, df0], axis=0)\n",
" y = np.hstack((np.zeros(df1.shape[0]).astype(np.int), np.ones(df0.shape[0]).astype(np.int)))\n",
" pipe = make_pipeline(StandardScaler(), model(**self.params))\n",
" grid_params = {model_name + '__C': np.logspace(-9, 1, 21)}\n",
" grid = GridSearchCV(pipe, param_grid=grid_params, scoring='roc_auc', n_jobs=-1, cv=cv).fit(X, y)\n",
" #pdb.set_trace()\n",
" pred = grid.predict_proba(test_df)\n",
" return pd.Series(pred.mean(axis=0), index=[author_1, author_2])\n",
" else:\n",
" raise KeyError\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Class Description\n",
"The class constructor has one mandatory parameter, that is the path to the corpus of the texts used to train the model. \n",
"* The ***set_lang_model*** method adjusts the parameters of the language model, that is, the number of trigrams used, the length of the substring to calculate the number of trigrams, the number of skipped initial trigram and switching on or switching off the smoothing.\n",
"* The ***one_against_one*** method get a class of machine learning model at the input, that is, two authors and samples size. Then, the `makesample` method creates samples of the text of a given size for each author. Then, method selects the optimal learning rate for the ML-model, with the help of `GridSearchCV`. Finally, the method calculates the probability of the correct recognition of the author in cross-validation for the best estimator. The number of folds is equal to the number of substrings in samples for each author. Since the sample sizes are relatively small, cross-validation does not take much time.\n",
"* The ***fit*** method teaches M classifiers for N authors on subsamples of a given size. For the scheme `one against one` $M = \\frac{N(N-1)}{2}$, for the scheme` one against all` $M=N$\n",
"* The ***predict*** method receives the path to the text file and returns the author for the text using the trained classifiers.\n",
"* The ***hypothesis_test*** method get the path to the text file, two authors and returns the probabilities of belonging the specified texts to the specified authors."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"recognizer = Recognizer('../corpus/')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will fix the length of substings equal to 5000 characters and estimate the probability of the correct recognition of the author for different amounts of the most common trigrams in the range of 100 to 3000 and conduct experiments for the `SVM` with the linear and rbf kernel and with the logistic regression. Since the SVM algorithm cannot predict the class probabilities, it only returns the class labels, using `'accuracy'` as the metrics. For `LogisticRegression`, we will also use metric `'accuracy'` for correct comparison with SVM."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def calculate_errors(recognizer, model, params, scoring='accuracy', length_string=5000,\n",
" n_3gramms=300, random_state=None):\n",
" recognizer.set_lang_model(lang_model_size=n_3gramms, skip=0, smoothing=False, string_size=length_string)\n",
" errors = {}\n",
" authors = recognizer.authors\n",
" n_authors = len(authors)\n",
" for i in range(n_authors):\n",
" for j in range(i):\n",
" mean, std = recognizer.one_against_one(model, params, authors[i], authors[j],\n",
" scoring, 20, random_state=random_state)\n",
" errors['_'.join([authors[i], authors[j]])] = (mean)\n",
" return errors"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#SVM with linear kernel\n",
"params = {'random_state':42, 'kernel': 'linear'} \n",
"df_list = {}\n",
"for ngrams in tqdm_notebook(range(100, 3100, 100)):\n",
" df_list[ngrams] = calculate_errors(recognizer, SVC, params, n_3gramms=ngrams, random_state=ngrams)\n",
"df_errors_svl = pd.DataFrame(df_list)\n",
"f, ax = plt.subplots(1, 1)\n",
"sns.boxplot(data=df_errors_svl, ax=ax);\n",
"sns.pointplot(data=df_errors_svl.values, linestyles='', scale=1, color='k', errwidth=1.5, capsize=0.2, markers='x')\n",
"sns.pointplot(data=df_errors_svl.values, linestyles='--', scale=0.4, color='k', errwidth=0, capsize=0, ax=ax)\n",
"ax.set_ylabel('P'); ax.set_xlabel('3-gramms'); ax.set_xticklabels([t for t in range(100, 3100, 100)]);\n",
"plt.gcf().autofmt_xdate(); ax.grid(); plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#SVM with rbf kernel\n",
"params = {'random_state':42, 'kernel': 'rbf'} \n",
"df_list = {}\n",
"for ngrams in tqdm_notebook(range(100, 3100, 100)):\n",
" df_list[ngrams] = calculate_errors(recognizer, SVC, params, n_3gramms=ngrams, random_state=ngrams)\n",
"df_errors_svr = pd.DataFrame(df_list)\n",
"f, ax = plt.subplots(1, 1)\n",
"sns.boxplot(data=df_errors_svr, ax=ax);\n",
"sns.pointplot(data=df_errors_svr.values, linestyles='', scale=1, color='k', errwidth=1.5, capsize=0.2, markers='x')\n",
"sns.pointplot(data=df_errors_svr.values, linestyles='--', scale=0.4, color='k', errwidth=0, capsize=0, ax=ax)\n",
"ax.set_ylabel('P'); ax.set_xlabel('3-gramms'); ax.set_xticklabels([t for t in range(100, 3100, 100)]);\n",
"plt.gcf().autofmt_xdate(); ax.grid(); plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Logistic regression\n",
"params={'random_state':42, 'class_weight':'balanced', 'solver':'liblinear'} \n",
"df_list = {}\n",
"for ngrams in tqdm_notebook(range(100, 3100, 100)):\n",
" df_list[ngrams] = calculate_errors(recognizer, LogisticRegression, params, n_3gramms=ngrams, random_state=ngrams) #, scoring='roc_auc'\n",
"df_errors_lr = pd.DataFrame(df_list)\n",
"f, ax = plt.subplots(1, 1)\n",
"sns.boxplot(data=df_errors_lr, ax=ax);\n",
"sns.pointplot(data=df_errors_lr.values, linestyles='', scale=1, color='k', errwidth=1.5, capsize=0.2, markers='x')\n",
"sns.pointplot(data=df_errors_lr.values, linestyles='--', scale=0.4, color='k', errwidth=0, capsize=0, ax=ax)\n",
"ax.set_ylabel('P'); ax.set_xlabel('3-gramms'); ax.set_xticklabels([t for t in range(100, 3100, 100)]);\n",
"plt.gcf().autofmt_xdate(); ax.grid(); plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#summary graph for trigramms\n",
"trigramms = df_errors_lr.columns\n",
"df1 = pd.DataFrame({'trigramms': trigramms, 'P': df_errors_svl.mean()})\n",
"df2 = pd.DataFrame({'trigramms': trigramms, 'P': df_errors_svr.mean()})\n",
"df3 = pd.DataFrame({'trigramms': trigramms, 'P': df_errors_lr.mean()})\n",
"f, ax = plt.subplots(1, 1)\n",
"x_col='trigramms'\n",
"y_col = 'P'\n",
"sns.pointplot(ax=ax,x=x_col,y=y_col,data=df1,color='blue', linestyles='-' , scale=.8, markers='.') #, capsize=1\n",
"sns.pointplot(ax=ax,x=x_col,y=y_col,data=df2,color='green', linestyles='-', scale=.8, markers='.')\n",
"sns.pointplot(ax=ax,x=x_col,y=y_col,data=df3,color='red', linestyles='-', scale=.8, markers='.')\n",
"ax.legend(handles=ax.lines[::len(df1)+1], labels=[\"SVC_linear\",\"SVC_rbf\",\"LogReg\"]); ax.grid()\n",
"ax.set_xticklabels([t for t in range(100, 3100, 100)]); plt.gcf().autofmt_xdate(); plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Linear SVM and logistic regression show a slightly better result. SVM with the rbf kernel works slightly worse, especially concerning the number of 1000 trigrams or higher. \n",
"We will fix the number of trigrams equal to 300 and estimate the probability of the correct recognition of the author of fragments of various length from 1000 to 20,000 characters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Logistic regression\n",
"params = {'random_state':42, 'kernel': 'linear'}\n",
"df_list = {}\n",
"for length in tqdm_notebook(range(1000, 31000, 1000)):\n",
" df_list[length] = calculate_errors(recognizer, SVC, params, length_string=length, random_state=length)\n",
"df_errors_svl = pd.DataFrame(df_list)\n",
"f, ax = plt.subplots(1, 1)\n",
"sns.boxplot(data=df_errors_svl, ax=ax);\n",
"sns.pointplot(data=df_errors_svl.values, linestyles='', scale=1, color='k', errwidth=1.5, capsize=0.2, markers='x')\n",
"sns.pointplot(data=df_errors_svl.values, linestyles='--', scale=0.4, color='k', errwidth=0, capsize=0, ax=ax)\n",
"ax.set_ylabel('P'); ax.set_xlabel('length of string'); ax.set_xticklabels([t for t in range(1000, 31000, 1000)])\n",
"plt.gcf().autofmt_xdate(); ax.grid(); plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#SVM with rbf kernel\n",
"params = {'random_state':42, 'kernel': 'rbf'} \n",
"df_list = {}\n",
"for length in tqdm_notebook(range(1000, 31000, 1000)):\n",
" df_list[length] = calculate_errors(recognizer, SVC, params, length_string=length, random_state=length) \n",
" \n",
"df_errors_svr = pd.DataFrame(df_list)\n",
"f, ax = plt.subplots(1, 1)\n",
"sns.boxplot(data=df_errors_svr, ax=ax);\n",
"sns.pointplot(data=df_errors_svr.values, linestyles='', scale=1, color='k', errwidth=1.5, capsize=0.2, markers='x')\n",
"sns.pointplot(data=df_errors_svr.values, linestyles='--', scale=0.4, color='k', errwidth=0, capsize=0, ax=ax)\n",
"ax.set_ylabel('P'); ax.set_xlabel('length of string'); ax.set_xticklabels([t for t in range(1000, 31000, 1000)])\n",
"plt.gcf().autofmt_xdate(); ax.grid(); plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Logistic regression\n",
"params={'random_state':42, 'class_weight':'balanced', 'solver':'liblinear'} #, 'C': 1e-7\n",
"df_list = {}\n",
"for length in tqdm_notebook(range(1000, 31000, 1000)):\n",
" df_list[length] = calculate_errors(recognizer, LogisticRegression, params, length_string=length, random_state=length) \n",
"df_errors_lr = pd.DataFrame(df_list)\n",
"f, ax = plt.subplots(1, 1)\n",
"sns.boxplot(data=df_errors_lr, ax=ax);\n",
"sns.pointplot(data=df_errors_lr.values, linestyles='', scale=1, color='k', errwidth=1.5, capsize=0.2, markers='x')\n",
"sns.pointplot(data=df_errors_lr.values, linestyles='--', scale=0.4, color='k', errwidth=0, capsize=0, ax=ax)\n",
"ax.set_ylabel('P'); ax.set_xlabel('length of string'); ax.set_xticklabels([t for t in range(1000, 31000, 1000)])\n",
"plt.gcf().autofmt_xdate(); ax.grid(); plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#summary graph for trigramms\n",
"string_length = df_errors_lr.columns\n",
"df1 = pd.DataFrame({'string_length': string_length, 'P': df_errors_svl.mean()})\n",
"df2 = pd.DataFrame({'string_length': string_length, 'P': df_errors_svr.mean()})\n",
"df3 = pd.DataFrame({'string_length': string_length, 'P': df_errors_lr.mean()})\n",
"f, ax = plt.subplots(1, 1)\n",
"x_col='string_length'\n",
"y_col = 'P'\n",
"sns.pointplot(ax=ax,x=x_col,y=y_col,data=df1,color='blue', linestyles='-' , scale=.8, markers='.') #, capsize=1\n",
"sns.pointplot(ax=ax,x=x_col,y=y_col,data=df2,color='green', linestyles='-', scale=.8, markers='.')\n",
"sns.pointplot(ax=ax,x=x_col,y=y_col,data=df3,color='red', linestyles='-', scale=.8, markers='.')\n",
"ax.legend(handles=ax.lines[::len(df1)+1], labels=[\"SVC_linear\",\"SVC_rbf\",\"LogReg\"])\n",
"ax.set_xticklabels([t for t in range(1000, 31000, 1000)])\n",
"plt.gcf().autofmt_xdate()\n",
"ax.grid()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, the results of all three models are approximately equal. At string lengths of up to 2000 characters, logistic regression and SVM with the rbf kernel perform slightly better. Further, all algorithms show approximately the same results. For the lines of 3000 characters long, the accuracy of author recognition rises to $p > 0.9$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's estimate the effect that initial seven trigrams have on the quality of recognition of author. We will use linear SVM, the number of analysed trigrams is 250, the length of the substring is 5000."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"recognizer.set_lang_model(lang_model_size=250, skip=0, smoothing=False, string_size=5000)\n",
"n_mean = 0\n",
"n = 500\n",
"for i in tqdm_notebook(range(n)):\n",
" random.seed(i)\n",
" author1, author2 = random.sample(recognizer.authors, 2)\n",
" params = {'random_state':n, 'kernel': 'linear'}\n",
" mean, std = recognizer.one_against_one(SVC, params, author1, author2, 'accuracy', 25, random_state=i)\n",
" n_mean += mean\n",
"print(f'{n_mean / n:.5f}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"recognizer.set_lang_model(lang_model_size=250, skip=7, smoothing=False, string_size=5000)\n",
"n_mean = 0\n",
"n = 500\n",
"for i in tqdm_notebook(range(n)):\n",
" random.seed(i)\n",
" author1, author2 = random.sample(recognizer.authors, 2)\n",
" params = {'random_state':n, 'kernel': 'linear'}\n",
" mean, std = recognizer.one_against_one(SVC, params, author1, author2, 'accuracy', 25, random_state=i)\n",
" n_mean += mean\n",
"print(f'{n_mean / n:.5f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When skipping the initial seven trigrams, the probability of recognising the author changes slightly."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The evaluation of the effect of Laplace's smoothing\n",
"\n",
"When calculating the frequencies of appearance of N-grams in the text as features of identification of the author, the researcher can meet sparse data. It happens when a part of the signs is missing in the corpus estimated since there is not a general sample of data used. Primarily, this problem is relevant for texts of small sizes and high orders N-grams. The solution of the problem is the use of special techniques for smoothing the probabilities (methods by Laplace, Good-Turing, Katz interpolation method, Kneser-Ney method, etc. \\[6\\]), allowing to estimate the probabilities of non-occurring events. In the works \\[3\\], \\[4\\], the Laplace smoothing proved to be the best.\n",
"To calculate the N-gram probabilities taking additive smoothing into account, we will use the following expression: \n",
"$P_{ADD}(w_{i-n+1}^{i-1})=\\frac{\\delta+c(w_{i-n+1}^{i})}{\\delta V + \\sum \\limits_{w_{i}} c(w_{i-n+1}^{i})}$, \n",
"where c (·) is the number of uses of a given N-gram; V is the number of all N-grams in the vocabulary used.\n",
"Laplace smoothing is a particular case of additive smoothing with $delta = 1$\n",
"To assess the effect of Laplace's smoothing, we will conduct 500 independent experiments with smoothing turned on and off for randomly selected pairs of authors to compare the probabilities of the correct recognition.\n",
"Let's conduct these experiments for two issues:\n",
"1. number of the trigrams is 1,000, the length of the substring is 4,000; in this case, there are sparse data, and using smoothing can be effective\n",
"2. number of the trigrams is 300, the length of the substring is 20000; in this case, there is no sparse data.\n",
"We expect that, in the first case, the results with smoothing will differ from the results without smoothing.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"recognizer.set_lang_model(lang_model_size=1000, skip=0, smoothing=False, string_size=4000)\n",
"n_mean = 0\n",
"params = {'C': 1, 'kernel': 'linear'}\n",
"n = 500\n",
"for i in tqdm_notebook(range(n)):\n",
" random.seed(i*42)\n",
" author1, author2 = random.sample(recognizer.authors, 2) \n",
" params = {'random_state':n, 'kernel': 'rbf'}\n",
" #params={'random_state':42, 'class_weight':'balanced', 'solver':'liblinear'}\n",
" mean, std = recognizer.one_against_one(SVC, params, author1, author2, 'accuracy', 10, random_state=i)\n",
" n_mean += mean\n",
"print(f'{n_mean / n:.5f}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"recognizer.set_lang_model(lang_model_size=1000, skip=0, smoothing=True, string_size=4000)\n",
"n_mean = 0\n",
"n = 500\n",
"for i in tqdm_notebook(range(n)):\n",
" random.seed(i*42)\n",
" author1, author2 = random.sample(recognizer.authors, 2)\n",
" params = {'random_state':n, 'kernel': 'rbf'}\n",
" mean, std = recognizer.one_against_one(SVC, params, author1, author2, 'accuracy', 10, random_state=i)\n",
" n_mean += mean\n",
"print(f'{n_mean / n:.5f}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"recognizer.set_lang_model(lang_model_size=300, skip=0, smoothing=False, string_size=20000)\n",
"n_mean = 0\n",
"n = 500\n",
"for i in tqdm_notebook(range(n)):\n",
" random.seed(i*42)\n",
" author1, author2 = random.sample(recognizer.authors, 2)\n",
" params = {'random_state':n, 'kernel': 'rbf'}\n",
" mean, std = recognizer.one_against_one(SVC, params, author1, author2, 'accuracy', 10, random_state=i)\n",
" n_mean += mean\n",
"print(f'{n_mean / n:.5f}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"recognizer.set_lang_model(lang_model_size=300, skip=0, smoothing=True, string_size=20000)\n",
"n_mean = 0\n",
"n = 500\n",
"for i in tqdm_notebook(range(n)):\n",
" random.seed(i*42)\n",
" author1, author2 = random.sample(recognizer.authors, 2)\n",
" params = {'random_state':n, 'kernel': 'rbf'}\n",
" mean, std = recognizer.one_against_one(SVC, params, author1, author2, 'accuracy', 10, random_state=i)\n",
" n_mean += mean\n",
"print(f'{n_mean / n:.5f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, smoothing slightly affect to result in both cases. Perhaps this is because in \\[3\\] and \\[4\\] samples of a much smaller size were used, in the amount of 2 for each author and no regularisation of the model was carried out."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Author Recognition\n",
"Until now, we have tried to predict the author for one sample of text for all possible pairs of $(M = \\frac{N(N-1)}{2})$ authors, where $N$ is the number of authors.\n",
"In general, each text consists of several fragments and the number of prospective authors may be more than two.\n",
"Let's consider the multiclass classification problem. Initially, all classifiers were trained in the \"train\" corpus to recognise all authors from the list recognizer.authors. The size of the training subsample was set by the `train_size` parameter of the` fit` method. For classifiers based on SVM, we applied the `one against one` scheme, and each classifier was trained on equal subsamples for each of the authors. For LogisticRegression, the `one against all` scheme was applied, and each classifier was trained on a subsample *of its* author and a $N-1$ subsample of all other authors. Thus, the training sample was unbalanced.\n",
"Author recognition was implemented by the `predict` method. Recognition process was as follows. The entire text sample was divided into substrings of the selected length, with the total number of $ m $ substrings. We calculated the number of trigrams for each substring. Thus, we obtained a matrix of $m \\times l$ size, where $l$ is the number of trigrams that is accounted for in in the language model - *lang_model_size*. Then, the matrix was passed to the `predict` method of each classifier. Thus, for each of the substrings, we got the forecast of recognition of the author. Positive forecasts were summed up by classifiers and substrings, then the final choice of the author was made by majority voting for the entire text.\n",
"For `LogisticRegression` in the `one against all` scheme, we chose the `predict` method instead of the ` predict_proba` since it turned out that there was the more accurate final result.\n",
"In previous experiments, we obtained excellent results for a substring of a length of 20,000 characters with the number of trigrams of 2000. Further, we will apply these values. We will not use smoothing or skip initial trigrams.\n",
"We will recognise the author for each text from the test corpus and display the results in a table. In case of an incorrect result, we will also display the name of the file containing the prediction error."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_corpus = '../test_corpus/'\n",
"ilf_petrov = '../IlfPetrov/'\n",
"string_size = 20000\n",
"train_size=15\n",
"lang_model_size=2000\n",
"\n",
"recognizer.set_lang_model(lang_model_size=lang_model_size,\n",
" skip=0, smoothing=False,\n",
" string_size=string_size)\n",
"\n",
"def predict_author(recognizer, PATH='../test_corpus/'):\n",
" files = listdir(PATH)\n",
" true = 0\n",
" all = 0\n",
" print(' ________________________________________ ')\n",
" print('| real author | predicted author| matched ')\n",
" \n",
" for file in files:\n",
" author = file.split('_')[0]\n",
" if author in recognizer.authors:\n",
" all += 1\n",
" pred_author = recognizer.predict(path.join(PATH, file))# path.join((path, ))\n",
" print( f'| {author}{\" \" * (12 - len(author))}| {pred_author}{\" \" * (15 - len(pred_author))}| {author == pred_author}', end=\" \")\n",
" if author == pred_author:\n",
" true +=1\n",
" print()\n",
" else:\n",
" print(file)\n",
" print('|________________________________________ ') \n",
" if PATH != '../IlfPetrov/':\n",
" print(f'Accuracy {100 * true / all:.2f}%', end='\\n\\n') "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### SVM with *linear* kernel"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"params = {'random_state':42, 'kernel':'linear'} \n",
"recognizer.fit(model=SVC, params=params, mode='oao', train_size=train_size, random_state=42)\n",
"predict_author(recognizer=recognizer, PATH=test_corpus)\n",
"predict_author(recognizer=recognizer, PATH=ilf_petrov)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### SVM with *rbf* kernel"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"params = {'random_state':42,'kernel':'rbf'}\n",
"recognizer.fit(model=SVC, params=params, mode='oao', train_size=train_size, random_state=42)\n",
"predict_author(recognizer=recognizer, PATH=test_corpus)\n",
"predict_author(recognizer=recognizer, PATH=ilf_petrov)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### LogisticRegression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"params={'random_state':42, 'class_weight':'balanced', 'solver':'liblinear'}\n",
"recognizer.fit(model=LogisticRegression, params=params, mode='oaa', train_size=train_size, random_state=42)\n",
"predict_author(recognizer=recognizer, PATH=test_corpus)\n",
"predict_author(recognizer=recognizer, PATH=ilf_petrov)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Linear SVM and LogisticRegression show the best results on the training corpus, and both define Bulgakov as the author of the novel “The Twelve Chairs”, but linear SVM defines Ilf and Petrov as the authors of “Golden Calf”. The SVM with rbf kernel shows a worse result, but it recognises Bulgakov as the author of both novels."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Authorship Hypothesis Testing\n",
"\n",
"In the previous experiment, we could not unambiguously identify the author for the novels \"The Twelve Chairs\" and \"Golden Calf\". We will put a strict question, who was the authors, Ilf and Petrov, or Bulgakov?\n",
"To select the proposed author, in the Recognizer class we will use a `hypothesis_test` method of our class. In the `one against one` scheme, it is all simple. We already have got trained classifiers for each of the pairs of authors, and it is enough to use them to recognise.\n",
"If there are classifiers trained in the `one against one` scheme, we have two possibilities:\n",
"* for classifiers trained on the texts of each of the authors, we can call the `predict_proba` method and select the result from the obtained probabilities using the `softmax` method\n",
"* we can teach a new classifier on the texts of two authors, i.e. reduce the task to the `one against one` scheme. In this case, it is reasonable to use `roc_auc` as a metric.\n",
"We will use all the tools.\n",
"\n",
"Let's experiment in the following way. For each text, in the test corpus, we will make the real author playoff against any other, one by one. We will calculate the `accuracy` and determine the names of the files where the author is defined incorrectly. First, we will experiment with the test corpus; then we will check the works of Ilf and Petrov."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def test_hyppo(recognizer, PATH='../test_corpus/', softmax=False):\n",
" files = listdir(PATH)\n",
" true = 0; all = 0\n",
" print('________________________________________________') \n",
" for file in files:\n",
" real_author = file.split('_')[0]\n",
" if real_author not in recognizer.authors:\n",
" continue\n",
" another_authors = recognizer.authors.copy()\n",
" another_authors.remove(real_author)\n",
" for author in another_authors: \n",
" \n",
" all += 1\n",
" pred = recognizer.hypothesis_test(path.join(PATH, file), author, real_author, softmax=softmax)\n",
" #pdb.set_trace()\n",
" pred_author = pred[pred == pred.max()].index[0]\n",
" if real_author == pred_author:\n",
" true +=1\n",
" else:\n",
" print(f'{file} - {author}') \n",
" if PATH != '../IlfPetrov/':\n",
" print(f'Accuracy {100 * true / all:.2f}% in {all} experiments') \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### SVM with *linear* kernel"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"params = {'random_state':42, 'kernel':'linear'}\n",
"recognizer.fit(model=SVC, params=params, mode='oao', train_size=train_size, random_state=42)\n",
"test_hyppo(recognizer=recognizer)\n",
"test_hyppo(recognizer=recognizer, PATH=ilf_petrov)\n",
"print()\n",
"print('12 Chairs')\n",
"print(recognizer.hypothesis_test('../IlfPetrov/IlfPetrov_12chairs_string.txt', 'IlfPetrov', 'Bulgakov'), end='\\n\\n')\n",
"print('The Golden Calf')\n",
"print(recognizer.hypothesis_test('../IlfPetrov/IlfPetrov_золотой_теленок_string.txt', 'IlfPetrov', 'Bulgakov'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### SVM with *rbf* kernel"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"params = {'random_state':42, 'kernel':'rbf'}\n",
"recognizer.fit(model=SVC, params=params, mode='oao', train_size=train_size, random_state=42)\n",
"test_hyppo(recognizer=recognizer)\n",
"test_hyppo(recognizer=recognizer, PATH=ilf_petrov)\n",
"print()\n",
"print('12 Chairs')\n",
"print(recognizer.hypothesis_test('../IlfPetrov/IlfPetrov_12chairs_string.txt', 'IlfPetrov', 'Bulgakov'), end='\\n\\n')\n",
"print('The Golden Calf')\n",
"print(recognizer.hypothesis_test('../IlfPetrov/IlfPetrov_золотой_теленок_string.txt', 'IlfPetrov', 'Bulgakov'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### LogisticRegression like a *one-agains-one*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"params={'random_state':42, 'class_weight':'balanced', 'solver':'liblinear'}\n",
"recognizer.fit(model=LogisticRegression, params=params, mode='oaa', train_size=train_size, random_state=42)\n",
"test_hyppo(recognizer=recognizer)\n",
"test_hyppo(recognizer=recognizer, PATH=ilf_petrov)\n",
"print()\n",
"print(\"'12 Chairs'\")\n",
"print(recognizer.hypothesis_test('../IlfPetrov/IlfPetrov_12chairs_string.txt', 'IlfPetrov', 'Bulgakov'), end='\\n\\n')\n",
"print(\"'The Golden Calf'\")\n",
"print(recognizer.hypothesis_test('../IlfPetrov/IlfPetrov_золотой_теленок_string.txt', 'IlfPetrov', 'Bulgakov'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### LogisticRegression with the *softmax*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"params={'random_state':42, 'class_weight':'balanced', 'solver':'liblinear'}\n",
"recognizer.fit(model=LogisticRegression, params=params, mode='oaa', train_size=train_size, random_state=42)\n",
"test_hyppo(recognizer=recognizer, softmax=True)\n",
"test_hyppo(recognizer=recognizer, PATH=ilf_petrov, softmax=True)\n",
"print()\n",
"print(\"'12 Chairs'\")\n",
"print(recognizer.hypothesis_test('../IlfPetrov/IlfPetrov_12chairs_string.txt', 'IlfPetrov', 'Bulgakov'), end='\\n\\n')\n",
"print(\"'The Golden Calf'\")\n",
"print(recognizer.hypothesis_test('../IlfPetrov/IlfPetrov_золотой_теленок_string.txt', 'IlfPetrov', 'Bulgakov'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We obtained the best accuracy using the SVM with linear kernel and LogisticRegression *one-against-one*. In the first issue, the algorithm recognises Bulgakov as the author of \"The Twelve Chairs\" but with a minimum margin, in the second we've got a draw - 05 : 0.5 for both works."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I would also like to mention that:\n",
"\n",
"1. In most experiments, Mikhail Bulgakov was recognised as the real author of Valentin Kataev's \"Erendorf Island\". The story was written in 1924. We know that, at that time, Kataev and Bulgakov worked together for the Moscow \"Gudok\" newspaper. Bulgakov's works were published little, so he could write the story and ask Kataev to publish it under his name for a fee (but it is not exactly).\n",
"\n",
"2. Many classifiers have difficulties in recognising the author of the novel \"The Three Fat Men\" as if Olesha did not write it. Perhaps, the novel was edited in the middle of the XX century. For this reason, we exclude Lev Kassil’s novel “The Black Book and Schwambrania” from the training building when guessing that the novel was significantly revised in the 1950s (but it is not exactly too)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###
Conclusions
\n",
" \n",
"In this project, we've considered one of the methods of recognising authors of the books, based on the calculation of the most frequently appeared trigrams. We attempted to recognise the authors of the novels \"The Twelve Chairs\" and \"Golden Calf\". Based on the results of the project, we can conclude that:\n",
"\n",
"1. The proposed method shows good results in recognising authors, but it strongly depends on parameters, such as the number of trigrams used, the length of the substrings being analyzed. With small changes in parameters, the results can vary. Moreover, although the overall picture will change slightly, for individual issues, the results may be radically opposite.\n",
"2. This method applies to the large texts only. We analysed the strings with a length of 20,000 characters. For small texts, this method is not suitable.\n",
"3. We used three algorithms, the logistic regression, and two types of SVM, with linear and rbf kernels. The best results were obtained using Logistic Regression; the worst was for SVM rbf kernel.\n",
"4. The authorship of the novels \"The Twelve Chairs\" and \"Golden Calf\" cannot be unambiguously recognised but for us, it seems that the hypothesis of Bulgakov's is reasonable. More research is needed in this area. Perhaps, the use of structures of higher hierarchies (word forms, grammatical classes, syntactic structures) will give more reliable results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###
References
\n",
"\n",
"1. Ирина Амлински \"12 стульев от Михаила Булгакова\", Berlin, Kirschner-Verlag, 2013\n",
"2. Дмитрий Галковский \"Что необходимо знать о Михаиле Булгакове\"\n",
"3. [А.С. Романов \"Методика идентификации автора текста на основе аппарата опорных векторов\"](https://docplayer.ru/29386403-Metodika-identifikacii-avtora-teksta-na-osnove-apparata-opornyh-vektorov.html)\n",
"4. [Романов А. С. \"Идентификация автора текста с помощью аппарата опорных векторов в случае двух возможных альтернатив\"](http://www.dialog-21.ru/media/1617/67.pdf)\n",
"5. Михаил Копотев \"Введение в корпусную лингвистику\", Praha, Animedia Company, 2014\n",
"6. Chen S.F. An empirical study of smoothing techniques for language modeling S.F. Chen,J. Goodman Computer Speech & Language. – 1999. – Vol. 13, № 4.\n",
"7. В.П.Фоменко, Т.Г.Фоменко [\"Авторский инвариант русских литературных текстов\"](http://chronologia.org/seven2_2/add3.html)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}