{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Approach to categorical variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Categorical variables are the essence of many real-world tasks. Every business task you're will ever solve will include categorical variables. So it's better to have a good taste of them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For demonstration purposes, I will use 2 models RF and Linear as they have different nature and would better highlight differences in category treating.\n", "\n", "Dataset from kaggle medium competition, where we should predict a number of claps (likes) to the article." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn.metrics import mean_absolute_error\n", "from sklearn.linear_model import Ridge\n", "from sklearn.ensemble import RandomForestRegressor\n", "import feather\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.manifold import TSNE\n", "from sklearn.decomposition import PCA\n", "from sklearn.model_selection import train_test_split\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def add_date_parts(df, date_column= 'published'):\n", " df['hour'] = df[date_column].dt.hour\n", " df['month'] = df[date_column].dt.month\n", " df['weekday'] = df[date_column].dt.weekday\n", " df['year'] = df[date_column].dt.year\n", " df['week'] = df[date_column].dt.week\n", " df['working_day'] = (df['weekday'] < 5).astype('int')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "PATH_TO_DATA = '../../data/medium/'\n", "train_df = feather.read_dataframe(PATH_TO_DATA +'medium_train')\n", "train_df.set_index('id', inplace=True)\n", "add_date_parts(train_df)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contentpublishedtitleauthordomaintagslengthurlimage_urllanglog_recommendshourmonthweekdayyearweekworking_day
id
358381Patricio Barríacronista del proyecto Supay Was...2017-06-30 23:40:35.633Salamancas, antepasados y espíritus guardianes...Patricio Barríamedium.comChile Etnografia ValleDeElqui Brujeria Postcol...12590https://medium.com/@patopullayes/salamancas-an...https://cdn-images-1.medium.com/max/1200/1*e5C...SPANISH3.0910423642017261
\n", "
" ], "text/plain": [ " content \\\n", "id \n", "358381 Patricio Barríacronista del proyecto Supay Was... \n", "\n", " published \\\n", "id \n", "358381 2017-06-30 23:40:35.633 \n", "\n", " title author \\\n", "id \n", "358381 Salamancas, antepasados y espíritus guardianes... Patricio Barría \n", "\n", " domain tags length \\\n", "id \n", "358381 medium.com Chile Etnografia ValleDeElqui Brujeria Postcol... 12590 \n", "\n", " url \\\n", "id \n", "358381 https://medium.com/@patopullayes/salamancas-an... \n", "\n", " image_url lang \\\n", "id \n", "358381 https://cdn-images-1.medium.com/max/1200/1*e5C... SPANISH \n", "\n", " log_recommends hour month weekday year week working_day \n", "id \n", "358381 3.09104 23 6 4 2017 26 1 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The text is not the purpose of this tutorial, so I'll drop it" ] }, { "cell_type": "code", "execution_count": 460, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
authordomainlanglog_recommendshourmonthweekdayyearweekworking_day
id
358381Patricio Barríamedium.comSPANISH3.0910423642017261
\n", "
" ], "text/plain": [ " author domain lang log_recommends hour month \\\n", "id \n", "358381 Patricio Barría medium.com SPANISH 3.09104 23 6 \n", "\n", " weekday year week working_day \n", "id \n", "358381 4 2017 26 1 " ] }, "execution_count": 460, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df = train_df[['author','domain','lang','log_recommends','hour','month','weekday','year','week','working_day']]\n", "train_df.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic approach LE." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "LE (label encoding) is the most simple. We have some categories (country for example) ['Russia', 'USA', 'GB']. But algoritms do not work with strings, they need numbers. Ok, we can do it ['Russia', 'USA', 'GB'] -> [0, 1, 2]. Relly simple. Let's try." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "autor_to_int = dict((zip(train_df.author.unique(), range(train_df.author.unique().shape[0]))))\n", "domain_to_int = dict((zip(train_df.domain.unique(), range(train_df.domain.unique().shape[0]))))\n", "lang_to_int = dict((zip(train_df.lang.unique(), range(train_df.lang.unique().shape[0]))))\n", "train_df_le = train_df.copy()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
authordomainlanglog_recommendshourmonthweekdayyearweekworking_day
id
3583810003.0910423642017261
4019001011.0986123642017261
1465662021.3862923642017261
289703032.7725922642017261
1027634011.3862922642017261
\n", "
" ], "text/plain": [ " author domain lang log_recommends hour month weekday year \\\n", "id \n", "358381 0 0 0 3.09104 23 6 4 2017 \n", "401900 1 0 1 1.09861 23 6 4 2017 \n", "146566 2 0 2 1.38629 23 6 4 2017 \n", "28970 3 0 3 2.77259 22 6 4 2017 \n", "102763 4 0 1 1.38629 22 6 4 2017 \n", "\n", " week working_day \n", "id \n", "358381 26 1 \n", "401900 26 1 \n", "146566 26 1 \n", "28970 26 1 \n", "102763 26 1 " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df_le['author'] = train_df_le['author'].apply(lambda aut: autor_to_int[aut])\n", "train_df_le['domain'] = train_df_le['domain'].apply(lambda aut: domain_to_int[aut])\n", "train_df_le['lang'] = train_df_le['lang'].apply(lambda aut: lang_to_int[aut])\n", "train_df_le.head()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
authordomainlanghourmonthweekdayyearweekworking_day
id
35838100023642017261
\n", "
" ], "text/plain": [ " author domain lang hour month weekday year week working_day\n", "id \n", "358381 0 0 0 23 6 4 2017 26 1" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = train_df_le.log_recommends\n", "X = train_df_le.drop('log_recommends', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### RF label encoded" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.5075966789005786" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train, X_val,y_train,y_val = train_test_split(X,y, test_size=0.2)\n", "rf = RandomForestRegressor()\n", "rf.fit(X_train, y_train)\n", "preds = rf.predict(X_val)\n", "mean_absolute_error(y_val, preds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### LR label encoded" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Linear models like scaled input" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScaler()\n", "X = scaler.fit_transform(X)\n", "X_train, X_val,y_train,y_val = train_test_split(X,y, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.5689939074034462" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ridge = Ridge()\n", "ridge.fit(X_train, y_train)\n", "preds = ridge.predict(X_val)\n", "mean_absolute_error(y_val, preds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems linear model perform worse. Yes, it is, because of their nature. Linear model tries to find weight W that would be multiplied with input X, y = W*X + b. With LE we are telling to out model (with mapping ['Russia', 'USA', 'GB'] -> [0, 1, 2]), that weight in \"Russia\" doesn't matter because X==0, and that GB two times bigger than USA.\n", "\n", "So it's not ok to use LE with linear models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One-hot-encoding (OHE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can treat category as the thing on its own. ['Russia', 'USA', 'GB'] will convert to 3 features, each of which would take value 0 or 1.\n", "\n", "This way we can treat features independently, but cardinality blows up." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "train_df_ohe = train_df.copy()\n", "y = train_df_ohe.log_recommends\n", "X = train_df_ohe.drop('log_recommends', axis=1)\n", "X[X.columns] = X[X.columns].astype('category')\n", "X = pd.get_dummies(X, prefix=X.columns)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(62313, 31729)" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Boom! It was 9 dimensions now it's 317k dimensions. (Yes, I treat day-year-week as a category)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### RF ohe" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.4044947415210292" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train, X_val,y_train,y_val = train_test_split(X,y, test_size=0.2)\n", "rf = RandomForestRegressor(n_jobs=-1)\n", "rf.fit(X_train, y_train)\n", "preds = rf.predict(X_val)\n", "mean_absolute_error(y_val, preds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Score improved but learning time and memory consumption jumped drastically. (It was > 20Gb RAM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### LR ohe" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.14977547763283" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ridge = Ridge()\n", "ridge.fit(X_train, y_train)\n", "preds = ridge.predict(X_val)\n", "mean_absolute_error(y_val, preds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wow! Significant improvement." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Categorical embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You already knew everything that was above.\n", "\n", "Now it's time to try something new. We'll look at NN approach to categorical variables.\n", "\n", "In kaggle competitions, we can see, that in competitions with heavy use of categorical data tree ensembling methods work the best (XGBoost). Why in ages of rising NN they still haven't conquered this area?
\n", "In principle a neural network can approximate any continuous function and piecewise continuous function. However, it is not suitable to approximate arbitrary non-continuous functions as it assumes a certain level of continuity in its general form. During the training phase the continuity of the data guarantees the convergence of the optimization, and during the prediction phase it ensures that slightly changing the values of the input keeps the output stable.
\n", "Trees don't have this assumption about data continuity and can divide the states of a variable as fine as necessary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NN is somehow close to the linear model. What have we done to linear model? We used OHE, but it blew our dimensionality. For many real-world tasks when features may have cardinality about millions it would be harder. Secondly, we've lost some information with such a transformation. In our example, we have language as a feature. When we are converting \"SPANISH\" -> [1,0,0,...,0] and when \"ENGLISH\" -> [0,1,0,...,0]. Both languages have the same distance between each other, but there is no doubts Spanish and English are more similar than English and Chinese. We want to get this inner relation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The solution to these problems is to use embeddings, which translate large sparse vectors into a lower-dimensional space that preserves semantic relationships." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How it works in NLP field:\n", "\n", "| feature | vector |\n", "|------|------|\n", "| puppy | [0.9, 1.0, 0.0] |\n", "| dog | [1.0, 0.2, 0.0]|\n", "| kitten | [0.0, 1.0, 0.9]|\n", "| cat | [0.0, 0.2, 1.0]|\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see words share some values, that we can consider as \"dogness\" or \"size\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To do this, all we need is the matrix of embeddings.\n", "\n", "At the start, we are applying OHE and obtaining N rows with M columns. Where m is a category value. Then we picking row that encodes our category from the embedding matrix. Further we using this vector that repsents some rich properties of our initial category. \n", "We can obtain embeddings with NN magic. We are training embedding matrix with the size of MxP where P is number which we are picking (hyperparameter). Google's heuristic says us to pick M**0.25" ] }, { "cell_type": "code", "execution_count": 464, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 464, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import Image\n", "Image(url='https://habrastorage.org/webt/of/jy/gd/ofjygd5fmbpxwz8x6boeu2nnpk4.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'll use keras, but it's not important it's just a tool." ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import numpy as np\n", "import pandas as pd\n", "import keras\n", "from keras.models import Sequential\n", "from keras.layers import Dense, BatchNormalization\n", "from keras.layers import Input, Embedding, Dense, Dropout\n", "from keras.models import Model\n", "import matplotlib.pyplot as plt\n" ] }, { "cell_type": "code", "execution_count": 430, "metadata": {}, "outputs": [], "source": [ "class EmbeddingMapping():\n", " \"\"\"\n", " Helper class for handling categorical variables\n", " An instance of this class should be defined for each categorical variable we want to use.\n", " \"\"\"\n", " def __init__(self, series):\n", " # get a list of unique values\n", " values = series.unique().tolist()\n", " \n", " # Set a dictionary mapping from values to integer value\n", " self.embedding_dict = {value: int_value+1 for int_value, value in enumerate(values)}\n", " \n", " # The num_values will be used as the input_dim when defining the embedding layer. \n", " # It will also be returned for unseen values \n", " self.num_values = len(values) + 1\n", "\n", " def get_mapping(self, value):\n", " # If the value was seen in the training set, return its integer mapping\n", " if value in self.embedding_dict:\n", " return self.embedding_dict[value]\n", " # Else, return the same integer for unseen values\n", " else:\n", " return self.num_values" ] }, { "cell_type": "code", "execution_count": 439, "metadata": {}, "outputs": [], "source": [ "#converting some out features\n", "author_mapping = EmbeddingMapping(train_df['author'])\n", "domain_mapping = EmbeddingMapping(train_df['domain'])\n", "lang_mapping = EmbeddingMapping(train_df['lang'])\n", "X_emb = X_emb.assign(author_mapping=X_emb['author'].apply(author_mapping.get_mapping))\n", "X_emb = X_emb.assign(lang_mapping=X_emb['lang'].apply(lang_mapping.get_mapping))\n", "X_emb = X_emb.assign(domain_mapping=X_emb['domain'].apply(domain_mapping.get_mapping))" ] }, { "cell_type": "code", "execution_count": 441, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
authordomainlanglog_recommendshourmonthweekdayyearweekworking_dayauthor_mappinglang_mappingdomain_mapping
id
114160Gautham Krishnamedium.comENGLISH1.79176197020162811858241
\n", "
" ], "text/plain": [ " author domain lang log_recommends hour month \\\n", "id \n", "114160 Gautham Krishna medium.com ENGLISH 1.79176 19 7 \n", "\n", " weekday year week working_day author_mapping lang_mapping \\\n", "id \n", "114160 0 2016 28 1 18582 4 \n", "\n", " domain_mapping \n", "id \n", "114160 1 " ] }, "execution_count": 441, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_emb.sample(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_emb = train_df.copy()" ] }, { "cell_type": "code", "execution_count": 435, "metadata": {}, "outputs": [], "source": [ "X_train, X_val,y_train,y_val = train_test_split(X_emb,y, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": 274, "metadata": {}, "outputs": [], "source": [ "# Keras functional API\n", "#Input\n", "author_input = Input(shape=(1,), dtype='int32') \n", "lang_input = Input(shape=(1,), dtype='int32')\n", "domain_input = Input(shape=(1,), dtype='int32')\n", "\n", "# It's google's fule of thumb N_embeddings == N_originall_dim**0.25\n", "# Let’s define the embedding layer and flatten it\n", "# Originally 31331 unique authors\n", "author_embedings = Embedding(output_dim=13, input_dim=author_mapping.num_values, input_length=1)(author_input)\n", "author_embedings = keras.layers.Reshape((13,))(author_embedings)\n", "# Originally 62 unique langs\n", "lang_embedings = Embedding(output_dim=3, input_dim=lang_mapping.num_values, input_length=1)(lang_input)\n", "lang_embedings = keras.layers.Reshape((3,))(lang_embedings)\n", "# Originally 221 unique domains\n", "domain_embedings = Embedding(output_dim=4, input_dim=domain_mapping.num_values, input_length=1)(domain_input)\n", "domain_embedings = keras.layers.Reshape((4,))(domain_embedings)\n", "\n", "\n", "# Concatenate continuous and embeddings inputs\n", "all_input = keras.layers.concatenate([lang_embedings, author_embedings, domain_embedings])" ] }, { "cell_type": "code", "execution_count": 475, "metadata": {}, "outputs": [], "source": [ "# Fully connected layer to train NN and learn embeddings\n", "units=25\n", "dense1 = Dense(units=units, activation='relu')(all_input)\n", "dense1 = Dropout(0.5)(dense1)\n", "dense2 = Dense(units, activation='relu')(dense1)\n", "dense2 = Dropout(0.5)(dense2)\n", "predictions = Dense(1)(dense2)" ] }, { "cell_type": "code", "execution_count": 443, "metadata": {}, "outputs": [], "source": [ "epochs = 40\n", "model = Model(inputs=[lang_input, author_input, domain_input], outputs=predictions)\n", "model.compile(loss='mae', optimizer='adagrad')\n", "\n", "history = model.fit([X_train['lang_mapping'], X_train['author_mapping'], X_train['domain_mapping']], y_train, \n", " epochs=epochs, batch_size=128, verbose=0,\n", " validation_data=([X_val['lang_mapping'], X_val['author_mapping'], X_val['domain_mapping']], y_val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this step, we've trained a NN, but we are not going to use it. We want to get the embeddings layer.\n", "\n", "For each category, we have distinct embedding. Let's extract them and use it in our simple models." ] }, { "cell_type": "code", "execution_count": 461, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ]" ] }, "execution_count": 461, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.layers" ] }, { "cell_type": "code", "execution_count": 444, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(222, 4)" ] }, "execution_count": 444, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.layers[5].get_weights()[0].shape" ] }, { "cell_type": "code", "execution_count": 445, "metadata": {}, "outputs": [], "source": [ "lang_embedding = model.layers[3].get_weights()[0]\n", "lang_emb_cols = [f'lang_emb_{i}' for i in range(lang_embedding.shape[1])]" ] }, { "cell_type": "code", "execution_count": 446, "metadata": {}, "outputs": [], "source": [ "author_embedding = model.layers[4].get_weights()[0]\n", "aut_emb_cols = [f'aut_emb_{i}' for i in range(author_embedding.shape[1])]" ] }, { "cell_type": "code", "execution_count": 447, "metadata": {}, "outputs": [], "source": [ "domain_embedding = model.layers[5].get_weights()[0]\n", "dom_emb_cols = [f'dom_emb_{i}' for i in range(domain_embedding.shape[1])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have embeddings, and all we need is to take a row that corresponds to our examples." ] }, { "cell_type": "code", "execution_count": 448, "metadata": {}, "outputs": [], "source": [ "def get_author_vector(aut_num):\n", " return author_embedding[aut_num,:]\n", "def get_lang_vector(lang_num):\n", " return lang_embedding[lang_num,:]\n", "def get_domain_vector(dom_num):\n", " return domain_embedding[dom_num,:]" ] }, { "cell_type": "code", "execution_count": 449, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-0.01509277, -0.03493742, -0.04596788], dtype=float32)" ] }, "execution_count": 449, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_lang_vector(4)" ] }, { "cell_type": "code", "execution_count": 450, "metadata": {}, "outputs": [], "source": [ "lang_emb = pd.DataFrame(X_emb['lang_mapping'].apply(get_lang_vector).values.tolist(), columns=lang_emb_cols)\n", "lang_emb.index = X_emb.index\n", "X_emb[lang_emb_cols] = lang_emb" ] }, { "cell_type": "code", "execution_count": 451, "metadata": {}, "outputs": [], "source": [ "aut_emb = pd.DataFrame(X_emb['author_mapping'].apply(get_author_vector).values.tolist(), columns=aut_emb_cols)\n", "aut_emb.index = X_emb.index\n", "X_emb[aut_emb_cols] = aut_emb" ] }, { "cell_type": "code", "execution_count": 452, "metadata": {}, "outputs": [], "source": [ "dom_emb = pd.DataFrame(X_emb['domain_mapping'].apply(get_domain_vector).values.tolist(), columns=dom_emb_cols)\n", "dom_emb.index = X_emb.index\n", "X_emb[dom_emb_cols] = dom_emb" ] }, { "cell_type": "code", "execution_count": 453, "metadata": {}, "outputs": [], "source": [ "X_emb.drop(['author', 'lang', 'domain', 'log_recommends',\n", " 'author_mapping', 'lang_mapping', 'domain_mapping',],axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 454, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['hour', 'month', 'weekday', 'year', 'week', 'working_day', 'lang_emb_0',\n", " 'lang_emb_1', 'lang_emb_2', 'aut_emb_0', 'aut_emb_1', 'aut_emb_2',\n", " 'aut_emb_3', 'aut_emb_4', 'aut_emb_5', 'aut_emb_6', 'aut_emb_7',\n", " 'aut_emb_8', 'aut_emb_9', 'aut_emb_10', 'aut_emb_11', 'aut_emb_12',\n", " 'dom_emb_0', 'dom_emb_1', 'dom_emb_2', 'dom_emb_3'],\n", " dtype='object')" ] }, "execution_count": 454, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_emb.columns" ] }, { "cell_type": "code", "execution_count": 455, "metadata": {}, "outputs": [], "source": [ "X_train, X_val,y_train,y_val = train_test_split(X_emb,y, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": 456, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7075810844493988" ] }, "execution_count": 456, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf = RandomForestRegressor(n_jobs=-1)\n", "rf.fit(X_train, y_train)\n", "preds = rf.predict(X_val)\n", "mean_absolute_error(y_val, preds)" ] }, { "cell_type": "code", "execution_count": 458, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6837744334988985" ] }, "execution_count": 458, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ridge = Ridge()\n", "ridge.fit(X_train, y_train)\n", "preds = ridge.predict(X_val)\n", "mean_absolute_error(y_val, preds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems like a success." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One nice property of embeddings - our categories have some simularity(distance) from each other. Let's look at the graph." ] }, { "cell_type": "code", "execution_count": 526, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Loading BokehJS ...\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(root) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = true;\n", "\n", " if (typeof (root._bokeh_onload_callbacks) === \"undefined\" || force === true) {\n", " root._bokeh_onload_callbacks = [];\n", " root._bokeh_is_loading = undefined;\n", " }\n", "\n", " var JS_MIME_TYPE = 'application/javascript';\n", " var HTML_MIME_TYPE = 'text/html';\n", " var EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", " var CLASS_NAME = 'output_bokeh rendered_html';\n", "\n", " /**\n", " * Render data to the DOM node\n", " */\n", " function render(props, node) {\n", " var script = document.createElement(\"script\");\n", " node.appendChild(script);\n", " }\n", "\n", " /**\n", " * Handle when an output is cleared or removed\n", " */\n", " function handleClearOutput(event, handle) {\n", " var cell = handle.cell;\n", "\n", " var id = cell.output_area._bokeh_element_id;\n", " var server_id = cell.output_area._bokeh_server_id;\n", " // Clean up Bokeh references\n", " if (id != null && id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", "\n", " if (server_id !== undefined) {\n", " // Clean up Bokeh references\n", " var cmd = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", " cell.notebook.kernel.execute(cmd, {\n", " iopub: {\n", " output: function(msg) {\n", " var id = msg.content.text.trim();\n", " if (id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", " }\n", " }\n", " });\n", " // Destroy server and session\n", " var cmd = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", " cell.notebook.kernel.execute(cmd);\n", " }\n", " }\n", "\n", " /**\n", " * Handle when a new output is added\n", " */\n", " function handleAddOutput(event, handle) {\n", " var output_area = handle.output_area;\n", " var output = handle.output;\n", "\n", " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", " if ((output.output_type != \"display_data\") || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n", " return\n", " }\n", "\n", " var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", "\n", " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", " // store reference to embed id on output_area\n", " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", " }\n", " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", " var bk_div = document.createElement(\"div\");\n", " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", " var script_attrs = bk_div.children[0].attributes;\n", " for (var i = 0; i < script_attrs.length; i++) {\n", " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", " }\n", " // store reference to server id on output_area\n", " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", " }\n", " }\n", "\n", " function register_renderer(events, OutputArea) {\n", "\n", " function append_mime(data, metadata, element) {\n", " // create a DOM node to render to\n", " var toinsert = this.create_output_subarea(\n", " metadata,\n", " CLASS_NAME,\n", " EXEC_MIME_TYPE\n", " );\n", " this.keyboard_manager.register_events(toinsert);\n", " // Render to node\n", " var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", " render(props, toinsert[toinsert.length - 1]);\n", " element.append(toinsert);\n", " return toinsert\n", " }\n", "\n", " /* Handle when an output is cleared or removed */\n", " events.on('clear_output.CodeCell', handleClearOutput);\n", " events.on('delete.Cell', handleClearOutput);\n", "\n", " /* Handle when a new output is added */\n", " events.on('output_added.OutputArea', handleAddOutput);\n", "\n", " /**\n", " * Register the mime type and append_mime function with output_area\n", " */\n", " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", " /* Is output safe? */\n", " safe: true,\n", " /* Index of renderer in `output_area.display_order` */\n", " index: 0\n", " });\n", " }\n", "\n", " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", " if (root.Jupyter !== undefined) {\n", " var events = require('base/js/events');\n", " var OutputArea = require('notebook/js/outputarea').OutputArea;\n", "\n", " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", " register_renderer(events, OutputArea);\n", " }\n", " }\n", "\n", " \n", " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", " root._bokeh_timeout = Date.now() + 5000;\n", " root._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"

\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"

\\n\"+\n", " \"
    \\n\"+\n", " \"
  • re-rerun `output_notebook()` to attempt to load from CDN again, or
  • \\n\"+\n", " \"
  • use INLINE resources instead, as so:
  • \\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n", " \"from bokeh.resources import INLINE\\n\"+\n", " \"output_notebook(resources=INLINE)\\n\"+\n", " \"\\n\"+\n", " \"
\"}};\n", "\n", " function display_loaded() {\n", " var el = document.getElementById(\"2497\");\n", " if (el != null) {\n", " el.textContent = \"BokehJS is loading...\";\n", " }\n", " if (root.Bokeh !== undefined) {\n", " if (el != null) {\n", " el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n", " }\n", " } else if (Date.now() < root._bokeh_timeout) {\n", " setTimeout(display_loaded, 100)\n", " }\n", " }\n", "\n", "\n", " function run_callbacks() {\n", " try {\n", " root._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n", " }\n", " finally {\n", " delete root._bokeh_onload_callbacks\n", " }\n", " console.info(\"Bokeh: all callbacks have finished\");\n", " }\n", "\n", " function load_libs(js_urls, callback) {\n", " root._bokeh_onload_callbacks.push(callback);\n", " if (root._bokeh_is_loading > 0) {\n", " console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", " return null;\n", " }\n", " if (js_urls == null || js_urls.length === 0) {\n", " run_callbacks();\n", " return null;\n", " }\n", " console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", " root._bokeh_is_loading = js_urls.length;\n", " for (var i = 0; i < js_urls.length; i++) {\n", " var url = js_urls[i];\n", " var s = document.createElement('script');\n", " s.src = url;\n", " s.async = false;\n", " s.onreadystatechange = s.onload = function() {\n", " root._bokeh_is_loading--;\n", " if (root._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: all BokehJS libraries loaded\");\n", " run_callbacks()\n", " }\n", " };\n", " s.onerror = function() {\n", " console.warn(\"failed to load library \" + url);\n", " };\n", " console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", " document.getElementsByTagName(\"head\")[0].appendChild(s);\n", " }\n", " };var element = document.getElementById(\"2497\");\n", " if (element == null) {\n", " console.log(\"Bokeh: ERROR: autoload.js configured with elementid '2497' but no matching script tag was found. \")\n", " return false;\n", " }\n", "\n", " var js_urls = [\"https://cdn.pydata.org/bokeh/release/bokeh-1.0.1.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-widgets-1.0.1.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-tables-1.0.1.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-gl-1.0.1.min.js\"];\n", "\n", " var inline_js = [\n", " function(Bokeh) {\n", " Bokeh.set_log_level(\"info\");\n", " },\n", " \n", " function(Bokeh) {\n", " \n", " },\n", " function(Bokeh) {\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-1.0.1.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-1.0.1.min.css\");\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-1.0.1.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-1.0.1.min.css\");\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-tables-1.0.1.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-tables-1.0.1.min.css\");\n", " }\n", " ];\n", "\n", " function run_inline_js() {\n", " \n", " if ((root.Bokeh !== undefined) || (force === true)) {\n", " for (var i = 0; i < inline_js.length; i++) {\n", " inline_js[i].call(root, root.Bokeh);\n", " }if (force === true) {\n", " display_loaded();\n", " }} else if (Date.now() < root._bokeh_timeout) {\n", " setTimeout(run_inline_js, 100);\n", " } else if (!root._bokeh_failed_load) {\n", " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", " root._bokeh_failed_load = true;\n", " } else if (force !== true) {\n", " var cell = $(document.getElementById(\"2497\")).parents('.cell').data().cell;\n", " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", " }\n", "\n", " }\n", "\n", " if (root._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", " run_inline_js();\n", " } else {\n", " load_libs(js_urls, function() {\n", " console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n", " run_inline_js();\n", " });\n", " }\n", "}(window));" ], "application/vnd.bokehjs_load.v0+json": "\n(function(root) {\n function now() {\n return new Date();\n }\n\n var force = true;\n\n if (typeof (root._bokeh_onload_callbacks) === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n \n\n \n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n var NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"
    \\n\"+\n \"
  • re-rerun `output_notebook()` to attempt to load from CDN again, or
  • \\n\"+\n \"
  • use INLINE resources instead, as so:
  • \\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded() {\n var el = document.getElementById(\"2497\");\n if (el != null) {\n el.textContent = \"BokehJS is loading...\";\n }\n if (root.Bokeh !== undefined) {\n if (el != null) {\n el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(display_loaded, 100)\n }\n }\n\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n }\n finally {\n delete root._bokeh_onload_callbacks\n }\n console.info(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(js_urls, callback) {\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = js_urls.length;\n for (var i = 0; i < js_urls.length; i++) {\n var url = js_urls[i];\n var s = document.createElement('script');\n s.src = url;\n s.async = false;\n s.onreadystatechange = s.onload = function() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.log(\"Bokeh: all BokehJS libraries loaded\");\n run_callbacks()\n }\n };\n s.onerror = function() {\n console.warn(\"failed to load library \" + url);\n };\n console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.getElementsByTagName(\"head\")[0].appendChild(s);\n }\n };var element = document.getElementById(\"2497\");\n if (element == null) {\n console.log(\"Bokeh: ERROR: autoload.js configured with elementid '2497' but no matching script tag was found. \")\n return false;\n }\n\n var js_urls = [\"https://cdn.pydata.org/bokeh/release/bokeh-1.0.1.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-widgets-1.0.1.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-tables-1.0.1.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-gl-1.0.1.min.js\"];\n\n var inline_js = [\n function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\n \n function(Bokeh) {\n \n },\n function(Bokeh) {\n console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-1.0.1.min.css\");\n Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-1.0.1.min.css\");\n console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-1.0.1.min.css\");\n Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-1.0.1.min.css\");\n console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-tables-1.0.1.min.css\");\n Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-tables-1.0.1.min.css\");\n }\n ];\n\n function run_inline_js() {\n \n if ((root.Bokeh !== undefined) || (force === true)) {\n for (var i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }if (force === true) {\n display_loaded();\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n var cell = $(document.getElementById(\"2497\")).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n\n }\n\n if (root._bokeh_is_loading === 0) {\n console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(js_urls, function() {\n console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import bokeh.models as bm, bokeh.plotting as pl\n", "from bokeh.io import output_notebook\n", "output_notebook()\n", "\n", "def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',\n", " width=600, height=400, show=True, **kwargs):\n", " \"\"\" draws an interactive plot for data points with auxilirary info on hover \"\"\"\n", " if isinstance(color, str): color = [color] * len(x)\n", " data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })\n", "\n", " fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)\n", " fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)\n", "\n", " fig.add_tools(bm.HoverTool(tooltips=[(key, \"@\" + key) for key in kwargs.keys()]))\n", " if show: pl.show(fig)\n", " return fig" ] }, { "cell_type": "code", "execution_count": 503, "metadata": {}, "outputs": [], "source": [ "langs_vectors = [get_lang_vector(l) for l in lang_mapping.embedding_dict.values()]" ] }, { "cell_type": "code", "execution_count": 504, "metadata": {}, "outputs": [], "source": [ "lang_tsne = TSNE().fit_transform(langs_vectors )" ] }, { "cell_type": "code", "execution_count": 505, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "(function(root) {\n", " function embed_document(root) {\n", " \n", " var docs_json = {\"d939f5f4-54fa-4657-bedf-e24c360be656\":{\"roots\":{\"references\":[{\"attributes\":{\"below\":[{\"id\":\"1647\",\"type\":\"LinearAxis\"}],\"left\":[{\"id\":\"1652\",\"type\":\"LinearAxis\"}],\"plot_height\":400,\"renderers\":[{\"id\":\"1647\",\"type\":\"LinearAxis\"},{\"id\":\"1651\",\"type\":\"Grid\"},{\"id\":\"1652\",\"type\":\"LinearAxis\"},{\"id\":\"1656\",\"type\":\"Grid\"},{\"id\":\"1665\",\"type\":\"BoxAnnotation\"},{\"id\":\"1675\",\"type\":\"GlyphRenderer\"}],\"title\":{\"id\":\"1725\",\"type\":\"Title\"},\"toolbar\":{\"id\":\"1663\",\"type\":\"Toolbar\"},\"x_range\":{\"id\":\"1639\",\"type\":\"DataRange1d\"},\"x_scale\":{\"id\":\"1643\",\"type\":\"LinearScale\"},\"y_range\":{\"id\":\"1641\",\"type\":\"DataRange1d\"},\"y_scale\":{\"id\":\"1645\",\"type\":\"LinearScale\"}},\"id\":\"1638\",\"subtype\":\"Figure\",\"type\":\"Plot\"},{\"attributes\":{},\"id\":\"1661\",\"type\":\"ResetTool\"},{\"attributes\":{\"callback\":null},\"id\":\"1639\",\"type\":\"DataRange1d\"},{\"attributes\":{},\"id\":\"1662\",\"type\":\"HelpTool\"},{\"attributes\":{\"callback\":null},\"id\":\"1641\",\"type\":\"DataRange1d\"},{\"attributes\":{\"data_source\":{\"id\":\"1637\",\"type\":\"ColumnDataSource\"},\"glyph\":{\"id\":\"1673\",\"type\":\"Scatter\"},\"hover_glyph\":null,\"muted_glyph\":null,\"nonselection_glyph\":{\"id\":\"1674\",\"type\":\"Scatter\"},\"selection_glyph\":null,\"view\":{\"id\":\"1676\",\"type\":\"CDSView\"}},\"id\":\"1675\",\"type\":\"GlyphRenderer\"},{\"attributes\":{},\"id\":\"1643\",\"type\":\"LinearScale\"},{\"attributes\":{\"bottom_units\":\"screen\",\"fill_alpha\":{\"value\":0.5},\"fill_color\":{\"value\":\"lightgrey\"},\"left_units\":\"screen\",\"level\":\"overlay\",\"line_alpha\":{\"value\":1.0},\"line_color\":{\"value\":\"black\"},\"line_dash\":[4,4],\"line_width\":{\"value\":2},\"plot\":null,\"render_mode\":\"css\",\"right_units\":\"screen\",\"top_units\":\"screen\"},\"id\":\"1665\",\"type\":\"BoxAnnotation\"},{\"attributes\":{\"formatter\":{\"id\":\"1729\",\"type\":\"BasicTickFormatter\"},\"plot\":{\"id\":\"1638\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"1648\",\"type\":\"BasicTicker\"}},\"id\":\"1647\",\"type\":\"LinearAxis\"},{\"attributes\":{},\"id\":\"1648\",\"type\":\"BasicTicker\"},{\"attributes\":{\"plot\":null,\"text\":\"\"},\"id\":\"1725\",\"type\":\"Title\"},{\"attributes\":{\"plot\":{\"id\":\"1638\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"1648\",\"type\":\"BasicTicker\"}},\"id\":\"1651\",\"type\":\"Grid\"},{\"attributes\":{},\"id\":\"1727\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"source\":{\"id\":\"1637\",\"type\":\"ColumnDataSource\"}},\"id\":\"1676\",\"type\":\"CDSView\"},{\"attributes\":{},\"id\":\"1729\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"formatter\":{\"id\":\"1727\",\"type\":\"BasicTickFormatter\"},\"plot\":{\"id\":\"1638\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"1653\",\"type\":\"BasicTicker\"}},\"id\":\"1652\",\"type\":\"LinearAxis\"},{\"attributes\":{},\"id\":\"1645\",\"type\":\"LinearScale\"},{\"attributes\":{},\"id\":\"1731\",\"type\":\"Selection\"},{\"attributes\":{},\"id\":\"1653\",\"type\":\"BasicTicker\"},{\"attributes\":{},\"id\":\"1732\",\"type\":\"UnionRenderers\"},{\"attributes\":{\"dimension\":1,\"plot\":{\"id\":\"1638\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"1653\",\"type\":\"BasicTicker\"}},\"id\":\"1656\",\"type\":\"Grid\"},{\"attributes\":{\"callback\":null,\"renderers\":\"auto\",\"tooltips\":[[\"token\",\"@token\"]]},\"id\":\"1677\",\"type\":\"HoverTool\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.1},\"fill_color\":{\"value\":\"#1f77b4\"},\"line_alpha\":{\"value\":0.1},\"line_color\":{\"value\":\"#1f77b4\"},\"size\":{\"units\":\"screen\",\"value\":10},\"x\":{\"field\":\"x\"},\"y\":{\"field\":\"y\"}},\"id\":\"1674\",\"type\":\"Scatter\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.25},\"fill_color\":{\"field\":\"color\"},\"line_alpha\":{\"value\":0.25},\"line_color\":{\"field\":\"color\"},\"size\":{\"units\":\"screen\",\"value\":10},\"x\":{\"field\":\"x\"},\"y\":{\"field\":\"y\"}},\"id\":\"1673\",\"type\":\"Scatter\"},{\"attributes\":{\"active_drag\":\"auto\",\"active_inspect\":\"auto\",\"active_multi\":null,\"active_scroll\":{\"id\":\"1658\",\"type\":\"WheelZoomTool\"},\"active_tap\":\"auto\",\"tools\":[{\"id\":\"1657\",\"type\":\"PanTool\"},{\"id\":\"1658\",\"type\":\"WheelZoomTool\"},{\"id\":\"1659\",\"type\":\"BoxZoomTool\"},{\"id\":\"1660\",\"type\":\"SaveTool\"},{\"id\":\"1661\",\"type\":\"ResetTool\"},{\"id\":\"1662\",\"type\":\"HelpTool\"},{\"id\":\"1677\",\"type\":\"HoverTool\"}]},\"id\":\"1663\",\"type\":\"Toolbar\"},{\"attributes\":{\"callback\":null,\"data\":{\"color\":[\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\"],\"token\":[\"SPANISH\",\"PORTUGUESE\",\"TURKISH\",\"ENGLISH\",\"THAI\",\"Korean\",\"FRENCH\",\"RUSSIAN\",\"ITALIAN\",\"SERBIAN\",\"VIETNAMESE\",\"CZECH\",\"UKRAINIAN\",\"DUTCH\",\"INDONESIAN\",\"Japanese\",\"SLOVAK\",\"BULGARIAN\",\"GERMAN\",\"ChineseT\",\"MONGOLIAN\",\"HUNGARIAN\",\"POLISH\",\"UNK\",\"NORWEGIAN_N\",\"DANISH\",\"NORWEGIAN\",\"FINNISH\",\"BENGALI\",\"ARMENIAN\",\"ARABIC\",\"MALAY\",\"ESPERANTO\",\"Chinese\",\"SLOVENIAN\",\"GEORGIAN\",\"LAOTHIAN\",\"GREEK\",\"ROMANIAN\",\"CATALAN\",\"WELSH\",\"SWEDISH\",\"HINDI\",\"CROATIAN\",\"UZBEK\",\"HEBREW\",\"Unknown\",\"ESTONIAN\",\"BOSNIAN\",\"TAGALOG\",\"AZERBAIJANI\",\"ALBANIAN\",\"KANNADA\",\"MACEDONIAN\",\"ICELANDIC\",\"LITHUANIAN\",\"GALICIAN\",\"LATIN\",\"LATVIAN\",\"TAMIL\",\"BURMESE\",\"SANSKRIT\"],\"x\":{\"__ndarray__\":\"9pmRwRpc5kAr16xA7Cu5QQNlPUHsBiRBfSCYv6QQv0C8T25AJFy8wbFxwsG6Xd7BaV2cv5twvsFxC6FB30GGQBpWzMAe5rrA8MX4wfHAnEFCjdlBtdSFwUyFt8F3LcFBd6ekQcEWW74hKmfBi0zVwESlD8GwX39B3ULhwXhLdsGAjsQ/i8mewXJ1usCUsQhBjzpAwR9ZN8FD6DXABezbv2mbIEEk9afBcdBeQf1UPkHLk6I/uXOAQFlPzEGZWwpB5tmUQUafc8FC8BjBqwSJQapvZ0GVH1RBpLkCQXw7bL9z5pxBFjzwQTfdZcA4NExA8XBmQd0s5UE=\",\"dtype\":\"float32\",\"shape\":[62]},\"y\":{\"__ndarray__\":\"Xj6bwSXo/0BYWbdB1y+yQbuHxkHO8AbBXPWGwHcmT0ALtMrADsSgwduA3sG+FbfBbtYfQBeRGcI2VI1BFeJFQTmuqMHrSLy/vufvwebB5UGU8ntB/vg0wdtfZMEFnAJCSt8kQeFXosFKakvAF1VxwVcj1sHDyFJBPgQLwpB/68GtIZdBcqPJwaTOI0H3NZBBoh+xwSJphMHindnBCOVvwYkLUkFdDgTCf5D7QXTgeD8WuexA9t44wTr4GUJkDHLBX59eQM3rDsLnmhXBaoq3QYO/jMAYDvhArCQ2wGgcVEEfKBVCan4IQl8IF8FnEnK/aU6TQda/2kE=\",\"dtype\":\"float32\",\"shape\":[62]}},\"selected\":{\"id\":\"1731\",\"type\":\"Selection\"},\"selection_policy\":{\"id\":\"1732\",\"type\":\"UnionRenderers\"}},\"id\":\"1637\",\"type\":\"ColumnDataSource\"},{\"attributes\":{},\"id\":\"1657\",\"type\":\"PanTool\"},{\"attributes\":{},\"id\":\"1658\",\"type\":\"WheelZoomTool\"},{\"attributes\":{\"overlay\":{\"id\":\"1665\",\"type\":\"BoxAnnotation\"}},\"id\":\"1659\",\"type\":\"BoxZoomTool\"},{\"attributes\":{},\"id\":\"1660\",\"type\":\"SaveTool\"}],\"root_ids\":[\"1638\"]},\"title\":\"Bokeh Application\",\"version\":\"1.0.1\"}};\n", " var render_items = [{\"docid\":\"d939f5f4-54fa-4657-bedf-e24c360be656\",\"roots\":{\"1638\":\"8e83688c-2379-48ad-9c9d-a6a8cb161df8\"}}];\n", " root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n", "\n", " }\n", " if (root.Bokeh !== undefined) {\n", " embed_document(root);\n", " } else {\n", " var attempts = 0;\n", " var timer = setInterval(function(root) {\n", " if (root.Bokeh !== undefined) {\n", " embed_document(root);\n", " clearInterval(timer);\n", " }\n", " attempts++;\n", " if (attempts > 100) {\n", " console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n", " clearInterval(timer);\n", " }\n", " }, 10, root)\n", " }\n", "})(window);" ], "application/vnd.bokehjs_exec.v0+json": "" }, "metadata": { "application/vnd.bokehjs_exec.v0+json": { "id": "1638" } }, "output_type": "display_data" }, { "data": { "text/html": [ "
Figure(
id = '1638', …)
above = [],
aspect_scale = 1,
background_fill_alpha = {'value': 1.0},
background_fill_color = {'value': '#ffffff'},
below = [LinearAxis(id='1647', ...)],
border_fill_alpha = {'value': 1.0},
border_fill_color = {'value': '#ffffff'},
css_classes = [],
disabled = False,
extra_x_ranges = {},
extra_y_ranges = {},
h_symmetry = True,
height = None,
hidpi = True,
js_event_callbacks = {},
js_property_callbacks = {},
left = [LinearAxis(id='1652', ...)],
lod_factor = 10,
lod_interval = 300,
lod_threshold = 2000,
lod_timeout = 500,
match_aspect = False,
min_border = 5,
min_border_bottom = None,
min_border_left = None,
min_border_right = None,
min_border_top = None,
name = None,
outline_line_alpha = {'value': 1.0},
outline_line_cap = 'butt',
outline_line_color = {'value': '#e5e5e5'},
outline_line_dash = [],
outline_line_dash_offset = 0,
outline_line_join = 'bevel',
outline_line_width = {'value': 1},
output_backend = 'canvas',
plot_height = 400,
plot_width = 600,
renderers = [LinearAxis(id='1647', ...), Grid(id='1651', ...), LinearAxis(id='1652', ...), Grid(id='1656', ...), BoxAnnotation(id='1665', ...), GlyphRenderer(id='1675', ...)],
right = [],
sizing_mode = 'fixed',
subscribed_events = [],
tags = [],
title = Title(id='1725', ...),
title_location = 'above',
toolbar = Toolbar(id='1663', ...),
toolbar_location = 'right',
toolbar_sticky = True,
v_symmetry = False,
width = None,
x_range = DataRange1d(id='1639', ...),
x_scale = LinearScale(id='1643', ...),
y_range = DataRange1d(id='1641', ...),
y_scale = LinearScale(id='1645', ...))
\n", "\n" ], "text/plain": [ "Figure(id='1638', ...)" ] }, "execution_count": 505, "metadata": {}, "output_type": "execute_result" } ], "source": [ "draw_vectors(lang_tsne[:, 0], lang_tsne[:, 1], token=list(lang_mapping.embedding_dict.keys()))" ] }, { "cell_type": "code", "execution_count": 518, "metadata": {}, "outputs": [], "source": [ "langs_vectors_pca = PCA(n_components=2).fit_transform(langs_vectors)" ] }, { "cell_type": "code", "execution_count": 519, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "(function(root) {\n", " function embed_document(root) {\n", " \n", " var docs_json = {\"43a9d2a9-b7d4-4890-825d-114dc01b4432\":{\"roots\":{\"references\":[{\"attributes\":{\"below\":[{\"id\":\"2317\",\"type\":\"LinearAxis\"}],\"left\":[{\"id\":\"2322\",\"type\":\"LinearAxis\"}],\"plot_height\":400,\"renderers\":[{\"id\":\"2317\",\"type\":\"LinearAxis\"},{\"id\":\"2321\",\"type\":\"Grid\"},{\"id\":\"2322\",\"type\":\"LinearAxis\"},{\"id\":\"2326\",\"type\":\"Grid\"},{\"id\":\"2335\",\"type\":\"BoxAnnotation\"},{\"id\":\"2345\",\"type\":\"GlyphRenderer\"}],\"title\":{\"id\":\"2431\",\"type\":\"Title\"},\"toolbar\":{\"id\":\"2333\",\"type\":\"Toolbar\"},\"x_range\":{\"id\":\"2309\",\"type\":\"DataRange1d\"},\"x_scale\":{\"id\":\"2313\",\"type\":\"LinearScale\"},\"y_range\":{\"id\":\"2311\",\"type\":\"DataRange1d\"},\"y_scale\":{\"id\":\"2315\",\"type\":\"LinearScale\"}},\"id\":\"2308\",\"subtype\":\"Figure\",\"type\":\"Plot\"},{\"attributes\":{\"plot\":{\"id\":\"2308\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"2318\",\"type\":\"BasicTicker\"}},\"id\":\"2321\",\"type\":\"Grid\"},{\"attributes\":{\"plot\":null,\"text\":\"\"},\"id\":\"2431\",\"type\":\"Title\"},{\"attributes\":{\"formatter\":{\"id\":\"2433\",\"type\":\"BasicTickFormatter\"},\"plot\":{\"id\":\"2308\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"2323\",\"type\":\"BasicTicker\"}},\"id\":\"2322\",\"type\":\"LinearAxis\"},{\"attributes\":{},\"id\":\"2323\",\"type\":\"BasicTicker\"},{\"attributes\":{},\"id\":\"2433\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"dimension\":1,\"plot\":{\"id\":\"2308\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"2323\",\"type\":\"BasicTicker\"}},\"id\":\"2326\",\"type\":\"Grid\"},{\"attributes\":{},\"id\":\"2435\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.1},\"fill_color\":{\"value\":\"#1f77b4\"},\"line_alpha\":{\"value\":0.1},\"line_color\":{\"value\":\"#1f77b4\"},\"size\":{\"units\":\"screen\",\"value\":10},\"x\":{\"field\":\"x\"},\"y\":{\"field\":\"y\"}},\"id\":\"2344\",\"type\":\"Scatter\"},{\"attributes\":{},\"id\":\"2437\",\"type\":\"Selection\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.25},\"fill_color\":{\"field\":\"color\"},\"line_alpha\":{\"value\":0.25},\"line_color\":{\"field\":\"color\"},\"size\":{\"units\":\"screen\",\"value\":10},\"x\":{\"field\":\"x\"},\"y\":{\"field\":\"y\"}},\"id\":\"2343\",\"type\":\"Scatter\"},{\"attributes\":{},\"id\":\"2438\",\"type\":\"UnionRenderers\"},{\"attributes\":{\"active_drag\":\"auto\",\"active_inspect\":\"auto\",\"active_multi\":null,\"active_scroll\":{\"id\":\"2328\",\"type\":\"WheelZoomTool\"},\"active_tap\":\"auto\",\"tools\":[{\"id\":\"2327\",\"type\":\"PanTool\"},{\"id\":\"2328\",\"type\":\"WheelZoomTool\"},{\"id\":\"2329\",\"type\":\"BoxZoomTool\"},{\"id\":\"2330\",\"type\":\"SaveTool\"},{\"id\":\"2331\",\"type\":\"ResetTool\"},{\"id\":\"2332\",\"type\":\"HelpTool\"},{\"id\":\"2347\",\"type\":\"HoverTool\"}]},\"id\":\"2333\",\"type\":\"Toolbar\"},{\"attributes\":{},\"id\":\"2327\",\"type\":\"PanTool\"},{\"attributes\":{},\"id\":\"2318\",\"type\":\"BasicTicker\"},{\"attributes\":{},\"id\":\"2332\",\"type\":\"HelpTool\"},{\"attributes\":{},\"id\":\"2328\",\"type\":\"WheelZoomTool\"},{\"attributes\":{\"overlay\":{\"id\":\"2335\",\"type\":\"BoxAnnotation\"}},\"id\":\"2329\",\"type\":\"BoxZoomTool\"},{\"attributes\":{},\"id\":\"2330\",\"type\":\"SaveTool\"},{\"attributes\":{},\"id\":\"2315\",\"type\":\"LinearScale\"},{\"attributes\":{},\"id\":\"2331\",\"type\":\"ResetTool\"},{\"attributes\":{\"callback\":null},\"id\":\"2309\",\"type\":\"DataRange1d\"},{\"attributes\":{\"bottom_units\":\"screen\",\"fill_alpha\":{\"value\":0.5},\"fill_color\":{\"value\":\"lightgrey\"},\"left_units\":\"screen\",\"level\":\"overlay\",\"line_alpha\":{\"value\":1.0},\"line_color\":{\"value\":\"black\"},\"line_dash\":[4,4],\"line_width\":{\"value\":2},\"plot\":null,\"render_mode\":\"css\",\"right_units\":\"screen\",\"top_units\":\"screen\"},\"id\":\"2335\",\"type\":\"BoxAnnotation\"},{\"attributes\":{},\"id\":\"2313\",\"type\":\"LinearScale\"},{\"attributes\":{\"data_source\":{\"id\":\"2307\",\"type\":\"ColumnDataSource\"},\"glyph\":{\"id\":\"2343\",\"type\":\"Scatter\"},\"hover_glyph\":null,\"muted_glyph\":null,\"nonselection_glyph\":{\"id\":\"2344\",\"type\":\"Scatter\"},\"selection_glyph\":null,\"view\":{\"id\":\"2346\",\"type\":\"CDSView\"}},\"id\":\"2345\",\"type\":\"GlyphRenderer\"},{\"attributes\":{\"formatter\":{\"id\":\"2435\",\"type\":\"BasicTickFormatter\"},\"plot\":{\"id\":\"2308\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"2318\",\"type\":\"BasicTicker\"}},\"id\":\"2317\",\"type\":\"LinearAxis\"},{\"attributes\":{\"source\":{\"id\":\"2307\",\"type\":\"ColumnDataSource\"}},\"id\":\"2346\",\"type\":\"CDSView\"},{\"attributes\":{\"callback\":null},\"id\":\"2311\",\"type\":\"DataRange1d\"},{\"attributes\":{\"callback\":null,\"data\":{\"color\":[\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\",\"blue\"],\"token\":[\"SPANISH\",\"PORTUGUESE\",\"TURKISH\",\"ENGLISH\",\"THAI\",\"Korean\",\"FRENCH\",\"RUSSIAN\",\"ITALIAN\",\"SERBIAN\",\"VIETNAMESE\",\"CZECH\",\"UKRAINIAN\",\"DUTCH\",\"INDONESIAN\",\"Japanese\",\"SLOVAK\",\"BULGARIAN\",\"GERMAN\",\"ChineseT\",\"MONGOLIAN\",\"HUNGARIAN\",\"POLISH\",\"UNK\",\"NORWEGIAN_N\",\"DANISH\",\"NORWEGIAN\",\"FINNISH\",\"BENGALI\",\"ARMENIAN\",\"ARABIC\",\"MALAY\",\"ESPERANTO\",\"Chinese\",\"SLOVENIAN\",\"GEORGIAN\",\"LAOTHIAN\",\"GREEK\",\"ROMANIAN\",\"CATALAN\",\"WELSH\",\"SWEDISH\",\"HINDI\",\"CROATIAN\",\"UZBEK\",\"HEBREW\",\"Unknown\",\"ESTONIAN\",\"BOSNIAN\",\"TAGALOG\",\"AZERBAIJANI\",\"ALBANIAN\",\"KANNADA\",\"MACEDONIAN\",\"ICELANDIC\",\"LITHUANIAN\",\"GALICIAN\",\"LATIN\",\"LATVIAN\",\"TAMIL\",\"BURMESE\",\"SANSKRIT\"],\"x\":{\"__ndarray__\":\"5QFTkIpiv7/yXG/yTVqiPyynGRACjLE/AO6iipXLvj/2QhZuxty9Pz8/6SadPWC/gvFGljtMib8q8pNJIPaTP2krI3Hy04i/eb1TW/Dvwb9qiBHFohPFv5y9BOSNAcO/0kCVx9G2ez81bw592VvQv1j4pT4vsro/AzRAscC7oT+qvS6gPpuyv1/YgaDic5G/F6Lidag30b/SJ6nEJPbFP/n+xs9ksbw/Y4QxMPl3sr+oC1HWCwm7vxMaWBXLYdE/CZUhiuESsz9fx4rxjo2sv9QrTD4hcaW/nXVhwpu8qr9gxRrb87O2vz8KbMNX4rI/7UeQ62ka0b9Q3zON6MLDv77ydFeBHbE/mGO2uvQLxb9Gf6J+uP6iP/IDosxMxrE/MJfqX8E6tr+08fE+Kdexv0+5kJ58v7C/c7DeH+VzqL/14Oqe+WGtP40BdpKhLcy/6QMow2UZwz8uX6FtVoKHP6uw5A5BPJ4/YjcMdIUfmb8O3vtDHM3MPwhBYv8KqZO/d+Utx/aTqD9y7xNBnTzHv59P2GoAwZ2/2eeypkF4uz+WxysYq5Fivw7zAF2Giac/eufa8vanVD8vvpm159ykP/HbkWd7M9E/hpWidqV10z/uDCojwKuPv26+kT43R2E/X9cRpNButz8woClOPYrLPw==\",\"dtype\":\"float64\",\"shape\":[62]},\"y\":{\"__ndarray__\":\"qaP/dKC4lL/dd/xNbdhgv1pY+riIKJe/H8XsiQaagb96uOLrmOt0v5RsAxx/W6m/8uIL24x7bb9YQLAsMIlwv+h3erb2T5O/LwTmjh4Rl7+o8WfJyhn1vtTg96lA1Za/BG7jDCqMIb8ET+UTQ3+UvzDG5E5xspE/OUYWeZJRmT+tK6KCPeqWPyikPgSVr4G/EJoHyth+nL+0m6X2YKahP6DYq6SYy7G/8uKrbo73o7/TS5l3V9ScvxSxQcNRJ7U/dEVoqMtVlL8yVSunh0KOv431KCX4BLS/pqQLTA4NoD80yugG+uOhP3o3HPFgBkm/iXfrjCbIlb9swE6Qvc2pP7chtmbe2ZW/VxzVshgbm79B+oy2ssKlP1SZMM5aKlA/H2IM3q6OiD/GaWIeQeegP2agfNiSHLU/iuQeWXRXjT884IeFZ0epP2WIkbUMOao/cLYr7z5XtT/xQGejciekv0oEqoMjYyE/Ynha1wTGmr+TD18nHq19P4efJ8gaRXa/i2+9D+eZqb9xkRguxMG1P0Hqzn8fkY4/sDL9494ciT+/uLyGeCClv2Z8tdXtimA/hYw+injojL8wGRvFSaNtP1yZSE4+/6y/S59o1CuOhb/m1t+p0d6oPzrfhCT4toq/QtpwS7Trej8ZUODsJ1Oivw==\",\"dtype\":\"float64\",\"shape\":[62]}},\"selected\":{\"id\":\"2437\",\"type\":\"Selection\"},\"selection_policy\":{\"id\":\"2438\",\"type\":\"UnionRenderers\"}},\"id\":\"2307\",\"type\":\"ColumnDataSource\"},{\"attributes\":{\"callback\":null,\"renderers\":\"auto\",\"tooltips\":[[\"token\",\"@token\"]]},\"id\":\"2347\",\"type\":\"HoverTool\"}],\"root_ids\":[\"2308\"]},\"title\":\"Bokeh Application\",\"version\":\"1.0.1\"}};\n", " var render_items = [{\"docid\":\"43a9d2a9-b7d4-4890-825d-114dc01b4432\",\"roots\":{\"2308\":\"5bcbdfcb-dc44-4daf-95f8-8fe92c958a2c\"}}];\n", " root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n", "\n", " }\n", " if (root.Bokeh !== undefined) {\n", " embed_document(root);\n", " } else {\n", " var attempts = 0;\n", " var timer = setInterval(function(root) {\n", " if (root.Bokeh !== undefined) {\n", " embed_document(root);\n", " clearInterval(timer);\n", " }\n", " attempts++;\n", " if (attempts > 100) {\n", " console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n", " clearInterval(timer);\n", " }\n", " }, 10, root)\n", " }\n", "})(window);" ], "application/vnd.bokehjs_exec.v0+json": "" }, "metadata": { "application/vnd.bokehjs_exec.v0+json": { "id": "2308" } }, "output_type": "display_data" }, { "data": { "text/html": [ "
Figure(
id = '2308', …)
above = [],
aspect_scale = 1,
background_fill_alpha = {'value': 1.0},
background_fill_color = {'value': '#ffffff'},
below = [LinearAxis(id='2317', ...)],
border_fill_alpha = {'value': 1.0},
border_fill_color = {'value': '#ffffff'},
css_classes = [],
disabled = False,
extra_x_ranges = {},
extra_y_ranges = {},
h_symmetry = True,
height = None,
hidpi = True,
js_event_callbacks = {},
js_property_callbacks = {},
left = [LinearAxis(id='2322', ...)],
lod_factor = 10,
lod_interval = 300,
lod_threshold = 2000,
lod_timeout = 500,
match_aspect = False,
min_border = 5,
min_border_bottom = None,
min_border_left = None,
min_border_right = None,
min_border_top = None,
name = None,
outline_line_alpha = {'value': 1.0},
outline_line_cap = 'butt',
outline_line_color = {'value': '#e5e5e5'},
outline_line_dash = [],
outline_line_dash_offset = 0,
outline_line_join = 'bevel',
outline_line_width = {'value': 1},
output_backend = 'canvas',
plot_height = 400,
plot_width = 600,
renderers = [LinearAxis(id='2317', ...), Grid(id='2321', ...), LinearAxis(id='2322', ...), Grid(id='2326', ...), BoxAnnotation(id='2335', ...), GlyphRenderer(id='2345', ...)],
right = [],
sizing_mode = 'fixed',
subscribed_events = [],
tags = [],
title = Title(id='2431', ...),
title_location = 'above',
toolbar = Toolbar(id='2333', ...),
toolbar_location = 'right',
toolbar_sticky = True,
v_symmetry = False,
width = None,
x_range = DataRange1d(id='2309', ...),
x_scale = LinearScale(id='2313', ...),
y_range = DataRange1d(id='2311', ...),
y_scale = LinearScale(id='2315', ...))
\n", "\n" ], "text/plain": [ "Figure(id='2308', ...)" ] }, "execution_count": 519, "metadata": {}, "output_type": "execute_result" } ], "source": [ "draw_vectors(langs_vectors_pca[:, 0], langs_vectors_pca[:, 1], token=list(lang_mapping.embedding_dict.keys()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time graphs doesn't look any meaningfull, but score speaks for itself." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cat2Vec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another approach came from NLP is word2Vec that was renamed to Cat2Vec. It haven't firm confirmation about it's usefulness, but there are some papers that argue that. (Links below)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Distributional semantics and John Rupert Firth says \"You shall know a word by the company it keeps\". Some words share the same context, so they are somehow similar. We can suggest, that categories may share some inner correlation by they co-occurrence. For example weather and city. Maybe city \"Philadelphia\" may be similar to weather \"always sunny\", or \"Moskow\" with \"snowy\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Firstly we applying Feature encoding, then we can make \"sentence\" from our row.\n", "\n", "In the example below, let's imagine we have an article at \"Monday January 2018 English_language Medium.com\" Here our sentence so maybe if English co-occurs with Medium more often then Chinese with hackernoon.com. (Poor consideration but just for example)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The only consideration is \"word\" order. Word2Vec relays on order, fro categorical \"sentence\" it doesn't matter, so it's better to shuffle sentences.\n", "\n", "Let's implement it." ] }, { "cell_type": "code", "execution_count": 417, "metadata": {}, "outputs": [], "source": [ "X_w2v = train_df.copy()" ] }, { "cell_type": "code", "execution_count": 418, "metadata": {}, "outputs": [], "source": [ "month_int_to_name = {1:'jan',2:'feb',3:'apr',4:'march',5:'may',6:'june',7:'jul',8:'aug',9:'sept',10:'okt',11:'nov',12:'dec',}\n", "weekday_int_to_day = {0:'mon',1:'thus',2:'wen',3:'thusd',4:'fri',5:'sut',6:'sun',}" ] }, { "cell_type": "code", "execution_count": 419, "metadata": {}, "outputs": [], "source": [ "working_day_int_to_day = {1: 'work',0:'not_work'}" ] }, { "cell_type": "code", "execution_count": 420, "metadata": {}, "outputs": [], "source": [ "X_w2v.month = X_w2v.month.apply(lambda x : month_int_to_name[x])" ] }, { "cell_type": "code", "execution_count": 421, "metadata": {}, "outputs": [], "source": [ "X_w2v.weekday = X_w2v.weekday.apply(lambda x : weekday_int_to_day[x])" ] }, { "cell_type": "code", "execution_count": 422, "metadata": {}, "outputs": [], "source": [ "X_w2v.working_day = X_w2v.working_day.apply(lambda x : working_day_int_to_day[x])" ] }, { "cell_type": "code", "execution_count": 371, "metadata": {}, "outputs": [], "source": [ "all_list = list()\n", "for ind, r in X_w2v.iterrows():\n", " values_list = [str(val).replace(' ', '_') for val in r.values]\n", " all_list.append(values_list)" ] }, { "cell_type": "code", "execution_count": 523, "metadata": {}, "outputs": [], "source": [ "from gensim.models import Word2Vec\n", "model = Word2Vec(all_list, \n", " size=32, # embedding vector size\n", " min_count=5, # consider words that occured at least 5 times\n", " window=5).wv" ] }, { "cell_type": "code", "execution_count": 525, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('may', 0.9750568866729736),\n", " ('jan', 0.9696922302246094),\n", " ('apr', 0.9657937288284302),\n", " ('feb', 0.9636536240577698),\n", " ('march', 0.9605866074562073),\n", " ('jul', 0.8678117990493774),\n", " ('aug', 0.842918872833252),\n", " ('sept', 0.8228173851966858),\n", " ('okt', 0.7803250551223755),\n", " ('nov', 0.77550208568573)]" ] }, "execution_count": 525, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.most_similar('june')" ] }, { "cell_type": "code", "execution_count": 378, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['medium.com', 'Carlos_E._Perez', 'Regalos_bodas,_bautizos,_comuniones', 'Ploum', 'Ash_Rust', 'Leandro_Demori', 'Dave_Mckenna', 'Steve_Krakauer', 'Raul_Kuk', 'Carolina_Lacerda']\n" ] } ], "source": [ "words = sorted(model.vocab.keys(), \n", " key=lambda word: model.vocab[word].count,\n", " reverse=True)[:1000]\n", "\n", "print(words[::100])" ] }, { "cell_type": "code", "execution_count": 379, "metadata": {}, "outputs": [], "source": [ "word_vectors = np.array([model.get_vector(wrd) for wrd in words])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Draw a graph as usual" ] }, { "cell_type": "code", "execution_count": 529, "metadata": {}, "outputs": [], "source": [ "word_tsne = TSNE().fit_transform(word_vectors )" ] }, { "cell_type": "code", "execution_count": 383, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "(function(root) {\n", " function embed_document(root) {\n", " \n", " var docs_json = {\"1f7fc17c-34db-4f9b-8646-6bd1f046e2b2\":{\"roots\":{\"references\":[{\"attributes\":{\"below\":[{\"id\":\"1012\",\"type\":\"LinearAxis\"}],\"left\":[{\"id\":\"1017\",\"type\":\"LinearAxis\"}],\"plot_height\":400,\"renderers\":[{\"id\":\"1012\",\"type\":\"LinearAxis\"},{\"id\":\"1016\",\"type\":\"Grid\"},{\"id\":\"1017\",\"type\":\"LinearAxis\"},{\"id\":\"1021\",\"type\":\"Grid\"},{\"id\":\"1030\",\"type\":\"BoxAnnotation\"},{\"id\":\"1040\",\"type\":\"GlyphRenderer\"}],\"title\":{\"id\":\"1045\",\"type\":\"Title\"},\"toolbar\":{\"id\":\"1028\",\"type\":\"Toolbar\"},\"x_range\":{\"id\":\"1004\",\"type\":\"DataRange1d\"},\"x_scale\":{\"id\":\"1008\",\"type\":\"LinearScale\"},\"y_range\":{\"id\":\"1006\",\"type\":\"DataRange1d\"},\"y_scale\":{\"id\":\"1010\",\"type\":\"LinearScale\"}},\"id\":\"1003\",\"subtype\":\"Figure\",\"type\":\"Plot\"},{\"attributes\":{},\"id\":\"1008\",\"type\":\"LinearScale\"},{\"attributes\":{},\"id\":\"1052\",\"type\":\"UnionRenderers\"},{\"attributes\":{\"plot\":{\"id\":\"1003\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"1013\",\"type\":\"BasicTicker\"}},\"id\":\"1016\",\"type\":\"Grid\"},{\"attributes\":{\"bottom_units\":\"screen\",\"fill_alpha\":{\"value\":0.5},\"fill_color\":{\"value\":\"lightgrey\"},\"left_units\":\"screen\",\"level\":\"overlay\",\"line_alpha\":{\"value\":1.0},\"line_color\":{\"value\":\"black\"},\"line_dash\":[4,4],\"line_width\":{\"value\":2},\"plot\":null,\"render_mode\":\"css\",\"right_units\":\"screen\",\"top_units\":\"screen\"},\"id\":\"1030\",\"type\":\"BoxAnnotation\"},{\"attributes\":{\"active_drag\":\"auto\",\"active_inspect\":\"auto\",\"active_multi\":null,\"active_scroll\":{\"id\":\"1023\",\"type\":\"WheelZoomTool\"},\"active_tap\":\"auto\",\"tools\":[{\"id\":\"1022\",\"type\":\"PanTool\"},{\"id\":\"1023\",\"type\":\"WheelZoomTool\"},{\"id\":\"1024\",\"type\":\"BoxZoomTool\"},{\"id\":\"1025\",\"type\":\"SaveTool\"},{\"id\":\"1026\",\"type\":\"ResetTool\"},{\"id\":\"1027\",\"type\":\"HelpTool\"},{\"id\":\"1042\",\"type\":\"HoverTool\"}]},\"id\":\"1028\",\"type\":\"Toolbar\"},{\"attributes\":{\"callback\":null},\"id\":\"1004\",\"type\":\"DataRange1d\"},{\"attributes\":{},\"id\":\"1010\",\"type\":\"LinearScale\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.1},\"fill_color\":{\"value\":\"#1f77b4\"},\"line_alpha\":{\"value\":0.1},\"line_color\":{\"value\":\"#1f77b4\"},\"size\":{\"units\":\"screen\",\"value\":10},\"x\":{\"field\":\"x\"},\"y\":{\"field\":\"y\"}},\"id\":\"1039\",\"type\":\"Scatter\"},{\"attributes\":{\"formatter\":{\"id\":\"1049\",\"type\":\"BasicTickFormatter\"},\"plot\":{\"id\":\"1003\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"1013\",\"type\":\"BasicTicker\"}},\"id\":\"1012\",\"type\":\"LinearAxis\"},{\"attributes\":{},\"id\":\"1013\",\"type\":\"BasicTicker\"},{\"attributes\":{\"formatter\":{\"id\":\"1047\",\"type\":\"BasicTickFormatter\"},\"plot\":{\"id\":\"1003\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"1018\",\"type\":\"BasicTicker\"}},\"id\":\"1017\",\"type\":\"LinearAxis\"},{\"attributes\":{},\"id\":\"1025\",\"type\":\"SaveTool\"},{\"attributes\":{\"data_source\":{\"id\":\"1002\",\"type\":\"ColumnDataSource\"},\"glyph\":{\"id\":\"1038\",\"type\":\"Scatter\"},\"hover_glyph\":null,\"muted_glyph\":null,\"nonselection_glyph\":{\"id\":\"1039\",\"type\":\"Scatter\"},\"selection_glyph\":null,\"view\":{\"id\":\"1041\",\"type\":\"CDSView\"}},\"id\":\"1040\",\"type\":\"GlyphRenderer\"},{\"attributes\":{\"callback\":null},\"id\":\"1006\",\"type\":\"DataRange1d\"},{\"attributes\":{},\"id\":\"1026\",\"type\":\"ResetTool\"},{\"attributes\":{},\"id\":\"1018\",\"type\":\"BasicTicker\"},{\"attributes\":{\"overlay\":{\"id\":\"1030\",\"type\":\"BoxAnnotation\"}},\"id\":\"1024\",\"type\":\"BoxZoomTool\"},{\"attributes\":{},\"id\":\"1049\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"dimension\":1,\"plot\":{\"id\":\"1003\",\"subtype\":\"Figure\",\"type\":\"Plot\"},\"ticker\":{\"id\":\"1018\",\"type\":\"BasicTicker\"}},\"id\":\"1021\",\"type\":\"Grid\"},{\"attributes\":{},\"id\":\"1022\",\"type\":\"PanTool\"},{\"attributes\":{},\"id\":\"1051\",\"type\":\"Selection\"},{\"attributes\":{},\"id\":\"1023\",\"type\":\"WheelZoomTool\"},{\"attributes\":{\"callback\":null,\"renderers\":\"auto\",\"tooltips\":[[\"token\",\"@token\"]]},\"id\":\"1042\",\"type\":\"HoverTool\"},{\"attributes\":{},\"id\":\"1047\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"fill_alpha\":{\"value\":0.25},\"fill_color\":{\"field\":\"color\"},\"line_alpha\":{\"value\":0.25},\"line_color\":{\"field\":\"color\"},\"size\":{\"units\":\"screen\",\"value\":10},\"x\":{\"field\":\"x\"},\"y\":{\"field\":\"y\"}},\"id\":\"1038\",\"type\":\"Scatter\"},{\"attributes\":{\"plot\":null,\"text\":\"\"},\"id\":\"1045\",\"type\":\"Title\"},{\"attributes\":{\"callback\":null,\"data\":{\"color\":[\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\",\"green\"],\"token\":[\"medium.com\",\"work\",\"ENGLISH\",\"2016\",\"2017\",\"2015\",\"thus\",\"mon\",\"wen\",\"not_work\",\"thusd\",\"fri\",\"may\",\"june\",\"apr\",\"march\",\"jan\",\"feb\",\"sun\",\"sut\",\"dec\",\"nov\",\"okt\",\"2014\",\"aug\",\"sept\",\"jul\",\"PORTUGUESE\",\"hackernoon.com\",\"SPANISH\",\"FRENCH\",\"2013\",\"ITALIAN\",\"RUSSIAN\",\"TURKISH\",\"Japanese\",\"GERMAN\",\"War_Is_Boring\",\"INDONESIAN\",\"Jon_Westenberg_\\ud83c\\udf08\",\"ChineseT\",\"THAI\",\"jw-webmagazine.com\",\"Korean\",\"James_Altucher\",\"Ethan_Siegel\",\"DUTCH\",\"Larry_Kim\",\"Darius_Foroux\",\"2012\",\"Naho_B_M\",\"Thomas_Oppong\",\"Eric_Elliott\",\"POLISH\",\"Josh_Spector\",\"Benjamin_P._Hardy\",\"The_Obama_White_House\",\"Netflix_Technology_Blog\",\"Christopher_D._Connors\",\"blog.medium.com\",\"Nancy_Pong\",\"UKRAINIAN\",\"The_Awl\",\"Helena_Price\",\"Julie_Zhuo\",\"Marcus_Brancaglione\",\"World_Economic_Forum\",\"Jeff_Jarvis\",\"VIETNAMESE\",\"Caitlin_Johnstone\",\"Srinivas_Rao\",\"UNK\",\"CZECH\",\"Brandon_Anderson\",\"Matthew_Gault\",\"thecoffeelicious.com\",\"Winnie_Lim\",\"Enrique_Dans\",\"Tobias_van_Schneider\",\"Cursos_educacion\",\"Elle_Kaplan\",\"HUNGARIAN\",\"Nafeez_Ahmed\",\"Anil_Dash\",\"umair_haque\",\"David_Axe\",\"Onur_Karapinar\",\"Chinese\",\"Robert_Beckhusen\",\"Joseph_Trevithick\",\"Brett_Berry\",\"Samej_Spenser\",\"Matter\",\"\\u00c9ditions_Numeriklivres\",\"Todd_Brison\",\"Jeff_Goins\",\"2011\",\"Steven_Levy\",\"The_Hairpin\",\"Aglaia_Berlutti\",\"Carlos_E._Perez\",\"Daniel_Holliday\",\"NORWEGIAN\",\"Emma_Lindsay\",\"Herbert_Lui\",\"Peter_Fritz_Walter\",\"Daniel_Christian_Wahl\",\"Ezinne_Ukoha\",\"Washington_Post\",\"Buffer\",\"Jordan_Greenhall\",\"@DFRLab\",\"Chad_Grills\",\"Taka_Umada\",\"Richard_Kenneth_Eng\",\"mauludSADIQ\",\"ROMANIAN\",\"Fr\\u00e9d\\u00e9ric_(Films_de_Lover)\",\"Lyman_Stone\",\"Shaunta_Grimes\",\"Salesforce\",\"Scott_Santens\",\"Dan_Hill\",\"Jared_M._Spool\",\"Quinn_Norton\",\"Sean_McCabe\",\"Human_Parts\",\"Benjamin_Foley\",\"Ramy_Khuffash\",\"Stephen_Orban\",\"Oliver_\\u201cShiny\\u201d_Blakemore\",\"byrslf.co\",\"Japan_Wireless\",\"Marcin_Wichary\",\"Augusta_Khalil_Ibrahim\",\"Nir_Eyal\",\"Ameet_Ranadive\",\"Laetitia_Vitaud\",\"James_Clear\",\"InVision\",\"Nicole_Dieker\",\"Jornalistas_Livres\",\"David_Kadavy\",\"Lessig\",\"Greg_Muender\",\"Sandeep_Kashyap\",\"Jon_Moore\",\"Floris_Koot\",\"Pascal_Kott\\u00e9\",\"Intercom\",\"Piyorot\",\"Dr._Lisa_Galarneau\",\"Leah_Stella_Stephens\",\"Gino_Sorcinelli\",\"Zeynep_Tufekci\",\"The_Scratch_Team\",\"Beautyon\",\"towardsdatascience.com\",\"DANISH\",\"Chris_Messina\",\"Peter_Diamandis\",\"Holly_Wood,_PhD_\\ud83c\\udf39\",\"Hillary_Clinton\",\"Spencer_Carli\",\"Graham_Brown-Martin\",\"SLOVAK\",\"COMMU\",\"Tim_Boucher\",\"GREEK\",\"artplusmarketing.com\",\"Michael_Tracey\",\"Chloe_Leb\",\"Jessica_Semaan\",\"Dave_Boice\",\"Hunter_Walk\",\"Andreas_Sandre\",\"MONGOLIAN\",\"Eric_Jorgenson\",\"Shane_Parrish\",\"Mitchell_Harper\",\"Nick_Crocker\",\"Fagner_Brack\",\"Andrew_Dobbs\",\"Kevin_Knodell\",\"Pete_Brook\",\"Patsaraporn_R.\",\"Sara_Benincasa\",\"Martin_McClellan\",\"Tiago_Forte\",\"praxis.fortelabs.co\",\"Paul_Ford\",\"Josh_Stearns\",\"WebdesignerDepot\",\"Adam_Elkus\",\"Peter_D\\u00f6rrie\",\"Mike_Sturm\",\"swardley\",\"Jane_Hwangbo\",\"Nassim_Nicholas_Taleb\",\"Joe_Brewer\",\"Regalos_bodas,_bautizos,_comuniones\",\"AirbnbEng\",\"The_Billfold\",\"Ryan_Hussey\",\"Eric_Koester\",\"Vaidehi_Joshi\",\"Alexander_Lozhechkin\",\"Gabriel_Machuret\",\"Ernio_Hernandez\",\"Richard_D._Bartlett\",\"Matt_Steel\",\"Paul_Mason\",\"Mike_\\u201cDJ\\u201d_Pizzo\",\"Clement_Vouillon\",\"Per_Harald_Borgen\",\"CATALAN\",\"SWEDISH\",\"SayaAppGossip\",\"We_Shall_Burn_Bright\",\"Abby_Norman\",\"Sarah_Cooper\",\"Jeff_Escalante\",\"David_Fuentes\",\"Rajen_Sanghvi\",\"The_Academy\",\"AJ+\",\"Mike_Fishbein\",\"Yui_Yamagishi\",\"USAID_Water_Team\",\"Adam_Marx\",\"Jonas_Ellison\",\"Ben_Johnson\",\"Zdravko_Cvijetic\",\"Ryan_Hoover\",\"Tim_Cigelske\",\"Rob_Gordon\",\"Refinery29\",\"AWS_Startups\",\"Medium\",\"Cake\",\"Jimmy_Song\",\"TimboGolden\",\"Alex_Rowe\",\"Lauren_Holliday\",\"Pixel_Magazine\",\"John_Zeratsky\",\"steve_blank\",\"Kickstarter\",\"Harry_Keller\",\"Alexander_Obenauer\",\"A.H._Chu\",\"Mosaic_Ventures\",\"Ita\\u00fa\",\"Brandon_Richey\",\"Cortney_Harding\",\"Kingsley_Uyi_Idehen\",\"Puntero_Izquierdo\",\"Daniel_Reagan_Diggins\",\"Writing_For_Research\",\"Yusuke-s\",\"Niklas_Goeke\",\"Allison_Washington\",\"Are_You_Syrious?\",\"Angus_Hervey\",\"Founder_Collective\",\"Oxford_University\",\"David_Gilbertson\",\"Ugo_Valentini\",\"John_Fawkes\",\"the_grugq\",\"Defiant\",\"McGraw-Hill_Education\",\"Hiut_Denim_Co\",\"Haje_Jan_Kamps\",\"Nathan_Curtis\",\"Andy_Dunn\",\"Arthur_Juliani\",\"Dylan_Robertson\",\"Jeffrey_Zeldman\",\"Jan_Chipchase\",\"Tor_Chanon\",\"Laurence_McCahill\",\"Mikael_Cho\",\"Brian_Honigman\",\"Thomas_Euler\",\"Adrian_Hanft\",\"John_Swain\",\"Ester_Bloom\",\"Ramya_Menon\",\"Varaha_Mihira\",\"Homeland_Is_Not_A_Series.\",\"Robert_Christgau\",\"ACLU_National\",\"Topica_Founder_Institute\",\"Zat_Rana\",\"Bas_Grasmayer\",\"Seyi_Fabode\",\"Aram_Rasa_Taghavi\",\"Poornima_Vijayashanker\",\"Kanvashrama_Trust\",\"Ploum\",\"Natalia_Babaeva\",\"Keith_Parkins\",\"Bradley_Nice\",\"ARABIC\",\"Ben_Pobjie\",\"Simon_de_la_Rouviere\",\"April_Walsh\",\"Bruce_Sterling\",\"Max_Deutsch\",\"Joe_Birch\",\"Yonatan_Zunger\",\"Junaid_Mubeen\",\"UNICEF\",\"Uri_Shaked\",\"Thaddeus_Howze\",\"Isaac_Morehouse\",\"Heath_Houston\",\"Transit\",\"Your_Fat_Friend\",\"Midori_Kocak\",\"Heather_Nann\",\"\\u00c1gua_de_Salsicha\",\"The_Leith_Agency\",\"Dakota_Shane\",\"Memo_Akten\",\"Femke\",\"Dan_Abelow\",\"Ali_Mese\",\"Wil_Wheaton\",\"Anne-Laure_Fr\\u00e9ant\",\"Paul_Cantor\",\"Jason_CranfordTeague\",\"brightthemag.com\",\"Chris_McCann\",\"Siftery\",\"Gary_Vaynerchuk\",\"R._Philip_Bouchard\",\"Daniel_Eckler\",\"Requests_for_Startups\",\"Spotlight_Central\",\"Patrik_Horv\\u00e1th\",\"Nader_Dabit\",\"Andy_Raskin\",\"Adam_Thomas\",\"Valigia_Blu\",\"David_Piepgrass\",\"Christina_Wodtke\",\"extranewsfeed.com\",\"Kevin_Rose\",\"Christian_Carvalho_Cruz\",\"Ajai_Raj\",\"Mark_Humphries\",\"John_Cutler\",\"SF_Ali\",\"Niko_Canner\",\"Justin_Jackson\",\"GrrlScientist\",\"Tom_Whitwell\",\"Britton_MDG\",\"Matt_Cetti-Roberts\",\"Alexander_Muse\",\"SantMat\",\"ma\\u0159enka_cerny\",\"Omar_El_Gabry\",\"Amanda_Machado\",\"Martino_Pietropoli\",\"UF_J-School\",\"Gutbloom\",\"Luke_Trayser\",\"Craig_Mod\",\"elizabeth_tobey\",\"Stef_Lewandowski\",\"Iv\\u00e1n_Lasso\",\"Wolox_Engineering\",\"Doug_Bierend\",\"Michael_Mangialardi\",\"David_Smooke\",\"Bharat_Tiwari\",\"Buster_Benson\",\"Alex_Voloshyn\",\"Zaron_Burnett_III\",\"Vitaliy_Verbenko\",\"Robert_Joffred\",\"Richard_Reis\",\"Ian_Lake\",\"Hanna_Brooks_Olsen\",\"Christian_Beck\",\"Rocco_Balsamo\",\"clubeorganico.com\",\"Creative_Social\",\"Scott_Belsky\",\"Flavia_Dzodan\",\"S._Paiva\",\"Gabriella_Bregman\",\"dex_digital\",\"Reynolds_Sandbox\",\"Smiley_Poswolsky\",\"Charles_Chu\",\"uxdesign.cc\",\"Ash_Rust\",\"Tom_Mitchell\",\"David_Siegel\",\"MailChimp\",\"Matheus_Ferreirinha\",\"Shubham_Datta\",\"Dan_Abramov\",\"David_Silver\",\"Gerard_Sans\",\"Mark_Manson\",\"Kate_Seabury\",\"jason\",\"Addy_Osmani\",\"G\\u00fcrcan_\\u00d6zt\\u00fcrk\",\"Singularity_University\",\"Tienda_de_Regalos\",\"Product_Hunt\",\"Medium_France\",\"Oliver_Cameron\",\"Terry_Mun\",\"Doc_Huston\",\"Tubik_Studio\",\"Elizabeth_Grattan\",\"Democratize\",\"Henry_Wismayer\",\"Blink\",\"Yulya_Besplemennova\",\"Arjan_Haring_\\ud83d\\udd2e\\ud83d\\udd28\",\"ConsenSys\",\"Ju_Do_Vale\",\"Chris_Hill\",\"Chen_Ye\",\"Jim_Salmons\",\"Matt_Quinn\",\"Pop_Topoi\",\"Kyle_Mizokami\",\"Adam_Rawnsley\",\"Vager_Saadullah\",\"Adam_Geitgey\",\"Edison_Nation\",\"Ali_Muzaffar\",\"Adam_Smith\",\"Matheus_Lima\",\"Willy_Braun\",\"VEON_Careers\",\"App\\u2019n\\u2019roll\",\"Rohan_\\u201cJava_Mac\\u201d_McLeod\",\"William_Belk\",\"Marcin_Treder\",\"MIT_Media_Lab\",\"Yann_Girard\",\"Doc_Searls\",\"Jessi_Hempel\",\"Francesco_Marconi\",\"gk_\",\"Knight_Foundation\",\"Bruno_Rodrigues\",\"Vartika\",\"Jean-baptiste_Jlt\",\"Tim_Monreal\",\"Anthony_J._Williams\",\"Cody_Charles\",\"julian_rogers\",\"Marcel_Kampman\",\"Carlos_Torres\",\"Javier_\\u2018Sim\\u00f3n\\u2019_Cuello\",\"Artem_Zavyalov\",\"PublishersLunch\",\"Tiffany_Sun\",\"Stephanie_Georgopulos\",\"Harry_Alford\",\"Master_of_Project_Academy\",\"Un.Dici\",\"Jay_Baer\",\"Rodrigo_Martinez\",\"library.gv.com\",\"Jonathan_Guthrie\",\"Gilberto_Miranda_Junior\",\"Brian_Scott_MacKenzie\",\"Vandini_Sharma\",\"AFS_Intercultural_Program\",\"Lee_Constantine\",\"Carol_Patrocinio\",\"Caio_Gondo\",\"ImmanuelTolstoyevski\",\"Sean_Kim\",\"Chris_Danilo\",\"Adrian_Chmielarz\",\"Lomography\",\"Medium_Japan\",\"Maike_Robert\",\"Michael_Sippey\",\"Illya_Klymov\",\"Gustavo_Tanaka\",\"Belle_Beth_Cooper\",\"Tony_Stubblebine\",\"Matt_Cooper-Wright\",\"LookBook\\u2122_WeLiveStyle\",\"Benjamin_Hoguet\",\"Jessamyn_West\",\"Leandro_Demori\",\"Andr\\u00e9_Camargo\",\"Hamza_Khan\",\"Christopher_Pierznik\",\"Collectors_Weekly\",\"Jin_Young_Kim\",\"Kelly_Dessaint\",\"Alore\",\"Rachel_Sklar\",\"The_Physics_arXiv_Blog\",\"Daniel_Jeffries\",\"Stowe_Boyd\",\"Josh_Santiago\",\"OECD\",\"Jordi_Romero\",\"Nathan_Bashaw\",\"Gerry_McGovern\",\"FocusEcuador\",\"Arthur_Chiaravalli\",\"Lauren_Modery\",\"MTV_News_Staff\",\"Afiur_Rahman_Fahim\",\"Rose_Powell\",\"Suzane_Jardim\",\"Experiential_Theology\",\"Leo_Burnett\",\"Nomad_Pass\",\"Andrew_Merle\",\"Kas_Tebbetts\",\"Silvia_Killingsworth\",\"Brad_Feld\",\"Josiah_Brown\",\"Peter_Coffin\",\"Karlijn_Willems\",\"Edward_Sullivan\",\"fi3200/EricSoucy\",\"Tom_Nixon\",\"Martin_Rezny\",\"Ryan_Holiday\",\"La_Tizza\",\"Perry_K._Wong\",\"Jordan_Morgan\",\"Snippets\",\"Mathieu_Grizard\",\"Ryan_Holmes\",\"Danny_Ngan\",\"Bertrand_Maltaverne\",\"blog.prototypr.io\",\"Patrice_Bonfy\",\"Ravi_Raman\",\"Arthur_Ferraz\",\"Owen_Blacker\",\"adam_nicholas_phillips\",\"The_Startup_Grind_Team\",\"Loic_Le_Meur\",\"Zarina_Zabrisky\",\"Owen_Williams\",\"Editora_Dublinense\",\"Susan_Crawford\",\"Jenn_Sutherland-Miller\",\"Alex_Danco\",\"blog.producthunt.com\",\"Valentin_Decker\",\"Laura_Annabelle\",\"First_Draft\",\"lance_weiler\",\"Lucas_Panek\",\"Joshua_\\uc2a4\\ud06c\\ub78c_Partogi\",\"Lauren_Smiley\",\"Ray_Yamazaki\",\"The_Animalist\",\"MySwedish\",\"ESTONIAN\",\"Zak_Slayback\",\"Dave_Gray\",\"Vagner_Vargas\",\"Kate_Lee\",\"Cem_ARGUN\",\"Lau_Dos_Santos\",\"Laurenellen_McCann!!\",\"Kyle_Pfister\",\"Javier_Escribano\",\"Jory_MacKay\",\"CamMi_Pham\",\"Asghar_Bukhari\",\"Lon_Shapiro\",\"Nicholas_C._Zakas\",\"Jasky_Singh\",\"Samuel_Mound\",\"Derrick_Harris\",\"Munindra_(Munnan)_Misra\",\"Sabine_Hossenfelder\",\"Alexis_Goldstein\",\"Nils_Parker\",\"Kim_Boekbinder\",\"2010\",\"Run_for_Something\",\"The_U.S._Digital_Service\",\"Emmet_Savage\",\"Bravo!\",\"Dave_Mckenna\",\"AthensLive_News\",\"Sean_Everett\",\"Cloudflare\",\"Meduza\",\"Peter_Schroeder_\\ud83d\\ude80\",\"Amy_Siskind\",\"coolmccool\",\"Geoff_Pilkington\",\"@DFF_Shane\",\"Ryan_Riley\",\"Tel\\u00e9fonoRoto\",\"Alex_Taussig\",\"Fraze_Craze\",\"Son_of_Baldwin\",\"Marco_Zander\",\"Ben_Werdmuller\",\"Matej_\\u2018Retro\\u2019_Jan\",\"\\u0420\\u0430\\u0434\\u0438\\u043e_\\u0421\\u0432\\u043e\\u0431\\u043e\\u0434\\u0430\",\"BENGALI\",\"Jason_Wojciechowski\",\"M.G._Siegler\",\"Chanda_Prescod-Weinstein\",\"Aymen_El_Amri\",\"Sam_Rye\",\"Tobias_Stone\",\"RankWatch\",\"Ron_Tuch\",\"Sebastian_Eschweiler\",\"Kevin_William_David\",\"Ronald_C._Flores-Gunkle\",\"shift.newco.co\",\"Trent_Lapinski\",\"National_Institute_of_Standards_and_Technology\",\"Mark_David\",\"android.jlelse.eu\",\"Harris_Sockel\",\"Doctors_Without_Borders\",\"RSA\",\"Sam_Thorogood\",\"Peter_Nowell\",\"Alex_Schult\",\"Dan_Sanchez\",\"Dion_Almaer\",\"Michael_Simmons\",\"Gerard_Mclean\",\"Tal_Kol\",\"Frederico_Mattos\",\"Royal_Montgomery\",\"Tommy_Darker\",\"Dominic_Williams\",\"Traditional_Tradesman\",\"Heidi_K._Isern\",\"Dave_Balter\",\"Brandon_E._Miller\",\"A._Sharif\",\"ProofHub\",\"Sara_Chipps\",\"Massimo_Sormonta\",\"VAWAA\",\"farmdrop\",\"Ernst-Jan_Pfauth\",\"Leah_Hood\",\"Coletivo_Trama\",\"Angela_Obias\",\"Filipe_Dalmatti_Lima\",\"Stella_J._McKenna\",\"Emlyn_O'Regan\",\"Anders_Emil_M\\u00f8ller\",\"Bonnitta_Roy\",\"Brian_Geddes\",\"Mirah_Curzer\",\"The_Pendulum\",\"CauseLabs\",\"The_Emotional_Businessist\",\"Marco_Venturi\",\"Michel_Weststrate\",\"The_Local_News_Lab\",\"Henry_Ward\",\"W._Ian_O'Byrne\",\"medium.freecodecamp.org\",\"Eric_Reidy\",\"Jeff_Bullas\",\"Alex_Steffen\",\"Monica_Cainarca\",\"Jennifer_Brandel\",\"Modelo\",\"Jean-Charles_Kurdali\",\"Tim_O'Reilly\",\"Al_Jazeera_English\",\"Penguin_Random_House\",\"Sapataria_Radical\",\"salvo_fedele\",\"Leonard_Kim\",\"Charles_Scalfani\",\"Tomas_Laurinavicius\",\"Hank_Green\",\"Mark\",\"Christopher_Daniels_(Notorious_DCI)\",\"The_Huffington_Post\",\"Steve_Krakauer\",\"HubSpot_Academy\",\"Hazel_Gale\",\"PopLand_Security\",\"Rex_Sorgatz\",\"Darin_Stevenson\",\"Medea,_Malm\\u00f6_University\",\"Catchlight\",\"Mike_Hearn\",\"Aspie_Savant\",\"Paul_Richard_Huard\",\"Steve_Weintz\",\"Cynthia_Koo\",\"Lucky_Peach\",\"Dr_Jacques_COULARDEAU\",\"Ros\\u00e1rio_Pereira_Fernandes\",\"Jennifer_Pahlka\",\"Justin_Lawler\",\"Giovanni_Toschi\",\"Hugh_Forrest\",\"BULGARIAN\",\"Toky\",\"Evernote\",\"Wylinka\",\"Michael_Haupt\",\"Alex_Bretas\",\"Jasmine_Ramratan\",\"Nabeena_Mali\",\"Cloudways\",\"sarah_miller\",\"Amy_Sterling_Casil\",\"Danielle_Newnham\",\"James_Allworth\",\"TapFuse\",\"Bibblio.org\",\"Sasha_Stone\",\"Jeff_Gothelf\",\"Matt_Kandler\",\"Will_Gibbs\",\"ThoughtWorks_Brasil\",\"Stuart_James\",\"Febin_John_James\",\"Alexei_Ledenev\",\"Melissa_Chu\",\"Stanford_Alumni\",\"Tomas_Trajan\",\"Boffy\",\"University_of_Cambridge\",\"Noticieros_Televisa\",\"Node.js_Foundation\",\"Aur\\u00e9lien_Herv\\u00e9\",\"Jordan_Bray\",\"Jim_Cummings\",\"Valeriano_Donzelli_(Vale)\",\"Steve_Bryant\",\"Comune_di_Bergamo\",\"Regina_Anaejionu\",\"Fabio_Farro\",\"ABC_News\",\"Andrew_Coyle\",\"egi_syahban\",\"Amit_Shekhar\",\"Christian_Hernandez\",\"Inkbot_Design\",\"Joel_Thoms\",\"Hieu_Pham\",\"Nadia_Eghbal\",\"Martin_Sokk\",\"Rafael_Izzo\",\"Mario_Fraioli\",\"Michelle_Nickolaisen\",\"Ryan_McGeehan\",\"Alina_Simone\",\"Gustavo_Martin\",\"Chet_Haase\",\"Faris\",\"VA_Innovation_(VACI)\",\"Sathyvelu_Kunashegaran\",\"Jonathan_Rowson\",\"Platform_&_Stream\",\"Mother_Jones\",\"Byron_Crawford\",\"Tamyka_Bell\",\"Joshua_Lasky\",\"Creative_Economy\",\"Hans_Christian_Berge\",\"Arnau_Perendreu\",\"Luis_Fernando_Molina\",\"Alan_Soon\",\"Slack_API\",\"Rudy_Rosciglione\",\"Patipan_Injai\",\"Rodolfo_Pinotti\",\"Luke_Chesser\",\"Nima_Gardideh\",\"Jonathan_Albright\",\"pete\",\"Duncan_Green\",\"Atomic\",\"The_Daily_Signal\",\"Raul_Kuk\",\"Nathan_Benaich\",\"Maxwell_Anderson\",\"Felipe_Castro\",\"Hello_Fears\",\"Ev_Williams\",\"Josh_Pigford\",\"Michon_Neal\",\"\\u0421\\u043b\\u043e\\u0431\\u043e\\u0434\\u0438\\u043d_\\u041c\\u0438\\u0445\\u0430\\u0438\\u043b\",\"Erika_Hall\",\"Anne_Currie\",\"Roger_Taylor\",\"Arnaud_Sahuguet\",\"Liam_Hogan\",\"Jean-Luc_Raymond\",\"Dominique_Matti\",\"Phil_Forbes\",\"Sarah_Mock\",\"Martijn_van_den_Broeck\",\"Rafe_Furst\",\"Paul-Olivier_Dehaye\",\"I_Blog_In_Jordans\",\"George_Gally\",\"Medium_em_Portugu\\u00eas\",\"BRIO\",\"QASymphony\",\"Tony_Brasunas\",\"Duncan_Weldon\",\"Alex_Bauer\",\"Second_Home\",\"Silver_Keskkula\",\"Martin_Shelton\",\"Aaron_Loeb\",\"Julia_Haslanger\",\"Vikram\",\"Joe_Bagel\",\"Classy\",\"Gianluca_Licciardi\",\"M_Aan_Mansyur\",\"Medium_en_espa\\u00f1ol\",\"Brass_Stories\",\"HBO_PR\",\"Alexandra_Samuel\",\"Raghav_Haran\",\"Andrew_Watts\",\"Panisuan_Joe_Chasinga\",\"Bobbie_Johnson\",\"Alex_Gershon\",\"Vinicius_Reis\",\"THX_Ltd.\",\"Virginia_Heffernan\",\"-Ixca-\",\"Matthew_Lew\",\"diesdas.digital\",\"builttoadapt.io\",\"Oliver_Lindberg\",\"J.B._Handley,_Jr.\",\"Ian_Warner\",\"Tom_Farr\",\"EricaJoy\",\"Simon_Owens\",\"Andreia_Paralta_Carqueija\",\"Matthew_Trinetti\",\"enso\",\"Michele_Connolly\",\"Shaun_Lind\",\"Isfandiyar_Shaheen\",\"Paul\",\"Jordan_Stodart\",\"The_Conversation\",\"usersnap\",\"Galen_Buckwalter,_PhD\",\"wtfeconomy.com\",\"Ryan_Hanley\",\"Arin_Basu\",\"Rachel_Glickhouse\",\"Clive_Thompson\",\"chatbotsmagazine.com\",\"Matt_Schlicht\",\"Iv\\u00e1n_Fanego\",\"Cristina_Juesas\",\"Bruce_Nappi\",\"Interstellar_Raccoons\",\"Guillermo_Peris\",\"John_Herrman\",\"Medium_T\\u00fcrk\\u00e7e\",\"Dead_Beat_Books\",\"DJ_Louie_XIV\",\"Mosaic\",\"Jason_Smith\",\"IGNITION_Staff\",\"Todd_Moy\",\"Ahmet_\\u00d6zkale\",\"Depict\",\"Christopher_Pitt\",\"Farrar,_Straus_&_Giroux\",\"2009\",\"2008\",\"Marcus_K._Dowling\",\"Ken_Grady\",\"Carolina_Lacerda\",\"Plataforma_Bodisatva\",\"StellarPeers\",\"Cody_Engel\",\"ThunderPuff\",\"Larry_Cornett,_Ph.D.\",\"Brad_Stulberg\",\"Charlie_Deets\",\"Thais_Weiller\",\"Ray_Hennessy\",\"Shoutem\",\"GapYearStories\",\"Dejan_Atanasov\",\"David_J_Bland\",\"Kamela_Hutzley_Dolinova\",\"Thomas_Schranz\",\"H\\u00e9ctor_Delgado\",\"Andy_Adams\",\"Fairbank_Center_Blog\",\"Richard_Burton\",\"Caterina_Kostoula\",\"Jeremy_Puma\",\"Matt_Wesson\",\"NORWEGIAN_N\",\"Vicente_L_Ruiz\",\"Anna_Loboda\",\"Snarf\",\"Gillian_Rhodes\",\"Jon_Pincus\",\"Lincoln_W_Daniel\",\"Allan_Ishac\",\"American_Experience_|_PBS\",\"Mattias_Lehman\",\"Jacob_Molz\",\"Bianca_Strul\",\"Team_TBN\",\"Andrew\",\"Chance_Taken\",\"Cory_Sellar\",\"mono_\\uf8ff\",\"Shelley_Bernstein\",\"Pat_Heery\",\"Eduardo_Rabelo\",\"Kyle_Young\",\"EntreArtes_Comunicaci\\u00f3n\",\"ScholarMatch\",\"Andrei_Reina\",\"Concepts\",\"Vlad_Balin\",\"Lucas_Kalikowski\",\"Tom_Martin\",\"Dan_Gillmor\",\"Gretchen_Rubin\",\"Aura_Wilming\",\"Merah_Muda_Memudar\",\"Claire_Power\",\"Lola_Phoenix\",\"Dr._Cameron_Sepah\",\"Stephanie_Hays\",\"Editor\",\"Skyscanner_Growth\",\"SOM\",\"Sarah_Blackwood\",\"ICRC\",\"Andrew_Grant-Thomas\",\"Rodrigo_Aguiar\",\"United_States_Geospatial_Intelligence_Foundation\",\"Perch\\u00e9_Mi_Piace\",\"Moys\\u00e9s_Pinto_Neto\",\"@TheDovBaron\",\"Peter_Chang\",\"Savan_Patel\",\"Education_Elements\",\"MARGEM_CULTURAL\",\"Mathias_Lafeldt\",\"David_Byttow\",\"John_Saito\",\"Jacob_Cass\",\"\\u0410\\u043d\\u0434\\u0440\\u0435\\u0439_\\u0421\\u0442\\u0451\\u043f\\u0438\\u043d\",\"if_me\",\"Alexey_Ezhikov\",\"Bonni_Rambatan\",\"Arman_Anaturk\",\"Andrew_McLaughlin\",\"Recruta_Stone\",\"Privacy_International\",\"Margaret_Gould_Stewart\",\"Brett_Mazoch\",\"Victoria\",\"Ta\\u00eds_Bravo\",\"exedre\",\"Guilherme_Braga_Alves\",\"Kenny_Chen\",\"Marc_Hemeon\",\"Peter_Abualzolof\",\"Dave_Pell\",\"Emotive_Brand\",\"International_Med._Corps\",\"Matt_Anderson\",\"Linus_Ekenstam\"],\"x\":{\"__ndarray__\":\"oDhFQbQSub8Bj+FAbo/zwG/09sCowufAfdq7QUPIukFXc7lBdMhiQcRWt0F9+bVBd9MhQvohH0JE0xxC0oUiQi5qHUJJxR9CEV/cQSZh3EGkux5CAeYdQlgEIUJ0xeDAF5cfQiGeHEISqiJCD4LfQPr1REGwuNBABI3KQJiy28A1yr1ARCa4QKKWwkCRWNdA0KvFQGrslT+DrsdAPv3nPgi4xUDWys9AqM5ZQcU6xUCVRuK/rMexPz0q1UCzM9a/XGcCwHWcM8Dn4FdBaikEv5mzLUCmx9ZAuXetP5Gtcb/0ChxA49JlwNzyIL+WyR5BMwUOQqxbzkAa7pZAITgJwO78iEA71AtCOLpTv0z+L0DW8/BAcUV9QK/IU0BlZOVA98/YQLrAwr9Tmg1AKdgjQcbd2z/fBAA/q+VJv8sgC0Lzmie/CQSfQAfkgEDKQa5A/NBnPpKECUH0lw1C383AQItdEcE6PaZA3R3SQBECBEIAXpdA3woGQt4KnUBgjw9AfZCiv1HClkAEq9FAGNkIQl8H+D40at8/7yvzQEWyfECtAgbAcLPFQImOxL7rLRNAkczSQNcHsEArLcFAdAI3QMZtX8AAMgZC1TEIQDWvZT5l1gPAjeMCQj85kUBCEts+sU1awNwsPEDF/oe+W1VhwMgZtkDQbh1BM2zcQEF9h7/O1Zo/Po1MQLrT08AMjxpBtFZKQS95t0AjhOY/vQewvpItG0FuXoBBfQ6lQMUggj6LMgs/rszVQZrXcsAPN/VASG/pQOEMiUCT2pBAj2Wmv05NB0IW9FM+QdtQQXkmGMHlBOXAN7S3Plbe5EBUK7Q/0guaQaGxBUGWgPRA5KAOQUKVgL8N6ZPAl+sswRUkb8DIjCzBJD9VwcuC2EGeNIfBmtFowHbiKUGjoKM/5kr+Qa93PUENDVa/AUioQLTFEkE1CghApy2gQGwPQEDyDB1A3q9KwXpqJUGzE/A/BtYEwWzonUCFxFZBI9IuQfk7ykDeXEhBcuxIQRPi2UB9Nqm/2dEVQCw5PT+Ke2PA8e4kQe52ksBNGxbB8lgkwQItiz8uffdBne+6wNHA9EADOQ9BM4fOQAPicMHo5vhBmJYmwXaLJcEXe03AmC2BweL7ccADTpzAdZSdQEzOGsH7uxBBL+7DQDS1+EGnPgBClXagwX5077+gOZ3BzfrhQTrM8kCc49RA/MWbQMpLq77yPPFBf/C8P9UZrMCXijhAeqvQv0Mz57+fq7tAzN+wwPiPu0FgpiFACzfrvyYizj5ptgHALZviwBONG8Ez/RHAbt9owYVSND/upBdBgP0pwJW72sDZPDXBQAEFQQ88wL+d5EvBwhbNQAV0lEElzA/BMYP0QCM3D0E2A2bBIk5HwEGCW0FNaDzBfvr7PzaAL0Hx7MnBemJGQYytC8AImI1B8mgAQgaSyTyPeoHBEv9zwVJVu0ANtNdARKtMQXPfAcCQMDxBW5v2wNwLDkEIzXdA54nswHYZj0GvD9jAgMQkQZ7ZSr6Z4C5BVo6SwMTkFcG5uY3AltgbwdCMN8EDhT7A+SqvwdjkvcCdFxZBQf0ZQfP7gcFlSVBBgJZDv0NVU0EbFQjB8bT8QfDx+kEDrGfAUNoHwe0DiMDJshZAnt17QUEjgsFpitvA10nRwHz5qMHKO5bAEriwwN5v0EGstRFBvqmxwEtTnMDYbIlAHkcDwQWCJsFwf/VBaKCpvftxp0FY36LBCq+FP+7x6cAToY3AZdJ9wWUmacF0AhC/2g/9QVr7S8EARThBeFAGwFTKy0D3UBjBfPKPwYsEJcENGPHAhdf7wH0SKsFx8DJBFo1vwa6XqUFZTyXBBuzsQQIZK0Hc8BJBIhBBQQiUc8FFOJ5Bq8RJwa0RM0H0x25BDn2CwBhWh8Ew7zLBqyuIPyLelUAqkUpBc3NpwSKFlsEZdlZBKOsBQbNaNsFcBdpBRhnzQecYDED+B4XBouRAQPkLhMGkPSm/7sWCQApw5EFBMWo/q2I8wUukrsGw1GxB6rYswUj7ukDH9uFBOTsmQDrXx7+GnsXAZ9DBwX5vXcHnm8/A7rtDQWZwdkEnLoJBocSQQYiDqsGVCeVAQNHbQVlyY8GypQJA4TD4QDUTR0CPqWJAkucjQMzYXUFLdFLANhw6wVI4EkHszAxBoyQvQV+vFsH96GdAkPCiwS3CacDN5FXByCG5wekxRsE9wrxBCZ2lQBhMx0HtqFFBal/0QVYz2MAiH13BLBD3wDlMmcEQRo/BENeBQc6cgMETpk3BnyIGQLiwLEHHYmLBZb3PQWl1E74cE35An+3KP/ov2r4GjOVBzIsNQHMaXkG2wg3B6AYuwD7RscDZakPB0m5mwVl4OkHzIO7AUr99wQPXukDovCfB9iFoQQkOB0FQ5+/AjhbePzTRIEHrNEPBh2KCPgNYEMHFM+/AM5IjQQJiTEFJ+uZBCZVaPzRWBMGYzDjAdtxQwc6FacFOOVtB0IdrQSFiYkE4JVDA+LQUwSI5U8FEF29BPqmTP/YB+UGYUTXBomnawAVAQEFnywlBTEaFQYKdmMCP/SfB49yJwdLfmMCOZWpA9X7CQZBIrEEKsZ2+7zMQwWRQtsGZK/TAe7uwQfnBnkGI9HLBph/tQZ3fT0AvqzHBRckCwWQqh8HYB8RBmm75QVSvzD/6W5VBnf7FQV+YEcBBlhzBAd5+wSweUUEhd7tAVxWmwfLAZsGDa1PBhy2MQXnT90DceQ7BoycgwbpZxMGX519BS289wVqS3kF9zhfBD1FdP4qjW8GlMB7BOieOwR9nJEH1aXrB4cgwwe/x2T3SWeXA5tQlwBkvmsE6BWZBfqZfwelgtMHyiXVBDvWxwba4k0HkRtlAmipowPcp1D9tOOdBBQcZwDWkXcE/5fy/HMy8QU9DV0GnYZPABpJ2wL42AEHHqjFBzioswV9pyEHpi8zAsTwbwMEJdcFnCSbBb3tCQGNkeMHbh8NBdij5P24A+8Cx+xjBGKtPQT4J0UFnQw3BLmIlweoIacESsF5B2xV+Qc39ir9iwMtBBZyCwFrGzME9QrjBDt31wF+JcMBsDq9BGFyjvy9HvEELoKJB/tFsPWXTdsEn6bbA1p0UwZugXMH8l5RA49CewcD8OcHRYj9AUyqJQfR5QcFF/y7AljlswcnqcMBWxQZAhi6wv6B18MBH/6fAzpo/wc2d50AOpo1BAfuUwc4Lw8FqAMC+0180wcKgwEH7y3RBRSuDwXzcpkCl40jBnDznQCCvpMG5Q7xBhmW6wfTsjkFt+ydA543SQbx0y7+cv6ZAYi/pQaDBs8An4fQ/hwJ6wQRVisCSLSBAsP0wQb1et8H1Pc3BIUOWwXnGecFuUCfBUtlGwRQIEkG+xqPButrHQKmdyUD3nEfB8yQnwOjElcExN3PBoqNwwWPMS8FZLrbBlIhpwco+EEByVrzAi34YwHWyU0Gj0X5BzD5JQf61dcHBrd/A9mwowQyFisG2CEbAT4AvwfUtxMC768hAP62UwebLzkEy2LO+PoqIwV6HfME9cZhAllkKQWbZr8AyQbZAwvdXwV9hx8G4RCdBqjcSQfYRIUHA0s0/EBSdwXAssD2SqK/AnCTWQTVQqcEXrJLAm7tKweJ2lMF7TCJAxaunwTn500Aqv4zAleTWQSRREz/a9pbB7lu+QcHpzb/nbpnBdjMfwT4wdEHk39pBQEOOwaEO1UBJmxpBH5VhwWvghD/ml63A6jelQCQIEcET6kPBWZdUwZUoMcHyNAHBOeRbwVT8DEAg6rXBxUKgwfNtZsHTP1K/0QqaPmxNh0DIIa7BnHocwPM3m0EfyyjAn+mRwAYMjkEYc2FBeY2fwVOSvUHdhhLBWOyWQZQozT4mLbdAqP2FQR8kmsEypHpBo33fwERhgMBeEKU/j/lPwcQxyEAiJHfBv832wFYn58B2AB3BwplLwHUJZEFyO4LBxl1zQR3It8ETkKnBBhGnwTqLg0D8vYW/4pIkwQ0SrUFcPH8/5vu/wVdPvcBJe+0/NvxCwaP1IcHAUelB2CggwHVzLkHaI4PBBk9jwRRgLEE9QhjBfdG8P61O7sAb0IFBxGo3wSKE+sCm1l/BasVgQZH1bMFz578/hnUeP2T3xb+M3tJBY10KwPDIkMHEJTDBbzNWwVnmrsHbt5RANxqbwYHpgMHTPsHAUFzgwIxGfz9tRLs/AEM/QZHsScHUXQnA6HwGwYIzzL9qOWZBaxCBQTwZtEAKjW1BvVp3vqk+LMFXDG1AxWbbwDS5ScEdo/5AhKNjwNOX4MAsHgLBcZwrwLoI57474rHBQu6YwZTPsUEaNIzBlEURwCl+7MAkIxdAWr4TwXFb20ANAtfAcJ6dwQQu4MAN+xjBiB6WwXo7aUCAlh3BEvh5QVchUUEHOU5BbtQvwb2CMMEwPrnBo++kwOaCNUCHQJ7BTcJZwc/EukCw9bPBiQYiwewo6z+F/4zBvJ+1QTcYgUGuiNxBqJerQZ3hob8IzLHBTkK+P38JOEALLxvAWvlowdQkz0BEWZVBZEwjwbt/scBsgaJB95OowEhEeMAVOz3ALQY7we5PjsFN5L3A1FvlwJ6jYcEj2ExA2QTSPRDOP8F4zrw/jPVDwYNYir7oqU/A/EW6P3AMNsHqRQ/BXhS5wArasT5HNIfA5m2XQOo0ScE3IpfATI2rwdMlQECogM7BgNqpQcKGEcGj6LO/QfE8wUl1sUECY3xAz++jQf6grcFC8LHB/fqPwR20PsC94vG+BatlwCGgr0Ejb7nAskV1wRJeNcC7tqDBDmT+wEHsMUCYMC6/eYdNQQdi20Cr7y3BviRmQIbDCL4iUSRA77wzQPUQQj/vZUdBhAtdQM2UMMAitaHAJixqwKU6jMD8Qoy/JwdbwfDnXEFY8nDBF2rFPZ/KSsFPRoTBFZ14QL/mEcC1oTu/NoNFwJ3fvUHlpWxBeabiQI6s0UDAwpXBYCmdQc19u7+HDQPBm9YFQPcdgsEt74E+5ZKYv0RcX8GaWJvAqhafQdDiSz5k27dA1NmbQI7cDcFqqIBBJOHiv8Zw9ECIUgvBXTNcQJy2eUEij35ArGcUwC+Dt8FVEddAEVOAQV8Gp0G4HZc/2uWSwRSFr8F7f6/BQ6qFwdXjwcAMlJ3AQEmSwdejq8HJDb5B/YYKwXF7lkB5BzlBfVVTwbWQakFMb7Q/SFE5wS9NLUC9RADBYQ9cPzbFmcEH2rjAjS2sQcZEjsGl6KtBisGFwSF0m743rzHAG/Z2QcFm2j/IBKbAJAdYwXP9IsGN/nVB422qQW/7MEGBjbjAmTzewER8X0AQ8B7BIY7jQJdeh0C0f5zByUqQwQ==\",\"dtype\":\"float32\",\"shape\":[1000]},\"y\":{\"__ndarray__\":\"rak9wl6cSsLEyFfCtHQ/wvCUQMKg5DvCR1nYwQld28GCSdbBQnInwqgZ28E749fBJViZwa6cncE1vpTBGr6UwQvjmMGqqZTBRc5jwTm+Y8F38ILB/xOJwWM+icEs/znCXqSOwYv2jcHIUI7BfShcwtGwPcJBBFvCUKxdwr3VOMKzhFnCJChcwiuyVMIs9FPCkqNSwn7gRsJ00k/CxG5GwqdzTMKfhUrCQPcjwoQdRsL3NEbCAoA/wmP7RcJhQ0LCCGpEwspVPMIN3iLCWbZAwuJ2PMJdJ0LCy2Y3wpbHPcLikzzCh0YZwh9dOsJ3uyzCAfSJwJqfPsJP9QrCSzcywtr4K8KOjwLAmXE2wt6rCsI1NjnCO/g0wiUPM8JMAjvCMmU6wml+L8K5pjDCUjcpwlJhNcKMnCzC3aMtwtYlVMBd0TXCCl41wrkKKcL94gXCyK0mwooVJ8JTzkvAQ5A2wkgP78FUsyfCSAYIwvHUaT947RbC99u/v2hpIMLGZCzCBUYdwuiSIsLWxhLC9m5twIDqK8IpTiLClG4zwohIE8KmQAnCfhCgwUU3AsKoahvCT6IZwqCiKML+BO3BUjftwdUKFMIF/ojAXg0fwrccm8H98yXCOQAcwBq4DsJ99RvC2qoMwldvE8Kc1OrByQ4Owu+TD8IGhNvBkc3gwcLl/8Et5BHCObANwupsuMEatCbCpcoVwpS018E+ivrBkxoQwiP6D8I62+fBeRcRwjiEC8JexQnCEkBAwA12B8Li0vTBDm36wRxAAMLY2gLCcnIEwtosJ7+I5arBP8Lowd1jx8H0i9jB6TYewrT9ysE4NwjCqQhRwW3fFsLPYizCiju1wfQS28GW+NLBzBC8wUxQqcFF5KzBQvXUwSOFL0A8CnrBCt8BwrwjD8IvfB/CG0IjwOz6f8H3wZvBy+S1wZPxvcEcXAHC/nNKwRSEycExN9zBy2nOwbxyzsEbWxPCVimKwDSCpcH5mRPCwPTIwRYPur66hAvCvjoPwgO61MEHyPLBe5Wiwbax0cFHpMTAwK/ZwZTNpMHZapvBmLfXwXBR08EXFcE/zwbAwUG8SsEX6t/BWyLPvwcjiMFNuw7A6cXUwVU8s8Gi/wbCU9HEwSp14MH4gZHBCKn4wd1zw8CcegPC4fYkwgsluMCpDxi/gwoXwVsC8ME4bjPBD9/NQCw+1sGJvmzBUtSewT2ZicH7sWk9I4CjwenYN8FEYcjBLuC+we99wEA/eM/B+khYwVNZYb8T3NnBHn0lwWhn58EvWLfBws6WwVKbv8EqNJvBL6CkwaTRvsFB8v7BDOKDwBsRsMExudLBBmXqwVao7MB3/0XBfFoJwV7ARsHK7DpAfA/OvyqQdsGTzqHBy6fLwZi1BcL76onBgi/xwR4NR77S0otAA7pSwQ/UjcFSEnnBtm0LQMVHjsEkF7TBqo7hwNvvScEXb4DBBzwQQNPnn8F8Mu/BurZUwWU/hcHAHbfB5THlwESXwcCfvqrAZdTywdvJMcHV3HvBoJLfwYgascEi8oLBD/BSwfbJ5sFNg/E+BTzHwPYm3D6oFQXC3gxRwcYkssE1qrfBsA+BPyVrrMFtkSvBNu5TQAU2O8Der9rBD81gwXrBAcI+/ZDBx6s/wKW3GkC7t8nBUvBkwTZiG0BDx9bAt2XfwXdVgUBk1q3Bz/iHwcBRncHwPerBQAVxwSi9mMHjsIzAkLG4QZnjCkGtBMRBa7+EwUgG7sAs25HBK+6ZQQG4HcE5Uhs/E4SAQAbHrMC1AARBKSJSwWawPcEUvKVA+vYZQbj7icGlId09356Awe/wpcFT8wHChvNewar+hkDuSjbB3/dhwPi0nsH/c6HAWrcIwiNfZ0DtUklBB/BcwUkKmcEtmqjBfWnzwKvqB8H7XmnBWDAVwXLr98DhLSJAbqQaQArNyEErT8LA7yO5QGcacMFCq6tA0JTsv3799cDmO+PAjhR6wNlb6r+QsdjBiHkywXZUBUDF2PLALfkwQSstEUHAZabBb+a/wUs+ckHLuz9AM35bQL5Dh8HGI7vB6ox/QKazH0EGGao+Oj9rwYkHrsHpeAhBh9iCQUO3wcBO37RA3EFyPc9Jv8AyisLA/L92wLChJ8HTI7bBPESGwfQymMGEwlJAFcPPQcsMh8FyTwnBvRBnv8YayUE31pnBpwfVQbfwhcF8UsHBbPhFwLFsLsHHATO/ChKIwAePMEEYwP3BGTmHP2cykMFyK8DBaVjLQTWgRcByJ6DAE1UjQXxUhMB3YhjBGI54wSm0msDJBL3AqWgUQdb8BkHLdV+/qrw6QQSNEcEPfDBBJwkywS0MUcCAuaFBGpyMQZKZL8CI/jVBrvFxQRlTfkGEYo9BUcfqQSwehcHrudW+iQg1wS0UKcBXyTm/ZxyEQUCTrcBs96NBPu2YQUoGr8GUuMZAgjNMwUDPNsHgXDG/1cAvQdOIiUCwygpBHOOwQKcm60BviDRBIYkyP+uXcUEOoBtB8iFjwO5CdUG0CGBAWhydwWqaD0CN7eVBFvjGQEPb/MEAIo3B6VWIvxuvUkCoxcDAQBPjQRTZI8EkURm/gJv2QM9tTD/KYW/BkM7IQS5Y0UGrvzTB+QKiP6VWGEHyTyVBpP9qQFIepcBwJVZBiFSSwVuWCEGKrAdBM3aiQAINR0FkJgpBamBsQYJwxMGPpaRBYQ+bwP3+UcDJLKRBqhVJwJdEHkCD6A5CPzdpwYmbS0GMZN9BOC1/wYz1tEG4/2zBrs5JwcK7MT8V3CdC3Pp/QAFrBEKH00zAMJ/4wNDik0GYSDrAOtILQkxX7EF7rwtB2IOFQSKgn78y+IvBozQWwCCezb9Nh1HBmSsswCuZ/T+dOp1BiN80QdrR3b/lKQY+8+8CQXxOaMCHaqVB6k/+P4mqjEFpGTbAyPWTQBZjqkAkpEjBNQS/QTyvxEAihhPBM1qMQblBTEFC8Y5ACO21Qe1vM8EvDkFBiM6sQDC6fMDWTFhBoh/+wcZtSL4piZdB6xGxQbvylUHr1idB8Y8cP5KtG0L5CX8/DSUVQgItx0Dh3dk/fVpCQbRNA8ESW5ZAzKOkwEMgEUDbEClB0c4gQBoFNcGUtCnBlXpoQXf3RUErIYZAVUqiQU3Hp8BJ7KNAF0Y+QVV1sUGLM8q//wCrQTRkHkJIIS3BAm4nwXMhhEFVDllBbJcQQcCplz67UUFB6RDhvt/x9kBM/6JBKJC6QR8JTUAL7yXBd2ebPMV08T887xPBlDBKP4Cvi8B4RhtBxyTPQT/VBb/tymrAnSPSP9Py6kFt/rJAwP0OQE7evMCsnwdBsCU8QW2a80D2+5FAtAXCQQTZGkCOI7BAQ/4cQP5E+UGZ9OFB+mbVPySm2bzrldVBR9GMPyUa/UE/GohBqv4kQiRB/0Av18BBGKoAQZ/qBkEoqExBf/XCQbVY+0DG39FBlw8FQvgZi8FQEqxAx/qOQSDee0EI8RZBXlgRQp99KUF7/CVAA0zdQZckI0GMTJVBwmCWwF3bDUDLoXDBao+7QMjKukHel+FAX69GQWjMnEG5DL5BivDMQTKgUUHOjRRBlG/yQSF3+L5W9W9BYCgLQvMBH0LS8cdBNCjCQJP1vEA1X+BBRjX1vxPo+UEnMW7BR/AYwD/hFUEII1jAIPmMQOIG20G6CN5AC3jYQKMTXsHp5v9BgNiYQbRRwkBVuzFBrxz1QcbKgEGMj8xBu/GrQbXlpUD33fhBVm7jQQUrEkEFKh5Cf6SMQXnWA0ImRgNCUoC0QUJIzkHaTnhBhFryPyMbFUGJJGRBCC5cQVkL9kHwBeRBnLPWQSGZaUF75MJBo4EEQa37GUERw5dBLIDdv3XfzUDLWRdC0dRkQdzr4kFY4QRCqAlHwbUe0UHowY9BoJIeQmD7gkH7+ohBdTgdQVjMRUCuXTC/dKhVQEb9MUBVugJCgl44QUO4BEF/OvZA4LFewQprK0GP659B9IjEQAQvPUEofDRBhzv+QHv0ikEalAJC2gXGQdi+p0ERuBxBAfzIQd3Ed0FS1qFAlV81QdFqskHo8dRBK8fmQUDyyMBTPLBB/6uyQR4n90FETU/BIx3av4EIqUGdRhlC9tasQadf2UHA6lZA8poQQsjdRkC0dBRB7/IOQlxtsUFIs5tBXEpMwEsThEFaiXjAvMWxQd2LAEKjCuxBqJtJwYuYyEElC2Y/IXvTwKSGJkAdOgpC4C4zQZks90FyxjDBDJbbQEMltkDMUY9BlIxzQFtM5kEJyxxCGg+eQYizkUHJKK1B7UwBQgLv10EAwihCkVjvQaDZHsBz51BBfkgXwCXsPEFVyIJBsWF+QC2ECEKOa+lBsMADQeM2tEGapwBCM+roQbZDM0EHUFxBtHSEQLiDHkI5u4pBEDtkwU7cJ0Et27xBFq0jQIzzf7/vN+hBIjKSwEUXE0KvW09BEJg/QdfXFkJNq6NB9/sfQTj4GkKF/9ZBnLtgQaWuhT/VxDZBhYF/QYEUB0KtOYpBrg7VQVhCD0IO6KVBnmYBQqvVDUHSz0hBPpTzQcmIAkL0l2NBcPEXQpbUg0H/kidC6wMYQuePckGzsAtCwqYGQr7RbEE/PPFBm7ISQthAEkKXQDhAYTq3Qf6P8EHGCfBBHqMIQi9X0kF8sfBBCK3FQU9iFULYrb0/pD0NQm4LnUF7CCFCKx3RQQuHg0G1dNRByKJRQeOvKUIj3nFBYIIXQPJcWkGn0gpCHgarQNmb4UHMnalBykCDQW1c6UBfydhBNtkJQp6JcUEDfCFCx9WmQUxBW0Gd3QFBA40MQpsimEHGKg1CdnC6QZtnw0FPKfBBDmQhQkMtF0J8xqdBbk5nQWOBs0E125dBz7IAQgFoFUI6ElNBfZQoQou8qUAcRBVC1SwcQtyRwUF3fwdBuIgFQmOB1UG13Im/148HQXlpWEEJKIHAG4UQQvYMikClvrXAVgESQkp8FkIT2KdBS5mJQTJw5EF7gRhCVJ8AQmJs6EFNZC9Bv3MeQZKVFEKsWGdBIIK6QEt5BkJf1a1B+eDIQcjxukFCWlhBoUQCQv2Vl0HsogVC0tpnQaQBJ0E1pQRCZvkhQh5Lj0GZu4RBTcLgv2rTR0Edn6dB0rvQQJ1sMEFbc6FBnN0EQje8AEKYKKBBqEwpQcjs8kEhzRFBekLsQdYen0HlBL9BdQO6QC9qQsGcmJJBAZIbQouF1T+cnxFBxzhpQbxTb0GWgQ1CU7alQD2h+D9gjBJBJmRYQeUXpUFE2hlCAQEEQaBAFELuK9NB1Xx2vb84EEJsKaZA671nQR+3zEHpIxtCprcTQgwyzEF3cIRBcir6QY6vs0GT2GFBC+mWQQ==\",\"dtype\":\"float32\",\"shape\":[1000]}},\"selected\":{\"id\":\"1051\",\"type\":\"Selection\"},\"selection_policy\":{\"id\":\"1052\",\"type\":\"UnionRenderers\"}},\"id\":\"1002\",\"type\":\"ColumnDataSource\"},{\"attributes\":{\"source\":{\"id\":\"1002\",\"type\":\"ColumnDataSource\"}},\"id\":\"1041\",\"type\":\"CDSView\"},{\"attributes\":{},\"id\":\"1027\",\"type\":\"HelpTool\"}],\"root_ids\":[\"1003\"]},\"title\":\"Bokeh Application\",\"version\":\"1.0.1\"}};\n", " var render_items = [{\"docid\":\"1f7fc17c-34db-4f9b-8646-6bd1f046e2b2\",\"roots\":{\"1003\":\"73d3cd95-cc8e-4d45-8458-e4aff03d22c3\"}}];\n", " root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n", "\n", " }\n", " if (root.Bokeh !== undefined) {\n", " embed_document(root);\n", " } else {\n", " var attempts = 0;\n", " var timer = setInterval(function(root) {\n", " if (root.Bokeh !== undefined) {\n", " embed_document(root);\n", " clearInterval(timer);\n", " }\n", " attempts++;\n", " if (attempts > 100) {\n", " console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n", " clearInterval(timer);\n", " }\n", " }, 10, root)\n", " }\n", "})(window);" ], "application/vnd.bokehjs_exec.v0+json": "" }, "metadata": { "application/vnd.bokehjs_exec.v0+json": { "id": "1003" } }, "output_type": "display_data" }, { "data": { "text/html": [ "
Figure(
id = '1003', …)
above = [],
aspect_scale = 1,
background_fill_alpha = {'value': 1.0},
background_fill_color = {'value': '#ffffff'},
below = [LinearAxis(id='1012', ...)],
border_fill_alpha = {'value': 1.0},
border_fill_color = {'value': '#ffffff'},
css_classes = [],
disabled = False,
extra_x_ranges = {},
extra_y_ranges = {},
h_symmetry = True,
height = None,
hidpi = True,
js_event_callbacks = {},
js_property_callbacks = {},
left = [LinearAxis(id='1017', ...)],
lod_factor = 10,
lod_interval = 300,
lod_threshold = 2000,
lod_timeout = 500,
match_aspect = False,
min_border = 5,
min_border_bottom = None,
min_border_left = None,
min_border_right = None,
min_border_top = None,
name = None,
outline_line_alpha = {'value': 1.0},
outline_line_cap = 'butt',
outline_line_color = {'value': '#e5e5e5'},
outline_line_dash = [],
outline_line_dash_offset = 0,
outline_line_join = 'bevel',
outline_line_width = {'value': 1},
output_backend = 'canvas',
plot_height = 400,
plot_width = 600,
renderers = [LinearAxis(id='1012', ...), Grid(id='1016', ...), LinearAxis(id='1017', ...), Grid(id='1021', ...), BoxAnnotation(id='1030', ...), GlyphRenderer(id='1040', ...)],
right = [],
sizing_mode = 'fixed',
subscribed_events = [],
tags = [],
title = Title(id='1045', ...),
title_location = 'above',
toolbar = Toolbar(id='1028', ...),
toolbar_location = 'right',
toolbar_sticky = True,
v_symmetry = False,
width = None,
x_range = DataRange1d(id='1004', ...),
x_scale = LinearScale(id='1008', ...),
y_range = DataRange1d(id='1006', ...),
y_scale = LinearScale(id='1010', ...))
\n", "\n" ], "text/plain": [ "Figure(id='1003', ...)" ] }, "execution_count": 383, "metadata": {}, "output_type": "execute_result" } ], "source": [ "draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our categories mingled, but we can notice that years, days, languages are stays apart from authors cloud." ] }, { "cell_type": "code", "execution_count": 386, "metadata": {}, "outputs": [], "source": [ "def get_phrase_embedding(phrase):\n", " \"\"\"\n", " Convert phrase to a vector by aggregating it's word embeddings. See description above.\n", " \"\"\"\n", " # 1. lowercase phrase\n", " # 2. tokenize phrase\n", " # 3. average word vectors for all words in tokenized phrase\n", " # skip words that are not in model's vocabulary\n", " # if all words are missing from vocabulary, return zeros\n", " \n", " vector = np.zeros([model.vector_size], dtype='float32')\n", " word_count = 0\n", " \n", " for word in phrase.split():\n", " if word in model.vocab:\n", " vector += model.get_vector(word)\n", " word_count += 1\n", " \n", " if word_count:\n", " vector /= word_count\n", " \n", " \n", " \n", " return vector" ] }, { "cell_type": "code", "execution_count": 423, "metadata": {}, "outputs": [], "source": [ "new_features = list()\n", "for ph in all_list:\n", " vector = get_phrase_embedding(' '.join(ph))\n", " new_features.append(vector)" ] }, { "cell_type": "code", "execution_count": 424, "metadata": {}, "outputs": [], "source": [ "new_features = pd.DataFrame(new_features)\n", "new_features.index = X_w2v.index\n", "X_w2v = pd.concat([X_w2v,new_features],axis=1\n", " )" ] }, { "cell_type": "code", "execution_count": 425, "metadata": {}, "outputs": [], "source": [ "X_w2v.drop(['author','domain','lang','working_day','year','month','weekday','log_recommends'], axis=1, inplace = True)" ] }, { "cell_type": "code", "execution_count": 426, "metadata": {}, "outputs": [], "source": [ "X_train, X_val,y_train,y_val = train_test_split(X_w2v,y, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": 428, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.5360024689310734" ] }, "execution_count": 428, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf = RandomForestRegressor(n_jobs=-1)\n", "rf.fit(X_train, y_train)\n", "preds = rf.predict(X_val)\n", "mean_absolute_error(y_val, preds)" ] }, { "cell_type": "code", "execution_count": 429, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.496423317408919" ] }, "execution_count": 429, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ridge = Ridge()\n", "ridge.fit(X_train, y_train)\n", "preds = ridge.predict(X_val)\n", "mean_absolute_error(y_val, preds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Poor result, but I cutted a lot of features that could help this algorithm to word." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you know that categorical variables it is a tricky beast and that we can get a lot of it by embeddings and cat2Vec technics. They work not only for NN but in simpler models, so it is possible to use it in production low-latency systems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://arxiv.org/ftp/arxiv/papers/1603/1603.04259.pdf ITEM2VEC: NEURAL ITEM EMBEDDING FOR COLLABORATIVE FILTERING
\n", "https://openreview.net/pdf?id=HyNxRZ9xg CAT2VEC: LEARNING DISTRIBUTED REPRESENTATION OF MULTI-FIELD CATEGORICAL DATA
\n", "https://arxiv.org/pdf/1604.06737v1.pdf Entity Embeddings of Categorical Variables
\n", "https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture Embeddings
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }