{ "cells": [ { "cell_type": "markdown", "id": "4b423346-1284-4333-95b7-e62d0e624b74", "metadata": {}, "source": [ "# 1. Test the fastText results using TokenEmbedding('wiki.en')." ] }, { "cell_type": "code", "execution_count": 5, "id": "ee2c5c01-15ef-4949-a055-872d0cb630d0", "metadata": { "tags": [] }, "outputs": [], "source": [ "import os\n", "import torch\n", "from torch import nn\n", "import sys\n", "sys.path.append('/home/jovyan/work/d2l_solutions/notebooks/exercises/d2l_utils/')\n", "import d2l\n", "\n", "class TokenEmbedding:\n", " \"\"\"Token Embedding.\"\"\"\n", " def __init__(self, embedding_name):\n", " self.idx_to_token, self.idx_to_vec = self._load_embedding(\n", " embedding_name)\n", " self.unknown_idx = 0\n", " self.token_to_idx = {token: idx for idx, token in\n", " enumerate(self.idx_to_token)}\n", "\n", " def _load_embedding(self, embedding_name):\n", " idx_to_token, idx_to_vec = [''], []\n", " data_dir = d2l.download_extract(embedding_name)\n", " # GloVe website: https://nlp.stanford.edu/projects/glove/\n", " # fastText website: https://fasttext.cc/\n", " with open(os.path.join(data_dir, 'vec.txt'), 'r') as f:\n", " for line in f:\n", " elems = line.rstrip().split(' ')\n", " token, elems = elems[0], [float(elem) for elem in elems[1:]]\n", " # Skip header information, such as the top row in fastText\n", " if len(elems) > 1:\n", " idx_to_token.append(token)\n", " idx_to_vec.append(elems)\n", " idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec\n", " return idx_to_token, torch.tensor(idx_to_vec)\n", "\n", " def __getitem__(self, tokens):\n", " indices = [self.token_to_idx.get(token, self.unknown_idx)\n", " for token in tokens]\n", " vecs = self.idx_to_vec[torch.tensor(indices)]\n", " return vecs\n", "\n", " def __len__(self):\n", " return len(self.idx_to_token)\n", " \n", "def knn(W, x, k):\n", " # Add 1e-9 for numerical stability\n", " cos = torch.mv(W, x.reshape(-1,)) / (\n", " torch.sqrt(torch.sum(W * W, axis=1) + 1e-9) *\n", " torch.sqrt((x * x).sum()))\n", " _, topk = torch.topk(cos, k=k)\n", " return topk, [cos[int(i)] for i in topk]\n", "\n", "def get_similar_tokens(query_token, k, embed):\n", " topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1)\n", " for i, c in zip(topk[1:], cos[1:]): # Exclude the input word\n", " print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}')\n", " \n", "def get_analogy(token_a, token_b, token_c, embed):\n", " vecs = embed[[token_a, token_b, token_c]]\n", " x = vecs[1] - vecs[0] + vecs[2]\n", " topk, cos = knn(embed.idx_to_vec, x, 1)\n", " return embed.idx_to_token[int(topk[0])] # Remove unknown words" ] }, { "cell_type": "code", "execution_count": 7, "id": "2dd2afdb-86bd-4c05-8ca1-b9516ae9c5f9", "metadata": { "tags": [] }, "outputs": [], "source": [ "d2l.DATA_HUB['wiki.en'] = (d2l.DATA_URL + 'wiki.en.zip',\n", " 'c1816da3821ae9f43899be655002f6c723e91b88')\n", "ft_wiki = TokenEmbedding('wiki.en')\n", "get_similar_tokens('chip', 3, ft_wiki)" ] }, { "cell_type": "code", "execution_count": null, "id": "2fb35d02-e2c0-4b87-a7fa-fad8f1598c2d", "metadata": {}, "outputs": [], "source": [ "get_analogy('man', 'woman', 'son', ft_wiki)" ] }, { "cell_type": "markdown", "id": "295ed257-9760-4473-8b0f-8528709f10e6", "metadata": {}, "source": [ "# 2. When the vocabulary is extremely large, how can we find similar words or complete a word analogy faster?" ] }, { "cell_type": "markdown", "id": "48ead29b-cad8-4c8a-b9ea-482b88478c54", "metadata": {}, "source": [ "When the vocabulary is extremely large, finding similar words or completing a word analogy can be very time-consuming, because we need to compare the word vectors of every word in the vocabulary with the query word or the analogy words. To speed up this process, we can use some techniques such as:\n", "\n", "- Indexing: We can use some data structures or algorithms, such as hash tables, trees, or approximate nearest neighbor search, to index the word vectors and reduce the search space. This way, we can find the most similar words or the best analogy words without scanning the whole vocabulary.\n", "- Dimensionality reduction: We can use some methods, such as principal component analysis, singular value decomposition, or autoencoders, to reduce the dimensionality of the word vectors and preserve the most important information. This way, we can reduce the computational cost and memory usage of the similarity or analogy tasks.\n", "- Subword information: We can use some models, such as fastText or BPE, to represent words as sequences of subwords, such as character n-grams, and learn embeddings for each subword. This way, we can handle out-of-vocabulary words and capture the morphological information of words, which can improve the quality and efficiency of the similarity or analogy tasks." ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:d2l]", "language": "python", "name": "conda-env-d2l-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }