{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "\n", "***\n", "***\n", "\n", "# Introduction to Word Embeddings\n", "Analyzing Meaning through Word Embeddings\n", "\n", "***\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Using vectors to represent things**\n", "- one of the most fascinating ideas in machine learning. \n", "- Word2vec is a method to efficiently create word embeddings. \n", " - Mikolov et al. (2013). [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)\n", " - Mikolov et al. (2013). [Distributed representations of words and phrases and their compositionality](https://arxiv.org/pdf/1310.4546.pdf)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "***\n", "\n", "## The Geometry of Culture\n", "\n", "Analyzing Meaning through Word Embeddings\n", "\n", "***\n", "***\n", "\n", "Austin C. Kozlowski; Matt Taddy; James A. Evans\n", "\n", "https://arxiv.org/abs/1803.09288\n", "\n", "Word embeddings represent **semantic relations** between words as **geometric relationships** between vectors in a high-dimensional space, operationalizing a relational model of meaning consistent with contemporary theories of identity and culture. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "- Dimensions induced by word differences (e.g. man - woman, rich - poor, black - white, liberal - conservative) in these vector spaces closely correspond to dimensions of cultural meaning, \n", "- Macro-cultural investigation with a longitudinal analysis of the coevolution of gender and class associations in the United States over the 20th century \n", "\n", "The success of these high-dimensional models motivates a move towards \"high-dimensional theorizing\" of meanings, identities and cultural processes." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## HistWords \n", "\n", "HistWords is a collection of tools and datasets for analyzing language change using word vector embeddings. \n", "\n", "- The goal of this project is to facilitate quantitative research in diachronic linguistics, history, and the digital humanities.\n", "\n", "\n", "- We used the historical word vectors in HistWords to study the semantic evolution of more than 30,000 words across 4 languages. \n", "\n", "- This study led us to propose two statistical laws that govern the evolution of word meaning \n", "\n", "\n", "https://nlp.stanford.edu/projects/histwords/\n", "\n", "https://github.com/williamleif/histwords\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change**\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Word embeddings quantify 100 years of gender and ethnic stereotypes\n", "\n", "http://www.pnas.org/content/early/2018/03/30/1720347115\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## The Illustrated Word2vec\n", "\n", "Jay Alammar. https://jalammar.github.io/illustrated-word2vec/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Personality Embeddings\n", "\n", "> What are you like?\n", "\n", "**Big Five personality traits**: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism\n", "- the five-factor model (FFM) \n", "- **the OCEAN model**\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:08:26.103877Z", "start_time": "2019-04-05T09:08:26.096732Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "- 开放性(openness):具有想象、审美、情感丰富、求异、创造、智能等特质。\n", "- 责任心(conscientiousness):显示胜任、公正、条理、尽职、成就、自律、谨慎、克制等特点。\n", "- 外倾性(extraversion):表现出热情、社交、果断、活跃、冒险、乐观等特质。\n", "- 宜人性(agreeableness):具有信任、利他、直率、依从、谦虚、移情等特质。\n", "- 神经质或情绪稳定性(neuroticism):具有平衡焦虑、敌对、压抑、自我意识、冲动、脆弱等情绪的特质,即具有保持情绪稳定的能力。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:37:14.866873Z", "start_time": "2019-04-05T09:37:14.862287Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Personality Embeddings: What are you like?\n", "jay = [-0.4, 0.8, 0.5, -0.2, 0.3]\n", "john = [-0.3, 0.2, 0.3, -0.4, 0.9]\n", "mike = [-0.5, -0.4, -0.2, 0.7, -0.1]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Cosine Similarity\n", "The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:\n", "\n", "$$\n", "\\mathbf{A}\\cdot\\mathbf{B}\n", "=\\left\\|\\mathbf{A}\\right\\|\\left\\|\\mathbf{B}\\right\\|\\cos\\theta\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "$$\n", "\\text{similarity} = \\cos(\\theta) = {\\mathbf{A} \\cdot \\mathbf{B} \\over \\|\\mathbf{A}\\| \\|\\mathbf{B}\\|} = \\frac{ \\sum\\limits_{i=1}^{n}{A_i B_i} }{ \\sqrt{\\sum\\limits_{i=1}^{n}{A_i^2}} \\sqrt{\\sum\\limits_{i=1}^{n}{B_i^2}} },\n", "$$\n", "\n", "where $A_i$ and $B_i$ are components of vector $A$ and $B$ respectively." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:17:02.584345Z", "start_time": "2019-04-05T09:17:02.577116Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "-0.4999999999999999" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from numpy import dot\n", "from numpy.linalg import norm\n", "\n", "def cos_sim(a, b):\n", " return dot(a, b)/(norm(a)*norm(b))\n", "\n", "cos_sim([1, 0, -1], [-1,-1, 0])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:14:47.865304Z", "start_time": "2019-04-05T09:14:47.857879Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array([[-0.5]])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics.pairwise import cosine_similarity\n", "\n", "cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "$$CosineDistance = 1- CosineSimilarity$$" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:25:29.120401Z", "start_time": "2019-04-05T09:25:29.113611Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "-0.5" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy import spatial\n", "# spatial.distance.cosine computes \n", "# the Cosine distance between 1-D arrays.\n", "1 - spatial.distance.cosine([1, 0, -1], [-1,-1, 0])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:17:48.677658Z", "start_time": "2019-04-05T09:17:48.672323Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.6582337075311759" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cos_sim(jay, john)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:18:04.442036Z", "start_time": "2019-04-05T09:18:04.437385Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "-0.3683509554826695" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cos_sim(jay, mike)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Cosine similarity works for any number of dimensions. \n", "- We can represent people (and things) as vectors of numbers (which is great for machines!).\n", "- We can easily calculate how similar vectors are to each other." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Word Embeddings\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Google News Word2Vec\n", "\n", "You can download Google’s pre-trained model here.\n", "\n", "- It’s 1.5GB! \n", "- It includes word vectors for a vocabulary of 3 million words and phrases \n", "- It is trained on roughly 100 billion words from a Google News dataset. \n", "- The vector length is 300 features.\n", "\n", "http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Using the **Gensim** library in python, we can \n", "- find the most similar words to the resulting vector. \n", "- add and subtract word vectors, \n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:48:32.663015Z", "start_time": "2019-04-05T09:46:33.523568Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import gensim\n", "# Load Google's pre-trained Word2Vec model.\n", "filepath = '/Users/datalab/bigdata/GoogleNews-vectors-negative300.bin'\n", "model = gensim.models.KeyedVectors.load_word2vec_format(filepath, binary=True) " ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:54:56.388102Z", "start_time": "2019-04-05T09:54:56.383705Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array([ 0.24316406, -0.07714844, -0.10302734, -0.10742188, 0.11816406,\n", " -0.10742188, -0.11425781, 0.02563477, 0.11181641, 0.04858398],\n", " dtype=float32)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model['woman'][:10]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:52:13.063134Z", "start_time": "2019-04-05T09:52:12.809519Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[('man', 0.7664012312889099),\n", " ('girl', 0.7494640946388245),\n", " ('teenage_girl', 0.7336829900741577),\n", " ('teenager', 0.631708562374115),\n", " ('lady', 0.6288785934448242),\n", " ('teenaged_girl', 0.6141784191131592),\n", " ('mother', 0.607630729675293),\n", " ('policewoman', 0.6069462299346924),\n", " ('boy', 0.5975908041000366),\n", " ('Woman', 0.5770983099937439)]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.most_similar('woman')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:51:43.612745Z", "start_time": "2019-04-05T09:51:43.607316Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.76640123" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.similarity('woman', 'man')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:54:38.743342Z", "start_time": "2019-04-05T09:54:38.738722Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.76640123" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cos_sim(model['woman'], model['man'])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2019-04-05T09:49:03.690483Z", "start_time": "2019-04-05T09:48:48.812748Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[('queen', 0.7118192911148071),\n", " ('monarch', 0.6189674139022827),\n", " ('princess', 0.5902431607246399),\n", " ('crown_prince', 0.5499460697174072),\n", " ('prince', 0.5377321243286133)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "$$King- Queen = Man - Woman$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Now that we’ve looked at trained word embeddings, \n", "\n", "- let’s learn more about the training process. \n", "- But before we get to word2vec, we need to look at a conceptual parent of word embeddings: **the neural language model**.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The neural language model\n", "\n", "“You shall know a word by the company it keeps” J.R. Firth\n", "\n", "\n", "\n", "> Bengio 2003 A Neural Probabilistic Language Model. Journal of Machine Learning Research. 3:1137–1155\n", "\n", "After being trained, early neural language models (Bengio 2003) would calculate a prediction in three steps:\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-04-05T10:08:02.288041Z", "start_time": "2019-04-05T10:08:02.282384Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "The output of the neural language model is a probability score for all the words the model knows. \n", "- We're referring to the probability as a percentage here, \n", "- but 40% would actually be represented as 0.4 in the output vector.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Language Model Training\n", "\n", "- We get a lot of text data (say, all Wikipedia articles, for example). then\n", "- We have a window (say, of three words) that we slide against all of that text.\n", "- The sliding window generates training samples for our model" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As this window slides against the text, we (virtually) generate a dataset that we use to train a model. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Instead of only looking two words before the target word, we can also look at two words after it.\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "If we do this, the dataset we’re virtually building and training the model against would look like this:\n", "\n", "\n", "\n", "This is called a **Continuous Bag of Words** (CBOW) https://arxiv.org/pdf/1301.3781.pdf" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Skip-gram\n", "Instead of guessing a word based on its context (the words before and after it), this other architecture tries to guess neighboring words using the current word. \n", "\n", "\n", "\n", "https://arxiv.org/pdf/1301.3781.pdf" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-04-05T10:48:55.984057Z", "start_time": "2019-04-05T10:48:55.979592Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The pink boxes are in different shades because this sliding window actually creates four separate samples in our training dataset.\n", "\n", "\n", "- We then slide our window to the next position:\n", "- Which generates our next four examples:\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-04-05T10:45:28.851644Z", "start_time": "2019-04-05T10:45:28.847193Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Negative Sampling\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "And switch it to a model that takes the input and output word, and outputs a score indicating **if they’re neighbors or not** \n", "- 0 for “not neighbors”, 1 for “neighbors”.\n", "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "we need to introduce negative samples to our dataset\n", "- samples of words that are not neighbors. \n", "- Our model needs to return 0 for those samples.\n", "- This leads to a great tradeoff of computational and statistical efficiency.\n", "\n", "## Skipgram with Negative Sampling (SGNS)\n" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-04-05T10:48:55.984057Z", "start_time": "2019-04-05T10:48:55.979592Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Word2vec Training Process\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Pytorch word2vec \n", "https://github.com/jojonki/word2vec-pytorch/blob/master/word2vec.ipynb\n", "\n", "https://github.com/bamtercelboo/pytorch_word2vec/blob/master/model.py" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:43:56.056098Z", "start_time": "2019-04-07T07:43:56.045854Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# see http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html\n", "import torch\n", "from torch.autograd import Variable\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "import torch.optim as optim\n", "\n", "torch.manual_seed(1)" ] }, { "cell_type": "code", "execution_count": 202, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:08.645360Z", "start_time": "2019-04-07T09:08:08.641731Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "text = \"\"\"We are about to study the idea of a computational process.\n", "Computational processes are abstract beings that inhabit computers.\n", "As they evolve, processes manipulate other abstract things called data.\n", "The evolution of a process is directed by a pattern of rules\n", "called a program. People create programs to direct processes. In effect,\n", "we conjure the spirits of the computer with our spells.\"\"\"\n", "\n", "text = text.replace(',', '').replace('.', '').lower().split()" ] }, { "cell_type": "code", "execution_count": 203, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:11.597040Z", "start_time": "2019-04-07T09:08:11.590972Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vocab_size: 44\n" ] } ], "source": [ "# By deriving a set from `raw_text`, we deduplicate the array\n", "vocab = set(text)\n", "vocab_size = len(vocab)\n", "print('vocab_size:', vocab_size)\n", "\n", "w2i = {w: i for i, w in enumerate(vocab)}\n", "i2w = {i: w for i, w in enumerate(vocab)}" ] }, { "cell_type": "code", "execution_count": 204, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:13.785456Z", "start_time": "2019-04-07T09:08:13.773647Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cbow sample (['we', 'are', 'to', 'study'], 'about')\n" ] } ], "source": [ "# context window size is two\n", "def create_cbow_dataset(text):\n", " data = []\n", " for i in range(2, len(text) - 2):\n", " context = [text[i - 2], text[i - 1],\n", " text[i + 1], text[i + 2]]\n", " target = text[i]\n", " data.append((context, target))\n", " return data\n", "\n", "cbow_train = create_cbow_dataset(text)\n", "print('cbow sample', cbow_train[0])\n" ] }, { "cell_type": "code", "execution_count": 205, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:14.703431Z", "start_time": "2019-04-07T09:08:14.671677Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "skipgram sample ('about', 'we', 1)\n" ] } ], "source": [ "def create_skipgram_dataset(text):\n", " import random\n", " data = []\n", " for i in range(2, len(text) - 2):\n", " data.append((text[i], text[i-2], 1))\n", " data.append((text[i], text[i-1], 1))\n", " data.append((text[i], text[i+1], 1))\n", " data.append((text[i], text[i+2], 1))\n", " # negative sampling\n", " for _ in range(4):\n", " if random.random() < 0.5 or i >= len(text) - 3:\n", " rand_id = random.randint(0, i-1)\n", " else:\n", " rand_id = random.randint(i+3, len(text)-1)\n", " data.append((text[i], text[rand_id], 0))\n", " return data\n", "\n", "\n", "skipgram_train = create_skipgram_dataset(text)\n", "print('skipgram sample', skipgram_train[0])" ] }, { "cell_type": "code", "execution_count": 206, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:15.358380Z", "start_time": "2019-04-07T09:08:15.335746Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class CBOW(nn.Module):\n", " def __init__(self, vocab_size, embd_size, context_size, hidden_size):\n", " super(CBOW, self).__init__()\n", " self.embeddings = nn.Embedding(vocab_size, embd_size)\n", " self.linear1 = nn.Linear(2*context_size*embd_size, hidden_size)\n", " self.linear2 = nn.Linear(hidden_size, vocab_size)\n", " \n", " def forward(self, inputs):\n", " embedded = self.embeddings(inputs).view((1, -1))\n", " hid = F.relu(self.linear1(embedded))\n", " out = self.linear2(hid)\n", " log_probs = F.log_softmax(out, dim = 1)\n", " return log_probs\n", " \n", " def extract(self, inputs):\n", " embeds = self.embeddings(inputs)\n", " return embeds" ] }, { "cell_type": "code", "execution_count": 207, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:15.952938Z", "start_time": "2019-04-07T09:08:15.934330Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class SkipGram(nn.Module):\n", " def __init__(self, vocab_size, embd_size):\n", " super(SkipGram, self).__init__()\n", " self.embeddings = nn.Embedding(vocab_size, embd_size)\n", " \n", " def forward(self, focus, context):\n", " embed_focus = self.embeddings(focus).view((1, -1)) # input\n", " embed_ctx = self.embeddings(context).view((1, -1)) # output\n", " score = torch.mm(embed_focus, torch.t(embed_ctx)) # input*output\n", " log_probs = F.logsigmoid(score) # sigmoid\n", " return log_probs\n", " \n", " def extract(self, focus):\n", " embed_focus = self.embeddings(focus)\n", " return embed_focus" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-04-07T08:24:20.526644Z", "start_time": "2019-04-07T08:24:20.523217Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "`torch.mm` Performs a matrix multiplication of the matrices \n", "\n", "`torch.t` Expects :attr:`input` to be a matrix (2-D tensor) and transposes dimensions 0\n", "and 1. Can be seen as a short-hand function for ``transpose(input, 0, 1)``." ] }, { "cell_type": "code", "execution_count": 208, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:17.249149Z", "start_time": "2019-04-07T09:08:17.245376Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "embd_size = 100\n", "learning_rate = 0.001\n", "n_epoch = 30\n", "CONTEXT_SIZE = 2 # 2 words to the left, 2 to the right" ] }, { "cell_type": "code", "execution_count": 209, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:18.005531Z", "start_time": "2019-04-07T09:08:17.973458Z" }, "code_folding": [], "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def train_cbow():\n", " hidden_size = 64\n", " losses = []\n", " loss_fn = nn.NLLLoss()\n", " model = CBOW(vocab_size, embd_size, CONTEXT_SIZE, hidden_size)\n", " print(model)\n", " optimizer = optim.SGD(model.parameters(), lr=learning_rate)\n", " for epoch in range(n_epoch):\n", " total_loss = .0\n", " for context, target in cbow_train:\n", " ctx_idxs = [w2i[w] for w in context]\n", " ctx_var = Variable(torch.LongTensor(ctx_idxs))\n", "\n", " model.zero_grad()\n", " log_probs = model(ctx_var)\n", "\n", " loss = loss_fn(log_probs, Variable(torch.LongTensor([w2i[target]])))\n", "\n", " loss.backward()\n", " optimizer.step()\n", "\n", " total_loss += loss.data.item()\n", " losses.append(total_loss)\n", " return model, losses " ] }, { "cell_type": "code", "execution_count": 210, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:18.803221Z", "start_time": "2019-04-07T09:08:18.773964Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def train_skipgram():\n", " losses = []\n", " loss_fn = nn.MSELoss()\n", " model = SkipGram(vocab_size, embd_size)\n", " print(model)\n", " optimizer = optim.SGD(model.parameters(), lr=learning_rate)\n", " \n", " for epoch in range(n_epoch):\n", " total_loss = .0\n", " for in_w, out_w, target in skipgram_train:\n", " in_w_var = Variable(torch.LongTensor([w2i[in_w]]))\n", " out_w_var = Variable(torch.LongTensor([w2i[out_w]]))\n", " \n", " model.zero_grad()\n", " log_probs = model(in_w_var, out_w_var)\n", " loss = loss_fn(log_probs[0], Variable(torch.Tensor([target])))\n", " \n", " loss.backward()\n", " optimizer.step()\n", "\n", " total_loss += loss.data.item()\n", " losses.append(total_loss)\n", " return model, losses" ] }, { "cell_type": "code", "execution_count": 211, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:24.429022Z", "start_time": "2019-04-07T09:08:19.718142Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CBOW(\n", " (embeddings): Embedding(44, 100)\n", " (linear1): Linear(in_features=400, out_features=64, bias=True)\n", " (linear2): Linear(in_features=64, out_features=44, bias=True)\n", ")\n", "SkipGram(\n", " (embeddings): Embedding(44, 100)\n", ")\n" ] } ], "source": [ "cbow_model, cbow_losses = train_cbow()\n", "sg_model, sg_losses = train_skipgram()" ] }, { "cell_type": "code", "execution_count": 212, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:24.968373Z", "start_time": "2019-04-07T09:08:24.430942Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.figure(figsize= (10, 4))\n", "plt.subplot(121)\n", "plt.plot(range(n_epoch), cbow_losses, 'r-o', label = 'CBOW Losses')\n", "plt.legend()\n", "plt.subplot(122)\n", "plt.plot(range(n_epoch), sg_losses, 'g-s', label = 'SkipGram Losses')\n", "plt.legend()\n", "plt.tight_layout()\n" ] }, { "cell_type": "code", "execution_count": 213, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:46.483856Z", "start_time": "2019-04-07T09:08:46.477458Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "100" ] }, "execution_count": 213, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cbow_vec = cbow_model.extract(Variable(torch.LongTensor([v for v in w2i.values()])))\n", "cbow_vec = cbow_vec.data.numpy()\n", "len(cbow_vec[0])" ] }, { "cell_type": "code", "execution_count": 214, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:08:59.082813Z", "start_time": "2019-04-07T09:08:59.076630Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "100" ] }, "execution_count": 214, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sg_vec = sg_model.extract(Variable(torch.LongTensor([v for v in w2i.values()])))\n", "sg_vec = sg_vec.data.numpy()\n", "len(sg_vec[0])" ] }, { "cell_type": "code", "execution_count": 217, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T09:09:21.040471Z", "start_time": "2019-04-07T09:09:20.289414Z" }, "code_folding": [ 0 ], "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 利用PCA算法进行降维\n", "from sklearn.decomposition import PCA\n", "X_reduced = PCA(n_components=2).fit_transform(sg_vec)\n", "\n", "# 绘制所有单词向量的二维空间投影\n", "import matplotlib.pyplot as plt\n", "import matplotlib\n", "\n", "fig = plt.figure(figsize = (20, 10))\n", "ax = fig.gca()\n", "ax.set_facecolor('black')\n", "ax.plot(X_reduced[:, 0], X_reduced[:, 1], '.', markersize = 1, alpha = 0.4, color = 'white')\n", "# 绘制几个特殊单词的向量\n", "words = list(w2i.keys())\n", "# 设置中文字体,否则无法在图形上显示中文\n", "for w in words:\n", " if w in w2i:\n", " ind = w2i[w]\n", " xy = X_reduced[ind]\n", " plt.plot(xy[0], xy[1], '.', alpha =1, color = 'red')\n", " plt.text(xy[0], xy[1], w, alpha = 1, color = 'white', fontsize = 20)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# NGram词向量模型\n", "\n", "本文件是集智AI学园http://campus.swarma.org 出品的“火炬上的深度学习”第VI课的配套源代码\n", "\n", "原理:利用一个人工神经网络来根据前N个单词来预测下一个单词,从而得到每个单词的词向量\n", "\n", "以刘慈欣著名的科幻小说《三体》为例,来展示利用NGram模型训练词向量的方法\n", "- 预处理分为两个步骤:1、读取文件、2、分词、3、将语料划分为N+1元组,准备好训练用数据\n", "- 在这里,我们并没有去除标点符号,一是为了编程简洁,而是考虑到分词会自动将标点符号当作一个单词处理,因此不需要额外考虑。" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T06:12:24.844179Z", "start_time": "2019-04-07T06:12:24.840626Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "with open(\"../data/3body.txt\", 'r') as f:\n", " text = str(f.read())" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T06:12:26.106850Z", "start_time": "2019-04-07T06:12:25.963215Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7754\n" ] } ], "source": [ "import jieba, re\n", "temp = jieba.lcut(text)\n", "words = []\n", "for i in temp:\n", " #过滤掉所有的标点符号\n", " i = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\'””《》]+|[+——!,。?、~@#¥%……&*():]+\", \"\", i)\n", " if len(i) > 0:\n", " words.append(i)\n", "print(len(words))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T06:12:25.377559Z", "start_time": "2019-04-07T06:12:25.373378Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'八万五千三体时(约8.6个地球年)后。\\n\\n元首下令召开三体世界全体执政官紧急会议,这很不寻常,一定有什么重大的事件发生。\\n\\n两万三体时前,三体舰队启航了,它们只知道目标的大致方向,却不知道它的距离。也'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text[:100]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T06:14:48.293883Z", "start_time": "2019-04-07T06:14:48.286653Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "八万五千 三体 时 约 86 个 地球 年 后 元首 下令 召开 三体 世界 全体 执政官 紧急会议 这 很 不 寻常 一定 有 什么 重大 的 事件 发生 两万 三体 时前 三体 舰队 启航 了 它们 只 知道 目标 的 大致 方向 却 不 知道 它 的 距离 也许 目标\n" ] } ], "source": [ "print(*words[:50])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T06:12:26.856673Z", "start_time": "2019-04-07T06:12:26.844921Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(['八万五千', '三体'], '时'), (['三体', '时'], '约'), (['时', '约'], '86')]\n" ] } ], "source": [ "trigrams = [([words[i], words[i + 1]], words[i + 2]) for i in range(len(words) - 2)]\n", "# 打印出前三个元素看看\n", "print(trigrams[:3])" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:01:07.086627Z", "start_time": "2019-04-07T07:01:07.074396Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2000\n" ] } ], "source": [ "# 得到词汇表\n", "vocab = set(words)\n", "print(len(vocab))\n", "word_to_idx = {i:[k, 0] for k, i in enumerate(vocab)} \n", "idx_to_word = {k:i for k, i in enumerate(vocab)}\n", "for w in words:\n", " word_to_idx[w][1] +=1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "构造NGram神经网络模型 (三层的网络)\n", "\n", "1. 输入层:embedding层,这一层的作用是:先将输入单词的编号映射为一个one hot编码的向量,形如:001000,维度为单词表大小。\n", "然后,embedding会通过一个线性的神经网络层映射到这个词的向量表示,输出为embedding_dim\n", "2. 线性层,从embedding_dim维度到128维度,然后经过非线性ReLU函数\n", "3. 线性层:从128维度到单词表大小维度,然后log softmax函数,给出预测每个单词的概率" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:01:38.819578Z", "start_time": "2019-04-07T07:01:38.783151Z" }, "code_folding": [], "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import torch.nn as nn\n", "import torch.nn.functional as F\n", "import torch.optim as optim\n", "from torch.autograd import Variable\n", "\n", "import torch\n", "\n", "class NGram(nn.Module):\n", " def __init__(self, vocab_size, embedding_dim, context_size):\n", " super(NGram, self).__init__()\n", " self.embeddings = nn.Embedding(vocab_size, embedding_dim) #嵌入层\n", " self.linear1 = nn.Linear(context_size * embedding_dim, 128) #线性层\n", " self.linear2 = nn.Linear(128, vocab_size) #线性层\n", "\n", " def forward(self, inputs):\n", " #嵌入运算,嵌入运算在内部分为两步:将输入的单词编码映射为one hot向量表示,然后经过一个线性层得到单词的词向量\n", " embeds = self.embeddings(inputs).view(1, -1)\n", " # 线性层加ReLU\n", " out = F.relu(self.linear1(embeds))\n", " \n", " # 线性层加Softmax\n", " out = self.linear2(out)\n", " log_probs = F.log_softmax(out, dim = 1)\n", " return log_probs\n", " def extract(self, inputs):\n", " embeds = self.embeddings(inputs)\n", " return embeds" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:14:30.758701Z", "start_time": "2019-04-07T07:02:06.973444Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "第0轮,损失函数为:56704.61\n", "第1轮,损失函数为:53935.28\n", "第2轮,损失函数为:52241.16\n", "第3轮,损失函数为:51008.51\n", "第4轮,损失函数为:50113.76\n", "第5轮,损失函数为:49434.07\n", "第6轮,损失函数为:48879.33\n", "第7轮,损失函数为:48404.71\n", "第8轮,损失函数为:47983.95\n", "第9轮,损失函数为:47600.01\n", "第10轮,损失函数为:47240.32\n", "第11轮,损失函数为:46897.53\n", "第12轮,损失函数为:46566.24\n", "第13轮,损失函数为:46241.59\n", "第14轮,损失函数为:45920.18\n", "第15轮,损失函数为:45599.50\n", "第16轮,损失函数为:45277.74\n", "第17轮,损失函数为:44953.10\n", "第18轮,损失函数为:44624.41\n", "第19轮,损失函数为:44290.34\n", "第20轮,损失函数为:43950.63\n", "第21轮,损失函数为:43604.48\n", "第22轮,损失函数为:43251.90\n", "第23轮,损失函数为:42891.99\n", "第24轮,损失函数为:42524.64\n", "第25轮,损失函数为:42149.46\n", "第26轮,损失函数为:41766.14\n", "第27轮,损失函数为:41374.89\n", "第28轮,损失函数为:40975.62\n", "第29轮,损失函数为:40568.36\n", "第30轮,损失函数为:40153.31\n", "第31轮,损失函数为:39730.61\n", "第32轮,损失函数为:39300.70\n", "第33轮,损失函数为:38863.39\n", "第34轮,损失函数为:38419.11\n", "第35轮,损失函数为:37968.16\n", "第36轮,损失函数为:37510.99\n", "第37轮,损失函数为:37048.06\n", "第38轮,损失函数为:36579.82\n", "第39轮,损失函数为:36106.78\n", "第40轮,损失函数为:35629.46\n", "第41轮,损失函数为:35148.57\n", "第42轮,损失函数为:34665.39\n", "第43轮,损失函数为:34180.25\n", "第44轮,损失函数为:33693.93\n", "第45轮,损失函数为:33207.48\n", "第46轮,损失函数为:32721.72\n", "第47轮,损失函数为:32237.36\n", "第48轮,损失函数为:31755.00\n", "第49轮,损失函数为:31275.05\n", "第50轮,损失函数为:30798.38\n", "第51轮,损失函数为:30325.62\n", "第52轮,损失函数为:29857.59\n", "第53轮,损失函数为:29394.65\n", "第54轮,损失函数为:28937.08\n", "第55轮,损失函数为:28485.72\n", "第56轮,损失函数为:28041.07\n", "第57轮,损失函数为:27603.33\n", "第58轮,损失函数为:27173.14\n", "第59轮,损失函数为:26750.82\n", "第60轮,损失函数为:26336.92\n", "第61轮,损失函数为:25931.60\n", "第62轮,损失函数为:25534.87\n", "第63轮,损失函数为:25147.07\n", "第64轮,损失函数为:24768.02\n", "第65轮,损失函数为:24397.92\n", "第66轮,损失函数为:24036.68\n", "第67轮,损失函数为:23684.69\n", "第68轮,损失函数为:23341.30\n", "第69轮,损失函数为:23006.46\n", "第70轮,损失函数为:22680.18\n", "第71轮,损失函数为:22361.95\n", "第72轮,损失函数为:22051.86\n", "第73轮,损失函数为:21749.46\n", "第74轮,损失函数为:21454.48\n", "第75轮,损失函数为:21167.06\n", "第76轮,损失函数为:20886.72\n", "第77轮,损失函数为:20613.04\n", "第78轮,损失函数为:20346.13\n", "第79轮,损失函数为:20085.52\n", "第80轮,损失函数为:19831.27\n", "第81轮,损失函数为:19583.16\n", "第82轮,损失函数为:19341.03\n", "第83轮,损失函数为:19104.43\n", "第84轮,损失函数为:18873.11\n", "第85轮,损失函数为:18646.91\n", "第86轮,损失函数为:18425.87\n", "第87轮,损失函数为:18209.80\n", "第88轮,损失函数为:17998.34\n", "第89轮,损失函数为:17791.97\n", "第90轮,损失函数为:17589.94\n", "第91轮,损失函数为:17392.24\n", "第92轮,损失函数为:17199.04\n", "第93轮,损失函数为:17009.97\n", "第94轮,损失函数为:16824.82\n", "第95轮,损失函数为:16643.87\n", "第96轮,损失函数为:16466.76\n", "第97轮,损失函数为:16293.54\n", "第98轮,损失函数为:16123.99\n", "第99轮,损失函数为:15957.75\n" ] } ], "source": [ "losses = [] #纪录每一步的损失函数\n", "criterion = nn.NLLLoss() #运用负对数似然函数作为目标函数(常用于多分类问题的目标函数)\n", "model = NGram(len(vocab), 10, 2) #定义NGram模型,向量嵌入维数为10维,N(窗口大小)为2\n", "optimizer = optim.SGD(model.parameters(), lr=0.001) #使用随机梯度下降算法作为优化器 \n", "#循环100个周期\n", "for epoch in range(100):\n", " total_loss = torch.Tensor([0])\n", " for context, target in trigrams:\n", " # 准备好输入模型的数据,将词汇映射为编码\n", " context_idxs = [word_to_idx[w][0] for w in context]\n", " # 包装成PyTorch的Variable\n", " context_var = Variable(torch.LongTensor(context_idxs))\n", " # 清空梯度:注意PyTorch会在调用backward的时候自动积累梯度信息,故而每隔周期要清空梯度信息一次。\n", " optimizer.zero_grad()\n", " # 用神经网络做计算,计算得到输出的每个单词的可能概率对数值\n", " log_probs = model(context_var)\n", " # 计算损失函数,同样需要把目标数据转化为编码,并包装为Variable\n", " loss = criterion(log_probs, Variable(torch.LongTensor([word_to_idx[target][0]])))\n", " # 梯度反传\n", " loss.backward()\n", " # 对网络进行优化\n", " optimizer.step()\n", " # 累加损失函数值\n", " total_loss += loss.data\n", " losses.append(total_loss)\n", " print('第{}轮,损失函数为:{:.2f}'.format(epoch, total_loss.numpy()[0]))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ " 12m 24s!!!" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:17:40.676666Z", "start_time": "2019-04-07T07:17:39.894446Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# 从训练好的模型中提取每个单词的向量\n", "vec = model.extract(Variable(torch.LongTensor([v[0] for v in word_to_idx.values()])))\n", "vec = vec.data.numpy()\n", "\n", "# 利用PCA算法进行降维\n", "from sklearn.decomposition import PCA\n", "X_reduced = PCA(n_components=2).fit_transform(vec)" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:21:25.499695Z", "start_time": "2019-04-07T07:21:24.948708Z" }, "code_folding": [ 0 ], "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 绘制所有单词向量的二维空间投影\n", "import matplotlib.pyplot as plt\n", "import matplotlib\n", "\n", "fig = plt.figure(figsize = (20, 10))\n", "ax = fig.gca()\n", "ax.set_facecolor('black')\n", "ax.plot(X_reduced[:, 0], X_reduced[:, 1], '.', markersize = 1, alpha = 0.4, color = 'white')\n", "# 绘制几个特殊单词的向量\n", "words = ['智子', '地球', '三体', '质子', '科学', '世界', '文明', '太空', '加速器', '平面', '宇宙', '信息']\n", "# 设置中文字体,否则无法在图形上显示中文\n", "zhfont1 = matplotlib.font_manager.FontProperties(fname='/Library/Fonts/华文仿宋.ttf', size = 35)\n", "for w in words:\n", " if w in word_to_idx:\n", " ind = word_to_idx[w][0]\n", " xy = X_reduced[ind]\n", " plt.plot(xy[0], xy[1], '.', alpha =1, color = 'red')\n", " plt.text(xy[0], xy[1], w, fontproperties = zhfont1, alpha = 1, color = 'white')" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:22:21.974225Z", "start_time": "2019-04-07T07:22:21.911757Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['局部', '一场', '来', '错误', '一生', '正中', '航行', '地面', '只是', '政府']" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 定义计算cosine相似度的函数\n", "import numpy as np\n", "def cos_similarity(vec1, vec2):\n", " norm1 = np.linalg.norm(vec1)\n", " norm2 = np.linalg.norm(vec2)\n", " norm = norm1 * norm2\n", " dot = np.dot(vec1, vec2)\n", " result = dot / norm if norm > 0 else 0\n", " return result\n", " \n", "# 在所有的词向量中寻找到与目标词(word)相近的向量,并按相似度进行排列\n", "def find_most_similar(word, vectors, word_idx):\n", " vector = vectors[word_to_idx[word][0]]\n", " simi = [[cos_similarity(vector, vectors[num]), key] for num, key in enumerate(word_idx.keys())]\n", " sort = sorted(simi)[::-1]\n", " words = [i[1] for i in sort]\n", " return words\n", "\n", "# 与智子靠近的词汇\n", "find_most_similar('智子', vec, word_to_idx)[:10]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Gensim Word2vec " ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:23:43.680735Z", "start_time": "2019-04-07T07:23:43.140121Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import gensim as gensim\n", "from gensim.models import Word2Vec\n", "from gensim.models.keyedvectors import KeyedVectors\n", "from gensim.models.word2vec import LineSentence" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:24:20.881447Z", "start_time": "2019-04-07T07:24:18.860234Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "f = open(\"../data/三体.txt\", 'r')\n", "lines = []\n", "for line in f:\n", " temp = jieba.lcut(line)\n", " words = []\n", " for i in temp:\n", " #过滤掉所有的标点符号\n", " i = re.sub(\"[\\s+\\.\\!\\/_,$%^*(+\\\"\\'””《》]+|[+——!,。?、~@#¥%……&*():;‘]+\", \"\", i)\n", " if len(i) > 0:\n", " words.append(i)\n", " if len(words) > 0:\n", " lines.append(words)" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:24:42.626293Z", "start_time": "2019-04-07T07:24:41.970277Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# 调用gensim Word2Vec的算法进行训练。\n", "# 参数分别为:size: 嵌入后的词向量维度;window: 上下文的宽度,min_count为考虑计算的单词的最低词频阈值\n", "model = Word2Vec(lines, size = 20, window = 2 , min_count = 0)" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:24:58.455724Z", "start_time": "2019-04-07T07:24:58.449700Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[('地球', 0.9994192123413086),\n", " ('组织', 0.9993805289268494),\n", " ('文明', 0.9993024468421936),\n", " ('中', 0.9992458820343018),\n", " ('发展', 0.9992369413375854),\n", " ('的', 0.9992075562477112),\n", " ('与', 0.9990826845169067),\n", " ('社会', 0.9990127682685852),\n", " ('一次', 0.9990046620368958),\n", " ('状态', 0.9989502429962158)]" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.wv.most_similar('三体', topn = 10)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2019-04-23T03:59:03.438346Z", "start_time": "2019-04-23T03:59:03.263853Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'model' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mrawWordVec\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mword2ind\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mw\u001b[0m \u001b[0;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvocab\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5\u001b[0m \u001b[0mrawWordVec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mw\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mword2ind\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mw\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'model' is not defined" ] } ], "source": [ "# 将词向量投影到二维空间\n", "rawWordVec = []\n", "word2ind = {}\n", "for i, w in enumerate(model.wv.vocab):\n", " rawWordVec.append(model[w])\n", " word2ind[w] = i\n", "rawWordVec = np.array(rawWordVec)\n", "X_reduced = PCA(n_components=2).fit_transform(rawWordVec)" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "ExecuteTime": { "end_time": "2019-04-07T07:26:15.113405Z", "start_time": "2019-04-07T07:26:14.643527Z" }, "code_folding": [ 10 ], "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 绘制星空图\n", "# 绘制所有单词向量的二维空间投影\n", "fig = plt.figure(figsize = (15, 10))\n", "ax = fig.gca()\n", "ax.set_facecolor('black')\n", "ax.plot(X_reduced[:, 0], X_reduced[:, 1], '.', markersize = 1, alpha = 0.3, color = 'white')\n", "# 绘制几个特殊单词的向量\n", "words = ['智子', '地球', '三体', '质子', '科学', '世界', '文明', '太空', '加速器', '平面', '宇宙', '进展','的']\n", "# 设置中文字体,否则无法在图形上显示中文\n", "zhfont1 = matplotlib.font_manager.FontProperties(fname='/Library/Fonts/华文仿宋.ttf', size=26)\n", "for w in words:\n", " if w in word2ind:\n", " ind = word2ind[w]\n", " xy = X_reduced[ind]\n", " plt.plot(xy[0], xy[1], '.', alpha =1, color = 'red')\n", " plt.text(xy[0], xy[1], w, fontproperties = zhfont1, alpha = 1, color = 'yellow')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# END" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python [conda env:anaconda]", "language": "python", "name": "conda-env-anaconda-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "517px", "left": "953px", "top": "134px", "width": "255px" }, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }