{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Doc2vec from scratch in PyTorch\n", "===============================\n", "\n", "Here we are implementing this useful algorithm with a library we know and trust. With luck this will be more accessible than reading the papers but more in-depth than typical \"install gensim and just do what I say\" tutorials, and still easy to understand for anyone whose maths skills have atrophied to nothing (like me). This is all based on the great work by [Nejc Ilenic](https://github.com/inejc/paragraph-vectors) and reading the referenced papers and gensim's source.\n", "\n", "`doc2vec` descends from `word2vec`, the basic form of which is that it is a model trained to predict the missing word in a context. Given sentences like \"the cat ___ on the mat\" it should predict \"sat\", and in doing so learn a useful representation of words. We can then extract the internal weights and re-use them as \"word embeddings\", vectors giving each word a position in N-dimensional space that is hopefully close to similar words and an appropriate distance from related words. \n", "\n", "`doc2vec` or \"Paragraph vectors\" extends the `word2vec` idea by simply adding a document id to each context. This helps the network learn associations between contexts and produces vectors that position each paragraph (document) in space." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we need to load the data. We'll begin by overfitting on a tiny dataset just to check all the parts fit together." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | text | \n", "tokens | \n", "
|---|---|---|
| 0 | \n", "In the week before their departure to Arrakis, when all the final scurrying about had reached a ... | \n", "[in, the, week, before, their, departure, to, arrakis, when, all, the, final, scurrying, about, ... | \n", "
| 1 | \n", "It was a warm night at Castle Caladan, and the ancient pile of stone that had served the Atreide... | \n", "[it, was, a, warm, night, at, castle, caladan, and, the, ancient, pile, of, stone, that, had, se... | \n", "
| 2 | \n", "The old woman was let in by the side door down the vaulted passage by Paul's room and she was al... | \n", "[the, old, woman, was, let, in, by, the, side, door, down, the, vaulted, passage, by, paul, room... | \n", "
| 3 | \n", "By the half-light of a suspensor lamp, dimmed and hanging near the floor, the awakened boy could... | \n", "[by, the, half, light, of, a, suspensor, lamp, dimmed, and, hanging, near, the, floor, the, awak... | \n", "
| \n", " | text | \n", "tokens | \n", "length | \n", "clean_tokens | \n", "clean_length | \n", "
|---|---|---|---|---|---|
| 0 | \n", "In the week before their departure to Arrakis, when all the final scurrying about had reached a ... | \n", "[in, the, week, before, their, departure, to, arrakis, when, all, the, final, scurrying, about, ... | \n", "32 | \n", "[in, the, week, before, their, departure, to, arrakis, when, all, the, final, scurrying, about, ... | \n", "32 | \n", "
| 1 | \n", "It was a warm night at Castle Caladan, and the ancient pile of stone that had served the Atreide... | \n", "[it, was, a, warm, night, at, castle, caladan, and, the, ancient, pile, of, stone, that, had, se... | \n", "39 | \n", "[it, was, a, warm, night, at, castle, caladan, and, the, ancient, pile, of, stone, that, had, se... | \n", "39 | \n", "
| 2 | \n", "The old woman was let in by the side door down the vaulted passage by Paul's room and she was al... | \n", "[the, old, woman, was, let, in, by, the, side, door, down, the, vaulted, passage, by, paul, room... | \n", "34 | \n", "[the, old, woman, was, let, in, by, the, side, door, down, the, vaulted, passage, by, paul, room... | \n", "34 | \n", "
| 3 | \n", "By the half-light of a suspensor lamp, dimmed and hanging near the floor, the awakened boy could... | \n", "[by, the, half, light, of, a, suspensor, lamp, dimmed, and, hanging, near, the, floor, the, awak... | \n", "53 | \n", "[by, the, half, light, of, a, suspensor, lamp, dimmed, and, hanging, near, the, floor, the, awak... | \n", "53 | \n", "
| \n", " | scores | \n", "loss | \n", "
|---|---|---|
| 0 | \n", "[1, -1, -1, -1] | \n", "tensor(1.2530) | \n", "
| 1 | \n", "[0.5, -1, -1, -1] | \n", "tensor(1.4139) | \n", "
| 2 | \n", "[0, -1, -1, -1] | \n", "tensor(1.6329) | \n", "
| 3 | \n", "[0, 0, 0, 0] | \n", "tensor(2.7726) | \n", "
| 4 | \n", "[0, 0, 0, 1] | \n", "tensor(3.3927) | \n", "
| 5 | \n", "[0, 1, 1, 1] | \n", "tensor(4.6329) | \n", "
| 6 | \n", "[0.5, 1, 1, 1] | \n", "tensor(4.4139) | \n", "
| 7 | \n", "[1, 1, 1, 1] | \n", "tensor(4.2530) | \n", "
| \n", " | doc_id | \n", "similarity | \n", "text | \n", "
|---|---|---|---|
| 1 | \n", "1 | \n", "1.000000 | \n", "It was a warm night at Castle Caladan, and the ancient pile of stone that had served the Atreide... | \n", "
| 0 | \n", "0 | \n", "0.177416 | \n", "In the week before their departure to Arrakis, when all the final scurrying about had reached a ... | \n", "
| 3 | \n", "3 | \n", "0.081760 | \n", "By the half-light of a suspensor lamp, dimmed and hanging near the floor, the awakened boy could... | \n", "
| 2 | \n", "2 | \n", "-0.044768 | \n", "The old woman was let in by the side door down the vaulted passage by Paul's room and she was al... | \n", "
| \n", " | text | \n", "tokens | \n", "group | \n", "
|---|---|---|---|
| 0 | \n", "Claxton hunting first major medal British hurdler Sarah Claxton is confident she can win her fi... | \n", "[claxton, hunting, first, major, medal, british, hurdler, sarah, claxton, is, confident, she, ca... | \n", "sport | \n", "
| 1 | \n", "O'Sullivan could run in Worlds Sonia O'Sullivan has indicated that she would like to participat... | \n", "[could, run, in, worlds, sonia, has, indicated, that, she, would, like, to, participate, in, nex... | \n", "sport | \n", "
| 2 | \n", "Greene sets sights on world title Maurice Greene aims to wipe out the pain of losing his Olympi... | \n", "[greene, sets, sights, on, world, title, maurice, greene, aims, to, wipe, out, the, pain, of, lo... | \n", "sport | \n", "
| 3 | \n", "IAAF launches fight against drugs The IAAF - athletics' world governing body - has met anti-dop... | \n", "[iaaf, launches, fight, against, drugs, the, iaaf, athletics, world, governing, body, has, met, ... | \n", "sport | \n", "
| \n", " | doc_id | \n", "similarity | \n", "text | \n", "
|---|---|---|---|
| 0 | \n", "0 | \n", "1.000000 | \n", "Claxton hunting first major medal British hurdler Sarah Claxton is confident she can win her fi... | \n", "
| 37 | \n", "37 | \n", "0.504319 | \n", "Radcliffe proves doubters wrong This won't go down as one of the greatest marathons of Paula's ... | \n", "
| 41 | \n", "41 | \n", "0.499603 | \n", "Radcliffe enjoys winning comeback Paula Radcliffe made a triumphant return to competitive runni... | \n", "
| 1545 | \n", "1545 | \n", "0.499484 | \n", "Search wars hit desktop PCs Another front in the on-going battle between Microsoft and Google i... | \n", "
| 1266 | \n", "1266 | \n", "0.490500 | \n", "Student 'inequality' exposed Teenagers from well-off backgrounds are six times more likely to g... | \n", "
| 19 | \n", "19 | \n", "0.442955 | \n", "Edwards tips Idowu for Euro gold World outdoor triple jump record holder and BBC pundit Jonatha... | \n", "
| 348 | \n", "348 | \n", "0.430447 | \n", "Italy aim to rattle England Italy coach John Kirwan believes his side can upset England as the ... | \n", "
| 251 | \n", "251 | \n", "0.429918 | \n", "Ferguson rues failure to cut gap Boss Sir Alex Ferguson was left ruing Manchester United's fail... | \n", "
| 24 | \n", "24 | \n", "0.429485 | \n", "El Guerrouj targets cross country Double Olympic champion Hicham El Guerrouj is set to make a r... | \n", "
| 464 | \n", "464 | \n", "0.412518 | \n", "Henin-Hardenne beaten on comeback Justine Henin-Hardenne lost to Elena Dementieva in a comeback... | \n", "