{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Phonetic similarity lookup\n", "\n", "(Incomplete!)\n", "\n", "In a previous notebook, we discussed [how to quickly find words with meanings similar to other words](understanding-word-vectors.ipynb). In this notebook, I demonstrate how to find words that *sound like* other words.\n", "\n", "I'm going to make use of some of [my recent research](https://github.com/aparrish/phonetic-similarity-vectors) in phonetic similarity. The algorithm I made uses phoneme transcriptions from [the CMU pronouncing dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict) along with information about articulatory/acoustic features of those phonemes to produce vector representations of the *sound* of every word in the dictionary.\n", "\n", "In this notebook, I show how to make a fast approximate nearest neighbor lookup of words by their phonetic similarity. Then I show a few potential applications in creative language generation using that lookup, plus a bit of vector arithmetic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "You'll need the `numpy`, `spacy` and `simpleneighbors` packages to run this code." ] }, { "cell_type": "code", "execution_count": 260, "metadata": {}, "outputs": [], "source": [ "import random\n", "from collections import defaultdict\n", "import numpy as np\n", "import spacy\n", "from simpleneighbors import SimpleNeighbors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can download the phonetic similarity vectors using the following command:" ] }, { "cell_type": "code", "execution_count": 275, "metadata": {}, "outputs": [], "source": [ "!curl -L -s https://github.com/aparrish/phonetic-similarity-vectors/blob/master/cmudict-0.7b-simvecs?raw=true >cmudict-0.7b-simvecs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the data\n", "\n", "The vector file looks like this:\n", "\n", " WARNER -1.178800 1.883123 -1.101779 -0.698869 -0.109708 -0.482693 -0.291353 1.179281 0.191032 -1.192597 -0.684268 -1.132983 0.072473 -0.626924 0.569412 -1.639735 -3.000464 -1.414111 1.806220 -1.075352 1.274347 -0.111253 0.675737 -0.579840 -1.111530 -0.960682 -1.664172 0.872162 1.311749 -0.182414 3.062428 -1.333462 1.375817 0.947289 1.699605 1.799368 2.434342 0.382153 0.383062 2.583699 -0.756335 1.862328 -0.189235 -2.033432 -0.609034 -0.782589 0.394311 -1.056266 -1.288209 0.055472" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "word_vecs = []\n", "for line in open(\"./cmudict-0.7b-simvecs\", encoding='latin1'):\n", " line = line.strip()\n", " word, vec = line.split(\" \")\n", " word = word.rstrip('(0123)').lower()\n", " vec = tuple(float(n) for n in vec.split())\n", " word_vecs.append((word, vec))" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "133859" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(word_vecs)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "group_by_vec = defaultdict(list)\n", "for word, vec in word_vecs:\n", " group_by_vec[vec].append(word)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "113694" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(group_by_vec)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "lookup = {}\n", "for word, vec in word_vecs:\n", " if word in lookup:\n", " continue\n", " lookup[word] = np.array(vec)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "125071" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(lookup)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "nlp = spacy.load('en_core_web_md')" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "nns = SimpleNeighbors(50)\n", "lookup = {}\n", "for vec, words in group_by_vec.items():\n", " sort_by_prob = sorted(words, key=lambda x: nlp.vocab[x].prob)\n", " nns.add_one(sort_by_prob[0], vec)\n", "nns.build(50)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['parrish',\n", " 'perished',\n", " 'parish',\n", " 'buresh',\n", " 'parrishes',\n", " \"paris'\",\n", " 'barrish',\n", " 'marish',\n", " 'cherish',\n", " 'perishing',\n", " 'garish',\n", " 'maresh']" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nns.nearest(lookup['parrish'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### random walk" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "allison allinson allison's ilalis's alissa isolate oscillate ocelot assad facade futch sutch suss tussle solicitous solicits tussle tussles suss genesis janus dishon zisson sissom cynicism nissei taisei chace chaste tastes chests sets stet stetz test's pests pastes missteps misstates pastes misstates misstates allstate's tastes pastes mists mistrust trust's strutz constructs " ] } ], "source": [ "current = 'allison'\n", "for i in range(50):\n", " print(current, end=\" \")\n", " current = random.choice(nns.nearest(lookup[current])[1:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### replacement" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "frost_doc = nlp(open(\"frost.txt\").read())" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "thuy lodes diverging inning eh colello woodward,\n", "unbend sarra eh toogood knoche travels boeve\n", "ende gyi awan traveller, lall aah stowed\n", "edmond tookes downtime urwin ass farb ige ee couldn't\n", "khuu hewell h. belt engh leitha undergrowth;\n", "\n", "then cookout the futher, as juts eye's fier,\n", "anand misbehaving hypercard uther geter claymore,\n", "because h. wah's glassey edmond wounded beware;\n", "xio ahs four's jass yother gassing geniere\n", "ahead whorl jim relay abide uther simm,\n", "\n", "edmond goethe that norling equality loye\n", "in. reeves' mono steppes hedge janardhan brakke.\n", "ayo, eh speck rather thirst form otherness deady!\n", "whet renewing haugh woy needs amman tucci byway,\n", "aux undoubted f. oooh shooed ivor cupp gapp.\n", "\n", "uhh schill bedke tailing matthes whiz oooh thigh\n", "somewheres gauges ende eases rench:\n", "toupee inroads diverged inning aue woodis, unland I—\n", "aigner put leitha one letsch raveled baye,\n", "odland sajak has mabe lall the difference.\n", "\n" ] } ], "source": [ "output = []\n", "for word in frost_doc:\n", " if word.text.lower() in lookup:\n", " new_word = random.choice(nns.nearest(lookup[word.text.lower()]))\n", " output.append(new_word)\n", " else:\n", " output.append(word.text)\n", " output.append(word.whitespace_)\n", "print(''.join(output))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### tinting sound" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "frost_doc = nlp(open(\"frost.txt\").read())" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "tint_word = 'soap'\n", "tint_vec = lookup[tint_word]\n", "tint_factor = 0.4" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tope rogues survivor's in. aue yellow woodwork,\n", "anand sa aw kote topknot travel busch\n", "england soapy urwin travenol, nall aw stowed\n", "oakland choke downe youn ahs fart edge aue tooke\n", "tope wickware chipote ghent ein posa choke;\n", "\n", "them choke ertha suther, ige justo eye's serr,\n", "umland heavy soper otha mater claymore,\n", "soco's schoepf swatch grassi unland footnoted swimwear;\n", "zhou aase fornoff that judge psychopath gehr\n", "hieb sworn zemke nilly taub ertha simm,\n", "\n", "earned putsch zag phoning emotionally sope\n", "innate reeves' lomonaco stake heid radant black.\n", "oooh, oie kepp judge scherf fuhr another's jade!\n", "hait lowing hah wye ladd's nahm touche erway,\n", "i. soot efface oooh photo's ever upham tabak.\n", "\n", "aw chalet beebe sellick this' which ee saye\n", "footware outages anand rages hentz:\n", "souk rototilles sope ame i. woodwork, earned I—\n", "ae cooke schoepf one selloff sope bip,\n", "odland jap hass mib auld soak referenced.\n", "\n" ] } ], "source": [ "output = []\n", "for word in frost_doc:\n", " if word.text.lower() in lookup:\n", " vec = lookup[word.text.lower()]\n", " target_vec = (vec * (1-tint_factor)) + (tint_vec * tint_factor)\n", " new_word = random.choice(nns.nearest(target_vec))\n", " output.append(new_word)\n", " else:\n", " output.append(word.text)\n", " output.append(word.whitespace_)\n", "print(''.join(output))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### picking synonyms based on sound" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "from scipy.spatial.distance import cosine" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [], "source": [ "def cosine_similarity(a, b):\n", " return 1 - cosine([a], [b])" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9746318461970761" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cosine_similarity(np.array([1,2,3]), np.array([4,5,6]))" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "semantic_nns = SimpleNeighbors(300)\n", "for item in nlp.vocab:\n", " if item.has_vector and item.prob > -15 and item.is_lower:\n", " semantic_nns.add_one(item.text, item.vector)\n", "semantic_nns.build(50)" ] }, { "cell_type": "code", "execution_count": 255, "metadata": {}, "outputs": [], "source": [ "def soundalike_synonym(word, target_vec, n=5):\n", " return sorted(\n", " [item for item in semantic_nns.nearest(nlp.vocab[word].vector, 50) if item in lookup],\n", " key=lambda x: cosine_similarity(target_vec, lookup[x]), reverse=True)[:n]" ] }, { "cell_type": "code", "execution_count": 257, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['chimp', 'hippo', 'toad', 'platypus', 'shark']" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soundalike_synonym('mastodon', lookup['soap'])" ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['velociraptor', 'dinosaur', 'caveman', 'dino', 'skeleton']" ] }, "execution_count": 168, "metadata": {}, "output_type": "execute_result" } ], "source": [ "semantic_nns.nearest(nlp.vocab['mastodon'].vector, 5)" ] }, { "cell_type": "code", "execution_count": 259, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fog → grille\n", "willingly → gladly\n", "tolerates → gravitate\n", "casino → grand\n", "farmland → graze\n", "micromanage → discretionary\n", "grok → query\n", "crappy → crap\n", "arguably → greatest\n", "naughty → brunette\n", "prior → preceding\n", "encountered → initially\n", "dandruff → tanning\n", "gendered → transcends\n", "airborne → aircraft\n", "natures → glamour\n" ] } ], "source": [ "target_vec = lookup['green']\n", "words = random.sample(semantic_nns.corpus, 16)\n", "for item in words:\n", " print(item, \"→\", soundalike_synonym(item, target_vec, 1)[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Soundalike synonym replacement" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [], "source": [ "frost_doc = nlp(open(\"frost.txt\").read())" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [], "source": [ "target_word = 'soap'\n", "target_vec = lookup[target_word]" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Two motorists emerged in a silver spruce,\n", "And sorry I could not tours both\n", "And not one oasis, long I looked\n", "And thought down one as far as I not\n", "To where it crook in the foliage;\n", "\n", "Then stopped the particular, as just as but,\n", "And thought perhaps the make purported,\n", "Because it did tree and chose duds;\n", "Though as for that the turning there\n", "took pajama them really about the not,\n", "\n", "And both that night equally stood\n", "In fig no take took strut violet.\n", "Oh, I stopped the same for another summer!\n", "Yet knows how it resulted on to take,\n", "I admit if I not ever say back.\n", "\n", "I abide also surprised this with a sigh\n", "Somewhere child and mos hence:\n", "Two crossing transformed in a laminate, and I—\n", "I saw the same less embark by,\n", "And that could it most the because.\n", "\n" ] } ], "source": [ "output = []\n", "for word in frost_doc:\n", " if word.is_alpha \\\n", " and word.pos_ in ('NOUN', 'VERB', 'ADJ') \\\n", " and word.text.lower() in lookup:\n", " new_word = random.choice(soundalike_synonym(word.text.lower(), target_vec))\n", " output.append(new_word)\n", " else:\n", " output.append(word.text)\n", " output.append(word.whitespace_)\n", "print(''.join(output))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }