{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using PCA to visualize the MtG universe\n", "\n", "In this notebook, we're going to scrape Magic the Gathering's Gatherer card database and then perform principal components analysis to visualize hidden relationships between cards. Our goal will be to see how much card-to-card variation can be simplified and then plotted in two-dimensions and again what those card groupings look like.\n", "\n", "\n", "\n", "This data set is very high-dimensional -- there are over a 100 unique mechanics in the game and the game state has many different elements (hand, battlefield, mana pool, etc.). Being able to translate the 13,000 unique card texts into structured data is also a challenge NLP-related task." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Warning: This notebook is long...so, for the impatient:\n", "\n", "Here is what we will be working towards, *a programmatic mapping of every Magic card ever made *across two psuedo-axes:\n", "\n", "\n", "\n", "We will show that while Magic cards can differ in thousands of ways, they can be roughly categorized based on two simple measures: how \"creature-y\" are they? and how much do they related to the board or non-board state?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Implementation details\n", "\n", "Pretty baller, right? We will interpret and grok this graph later, but for now, let's do this...\n", "\n", "*LEERRRROOYYY JENNNKIINNNNNSSS*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline\n", "Here's a breakdown of the four steps that we'll go through to accomplish this task:\n", "1. Scrape + clean the data using `requests`, `web` from `pattern`, and `pandas`\n", "- Extract features from the data using `fuzzywuzzy` and domain knowledge\n", "- Perform and analyze PCA using `sklearn`\n", "- Visualize + interpret results using the `plotly` graphing library" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*First some boring imports and settings (feel free to skip over)*" ] }, { "cell_type": "code", "execution_count": 172, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# boring imports\n", "\n", "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pylab as plt\n", "\n", "import requests\n", "from pattern import web\n", "requests.packages.urllib3.disable_warnings()\n", "\n", "import re, string\n", "from sets import Set\n", "from collections import Counter\n", "from fuzzywuzzy import fuzz\n", "\n", "database = {}\n", "pd.set_option('display.max_rows', 10)\n", "\n", "# Silly helper functions\n", "\n", "def isInt(s):\n", " try: \n", " int(s)\n", " return True\n", " except ValueError:\n", " return False\n", "\n", "def anyIntOrColor(l):\n", " for val in l:\n", " if isInt(val) | (val in ['Black', 'Red', 'Green', 'Blue', 'White']) : return True\n", " return False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# (1) -- Scrape baby, scrape\n", "\n", "Our first order of business is scraping the data from the Gatherer database using `requests` and `web` from `pattern`. In it's simplest form, every Magic card has a name, text, type, mana cost, and power/toughness (if it's a creature). An example is Hypnotic Specter, a powerful creature in the early days of Magic:\n", "\n", "\n", "\n", "To scrape the relevant card features, we will construct card URLs using the card's `multiverse_id` and on the page we load will look for unique HTML elements that correspond to each of the features we will to obtain." ] }, { "cell_type": "code", "execution_count": 173, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# grabCard scrapes:\n", "\n", "# name, types, text (lowered, alphanumeritized), mana cost,\n", "# cmc, power and toughness, and rarity.\n", "\n", "# and adds it to the global card database\n", "\n", "def grabCard(multiverse_id):\n", " xml = \"http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=\" + str(multiverse_id)\n", " dom = web.Element(requests.get(xml).text)\n", " \n", " # card name, card type\n", " cardName = dom('div.cardImage img')[0].attributes['alt'] if dom('div .cardImage img') else ''\n", " cardType = [element.strip() for element in \\\n", " dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_typeRow div.value')[0].content.split(u'\\u2014')]\n", " \n", " # extract, parse, clean text into a list\n", " cardText = []\n", " pattern = re.compile('[\\W_]+')\n", " for line in dom('div.cardtextbox'):\n", " for element in line:\n", " cardText.append(element)\n", " \n", " for i in xrange(len(cardText)):\n", " if cardText[i].type == 'element' and cardText[i].tag == 'img':\n", " cardText[i] = cardText[i].attributes['alt']\n", " else:\n", " cardText[i] = str(cardText[i]).strip().lower()\n", " pattern.sub('', cardText[i]) \n", " \n", " # mana symbols\n", " manaCost = [element.attributes['alt'] for element in dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_manaRow div.value img')]\n", " cmc = int(dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_cmcRow div.value')[0].content.strip()) \\\n", " if dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_cmcRow div.value') else np.nan\n", " \n", " # rarity\n", " rarity = dom('div #ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_rarityRow div.value span')[0].content.lower()\n", " \n", " # p/t\n", " power = np.nan\n", " power = [_.strip() for _ in dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value')[0].content.split(' / ')][0] \\\n", " if dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value') else np.nan\n", " power = float(power) if power != '*' and power != np.nan else np.nan\n", " toughness = [_.strip() for _ in dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value')[0].content.split(' / ')][1] \\\n", " if dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value') else np.nan\n", " toughness = float(toughness) if (toughness != '*' and toughness != '7-*' and toughness != np.nan) else np.nan\n", " \n", " # add data\n", " database[cardName] = {\n", " 'cardType' : cardType,\n", " 'cardText' : cardText,\n", " 'manaCost' : manaCost,\n", " 'cmc' : cmc,\n", " 'rarity': rarity,\n", " 'power' : power,\n", " 'toughness' : toughness\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Perform the scraping\n", "\n", "We'll iterate through a range of `multiverse_id`s to scrape a desired amount of cards. Note that it takes around 1 minute/500 `multiverse_id`s. Given that there are 13k+ cards (and multiple versions of each -- see below), we'll limit our scraping to ~500 cards from the very first Magic set: Alpha." ] }, { "cell_type": "code", "execution_count": 174, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Grabbed 100\n", "Grabbed 200\n", "Grabbed 300\n", "Grabbed 400\n", "Grabbed 500\n", "Done!\n" ] } ], "source": [ "cardsToScrape = 600\n", "\n", "for i in xrange(1, cardsToScrape):\n", " if (i % 100 == 0): print \"Grabbed \" + str(i)\n", " grabCard(i)\n", "\n", "print \"Done!\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point, we now have roughly `cardsToScrape` cards and associated values in a local `dict` using the `cardName` as the key. (Note that we have less than `cardsToScrape` as we're iterating over `multiverse_id`s and some ids don't actually match to a card page.)\n", "\n", "### Note for potential future work\n", "\n", "*There are other aspects represented on the Gatherer database such as set and community ratings but we leave this to future work. Annoyingly, for cards in multiple sets, the card will have a different page (and subsequently different set of ratings) for each set; though this would require more work, it'd be super interesting if you could predict a card's community interest (# ratings) and favorability (average rating).*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making the data usable\n", "\n", "We'll now put this into a `pandas` dataframe for cleaning, variable creation and initial analysis/spot checking/understanding." ] }, { "cell_type": "code", "execution_count": 175, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | toughness | \n", "power | \n", "cmc | \n", "rarity | \n", "cardType | \n", "cardText | \n", "manaCost | \n", "cardName | \n", "
---|---|---|---|---|---|---|---|---|
Air Elemental | \n", "4 | \n", "4 | \n", "5 | \n", "uncommon | \n", "[Creature, Elemental] | \n", "[flying] | \n", "[3, Blue, Blue] | \n", "Air Elemental | \n", "
Ancestral Recall | \n", "NaN | \n", "NaN | \n", "1 | \n", "rare | \n", "[Instant] | \n", "[target player draws three cards.] | \n", "[Blue] | \n", "Ancestral Recall | \n", "
Animate Artifact | \n", "NaN | \n", "NaN | \n", "4 | \n", "uncommon | \n", "[Enchantment, Aura] | \n", "[enchant artifact, as long as enchanted artifa... | \n", "[3, Blue] | \n", "Animate Artifact | \n", "
Animate Dead | \n", "NaN | \n", "NaN | \n", "2 | \n", "uncommon | \n", "[Enchantment, Aura] | \n", "[enchant creature card in a graveyard, when an... | \n", "[1, Black] | \n", "Animate Dead | \n", "
Animate Wall | \n", "NaN | \n", "NaN | \n", "1 | \n", "rare | \n", "[Enchantment, Aura] | \n", "[enchant wall, enchanted wall can attack as th... | \n", "[White] | \n", "Animate Wall | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
Winter Orb | \n", "NaN | \n", "NaN | \n", "2 | \n", "rare | \n", "[Artifact] | \n", "[players can't untap more than one land during... | \n", "[2] | \n", "Winter Orb | \n", "
Wooden Sphere | \n", "NaN | \n", "NaN | \n", "1 | \n", "uncommon | \n", "[Artifact] | \n", "[whenever a player casts a green spell, you ma... | \n", "[1] | \n", "Wooden Sphere | \n", "
Word of Command | \n", "NaN | \n", "NaN | \n", "2 | \n", "rare | \n", "[Instant] | \n", "[look at target opponent's hand and choose a c... | \n", "[Black, Black] | \n", "Word of Command | \n", "
Wrath of God | \n", "NaN | \n", "NaN | \n", "4 | \n", "rare | \n", "[Sorcery] | \n", "[destroy all creatures. they can't be regenera... | \n", "[2, White, White] | \n", "Wrath of God | \n", "
Zombie Master | \n", "3 | \n", "2 | \n", "3 | \n", "rare | \n", "[Creature, Zombie] | \n", "[other zombie creatures have swampwalk., other... | \n", "[1, Black, Black] | \n", "Zombie Master | \n", "
296 rows × 8 columns
\n", "\n", " | \n", " | Artifact | \n", "Black | \n", "Blue | \n", "Green | \n", "Red | \n", "Variable Colorless | \n", "White | \n", "cmc | \n", "colorlessMana | \n", "mana_Black | \n", "mana_Blue | \n", "mana_Green | \n", "mana_Red | \n", "mana_White | \n", "power | \n", "toughness | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
color | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Artifact | \n", "count | \n", "62 | \n", "62 | \n", "62 | \n", "62 | \n", "62 | \n", "62 | \n", "62 | \n", "47.000000 | \n", "62.000000 | \n", "62 | \n", "62 | \n", "62 | \n", "62 | \n", "62.00 | \n", "5.000000 | \n", "5.000000 | \n", "
mean | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2.361702 | \n", "1.790323 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "2.400000 | \n", "5.000000 | \n", "|
std | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1.673652 | \n", "1.775397 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "2.302173 | \n", "1.414214 | \n", "|
min | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.000000 | \n", "3.000000 | \n", "|
25% | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1.000000 | \n", "0.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.000000 | \n", "4.000000 | \n", "|
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
White | \n", "min | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1.000000 | \n", "0.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1.00 | \n", "1.000000 | \n", "1.000000 | \n", "
25% | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1.000000 | \n", "0.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1.00 | \n", "1.500000 | \n", "1.000000 | \n", "|
50% | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "2.000000 | \n", "1.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1.00 | \n", "2.000000 | \n", "2.000000 | \n", "|
75% | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "3.000000 | \n", "1.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1.75 | \n", "3.000000 | \n", "4.500000 | \n", "|
max | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "6.000000 | \n", "3.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3.00 | \n", "6.000000 | \n", "6.000000 | \n", "
48 rows × 16 columns
\n", "\n", " | \n", " | toughness | \n", "power | \n", "cmc | \n", "colorlessMana | \n", "Variable Colorless | \n", "mana_Blue | \n", "Blue | \n", "mana_Black | \n", "Black | \n", "mana_Red | \n", "... | \n", "Artifact | \n", "Creature | \n", "Enchantment | \n", "Instant | \n", "Land | \n", "Sorcery | \n", "basic land | \n", "common | \n", "rare | \n", "uncommon | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Primary Type | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Artifact | \n", "count | \n", "1 | \n", "1 | \n", "43.000000 | \n", "43.000000 | \n", "43 | \n", "43 | \n", "43 | \n", "43.00 | \n", "43.00 | \n", "43 | \n", "... | \n", "43 | \n", "43 | \n", "43 | \n", "43 | \n", "43 | \n", "43 | \n", "43 | \n", "43.00 | \n", "43.000000 | \n", "43.000000 | \n", "
mean | \n", "6 | \n", "3 | \n", "2.116279 | \n", "2.116279 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.00 | \n", "0 | \n", "... | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.604651 | \n", "0.395349 | \n", "|
std | \n", "NaN | \n", "NaN | \n", "1.499354 | \n", "1.499354 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.00 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.494712 | \n", "0.494712 | \n", "|
min | \n", "6 | \n", "3 | \n", "0.000000 | \n", "0.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.00 | \n", "0 | \n", "... | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.000000 | \n", "0.000000 | \n", "|
25% | \n", "6 | \n", "3 | \n", "1.000000 | \n", "1.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.00 | \n", "0 | \n", "... | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.000000 | \n", "0.000000 | \n", "|
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
Sorcery | \n", "min | \n", "NaN | \n", "NaN | \n", "1.000000 | \n", "0.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.00 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0.00 | \n", "0.000000 | \n", "0.000000 | \n", "
25% | \n", "NaN | \n", "NaN | \n", "1.250000 | \n", "0.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.00 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0.00 | \n", "0.000000 | \n", "0.000000 | \n", "|
50% | \n", "NaN | \n", "NaN | \n", "2.000000 | \n", "1.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0.00 | \n", "0.00 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0.00 | \n", "0.000000 | \n", "0.000000 | \n", "|
75% | \n", "NaN | \n", "NaN | \n", "3.000000 | \n", "2.000000 | \n", "1 | \n", "0 | \n", "0 | \n", "0.75 | \n", "0.75 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0.75 | \n", "1.000000 | \n", "0.750000 | \n", "|
max | \n", "NaN | \n", "NaN | \n", "4.000000 | \n", "3.000000 | \n", "1 | \n", "3 | \n", "1 | \n", "3.00 | \n", "1.00 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1.00 | \n", "1.000000 | \n", "1.000000 | \n", "
48 rows × 26 columns
\n", "