{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# geoRx: Geoscience prescription\n",
    "\n",
    "This notebook accompanies\n",
    "\n",
    "> Hall, M (2017). Three data analytics party tricks. _The Leading Edge_ **36** (3).\n",
    "\n",
    "Inspired by and/or based on [**science concierge**](https://github.com/titipata/science_concierge) and [**Chris Clark's repo**](https://github.com/groveco/content-engine) on content-based recommendation.\n",
    "\n",
    "A version of this code is running at [georx.geosci.ai](http://georx.geosci.ai) where you can try it out. \n",
    "\n",
    "\n",
    "## Load data\n",
    "\n",
    "This dataset is 1000 random articles from the journal _Geophysics_ from 1936 to 2016. It represents about 10% of the total number of articles published in that time. It was collected from seg.org with permission, and processed into a CSV file of titles, abstracts, and DOIs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('data/title_abstract_doi.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>abstract</th>\n",
       "      <th>doi</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Magnetic And Gravity Anomaly Patterns Related ...</td>\n",
       "      <td>A study of the features of gravity and magneti...</td>\n",
       "      <td>10.1190/1.1444192</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Inversion For Permeability Distribution From T...</td>\n",
       "      <td>Understanding reservoir properties plays a key...</td>\n",
       "      <td>10.1190/geo2014-0203.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Quantifying Background Magnetic-Field Inhomoge...</td>\n",
       "      <td>Nuclear magnetic resonance measurements provid...</td>\n",
       "      <td>10.1190/geo2012-0488.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Families Of Salt Domes In The Gulf Coastal Pro...</td>\n",
       "      <td>If two fluids of different densities are super...</td>\n",
       "      <td>10.1190/1.1439806</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Attribute-Guided Well-Log Interpolation Applie...</td>\n",
       "      <td>Several approaches exist to use trends in 3D s...</td>\n",
       "      <td>10.1190/1.2996302</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               title  \\\n",
       "0  Magnetic And Gravity Anomaly Patterns Related ...   \n",
       "1  Inversion For Permeability Distribution From T...   \n",
       "2  Quantifying Background Magnetic-Field Inhomoge...   \n",
       "3  Families Of Salt Domes In The Gulf Coastal Pro...   \n",
       "4  Attribute-Guided Well-Log Interpolation Applie...   \n",
       "\n",
       "                                            abstract                     doi  \n",
       "0  A study of the features of gravity and magneti...       10.1190/1.1444192  \n",
       "1  Understanding reservoir properties plays a key...  10.1190/geo2014-0203.1  \n",
       "2  Nuclear magnetic resonance measurements provid...  10.1190/geo2012-0488.1  \n",
       "3  If two fluids of different densities are super...       10.1190/1.1439806  \n",
       "4  Several approaches exist to use trends in 3D s...       10.1190/1.2996302  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1000"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recommender\n",
    "\n",
    "This is a simple class (to learn more about classes, read up on object oriented programming). It has 4 methods (functions) that implement the workflow:\n",
    "\n",
    "- `__init__()`: Instantiates the class with some 'hyperparameters'.\n",
    "- `_preprocess()`: Does some basic preprocessing on the abstracts.\n",
    "- `fit`: Constructs the model, which consists of two main pieces:\n",
    "  - A 100-dimensional 'semantic' space, in which each document is a point with 100 coordinates.\n",
    "  - A look-up table of the distances between points, so we can find an article's neighbours quickly.\n",
    "- `recommend`: Takes a list of 'liked' articles, finds their midpoint in the semantic space, and looks up the closest articles to that midpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "from nltk.stem.porter import PorterStemmer\n",
    "from nltk.tokenize import RegexpTokenizer\n",
    "\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.decomposition import TruncatedSVD\n",
    "from sklearn.preprocessing import Normalizer\n",
    "from sklearn.neighbors import BallTree, KDTree\n",
    "\n",
    "STEMMER = PorterStemmer()\n",
    "TOKENIZER = RegexpTokenizer(r'\\w+')\n",
    "\n",
    "class ContentRx(object):\n",
    "    \"\"\"\n",
    "    A simple class to implement a scikit-learn-like API,\n",
    "    and to hold the data.\n",
    "    \"\"\"\n",
    "    def __init__(self,\n",
    "                 components=100,\n",
    "                 return_scores=True,\n",
    "                 metric='euclidean',\n",
    "                 centroid='median',\n",
    "                 ngram_range=(1,2),   # Can be very slow above (1,2)\n",
    "                 ignore_fewer_than=0, # ignore words fewer than this\n",
    "                 ):\n",
    "        self.components = components\n",
    "        self.return_scores = return_scores\n",
    "        self.centroid = centroid\n",
    "        self.metric = metric\n",
    "        self.ngram_range = ngram_range\n",
    "        self.ignore_fewer_than = ignore_fewer_than\n",
    "        \n",
    "    def _preprocess(self, text):\n",
    "        \"\"\"\n",
    "        Stem and tokenize a piece of text (e.g. an abstract).\n",
    "        \"\"\"\n",
    "        out = [STEMMER.stem(token) for token in TOKENIZER.tokenize(text)]\n",
    "        return ' '.join(out)\n",
    "\n",
    "    def fit(self, data):\n",
    "        \"\"\"\n",
    "        Algorithm for latent semantic analysis:\n",
    "        * Create a tf-idf (e.g. unigrams and bigrams) for each doc.\n",
    "        * Compute similarity with sklearn pairwise metrics.\n",
    "        * Get the 100 most-similar items.\n",
    "        \"\"\"\n",
    "        data = [self._preprocess(item) for item in data]\n",
    "\n",
    "        # Build LSA pipline: TF-IDF then normalized SVD reduction.\n",
    "        tfidf = TfidfVectorizer(ngram_range=self.ngram_range,\n",
    "                                min_df=self.ignore_fewer_than,\n",
    "                                stop_words='english',\n",
    "                                )\n",
    "        svd = TruncatedSVD(n_components=self.components)\n",
    "        normalize = Normalizer(copy=False)\n",
    "        lsa = make_pipeline(tfidf, svd, normalize)\n",
    "        self.X = lsa.fit_transform(data)\n",
    "\n",
    "        # Build and store distance tree.\n",
    "        # metrics: see BallTree.valid_metrics\n",
    "        self.tree = KDTree(self.X, metric=self.metric)\n",
    "\n",
    "        return\n",
    "\n",
    "    def recommend(self, likes, n_recommend=10):\n",
    "        \"\"\"\n",
    "        Makes a recommendation.\n",
    "        \"\"\"\n",
    "        # Make the query from the input document idxs.\n",
    "        # Science Concierge uses Rocchio algorithm,\n",
    "        # but I don't think I care about 'dislikes'.\n",
    "        vecs = np.array([self.X[idx] for idx in likes])\n",
    "        q = np.mean(vecs, axis=0).reshape(1, -1)\n",
    "\n",
    "        # Get the matches and their distances.\n",
    "        dist, idx = self.tree.query(q, k=n_recommend+len(likes))\n",
    "        \n",
    "        # Get rid of the original likes, which may or may not be in the result.\n",
    "        ind, dist = zip(*[(i, d)\n",
    "                          for d, i in zip(np.squeeze(dist), np.squeeze(idx))\n",
    "                          if i not in likes])\n",
    "        \n",
    "        # If the likes weren't in the result, we remove the most distant results.\n",
    "        if self.return_scores:\n",
    "            return list(ind)[:n_recommend], list(1 - np.array(dist))[:n_recommend]\n",
    "        return list(ind)[:n_recommend]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instantiate the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "crx = ContentRx(ngram_range=(1,2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train the model by fitting to our dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "crx.fit(df.abstract)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model is trained!\n",
    "\n",
    "## Make recommendations\n",
    "\n",
    "First, let's find some papers we like. (Remember this is only a subset of 1000 papers.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[79, 127]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = [i for i, t in enumerate(df.title) if 'spectral decomp' in t.lower()]\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('Seismic Spectral Decomposition Using Deconvolutive Short-Time Fourier Transform Spectrogram',\n",
       " 'Maximum Entropy Spectral Decomposition Of A Seismogram Into Its Minimum Entropy Component Plus Noise')"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.title[79], df.title[127]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can get our recommendations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "idx, scores = crx.recommend(likes=s, n_recommend=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[737, 257, 718, 164, 863, 252, 721, 642, 766, 355]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>abstract</th>\n",
       "      <th>doi</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>737</th>\n",
       "      <td>Empirical Mode Decomposition For Seismic Time-...</td>\n",
       "      <td>Time-frequency analysis plays a significant ro...</td>\n",
       "      <td>10.1190/geo2012-0199.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>257</th>\n",
       "      <td>Choice Of Operator Length For Maximum Entropy ...</td>\n",
       "      <td>Empirical evidence based on maximum entropy sp...</td>\n",
       "      <td>10.1190/1.1440902</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>718</th>\n",
       "      <td>Reservoir Characterization Based On Seismic Sp...</td>\n",
       "      <td>The seismic frequency spectrum provides a usef...</td>\n",
       "      <td>10.1190/geo2011-0323.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>164</th>\n",
       "      <td>Seismic Sequence Analysis And Attribute Extrac...</td>\n",
       "      <td>The variation of frequency content of a seismi...</td>\n",
       "      <td>10.1190/1.1487136</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>863</th>\n",
       "      <td>Ergodicity Of Stationary White Gaussian Processes</td>\n",
       "      <td>Stationary time series is an important concept...</td>\n",
       "      <td>10.1190/1.1444502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>252</th>\n",
       "      <td>Sparse Time-Frequency Representation For Seism...</td>\n",
       "      <td>Attenuation of random noise is a major concern...</td>\n",
       "      <td>10.1190/geo2015-0341.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>721</th>\n",
       "      <td>Maximum‐Entropy Spatial Processing Of Array Data</td>\n",
       "      <td>The procedure of maximum‐entropy spectral anal...</td>\n",
       "      <td>10.1190/1.1440471</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>642</th>\n",
       "      <td>Predictive Deconvolution And The Zero‐Phase So...</td>\n",
       "      <td>Predictive deconvolution is commonly applied t...</td>\n",
       "      <td>10.1190/1.1441674</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>766</th>\n",
       "      <td>Spectrum Of The Potential Field Due To Randoml...</td>\n",
       "      <td>Covariance and spectral density functions of a...</td>\n",
       "      <td>10.1190/1.1439933</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>355</th>\n",
       "      <td>Theory Of Nonstationary Linear Filtering In Th...</td>\n",
       "      <td>A general linear theory describes the extensio...</td>\n",
       "      <td>10.1190/1.1444318</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 title  \\\n",
       "737  Empirical Mode Decomposition For Seismic Time-...   \n",
       "257  Choice Of Operator Length For Maximum Entropy ...   \n",
       "718  Reservoir Characterization Based On Seismic Sp...   \n",
       "164  Seismic Sequence Analysis And Attribute Extrac...   \n",
       "863  Ergodicity Of Stationary White Gaussian Processes   \n",
       "252  Sparse Time-Frequency Representation For Seism...   \n",
       "721   Maximum‐Entropy Spatial Processing Of Array Data   \n",
       "642  Predictive Deconvolution And The Zero‐Phase So...   \n",
       "766  Spectrum Of The Potential Field Due To Randoml...   \n",
       "355  Theory Of Nonstationary Linear Filtering In Th...   \n",
       "\n",
       "                                              abstract                     doi  \n",
       "737  Time-frequency analysis plays a significant ro...  10.1190/geo2012-0199.1  \n",
       "257  Empirical evidence based on maximum entropy sp...       10.1190/1.1440902  \n",
       "718  The seismic frequency spectrum provides a usef...  10.1190/geo2011-0323.1  \n",
       "164  The variation of frequency content of a seismi...       10.1190/1.1487136  \n",
       "863  Stationary time series is an important concept...       10.1190/1.1444502  \n",
       "252  Attenuation of random noise is a major concern...  10.1190/geo2015-0341.1  \n",
       "721  The procedure of maximum‐entropy spectral anal...       10.1190/1.1440471  \n",
       "642  Predictive deconvolution is commonly applied t...       10.1190/1.1441674  \n",
       "766  Covariance and spectral density functions of a...       10.1190/1.1439933  \n",
       "355  A general linear theory describes the extensio...       10.1190/1.1444318  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[idx]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have scores (inverse distances) for each recommendation too:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 22.6 Empirical Mode Decomposition For Seismic Time-Frequency Analysis\n",
      " 10.7 Choice Of Operator Length For Maximum Entropy Spectral Analysis\n",
      "  9.8 Reservoir Characterization Based On Seismic Spectral Variations\n",
      "  7.5 Seismic Sequence Analysis And Attribute Extraction Using Quadratic Time‐Frequency Representations\n",
      "  7.2 Ergodicity Of Stationary White Gaussian Processes\n",
      "  6.9 Sparse Time-Frequency Representation For Seismic Noise Reduction Using Low-Rank And Sparse Decomposition\n",
      "  3.9 Maximum‐Entropy Spatial Processing Of Array Data\n",
      "  1.6 Predictive Deconvolution And The Zero‐Phase Source\n",
      "  1.2 Spectrum Of The Potential Field Due To Randomly Distributed Sources\n",
      "  1.0 Theory Of Nonstationary Linear Filtering In The Fourier Domain With Application To Time‐Variant Filtering\n"
     ]
    }
   ],
   "source": [
    "for i, s in zip(idx, scores):\n",
    "    print('{:.1f}'.format(100*s).rjust(5), df.title[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr />\n",
    "\n",
    "&copy; Agile Geoscience 2017 &mdash; licensed under Apache 2.0"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda env:python3]",
   "language": "python",
   "name": "conda-env-python3-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}