{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# geoRx: Geoscience prescription\n", "\n", "This notebook accompanies\n", "\n", "> Hall, M (2017). Three data analytics party tricks. _The Leading Edge_ **36** (3).\n", "\n", "Inspired by and/or based on [**science concierge**](https://github.com/titipata/science_concierge) and [**Chris Clark's repo**](https://github.com/groveco/content-engine) on content-based recommendation.\n", "\n", "A version of this code is running at [georx.geosci.ai](http://georx.geosci.ai) where you can try it out. \n", "\n", "\n", "## Load data\n", "\n", "This dataset is 1000 random articles from the journal _Geophysics_ from 1936 to 2016. It represents about 10% of the total number of articles published in that time. It was collected from seg.org with permission, and processed into a CSV file of titles, abstracts, and DOIs." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_csv('data/title_abstract_doi.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleabstractdoi
0Magnetic And Gravity Anomaly Patterns Related ...A study of the features of gravity and magneti...10.1190/1.1444192
1Inversion For Permeability Distribution From T...Understanding reservoir properties plays a key...10.1190/geo2014-0203.1
2Quantifying Background Magnetic-Field Inhomoge...Nuclear magnetic resonance measurements provid...10.1190/geo2012-0488.1
3Families Of Salt Domes In The Gulf Coastal Pro...If two fluids of different densities are super...10.1190/1.1439806
4Attribute-Guided Well-Log Interpolation Applie...Several approaches exist to use trends in 3D s...10.1190/1.2996302
\n", "
" ], "text/plain": [ " title \\\n", "0 Magnetic And Gravity Anomaly Patterns Related ... \n", "1 Inversion For Permeability Distribution From T... \n", "2 Quantifying Background Magnetic-Field Inhomoge... \n", "3 Families Of Salt Domes In The Gulf Coastal Pro... \n", "4 Attribute-Guided Well-Log Interpolation Applie... \n", "\n", " abstract doi \n", "0 A study of the features of gravity and magneti... 10.1190/1.1444192 \n", "1 Understanding reservoir properties plays a key... 10.1190/geo2014-0203.1 \n", "2 Nuclear magnetic resonance measurements provid... 10.1190/geo2012-0488.1 \n", "3 If two fluids of different densities are super... 10.1190/1.1439806 \n", "4 Several approaches exist to use trends in 3D s... 10.1190/1.2996302 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1000" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recommender\n", "\n", "This is a simple class (to learn more about classes, read up on object oriented programming). It has 4 methods (functions) that implement the workflow:\n", "\n", "- `__init__()`: Instantiates the class with some 'hyperparameters'.\n", "- `_preprocess()`: Does some basic preprocessing on the abstracts.\n", "- `fit`: Constructs the model, which consists of two main pieces:\n", " - A 100-dimensional 'semantic' space, in which each document is a point with 100 coordinates.\n", " - A look-up table of the distances between points, so we can find an article's neighbours quickly.\n", "- `recommend`: Takes a list of 'liked' articles, finds their midpoint in the semantic space, and looks up the closest articles to that midpoint." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "\n", "from nltk.stem.porter import PorterStemmer\n", "from nltk.tokenize import RegexpTokenizer\n", "\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.decomposition import TruncatedSVD\n", "from sklearn.preprocessing import Normalizer\n", "from sklearn.neighbors import BallTree, KDTree\n", "\n", "STEMMER = PorterStemmer()\n", "TOKENIZER = RegexpTokenizer(r'\\w+')\n", "\n", "class ContentRx(object):\n", " \"\"\"\n", " A simple class to implement a scikit-learn-like API,\n", " and to hold the data.\n", " \"\"\"\n", " def __init__(self,\n", " components=100,\n", " return_scores=True,\n", " metric='euclidean',\n", " centroid='median',\n", " ngram_range=(1,2), # Can be very slow above (1,2)\n", " ignore_fewer_than=0, # ignore words fewer than this\n", " ):\n", " self.components = components\n", " self.return_scores = return_scores\n", " self.centroid = centroid\n", " self.metric = metric\n", " self.ngram_range = ngram_range\n", " self.ignore_fewer_than = ignore_fewer_than\n", " \n", " def _preprocess(self, text):\n", " \"\"\"\n", " Stem and tokenize a piece of text (e.g. an abstract).\n", " \"\"\"\n", " out = [STEMMER.stem(token) for token in TOKENIZER.tokenize(text)]\n", " return ' '.join(out)\n", "\n", " def fit(self, data):\n", " \"\"\"\n", " Algorithm for latent semantic analysis:\n", " * Create a tf-idf (e.g. unigrams and bigrams) for each doc.\n", " * Compute similarity with sklearn pairwise metrics.\n", " * Get the 100 most-similar items.\n", " \"\"\"\n", " data = [self._preprocess(item) for item in data]\n", "\n", " # Build LSA pipline: TF-IDF then normalized SVD reduction.\n", " tfidf = TfidfVectorizer(ngram_range=self.ngram_range,\n", " min_df=self.ignore_fewer_than,\n", " stop_words='english',\n", " )\n", " svd = TruncatedSVD(n_components=self.components)\n", " normalize = Normalizer(copy=False)\n", " lsa = make_pipeline(tfidf, svd, normalize)\n", " self.X = lsa.fit_transform(data)\n", "\n", " # Build and store distance tree.\n", " # metrics: see BallTree.valid_metrics\n", " self.tree = KDTree(self.X, metric=self.metric)\n", "\n", " return\n", "\n", " def recommend(self, likes, n_recommend=10):\n", " \"\"\"\n", " Makes a recommendation.\n", " \"\"\"\n", " # Make the query from the input document idxs.\n", " # Science Concierge uses Rocchio algorithm,\n", " # but I don't think I care about 'dislikes'.\n", " vecs = np.array([self.X[idx] for idx in likes])\n", " q = np.mean(vecs, axis=0).reshape(1, -1)\n", "\n", " # Get the matches and their distances.\n", " dist, idx = self.tree.query(q, k=n_recommend+len(likes))\n", " \n", " # Get rid of the original likes, which may or may not be in the result.\n", " ind, dist = zip(*[(i, d)\n", " for d, i in zip(np.squeeze(dist), np.squeeze(idx))\n", " if i not in likes])\n", " \n", " # If the likes weren't in the result, we remove the most distant results.\n", " if self.return_scores:\n", " return list(ind)[:n_recommend], list(1 - np.array(dist))[:n_recommend]\n", " return list(ind)[:n_recommend]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instantiate the model:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "crx = ContentRx(ngram_range=(1,2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train the model by fitting to our dataset:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "crx.fit(df.abstract)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model is trained!\n", "\n", "## Make recommendations\n", "\n", "First, let's find some papers we like. (Remember this is only a subset of 1000 papers.)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[79, 127]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = [i for i, t in enumerate(df.title) if 'spectral decomp' in t.lower()]\n", "s" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('Seismic Spectral Decomposition Using Deconvolutive Short-Time Fourier Transform Spectrogram',\n", " 'Maximum Entropy Spectral Decomposition Of A Seismogram Into Its Minimum Entropy Component Plus Noise')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.title[79], df.title[127]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can get our recommendations:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "idx, scores = crx.recommend(likes=s, n_recommend=10)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[737, 257, 718, 164, 863, 252, 721, 642, 766, 355]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "idx" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleabstractdoi
737Empirical Mode Decomposition For Seismic Time-...Time-frequency analysis plays a significant ro...10.1190/geo2012-0199.1
257Choice Of Operator Length For Maximum Entropy ...Empirical evidence based on maximum entropy sp...10.1190/1.1440902
718Reservoir Characterization Based On Seismic Sp...The seismic frequency spectrum provides a usef...10.1190/geo2011-0323.1
164Seismic Sequence Analysis And Attribute Extrac...The variation of frequency content of a seismi...10.1190/1.1487136
863Ergodicity Of Stationary White Gaussian ProcessesStationary time series is an important concept...10.1190/1.1444502
252Sparse Time-Frequency Representation For Seism...Attenuation of random noise is a major concern...10.1190/geo2015-0341.1
721Maximum‐Entropy Spatial Processing Of Array DataThe procedure of maximum‐entropy spectral anal...10.1190/1.1440471
642Predictive Deconvolution And The Zero‐Phase So...Predictive deconvolution is commonly applied t...10.1190/1.1441674
766Spectrum Of The Potential Field Due To Randoml...Covariance and spectral density functions of a...10.1190/1.1439933
355Theory Of Nonstationary Linear Filtering In Th...A general linear theory describes the extensio...10.1190/1.1444318
\n", "
" ], "text/plain": [ " title \\\n", "737 Empirical Mode Decomposition For Seismic Time-... \n", "257 Choice Of Operator Length For Maximum Entropy ... \n", "718 Reservoir Characterization Based On Seismic Sp... \n", "164 Seismic Sequence Analysis And Attribute Extrac... \n", "863 Ergodicity Of Stationary White Gaussian Processes \n", "252 Sparse Time-Frequency Representation For Seism... \n", "721 Maximum‐Entropy Spatial Processing Of Array Data \n", "642 Predictive Deconvolution And The Zero‐Phase So... \n", "766 Spectrum Of The Potential Field Due To Randoml... \n", "355 Theory Of Nonstationary Linear Filtering In Th... \n", "\n", " abstract doi \n", "737 Time-frequency analysis plays a significant ro... 10.1190/geo2012-0199.1 \n", "257 Empirical evidence based on maximum entropy sp... 10.1190/1.1440902 \n", "718 The seismic frequency spectrum provides a usef... 10.1190/geo2011-0323.1 \n", "164 The variation of frequency content of a seismi... 10.1190/1.1487136 \n", "863 Stationary time series is an important concept... 10.1190/1.1444502 \n", "252 Attenuation of random noise is a major concern... 10.1190/geo2015-0341.1 \n", "721 The procedure of maximum‐entropy spectral anal... 10.1190/1.1440471 \n", "642 Predictive deconvolution is commonly applied t... 10.1190/1.1441674 \n", "766 Covariance and spectral density functions of a... 10.1190/1.1439933 \n", "355 A general linear theory describes the extensio... 10.1190/1.1444318 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[idx]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have scores (inverse distances) for each recommendation too:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 22.6 Empirical Mode Decomposition For Seismic Time-Frequency Analysis\n", " 10.7 Choice Of Operator Length For Maximum Entropy Spectral Analysis\n", " 9.8 Reservoir Characterization Based On Seismic Spectral Variations\n", " 7.5 Seismic Sequence Analysis And Attribute Extraction Using Quadratic Time‐Frequency Representations\n", " 7.2 Ergodicity Of Stationary White Gaussian Processes\n", " 6.9 Sparse Time-Frequency Representation For Seismic Noise Reduction Using Low-Rank And Sparse Decomposition\n", " 3.9 Maximum‐Entropy Spatial Processing Of Array Data\n", " 1.6 Predictive Deconvolution And The Zero‐Phase Source\n", " 1.2 Spectrum Of The Potential Field Due To Randomly Distributed Sources\n", " 1.0 Theory Of Nonstationary Linear Filtering In The Fourier Domain With Application To Time‐Variant Filtering\n" ] } ], "source": [ "for i, s in zip(idx, scores):\n", " print('{:.1f}'.format(100*s).rjust(5), df.title[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "© Agile Geoscience 2017 — licensed under Apache 2.0" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda env:python3]", "language": "python", "name": "conda-env-python3-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }