{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Topic Modeling with MALLET"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'd like to test how [Taylor Salo](https://www.github.com/tsalo) integrated MALLET into NeuroSynth, and whether that integration works in a docker container.\n",
    "\n",
    "First, let's import some dependencies and text to work with. \n",
    "\n",
    "For testing, we'll use an XML file separately downloaded from PubMed. In the spirit of NeuroSynth, we downloaded [Tal Yarkoni's](https://www.ncbi.nlm.nih.gov/pubmed/?term=tal+yarkoni) bibliography. Thanks, Tal! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning: Some articles do not have abstracts on PubMed!\n",
      "Only articles with complete data will be included.\n"
     ]
    }
   ],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "import pandas as pd\n",
    "\n",
    "with open('../neurosynth/tests/data/yarkoni_pubmed.xml') as infile:\n",
    "    xml_file = infile.read()\n",
    "soup = BeautifulSoup(xml_file, 'lxml')\n",
    "\n",
    "try:\n",
    "    assert type(soup) == BeautifulSoup\n",
    "except AssertionError:\n",
    "    print('Check file type! Must be HTML or XML.')\n",
    "\n",
    "titles = soup.find_all('articletitle')\n",
    "abstracts = soup.find_all('abstract')\n",
    "\n",
    "if len(titles) != len(abstracts):\n",
    "    print('Warning: Some articles do not have abstracts on PubMed!')\n",
    "    print('Only articles with complete data will be included.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Three articles do not have abstracts:\n",
    "1. Pain in the ACC?\n",
    "2. Introduction to the special issue on reliability and \n",
    "    replication in cognitive and affective neuroscience research.\n",
    "3. Establishing homology between monkey and human brains.\n",
    "\n",
    "Maybe because they're commentaries? We'll need to filter the results to only consider articles with abstracts. Then, import any matching articles into a pandas dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>abstract</th>\n",
       "      <th>pmid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Compassion is critical for societal wellbeing....</td>\n",
       "      <td>27018610</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Open access, open data, open source and other ...</td>\n",
       "      <td>27387362</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>The functional organization of human medial fr...</td>\n",
       "      <td>27307242</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Social scientists often seek to demonstrate th...</td>\n",
       "      <td>27031707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Decades of animal and human neuroimaging resea...</td>\n",
       "      <td>26831091</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            abstract      pmid\n",
       "0  Compassion is critical for societal wellbeing....  27018610\n",
       "1  Open access, open data, open source and other ...  27387362\n",
       "2  The functional organization of human medial fr...  27307242\n",
       "3  Social scientists often seek to demonstrate th...  27031707\n",
       "4  Decades of animal and human neuroimaging resea...  26831091"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "abstracts = []\n",
    "pmids = []\n",
    "\n",
    "articles = soup.find_all('pubmedarticle')\n",
    "for a in articles:\n",
    "    if a.find_all('abstract')!= []:\n",
    "        # This is a little messy, but pulls out the\n",
    "        # results in plain text without another loop.\n",
    "        abstracts.append(a.find_all('abstracttext')[0].get_text())\n",
    "        pmids.append(a.find_all(idtype='pubmed')[0].get_text())\n",
    "\n",
    "df = pd.DataFrame({'pmid': pmids,\n",
    "     'abstract': abstracts})\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a test dataset! Let's see how it plays with MALLET. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MALLET toolbox found!\n",
      "Abstracts folder not found. Creating abstract files...\n",
      "Generating topics...\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>terms</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>topic</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>topic_000</th>\n",
       "      <td>decision making neuroimaging gains demonstrate...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>topic_001</th>\n",
       "      <td>social anxiety disorder generalized type givin...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>topic_002</th>\n",
       "      <td>orthographic visual language widespread neighb...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>topic_003</th>\n",
       "      <td>connectivity findings global found state lpfc ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>topic_004</th>\n",
       "      <td>data human provide coactivation brain map api ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                       terms\n",
       "topic                                                       \n",
       "topic_000  decision making neuroimaging gains demonstrate...\n",
       "topic_001  social anxiety disorder generalized type givin...\n",
       "topic_002  orthographic visual language widespread neighb...\n",
       "topic_003  connectivity findings global found state lpfc ...\n",
       "topic_004  data human provide coactivation brain map api ..."
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "import subprocess\n",
    "import shutil\n",
    "import sys\n",
    "sys.path.append(os.path.abspath('..'))\n",
    "from neurosynth.analysis.reduce import topic_models\n",
    "\n",
    "weights_df, keys_df = topic_models(df)\n",
    "keys_df.head()"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}