{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Topic Modeling with MALLET" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'd like to test how [Taylor Salo](https://www.github.com/tsalo) integrated MALLET into NeuroSynth, and whether that integration works in a docker container.\n", "\n", "First, let's import some dependencies and text to work with. \n", "\n", "For testing, we'll use an XML file separately downloaded from PubMed. In the spirit of NeuroSynth, we downloaded [Tal Yarkoni's](https://www.ncbi.nlm.nih.gov/pubmed/?term=tal+yarkoni) bibliography. Thanks, Tal! " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Warning: Some articles do not have abstracts on PubMed!\n", "Only articles with complete data will be included.\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "import pandas as pd\n", "\n", "with open('../neurosynth/tests/data/yarkoni_pubmed.xml') as infile:\n", " xml_file = infile.read()\n", "soup = BeautifulSoup(xml_file, 'lxml')\n", "\n", "try:\n", " assert type(soup) == BeautifulSoup\n", "except AssertionError:\n", " print('Check file type! Must be HTML or XML.')\n", "\n", "titles = soup.find_all('articletitle')\n", "abstracts = soup.find_all('abstract')\n", "\n", "if len(titles) != len(abstracts):\n", " print('Warning: Some articles do not have abstracts on PubMed!')\n", " print('Only articles with complete data will be included.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Three articles do not have abstracts:\n", "1. Pain in the ACC?\n", "2. Introduction to the special issue on reliability and \n", " replication in cognitive and affective neuroscience research.\n", "3. Establishing homology between monkey and human brains.\n", "\n", "Maybe because they're commentaries? We'll need to filter the results to only consider articles with abstracts. Then, import any matching articles into a pandas dataframe." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abstractpmid
0Compassion is critical for societal wellbeing....27018610
1Open access, open data, open source and other ...27387362
2The functional organization of human medial fr...27307242
3Social scientists often seek to demonstrate th...27031707
4Decades of animal and human neuroimaging resea...26831091
\n", "
" ], "text/plain": [ " abstract pmid\n", "0 Compassion is critical for societal wellbeing.... 27018610\n", "1 Open access, open data, open source and other ... 27387362\n", "2 The functional organization of human medial fr... 27307242\n", "3 Social scientists often seek to demonstrate th... 27031707\n", "4 Decades of animal and human neuroimaging resea... 26831091" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "abstracts = []\n", "pmids = []\n", "\n", "articles = soup.find_all('pubmedarticle')\n", "for a in articles:\n", " if a.find_all('abstract')!= []:\n", " # This is a little messy, but pulls out the\n", " # results in plain text without another loop.\n", " abstracts.append(a.find_all('abstracttext')[0].get_text())\n", " pmids.append(a.find_all(idtype='pubmed')[0].get_text())\n", "\n", "df = pd.DataFrame({'pmid': pmids,\n", " 'abstract': abstracts})\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a test dataset! Let's see how it plays with MALLET. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MALLET toolbox found!\n", "Abstracts folder not found. Creating abstract files...\n", "Generating topics...\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
terms
topic
topic_000decision making neuroimaging gains demonstrate...
topic_001social anxiety disorder generalized type givin...
topic_002orthographic visual language widespread neighb...
topic_003connectivity findings global found state lpfc ...
topic_004data human provide coactivation brain map api ...
\n", "
" ], "text/plain": [ " terms\n", "topic \n", "topic_000 decision making neuroimaging gains demonstrate...\n", "topic_001 social anxiety disorder generalized type givin...\n", "topic_002 orthographic visual language widespread neighb...\n", "topic_003 connectivity findings global found state lpfc ...\n", "topic_004 data human provide coactivation brain map api ..." ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os\n", "import subprocess\n", "import shutil\n", "import sys\n", "sys.path.append(os.path.abspath('..'))\n", "from neurosynth.analysis.reduce import topic_models\n", "\n", "weights_df, keys_df = topic_models(df)\n", "keys_df.head()" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }