## Mechanisms notebook

In [25]:
import numpy as np
import pdb
from Mechanisms_bot.src import pubmed_io
import Mechanisms_bot.tweet_mechanism as tweet_mechanism
from Mechanisms_bot.src import abstract_parser 
import importlib
import pickle
import re

### Query Pubmed for sentences
Get PMIDs for sentences containing mechanisms and lack of clarity

In [3]:
pubmed_ids = pubmed_io.get_pubmed_ids(99000)
len(pubmed_ids)

99000

Get the abstracts for those articles

In [4]:
importlib.reload(pubmed_io)
pubmed_abstracts = pubmed_io.get_pubmed_abstracts( pubmed_ids)
len(pubmed_abstracts)

98536

### All of the above in one function

In [None]:
tweet_mechanism.get_recent_mech_sentences(90000)

Pubmed returns an XML. It is structured PubmedArticleList / PubmedArticle / MedlineCitation. MedlineCitation has children: PMID, and Article

Article has children: Journal, ArticleTitle, and Abstract

Abstract has child: AbstractText

Make sure that the mechanism and lack of clarity are in the same sentence

In [5]:
good_sent = list(map(abstract_parser.get_mech_sent, pubmed_abstracts, pubmed_ids) )
good_sent = [x for x in good_sent if x['mech_sent']]
len(good_sent)

24945

Save and load, as necessary

In [8]:
with open('good_sent.pickle', 'wb') as pickle_file:
 pickle.dump(good_sent, pickle_file )

### Quick stats on sentences

In [2]:
with open('good_sent.pickle', 'rb') as pickle_file:
 good_sent = pickle.load( pickle_file )

In [54]:
unknown = {'understood', 'unclear', 'unknown'}
end_unknown = sum([1 for x in good_sent if x['mech_sent'].split(' ')[-1][:-1] in unknown])
print('Number of sentences ending with variant of "unknown": {}'.format(end_unknown) )

Number of sentences ending with variant of "unknown": 19459


In [55]:
however = sum([1 for x in good_sent if re.findall('however', x['mech_sent'].lower()) ])
print('Number of sentences containing "however": {}'.format(however) )

Number of sentences containing "however": 8889


In [56]:
although = sum([1 for x in good_sent if re.findall('although', x['mech_sent'].lower()) ])
print('Number of sentences containing "although": {}'.format(although) )

Number of sentences containing "although": 2822


In [57]:
while_count = sum([1 for x in good_sent if re.findall('while', x['mech_sent'].lower()) ])
print('Number of sentences containing "while": {}'.format(while_count) )

Number of sentences containing "while": 612


In [61]:
sent_length = [len(x['mech_sent']) for x in good_sent]
sent_min = good_sent[np.argmin(sent_length)]
print('Shortest sentence: {0}, PMID: {1}'.format(sent_min['mech_sent'], sent_min['PMID']) )

Shortest sentence: The mechanism is unclear., PMID: 25602775


In [63]:
# These are long sentences due to punctuation
for i, cur_sent in enumerate(good_sent):
 if cur_sent['PMID'] in {'26526306', '25454993', '26351365', '26648182'}:
 print(i)

1255
2721
4914
14158
16001


A fairly long sentence:

In [47]:
sent_length = [len(x['mech_sent']) for x in good_sent[:1255]]
sent_max = good_sent[np.argmax(sent_length)]
sent_max

{'PMID': '26663484',
 'mech_sent': "Although an increasing number of studies have identified misregulated miRNAs in the neurodegenerative diseases (NDDs) Alzheimer's disease, Parkinson's disease, Huntington's disease, and amyotrophic lateral sclerosis, which suggests that alterations in the miRNA regulatory pathway could contribute to disease pathogenesis, the molecular mechanisms underlying the pathological implications of misregulated miRNA expression and the regulation of the key genes involved in NDDs remain largely unknown."}

A sample of mechanism sentences:

In [18]:
good_sent[:100]

[{'PMID': '26752988',
 'mech_sent': 'Lithium and valproate modulate disturbances in intracellular calcium homeostasis implicated in the pathophysiology of bipolar disorder, but the molecular mechanisms are not fully understood.'},
 {'PMID': '26752791',
 'mech_sent': 'Sin Nombre virus, SNV), the mechanism is largely unknown.'},
 {'PMID': '26752716',
 'mech_sent': 'Orchestrated trophoblast differentiation is necessary to establish and maintain a normal pregnancy, however the molecular mechanisms that guide this process remain largely unknown.'},
 {'PMID': '26752685',
 'mech_sent': 'However, the functional consequences of this outside of the HIF pathway remain unclear.'},
 {'PMID': '26752654',
 'mech_sent': 'Yet the physicochemical mechanisms underlying such mineral formation and growth in atheromata remain unknown.'},
 {'PMID': '26752649',
 'mech_sent': 'Although defects in intestinal barrier function are a key pathogenic factor in patients with inflammatory bowel diseases (IBDs), the mo

# Tweeting!

In [10]:
importlib.reload(tweet_mechanism)
new_sent, tweeted_sentences = tweet_mechanism.main();

INFO:Mechanisms_bot.tweet_mechanism:Getting recent papers
INFO:Mechanisms_bot.tweet_mechanism:Got 100 recent PubMed IDs
INFO:Mechanisms_bot.tweet_mechanism:Getting abstracts
INFO:Mechanisms_bot.tweet_mechanism:Downloaded 100 abstracts
INFO:Mechanisms_bot.tweet_mechanism:Found 22 mechanisms sentences

