`) instead of cells. The strategy is to keep only added blocks that are preceded by an empty row (``) before and deleted blocks that are followed by an empty cell."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"plus/minus overview"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"- Tibetan\n",
"- others\n",
"+ [[\n",
"+ ]]\n",
"+ Mahayana\n",
"+ , altustic\n",
"+ all sentient beings\n"
]
}
],
"source": [
"def extract(diff_html):\n",
" diff = { \"added\": [],\n",
" \"deleted\" : [] }\n",
"\n",
" d = BeautifulSoup(diff_html, 'html.parser')\n",
"\n",
" tr = d.find_all(\"tr\")\n",
"\n",
" for what in [ [\"added\", \"ins\"], [\"deleted\", \"del\"] ]:\n",
" a = []\n",
"\n",
" # checking block \n",
" # we also check this is not only context showing for non-substition edits\n",
" a = [ t.find(\"td\", \"diff-%sline\" % (what[0])) for t in tr if len(t.find_all(what[1])) == 0 and len(t.find_all(\"td\", \"diff-empty\")) > 0 ]\n",
"\n",
" # checking inline\n",
" a.extend(d.find_all(what[1]))\n",
"\n",
" # filtering empty extractions\n",
" a = [ x for x in a if x != None ]\n",
"\n",
" # registering\n",
" diff[what[0]] = [ tag.get_text() for tag in a ]\n",
"\n",
" return diff\n",
"\n",
"def print_plusminus_overview(diff):\n",
" for minus in diff[\"deleted\"]:\n",
" print \"- %s\" % (minus)\n",
"\n",
" for plus in diff[\"added\"]:\n",
" print \"+ %s\" % (plus)\n",
" \n",
"display(HTML(\"plus/minus overview\"))\n",
"\n",
"diff = extract(diff)\n",
"print_plusminus_overview(diff)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Natural language processing\n",
"\n",
"We are now going to proceed to a little bit of language processing. NLTK provides very usefull starter tools to manipulate bits of natural language. The core of the workflow is about tokenization and normalization.\n",
"\n",
"The first stem is to be able to count words correctly, it is were normalization intervens:\n",
"\n",
"- **stemming** is the process of reducing a word to its roots. For example, you may want to transform \"gods\" to \"god\", \"is\" to \"be\", etc\n",
"- **lemmatization** is closely related to stemming. Whereas the first one is a context-free procedure, lemmatization take care of variables related to grammar like the position in the phrase to have a less agressive approach.\n",
"\n",
"Right now, we apply lemmatization without the grammatical information. This is just in order to prepare advanced NLP work."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def normalize(word):\n",
" lemmatizer = nltk.WordNetLemmatizer()\n",
" stemmer = nltk.stem.porter.PorterStemmer()\n",
"\n",
" word = word.lower()\n",
" word = stemmer.stem_word(word)\n",
" word = lemmatizer.lemmatize(word)\n",
"\n",
" return word"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The process of counting stems is mainly about mapping the result of the **tokenization** of plus/minus contents. Dividing sentences into parts and words can be a very tedious work without the right parser or if you are looking for a universal grammar. It is also very related to the language itself. For example parsing english or german is very different. For the moment, we are going to use the [Punkt tokenizer](http://www.nltk.org/api/nltk.tokenize.html) because it is now all about english sentences.\n",
"\n",
"Tokenization, stemming and lemmatization are very sensitive points. It is possible to develop more precise strategies depending on what you are looking for. We are going to let it fuzzy to give space to later use and keep a broad mindset about what can be done with diff information. Meanwhile, for counting purpose, the basic implementation of these methods are largely sufficient."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def count_stems(sentences, inflections=None):\n",
" stems = defaultdict(int)\n",
"\n",
" ignore_list = \"{}()[]<>./,;\\\"':!?=*&%\"\n",
" \n",
" if inflections == None:\n",
" inflections = defaultdict(dict)\n",
"\n",
" for sentence in sentences:\n",
" for word in nltk.word_tokenize(sentence):\n",
" old = word\n",
" word = normalize(word)\n",
" if not(word in ignore_list):\n",
" stems[word] += 1\n",
"\n",
" # keeping track of inflection usages\n",
" inflections[word].setdefault(old,0)\n",
" inflections[word][old] += 1\n",
"\n",
" return stems\n",
"\n",
"def print_plusminus_terms_overview(stems):\n",
" print \"\\n%s|%s\\n\" % (\"+\"*len(stems[\"added\"].items()), \"-\"*len(stems[\"deleted\"].items()))\n",
"\n",
"def print_plusminus_terms(stems):\n",
" for k in stems.keys():\n",
" display(HTML(\"%s:\" % (k)))\n",
" \n",
" for term in stems[k]:\n",
" print term"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"plus/minus ---> terms"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"deleted:"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"other\n",
"tibetan\n"
]
},
{
"data": {
"text/html": [
"added:"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"altust\n",
"mahayana\n",
"all\n",
"sentient\n",
"be\n"
]
}
],
"source": [
"inflections = defaultdict(dict)\n",
"\n",
"display(HTML(\"plus/minus ---> terms\"))\n",
"\n",
"stems = {}\n",
"stems[\"added\"] = count_stems(diff[\"added\"], inflections)\n",
"stems[\"deleted\"] = count_stems(diff[\"deleted\"], inflections)\n",
"\n",
"print_plusminus_terms(stems)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## inflections\n",
"\n",
"We have also kept trace of inflections. This is not very important over one diff but it is interesting if you have collected inflections over a large set of words. For example, you might want to use the most common inflection instead of the stem form to produce more readable/pretty words cloud."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"inflections"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[be] beings (1)\n",
"[mahayana] Mahayana (1)\n",
"[sentient] sentient (1)\n",
"[altust] altustic (1)\n",
"[all] all (1)\n",
"[other] others (1)\n",
"[tibetan] Tibetan (1)\n"
]
}
],
"source": [
"display(HTML(\"inflections\"))\n",
"\n",
"for stem, i in inflections.iteritems():\n",
" print \"[%s] %s\" % (stem, \", \".join(map(lambda x: \"%s (%s)\" % (x[0], x[1]), i.items())))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## See also\n",
"\n",
"This procedure is extensively used in the [words of wisdom and love](words%20of%20wisdom%20and%20love.ipynb) notebook about counting reccuring terms in diff of love, ethics, wisdom and morality pages."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
|