{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Usupervised classification of philosophical genres\n", "\n", "This notebook is a part of work being done for the [Trace of Theory project](https://github.com/htrc/ACS-TT), a collaboration between researchers of [NovelTM](http://novel-tm.ca/) and the HathiTrust Research Center ([HTRC](https://www.hathitrust.org/)).\n", "\n", "Here, we'll use unsupervised techniques to identify clusters of similar texts within a corpus of about 3,200 philosophical texts. These texts were previously identified in the HathiTrust public domain corpus using a list of philosophical keywords. The idea now is to look for something like philsophical \"genres\" within this subcorpus of philosophical texts and to compare the computational results to human labels. Our features will mix word-count data with measures of form and with textual metadata, so that we're examining not just subject matter, but also style and (minimal) context.\n", "\n", "The work below is almost exclusively about methods. There's not a lot of analysis, and the notebook ends with suggestions for things to try, rather than conclusions about philosophical genre.\n", "\n", "## Roadmap\n", "\n", "* Download feature data for the 3,200 philosophical texts from the HathiTrust Research Center\n", "* Parse feature data\n", "* Calculate other features derived from the same sources\n", "* Reduce the dimensions of the feature space in order to have some hope of clustering the texts\n", "* Perform [_k_-means](https://en.wikipedia.org/wiki/K-means_clustering) and [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN) clustering on the dimension-reduced features\n", "* Visualize the clustering output alongside the human labels; use both static plots (via Matplotlib/[Seaborn](http://stanford.edu/~mwaskom/software/seaborn/)) and interactives (via [Bokeh](http://bokeh.pydata.org/en/latest/))\n", "\n", "\n", "## Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "