{ "metadata": { "name": "", "signature": "sha256:cb98a6c4d23f840942c5cc6016b5dd12655aaf177df4bea161fd0c5c312bf6af" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "In this post, we'll apply a little principal component analysis, a technique widely used in machine learning, to images of Japanese characters, the *kanji*.\n", "\n", "We'll go through several steps:\n", "\n", "- first, we produce rasterized images of the individual characters using `matplotlib`\n", "- next, we use `scikit-learn` to compute a principal component analysis (PCA) on the image data obtained in the previous step\n", "- finally, we'll explore some of the applications of the data obtained through this analysis" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Making the images" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a first step, let's produce rasterized images of Japanese characters. You might ask, which characters? It turns out that the Japanese have standardized the list of characters usually encountered in newspapers, movies, posters, books. This list is called the Joyo kanji list, which has a page on [Wikipedia](http://en.wikipedia.org/wiki/List_of_j\u014dy\u014d_kanji).\n", "\n", "Using this webpage, I copied and pasted the content of the character table in a file which I placed on my hard drive. It is aptly named `kanji_list.csv`. I'll import the character data using `pandas`, which seems to be everyone's favorite when it comes to reading csv files." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.read_csv(\"kanji_list.csv\", sep='\\t', header=None)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get a glimpse of the content of the file, we can display its *head*:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df.head(10)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "\u4e9c | \n", "\u4e9e | \n", "\u4e8c | \n", "7 | \n", "S | \n", "NaN | \n", "sub- | \n", "\u30a2 | \n", "
1 | \n", "a | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 | \n", "2 | \n", "\u54c0 | \n", "NaN | \n", "\u53e3 | \n", "9 | \n", "S | \n", "NaN | \n", "pathetic | \n", "\u30a2\u30a4\u3001\u3042\u308f-\u308c\u3001\u3042\u308f-\u308c\u3080 | \n", "
3 | \n", "ai, awa-re, awa-remu | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
4 | \n", "3 | \n", "\u6328 | \n", "NaN | \n", "\u624b | \n", "10 | \n", "S | \n", "2010 | \n", "push open | \n", "\u30a2\u30a4 | \n", "
5 | \n", "ai | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
6 | \n", "4 | \n", "\u611b | \n", "NaN | \n", "\u5fc3 | \n", "13 | \n", "4 | \n", "NaN | \n", "love | \n", "\u30a2\u30a4 | \n", "
7 | \n", "ai | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
8 | \n", "5 | \n", "\u66d6 | \n", "NaN | \n", "\u65e5 | \n", "17 | \n", "S | \n", "2010 | \n", "not clear | \n", "\u30a2\u30a4 | \n", "
9 | \n", "ai | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "