{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Keras implementation of Character Aware Neural Models by Yoon Kim et al.\n", "\n", "\n", "## Introduction/Disclaimer\n", "\n", "
\n", "This notebook tries to understand the structure and implement the character RNN given in [Kim, 2016](https://arxiv.org/abs/1508.06615). While there is [code](https://github.com/yoonkim/lstm-char-cnn) available online by the authors, that is available in Torch (and there are several PyTorch implementations available). The current notebook tries to aid people in recreating such models in Keras, rather than having a model that works out-of-the-box. I am also currently not very confident about my approach myself, so please if you have any comments, open an issue ticket.
\n", "\n", "\n", "## Description\n", "\n", "The model is a character-input sequential model for next-word prediction. It accepts as input a word as a series of characters (represented as integers), and spits a probability distribution of the next word. It differs from other character RNN models in that it considers the whole word at each timestep, instead of a single character, and from word RNN models in that it accepts characters-as-integers and not words-as-integers. Below we show how we pre-process a corpus `.txt` file to feed into the model, how we build the model into keras and finally how we train & evaluate it.\n", "\n", "## Requirements\n", "Python packages `keras`, `numpy` and `tensorflow` (might also work with Theano but I haven't tried it). Also a GPU with CUDA is *highly* recommended.\n", "\n", "### Preprocessing\n", "\n", "Assume we have a corpus `train.txt` for training, `valid.txt` for validation and `test.txt` for testing. We split each corpus into words.\n", "\n", " Note: The corpus we use here is the same as the corpus used by Kim, Y. et al. in their repo. The prepared corpus have some words that appear only once (e.g. some very rare proper names) and in that case they have been erplaced with the token `