{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using random cross-validation for news categorization\n", "## by [Andres Soto](https://www.linkedin.com/in/andres-soto-villaverde-36198a5/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In my previous blog [News Categorization using Multinomial Naive Bayes](http://nbviewer.jupyter.org/github/andressotov/News-Categorization-MNB/blob/master/News%20Categorization%20MNB.ipynb), I tried to show how to predict the category (business, entertainment, etc.) of a news article given only its headline using Multinomial Naive Bayes algorithm. \n", "In that experience, the classification algorithm was trained just with one set of data. Although the training set were selected by random, it is just a sample of the possible results. This time, I would test it with several sets in order to determine how confident the results are." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset used for the project comes from the UCI Machine Learning Repository." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Lichman, M. (2013). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset contains headlines, URLs, and categories for 422419 news stories collected, which are labelled:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Label | Category\t| News | Percent\n", "-------|------------|----------|----------\n", "b\t| business\t|