{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Lecture 3: Data Visualization\n", "\n", "We've now learned the basics of R and how to manipulate and clean data with R. Let's dive into a field that arguably R is most useful for: data visualization. \n", "\n", "In this lecture we will learn:\n", "\n", "1. Why data visualization is important,\n", "2. Notable techniques used to visualize data,\n", "3. The challenges of data visualization, and\n", "4. Useful visual tools available in R." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why is Data Visualization Important?\n", "\n", "Visualizing data is crucial in communicating ideas. We more readily and easily process information that is visual rather than abstract in nature. Since much of the output that arises from data analytics is abstract, visualization allows both easy digestion of complex patterns and presentation of consequent insight to those from non-technical backgrounds.\n", "\n", "Many avoid data visualization because the process can be time-consuming, and good visuals are perceived to be hard to make. But many latent trends in a dataset can only be made noticeable via visualization. Not visualizing at all can result in a lack of foresight when it comes to model and parameter selection. When in doubt, visualize!\n", "\n", "Note that there are typically two types of visualizations: __distributional__ (using histograms or box plots to assess the distribution of a variable) and __correlational__ (using line plots or scatter plots to understand the relationship between two variables).\n", "\n", "The process of data visualization usually works in the following fashion: \n", "\n", "* Simple data analysis (correlational, summarize)\n", "* Data visualization\n", "* Identification of pattern\n", "* Secondary analysis or implementation\n", "\n", "Let's see how this works using the built-in `iris` dataset in R. This dataset is based on a famous experiment conducted by R.A. Fisher [1]." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'data.frame':\t150 obs. of 5 variables:\n", " $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...\n", " $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...\n", " $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...\n", " $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...\n", " $ Species : Factor w/ 3 levels \"setosa\",\"versicolor\",..: 1 1 1 1 1 1 1 1 1 1 ...\n" ] }, { "data": { "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width \n", " Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 \n", " 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 \n", " Median :5.800 Median :3.000 Median :4.350 Median :1.300 \n", " Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 \n", " 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 \n", " Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 \n", " Species \n", " setosa :50 \n", " versicolor:50 \n", " virginica :50 \n", " \n", " \n", " " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data(iris)\n", "str(iris)\n", "summary(iris)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset is a good example of a classification problem, where we can use this dataset to train an algorithm that outputs species given Sepal.Length, width and petal width and length. \n", "### First Step: Correlational Analysis" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
Sepal.Length | 1.0000000 | -0.1175698 | 0.8717538 | 0.8179411 |
Sepal.Width | -0.1175698 | 1.0000000 | -0.4284401 | -0.3661259 |
Petal.Length | 0.8717538 | -0.4284401 | 1.0000000 | 0.9628654 |
Petal.Width | 0.8179411 | -0.3661259 | 0.9628654 | 1.0000000 |