{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Visualization with Swing\n", "\n", "A picture is worth a thousand words. In machine learning, we usually handle high-dimensional data, which is impossible to draw on display directly. But a variety of statistical plots are tremendously valuable for us to grasp the characteristics of many data points. Smile provides data visualization tools such as plots and maps for researchers to understand information more easily and quickly.\n", "\n", "Smile provides many advanced interactive statistical plots with Java's Swing graphics library. To render Swing plot canvas in Notebook, we generate an image and embedded it into HTML. Therefore, we lose the interactive functionality. To fully leverage Swing-based plots, we recommend the users to use Smile's shell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import $ivy.`com.github.haifengl::smile-scala:3.1.0`\n", "import $ivy.`org.slf4j:slf4j-simple:2.0.13` \n", "\n", "import java.lang.Math._\n", "import java.awt.Color.{BLACK, BLUE, CYAN, DARK_GRAY, GRAY, GREEN, LIGHT_GRAY, MAGENTA, ORANGE, PINK, RED, WHITE, YELLOW}\n", "import smile.plot.swing.Palette.{DARK_RED, VIOLET_RED, DARK_GREEN, LIGHT_GREEN, PASTEL_GREEN, FOREST_GREEN, GRASS_GREEN, NAVY_BLUE, SLATE_BLUE, ROYAL_BLUE, CADET_BLUE, MIDNIGHT_BLUE, SKY_BLUE, STEEL_BLUE, DARK_BLUE, DARK_MAGENTA, DARK_CYAN, PURPLE, LIGHT_PURPLE, DARK_PURPLE, GOLD, BROWN, SALMON, TURQUOISE, BURGUNDY, PLUM}\n", "import smile.plot.swing._\n", "import smile.plot.show\n", "import smile.interpolation._\n", "import smile.math.matrix._\n", "import smile.stat.distribution._\n", "import smile._\n", "import smile.plot.Render._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's plot a heart. Math is beautiful, isn't it?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val heart = -314 to 314 map { i =>\n", " val t = i / 100.0\n", " val x = 16 * pow(sin(t), 3)\n", " val y = 13 * cos(t) - 5 * cos(2*t) - 2 * cos(3*t) - cos(4*t)\n", " Array(x, y)\n", "}\n", "\n", "show(line(heart.toArray, color = RED))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the function `plot` returns a `PlotCanvas` that encapsulates the plot specification. The function `show` does the renderring job (with the help of implict argument `display` that we defined earlier)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scatter Plot\n", "\n", "A scatter plot displays data as a collection of points. The points can be color-coded, which is very useful for classification tasks. The user can use `plot` functions to draw scatter plot easily.\n", "```\n", "def plot(x: Array[Array[Double]], mark: Char = '*', color: Color = Color.BLACK): Canvas\n", "\n", "def plot(x: Array[Array[Double]], y: Array[String], mark: Char): Canvas\n", "\n", "def plot(x: Array[Array[Double]], y: Array[Int], mark: Char): Canvas\n", "```\n", "The legends are as follows.\n", "\n", "- . : dot\n", "- \\+ : \\+\n", "- \\- : \\-\n", "- | : |\n", "- \\* : star\n", "- x : x\n", "- o : circle\n", "- O : large circle\n", "- @ : solid circle\n", "- \\# : large solid circle\n", "- s : square\n", "- S : large square\n", "- q : solid square\n", "- Q : large solid square\n", "\n", "For any other char, the data point will be drawn as a dot.\n", "\n", "The functions return a `PlotCanvas`, which can be used to control the plot programmatically. The user can also use the popup context menu by right mouse click to print, change the title, axis labels, and font, etc.\n", "\n", "On the desktop, the user can zoom in/out by mouse wheel. For 2D plot, the user can shift the coordinates by moving mouse after double click. The user can also select an area by mouse for detailed view. For 3D plot, the user can rotate the view by dragging mouse." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val iris = read.arff(\"../data/weka/iris.arff\")\n", "val canvas = plot(iris, \"sepallength\", \"sepalwidth\", \"class\", '*')\n", "canvas.setAxisLabels(\"sepallength\", \"sepalwidth\")\n", "show(canvas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we plot the first two columns of Iris data. We use the class label for legend and color coding. It is also easy to draw a 3D plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val canvas = plot(iris, \"sepallength\", \"sepalwidth\", \"petallength\", \"class\", '*')\n", "canvas.setAxisLabels(\"sepallength\", \"sepalwidth\", \"petallength\")\n", "show(canvas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, the Iris data has four attributes. So even 3D plot is not sufficient to see the whole picture. A general practice is plot all the attribute pairs. For example," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "show(splom(iris, '*', \"class\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Box Plot\n", "\n", "The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.\n", "\n", "Box plots can be useful to display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val groups = (iris(\"sepallength\").toDoubleArray zip iris(\"class\").toStringArray).groupBy(_._2)\n", "val labels = groups.keys.toArray\n", "val data = groups.values.map { a => a.map(_._1) }.toArray\n", "val canvas = boxplot(data, labels)\n", "canvas.setAxisLabels(\"\", \"sepallength\")\n", "show(canvas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histogram\n", "\n", "A histogram is a graphical representation of the distribution of numerical data. The range of values is divided into a series of consecutive, non-overlapping intervals/bins. The bins must be adjacent, and are usually equal size.\n", "```\n", "def hist(data: Array[Double], k: Int = 10, prob: Boolean = false, color: Color = Color.BLUE): Canvas\n", "\n", "def hist(data: Array[Double], breaks: Array[Double], prob: Boolean, color: Color): Canvas\n", "``` \n", "where k is the number of bins (10 by default), or you can also specify an array of the breakpoints between bins.\n", "\n", "Let's apply the histogram to an interesting data: the wisdom of crowds. The original experiment took place about a hundred years ago at a county fair in England. The fair had a guess the weight of the ox contest. Francis Galton calculated the average of all guesses, which is right to within one pound.\n", "\n", "Recently, NPR Planet Money ran the experiment again. NPR posted a couple of pictures of a cow (named Penelope) and asked people to guess her weight. They got over 17,000 responses. The average of guesses was 1,287 pounds, which is pretty close to Penelope's weight 1,355 pounds." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val cow = read.csv(\"../data/npr/cow.txt\", header = false)(\"V1\").toDoubleArray\n", "val canvas = hist(cow, 50)\n", "canvas.setAxisLabels(\"Weight\", \"Probability\")\n", "show(canvas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The histogram gives a rough sense of the distribution of crowd guess, which has a long tail. Filter out the weights over 3500 pounds, the histogram shows more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val canvas = hist(cow.filter(_ <= 3500), 50)\n", "canvas.setAxisLabels(\"Weight\", \"Probability\")\n", "show(canvas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Smile also supports histograms that display the distribution of 2-dimensional data.\n", "```\n", "def hist3(data: Array[Array[Double]], xbins: Int = 10, ybins: Int = 10, prob: Boolean = false, palette: Array[Color] = Palette.jet(16)): Canvas\n", "``` \n", "Here we generate a data set from a 2-dimensional Gaussian distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val gauss = new MultivariateGaussianDistribution(Array(0.0, 0.0), new Matrix(Array(Array(1.0, 0.6), Array(0.6, 2.0))))\n", "val data = (0 until 10000) map { i: Int => gauss.rand }\n", "show(hist3(data.toArray, 50, 50))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q-Q Plot\n", "\n", "A Q–Q plot (\"Q\" stands for quantile) is a probability plot for comparing two probability distributions by plotting their quantiles against each other. A point (x, y) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate).\n", "```\n", "def qqplot(x: Array[Double]): Canvas\n", "\n", "def qqplot(x: Array[Double], d: Distribution): Canvas\n", "def qqplot(x: Array[Double], y: Array[Double]): Canvas\n", "\n", "def qqplot(x: Array[Int], d: DiscreteDistribution): Canvas\n", "def qqplot(x: Array[Int], y: Array[Int]): Canvas\n", "``` \n", "Smile supports the Q-Q plot of samples to a given distribution and also of two sample sets. The second distribution/samples is optional. If missing, we assume it the standard Gaussian distribution.\n", "\n", "In what follows, we generate a random sample set from standard Gaussian distribution and draw its Q-Q plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val gauss = new GaussianDistribution(0.0, 1.0)\n", "val data = (0 until 1000) map { i: Int => gauss.rand }\n", "show(qqplot(data.toArray))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact, this is also a good visual way to verify the quality of our random number generator." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Heatmap\n", "\n", "A heat map is a graphical representation of data where the values in a matrix are represented as colors. In cluster analysis, researchers often employs the heat map by permuting the rows and the columns of a matrix to place similar values near each other according to the clustering.\n", "```\n", "def heatmap(z: Array[Array[Double]], palette: Array[Color] = Palette.jet(16)): Canvas\n", "\n", "def heatmap(x: Array[Double], y: Array[Double], z: Array[Array[Double]], palette: Array[Color]): Canvas\n", "\n", "def heatmap(rowLabels: Array[String], columnLabels: Array[String], z: Array[Array[Double]], palette: Array[Color]): Canvas\n", "``` \n", "where `z` is the matrix to display and the optional parameters `x` and `y` are the coordinates of data matrix cells, which must be in ascending order. Alternatively, one can also provide labels as the coordinates, which is a common practice in cluster analysis.\n", "\n", "In what follows, we display the heat map of a matrix. We starts with a small `4 x 4` matrix and enlarge it with bicubic interpolation. We also use the helper class Palette to generate the color scheme. This class provides many other color schemes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// the matrix to display\n", "val z = Array(\n", " Array(1.0, 2.0, 4.0, 1.0),\n", " Array(6.0, 3.0, 5.0, 2.0),\n", " Array(4.0, 2.0, 1.0, 5.0),\n", " Array(5.0, 4.0, 2.0, 3.0)\n", ")\n", "\n", "// make the matrix larger with bicubic interpolation\n", "val x = Array(0.0, 1.0, 2.0, 3.0)\n", "val y = Array(0.0, 1.0, 2.0, 3.0)\n", "val bicubic = new BicubicInterpolation(x, y, z)\n", "val Z = Array.ofDim[Double](101, 101)\n", "for (i <- 0 to 100) {\n", " for (j <- 0 to 100)\n", " Z(i)(j) = bicubic.interpolate(i * 0.03, j * 0.03)\n", "}\n", "\n", "show(heatmap(Z, Palette.jet(256)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A special case of heat map is to draw the sparsity pattern of a matrix.\n", "```\n", "def spy(matrix: SparseMatrix, k: Int = 1): Canvas\n", "``` \n", "The structure of sparse matrix is critical in solving linear systems. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val sparse = SparseMatrix.text(java.nio.file.Paths.get(\"../data/matrix/mesh2em5.txt\"))\n", "val canvas = spy(sparse)\n", "canvas.setTitle(\"mesh2em5\")\n", "show(canvas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contour\n", "\n", "A contour plot represents a 3-dimensional surface by plotting constant z slices, called contours, on a 2-dimensional format. That is, given a value for z, lines are drawn for connecting the (x, y) coordinates where that z value occurs.\n", "```\n", "def contour(z: Array[Array[Double]]): Canvas\n", "def contour(z: Array[Array[Double]], levels: Array[Double]): Canvas\n", "def contour(x: Array[Double], y: Array[Double], z: Array[Array[Double]]): Canvas\n", "def contour(x: Array[Double], y: Array[Double], z: Array[Array[Double]], levels: Array[Double]): Canvas\n", "``` \n", "Similar to heatmap, the parameters x and y are the coordinates of data matrix cells, which must be in ascending order. The slice values can be automatically determined from the data, or provided through the parameter levels.\n", "\n", "Contours are often jointly used with the heat map. In the following example, we add the contour lines to the previous heat map exampl." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val canvas = heatmap(Z, Palette.jet(256))\n", "canvas.add(Contour.of(Z))\n", "show(canvas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example also shows how to mix multiple plots together." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Surface\n", "Besides heat map and contour, we can also visualize a matrix with the three-dimensional shaded surface.\n", "```\n", "def surface(z: Array[Array[Double]], palette: Array[Color] = Palette.jet(16)): Canvas\n", "\n", "def surface(x: Array[Double], y: Array[Double], z: Array[Array[Double]], palette: Array[Color]): Canvas\n", "``` \n", "The usage is similar with heatmap and contour functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "show(surface(Z, Palette.jet(256, 1.0f)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wireframe\n", "\n", "The wireframe model is a visual presentation of a three-dimensional physical object. A wireframe model consists of two tables, the vertex table and the edge table. Each entry of the vertex table records a vertex and its coordinate values, while each entry of the edge table has two components giving the two incident vertices of that edge.\n", "```\n", "def wireframe(vertices: Array[Array[Double]], edges: Array[Array[Int]]): Canvas\n", "``` \n", "where vertices is an `n x 2` or `n x 3` array which are coordinates of `n` vertices, and edges is an `m x 2` array of which each row is the vertex indices of two end points of each edge." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val (vertices, edges) = read.wavefront(\"../data/wavefront/teapot.obj\")\n", "show(wireframe(vertices, edges))" ] } ], "metadata": { "kernelspec": { "display_name": "Scala (2.13)", "language": "scala", "name": "scala213" }, "language_info": { "codemirror_mode": "text/x-scala", "file_extension": ".scala", "mimetype": "text/x-scala", "name": "scala", "nbconvert_exporter": "script", "version": "2.13.1" } }, "nbformat": 4, "nbformat_minor": 4 }