"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data and Statistics Packages"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Contents\n",
"\n",
"- [Data and Statistics Packages](#Data-and-Statistics-Packages) \n",
" - [Overview](#Overview) \n",
" - [DataFrames](#DataFrames) \n",
" - [Statistics and Econometrics](#Statistics-and-Econometrics) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"This lecture explores some of the key packages for working with data and doing statistics in Julia\n",
"\n",
"In particular, we will examine the `DataFrame` object in detail (i.e., construction, manipulation, querying, visualization, and nuances like missing data)\n",
"\n",
"While Julia is not an ideal language for pure cookie-cutter statistical analysis, it has many useful packages to provide those tools as part of a more general solution\n",
"\n",
"Examples include `GLM.jl` and `FixedEffectModels.jl`, which we discuss\n",
"\n",
"This list is not exhaustive, and others can be found in organizations such as [JuliaStats](https://github.com/JuliaStats), [JuliaData](https://github.com/JuliaData/), and [QueryVerse](https://github.com/queryverse)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": true
},
"outputs": [],
"source": [
"using InstantiateFromURL\n",
"github_project(\"QuantEcon/quantecon-notebooks-julia\", version = \"0.2.0\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": true
},
"outputs": [],
"source": [
"using LinearAlgebra, Statistics, Compat\n",
"using DataFrames, RDatasets, DataFramesMeta, CategoricalArrays, Query, VegaLite\n",
"using DataVoyager, GLM, FixedEffectModels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrames\n",
"\n",
"A useful package for working with data is [DataFrames.jl](https://github.com/JuliaStats/DataFrames.jl)\n",
"\n",
"The most important data type provided is a `DataFrame`, a two dimensional array for storing heterogeneous data\n",
"\n",
"Although data can be heterogeneous within a `DataFrame`, the contents of the columns must be homogeneous\n",
"(of the same type)\n",
"\n",
"This is analogous to a `data.frame` in R, a `DataFrame` in Pandas (Python) or, more loosely, a spreadsheet in Excel\n",
"\n",
"There are a few different ways to create a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Constructing a DataFrame\n",
"\n",
"The first is to set up columns and construct a dataframe by assigning names"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"using DataFrames, RDatasets # RDatasets provides good standard data examples from R\n",
"\n",
"# note use of missing\n",
"commodities = [\"crude\", \"gas\", \"gold\", \"silver\"]\n",
"last_price = [4.2, 11.3, 12.1, missing]\n",
"df = DataFrame(commod = commodities, price = last_price)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Columns of the `DataFrame` can be accessed by name using a symbol `df[!, :col]` or a struct-style `df.col`, as below"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"df[!, :price]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"df.price"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the type of this array has values `Union{Missing, Float64}` since it was created with a `missing` value"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"df.commod"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `DataFrames.jl` package provides a number of methods for acting on `DataFrame`’s, such as `describe`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"DataFrames.describe(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While often data will be generated all at once, or read from a file, you can add to a `DataFrame` by providing the key parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"nt = (commod = \"nickel\", price= 5.1)\n",
"push!(df, nt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Named tuples can also be used to construct a `DataFrame`, and have it properly deduce all types"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"nt = (t = 1, col1 = 3.0)\n",
"df2 = DataFrame([nt])\n",
"push!(df2, (t=2, col1 = 4.0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Working with Missing\n",
"\n",
"As we discussed in [fundamental types](getting_started_julia/fundamental_types.ipynb#missing), the semantics of `missing` are that mathematical operations will not silently ignore it\n",
"\n",
"In order to allow `missing` in a column, you can create/load the `DataFrame`\n",
"from a source with `missing`’s, or call `allowmissing!` on a column"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"allowmissing!(df2, :col1) # necessary to add in a for col1\n",
"push!(df2, (t=3, col1 = missing))\n",
"push!(df2, (t=4, col1 = 5.1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see the propagation of `missing` to caller functions, as well as a way to efficiently calculate with non-missing data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"@show mean(df2.col1)\n",
"@show mean(skipmissing(df2.col1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And to replace the `missing`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"df2.col1 .= coalesce.(df2.col1, 0.0) # replace all missing with 0.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Manipulating and Transforming DataFrames\n",
"\n",
"One way to do an additional calculation with a `DataFrame` is to tuse the `@transform` macro from `DataFramesMeta.jl`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"using DataFramesMeta\n",
"f(x) = x^2\n",
"df2 = @transform(df2, col2 = f.(:col1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Categorical Data\n",
"\n",
"For data that is [categorical](https://juliadata.github.io/DataFrames.jl/stable/man/categorical.html#Categorical-Data-1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"using CategoricalArrays\n",
"id = [1, 2, 3, 4]\n",
"y = [\"old\", \"young\", \"young\", \"old\"]\n",
"y = CategoricalArray(y)\n",
"df = DataFrame(id=id, y=y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"levels(df.y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualization, Querying, and Plots\n",
"\n",
"The `DataFrame` (and similar types that fulfill a standard generic interface) can fit into a variety of packages\n",
"\n",
"One set of them is the [QueryVerse](https://github.com/queryverse)\n",
"\n",
"**Note:** The QueryVerse, in the same spirit as R’s tidyverse, makes heavy use of the pipeline syntax `|>`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"x = 3.0\n",
"f(x) = x^2\n",
"g(x) = log(x)\n",
"\n",
"@show g(f(x))\n",
"@show x |> f |> g; # pipes nest function calls"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To give an example directly from the source of the LINQ inspired [Query.jl](http://www.queryverse.org/Query.jl/stable/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"using Query\n",
"\n",
"df = DataFrame(name=[\"John\", \"Sally\", \"Kirk\"], age=[23., 42., 59.], children=[3,5,2])\n",
"\n",
"x = @from i in df begin\n",
" @where i.age>50\n",
" @select {i.name, i.children}\n",
" @collect DataFrame\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While it is possible to just use the `Plots.jl` library, there may be better options for displaying tabular data – such as [VegaLite.jl](https://github.com/queryverse/VegaLite.jl)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"using RDatasets, VegaLite\n",
"iris = dataset(\"datasets\", \"iris\")\n",
"\n",
"iris |> @vlplot(\n",
" :point,\n",
" x=:PetalLength,\n",
" y=:PetalWidth,\n",
" color=:Species\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another useful tool for exploring tabular data is [DataVoyager.jl](https://github.com/queryverse/DataVoyager.jl)"
]
},
{
"cell_type": "markdown",
"metadata": {
"hide-output": false
},
"source": [
"```julia\n",
"using DataVoyager\n",
"iris |> Voyager()\n",
"```\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `Voyager()` function creates a separate window for analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Statistics and Econometrics\n",
"\n",
"While Julia is not intended as a replacement for R, Stata, and similar specialty languages, it has a growing number of packages aimed at statistics and econometrics\n",
"\n",
"Many of the packages live in the [JuliaStats organization](https://github.com/JuliaStats/)\n",
"\n",
"A few to point out\n",
"\n",
"- [StatsBase](https://github.com/JuliaStats/StatsBase.jl) has basic statistical functions such as geometric and harmonic means, auto-correlations, robust statistics, etc. \n",
"- [StatsFuns](https://github.com/JuliaStats/StatsFuns.jl) has a variety of mathematical functions and constants such as pdf and cdf of many distributions, softmax, etc. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### General Linear Models\n",
"\n",
"To run linear regressions and similar statistics, use the [GLM](http://juliastats.github.io/GLM.jl/latest/) package"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"using GLM\n",
"\n",
"x = randn(100)\n",
"y = 0.9 .* x + 0.5 * rand(100)\n",
"df = DataFrame(x=x, y=y)\n",
"ols = lm(@formula(y ~ x), df) # R-style notation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fixed Effects\n",
"\n",
"While Julia may be overkill for estimating a simple linear regression,\n",
"fixed-effects estimation with dummies for multiple variables are much more computationally intensive\n",
"\n",
"For a 2-way fixed-effect, taking the example directly from the documentation using [cigarette consumption data](https://github.com/johnmyleswhite/RDatasets.jl/blob/master/doc/plm/rst/Cigar.rst)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"using FixedEffectModels\n",
"cigar = dataset(\"plm\", \"Cigar\")\n",
"cigar.StateCategorical = categorical(cigar.State)\n",
"cigar.YearCategorical = categorical(cigar.Year)\n",
"fixedeffectresults = reg(cigar, @model(Sales ~ NDI, fe = StateCategorical + YearCategorical,\n",
" weights = Pop, vcov = cluster(StateCategorical)))\n",
"regtable(fixedeffectresults)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To explore the data use the interactive DataVoyager and VegaLite"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"# cigar |> Voyager()\n",
"\n",
"cigar |> @vlplot(\n",
" :point,\n",
" x=:Price,\n",
" y=:Sales,\n",
" color=:Year,\n",
" size=:NDI\n",
")"
]
}
],
"metadata": {
"filename": "data_statistical_packages.rst",
"kernelspec": {
"display_name": "Julia 1.2",
"language": "julia",
"name": "julia-1.2"
},
"title": "Data and Statistics Packages"
},
"nbformat": 4,
"nbformat_minor": 2
}