{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Julia is getting more and more interesting as a way to translate math into beautiful-looking code. After some halted attempts earlier at Julia 1.0 and 1.4, I'm giving it a try again at 1.6 after it seems like Julia is looking more and more stable. So here I'll write down a simple task that would be trivial to do in Python, except I'll do it in Julia instead. \n", "\n", "I want to:\n", "\n", "1. Read a CSV file from the web (i.e. in a github repository) as a dataframe\n", "2. Do some data cleaning - text replacement, fill na's, etc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load a CSV File" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "using HTTP, CSV, DataFrames, PyCall, DataFramesMeta" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Use a GET command to get some data, in this case it's a String\n", "# r = HTTP.get(\"https://api.github.com/users/banditkings\").body |> String;" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "url = \"https://ocw.mit.edu/courses/sloan-school-of-management/15-071-the-analytics-edge-spring-2017/linear-regression/assignment-2/climate_change.csv\";" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Docstrings must be placed before a function call. On the plus side, it seems to render the docstring really nicely:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "Basic function call example. In this case we're using it to fetch a CSV file from\n", "github here.\n", "\n", "You can even use latex in here! \\$f(x) = Ax + b\\$\n", "\n", "and HTML!!\n", "\n", "Parameters\n", "----------\n", "url : String\n", "* url address for the CSV file that you want to read in as a dataframe\n", "\n", "Returns\n", "-------\n", "dataframe\n", "\"\"\" \n", "function get_data(url)\n", " body = HTTP.get(url).body\n", " df = CSV.read(body, DataFrame);\n", "end;" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/latex": [ "Basic function call example. In this case we're using it to fetch a CSV file from github here.\n", "\n", "You can even use latex in here! $f(x) = Ax + b$\n", "\n", "and HTML!!\n", "\n", "\\subsection{Parameters}\n", "url : String\n", "\n", "\\begin{itemize}\n", "\\item url address for the CSV file that you want to read in as a dataframe\n", "\n", "\\end{itemize}\n", "\\subsection{Returns}\n", "dataframe\n", "\n" ], "text/markdown": [ "Basic function call example. In this case we're using it to fetch a CSV file from github here.\n", "\n", "You can even use latex in here! $f(x) = Ax + b$\n", "\n", "and HTML!!\n", "\n", "## Parameters\n", "\n", "url : String\n", "\n", " * url address for the CSV file that you want to read in as a dataframe\n", "\n", "## Returns\n", "\n", "dataframe\n" ], "text/plain": [ " Basic function call example. In this case we're using it to fetch a CSV file\n", " from github here.\n", "\n", " You can even use latex in here! \u001b[35mf(x) = Ax + b\u001b[39m\n", "\n", " and HTML!!\n", "\n", "\u001b[1m Parameters\u001b[22m\n", "\u001b[1m ============\u001b[22m\n", "\n", " url : String\n", "\n", " • url address for the CSV file that you want to read in as a\n", " dataframe\n", "\n", "\u001b[1m Returns\u001b[22m\n", "\u001b[1m =========\u001b[22m\n", "\n", " dataframe" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?get_data() # display the docstring we just generated" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "### Control Flows and `try`:`catch`\n", "\n", "Our little `get_data` function is fine but I don't want to constantly be downloading this file if I don't have to...how about something like python's `try`:`except` and learn a little about how Julia handles control flows?\n", "\n", "* **Goal:** I want to try to load from CSV file if it's available, and load from a url if it doesn't exist. \n", "\n", "* Reference: [Control Flows](https://docs.julialang.org/en/v1/manual/control-flow/) and `try`:`catch` [here](https://docs.julialang.org/en/v1/manual/control-flow/#The-try/catch-statement)!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "try\n", " df = CSV.read(\"climate_change.csv\", DataFrame);\n", "catch\n", " println(\"No CSV Found, Downloading from source\")\n", " df = get_data(url);\n", " # We can also save the csv file locally\n", " CSV.write(\"climate_change.csv\", df)\n", "end;\n", "\n", "df = CSV.read(\"climate_change.csv\", DataFrame);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python vs Julia: appending an item to a list (and PyCall in Jupyter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another key difference from python is appending lists together. In python, you'll create a list like:\n", "\n", "```python\n", ">>> foo = [1]\n", ">>> foo = foo + ['a', 'b']\n", ">>> foo\n", "[1, 'a', 'b']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Actually why don't we use the `PyCall` module to demonstrate the above? With `PyCall`, you can call Python and get functions and variables out. I tried this a while back and it seemed to be really bad at supporting pandas dataframes but things like numpy arrays are fine. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3-element Vector{Any}:\n", " 1\n", " \"a\"\n", " \"b\"" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Example of using Python within a jupyter notebook cell with a Julia kernel\n", "\n", "py\"\"\"\n", "foo = [1]\n", "foo = foo + ['a', 'b']\n", "\"\"\"\n", "\n", "# Reference an existing variable or function\n", "py\"foo\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now here's the same thing in Julia:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2-element Vector{Any}:\n", " 1\n", " \"a\"" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foo = Any[1]; # need to set to Any or it'll assume it's an Integer and it'll fail on the two strings there\n", "append!(foo, [\"a\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is that `!` symbol?\n", "\n", "Okay notice the exclamation mark `!` after `append` - this is special Julia notation for functions that modify something inplace. In this case, `append!` adds an item to the list in place. You could use `push!` if adding a single item. And there's no `append` function, only `append!`. How do you do something where you don't modify it in place? You can't, but you can make a copy:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3-element Vector{Any}:\n", " 1\n", " \"a\"\n", " \"b\"" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bar = foo |> copy\n", "append!(bar, [\"b\"])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foo == bar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame Manipulation\n", "\n", "With Base Julia:\n", "[Documentation](https://dataframes.juliadata.org/stable/man/working_with_dataframes/)\n", "\n", "`head` and `tail` become `first` and `last` with Julia." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

5 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863
2198362.167345.521633.71303.746192.057351.8481366.120.0794
3198371.741344.151633.22303.795192.818353.7251366.290.0731
4198381.13342.251631.35303.839193.602355.6331366.420.0673
5198390.428340.171648.4303.901194.392357.4651366.230.0619
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1983 & 5 & 2.556 & 345.96 & 1638.59 & 303.677 & 191.324 & 350.113 & 1366.1 & 0.0863 & $\\dots$ \\\\\n", "\t2 & 1983 & 6 & 2.167 & 345.52 & 1633.71 & 303.746 & 192.057 & 351.848 & 1366.12 & 0.0794 & $\\dots$ \\\\\n", "\t3 & 1983 & 7 & 1.741 & 344.15 & 1633.22 & 303.795 & 192.818 & 353.725 & 1366.29 & 0.0731 & $\\dots$ \\\\\n", "\t4 & 1983 & 8 & 1.13 & 342.25 & 1631.35 & 303.839 & 193.602 & 355.633 & 1366.42 & 0.0673 & $\\dots$ \\\\\n", "\t5 & 1983 & 9 & 0.428 & 340.17 & 1648.4 & 303.901 & 194.392 & 357.465 & 1366.23 & 0.0619 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m5×11 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 136 ⋯\n", " 2 │ 1983 6 2.167 345.52 1633.71 303.746 192.057 351.848 136\n", " 3 │ 1983 7 1.741 344.15 1633.22 303.795 192.818 353.725 136\n", " 4 │ 1983 8 1.13 342.25 1631.35 303.839 193.602 355.633 136\n", " 5 │ 1983 9 0.428 340.17 1648.4 303.901 194.392 357.465 136 ⋯\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first(df, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also pipe commands as well with Julia's base pipe `|>`" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
308200812-0.666385.561812.88322.182244.204535.0051365.690.0046
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t308 & 2008 & 12 & -0.666 & 385.56 & 1812.88 & 322.182 & 244.204 & 535.005 & 1365.69 & 0.0046 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1mDataFrameRow\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 308 │ 2008 12 -0.666 385.56 1812.88 322.182 244.204 535.005 136 ⋯\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df |> last" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about the last 5 by pipe `|>`? Well it's not as good as R's `dplyr` unfortunately...the base pipe function in Julia can't handle multiple parameters. There are other libraries like `Pipe` and `Chain` that make use of macros, but compared to `dplyr` or even method chaining in Python, they both are too clunky. \n", "\n", "Libraries like the [Query](https://github.com/queryverse/Query.jl) from the [Queryverse](https://www.queryverse.org) attempt to address this but one of the many issues of trying to do work in a new language that's decades behind Python and R. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing and Slicing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indexing is just like pandas like:\n", "```Julia\n", "df[rows, columns]\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

6 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198390.428340.171648.4303.901194.392357.4651366.230.0619
21983100.002340.31663.79303.97195.171359.1741366.060.0569
3198311-0.176341.531658.23304.032195.921360.7581366.110.0524
4198312-0.176343.071654.31304.082196.609362.1741366.060.0486
519841-0.339344.051658.98304.13197.219363.3591365.430.0451
619842-0.565344.771656.48304.194197.759364.2961365.660.0416
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1983 & 9 & 0.428 & 340.17 & 1648.4 & 303.901 & 194.392 & 357.465 & 1366.23 & 0.0619 & $\\dots$ \\\\\n", "\t2 & 1983 & 10 & 0.002 & 340.3 & 1663.79 & 303.97 & 195.171 & 359.174 & 1366.06 & 0.0569 & $\\dots$ \\\\\n", "\t3 & 1983 & 11 & -0.176 & 341.53 & 1658.23 & 304.032 & 195.921 & 360.758 & 1366.11 & 0.0524 & $\\dots$ \\\\\n", "\t4 & 1983 & 12 & -0.176 & 343.07 & 1654.31 & 304.082 & 196.609 & 362.174 & 1366.06 & 0.0486 & $\\dots$ \\\\\n", "\t5 & 1984 & 1 & -0.339 & 344.05 & 1658.98 & 304.13 & 197.219 & 363.359 & 1365.43 & 0.0451 & $\\dots$ \\\\\n", "\t6 & 1984 & 2 & -0.565 & 344.77 & 1656.48 & 304.194 & 197.759 & 364.296 & 1365.66 & 0.0416 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m6×11 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1983 9 0.428 340.17 1648.4 303.901 194.392 357.465 136 ⋯\n", " 2 │ 1983 10 0.002 340.3 1663.79 303.97 195.171 359.174 136\n", " 3 │ 1983 11 -0.176 341.53 1658.23 304.032 195.921 360.758 136\n", " 4 │ 1983 12 -0.176 343.07 1654.31 304.082 196.609 362.174 136\n", " 5 │ 1984 1 -0.339 344.05 1658.98 304.13 197.219 363.359 136 ⋯\n", " 6 │ 1984 2 -0.565 344.77 1656.48 304.194 197.759 364.296 136\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get Rows 5-10 (INCLUSIVE), and all columns\n", "df[5:10, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can exclude columns with `Not()`, and you can use regex, too. What a dream!!" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "

DataFrameRow (7 columns)

YearMonthMEIN2OTSIAerosolsTemp
Int64Int64Float64Float64Float64Float64Float64
1198352.556303.6771366.10.08630.109
" ], "text/latex": [ "\\begin{tabular}{r|ccccccc}\n", "\t& Year & Month & MEI & N2O & TSI & Aerosols & Temp\\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & 1983 & 5 & 2.556 & 303.677 & 1366.1 & 0.0863 & 0.109 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1mDataFrameRow\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m TSI \u001b[0m\u001b[1m Aerosols \u001b[0m\u001b[1m Temp \u001b[0m\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\n", "─────┼────────────────────────────────────────────────────────────\n", " 1 │ 1983 5 2.556 303.677 1366.1 0.0863 0.109" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Exclude all columns that have a capital C in it\n", "df[1, Not(r\"C.*\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking out the data:\n", "\n", "* `size(df)` : To get dataframe dimensions you need to use `size` instead of `length` or `len` or `shape`:\n", "* `describe(df)` : Gives you summary info" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(308, 11)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "size(df)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

11 rows × 7 columns

variablemeanminmedianmaxnmissingeltype
SymbolFloat64RealFloat64RealInt64DataType
1Year1995.6619831996.020080Int64
2Month6.5519517.0120Int64
3MEI0.275555-1.6350.23753.0010Float64
4CO2363.227340.17361.735388.50Float64
5CH41749.821629.891764.041814.180Float64
6N2O312.392303.677311.507322.1820Float64
7CFC-11251.973191.324258.344271.4940Float64
8CFC-12497.525350.113528.356543.8130Float64
9TSI1366.071365.431365.981367.320Float64
10Aerosols0.01665710.00160.005750.14940Float64
11Temp0.256776-0.2820.2480.7390Float64
" ], "text/latex": [ "\\begin{tabular}{r|ccccccc}\n", "\t& variable & mean & min & median & max & nmissing & eltype\\\\\n", "\t\\hline\n", "\t& Symbol & Float64 & Real & Float64 & Real & Int64 & DataType\\\\\n", "\t\\hline\n", "\t1 & Year & 1995.66 & 1983 & 1996.0 & 2008 & 0 & Int64 \\\\\n", "\t2 & Month & 6.55195 & 1 & 7.0 & 12 & 0 & Int64 \\\\\n", "\t3 & MEI & 0.275555 & -1.635 & 0.2375 & 3.001 & 0 & Float64 \\\\\n", "\t4 & CO2 & 363.227 & 340.17 & 361.735 & 388.5 & 0 & Float64 \\\\\n", "\t5 & CH4 & 1749.82 & 1629.89 & 1764.04 & 1814.18 & 0 & Float64 \\\\\n", "\t6 & N2O & 312.392 & 303.677 & 311.507 & 322.182 & 0 & Float64 \\\\\n", "\t7 & CFC-11 & 251.973 & 191.324 & 258.344 & 271.494 & 0 & Float64 \\\\\n", "\t8 & CFC-12 & 497.525 & 350.113 & 528.356 & 543.813 & 0 & Float64 \\\\\n", "\t9 & TSI & 1366.07 & 1365.43 & 1365.98 & 1367.32 & 0 & Float64 \\\\\n", "\t10 & Aerosols & 0.0166571 & 0.0016 & 0.00575 & 0.1494 & 0 & Float64 \\\\\n", "\t11 & Temp & 0.256776 & -0.282 & 0.248 & 0.739 & 0 & Float64 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m11×7 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m variable \u001b[0m\u001b[1m mean \u001b[0m\u001b[1m min \u001b[0m\u001b[1m median \u001b[0m\u001b[1m max \u001b[0m\u001b[1m nmissing \u001b[0m\u001b[1m elt\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Symbol \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Real \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Real \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Dat\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ Year 1995.66 1983 1996.0 2008 0 Int ⋯\n", " 2 │ Month 6.55195 1 7.0 12 0 Int\n", " 3 │ MEI 0.275555 -1.635 0.2375 3.001 0 Flo\n", " 4 │ CO2 363.227 340.17 361.735 388.5 0 Flo\n", " 5 │ CH4 1749.82 1629.89 1764.04 1814.18 0 Flo ⋯\n", " 6 │ N2O 312.392 303.677 311.507 322.182 0 Flo\n", " 7 │ CFC-11 251.973 191.324 258.344 271.494 0 Flo\n", " 8 │ CFC-12 497.525 350.113 528.356 543.813 0 Flo\n", " 9 │ TSI 1366.07 1365.43 1365.98 1367.32 0 Flo ⋯\n", " 10 │ Aerosols 0.0166571 0.0016 0.00575 0.1494 0 Flo\n", " 11 │ Temp 0.256776 -0.282 0.248 0.739 0 Flo\n", "\u001b[36m 1 column omitted\u001b[0m" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "describe(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering DataFrames Based on Conditions\n", "\n", "How do you conditionally filter a dataframe? One way to do it is through indexing, similar to how you would do it in Pandas. \n", "\n", "Note that you can reference a dataframe column (like a pandas series) both with and without quotes, so `df.Year` is the same as `df.\"Year\"`. Handy for dealing with spaces!" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

4 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119851-0.561345.251663.42305.301208.537373.6231365.610.0164
219852-0.602346.061666.21305.243209.543374.6811365.710.016
319853-0.737347.661678.34305.165210.368376.0041365.660.0141
419854-0.484348.21675.24305.093211.111377.6351365.510.0138
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1985 & 1 & -0.561 & 345.25 & 1663.42 & 305.301 & 208.537 & 373.623 & 1365.61 & 0.0164 & $\\dots$ \\\\\n", "\t2 & 1985 & 2 & -0.602 & 346.06 & 1666.21 & 305.243 & 209.543 & 374.681 & 1365.71 & 0.016 & $\\dots$ \\\\\n", "\t3 & 1985 & 3 & -0.737 & 347.66 & 1678.34 & 305.165 & 210.368 & 376.004 & 1365.66 & 0.0141 & $\\dots$ \\\\\n", "\t4 & 1985 & 4 & -0.484 & 348.2 & 1675.24 & 305.093 & 211.111 & 377.635 & 1365.51 & 0.0138 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m4×11 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1985 1 -0.561 345.25 1663.42 305.301 208.537 373.623 136 ⋯\n", " 2 │ 1985 2 -0.602 346.06 1666.21 305.243 209.543 374.681 136\n", " 3 │ 1985 3 -0.737 347.66 1678.34 305.165 210.368 376.004 136\n", " 4 │ 1985 4 -0.484 348.2 1675.24 305.093 211.111 377.635 136\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df.Year.==1985) .& (df.\"Month\".<=4), :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is that `.` doing in front of the `<=`, `==` and `&` sign?\n", "\n", "The `.` sign there is doing some heavy lifting. It's an 'elementwise' operator, meaning that it does the comparison across every element in the vector `df.Year` and `df.Month`. If you don't have them this is what it would return for the `df.Year` comparison:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Year==1985 # Returns true if ALL values of Year are equal to 1985" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When what we really want is something like a list (vector) of values that compares each element and returns either `true` or `false` for each element in each vector." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Bool[0, 0, 0, 0, 0]\n" ] } ], "source": [ "println((df.Year.==1985 )[1:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another example" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Bool[1, 0, 0]\n" ] } ], "source": [ "println(([1,1,0].==[1,0,1]))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n" ] } ], "source": [ "println((df.Year.==1985) .& (df.\"Month\".<=4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DataFrames` also has a `filter` and `subset` option to index based on row values, but they're a little messy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `filter(f, df)` function lets you filter by passing in a function, like:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119851-0.561345.251663.42305.301208.537373.6231365.610.0164
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1985 & 1 & -0.561 & 345.25 & 1663.42 & 305.301 & 208.537 & 373.623 & 1365.61 & 0.0164 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1mDataFrameRow\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1985 1 -0.561 345.25 1663.42 305.301 208.537 373.623 136 ⋯\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filter(:Year=> ==(1985), df) |> first" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is the colon `:` doing in front of `Year`?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another nuance with Julia is this little colon in front of a column name, and that's because `Year` is a `Symbol` type. This is a special thing you can do for performance, but you can still index columns by string too - handy if your column name has spaces in it." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119851-0.561345.251663.42305.301208.537373.6231365.610.0164
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1985 & 1 & -0.561 & 345.25 & 1663.42 & 305.301 & 208.537 & 373.623 & 1365.61 & 0.0164 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1mDataFrameRow\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1985 1 -0.561 345.25 1663.42 305.301 208.537 373.623 136 ⋯\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filter(:\"Year\"=> ==(1985), df) |> first" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What the heck is the `=>` doing after the column?\n", "\n", "More nuance, and this is usually the part where I go back to Pandas or Dplyr. What's actually happening here is that it's assigning a `Pair` like `key=>value`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering Rows " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The row logic makes sense to me more, but it's not as performant" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.190525 seconds (834.12 k allocations: 43.820 MiB, 99.68% compilation time)\n" ] }, { "data": { "text/html": [ "

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1983 & 5 & 2.556 & 345.96 & 1638.59 & 303.677 & 191.324 & 350.113 & 1366.1 & 0.0863 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1mDataFrameRow\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 13 ⋯\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time begin\n", " filter(row -> (row.Year in [1983, 1984]), df) |> first\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that `->` is an anonymous function and `filter` is expecting a function as its first argument.\n", "\n", "\n", "* See Julia for Data Science for a good introductiont to `filter` and `subset`: [julia for data science](https://juliadatascience.io/filter_subset)\n", "* Another Good writeup here in this article [why do we have both subset and filter functions?](https://bkamins.github.io/julialang/2021/05/07/subset.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This row filtering is equivalent to:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.098306 seconds (370.27 k allocations: 18.121 MiB, 99.74% compilation time)\n" ] }, { "data": { "text/html": [ "

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1983 & 5 & 2.556 & 345.96 & 1638.59 & 303.677 & 191.324 & 350.113 & 1366.1 & 0.0863 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1mDataFrameRow\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 13 ⋯\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time begin\n", " df[(df.Year .==1983) .| (df.Year .==1984), :] |> first\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or also:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.071469 seconds (159.67 k allocations: 8.747 MiB, 99.50% compilation time)\n" ] }, { "data": { "text/html": [ "

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1983 & 5 & 2.556 & 345.96 & 1638.59 & 303.677 & 191.324 & 350.113 & 1366.1 & 0.0863 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1mDataFrameRow\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 13 ⋯\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# using a list comprehension works for the `in` operator here\n", "@time begin\n", " df[[ year in [1983,1984] for year in df.Year], :] |> first\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And you can combine indexing like this:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.084675 seconds (455.85 k allocations: 23.343 MiB, 99.34% compilation time)\n" ] }, { "data": { "text/html": [ "

4 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119841-0.339344.051658.98304.13197.219363.3591365.430.0451
219842-0.565344.771656.48304.194197.759364.2961365.660.0416
3198430.131345.461655.77304.285198.249365.0441366.170.0383
4198440.331346.771657.68304.389198.723365.6921365.570.0352
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1984 & 1 & -0.339 & 344.05 & 1658.98 & 304.13 & 197.219 & 363.359 & 1365.43 & 0.0451 & $\\dots$ \\\\\n", "\t2 & 1984 & 2 & -0.565 & 344.77 & 1656.48 & 304.194 & 197.759 & 364.296 & 1365.66 & 0.0416 & $\\dots$ \\\\\n", "\t3 & 1984 & 3 & 0.131 & 345.46 & 1655.77 & 304.285 & 198.249 & 365.044 & 1366.17 & 0.0383 & $\\dots$ \\\\\n", "\t4 & 1984 & 4 & 0.331 & 346.77 & 1657.68 & 304.389 & 198.723 & 365.692 & 1365.57 & 0.0352 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m4×11 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1984 1 -0.339 344.05 1658.98 304.13 197.219 363.359 136 ⋯\n", " 2 │ 1984 2 -0.565 344.77 1656.48 304.194 197.759 364.296 136\n", " 3 │ 1984 3 0.131 345.46 1655.77 304.285 198.249 365.044 136\n", " 4 │ 1984 4 0.331 346.77 1657.68 304.389 198.723 365.692 136\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time begin \n", " filter(row -> (row.Year in [1983, 1984]) & (row.Month <= 4), df)\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or equivalently with plain indexing:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.206541 seconds (533.77 k allocations: 27.184 MiB, 13.27% gc time, 99.64% compilation time)\n" ] }, { "data": { "text/html": [ "

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1983 & 5 & 2.556 & 345.96 & 1638.59 & 303.677 & 191.324 & 350.113 & 1366.1 & 0.0863 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1mDataFrameRow\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 13 ⋯\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# using a list comprehension works because there's no elementwise `.in` or `.∈` operator here\n", "@time begin\n", " condition1 = [year in [1983,1984] for year in df.Year]\n", " condition2 = [month <= 5 for month in df.Month]\n", " df[condition1 .& condition2, :] |> first\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the nice thing about `filter` is that you can apply functions and stuff to it, like you could do calculations and then do a boolean off of the calculation, etc etc. Either way it's not too slow." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subset(df, args)\n", "\n", "Returns a copy of the dataframe where args are true" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

8 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863
2198362.167345.521633.71303.746192.057351.8481366.120.0794
3198371.741344.151633.22303.795192.818353.7251366.290.0731
4198381.13342.251631.35303.839193.602355.6331366.420.0673
5198390.428340.171648.4303.901194.392357.4651366.230.0619
61983100.002340.31663.79303.97195.171359.1741366.060.0569
7198311-0.176341.531658.23304.032195.921360.7581366.110.0524
8198312-0.176343.071654.31304.082196.609362.1741366.060.0486
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1983 & 5 & 2.556 & 345.96 & 1638.59 & 303.677 & 191.324 & 350.113 & 1366.1 & 0.0863 & $\\dots$ \\\\\n", "\t2 & 1983 & 6 & 2.167 & 345.52 & 1633.71 & 303.746 & 192.057 & 351.848 & 1366.12 & 0.0794 & $\\dots$ \\\\\n", "\t3 & 1983 & 7 & 1.741 & 344.15 & 1633.22 & 303.795 & 192.818 & 353.725 & 1366.29 & 0.0731 & $\\dots$ \\\\\n", "\t4 & 1983 & 8 & 1.13 & 342.25 & 1631.35 & 303.839 & 193.602 & 355.633 & 1366.42 & 0.0673 & $\\dots$ \\\\\n", "\t5 & 1983 & 9 & 0.428 & 340.17 & 1648.4 & 303.901 & 194.392 & 357.465 & 1366.23 & 0.0619 & $\\dots$ \\\\\n", "\t6 & 1983 & 10 & 0.002 & 340.3 & 1663.79 & 303.97 & 195.171 & 359.174 & 1366.06 & 0.0569 & $\\dots$ \\\\\n", "\t7 & 1983 & 11 & -0.176 & 341.53 & 1658.23 & 304.032 & 195.921 & 360.758 & 1366.11 & 0.0524 & $\\dots$ \\\\\n", "\t8 & 1983 & 12 & -0.176 & 343.07 & 1654.31 & 304.082 & 196.609 & 362.174 & 1366.06 & 0.0486 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m8×11 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 136 ⋯\n", " 2 │ 1983 6 2.167 345.52 1633.71 303.746 192.057 351.848 136\n", " 3 │ 1983 7 1.741 344.15 1633.22 303.795 192.818 353.725 136\n", " 4 │ 1983 8 1.13 342.25 1631.35 303.839 193.602 355.633 136\n", " 5 │ 1983 9 0.428 340.17 1648.4 303.901 194.392 357.465 136 ⋯\n", " 6 │ 1983 10 0.002 340.3 1663.79 303.97 195.171 359.174 136\n", " 7 │ 1983 11 -0.176 341.53 1658.23 304.032 195.921 360.758 136\n", " 8 │ 1983 12 -0.176 343.07 1654.31 304.082 196.609 362.174 136\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subset(df, :Year=> ByRow(==(1983)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subset with multiple conditions\n", "\n", "To do multiple conditions (including an OR operator) we need to use a function. Both `filter` and `subset` are powerful because of the ability to pass functions as arguments." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

4 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119841-0.339344.051658.98304.13197.219363.3591365.430.0451
219842-0.565344.771656.48304.194197.759364.2961365.660.0416
3198430.131345.461655.77304.285198.249365.0441366.170.0383
4198440.331346.771657.68304.389198.723365.6921365.570.0352
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1984 & 1 & -0.339 & 344.05 & 1658.98 & 304.13 & 197.219 & 363.359 & 1365.43 & 0.0451 & $\\dots$ \\\\\n", "\t2 & 1984 & 2 & -0.565 & 344.77 & 1656.48 & 304.194 & 197.759 & 364.296 & 1365.66 & 0.0416 & $\\dots$ \\\\\n", "\t3 & 1984 & 3 & 0.131 & 345.46 & 1655.77 & 304.285 & 198.249 & 365.044 & 1366.17 & 0.0383 & $\\dots$ \\\\\n", "\t4 & 1984 & 4 & 0.331 & 346.77 & 1657.68 & 304.389 & 198.723 & 365.692 & 1365.57 & 0.0352 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m4×11 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1984 1 -0.339 344.05 1658.98 304.13 197.219 363.359 136 ⋯\n", " 2 │ 1984 2 -0.565 344.77 1656.48 304.194 197.759 364.296 136\n", " 3 │ 1984 3 0.131 345.46 1655.77 304.285 198.249 365.044 136\n", " 4 │ 1984 4 0.331 346.77 1657.68 304.389 198.723 365.692 136\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function my_filter(year, month)\n", " year_match = (year in [1983, 1984])\n", " month_match = (month<=4)\n", " return year_match & month_match\n", "end;\n", "\n", "# apply this filter for each row\n", "subset(df, [:Year, :Month]=> ByRow(my_filter))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`subset` is nice because it can handle missing values, whereas `filter` won't. So in short, for most of your use cases you should stick with indexing or using `subset`. " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.000168 seconds (507 allocations: 39.328 KiB)\n" ] }, { "data": { "text/html": [ "

4 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119841-0.339344.051658.98304.13197.219363.3591365.430.0451
219842-0.565344.771656.48304.194197.759364.2961365.660.0416
3198430.131345.461655.77304.285198.249365.0441366.170.0383
4198440.331346.771657.68304.389198.723365.6921365.570.0352
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1984 & 1 & -0.339 & 344.05 & 1658.98 & 304.13 & 197.219 & 363.359 & 1365.43 & 0.0451 & $\\dots$ \\\\\n", "\t2 & 1984 & 2 & -0.565 & 344.77 & 1656.48 & 304.194 & 197.759 & 364.296 & 1365.66 & 0.0416 & $\\dots$ \\\\\n", "\t3 & 1984 & 3 & 0.131 & 345.46 & 1655.77 & 304.285 & 198.249 & 365.044 & 1366.17 & 0.0383 & $\\dots$ \\\\\n", "\t4 & 1984 & 4 & 0.331 & 346.77 & 1657.68 & 304.389 & 198.723 & 365.692 & 1365.57 & 0.0352 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m4×11 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1984 1 -0.339 344.05 1658.98 304.13 197.219 363.359 136 ⋯\n", " 2 │ 1984 2 -0.565 344.77 1656.48 304.194 197.759 364.296 136\n", " 3 │ 1984 3 0.131 345.46 1655.77 304.285 198.249 365.044 136\n", " 4 │ 1984 4 0.331 346.77 1657.68 304.389 198.723 365.692 136\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time begin\n", " subset(df, [:Year, :Month]=> ByRow(my_filter))\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `@subset` \n", "\n", "Instead of the filter and subset stuff in base dataframes, we can use DataFramesMeta:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

1 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1983 & 5 & 2.556 & 345.96 & 1638.59 & 303.677 & 191.324 & 350.113 & 1366.1 & 0.0863 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m1×11 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 13 ⋯\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@subset(df, (:Year .<1984), (:Month .==5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can do `@rsubset` to filter across rows so you don't need to do the elementwise `.` every time:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

1 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1983 & 5 & 2.556 & 345.96 & 1638.59 & 303.677 & 191.324 & 350.113 & 1366.1 & 0.0863 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m1×11 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 13 ⋯\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@rsubset(df, :Year<1984, :Month==5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So in the above example we are using an AND statement, it'll keep doing the ANDs across each item there. So if you want to do something fancier like an OR statement you'll need to wrap it in parenthesis:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.134172 seconds (490.13 k allocations: 26.135 MiB, 99.18% compilation time)\n" ] }, { "data": { "text/html": [ "

2 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863
219853-0.737347.661678.34305.165210.368376.0041365.660.0141
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1983 & 5 & 2.556 & 345.96 & 1638.59 & 303.677 & 191.324 & 350.113 & 1366.1 & 0.0863 & $\\dots$ \\\\\n", "\t2 & 1985 & 3 & -0.737 & 347.66 & 1678.34 & 305.165 & 210.368 & 376.004 & 1365.66 & 0.0141 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m2×11 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 136 ⋯\n", " 2 │ 1985 3 -0.737 347.66 1678.34 305.165 210.368 376.004 136\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time begin\n", " @rsubset(df, (:Year<1984 && :Month==5) || (:Year==1985 && :Month==3))\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can always pass a custom function too, just like before (this is fast!)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.000125 seconds (508 allocations: 39.375 KiB)\n" ] }, { "data": { "text/html": [ "

4 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119841-0.339344.051658.98304.13197.219363.3591365.430.0451
219842-0.565344.771656.48304.194197.759364.2961365.660.0416
3198430.131345.461655.77304.285198.249365.0441366.170.0383
4198440.331346.771657.68304.389198.723365.6921365.570.0352
" ], "text/latex": [ "\\begin{tabular}{r|ccccccccccc}\n", "\t& Year & Month & MEI & CO2 & CH4 & N2O & CFC-11 & CFC-12 & TSI & Aerosols & \\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & Float64 & \\\\\n", "\t\\hline\n", "\t1 & 1984 & 1 & -0.339 & 344.05 & 1658.98 & 304.13 & 197.219 & 363.359 & 1365.43 & 0.0451 & $\\dots$ \\\\\n", "\t2 & 1984 & 2 & -0.565 & 344.77 & 1656.48 & 304.194 & 197.759 & 364.296 & 1365.66 & 0.0416 & $\\dots$ \\\\\n", "\t3 & 1984 & 3 & 0.131 & 345.46 & 1655.77 & 304.285 & 198.249 & 365.044 & 1366.17 & 0.0383 & $\\dots$ \\\\\n", "\t4 & 1984 & 4 & 0.331 & 346.77 & 1657.68 & 304.389 & 198.723 & 365.692 & 1365.57 & 0.0352 & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m4×11 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m Year \u001b[0m\u001b[1m Month \u001b[0m\u001b[1m MEI \u001b[0m\u001b[1m CO2 \u001b[0m\u001b[1m CH4 \u001b[0m\u001b[1m N2O \u001b[0m\u001b[1m CFC-11 \u001b[0m\u001b[1m CFC-12 \u001b[0m\u001b[1m TSI\u001b[0m ⋯\n", "\u001b[1m \u001b[0m│\u001b[90m Int64 \u001b[0m\u001b[90m Int64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Flo\u001b[0m ⋯\n", "─────┼──────────────────────────────────────────────────────────────────────────\n", " 1 │ 1984 1 -0.339 344.05 1658.98 304.13 197.219 363.359 136 ⋯\n", " 2 │ 1984 2 -0.565 344.77 1656.48 304.194 197.759 364.296 136\n", " 3 │ 1984 3 0.131 345.46 1655.77 304.285 198.249 365.044 136\n", " 4 │ 1984 4 0.331 346.77 1657.68 304.389 198.723 365.692 136\n", "\u001b[36m 3 columns omitted\u001b[0m" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time begin\n", " @rsubset(df, my_filter(:Year, :Month))\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nice, it's super fast too! I think this is the way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## @chain from DataFramesMeta\n", "\n", "Instead of the clunky filtering and indexing, you can try out some of the querying libraries that Julia has to offer. They are all going to be different than pandas and dplyr, but that might not be a bad thing. The downside is that there isn't one obvious/clear standard yet in Julia, so for the practical business user or accidental coder it's still tough sledding. \n", "\n", "It's November 2021 and we've had stuff like Query and Queryverse for a while now, with DataFramesMeta in the mix. I'll show off `@chain` macro stuff here from the DataFramesMeta package, and we've already been familiar with `@rsubset` from earlier." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "df2 = rename(df,:CO2 => \"Carbon Dioxide\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But there's more! Here's [a guide for people coming from dplyr](https://juliadata.github.io/DataFramesMeta.jl/stable/dplyr/) from the DataFramesMeta folks." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

4 rows × 3 columns

MEIAerosolsCarbon Dioxide
Float64Float64Float64
1-0.5610.0164345.25
2-0.6020.016346.06
3-0.7370.0141347.66
4-0.4840.0138348.2
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& MEI & Aerosols & Carbon Dioxide\\\\\n", "\t\\hline\n", "\t& Float64 & Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & -0.561 & 0.0164 & 345.25 \\\\\n", "\t2 & -0.602 & 0.016 & 346.06 \\\\\n", "\t3 & -0.737 & 0.0141 & 347.66 \\\\\n", "\t4 & -0.484 & 0.0138 & 348.2 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m4×3 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m MEI \u001b[0m\u001b[1m Aerosols \u001b[0m\u001b[1m Carbon Dioxide \u001b[0m\n", "\u001b[1m \u001b[0m│\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\n", "─────┼───────────────────────────────────\n", " 1 │ -0.561 0.0164 345.25\n", " 2 │ -0.602 0.016 346.06\n", " 3 │ -0.737 0.0141 347.66\n", " 4 │ -0.484 0.0138 348.2" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@chain df2 begin\n", " @rsubset(:Year==1985, :Month<=4)\n", " @select(:MEI, :Aerosols, :\"Carbon Dioxide\")\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using our previous OR filter:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

4 rows × 3 columns

MEIAerosolsCarbon Dioxide
Float64Float64Float64
1-0.3390.0451344.05
2-0.5650.0416344.77
30.1310.0383345.46
40.3310.0352346.77
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& MEI & Aerosols & Carbon Dioxide\\\\\n", "\t\\hline\n", "\t& Float64 & Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & -0.339 & 0.0451 & 344.05 \\\\\n", "\t2 & -0.565 & 0.0416 & 344.77 \\\\\n", "\t3 & 0.131 & 0.0383 & 345.46 \\\\\n", "\t4 & 0.331 & 0.0352 & 346.77 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m4×3 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m MEI \u001b[0m\u001b[1m Aerosols \u001b[0m\u001b[1m Carbon Dioxide \u001b[0m\n", "\u001b[1m \u001b[0m│\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\u001b[90m Float64 \u001b[0m\n", "─────┼───────────────────────────────────\n", " 1 │ -0.339 0.0451 344.05\n", " 2 │ -0.565 0.0416 344.77\n", " 3 │ 0.131 0.0383 345.46\n", " 4 │ 0.331 0.0352 346.77" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@chain df2 begin\n", " @rsubset(my_filter(:Year, :Month))\n", " @select(:MEI, :Aerosols, :\"Carbon Dioxide\")\n", "end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.7.0", "language": "julia", "name": "julia-1.7" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.7.0" } }, "nbformat": 4, "nbformat_minor": 4 }