{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Julia is getting more and more interesting as a way to translate math into beautiful-looking code. After some halted attempts earlier at Julia 1.0 and 1.4, I'm giving it a try again at 1.6 after it seems like Julia is looking more and more stable. So here I'll write down a simple task that would be trivial to do in Python, except I'll do it in Julia instead. \n", "\n", "I want to:\n", "\n", "1. Read a CSV file from the web (i.e. in a github repository) as a dataframe\n", "2. Do some data cleaning - text replacement, fill na's, etc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load a CSV File" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "using HTTP, CSV, DataFrames, PyCall, DataFramesMeta" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Use a GET command to get some data, in this case it's a String\n", "# r = HTTP.get(\"https://api.github.com/users/banditkings\").body |> String;" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "url = \"https://ocw.mit.edu/courses/sloan-school-of-management/15-071-the-analytics-edge-spring-2017/linear-regression/assignment-2/climate_change.csv\";" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Docstrings must be placed before a function call. On the plus side, it seems to render the docstring really nicely:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "Basic function call example. In this case we're using it to fetch a CSV file from\n", "github here.\n", "\n", "You can even use latex in here! \\$f(x) = Ax + b\\$\n", "\n", "and HTML!!\n", "\n", "Parameters\n", "----------\n", "url : String\n", "* url address for the CSV file that you want to read in as a dataframe\n", "\n", "Returns\n", "-------\n", "dataframe\n", "\"\"\" \n", "function get_data(url)\n", " body = HTTP.get(url).body\n", " df = CSV.read(body, DataFrame);\n", "end;" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/latex": [ "Basic function call example. In this case we're using it to fetch a CSV file from github here.\n", "\n", "You can even use latex in here! $f(x) = Ax + b$\n", "\n", "and HTML!!\n", "\n", "\\subsection{Parameters}\n", "url : String\n", "\n", "\\begin{itemize}\n", "\\item url address for the CSV file that you want to read in as a dataframe\n", "\n", "\\end{itemize}\n", "\\subsection{Returns}\n", "dataframe\n", "\n" ], "text/markdown": [ "Basic function call example. In this case we're using it to fetch a CSV file from github here.\n", "\n", "You can even use latex in here! $f(x) = Ax + b$\n", "\n", "and HTML!!\n", "\n", "## Parameters\n", "\n", "url : String\n", "\n", " * url address for the CSV file that you want to read in as a dataframe\n", "\n", "## Returns\n", "\n", "dataframe\n" ], "text/plain": [ " Basic function call example. In this case we're using it to fetch a CSV file\n", " from github here.\n", "\n", " You can even use latex in here! \u001b[35mf(x) = Ax + b\u001b[39m\n", "\n", " and HTML!!\n", "\n", "\u001b[1m Parameters\u001b[22m\n", "\u001b[1m ============\u001b[22m\n", "\n", " url : String\n", "\n", " • url address for the CSV file that you want to read in as a\n", " dataframe\n", "\n", "\u001b[1m Returns\u001b[22m\n", "\u001b[1m =========\u001b[22m\n", "\n", " dataframe" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?get_data() # display the docstring we just generated" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "### Control Flows and `try`:`catch`\n", "\n", "Our little `get_data` function is fine but I don't want to constantly be downloading this file if I don't have to...how about something like python's `try`:`except` and learn a little about how Julia handles control flows?\n", "\n", "* **Goal:** I want to try to load from CSV file if it's available, and load from a url if it doesn't exist. \n", "\n", "* Reference: [Control Flows](https://docs.julialang.org/en/v1/manual/control-flow/) and `try`:`catch` [here](https://docs.julialang.org/en/v1/manual/control-flow/#The-try/catch-statement)!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "try\n", " df = CSV.read(\"climate_change.csv\", DataFrame);\n", "catch\n", " println(\"No CSV Found, Downloading from source\")\n", " df = get_data(url);\n", " # We can also save the csv file locally\n", " CSV.write(\"climate_change.csv\", df)\n", "end;\n", "\n", "df = CSV.read(\"climate_change.csv\", DataFrame);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python vs Julia: appending an item to a list (and PyCall in Jupyter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another key difference from python is appending lists together. In python, you'll create a list like:\n", "\n", "```python\n", ">>> foo = [1]\n", ">>> foo = foo + ['a', 'b']\n", ">>> foo\n", "[1, 'a', 'b']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Actually why don't we use the `PyCall` module to demonstrate the above? With `PyCall`, you can call Python and get functions and variables out. I tried this a while back and it seemed to be really bad at supporting pandas dataframes but things like numpy arrays are fine. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3-element Vector{Any}:\n", " 1\n", " \"a\"\n", " \"b\"" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Example of using Python within a jupyter notebook cell with a Julia kernel\n", "\n", "py\"\"\"\n", "foo = [1]\n", "foo = foo + ['a', 'b']\n", "\"\"\"\n", "\n", "# Reference an existing variable or function\n", "py\"foo\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now here's the same thing in Julia:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2-element Vector{Any}:\n", " 1\n", " \"a\"" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foo = Any[1]; # need to set to Any or it'll assume it's an Integer and it'll fail on the two strings there\n", "append!(foo, [\"a\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is that `!` symbol?\n", "\n", "Okay notice the exclamation mark `!` after `append` - this is special Julia notation for functions that modify something inplace. In this case, `append!` adds an item to the list in place. You could use `push!` if adding a single item. And there's no `append` function, only `append!`. How do you do something where you don't modify it in place? You can't, but you can make a copy:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3-element Vector{Any}:\n", " 1\n", " \"a\"\n", " \"b\"" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bar = foo |> copy\n", "append!(bar, [\"b\"])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foo == bar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame Manipulation\n", "\n", "With Base Julia:\n", "[Documentation](https://dataframes.juliadata.org/stable/man/working_with_dataframes/)\n", "\n", "`head` and `tail` become `first` and `last` with Julia." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
5 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
2 | 1983 | 6 | 2.167 | 345.52 | 1633.71 | 303.746 | 192.057 | 351.848 | 1366.12 | 0.0794 |
3 | 1983 | 7 | 1.741 | 344.15 | 1633.22 | 303.795 | 192.818 | 353.725 | 1366.29 | 0.0731 |
4 | 1983 | 8 | 1.13 | 342.25 | 1631.35 | 303.839 | 193.602 | 355.633 | 1366.42 | 0.0673 |
5 | 1983 | 9 | 0.428 | 340.17 | 1648.4 | 303.901 | 194.392 | 357.465 | 1366.23 | 0.0619 |
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
308 | 2008 | 12 | -0.666 | 385.56 | 1812.88 | 322.182 | 244.204 | 535.005 | 1365.69 | 0.0046 |
6 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 9 | 0.428 | 340.17 | 1648.4 | 303.901 | 194.392 | 357.465 | 1366.23 | 0.0619 |
2 | 1983 | 10 | 0.002 | 340.3 | 1663.79 | 303.97 | 195.171 | 359.174 | 1366.06 | 0.0569 |
3 | 1983 | 11 | -0.176 | 341.53 | 1658.23 | 304.032 | 195.921 | 360.758 | 1366.11 | 0.0524 |
4 | 1983 | 12 | -0.176 | 343.07 | 1654.31 | 304.082 | 196.609 | 362.174 | 1366.06 | 0.0486 |
5 | 1984 | 1 | -0.339 | 344.05 | 1658.98 | 304.13 | 197.219 | 363.359 | 1365.43 | 0.0451 |
6 | 1984 | 2 | -0.565 | 344.77 | 1656.48 | 304.194 | 197.759 | 364.296 | 1365.66 | 0.0416 |
DataFrameRow (7 columns)
Year | Month | MEI | N2O | TSI | Aerosols | Temp | |
---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 303.677 | 1366.1 | 0.0863 | 0.109 |
11 rows × 7 columns
variable | mean | min | median | max | nmissing | eltype | |
---|---|---|---|---|---|---|---|
Symbol | Float64 | Real | Float64 | Real | Int64 | DataType | |
1 | Year | 1995.66 | 1983 | 1996.0 | 2008 | 0 | Int64 |
2 | Month | 6.55195 | 1 | 7.0 | 12 | 0 | Int64 |
3 | MEI | 0.275555 | -1.635 | 0.2375 | 3.001 | 0 | Float64 |
4 | CO2 | 363.227 | 340.17 | 361.735 | 388.5 | 0 | Float64 |
5 | CH4 | 1749.82 | 1629.89 | 1764.04 | 1814.18 | 0 | Float64 |
6 | N2O | 312.392 | 303.677 | 311.507 | 322.182 | 0 | Float64 |
7 | CFC-11 | 251.973 | 191.324 | 258.344 | 271.494 | 0 | Float64 |
8 | CFC-12 | 497.525 | 350.113 | 528.356 | 543.813 | 0 | Float64 |
9 | TSI | 1366.07 | 1365.43 | 1365.98 | 1367.32 | 0 | Float64 |
10 | Aerosols | 0.0166571 | 0.0016 | 0.00575 | 0.1494 | 0 | Float64 |
11 | Temp | 0.256776 | -0.282 | 0.248 | 0.739 | 0 | Float64 |
4 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1985 | 1 | -0.561 | 345.25 | 1663.42 | 305.301 | 208.537 | 373.623 | 1365.61 | 0.0164 |
2 | 1985 | 2 | -0.602 | 346.06 | 1666.21 | 305.243 | 209.543 | 374.681 | 1365.71 | 0.016 |
3 | 1985 | 3 | -0.737 | 347.66 | 1678.34 | 305.165 | 210.368 | 376.004 | 1365.66 | 0.0141 |
4 | 1985 | 4 | -0.484 | 348.2 | 1675.24 | 305.093 | 211.111 | 377.635 | 1365.51 | 0.0138 |
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1985 | 1 | -0.561 | 345.25 | 1663.42 | 305.301 | 208.537 | 373.623 | 1365.61 | 0.0164 |
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1985 | 1 | -0.561 | 345.25 | 1663.42 | 305.301 | 208.537 | 373.623 | 1365.61 | 0.0164 |
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
4 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1984 | 1 | -0.339 | 344.05 | 1658.98 | 304.13 | 197.219 | 363.359 | 1365.43 | 0.0451 |
2 | 1984 | 2 | -0.565 | 344.77 | 1656.48 | 304.194 | 197.759 | 364.296 | 1365.66 | 0.0416 |
3 | 1984 | 3 | 0.131 | 345.46 | 1655.77 | 304.285 | 198.249 | 365.044 | 1366.17 | 0.0383 |
4 | 1984 | 4 | 0.331 | 346.77 | 1657.68 | 304.389 | 198.723 | 365.692 | 1365.57 | 0.0352 |
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
8 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
2 | 1983 | 6 | 2.167 | 345.52 | 1633.71 | 303.746 | 192.057 | 351.848 | 1366.12 | 0.0794 |
3 | 1983 | 7 | 1.741 | 344.15 | 1633.22 | 303.795 | 192.818 | 353.725 | 1366.29 | 0.0731 |
4 | 1983 | 8 | 1.13 | 342.25 | 1631.35 | 303.839 | 193.602 | 355.633 | 1366.42 | 0.0673 |
5 | 1983 | 9 | 0.428 | 340.17 | 1648.4 | 303.901 | 194.392 | 357.465 | 1366.23 | 0.0619 |
6 | 1983 | 10 | 0.002 | 340.3 | 1663.79 | 303.97 | 195.171 | 359.174 | 1366.06 | 0.0569 |
7 | 1983 | 11 | -0.176 | 341.53 | 1658.23 | 304.032 | 195.921 | 360.758 | 1366.11 | 0.0524 |
8 | 1983 | 12 | -0.176 | 343.07 | 1654.31 | 304.082 | 196.609 | 362.174 | 1366.06 | 0.0486 |
4 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1984 | 1 | -0.339 | 344.05 | 1658.98 | 304.13 | 197.219 | 363.359 | 1365.43 | 0.0451 |
2 | 1984 | 2 | -0.565 | 344.77 | 1656.48 | 304.194 | 197.759 | 364.296 | 1365.66 | 0.0416 |
3 | 1984 | 3 | 0.131 | 345.46 | 1655.77 | 304.285 | 198.249 | 365.044 | 1366.17 | 0.0383 |
4 | 1984 | 4 | 0.331 | 346.77 | 1657.68 | 304.389 | 198.723 | 365.692 | 1365.57 | 0.0352 |
4 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1984 | 1 | -0.339 | 344.05 | 1658.98 | 304.13 | 197.219 | 363.359 | 1365.43 | 0.0451 |
2 | 1984 | 2 | -0.565 | 344.77 | 1656.48 | 304.194 | 197.759 | 364.296 | 1365.66 | 0.0416 |
3 | 1984 | 3 | 0.131 | 345.46 | 1655.77 | 304.285 | 198.249 | 365.044 | 1366.17 | 0.0383 |
4 | 1984 | 4 | 0.331 | 346.77 | 1657.68 | 304.389 | 198.723 | 365.692 | 1365.57 | 0.0352 |
1 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
1 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
2 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
2 | 1985 | 3 | -0.737 | 347.66 | 1678.34 | 305.165 | 210.368 | 376.004 | 1365.66 | 0.0141 |
4 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1984 | 1 | -0.339 | 344.05 | 1658.98 | 304.13 | 197.219 | 363.359 | 1365.43 | 0.0451 |
2 | 1984 | 2 | -0.565 | 344.77 | 1656.48 | 304.194 | 197.759 | 364.296 | 1365.66 | 0.0416 |
3 | 1984 | 3 | 0.131 | 345.46 | 1655.77 | 304.285 | 198.249 | 365.044 | 1366.17 | 0.0383 |
4 | 1984 | 4 | 0.331 | 346.77 | 1657.68 | 304.389 | 198.723 | 365.692 | 1365.57 | 0.0352 |
4 rows × 3 columns
MEI | Aerosols | Carbon Dioxide | |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | -0.561 | 0.0164 | 345.25 |
2 | -0.602 | 0.016 | 346.06 |
3 | -0.737 | 0.0141 | 347.66 |
4 | -0.484 | 0.0138 | 348.2 |
4 rows × 3 columns
MEI | Aerosols | Carbon Dioxide | |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | -0.339 | 0.0451 | 344.05 |
2 | -0.565 | 0.0416 | 344.77 |
3 | 0.131 | 0.0383 | 345.46 |
4 | 0.331 | 0.0352 | 346.77 |