{ "cells": [ { "cell_type": "markdown", "id": "e355db8b-ebb6-4ea6-97b5-3b9fdadc302c", "metadata": {}, "source": [ "# 106 examples of Red Amber" ] }, { "cell_type": "markdown", "id": "f20f4970-db38-4d96-9a36-d4cf9d007596", "metadata": {}, "source": [ "Last update is on March 20, 2023, for RedAmber Version 0.4.1 and Arrow version 11.0.0 ." ] }, { "cell_type": "markdown", "id": "f6e927d0-b59a-4c4e-9f8a-4fa08f9a6b2f", "metadata": {}, "source": [ "## 1. Install" ] }, { "cell_type": "markdown", "id": "85eacfe6-fa11-4749-844f-5914d6cd7dbc", "metadata": {}, "source": [ "Install requirements before you install RedAmber.\n", "\n", "- Ruby (>= 3.0)\n", "\n", "- Apache Arrow (~> 11.0.0)\n", "- Apache Arrow GLib (~> 11.0.0)\n", "- Apache Parquet GLib (~> 11.0.0) # if you need IO from/to Parquet resource.\n", "\n", " See [Apache Arrow install document](https://arrow.apache.org/install/).\n", " \n", " - Minimum installation example for the latest Ubuntu:\n", " ```\n", " sudo apt update\n", " sudo apt install -y -V ca-certificates lsb-release wget\n", " wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb\n", " sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb\n", " sudo apt update\n", " sudo apt install -y -V libarrow-dev\n", " sudo apt install -y -V libarrow-glib-dev\n", " ```\n", " - On Fedora 38 (Rawhide):\n", " ```\n", " sudo dnf update\n", " sudo dnf -y install gcc-c++ libarrow-devel libarrow-glib-devel ruby-devel\n", " \n", "\n", " - On macOS, you can install Apache Arrow C++ library using Homebrew:\n", " ```\n", " brew install apache-arrow\n", " ```\n", " and GLib (C) package with:\n", "\n", " ```\n", " brew install apache-arrow-glib\n", " ```\n", "\n", "If you prepared Apache Arrow, add these lines to your Gemfile:\n", "\n", "```ruby\n", "gem 'red-arrow', '~> 11.0.0'\n", "gem 'red_amber'\n", "gem 'red-parquet', '~> 11.0.0' # Optional, if you use IO from/to parquet\n", "gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame\n", "gem 'red-datasets-arrow' # Optional, recommended if you use Red Datasets\n", "gem 'red-arrow-numo-narray' # Optional, recommended if you use inputs from Numo::NArray\n", "```\n", "\n", "And then execute `bundle install` or install it yourself as `gem install red_amber`.\n" ] }, { "cell_type": "markdown", "id": "8c08c45d-0818-4b43-bc65-4d43dd8b6b66", "metadata": {}, "source": [ "## 2. Require" ] }, { "cell_type": "code", "execution_count": null, "id": "74b76022-03ea-40ae-bac8-fc8743659042", "metadata": { "tags": [] }, "outputs": [], "source": [ "require 'red_amber' # require 'red-amber' is also OK\n", "include RedAmber\n", "{RedAmber: VERSION, Arrow: Arrow::VERSION}" ] }, { "cell_type": "markdown", "id": "d8fb6289-39ea-4fa9-a165-b87ee6d125e9", "metadata": { "tags": [] }, "source": [ "## 3. Initialize" ] }, { "cell_type": "code", "execution_count": null, "id": "51f81824-626a-4741-a29b-30ea357fe7b5", "metadata": { "tags": [] }, "outputs": [], "source": [ "# From a Hash\n", "DataFrame.new(x: [1, 2, 3], y: %w[A B C])" ] }, { "cell_type": "code", "execution_count": null, "id": "20b696eb-c199-444d-a957-e0b1081f1506", "metadata": { "tags": [] }, "outputs": [], "source": [ "# From a schema and a row-oriented array\n", "DataFrame.new({ x: :uint8, y: :string }, [[1, 'A'], [2, 'B'], [3, 'C']])" ] }, { "cell_type": "code", "execution_count": null, "id": "21eab151-f977-4474-a6d1-576169e24b26", "metadata": { "tags": [] }, "outputs": [], "source": [ "# From an Arrow::Table\n", "table = Arrow::Table.new(x: [1, 2, 3], y: %w[A B C])\n", "DataFrame.new(table)" ] }, { "cell_type": "code", "execution_count": null, "id": "aa09d3da-f332-45cd-92ca-712c6a679035", "metadata": { "tags": [] }, "outputs": [], "source": [ "# From a Rover::DataFrame\n", "require 'rover'\n", "rover = Rover::DataFrame.new(x: [1, 2, 3], y: %w[A B C])\n", "DataFrame.new(rover)" ] }, { "cell_type": "code", "execution_count": null, "id": "cd2c3677-00fb-48fe-bb94-18bc0815db72", "metadata": { "tags": [] }, "outputs": [], "source": [ "# from a red-datasets\n", "require 'datasets-arrow'\n", "dataset = Datasets::Penguins.new\n", "penguins = DataFrame.new(dataset) # Since 0.2.2 . If it is older, it must be `dataset.to_arrow`." ] }, { "cell_type": "code", "execution_count": null, "id": "2e4619b7-bf6d-4081-9066-b186da8fdf5b", "metadata": { "tags": [] }, "outputs": [], "source": [ "dataset = Datasets::Rdatasets.new('datasets', 'mtcars')\n", "mtcars = DataFrame.new(dataset)" ] }, { "cell_type": "markdown", "id": "f1f2cb95-fe5d-4176-a94f-1bbb05be88fa", "metadata": {}, "source": [ "(New from 0.2.3 with Arrow 10.0.0) It is possible to initialize by objects responsible to `to_arrow` since 0.2.3 . Arrays in Numo::NArray is responsible to `to_arrow` with `red-arrow-numo-narray` gem. This feature is proposed by the Red Data Tools member @kojix2 and implemented by @kou in Arrow 10.0.0 and Red Arrow Numo::NArray 0.0.6. Thanks!" ] }, { "cell_type": "code", "execution_count": null, "id": "2a7738ed-487f-4b61-b7a4-fa84a4dc1a76", "metadata": { "tags": [] }, "outputs": [], "source": [ "require 'arrow-numo-narray'\n", "\n", "DataFrame.new(numo: Numo::DFloat.new(3).rand)" ] }, { "cell_type": "markdown", "id": "507995c9-4d91-481f-831b-756fff7781e2", "metadata": {}, "source": [ "Another example by Numo::NArray is [#73. Introduce columns from numo/narray](#73.-Introduce-columns-from-numo/narray)." ] }, { "cell_type": "markdown", "id": "e1f77a54-3a43-4d17-bb6f-332ef13832a3", "metadata": {}, "source": [ "## 4. Load" ] }, { "cell_type": "markdown", "id": "0fed4f43-3fbb-43e5-af0d-f93401deea78", "metadata": {}, "source": [ "`RedAmber::DataFrame` delegates `#load` to `Arrow::Table#load`. We can load from `[.arrow, .arrows, .csv, .csv.gz, .tsv]` files." ] }, { "cell_type": "markdown", "id": "ffdbce32-99be-4e1f-af1a-f4c65f8a8d71", "metadata": {}, "source": [ "`load` accepts following options:\n", "\n", "`load(input, format: nil, compression: nil, schema: nil, skip_lines: nil)`\n", "\n", "- `format` [:arrow_file, :batch, :arrows, :arrow_stream, :stream, :csv, :tsv]\n", "- `compression` [:gzip, nil]\n", "- `schema` [Arrow::Schema]\n", "- `skip_lines` [Regexp]" ] }, { "cell_type": "markdown", "id": "bc496298-8a04-41c3-9659-156c5fd99ef6", "metadata": {}, "source": [ "Load from a file 'comecome.csv';" ] }, { "cell_type": "code", "execution_count": null, "id": "ce9f7237-9196-41a0-ba04-6ef4430a8a0c", "metadata": { "tags": [] }, "outputs": [], "source": [ "file = Tempfile.open(['comecome', '.csv']) do |f|\n", " f.puts(<<~CSV)\n", " name,age\n", " Yasuko,68\n", " Rui,49\n", " Hinata,28\n", " CSV\n", " f\n", "end\n", "\n", "DataFrame.load(file)" ] }, { "cell_type": "markdown", "id": "2025f16f-49cc-40f3-9078-2d8fd0d778eb", "metadata": {}, "source": [ "Load from a Buffer;" ] }, { "cell_type": "code", "execution_count": null, "id": "3726317a-697f-49ba-a97f-bd8ae592987c", "metadata": { "tags": [] }, "outputs": [], "source": [ "DataFrame.load(Arrow::Buffer.new(<<~BUFFER), format: :csv)\n", " name,age\n", " Yasuko,68\n", " Rui,49\n", " Hinata,28\n", "BUFFER" ] }, { "cell_type": "markdown", "id": "8be3f54f-2c91-4f2f-9957-666d600682e9", "metadata": {}, "source": [ "Load from a Buffer skipping comment line;" ] }, { "cell_type": "code", "execution_count": null, "id": "7f9f88ae-52d6-422f-9c84-43afb2d65892", "metadata": { "tags": [] }, "outputs": [], "source": [ "DataFrame.load(Arrow::Buffer.new(<<~BUFFER), format: :csv, skip_lines: /^#/)\n", " # comment\n", " name,age\n", " Yasuko,68\n", " Rui,49\n", " Hinata,28\n", "BUFFER" ] }, { "cell_type": "markdown", "id": "29875147-1371-4575-a565-69c3534c15f2", "metadata": {}, "source": [ "## 5. Load from a URI" ] }, { "cell_type": "code", "execution_count": null, "id": "916b86e2-e3a2-4ebb-8770-9e8a29c46523", "metadata": { "tags": [] }, "outputs": [], "source": [ "uri = URI(\"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv\")\n", "DataFrame.load(uri)" ] }, { "cell_type": "markdown", "id": "e6abe64d-e97f-437e-9c54-18f9e06e9668", "metadata": {}, "source": [ "## 6. Save\n", "\n", "`#save` accepts same options as `#load`. See [#4. Load](#4.-Load)." ] }, { "cell_type": "code", "execution_count": null, "id": "91c0fb62-7990-47f1-9fb6-b0529bc1783f", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.save(\"penguins.arrow\")\n", "penguins.save(\"penguins.arrows\")\n", "penguins.save(\"penguins.csv\")\n", "penguins.save(\"penguins.csv.gz\")\n", "penguins.save(\"penguins.tsv\")\n", "penguins.save(\"penguins.feather\")" ] }, { "cell_type": "markdown", "id": "65bd806c-9a96-4c1b-a115-f28965a60e0d", "metadata": {}, "source": [ "(Since 0.3.0) `DataFrame#save` returns self." ] }, { "cell_type": "markdown", "id": "d1d30973-9e2f-406a-9f42-9e6e4c966baf", "metadata": {}, "source": [ "## 7. to_s/inspect" ] }, { "cell_type": "markdown", "id": "a7bc9cb7-eae4-495f-831e-b747e486d0bd", "metadata": {}, "source": [ "`to_s` or `inspect` (it uses to_s inside) shows a preview of the dataframe.\n", "\n", "It shows first 5 and last 3 rows if it has many rows. Columns are also omitted if line is exceeded 80 letters." ] }, { "cell_type": "code", "execution_count": null, "id": "af6d29ef-2e1c-4a08-a8b2-d69acda79ec5", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(\n", " x: [1, 2, 3, 4, 5],\n", " y: [1, 2, 3, 0/0.0, nil],\n", " s: %w[A B C D] << nil,\n", " b: [true, false, true, false, nil]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "cdff2e60-bd0a-4d12-b348-201a49bbbbbe", "metadata": { "tags": [] }, "outputs": [], "source": [ "p penguins; nil" ] }, { "cell_type": "markdown", "id": "cb44df38-58f7-479c-b7a4-c9c305639292", "metadata": {}, "source": [ "## 8. Show table" ] }, { "cell_type": "markdown", "id": "4f5c7221-fd6c-4dcd-81ce-6d1da0673014", "metadata": {}, "source": [ "`#table` shows Arrow::Table object. The alias is `#to_arrow`." ] }, { "cell_type": "code", "execution_count": null, "id": "fc710035-8134-4b18-89fe-8c58b95e0e0e", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.table" ] }, { "cell_type": "code", "execution_count": null, "id": "2634fb7b-194f-4277-94ba-05f39c497ffa", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.to_arrow" ] }, { "cell_type": "code", "execution_count": null, "id": "9dba2a67-ede7-4663-907b-9b2dd5db1605", "metadata": { "tags": [] }, "outputs": [], "source": [ "# This is a Red Arrow's feature\n", "puts df.table.to_s(format: :column)" ] }, { "cell_type": "code", "execution_count": null, "id": "d1cc17b8-1cfc-4986-9dec-7bca02be32f0", "metadata": { "tags": [] }, "outputs": [], "source": [ "# This is also a Red Arrow's feature\n", "puts df.table.to_s(format: :list)" ] }, { "cell_type": "markdown", "id": "16e4ae6b-2399-43f0-be8e-65669b95c7b6", "metadata": {}, "source": [ "## 9. TDR" ] }, { "cell_type": "markdown", "id": "2d14eb4b-9026-4cc5-a71a-598946d40b67", "metadata": {}, "source": [ "TDR means 'Transposed Dataframe Representation'. It shows columns in lateral just the same shape as initializing by a Hash. TDR has some information which is useful for the exploratory data processing.\n", "\n", "- DataFrame shape: n_rows x n_columns\n", "- Data types\n", "- Levels: number of unique elements\n", "- Data preview: same data is aggregated if level is smaller (tally mode)\n", "- Show counts of abnormal element: NaN and nil\n", "\n", "It is similar to dplyr's (or Polars's) `glimpse()` so we have an alias `#glimpse` (since 0.4.0)." ] }, { "cell_type": "code", "execution_count": null, "id": "8050462f-7c60-41b7-a011-af11763784dc", "metadata": { "tags": [] }, "outputs": [], "source": [ "# use the same dataframe as #7\n", "df.tdr" ] }, { "cell_type": "code", "execution_count": null, "id": "bb616ffe-c19a-4b02-a011-601ceb3db656", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.tdr" ] }, { "cell_type": "markdown", "id": "73b8dc18-079f-4d40-8d0e-239f010550da", "metadata": {}, "source": [ "`#tdr` has some options:\n", "\n", "`limit` : to limit a number of variables to show. Default value is `limit=10`." ] }, { "cell_type": "code", "execution_count": null, "id": "0962845d-e642-4d2a-9607-43e197b46bc5", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.tdr(3)" ] }, { "cell_type": "markdown", "id": "1324f876-4579-42af-95c6-b90c67b05089", "metadata": {}, "source": [ "By default `#tdr` shows 9 variables at maximum. `#tdr(:all)` will show all variables." ] }, { "cell_type": "code", "execution_count": null, "id": "ea6e2b69-aa98-4e0e-887d-58d9f27a34ca", "metadata": { "tags": [] }, "outputs": [], "source": [ "mtcars.tdr(:all)" ] }, { "cell_type": "markdown", "id": "32ebd6cb-1eae-4df2-acc8-86ef3c04c605", "metadata": {}, "source": [ "(Since 0.4.0) `#tdra` method is short cut for `#tdr(:all)`" ] }, { "cell_type": "code", "execution_count": null, "id": "039f5e56-a914-469e-80b6-4475b5ce513d", "metadata": { "tags": [] }, "outputs": [], "source": [ "mtcars.tdra" ] }, { "cell_type": "markdown", "id": "573606c4-23b9-4b38-8c92-a04f1c1e8781", "metadata": {}, "source": [ "`elements` : max number of elements to show in observations. Default value is `elements: 5`." ] }, { "cell_type": "code", "execution_count": null, "id": "f957d2bd-e8c0-42a1-a3b4-0a9478e740bf", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.tdr(elements: 3) # Show first 3 items in data" ] }, { "cell_type": "markdown", "id": "d37ece79-1999-49eb-a2d1-831184ee6509", "metadata": {}, "source": [ "`tally` : max level to use tally mode. Level means size of `tally`ed hash. Default value is `tally: 5`." ] }, { "cell_type": "code", "execution_count": null, "id": "9c1c472c-3d15-4bca-9a1b-7f86c63d3ed8", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.tdr(tally: 0) # Don't use tally mode" ] }, { "cell_type": "markdown", "id": "e3c38037-90a1-4fc5-9904-41fc74085908", "metadata": {}, "source": [ "`#tdr_str` returns a String. `#tdr` do the same thing as `puts #tdr_str`" ] }, { "cell_type": "code", "execution_count": null, "id": "c14999d7-3462-4210-a2a5-ea61130ed034", "metadata": { "tags": [] }, "outputs": [], "source": [ "puts penguins.tdr_str" ] }, { "cell_type": "markdown", "id": "c9b8951e-1603-4455-a837-b5b341c43444", "metadata": {}, "source": [ "(Since 0.4.0) `#glimpse` is an alias for `#tdr`." ] }, { "cell_type": "code", "execution_count": null, "id": "6865e4be-a63b-4adf-a306-c8102b0b8da3", "metadata": { "tags": [] }, "outputs": [], "source": [ "mtcars.glimpse(:all, elements: 10)" ] }, { "cell_type": "markdown", "id": "21d68764-1bc1-4915-99b6-5ae938b85999", "metadata": {}, "source": [ "## 10. Size and shape" ] }, { "cell_type": "code", "execution_count": null, "id": "487399f8-a3ef-467f-aa7f-ecbaee5fcb75", "metadata": { "tags": [] }, "outputs": [], "source": [ "# same as n_rows, n_obs\n", "df.size" ] }, { "cell_type": "code", "execution_count": null, "id": "dc7441c3-7c85-4ce1-a20e-de8f41f280b4", "metadata": { "tags": [] }, "outputs": [], "source": [ "# same as n_cols, n_vars\n", "df.n_keys" ] }, { "cell_type": "code", "execution_count": null, "id": "3d42fea6-801a-45f4-8e22-ea9d76ae070f", "metadata": { "tags": [] }, "outputs": [], "source": [ "# [df.size, df.n_keys], [df.n_rows, df.n_cols]\n", "df.shape" ] }, { "cell_type": "markdown", "id": "bc5caa94-325f-4014-9c90-8ac909c2b378", "metadata": {}, "source": [ "## 11. Keys" ] }, { "cell_type": "code", "execution_count": null, "id": "bb47775f-fed0-42e6-8781-aa8b721d6112", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.keys" ] }, { "cell_type": "code", "execution_count": null, "id": "3d540ab0-3e52-47b7-b338-b4e0b3d929cb", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.keys" ] }, { "cell_type": "markdown", "id": "decc6a61-9994-4d60-9827-b257cafafb70", "metadata": {}, "source": [ "## 12. Types" ] }, { "cell_type": "code", "execution_count": null, "id": "bf9cd2bc-a213-427e-bc00-f2083b0e0471", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.types" ] }, { "cell_type": "code", "execution_count": null, "id": "b1ecb891-98b5-4919-9f37-1847202007d8", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.types" ] }, { "cell_type": "markdown", "id": "869b3670-62f8-4c23-807b-d6d100a1981e", "metadata": {}, "source": [ "## 13. Data type classes" ] }, { "cell_type": "code", "execution_count": null, "id": "776ab4db-073b-4b30-931a-8ec77284cdc4", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.type_classes" ] }, { "cell_type": "code", "execution_count": null, "id": "0546a5d0-cab1-4ca8-a2e5-0637d0fd48b6", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.type_classes" ] }, { "cell_type": "markdown", "id": "1c2513f6-909e-47fd-a543-66c4f424f44e", "metadata": { "tags": [] }, "source": [ "## 14. Indices\n", "\n", "Another example of `indices` is [62. Custom index](#62.-Custom-index)." ] }, { "cell_type": "code", "execution_count": null, "id": "e6e9d7ef-1471-4f23-9210-56045c9fabd5", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.indexes\n", "# or\n", "df.indices" ] }, { "cell_type": "markdown", "id": "cd209f6f-4bd6-4eb1-9f31-25a4f27eed7f", "metadata": {}, "source": [ "`#indices` returns Vector since 0.2.3 ." ] }, { "cell_type": "markdown", "id": "3908395f-b086-4fbb-9855-e1ce233f0595", "metadata": {}, "source": [ "## 15. To an Array or a Hash" ] }, { "cell_type": "markdown", "id": "22cb724e-cf61-40d9-a58b-9cc793e83645", "metadata": {}, "source": [ "DataFrame#to_a returns an array of row-oriented data without a header." ] }, { "cell_type": "code", "execution_count": null, "id": "4054daad-9266-4002-8942-c0891050cb4d", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.to_a" ] }, { "cell_type": "markdown", "id": "f6abae59-fe31-4056-9de8-7c36e35235de", "metadata": {}, "source": [ "If you need a column-oriented array with keys, use `.to_h.to_a`" ] }, { "cell_type": "code", "execution_count": null, "id": "d3631290-eb74-4d21-a469-86381c668c7f", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.to_h" ] }, { "cell_type": "code", "execution_count": null, "id": "08c45e92-f640-4e62-bc96-ee259d0ecff4", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.to_h.to_a" ] }, { "cell_type": "markdown", "id": "39b65fc0-4405-4414-9a74-91c724ef587c", "metadata": {}, "source": [ "## 16. Schema" ] }, { "cell_type": "markdown", "id": "e734dfc4-bcc8-45d7-af59-3945800d1246", "metadata": {}, "source": [ "Schema is keys and value types pairs as a Hash." ] }, { "cell_type": "code", "execution_count": null, "id": "36db7842-e9b0-4473-84d4-3aef987d427f", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.schema" ] }, { "cell_type": "markdown", "id": "3e61237d-ac67-45bb-827c-a769dff61809", "metadata": {}, "source": [ "## 17. Vector" ] }, { "cell_type": "markdown", "id": "27402307-aaad-49c8-88ca-65346668601d", "metadata": {}, "source": [ "Each variable (column in the table) is represented by a Vector object." ] }, { "cell_type": "code", "execution_count": null, "id": "6c9ba041-231d-4057-a280-acf620b68525", "metadata": { "tags": [] }, "outputs": [], "source": [ "df[:x] # This syntax will come later" ] }, { "cell_type": "markdown", "id": "3e13d06d-b432-45b2-9745-0c6ef9228e23", "metadata": {}, "source": [ "Or create new Vector by the constructor." ] }, { "cell_type": "code", "execution_count": null, "id": "3e18a4e0-238c-4800-8bda-a88a57dde3e9", "metadata": { "tags": [] }, "outputs": [], "source": [ "Vector.new(1, 2, 3, 4, 5)" ] }, { "cell_type": "code", "execution_count": null, "id": "3bd55d9d-b988-46b2-bc11-e3dc5f4adc6c", "metadata": { "tags": [] }, "outputs": [], "source": [ "Vector.new(1..5)" ] }, { "cell_type": "code", "execution_count": null, "id": "19688e6e-b59b-4a84-8c07-57e87cd0e242", "metadata": { "tags": [] }, "outputs": [], "source": [ "Vector.new([1, 2, 3], [4, 5])" ] }, { "cell_type": "code", "execution_count": null, "id": "076bd0e2-01ab-4497-9b9b-84f72a4805bc", "metadata": { "tags": [] }, "outputs": [], "source": [ "array = Arrow::Array.new([1, 2, 3, 4, 5])\n", "Vector.new(array)" ] }, { "cell_type": "markdown", "id": "22091661-e78a-4c66-9e48-4c3c676469b4", "metadata": {}, "source": [ "- TODO: `Vector[1..5]` as a constructor" ] }, { "cell_type": "markdown", "id": "b729bdba-87a2-4282-bd0e-319fe17f42da", "metadata": {}, "source": [ "## 18. Vectors" ] }, { "cell_type": "markdown", "id": "f5ddd840-2f84-467b-a9bb-feb769573b69", "metadata": {}, "source": [ "Returns an Array of Vectors as a DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "d3ae03f2-e2fe-4a15-abe1-331185448d61", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.vectors" ] }, { "cell_type": "markdown", "id": "8ac88ff3-0cb6-43d6-a999-0c2e8c6defb7", "metadata": { "tags": [] }, "source": [ "## 19. Variables\n", "\n", "Returns key and Vector pairs as a Hash." ] }, { "cell_type": "code", "execution_count": null, "id": "3351a216-6fe5-485e-8686-53c1e754fa2e", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.variables" ] }, { "cell_type": "markdown", "id": "3b518c1c-eda7-406f-a885-b2344b1726eb", "metadata": {}, "source": [ "## 20. Select columns by #[ ]" ] }, { "cell_type": "markdown", "id": "767b4e49-19eb-4d5f-b030-91bd78f0f5b9", "metadata": {}, "source": [ "`DataFrame#[]` is overloading column operations and row operations.\n", "\n", "- For columns (variables)\n", " - Key in a Symbol: `df[:symbol]`\n", " - Key in a String: `df[\"string\"]`\n", " - Keys in an Array: `df[:symbol1, \"string\", :symbol2]`\n", " - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`" ] }, { "cell_type": "code", "execution_count": null, "id": "ccf60edc-cccf-49e3-a503-1ca532247130", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Keys in a Symbol and a String\n", "df[:x, 'y']" ] }, { "cell_type": "code", "execution_count": null, "id": "8500f8c0-ff5a-4537-9f47-03d675e31b18", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Keys in a Range\n", "df[:x..:y]" ] }, { "cell_type": "code", "execution_count": null, "id": "db35cae1-35c2-47de-a7e8-906161f21282", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Keys with a index Range, and a symbol\n", "df[df.keys[2..], :x]" ] }, { "cell_type": "markdown", "id": "03e14403-f7bc-4350-9e7b-715901164331", "metadata": {}, "source": [ "## 21. Select rows by #[ ]\n", "`DataFrame#[]` is overloading column operations and row operations.\n", "\n", "- For rows (observations)\n", " - Select rows by a Index: `df[index]`\n", " - Select rows by Indices: `df[indices]` # Array, Arrow::Array, Vectors are acceptable for indices\n", " - Select rows by Ranges: `df[range]`\n", " - Select rows by Booleans: `df[booleans]` # Array, Arrow::Array, Vectors are acceptable for booleans" ] }, { "cell_type": "code", "execution_count": null, "id": "e3bc60a7-611e-4fd8-9770-8e0d167d3fee", "metadata": { "tags": [] }, "outputs": [], "source": [ "# indices\n", "df[0, 2, 1]" ] }, { "cell_type": "code", "execution_count": null, "id": "2b8b3801-ae37-4629-9db5-ff937941c895", "metadata": { "tags": [] }, "outputs": [], "source": [ "# including a Range\n", "# negative indices are also acceptable\n", "df[1..2, -1]" ] }, { "cell_type": "code", "execution_count": null, "id": "3f6f8d73-a66c-4773-9bf5-0878c700f2d6", "metadata": { "tags": [] }, "outputs": [], "source": [ "# booleans\n", "# length of boolean should be the same as self\n", "df[false, true, true, false, true]" ] }, { "cell_type": "code", "execution_count": null, "id": "abe57279-54fd-48ec-a1a4-c7453211e776", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Arrow::Array\n", "indices = Arrow::UInt8Array.new([0,2,4])\n", "df[indices]" ] }, { "cell_type": "code", "execution_count": null, "id": "2266611f-23d8-4645-a1e8-b07c2370fb3f", "metadata": { "tags": [] }, "outputs": [], "source": [ "# By a Vector as indices\n", "indices = Vector.new(df.indices)\n", "# indices > 1 returns a boolean Vector\n", "df[indices > 1]" ] }, { "cell_type": "code", "execution_count": null, "id": "0ea2da7e-aeca-4874-be4a-6af563aa378b", "metadata": { "tags": [] }, "outputs": [], "source": [ "# By a Vector as booleans\n", "booleans = df[:b]" ] }, { "cell_type": "code", "execution_count": null, "id": "9f842890-6359-4266-9a23-2f8f813ef548", "metadata": { "tags": [] }, "outputs": [], "source": [ "df[booleans]" ] }, { "cell_type": "markdown", "id": "98a04874-cb2c-44c0-b410-b330b9d12b0f", "metadata": {}, "source": [ "## 22. empty?" ] }, { "cell_type": "code", "execution_count": null, "id": "7b1ab319-90a7-4f09-8629-04dcd94076cb", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.empty?" ] }, { "cell_type": "code", "execution_count": null, "id": "3f9f8771-87dd-44eb-8aac-6a3ed8b4c183", "metadata": { "tags": [] }, "outputs": [], "source": [ "DataFrame.new" ] }, { "cell_type": "code", "execution_count": null, "id": "1e09c32f-20a8-4175-827f-cdb98063535a", "metadata": { "tags": [] }, "outputs": [], "source": [ "DataFrame.new.empty?" ] }, { "cell_type": "markdown", "id": "86b826dd-10e6-4087-9162-b89ac6561a61", "metadata": {}, "source": [ "## 23. Select columns by pick" ] }, { "cell_type": "markdown", "id": "b5aefd22-4e96-4dc5-91d2-e6826256bda6", "metadata": { "tags": [] }, "source": [ "`DataFrame#pick` accepts an Array of keys to pick up columns (variables) and creates a new DataFrame. You can change the order of columns at a same time.\n", "\n", "The name `pick` comes from the action to pick variables(columns) according to the label keys." ] }, { "cell_type": "code", "execution_count": null, "id": "68124521-b823-424d-9e06-d11aa927d618", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.pick(:s, :y)\n", "# or\n", "df.pick([:s, :y]) # OK too." ] }, { "cell_type": "markdown", "id": "a76dca00-da8f-4959-be18-7a1015a9d13c", "metadata": {}, "source": [ "Or use a boolean Array of lengeh `n_key` to `pick`. This style preserves the order of variables." ] }, { "cell_type": "code", "execution_count": null, "id": "b91f8925-529c-43c9-93ba-e21bcac0f2f7", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.pick(false, true, true, false)\n", "# or\n", "df.pick([false, true, true, false])\n", "# or\n", "df.pick(Vector.new([false, true, true, false]))" ] }, { "cell_type": "markdown", "id": "5f903182-745b-4923-99d8-14a9b9c6ea4c", "metadata": {}, "source": [ "`#pick` also accepts a block in the context of self.\n", "\n", "Next example is picking up numeric variables." ] }, { "cell_type": "code", "execution_count": null, "id": "37bb0a49-c38a-484c-91d4-3e23ab43a727", "metadata": { "tags": [] }, "outputs": [], "source": [ "# reciever is required with the argument style\n", "df.pick(df.vectors.map(&:numeric?))\n", "\n", "# with a block\n", "df.pick { vectors.map(&:numeric?) }" ] }, { "cell_type": "markdown", "id": "1692e8c3-aafa-4951-84a6-ff26e4f8bbde", "metadata": {}, "source": [ "`pick` also accepts numeric indexes.\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": null, "id": "dd929ef0-8af4-4bb6-963a-b3f1abd73bf0", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.pick(0, 3)" ] }, { "cell_type": "markdown", "id": "7c1815e4-de6c-425e-8602-b8dd66836250", "metadata": {}, "source": [ "## 24. Reject columns by drop" ] }, { "cell_type": "markdown", "id": "d1ab045e-66f9-4922-8bf2-35aee7f2812e", "metadata": { "tags": [] }, "source": [ "`DataFrame#drop` accepts an Array keys to drop columns (variables) to create a remainer DataFrame.\n", "\n", "The name `drop` comes from the pair word of `pick`." ] }, { "cell_type": "code", "execution_count": null, "id": "7ccace08-62b0-4b0b-93fb-81edf673abf7", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.drop(:x, :b)\n", "# df.drop([:x, :b]) #is OK too." ] }, { "cell_type": "markdown", "id": "2085b349-95c5-4607-b029-f7c3d630ac1c", "metadata": {}, "source": [ "Or use a boolean Array of lengeh `n_key` to `drop`." ] }, { "cell_type": "code", "execution_count": null, "id": "785c02f1-1e16-4722-9961-4b49223c8290", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.drop(true, false, false, true)\n", "# df.drop([true, false, false, true]) # is OK too" ] }, { "cell_type": "markdown", "id": "d246161e-02cc-40fb-8921-26b37eb3956f", "metadata": {}, "source": [ "`#drop` also accepts a block in the context of self.\n", "\n", "Next example will drop variables which have nil or NaN values." ] }, { "cell_type": "code", "execution_count": null, "id": "069932e3-d393-4ede-9eb5-7aac8625e0c0", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.drop { vectors.map { |v| v.is_na.any } }" ] }, { "cell_type": "markdown", "id": "88b064d6-7d90-4a0b-b9c8-d92e103269fb", "metadata": {}, "source": [ "Argument style is also acceptable but it requires the reciever 'df'." ] }, { "cell_type": "code", "execution_count": null, "id": "3003a5c2-0966-4f2c-9643-59e8b546c8aa", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.drop(df.vectors.map { |v| v.is_na.any })" ] }, { "cell_type": "markdown", "id": "e30c2b72-fc05-405c-97dc-7e31b88abebf", "metadata": {}, "source": [ "`drop` also accepts numeric indexes.\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": null, "id": "0b6eebe0-145d-4660-8be4-e6f09eb656aa", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.drop(0, 3)" ] }, { "cell_type": "markdown", "id": "0f6dc86c-828d-4f9f-8b07-fce63c30fdca", "metadata": {}, "source": [ "## 25. Pick/drop and nil" ] }, { "cell_type": "markdown", "id": "0a108878-565b-400e-9a47-a15aae09429c", "metadata": {}, "source": [ "When `pick` or `drop` is used with booleans, nil in the booleans is treated as false. This behavior is aligned with Ruby's `BasicObject#!`." ] }, { "cell_type": "code", "execution_count": null, "id": "7c01fbb4-9bfa-4afc-8e6b-45c97c0beb03", "metadata": { "tags": [] }, "outputs": [], "source": [ "booleans = [true, true, false, nil]\n", "booleans_invert = booleans.map(&:!) # => [false, false, true, true] because nil.! is true\n", "df.pick(booleans) == df.drop(booleans_invert)" ] }, { "cell_type": "markdown", "id": "12a24264-9b7a-42a1-a541-e292e3876e35", "metadata": {}, "source": [ "## 26. Vector#invert, #primitive_invert\n", "\n", "For the boolean Vector;" ] }, { "cell_type": "code", "execution_count": null, "id": "ea352e12-7e8a-43be-b8ac-797adbc47708", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector = Vector.new(booleans)" ] }, { "cell_type": "markdown", "id": "2a0f82e0-157b-4185-9254-0618be291f9b", "metadata": {}, "source": [ "nil is converted to nil by `Vector#invert`." ] }, { "cell_type": "code", "execution_count": null, "id": "596c521f-12bf-4448-9e5d-e1b4a2c3d896", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector.invert\n", "# or\n", "!vector" ] }, { "cell_type": "markdown", "id": "a1aec910-3055-4627-a02b-22d45f2ceb70", "metadata": {}, "source": [ "So `df.pick(booleans) != df.drop(booleans.invert)` when booleans have any nils.\n", "\n", "On the other hand, `Vector#primitive_invert` follows Ruby's `BasicObject#!`'s behavior. Then pick and drop keep 'MECE' behavior." ] }, { "cell_type": "code", "execution_count": null, "id": "4dcaba48-1cea-4ce9-b4a9-b079b43af7ec", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector.primitive_invert" ] }, { "cell_type": "code", "execution_count": null, "id": "c7ae4dad-275a-49e0-a0b0-bf3686248070", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.pick(vector) == df.drop(vector.primitive_invert)" ] }, { "cell_type": "markdown", "id": "9a6cec74-43f0-4a72-8262-25b1e311f602", "metadata": {}, "source": [ "## 27. Pick/drop, #[] and #v" ] }, { "cell_type": "markdown", "id": "32c8f74d-b3ce-4305-9af7-6ea70052c773", "metadata": {}, "source": [ "When `pick` or `drop` select a single column (variable), it returns a `DataFrame` with one column (variable)." ] }, { "cell_type": "code", "execution_count": null, "id": "e13aee24-cac6-41ad-b8a3-0ec26edbe5d1", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.pick(:x) # or\n", "df.drop(:y, :s, :b)" ] }, { "cell_type": "markdown", "id": "3e47b9d2-929e-4674-9690-0a1fdf7b0a7d", "metadata": {}, "source": [ "In contrast, when `[]` selects a single column (variable), it returns a `Vector`." ] }, { "cell_type": "code", "execution_count": null, "id": "60d228be-7357-434d-9d39-ee72c110e6fe", "metadata": { "tags": [] }, "outputs": [], "source": [ "df[:x]" ] }, { "cell_type": "markdown", "id": "6d973934-e08b-4b45-8efb-52f9167e7238", "metadata": {}, "source": [ "This behavior may be useful to use with DataFrame manipulation verbs (like pick, drop, slice, remove, assign, rename)." ] }, { "cell_type": "code", "execution_count": null, "id": "6beefc5a-dc47-42cc-a283-456073c4251e", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.pick { keys.select { |key| df[key].numeric? } }" ] }, { "cell_type": "markdown", "id": "365a4de2-dbe2-4eb8-8c3a-06dd573f88ca", "metadata": {}, "source": [ "`df#v` method is same as `df#[]` to pick a Vector. But a little bit faster and easy to use in the block." ] }, { "cell_type": "code", "execution_count": null, "id": "6f08d8be-22c4-42d0-8519-23c1be71b777", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.v(:x)" ] }, { "cell_type": "markdown", "id": "34c9bcb0-889a-4190-b2b8-49765cd059c2", "metadata": {}, "source": [ "## 28. Slice\n", "\n", "Another example of `slice` is [#66. Row index label by slice_by](#66.-Row-index-label-by-slice_by)." ] }, { "cell_type": "markdown", "id": "9a428ba8-c306-4ab8-8607-51174e8e6ebe", "metadata": {}, "source": [ "`slice` selects rows (records) to create a subset of a DataFrame." ] }, { "cell_type": "markdown", "id": "6016d6d4-72d6-4ae2-b7dd-3d526c91ae61", "metadata": {}, "source": [ "`slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers. Negative index from the tail like Ruby's Array is also acceptable." ] }, { "cell_type": "code", "execution_count": null, "id": "9cdce2e4-7876-4be6-bd1f-bc8ab6e6c871", "metadata": { "tags": [] }, "outputs": [], "source": [ "# returns 5 rows from the start and 5 rows from the end\n", "penguins.slice(0...5, -5..-1)" ] }, { "cell_type": "code", "execution_count": null, "id": "93c3f6f0-7bc9-4909-8f32-20e8c1ddfd3a", "metadata": { "tags": [] }, "outputs": [], "source": [ "# slice accepts Float index\n", "# 33% of 344 observations in index => 113.52 th data ??\n", "indexed_penguins = penguins.assign_left { [:index, indexes] } # #assign_left and assigner by Array is 0.2.0 feature\n", "indexed_penguins.slice(penguins.size * 0.33)" ] }, { "cell_type": "markdown", "id": "8139bb28-89f8-4058-b824-dde33ead0b60", "metadata": {}, "source": [ "Indices in Vectors or Arrow::Arrays are also acceptable." ] }, { "cell_type": "markdown", "id": "6f79db8c-c706-4d60-949b-3f644474d375", "metadata": {}, "source": [ "Another way to select in `slice` is to use booleans. An alias for this feature is `filter`.\n", "- Booleans is an Array, Arrow::Array, Vector or their Array.\n", "- Each data type must be boolean.\n", "- Size of booleans must be same as the size of self." ] }, { "cell_type": "code", "execution_count": null, "id": "f58ca131-7375-4489-90ce-6ba54b898eb5", "metadata": { "tags": [] }, "outputs": [], "source": [ "# make boolean Vector to check over 40\n", "booleans = penguins[:bill_length_mm] > 40" ] }, { "cell_type": "code", "execution_count": null, "id": "176ab365-c66a-4712-97b9-4381a536321b", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.slice(booleans)" ] }, { "cell_type": "markdown", "id": "3264a182-6b72-461a-b712-c3b708c53516", "metadata": {}, "source": [ "`slice` accepts a block.\n", "- We can't use both arguments and a block at a same time.\n", "- The block should return indeces in any length or a boolean Array with a same length as `size`.\n", "- Block is called in the context of self. So reciever 'self' can be omitted in the block." ] }, { "cell_type": "code", "execution_count": null, "id": "c95d3426-0bbb-430e-8d83-6e22434d99ed", "metadata": { "tags": [] }, "outputs": [], "source": [ "# return a DataFrame with bill_length_mm is in 2*std range around mean\n", "penguins.slice do\n", " min = bill_length_mm.mean - bill_length_mm.std\n", " max = bill_length_mm.mean + bill_length_mm.std\n", " bill_length_mm.to_a.map { |e| (min..max).include? e }\n", "end" ] }, { "cell_type": "markdown", "id": "4fa42801-64f5-4432-856b-85c26a68515d", "metadata": {}, "source": [ "## 29. Slice and nil option" ] }, { "cell_type": "markdown", "id": "31017a7e-0923-4283-bc92-246ebe2591c3", "metadata": {}, "source": [ "`Arrow::Table#slice` uses `#filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row." ] }, { "cell_type": "code", "execution_count": null, "id": "8e4a8108-154b-4621-acd1-704ddf229d61", "metadata": { "tags": [] }, "outputs": [], "source": [ "hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }\n", "table = Arrow::Table.new(hash)\n", "table.slice([true, false, nil])" ] }, { "cell_type": "markdown", "id": "dbb57c5a-e949-42b8-a82c-9affb3fe5b7b", "metadata": {}, "source": [ "Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method." ] }, { "cell_type": "code", "execution_count": null, "id": "851c3bf6-b9e9-41bd-92c5-5372ed934549", "metadata": { "tags": [] }, "outputs": [], "source": [ "RedAmber::DataFrame.new(table).slice([true, false, nil]).table" ] }, { "cell_type": "markdown", "id": "56398a3d-6146-43af-8b96-fec37730fc49", "metadata": {}, "source": [ "## 30. Remove" ] }, { "cell_type": "markdown", "id": "9e042a97-8a5d-412e-8e4a-fda382225a2d", "metadata": {}, "source": [ "Slice and reject rows (observations) to create a remainer DataFrame." ] }, { "cell_type": "markdown", "id": "2b4cbb97-eef3-4db8-8f25-c44c208ec554", "metadata": {}, "source": [ "`#remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer." ] }, { "cell_type": "code", "execution_count": null, "id": "17e38ab8-886b-4114-bcaf-ee18df7d00cd", "metadata": { "tags": [] }, "outputs": [], "source": [ "# returns 6th to 339th obs. Remainer of 1st example of #30\n", "penguins.remove(0...5, -5..-1)" ] }, { "cell_type": "markdown", "id": "def1c1c4-6b60-4864-ae24-c797fbf008a7", "metadata": {}, "source": [ "`remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `#size`." ] }, { "cell_type": "code", "execution_count": null, "id": "6f169420-7eb2-457f-8d59-7a5c90aa3fa5", "metadata": { "tags": [] }, "outputs": [], "source": [ "# remove all observation contains nil\n", "removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }" ] }, { "cell_type": "markdown", "id": "5f1864c9-4ae4-4fcd-9840-ea424ef5e27d", "metadata": {}, "source": [ "`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as size. Block is called in the context of self." ] }, { "cell_type": "code", "execution_count": null, "id": "a6807c65-25e5-4ee1-8d1b-6018c46b3999", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Remove data in 2*std range around mean\n", "penguins.remove do\n", " vector = self[:bill_length_mm]\n", " min = vector.mean - vector.std\n", " max = vector.mean + vector.std\n", " vector.to_a.map { |e| (min..max).include? e }\n", "end" ] }, { "cell_type": "markdown", "id": "591e6b22-da98-4336-b22e-c7bc9bcf2ebf", "metadata": {}, "source": [ "## 31. Remove and nil" ] }, { "cell_type": "markdown", "id": "67926d1b-c76e-4cb7-b679-6545d850e7e4", "metadata": {}, "source": [ "When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`." ] }, { "cell_type": "code", "execution_count": null, "id": "8575614e-f702-4ee4-ac7b-745e9b32e803", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])" ] }, { "cell_type": "code", "execution_count": null, "id": "932a5e71-8cef-44e5-a789-ce97329bc001", "metadata": { "tags": [] }, "outputs": [], "source": [ "booleans = df[:a] < 2" ] }, { "cell_type": "code", "execution_count": null, "id": "74cf6aa6-8913-433d-97ad-bba2d548afe5", "metadata": { "tags": [] }, "outputs": [], "source": [ "booleans_invert = booleans.to_a.map(&:!)" ] }, { "cell_type": "code", "execution_count": null, "id": "5e466a06-cb17-4dc1-a5b0-34bfd3ffb78b", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice(booleans) == df.remove(booleans_invert)" ] }, { "cell_type": "markdown", "id": "8bca0b06-2d08-4c28-8b4c-4fd088f2d2d3", "metadata": {}, "source": [ "Whereas `Vector#invert` returns nil for elements nil. This will bring different result. (See #26)" ] }, { "cell_type": "code", "execution_count": null, "id": "077b216f-0a08-413e-95c9-12789d15a9ba", "metadata": { "tags": [] }, "outputs": [], "source": [ "booleans.invert" ] }, { "cell_type": "code", "execution_count": null, "id": "b3df62a6-c4a3-44cb-bde6-f6be12b120c8", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.remove(booleans.invert)" ] }, { "cell_type": "markdown", "id": "e05f00b6-3bae-4650-8bbc-d4e0692f6f85", "metadata": {}, "source": [ "We have `#primitive_invert` method in Vector. This method returns the same result as `.to_a.map(&:!)` above." ] }, { "cell_type": "code", "execution_count": null, "id": "296ca3cd-a6da-4603-a576-d8c36a810e4f", "metadata": { "tags": [] }, "outputs": [], "source": [ "booleans.primitive_invert" ] }, { "cell_type": "code", "execution_count": null, "id": "ba5b8c0b-b94e-4209-adcd-258ea3b87bfd", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.remove(booleans.primitive_invert)" ] }, { "cell_type": "code", "execution_count": null, "id": "2446792f-0b0a-4642-acae-b4fec89261c1", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice(booleans) == df.remove(booleans.primitive_invert)" ] }, { "cell_type": "markdown", "id": "7c23a4ad-0c17-4178-b58a-abfd8153d49b", "metadata": {}, "source": [ "## 32. Remove nil" ] }, { "cell_type": "markdown", "id": "84c7238b-1029-416f-b495-9d045f77b22c", "metadata": {}, "source": [ "Remove any observations containing nil." ] }, { "cell_type": "code", "execution_count": null, "id": "de4bb615-d14d-4c90-ab54-db2f375b9f00", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.remove_nil" ] }, { "cell_type": "markdown", "id": "4a4ae8f9-dcf8-4dad-bb77-af076e9cadb5", "metadata": {}, "source": [ "The roundabout way for this is to use `#remove`." ] }, { "cell_type": "code", "execution_count": null, "id": "27a3da5f-0ea2-4c5d-a6c3-c0e20f2224a3", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.remove { vectors.map(&:is_nil).reduce(&:|) }" ] }, { "cell_type": "markdown", "id": "4f2a58fd-f033-44f6-9eb4-ed893a2b5d1d", "metadata": {}, "source": [ "## 33. Rename" ] }, { "cell_type": "markdown", "id": "c0d39506-8ae5-48e7-9dd2-acf38d4ec1a9", "metadata": {}, "source": [ "Rename keys (column names) to create a updated DataFrame." ] }, { "cell_type": "markdown", "id": "3f6924ec-e86c-4089-ae40-6783027d3ce0", "metadata": {}, "source": [ "`#rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]` ." ] }, { "cell_type": "code", "execution_count": null, "id": "9396c96d-83d7-4b92-a4ca-27bc9e4d7b9d", "metadata": { "tags": [] }, "outputs": [], "source": [ "h = { name: %w[Yasuko Rui Hinata], age: [68, 49, 28] }\n", "comecome = RedAmber::DataFrame.new(h)" ] }, { "cell_type": "code", "execution_count": null, "id": "fad279c6-1ca0-4493-bd69-0e9ef011bff7", "metadata": { "tags": [] }, "outputs": [], "source": [ "comecome.rename(age: :age_in_1993)\n", "# comecome.rename(:age, :age_in_1993) # is also OK\n", "# comecome.rename([:age, :age_in_1993]) # is also OK" ] }, { "cell_type": "markdown", "id": "9dabb005-9822-4c4b-aaa5-fa6f28f2ed43", "metadata": {}, "source": [ "`#rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]`. Block is called in the context of self." ] }, { "cell_type": "markdown", "id": "aabbba20-6ef8-4da2-8dc0-0cb243cf3b23", "metadata": {}, "source": [ "Symbol key and String key are distinguished." ] }, { "cell_type": "markdown", "id": "07f98b31-6123-4466-b4f8-f995c7cde474", "metadata": {}, "source": [ "## 34. Assign" ] }, { "cell_type": "markdown", "id": "99f6787f-2b36-4360-b155-1c2d7874d25e", "metadata": {}, "source": [ "Another example of `assign` is [64. Assign revised](#64.-Assign-revised), [#65. Variations of assign](#65.-Variations-of-assign) .\n", "\n", "Assign new or updated columns (variables) and create a updated DataFrame.\n", "\n", "- Columns with new keys will append new variables at right (bottom in TDR).\n", "- Columns with exisiting keys will update corresponding vectors." ] }, { "cell_type": "markdown", "id": "b4b22da0-4ee2-4196-88e1-1cfea6a72f4d", "metadata": {}, "source": [ "`#assign(key_pairs)` accepts pairs of key and array_like values as arguments. The pairs should be a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. " ] }, { "cell_type": "code", "execution_count": null, "id": "56dcfed8-a6f9-4d8c-bac3-e8ce7c0674a7", "metadata": { "tags": [] }, "outputs": [], "source": [ "comecome = RedAmber::DataFrame.new( name: %w[Yasuko Rui Hinata], age: [68, 49, 28] )" ] }, { "cell_type": "code", "execution_count": null, "id": "8da8d282-8798-44d5-bb7b-7fa2df922308", "metadata": { "tags": [] }, "outputs": [], "source": [ "# update :age and add :brother\n", "assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }\n", "comecome.assign(assigner)" ] }, { "cell_type": "markdown", "id": "e6d3ddfc-b16d-4b20-83df-357e9cdb32e6", "metadata": {}, "source": [ "`#assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and array_like values as a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. Block is called in the context of self." ] }, { "cell_type": "code", "execution_count": null, "id": "8d69edd0-7ad7-4318-8033-1785ce2543db", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = RedAmber::DataFrame.new(\n", " index: [0, 1, 2, 3, nil],\n", " float: [0.0, 1.1, 2.2, Float::NAN, nil],\n", " string: ['A', 'B', 'C', 'D', nil])" ] }, { "cell_type": "code", "execution_count": null, "id": "e884af01-d82b-42e7-8e92-62baf19919cb", "metadata": { "tags": [] }, "outputs": [], "source": [ "# update numeric variables\n", "df.assign do\n", " vectors.select(&:numeric?).map { |v| [v.key, -v] }\n", "end" ] }, { "cell_type": "markdown", "id": "7b8e2090-628f-4b17-8929-cbb5e0285dff", "metadata": {}, "source": [ "In this example, columns :x and :y are updated. Column :x returns complements for #negate method because :x is :uint8 type." ] }, { "cell_type": "code", "execution_count": null, "id": "9452f8db-5f23-4044-ac87-ac5695fae8ae", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.types" ] }, { "cell_type": "markdown", "id": "c5c83559-f4d8-4ed2-8b20-5c50eb1faa14", "metadata": {}, "source": [ "## 35. Coerce (Vector)" ] }, { "cell_type": "markdown", "id": "77bdfc69-b728-4335-b76e-e4be92f94310", "metadata": {}, "source": [ "Vector has coerce method." ] }, { "cell_type": "code", "execution_count": null, "id": "2bfbe584-be54-486b-af32-e76b37c10e49", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector = RedAmber::Vector.new(1,2,3)" ] }, { "cell_type": "code", "execution_count": null, "id": "ce35d901-38a8-4f13-b2d1-29b83f6c5438", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Vector's `#*` method\n", "vector * -1" ] }, { "cell_type": "code", "execution_count": null, "id": "7d5fc2be-f590-4678-92e9-faa27b618266", "metadata": { "tags": [] }, "outputs": [], "source": [ "# coerced calculation\n", "-1 * vector" ] }, { "cell_type": "code", "execution_count": null, "id": "fa90a6af-add7-42f2-9707-7d726575aeb6", "metadata": { "tags": [] }, "outputs": [], "source": [ "# `@-` operator\n", "-vector" ] }, { "cell_type": "markdown", "id": "4820b527-44e9-4738-aa0e-73604078b3b0", "metadata": { "tags": [] }, "source": [ "## 36. to_ary (Vector)" ] }, { "cell_type": "markdown", "id": "8507dcc4-74e3-44ad-aa54-cf43d55f2131", "metadata": {}, "source": [ "`Vector#to_ary` will enable implicit conversion to an Array." ] }, { "cell_type": "code", "execution_count": null, "id": "b12bd7c8-2981-426c-8ae3-154504a8ea15", "metadata": { "tags": [] }, "outputs": [], "source": [ "Array(Vector.new([3, 4, 5]))" ] }, { "cell_type": "code", "execution_count": null, "id": "c0cb5a98-7cdf-43a8-b2f7-f9df1961c761", "metadata": { "tags": [] }, "outputs": [], "source": [ "[1, 2] + Vector.new([3, 4, 5])" ] }, { "cell_type": "code", "execution_count": null, "id": "03497daf-3afb-429d-9bf8-0f335fee8221", "metadata": { "tags": [] }, "outputs": [], "source": [ "[1, 2, Vector.new([3, 4, 5])].flatten" ] }, { "cell_type": "markdown", "id": "216dde4f-e4d8-4f29-903a-8cbf75de5b8e", "metadata": {}, "source": [ "## 37. Fill nil (Vector)" ] }, { "cell_type": "markdown", "id": "1959d0d7-6d09-4fa5-9365-1e2f7fc35d61", "metadata": {}, "source": [ "`Vector#fill_nil_forward` or `Vector#fill_nil_backward` will\n", "propagate the last valid observation forward (or backward).\n", "Or preserve nil if all previous values are nil or at the end." ] }, { "cell_type": "code", "execution_count": null, "id": "d003b06a-859f-4de0-9e35-803efac85169", "metadata": { "tags": [] }, "outputs": [], "source": [ "integer = Vector.new([0, 1, nil, 3, nil])\n", "integer.fill_nil_forward" ] }, { "cell_type": "code", "execution_count": null, "id": "c5d74006-d364-4e86-8a5e-9e96e87a96e0", "metadata": { "tags": [] }, "outputs": [], "source": [ "integer.fill_nil_backward" ] }, { "cell_type": "markdown", "id": "347785a6-eab0-4864-a871-2c320005211e", "metadata": {}, "source": [ "## 38. all?/any? (Vector)" ] }, { "cell_type": "markdown", "id": "f82a6f5d-03d3-4645-85f5-d25999165378", "metadata": {}, "source": [ "`Vector#all?` returns true if all elements is true.\n", "\n", "`Vector#any?` returns true if exists any true.\n", "\n", "These are unary aggregation function." ] }, { "cell_type": "code", "execution_count": null, "id": "ebad37ad-0a09-48b1-ba3a-4e030a917837", "metadata": { "tags": [] }, "outputs": [], "source": [ "booleans = Vector.new([true, true, nil])\n", "booleans.all?" ] }, { "cell_type": "code", "execution_count": null, "id": "97fc24da-03d4-406d-b353-562896775d60", "metadata": { "tags": [] }, "outputs": [], "source": [ "booleans.any?" ] }, { "cell_type": "markdown", "id": "0ff3b22d-9f7c-42f2-8d18-c89a06af681b", "metadata": {}, "source": [ "If these methods are used with option `skip_nulls: false` nil is considered." ] }, { "cell_type": "code", "execution_count": null, "id": "3e0e5800-665a-4a05-b2cb-d152f3f077de", "metadata": { "tags": [] }, "outputs": [], "source": [ "booleans.all?(skip_nulls: false)" ] }, { "cell_type": "code", "execution_count": null, "id": "3e43f0c4-a254-4735-ac28-de14d2670c67", "metadata": { "tags": [] }, "outputs": [], "source": [ "booleans.any?(skip_nulls: false)" ] }, { "cell_type": "markdown", "id": "abc71a85-7958-4a21-91cf-8c96c0784525", "metadata": {}, "source": [ "## 39. count/count_uniq (Vector)" ] }, { "cell_type": "markdown", "id": "3d556118-4105-4d12-806d-ba56c6ae3d1b", "metadata": {}, "source": [ "`Vector#count` counts element.\n", "\n", "`Vector#count_uniq` counts unique element. `#count_distinct` is an alias (Arrow's name).\n", "\n", "These are unary aggregation function." ] }, { "cell_type": "code", "execution_count": null, "id": "2af73e32-1d7e-4f80-b54e-c40ef08b7034", "metadata": { "tags": [] }, "outputs": [], "source": [ "string = Vector.new(%w[A B A])\n", "string.count" ] }, { "cell_type": "code", "execution_count": null, "id": "fe6d8d85-27b0-438f-b1b4-1b15e9eb05f9", "metadata": { "tags": [] }, "outputs": [], "source": [ "string.count_uniq # count_distinct is also OK" ] }, { "cell_type": "markdown", "id": "70abed9f-665a-4ea7-939e-4b185ee53755", "metadata": {}, "source": [ "## 40. stddev/variance (Vector)" ] }, { "cell_type": "markdown", "id": "965de338-b3be-4d33-92e1-5ad7e2ed18f0", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": null, "id": "0afec200-f377-432b-a260-ae5a0c5ce794", "metadata": { "tags": [] }, "outputs": [], "source": [ "integers = Vector.new([1, 2, 3, nil])\n", "integers.stddev" ] }, { "cell_type": "code", "execution_count": null, "id": "2e40ac09-cb7f-4978-87e8-53f84f16f7c7", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Unbiased standard deviation\n", "integers.sd" ] }, { "cell_type": "code", "execution_count": null, "id": "e6158e3b-4af8-467c-a355-8e9f2e579548", "metadata": { "tags": [] }, "outputs": [], "source": [ "integers.variance" ] }, { "cell_type": "code", "execution_count": null, "id": "d64d39f2-d979-49f1-9946-65890f40d646", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Unbiased variance\n", "integers.var" ] }, { "cell_type": "markdown", "id": "25023f5a-798a-40a5-ab84-a6615602f747", "metadata": {}, "source": [ "## 41. negate (Vector)" ] }, { "cell_type": "markdown", "id": "00ddf322-ef50-40a1-86a6-22bf3d43f007", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": null, "id": "ab5a357a-e98c-40a1-9b89-0b38645e416f", "metadata": { "tags": [] }, "outputs": [], "source": [ "double = Vector.new([1.0, -2, 3])\n", "double.negate" ] }, { "cell_type": "code", "execution_count": null, "id": "8a06c856-d61c-4752-a296-1fa207ffd9a1", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Same as #negate\n", "-double" ] }, { "cell_type": "markdown", "id": "9b145724-d165-4ef3-8a06-2948dd0c7dbb", "metadata": {}, "source": [ "## 42. round (Vector)" ] }, { "cell_type": "markdown", "id": "b780c2f3-935c-4b2f-b18a-b277cf7c24b7", "metadata": {}, "source": [ "Otions for `#round`;\n", "\n", "- `:n-digits` The number of digits to show.\n", "- `round_mode` Specify rounding mode.\n", "\n", "This is a unary element-wise function." ] }, { "cell_type": "code", "execution_count": null, "id": "e7a069b0-3547-4cd2-a2f0-0740f186b191", "metadata": { "tags": [] }, "outputs": [], "source": [ "double = RedAmber::Vector.new([15.15, 2.5, 3.5, -4.5, -5.5])" ] }, { "cell_type": "code", "execution_count": null, "id": "5ee84b24-8830-4788-a404-d5e1cca22abf", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.round" ] }, { "cell_type": "code", "execution_count": null, "id": "20adb1ad-473c-4245-b959-7848c239fb76", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.round(mode: :half_to_even)" ] }, { "cell_type": "code", "execution_count": null, "id": "d2777ad8-2c24-48e4-8f5f-77403e3109ea", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.round(mode: :towards_infinity)" ] }, { "cell_type": "code", "execution_count": null, "id": "a8ab2735-74cb-4cfe-a5a2-61bfa90c72ac", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.round(mode: :half_up)" ] }, { "cell_type": "code", "execution_count": null, "id": "3575481c-40ed-405f-a69c-7581d4dce2cf", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.round(mode: :half_towards_zero)" ] }, { "cell_type": "code", "execution_count": null, "id": "a86e4c5c-aced-4a88-b692-4e26b90f1653", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.round(mode: :half_towards_infinity)" ] }, { "cell_type": "code", "execution_count": null, "id": "73f51bab-ff46-4b99-96a5-8c6547ad9d35", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.round(mode: :half_to_odd)" ] }, { "cell_type": "code", "execution_count": null, "id": "a12c684c-4a63-4dac-a81b-969978812a24", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.round(n_digits: 0)" ] }, { "cell_type": "code", "execution_count": null, "id": "17370f2b-0957-411b-8145-56aa9fc956ac", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.round(n_digits: 1)" ] }, { "cell_type": "code", "execution_count": null, "id": "53072cff-b28b-4672-b30a-8ca37562bc21", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.round(n_digits: -1)" ] }, { "cell_type": "markdown", "id": "51dedfce-51c7-4e5b-b890-a90ad9cf7596", "metadata": {}, "source": [ "## 43. and/or (Vector)" ] }, { "cell_type": "markdown", "id": "b2c4869b-6ebf-476c-b2fd-a4b9c0638dc5", "metadata": {}, "source": [ "RedAmber select `and_kleene`/`or_kleene` as default `&`/`|` method.\n", "\n", "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": null, "id": "2d4f5853-1ed9-4d8b-87a9-b5c1faac5fae", "metadata": { "tags": [] }, "outputs": [], "source": [ "bool_self = Vector.new([true, true, true, false, false, false, nil, nil, nil])\n", "bool_other = Vector.new([true, false, nil, true, false, nil, true, false, nil])\n", "\n", "bool_self & bool_other # same as bool_self.and_kleene(bool_other)" ] }, { "cell_type": "code", "execution_count": null, "id": "236c9733-8d45-467e-b288-e6c18b9c39d2", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Ruby's primitive `&&`\n", "bool_self && bool_other" ] }, { "cell_type": "code", "execution_count": null, "id": "4e984a9c-7d9c-465d-bf26-0c685dedd4bf", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Arrow's default `and`\n", "bool_self.and_org(bool_other)" ] }, { "cell_type": "code", "execution_count": null, "id": "0120ebf5-355d-41f5-83d5-49b9802f337b", "metadata": { "tags": [] }, "outputs": [], "source": [ "bool_self | bool_other # same as bool_self.or_kleene(bool_other)" ] }, { "cell_type": "code", "execution_count": null, "id": "24ceee23-79df-4fcd-afd8-f3839a087785", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Ruby's primitive `||`\n", "bool_self || bool_other" ] }, { "cell_type": "code", "execution_count": null, "id": "c152d04b-71a0-4b18-acd1-b5ab9e413d00", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Arrow's default `or`\n", "bool_self.or_org(bool_other)" ] }, { "cell_type": "markdown", "id": "beede237-c5ed-4e12-a432-ec7e4546d786", "metadata": {}, "source": [ "## 44. is_finite/is_nan/is_nil/is_na (Vector)" ] }, { "cell_type": "markdown", "id": "77418efd-c0d7-4d63-a7db-2d43fafd386e", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": null, "id": "19558f9e-fdc4-46e5-90d0-724e4e8fbd8e", "metadata": { "tags": [] }, "outputs": [], "source": [ "double = Vector.new([Math::PI, Float::INFINITY, -Float::INFINITY, Float::NAN, nil])" ] }, { "cell_type": "code", "execution_count": null, "id": "d90a7168-1f87-4363-9589-c1f161babc7d", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.is_finite" ] }, { "cell_type": "code", "execution_count": null, "id": "7d88049b-695f-4b0c-a105-8fb5797a58b1", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.is_inf" ] }, { "cell_type": "code", "execution_count": null, "id": "7d86a7b5-84bf-4031-9811-4076281920cf", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.is_na" ] }, { "cell_type": "code", "execution_count": null, "id": "d562f826-7a37-4c57-8f92-777555987246", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.is_nil" ] }, { "cell_type": "code", "execution_count": null, "id": "e460dc6b-e48f-4462-9ce8-aa6069ebae27", "metadata": { "tags": [] }, "outputs": [], "source": [ "double.is_valid" ] }, { "cell_type": "markdown", "id": "2cca75eb-f0e8-4f85-89cb-3601512e76b0", "metadata": {}, "source": [ "## 45. Prime-th rows" ] }, { "cell_type": "code", "execution_count": null, "id": "0751e820-a22d-45b5-9005-df523d2353be", "metadata": { "tags": [] }, "outputs": [], "source": [ "# prime-th rows ... Don't ask me what it means.\n", "require 'prime'\n", "penguins.assign_left(:index, penguins.indices + 1) # since 0.2.0\n", " .slice { Vector.new(Prime.each(size).to_a) - 1 }" ] }, { "cell_type": "markdown", "id": "c9e8de1a-ad8f-4fdc-a65c-4d3db7123530", "metadata": {}, "source": [ "## 46. Slice by Enumerator" ] }, { "cell_type": "markdown", "id": "32dd53a3-a822-4ae1-afe2-b5aa2bfbd3e3", "metadata": {}, "source": [ "Slice accepts Enumerator." ] }, { "cell_type": "code", "execution_count": null, "id": "b2a118fa-f3c0-4f31-9b45-6db27ccbebe6", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Select every 10 samples\n", "penguins.assign_left(index: penguins.indices) # 0.2.0 feature\n", " .slice(0.step(by: 10, to: 340))" ] }, { "cell_type": "code", "execution_count": null, "id": "361d5d09-6711-4211-a26e-a92bfd1a8add", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Select every 2 samples by step 100\n", "penguins.assign_left(index: penguins.indices) # 0.2.0 feature\n", " .slice { 0.step(by: 100, to: 300).map { |i| i..(i+1) } }" ] }, { "cell_type": "markdown", "id": "db312c2c-3a7c-4765-bfad-b3313b173a79", "metadata": {}, "source": [ "## 47. Output mode" ] }, { "cell_type": "markdown", "id": "714ed8df-5aa3-4ac4-8b0d-6390aff73c8c", "metadata": {}, "source": [ "Output mode of `DataFrame#inspect` and `DataFrame#to_iruby` is Table mode by default. If you prefer other mode set the environment variable `RED_AMBER_OUTPUT_MODE` ." ] }, { "cell_type": "code", "execution_count": null, "id": "a721804b-006e-44c6-8d38-885eae747eaa", "metadata": { "tags": [] }, "outputs": [], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'Table' # or nil (default)\n", "penguins # Almost same as `puts penguins.to_s` in any mode" ] }, { "cell_type": "code", "execution_count": null, "id": "7846a498-5f06-452e-b9d5-eba47c73035c", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins[:species]" ] }, { "cell_type": "code", "execution_count": null, "id": "cc74055b-4b4f-4645-afd3-6b7aacd3d414", "metadata": { "tags": [] }, "outputs": [], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'Plain' # Since 0.2.2\n", "penguins" ] }, { "cell_type": "code", "execution_count": null, "id": "172b65fa-d9b0-49e9-a9ba-5e6fbfb8219b", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins[:species]" ] }, { "cell_type": "code", "execution_count": null, "id": "c2109e65-7817-4745-957d-c8947795cf2a", "metadata": { "tags": [] }, "outputs": [], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'Minimum' # Since 0.2.2\n", "penguins" ] }, { "cell_type": "code", "execution_count": null, "id": "0df9bfd3-774d-4f41-999d-86c618386a59", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins[:species]" ] }, { "cell_type": "code", "execution_count": null, "id": "e4c9f70c-a4b1-4a81-bbc4-e9b14a6b6cb0", "metadata": { "tags": [] }, "outputs": [], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'TDR'\n", "penguins" ] }, { "cell_type": "code", "execution_count": null, "id": "2786e9a7-e321-43c5-b56e-9f2ca9d62f8b", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins[:species]" ] }, { "cell_type": "code", "execution_count": null, "id": "02f977cc-3ab0-4c4f-84e8-f7a42948929c", "metadata": { "tags": [] }, "outputs": [], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = nil" ] }, { "cell_type": "markdown", "id": "c6781904-c014-4ef0-86c5-8a758ca3ca1c", "metadata": {}, "source": [ "## 48. Empty key" ] }, { "cell_type": "markdown", "id": "38912e38-bddd-47cc-adfd-5c4f780636ea", "metadata": {}, "source": [ "Empty key `:\"\"` will be automatically renamed to `:unnamed1`.\n", "\n", "If `:unnamed1` was used, `:unnamed1.succ` will be used.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": null, "id": "13569004-bb23-45fa-8d11-fe5f367641a6", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(\"\": [1, 2], unnamed1: [3, 4])" ] }, { "cell_type": "markdown", "id": "a4343343-d59a-4b71-b08f-4b737a6d4258", "metadata": {}, "source": [ "## 49. Grouping" ] }, { "cell_type": "markdown", "id": "8ddf0937-7b0e-4ee6-8116-8e78c308e76f", "metadata": {}, "source": [ "`DataFrame#group` takes group_key as an argument, and creates `Group` class. \n", "\n", "Group class inspects counts of each unique elements. \n", "\n", "(Since 0.1.7)" ] }, { "cell_type": "code", "execution_count": null, "id": "ee602e52-7988-4fab-b5e3-c466acf01c98", "metadata": { "tags": [] }, "outputs": [], "source": [ "group = penguins.group(:species)" ] }, { "cell_type": "markdown", "id": "a48ab0e9-60e3-4c07-a27b-9c53783318c1", "metadata": {}, "source": [ "The instance of `Group` class has methods to summary functions.\n", "\n", "It returns `function(key)` style summarized columns as a result." ] }, { "cell_type": "code", "execution_count": null, "id": "20b23ada-b895-4921-b57b-8d46b451e494", "metadata": { "tags": [] }, "outputs": [], "source": [ "group.count" ] }, { "cell_type": "markdown", "id": "584d8fa1-1e70-4f31-bdc2-5d4956a28cb5", "metadata": {}, "source": [ "If count result is same in multiple columns, count column is aggregated to one column `:count`." ] }, { "cell_type": "code", "execution_count": null, "id": "e6936488-9f23-47bd-8492-537c5be1afb3", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.pick(:species, :bill_length_mm, :bill_depth_mm).group(:species).count" ] }, { "cell_type": "markdown", "id": "7c90b59b-918c-492c-ac5b-46b62b31d136", "metadata": {}, "source": [ "Grouping key comes first (leftmost) in the columns." ] }, { "cell_type": "markdown", "id": "21c8a6fa-3f95-41ea-817f-ffc1011e8df9", "metadata": {}, "source": [ "## 50. Grouping with a block" ] }, { "cell_type": "markdown", "id": "db89c8b4-8189-410d-abe9-eac8f440065c", "metadata": {}, "source": [ "`DataFrame#group` takes a block and we can specify multiple functions.\n", "\n", "Inside the block is the context of instance of Group. So we can use summary functions without the reciever.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": null, "id": "913f576b-ec86-4e94-af05-7c656ea24cc2", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.group(:species) { [count(:species), mean(:body_mass_g)] }" ] }, { "cell_type": "markdown", "id": "e6dc3206-d327-4242-9908-253ad4a74759", "metadata": {}, "source": [ "`Group#summarize` accepts same block as `DataFrame#group`." ] }, { "cell_type": "code", "execution_count": null, "id": "67c7fc55-7b30-469c-bd0c-cda5732863fe", "metadata": { "tags": [] }, "outputs": [], "source": [ "group.summarize { [count(:species), mean] }" ] }, { "cell_type": "markdown", "id": "88682c8d-0ab6-4cee-8e59-5b6ad0ae5dac", "metadata": {}, "source": [ "## 51. Vector#shift" ] }, { "cell_type": "markdown", "id": "c6dd1bc0-7146-4884-aac5-9874e829235e", "metadata": {}, "source": [ "`Vector#shift(amount = 1, fill: nil)`\n", "\n", "Shift vector's values by specified `amount`. Shifted space is filled by value `fill`.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": null, "id": "013f2db6-3e1d-481f-a908-57605729b51d", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector = RedAmber::Vector.new([1, 2, 3, 4, 5])\n", "vector.shift" ] }, { "cell_type": "code", "execution_count": null, "id": "7625acd7-d6a0-4775-b5e0-ca87f95f4f28", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector.shift(-2)" ] }, { "cell_type": "code", "execution_count": null, "id": "34a9ac2a-2e3f-44bc-8ba7-c4487dc3528e", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector.shift(fill: Float::NAN)" ] }, { "cell_type": "markdown", "id": "39f305a6-5718-4e81-b08f-564fa93861cc", "metadata": { "tags": [] }, "source": [ "## 52. From the Pandas cookbook (if-then)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#if-then" ] }, { "cell_type": "markdown", "id": "4f1d41dd-714a-4c68-af3b-a9080847f833", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "df.loc[df.AAA >= 5, \"BBB\"] = -1\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "1 5 -1 50\n", "2 6 -1 -30\n", "3 7 -1 -50\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "24774ccc-8f0f-4ce4-9ba0-bebed8781c38", "metadata": { "tags": [] }, "outputs": [], "source": [ "# RedAmber\n", "df = DataFrame.new(\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50] # You can omit {}\n", ")\n", "\n", "df.assign(BBB: df[:BBB].replace(df[:AAA] >= 5, -1))" ] }, { "cell_type": "markdown", "id": "c76ca2e3-e239-440b-923b-dfabb1a1c007", "metadata": {}, "source": [ "If you want to replace both :BBB and :CCC ;" ] }, { "cell_type": "code", "execution_count": null, "id": "3f97227b-cbee-4515-b76d-3514401967d9", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.assign do\n", " replacer = v(:AAA) >= 5 # Boolean Vector\n", " {\n", " BBB: v(:BBB).replace(replacer, -1),\n", " CCC: v(:CCC).replace(replacer, -2)\n", " }\n", "end" ] }, { "cell_type": "markdown", "id": "4bf560f9-4393-4ed5-9994-24f8a93470fb", "metadata": { "tags": [] }, "source": [ "## 53. From the Pandas cookbook (Splitting)\n", "Split a frame with a boolean criterion\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#splitting" ] }, { "cell_type": "markdown", "id": "fcf06cad-4a34-45e2-b8f8-8422914eb319", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "df[df.AAA <= 5]\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "1 5 20 50\n", "\n", "df[df.AAA > 5]\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "2 6 30 -30\n", "3 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "b08e74d4-aba8-4fb5-a815-5d5384e92f81", "metadata": { "tags": [] }, "outputs": [], "source": [ "# RedAmber\n", "df = DataFrame.new(\n", " # You can omit outer {}\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]\n", ")\n", "\n", "df.slice(df[:AAA] <= 5)\n", "# df[df[:AAA] <= 5] # is also OK" ] }, { "cell_type": "code", "execution_count": null, "id": "caa72796-ff7e-4275-849f-04698114ee08", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.remove(df[:AAA] <= 5)\n", "# df.slice(df[:AAA] > 5) # do the same thing" ] }, { "cell_type": "markdown", "id": "ba7588e9-bbac-4547-a56c-3eea9f819460", "metadata": { "tags": [] }, "source": [ "## 54. From the Pandas cookbook (Building criteria)\n", "Split a frame with a boolean criterion\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#building-criteria" ] }, { "cell_type": "markdown", "id": "c6d01b08-1af7-47b0-a9e4-f27ab41fe24e", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "\n", "# and\n", "df.loc[(df[\"BBB\"] < 25) & (df[\"CCC\"] >= -40), \"AAA\"]\n", "\n", "# returns a series =>\n", "0 4\n", "1 5\n", "Name: AAA, dtype: int64\n", "\n", "# or\n", "df.loc[(df[\"BBB\"] > 25) | (df[\"CCC\"] >= -40), \"AAA\"]\n", "\n", "# returns a series =>\n", "0 4\n", "1 5\n", "2 6\n", "3 7\n", "Name: AAA, dtype: int64\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "46066e96-91e5-4a96-9840-7e4ce6f06818", "metadata": { "tags": [] }, "outputs": [], "source": [ "# RedAmber\n", "df = DataFrame.new(\n", " # You can omit {}\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]\n", ")\n", "\n", "df.slice( (df[:BBB] < 25) & (df[:CCC] >= 40) ).pick(:AAA)" ] }, { "cell_type": "code", "execution_count": null, "id": "dc8304c2-be28-420e-b2d5-a4b636eaac8b", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice( (df[:BBB] > 25) | (df[:CCC] >= 40) ).pick(:AAA)\n", "# df[ (df[:BBB] > 25) | (df[:CCC] >= 40) ][:AAA)] # also OK" ] }, { "cell_type": "markdown", "id": "79616705-f497-4bb4-ad1d-5ee93c0093ce", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# or (with assignment)\n", "df.loc[(df[\"BBB\"] > 25) | (df[\"CCC\"] >= 75), \"AAA\"] = 0.1\n", "df\n", "\n", "# returns a dataframe =>\n", " AAA BBB CCC\n", "0 0.1 10 100\n", "1 5.0 20 50\n", "2 0.1 30 -30\n", "3 0.1 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "7ab3b044-5a0f-4a38-8a42-aed1228b6462", "metadata": { "tags": [] }, "outputs": [], "source": [ "# df.assign(AAA: df[:AAA].replace((df[:BBB] > 25) | (df[:CCC] >= 75), 0.1)) # by one liner\n", "\n", "booleans = (df[:BBB] > 25) | (df[:CCC] >= 75)\n", "replaced = df[:AAA].replace(booleans, 0.1)\n", "df.assign(AAA: replaced)" ] }, { "cell_type": "markdown", "id": "6cd4a47c-f619-462b-94c0-4c488761d5b0", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Select rows with data closest to certain value using argsort\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "aValue = 43.0\n", "df.loc[(df.CCC - aValue).abs().argsort()]\n", "\n", "# returns a dataframe =>\n", " AAA BBB CCC\n", "1 5 20 50\n", "0 4 10 100\n", "2 6 30 -30\n", "3 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "13cb1d45-2d13-4708-ad76-57efd72e609b", "metadata": { "tags": [] }, "outputs": [], "source": [ "a_value = 43\n", "df[(df[:CCC] - a_value).abs.sort_indexes]\n", "# df.slice (df[:CCC] - a_value).abs.sort_indexes # also OK" ] }, { "cell_type": "markdown", "id": "9330117b-40e0-4574-8900-7857622daad4", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Dynamically reduce a list of criteria using a binary operators\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "Crit1 = df.AAA <= 5.5\n", "Crit2 = df.BBB == 10.0\n", "Crit3 = df.CCC > -40.0\n", "AllCrit = Crit1 & Crit2 & Crit3\n", "\n", "import functools\n", "\n", "CritList = [Crit1, Crit2, Crit3]\n", "AllCrit = functools.reduce(lambda x, y: x & y, CritList)\n", "df[AllCrit]\n", "\n", "# returns a dataframe =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "40336e62-d411-4655-8ec5-8d30876ada47", "metadata": { "tags": [] }, "outputs": [], "source": [ "crit1 = df[:AAA] <= 5.5\n", "crit2 = df[:BBB] == 10.0\n", "crit3 = df[:CCC] >= -40.0\n", "df[crit1 & crit2 & crit3]" ] }, { "cell_type": "markdown", "id": "62e03375-573d-4368-a26e-1be3a4d58cf8", "metadata": { "tags": [] }, "source": [ "## 55. From the Pandas cookbook (Dataframes)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#dataframes" ] }, { "cell_type": "markdown", "id": "462ce855-6e78-49a6-8b5f-48a5864b6397", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Using both row labels and value conditionals\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))] \n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "2 6 30 -30\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "9da915f1-c772-4071-b6d6-292d8ffea857", "metadata": { "tags": [] }, "outputs": [], "source": [ "# RedAmber\n", "df = DataFrame.new(\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]\n", ")\n", "\n", "df[(df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }]" ] }, { "cell_type": "markdown", "id": "26bfea62-12e4-4a01-976e-6184ecafa2fd", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Use loc for label-oriented slicing and iloc positional slicing GH2904\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]},\n", " index=[\"foo\", \"bar\", \"boo\", \"kar\"],\n", ")\n", "\n", "# There are 2 explicit slicing methods, with a third general case\n", "# 1. Positional-oriented (Python slicing style : exclusive of end)\n", "# 2. Label-oriented (Non-Python slicing style : inclusive of end)\n", "# 3. General (Either slicing style : depends on if the slice contains labels or positions)\n", "\n", "df.loc[\"bar\":\"kar\"] # Label\n", "# returns =>\n", " AAA BBB CCC\n", "bar 5 20 50\n", "boo 6 30 -30\n", "kar 7 40 -50\n", "\n", "# Generic\n", "df[0:3]\n", "# returns =>\n", " AAA BBB CCC\n", "foo 4 10 100\n", "bar 5 20 50\n", "boo 6 30 -30\n", "\n", "df[\"bar\":\"kar\"]\n", "# returns =>\n", " AAA BBB CCC\n", "bar 5 20 50\n", "boo 6 30 -30\n", "kar 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "ccabc137-33d3-47e8-ad09-88a285765380", "metadata": { "tags": [] }, "outputs": [], "source": [ "# RedAmber does not have row index. Use a new column as indexes.\n", "labeled = df.assign_left(index: %w[foo bar boo kar])\n", "# labeled = df.assign(index: %w[foo bar boo kar]).pick { [keys[-1], keys[0...-1]] } # until v0.1.8" ] }, { "cell_type": "code", "execution_count": null, "id": "f0871131-725e-4e33-a3cc-1fccd610a4b2", "metadata": { "tags": [] }, "outputs": [], "source": [ "labeled[1..3]" ] }, { "cell_type": "code", "execution_count": null, "id": "dffc55f0-481e-4076-aa3f-89f1294655b9", "metadata": { "tags": [] }, "outputs": [], "source": [ "labeled.slice do\n", " v = v(:index)\n", " v.index(\"bar\")..v.index(\"kar\")\n", "end" ] }, { "cell_type": "markdown", "id": "ee4f4498-c9d8-4cb9-8b1b-faa19f968f30", "metadata": {}, "source": [ "`slice_by` returns the same result as above.\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": null, "id": "97c76bdc-6c1b-4b7d-92bf-d5108fe479a0", "metadata": { "tags": [] }, "outputs": [], "source": [ "labeled.slice_by(:index, keep_key: true) { \"bar\"..\"kar\"}" ] }, { "cell_type": "markdown", "id": "d72b067c-d15d-499b-8d2a-8b79f5287e97", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Ambiguity arises when an index consists of integers with a non-zero start or non-unit increment.\n", "df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4]) # Note index starts at 1.\n", "\n", "df2.iloc[1:3] # Position-oriented\n", "# returns =>\n", " AAA BBB CCC\n", "2 5 20 50\n", "3 6 30 -30\n", "\n", "df2.loc[1:3] # Label-oriented\n", "# returns =>\n", " AAA BBB CCC\n", "1 4 10 100\n", "2 5 20 50\n", "3 6 30 -30\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "f023a23a-ea1d-415c-b395-195801351433", "metadata": { "tags": [] }, "outputs": [], "source": [ "# RedAmber only have an implicit integer index 0...size,\n", "# does not happen any ambiguity unless you create a new column and use it for indexes :-)." ] }, { "cell_type": "markdown", "id": "5cada26c-a37a-4eff-bfa5-66635d278671", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Using inverse operator (~) to take the complement of a mask\n", "df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "1 5 20 50\n", "3 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "29b81efe-5b9b-4640-89b1-422ae94cf01d", "metadata": { "tags": [] }, "outputs": [], "source": [ "# RedAmber offers #! method for boolean Vector.\n", "df[!((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i })]\n", "\n", "# or\n", "# df[((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }).invert]" ] }, { "cell_type": "markdown", "id": "4fb7103f-00ed-4e74-81e5-7d480004e681", "metadata": {}, "source": [ "If you have `nil` in your data, consider #primitive_invert for consistent result. See example #26." ] }, { "cell_type": "markdown", "id": "a1780372-b566-41f5-84d6-6213e3f9efa7", "metadata": { "tags": [] }, "source": [ "## 56. From the Pandas cookbook (New columns)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#new-columns" ] }, { "cell_type": "markdown", "id": "5f5f9d8b-7550-44f6-9f79-64f940f5000a", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Efficiently and dynamically creating new columns using applymap\n", "df = pd.DataFrame({\"AAA\": [1, 2, 1, 3], \"BBB\": [1, 1, 2, 2], \"CCC\": [2, 1, 3, 1]})\n", "df\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 1 1 2\n", "1 2 1 1\n", "2 1 2 3\n", "3 3 2 1\n", "\n", "source_cols = df.columns # Or some subset would work too\n", "new_cols = [str(x) + \"_cat\" for x in source_cols]\n", "categories = {1: \"Alpha\", 2: \"Beta\", 3: \"Charlie\"}\n", "df[new_cols] = df[source_cols].applymap(categories.get)\n", "df\n", "\n", "# returns =>\n", " AAA BBB CCC AAA_cat BBB_cat CCC_cat\n", "0 1 1 2 Alpha Alpha Beta\n", "1 2 1 1 Beta Alpha Alpha\n", "2 1 2 3 Alpha Beta Charlie\n", "3 3 2 1 Charlie Beta Alpha\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "265d63e6-0c01-4080-8d5b-c3153be595a5", "metadata": { "tags": [] }, "outputs": [], "source": [ "# RedAmber\n", "df = DataFrame.new({\"AAA\": [1, 2, 1, 3], \"BBB\": [1, 1, 2, 2], \"CCC\": [2, 1, 3, 1]})" ] }, { "cell_type": "code", "execution_count": null, "id": "4be751d7-69a8-4400-b094-9932bf3d577b", "metadata": { "tags": [] }, "outputs": [], "source": [ "categories = {1 => \"Alpha\", 2 => \"Beta\", 3 => \"Charlie\"}\n", "\n", "# Creating a Hash from keys\n", "df.assign do\n", " keys.each_with_object({}) do |key, h|\n", " h[\"#{key}_cat\"] = v(key).to_a.map { |x| categories[x] }\n", " end\n", "end\n", "\n", "# Creating an Array from vectors, from v0.2.0\n", "df.assign do\n", " vectors.map do |v|\n", " [\"#{v.key}_cat\", v.to_a.map { |x| categories[x] } ]\n", " end\n", "end" ] }, { "cell_type": "markdown", "id": "aa8214e5-a897-406e-bb5e-95ad9fea0cdd", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Keep other columns when using min() with groupby\n", "df = pd.DataFrame(\n", " {\"AAA\": [1, 1, 1, 2, 2, 2, 3, 3], \"BBB\": [2, 1, 3, 4, 5, 1, 2, 3]}\n", ")\n", "df\n", "\n", "# returns =>\n", " AAA BBB\n", "0 1 2\n", "1 1 1\n", "2 1 3\n", "3 2 4\n", "4 2 5\n", "5 2 1\n", "6 3 2\n", "7 3 3\n", "\n", "# Method 1 : idxmin() to get the index of the minimums\n", "df.loc[df.groupby(\"AAA\")[\"BBB\"].idxmin()]\n", "\n", "# returns =>\n", " AAA BBB\n", "1 1 1\n", "5 2 1\n", "6 3 2\n", "\n", "# Method 2 : sort then take first of each\n", "df.sort_values(by=\"BBB\").groupby(\"AAA\", as_index=False).first()\n", "\n", "# returns =>\n", " AAA BBB\n", "0 1 1\n", "1 2 1\n", "2 3 2\n", "\n", "# Notice the same results, with the exception of the index.\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "5dd5f7dc-21a3-4397-9ef6-c5bc630ff858", "metadata": { "tags": [] }, "outputs": [], "source": [ "# RedAmber\n", "df = DataFrame.new(AAA: [1, 1, 1, 2, 2, 2, 3, 3], BBB: [2, 1, 3, 4, 5, 1, 2, 3])" ] }, { "cell_type": "code", "execution_count": null, "id": "000c1632-8faf-407f-bb7d-e583ef573442", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.group(:AAA).min\n", "\n", "# Add `.rename { [keys[-1], :BBB] }` if you want." ] }, { "cell_type": "markdown", "id": "5aa5b4ab-2fb3-4a22-804e-55d6351dd427", "metadata": { "tags": [] }, "source": [ "## 57. Summary/describe" ] }, { "cell_type": "code", "execution_count": null, "id": "610be94e-b7ce-43f5-a5c1-ddef745d6bac", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.summary\n", "# or\n", "penguins.describe" ] }, { "cell_type": "markdown", "id": "c06cbb0e-fd75-4567-88df-bd9902fc94e2", "metadata": {}, "source": [ "If you need a variables in row, use `transpose`. (Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "42406c73-6480-4a30-a940-930bf6804fff", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins.summary.transpose(name: :stats)" ] }, { "cell_type": "markdown", "id": "982a735e-c887-4c15-be82-4f3a28138fa7", "metadata": {}, "source": [ "## 58. Quantile/Quantiles" ] }, { "cell_type": "markdown", "id": "d0832e36-7281-4d3c-a35d-5a89aecc341e", "metadata": {}, "source": [ "`Vector#quantile(prob)` returns quantile at probability `prob`.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "88c6ed5d-b41e-4c17-b93f-47ce15702974", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins[:bill_depth_mm].quantile # default is prob = 0.5" ] }, { "cell_type": "markdown", "id": "07368c10-65d4-4081-ace7-12bebca493c2", "metadata": {}, "source": [ "`Vector#quantiles` accepts an Array for multiple quantiles. Returns a DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "3cbac413-fc36-42a2-abd3-1343e3b88467", "metadata": { "tags": [] }, "outputs": [], "source": [ "penguins[:bill_depth_mm].quantiles([0.05, 0.95])" ] }, { "cell_type": "markdown", "id": "54800864-c33d-4fec-818b-1a12a7e4d015", "metadata": {}, "source": [ "## 59. Transpose" ] }, { "cell_type": "markdown", "id": "517281cf-3f29-4725-816c-5a866e9cc9cc", "metadata": {}, "source": [ "`DataFrame#transpose` creates transposed DataFrame for wide type dataframe.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "a2fe66cb-e86d-402e-8724-eced2420e3d0", "metadata": { "tags": [] }, "outputs": [], "source": [ "uri = URI(\"https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv\")\n", "import_cars = RedAmber::DataFrame.load(uri)" ] }, { "cell_type": "code", "execution_count": null, "id": "b4ef2b48-33de-4982-b082-1966749fcf65", "metadata": { "tags": [] }, "outputs": [], "source": [ "import_cars.transpose" ] }, { "cell_type": "markdown", "id": "2d8c6d1e-06c6-4ae6-9d76-e9692567de59", "metadata": {}, "source": [ "Default name of created column is `:NAME`.\n", "\n", "We can name the column from the keys in original by the option `name:`." ] }, { "cell_type": "code", "execution_count": null, "id": "31679a91-11b4-42a8-95cc-58644a3003ba", "metadata": { "tags": [] }, "outputs": [], "source": [ "import_cars.transpose(key: :Year, name: :Manufacturer)" ] }, { "cell_type": "markdown", "id": "556e5c2d-bb01-4e16-9bf6-bdfd302d5b2a", "metadata": {}, "source": [ "You can specify index column by option `:key` even if it is in the middle of the original DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "31edb594-93fb-493f-8036-14e8273596ed", "metadata": { "tags": [] }, "outputs": [], "source": [ "# locate `:Year` in the middle\n", "df = import_cars.pick(1..2, 0, 3..)" ] }, { "cell_type": "code", "execution_count": null, "id": "ffa7cecc-5298-49d6-b483-b2b5cf2e1820", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.transpose(key: :Year)" ] }, { "cell_type": "markdown", "id": "8c0e6fd9-2355-43dc-bfdb-254fdbd405fb", "metadata": {}, "source": [ "## 60. To_long" ] }, { "cell_type": "markdown", "id": "8970bf89-6841-44aa-8faa-8c2e97842a8d", "metadata": {}, "source": [ "`DataFrame#to_long(*keep_keys)` reshapes wide DataFrame to the long DataFrame.\n", "\n", "- Parameter `keep_keys` specifies the key names to keep.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "f188cf35-3364-45aa-ad3c-1e079ed7b1a3", "metadata": { "tags": [] }, "outputs": [], "source": [ "uri = URI(\"https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv\")\n", "import_cars = RedAmber::DataFrame.load(uri)" ] }, { "cell_type": "code", "execution_count": null, "id": "bee776ae-7a82-41aa-8a31-c53af6b9ad9f", "metadata": { "tags": [] }, "outputs": [], "source": [ "import_cars.to_long(:Year)" ] }, { "cell_type": "markdown", "id": "771aac21-e739-4eea-9d75-cc99f691f87e", "metadata": {}, "source": [ "- Option `:name` specify the key of the column which is come **from key names**. Default is `:NAME`.\n", "- Option `:value` specify the key of the column which is come **from values**. Default is `:VALUE`." ] }, { "cell_type": "code", "execution_count": null, "id": "b7f7aeab-d545-4a28-82ce-db844029cc9c", "metadata": { "tags": [] }, "outputs": [], "source": [ "import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)" ] }, { "cell_type": "markdown", "id": "44361299-695f-4aef-8847-9b8a9dddc3d1", "metadata": {}, "source": [ "## 61. To_wide" ] }, { "cell_type": "markdown", "id": "7fa18ac8-99e1-4d55-bed4-cc02469496b5", "metadata": {}, "source": [ "`DataFrame#to_wide(*keep_keys)` reshapes long DataFrame to a wide DataFrame.\n", "\n", "- Option `:name` specify the key of the column which will be expanded **to key name**. Default is `:NAME`.\n", "- Option `:value` specify the key of the column which will be expanded **to values**. Default is `:VALUE`.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "1d59c42f-c5ec-4df2-9fbe-592092ad2f8b", "metadata": { "tags": [] }, "outputs": [], "source": [ "import_cars.to_long(:Year).to_wide" ] }, { "cell_type": "code", "execution_count": null, "id": "84b7909a-200c-4aec-b192-25016399a7c4", "metadata": { "tags": [] }, "outputs": [], "source": [ "import_cars.to_long(:Year).to_wide(name: :NAME, value: :VALUE)\n", "# is also OK" ] }, { "cell_type": "markdown", "id": "f2919aae-12ba-448b-a370-7910845fa470", "metadata": {}, "source": [ "## 62. Custom index\n", "\n", "Another example of `indices` is [14. Indices](#14.-Indices)." ] }, { "cell_type": "markdown", "id": "0a623030-3aad-459d-a04b-23dc52ea4088", "metadata": {}, "source": [ "We can set the start of indices by the option.\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": null, "id": "c33c39f6-b48c-45d7-a755-43c33da21b9d", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(x: [0, 1, 2, 3, 4])\n", "df.indices" ] }, { "cell_type": "code", "execution_count": null, "id": "fe797234-3759-4cd6-a9fc-02919afb9fed", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.indices(1)" ] }, { "cell_type": "markdown", "id": "a3e52c9f-704a-4032-bde9-a834b2a3e3f2", "metadata": {}, "source": [ "You can put the first value which accepts `#succ` method." ] }, { "cell_type": "code", "execution_count": null, "id": "ab17dad0-de80-412d-a890-d0f85ee630ef", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.indices(\"a\")" ] }, { "cell_type": "markdown", "id": "cda03584-a0f3-4e16-bc34-b33bf83bf156", "metadata": { "tags": [] }, "source": [ "## 63. Method missing\n", "\n", "`RedAmber::DataFrame` has `#method_missing` to enable to call key names as methods.\n", "\n", "This feature is limited to what can be called as a method (`:key` is OK, not allowed for the keys `:Key`, `:\"key.1\"`, `:\"1key\"`, etc. ). But it will be convenient in many cases.\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": null, "id": "1e12170d-73d9-4adc-acca-a0d4ea697f24", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(x: [1, 2, 3])\n", "df.x.sum" ] }, { "cell_type": "code", "execution_count": null, "id": "b80ea144-b668-4df1-a28a-2c19728f4365", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Some ways to pull a Vector\n", "df[:x] # Formal style\n", "\n", "df.v(:x) # #v method\n", "\n", "df.x # method" ] }, { "cell_type": "code", "execution_count": null, "id": "c8dca8e8-ccef-4485-a871-a251a0809d16", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.x.sum" ] }, { "cell_type": "markdown", "id": "cea71378-f0af-4c00-a7a9-6d5e1160ebe6", "metadata": { "tags": [] }, "source": [ "## 64. Assign revised\n", "\n", "Another example of `assign` is [#34. Assign](#34.-Assign), [#65. Variations of assign](#65.-Variations-of-assign) ." ] }, { "cell_type": "code", "execution_count": null, "id": "6a395669-0b37-4bbf-9852-fcc434dbfbeb", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(x: [1, 2, 3])\n", "\n", "# Assign by a Hash\n", "df.assign(y: df.x / 10.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "3099b006-9a9a-4f27-9286-e00affcf1f2d", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Assign by separated key and value\n", "df.assign(:y) { x / 10.0 }" ] }, { "cell_type": "code", "execution_count": null, "id": "aa14333d-0fd0-43a9-ad3c-1a694b03d91e", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Separated keys and values\n", "df.assign(:y, :z) { [x * 10, x / 10.0] }" ] }, { "cell_type": "markdown", "id": "a33a9394-3d89-4855-a2bc-4a2a95ad3f08", "metadata": {}, "source": [ "## 65. Variations of assign\n", "\n", "Another example of `assign` is [#34. Assign](#34.-Assign), [#64. Assign revised](#64.-Assign-revised) ." ] }, { "cell_type": "code", "execution_count": null, "id": "59013a14-8c0d-4dec-b936-5a997ae4ca95", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(x: [1, 2, 3])" ] }, { "cell_type": "code", "execution_count": null, "id": "c30ba70a-0185-4b23-826d-56f810baad93", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Hash args\n", "df.assign(y: df[:x] * 10, z: df[:x] / 10.0)\n", "\n", "# Hash\n", "hash = {y: df[:x] * 10, z: df[:x] / 10.0}\n", "df.assign(hash)\n", "\n", "# Array\n", "array = [[:y, df[:x] * 10], [:z, df[:x] / 10.0]]\n", "df.assign(array)\n", "\n", "# Array\n", "df.assign [\n", " [:y, df[:x] * 10],\n", " [:z, df[:x] / 10.0]\n", "]\n", "\n", "# Hash\n", "df.assign({\n", " y: df[:x] * 10,\n", " z: df[:x] / 10.0\n", "})\n", "\n", "# Block, Hash\n", "df.assign { {y: df[:x] * 10, z: df[:x] / 10.0} }\n", "\n", "# Block, Array\n", "df.assign { [[:y, df[:x] * 10], [:z, df[:x] / 10.0]] }\n", "\n", "# Block, Array, method\n", "#df.assign { [:y, x * 10], [:z, x / 10.0]] }\n", "\n", "# Separated\n", "#df.assign(:y, :z) { [x * 10, x / 10.0] }" ] }, { "cell_type": "markdown", "id": "0472a570-c0e5-4ec8-bd26-23914a48f23d", "metadata": { "tags": [] }, "source": [ "## 66. Row index label by slice_by\n", "\n", "Another example of `slice` is [#28. Slice](#28.-Slice).\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": null, "id": "d7a6cb36-53e7-4503-88c0-301531d8b877", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(num: [1.1, 2.2, 3.3, 4.4, 5.5])\n", " .assign_left(:label) { indices(\"a\") }" ] }, { "cell_type": "markdown", "id": "86377e12-acf7-4529-baf6-46e54a86aa5a", "metadata": {}, "source": [ "`slice_by(key) { row_selector }` selects rows in column `key` with `row_selector`." ] }, { "cell_type": "code", "execution_count": null, "id": "2455da80-64af-4a4e-936f-4b7651dcf5ba", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice_by(:label) { \"b\"..\"d\" }" ] }, { "cell_type": "code", "execution_count": null, "id": "366587f4-0fde-4204-bdea-c3dc5c58b155", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice_by(:label) { [\"c\", \"b\", \"e\"] }" ] }, { "cell_type": "markdown", "id": "c074498f-6e27-45e3-9f9a-f1c0d92fb742", "metadata": {}, "source": [ "If the option `keep_key:` set to `true`, index label column is preserved." ] }, { "cell_type": "code", "execution_count": null, "id": "115b01f9-3722-4a2e-a6d0-7feadb058659", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice_by(:label, keep_key: true) { \"b\"..\"d\" }" ] }, { "cell_type": "markdown", "id": "8285cb78-5ef5-4fd4-baa1-d862f92418d5", "metadata": {}, "source": [ "## 67. Simpson's paradox in COVID-19 data\n", "\n", "https://www.rdocumentation.org/packages/openintro/versions/2.3.0/topics/simpsons_paradox_covid" ] }, { "cell_type": "code", "execution_count": null, "id": "47e2d348-6a36-4dfb-8697-959ef1442c11", "metadata": { "tags": [] }, "outputs": [], "source": [ "require 'datasets-arrow'\n", "\n", "ds = Datasets::Rdatasets.new('openintro', 'simpsons_paradox_covid')\n", "df = RedAmber::DataFrame.new(ds.to_arrow)" ] }, { "cell_type": "markdown", "id": "3d28b955-a57e-42fb-ae2f-647ee760ad47", "metadata": {}, "source": [ "Create group and count by vaccine status and outcome." ] }, { "cell_type": "code", "execution_count": null, "id": "1b5d9c45-a749-4e04-9f8a-5eeb0652771b", "metadata": { "tags": [] }, "outputs": [], "source": [ "count = df.group(:vaccine_status, :outcome).count" ] }, { "cell_type": "markdown", "id": "a6b6f501-f795-4fbf-b135-17609c5faebb", "metadata": {}, "source": [ "Reshape to human readable wide table." ] }, { "cell_type": "code", "execution_count": null, "id": "fcec849e-4144-4035-9ce6-45d4bc124f11", "metadata": { "tags": [] }, "outputs": [], "source": [ "all_count = count.to_wide(name: :vaccine_status, value: :count)" ] }, { "cell_type": "markdown", "id": "caccb2cc-64a3-4b89-b604-eca4780c3c52", "metadata": {}, "source": [ "Compute death or survived ratio for vaccine status." ] }, { "cell_type": "code", "execution_count": null, "id": "de42a142-38f5-426b-b9ed-706dba19015c", "metadata": { "tags": [] }, "outputs": [], "source": [ "all_count.assign do\n", " {\n", " \"vaccinated_%\": 100.0 * vaccinated / vaccinated.sum,\n", " \"unvaccinated_%\": 100.0 * unvaccinated / unvaccinated.sum\n", " }\n", "end" ] }, { "cell_type": "markdown", "id": "a18599a7-a21e-45a5-aa55-0170414ba3d0", "metadata": {}, "source": [ "Death ratio for vaccinated is higher than unvaccinated. Is it true?\n", "\n", "Next, do the same thing above for each age group. Temporally create methods." ] }, { "cell_type": "code", "execution_count": null, "id": "296b4570-7746-4444-8c0a-3bc4a7bdf929", "metadata": { "tags": [] }, "outputs": [], "source": [ "def make_covid_table(df)\n", " df.group(:vaccine_status, :outcome)\n", " .count\n", " .to_wide(name: :vaccine_status, value: :count)\n", " .assign do\n", " {\n", " \"vaccinated_%\": (100.0 * vaccinated / vaccinated.sum).round(n_digits: 3),\n", " \"unvaccinated_%\": (100.0 * unvaccinated / unvaccinated.sum).round(n_digits: 3)\n", " }\n", " end\n", "end" ] }, { "cell_type": "code", "execution_count": null, "id": "0443fe4f-e023-4496-9538-3d8772debfb6", "metadata": { "tags": [] }, "outputs": [], "source": [ "# under 50\n", "make_covid_table(df[df[:age_group] == \"under 50\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "3c451727-5d61-495d-aae9-fb8486b6067f", "metadata": { "tags": [] }, "outputs": [], "source": [ "# 50 +\n", "make_covid_table(df[df[:age_group] == \"50 +\"])" ] }, { "cell_type": "markdown", "id": "78c3c909-5abf-4989-bd27-e86543a14431", "metadata": {}, "source": [ "Death ratio for vaccinated is lower than unvaccinated for grouped subset by age. This is an exaple of \"Simpson's paradox\" ." ] }, { "cell_type": "code", "execution_count": null, "id": "aa974dd3-d1fd-48a8-ae41-dc502f757baf", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Vaccine status vs age\n", "# 50+ is highly vaccinated.\n", "df.group(:vaccine_status, :age_group).count.to_wide(name: :age_group, value: :count)" ] }, { "cell_type": "code", "execution_count": null, "id": "26f341ec-50f8-4e53-9d8b-3ab6635e2caf", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Outcome vs age\n", "# 50+ also has higher death rate.\n", "df.group(:outcome, :age_group).count.to_wide(name: :age_group, value: :count)" ] }, { "cell_type": "markdown", "id": "184a5b21-dabd-4428-84d4-aa3e3f8d4666", "metadata": {}, "source": [ "## 68. Clean up dirty data" ] }, { "cell_type": "code", "execution_count": null, "id": "3f34f465-7cdc-4bea-bffc-1ba2f741820d", "metadata": { "tags": [] }, "outputs": [], "source": [ "file = Tempfile.open(['dirty_data', '.csv']) do |f|\n", " f.puts(<<~CSV)\n", " height,weight\n", " 154.9,52.2\n", " 156.8cm,51.1kg\n", " 152,49\n", " 148.5cm,45.4kg\n", " 155cm,\n", " ,49.9kg\n", " 1.58m,49.8kg\n", " 166.8cm,53.6kg\n", " CSV\n", " f\n", "end\n", "\n", "df = DataFrame.load(file)" ] }, { "cell_type": "markdown", "id": "b7704f5e-8e8d-439c-bbd5-535d6e9f8911", "metadata": {}, "source": [ "It was loaded as String Vectors." ] }, { "cell_type": "code", "execution_count": null, "id": "e7abb6f9-8ca3-4868-aa0a-80f721475235", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.schema" ] }, { "cell_type": "markdown", "id": "f7bc2ef6-7d08-4d65-8439-b02fb9c6e5dc", "metadata": {}, "source": [ "First for the `:weight` column. Replacing \"\" to NaN causes casting to Float." ] }, { "cell_type": "code", "execution_count": null, "id": "6435e95f-db41-4d26-84bd-4880de3fbd32", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.assign do\n", " {\n", " weight: weight.replace(weight == \"\", Float::NAN)\n", " }\n", "end" ] }, { "cell_type": "markdown", "id": "fcc1cd2b-7ff1-4e64-a6a9-0026209455cb", "metadata": {}, "source": [ "Apply same conversion for `:height` followed by unit conversion by `if_else`." ] }, { "cell_type": "code", "execution_count": null, "id": "c88379f4-b063-42e9-8f76-676f364a79af", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = df.assign do\n", " {\n", " weight: weight.replace(weight == '', Float::NAN),\n", " height: height.replace(height == '', Float::NAN)\n", " .then { |h| (h < 10).if_else(h * 100, h) }\n", " }\n", "end\n", "puts df.schema\n", "df" ] }, { "cell_type": "markdown", "id": "8f16dc4c-2ecd-4f36-b7ba-8a05487b1a26", "metadata": {}, "source": [ "We got clean data, then compute BMI as a new column." ] }, { "cell_type": "code", "execution_count": null, "id": "bb061dab-008c-431c-9f7d-c74aa33e5f5a", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.assign(:BMI) { (weight / height ** 2 * 10000).round(n_digits: 1) }" ] }, { "cell_type": "markdown", "id": "8fe6cc1a-530d-4883-8641-7aebd97ebd16", "metadata": {}, "source": [ "## 69. From the Pandas cookbook - Multiindexing\n", "\n", "(Updated on v0.3.0)\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#multiindexing" ] }, { "cell_type": "markdown", "id": "a451139d-9b7a-4cbd-b1f1-4b99ac6f81b8", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Efficiently and dynamically creating new columns using applymap\n", "\n", "df = pd.DataFrame(\n", " {\n", " \"row\": [0, 1, 2],\n", " \"One_X\": [1.1, 1.1, 1.1],\n", " \"One_Y\": [1.2, 1.2, 1.2],\n", " \"Two_X\": [1.11, 1.11, 1.11],\n", " \"Two_Y\": [1.22, 1.22, 1.22],\n", " }\n", ")\n", "df\n", "\n", "# =>\n", " row One_X One_Y Two_X Two_Y\n", "0 0 1.1 1.2 1.11 1.22\n", "1 1 1.1 1.2 1.11 1.22\n", "2 2 1.1 1.2 1.11 1.22\n", "\n", "# As Labelled Index\n", "df = df.set_index(\"row\")\n", "df\n", "\n", "# =>\n", " One_X One_Y Two_X Two_Y\n", "row \n", "0 1.1 1.2 1.11 1.22\n", "1 1.1 1.2 1.11 1.22\n", "2 1.1 1.2 1.11 1.22\n", "\n", "# With Hierarchical Columns\n", "df.columns = pd.MultiIndex.from_tuples([tuple(c.split(\"_\")) for c in df.columns])\n", "df\n", "\n", "# =>\n", " One Two \n", " X Y X Y\n", "row \n", "0 1.1 1.2 1.11 1.22\n", "1 1.1 1.2 1.11 1.22\n", "2 1.1 1.2 1.11 1.22\n", "\n", "# Now stack & Reset\n", "df = df.stack(0).reset_index(1)\n", "df\n", "\n", "# =>\n", " level_1 X Y\n", "row \n", "0 One 1.10 1.20\n", "0 Two 1.11 1.22\n", "1 One 1.10 1.20\n", "1 Two 1.11 1.22\n", "2 One 1.10 1.20\n", "2 Two 1.11 1.22\n", "\n", "# And fix the labels (Notice the label 'level_1' got added automatically)\n", "df.columns = [\"Sample\", \"All_X\", \"All_Y\"]\n", "df\n", "\n", "# =>\n", " Sample All_X All_Y\n", "row \n", "0 One 1.10 1.20\n", "0 Two 1.11 1.22\n", "1 One 1.10 1.20\n", "1 Two 1.11 1.22\n", "2 One 1.10 1.20\n", "2 Two 1.11 1.22\n", "```" ] }, { "cell_type": "markdown", "id": "6af6de4e-3875-4e05-8642-20d07bf3a363", "metadata": {}, "source": [ "(Until 0.2.3)\n", "This is an example before `Vector#split` has introduced. See [84. Vector#split](#84.-Vector#split) ." ] }, { "cell_type": "code", "execution_count": null, "id": "1193c48a-528b-4e22-a8f7-24dba63e3fa7", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = RedAmber::DataFrame.new(\n", " \"row\": [0, 1, 2],\n", " \"One_X\": [1.1, 1.1, 1.1],\n", " \"One_Y\": [1.2, 1.2, 1.2],\n", " \"Two_X\": [1.11, 1.11, 1.11],\n", " \"Two_Y\": [1.22, 1.22, 1.22],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "9a9bcbd1-112f-4e58-b065-90fec7270a0d", "metadata": { "tags": [] }, "outputs": [], "source": [ "x = df.pick(:row, :One_X, :Two_X)\n", " .to_long(:row, name: :Sample, value: :All_X)" ] }, { "cell_type": "code", "execution_count": null, "id": "949e64a9-5f45-481b-b4b4-dc3bc4da4326", "metadata": { "tags": [] }, "outputs": [], "source": [ "y = df.pick(:row, :One_Y, :Two_Y)\n", " .to_long(:row, name: :Sample, value: :All_Y)" ] }, { "cell_type": "code", "execution_count": null, "id": "d2f5959c-c490-45fe-bcbd-3b8325760f6d", "metadata": { "tags": [] }, "outputs": [], "source": [ "x.pick(:row)\n", " .assign [\n", " [:Sample, x[:Sample].each.map { |x| x.split(\"_\").first }],\n", " [:All_X, x[:All_X]],\n", " [:All_Y, y[:All_Y]]\n", " ]" ] }, { "cell_type": "markdown", "id": "9e5fbc75-6d6f-4cd5-8392-0f640ffe24b0", "metadata": {}, "source": [ "(Since 0.3.0)\n", "This example will use `Vector#split_to_columns`." ] }, { "cell_type": "code", "execution_count": null, "id": "f027bf68-7f7c-4ed4-b14a-978d3349c5e4", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = RedAmber::DataFrame.new(\n", " \"row\": [0, 1, 2],\n", " \"One_X\": [1.1, 1.1, 1.1],\n", " \"One_Y\": [1.2, 1.2, 1.2],\n", " \"Two_X\": [1.11, 1.11, 1.11],\n", " \"Two_Y\": [1.22, 1.22, 1.22],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "5e99e3e7-a79c-40c3-a825-448fad4d64ce", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.to_long(:row)" ] }, { "cell_type": "markdown", "id": "3bd06408-6186-40b4-bfbf-67d60c456abe", "metadata": {}, "source": [ "`Vector#split_to_colums` returns two splitted Vectors." ] }, { "cell_type": "code", "execution_count": null, "id": "ddc2eb91-a450-4222-8b55-012b30d917fc", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.to_long(:row, name: :Sample)\n", " .assign(:Sample, :xy) { v(:Sample).split_to_columns('_') }" ] }, { "cell_type": "code", "execution_count": null, "id": "abb5e884-381b-4cb7-afac-39e08948b868", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.to_long(:row, name: :Sample)\n", " .assign(:Sample, :xy) { v(:Sample).split_to_columns('_') }\n", " .to_wide(name: :xy, value: :VALUE)" ] }, { "cell_type": "markdown", "id": "7e65c417-692c-4145-96dd-deef12f46ed4", "metadata": { "tags": [] }, "source": [ "## 70. From the Pandas cookbook (Arithmetic)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#arithmetic" ] }, { "cell_type": "markdown", "id": "47c3ac27-37f4-4e6b-a5f7-95d818e21cdc", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "cols = pd.MultiIndex.from_tuples(\n", " [(x, y) for x in [\"A\", \"B\", \"C\"] for y in [\"O\", \"I\"]]\n", ")\n", "\n", "df = pd.DataFrame(np.random.randn(2, 6), index=[\"n\", \"m\"], columns=cols)\n", "df\n", "\n", "# =>\n", " A B C \n", " O I O I O I\n", "n 0.469112 -0.282863 -1.509059 -1.135632 1.212112 -0.173215\n", "m 0.119209 -1.044236 -0.861849 -2.104569 -0.494929 1.071804\n", "\n", "df = df.div(df[\"C\"], level=1)\n", "df\n", "\n", "# =>\n", " A B C \n", " O I O I O I\n", "n 0.387021 1.633022 -1.244983 6.556214 1.0 1.0\n", "m -0.240860 -0.974279 1.741358 -1.963577 1.0 1.0\n", "```" ] }, { "cell_type": "markdown", "id": "56ce988f-7d7d-4bbd-8919-be3eb4242b77", "metadata": {}, "source": [ "This is a tentative example. This work may be refined by the coming feature which treats multiple key header easily." ] }, { "cell_type": "code", "execution_count": null, "id": "a9e14b16-c7d8-4539-90f9-25eb89f4789f", "metadata": { "tags": [] }, "outputs": [], "source": [ "require \"arrow-numo-narray\"\n", "\n", "values = Numo::DFloat.new(6, 2).rand_norm" ] }, { "cell_type": "markdown", "id": "03aa2096-4411-4665-9bfc-dfedc41e75af", "metadata": {}, "source": [ "For consistency with the pandas result, we will use same data of them." ] }, { "cell_type": "code", "execution_count": null, "id": "a470ced5-dadb-48cb-a43f-8161ac3df769", "metadata": { "tags": [] }, "outputs": [], "source": [ "values = [\n", " [0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215],\n", " [0.119209, -1.044236, -0.861849, -2.104569, -0.494929, 1.071804]\n", "].transpose" ] }, { "cell_type": "code", "execution_count": null, "id": "4288a84f-c3db-4fa7-b73e-9e21077304c5", "metadata": { "tags": [] }, "outputs": [], "source": [ "keys = %w[A B C].product(%w[O I]).map(&:join)" ] }, { "cell_type": "code", "execution_count": null, "id": "a1199c40-1508-4573-a427-cd7f7d580701", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = RedAmber::DataFrame.new(index: %w[n m])\n", " .assign(*keys) { values }" ] }, { "cell_type": "code", "execution_count": null, "id": "bab052e6-5b26-40ba-95b6-c87ac444cb80", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.assign do\n", " assigner = {}\n", " %w[A B C].each do |abc|\n", " %w[O I].each do |oi|\n", " key = \"#{abc}#{oi}\".to_sym\n", " assigner[key] = v(key) / v(\"C#{oi}\".to_sym)\n", " end\n", " end\n", " assigner\n", "end" ] }, { "cell_type": "code", "execution_count": null, "id": "24de5e41-c775-472d-ab93-24ef9981febe", "metadata": { "tags": [] }, "outputs": [], "source": [ "coords = [[\"AA\", \"one\"], [\"AA\", \"six\"], [\"BB\", \"one\"], [\"BB\", \"two\"], [\"BB\", \"six\"]].transpose\n", "df = RedAmber::DataFrame.new(MyData: [11, 22, 33, 44, 55])\n", " .assign_left(:label1, :label2) { coords }" ] }, { "cell_type": "markdown", "id": "0de59f72-d401-40f1-bdf3-b38168dae359", "metadata": {}, "source": [ "## 71. From the Pandas cookbook (Slicing)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#slicing" ] }, { "cell_type": "markdown", "id": "4de08a4c-864a-4dbe-a7ab-427bfa211881", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "coords = [(\"AA\", \"one\"), (\"AA\", \"six\"), (\"BB\", \"one\"), (\"BB\", \"two\"), (\"BB\", \"six\")]\n", "index = pd.MultiIndex.from_tuples(coords)\n", "df = pd.DataFrame([11, 22, 33, 44, 55], index, [\"MyData\"])\n", "df\n", "\n", "# =>\n", " MyData\n", "AA one 11\n", " six 22\n", "BB one 33\n", " two 44\n", " six 55\n", "```" ] }, { "cell_type": "markdown", "id": "5876deb1-8574-48c0-af6a-00447ba5ebe7", "metadata": {}, "source": [ "To take the cross section of the 1st level and 1st axis the index:\n", "\n", "```python\n", "# by Python Pandas\n", "# Note : level and axis are optional, and default to zero\n", "df.xs(\"BB\", level=0, axis=0)\n", "\n", "# =>\n", " MyData\n", "one 33\n", "two 44\n", "six 55\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "6941b8ef-6e40-4103-ab93-bf370d80ebf8", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice { label1 == \"BB\" }.drop(:label1)" ] }, { "cell_type": "markdown", "id": "cf62d474-5a15-4af3-a0fc-99455bf0a205", "metadata": {}, "source": [ "…and now the 2nd level of the 1st axis.\n", "\n", "```python\n", "# by Python Pandas\n", "df.xs(\"six\", level=1, axis=0)\n", "\n", "# =>\n", " MyData\n", "AA 22\n", "BB 55\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "9968d610-cadc-4b0a-8364-8c95e6ed6b0d", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice { label2 == \"six\" }.drop(:label2)" ] }, { "cell_type": "markdown", "id": "b7d9d33d-6959-492d-b24f-9952b36cf9f6", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "import itertools\n", "\n", "index = list(itertools.product([\"Ada\", \"Quinn\", \"Violet\"], [\"Comp\", \"Math\", \"Sci\"]))\n", "headr = list(itertools.product([\"Exams\", \"Labs\"], [\"I\", \"II\"]))\n", "indx = pd.MultiIndex.from_tuples(index, names=[\"Student\", \"Course\"])\n", "cols = pd.MultiIndex.from_tuples(headr) # Notice these are un-named\n", "data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)]\n", "df = pd.DataFrame(data, indx, cols)\n", "df\n", "\n", "# =>\n", " Exams Labs \n", " I II I II\n", "Student Course \n", "Ada Comp 70 71 72 73\n", " Math 71 73 75 74\n", " Sci 72 75 75 75\n", "Quinn Comp 73 74 75 76\n", " Math 74 76 78 77\n", " Sci 75 78 78 78\n", "Violet Comp 76 77 78 79\n", " Math 77 79 81 80\n", " Sci 78 81 81 81\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "78e5753a-8d9c-4132-9671-6b09b3a06a20", "metadata": { "tags": [] }, "outputs": [], "source": [ "indexes = %w[Ada Quinn Violet].product(%w[Comp Math Sci]).transpose\n", "df = RedAmber::DataFrame.new(%w[Student Course].zip(indexes))\n", " .assign do\n", " assigner = {}\n", " keys = %w[Exams Labs].product(%w[I II]).map { |a| a.join(\"/\") } \n", " keys.each.with_index do |key, x|\n", " assigner[key] = (0...9).map { |y| 70 + x + y + (x * y) % 3 }\n", " end\n", " assigner\n", " end" ] }, { "cell_type": "markdown", "id": "2fc29a2d-fbe8-42f9-83dc-1ddfb7563ce0", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "All = slice(None)\n", "\n", "df.loc[\"Violet\"]\n", "\n", "# =>\n", " Exams Labs \n", " I II I II\n", "Course \n", "Comp 76 77 78 79\n", "Math 77 79 81 80\n", "Sci 78 81 81 81\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "1c7662a7-3208-40ee-9f64-235c68ef57b3", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice(df[:Student] == \"Violet\").drop(:Student)" ] }, { "cell_type": "markdown", "id": "af5bb6db-cf2d-4887-9d73-7da651f9a3ea", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df.loc[(All, \"Math\"), All]\n", "\n", "# =>\n", " Exams Labs \n", " I II I II\n", "Student Course \n", "Ada Math 71 73 75 74\n", "Quinn Math 74 76 78 77\n", "Violet Math 77 79 81 80\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "28baea84-76c8-4473-b1ff-0d8f6fde2cca", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice(df[:Course] == \"Math\")" ] }, { "cell_type": "markdown", "id": "e12bcd4f-be10-4ee4-b55e-dc57e3de9698", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df.loc[(slice(\"Ada\", \"Quinn\"), \"Math\"), All]\n", "\n", "# =>\n", " Exams Labs \n", " I II I II\n", "Student Course \n", "Ada Math 71 73 75 74\n", "Quinn Math 74 76 78 77\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "11387d5b-a470-4ebf-8260-55db8f5f829a", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice(df[:Course] == \"Math\")\n", " .slice { (v(:Student) == \"Ada\") | (v(:Student) == \"Quinn\") }" ] }, { "cell_type": "markdown", "id": "61270946-535c-4a42-a155-a1c0ca025416", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df.loc[(All, \"Math\"), (\"Exams\")]\n", "\n", "# =>\n", " I II\n", "Student Course \n", "Ada Math 71 73\n", "Quinn Math 74 76\n", "Violet Math 77 79\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "7dbdd733-6cbb-48e1-ab8d-45587d455b8b", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice(df[:Course] == \"Math\")\n", " .pick {\n", " [:Student, :Course].concat keys.select { |key| key.to_s.start_with?(\"Exams\") }\n", " }" ] }, { "cell_type": "markdown", "id": "54be92ae-32b8-44f8-926c-190ab7592364", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df.loc[(All, \"Math\"), (All, \"II\")]\n", "\n", "# =>\n", " Exams Labs\n", " II II\n", "Student Course \n", "Ada Math 73 74\n", "Quinn Math 76 77\n", "Violet Math 79 80\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "5323da02-4dbe-43ac-89b8-3f9de0ae25c1", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.slice(df[:Course] == \"Math\")\n", " .pick {\n", " [:Student, :Course].concat keys.select { |key| key.to_s.end_with?(\"II\") }\n", " }" ] }, { "cell_type": "markdown", "id": "cd2f29b9-de8f-4680-b2bd-b26093391cdb", "metadata": {}, "source": [ "## 72. Vector#map\n", "\n", "`Vector#map` method accepts a block and return yielded results from the block in a Vector." ] }, { "cell_type": "code", "execution_count": null, "id": "f0dd0d14-0a98-429e-875b-71ed7feba524", "metadata": { "tags": [] }, "outputs": [], "source": [ "v = Vector.new(1, 2, 3, 4)\n", "v.map { |x| x / 100.0 }" ] }, { "cell_type": "markdown", "id": "3799e606-e6aa-4302-9fb7-1e4f8a0718e3", "metadata": {}, "source": [ "If no block is given, return a Enumerator." ] }, { "cell_type": "code", "execution_count": null, "id": "179ea66f-081e-4a16-b758-cf760c756cf2", "metadata": { "tags": [] }, "outputs": [], "source": [ "v.map" ] }, { "cell_type": "markdown", "id": "7459005f-db30-4237-8c4c-24ab507301fb", "metadata": {}, "source": [ "If you need ruby's map from a Vector, try `.each.map` ." ] }, { "cell_type": "code", "execution_count": null, "id": "abedbdb5-0dce-4f56-a962-cbe18a26edc0", "metadata": { "tags": [] }, "outputs": [], "source": [ "v.each.map { |x| x / 100.0 }" ] }, { "cell_type": "markdown", "id": "64b6277d-2485-4a25-8b8a-3d0fcc7299a3", "metadata": {}, "source": [ "Alias for `#map` is `#collect`" ] }, { "cell_type": "markdown", "id": "134b3d00-4221-41e1-9450-a47609dfc379", "metadata": {}, "source": [ "Similar method is `Vector#filter/#select`." ] }, { "cell_type": "markdown", "id": "f754a76c-a37d-4078-909e-bdfee893a000", "metadata": {}, "source": [ "## 73. Introduce columns from numo/narray\n", "\n", "(Until 0.2.2 w/Arrow 9.0.0) We couldn't construct the DataFrame with Numo/NArray in current version. But following trick will enable it." ] }, { "cell_type": "code", "execution_count": null, "id": "0f1957da-66d4-4c18-8f17-427701d50a7e", "metadata": { "tags": [] }, "outputs": [], "source": [ "DataFrame.new(index: Array(1..10))\n", " .assign do\n", " {\n", " x0: Numo::DFloat.new(size).rand_norm(0, 2),\n", " x1: Numo::DFloat.new(size).rand_norm(5, 2),\n", " x2: Numo::DFloat.new(size).rand_norm(10, 2),\n", " y0: Numo::DFloat.new(size).rand_norm(100, 10),\n", " y1: Numo::DFloat.new(size).rand_norm(200, 10),\n", " y2: Numo::DFloat.new(size).rand_norm(300, 10)\n", " }\n", " end" ] }, { "cell_type": "markdown", "id": "11af697f-ad7e-411b-a0ca-dccb6c601579", "metadata": {}, "source": [ "If you do not need the index column, try this." ] }, { "cell_type": "code", "execution_count": null, "id": "ce44bcf5-0183-4b3c-8d5f-e2f40cf73b6f", "metadata": { "tags": [] }, "outputs": [], "source": [ "DataFrame.new(_: Array(1..10))\n", " .assign do\n", " {\n", " x0: Numo::DFloat.new(size).rand_norm(0, 2),\n", " x1: Numo::DFloat.new(size).rand_norm(5, 2),\n", " x2: Numo::DFloat.new(size).rand_norm(10, 2),\n", " y0: Numo::DFloat.new(size).rand_norm(100, 10),\n", " y1: Numo::DFloat.new(size).rand_norm(200, 10),\n", " y2: Numo::DFloat.new(size).rand_norm(300, 10)\n", " }\n", " end\n", " .drop(:_)" ] }, { "cell_type": "markdown", "id": "5dfee7ae-3f56-4b4f-acfd-46323e4cce8e", "metadata": {}, "source": [ "(New from 0.2.3 with Aroow 10.0.0) It is possible to initialize by objects responsible to `to_arrow` since 0.2.3 . Arrays in Numo::NArray is responsible to `to_arrow` with `red-arrow-numo-narray` gem. This feature is proposed by the Red Data Tools member @kojix2 and implemented by @kou in Arrow 10.0.0 and Red Arrow Numo::NArray 0.0.6. Thanks!" ] }, { "cell_type": "code", "execution_count": null, "id": "ccdbba55-0d4b-481e-aebb-db14d6388de3", "metadata": { "tags": [] }, "outputs": [], "source": [ "require 'arrow-numo-narray'\n", "\n", "size = 10\n", "DataFrame.new(\n", " x0: Numo::DFloat.new(size).rand_norm(0, 2),\n", " x1: Numo::DFloat.new(size).rand_norm(5, 2),\n", " x2: Numo::DFloat.new(size).rand_norm(10, 2),\n", " y0: Numo::DFloat.new(size).rand_norm(100, 10),\n", " y1: Numo::DFloat.new(size).rand_norm(200, 10),\n", " y2: Numo::DFloat.new(size).rand_norm(300, 10)\n", ")" ] }, { "cell_type": "markdown", "id": "c04cccb1-fa5d-4247-97a3-a9fe5e967bb1", "metadata": {}, "source": [ "## 74. Join (mutating joins)\n", "\n", "(Since 0.2.3)" ] }, { "cell_type": "code", "execution_count": null, "id": "713c5acd-6afb-4f62-a5d9-ea2719a86897", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(\n", " KEY: %w[A B C],\n", " X1: [1, 2, 3]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "a037ee2c-a4d9-45c3-a90b-65e7f08aa0c9", "metadata": { "tags": [] }, "outputs": [], "source": [ "other = DataFrame.new(\n", " KEY: %w[A B D],\n", " X2: [true, false, nil]\n", ")" ] }, { "cell_type": "markdown", "id": "83ee42eb-e775-4b95-9395-e588970384a7", "metadata": {}, "source": [ "Inner join will join data leaving only the matching records." ] }, { "cell_type": "code", "execution_count": null, "id": "d88e686a-f7cd-40fd-9516-d9fa5c2d7b45", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.inner_join(other, :KEY)" ] }, { "cell_type": "markdown", "id": "4932dc10-fd57-47ee-af7a-e51eaef79872", "metadata": {}, "source": [ "If we omit join keys, common keys are automatically chosen (natural key)." ] }, { "cell_type": "code", "execution_count": null, "id": "d3f096db-7421-4a7d-b2e2-f4d59c40190b", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.inner_join(other)" ] }, { "cell_type": "markdown", "id": "8d73c66c-109b-4f85-a06e-43b02d3fc98a", "metadata": {}, "source": [ "Full join will join data leaving all records." ] }, { "cell_type": "code", "execution_count": null, "id": "f6b77037-3983-4163-bcc4-b51d69c99709", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.full_join(other)" ] }, { "cell_type": "markdown", "id": "cfa938bb-2b69-4732-bda2-44e9d81b129f", "metadata": {}, "source": [ "Left join will join matching values to self from other (type: left_outer)." ] }, { "cell_type": "code", "execution_count": null, "id": "c7cf2218-505e-4f9d-8a55-2ec70a9faf34", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.left_join(other)" ] }, { "cell_type": "markdown", "id": "0bcb210e-18f8-4b84-be7f-d9df67a5ef99", "metadata": {}, "source": [ "Right join will join matching values from self to other (type: right_outer)." ] }, { "cell_type": "code", "execution_count": null, "id": "c9461244-0177-402b-a0ab-764f105a9b8e", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.right_join(other)" ] }, { "cell_type": "markdown", "id": "05133084-abec-47e0-b778-57d9eb8e9125", "metadata": {}, "source": [ "Left join will join matching values to self from other." ] }, { "cell_type": "code", "execution_count": null, "id": "a37a3319-f4fe-43cc-950d-098c0e5f65c3", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.left_join(other)" ] }, { "cell_type": "markdown", "id": "9f40ba45-936a-474d-9f4e-eb73ac47f6eb", "metadata": {}, "source": [ "## 75. Join (filtering joins)\n", "\n", "(Since 0.2.3)" ] }, { "cell_type": "markdown", "id": "81eb44d9-a868-4067-a040-74c0a85f925f", "metadata": {}, "source": [ "Semi join will return records of self that have a match in other." ] }, { "cell_type": "code", "execution_count": null, "id": "e646b16e-5972-48eb-a25d-95ba75c2df62", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.semi_join(other)" ] }, { "cell_type": "markdown", "id": "245c8243-9a55-4145-ba3f-52f2cf3f4558", "metadata": {}, "source": [ "Anti join will return records of self that do not have a match in other." ] }, { "cell_type": "code", "execution_count": null, "id": "dd675476-df61-433e-9d18-9250da381eb7", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.anti_join(other)" ] }, { "cell_type": "markdown", "id": "0f80b55d-96c7-4a5d-aca2-42adce1173ad", "metadata": {}, "source": [ "## 76. Partial joins\n", "\n", "(Since 0.2.3)" ] }, { "cell_type": "code", "execution_count": null, "id": "90efb2aa-bf52-49ee-9a68-982a595644d9", "metadata": { "tags": [] }, "outputs": [], "source": [ "df2 = DataFrame.new(\n", " KEY1: %w[A B C],\n", " KEY2: %w[s t u],\n", " X: [1, 2, 3]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "327200d2-1077-4995-aed6-128db085fe7b", "metadata": { "tags": [] }, "outputs": [], "source": [ "other2 = DataFrame.new(\n", " KEY1: %w[A B D],\n", " KEY2: %w[s u v],\n", " Y: [3, 2, 1]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "a45ff487-8f60-48db-acb6-8cc904ea82a9", "metadata": { "tags": [] }, "outputs": [], "source": [ "# natural join\n", "df2.inner_join(other2)\n", "# Same as df2.inner_join(other2, [:KEY1, :KEY2])" ] }, { "cell_type": "markdown", "id": "ca1f60fa-1226-4850-a126-16f6f2849c9b", "metadata": {}, "source": [ "Partial join enables some part of common keys as join keys.\n", "\n", "Common keys of other not used as join keys will renamed as `:suffix`. Default suffix is '.1'." ] }, { "cell_type": "code", "execution_count": null, "id": "1c8c7714-af29-4547-9640-0e89a87517ac", "metadata": { "tags": [] }, "outputs": [], "source": [ "# partial join\n", "df2.inner_join(other2, :KEY1)" ] }, { "cell_type": "code", "execution_count": null, "id": "454467b6-1472-4444-9b0d-169c03f46fc6", "metadata": { "tags": [] }, "outputs": [], "source": [ "df2.inner_join(other2, :KEY1, suffix: '_')" ] }, { "cell_type": "markdown", "id": "7d031322-7b27-4851-9331-583e180abf87", "metadata": {}, "source": [ "## 77. Order of record in join\n", "\n", "Order of records is not guaranteed to be preserved before or after join. This is a similar property to RDB. Records behave like a set.\n", "\n", "If you want to preserve the order of records, it is recommended to add an index or sort.\n", "\n", "(Since 0.2.3)" ] }, { "cell_type": "code", "execution_count": null, "id": "daa1ff25-8070-4005-898e-5ba4b46fdf52", "metadata": { "tags": [] }, "outputs": [], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": null, "id": "123966db-76e1-486f-a8b5-90448c09b413", "metadata": { "tags": [] }, "outputs": [], "source": [ "other2" ] }, { "cell_type": "code", "execution_count": null, "id": "fbe3b05a-0c79-4199-9b27-f0ff0f508a5f", "metadata": { "tags": [] }, "outputs": [], "source": [ "df2.full_join(other2, :KEY2)" ] }, { "cell_type": "markdown", "id": "7f070549-97e9-4339-8ce0-c6d2f0615691", "metadata": {}, "source": [ "## 78. Set operations\n", "\n", "Keys in self and other must be same in set operations.\n", "\n", "(Since 0.2.3)" ] }, { "cell_type": "code", "execution_count": null, "id": "9e12fb71-aeea-4df5-a070-c7c698ead596", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(\n", " KEY1: %w[A B C],\n", " KEY2: [1, 2, 3]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "477ba406-877a-486f-a7a0-ad3f5fc3e648", "metadata": { "tags": [] }, "outputs": [], "source": [ "other = DataFrame.new(\n", " KEY1: %w[A B D],\n", " KEY2: [1, 4, 5]\n", ")" ] }, { "cell_type": "markdown", "id": "ddbc81ff-5c6d-450c-afc4-7fb85b8af4bb", "metadata": {}, "source": [ "Intersect will select records appearing in both self and other." ] }, { "cell_type": "code", "execution_count": null, "id": "60cd67e7-59a2-4626-aad1-11c0f0b9dd4e", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.intersect(other)" ] }, { "cell_type": "markdown", "id": "9a5cc761-3d0f-45db-842d-ff1d0d824164", "metadata": {}, "source": [ "Union will select records appearing in both self or other." ] }, { "cell_type": "code", "execution_count": null, "id": "cb430af3-5da7-4f21-9efa-afc3fd3a008c", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.union(other)" ] }, { "cell_type": "markdown", "id": "c03df9f4-5830-4dab-b9c1-c38bd78bce07", "metadata": {}, "source": [ "Difference will select records appearing in self but not in other.\n", "\n", "It has an alias `#setdiff`." ] }, { "cell_type": "code", "execution_count": null, "id": "ba100d82-e241-46f6-8e9f-716279380685", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.difference(other)" ] }, { "cell_type": "markdown", "id": "29f1e26b-1aac-4921-910d-5dc2db28b80f", "metadata": {}, "source": [ "## 79. Join (big method)\n", "\n", "Undocumented big method `join` supports all mutating joins, filtering joins and set operations.\n", "\n", "|category|method of RedAmber|:type in join method|requirement|\n", "|-|-|-|-|\n", "|mutating joins|#inner_join|:inner||\n", "|mutating joins|#full_join|:full_outer||\n", "|mutating joins|#left_join|:left_outer||\n", "|mutating joins|#right_join|:right_outer||\n", "|-|-|:right_semi||\n", "|-|-|:right_anti||\n", "|filtering joins|#semi_join|:left_semi||\n", "|filtering joins|#anti_join|:left_anti||\n", "|set operations|#intersect|:inner|must have same keys with self and other|\n", "|set operations|#union|:full_outer|must have same keys with self and other|\n", "|set operations|#difference|:left_anti|must have same keys with self and other|\n", "\n", "(Since 0.2.3)" ] }, { "cell_type": "code", "execution_count": null, "id": "458a5bb3-b74d-47f4-8d1d-2c536ba8ba2a", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(\n", " KEY: %w[A B C],\n", " X1: [1, 2, 3]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "764f982a-fead-4c10-be8d-fcfd46f4c70e", "metadata": { "tags": [] }, "outputs": [], "source": [ "other = DataFrame.new(\n", " KEY: %w[A B D],\n", " X2: [true, false, nil]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "72428371-7f92-4bd8-be39-a492b2256db8", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.join(other, :KEY, type: :inner)\n", "# Same as df.inner_join(other)" ] }, { "cell_type": "markdown", "id": "948567b8-510e-4911-bb52-6e34c6803e38", "metadata": {}, "source": [ "## 80. Force order for #join\n", "We can use `:force_order` option to ensure unique order for `join` families.\n", "(`#full_join`, `#right_join` and `#join with :full_outer or :right_outer`)\n", "\n", "If this option is true (by default) it will append index to the source and sort after joining. It will cause some degradation in performance.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "cec0a165-56de-4364-bc3c-3955d9b74732", "metadata": { "tags": [] }, "outputs": [], "source": [ "df2 = DataFrame.new(\n", " KEY1: %w[A B C],\n", " KEY2: %w[s t u],\n", " X: [1, 2, 3]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "db088b29-9ed2-4f14-9b35-a6e13e8521c6", "metadata": { "tags": [] }, "outputs": [], "source": [ "right2 = DataFrame.new(\n", " KEY1: %w[A B D],\n", " KEY2: %w[s u v],\n", " Y: [3, 2, 1]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "12931ea7-d999-4fda-b8cb-aa8193cc373e", "metadata": { "tags": [] }, "outputs": [], "source": [ "df2.full_join(right2, :KEY2)" ] }, { "cell_type": "code", "execution_count": null, "id": "fbb9158c-ebca-40dc-a452-977e77ccb3fd", "metadata": { "tags": [] }, "outputs": [], "source": [ "df2.full_join(right2, :KEY2, force_order: false)" ] }, { "cell_type": "code", "execution_count": null, "id": "db29c564-b6e6-4e6e-9501-61aaf7ab5560", "metadata": { "tags": [] }, "outputs": [], "source": [ "df2.full_join(right2, { left: :KEY2, right: 'KEY2' })" ] }, { "cell_type": "code", "execution_count": null, "id": "29476ca0-f3f3-48a2-bb0d-2d355a49d3d5", "metadata": { "tags": [] }, "outputs": [], "source": [ "df2.full_join(right2, { left: :KEY2, right: 'KEY2' }, force_order: false)" ] }, { "cell_type": "markdown", "id": "bcb02520-547b-4e0e-815d-96257aea158a", "metadata": {}, "source": [ "## 81. Binding DataFrames in vertical (concatenate)\n", "\n", "Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.\n", "\n", "The alias is `concat`.\n", "(Since 0.2.3)" ] }, { "cell_type": "code", "execution_count": null, "id": "9ef28118-357b-4ea5-9dce-09b870f679c6", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(x: [1, 2], y: ['A', 'B'])" ] }, { "cell_type": "code", "execution_count": null, "id": "bd2e8c68-15b4-4575-b951-449b73aab8a1", "metadata": { "tags": [] }, "outputs": [], "source": [ "other = DataFrame.new(x: [3, 4], y: ['C', 'D'])" ] }, { "cell_type": "code", "execution_count": null, "id": "12a96c86-ec57-4ed4-b964-31eba1cdecfa", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.concatenate(other)" ] }, { "cell_type": "markdown", "id": "0542225b-39a3-409a-bdf7-66078c21bf0d", "metadata": {}, "source": [ "## 82. Binding DataFrames in lateral (merge)\n", "\n", "Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.\n", "\n", "(Since 0.2.3)" ] }, { "cell_type": "code", "execution_count": null, "id": "e40ee7c7-fe98-426b-ade2-0cfd6d3619ea", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(x: [1, 2], y: [3, 4])" ] }, { "cell_type": "code", "execution_count": null, "id": "7a11fdb8-ddb8-4451-be2e-9ba5bf383162", "metadata": { "tags": [] }, "outputs": [], "source": [ "other = DataFrame.new(a: ['A', 'B'], b: ['C', 'D'])" ] }, { "cell_type": "code", "execution_count": null, "id": "f913f3c9-a084-43d6-a270-5b6443270386", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.merge(other)" ] }, { "cell_type": "markdown", "id": "5dcfab2a-38a9-4282-8ce1-4741e8a71b84", "metadata": {}, "source": [ "## 83. Join - larger example by nycflight13\n", "\n", "(Since 0.2.3)\n", "\n", "'nycflights13' dataset is a large dataset. It will take a while for the first run to fetch and prepare red-datasets cache." ] }, { "cell_type": "code", "execution_count": null, "id": "0c620ef9-2f78-4169-92e1-ed30bcc879c2", "metadata": { "tags": [] }, "outputs": [], "source": [ "require 'datasets-arrow'\n", "\n", "package = 'nycflights13'\n", "\n", "airlines = DataFrame.new(Datasets::Rdatasets.new(package, 'airlines'))\n", "airports = DataFrame.new(Datasets::Rdatasets.new(package, 'airports'))\n", "flights = DataFrame.new(Datasets::Rdatasets.new(package, 'flights'))\n", " .pick(%i[month day carrier flight tailnum origin dest air_time distance])\n", "planes = DataFrame.new(Datasets::Rdatasets.new(package, 'planes'))\n", "weather = DataFrame.new(Datasets::Rdatasets.new(package, 'weather'))\n", "\n", "flights" ] }, { "cell_type": "code", "execution_count": null, "id": "6b9eec6f-c2da-4acc-94db-27894c188dc7", "metadata": { "tags": [] }, "outputs": [], "source": [ "airlines" ] }, { "cell_type": "code", "execution_count": null, "id": "f5c46b59-1f09-4aed-ae23-b27dd8da8c1e", "metadata": {}, "outputs": [], "source": [ "# inner join\n", "flights.inner_join(airlines, :carrier)\n", "# flights.inner_join(airlines) # natural join (same result)" ] }, { "cell_type": "markdown", "id": "3614be12-be76-4e4e-99e6-dea1667104fe", "metadata": { "tags": [] }, "source": [ "## 84. Vector#split_to_columns\n", "\n", "Another example using in the DataFrame operation is in [## 69. From the Pandas cookbook (Multiindexing)](#69.-From-the-Pandas-cookbook---Multiindexing).\n", "\n", "`self` must be a String type Vector.\n", "\n", "(Since 0.3.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "88c40a3c-1c25-4e50-9bb6-bde0788f99ea", "metadata": { "tags": [] }, "outputs": [], "source": [ "v = Vector.new(['a b', 'c d', 'e f'])" ] }, { "cell_type": "code", "execution_count": null, "id": "fbb233ea-ece1-4b0e-af42-f283ffb128f5", "metadata": { "tags": [] }, "outputs": [], "source": [ "v.split_to_columns" ] }, { "cell_type": "markdown", "id": "a1e74ffa-fa77-4fdf-b6a1-bd07083987a7", "metadata": {}, "source": [ "`#split` accepts `sep` argument as a separator. `sep` is passed to `String#split(sep)`." ] }, { "cell_type": "code", "execution_count": null, "id": "ecfecabc-ca09-4af7-ae43-cdbf20213dd5", "metadata": { "tags": [] }, "outputs": [], "source": [ "Vector.new('ab', 'cd', 'ef')\n", " .split_to_columns('')" ] }, { "cell_type": "markdown", "id": "8a2cac53-7b2a-4174-8b34-2332c31b058f", "metadata": {}, "source": [ "nil will separated as nil." ] }, { "cell_type": "code", "execution_count": null, "id": "31a86cab-895f-47d9-ae07-678fa4ba0e32", "metadata": { "tags": [] }, "outputs": [], "source": [ "Vector.new(nil, 'c d', 'e f')\n", " .split_to_columns" ] }, { "cell_type": "markdown", "id": "b63e2987-ea71-4001-8a90-1fb8c75e9f9d", "metadata": {}, "source": [ "## 85. Vector#split_to_rows\n", "\n", "`#split_to_rows` will separate strings and flatten into row.\n", "\n", "(Since 0.3.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "2f51e0b9-2de0-414d-8f9f-e2ac7045db3a", "metadata": { "tags": [] }, "outputs": [], "source": [ "v = Vector.new(['a b', 'c d', 'e f'])" ] }, { "cell_type": "code", "execution_count": null, "id": "1f77a7d7-a654-4a2e-9361-070636cf8465", "metadata": { "tags": [] }, "outputs": [], "source": [ "v.split_to_rows" ] }, { "cell_type": "markdown", "id": "1a21c6cd-7c34-49e6-9c8d-6608e07b3ed1", "metadata": {}, "source": [ "## 86. Vector#merge\n", "(Since 0.3.0)" ] }, { "cell_type": "markdown", "id": "f2127771-e253-45ff-879d-ac269412319d", "metadata": {}, "source": [ "`Vector#merge(other)` merges `self` and `other` if they are String Vector." ] }, { "cell_type": "code", "execution_count": null, "id": "6c2b9366-515d-4df9-a51f-3c4424a2ed18", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector = Vector.new(%w[a c e])\n", "other = Vector.new(%w[b d f])\n", "vector.merge(other)" ] }, { "cell_type": "markdown", "id": "434492e1-10b5-468e-9810-ab5a60f0e582", "metadata": {}, "source": [ "If `other` is scalar, it will be appended to each elements of `self`." ] }, { "cell_type": "code", "execution_count": null, "id": "eb0fe846-109d-4090-8f22-dc2477f60476", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector.merge('x')" ] }, { "cell_type": "markdown", "id": "e378701a-178b-4f39-b583-28d049eb1af2", "metadata": {}, "source": [ "Option `:sep` is used to concatenating elements. Its default value is ' '." ] }, { "cell_type": "code", "execution_count": null, "id": "741553a7-e3cb-444c-b165-afe0cb9afc6e", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector.merge('x', sep: '')" ] }, { "cell_type": "markdown", "id": "3bf81029-8ee3-43bb-871c-d48faf1335e8", "metadata": {}, "source": [ "## 87. Separate a variable (column) in a DataFrame\n", "(Since 0.3.0)\n", "\n", "R's separate operation.\n", "\n", "https://tidyr.tidyverse.org/reference/separate.html" ] }, { "cell_type": "code", "execution_count": null, "id": "2621b75e-1657-4160-8fc3-1eb2d486a8cd", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(xyz: [nil, 'x.y', 'x.z', 'y.z'])" ] }, { "cell_type": "markdown", "id": "7a03e1b1-fdb8-4509-ab1b-fffbae7399ca", "metadata": {}, "source": [ "Instead of `separate(:xyz, [:a, :b])` we will do:" ] }, { "cell_type": "code", "execution_count": null, "id": "5710ce48-aee9-4fa7-b242-4ff0edbdf7db", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.assign(:A, :B) { xyz.split_to_columns('.') }\n", " .drop(:xyz)" ] }, { "cell_type": "markdown", "id": "a0091bf2-b763-4b11-a038-691904db1db7", "metadata": {}, "source": [ "If you need :B only, instead of `separate(:xyz, [nil, :B])` we will do:" ] }, { "cell_type": "code", "execution_count": null, "id": "9200eecd-0057-4ad7-bd06-403c026b32d7", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.assign(:A, :B) { xyz.split_to_columns('.') }\n", " .pick(:B)" ] }, { "cell_type": "markdown", "id": "3d1bc19e-8d25-49fd-b128-91f1877d05d5", "metadata": {}, "source": [ "When splitted length is not equal, split returns max size of Vector Array filled with nil." ] }, { "cell_type": "code", "execution_count": null, "id": "783043c4-f2d6-4a7c-9fb5-db283bb02028", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(xyz: ['x', 'x y', 'x y z', nil])\n", "df.assign(:x, :y, :z) { xyz.split_to_columns }" ] }, { "cell_type": "markdown", "id": "72a9b017-379f-458d-9acf-1e3058b23325", "metadata": {}, "source": [ "Split limiting max 2 elemnts." ] }, { "cell_type": "code", "execution_count": null, "id": "240a561b-ace6-45e3-af2d-1d06ec5202e0", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.assign(:x, :yz) { xyz.split_to_columns(' ', 2) }" ] }, { "cell_type": "markdown", "id": "bdb2f78f-72a5-4c6b-823a-a762a1737788", "metadata": {}, "source": [ "Another example:" ] }, { "cell_type": "code", "execution_count": null, "id": "362c1732-b257-46cb-af24-a0f652fb655a", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(id: 1..3, 'month-year': %w[8-2022 9-2022 10-2022])\n", " .assign(:month, :year) { v(:'month-year').split_to_columns('-') }" ] }, { "cell_type": "markdown", "id": "1e3a37be-5103-48b3-9b0c-6529f6f0b399", "metadata": {}, "source": [ "Split between the letters." ] }, { "cell_type": "code", "execution_count": null, "id": "195ca0af-fa79-4f4f-aa53-63eac214cafe", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(id: 1..3, yearmonth: %w[202209 202210 202211])\n", " .assign(:year, :month) { yearmonth.split_to_columns(/(?=..$)/) }" ] }, { "cell_type": "markdown", "id": "24e0a399-5d4d-4dd1-ac4d-73d1722d9715", "metadata": {}, "source": [ "## 88. Unite variables (columns) in a DataFrame\n", "(Since 0.3.0)\n", "\n", "R's unite operation." ] }, { "cell_type": "code", "execution_count": null, "id": "9fac7414-dd2e-40aa-92cf-1b21096ee308", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(id: 1..3, year: %w[2022 2022 2022], month: %w[09 10 11])" ] }, { "cell_type": "code", "execution_count": null, "id": "2f4fa388-e16a-4c04-9c5d-5ec22fb37f34", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.assign(:yearmonth) { year.merge(month, sep: '') }\n", " .pick(:id, :yearmonth)" ] }, { "cell_type": "code", "execution_count": null, "id": "022be86b-4fde-4d24-9054-1390679554c7", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Or directly create:\n", "DataFrame.new(id: 1..3, yearmonth: df.year.merge(df.month, sep: ''))" ] }, { "cell_type": "markdown", "id": "b72d088d-76c3-4248-b37e-b8c6a680be77", "metadata": {}, "source": [ "## 89. Separate variable and lengthen into several rows.\n", "(Since 0.3.0)\n", "\n", "R's separate_rows operation." ] }, { "cell_type": "code", "execution_count": null, "id": "4414c910-9bb7-4ffb-b049-1fd658d9fea8", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(id: 1..3, yearmonth: %w[202209 202210 202211])\n", " .assign(:year, :month) { yearmonth.split_to_columns(/(?=..$)/) }\n", " .drop(:yearmonth)\n", " .to_long(:id)" ] }, { "cell_type": "markdown", "id": "121ca2d7-ed1b-44be-abfc-83638ecb7e41", "metadata": {}, "source": [ "Another example with different list size." ] }, { "cell_type": "code", "execution_count": null, "id": "a406957d-9a0c-4f51-8639-d2b92a7dbf19", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = DataFrame.new(\n", " x: 1..3,\n", " y: ['a', 'd,e,f', 'g,h'],\n", " z: ['1', '2,3,4', '5,6'],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "b57572db-8045-4b06-89ea-2e53328e73b4", "metadata": { "tags": [] }, "outputs": [], "source": [ "sizes = df.y.split(',').list_sizes\n", "a = sizes.to_a.map.with_index(1) { |n, i| [i] * n }.flatten" ] }, { "cell_type": "code", "execution_count": null, "id": "6fbf285d-25b7-4470-bdb0-d9077c31ecdd", "metadata": { "tags": [] }, "outputs": [], "source": [ "DataFrame.new(\n", " x: a,\n", " y: df.y.split_to_rows(','),\n", " z: df.z.split_to_rows(',')\n", ")" ] }, { "cell_type": "markdown", "id": "cc149d4e-809a-4d1b-ab31-284e15289757", "metadata": {}, "source": [ "Another way to use `#split_to_columns`." ] }, { "cell_type": "code", "execution_count": null, "id": "d3e60c70-9634-447a-80fa-f37f02f98b14", "metadata": { "tags": [] }, "outputs": [], "source": [ "xy = df.pick(:x, :y)\n", " .assign(:y, :y1, :y2) { v(:y).split_to_columns(',') }\n", " .to_long(:x, value: :y)\n", " .remove_nil" ] }, { "cell_type": "code", "execution_count": null, "id": "4e143d12-030c-44dd-a2a2-7eac0f12f809", "metadata": { "tags": [] }, "outputs": [], "source": [ "xz = df.pick(:x, :z)\n", " .assign(:z, :z1, :z2) { v(:z).split_to_columns(',') }\n", " .to_long(:x, value: :z)\n", " .remove_nil" ] }, { "cell_type": "code", "execution_count": null, "id": "0cf219e0-5d65-4b36-98da-e8dfa5679f04", "metadata": { "tags": [] }, "outputs": [], "source": [ "xy.pick(:x, :y).merge(xz.pick(:z))" ] }, { "cell_type": "markdown", "id": "75de904d-f6e2-4c5c-8e2f-6402ce1a9db6", "metadata": {}, "source": [ "Get all combinations of :y and :z." ] }, { "cell_type": "code", "execution_count": null, "id": "bae1fac7-afbd-4b7e-a527-ad27114d49fe", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.assign(:y, :y1, :y2) { v(:y).split_to_columns(',') }\n", " .to_long(:x, :z, value: :y)\n", " .drop(:NAME)\n", " .assign(:z, :z1, :z2) { v(:z).split_to_columns(',') }\n", " .to_long(:x, :y, value: :z)\n", " .drop(:NAME)\n", " .drop_nil" ] }, { "cell_type": "markdown", "id": "b11fa65f-e3eb-41ee-bc7c-6b5ed7b8c579", "metadata": {}, "source": [ "## 90. Vector#propagate\n", "\n", "Spread the return value of an aggregate function as if it is a element-wise function.\n", "\n", "It has an alias `#expand`.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "167d64c8-d9a0-4887-8d68-657e807409d8", "metadata": { "tags": [] }, "outputs": [], "source": [ "vec = Vector.new(1, 2, 3, 4)\n", "vec.propagate(:mean)" ] }, { "cell_type": "markdown", "id": "388bf355-eabc-4e16-ba21-26e3aaffc935", "metadata": {}, "source": [ "Block is also available." ] }, { "cell_type": "code", "execution_count": null, "id": "f6dfab50-5531-4189-8b6d-db7010f681e0", "metadata": { "tags": [] }, "outputs": [], "source": [ "vec.propagate { |v| v.mean.round }" ] }, { "cell_type": "markdown", "id": "91c238ba-faa2-4281-833b-f99dcce698a0", "metadata": {}, "source": [ "## 91. Vector#sort / #sort_indices\n", "\n", "`#sort` will arrange values in Vector.\n", "\n", "Accepts :sort order option:\n", " - `:+`, `:ascending` or without argument will sort in increasing order.\n", " - `:-` or `:descending` will sort in decreasing order.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "dd5b7b22-2e26-46a1-b4ae-55d54289987b", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector = Vector.new(%w[B D A E C])\n", "vector.sort\n", "# same as vector.sort(:+)\n", "# same as vector.sort(:ascending)" ] }, { "cell_type": "markdown", "id": "deaa8410-4da7-4f3c-a328-f0afe3339dc7", "metadata": {}, "source": [ "Sort in decreasing order;" ] }, { "cell_type": "code", "execution_count": null, "id": "2eb1248d-5738-4b5d-80f5-d0b734e11f5f", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector.sort(:-)\n", "# same as vector.sort(:descending)" ] }, { "cell_type": "markdown", "id": "54627eb6-0e71-4a2f-a6b7-3125829ec310", "metadata": {}, "source": [ "## 93. Vector#rank\n", "\n", "Returns numerical rank of self.\n", " \n", "Note: This method is currently fixed to the default behavior because `RankOptions` in C++ function is not implemented in C GLib yet.\n", " - Nil values are considered greater than any value.\n", " - NaN values are considered greater than any value but smaller than nil values.\n", " - Tiebreakers are ranked in order of appearance.\n", "\n", "(Since 0.4.0, Experimental feature)" ] }, { "cell_type": "markdown", "id": "b50acfc9-a559-43af-b408-848cbd39c9aa", "metadata": {}, "source": [ "Rank of float Vector;" ] }, { "cell_type": "code", "execution_count": null, "id": "98541e5a-5da4-4b1c-8557-ff6f44b93c02", "metadata": { "tags": [] }, "outputs": [], "source": [ "float = Vector.new(0.1, nil, Float::NAN, 0.2, 0.1)" ] }, { "cell_type": "code", "execution_count": null, "id": "3e1083e8-49fc-4460-9e2d-3a98869b4c80", "metadata": { "tags": [] }, "outputs": [], "source": [ "float.rank" ] }, { "cell_type": "markdown", "id": "30147320-6c52-435d-ad30-b8b8da394f67", "metadata": {}, "source": [ "Rank of string Vector;" ] }, { "cell_type": "code", "execution_count": null, "id": "c0ecb442-eddd-4acb-a3e2-65a6d5dfbcca", "metadata": { "tags": [] }, "outputs": [], "source": [ "string = Vector.new(\"A\", \"B\", nil, \"A\", \"C\")" ] }, { "cell_type": "code", "execution_count": null, "id": "4485494f-242d-4315-bc49-b4835ba0568c", "metadata": { "tags": [] }, "outputs": [], "source": [ "string.rank" ] }, { "cell_type": "markdown", "id": "31560ca0-9273-4c15-8195-140b4309fe61", "metadata": {}, "source": [ "## 93. Vector#sample\n", "Pick up elements at random.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "markdown", "id": "7bddfc6d-3d43-46d4-9913-e4fab9566e82", "metadata": {}, "source": [ "Return a randomly selected element. This is one of an aggregation function." ] }, { "cell_type": "code", "execution_count": null, "id": "2054c75c-dffa-42a0-a4f6-9ab2603471ae", "metadata": { "tags": [] }, "outputs": [], "source": [ "v = Vector.new('A'..'H')" ] }, { "cell_type": "markdown", "id": "5f35c963-acff-4ec0-9973-22baeb6ade19", "metadata": {}, "source": [ "Returns scalar without any arguments." ] }, { "cell_type": "code", "execution_count": null, "id": "ce464ab6-da74-4d71-b771-e63d16069341", "metadata": { "tags": [] }, "outputs": [], "source": [ "v.sample" ] }, { "cell_type": "markdown", "id": "8fda39a4-c50c-44bb-90a8-c08c4cbc3425", "metadata": {}, "source": [ "`sample(n)` will pick up `n` elements at random. `n` is a positive number of elements to pick." ] }, { "cell_type": "markdown", "id": "cfeaf629-d137-430c-8dde-36dbc02de2e6", "metadata": {}, "source": [ "If n is smaller or equal to size, elements are picked by non-repeating.\n", "\n", "If n == 1 (in case of `sample(1)`), it returns a Vector of size == 1 not a scalar." ] }, { "cell_type": "code", "execution_count": null, "id": "c5353fd0-e1d7-4b50-84a2-609c18767208", "metadata": { "tags": [] }, "outputs": [], "source": [ "v.sample(1)" ] }, { "cell_type": "markdown", "id": "26a8dc42-5b64-4a19-b943-2121dbfd87b6", "metadata": {}, "source": [ "Sample same size of self: every element is picked in random order." ] }, { "cell_type": "code", "execution_count": null, "id": "11966428-5b7c-435a-93b0-793a1a3875f9", "metadata": { "tags": [] }, "outputs": [], "source": [ "v.sample(8)" ] }, { "cell_type": "markdown", "id": "fe7ac056-d740-4aca-9c9d-b93870e29362", "metadata": {}, "source": [ "If n is greater than `size`, some elements are picked repeatedly." ] }, { "cell_type": "code", "execution_count": null, "id": "249b4709-a8ff-4ffe-a925-ad03f19c08ef", "metadata": { "tags": [] }, "outputs": [], "source": [ "v.sample(9)" ] }, { "cell_type": "markdown", "id": "b842d546-b5ae-433e-859c-ef67f21fe02e", "metadata": {}, "source": [ "`sample(prop)` will pick up elements by proportion `prop` at random. `prop` must be positive float.\n", " - Absolute number of elements to pick:`prop*size` is rounded (by `half: :up``)\n", " - If prop is smaller or equal to 1.0, elements are picked by non-repeating." ] }, { "cell_type": "markdown", "id": "d8dc3adb-fca4-4e7e-b44a-8d7d1fea3b38", "metadata": {}, "source": [ "If picked element is only one, it returns a Vector of size == 1 not a scalar." ] }, { "cell_type": "code", "execution_count": null, "id": "40e54a11-dbb0-47d7-9d99-1fa7ec7df8e0", "metadata": { "tags": [] }, "outputs": [], "source": [ "v.sample(0.1)" ] }, { "cell_type": "markdown", "id": "3a650ca1-dec1-4230-bf94-c8ce7ab2e22c", "metadata": {}, "source": [ "Sample same size of self: every element is picked in random order." ] }, { "cell_type": "code", "execution_count": null, "id": "26ff63ce-6703-4a8e-b8b0-928a5cd5d626", "metadata": { "tags": [] }, "outputs": [], "source": [ "v.sample(1.0)" ] }, { "cell_type": "markdown", "id": "b4ffb449-64a2-4925-82c8-5b55f668081f", "metadata": {}, "source": [ "If prop is greater than 1.0, some elements are picked repeatedly." ] }, { "cell_type": "code", "execution_count": null, "id": "030420b0-27a5-4b7a-a73c-a2cd7744e7d3", "metadata": { "tags": [] }, "outputs": [], "source": [ "# 2 times over sampling\n", "sampled = v.sample(2.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "3b50b6a2-e8fb-4f05-bd71-2fe91db94b5a", "metadata": { "tags": [] }, "outputs": [], "source": [ "sampled.tally" ] }, { "cell_type": "markdown", "id": "9050c361-23c6-490b-9269-8766aa051df2", "metadata": {}, "source": [ "## 94. Vector#concatenate\n", "\n", "Concatenate other array-like to self.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "markdown", "id": "8e796d36-47d1-4bd3-981c-2ad88927581f", "metadata": {}, "source": [ "Concatenate to string;" ] }, { "cell_type": "code", "execution_count": null, "id": "9dd54c26-3cff-48e9-abe4-bb96d168e013", "metadata": { "tags": [] }, "outputs": [], "source": [ "string = Vector.new(%w[A B])" ] }, { "cell_type": "code", "execution_count": null, "id": "d2598c46-70e9-4029-8023-f373133fbbd7", "metadata": { "tags": [] }, "outputs": [], "source": [ "string.concatenate([1, 2])" ] }, { "cell_type": "markdown", "id": "643f5b62-b176-48ba-ab9a-4e19e776c639", "metadata": {}, "source": [ "Concatenate to integer;" ] }, { "cell_type": "code", "execution_count": null, "id": "d8ab2a40-ed9c-4144-9ac1-6ef7c4b13dd8", "metadata": { "tags": [] }, "outputs": [], "source": [ "integer = Vector.new(1, 2)" ] }, { "cell_type": "code", "execution_count": null, "id": "dd3c17af-2df8-4c55-8428-9ff884bf16c7", "metadata": { "tags": [] }, "outputs": [], "source": [ "integer.concatenate([\"A\", \"B\"])" ] }, { "cell_type": "markdown", "id": "41585cb1-848d-4b55-b2de-2ad4666bf941", "metadata": {}, "source": [ "## 95. Vector#resolve\n", "\n", "Return other as a Vector which is same data type as self.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "markdown", "id": "e2266c45-afea-4636-aca5-3e4136d2b409", "metadata": {}, "source": [ "Integer to String;" ] }, { "cell_type": "code", "execution_count": null, "id": "2c82d3d8-d28c-431a-ada3-2f7e1d50a973", "metadata": { "tags": [] }, "outputs": [], "source": [ "Vector.new('A').resolve([1, 2])" ] }, { "cell_type": "markdown", "id": "d0a86725-5c8d-45ad-9fe0-08bda4ca957c", "metadata": {}, "source": [ "String to Ineger;" ] }, { "cell_type": "code", "execution_count": null, "id": "9052b0c2-99e3-48e4-9b4a-f95fb413ed94", "metadata": { "tags": [] }, "outputs": [], "source": [ "Vector.new(1).resolve(['A'])" ] }, { "cell_type": "markdown", "id": "31868026-4548-4e49-aa92-873cffd0d9c5", "metadata": {}, "source": [ "Upcast to uint16;" ] }, { "cell_type": "code", "execution_count": null, "id": "84a6fa26-cdab-42fc-b89b-a2488d673fe7", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector = Vector.new(256)" ] }, { "cell_type": "markdown", "id": "7c683a98-578b-456e-88fb-3d28b3ccd370", "metadata": {}, "source": [ "Not a uint8 Vector;" ] }, { "cell_type": "code", "execution_count": null, "id": "790f70f8-c13f-4305-a1e8-643d5a925a33", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector.resolve([1, 2])" ] }, { "cell_type": "markdown", "id": "87fff016-c410-4c59-8855-623e75195d1d", "metadata": {}, "source": [ "## 96. SubFrames\n", "\n", "`SubFrames` is a new concept of DataFrame collection. It represents ordered subsets of a DataFrame collected by some rules. It includes both grouping and windowing concepts in a unified manner, and also covers broader cases more flexibly.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "3a1e91df-59b1-43ab-bb5f-281578cfae8c", "metadata": { "tags": [] }, "outputs": [], "source": [ "dataframe = DataFrame.new(\n", " x: [*1..6],\n", " y: %w[A A B B B C],\n", " z: [false, true, false, nil, true, false]\n", ")\n", "p dataframe; nil" ] }, { "cell_type": "code", "execution_count": null, "id": "de3b5be4-1ebc-45d7-9ce9-cfbd38418c12", "metadata": { "tags": [] }, "outputs": [], "source": [ "sf = SubFrames.new(dataframe, [[0, 1], [2, 3, 4], [5]])" ] }, { "cell_type": "markdown", "id": "e1cfb4d4-cbcb-4927-9fb0-86b0f0ae52e1", "metadata": {}, "source": [ "Source DataFrame (univarsal set)." ] }, { "cell_type": "code", "execution_count": null, "id": "85989134-edf7-42af-99b6-77bd1090e8d0", "metadata": { "tags": [] }, "outputs": [], "source": [ "sf.baseframe" ] }, { "cell_type": "markdown", "id": "05c579bc-c446-42f5-a841-2a22003ab7f3", "metadata": {}, "source": [ "Size of subsets." ] }, { "cell_type": "code", "execution_count": null, "id": "20040f97-4152-4440-964d-538a43977746", "metadata": { "tags": [] }, "outputs": [], "source": [ "sf.size" ] }, { "cell_type": "markdown", "id": "419c9e4f-f45a-4509-b0f5-c8253382dad1", "metadata": {}, "source": [ "Sizes of each subsets." ] }, { "cell_type": "code", "execution_count": null, "id": "7e87a3e7-d33a-43b6-a1e4-e8a773741b5f", "metadata": { "tags": [] }, "outputs": [], "source": [ "sf.sizes" ] }, { "cell_type": "markdown", "id": "dd4b9b59-546f-4e1a-974b-f1ddb7d99896", "metadata": {}, "source": [ "`#each` will return an Enumerator or iterates each subset as a DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "1e8b54a7-f6d6-4418-b6ee-26eb9e29517b", "metadata": { "tags": [] }, "outputs": [], "source": [ "sf.each" ] }, { "cell_type": "code", "execution_count": null, "id": "54e75770-f086-435c-9973-6acc1fd2453c", "metadata": { "tags": [] }, "outputs": [], "source": [ "sf.each.next" ] }, { "cell_type": "markdown", "id": "7af37556-ce78-43c8-8966-a485ccf787bc", "metadata": {}, "source": [ "`SubFrames.new` also accepts a block." ] }, { "cell_type": "code", "execution_count": null, "id": "b1c10be6-edcb-4907-af8c-79ecc35d2597", "metadata": { "tags": [] }, "outputs": [], "source": [ "usf = SubFrames.new(dataframe) { |df| [df.indices] }" ] }, { "cell_type": "markdown", "id": "2ad1370b-6a60-4f13-a9ad-7335c976f177", "metadata": {}, "source": [ "`#universal?` tests if self is an univarsal set." ] }, { "cell_type": "code", "execution_count": null, "id": "de56f4ba-5ecc-47a3-9008-479e91734025", "metadata": { "tags": [] }, "outputs": [], "source": [ "usf.universal?" ] }, { "cell_type": "markdown", "id": "edcefc08-6942-49f0-a765-1d4e1385c000", "metadata": {}, "source": [ "`#empty?` tests if self is an empty set." ] }, { "cell_type": "code", "execution_count": null, "id": "bb948026-dc14-4495-baac-1f33b68dd3ca", "metadata": { "tags": [] }, "outputs": [], "source": [ "esf = SubFrames.new(dataframe, [])" ] }, { "cell_type": "code", "execution_count": null, "id": "ef2a4fdf-d719-4226-a358-423221175361", "metadata": { "tags": [] }, "outputs": [], "source": [ "esf.empty?" ] }, { "cell_type": "markdown", "id": "d51a65cc-546b-4182-b204-4ea916a1749f", "metadata": {}, "source": [ "`SubFrames.new` also accepts boolean filters even from the block." ] }, { "cell_type": "code", "execution_count": null, "id": "9a8163fd-64f0-49c0-af5d-34ed3ba56343", "metadata": { "tags": [] }, "outputs": [], "source": [ "small = dataframe.x < 4\n", "large = !small\n", "small_large = SubFrames.new(dataframe) { [small, large] }" ] }, { "cell_type": "markdown", "id": "9205012a-36f4-41e0-bab7-d4e0da5410df", "metadata": {}, "source": [ "## 97. SubFrames#concatenate\n", "\n", "`SubFrames#concatenate` (or alias `#concat`) will concatenate SubFrames to create a DataFrame.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "fa04d65b-94bd-44f3-b067-062513f9e89f", "metadata": { "tags": [] }, "outputs": [], "source": [ "sf.concatenate" ] }, { "cell_type": "markdown", "id": "8dfdd0ad-3795-4e72-9b4b-68523e1ae1cd", "metadata": {}, "source": [ "## 98. SubFrames.by_group\n", "\n", "Create SubFrames by Group object.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "e71ca5ea-9c70-43b5-af7d-0c6f6266bf8e", "metadata": { "tags": [] }, "outputs": [], "source": [ "p dataframe; nil" ] }, { "cell_type": "code", "execution_count": null, "id": "2b88344c-340d-4477-b053-19a2827c7c36", "metadata": { "tags": [] }, "outputs": [], "source": [ "group = Group.new(dataframe, [:y])\n", "sf = SubFrames.by_group(group)" ] }, { "cell_type": "markdown", "id": "3587b482-d043-4739-b0ca-af133d915506", "metadata": {}, "source": [ "## 99. DataFrame#sub_by_value\n", "\n", "Grouping by value. It is corresponding to Group processing.\n", "\n", "Create SubFrames from keys and group by values in columns specified by the key.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "7ce04a2a-d907-4c3b-9818-5d2b4c48b8d3", "metadata": { "tags": [] }, "outputs": [], "source": [ "dataframe.sub_by_value(keys: :y)" ] }, { "cell_type": "markdown", "id": "46bd06d9-6f10-45f5-9c9d-9e0c640fb829", "metadata": {}, "source": [ "## 100. DataFrame#sub_by_window\n", "\n", "Create SubFrames by window in `size` rolling `from` by `step`.\n", "\n", "Default values is `from: 0`, `size: nil` and `step: 1`.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "09c66f0b-b6b0-413b-9f68-3fb596da309a", "metadata": { "tags": [] }, "outputs": [], "source": [ "dataframe.sub_by_window(size: 4, step: 2)" ] }, { "cell_type": "markdown", "id": "b6354ca3-9408-4d79-a83b-ec2f6cf8cd06", "metadata": {}, "source": [ "## 101. DataFrame#sub_by_enum\n", "\n", "Create SubFrames by Grouping/Windowing by posion. The position is specified by `Array`'s enumerator method such as `each_slice` or `each_cons`.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "markdown", "id": "3a5c1b44-2d82-41be-b88f-e87b11fd554c", "metadata": {}, "source": [ "Create a SubFrames object sliced by 3 rows. This is MECE (Mutually Exclusive and Collectively Exhaustive) SubFrames." ] }, { "cell_type": "code", "execution_count": null, "id": "f34ab0c5-5802-4281-aa0c-05c0036214a4", "metadata": { "tags": [] }, "outputs": [], "source": [ "dataframe.sub_by_enum(:each_slice, 3)" ] }, { "cell_type": "markdown", "id": "30094486-7552-4c1b-b1db-f6e5b4b1ea7d", "metadata": {}, "source": [ "Create a SubFrames object for each consecutive 3 rows." ] }, { "cell_type": "code", "execution_count": null, "id": "ab90c3c6-43d0-4695-be00-d62cac963f0e", "metadata": { "tags": [] }, "outputs": [], "source": [ "dataframe.sub_by_enum(:each_cons, 4)" ] }, { "cell_type": "markdown", "id": "046789f7-3253-48be-b0f7-fb035b0b7192", "metadata": {}, "source": [ "## 102. DataFrame#sub_by_kernel\n", "\n", "Create SubFrames by windowing with a kernel and step.\n", "Kernel is a boolean Array and it behaves like a masked window.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "21de6c9a-014b-4738-a352-93b4e61a1f34", "metadata": { "tags": [] }, "outputs": [], "source": [ "kernel = [true, false, false, true]\n", "dataframe.sub_by_kernel(kernel, step: 2)" ] }, { "cell_type": "markdown", "id": "2b68e71f-b815-48e8-ad5d-8ba07ebc72ad", "metadata": {}, "source": [ "## 103. DataFrame#build_subframes\n", "\n", "Generic builder of sub-dataframe from self.\n", "\n", "(Sice 0.4.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "3096b251-d6c7-4790-b382-64cd10ce5dc7", "metadata": { "tags": [] }, "outputs": [], "source": [ "dataframe.build_subframes([[0, 2, 4], [1, 3, 5]])" ] }, { "cell_type": "markdown", "id": "0c8ba2ec-b079-4fcf-a0d7-a4637d0632ba", "metadata": {}, "source": [ "`#build_subframes` also accepts a block." ] }, { "cell_type": "code", "execution_count": null, "id": "179e600f-591e-4257-89f1-10d4c2c06fbd", "metadata": { "tags": [] }, "outputs": [], "source": [ "dataframe.build_subframes do |df|\n", " even = df.indices.map(&:even?)\n", " [even, !even]\n", "end" ] }, { "cell_type": "markdown", "id": "86b2df32-e497-41a2-9edd-df863bbb050e", "metadata": {}, "source": [ "## 104. SubFrames#aggregate\n", "- `SubFrames#aggregate(group_keys, aggregations)`\n", " Aggregate SubFrames to create a DataFrame.\n", "\n", "This API will cahnge in 0.4.1.\n", "\n", "(Since 0.4.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "d2659324-f63f-42f8-af04-d800088b098d", "metadata": { "tags": [] }, "outputs": [], "source": [ "sf = dataframe.sub_by_value(keys: :y)" ] }, { "cell_type": "code", "execution_count": null, "id": "3c6320bd-a0b0-4c51-a302-0c668c514292", "metadata": { "tags": [] }, "outputs": [], "source": [ "sf.aggregate(:y, {z: :count})" ] }, { "cell_type": "markdown", "id": "86379ccb-03fa-4a0f-83b4-5e1ed769304a", "metadata": {}, "source": [ "## 105. Vector#modulo\n", "\n", "(Since 0.4.1)" ] }, { "cell_type": "markdown", "id": "84023114-3e44-4a74-89d1-ff1ca8a0eda8", "metadata": {}, "source": [ "`#%` is an alias of `#modulo`." ] }, { "cell_type": "code", "execution_count": null, "id": "fb05b349-e32c-47e7-af3e-27173d1aeffc", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector = Vector.new(5, -3, 1)\n", "vector % 3" ] }, { "cell_type": "markdown", "id": "4ba911de-b8d0-4e43-ad75-6790d6f7366d", "metadata": {}, "source": [ "`#%` and `#modulo` is equivalent to `self-divisor*(self/divisor).floor`." ] }, { "cell_type": "code", "execution_count": null, "id": "15bc7bb9-0c99-488d-9063-ccb74bb64fda", "metadata": { "tags": [] }, "outputs": [], "source": [ "vector.modulo(-2)" ] }, { "cell_type": "markdown", "id": "3473fbbc-4317-48af-9434-c9aec1701461", "metadata": { "tags": [] }, "source": [ "## 106. Grouped DataFrame as a list\n", "This API was introduced in 0.2.3, and supply a new DataFrame group (experimental).\n", "\n", "This additional API will treat a grouped DataFrame as a list of DataFrames. I think this API has pros such as:\n", "\n", "- API is easy to understand and flexible.\n", "- It has good compatibility with Ruby's primitive Enumerables.\n", "- We can only use non hash-ed aggregation functions.\n", "- Do not need grouped DataFrame state, nor `#ungroup` method.\n", "- May be useful for concurrent operations.\n", "\n", "This feature is implemented by Ruby, so it is pretty slow and experimental. Use original Group API for practical purpose.\n", "\n", "(Since 0.2.3, experimental feature => This was upgraded to SubFrames feature)" ] }, { "cell_type": "code", "execution_count": null, "id": "389d5edc-b69d-4004-af7e-3419f8787d58", "metadata": {}, "outputs": [], "source": [ "enum = penguins.group(:island).each" ] }, { "cell_type": "code", "execution_count": null, "id": "e1b4a71b-c35a-4ecc-ab4e-ea18a752f931", "metadata": {}, "outputs": [], "source": [ "enum.to_a" ] }, { "cell_type": "code", "execution_count": null, "id": "ff69fc1e-ca8f-46c8-a434-7d8990fa605b", "metadata": {}, "outputs": [], "source": [ "array = enum.map do |df|\n", " DataFrame.new(island: [df.island[0]]).assign do\n", " df.variables.each_with_object({}) do |(key, vec), hash|\n", " next unless vec.numeric?\n", " hash[\"mean(#{key})\"] = [vec.mean]\n", " end\n", " end\n", "end" ] }, { "cell_type": "code", "execution_count": null, "id": "de50bff8-5081-4aff-bb9b-820c4111707e", "metadata": {}, "outputs": [], "source": [ "array.reduce { |a, df| a.concat df }" ] } ], "metadata": { "kernelspec": { "display_name": "Ruby 3.0.2", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": ".rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "3.0.2" } }, "nbformat": 4, "nbformat_minor": 5 }