{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 40 kotlin-dataframe puzzles\n",
"inspired by [100 pandas puzzles](https://github.com/ajcr/100-pandas-puzzles)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Importing kotlin-dataframe\n",
"### Getting started\n",
"Difficulty: easy\n",
"\n",
"**1.** Import kotlin-dataframe"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:03.220499028Z",
"start_time": "2024-09-18T10:51:58.917122418Z"
}
},
"source": [
"%useLatestDescriptors\n",
"%use dataframe"
],
"outputs": [],
"execution_count": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrame Basics\n",
"### A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames\n",
"Difficulty: easy\n",
"\n",
"Consider the following columns:\n",
"```[kotlin]\n",
"val animal by columnOf(\"cat\", \"cat\", \"snake\", \"dog\", \"dog\", \"cat\", \"snake\", \"cat\", \"dog\", \"dog\")\n",
"val age by columnOf(2.5, 3.0, 0.5, Double.NaN, 5.0, 2.0, 4.5, Double.NaN, 7, 3)\n",
"val visits by columnOf(1, 3, 2, 3, 2, 3, 1, 1, 2, 1)\n",
"val priority by columnOf(\"yes\", \"yes\", \"no\", \"yes\", \"no\", \"no\", \"no\", \"yes\", \"no\", \"no\")\n",
"```\n",
"**2.** Create a DataFrame df from this columns."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:06.499780842Z",
"start_time": "2024-09-18T10:52:03.239216461Z"
}
},
"source": [
"val animal by columnOf(\"cat\", \"cat\", \"snake\", \"dog\", \"dog\", \"cat\", \"snake\", \"cat\", \"dog\", \"dog\")\n",
"val age by columnOf(2.5, 3.0, 0.5, Double.NaN, 5.0, 2.0, 4.5, Double.NaN, 7.0, 3.0)\n",
"val visits by columnOf(1, 3, 2, 3, 2, 3, 1, 1, 2, 1)\n",
"val priority by columnOf(\"yes\", \"yes\", \"no\", \"yes\", \"no\", \"no\", \"no\", \"yes\", \"no\", \"no\")\n",
"\n",
"val df = dataFrameOf(animal, age, visits, priority)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
"
\n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| dog | NaN | 3 | yes |
| dog | 5.000000 | 2 | no |
| cat | 2.000000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| dog | 7.000000 | 2 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**3.** Display a summary of the basic information about this DataFrame and its data."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:06.699653196Z",
"start_time": "2024-09-18T10:52:06.578366707Z"
}
},
"source": [
"df.schema()"
],
"outputs": [
{
"data": {
"text/plain": [
"animal: String\n",
"age: Double\n",
"visits: Int\n",
"priority: String"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 3
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:07.516910347Z",
"start_time": "2024-09-18T10:52:07.151245989Z"
}
},
"source": [
"df.describe()"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | name | type | count | unique | nulls | top | freq | mean | std | min | median | max |
|---|
| animal | String | 10 | 3 | 0 | cat | 4 | null | null | cat | dog | snake |
| age | Double | 10 | 8 | 0 | 3.000000 | 2 | NaN | NaN | 0.500000 | 3.750000 | NaN |
| visits | Int | 10 | 3 | 0 | 1 | 4 | 1.900000 | 0.875595 | 1 | 2 | 3 |
| priority | String | 10 | 2 | 0 | no | 6 | null | null | no | no | yes |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"name\",\"type\",\"count\",\"unique\",\"nulls\",\"top\",\"freq\",\"mean\",\"std\",\"min\",\"median\",\"max\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Any\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"}],\"nrow\":4,\"ncol\":12},\"kotlin_dataframe\":[{\"name\":\"animal\",\"type\":\"String\",\"count\":10,\"unique\":3,\"nulls\":0,\"top\":\"cat\",\"freq\":4,\"mean\":null,\"std\":null,\"min\":\"cat\",\"median\":\"dog\",\"max\":\"snake\"},{\"name\":\"age\",\"type\":\"Double\",\"count\":10,\"unique\":8,\"nulls\":0,\"top\":\"3.0\",\"freq\":2,\"mean\":NaN,\"std\":NaN,\"min\":\"0.5\",\"median\":\"3.75\",\"max\":\"NaN\"},{\"name\":\"visits\",\"type\":\"Int\",\"count\":10,\"unique\":3,\"nulls\":0,\"top\":\"1\",\"freq\":4,\"mean\":1.9,\"std\":0.8755950357709131,\"min\":\"1\",\"median\":\"2\",\"max\":\"3\"},{\"name\":\"priority\",\"type\":\"String\",\"count\":10,\"unique\":2,\"nulls\":0,\"top\":\"no\",\"freq\":6,\"mean\":null,\"std\":null,\"min\":\"no\",\"median\":\"no\",\"max\":\"yes\"}]}"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**4.** Return the first 3 rows of the DataFrame df."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:07.853959161Z",
"start_time": "2024-09-18T10:52:07.654485332Z"
}
},
"source": [
"df[0 ..< 3] // df[0..2]\n",
"\n",
"// or equivalently\n",
"\n",
"df.head(3)\n",
"\n",
"// or\n",
"\n",
"df.take(3)"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":3,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"}]}"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**5.** Select \"animal\" and \"age\" columns from the DataFrame df."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:08.396578877Z",
"start_time": "2024-09-18T10:52:08.236415219Z"
}
},
"source": [
"df[animal, age]"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age |
|---|
| cat | 2.500000 |
| cat | 3.000000 |
| snake | 0.500000 |
| dog | NaN |
| dog | 5.000000 |
| cat | 2.000000 |
| snake | 4.500000 |
| cat | NaN |
| dog | 7.000000 |
| dog | 3.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":10,\"ncol\":2},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5},{\"animal\":\"cat\",\"age\":3.0},{\"animal\":\"snake\",\"age\":0.5},{\"animal\":\"dog\",\"age\":NaN},{\"animal\":\"dog\",\"age\":5.0},{\"animal\":\"cat\",\"age\":2.0},{\"animal\":\"snake\",\"age\":4.5},{\"animal\":\"cat\",\"age\":NaN},{\"animal\":\"dog\",\"age\":7.0},{\"animal\":\"dog\",\"age\":3.0}]}"
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**6.** Select the data in rows [3, 4, 8] and in columns [\"animal\", \"age\"]."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:08.881797656Z",
"start_time": "2024-09-18T10:52:08.731032495Z"
}
},
"source": [
"df[3, 4, 8][animal, age]"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age |
|---|
| dog | NaN |
| dog | 5.000000 |
| dog | 7.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"animal\":\"dog\",\"age\":NaN},{\"animal\":\"dog\",\"age\":5.0},{\"animal\":\"dog\",\"age\":7.0}]}"
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**7.** Select only the rows where the number of visits is grater than 2."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:09.700002103Z",
"start_time": "2024-09-18T10:52:09.490718599Z"
}
},
"source": [
"df.filter { visits > 2 }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 3.000000 | 3 | yes |
| dog | NaN | 3 | yes |
| cat | 2.000000 | 3 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":3,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"}]}"
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**8.** Select the rows where the age is missing, i.e. it is NaN."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:10.376245935Z",
"start_time": "2024-09-18T10:52:10.254929046Z"
}
},
"source": [
"df.filter { age.isNaN() }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| dog | NaN | 3 | yes |
| cat | NaN | 1 | yes |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":2,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"}]}"
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"N**9.** Select the rows where the animal is a cat and the age is less than 3."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:10.939482997Z",
"start_time": "2024-09-18T10:52:10.804266534Z"
}
},
"source": [
"df.filter { animal == \"cat\" && age < 3 }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 2.000000 | 3 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":2,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"}]}"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**10.** Select the rows where age is between 2 and 4 (inclusive)."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:11.497572835Z",
"start_time": "2024-09-18T10:52:11.349495496Z"
}
},
"source": [
"df.filter { age in 2.0..4.0 }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| cat | 2.000000 | 3 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":4,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**11.** Change tha age in row 5 to 1.5"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:12.041722746Z",
"start_time": "2024-09-18T10:52:11.881182942Z"
}
},
"source": [
"df.update { age }.at(5).with { 1.5 }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| dog | NaN | 3 | yes |
| dog | 5.000000 | 2 | no |
| cat | 1.500000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| dog | 7.000000 | 2 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":1.5,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**12.** Calculate the sum of all visits in df (i.e. the total number of visits)."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:12.502683994Z",
"start_time": "2024-09-18T10:52:12.418093861Z"
}
},
"source": [
"df.visits.sum()"
],
"outputs": [
{
"data": {
"text/plain": [
"19"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**13.** Calculate the mean age for each different animal in df."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:13.121108114Z",
"start_time": "2024-09-18T10:52:12.959992501Z"
}
},
"source": [
"df.groupBy { animal }.mean { age }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age |
|---|
| cat | NaN |
| snake | 2.500000 |
| dog | NaN |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":NaN},{\"animal\":\"snake\",\"age\":2.5},{\"animal\":\"dog\",\"age\":NaN}]}"
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**14.** Append a new row to df with your choice of values for each column. Then delete that row to return the original DataFrame."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:13.572565851Z",
"start_time": "2024-09-18T10:52:13.477561023Z"
}
},
"source": [
"val modifiedDf = df.append(\"dog\", 5.5, 2, \"no\")\n",
"modifiedDf.dropLast()"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| dog | NaN | 3 | yes |
| dog | 5.000000 | 2 | no |
| cat | 2.000000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| dog | 7.000000 | 2 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**15.** Count the number of each type of animal in df."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:13.908195752Z",
"start_time": "2024-09-18T10:52:13.824927928Z"
}
},
"source": [
"df.groupBy { animal }.count()"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | count |
|---|
| cat | 4 |
| snake | 2 |
| dog | 4 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"count\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"count\":4},{\"animal\":\"snake\",\"count\":2},{\"animal\":\"dog\",\"count\":4}]}"
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**16.** Sort df first by the values in the 'age' in descending order, then by the value in the 'visits' column in ascending order."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:14.315051455Z",
"start_time": "2024-09-18T10:52:14.179581316Z"
}
},
"source": [
"df.sortBy { age.desc() and visits }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | NaN | 1 | yes |
| dog | NaN | 3 | yes |
| dog | 7.000000 | 2 | no |
| dog | 5.000000 | 2 | no |
| snake | 4.500000 | 1 | no |
| dog | 3.000000 | 1 | no |
| cat | 3.000000 | 3 | yes |
| cat | 2.500000 | 1 | yes |
| cat | 2.000000 | 3 | no |
| snake | 0.500000 | 2 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"}]}"
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**17.** The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:14.804758202Z",
"start_time": "2024-09-18T10:52:14.696941579Z"
}
},
"source": [
"df.convert { priority }.with { it == \"yes\" }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | true |
| cat | 3.000000 | 3 | true |
| snake | 0.500000 | 2 | false |
| dog | NaN | 3 | true |
| dog | 5.000000 | 2 | false |
| cat | 2.000000 | 3 | false |
| snake | 4.500000 | 1 | false |
| cat | NaN | 1 | true |
| dog | 7.000000 | 2 | false |
| dog | 3.000000 | 1 | false |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Boolean\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":true},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":true},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":false},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":true},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":false},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":false},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":false},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":true},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":false},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":false}]}"
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**18.** In the 'animal' column, change the 'dog' entries to 'corgi'."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:15.350396770Z",
"start_time": "2024-09-18T10:52:15.245727234Z"
}
},
"source": [
"df.update { animal }.where { it == \"dog\" }.with { \"corgi\" }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| corgi | NaN | 3 | yes |
| corgi | 5.000000 | 2 | no |
| cat | 2.000000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| corgi | 7.000000 | 2 | no |
| corgi | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"corgi\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"corgi\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"corgi\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"corgi\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**19.** For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:15.873577985Z",
"start_time": "2024-09-18T10:52:15.735301003Z"
}
},
"source": [
"df.pivot { visits }.groupBy { animal }.mean(skipNA = true) { age }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | visits | | |
|---|
| 1 | 3 | 2 |
|---|
| cat | 2.500000 | 2.500000 | null |
| snake | 4.500000 | null | 0.500000 |
| dog | 3.000000 | NaN | 6.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"visits\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ColumnGroup\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"visits\":{\"data\":{\"1\":2.5,\"3\":2.5,\"2\":null},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"1\",\"3\",\"2\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"}]}}},{\"animal\":\"snake\",\"visits\":{\"data\":{\"1\":4.5,\"3\":null,\"2\":0.5},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"1\",\"3\",\"2\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"}]}}},{\"animal\":\"dog\",\"visits\":{\"data\":{\"1\":3.0,\"3\":NaN,\"2\":6.0},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"1\",\"3\",\"2\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"}]}}}]}"
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrame: beyond the basics\n",
"### Slightly trickier: you may need to combine two or more methods to get the right answer\n",
"Difficulty: medium\n",
"\n",
"The previous section was tour through some basic but essential DataFrame operations. Below are some ways that you might need to cut your data, but for which there is no single \"out of the box\" method."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**20.** You have a DataFrame df with a column 'A' of integers. For example:\n",
"```kotlin\n",
"val df = dataFrameOf(\"A\")(1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7)\n",
"```\n",
"How do you filter out rows which contain the same integer as the row immediately above?\n",
"\n",
"You should be left with a column containing the following values:\n",
"```\n",
"1, 2, 3, 4, 5, 6, 7\n",
"```"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:16.459416497Z",
"start_time": "2024-09-18T10:52:16.272503788Z"
}
},
"source": [
"val df = dataFrameOf(\"A\")(1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":11,\"ncol\":1},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":5},{\"A\":5},{\"A\":6},{\"A\":7},{\"A\":7}]}"
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 21
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:16.946693546Z",
"start_time": "2024-09-18T10:52:16.842037901Z"
}
},
"source": [
"df.filter { prev()?.A != A }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":7,\"ncol\":1},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":6},{\"A\":7}]}"
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 22
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:17.478718422Z",
"start_time": "2024-09-18T10:52:17.384539811Z"
}
},
"source": [
"df.filter { diffOrNull { A } != 0 }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":7,\"ncol\":1},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":6},{\"A\":7}]}"
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could use `distinct()` here but it won't work as desired if A is [1, 1, 2, 2, 1, 1] for example."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:17.977821482Z",
"start_time": "2024-09-18T10:52:17.926194720Z"
}
},
"source": [
"df.distinct()"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":7,\"ncol\":1},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":6},{\"A\":7}]}"
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**21.** Given a DataFrame of random numetic values:\n",
"```kotlin\n",
"val df = dataFrameOf(\"a\", \"b\", \"c\").randomDouble(5) // this is a 5x3 DataFrame of double values\n",
"```\n",
"\n",
"how do you subtract the row mean from each element in the row?"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:18.675812178Z",
"start_time": "2024-09-18T10:52:18.412419237Z"
}
},
"source": [
"val df = dataFrameOf(\"a\", \"b\", \"c\").randomDouble(5)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c |
|---|
| 0.521123 | 0.918268 | 0.565521 |
| 0.809771 | 0.543012 | 0.597897 |
| 0.927176 | 0.888330 | 0.874393 |
| 0.058951 | 0.728707 | 0.034816 |
| 0.804097 | 0.762586 | 0.937529 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":3},\"kotlin_dataframe\":[{\"a\":0.5211226771979798,\"b\":0.9182680768278705,\"c\":0.5655205458312264},{\"a\":0.8097706752385012,\"b\":0.5430119946891596,\"c\":0.5978969669180416},{\"a\":0.92717591604802,\"b\":0.8883296755272901,\"c\":0.8743928104323854},{\"a\":0.05895073364289227,\"b\":0.7287071638136583,\"c\":0.03481563927773523},{\"a\":0.8040971099148696,\"b\":0.7625857381225364,\"c\":0.9375285852183818}]}"
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 25
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:19.249067555Z",
"start_time": "2024-09-18T10:52:19.126155154Z"
}
},
"source": [
"df.update { colsOf() }\n",
" .with { it - rowMean() }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c |
|---|
| -0.147181 | 0.249964 | -0.102783 |
| 0.159544 | -0.107215 | -0.052330 |
| 0.030543 | -0.008303 | -0.022240 |
| -0.215207 | 0.454549 | -0.239342 |
| -0.030640 | -0.072151 | 0.102791 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":3},\"kotlin_dataframe\":[{\"a\":-0.14718108942104569,\"b\":0.24996431020884502,\"c\":-0.10278322078779911},{\"a\":0.15954412962326703,\"b\":-0.10721455092607457,\"c\":-0.05232957869719257},{\"a\":0.030543115378788177,\"b\":-0.008303125141941647,\"c\":-0.02223999023684642},{\"a\":-0.21520711193520298,\"b\":0.45454931823556305,\"c\":-0.23934220630036002},{\"a\":-0.03064003450372632,\"b\":-0.07215140629605954,\"c\":0.10279144079978586}]}"
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**22.** Suppose you have DataFrame with 10 columns of real numbers, for example:\n",
"```kotlin\n",
"val names = ('a'..'j').map { it.toString() }\n",
"val df = dataFrameOf(names).randomDouble(5)\n",
"```\n",
"\n",
"Which column of number has the smallest sum? Return that column's label."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:19.965175100Z",
"start_time": "2024-09-18T10:52:19.598014715Z"
}
},
"source": [
"val names = ('a'..'j').map { it.toString() }\n",
"val df = dataFrameOf(names).randomDouble(5)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c | d | e | f | g | h | i | j |
|---|
| 0.612169 | 0.108263 | 0.185536 | 0.430257 | 0.959346 | 0.399817 | 0.498650 | 0.417456 | 0.404601 | 0.847830 |
| 0.901659 | 0.040860 | 0.003512 | 0.870346 | 0.182568 | 0.532544 | 0.287011 | 0.339791 | 0.622038 | 0.822349 |
| 0.531465 | 0.423177 | 0.887166 | 0.351157 | 0.430013 | 0.179277 | 0.953403 | 0.222690 | 0.281130 | 0.657052 |
| 0.203136 | 0.526475 | 0.518940 | 0.409763 | 0.121049 | 0.888950 | 0.438593 | 0.773491 | 0.669085 | 0.436394 |
| 0.531437 | 0.142654 | 0.976179 | 0.982954 | 0.079222 | 0.669538 | 0.932285 | 0.987870 | 0.487531 | 0.246299 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\",\"d\",\"e\",\"f\",\"g\",\"h\",\"i\",\"j\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":10},\"kotlin_dataframe\":[{\"a\":0.6121691192874636,\"b\":0.10826292862147413,\"c\":0.18553643751876292,\"d\":0.43025739391233997,\"e\":0.9593458354189062,\"f\":0.39981729336045646,\"g\":0.49865033558033545,\"h\":0.4174560831405012,\"i\":0.40460104231272565,\"j\":0.847830490738343},{\"a\":0.9016585558415995,\"b\":0.040860274926845075,\"c\":0.0035116923991718174,\"d\":0.8703463095918256,\"e\":0.1825680938352957,\"f\":0.5325438772189126,\"g\":0.2870112242064805,\"h\":0.33979061054001425,\"i\":0.6220380832306763,\"j\":0.8223494500148123},{\"a\":0.5314650843323452,\"b\":0.42317657849111134,\"c\":0.887165677595343,\"d\":0.3511566149965749,\"e\":0.4300127181218244,\"f\":0.17927725177230414,\"g\":0.9534025504792901,\"h\":0.22269045312684088,\"i\":0.2811296671250362,\"j\":0.657051852732853},{\"a\":0.2031364555817572,\"b\":0.5264750146974752,\"c\":0.5189402998748279,\"d\":0.4097625055097335,\"e\":0.12104893217793067,\"f\":0.8889504457529872,\"g\":0.4385927587062547,\"h\":0.7734914188553896,\"i\":0.6690853272120499,\"j\":0.4363942666192582},{\"a\":0.5314370292128995,\"b\":0.14265383652679098,\"c\":0.9761787154668559,\"d\":0.9829539883096804,\"e\":0.07922175780989915,\"f\":0.669537947529444,\"g\":0.9322850054669332,\"h\":0.987869801636424,\"i\":0.48753118856457334,\"j\":0.24629935762383492}]}"
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 27
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:20.639190842Z",
"start_time": "2024-09-18T10:52:20.535081265Z"
}
},
"source": [
"df.sum().transpose().minBy(\"value\")[\"name\"]"
],
"outputs": [
{
"data": {
"text/plain": [
"b"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**23.** How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:21.616937975Z",
"start_time": "2024-09-18T10:52:21.390718075Z"
}
},
"source": [
"val df = dataFrameOf(\"a\", \"b\", \"c\").randomInt(30, 0..2)\n",
"df.distinct().count()"
],
"outputs": [
{
"data": {
"text/plain": [
"19"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 29
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**24.** In the cell below, you have a DataFrame `df` that consists of 10 columns of floating-point numbers. Exactly 5 entries in each row are NaN values.\n",
"\n",
"For each row of the DataFrame, find the *column* which contains the *third* NaN value.\n",
"\n",
"You should return a column of column labels: `e, c, d, h, d`"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:22.366340356Z",
"start_time": "2024-09-18T10:52:22.051474456Z"
}
},
"source": [
"val nan = Double.NaN\n",
"val names = ('a'..'j').map { it.toString() }\n",
"val data = listOf(\n",
" 0.04, nan, nan, 0.25, nan, 0.43, 0.71, 0.51, nan, nan,\n",
" nan, nan, nan, 0.04, 0.76, nan, nan, 0.67, 0.76, 0.16,\n",
" nan, nan, 0.5, nan, 0.31, 0.4, nan, nan, 0.24, 0.01,\n",
" 0.49, nan, nan, 0.62, 0.73, 0.26, 0.85, nan, nan, nan,\n",
" nan, nan, 0.41, nan, 0.05, nan, 0.61, nan, 0.48, 0.68,\n",
")\n",
"val df = dataFrameOf(names)(*data.toTypedArray())\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c | d | e | f | g | h | i | j |
|---|
| 0.040000 | NaN | NaN | 0.250000 | NaN | 0.430000 | 0.710000 | 0.510000 | NaN | NaN |
| NaN | NaN | NaN | 0.040000 | 0.760000 | NaN | NaN | 0.670000 | 0.760000 | 0.160000 |
| NaN | NaN | 0.500000 | NaN | 0.310000 | 0.400000 | NaN | NaN | 0.240000 | 0.010000 |
| 0.490000 | NaN | NaN | 0.620000 | 0.730000 | 0.260000 | 0.850000 | NaN | NaN | NaN |
| NaN | NaN | 0.410000 | NaN | 0.050000 | NaN | 0.610000 | NaN | 0.480000 | 0.680000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\",\"d\",\"e\",\"f\",\"g\",\"h\",\"i\",\"j\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":10},\"kotlin_dataframe\":[{\"a\":0.04,\"b\":NaN,\"c\":NaN,\"d\":0.25,\"e\":NaN,\"f\":0.43,\"g\":0.71,\"h\":0.51,\"i\":NaN,\"j\":NaN},{\"a\":NaN,\"b\":NaN,\"c\":NaN,\"d\":0.04,\"e\":0.76,\"f\":NaN,\"g\":NaN,\"h\":0.67,\"i\":0.76,\"j\":0.16},{\"a\":NaN,\"b\":NaN,\"c\":0.5,\"d\":NaN,\"e\":0.31,\"f\":0.4,\"g\":NaN,\"h\":NaN,\"i\":0.24,\"j\":0.01},{\"a\":0.49,\"b\":NaN,\"c\":NaN,\"d\":0.62,\"e\":0.73,\"f\":0.26,\"g\":0.85,\"h\":NaN,\"i\":NaN,\"j\":NaN},{\"a\":NaN,\"b\":NaN,\"c\":0.41,\"d\":NaN,\"e\":0.05,\"f\":NaN,\"g\":0.61,\"h\":NaN,\"i\":0.48,\"j\":0.68}]}"
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 30
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:22.818316268Z",
"start_time": "2024-09-18T10:52:22.640257239Z"
}
},
"source": [
"df.mapToColumn(\"res\") { \n",
" namedValuesOf()\n",
" .filter { it.value.isNaN() }.drop(2)\n",
" .firstOrNull()?.name \n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"res\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":1},\"kotlin_dataframe\":[{\"res\":\"e\"},{\"res\":\"c\"},{\"res\":\"d\"},{\"res\":\"h\"},{\"res\":\"d\"}]}"
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 31
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**25.** A DataFrame has a column of groups 'grps' and and column of integer values 'vals':\n",
"```kotlin\n",
"val grps by column(\"a\", \"a\", \"a\", \"b\", \"b\", \"c\", \"a\", \"a\", \"b\", \"c\", \"c\", \"c\", \"b\", \"b\", \"c\")\n",
"val vals by column(12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87)\n",
"\n",
"val df = dataFrameOf(grps, vals)\n",
"```\n",
"\n",
"For each group, find the sum of the three greatest values. You should end up with the answer as follows:\n",
"```\n",
"grps\n",
"a 409\n",
"b 156\n",
"c 345\n",
"```"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:23.878779112Z",
"start_time": "2024-09-18T10:52:23.581892967Z"
}
},
"source": [
"val grps by columnOf(\"a\", \"a\", \"a\", \"b\", \"b\", \"c\", \"a\", \"a\", \"b\", \"c\", \"c\", \"c\", \"b\", \"b\", \"c\")\n",
"val vals by columnOf(12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87)\n",
"\n",
"val df = dataFrameOf(grps, vals)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | grps | vals |
|---|
| a | 12 |
| a | 345 |
| a | 3 |
| b | 1 |
| b | 45 |
| c | 14 |
| a | 4 |
| a | 52 |
| b | 54 |
| c | 23 |
| c | 235 |
| c | 21 |
| b | 57 |
| b | 3 |
| c | 87 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"grps\",\"vals\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":15,\"ncol\":2},\"kotlin_dataframe\":[{\"grps\":\"a\",\"vals\":12},{\"grps\":\"a\",\"vals\":345},{\"grps\":\"a\",\"vals\":3},{\"grps\":\"b\",\"vals\":1},{\"grps\":\"b\",\"vals\":45},{\"grps\":\"c\",\"vals\":14},{\"grps\":\"a\",\"vals\":4},{\"grps\":\"a\",\"vals\":52},{\"grps\":\"b\",\"vals\":54},{\"grps\":\"c\",\"vals\":23},{\"grps\":\"c\",\"vals\":235},{\"grps\":\"c\",\"vals\":21},{\"grps\":\"b\",\"vals\":57},{\"grps\":\"b\",\"vals\":3},{\"grps\":\"c\",\"vals\":87}]}"
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 32
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:24.442610911Z",
"start_time": "2024-09-18T10:52:24.293394383Z"
}
},
"source": [
"df.groupBy { grps }.aggregate { \n",
" vals.sortDesc().take(3).sum() into \"res\"\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"grps\",\"res\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"grps\":\"a\",\"res\":409},{\"grps\":\"b\",\"res\":156},{\"grps\":\"c\",\"res\":345}]}"
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**26.** The DataFrame `df` constructed below has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive).\n",
"\n",
"For each group of 10 consecutive integers in 'A' (i.e. `(0, 10]`, `(10, 20]`, ...), calculate the sum of the corresponding values in column 'B'.\n",
"\n",
"The answer as follows:\n",
"\n",
"```\n",
"A\n",
"(0, 10] 635\n",
"(10, 20] 360\n",
"(20, 30] 315\n",
"(30, 40] 306\n",
"(40, 50] 750\n",
"(50, 60] 284\n",
"(60, 70] 424\n",
"(70, 80] 526\n",
"(80, 90] 835\n",
"(90, 100] 852\n",
"```"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:25.048873951Z",
"start_time": "2024-09-18T10:52:24.836670183Z"
}
},
"source": [
"import kotlin.random.Random\n",
"\n",
"val random = Random(42)\n",
"val list = List(200) { random.nextInt(1, 101) }\n",
"val df = dataFrameOf(\"A\", \"B\")(*list.toTypedArray())\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | A | B |
|---|
| 34 | 41 |
| 42 | 3 |
| 42 | 33 |
| 22 | 41 |
| 70 | 88 |
| 53 | 68 |
| 80 | 4 |
| 59 | 59 |
| 45 | 1 |
| 27 | 14 |
| 70 | 8 |
| 11 | 52 |
| 51 | 60 |
| 46 | 43 |
| 17 | 17 |
| 17 | 42 |
| 56 | 29 |
| 58 | 49 |
| 48 | 7 |
| 73 | 52 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\",\"B\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":100,\"ncol\":2},\"kotlin_dataframe\":[{\"A\":34,\"B\":41},{\"A\":42,\"B\":3},{\"A\":42,\"B\":33},{\"A\":22,\"B\":41},{\"A\":70,\"B\":88},{\"A\":53,\"B\":68},{\"A\":80,\"B\":4},{\"A\":59,\"B\":59},{\"A\":45,\"B\":1},{\"A\":27,\"B\":14},{\"A\":70,\"B\":8},{\"A\":11,\"B\":52},{\"A\":51,\"B\":60},{\"A\":46,\"B\":43},{\"A\":17,\"B\":17},{\"A\":17,\"B\":42},{\"A\":56,\"B\":29},{\"A\":58,\"B\":49},{\"A\":48,\"B\":7},{\"A\":73,\"B\":52}]}"
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 34
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:25.551994210Z",
"start_time": "2024-09-18T10:52:25.364294809Z"
}
},
"source": [
"df.groupBy { A.map { (it - 1) / 10 } }.sum { B }\n",
" .sortBy { A }\n",
" .convert { A }.with { \"(${it * 10}, ${it * 10 + 10}]\" }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | A | B |
|---|
| (0, 10] | 353 |
| (10, 20] | 873 |
| (20, 30] | 321 |
| (30, 40] | 322 |
| (40, 50] | 432 |
| (50, 60] | 754 |
| (60, 70] | 405 |
| (70, 80] | 561 |
| (80, 90] | 657 |
| (90, 100] | 527 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\",\"B\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":10,\"ncol\":2},\"kotlin_dataframe\":[{\"A\":\"(0, 10]\",\"B\":353},{\"A\":\"(10, 20]\",\"B\":873},{\"A\":\"(20, 30]\",\"B\":321},{\"A\":\"(30, 40]\",\"B\":322},{\"A\":\"(40, 50]\",\"B\":432},{\"A\":\"(50, 60]\",\"B\":754},{\"A\":\"(60, 70]\",\"B\":405},{\"A\":\"(70, 80]\",\"B\":561},{\"A\":\"(80, 90]\",\"B\":657},{\"A\":\"(90, 100]\",\"B\":527}]}"
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 35
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrames: harder problems\n",
"\n",
"### These might require a bit of thinking outside the box...\n",
"\n",
"Difficulty: hard"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**27.** Consider a DataFrame `df` where there is an integer column 'X':\n",
"```kotlin\n",
"val df = dataFrameOf(\"X\")(7, 2, 0, 3, 4, 2, 5, 0, 3 , 4)\n",
"```\n",
"For each value, count the difference back to the previous zero (or the start of the column, whichever is closer). These values should therefore be\n",
"\n",
"```\n",
"[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]\n",
"```\n",
"\n",
"Make this a new column 'Y'."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:26.112747794Z",
"start_time": "2024-09-18T10:52:25.969539618Z"
}
},
"source": [
"val df = dataFrameOf(\"X\")(7, 2, 0, 3, 4, 2, 5, 0, 3, 4)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"X\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":10,\"ncol\":1},\"kotlin_dataframe\":[{\"X\":7},{\"X\":2},{\"X\":0},{\"X\":3},{\"X\":4},{\"X\":2},{\"X\":5},{\"X\":0},{\"X\":3},{\"X\":4}]}"
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 36
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:26.582340039Z",
"start_time": "2024-09-18T10:52:26.470043972Z"
}
},
"source": [
"df.mapToColumn(\"Y\") {\n",
" if (it.X == 0) 0 else (prev()?.newValue() ?: 0) + 1\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"Y\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":10,\"ncol\":1},\"kotlin_dataframe\":[{\"Y\":1},{\"Y\":2},{\"Y\":0},{\"Y\":1},{\"Y\":2},{\"Y\":3},{\"Y\":4},{\"Y\":0},{\"Y\":1},{\"Y\":2}]}"
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 37
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**28.** Consider the DataFrame constructed below which contains rows and columns of numerical data.\n",
"\n",
"Create a list of the column-row index locations of the 3 largest values in this DataFrame. In thi case, the answer should be:\n",
"```\n",
"[(0, d), (2, c), (3, f)]\n",
"```"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:27.309555464Z",
"start_time": "2024-09-18T10:52:26.981417395Z"
}
},
"source": [
"val names = ('a'..'h').map { it.toString() } // val names = (0..7).map { it.toString() }\n",
"val random = Random(30)\n",
"val list = List(64) { random.nextInt(1, 101) }\n",
"val df = dataFrameOf(names)(*list.toTypedArray())\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c | d | e | f | g | h |
|---|
| 43 | 88 | 66 | 100 | 9 | 59 | 74 | 23 |
| 6 | 63 | 43 | 58 | 4 | 85 | 9 | 25 |
| 49 | 59 | 100 | 52 | 28 | 1 | 19 | 81 |
| 92 | 41 | 13 | 57 | 28 | 97 | 63 | 39 |
| 4 | 59 | 72 | 65 | 50 | 35 | 14 | 31 |
| 55 | 74 | 33 | 66 | 17 | 39 | 80 | 38 |
| 18 | 64 | 91 | 39 | 80 | 55 | 65 | 2 |
| 19 | 76 | 75 | 18 | 32 | 97 | 1 | 32 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\",\"d\",\"e\",\"f\",\"g\",\"h\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":8,\"ncol\":8},\"kotlin_dataframe\":[{\"a\":43,\"b\":88,\"c\":66,\"d\":100,\"e\":9,\"f\":59,\"g\":74,\"h\":23},{\"a\":6,\"b\":63,\"c\":43,\"d\":58,\"e\":4,\"f\":85,\"g\":9,\"h\":25},{\"a\":49,\"b\":59,\"c\":100,\"d\":52,\"e\":28,\"f\":1,\"g\":19,\"h\":81},{\"a\":92,\"b\":41,\"c\":13,\"d\":57,\"e\":28,\"f\":97,\"g\":63,\"h\":39},{\"a\":4,\"b\":59,\"c\":72,\"d\":65,\"e\":50,\"f\":35,\"g\":14,\"h\":31},{\"a\":55,\"b\":74,\"c\":33,\"d\":66,\"e\":17,\"f\":39,\"g\":80,\"h\":38},{\"a\":18,\"b\":64,\"c\":91,\"d\":39,\"e\":80,\"f\":55,\"g\":65,\"h\":2},{\"a\":19,\"b\":76,\"c\":75,\"d\":18,\"e\":32,\"f\":97,\"g\":1,\"h\":32}]}"
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 38
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:27.907407295Z",
"start_time": "2024-09-18T10:52:27.741141235Z"
}
},
"source": [
"df.add(\"index\") { index() }\n",
" .gather { dropLast() }.into(\"name\", \"vals\")\n",
" .sortByDesc(\"vals\").take(3)[\"index\", \"name\"]"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"index\",\"name\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"index\":0,\"name\":\"d\"},{\"index\":2,\"name\":\"c\"},{\"index\":3,\"name\":\"f\"}]}"
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 39
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**29.** You are given the DataFrame below with a column of group IDs, 'grps', and a column of corresponding integer values, 'vals'.\n",
"\n",
"```kotlin\n",
"val random = Random(31)\n",
"val lab = listOf(\"A\", \"B\")\n",
"\n",
"val vals by columnOf(List(15) { random.nextInt(-30, 30) })\n",
"val grps by columnOf(List(15) { lab[random.nextInt(0, 2)] })\n",
"\n",
"val df = dataFrameOf(vals, grps)\n",
"```\n",
"\n",
"Create a new column 'patched_values' which contains the same values as the 'vals' any negative values in 'vals' with the group mean:\n",
"\n",
"```\n",
"vals grps patched_vals\n",
" -17 B 21.0\n",
" -7 B 21.0\n",
" 28 B 28.0\n",
" 16 B 16.0\n",
" -21 B 21.0\n",
" 19 B 19.0\n",
" -2 B 21.0\n",
" -19 B 21.0\n",
" 16 A 16.0\n",
" 9 A 9.0\n",
" -14 A 16.0\n",
" -19 A 16.0\n",
" -22 A 16.0\n",
" -1 A 16.0\n",
" 23 A 23.0\n",
"```"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:28.554746117Z",
"start_time": "2024-09-18T10:52:28.233611264Z"
}
},
"source": [
"val random = Random(31)\n",
"val lab = listOf(\"A\", \"B\")\n",
"\n",
"val vals by columnOf(*Array(15) { random.nextInt(-30, 30) })\n",
"val grps by columnOf(*Array(15) { lab[random.nextInt(0, 2)] })\n",
"\n",
"val df = dataFrameOf(vals, grps)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | vals | grps |
|---|
| -17 | B |
| -7 | B |
| 16 | A |
| 28 | B |
| 9 | A |
| 16 | B |
| -21 | B |
| -14 | A |
| -19 | A |
| -22 | A |
| 19 | B |
| -2 | B |
| -1 | A |
| -19 | B |
| 23 | A |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"vals\",\"grps\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":15,\"ncol\":2},\"kotlin_dataframe\":[{\"vals\":-17,\"grps\":\"B\"},{\"vals\":-7,\"grps\":\"B\"},{\"vals\":16,\"grps\":\"A\"},{\"vals\":28,\"grps\":\"B\"},{\"vals\":9,\"grps\":\"A\"},{\"vals\":16,\"grps\":\"B\"},{\"vals\":-21,\"grps\":\"B\"},{\"vals\":-14,\"grps\":\"A\"},{\"vals\":-19,\"grps\":\"A\"},{\"vals\":-22,\"grps\":\"A\"},{\"vals\":19,\"grps\":\"B\"},{\"vals\":-2,\"grps\":\"B\"},{\"vals\":-1,\"grps\":\"A\"},{\"vals\":-19,\"grps\":\"B\"},{\"vals\":23,\"grps\":\"A\"}]}"
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 40
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:29.312198503Z",
"start_time": "2024-09-18T10:52:28.985947343Z"
}
},
"source": [
"val means = df.filter { vals >= 0 }\n",
" .groupBy { grps }.mean()\n",
" .pivot { grps }.values { vals }\n",
"\n",
"df.add(\"patched_values\") {\n",
" if (vals < 0) means[grps] else vals.toDouble()\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | vals | grps | patched_values |
|---|
| -17 | B | 21.000000 |
| -7 | B | 21.000000 |
| 16 | A | 16.000000 |
| 28 | B | 28.000000 |
| 9 | A | 9.000000 |
| 16 | B | 16.000000 |
| -21 | B | 21.000000 |
| -14 | A | 16.000000 |
| -19 | A | 16.000000 |
| -22 | A | 16.000000 |
| 19 | B | 19.000000 |
| -2 | B | 21.000000 |
| -1 | A | 16.000000 |
| -19 | B | 21.000000 |
| 23 | A | 23.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"vals\",\"grps\",\"patched_values\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Any\"}],\"nrow\":15,\"ncol\":3},\"kotlin_dataframe\":[{\"vals\":-17,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":-7,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":16,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":28,\"grps\":\"B\",\"patched_values\":\"28.0\"},{\"vals\":9,\"grps\":\"A\",\"patched_values\":\"9.0\"},{\"vals\":16,\"grps\":\"B\",\"patched_values\":\"16.0\"},{\"vals\":-21,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":-14,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":-19,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":-22,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":19,\"grps\":\"B\",\"patched_values\":\"19.0\"},{\"vals\":-2,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":-1,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":-19,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":23,\"grps\":\"A\",\"patched_values\":\"23.0\"}]}"
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**30.** Implement a rolling mean over groups with window size 3, which ignores NaN value. For example consider the following DataFrame:\n",
"```kotlin\n",
"val group by columnOf(\"a\", \"a\", \"b\", \"b\", \"a\", \"b\", \"b\", \"b\", \"a\", \"b\", \"a\", \"b\")\n",
"val value by columnOf(1.0, 2.0, 3.0, Double.NaN, 2.0, 3.0, Double.NaN, 1.0, 7.0, 3.0, Double.NaN, 8.0)\n",
"\n",
"val df = dataFrameOf(group, value)\n",
"df\n",
"\n",
"group value\n",
"a 1.0\n",
"a 2.0\n",
"b 3.0\n",
"b NaN\n",
"a 2.0\n",
"b 3.0\n",
"b NaN\n",
"b 1.0\n",
"a 7.0\n",
"b 3.0\n",
"a NaN\n",
"b 8.0\n",
"```\n",
"The goal is:\n",
"```\n",
"1.000000\n",
"1.500000\n",
"3.000000\n",
"3.000000\n",
"1.666667\n",
"3.000000\n",
"3.000000\n",
"2.000000\n",
"3.666667\n",
"2.000000\n",
"4.500000\n",
"4.000000\n",
"```\n",
"E.g. the first window of size three for group 'b' has values 3.0, NaN and 3.0 and occurs at row index 5. Instead of being NaN the value in the new column at this row index should be 3.0 (just the two non-NaN values are used to compute the mean (3+3)/2)"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:29.916179065Z",
"start_time": "2024-09-18T10:52:29.599419695Z"
}
},
"source": [
"val groups by columnOf(\"a\", \"a\", \"b\", \"b\", \"a\", \"b\", \"b\", \"b\", \"a\", \"b\", \"a\", \"b\")\n",
"val value by columnOf(1.0, 2.0, 3.0, Double.NaN, 2.0, 3.0, Double.NaN, 1.0, 7.0, 3.0, Double.NaN, 8.0)\n",
"\n",
"val df = dataFrameOf(groups, value)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | groups | value |
|---|
| a | 1.000000 |
| a | 2.000000 |
| b | 3.000000 |
| b | NaN |
| a | 2.000000 |
| b | 3.000000 |
| b | NaN |
| b | 1.000000 |
| a | 7.000000 |
| b | 3.000000 |
| a | NaN |
| b | 8.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"groups\",\"value\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":12,\"ncol\":2},\"kotlin_dataframe\":[{\"groups\":\"a\",\"value\":1.0},{\"groups\":\"a\",\"value\":2.0},{\"groups\":\"b\",\"value\":3.0},{\"groups\":\"b\",\"value\":NaN},{\"groups\":\"a\",\"value\":2.0},{\"groups\":\"b\",\"value\":3.0},{\"groups\":\"b\",\"value\":NaN},{\"groups\":\"b\",\"value\":1.0},{\"groups\":\"a\",\"value\":7.0},{\"groups\":\"b\",\"value\":3.0},{\"groups\":\"a\",\"value\":NaN},{\"groups\":\"b\",\"value\":8.0}]}"
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 42
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:30.421331767Z",
"start_time": "2024-09-18T10:52:30.204783491Z"
}
},
"source": [
"df.add(\"id\") { index() }\n",
" .groupBy { groups }.add(\"res\") {\n",
" relative(-2..0).value.filter { !it.isNaN() }.mean()\n",
" }.concat()\n",
" .sortBy(\"id\")\n",
" .remove(\"id\")"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | groups | value | res |
|---|
| a | 1.000000 | 1.000000 |
| a | 2.000000 | 1.500000 |
| b | 3.000000 | 3.000000 |
| b | NaN | 3.000000 |
| a | 2.000000 | 1.666667 |
| b | 3.000000 | 3.000000 |
| b | NaN | 3.000000 |
| b | 1.000000 | 2.000000 |
| a | 7.000000 | 3.666667 |
| b | 3.000000 | 2.000000 |
| a | NaN | 4.500000 |
| b | 8.000000 | 4.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"groups\",\"value\",\"res\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":12,\"ncol\":3},\"kotlin_dataframe\":[{\"groups\":\"a\",\"value\":1.0,\"res\":1.0},{\"groups\":\"a\",\"value\":2.0,\"res\":1.5},{\"groups\":\"b\",\"value\":3.0,\"res\":3.0},{\"groups\":\"b\",\"value\":NaN,\"res\":3.0},{\"groups\":\"a\",\"value\":2.0,\"res\":1.6666666666666667},{\"groups\":\"b\",\"value\":3.0,\"res\":3.0},{\"groups\":\"b\",\"value\":NaN,\"res\":3.0},{\"groups\":\"b\",\"value\":1.0,\"res\":2.0},{\"groups\":\"a\",\"value\":7.0,\"res\":3.6666666666666665},{\"groups\":\"b\",\"value\":3.0,\"res\":2.0},{\"groups\":\"a\",\"value\":NaN,\"res\":4.5},{\"groups\":\"b\",\"value\":8.0,\"res\":4.0}]}"
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 43
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Date\n",
"Difficulty: easy/medium"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**31.** Create a column Of LocalDate that contains each day of 2015 and column of random numbers."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:30.956054521Z",
"start_time": "2024-09-18T10:52:30.931369415Z"
}
},
"source": [
"import kotlinx.datetime.*"
],
"outputs": [],
"execution_count": 44
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:31.891343150Z",
"start_time": "2024-09-18T10:52:31.475342485Z"
}
},
"source": [
"class DateRangeIterator(first: LocalDate, last: LocalDate, val step: Int) : Iterator {\n",
" private val finalElement: LocalDate = last\n",
" private var hasNext: Boolean = if (step > 0) first <= last else first >= last\n",
" private var next: LocalDate = if (hasNext) first else finalElement\n",
"\n",
" override fun hasNext(): Boolean = hasNext\n",
"\n",
" override fun next(): LocalDate {\n",
" val value = next\n",
" if (value == finalElement) {\n",
" if (!hasNext) throw kotlin.NoSuchElementException()\n",
" hasNext = false\n",
" } else {\n",
" next = next.plus(step, DateTimeUnit.DayBased(1))\n",
" }\n",
" return value\n",
" }\n",
"}\n",
"\n",
"operator fun ClosedRange.iterator() = DateRangeIterator(this.start, this.endInclusive, 1)\n",
"\n",
"fun ClosedRange.toList(): List {\n",
" return when (val size = this.start.daysUntil(this.endInclusive)) {\n",
" 0 -> emptyList()\n",
" 1 -> listOf(iterator().next())\n",
" else -> {\n",
" val dest = ArrayList(size)\n",
" for (item in this) {\n",
" dest.add(item)\n",
" }\n",
" dest\n",
" }\n",
" }\n",
"}"
],
"outputs": [],
"execution_count": 45
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:32.488075544Z",
"start_time": "2024-09-18T10:52:32.142058759Z"
}
},
"source": [
"val start = LocalDate(2015, 1, 1)\n",
"val end = LocalDate(2016, 1, 1)\n",
"\n",
"val days = (start..end).toList()\n",
"\n",
"val dti = days.toColumn(\"dti\")\n",
"val s = List(dti.size()) { Random.nextDouble() }.toColumn(\"s\")\n",
"val df = dataFrameOf(dti, s)\n",
"df.head()"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | dti | s |
|---|
| 2015-01-01 | 0.396569 |
| 2015-01-02 | 0.449204 |
| 2015-01-03 | 0.130729 |
| 2015-01-04 | 0.997274 |
| 2015-01-05 | 0.415122 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"dti\",\"s\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":2},\"kotlin_dataframe\":[{\"dti\":\"2015-01-01\",\"s\":0.396568897051362},{\"dti\":\"2015-01-02\",\"s\":0.44920443325559345},{\"dti\":\"2015-01-03\",\"s\":0.13072945780673562},{\"dti\":\"2015-01-04\",\"s\":0.9972744993532556},{\"dti\":\"2015-01-05\",\"s\":0.4151224123938473}]}"
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 46
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**32.** Find the sum of the values in s for every Wednesday."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:32.741899525Z",
"start_time": "2024-09-18T10:52:32.639067352Z"
}
},
"source": [
"df.filter { dti.dayOfWeek == DayOfWeek.TUESDAY }.sum { s }"
],
"outputs": [
{
"data": {
"text/plain": [
"26.523862856482456"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 47
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**33.** For each calendar month in s, find the mean of values."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:33.171104553Z",
"start_time": "2024-09-18T10:52:33.040389976Z"
}
},
"source": [
"df.groupBy { dti.map { it.month } named \"month\" }.mean()"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | month | s |
|---|
| JANUARY | 0.546164 |
| FEBRUARY | 0.477099 |
| MARCH | 0.512146 |
| APRIL | 0.385120 |
| MAY | 0.516046 |
| JUNE | 0.489589 |
| JULY | 0.551648 |
| AUGUST | 0.534199 |
| SEPTEMBER | 0.501065 |
| OCTOBER | 0.488557 |
| NOVEMBER | 0.450117 |
| DECEMBER | 0.532484 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"month\",\"s\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"java.time.Month\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":12,\"ncol\":2},\"kotlin_dataframe\":[{\"month\":\"JANUARY\",\"s\":0.5461639596015749},{\"month\":\"FEBRUARY\",\"s\":0.4770993733131084},{\"month\":\"MARCH\",\"s\":0.5121464835105507},{\"month\":\"APRIL\",\"s\":0.38511996646500574},{\"month\":\"MAY\",\"s\":0.5160459873989691},{\"month\":\"JUNE\",\"s\":0.4895893543924396},{\"month\":\"JULY\",\"s\":0.5516481447494934},{\"month\":\"AUGUST\",\"s\":0.5341986038499763},{\"month\":\"SEPTEMBER\",\"s\":0.5010645849219175},{\"month\":\"OCTOBER\",\"s\":0.48855698032237854},{\"month\":\"NOVEMBER\",\"s\":0.4501166270215522},{\"month\":\"DECEMBER\",\"s\":0.5324840945916145}]}"
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 48
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**34.** For each group of four consecutive calendar months in s, find the date on which the highest value occurred."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:33.529975669Z",
"start_time": "2024-09-18T10:52:33.383566764Z"
}
},
"source": [
"df.add(\"month4\") {\n",
" when (dti.monthNumber) {\n",
" in 1..4 -> 1\n",
" in 5..8 -> 2\n",
" else -> 3\n",
" }\n",
"}.groupBy(\"month4\").aggregate { maxBy(s) into \"max\" }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | month4 | max | | |
|---|
| dti | s | month4 |
|---|
| 1 | 2015-01-04 | 0.997274 | 1 |
| 2 | 2015-05-14 | 0.998469 | 2 |
| 3 | 2015-12-06 | 0.994667 | 3 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"month4\",\"max\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ColumnGroup\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"month4\":1,\"max\":{\"data\":{\"dti\":\"2015-01-04\",\"s\":0.9972744993532556,\"month4\":1},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"dti\",\"s\",\"month4\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}]}}},{\"month4\":2,\"max\":{\"data\":{\"dti\":\"2015-05-14\",\"s\":0.9984692529594049,\"month4\":2},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"dti\",\"s\",\"month4\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}]}}},{\"month4\":3,\"max\":{\"data\":{\"dti\":\"2015-12-06\",\"s\":0.9946668289629869,\"month4\":3},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"dti\",\"s\",\"month4\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}]}}}]}"
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 49
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**35.** Create a column consisting of the third Thursday in each month for the years 2015 and 2016."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:33.795279118Z",
"start_time": "2024-09-18T10:52:33.769492393Z"
}
},
"source": [
"import java.time.temporal.WeekFields\n",
"import java.util.*"
],
"outputs": [],
"execution_count": 50
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:34.286933382Z",
"start_time": "2024-09-18T10:52:34.146796587Z"
}
},
"source": [
"val start = LocalDate(2015, 1, 1)\n",
"val end = LocalDate(2016, 12, 31)\n",
"\n",
"(start..end).toList().toColumn(\"3thu\").filter {\n",
" it.toJavaLocalDate()[WeekFields.of(Locale.ENGLISH).weekOfMonth()] == 3\n",
" && it.dayOfWeek.value == 4\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | 3thu |
|---|
| 2015-01-15 |
| 2015-02-19 |
| 2015-03-19 |
| 2015-04-16 |
| 2015-05-14 |
| 2015-06-18 |
| 2015-07-16 |
| 2015-08-13 |
| 2015-09-17 |
| 2015-10-15 |
| 2015-11-19 |
| 2015-12-17 |
| 2016-01-14 |
| 2016-02-18 |
| 2016-03-17 |
| 2016-04-14 |
| 2016-05-19 |
| 2016-06-16 |
| 2016-07-14 |
| 2016-08-18 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"3thu\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"}],\"nrow\":24,\"ncol\":1},\"kotlin_dataframe\":[{\"3thu\":\"2015-01-15\"},{\"3thu\":\"2015-02-19\"},{\"3thu\":\"2015-03-19\"},{\"3thu\":\"2015-04-16\"},{\"3thu\":\"2015-05-14\"},{\"3thu\":\"2015-06-18\"},{\"3thu\":\"2015-07-16\"},{\"3thu\":\"2015-08-13\"},{\"3thu\":\"2015-09-17\"},{\"3thu\":\"2015-10-15\"},{\"3thu\":\"2015-11-19\"},{\"3thu\":\"2015-12-17\"},{\"3thu\":\"2016-01-14\"},{\"3thu\":\"2016-02-18\"},{\"3thu\":\"2016-03-17\"},{\"3thu\":\"2016-04-14\"},{\"3thu\":\"2016-05-19\"},{\"3thu\":\"2016-06-16\"},{\"3thu\":\"2016-07-14\"},{\"3thu\":\"2016-08-18\"}]}"
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 51
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cleaning Data\n",
"### Making a DataFrame easier to work with\n",
"Difficulty: *easy/medium*\n",
"\n",
"It happens all the time: someone gives you data containing malformed strings, lists and missing data. How do you tidy it up so you can get on with the analysis?\n",
"\n",
"Take this monstrosity as the DataFrame to use in the following puzzles:\n",
"```kotlin\n",
"val fromTo = listOf(\"LoNDon_paris\", \"MAdrid_miLAN\", \"londON_StockhOlm\", \"Budapest_PaRis\", \"Brussels_londOn\").toColumn(\"From_To\")\n",
"val flightNumber = listOf(10045.0, Double.NaN, 10065.0, Double.NaN, 10085.0).toColumn(\"FlightNumber\")\n",
"val recentDelays = listOf(listOf(23, 47), listOf(), listOf(24, 43, 87), listOf(13), listOf(67, 32)).toColumn(\"RecentDelays\")\n",
"val airline = listOf(\"KLM(!)\", \" (12)\", \"(British Airways. )\", \"12. Air France\", \"'Swiss Air'\").toColumn(\"Airline\")\n",
"\n",
"val df = dataFrameOf(fromTo, flightNumber, recentDelays, airline)\n",
"```\n",
"\n",
"It looks like this:\n",
"```\n",
"From_To FlightNumber RecentDelays Airline\n",
"LoNDon_paris 10045.000000 [23, 47] KLM(!)\n",
"MAdrid_miLAN NaN [] {Air France} (12)\n",
"londON_StockhOlm 10065.000000 [24, 43, 87] (British Airways. )\n",
"Budapest_PaRis NaN [13] 12. Air France\n",
"Brussels_londOn 10085.000000 [67, 32] 'Swiss Air'\n",
"```"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:34.958980526Z",
"start_time": "2024-09-18T10:52:34.606531125Z"
}
},
"source": [
"val fromTo = listOf(\"LoNDon_paris\", \"MAdrid_miLAN\", \"londON_StockhOlm\", \"Budapest_PaRis\", \"Brussels_londOn\").toColumn(\"From_To\")\n",
"val flightNumber = listOf(10045.0, Double.NaN, 10065.0, Double.NaN, 10085.0).toColumn(\"FlightNumber\")\n",
"val recentDelays = listOf(listOf(23, 47), listOf(), listOf(24, 43, 87), listOf(13), listOf(67, 32)).toColumn(\"RecentDelays\")\n",
"val airline = listOf(\"KLM(!)\", \"{Air France} (12)\", \"(British Airways. )\", \"12. Air France\", \"'Swiss Air'\").toColumn(\"Airline\")\n",
"\n",
"var df = dataFrameOf(fromTo, flightNumber, recentDelays, airline)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From_To | FlightNumber | RecentDelays | Airline |
|---|
| LoNDon_paris | 10045.000000 | [23, 47] | KLM(!) |
| MAdrid_miLAN | NaN | [ ] | {Air France} (12) |
| londON_StockhOlm | 10065.000000 | [24, 43, 87] | (British Airways. ) |
| Budapest_PaRis | NaN | [13] | 12. Air France |
| Brussels_londOn | 10085.000000 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From_To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":4},\"kotlin_dataframe\":[{\"From_To\":\"LoNDon_paris\",\"FlightNumber\":10045.0,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From_To\":\"MAdrid_miLAN\",\"FlightNumber\":NaN,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From_To\":\"londON_StockhOlm\",\"FlightNumber\":10065.0,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From_To\":\"Budapest_PaRis\",\"FlightNumber\":NaN,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From_To\":\"Brussels_londOn\",\"FlightNumber\":10085.0,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 52
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**36.** Some values in the FlightNumber column are missing (they are NaN). These numbers are meant to increase by 10 with each row, so 10055 and 10075 need to be put in place. Modify df to fill in these missing numbers and make the column an integer column (instead of a float column)."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:35.447484264Z",
"start_time": "2024-09-18T10:52:35.282119765Z"
}
},
"source": [
"df = df.fillNaNs { FlightNumber }\n",
" .with { prev()!!.FlightNumber + (next()!!.FlightNumber - prev()!!.FlightNumber) / 2 }\n",
" .convert { FlightNumber }.toInt()\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From_To | FlightNumber | RecentDelays | Airline |
|---|
| LoNDon_paris | 10045 | [23, 47] | KLM(!) |
| MAdrid_miLAN | 10055 | [ ] | {Air France} (12) |
| londON_StockhOlm | 10065 | [24, 43, 87] | (British Airways. ) |
| Budapest_PaRis | 10075 | [13] | 12. Air France |
| Brussels_londOn | 10085 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From_To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":4},\"kotlin_dataframe\":[{\"From_To\":\"LoNDon_paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From_To\":\"MAdrid_miLAN\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From_To\":\"londON_StockhOlm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From_To\":\"Budapest_PaRis\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From_To\":\"Brussels_londOn\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 53
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**37.** The **From_To** column would be better as two separate columns! Split each string on the underscore delimiter **_** to give a new two columns. Assign the correct names 'From' and 'To' to this columns."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:36.064043820Z",
"start_time": "2024-09-18T10:52:35.754355770Z"
}
},
"source": [
"var df2 = df.split { From_To }.by(\"_\").into(\"From\", \"To\")\n",
"df2"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | RecentDelays | Airline |
|---|
| LoNDon | paris | 10045 | [23, 47] | KLM(!) |
| MAdrid | miLAN | 10055 | [ ] | {Air France} (12) |
| londON | StockhOlm | 10065 | [24, 43, 87] | (British Airways. ) |
| Budapest | PaRis | 10075 | [13] | 12. Air France |
| Brussels | londOn | 10085 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":5},\"kotlin_dataframe\":[{\"From\":\"LoNDon\",\"To\":\"paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From\":\"MAdrid\",\"To\":\"miLAN\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From\":\"londON\",\"To\":\"StockhOlm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From\":\"Budapest\",\"To\":\"PaRis\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From\":\"Brussels\",\"To\":\"londOn\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 54
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**38.** Notice how the capitalisation of the city names is all mixed up in this temporary DataFrame 'temp'. Standardise the strings so that only the first letter is uppercase (e.g. \"londON\" should become \"London\".)"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:36.907444202Z",
"start_time": "2024-09-18T10:52:36.705260055Z"
}
},
"source": [
"df2 = df2.update { From and To }.with { it.lowercase().replaceFirstChar(Char::uppercase) }\n",
"df2"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | RecentDelays | Airline |
|---|
| London | Paris | 10045 | [23, 47] | KLM(!) |
| Madrid | Milan | 10055 | [ ] | {Air France} (12) |
| London | Stockholm | 10065 | [24, 43, 87] | (British Airways. ) |
| Budapest | Paris | 10075 | [13] | 12. Air France |
| Brussels | London | 10085 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":5},\"kotlin_dataframe\":[{\"From\":\"London\",\"To\":\"Paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From\":\"Madrid\",\"To\":\"Milan\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From\":\"London\",\"To\":\"Stockholm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From\":\"Budapest\",\"To\":\"Paris\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From\":\"Brussels\",\"To\":\"London\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 55
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**39.** In the **Airline** column, you can see some extra punctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. `'(British Airways. )'` should become `'British Airways'`."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:37.488390391Z",
"start_time": "2024-09-18T10:52:37.360873144Z"
}
},
"source": [
"df2 = df2.update { Airline }.with {\n",
" \"([a-zA-Z\\\\s]+)\".toRegex().find(it)?.value ?: \"\"\n",
"}\n",
"df2"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | RecentDelays | Airline |
|---|
| London | Paris | 10045 | [23, 47] | KLM |
| Madrid | Milan | 10055 | [ ] | Air France |
| London | Stockholm | 10065 | [24, 43, 87] | British Airways |
| Budapest | Paris | 10075 | [13] | Air France |
| Brussels | London | 10085 | [67, 32] | Swiss Air |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":5},\"kotlin_dataframe\":[{\"From\":\"London\",\"To\":\"Paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM\"},{\"From\":\"Madrid\",\"To\":\"Milan\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"Air France\"},{\"From\":\"London\",\"To\":\"Stockholm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"British Airways\"},{\"From\":\"Budapest\",\"To\":\"Paris\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\" Air France\"},{\"From\":\"Brussels\",\"To\":\"London\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"Swiss Air\"}]}"
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 56
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**40.** In the **RecentDelays** column, the values have been entered into the DataFrame as a list. We would like each first value in its own column, each second value in its own column, and so on. If there isn't an Nth value, the value should be `null`.\n",
"\n",
"Expand the column of lists into columns named 'delays_' and replace the unwanted RecentDelays column in `df` with 'delays'."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2024-09-18T10:52:38.630557973Z",
"start_time": "2024-09-18T10:52:38.337486955Z"
}
},
"source": [
"val prep_df = df2.split { RecentDelays }.into { \"delay_$it\" }\n",
"prep_df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | delay_1 | delay_2 | delay_3 | Airline |
|---|
| London | Paris | 10045 | 23 | 47 | null | KLM |
| Madrid | Milan | 10055 | null | null | null | Air France |
| London | Stockholm | 10065 | 24 | 43 | 87 | British Airways |
| Budapest | Paris | 10075 | 13 | null | null | Air France |
| Brussels | London | 10085 | 67 | 32 | null | Swiss Air |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"delay_1\",\"delay_2\",\"delay_3\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":7},\"kotlin_dataframe\":[{\"From\":\"London\",\"To\":\"Paris\",\"FlightNumber\":10045,\"delay_1\":23,\"delay_2\":47,\"delay_3\":null,\"Airline\":\"KLM\"},{\"From\":\"Madrid\",\"To\":\"Milan\",\"FlightNumber\":10055,\"delay_1\":null,\"delay_2\":null,\"delay_3\":null,\"Airline\":\"Air France\"},{\"From\":\"London\",\"To\":\"Stockholm\",\"FlightNumber\":10065,\"delay_1\":24,\"delay_2\":43,\"delay_3\":87,\"Airline\":\"British Airways\"},{\"From\":\"Budapest\",\"To\":\"Paris\",\"FlightNumber\":10075,\"delay_1\":13,\"delay_2\":null,\"delay_3\":null,\"Airline\":\" Air France\"},{\"From\":\"Brussels\",\"To\":\"London\",\"FlightNumber\":10085,\"delay_1\":67,\"delay_2\":32,\"delay_3\":null,\"Airline\":\"Swiss Air\"}]}"
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 57
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The DataFrame looks much better now!\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Kotlin",
"language": "kotlin",
"name": "kotlin"
},
"language_info": {
"codemirror_mode": "text/x-kotlin",
"file_extension": ".kt",
"mimetype": "text/x-kotlin",
"name": "kotlin",
"nbconvert_exporter": "",
"pygments_lexer": "kotlin",
"version": "1.8.0-dev-707"
}
},
"nbformat": 4,
"nbformat_minor": 1
}