{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": [
"# 40 kotlin-dataframe puzzles\n",
"inspired by [100 pandas puzzles](https://github.com/ajcr/100-pandas-puzzles)"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Importing kotlin-dataframe\n",
"### Getting started\n",
"Difficulty: easy\n",
"\n",
"**1.** Import kotlin-dataframe"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:17.081486Z",
"start_time": "2025-05-27T15:30:10.835561Z"
}
},
"cell_type": "code",
"source": "%use dataframe@kc25",
"outputs": [],
"execution_count": 1
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## DataFrame Basics\n",
"### A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames\n",
"Difficulty: easy\n",
"\n",
"Consider the following columns:\n",
"```[kotlin]\n",
"val animal by columnOf(\"cat\", \"cat\", \"snake\", \"dog\", \"dog\", \"cat\", \"snake\", \"cat\", \"dog\", \"dog\")\n",
"val age by columnOf(2.5, 3.0, 0.5, Double.NaN, 5.0, 2.0, 4.5, Double.NaN, 7, 3)\n",
"val visits by columnOf(1, 3, 2, 3, 2, 3, 1, 1, 2, 1)\n",
"val priority by columnOf(\"yes\", \"yes\", \"no\", \"yes\", \"no\", \"no\", \"no\", \"yes\", \"no\", \"no\")\n",
"```\n",
"**2.** Create a DataFrame df from this columns."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:19.983006Z",
"start_time": "2025-05-27T15:30:17.102569Z"
}
},
"cell_type": "code",
"source": [
"val animal by columnOf(\"cat\", \"cat\", \"snake\", \"dog\", \"dog\", \"cat\", \"snake\", \"cat\", \"dog\", \"dog\")\n",
"val age by columnOf(2.5, 3.0, 0.5, Double.NaN, 5.0, 2.0, 4.5, Double.NaN, 7.0, 3.0)\n",
"val visits by columnOf(1, 3, 2, 3, 2, 3, 1, 1, 2, 1)\n",
"val priority by columnOf(\"yes\", \"yes\", \"no\", \"yes\", \"no\", \"no\", \"no\", \"yes\", \"no\", \"no\")\n",
"\n",
"val df = dataFrameOf(animal, age, visits, priority)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
"
\n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| dog | NaN | 3 | yes |
| dog | 5.000000 | 2 | no |
| cat | 2.000000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| dog | 7.000000 | 2 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 2
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**3.** Display a summary of the basic information about this DataFrame and its data."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:20.132727Z",
"start_time": "2025-05-27T15:30:20.006270Z"
}
},
"cell_type": "code",
"source": "df.schema()",
"outputs": [
{
"data": {
"text/plain": [
"animal: String\n",
"age: Double\n",
"visits: Int\n",
"priority: String"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 3
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:20.574635Z",
"start_time": "2025-05-27T15:30:20.137260Z"
}
},
"cell_type": "code",
"source": "df.describe()",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | name | type | count | unique | nulls | top | freq | mean | std | min | p25 | median | p75 | max |
|---|
| animal | String | 10 | 3 | 0 | cat | 4 | null | null | cat | cat | dog | dog | snake |
| age | Double | 10 | 8 | 0 | 3.000000 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| visits | Int | 10 | 3 | 0 | 1 | 4 | 1.900000 | 0.875595 | 1 | 1.000000 | 2.000000 | 3.000000 | 3 |
| priority | String | 10 | 2 | 0 | no | 6 | null | null | no | no | no | yes | yes |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"name\",\"type\",\"count\",\"unique\",\"nulls\",\"top\",\"freq\",\"mean\",\"std\",\"min\",\"p25\",\"median\",\"p75\",\"max\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Comparable<*>\"}],\"nrow\":4,\"ncol\":14},\"kotlin_dataframe\":[{\"name\":\"animal\",\"type\":\"String\",\"count\":10,\"unique\":3,\"nulls\":0,\"top\":\"cat\",\"freq\":4,\"mean\":null,\"std\":null,\"min\":\"cat\",\"p25\":\"cat\",\"median\":\"dog\",\"p75\":\"dog\",\"max\":\"snake\"},{\"name\":\"age\",\"type\":\"Double\",\"count\":10,\"unique\":8,\"nulls\":0,\"top\":\"3.0\",\"freq\":2,\"mean\":NaN,\"std\":NaN,\"min\":\"NaN\",\"p25\":\"NaN\",\"median\":\"NaN\",\"p75\":\"NaN\",\"max\":\"NaN\"},{\"name\":\"visits\",\"type\":\"Int\",\"count\":10,\"unique\":3,\"nulls\":0,\"top\":\"1\",\"freq\":4,\"mean\":1.9,\"std\":0.8755950357709131,\"min\":\"1\",\"p25\":\"1.0\",\"median\":\"2.0\",\"p75\":\"3.0\",\"max\":\"3\"},{\"name\":\"priority\",\"type\":\"String\",\"count\":10,\"unique\":2,\"nulls\":0,\"top\":\"no\",\"freq\":6,\"mean\":null,\"std\":null,\"min\":\"no\",\"p25\":\"no\",\"median\":\"no\",\"p75\":\"yes\",\"max\":\"yes\"}]}"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 4
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**4.** Return the first 3 rows of the DataFrame df."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:20.882965Z",
"start_time": "2025-05-27T15:30:20.580482Z"
}
},
"cell_type": "code",
"source": [
"df[0 ..< 3] // df[0..2]\n",
"\n",
"// or equivalently\n",
"\n",
"df.head(3)\n",
"\n",
"// or\n",
"\n",
"df.take(3)"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":3,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"}]}"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 5
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**5.** Select \"animal\" and \"age\" columns from the DataFrame df."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:21.059871Z",
"start_time": "2025-05-27T15:30:20.889361Z"
}
},
"cell_type": "code",
"source": "df[animal, age]",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age |
|---|
| cat | 2.500000 |
| cat | 3.000000 |
| snake | 0.500000 |
| dog | NaN |
| dog | 5.000000 |
| cat | 2.000000 |
| snake | 4.500000 |
| cat | NaN |
| dog | 7.000000 |
| dog | 3.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":10,\"ncol\":2},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5},{\"animal\":\"cat\",\"age\":3.0},{\"animal\":\"snake\",\"age\":0.5},{\"animal\":\"dog\",\"age\":NaN},{\"animal\":\"dog\",\"age\":5.0},{\"animal\":\"cat\",\"age\":2.0},{\"animal\":\"snake\",\"age\":4.5},{\"animal\":\"cat\",\"age\":NaN},{\"animal\":\"dog\",\"age\":7.0},{\"animal\":\"dog\",\"age\":3.0}]}"
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 6
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**6.** Select the data in rows [3, 4, 8] and in columns [\"animal\", \"age\"]."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:21.265897Z",
"start_time": "2025-05-27T15:30:21.064874Z"
}
},
"cell_type": "code",
"source": "df[3, 4, 8][animal, age]",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age |
|---|
| dog | NaN |
| dog | 5.000000 |
| dog | 7.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"animal\":\"dog\",\"age\":NaN},{\"animal\":\"dog\",\"age\":5.0},{\"animal\":\"dog\",\"age\":7.0}]}"
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 7
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**7.** Select only the rows where the number of visits is greater than 2."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:21.584916Z",
"start_time": "2025-05-27T15:30:21.272107Z"
}
},
"cell_type": "code",
"source": "df.filter { visits > 2 }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 3.000000 | 3 | yes |
| dog | NaN | 3 | yes |
| cat | 2.000000 | 3 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":3,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"}]}"
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 8
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**8.** Select the rows where the age is missing, i.e. it is NaN."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:21.759770Z",
"start_time": "2025-05-27T15:30:21.590204Z"
}
},
"cell_type": "code",
"source": "df.filter { age.isNaN() }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| dog | NaN | 3 | yes |
| cat | NaN | 1 | yes |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":2,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"}]}"
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 9
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**9.** Select the rows where the animal is a cat and the age is less than 3."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:21.986808Z",
"start_time": "2025-05-27T15:30:21.763451Z"
}
},
"cell_type": "code",
"source": "df.filter { animal == \"cat\" && age < 3 }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 2.000000 | 3 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":2,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"}]}"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 10
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**10.** Select the rows where age is between 2 and 4 (inclusive)."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:22.229607Z",
"start_time": "2025-05-27T15:30:21.992245Z"
}
},
"cell_type": "code",
"source": "df.filter { age in 2.0..4.0 }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| cat | 2.000000 | 3 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":4,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 11
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**11.** Change the age in row 5 to 1.5"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:22.608884Z",
"start_time": "2025-05-27T15:30:22.234356Z"
}
},
"cell_type": "code",
"source": "df.update { age }.at(5).with { 1.5 }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| dog | NaN | 3 | yes |
| dog | 5.000000 | 2 | no |
| cat | 1.500000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| dog | 7.000000 | 2 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":1.5,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 12
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**12.** Calculate the sum of all visits in df (i.e. the total number of visits)."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:22.756365Z",
"start_time": "2025-05-27T15:30:22.614060Z"
}
},
"cell_type": "code",
"source": "df.visits.sum()",
"outputs": [
{
"data": {
"text/plain": [
"19"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 13
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**13.** Calculate the mean age for each different animal in df."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:23.030422Z",
"start_time": "2025-05-27T15:30:22.759863Z"
}
},
"cell_type": "code",
"source": "df.groupBy { animal }.mean { age }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age |
|---|
| cat | NaN |
| snake | 2.500000 |
| dog | NaN |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":NaN},{\"animal\":\"snake\",\"age\":2.5},{\"animal\":\"dog\",\"age\":NaN}]}"
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 14
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**14.** Append a new row to df with your choice of values for each column. Then delete that row to return the original DataFrame."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:23.175586Z",
"start_time": "2025-05-27T15:30:23.033965Z"
}
},
"cell_type": "code",
"source": [
"val modifiedDf = df.append(\"dog\", 5.5, 2, \"no\")\n",
"modifiedDf.dropLast()"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| dog | NaN | 3 | yes |
| dog | 5.000000 | 2 | no |
| cat | 2.000000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| dog | 7.000000 | 2 | no |
| dog | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 15
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**15.** Count the number of each type of animal in df."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:23.347199Z",
"start_time": "2025-05-27T15:30:23.178343Z"
}
},
"cell_type": "code",
"source": "df.groupBy { animal }.count()",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | count |
|---|
| cat | 4 |
| snake | 2 |
| dog | 4 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"count\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"count\":4},{\"animal\":\"snake\",\"count\":2},{\"animal\":\"dog\",\"count\":4}]}"
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 16
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**16.** Sort df first by the values in the 'age' in descending order, then by the value in the 'visits' column in ascending order."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:23.567352Z",
"start_time": "2025-05-27T15:30:23.351583Z"
}
},
"cell_type": "code",
"source": "df.sortBy { age.desc() and visits }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | NaN | 1 | yes |
| dog | NaN | 3 | yes |
| dog | 7.000000 | 2 | no |
| dog | 5.000000 | 2 | no |
| snake | 4.500000 | 1 | no |
| dog | 3.000000 | 1 | no |
| cat | 3.000000 | 3 | yes |
| cat | 2.500000 | 1 | yes |
| cat | 2.000000 | 3 | no |
| snake | 0.500000 | 2 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"}]}"
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 17
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**17.** The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:23.774099Z",
"start_time": "2025-05-27T15:30:23.571054Z"
}
},
"cell_type": "code",
"source": "df.convert { priority }.with { it == \"yes\" }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | true |
| cat | 3.000000 | 3 | true |
| snake | 0.500000 | 2 | false |
| dog | NaN | 3 | true |
| dog | 5.000000 | 2 | false |
| cat | 2.000000 | 3 | false |
| snake | 4.500000 | 1 | false |
| cat | NaN | 1 | true |
| dog | 7.000000 | 2 | false |
| dog | 3.000000 | 1 | false |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Boolean\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":true},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":true},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":false},{\"animal\":\"dog\",\"age\":NaN,\"visits\":3,\"priority\":true},{\"animal\":\"dog\",\"age\":5.0,\"visits\":2,\"priority\":false},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":false},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":false},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":true},{\"animal\":\"dog\",\"age\":7.0,\"visits\":2,\"priority\":false},{\"animal\":\"dog\",\"age\":3.0,\"visits\":1,\"priority\":false}]}"
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 18
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**18.** In the 'animal' column, change the 'dog' entries to 'corgi'."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:23.969806Z",
"start_time": "2025-05-27T15:30:23.776709Z"
}
},
"cell_type": "code",
"source": "df.update { animal }.where { it == \"dog\" }.with { \"corgi\" }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | age | visits | priority |
|---|
| cat | 2.500000 | 1 | yes |
| cat | 3.000000 | 3 | yes |
| snake | 0.500000 | 2 | no |
| corgi | NaN | 3 | yes |
| corgi | 5.000000 | 2 | no |
| cat | 2.000000 | 3 | no |
| snake | 4.500000 | 1 | no |
| cat | NaN | 1 | yes |
| corgi | 7.000000 | 2 | no |
| corgi | 3.000000 | 1 | no |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"age\",\"visits\",\"priority\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":10,\"ncol\":4},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"age\":2.5,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"cat\",\"age\":3.0,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"snake\",\"age\":0.5,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"corgi\",\"age\":NaN,\"visits\":3,\"priority\":\"yes\"},{\"animal\":\"corgi\",\"age\":5.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":2.0,\"visits\":3,\"priority\":\"no\"},{\"animal\":\"snake\",\"age\":4.5,\"visits\":1,\"priority\":\"no\"},{\"animal\":\"cat\",\"age\":NaN,\"visits\":1,\"priority\":\"yes\"},{\"animal\":\"corgi\",\"age\":7.0,\"visits\":2,\"priority\":\"no\"},{\"animal\":\"corgi\",\"age\":3.0,\"visits\":1,\"priority\":\"no\"}]}"
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 19
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**19.** For each animal type and each number of visits, find the mean age.\n",
"\n",
"In other words, each row should be an animal, there should be a column for each of the number of visits and the values should be the mean ages."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:24.262755Z",
"start_time": "2025-05-27T15:30:23.974180Z"
}
},
"cell_type": "code",
"source": "df.pivot { visits }.groupBy { animal }.mean(skipNaN = true) { age }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | animal | visits | | |
|---|
| 1 | 3 | 2 |
|---|
| cat | 2.500000 | 2.500000 | null |
| snake | 4.500000 | null | 0.500000 |
| dog | 3.000000 | NaN | 6.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"animal\",\"visits\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ColumnGroup\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"animal\":\"cat\",\"visits\":{\"data\":{\"1\":2.5,\"3\":2.5,\"2\":null},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"1\",\"3\",\"2\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"}]}}},{\"animal\":\"snake\",\"visits\":{\"data\":{\"1\":4.5,\"3\":null,\"2\":0.5},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"1\",\"3\",\"2\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"}]}}},{\"animal\":\"dog\",\"visits\":{\"data\":{\"1\":3.0,\"3\":NaN,\"2\":6.0},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"1\",\"3\",\"2\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double?\"}]}}}]}"
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 20
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## DataFrame: beyond the basics\n",
"### Slightly trickier: you may need to combine two or more methods to get the right answer\n",
"Difficulty: medium\n",
"\n",
"The previous section was tour through some basic but essential DataFrame operations.\n",
"Below are some ways that you might need to cut your data, but for which there is no single \"out-of-the-box\" method."
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**20.** You have a DataFrame df with a column 'A' of integers. For example:\n",
"```kotlin\n",
"val df = dataFrameOf(\"A\")(1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7)\n",
"```\n",
"How do you filter out rows which contain the same integer as the row immediately above?\n",
"\n",
"You should be left with a column containing the following values:\n",
"```\n",
"1, 2, 3, 4, 5, 6, 7\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:24.606456Z",
"start_time": "2025-05-27T15:30:24.267278Z"
}
},
"cell_type": "code",
"source": [
"val df = dataFrameOf(\"A\")(1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":11,\"ncol\":1},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":5},{\"A\":5},{\"A\":6},{\"A\":7},{\"A\":7}]}"
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 21
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:24.775391Z",
"start_time": "2025-05-27T15:30:24.610229Z"
}
},
"cell_type": "code",
"source": "df.filter { prev()?.A != A }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":7,\"ncol\":1},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":6},{\"A\":7}]}"
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 22
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:24.950268Z",
"start_time": "2025-05-27T15:30:24.779040Z"
}
},
"cell_type": "code",
"source": "df.filter { diffOrNull { A } != 0 }",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":7,\"ncol\":1},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":6},{\"A\":7}]}"
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 23
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We could use `distinct()` here but it won't work as desired if A is [1, 1, 2, 2, 1, 1] for example."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:25.077735Z",
"start_time": "2025-05-27T15:30:24.955302Z"
}
},
"cell_type": "code",
"source": "df.distinct()",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":7,\"ncol\":1},\"kotlin_dataframe\":[{\"A\":1},{\"A\":2},{\"A\":3},{\"A\":4},{\"A\":5},{\"A\":6},{\"A\":7}]}"
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 24
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**21.** Given a DataFrame of random numeric values:\n",
"```kotlin\n",
"val df = dataFrameOf(\"a\", \"b\", \"c\").randomDouble(5) // this is a 5x3 DataFrame of double values\n",
"```\n",
"\n",
"how do you subtract the row mean from each element in the row?"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:25.536312Z",
"start_time": "2025-05-27T15:30:25.081475Z"
}
},
"cell_type": "code",
"source": [
"val df = dataFrameOf(\"a\", \"b\", \"c\").randomDouble(5)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c |
|---|
| 0.402014 | 0.523536 | 0.961745 |
| 0.829755 | 0.209906 | 0.653028 |
| 0.554656 | 0.840338 | 0.608483 |
| 0.117750 | 0.923225 | 0.309282 |
| 0.449417 | 0.681448 | 0.647284 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":3},\"kotlin_dataframe\":[{\"a\":0.4020143497479448,\"b\":0.5235357385266791,\"c\":0.9617451689850356},{\"a\":0.8297554032907936,\"b\":0.2099064473439166,\"c\":0.6530275274758358},{\"a\":0.554655750996573,\"b\":0.840338244339218,\"c\":0.6084827094033441},{\"a\":0.11775037322750681,\"b\":0.9232249599911567,\"c\":0.3092823696479352},{\"a\":0.44941681806776124,\"b\":0.6814483943758095,\"c\":0.6472835537872337}]}"
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 25
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:25.898186Z",
"start_time": "2025-05-27T15:30:25.541176Z"
}
},
"cell_type": "code",
"source": [
"df.update { colsOf() }\n",
" .with { it - rowMean() }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c |
|---|
| -0.227084 | -0.105563 | 0.332647 |
| 0.265526 | -0.354323 | 0.088798 |
| -0.113170 | 0.172513 | -0.059343 |
| -0.332336 | 0.473139 | -0.140804 |
| -0.143299 | 0.088732 | 0.054567 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":3},\"kotlin_dataframe\":[{\"a\":-0.22708406933860836,\"b\":-0.10556268055987406,\"c\":0.3326467498984824},{\"a\":0.2655256105872782,\"b\":-0.35432334535959875,\"c\":0.08879773477232045},{\"a\":-0.11316981724980535,\"b\":0.1725126760928396,\"c\":-0.05934285884303425},{\"a\":-0.3323355277280261,\"b\":0.4731390590356238,\"c\":-0.1408035313075977},{\"a\":-0.14329943734250694,\"b\":0.08873213896554133,\"c\":0.0545672983769655}]}"
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 26
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**22.** Suppose you have a DataFrame with 10 columns of real numbers, for example:\n",
"```kotlin\n",
"val names = ('a'..'j').map { it.toString() }\n",
"val df = dataFrameOf(names).randomDouble(5)\n",
"```\n",
"\n",
"Which column of numbers has the smallest sum? Return that column's label."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:26.872554Z",
"start_time": "2025-05-27T15:30:25.902235Z"
}
},
"cell_type": "code",
"source": [
"val names = ('a'..'j').map { it.toString() }\n",
"val df = dataFrameOf(names).randomDouble(5)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c | d | e | f | g | h | i | j |
|---|
| 0.858917 | 0.218979 | 0.034557 | 0.738621 | 0.179618 | 0.633212 | 0.878549 | 0.029619 | 0.610139 | 0.047784 |
| 0.201999 | 0.152280 | 0.712317 | 0.107476 | 0.739056 | 0.528339 | 0.574178 | 0.531298 | 0.381376 | 0.363998 |
| 0.328992 | 0.981267 | 0.663216 | 0.410589 | 0.665165 | 0.292244 | 0.570291 | 0.501647 | 0.822739 | 0.333988 |
| 0.626584 | 0.899953 | 0.921292 | 0.504896 | 0.053930 | 0.667949 | 0.884781 | 0.918130 | 0.652596 | 0.053050 |
| 0.366349 | 0.104409 | 0.765868 | 0.436629 | 0.282638 | 0.378790 | 0.586033 | 0.608915 | 0.597396 | 0.145855 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\",\"d\",\"e\",\"f\",\"g\",\"h\",\"i\",\"j\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":10},\"kotlin_dataframe\":[{\"a\":0.8589170822611669,\"b\":0.21897931088234834,\"c\":0.034557377186825056,\"d\":0.7386207843010325,\"e\":0.17961773479447896,\"f\":0.6332121021221402,\"g\":0.8785488181741078,\"h\":0.029619242234364407,\"i\":0.610139123615794,\"j\":0.04778397629111675},{\"a\":0.20199924201162456,\"b\":0.15228048794140292,\"c\":0.7123173164329571,\"d\":0.10747585812880023,\"e\":0.739056031535428,\"f\":0.5283393728344726,\"g\":0.5741780981169422,\"h\":0.5312984747745336,\"i\":0.38137580024457507,\"j\":0.3639978021317887},{\"a\":0.3289924703129842,\"b\":0.9812666588518684,\"c\":0.6632164235521054,\"d\":0.41058934948716486,\"e\":0.6651649498688819,\"f\":0.29224426349817056,\"g\":0.5702906320797264,\"h\":0.5016469503750937,\"i\":0.8227386503855655,\"j\":0.3339878226406737},{\"a\":0.6265836735959608,\"b\":0.8999534457071927,\"c\":0.9212919001805675,\"d\":0.5048959673204629,\"e\":0.05392971810146818,\"f\":0.6679494229021548,\"g\":0.8847811975034712,\"h\":0.9181301607122112,\"i\":0.6525964870925838,\"j\":0.053049734609139754},{\"a\":0.36634851355440723,\"b\":0.1044085008382113,\"c\":0.7658679056448057,\"d\":0.4366294933165755,\"e\":0.28263768398058176,\"f\":0.3787900688859698,\"g\":0.5860331514703689,\"h\":0.6089146474803173,\"i\":0.5973958951583465,\"j\":0.14585500694861353}]}"
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 27
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:27.080743Z",
"start_time": "2025-05-27T15:30:26.879745Z"
}
},
"cell_type": "code",
"source": "df.sum().transpose().minBy(\"value\")[\"name\"]",
"outputs": [
{
"data": {
"text/plain": [
"j"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 28
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**23.** How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:27.502006Z",
"start_time": "2025-05-27T15:30:27.084006Z"
}
},
"cell_type": "code",
"source": [
"val df = dataFrameOf(\"a\", \"b\", \"c\").randomInt(30, 0..2)\n",
"df.distinct().count()"
],
"outputs": [
{
"data": {
"text/plain": [
"16"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 29
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**24.** In the cell below, you have a DataFrame `df` that consists of 10 columns of floating-point numbers. Exactly 5 entries in each row are NaN values.\n",
"\n",
"For each row of the DataFrame, find the *column* which contains the *third* NaN value.\n",
"\n",
"You should return a column of column labels: `e, c, d, h, d`"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:27.980465Z",
"start_time": "2025-05-27T15:30:27.506186Z"
}
},
"cell_type": "code",
"source": [
"val nan = Double.NaN\n",
"val names = ('a'..'j').map { it.toString() }\n",
"val data = listOf(\n",
" 0.04, nan, nan, 0.25, nan, 0.43, 0.71, 0.51, nan, nan,\n",
" nan, nan, nan, 0.04, 0.76, nan, nan, 0.67, 0.76, 0.16,\n",
" nan, nan, 0.5, nan, 0.31, 0.4, nan, nan, 0.24, 0.01,\n",
" 0.49, nan, nan, 0.62, 0.73, 0.26, 0.85, nan, nan, nan,\n",
" nan, nan, 0.41, nan, 0.05, nan, 0.61, nan, 0.48, 0.68,\n",
")\n",
"val df = dataFrameOf(names)(*data.toTypedArray())\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c | d | e | f | g | h | i | j |
|---|
| 0.040000 | NaN | NaN | 0.250000 | NaN | 0.430000 | 0.710000 | 0.510000 | NaN | NaN |
| NaN | NaN | NaN | 0.040000 | 0.760000 | NaN | NaN | 0.670000 | 0.760000 | 0.160000 |
| NaN | NaN | 0.500000 | NaN | 0.310000 | 0.400000 | NaN | NaN | 0.240000 | 0.010000 |
| 0.490000 | NaN | NaN | 0.620000 | 0.730000 | 0.260000 | 0.850000 | NaN | NaN | NaN |
| NaN | NaN | 0.410000 | NaN | 0.050000 | NaN | 0.610000 | NaN | 0.480000 | 0.680000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\",\"d\",\"e\",\"f\",\"g\",\"h\",\"i\",\"j\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":10},\"kotlin_dataframe\":[{\"a\":0.04,\"b\":NaN,\"c\":NaN,\"d\":0.25,\"e\":NaN,\"f\":0.43,\"g\":0.71,\"h\":0.51,\"i\":NaN,\"j\":NaN},{\"a\":NaN,\"b\":NaN,\"c\":NaN,\"d\":0.04,\"e\":0.76,\"f\":NaN,\"g\":NaN,\"h\":0.67,\"i\":0.76,\"j\":0.16},{\"a\":NaN,\"b\":NaN,\"c\":0.5,\"d\":NaN,\"e\":0.31,\"f\":0.4,\"g\":NaN,\"h\":NaN,\"i\":0.24,\"j\":0.01},{\"a\":0.49,\"b\":NaN,\"c\":NaN,\"d\":0.62,\"e\":0.73,\"f\":0.26,\"g\":0.85,\"h\":NaN,\"i\":NaN,\"j\":NaN},{\"a\":NaN,\"b\":NaN,\"c\":0.41,\"d\":NaN,\"e\":0.05,\"f\":NaN,\"g\":0.61,\"h\":NaN,\"i\":0.48,\"j\":0.68}]}"
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 30
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:28.263132Z",
"start_time": "2025-05-27T15:30:27.984548Z"
}
},
"cell_type": "code",
"source": [
"df.mapToColumn(\"res\") { \n",
" namedValuesOf()\n",
" .filter { it.value.isNaN() }.drop(2)\n",
" .firstOrNull()?.name \n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"res\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":1},\"kotlin_dataframe\":[{\"res\":\"e\"},{\"res\":\"c\"},{\"res\":\"d\"},{\"res\":\"h\"},{\"res\":\"d\"}]}"
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 31
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**25.** A DataFrame has a column of groups 'grps' and a column of integer values 'vals':\n",
"```kotlin\n",
"val grps by column(\"a\", \"a\", \"a\", \"b\", \"b\", \"c\", \"a\", \"a\", \"b\", \"c\", \"c\", \"c\", \"b\", \"b\", \"c\")\n",
"val vals by column(12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87)\n",
"\n",
"val df = dataFrameOf(grps, vals)\n",
"```\n",
"\n",
"For each group, find the sum of the three greatest values. You should end up with the answer as follows:\n",
"```\n",
"grps\n",
"a 409\n",
"b 156\n",
"c 345\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:28.797177Z",
"start_time": "2025-05-27T15:30:28.266153Z"
}
},
"cell_type": "code",
"source": [
"val grps by columnOf(\"a\", \"a\", \"a\", \"b\", \"b\", \"c\", \"a\", \"a\", \"b\", \"c\", \"c\", \"c\", \"b\", \"b\", \"c\")\n",
"val vals by columnOf(12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87)\n",
"\n",
"val df = dataFrameOf(grps, vals)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | grps | vals |
|---|
| a | 12 |
| a | 345 |
| a | 3 |
| b | 1 |
| b | 45 |
| c | 14 |
| a | 4 |
| a | 52 |
| b | 54 |
| c | 23 |
| c | 235 |
| c | 21 |
| b | 57 |
| b | 3 |
| c | 87 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"grps\",\"vals\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":15,\"ncol\":2},\"kotlin_dataframe\":[{\"grps\":\"a\",\"vals\":12},{\"grps\":\"a\",\"vals\":345},{\"grps\":\"a\",\"vals\":3},{\"grps\":\"b\",\"vals\":1},{\"grps\":\"b\",\"vals\":45},{\"grps\":\"c\",\"vals\":14},{\"grps\":\"a\",\"vals\":4},{\"grps\":\"a\",\"vals\":52},{\"grps\":\"b\",\"vals\":54},{\"grps\":\"c\",\"vals\":23},{\"grps\":\"c\",\"vals\":235},{\"grps\":\"c\",\"vals\":21},{\"grps\":\"b\",\"vals\":57},{\"grps\":\"b\",\"vals\":3},{\"grps\":\"c\",\"vals\":87}]}"
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 32
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:29.016458Z",
"start_time": "2025-05-27T15:30:28.800910Z"
}
},
"cell_type": "code",
"source": [
"df.groupBy { grps }.aggregate { \n",
" vals.sortDesc().take(3).sum() into \"res\"\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"grps\",\"res\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"grps\":\"a\",\"res\":409},{\"grps\":\"b\",\"res\":156},{\"grps\":\"c\",\"res\":345}]}"
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 33
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**26.** The DataFrame `df` constructed below has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive).\n",
"\n",
"For each group of 10 consecutive integers in 'A' (i.e. `(0, 10]`, `(10, 20]`, ...), calculate the sum of the corresponding values in column 'B'.\n",
"\n",
"The answer is as follows:\n",
"\n",
"```\n",
"A\n",
"(0, 10] 635\n",
"(10, 20] 360\n",
"(20, 30] 315\n",
"(30, 40] 306\n",
"(40, 50] 750\n",
"(50, 60] 284\n",
"(60, 70] 424\n",
"(70, 80] 526\n",
"(80, 90] 835\n",
"(90, 100] 852\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:37:21.268836Z",
"start_time": "2025-05-27T15:37:21.028807Z"
}
},
"cell_type": "code",
"source": [
"import kotlin.random.Random\n",
"\n",
"val random = Random(42)\n",
"val list = List(200) { random.nextInt(1, 101) }\n",
"val df = dataFrameOf(\"A\", \"B\")(*list.toTypedArray())\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | A | B |
|---|
| 34 | 41 |
| 42 | 3 |
| 42 | 33 |
| 22 | 41 |
| 70 | 88 |
| 53 | 68 |
| 80 | 4 |
| 59 | 59 |
| 45 | 1 |
| 27 | 14 |
| 70 | 8 |
| 11 | 52 |
| 51 | 60 |
| 46 | 43 |
| 17 | 17 |
| 17 | 42 |
| 56 | 29 |
| 58 | 49 |
| 48 | 7 |
| 73 | 52 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\",\"B\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":100,\"ncol\":2},\"kotlin_dataframe\":[{\"A\":34,\"B\":41},{\"A\":42,\"B\":3},{\"A\":42,\"B\":33},{\"A\":22,\"B\":41},{\"A\":70,\"B\":88},{\"A\":53,\"B\":68},{\"A\":80,\"B\":4},{\"A\":59,\"B\":59},{\"A\":45,\"B\":1},{\"A\":27,\"B\":14},{\"A\":70,\"B\":8},{\"A\":11,\"B\":52},{\"A\":51,\"B\":60},{\"A\":46,\"B\":43},{\"A\":17,\"B\":17},{\"A\":17,\"B\":42},{\"A\":56,\"B\":29},{\"A\":58,\"B\":49},{\"A\":48,\"B\":7},{\"A\":73,\"B\":52}]}"
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 58
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:37:32.793588Z",
"start_time": "2025-05-27T15:37:32.442772Z"
}
},
"cell_type": "code",
"source": [
"df.groupBy { A.map { (it - 1) / 10 } }.sum { B }\n",
" .sortBy { A }\n",
" .convert { A }.with { \"(${it * 10}, ${it * 10 + 10}]\" }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | A | B |
|---|
| (0, 10] | 353 |
| (10, 20] | 873 |
| (20, 30] | 321 |
| (30, 40] | 322 |
| (40, 50] | 432 |
| (50, 60] | 754 |
| (60, 70] | 405 |
| (70, 80] | 561 |
| (80, 90] | 657 |
| (90, 100] | 527 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"A\",\"B\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":10,\"ncol\":2},\"kotlin_dataframe\":[{\"A\":\"(0, 10]\",\"B\":353},{\"A\":\"(10, 20]\",\"B\":873},{\"A\":\"(20, 30]\",\"B\":321},{\"A\":\"(30, 40]\",\"B\":322},{\"A\":\"(40, 50]\",\"B\":432},{\"A\":\"(50, 60]\",\"B\":754},{\"A\":\"(60, 70]\",\"B\":405},{\"A\":\"(70, 80]\",\"B\":561},{\"A\":\"(80, 90]\",\"B\":657},{\"A\":\"(90, 100]\",\"B\":527}]}"
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 59
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## DataFrames: harder problems\n",
"\n",
"### These might require a bit of thinking outside the box...\n",
"\n",
"Difficulty: hard"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**27.** Consider a DataFrame `df` where there is an integer column 'X':\n",
"```kotlin\n",
"val df = dataFrameOf(\"X\")(7, 2, 0, 3, 4, 2, 5, 0, 3 , 4)\n",
"```\n",
"For each value, count the difference back to the previous zero (or the start of the column, whichever is closer). These values should therefore be\n",
"\n",
"```\n",
"[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]\n",
"```\n",
"\n",
"Make this a new column 'Y'."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:37:52.912785Z",
"start_time": "2025-05-27T15:37:52.750200Z"
}
},
"cell_type": "code",
"source": [
"val df = dataFrameOf(\"X\")(7, 2, 0, 3, 4, 2, 5, 0, 3, 4)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"X\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":10,\"ncol\":1},\"kotlin_dataframe\":[{\"X\":7},{\"X\":2},{\"X\":0},{\"X\":3},{\"X\":4},{\"X\":2},{\"X\":5},{\"X\":0},{\"X\":3},{\"X\":4}]}"
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 60
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:45:16.496105Z",
"start_time": "2025-05-27T16:45:16.392098Z"
}
},
"cell_type": "code",
"source": [
"df.mapToColumn(\"Y\") {\n",
" if (it.X == 0) 0 else (prev()?.newValue() ?: 0) + 1\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"Y\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":10,\"ncol\":1},\"kotlin_dataframe\":[{\"Y\":1},{\"Y\":2},{\"Y\":0},{\"Y\":1},{\"Y\":2},{\"Y\":3},{\"Y\":4},{\"Y\":0},{\"Y\":1},{\"Y\":2}]}"
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 62
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**28.** Consider the DataFrame constructed below, which contains rows and columns of numerical data.\n",
"\n",
"Create a list of the column-row index locations of the three largest values in this DataFrame.\n",
"\n",
"In this case, the answer should be:\n",
"```\n",
"[(0, d), (2, c), (3, f)]\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:30.574254Z",
"start_time": "2025-05-27T15:30:30.071670Z"
}
},
"cell_type": "code",
"source": [
"val names = ('a'..'h').map { it.toString() } // val names = (0..7).map { it.toString() }\n",
"val random = Random(30)\n",
"val list = List(64) { random.nextInt(1, 101) }\n",
"val df = dataFrameOf(names)(*list.toTypedArray())\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | a | b | c | d | e | f | g | h |
|---|
| 43 | 88 | 66 | 100 | 9 | 59 | 74 | 23 |
| 6 | 63 | 43 | 58 | 4 | 85 | 9 | 25 |
| 49 | 59 | 100 | 52 | 28 | 1 | 19 | 81 |
| 92 | 41 | 13 | 57 | 28 | 97 | 63 | 39 |
| 4 | 59 | 72 | 65 | 50 | 35 | 14 | 31 |
| 55 | 74 | 33 | 66 | 17 | 39 | 80 | 38 |
| 18 | 64 | 91 | 39 | 80 | 55 | 65 | 2 |
| 19 | 76 | 75 | 18 | 32 | 97 | 1 | 32 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"a\",\"b\",\"c\",\"d\",\"e\",\"f\",\"g\",\"h\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}],\"nrow\":8,\"ncol\":8},\"kotlin_dataframe\":[{\"a\":43,\"b\":88,\"c\":66,\"d\":100,\"e\":9,\"f\":59,\"g\":74,\"h\":23},{\"a\":6,\"b\":63,\"c\":43,\"d\":58,\"e\":4,\"f\":85,\"g\":9,\"h\":25},{\"a\":49,\"b\":59,\"c\":100,\"d\":52,\"e\":28,\"f\":1,\"g\":19,\"h\":81},{\"a\":92,\"b\":41,\"c\":13,\"d\":57,\"e\":28,\"f\":97,\"g\":63,\"h\":39},{\"a\":4,\"b\":59,\"c\":72,\"d\":65,\"e\":50,\"f\":35,\"g\":14,\"h\":31},{\"a\":55,\"b\":74,\"c\":33,\"d\":66,\"e\":17,\"f\":39,\"g\":80,\"h\":38},{\"a\":18,\"b\":64,\"c\":91,\"d\":39,\"e\":80,\"f\":55,\"g\":65,\"h\":2},{\"a\":19,\"b\":76,\"c\":75,\"d\":18,\"e\":32,\"f\":97,\"g\":1,\"h\":32}]}"
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 38
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:30.825800Z",
"start_time": "2025-05-27T15:30:30.577145Z"
}
},
"cell_type": "code",
"source": [
"df.add(\"index\") { index() }\n",
" .gather { dropLast() }.into(\"name\", \"vals\")\n",
" .sortByDesc(\"vals\").take(3)[\"index\", \"name\"]"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"index\",\"name\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"index\":0,\"name\":\"d\"},{\"index\":2,\"name\":\"c\"},{\"index\":3,\"name\":\"f\"}]}"
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 39
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**29.** You are given the DataFrame below with a column of group IDs, 'grps', and a column of corresponding integer values 'vals'.\n",
"\n",
"```kotlin\n",
"val random = Random(31)\n",
"val lab = listOf(\"A\", \"B\")\n",
"\n",
"val vals by columnOf(List(15) { random.nextInt(-30, 30) })\n",
"val grps by columnOf(List(15) { lab[random.nextInt(0, 2)] })\n",
"\n",
"val df = dataFrameOf(vals, grps)\n",
"```\n",
"\n",
"Create a new column 'patched_values' which contains the same values as the 'vals' any negative values in 'vals' with the group mean:\n",
"\n",
"```\n",
"vals grps patched_vals\n",
" -17 B 21.0\n",
" -7 B 21.0\n",
" 28 B 28.0\n",
" 16 B 16.0\n",
" -21 B 21.0\n",
" 19 B 19.0\n",
" -2 B 21.0\n",
" -19 B 21.0\n",
" 16 A 16.0\n",
" 9 A 9.0\n",
" -14 A 16.0\n",
" -19 A 16.0\n",
" -22 A 16.0\n",
" -1 A 16.0\n",
" 23 A 23.0\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:46:49.985159Z",
"start_time": "2025-05-27T16:46:49.674750Z"
}
},
"cell_type": "code",
"source": [
"val random = Random(31)\n",
"val lab = listOf(\"A\", \"B\")\n",
"\n",
"val vals by columnOf(*Array(15) { random.nextInt(-30, 30) })\n",
"val grps by columnOf(*Array(15) { lab[random.nextInt(0, 2)] })\n",
"\n",
"val df = dataFrameOf(vals, grps)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | vals | grps |
|---|
| -17 | B |
| -7 | B |
| 16 | A |
| 28 | B |
| 9 | A |
| 16 | B |
| -21 | B |
| -14 | A |
| -19 | A |
| -22 | A |
| 19 | B |
| -2 | B |
| -1 | A |
| -19 | B |
| 23 | A |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"vals\",\"grps\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":15,\"ncol\":2},\"kotlin_dataframe\":[{\"vals\":-17,\"grps\":\"B\"},{\"vals\":-7,\"grps\":\"B\"},{\"vals\":16,\"grps\":\"A\"},{\"vals\":28,\"grps\":\"B\"},{\"vals\":9,\"grps\":\"A\"},{\"vals\":16,\"grps\":\"B\"},{\"vals\":-21,\"grps\":\"B\"},{\"vals\":-14,\"grps\":\"A\"},{\"vals\":-19,\"grps\":\"A\"},{\"vals\":-22,\"grps\":\"A\"},{\"vals\":19,\"grps\":\"B\"},{\"vals\":-2,\"grps\":\"B\"},{\"vals\":-1,\"grps\":\"A\"},{\"vals\":-19,\"grps\":\"B\"},{\"vals\":23,\"grps\":\"A\"}]}"
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 63
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:46:58.594859Z",
"start_time": "2025-05-27T16:46:58.263921Z"
}
},
"cell_type": "code",
"source": [
"val means = df.filter { vals >= 0 }\n",
" .groupBy { grps }.mean()\n",
" .pivot { grps }.values { vals }\n",
"\n",
"df.add(\"patched_values\") {\n",
" if (vals < 0) means[grps] else vals.toDouble()\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | vals | grps | patched_values |
|---|
| -17 | B | 21.000000 |
| -7 | B | 21.000000 |
| 16 | A | 16.000000 |
| 28 | B | 28.000000 |
| 9 | A | 9.000000 |
| 16 | B | 16.000000 |
| -21 | B | 21.000000 |
| -14 | A | 16.000000 |
| -19 | A | 16.000000 |
| -22 | A | 16.000000 |
| 19 | B | 19.000000 |
| -2 | B | 21.000000 |
| -1 | A | 16.000000 |
| -19 | B | 21.000000 |
| 23 | A | 23.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"vals\",\"grps\",\"patched_values\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Any\"}],\"nrow\":15,\"ncol\":3},\"kotlin_dataframe\":[{\"vals\":-17,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":-7,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":16,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":28,\"grps\":\"B\",\"patched_values\":\"28.0\"},{\"vals\":9,\"grps\":\"A\",\"patched_values\":\"9.0\"},{\"vals\":16,\"grps\":\"B\",\"patched_values\":\"16.0\"},{\"vals\":-21,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":-14,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":-19,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":-22,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":19,\"grps\":\"B\",\"patched_values\":\"19.0\"},{\"vals\":-2,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":-1,\"grps\":\"A\",\"patched_values\":\"16.0\"},{\"vals\":-19,\"grps\":\"B\",\"patched_values\":\"21.0\"},{\"vals\":23,\"grps\":\"A\",\"patched_values\":\"23.0\"}]}"
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 64
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**30.** Implement a rolling mean over groups with window size 3, which ignores NaN value. For example, consider the following DataFrame:\n",
"```kotlin\n",
"val group by columnOf(\"a\", \"a\", \"b\", \"b\", \"a\", \"b\", \"b\", \"b\", \"a\", \"b\", \"a\", \"b\")\n",
"val value by columnOf(1.0, 2.0, 3.0, Double.NaN, 2.0, 3.0, Double.NaN, 1.0, 7.0, 3.0, Double.NaN, 8.0)\n",
"\n",
"val df = dataFrameOf(group, value)\n",
"df\n",
"\n",
"group value\n",
"a 1.0\n",
"a 2.0\n",
"b 3.0\n",
"b NaN\n",
"a 2.0\n",
"b 3.0\n",
"b NaN\n",
"b 1.0\n",
"a 7.0\n",
"b 3.0\n",
"a NaN\n",
"b 8.0\n",
"```\n",
"The goal is:\n",
"```\n",
"1.000000\n",
"1.500000\n",
"3.000000\n",
"3.000000\n",
"1.666667\n",
"3.000000\n",
"3.000000\n",
"2.000000\n",
"3.666667\n",
"2.000000\n",
"4.500000\n",
"4.000000\n",
"```\n",
"E.g., the first window of size three for group 'b' has values 3.0, NaN and 3.0 and occurs at row index 5.\n",
"Instead of being NaN, the value in the new column at this row index should be 3.0 (just the two non-NaN values are used to compute the mean (3+3)/2)"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:48:53.445808Z",
"start_time": "2025-05-27T16:48:53.105156Z"
}
},
"cell_type": "code",
"source": [
"val groups by columnOf(\"a\", \"a\", \"b\", \"b\", \"a\", \"b\", \"b\", \"b\", \"a\", \"b\", \"a\", \"b\")\n",
"val value by columnOf(1.0, 2.0, 3.0, Double.NaN, 2.0, 3.0, Double.NaN, 1.0, 7.0, 3.0, Double.NaN, 8.0)\n",
"\n",
"val df = dataFrameOf(groups, value)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | groups | value |
|---|
| a | 1.000000 |
| a | 2.000000 |
| b | 3.000000 |
| b | NaN |
| a | 2.000000 |
| b | 3.000000 |
| b | NaN |
| b | 1.000000 |
| a | 7.000000 |
| b | 3.000000 |
| a | NaN |
| b | 8.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"groups\",\"value\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":12,\"ncol\":2},\"kotlin_dataframe\":[{\"groups\":\"a\",\"value\":1.0},{\"groups\":\"a\",\"value\":2.0},{\"groups\":\"b\",\"value\":3.0},{\"groups\":\"b\",\"value\":NaN},{\"groups\":\"a\",\"value\":2.0},{\"groups\":\"b\",\"value\":3.0},{\"groups\":\"b\",\"value\":NaN},{\"groups\":\"b\",\"value\":1.0},{\"groups\":\"a\",\"value\":7.0},{\"groups\":\"b\",\"value\":3.0},{\"groups\":\"a\",\"value\":NaN},{\"groups\":\"b\",\"value\":8.0}]}"
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 65
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:48:55.924811Z",
"start_time": "2025-05-27T16:48:55.634229Z"
}
},
"cell_type": "code",
"source": [
"df.add(\"id\") { index() }\n",
" .groupBy { groups }.add(\"res\") {\n",
" relative(-2..0).value.filter { !it.isNaN() }.mean()\n",
" }.concat()\n",
" .sortBy(\"id\")\n",
" .remove(\"id\")"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | groups | value | res |
|---|
| a | 1.000000 | 1.000000 |
| a | 2.000000 | 1.500000 |
| b | 3.000000 | 3.000000 |
| b | NaN | 3.000000 |
| a | 2.000000 | 1.666667 |
| b | 3.000000 | 3.000000 |
| b | NaN | 3.000000 |
| b | 1.000000 | 2.000000 |
| a | 7.000000 | 3.666667 |
| b | 3.000000 | 2.000000 |
| a | NaN | 4.500000 |
| b | 8.000000 | 4.000000 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"groups\",\"value\",\"res\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":12,\"ncol\":3},\"kotlin_dataframe\":[{\"groups\":\"a\",\"value\":1.0,\"res\":1.0},{\"groups\":\"a\",\"value\":2.0,\"res\":1.5},{\"groups\":\"b\",\"value\":3.0,\"res\":3.0},{\"groups\":\"b\",\"value\":NaN,\"res\":3.0},{\"groups\":\"a\",\"value\":2.0,\"res\":1.6666666666666667},{\"groups\":\"b\",\"value\":3.0,\"res\":3.0},{\"groups\":\"b\",\"value\":NaN,\"res\":3.0},{\"groups\":\"b\",\"value\":1.0,\"res\":2.0},{\"groups\":\"a\",\"value\":7.0,\"res\":3.6666666666666665},{\"groups\":\"b\",\"value\":3.0,\"res\":2.0},{\"groups\":\"a\",\"value\":NaN,\"res\":4.5},{\"groups\":\"b\",\"value\":8.0,\"res\":4.0}]}"
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 66
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Date\n",
"Difficulty: easy/medium"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**31.** Create a `LocalDate` column that contains each day of 2015 and a column of random numbers."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:49:28.314963Z",
"start_time": "2025-05-27T16:49:28.269963Z"
}
},
"cell_type": "code",
"source": "import kotlinx.datetime.*",
"outputs": [],
"execution_count": 67
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:52:19.345798Z",
"start_time": "2025-05-27T16:52:19.152897Z"
}
},
"cell_type": "code",
"source": [
"class DateRangeIterator(first: LocalDate, last: LocalDate, val step: Int) : Iterator {\n",
" private val finalElement: LocalDate = last\n",
" private var hasNext: Boolean = if (step > 0) first <= last else first >= last\n",
" private var next: LocalDate = if (hasNext) first else finalElement\n",
"\n",
" override fun hasNext(): Boolean = hasNext\n",
"\n",
" override fun next(): LocalDate {\n",
" val value = next\n",
" if (value == finalElement) {\n",
" if (!hasNext) throw kotlin.NoSuchElementException()\n",
" hasNext = false\n",
" } else {\n",
" next = next.plus(step, DateTimeUnit.DayBased(1))\n",
" }\n",
" return value\n",
" }\n",
"}\n",
"\n",
"operator fun ClosedRange.iterator() = DateRangeIterator(this.start, this.endInclusive, 1)\n",
"\n",
"fun ClosedRange.toList(): List {\n",
" return when (val size = this.start.daysUntil(this.endInclusive)) {\n",
" 0 -> emptyList()\n",
" 1 -> listOf(iterator().next())\n",
" else -> {\n",
" val dest = ArrayList(size)\n",
" for (item in this) {\n",
" dest.add(item)\n",
" }\n",
" dest\n",
" }\n",
" }\n",
"}"
],
"outputs": [
{
"ename": "org.jetbrains.kotlinx.jupyter.exceptions.ReplCompilerException",
"evalue": "at Cell In[69], line 1, column 77: This class shouldn't be used in Kotlin. Use kotlin.collections.Iterator or kotlin.collections.MutableIterator instead.\nat Cell In[69], line 22, column 38: This class shouldn't be used in Kotlin. Use kotlin.collections.List or kotlin.collections.MutableList instead.\nat Cell In[69], line 24, column 14: Type mismatch: inferred type is kotlin.collections.List??> but java.util.List was expected\nat Cell In[69], line 25, column 14: Type mismatch: inferred type is kotlin.collections.List but java.util.List was expected\nLine_176.jupyter.kts (26:17 - 32:10) Type mismatch: inferred type is ArrayList but List was expected",
"output_type": "error",
"traceback": [
"org.jetbrains.kotlinx.jupyter.exceptions.ReplCompilerException: at Cell In[69], line 1, column 77: This class shouldn't be used in Kotlin. Use kotlin.collections.Iterator or kotlin.collections.MutableIterator instead.",
"at Cell In[69], line 22, column 38: This class shouldn't be used in Kotlin. Use kotlin.collections.List or kotlin.collections.MutableList instead.",
"at Cell In[69], line 24, column 14: Type mismatch: inferred type is kotlin.collections.List??> but java.util.List was expected",
"at Cell In[69], line 25, column 14: Type mismatch: inferred type is kotlin.collections.List but java.util.List was expected",
"Line_176.jupyter.kts (26:17 - 32:10) Type mismatch: inferred type is ArrayList but List was expected",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.JupyterCompilerImpl.compileSync(JupyterCompilerImpl.kt:208)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl.eval(InternalEvaluatorImpl.kt:126)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:80)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:78)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withHost(ReplForJupyterImpl.kt:778)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl.execute-L4Nmkdk(CellExecutorImpl.kt:78)",
"\tat org.jetbrains.kotlinx.jupyter.repl.execution.CellExecutor$DefaultImpls.execute-L4Nmkdk$default(CellExecutor.kt:13)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evaluateUserCode-wNURfNM(ReplForJupyterImpl.kt:600)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalExImpl(ReplForJupyterImpl.kt:458)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.access$evalExImpl(ReplForJupyterImpl.kt:140)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:451)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:450)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withEvalContext(ReplForJupyterImpl.kt:431)",
"\tat org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalEx(ReplForJupyterImpl.kt:450)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:159)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:158)",
"\tat org.jetbrains.kotlinx.jupyter.streams.BlockingSubstitutionEngine.withDataSubstitution(SubstitutionEngine.kt:70)",
"\tat org.jetbrains.kotlinx.jupyter.streams.StreamSubstitutionManager.withSubstitutedStreams(StreamSubstitutionManager.kt:118)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.withForkedIn(IdeCompatibleMessageRequestProcessor.kt:335)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.access$withForkedIn(IdeCompatibleMessageRequestProcessor.kt:54)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$evalWithIO$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:349)",
"\tat org.jetbrains.kotlinx.jupyter.streams.BlockingSubstitutionEngine.withDataSubstitution(SubstitutionEngine.kt:70)",
"\tat org.jetbrains.kotlinx.jupyter.streams.StreamSubstitutionManager.withSubstitutedStreams(StreamSubstitutionManager.kt:118)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.withForkedErr(IdeCompatibleMessageRequestProcessor.kt:324)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.access$withForkedErr(IdeCompatibleMessageRequestProcessor.kt:54)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$evalWithIO$1.invoke(IdeCompatibleMessageRequestProcessor.kt:348)",
"\tat org.jetbrains.kotlinx.jupyter.streams.BlockingSubstitutionEngine.withDataSubstitution(SubstitutionEngine.kt:70)",
"\tat org.jetbrains.kotlinx.jupyter.streams.StreamSubstitutionManager.withSubstitutedStreams(StreamSubstitutionManager.kt:118)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.withForkedOut(IdeCompatibleMessageRequestProcessor.kt:316)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor.evalWithIO(IdeCompatibleMessageRequestProcessor.kt:347)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1.invoke(IdeCompatibleMessageRequestProcessor.kt:158)",
"\tat org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1.invoke(IdeCompatibleMessageRequestProcessor.kt:157)",
"\tat org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$Task.execute(JupyterExecutorImpl.kt:41)",
"\tat org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$executorThread$1.invoke(JupyterExecutorImpl.kt:83)",
"\tat org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$executorThread$1.invoke(JupyterExecutorImpl.kt:80)",
"\tat kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)",
""
]
}
],
"execution_count": 69
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:53:23.549560Z",
"start_time": "2025-05-27T16:53:23.213518Z"
}
},
"cell_type": "code",
"source": [
"val start = LocalDate(2015, 1, 1)\n",
"val end = LocalDate(2016, 1, 1)\n",
"\n",
"val days = (start..end).toList()\n",
"\n",
"val dti = days.toColumn(\"dti\")\n",
"val s = List(dti.size()) { Random.nextDouble() }.toColumn(\"s\")\n",
"val df = dataFrameOf(dti, s)\n",
"df.head()"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | dti | s |
|---|
| 2015-01-01 | 0.819283 |
| 2015-01-02 | 0.327622 |
| 2015-01-03 | 0.392450 |
| 2015-01-04 | 0.017273 |
| 2015-01-05 | 0.589880 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"dti\",\"s\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":5,\"ncol\":2},\"kotlin_dataframe\":[{\"dti\":\"2015-01-01\",\"s\":0.81928274269362},{\"dti\":\"2015-01-02\",\"s\":0.32762203844192006},{\"dti\":\"2015-01-03\",\"s\":0.392450181987287},{\"dti\":\"2015-01-04\",\"s\":0.017272799890585944},{\"dti\":\"2015-01-05\",\"s\":0.5898804785792099}]}"
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 70
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**32.** Find the sum of the values in `s` for every Wednesday."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:55:21.227753Z",
"start_time": "2025-05-27T16:55:21.108232Z"
}
},
"cell_type": "code",
"source": "df.filter { dti.dayOfWeek == DayOfWeek.WEDNESDAY }.sum { s }",
"outputs": [
{
"data": {
"text/plain": [
"25.126838407617647"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 72
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**33.** For each calendar month in `s`, find the mean of values."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:34.271110Z",
"start_time": "2025-05-27T15:30:34.067258Z"
}
},
"cell_type": "code",
"source": "df.groupBy { dti.map { it.month } named \"month\" }.mean()",
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | month | s |
|---|
| JANUARY | 0.392389 |
| FEBRUARY | 0.473040 |
| MARCH | 0.516771 |
| APRIL | 0.432887 |
| MAY | 0.524052 |
| JUNE | 0.607268 |
| JULY | 0.490782 |
| AUGUST | 0.457970 |
| SEPTEMBER | 0.573128 |
| OCTOBER | 0.421887 |
| NOVEMBER | 0.518313 |
| DECEMBER | 0.534615 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"month\",\"s\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"java.time.Month\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"}],\"nrow\":12,\"ncol\":2},\"kotlin_dataframe\":[{\"month\":\"JANUARY\",\"s\":0.3923887924273048},{\"month\":\"FEBRUARY\",\"s\":0.47303992769923914},{\"month\":\"MARCH\",\"s\":0.5167708223091557},{\"month\":\"APRIL\",\"s\":0.4328866673188899},{\"month\":\"MAY\",\"s\":0.5240523639675101},{\"month\":\"JUNE\",\"s\":0.6072684099398505},{\"month\":\"JULY\",\"s\":0.4907819426878258},{\"month\":\"AUGUST\",\"s\":0.45797032992733294},{\"month\":\"SEPTEMBER\",\"s\":0.5731276549789772},{\"month\":\"OCTOBER\",\"s\":0.42188734256119925},{\"month\":\"NOVEMBER\",\"s\":0.5183134742269732},{\"month\":\"DECEMBER\",\"s\":0.5346147584915001}]}"
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 48
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**34.** For each group of four consecutive calendar months in `s`, find the date on which the highest value occurred."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:56:57.022328Z",
"start_time": "2025-05-27T16:56:56.818392Z"
}
},
"cell_type": "code",
"source": [
"df.add(\"month4\") {\n",
" when (dti.monthNumber) {\n",
" in 1..4 -> 1\n",
" in 5..8 -> 2\n",
" else -> 3\n",
" }\n",
"}.groupBy(\"month4\").aggregate { maxBy { s } into \"max\" }"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | month4 | max | | |
|---|
| dti | s | month4 |
|---|
| 1 | 2015-03-13 | 0.990105 | 1 |
| 2 | 2015-05-18 | 0.998461 | 2 |
| 3 | 2015-10-19 | 0.976761 | 3 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"month4\",\"max\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ColumnGroup\"}],\"nrow\":3,\"ncol\":2},\"kotlin_dataframe\":[{\"month4\":1,\"max\":{\"data\":{\"dti\":\"2015-03-13\",\"s\":0.9901045724932529,\"month4\":1},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"dti\",\"s\",\"month4\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}]}}},{\"month4\":2,\"max\":{\"data\":{\"dti\":\"2015-05-18\",\"s\":0.9984614867813051,\"month4\":2},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"dti\",\"s\",\"month4\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}]}}},{\"month4\":3,\"max\":{\"data\":{\"dti\":\"2015-10-19\",\"s\":0.9767606402162385,\"month4\":3},\"metadata\":{\"kind\":\"ColumnGroup\",\"columns\":[\"dti\",\"s\",\"month4\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"}]}}}]}"
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 73
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**35.** Create a column consisting of the third Thursday in each month for the years 2015 and 2016."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T15:30:34.543148Z",
"start_time": "2025-05-27T15:30:34.503931Z"
}
},
"cell_type": "code",
"source": [
"import java.time.temporal.WeekFields\n",
"import java.util.*"
],
"outputs": [],
"execution_count": 50
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:58:54.878240Z",
"start_time": "2025-05-27T16:58:54.696376Z"
}
},
"cell_type": "code",
"source": [
"val start = LocalDate(2015, 1, 1)\n",
"val end = LocalDate(2016, 12, 31)\n",
"\n",
"(start..end).toList().toColumn(\"thirdThursday\").filter {\n",
" it.toJavaLocalDate()[WeekFields.of(Locale.ENGLISH).weekOfMonth()] == 3\n",
" && it.dayOfWeek.value == 4\n",
"}"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | thirdThursday |
|---|
| 2015-01-15 |
| 2015-02-19 |
| 2015-03-19 |
| 2015-04-16 |
| 2015-05-14 |
| 2015-06-18 |
| 2015-07-16 |
| 2015-08-13 |
| 2015-09-17 |
| 2015-10-15 |
| 2015-11-19 |
| 2015-12-17 |
| 2016-01-14 |
| 2016-02-18 |
| 2016-03-17 |
| 2016-04-14 |
| 2016-05-19 |
| 2016-06-16 |
| 2016-07-14 |
| 2016-08-18 |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"thirdThursday\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlinx.datetime.LocalDate\"}],\"nrow\":24,\"ncol\":1},\"kotlin_dataframe\":[{\"thirdThursday\":\"2015-01-15\"},{\"thirdThursday\":\"2015-02-19\"},{\"thirdThursday\":\"2015-03-19\"},{\"thirdThursday\":\"2015-04-16\"},{\"thirdThursday\":\"2015-05-14\"},{\"thirdThursday\":\"2015-06-18\"},{\"thirdThursday\":\"2015-07-16\"},{\"thirdThursday\":\"2015-08-13\"},{\"thirdThursday\":\"2015-09-17\"},{\"thirdThursday\":\"2015-10-15\"},{\"thirdThursday\":\"2015-11-19\"},{\"thirdThursday\":\"2015-12-17\"},{\"thirdThursday\":\"2016-01-14\"},{\"thirdThursday\":\"2016-02-18\"},{\"thirdThursday\":\"2016-03-17\"},{\"thirdThursday\":\"2016-04-14\"},{\"thirdThursday\":\"2016-05-19\"},{\"thirdThursday\":\"2016-06-16\"},{\"thirdThursday\":\"2016-07-14\"},{\"thirdThursday\":\"2016-08-18\"}]}"
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 75
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Cleaning Data\n",
"### Making a dataframe easier to work with\n",
"Difficulty: *easy/medium*\n",
"\n",
"It happens all the time: someone gives you data containing malformed strings, lists and missing data. How do you tidy it up so you can get on with the analysis?\n",
"\n",
"Take this monstrosity of a dataframe to use in the following puzzles:\n",
"```kotlin\n",
"val fromTo = listOf(\"LoNDon_paris\", \"MAdrid_miLAN\", \"londON_StockhOlm\", \"Budapest_PaRis\", \"Brussels_londOn\").toColumn(\"From_To\")\n",
"val flightNumber = listOf(10045.0, Double.NaN, 10065.0, Double.NaN, 10085.0).toColumn(\"FlightNumber\")\n",
"val recentDelays = listOf(listOf(23, 47), listOf(), listOf(24, 43, 87), listOf(13), listOf(67, 32)).toColumn(\"RecentDelays\")\n",
"val airline = listOf(\"KLM(!)\", \" (12)\", \"(British Airways. )\", \"12. Air France\", \"'Swiss Air'\").toColumn(\"Airline\")\n",
"\n",
"val df = dataFrameOf(fromTo, flightNumber, recentDelays, airline)\n",
"```\n",
"\n",
"It looks like this:\n",
"```\n",
"From_To FlightNumber RecentDelays Airline\n",
"LoNDon_paris 10045.000000 [23, 47] KLM(!)\n",
"MAdrid_miLAN NaN [] {Air France} (12)\n",
"londON_StockhOlm 10065.000000 [24, 43, 87] (British Airways. )\n",
"Budapest_PaRis NaN [13] 12. Air France\n",
"Brussels_londOn 10085.000000 [67, 32] 'Swiss Air'\n",
"```"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T16:59:59.448315Z",
"start_time": "2025-05-27T16:59:59.051459Z"
}
},
"cell_type": "code",
"source": [
"val fromTo = listOf(\"LoNDon_paris\", \"MAdrid_miLAN\", \"londON_StockhOlm\", \"Budapest_PaRis\", \"Brussels_londOn\").toColumn(\"From_To\")\n",
"val flightNumber = listOf(10045.0, Double.NaN, 10065.0, Double.NaN, 10085.0).toColumn(\"FlightNumber\")\n",
"val recentDelays = listOf(listOf(23, 47), listOf(), listOf(24, 43, 87), listOf(13), listOf(67, 32)).toColumn(\"RecentDelays\")\n",
"val airline = listOf(\"KLM(!)\", \"{Air France} (12)\", \"(British Airways. )\", \"12. Air France\", \"'Swiss Air'\").toColumn(\"Airline\")\n",
"\n",
"var df = dataFrameOf(fromTo, flightNumber, recentDelays, airline)\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From_To | FlightNumber | RecentDelays | Airline |
|---|
| LoNDon_paris | 10045.000000 | [23, 47] | KLM(!) |
| MAdrid_miLAN | NaN | [ ] | {Air France} (12) |
| londON_StockhOlm | 10065.000000 | [24, 43, 87] | (British Airways. ) |
| Budapest_PaRis | NaN | [13] | 12. Air France |
| Brussels_londOn | 10085.000000 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From_To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Double\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":4},\"kotlin_dataframe\":[{\"From_To\":\"LoNDon_paris\",\"FlightNumber\":10045.0,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From_To\":\"MAdrid_miLAN\",\"FlightNumber\":NaN,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From_To\":\"londON_StockhOlm\",\"FlightNumber\":10065.0,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From_To\":\"Budapest_PaRis\",\"FlightNumber\":NaN,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From_To\":\"Brussels_londOn\",\"FlightNumber\":10085.0,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 76
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**36.** Some values in the `FlightNumber` column are missing (they are NaN).\n",
"These numbers are meant to increase by 10 with each row, so 10,055 and 10,075 need to be put in the right place.\n",
"Modify `df` to fill in these missing numbers and make the column an integer column (instead of a float column)."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T17:00:49.820227Z",
"start_time": "2025-05-27T17:00:49.584566Z"
}
},
"cell_type": "code",
"source": [
"df = df.fillNaNs { FlightNumber }\n",
" .with { prev()!!.FlightNumber + (next()!!.FlightNumber - prev()!!.FlightNumber) / 2 }\n",
" .convert { FlightNumber }.toInt()\n",
"df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From_To | FlightNumber | RecentDelays | Airline |
|---|
| LoNDon_paris | 10045 | [23, 47] | KLM(!) |
| MAdrid_miLAN | 10055 | [ ] | {Air France} (12) |
| londON_StockhOlm | 10065 | [24, 43, 87] | (British Airways. ) |
| Budapest_PaRis | 10075 | [13] | 12. Air France |
| Brussels_londOn | 10085 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From_To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":4},\"kotlin_dataframe\":[{\"From_To\":\"LoNDon_paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From_To\":\"MAdrid_miLAN\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From_To\":\"londON_StockhOlm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From_To\":\"Budapest_PaRis\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From_To\":\"Brussels_londOn\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 77
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**37.** The **From_To** column can better be two separate columns!\n",
"\n",
"Split each string by the underscore delimiter **_**.\n",
"Assign the correct names 'From' and 'To' to these columns."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T17:02:53.148779Z",
"start_time": "2025-05-27T17:02:52.889704Z"
}
},
"cell_type": "code",
"source": [
"var df2 = df.split { From_To }.by(\"_\").into(\"From\", \"To\")\n",
"df2"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | RecentDelays | Airline |
|---|
| LoNDon | paris | 10045 | [23, 47] | KLM(!) |
| MAdrid | miLAN | 10055 | [ ] | {Air France} (12) |
| londON | StockhOlm | 10065 | [24, 43, 87] | (British Airways. ) |
| Budapest | PaRis | 10075 | [13] | 12. Air France |
| Brussels | londOn | 10085 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":5},\"kotlin_dataframe\":[{\"From\":\"LoNDon\",\"To\":\"paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From\":\"MAdrid\",\"To\":\"miLAN\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From\":\"londON\",\"To\":\"StockhOlm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From\":\"Budapest\",\"To\":\"PaRis\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From\":\"Brussels\",\"To\":\"londOn\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 78
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**38.** Notice how the capitalization of the city names is all mixed up in this temporary DataFrame 'temp'.\n",
"Standardize the strings so that only the first letter is uppercase (e.g. \"londON\" should become \"London\".)"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T17:03:40.050237Z",
"start_time": "2025-05-27T17:03:39.711618Z"
}
},
"cell_type": "code",
"source": [
"df2 = df2.update { From and To }.with { it.lowercase().replaceFirstChar { it.uppercase() } }\n",
"df2"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | RecentDelays | Airline |
|---|
| London | Paris | 10045 | [23, 47] | KLM(!) |
| Madrid | Milan | 10055 | [ ] | {Air France} (12) |
| London | Stockholm | 10065 | [24, 43, 87] | (British Airways. ) |
| Budapest | Paris | 10075 | [13] | 12. Air France |
| Brussels | London | 10085 | [67, 32] | 'Swiss Air' |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":5},\"kotlin_dataframe\":[{\"From\":\"London\",\"To\":\"Paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM(!)\"},{\"From\":\"Madrid\",\"To\":\"Milan\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"{Air France} (12)\"},{\"From\":\"London\",\"To\":\"Stockholm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"(British Airways. )\"},{\"From\":\"Budapest\",\"To\":\"Paris\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\"12. Air France\"},{\"From\":\"Brussels\",\"To\":\"London\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"'Swiss Air'\"}]}"
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 80
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**39.** In the **Airline** column, you can see some extra punctuation and symbols have appeared around the airline names.\n",
"Pull out just the airline name. E.g. `'(British Airways. )'` should become `'British Airways'`."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T17:04:01.551298Z",
"start_time": "2025-05-27T17:04:01.439299Z"
}
},
"cell_type": "code",
"source": [
"df2 = df2.update { Airline }.with {\n",
" \"([a-zA-Z\\\\s]+)\".toRegex().find(it)?.value ?: \"\"\n",
"}\n",
"df2"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | RecentDelays | Airline |
|---|
| London | Paris | 10045 | [23, 47] | KLM |
| Madrid | Milan | 10055 | [ ] | Air France |
| London | Stockholm | 10065 | [24, 43, 87] | British Airways |
| Budapest | Paris | 10075 | [13] | Air France |
| Brussels | London | 10085 | [67, 32] | Swiss Air |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"RecentDelays\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.collections.List\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":5},\"kotlin_dataframe\":[{\"From\":\"London\",\"To\":\"Paris\",\"FlightNumber\":10045,\"RecentDelays\":[23,47],\"Airline\":\"KLM\"},{\"From\":\"Madrid\",\"To\":\"Milan\",\"FlightNumber\":10055,\"RecentDelays\":[],\"Airline\":\"Air France\"},{\"From\":\"London\",\"To\":\"Stockholm\",\"FlightNumber\":10065,\"RecentDelays\":[24,43,87],\"Airline\":\"British Airways\"},{\"From\":\"Budapest\",\"To\":\"Paris\",\"FlightNumber\":10075,\"RecentDelays\":[13],\"Airline\":\" Air France\"},{\"From\":\"Brussels\",\"To\":\"London\",\"FlightNumber\":10085,\"RecentDelays\":[67,32],\"Airline\":\"Swiss Air\"}]}"
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 81
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**40.** In the **RecentDelays** column, the values have been entered into the DataFrame as a list.\n",
"We would like each first value to be in its own column, each second value in its own column, and so on.\n",
"If a certain value is missing, the value should be `null`.\n",
"\n",
"Expand the column of lists into columns named 'delays_' and replace the unwanted `RecentDelays` column in `df` with 'delays'."
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-05-27T17:05:05.593051Z",
"start_time": "2025-05-27T17:05:05.391206Z"
}
},
"cell_type": "code",
"source": [
"val prep_df = df2.split { RecentDelays }.into { \"delay_$it\" }\n",
"prep_df"
],
"outputs": [
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" | From | To | FlightNumber | delay_1 | delay_2 | delay_3 | Airline |
|---|
| London | Paris | 10045 | 23 | 47 | null | KLM |
| Madrid | Milan | 10055 | null | null | null | Air France |
| London | Stockholm | 10065 | 24 | 43 | 87 | British Airways |
| Budapest | Paris | 10075 | 13 | null | null | Air France |
| Brussels | London | 10085 | 67 | 32 | null | Swiss Air |
\n",
" \n",
" \n",
" "
],
"application/kotlindataframe+json": "{\"$version\":\"2.1.1\",\"metadata\":{\"columns\":[\"From\",\"To\",\"FlightNumber\",\"delay_1\",\"delay_2\",\"delay_3\",\"Airline\"],\"types\":[{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.Int?\"},{\"kind\":\"ValueColumn\",\"type\":\"kotlin.String\"}],\"nrow\":5,\"ncol\":7},\"kotlin_dataframe\":[{\"From\":\"London\",\"To\":\"Paris\",\"FlightNumber\":10045,\"delay_1\":23,\"delay_2\":47,\"delay_3\":null,\"Airline\":\"KLM\"},{\"From\":\"Madrid\",\"To\":\"Milan\",\"FlightNumber\":10055,\"delay_1\":null,\"delay_2\":null,\"delay_3\":null,\"Airline\":\"Air France\"},{\"From\":\"London\",\"To\":\"Stockholm\",\"FlightNumber\":10065,\"delay_1\":24,\"delay_2\":43,\"delay_3\":87,\"Airline\":\"British Airways\"},{\"From\":\"Budapest\",\"To\":\"Paris\",\"FlightNumber\":10075,\"delay_1\":13,\"delay_2\":null,\"delay_3\":null,\"Airline\":\" Air France\"},{\"From\":\"Brussels\",\"To\":\"London\",\"FlightNumber\":10085,\"delay_1\":67,\"delay_2\":32,\"delay_3\":null,\"Airline\":\"Swiss Air\"}]}"
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 82
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The dataframe looks much better now!\n"
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": ""
}
],
"metadata": {
"kernelspec": {
"display_name": "Kotlin",
"language": "kotlin",
"name": "kotlin"
},
"language_info": {
"codemirror_mode": "text/x-kotlin",
"file_extension": ".kt",
"mimetype": "text/x-kotlin",
"name": "kotlin",
"nbconvert_exporter": "",
"pygments_lexer": "kotlin",
"version": "1.8.0-dev-707"
},
"ktnbPluginMetadata": {
"projectLibraries": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}