{ "cells": [ { "cell_type": "markdown", "id": "95f0a171", "metadata": {}, "source": [ "(boolean-data)=\n", "# Boolean Data\n", "\n", "## Introduction\n", "\n", "In this chapter, we'll introduce boolean data: data that can be `True` or `False` (which can also be encoded as 1s or 0s). We'll first look at the fundamental Python true and false boolean variables before seeing how true and false work in data frames.\n", "\n", "## Booleans\n", "\n", "Some of the most important operations you will perform are with `True` and `False` values, also known as boolean data types. These are fundamental Python variables, just as numbers such as `1` are. \n", "\n", "### Boolean Variables and Conditions\n", "\n", "To assign the value `True` or `False` to a variable is the same as with any other assignment:" ] }, { "cell_type": "code", "execution_count": null, "id": "7e35b9fc", "metadata": {}, "outputs": [], "source": [ "bool_variable = True\n", "bool_variable" ] }, { "cell_type": "markdown", "id": "b32b1251", "metadata": {}, "source": [ "There are two types of operation that are associated with booleans: boolean operations, in which existing booleans are combined, and condition operations, which create a boolean when executed.\n", "\n", "Boolean operators that return booleans are as follows:\n", "\n", "| Operator | Description |\n", "| :---: | :--- |\n", "|`x and y`| are `x` and `y` both True? |\n", "|`x or y` | is at least one of `x` and `y` True? |\n", "| `not x` | is `x` False? | \n", "\n", "These behave as you'd expect: `True and False` evaluates to `False`, while `True or False` evaluates to `True`. There's also the `not` keyword. For example" ] }, { "cell_type": "code", "execution_count": null, "id": "590cd75d", "metadata": {}, "outputs": [], "source": [ "not True" ] }, { "cell_type": "markdown", "id": "dcd0a01b", "metadata": {}, "source": [ "as you would expect.\n", "\n", "Conditions are expressions that evaluate as booleans. A simple example is `10 == 20`. The `==` is an operator that compares the objects on either side and returns `True` if they have the same *values*--though be careful using it with different data types.\n", "\n", "Here's a table of conditions that return booleans:\n", "\n", "| Operator | Description |\n", "| :-------- | :----------------------------------- |\n", "| `x == y ` | is `x` equal to `y`? |\n", "| `x != y` | is `x` not equal to `y`? |\n", "| `x > y` | is `x` greater than `y`? |\n", "| `x >= y` | is `x` greater than or equal to `y`? |\n", "| `x < y` | is `x` less than `y`? |\n", "| `x <= y` | is `x` less than or equal to `y`? |\n", "| `x is y` | is `x` the same object as `y`? |\n", "\n", "\n", "As you can see from the table, the opposite of `==` is `!=`, which you can read as 'not equal to the value of'. Here's an example of `==`:" ] }, { "cell_type": "code", "execution_count": null, "id": "51622575", "metadata": {}, "outputs": [], "source": [ "boolean_condition = 10 == 20\n", "print(boolean_condition)" ] }, { "cell_type": "markdown", "id": "e4367753", "metadata": {}, "source": [ "```{admonition} Exercise\n", "What does `not (not True)` evaluate to?\n", "```" ] }, { "cell_type": "markdown", "id": "91dd45b6", "metadata": {}, "source": [ "The real power of conditions comes when we start to use them in more complex examples. Some of the keywords that evaluate conditions are `if`, `else`, `and`, `or`, `in`, `not`, and `is`. Here's an example showing how some of these conditional keywords work:" ] }, { "cell_type": "code", "execution_count": null, "id": "0c550daa", "metadata": {}, "outputs": [], "source": [ "name = \"Ada\"\n", "score = 99\n", "\n", "if name == \"Ada\" and score > 90:\n", " print(\"Ada, you achieved a high score.\")\n", "\n", "if name == \"Smith\" or score > 90:\n", " print(\"You could be called Smith or have a high score\")\n", "\n", "if name != \"Smith\" and score > 90:\n", " print(\"You are not called Smith and you have a high score\")" ] }, { "cell_type": "markdown", "id": "6e01a3f9", "metadata": {}, "source": [ "All three of these conditions evaluate as True, and so all three messages get printed. Given that `==` and `!=` test for equality and not equal, respectively, you may be wondering what the keywords `is` and `not` are for. Remember that everything in Python is an object, and that values can be assigned to objects. `==` and `!=` compare *values*, while `is` and `not` compare *objects*. For example," ] }, { "cell_type": "code", "execution_count": null, "id": "7420e1c1", "metadata": {}, "outputs": [], "source": [ "name_list = [\"Ada\", \"Adam\"]\n", "name_list_two = [\"Ada\", \"Adam\"]\n", "\n", "# Compare values\n", "print(name_list == name_list_two)\n", "\n", "# Compare objects\n", "print(name_list is name_list_two)" ] }, { "cell_type": "markdown", "id": "3a78e664", "metadata": {}, "source": [ "Note that code with lots of branching if statements is not very helpful to you or to anyone else who reads your code. Some automatic code checkers will pick this up and tell you that your code is too complex. Almost all of the time, there's a way to rewrite your code without lots of branching logic that will be better and clearer than having many nested `if` statements." ] }, { "cell_type": "markdown", "id": "56b4f2ef", "metadata": {}, "source": [ "One of the most useful conditional keywords is `in`. This one must pop up ten times a day in most coders' lives because it can pick out a variable or make sure something is where it's supposed to be." ] }, { "cell_type": "code", "execution_count": null, "id": "39caa7be", "metadata": {}, "outputs": [], "source": [ "name_list = [\"Lovelace\", \"Smith\", \"Hopper\", \"Babbage\"]\n", "\n", "print(\"Lovelace\" in name_list)\n", "\n", "print(\"Bob\" in name_list)" ] }, { "cell_type": "markdown", "id": "e044d14a", "metadata": {}, "source": [ "```{admonition} Exercise\n", "Check if \"a\" is in the string \"Walloping weasels\" using `in`. Is \"a\" `in` \"Anodyne\"?\n", "```" ] }, { "cell_type": "markdown", "id": "19f0782e", "metadata": {}, "source": [ "The opposite is `not in`.\n", "\n", "Finally, one conditional construct you're bound to use at *some* point, is the `if`...`else` structure:" ] }, { "cell_type": "code", "execution_count": null, "id": "95794e71", "metadata": {}, "outputs": [], "source": [ "score = 98\n", "\n", "if score == 100:\n", " print(\"Top marks!\")\n", "elif score > 90 and score < 100:\n", " print(\"High score!\")\n", "elif score > 10 and score <= 90:\n", " pass\n", "else:\n", " print(\"Better luck next time.\")" ] }, { "cell_type": "markdown", "id": "94a359e7", "metadata": {}, "source": [ "Note that this does nothing if the score is between 11 and 90, and prints a message otherwise.\n", "\n", "```{admonition} Exercise\n", "Create a new `if` ... `elif` ... `else` statement that prints \"well done\" if a score is over 90, \"good\" if between 40 and 90, and \"bad luck\" otherwise.\n", "```\n", "\n", "One nice feature of Python is that you can make multiple boolean comparisons in a single line." ] }, { "cell_type": "code", "execution_count": null, "id": "cd1cd061", "metadata": {}, "outputs": [], "source": [ "a, b = 3, 6\n", "\n", "1 < a < b < 20" ] }, { "cell_type": "markdown", "id": "4f06e97c", "metadata": {}, "source": [ "### Conditions in list comprehensions\n", "\n", "List comprehensions are an incredibly useful pattern in Python. Here's a simple one that produces a list of the first 12 numbers starting from 0:" ] }, { "cell_type": "code", "execution_count": null, "id": "59638407", "metadata": {}, "outputs": [], "source": [ "[x for x in range(12)]" ] }, { "cell_type": "markdown", "id": "b58abd5e", "metadata": {}, "source": [ "Booleans bring conditionality to the table. We'll add an `if` statement followed by a condition that evaluates to either True or False depending on the value of `x`. So, for example, we can ask for only those numbers that are divisible by 2:" ] }, { "cell_type": "code", "execution_count": null, "id": "8e8072ea", "metadata": {}, "outputs": [], "source": [ "[x for x in range(12) if x % 2 == 0]" ] }, { "cell_type": "markdown", "id": "3587832f", "metadata": {}, "source": [ "This trick even works with an `else` clause (but note that we have moved both `if` and `else` before the `for x in ...` part)" ] }, { "cell_type": "code", "execution_count": null, "id": "ec01f460", "metadata": {}, "outputs": [], "source": [ "[x if x % 2 == 0 else \"Not divisible by 2\" for x in range(12)]" ] }, { "cell_type": "markdown", "id": "6c48ec42", "metadata": {}, "source": [ "### Truthsy and Falsy Values" ] }, { "cell_type": "markdown", "id": "baf6cfd4", "metadata": {}, "source": [ "Python objects can be used in expressions that will return a boolean value, such as when a list, `listy`, is used with `if listy`. Built-in Python objects that are empty are usually evaluated as `False`, and are said to be 'Falsy'. In contrast, when these built-in objects are not empty, they evaluate as `True` and are said to be 'truthy'.\n", "Let's see some examples:" ] }, { "cell_type": "code", "execution_count": null, "id": "dc605a93", "metadata": {}, "outputs": [], "source": [ "listy = []\n", "other_listy = [1, 2, 3]\n", "\n", "if not (listy):\n", " print(\"Falsy\")\n", "else:\n", " print(\"Truthy\")" ] }, { "cell_type": "code", "execution_count": null, "id": "da8fe682", "metadata": {}, "outputs": [], "source": [ "if not (other_listy):\n", " print(\"Falsy\")\n", "else:\n", " print(\"Truthy\")" ] }, { "cell_type": "markdown", "id": "ad7b8743", "metadata": {}, "source": [ "The method doesn't just operate on lists; it'll work for many various other truthy and falsy objects:" ] }, { "cell_type": "code", "execution_count": null, "id": "d80ba0be", "metadata": {}, "outputs": [], "source": [ "if not 0:\n", " print(\"Falsy\")\n", "else:\n", " print(\"Truthy\")" ] }, { "cell_type": "code", "execution_count": null, "id": "1973d44d", "metadata": {}, "outputs": [], "source": [ "if not [0, 0, 0]:\n", " print(\"Falsy\")\n", "else:\n", " print(\"Truthy\")" ] }, { "cell_type": "markdown", "id": "2d836d8c", "metadata": {}, "source": [ "Note that zero was falsy, its the nothing of a float, but a list of three zeros is not an empty list, so it evaluates as truthy." ] }, { "cell_type": "code", "execution_count": null, "id": "62840c4a", "metadata": {}, "outputs": [], "source": [ "if not None:\n", " print(\"Falsy\")\n", "else:\n", " print(\"Truthy\")" ] }, { "cell_type": "markdown", "id": "1e69d8ff", "metadata": {}, "source": [ "Knowing what is truthy or falsy is useful in practice; imagine you'd like to default to a specific behaviour if a list called `list_vals` doesn't have any values in. You now know you can do it simply with `if list_vals`." ] }, { "cell_type": "markdown", "id": "303b1038", "metadata": {}, "source": [ "### any() and all()\n", "\n", "Of course, there is a big wide world of booleans out there; they don't always occur on their own. That's why the operators `any()` and `all()` exist. These apply to *iterables* of booleans, like a list of booleans.\n", "\n", "`any()` takes a list of booleans with at least one true value and returns true:" ] }, { "cell_type": "code", "execution_count": null, "id": "bdcb09a5", "metadata": {}, "outputs": [], "source": [ "any([True, False, False])" ] }, { "cell_type": "markdown", "id": "541862c4", "metadata": {}, "source": [ "`all()` takes a list of booleans and returns true only if *all* values are true:" ] }, { "cell_type": "code", "execution_count": null, "id": "2f666185", "metadata": {}, "outputs": [], "source": [ "all([True, True, True, True])" ] }, { "cell_type": "markdown", "id": "767f254b", "metadata": {}, "source": [ "Both of these also work for 1s and 0s:" ] }, { "cell_type": "code", "execution_count": null, "id": "78777d9c", "metadata": {}, "outputs": [], "source": [ "all([0, 0, 0, 1])" ] }, { "cell_type": "markdown", "id": "8e06930b", "metadata": {}, "source": [ "## Booleans in **pandas** data frames\n", "\n", "### Operations on booleans in data frames\n", "\n", "Quite often, you will run into a scenario where you're working with data that have True or False values in a data frame. It is easy to create a column of booleans in a **pandas** data frame:" ] }, { "cell_type": "code", "execution_count": null, "id": "7f338fd7", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame.from_dict(\n", " {\n", " \"bool_col_1\": [False] * 3 + [True, True],\n", " \"bool_col_2\": [True, False, True, False, True],\n", " }\n", ")\n", "df" ] }, { "cell_type": "markdown", "id": "c7a0cd32", "metadata": {}, "source": [ "We can perform operations on these just like regular **pandas** data frame columns. These accept `&` (and), `|` (or), `==` (equal), and `!=` (not equal) as operations:" ] }, { "cell_type": "code", "execution_count": null, "id": "9cdaec7a", "metadata": {}, "outputs": [], "source": [ "df[\"bool_col_1\"] | df[\"bool_col_2\"]" ] }, { "cell_type": "markdown", "id": "25dede69", "metadata": {}, "source": [ "Quite often, it's useful to have a count of the number of true values. If you take the sum of boolean columns in a **pandas** data frame, it will tot up the number of `True` values:" ] }, { "cell_type": "code", "execution_count": null, "id": "89ee3e44", "metadata": {}, "outputs": [], "source": [ "df.sum()" ] }, { "cell_type": "markdown", "id": "f2dc1d93", "metadata": {}, "source": [ "And if you ever get data formatted as 1s and 0s rather than True and False, it's easy to convert by changing the data type:" ] }, { "cell_type": "code", "execution_count": null, "id": "5e30cee7", "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame.from_dict({\"bool_col\": [0, 1, 0, 1, 1]})\n", "df[\"bool_col\"].astype(bool)" ] }, { "cell_type": "markdown", "id": "98e401db", "metadata": {}, "source": [ "### Creating booleans from comparisons using columns\n", "\n", "\n", "It's also possible to create boolean columns from numerical (or some other) columns. Let's use the diamonds dataset to demonstrate this:" ] }, { "cell_type": "code", "execution_count": null, "id": "9f63005f", "metadata": {}, "outputs": [], "source": [ "diamonds = pd.read_csv(\n", " \"https://github.com/mwaskom/seaborn-data/raw/master/diamonds.csv\"\n", ")\n", "diamonds.head()" ] }, { "cell_type": "markdown", "id": "67c2e2cf", "metadata": {}, "source": [ "We're going to create a new boolean variable for whenever the price is above 1000." ] }, { "cell_type": "code", "execution_count": null, "id": "7a27f0a0", "metadata": {}, "outputs": [], "source": [ "diamonds[\"expensive\"] = diamonds[\"price\"] > 1000\n", "diamonds.sample(10)" ] }, { "cell_type": "markdown", "id": "8383cab5", "metadata": {}, "source": [ "Of course, this could also have been achieved in a call to assign:" ] }, { "cell_type": "code", "execution_count": null, "id": "c78a6d47", "metadata": {}, "outputs": [], "source": [ "diamonds.assign(expensive=lambda x: x[\"price\"] > 1000).head()" ] }, { "cell_type": "markdown", "id": "622ff486", "metadata": {}, "source": [ "Another use of booleans that is quite useful when it comes to data frames is the `.isin()` function. For example, if you just want some True or False values for whether a set of columns is in a data frame:" ] }, { "cell_type": "code", "execution_count": null, "id": "d12f537d", "metadata": {}, "outputs": [], "source": [ "diamonds.columns.isin([\"x\", \"y\", \"z\"])" ] }, { "cell_type": "markdown", "id": "6de309a4", "metadata": {}, "source": [ "### any() and all() in data frames\n", "\n", "A **pandas** column of booleans behaves a lot like a list of booleans, and we can apply the same logic to it via **pandas** built-in `.any()` and `.all()` methods. We expect some entries for `\"expensive\"` to be true, so `any()` should return true:" ] }, { "cell_type": "code", "execution_count": null, "id": "35e73305", "metadata": {}, "outputs": [], "source": [ "diamonds[\"expensive\"].any()" ] }, { "cell_type": "markdown", "id": "a64edf0c", "metadata": {}, "source": [ "### Logical subsetting\n", "\n", "Although we've been effectively using this all along, it's useful to make it explicit: booleans can be used to logically subset a data frame. Let's say we only want the bits of a data frame where `x` is greater than `y`:" ] }, { "cell_type": "code", "execution_count": null, "id": "d1bbb0fa", "metadata": {}, "outputs": [], "source": [ "diamonds[diamonds[\"x\"] > diamonds[\"y\"]]" ] }, { "cell_type": "markdown", "id": "9b5906ba", "metadata": {}, "source": [ "The expression `diamonds[\"x\"] > diamonds[\"y\"]` creates a column of booleans that is used to filter to just the rows where the condition is true." ] } ], "metadata": { "interpreter": { "hash": "9d7534ecd9fbc7d385378f8400cf4d6cb9c6175408a574f1c99c5269f08771cc" }, "jupytext": { "cell_metadata_filter": "-all", "encoding": "# -*- coding: utf-8 -*-", "formats": "md:myst", "main_language": "python" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "toc-showtags": true }, "nbformat": 4, "nbformat_minor": 5 }