{
"cells": [
{
"cell_type": "markdown",
"id": "506cc8e4",
"metadata": {},
"source": [
"[Back to Table of Contents](https://www.shannonmburns.com/Psyc158/intro.html)\n",
"\n",
"[Previous: Chapter 4 - Cleaning Data](https://colab.research.google.com/github/smburns47/Psyc158/blob/main/chapter-4.ipynb)"
]
},
{
"cell_type": "markdown",
"id": "c7614989",
"metadata": {},
"source": [
"# Chapter 5 - Describing Data \n",
"\n",
"Over the last four chapters you've developed the basic proficiency in coding and data management that will enable you to explore and analyze data in many contexts. These skills are an important foundation for the rest of the course, so it is highly recommended that you practice with them and return to them to review frequently (and keep this url handy to come back to in future classes). \n",
"\n",
"Next, we are going to turn to how we can apply these skills for statistical insights on the data we've been working with. In this chapter, we will learn about the use of descriptive statistics for summarizing datasets. \n",
"\n",
"## 5.1 The concept of a distribution\n",
"\n",
"Assuming we have a tidy dataset with many variables, it is a good idea to look at the variation in your measures. This can give you clues about what kind of cleaning is needed, if any errors in data collection happened, and what sort of analyses will best suit the data. This leads us to one of the most fundamental concepts in statistics, the concept of a **distribution**.\n",
"\n",
"A distribution reflects the specific pattern of variation in a variable or set of variables. It is how the data are divided among different possible values. Thinking about distributions requires you to think abstractly, at a higher level, about your data. You must shift your thinking from a focus on the individual observations in your data set (e.g., the 60 people you have sampled) to a focus on all the observations as a group, and the pattern of how they vary. The concept of a distribution allows us to see the whole as greater than the sum of the parts; the forest, and not just the trees.\n",
"\n",
"The features of a forest cannot be seen in a single tree. Measuring the height of a single tree does not allow you to see characteristics of the distribution of height. You can know the height of that one tree, but not the minimum, maximum, or average height of trees in the forest based on a single measurement. Statistics such as the mean do not themselves constitute a distribution; they are features of a distribution, features that don’t apply to individual trees.\n",
"\n",
"Note that not just any bunch of numbers can be thought of as a distribution. The numbers must all be measures of the same attribute. So, for example, if you have measures of height and weight on a sample of 60 people, you can’t just lump the height and weight numbers into a single distribution. You can, however, examine the distribution of height and the distribution of weight separately."
]
},
{
"cell_type": "markdown",
"id": "253f750f",
"metadata": {},
"source": [
"## 5.2 Visualizing distributions\n",
"\n",
"### Histograms\n",
"When first learning about distributions, it can help to visualize them. Below are some examples of what distributions can look like: \n",
"\n",
"
\n",
"\n",
"This type of image is called a **histogram**. On the x-axis is some value that an observation can have on a variable. In the examples above we see (clockwise from upper left): anxiety scores from the Tetris Memories study; the number of intrusive memories those participants experienced; how long a sample of students slept per night, measured in hours; and the heights of a sample of people, measured in inches. The y-axis represents the *frequency* of some score or range of scores in a sample. So, in the first histogram (in purple), the height of the bars does not represent what STAI score someone got, but instead represents the number of participants in this sample who got a certain score.\n",
"\n",
"There are lots of ways to make histograms in R. In this course we will use the package ```ggformula``` to make our visualizations. It is a large and flexible package, so it can be good to read through a guide to [everything you can do with it](http://www.mosaic-web.org/ggformula/articles/pkgdown/ggformula-long.html) for help or inspiration. ```ggformula``` is a weird name, but that’s what the authors of this package called it. Because of that, many of the ```ggformula``` commands are going to start with ```gf_```; the ```g``` stands for \"graphical\" and the ```f``` stands for \"formula\". \n",
"\n",
"We will start by making a histogram with the ```gf_histogram()``` function. Here is how to make a basic histogram of ```intrusive_memories``` from the ```tetrismemories``` data frame."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3156255a",
"metadata": {
"scrolled": true,
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"install.packages(\"ggformula\")\n",
"library(ggformula)\n",
"\n",
"tetrismemories <- read.csv(\"https://raw.githubusercontent.com/smburns47/Psyc158/main/tetrismemories.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89693d51",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"gf_histogram(gformula = ~ intrusive_memories, data = tetrismemories)"
]
},
{
"cell_type": "markdown",
"id": "a315feb8",
"metadata": {},
"source": [
"```gf_formula``` takes two arguments. The first is a *formula* (using the argument name ```gformula```). In this formula, notice that the variable we want to display, ```intrusive_memories```, is placed after a ~ symbol (tilde). The ~ works like an equal sign in the equation for graphing a line (e.g., ```y = mx + b``` from geometry class). So in R, whenever you put something before the ~, its values go on the y-axis and whenever you put something after the ~, its values go on the x-axis. A histogram is a special case where the y-axis is just a count related to the variable on the x-axis, not a different variable. Thus, the ```gformula``` argument in ```gf_formula``` is the formula ``` ~ instrusive_memories```, which tells R \"plot the variable intrusive_memories on the x-axis.\"\n",
"\n",
"The second argument is ```data = tetrismemories```, which tells R which data frame to find ```intrusive_memories``` in.\n",
"\n",
"This is an example of using named arguments to explicitly tell R which argument values should be what. By using the named argument syntax with ```=``` signs, you can put these name/value pairs in any order inside the function call; e.g. ```gf_formula(data = tetrismemories, gformula = ~ intrusive_memories)``` would still work. You can also leave off the argument names ```gformula =``` and ```data =```, but in that case you have to enter the argument values in the specific order this function expects by default (formula first, then data). [Looking up](https://www.rdocumentation.org/packages/ggformula/versions/0.6/topics/gf_histogram) a function online or using the ```?``` symbol before a function name will always tell you how to use a function if you forget.\n",
"\n",
"Now say we wanted to change the histogram to one that plots the variable ```STAI_T``` on the x-axis instead. Rewrite the code below to do so."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81a3c649",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"#Change the plotted variable to STAI_T\n",
"gf_histogram(gformula = ~ intrusive_memories, data = tetrismemories)"
]
},
{
"cell_type": "markdown",
"id": "d935c7aa",
"metadata": {},
"source": [
"Because the variable on the x-axis is often measured on a continuous scale with many possible values, the bars in the histograms usually represent a range of values, called bins. We’ll illustrate this idea of bins by creating a simple outcome variable called ```outcome```, and displaying it on a histogram."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec5f7cd4",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"outcome <- c(1,2,3,4,5)\n",
"tiny_data <- data.frame(outcome) #gf_histogram only works on dataframe objects, not vectors\n",
"\n",
"#Write some code below here to plot a histogram of the variable \"outcome\" in the dataframe \"tiny_data\"\n"
]
},
{
"cell_type": "markdown",
"id": "9b9ad922",
"metadata": {},
"source": [
"This histogram shows gaps between the bars because by default ```gf_histogram()``` sets up 30 bins, even though we only have five possible numbers in our variable. If we change the number of bins to 5, then we’ll get rid of the gaps between the bars. To do this, we add in a new argument that ```gf_histogram()``` can accept: ```bins```. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f614f66a",
"metadata": {
"scrolled": true,
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"gf_histogram(gformula = ~ outcome, data = tiny_data, bins = 5)"
]
},
{
"cell_type": "markdown",
"id": "b7be81bb",
"metadata": {},
"source": [
"Now we only have one bin per data value. However, since there are exactly one of each type of value, this histogram looks like a rectangle. To help distinguish things more, we can change the colors of the histogram, adding in the arguments ```color``` and ```fill```. What do you think each of these arguments does?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8193db8d",
"metadata": {
"scrolled": true,
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"gf_histogram(gformula = ~ outcome, data = tiny_data, bins = 5, color = \"gray\", fill = \"blue\")"
]
},
{
"cell_type": "markdown",
"id": "13726bf0",
"metadata": {},
"source": [
"In chapter 7, we'll talk much more about how to tweak the details of our data visualizations."
]
},
{
"cell_type": "markdown",
"id": "8aa21a6b",
"metadata": {},
"source": [
"### Density plots\n",
"**Density plots** are a lot like histograms, except on the y-axis is the *proportion* (or percentage) of the data that falls into a bin, rather than the raw count number. The only real difference between a histogram and density plot is how the y-axis is represented.\n",
"\n",
"The function ```gf_dhistogram``` will plot a density plot (note the d before the word histogram, to distinguish it from the regular histogram function). "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e1e98cc3",
"metadata": {
"scrolled": true,
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"gf_dhistogram(gformula = ~ outcome, data = tiny_data, bins = 5, color = \"gray\", fill = \"blue\")"
]
},
{
"cell_type": "markdown",
"id": "2cde09c8",
"metadata": {},
"source": [
"If everything worked right, the general shape of this plot should match the one above. The only difference is, the bins show a density of 0.2 instead of 1, since 1 value out of 5 means 20% of the sample is in that bin.\n",
"\n",
"In the window below, write some code to make a density plot of the ```tetris_total_score``` variable in the ```tetrismemories``` data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0c49152e",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"#Finish the code below to make a density plot of Age in mindsetmatters\n",
"gf_dhistogram(#set formula here, #set data frame here, #set bin and color options here)\n"
]
},
{
"cell_type": "markdown",
"id": "581d3b0b",
"metadata": {},
"source": [
"You may have gotten a warning by running this that said ```Warning message: \"Removed 36 rows containing non-finite values (`stat_bin()`).”```. Warnings in R mean your code can still run, but the program thinks something happened that you should pay attention to. In this case, there are 36 observations that have missing data in ```tetris_total_score```. Since it's impossible to plot something that doesn't exist, this function ignored those rows when plotting. Some functions are able to manage missing values this way, others will break completely with an error when you try to pass data with missing values. As you use R more you'll start getting a sense of which ones behave which way. You'll also start learning what error messages mean so you can learn to use new functions correctly. \n",
"\n",
"In summary, the very first thing you should always do when analyzing data is to examine the distributions of your variables. If you skip this step, and go directly to the application of more complex statistical procedures without knowning the nature of your data, you do so at your own peril. Histograms are a key tool for examining distributions of variables to make sure the data are ok, and that your planned analysis will be appropriate. \n",
"\n",
"Besides visualizing a distribution, we can also describe it with specific metrics. In general, a distribution can be described with three things: its shape, center, and variability."
]
},
{
"cell_type": "markdown",
"id": "876572f0",
"metadata": {},
"source": [
"## 5.3 Shape\n",
"\n",
"Look back at the density plot you made for ```tetrismemories$intrusive_memories```, and take note of the general shape the whole dataset takes. Where is the peak of the the distribution? What are the most infrequent values? Statisticians describe the shapes of distributions using a few key features. Distributions can be **symmetrical**, or they can be **skewed**. A symmetrical distribution has a peak generally in the center of the x-axis, and the left side of the shape is close to a mirror image of the right side. If a distribution is skewed, it can be skewed left (the skinny longer tail is on the left, like someone stretched out that side) or skewed right (the skinny longer tail is on the right). \n",
"\n",
"It's rare to find a distribution perfectly symmetrical, but the more skewed something is the more we should take note of it. The distribution above has a fairly large skew to the right.\n",
"\n",
"Other ways to talk about the shape of distributions include:\n",
"- **uniform**, meaning the number of observations is evenly distributed across the possible scores and there is no peak in the data (remember the rectangle histogram we made earlier?) \n",
"- **unimodal**, meaning that there is one peak in the data \n",
"- **bimodal** (or multimodal), having two (or more) clear peaks with only a few data values in between\n",
"\n",
"Do you think the ```intrusive_memories``` distribution above is uniform, unimodal, or bimodal?\n",
"\n",
"You will often see distributions that match a particular shape and skewness. These special distribution types have names. For instance, the **normal** distribution is symmetric and unimodal, looking like a bell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a31bf94",
"metadata": {
"scrolled": true,
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"normal_data <- rnorm(n = 1000, mean = 0, sd = 1) \n",
"#rnorm is a function that generates data in a normal distribution;\n",
"# we'll talk about simulating data like this in a later chapter\n",
"\n",
"gf_dhistogram(gformula = ~ data_pts, data = data.frame(data_pts = normal_data))"
]
},
{
"cell_type": "markdown",
"id": "0191f0fa",
"metadata": {},
"source": [
"Usually, distributions are kind of lumpy and jagged, so many of these features should be thought of with the word “roughly” in front of them. Even if a distribution doesn’t have exactly the same number of observations across all possible scores — but has roughly the same number — we should still call that distribution uniform. If you look at the density plot above, you might see some extra lumpiness along the sides of the bell curve. Some people might initially think this is a bimodal distribution. But statisticians would consider it roughly unimodal and roughly normal because the lumps are quite small compared to the main peak.\n",
"\n",
"If a distribution is unimodal, it is often useful to notice where the center of the distribution lies. If lots of observations are clustered around the middle, then the value of that middle could be a handy summary of the sample of scores, letting you make statements such as, “Most of the data in this sample are around this middle value.”"
]
},
{
"cell_type": "markdown",
"id": "e99514ad",
"metadata": {},
"source": [
"## 5.4 Center\n",
"\n",
"There are specific numbers that represent what that middle point is for a distribution. We mentioned in chapter 1 that one of the central principles of statistics is the idea that we can better understand the world by throwing away information, and that’s exactly what we are doing when we summarize a dataset and describe its center value. In most situations, after checking out the shape of the data distribution, the next thing that you’ll want to calculate is a measure of **central tendency**. That is, you’d like to know something about where the “average” or “typical” value of your data lies. The three most commonly used measures are the mean, median, and mode.\n",
"\n",
"### Mean\n",
"The **mean** of a set of observations is a traditional average: add all of the values up, and then divide by the total number of values. If a student's five exam scores were 76, 91, 86, 80 and 92, the mean of those scores would be:\n",
"\n",
"$$\\frac{76 + 91 + 86 + 80 + 93}{5} = 85.2$$\n",
"\n",
"To calculate the mean in R, you could type out that exact formula above and have R work for you like a calculator: ```(76 + 91 + 86 + 80 + 93)/5```. However, that’s not the only way to do the calculation, and when the number of observations starts to become large, it’s very tedious. Besides, in almost every real world scenario, you’ve already got the actual numbers stored in a variable of some kind, like ```test_scores <- c(76,91,86,80,93)```. Under those circumstances, you can use a combination of the ```sum()``` function and the ```length()``` function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f6162918",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"test_scores <- c(76,91,86,80,93)\n",
"#This computes a mean\n",
"sum(test_scores) / length(test_scores)"
]
},
{
"cell_type": "markdown",
"id": "df3c89ee",
"metadata": {},
"source": [
"Although it’s pretty easy to calculate the mean like this, we can do it in an even easier way. Since the mean is such a common metric to compute, R also provides us with the ```mean()``` function. Simply pass a vector to this fuction to return the mean. Try it in the code block below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7c7841e1",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"#Use mean() to calculate the mean of test_scores:\n",
"\n",
"#Evaluate the code below. Do you think it will be the same as your mean calculation?\n",
"sum(test_scores) / length(test_scores)"
]
},
{
"cell_type": "markdown",
"id": "5464a561",
"metadata": {},
"source": [
"### Median\n",
"The mean is a really useful way of describing what the central tendency of a distribution is. However, sometimes it doesn't work as well as one might think. Imagine two distributions: ```dist1 <- c(10, 10, 11, 9, 11, 12, 8, 9, 10)``` and ```dist2 <- c(10, 10, 11, 9, 11, 12, 8, 9, 10, 100).``` Plot these as density plots below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "664e3471",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"dist1 <- c(10, 10, 11, 9, 11, 12, 8, 9, 10)\n",
"dist2 <- c(10, 10, 11, 9, 11, 12, 8, 9, 10, 100)\n",
"\n",
"#Write code to plot the density plot of each of these distributions, using df_dhistogram()\n"
]
},
{
"cell_type": "markdown",
"id": "5c47b5c1",
"metadata": {},
"source": [
"While ```dist2``` has an obvious outlier, most people would say a value that best represents the majority of the data is still around 10, like in ```dist1```. However, now calculate the mean of each distribution:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5dd2927e",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"#Calculate mean of dist1\n",
"\n",
"#Calculate mean of dist2\n"
]
},
{
"cell_type": "markdown",
"id": "69a05f75",
"metadata": {},
"source": [
"The mean of ```dist2``` is way off of the mean of ```dist1```. This sort of situation is not uncommon - in fact, the mean of any distribution with appreciable skew will fall towards the long tail rather than the peak of the rest of the data. In this case, a better measure of central tendency can be the **median**. \n",
"\n",
"The median is even easier to describe than the mean. The median of a set of observations is just the middle value, when the values are in numerical order. As before, let’s imagine we were interested in a student's five test scores. To figure out the median, we sort these numbers into ascending order, 76,80,**86**,91,93. The value in the middle, bolded here, is the median. \n",
"\n",
"It's easy to find the middle value in a set of numbers that are an odd-numbered length. But what should we do if ```test_scores``` had 6 values instead? E.g., 76,80,**86,91**,93,94. Now there are *two* middle numbers, 86 and 91. In this case, the median is defined as the average of these: 88.5. \n",
"\n",
"As before, it’s very tedious to do this by hand when you’ve got lots of numbers. Luckily in R, we have the ```median()``` function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "22241fc3",
"metadata": {
"scrolled": true,
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"median(test_scores)"
]
},
{
"cell_type": "markdown",
"id": "da0b586c",
"metadata": {},
"source": [
"### Mode\n",
"The last common measure of central tendency is the **mode**. This is the most frequent value within our data - what we've been calling the \"peak\" of a distribution. In a somewhat normal distribution, this is near the middle of the curve with the mean and median. But in a really weird distribution, where, say, the most extreme values are the most common, the mode could be way off. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7d6a5dbc",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"#Visualize a weird distribution. Where is the mean versus the median versus the mode?\n",
"weird_dist <- data.frame(data_pts = c(10,20,20,30,30,30,40,40,40,40,50,50,50,50,50))\n",
"gf_dhistogram(gformula = ~ data_pts, data = weird_dist)"
]
},
{
"cell_type": "markdown",
"id": "f1bd4a5a",
"metadata": {},
"source": [
"You may have tried to use a function like ```mode()``` in the code block above to calculate the mode of the dataframe ```weird_dist```. However, in R ```mode()``` is already taken up by a different function that does something else. So we'll need a different method to describe our distribution. Given that the mode is the most frequent data value in a distribution, we can use the ```table()``` function to list the number of data points with each unique value, and identify the mode that way:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bad2b055",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"table(weird_dist$data_pts)"
]
},
{
"cell_type": "markdown",
"id": "7efa3f3d",
"metadata": {},
"source": [
"Because the mode can be far away from the middle of the value range in distributions like this, it may be better to think of the mode as more of a measure of \"typicality\" rather than a measure of \"central tendency.\"\n",
"\n",
"### Which measure of central tendency to use?\n",
"When do you pick between mean, median, and mode? There's no hard and fast rule - for each use case, you'll want to think about the message you're trying to send with your summary, as well as the nature of the underlying data. But here are some things to think about when making your decision:\n",
"\n",
"- The mean is a good default - many statistical tools are based on the mean, and lots of people know what it is. \n",
"- As already mentioned, if you have a highly skewed distribution, median may be better than mean for describing the middle of the data.\n",
"- If your data are categorical, it doesn't make sense to calculate the mean or the median. Both the mean and the median rely on the idea that the numbers assigned to values are meaningful - i.e., can you say what the mean is of {apple, orange, banana, watermelon}? If data values aren't related numerically, it's best to use mode. \n",
"- If your data are ordinal scale, you’re more likely to want to use the median than the mean. For instance, when measuring the typical number of children that an American household has, you can only have whole-numbered children. However, if we calculate the [average child number per family in 2020](https://www.statista.com/statistics/718084/average-number-of-own-children-per-family), we get 1.93. A partial amount of a child doesn't really make sense when discussing the typical number of children one might find in a household. Instead, it would make more sense to use the median: 2. "
]
},
{
"cell_type": "markdown",
"id": "5ea73ff8",
"metadata": {},
"source": [
"## 5.5 Variability\n",
"\n",
"Lastly we come to variability. Variability refers to how spread out (or wide) the distribution is. Central tendency tells us what the most typical values are like, but we don't want to completely ignore the infrequent scores. Two distributions can have a mean of 60, but if the range of one is 50 to 70 while the range of another is 2 to 200, those are obviously different distributions of data. Variability also has multiple measures we can use. \n",
"\n",
"### Range\n",
"The **range** of a variable is very simple: it’s the biggest value minus the smallest value. For ```test_scores``` data, the maximum value is 93, and the minimum value is 76. We can calculate these values in R using the ```max()``` and ```min()``` functions:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f025c9a9",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"#Max of test_scores\n",
"max(test_scores)\n",
"\n",
"#Min of test_scores\n",
"min(test_scores)\n",
"\n",
"#Calculate the range of test_scores by subtracting the min from the max\n"
]
},
{
"cell_type": "markdown",
"id": "7857dc83",
"metadata": {},
"source": [
"The other possibility is to use the ```range()``` function, which outputs both the minimum value and the maximum value in a vector, like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "30f8af23",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"range(test_scores)"
]
},
{
"cell_type": "markdown",
"id": "ddcefe7d",
"metadata": {},
"source": [
"Although the range is the simplest way to quantify the notion of variability, it’s one of the worst. Recall from our discussion of the mean that we want our summary measure to be robust (meaning it isn't affected by outliers very much). If the dataset has one or two extreme values in it, we’d like our statistics not to be unduly influenced by these cases. If we look once again at our toy example of a data set containing a very extreme outlier, ```dist2 <- c(10, 10, 11, 9, 11, 12, 8, 9, 10, 100)```, it is clear that the range is not robust, since this has a range of 92 unless the outlier were removed; then we would have a range of only 4."
]
},
{
"cell_type": "markdown",
"id": "25c56a53",
"metadata": {},
"source": [
"### Interquartile range\n",
"The **interquartile range (IQR)** is like the range, but instead of calculating the difference between the biggest and smallest value, it calculates the difference between the 25th quantile and the 75th quantile. What is a quantile? \n",
"the 25th quantile of a data set is the smallest number x such that 25% of the data is less than x. This is also called percentiles, which you may have heard before. \n",
"\n",
"In fact, we’ve already come across the idea: the median of a data set is its 50th quantile / percentile! R actually\n",
"provides you with a way of calculating quantiles, using the (surprise, surprise) ```quantile()``` function. Let’s\n",
"use it to calculate the median of ```STAI_T``` in the ```tetrismemories``` dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "650c3dd9",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"quantile(tetrismemories$STAI_T, probs = 0.5, na.rm = TRUE)\n",
"median(tetrismemories$STAI_T, na.rm = TRUE)"
]
},
{
"cell_type": "markdown",
"id": "89506fe8",
"metadata": {},
"source": [
"