{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Chapter 11. Null Hypothesis Significance Testing" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Contents\n", "### 11.1 NHST for the bias of a coin\n", "### 11.2 Prior knowledge about the coin\n", "### 11.3 Confidence interval and highest density interval\n", "### 11.4 Multiple comparisons\n", "### 11.5 What a sampling distribution is good for\n", "## -----------------------------------------------------------------------------------------------------------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " In null hypothesis significance testing(NHST), the goal of inference is to decide whether a particular value of a model parameter can be rejected. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, we might want to know whether a coin is fair, which in NHST becomes the question of whether we can reject the hypothesis that the bias of the coin has the specific value 0.5.\n", " To make the logic of NHST concrete, suppose we have a coin that we want to test for fairness. We decide that we will conduct an experiment wherein we flip the coin N = 26 times, and we observe how many times it comes up heads. If the coin is fair, it should usually come up heads about 13 times out of 26 flips. Only rarely will it come up with far\n", "fewer or far greater than 13 heads. \n", " Suppose we now conduct our experiment: We flip the coin N = 26 times and we happen to observe z = 8 heads. All we need to do is figure out the probability of getting that few heads if the coin were truly fair. If the probability of getting so few heads is sufficiently tiny, then we doubt that the coin is truly fair.\n", "Notice that this reasoning depends on the notion of repeating the intended experiment, because we are computing the probability of getting 8 heads if we were to repeat an experiment with N = 26. In other words, we are figuring out the probability of getting 8 heads relative to the space of all possible outcomes when N = 26. Why do we restrict consideration to N = 26? Because that was the intention of the experimenter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem with NHST is that the interpretation of the observed outcome depends on the space of possible outcomes when the experiment is repeated. Why is that a problem? Because the definition of the space of possible outcomes depends on the intentions of the experimenter. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ## Intention!!!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " If the experimenter intended to flip the coin exactly N = 26 times, then the space of possibilities is all samples with N = 26. But if the experimenter intended to flip the coin for one minute (and merely happened to make 26 flips during that time) then the space of possibilities is all samples that could occur when flipping the coin for one minute. Some of those possibilities would have N = 26, but some would have N = 23, and some would have N = 32, etc. On the other hand, the experimenter might have intended to flip the coin until observing 8 heads, and it just happened to take 26 flips to get there. In this case, the space of possibilities is all samples that have the 8th head as the last flip. Notice that for any of those intended experiments (fixed N, fixed time, or fixed z), the actually-observed data are the same: z = 8 and N = 26. But the probability of the observed\n", "data is different relative to each experiment space. The space of possibilities is determined by what the experimenter had in mind while flipping the coin." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Do the observed data depend on what the experimenter had in mind? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", " ### We certainly hope not! \n", " A good experiment is founded on the principle that the data are insulated from experimenter’s intentions. The coin “knows” only that it was flipped 26 times, regardless of what the experimenter had in mind while doing the flipping. Therefore our conclusion about the coin should not depend on what the experimenter had in mind while flipping it. This chapter explains some of the gory details of NHST, to bring mathematical rigor to the above comments, and to bring rigor mortis to NHST. You’ll see how NHST is committed to the notion that the covert intentions of the experimenter are crucial to interpreting the\n", "data, even though the data are not supposed to be influenced by the covert intentions of the experimenter.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 11.1 NHST for the bias of a coin\n", "### 11.1.1 When the experimenter intends to fix N\n", "Now for some of the mathematical details of NHST. Suppose we intend to flip a coin N = 26 times and we happen to observe z = 8 heads. This result seems to suggest that the coin is biased, because the result is less than the 13 heads that we would expect to get from a fair coin. But someone who is skeptical about the claim that the coin is biased, i.e., a defender of the null hypothesis that the coin is fair, would argue that the seemingly biased result could have happened merely by chance from a genuinely fair coin. Because a “false alarm”(제 1종 오류), i.e., rejection of a null hypothesis when it is really true, is considered to be very costly in scientific practice, we decide that we will only reject the null hypothesis if the probability that it could generate the result is very small, conventionally less than 5%. In other words, to reject the null hypothesis, we need to show that the probability of getting something as extreme as z = 8, when N = 26, is less than 5%.(유의수준)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the probability of getting a particular number of heads when N is fixed? The\n", "answer is provided by the binomial probability distribution, which states that the probability\n", "of getting z heads out of N flips is" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##\n", "where the notation\n", "\u0010Nz\n", "\u0011\n", "will be defined below. The binomial distribution is derived by the\n", "following logic. Consider any specific sequence of N flips with z heads. The probability of\n", "that specific sequence is simply the product of the individual flips, which is the product of\n", "Bernoulli probabilities\n", "Q\n", "i θyi (1 − θ)1−yi = θz(1 − θ)N−z, which we first saw in Section 5.1,\n", "p. 66. But there are many different specific sequences with z heads. \n", "Let’s count how many\n", "ways there are. Consider allocating z heads to N flips in the sequence. The first head could\n", "go in any one of the N slots. The second head could go in any one of the remaining N − 1\n", "slots. The third head could go in any one of the remaining N − 2 slots. And so on, until the\n", "zth head could go in any one of the remaining N−(z−1) slots. Multiplying those possibilities\n", "together means that there are N · (N − 1) · . . . · (N − (z − 1)) ways of allocating z heads to N\n", "flips. As an algebraic convenience, notice that N · (N − 1) · . . . · (N − (z − 1)) = N!/(N − z)!.\n", "\n", "In this counting of the allocations, we’ve counted different orderings of the same allocation\n", "separately. For example, putting the 1st head in the 1st slot and the 2nd head in the second\n", "slot was counted as a different allocation than putting the 1st head in the 2nd slot and the\n", "2nd head in the 1st slot. In the space of possible outcomes, there is no meaningful difference\n", "in these allocations, because they both have a head in the 1st and 2nd slots. Therefore we\n", "get rid of this duplicate counting by dividing out by the number of ways of permuting the\n", "z heads among their z slots. The number of permutations of z items is z!. Putting this all together, the number of ways of allocating z heads among N flips, without duplicate\n", "counting of equivalent allocations, is N!/[(N − z)!z!]. This factor is also called the number\n", "of ways of choosing z items from N possibilities, or “N choose z” for \u0010 short, and is denoted Nz\n", "\u0011\n", "\n", "##\n", "\n", ". Thus, the overall probability of getting z heads in N flips is the probability of any\n", "particular sequence of z heads in N flips times the number of ways of choosing z slots from\n", "among the N possible flips. The product appears in Equation 11.1. An illustration of a\n", "binomial probability distribution is provided in the right panel of Figure 11.1, for N = 26\n", "and θ = .5. Notice that the abscissa ranges from z = 0 to z = 26, because in N = 26 flips it\n", "is possible to get anywhere from no heads to all heads.\n", "\n", "## Bernoulli -> Binomial\n", "\n", "\n", "**The binomial probability distribution in Figure 11.1 is also called a sampling distribution(or Empirical distribution).**\n", "Sampling with replacement(ex. Bootstrapping)\n", "Sampling without replacement\n", "\n", "##\n", "This terminology stems from the idea that any set of N flips is a representative sample\n", "of the behavior of the coin. If we were to repeatedly run experiments with a fair coin, such\n", "that in every experiment we flip the coin exactly N times, then, in the long run, the probability\n", "of getting each possible z would be the distribution shown in Figure 11.1. To describe\n", "it carefully, we would call it “the probability distribution of the possible sample outcomes”,\n", "but that’s usually just abbreviated as “the sampling distribution”.\n", "The left side of Figure 11.1 shows the null hypothesis. It shows the probability distribution\n", "for the two states of the coin. According to the null hypothesis, the coin is fair, whereby\n", "p(y = heads) = θ = .5. The two panels in the figure are connected by an implication arrow\n", "to denote that fact that when the sample size N is fixed, the sampling distribution on the\n", "right is implied.\n", "##\n", "\n", " Our goal is to determine whether the probability of getting the observed\n", "result, z = 8, is tiny enough that we can reject the null hypothesis. By using the binomial probability formula in Equation 11.1, we determine that the probability of getting exactly z = 8 heads in N = 26 flips is 2.3%." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "0.0232797116041184" ], "text/latex": [ "0.0232797116041184" ], "text/markdown": [ "0.0232797116041184" ], "text/plain": [ "[1] 0.02327971" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dbinom(8, 26, 0.5, log = FALSE)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 11
  2. \n", "\t
  3. 19
  4. \n", "\t
  5. 14
  6. \n", "\t
  7. 12
  8. \n", "\t
  9. 11
  10. \n", "\t
  11. 12
  12. \n", "\t
  13. 13
  14. \n", "\t
  15. 15
  16. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 11\n", "\\item 19\n", "\\item 14\n", "\\item 12\n", "\\item 11\n", "\\item 12\n", "\\item 13\n", "\\item 15\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 11\n", "2. 19\n", "3. 14\n", "4. 12\n", "5. 11\n", "6. 12\n", "7. 13\n", "8. 15\n", "\n", "\n" ], "text/plain": [ "[1] 11 19 14 12 11 12 13 15" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x<-rbinom(8, 26, 0.5)\n", "x" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "ename": "ERROR", "evalue": "Error in png(tf, width, height, \"in\", pointsize, bg, res, type = \"cairo\", : unable to load winCairo.dll: was it built?\n", "output_type": "error", "traceback": [ "Error in png(tf, width, height, \"in\", pointsize, bg, res, type = \"cairo\", : unable to load winCairo.dll: was it built?\n" ] }, { "ename": "ERROR", "evalue": "Error in jpeg(tf, width, height, \"in\", pointsize, quality, bg, res, type = \"cairo\", : unable to load winCairo.dll: was it built?\n", "output_type": "error", "traceback": [ "Error in jpeg(tf, width, height, \"in\", pointsize, quality, bg, res, type = \"cairo\", : unable to load winCairo.dll: was it built?\n" ] }, { "data": { "text/plain": [ "Plot with title \"Histogram of x\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "hist(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Figure 11.1 shows this probability as the height of the bar over z = 8. However, we do not want to determine the probability of only the actuallyobserved result. After all, for large N, any specific result z can be very improbable.\n", " For example, if we flip a fair coin N = 1000 times, the probability of getting exactly z = 500 heads is only 2.5%, even though z = 500 is precisely what we would expect if the coin were fair. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "0.0252250181783608" ], "text/latex": [ "0.0252250181783608" ], "text/markdown": [ "0.0252250181783608" ], "text/plain": [ "[1] 0.02522502" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dbinom(500, 1000, 0.5, log = FALSE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Therefore, instead of determining the probability of getting exactly the result z from the null hypothesis, we determine the probability of getting z or a result even more extreme than what we would expect. \n", "\n", "\n", "The reason for considering more extreme outcomes is this: If we would reject the null hypothesis because the result z is too far from what we would expect, then any other potential result, that has an even more extreme value, would also cause us to reject the null hypothesis. \n", "\n", "\n", "Therefore we want to know the probability of getting the actual outcome or an outcome more extreme relative to what we expect. This total probability is referred to as “the p value”. If this p value is less than a critical amount, then we reject the null hypothesis.\n", "\n", "\n", "\n", "The **critical probability(유의확률)** is conventionally set to 5%. In other words, we will reject the null hypothesis whenever the total probability of the observed z or an outcome more extreme is less than 5%. \n", "\n", "\n", "\n", "Notice that this decision rule will cause us to reject the null hypothesis 5% of the time when the null hypothesis is true, because the null hypothesis itself generates those extreme values 5% of the time, just by chance. The critical probability, 5%, is the proportion of false alarms that we are willing to tolerate in our decision process. We set the\n", "critical z values such that the false alarm rate is no greater than 5%.\n", "\n", "\n", "\n", "We also have to be careful to consider both directions of deviation from what we would expect. If we flip a coin and it comes up heads almost all the time, we suspect that it is biased. But if the coin comes up heads almost never, we also suspect that it is biased. Therefore we have to establish the range of all possible extreme values, high or low, that\n", "would cause us to reject the null hypothesis. We let half the false alarms be due to high values, and half be due to low values. Because we want the total false alarm rate to be no greater than 5%, we will reject the null hypothesis only when z is so high that it would reach or exceed that value less than 2.5% of the time by chance alone, or when z is so low that it would be that small or smaller by chance only 2.5% of the time. Figure 11.1 shows these extreme values of z as the darkly shaded bars in the tails of the distribution. The total probability mass of these bars does not exceed 2.5% in either tail. If the actually observed z falls among any of these darkly shaded extreme values, we reject the null hypothesis. For\n", "our specific situation, where the experimenter intended N = 26, we need a value of z that is 19 or greater, or 7 or less, to reject the hypothesis that θ = 0.5.\n", "\n", "\n", "\n", "Here’s the conclusion for our particular case. The actual observation had z = 8, and so we would not reject the null hypothesis that θ = .5. In NHST parlance, we would say that the result “has failed to reach significance”. This does not mean we accept the null hypothesis; we merely suspend judgment regarding rejection of this particular hypothesis. Notice that we have not determined any degree of belief in the hypothesis that θ = .5. The\n", "hypothesis might be true or might be false; we suspend judgment.\n", "\n", "\n", "\n", "\n", "It is worth reiterating how this conclusion was reached: We considered the space of all possible outcomes if the intended experiment were repeated, and we determined the probabilities of extreme outcomes in this space of possibilities. We then examined whether the one actually observed outcome fell into the extreme zones of the space of possible outcomes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.1.2 When the experimenter intends to fix z" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose that the experimenter did not intend to stop flipping when N flips were reached.\n", "Instead, the intention was to stop when z heads were reached. This scenario can happen\n", "in many real-life situations.\n", "\n", "\n", "\n", "For example, widgets on an assembly line can be checked for defects until z defective widgets are identified. In this situation, z is fixed in advance and N is the random variable. We don’t talk about the probability of getting z heads out of N flips, we instead talk about the probability of taking N flips to get z heads.\n", "\n", "\n", "\n", "\n", "What is the probability of taking N flips to get z heads? To answer this question, consider this: We know that the N-th flip is the z-th head, because that is what signalled us to stop flipping. Therefore the previous N − 1 flips had z − 1 heads in some random sequence.\n", "\n", "##\n", "The probability of getting z − 1 heads in N − 1 flips is\n", "\u0010N−1\n", "z−1\n", "\u0011\n", "θz−1(1 − θ)N−z. \n", "\n", "##\n", "\n", "\n", "The probability\n", "that the last flip comes up heads is θ. Therefore, the probability that it takes N flips to get z\n", "heads is" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "0.000808164244517682" ], "text/latex": [ "0.000808164244517682" ], "text/markdown": [ "0.000808164244517682" ], "text/plain": [ "[1] 0.0008081642" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dnbinom(8, 26, 0.5, log = FALSE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure 11.2 shows an example of this probability distribution. This distribution is sometimes\n", "called the **“negative binomial”**. Notice that values of N start at z and rise to infinity,\n", "because it takes at least z flips to get z heads, and it might take a huge number of flips to\n", "finally get the zth flip.\n", "\n", "\n", "\n", "\n", "If the coin is biased to come up heads rarely, then it will take a large number of flips\n", "until we get z heads. If the coin is biased to come up heads frequently, then it will take a\n", "small number of flips until we get z heads. Figure 11.2 shows the values of observed N for\n", "which the probability of getting that result, or something more extreme, is less than 2.5%\n", "in each tail. These extreme values are marked as dark bars in the sampling distribution.2 If\n", "the observed N falls in either of these extreme tails, we reject the null hypothesis.\n", "\n", "\n", "\n", "Here is the conclusion for our specific example: The actual observation is N = 26,\n", "which falls in the extreme tail of the sampling distribution, and therefore we reject the null\n", "hypothesis. In other words, in the space of possible outcomes, the null hypothesis predicts\n", "that it is very rare for a fair coin to need 26 flips to get 8 heads; so rare, in fact, that we\n", "reject the null hypothesis. Notice that while we have rejected the null hypothesis, we still\n", "have no particular degree of disbelief in it, nor do we have any particular degree of belief in\n", "any other hypothesis. All we know is that the actual observation lies in an extreme end of\n", "the space of possibilities if the intended experiment were repeated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.1.3 Soul searching\n", "\n", "Let’s summarize the situation. We watch the experimenter flip the coin. We see the same\n", "results as the experimenter, and we observe z = 8 heads out of N = 26 flips. According to\n", "NHST, if the intention of the experimenter was to stop when N = 26, then we do not reject\n", "the null hypothesis. If the intention of the experimenter was to stop when z = 8, then we\n", "do reject the null hypothesis. In other words, for us to draw a conclusion from the data, we need to know the experimenter’s intentions. *** It shows you other examples of this dependence of the decision on the experimenter’s intentions.***\n", "\n", "\n", "\n", "\n", "\n", "Notice that the actual observed events are the same regardless of how the experimenter\n", "decided to stop flipping coins; In either case we observe z heads in N flips. An outside\n", "observer of the flipping experiment, who is not privy to the covert intentions of the flipper,\n", "simply sees N flips, of which z were heads. It could be that the flipper intended to flip N\n", "times and then stop. Or it could be that the flipper intended to keep flipping until getting z\n", "heads. Or it could be that the flipper intend to flip for one minute.\n", "\n", "\n", "\n", "\n", "\n", "In real research, the actual reason for stopping is often neither because a pre-planned N\n", "was reached nor because a pre-planned z was reached, nor because time ran out. Instead,\n", "real researchers will sometimes monitor the data as they are collected, and do “preliminary”\n", "analyses on the data collected so far. If the currently collected data show significance, then\n", "data collection stops. If the data are close to significance, then data collection continues.\n", "With these intentions, the probability of getting significance is inflated because there is a\n", "chance of rejecting the null at every step along the way. In particular, if the experimenter\n", "intended to allow additional data to be collected after the preliminary inspection, but did\n", "not end up collecting additional data, the true probability of falsely rejecting the null is\n", "still inflated because the potential data space is larger, and there are more opportunities for\n", "rejecting the null.\n", "\n", "\n", "\n", "\n", "The solution to this mess is simple. All we have to do, to determine whether or not to\n", "reject the null hypothesis, is search the soul of the experimenter, to discover his/her true\n", "intentions about the experiment. Thus, when an experimenter reports his or her results, s/he\n", "can sign an Affidavit of Intent, or testify before Congress under oath. \n", "\n", "\n", "Or, perhaps advances in fMRI will one day give us objective measures of subconscious intent. Then NHST will\n", "be on solid ground. Right? Wrong.\n", "\n", "\n", "\n", "In all of these scenarios, the coin itself has no idea what the flipper’s intention is, and\n", "the propensity of the coin to come up heads does not depend on the intentions of the flipper.\n", "Indeed, we carefully design experiments to insulate the coins from the intentions of the\n", "experimenter. Therefore our inference about the coin should not depend on the intentions\n", "of the experimenter.\n", "A defender of NHST might be tempted to argue that I’m quibbling over trivial differences\n", "in the critical values. The critical values for the two cases I described above are very\n", "similar. Unfortunately, different intentions do not always lead to small differences in critical\n", "values. \n", "\n", "\n", "\n", "** For example, when experimenters check their data after every flip of the coin to see\n", "if the result so far is “significant” by fixed-N critical values, the false alarm rate sky rockets.\n", "If you check at every flip to see if conventional 5% critical values have been exceeded, then\n", "the actual false alarm rate with 10 flips is 5.5%, with 20 flips it’s 10.7%, with 30 flips it’s\n", "14.9%, with 40 flips it’s 15.4%, and with 50 flips the true false alarm rate is 17.1%. In other\n", "words, if you are willing to flip the coin up to 50 times, and along the way you check at\n", "every flip to see if you can reject the hypothesis that the coin is fair, using critical values\n", "that are supposed to keep the false alarm rate down to 5% or less, then you actually have\n", "a 17.1% chance of falsely rejecting the hypothesis even when the coin is truly fair. **\n", "\n", "\n", "You have to change the critical values quite a lot if you intend to check after every flip. Another\n", "situation in which critical values change dramatically is when experimenters intend to make multiple comparisons across different conditions in an experiment, as will be discussed later. Thus, we are not quibbling over tiny differences in critical values. Depending on the intentions, the critical values can change dramatically.\n", "\n", "\n", "\n", "More fundamentally, the argument, that if any intention leads to nearly the same critical\n", "values then it’s okay to use intentions, still fully admits that experimenter intentions influence\n", "the interpretation of data. It’s like arguing that we shouldn’t worry about the butchers\n", "putting their fingers on the scale, because no matter which butcher does it, the cheating\n", "is about the same. Admitting that experimenter intention influences the interpretation of\n", "data contradicts a basic premise of the data collection, that experimenter intentions have no\n", "influence on the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.1.4 Bayesian analysis\n", "\n", "The Bayesian interpretation of data does not depend on the covert intentions of the data collector.\n", "In general, for data that are independent across trials, the probability of the conjoint\n", "set of data is simply the product of the probabilities of the individual outcomes. \n", "\n", "\n", "Thus, for z = PN i=1 yi heads in N flips, the likelihood is QN i=1 θyi (1 − θ)1−yi = θz(1 − θ)N−z, regardless\n", "of the experimenter’s private reasons for collecting those data. \n", "\n", "\n", "The likelihood function captures everything we assume to influence the data. In the case of the coin, we assume that\n", "the bias of the coin is the only influence on its outcome, and that the flips are independent.\n", "The Bernoulli likelihood function completely captures those assumptions.\n", "\n", "\n", "\n", "\n", "**In summary, the NHST analysis and conclusion depend on the covert intentions of the\n", "experimenter, because those intentions define the space of all possible (unobserved) data.**\n", "\n", "\n", "This dependence of the analysis on the experimenter’s intentions conflicts with the opposite\n", "assumption that the experimenter’s intentions have no effect on the observed data. \n", "\n", "\n", "\n", "** The Bayesian analysis does not depend on the space of possible unobserved data. The Bayesian\n", "analysis operates only with the actual data obtained.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 11.2 Prior knowledge about the coin\n", "\n", "\n", "Suppose that we are not flipping a coin, but we are flipping a flat-headed\n", "nail. In a social science setting, this is like asking a survey question about left or right\n", "handedness of the respondent, which we know is far from 50/50, as opposed to asking a\n", "survey question about male or female sex of the respondent, which we know is close to\n", "50/50. When we flip the nail, it can land with its point touching the ground (which I’ll call\n", "tails) or it can land balanced on its head with its point sticking up (which I’ll call heads).\n", "We believe, just by looking at the nail and our previous experience with nails, that it will\n", "not come up heads and tails equally often. Indeed, with its narrow head, the nail will very\n", "probably come to rest with its point touching the ground, i.e., “tails”. In other words, we\n", "have a strong prior belief that the nail is tail-biased. Suppose we flip the nail 26 times and\n", "it comes up heads on 8 flips. Is the nail “fair”? Would we use it to determine who gets to\n", "kick off at the Superbowl?\n", "\n", "\n", "Prior를 아는 경우\n", "ex)\n", "\n", "\n", "Prior를 모르는 경우\n", "ex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.2.1 NHST analysis\n", "\n", "The NHST analysis does not care if we are flipping coins or nails. The analysis proceeds\n", "the same way as before. To determine whether the nail is biased, we first declare the experimenter’s\n", "intentions and then compute the probability of getting 8 heads or more if the\n", "nail were fair. As we saw in the previous section, if we declare that the intention was to\n", "flip the nail 26 times, then an outcome of 8 heads means we do not reject the hypothesis\n", "that the nail is fair. Let me say that again: We have a nail for which we have a strong prior\n", "belief that it is tail biased. We flip the nail 26 times, and find it comes up heads 8 times.\n", "\n", "\n", "** We conclude, therefore, that we cannot reject the null hypothesis that the nail can come up\n", "heads or tails 50/50. Huh? This is a nail we’re talking about. How can you not reject the\n", "null hypothesis?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.2.2 Bayesian analysis \n", "\n", "\n", "The Bayesian statistician starts the analysis with an expression of the prior knowledge.\n", "We know from prior experience that the narrow-headed nail is biased to show tails, so we\n", "express that knowledge in a prior. In a scientific setting, the prior is established by appealing\n", "to publicly accessible and reputable previous research. In our present toy example involving\n", "a nail, suppose that we represent our prior beliefs by a fictitious previous sample that had\n", "95% tails in a sample size of 20. That translates into a beta(θ|2, 20) prior distribution. If we\n", "wanted to go through the trouble, we could instead derive a prior from established theories\n", "regarding the mechanics of such objects, after making physical measurements of the nail\n", "such as its length, diameter, mass, rigidity, etc. In any case, to make the analysis convincing\n", "to an audience of peers, the prior must be agreeable to that audience. Suppose that the\n", "agreed prior for the nail is beta(θ|2, 20), then the posterior distribution is beta(θ|2+8, 20+18),\n", "as shown in the right side of Figure 11.3. The posterior beliefs clearly do not include the\n", "nail being fair.\n", "\n", "\n", "\n", "\n", "The differing inferences for a coin and a nail make good intuitive sense. Our posterior\n", "beliefs about the bias of the object should depend on our prior knowledge of the object: 8\n", "heads in 26 flips of narrow-headed nail should leave us with a different opinion than 8 heads\n", "in 26 flips of a coin." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 11.2.2.1 Priors are overt and should influence\n", "\n", "Some people might assert that prior beliefs are just as mysterious as the experimenter’s\n", "intentions. But this assertion is wrong. Prior beliefs are not capricious and idiosyncratic.\n", "Prior beliefs are overt, explicitly debated, and consensual. A Bayesian analyst might have\n", "personal priors that differ from what most people think, but if the analysis is supposed to\n", "convince an audience, then the analysis must use priors that the audience finds palatable. It\n", "is the job of the Bayesian analyst to make cogent arguments for the particular prior that is\n", "used. The research will not get published if the reviewers and editors think that that prior\n", "is untenable. Perhaps the researcher and the reviewers will have to agree to disagree about\n", "the prior, but even in that case the prior is an explicit part of the argument, and the analysis\n", "should be run with both priors in order to assess the robustness of the posterior. Science\n", "is a cumulative process, and new research is presented always in the context of previous\n", "research. A Bayesian analysis acknowledges this obvious fact, but it is ignored by NHST." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## beta 분포?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some people might wonder, if subjective priors are allowed for Bayesian analyses, then\n", "why not allow subjective intentions for NHST? \n", "\n", "\n", "**Because the subjective intentions in the data collector’s mind do not influence the data and therefore should not influence the analysis.Subjective prior beliefs, on the other hand, are not about how beliefs influence the data,\n", "but about how the data influence beliefs: Prior beliefs are the starting point from which we move in the light of new data.**\n", "\n", "\n", "\n", "\n", "Bayesian analysis tells us how much we should change our beliefs relative to our prior\n", "beliefs. Bayesian analysis does not tell us what our prior beliefs should be. Nevertheless,\n", "the priors are overt, public, and cumulative. Bayesian analysis provides an intellectually\n", "coherent method for determining the degree to which beliefs should change, and the conclusion\n", "is influenced by exactly what it should be influenced by, namely the priors and the\n", "observed data. The conclusion is not influenced by what it should not be influenced by,\n", "namely the experimenter’s covert intention while gathering the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 11.3 Confidence interval and highest density interval\n", "### 11.3.1 NHST confidence interval\n", "\n", "\n", "The primary goal of NHST is determining whether a particular ”null” value of a parameter\n", "can be rejected. One can also ask what range of parameter values would not be rejected.\n", "This range of non-rejectable parameter values is called the **confidence interval.** \n", "\n", "(There are different ways of defining an NHST confidence interval; this one is conceptually the most\n", "general and coherent with NHST precepts.) \n", "\n", "\n", "The 95% confidence interval consists of all values of θ that would not be rejected by a (two-tailed) significance test that allows 5% false alarms.\n", "\n", "\n", "\n", "\n", "For example, in a previous section we found that θ = .5 would not be rejected when\n", "z = 8 and N = 26, for a flipper who intended to stop when N = 26. The question is, which\n", "other values of θ would we not reject? Figure 11.4 shows the sampling distribution for\n", "different values of θ. The upper row shows the case of θ = 0.144, for which the sampling\n", "distribution has z = 8 snug against the upper rejection tail. In fact, if θ is nudged any smaller,\n", "the rejection tail includes z = 8, which means that smaller values of θ can be rejected. The\n", "lower row of Figure 11.4 shows the case of θ = 0.517, for which the sampling distribution\n", "has z = 8 snug against the lower rejection tail. If θ is nudged any larger, the rejection tail\n", "includes z = 8, which means that larger values of θ can be rejected. \n", "\n", "\n", "\n", "\n", "**\n", "In summary, the range\n", "of θ values we would not reject is θ ∈ [.144, .517]. This is the 95% confidence interval\n", "when z = 8 and N = 26, for a flipper who intended to stop when N = 26.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also determine the confidence interval for the experimenter who intended to\n", "stop **when z = 8.** Figure 11.5 shows the sampling distribution for different values of θ. The upper row shows the case of θ = 0.144, for which the sampling distribution has N = 26\n", "snug against the lower rejection tail. \n", "\n", "\n", "In fact, if θ is nudged any smaller, the rejection tail includes N = 26, which means that smaller values of θ can be rejected. The lower row of Figure 11.5 shows the case of θ = 0.493, for which the sampling distribution has N = 26\n", "snug against the upper rejection tail. If θ is nudged any larger, the rejection tail includes N = 26, which means that larger values of θ can be rejected. In summary, the range of θ values we would not reject is θ ∈ [.144, .493]. This is the 95% confidence interval when z = 8 and N = 26, for a flipper who intended to stop when z = 8." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have just seen that the NHST confidence interval depends on the covert intentions\n", "of the experimenter. When the intention was to stop when N = 26, then the range of\n", "biases that would not be rejected is θ ∈ [.144, .517]. But when the intention was to stop\n", "when z = 8, then the range of biases that would not be rejected is θ ∈ [.144, .493] (the\n", "fact that the lower ends of the confidence intervals are the same is merely an accidental\n", "coincidence for this case). The confidence interval depends on the experimenter’s intention\n", "because those intentions dictate the space of possible unobserved data relative to which the\n", "actually observed data are judged. If the experimenter had other intentions, such as flipping for a fixed duration, then the confidence interval would be yet something different. Thus,\n", "the interpretation of the NHST confidence interval is as convoluted as the interpretation of\n", "NHST itself, because the confidence interval is merely the significance test conducted at\n", "every candidate value of θ.\n", "\n", "\n", "**\n", "The confidence interval tells us something about the probability of extreme unobserved\n", "data values that we might have gotten if we repeated the experiment according to the covert\n", "intentions of the experimenter. But the confidence interval tells us little about the believability\n", "of any particular θ value, which is what we want to know.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.3.2 Bayesian HDI\n", "\n", "A concept in Bayesian inference, that is somewhat analogous to the NHST confidence interval,\n", "is the ***highest density interval (HDI)**, which was introduced in Section 3.3.5, p. 34.\n", "\n", "\n", "\n", "\n", "Let’s consider the HDI when we flip a coin and observe z = 8 and N = 26. Suppose we\n", "have a prior informed by the fact that the coin appears to be authentic, which we express\n", "here, for illustrative purposes, as a beta(θ|11, 11) distribution. The left side of Figure 11.3\n", "shows that the 95% HDI goes from θ = 0.261 to θ = 0.533. These limits span the 95% most\n", "believable values of the bias. Moreover, the posterior density shows exactly how believable\n", "each bias is. In particular, we can see that θ = .5 is within the 95% HDI, which we might\n", "use as a criterion if we are forced to categorically declare whether or not fairness is credible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Beta distribution\n", "A probability density of that form is called a beta distribution. Formally, a beta distribution has two parameters, called a and b, and the density itself is defined as\n", "
\n", "
\n", "where B(a, b) is simply a normalizing constant that ensures that the area under the beta\n", "density integrates to 1.0, as all probability density functions must. In other words, the\n", "normalizer for the beta distribution is B(a, b) ie. beta function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Application: Bayesian inference\n", "\n", "The use of Beta distributions in Bayesian inference is due to the fact that they provide a family of conjugate prior probability distributions for binomial (including Bernoulli) and geometric distributions. The domain of the beta distribution can be viewed as a probability, and in fact the beta distribution is often used to describe the distribution of a probability value p.\n", "\n", "\n", "베이즈 추론(Bayesian inference)은 통계적 추론의 한 방법으로, 추론해야 하는 대상의 사전 확률과 추가적인 관측을 통해 해당 대상의 사후 확률을 추론하는 방법이다. 베이즈 추론은 베이즈 확률론을 기반으로 하며, 이는 추론하는 대상을 확률변수로 보아 그 변수의 확률분포를 추정하는 것을 의미한다.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### There are at least three advantages of the HDI over an NHST confidence interval.\n", "\n", "\n", "**First,the HDI has a direct interpretation in terms of the believabilities of values of θ.** The HDI\n", "is explicitly about p(θ|D), which is exactly what we want to know. The NHST confidence\n", "interval, on the other hand, has no direct relationship with what we want to know; there’s\n", "no clear relationship between the probability of rejecting the value θ and the believability\n", "of θ. \n", "> ☞ likelihood 개념\n", "\n", "**Second, the HDI has no dependence on the intention of the experimenter during data\n", "collection**, because the likelihood has no dependence on the intention of the experimenter\n", "during data collection. The NHST confidence interval, in contrast, tells us about probabilities\n", "of data relative to what might have been if we replicated the experimenter’s covert\n", "intentions. \n", "\n", "\n", "\n", "**Third, the HDI is responsive to the analyst’s prior beliefs, as it should be.** The\n", "Bayesian analysis indicates how much the new data should alter our beliefs. The prior beliefs\n", "are overt and publicly decided. The NHST analysis, on the contrary, is ignorant of,\n", "and unresponsive to, the accumulated prior knowledge of the scientific community." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.4 Multiple comparisons\n", "\n", "In most experiments there are multiple conditions or treatments. \n", "\n", "\n", "For example, some people were trained with labels such that tall rectangle\n", "indicated category A, while short rectangles indicated category B. This was a case of a\n", "filtration condition because the lateral position could be filtered out of consideration; only the height mattered for correct classification. Other people were trained with labels such\n", "that tallest rectangles or rightmost line segments indicated category A, while other figures\n", "indicated category B. This was a case of a condensation condition because both dimensions\n", "of variation had to be considered and condensed into a single categorical response. These\n", "conditions were studied because different theories of learning predict that some conditions\n", "should be easier to learn than others. Therefore the goal of data analysis is to determine\n", "how different the learning performance is across the four conditions.\n", "When comparing multiple conditions, the constraint in NHST is to keep the overall\n", "false alarm rate down to the desired level, e.g. 5%. Abiding by this constraint depends on\n", "the number of comparisons that are to be made, which in turn depends on the intentions of\n", "the experimenter. In a Bayesian analysis, however, there is just one posterior distribution\n", "over the parameters that describe the conditions. That posterior distribution is unaffected\n", "by the intentions of the experimenter, and the posterior distribution can be examined from\n", "multiple perspectives however is suggested by insight and curiosity. The next two sections\n", "expand on NHST and Bayesian approaches to multiple comparisons. I will often use the\n", "terms “condition” and “group” interchangeably\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.4.1 NHST correction for experimentwise error\n", "\n", "When there are multiple groups, it often makes sense to compare each group to every other\n", "group. With four groups, for example, there are six different pairwise comparisons we can\n", "make; e.g., groups 1 vs 2, 2 vs 3, 1 vs 3, etc. In NHST, we have to take into account which\n", "comparisons we intend to run for the whole experiment. The problem is that each comparison\n", "involves a decision with the potential for false alarm. Suppose we set a criterion\n", "for rejecting the null such that each decision has a “per-comparison” (PC) false alarm rate\n", "of αPC, e.g., 5%. Our goal is to determine the overall false alarm rate when we conduct\n", "several comparisons. To get there, we do a little algebra. First, suppose the null hypothesis\n", "is true, which means that the groups are identical, and we get apparent differences in the\n", "samples by chance alone. This means that we get a false alarm on a proportion αPC of\n", "replications of a comparison test. Therefore, we do not get a false alarm on the complementary\n", "proportion 1 − αPC of replications. If we run c independent comparison tests, then\n", "the probability of not getting a false alarm on any of the tests is (1 − αPC)c. Consequently,\n", "the probability of getting at least one false alarm is 1 − (1 − αPC)c. We call that probability\n", "of getting at least one false alarm, across all the comparisons in the experiment, the “experimentwise”\n", "false alarm rate, denoted αEW. Here’s the rub: αEW is greater than αPC. For\n", "example, if αPC = .05 and c = 6, then αEW = 1 − (1 − αPC)c = .26. Thus, even when the\n", "null hypothesis is true, and there are really no differences between groups, if we conduct\n", "six independent comparisons, we have a 26% chance of rejecting the null hypothesis for at\n", "least one of the comparisons. Usually not all comparisons are structurally independent of\n", "each other, so the false alarm rate does not increase so rapidly, but it does increase whenever\n", "additional comparison tests are conducted.\n", "One way to keep the experimentwise false alarm rate down to 5% is by reducing the permitted\n", "false alarm rate for the individual comparisons, i.e., setting a more stringent criterion\n", "for rejecting the null hypothesis in individual comparisons. One often-used re-setting is the\n", "Bonferonni correction, which sets αPC = αdesired\n", "EW /c. For example, if the desired experimentwise\n", "false alarm rate is .05, and there are 6 comparisons planned, then we set each individual\n", "comparison’s false alarm rate to .05/6. This is a conservative correction, because the actual\n", "experiment-wise false alarm rate will usually be much less than αdesired\n", "EW . \n", "\n", "\n", "There are many different corrections available to the discerning NHST aficionado (e.g.,\n", "Maxwell & Delaney, 2004, Ch. 5). Not only do the correction factors depend on the structural\n", "relationships of the comparisons, but the correction factors also depend on whether\n", "the analyst intended to conduct the comparison before seeing the data, or was provoked\n", "into conducting the comparison only after seeing the data. If the comparison was intended\n", "in advance, it is called a planned comparison. If the comparison was thought of only after\n", "seeing a trend in the data, it is called a post-hoc comparison. Why should it matter whether\n", "a comparison is planned or post-hoc? Because even when the null hypothesis is true, and\n", "there are no real differences between groups, there will always be a highest and lowest random\n", "sample among the groups. If we don’t plan in advance which groups to compare, but\n", "do compare which ever two groups happen to be farthest apart, we have an inflated chance\n", "of declaring groups to be different that aren’t truly different.\n", "The point, for our purposes, is not which correction to use. The point is that the NHST\n", "analyst must make some correction, and the correction depends on the number and type of\n", "comparisons that the analyst intends to make. This creates a problem because two analysts\n", "can come to the same data but draw different conclusions because of the variety of comparisons\n", "that they find interesting enough to conduct, and what provoked their interest. The\n", "creative and inquisitive analyst, who wants to conduct many comparisons either because of\n", "deep thinking about implications of theory, or because of provocative unexpected trends in\n", "the data, is penalized for being thoughtful. A large set of comparisons can be conducted\n", "only at the cost of using a more stringent threshold for each of the comparisons. The uninquisitive\n", "analyst is rewarded with an easier criterion for achieving significance. This seems\n", "to be a counterproductive incentive structure: You have a higher chance of getting a “significant”\n", "result, and getting your work published, if you feign narrow mindedness under the\n", "pretense of protecting the world from false alarms.\n", "To make this concrete, consider again the filtration/condensation experiment from Section\n", "9.3.1, p. 178. The theory relating category structure to learning difficulty predicts\n", "that the filtration structures should be easier than the condensation structures, that the two\n", "condensation structures should be approximately equally difficult, and that the two filtration\n", "structures might be somewhat different in difficulty. Theory implies, therefore, three\n", "planned comparisons. But what if the analyst was less thoughtful, or took a more broadbrush\n", "approach, and planned only one comparison: The average filtration versus the average\n", "condensation. This single comparison would indeed address the primary theoretical issue,\n", "without worrying about ancillary nuances. The broad-brusher would be rewarded with a\n", "less stringent criterion for the test to achieve significance. On the other hand, suppose that\n", "upon seeing the data, the detail-oriented analyst discovers that the slower of the two filtration\n", "groups is not much faster than the faster of the two condensation groups. The two\n", "groups should therefore be compared. The analyst can treat this as a post-hoc comparison,\n", "or the analyst can realize that it would have made sense to plan to compare each of the\n", "filtration groups individually against each of the condensation groups. After all, it’s just\n", "as post-hoc to notice that the slower of the filtration groups is clearly much faster than the\n", "faster of the two condensation groups, and decide therefore not to compare them. So, we\n", "might as well be honest about it, and realize that the comparisons should have been planned\n", "in the first place. All this leaves the NHST analyst walking on the quicksand of soul searching.\n", "Was the comparison truly planned or post-hoc? Did the analyst commit premeditated\n", "exclusion of comparisons that should have been planned, or was the analyst merely superficial,\n", "or was the exclusion post-hoc? This problem is not solved by picking a story and\n", "sticking to it, because it still presumes that the analyst’s intentions should influence the data\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.4.2 Just one Bayesian posterior no matter how you look at\n", "\n", "The data from an experiment, or from an observational study, are carefully collected so to be\n", "totally insulated from the experimenter’s intentions regarding subsequent tests. Indeed, the\n", "data should be uninfluenced by the presence or absence of any other condition or subject in\n", "the experiment! For example, it doesn’t matter to an individual in a filtration group whether\n", "or not the experiment includes the other filtration group, or the condensation groups, or still\n", "yet other conditions, or how many subjects there are in the groups. Moreover, the data are\n", "uninfluenced by the experimenter’s intentions regarding the other groups and sample size.\n", "So why should our interpretation of the data depend on the experimenter’s intentions if the\n", "data themselves are not influenced by the experimenter’s intentions?\n", "In a Bayesian analysis, the interpretation of the data is indeed uninfluenced by the experimenter’s\n", "intentions. A Bayesian analysis yields a posterior distribution over the parameters\n", "of the model. The posterior distribution is the complete implication of the data. The\n", "posterior distribution can be examined in as many different ways as the analyst deems interesting;\n", "various comparisons of groups are merely different perspectives on the posterior\n", "distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I hope to have made it clear that sampling distributions aren’t as useful as posterior distributions\n", "for making inferences about hypotheses from a set of observed data. The reason is that\n", "sampling distributions tell us the probabilities of possible data if we run an intended experiment\n", "given a particular hypothesis, rather than the believabilities of possible hypotheses\n", "given that we have a particular set of data. Nevertheless, sampling distributions are appropriate\n", "and useful for other applications. Two of those applications are described in the\n", "following sections." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, in the case of the filtration-condensation experiment, the Bayesian analysis\n", "(see BRugs code on p. 9.5.2) yields a posterior distribution over a high-dimensional\n", "parameter space, which includes the four μj parameters that describe the learning accuracies\n", "of the four conditions. (The other parameters were the individual learning biases,\n", "denoted θ ji, and the four κj parameters that described how strongly the individual accuracies\n", "depended on the condition’s μj.) Let’s collapse across the other parameters and focus\n", "on the four conditions’s μj parameters. If we want to determine whether group 1 tends to\n", "be more accurate than group 2, we determine how much of the posterior distribution has\n", "μ1 > μ2. Figure 9.16 showed a histogram of that difference, and that figure is repeated as\n", "Figure 11.6 for convenience. The left panel shows the posterior distribution of μ1−μ2, from\n", "which we can ascertain whether a difference of zero is credible. The posterior distribution\n", "also tells us the believability of each candidate difference of μ’s.\n", "The other two panels of Figure 11.6 show other comparisons of the μj parameters.\n", "Those histograms merely summarize the posterior distribution from other perspectives. The\n", "posterior distribution itself is unchanged by how we look at it. We can examine any other\n", "comparison of μj parameters without worrying about what motivated us to consider it, because the posterior distribution is unchanged by those motivations.\n", "In summary, the Bayesian posterior distribution is appropriately insensitive to the experimenter’s\n", "covert intentions to compare or not compare various groups. The Bayesian\n", "posterior also directly tells us the believabilities of the magnitudes of differences, unlike\n", "NHST which tells us only about whether a difference is extreme in a space of possibilities\n", "determined by the experimenter’s intentions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.4.3 How Bayesian analysis mitigates false alarms\n", "\n", "\n", "No analysis is immune to false alarms, because randomly sampled data will occasionally\n", "contain accidental coincidences of outlying values. Bayesian analysis eschews the use of p\n", "values as a criterion for decision making, however, because the probability of false alarm\n", "depends dramatically the experimenter’s intentions. Bayesian analysis instead accepts the\n", "fact that the posterior is the best inference we can make, given the observed data and the\n", "prior beliefs.\n", "How, then, does a Bayesian analysis address the problem of false alarms? By incorporating\n", "prior knowledge into the structure of the model. Specifically, if we know that different\n", "groups have some overarching commonality, even if their specific treatments are different,\n", "we can nevertheless model the different group parameters as having been drawn from an\n", "overarching distribution that expresses the commonality. An example of this was described\n", "in the right side of Figure 9.17, p. 183, where the group κc parameters were modeled by\n", "an overarching distribution. If several of the groups yield similar data, this similarity informs\n", "the overarching distribution, which in turn implies that any outlying groups should\n", "be estimated to be a little more similar than they would be otherwise. In other words, just\n", "as there can be shrinkage of individual estimates toward the group central tendency (recall\n", "Section 9.2.3, p. 175), there can be shrinkage of group estimates toward the overall central\n", "tendency. The shrinkage protects against accidental outliers and false alarms (e.g., Berry\n", "& Hochberg, 1999; Gelman, 2005; Gelman, Hill, & Yajima, 2009; Lindquist & Gelman,\n", "2009; Meng & Dempster, 1987). This shrinkage is not an arbitrary “correction” like those\n", "applied in NHST. The shrinkage is a rational consequence of the prior knowledge expressed\n", "in the model structure. Section 18.2, p. 409, provides additions discussion and examples in\n", "the context of metric variables, which in NHST are analyzed with t-tests and ANOVA.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.5 What a sampling distribution is good for\n", " I hope to have made it clear that sampling distributions aren’t as useful as posterior distributions\n", "for making inferences about hypotheses from a set of observed data. The reason is that\n", "sampling distributions tell us the probabilities of possible data if we run an intended experiment\n", "given a particular hypothesis, rather than the believabilities of possible hypotheses\n", "given that we have a particular set of data. Nevertheless, sampling distributions are appropriate\n", "and useful for other applications. Two of those applications are described in the\n", "following sections." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.5.1 Planning an experiment\n", "\n", "Until this point in the book, I have emphasized analysis of data that have already been\n", "obtained. But a crucial part of conducting research is planning the study before actually\n", "obtaining the data. When planning research, we have some hypothesis about how the world might be, and we want to gather data that will inform us about the viability of that hypothesis.\n", "Typically we have some notion already about the experimental treatments or observational\n", "settings, and we want to plan how many observations we’ll probably need to make,\n", "or how long we’ll need to run the study, in order to have reasonably reliable evidence one\n", "way or the other.\n", "For example, suppose that my theory suggests a coin should be biased with θ = .60.\n", "Perhaps the coin is a population of voters, hence flipping the coin means polling a person in\n", "the population, the outcome heads means preference for candidate A. The theory regarding\n", "the bias may have come from previous polls regarding political attitudes. We would like\n", "to plan a survey of the population that will give us precise posterior beliefs about the true\n", "preference for candidate A. Suppose our intended survey will sample people until we obtain\n", "z = 100 people in favor of candidate A. By simulating the experiment over and over,\n", "using the hypothesized θ = .60, we can generate expected data, and then derive a Bayesian\n", "posterior distribution for every set of simulated data. For every posterior distribution, we\n", "determine some measure of accuracy, such as the width of the 95% HDI. From many simulated\n", "experiments, we get a sampling distribution of HDI widths. From the sampling\n", "distribution of HDI widths, we can decide whether z = 100 typically yields high enough\n", "accuracy for our purposes. If not, we repeat the simulation with a larger z. Once we know\n", "how big z needs to be to get the accuracy we seek, we can decide whether or not it is feasible\n", "to conduct such a study.\n", "Notice that we used the intended experiment to generate a space of possible data in\n", "order to anticipate what is likely to happen when the data are analyzed with Bayesian methods.\n", "For any single set of data (simulated or actual), we recognize that the individual data\n", "points in the set are insulated from the intentions of the design, and we conduct a Bayesian\n", "analysis of the data set. The use of a distribution of possible sample data, from an intended\n", "experiment, is perfectly appropriate here because it is exactly the implications of this hypothetical\n", "data distribution that we want to find out about.\n", "The issues of research design will be explored in greater depth in Chapter 13, which is\n", "entirely devoted to this topic.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11.5.2 Exploring model predictions (posterior predictive check)\n", "A Bayesian analysis only indicates the relative veracities of the various parameter values or\n", "models under consideration. The posterior distribution only tells us which parameter values\n", "are relatively less bad than the others. The posterior does not tell us whether the least bad\n", "parameter values are actually any good.\n", "For example, suppose we believe that a coin is a heavily biased trick coin, and either\n", "comes up heads 99% of the time, or else comes up tails 99% of the time; we just don’t know\n", "which direction of bias it has. Now we flip the coin 40 times and it comes up heads 30 of\n", "those flips. It turns out that the 99%-head model has a far bigger posterior probability than\n", "the 99%-tail model. But it is also the case that the 99%-head model is a terrible model of a\n", "coin that comes up heads 30 out of 40 flips!\n", "One way to evaluate whether the least unbelievable parameter values are any good is\n", "via a posterior predictive check. A posterior predictive check is an inspection of patterns\n", "in simulated data that are generated by typical posterior parameters values. Back in Exercise\n", "5.8, p. 81, we explored an example of a posterior predictive check, and another example\n", "appeared in Section 7.4.2, p. 118. The idea of a posterior predictive check as follows: If the\n", "posterior parameter values really are good descriptions of the data, then the predicted data from the model should actually “look like” real data. If the patterns in the predicted data do\n", "not mirror the patterns in the actual data, then we are motivated to invent models that can\n", "produce the patterns of interest.\n", "This use of the posterior predictive check is suspiciously like null hypothesis significance\n", "testing: We start with a hypothesis (i.e., the least unbelievable parameter values), and\n", "we generate simulated data as if we were repeating our intended experiment over and over.\n", "Then we see if the actual data are typical or atypical in the space of simulated data. If we\n", "were to go further, and determine critical values for false alarm rates and then reject the\n", "model if the actual data fall in its extreme tails, then we would indeed be doing NHST. But\n", "we don’t go that far. Instead, the goal of the posterior predictive check is to drive intuitions\n", "about the qualitative manner in which the model succeeds or fails, and about what sort of\n", "novel model formulation might better capture the trends in the data. Once we invent another\n", "model, then we can use Bayesian methods to quantitatively compare it with the other\n", "models." ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.2.1" } }, "nbformat": 4, "nbformat_minor": 0 }