{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Statistical Inference" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### The Method of Comparison\n", "The basic method to mitigate confounding is to compare(at least) two groups, one that receives _treatment_ and a _control group_ that does not (or that gets a different treatment).\n", "To minimize bias, the treatment group and the control group should be as similar as possible but for the fact that\n", "one gets treatment and the other does not.\n", "\n", "If subjects _self-select_ for treatment, that generally results in bias. So does allowing the experimenter flexibility to select the groups.\n", "The best way to minimize bias, and to be able to quantify the uncertainty in the resulting inferences, is to assign subjects to treatment or control _randomly_.\n", "\n", "For human subjects, the mere fact of receiving treatment—even a treatment with no real effect—can\n", "produce changes in response. This is called _the placebo effect_.\n", "For that reason, it is important that human subjects be _blind_ to whether they are treated or not, for instance,\n", "by giving subjects in the control group a _placebo_.\n", "That makes the treatment and control groups more similar.\n", "Both groups receive something: the difference is in _what_ they\n", "receive, rather than _whether_ they receive anything.\n", "\n", "Also, subjective elements can deliberately or inadvertently enter the assessment of subjects' responses to treatment,\n", "making it important for the people assessing the responses to be _blind_ to which subjects received treatment.\n", "When neither the subjects nor the assessors know who was treated, the experiment is _double blind_.\n", "\n", "See [SticiGui: Does Treatment Have an Effect?](http://www.stat.berkeley.edu/~stark/SticiGui/Text/experiments.htm) for more discussion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Smoking and Cancer\n", "See Freedman, 2009, Chapter 1.\n", "\n", "Smokers get more heart attacks, lung cancer, and other diseases than non-smokers.\n", "Is it because they smoke?\n", "\n", "Most smokers are male, and gender matters for many of those diseases.\n", "So does age, exposure to other environmental agents such as air pollution, etc.\n", "How can we tell whether smoking is responsible for the increased morbidity and mortality?\n", "\n", "### Example: HIP trial of the early 1960s\n", "See Freedman, 2009.\n", "700,000 members of a NY health plan, including 62,000 women between age 40 and 64, who were\n", "randomly assigned to be screened for breast cancer or not.\n", "This is a controlled, randomized experiment.\n", "\n", "\n", "### Example: Snow's study of the origins of Cholera\n", "See [SticiGui](http://www.stat.berkeley.edu/~stark/SticiGui/Text/experiments.htm#cholera)\n", "This is a natural experiment, but a spectacularly good one. Indeed, it helped establish the germ theory\n", "of disease.\n", "\n", "### Example: Yule's study of the Causes of Pauperism\n", "Freedman, 2009.\n", "This is a regression model applied to data from an observational study in an attempt to make causal inferences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The 2-sample problem \n", "\n", "Suppose we have a group of $N$ individuals who are randomized into two groups, a _treatment_ group of size $N_t$ and a _control_ group of size $N_c = N - N_t$.\n", "Label the individuals from $1$ to $N$.\n", "Let ${\\mathcal T}$ denote the labels of individuals assigned to treatment and ${\\mathcal C}$ denote \n", "the labels of those assigned to control.\n", "\n", "For each of the $N$ individuals, we measure a quantitative (real-valued) response.\n", "Each individual $i$ has two _potential responses_: the response $c_i $individual would have if assigned to \n", "the control group, and the response $t_i$ the individual would have if assigned to the treatment group.\n", "\n", "We assume that individual $i$'s response depends _only_ on that individual's assigment, and not on anyone else's assignment.\n", "This is the assumption of _non-interference_. \n", "In some cases, this assumption is reasonable; in others, it is not.\n", "\n", "For instance, imagine testing a vaccine for a communicable disease.\n", "If you and I have contact, whether you get the disease might depend on whether I am vaccinated—and _vice versa_—since if the vaccine protects me from illness, I won't infect you.\n", "Similarly, suppose we are testing the effectiveness of an advertisement for a product.\n", "If you and I are connected and you buy the product, I might be more likely to buy it, even if I don't\n", "see the advertisement.\n", "\n", "Conversely, suppose that \"treatment\" is exposure to a carcinogen, and the response whether the\n", "subject contracts cancer. \n", "On the assumption that cancer is not communicable, my exposure and your disease\n", "status have no connection.\n", "\n", "The _strong null hypothesis_ is that individual by individual, treatment makes no difference whatsoever: $c_i = t_i$ for all $i$.\n", "\n", "If so, any differences between statistics computed for the treatment and control groups are entirely due to the luck of the draw: which individuals happened to be assigned to treatment and which to control.\n", "\n", "We can find the _null distribution_ of any statistic computed from the responses of the two groups: if the strong null hypothesis is true, we know what individual $i$'s response would have been whether assigned to treatment or to control—namely, the same.\n", "\n", "For instance, suppose we suspect that treatment tends to increase response: in general, $t_i \\ge c_i$.\n", "Then we might expect $\\bar{c} = \\frac{1}{N_c} \\sum_{i \\in {\\mathcal C}} c_i$ to be less than\n", "$\\bar{t} = \\frac{1}{N_t} \\sum_{i \\in {\\mathcal T}} t_i$.\n", "How large a difference between $\\bar{c}$ and $\\bar{t}$ would be evidence that treatment increases the response,\n", "beyond what might happen by chance through the luck of the draw?\n", "\n", "This amounts to asking whether the observed difference in means between the two groups is a high percentile\n", "of the distribution of that difference in means, calculated on the assumption that the null hypothesis is true.\n", "\n", "Because of how subjects are assigned to treatment or to control, all allocations of $N_t$ subjects to\n", "treatment are equally likely.\n", "\n", "One way to partition the $N$ subjects randomly into a group of size $N_c$ and a group of size $N_t$ is\n", "to permute the $N$ subjects at random, then take the first $N_c$ in the permuted list to be the control\n", "group, and the remaining $N_t$ to be the treatment group.\n", "\n", "[Note: TO DO discussion of how to construct a random permutation. Issues with assigning random numbers to\n", "all items of the list. Compare with Knuth's algorithm, also for computational efficiency.]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aside: Random number generation\n", "\n", "[To do.]\n", "Most computers cannot generate true random numbers (there are rare exceptions that have _hardware random\n", "number generators_).\n", "Instead, they generate _psdudo-random numbers_ using algorithms called _pseudo-random number generators_ (PRNGs).\n", "\n", "Current high-end statistics packages and programming languages (e.g., R, Python) use the Mersenne Twister\n", "PRNG.\n", "The Mersenne Twister has a very long period ($2^{19937}-1$) and passed the [DIEHARD tests](https://en.wikipedia.org/wiki/Diehard_tests) for equidistribution, etc.\n", "However, it is not adequate for cryptography (the _state space_ is so small that its future values can be predicted\n", "from a relatively small number of observations).\n", "_Linear congruential_ generators are generally not adequate for statistics.\n", "In particular, beware the algorithms in _Numerical Recipes_ and the Excel PRNG.\n", "\n", "Even the Mersenne Twister runs into trouble generating random permutations of long vectors using the\n", "naiive approach (assign a random number to each element of the vector, then sort by those numbers).\n", "The period of the Mersenne Twister is about $4 \\times 10^{6002}$. That's less than the number of\n", "permutations of 2081 objects.\n", "\n", "[To do: explain Knuth's algorithm]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Significance level and power\n", "\n", "[To do.]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Permutation tests\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gender Bias in Teaching Evaluations\n", "MacNell, Driscoll, and Hunt (2014. [What's in a Name: Exposing Gender Bias in Student Ratings of Teaching](http://link.springer.com/article/10.1007%2Fs10755-014-9313-4), _Innovative Higher Education_) conducted a controlled, randomized experiment on\n", "the effect of students' perception of instructors' gender on teaching evaluations\n", "in an online course.\n", "Students in the class did not know the instructors' true genders.\n", "\n", "MacNell et al. randomized 43 students in an online course into four groups: two taught by a female\n", "instructor and two by a male instructor.\n", "One of the groups taught by each instructor was led to believe the instructor was male;\n", "the other was led to believe the instructor was female.\n", "Comparable instructor biographies were given to all students.\n", "Instructors treated the groups identically, including returning assignments at the same time.\n", "\n", "When students thought the instructor was female, students rated the instructor lower, on average,\n", "in every regard.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Characteristic F - M
Caring -0.52
Consistent -0.47
Enthusiastic -0.57
Fair -0.76
Feedback -0.47
Helpful -0.46
Knowledgeable -0.35
Praise -0.67
Professional -0.61
Prompt -0.80
Respectful -0.61
Responsive -0.22
\n", "\n", "Those results are for a 5-point scale, so a difference of 0.8 is 16% of the entire scale." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "MacNell et al. graciously shared their data.\n", "The evaluation data are coded as follows:\n", "\n", " Group\n", " 3 (8 students) - TA identified as male, true TA gender female \n", "\t 4 (12 students) - TA identified as male, true TA gender male\n", "\t 5 (12 students) - TA identified as female, true TA gender female\n", "\t 6 (11 students) - TA identified as female, true TA gender male\n", " tagender - 1 if TA is actually male, 0 if actually female \n", " taidgender - 1 if TA is identified as male, 0 if identified as female \n", " gender - 1 if student is male, 0 if student is female\n", "\n", "There are grades for 47 students but evaluations for only 43 (4 did not respond). \n", "The grades are not linked to the evaluations, per the IRB protocol.\n", "\n", "Let's think about the experiment: the students were assigned at random to four groups,\n", "with each group equally likely " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interlude: partitioning sets into more than two subsets\n", "\n", "Recall that there are $n \\choose k$ ways of picking a subset of size $k$ from $n$ items;\n", "hence there are $n \\choose k$ ways of dividing a set into a subset of size $k$ and one of size $n-k$\n", "(once you select those that belong to the subset of size $k$, the rest must be in the complementary\n", "subset of size $n-k$.\n", "\n", "In this problem, we are partitioning 43 things into 4 subsets, one of size 8, one of size 11, and\n", "two of size 12.\n", "How many ways are there of doing that?\n", "\n", "\n", "Recall the [Fundamental Rule of Counting](http://www.stat.berkeley.edu/~stark/SticiGui/SticiGui/Text/counting.htm#fundamental_rule): \n", "If a set of choices, $T_1, T_2, \\ldots, T_k$, could result, respectively, \n", "in $n_1, n_2, \\ldots, n_k$ possible outcomes, the entire set of $k$ choices has\n", "$\\prod_{i=1}^k n_k$ possible outcomes.\n", "\n", "We can think of the allocation of students to the four groups as choosing 8 of the 43\n", "students for the first group, then 11 of the remaining 35 for the second, \n", "then 12 of the remaining 24 for the third.\n", "The fourth group must containe the remaining 12.\n", "\n", "The number of ways of doing that is\n", "\n", "$$ \n", " {43 \\choose 8}{35 \\choose 11}{24 \\choose 12} =\n", " \\frac{43}{8! 35!} \\frac{35!}{11! 24!} \\frac{24!}{12! 12!} = \\frac{43!}{8! 11! 12! 12!}.\n", "$$\n", "\n", "Does the number depend on the order in which we made the choices?\n", "Suppose we made the choices in a different order: first 12 students for one group, then\n", "8 for the second, then 12 for the third (the fourth gets the remaining 11 students).\n", "The number of ways of doing that is\n", "$$ \n", " {43 \\choose 12}{31 \\choose 8}{23 \\choose 12} =\n", " \\frac{43}{12! 31!} \\frac{31!}{8! 23!} \\frac{23!}{12! 11!} = \\frac{43!}{8! 11! 12! 12!}.\n", "$$\n", "The number does not depend on the order in which we make the choices.\n", "\n", "By the same reasoning, the number of ways of dividing a set of $n$ objects into\n", "$m$ subsets of sizes $k_1, \\ldots k_m$ is given by the _multinomial coefficient_\n", "$$\n", " {n \\choose k_1, k_2, \\ldots, k_m} =\n", " {n \\choose k_1}{n-k_1 \\choose k_2} {n-k_1-k_2 \\choose k_3} \\cdots {n - \\sum_{i=1}^{m-1} k_i \\choose k_{m-1}}\n", "$$\n", "\n", "$$ = \\frac{n! (n-k_1)! (n-k_1 - k_2)! \\cdots \n", " (n - k_1 - \\cdots - k_{m-1}!}{k_1! (n-k_1)! k_2! (n-k_1-k_2)! \\cdots k_m!}\n", " = \\frac{n!}{\\prod_{i=1}^m k_i!}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will check how surprising it would be for the means to be so much lower when the TA is identified as female, if in fact there is \"no real difference\" in how they were rated, and the apparent difference is just due to the luck of the draw: which students happened to end up in which section.\n", "\n", "In the actual randomization, all $43 \\choose 8, 11, 12, 12$ allocations\n", "were equally likely.\n", "But there might be real differences between the two instructors.\n", "Hence, we'd like to use each of them as his or her own \"control.\"\n", "\n", "[Aside: Neyman model for causal inference; non-interference.]\n", "Each student's potential responses are represented by a ticket with 4 numbers:\n", "\n", "+ the rating that the student would assign to instructor 1 if instructor 1 is identified as male\n", "+ the rating that the student would assign to instructor 1 if instructor 1 is identified as female\n", "+ the rating that the student would assign to instructor 2 if instructor 2 is identified as male\n", "+ the rating that the student would assign to instructor 2 if instructor 2 is identified as female\n", "\n", "The null hypothesis is that the first two numbers are equal and the second two numbers are equal,\n", "but the first two numbers might be different from the second two numbers.\n", "This corresponds to the hypohtesis that \n", "students assigned to a given TA would rate him or her the same, whether that TA seemed to be male or female.\n", "For all students assigned instructor 1, we know both of the first two numbers if the hull hypothesis\n", "is true; but we know neither of the second two numbers.\n", "Similarly, if the null hypothesis is true, we know both of the second two numbers for all students\n", "assigned to instructor 2, but we know neither of the first two numbers for those students.\n", "\n", "Because of how the randomization was performed, all allocations \n", "of students to sections that keep the number of students in each section the same are equally likely, so\n", "in particular all allocations that keep the same students assigned to each actual instructor\n", "the same are equally likely.\n", "\n", "Hence, all ${20 \\choose 8}$ ways of splitting the 20 students assigned to the female instructor into two groups, one with 8 students and one with 12, are equally likely. Similarly, all\n", "${23 \\choose 12}$ ways of splitting the 23 students assigned to the male instructor into two groups, one with 12 students and one with 11, are equally likely.\n", "We can thus imagine shuffling the female TA's students between her sections, and the male TA's students\n", "between his sections, and examine the distribution of the difference between the mean score for the sections where the\n", "TA was identified as male is larger than the mean score for the sections where the TA was identified as\n", "female.\n", "\n", "If the difference is rarely as large as the observed mean difference, the observed mean difference gives\n", "evidence that being identified as female really does lower the scores." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "1.63604973262356e+23" ], "text/latex": [ "1.63604973262356e+23" ], "text/markdown": [ "1.63604973262356e+23" ], "text/plain": [ "[1] 1.63605e+23" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "
    \n", "\t
  1. 37
  2. \n", "\t
  3. 40
  4. \n", "\t
  5. 4
  6. \n", "\t
  7. 30
  8. \n", "\t
  9. 18
  10. \n", "\t
  11. 9
  12. \n", "\t
  13. 25
  14. \n", "\t
  15. 8
  16. \n", "\t
  17. 14
  18. \n", "\t
  19. 17
  20. \n", "\t
  21. 16
  22. \n", "\t
  23. 32
  24. \n", "\t
  25. 24
  26. \n", "\t
  27. 11
  28. \n", "\t
  29. 36
  30. \n", "\t
  31. 6
  32. \n", "\t
  33. 23
  34. \n", "\t
  35. 38
  36. \n", "\t
  37. 22
  38. \n", "\t
  39. 35
  40. \n", "\t
  41. 7
  42. \n", "\t
  43. 34
  44. \n", "\t
  45. 28
  46. \n", "\t
  47. 39
  48. \n", "\t
  49. 3
  50. \n", "\t
  51. 5
  52. \n", "\t
  53. 41
  54. \n", "\t
  55. 26
  56. \n", "\t
  57. 43
  58. \n", "\t
  59. 21
  60. \n", "\t
  61. 2
  62. \n", "\t
  63. 33
  64. \n", "\t
  65. 12
  66. \n", "\t
  67. 27
  68. \n", "\t
  69. 31
  70. \n", "\t
  71. 42
  72. \n", "\t
  73. 1
  74. \n", "\t
  75. 13
  76. \n", "\t
  77. 19
  78. \n", "\t
  79. 29
  80. \n", "\t
  81. 20
  82. \n", "\t
  83. 15
  84. \n", "\t
  85. 10
  86. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 37\n", "\\item 40\n", "\\item 4\n", "\\item 30\n", "\\item 18\n", "\\item 9\n", "\\item 25\n", "\\item 8\n", "\\item 14\n", "\\item 17\n", "\\item 16\n", "\\item 32\n", "\\item 24\n", "\\item 11\n", "\\item 36\n", "\\item 6\n", "\\item 23\n", "\\item 38\n", "\\item 22\n", "\\item 35\n", "\\item 7\n", "\\item 34\n", "\\item 28\n", "\\item 39\n", "\\item 3\n", "\\item 5\n", "\\item 41\n", "\\item 26\n", "\\item 43\n", "\\item 21\n", "\\item 2\n", "\\item 33\n", "\\item 12\n", "\\item 27\n", "\\item 31\n", "\\item 42\n", "\\item 1\n", "\\item 13\n", "\\item 19\n", "\\item 29\n", "\\item 20\n", "\\item 15\n", "\\item 10\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 37\n", "2. 40\n", "3. 4\n", "4. 30\n", "5. 18\n", "6. 9\n", "7. 25\n", "8. 8\n", "9. 14\n", "10. 17\n", "11. 16\n", "12. 32\n", "13. 24\n", "14. 11\n", "15. 36\n", "16. 6\n", "17. 23\n", "18. 38\n", "19. 22\n", "20. 35\n", "21. 7\n", "22. 34\n", "23. 28\n", "24. 39\n", "25. 3\n", "26. 5\n", "27. 41\n", "28. 26\n", "29. 43\n", "30. 21\n", "31. 2\n", "32. 33\n", "33. 12\n", "34. 27\n", "35. 31\n", "36. 42\n", "37. 1\n", "38. 13\n", "39. 19\n", "40. 29\n", "41. 20\n", "42. 15\n", "43. 10\n", "\n", "\n" ], "text/plain": [ " [1] 37 40 4 30 18 9 25 8 14 17 16 32 24 11 36 6 23 38 22 35 7 34 28 39 3\n", "[26] 5 41 26 43 21 2 33 12 27 31 42 1 13 19 29 20 15 10" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of allocations to 8, 11, 12, 12\n", "choose(43,8)*choose(35,11)*choose(24,12) # big number!\n", "\n", "# Random sampling using random permutations\n", "\n", "prompt <- 1:43; # dummy data for illustration\n", "x <- runif(43);\n", "i <- order(x);\n", "prompt[i]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
groupprofessionalrespectcaringenthusiasticcommunicatehelpfulfeedbackpromptconsistentfairresponsivepraisedknowledgeableclearoverallgenderagetagendertaidgender
135544434444443542199001
234444555534555541199201
335555555555555552199101
435555535555355552199101
535555555345555552199201
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllllllllllllll}\n", " & group & professional & respect & caring & enthusiastic & communicate & helpful & feedback & prompt & consistent & fair & responsive & praised & knowledgeable & clear & overall & gender & age & tagender & taidgender\\\\\n", "\\hline\n", "\t1 & 3 & 5 & 5 & 4 & 4 & 4 & 3 & 4 & 4 & 4 & 4 & 4 & 4 & 3 & 5 & 4 & 2 & 1990 & 0 & 1\\\\\n", "\t2 & 3 & 4 & 4 & 4 & 4 & 5 & 5 & 5 & 5 & 3 & 4 & 5 & 5 & 5 & 5 & 4 & 1 & 1992 & 0 & 1\\\\\n", "\t3 & 3 & 5 & 5 & 5 & 5 & 5 & 5 & 5 & 5 & 5 & 5 & 5 & 5 & 5 & 5 & 5 & 2 & 1991 & 0 & 1\\\\\n", "\t4 & 3 & 5 & 5 & 5 & 5 & 5 & 3 & 5 & 5 & 5 & 5 & 3 & 5 & 5 & 5 & 5 & 2 & 1991 & 0 & 1\\\\\n", "\t5 & 3 & 5 & 5 & 5 & 5 & 5 & 5 & 5 & 3 & 4 & 5 & 5 & 5 & 5 & 5 & 5 & 2 & 1992 & 0 & 1\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " group professional respect caring enthusiastic communicate helpful feedback\n", "1 3 5 5 4 4 4 3 4\n", "2 3 4 4 4 4 5 5 5\n", "3 3 5 5 5 5 5 5 5\n", "4 3 5 5 5 5 5 3 5\n", "5 3 5 5 5 5 5 5 5\n", " prompt consistent fair responsive praised knowledgeable clear overall gender\n", "1 4 4 4 4 4 3 5 4 2\n", "2 5 3 4 5 5 5 5 4 1\n", "3 5 5 5 5 5 5 5 5 2\n", "4 5 5 5 3 5 5 5 5 2\n", "5 3 4 5 5 5 5 5 5 2\n", " age tagender taidgender\n", "1 1990 0 1\n", "2 1992 0 1\n", "3 1991 0 1\n", "4 1991 0 1\n", "5 1992 0 1" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ " group professional respect caring enthusiastic \n", " Min. :3.000 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000 \n", " 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:3.50 1st Qu.:4.000 \n", " Median :4.000 Median :5.000 Median :5.000 Median :4.00 Median :4.000 \n", " Mean :4.465 Mean :4.326 Mean :4.326 Mean :3.93 Mean :3.907 \n", " 3rd Qu.:6.000 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.00 3rd Qu.:4.500 \n", " Max. :6.000 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000 \n", " communicate helpful feedback prompt \n", " Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 \n", " 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000 \n", " Median :4.000 Median :4.000 Median :4.000 Median :4.000 \n", " Mean :3.953 Mean :3.744 Mean :3.953 Mean :3.977 \n", " 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000 \n", " Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 \n", " consistent fair responsive praised knowledgeable \n", " Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 \n", " 1st Qu.:3.000 1st Qu.:3.500 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.00 \n", " Median :4.000 Median :4.000 Median :4.000 Median :4.000 Median :4.00 \n", " Mean :3.744 Mean :3.907 Mean :3.767 Mean :4.209 Mean :4.14 \n", " 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:4.500 3rd Qu.:5.000 3rd Qu.:5.00 \n", " Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00 \n", " clear overall gender age \n", " Min. :1.000 Min. :1.000 Min. :1.000 Min. :1982 \n", " 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:1.000 1st Qu.:1990 \n", " Median :4.000 Median :4.000 Median :2.000 Median :1990 \n", " Mean :3.721 Mean :3.953 Mean :1.535 Mean :1990 \n", " 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:2.000 3rd Qu.:1991 \n", " Max. :5.000 Max. :5.000 Max. :2.000 Max. :2012 \n", " tagender taidgender \n", " Min. :0.0000 Min. :0.0000 \n", " 1st Qu.:0.0000 1st Qu.:0.0000 \n", " Median :1.0000 Median :1.0000 \n", " Mean :0.5349 Mean :0.5349 \n", " 3rd Qu.:1.0000 3rd Qu.:1.0000 \n", " Max. :1.0000 Max. :1.0000 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the data are in a .csv file called \"Macnell-RatingsData.csv\" in the directory Data\n", "ratings <- read.csv(\"Data/Macnell-RatingsData.csv\", as.is=T); # reads a .csv file into a DataFrame\n", "ratings[1:5,]\n", "summary(ratings) # summary statistics for the data. Note the issue with \"age\"" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
sprintf {base}R Documentation
\n", "\n", "

Use C-style String Formatting Commands

\n", "\n", "

Description

\n", "\n", "

A wrapper for the C function sprintf, that returns a character\n", "vector containing a formatted combination of text and variable values.\n", "

\n", "\n", "\n", "

Usage

\n", "\n", "
\n",
       "sprintf(fmt, ...)\n",
       "gettextf(fmt, ..., domain = NULL)\n",
       "
\n", "\n", "\n", "

Arguments

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
fmt\n", "

a character vector of format strings, each of up to 8192 bytes.

\n", "
...\n", "

values to be passed into fmt. Only logical,\n", "integer, real and character vectors are supported, but some coercion\n", "will be done: see the ‘Details’ section.

\n", "
domain\n", "

see gettext.

\n", "
\n", "\n", "\n", "

Details

\n", "\n", "

sprintf is a wrapper for the system sprintf C-library\n", "function. Attempts are made to check that the mode of the values\n", "passed match the format supplied, and R's special values (NA,\n", "Inf, -Inf and NaN) are handled correctly.\n", "

\n", "

gettextf is a convenience function which provides C-style\n", "string formatting with possible translation of the format string.\n", "

\n", "

The arguments (including fmt) are recycled if possible a whole\n", "number of times to the length of the longest, and then the formatting\n", "is done in parallel. Zero-length arguments are allowed and will give\n", "a zero-length result. All arguments are evaluated even if unused, and\n", "hence some types (e.g., \"symbol\" or \"language\", see\n", "typeof) are not allowed.\n", "

\n", "

The following is abstracted from Kernighan and Ritchie (see\n", "References): however the actual implementation will follow the C99\n", "standard and fine details (especially the behaviour under user error)\n", "may depend on the platform.\n", "

\n", "

The string fmt contains normal characters,\n", "which are passed through to the output string, and also conversion\n", "specifications which operate on the arguments provided through\n", ".... The allowed conversion specifications start with a\n", "% and end with one of the letters in the set\n", "aAdifeEgGosxX%. These letters denote the following types:\n", "

\n", "\n", "
\n", "
d, i, o, x, X

Integer\n", "value, o being octal, \n", "x and X being hexadecimal (using the same case for\n", "a-f as the code). Numeric variables with exactly integer\n", "values will be coerced to integer. Formats d and i\n", "can also be used for logical variables, which will be converted to\n", "0, 1 or NA.\n", "

\n", "
\n", "
f

Double precision value, in “fixed\n", "point” decimal notation of the form "[-]mmm.ddd". The number of\n", "decimal places ("d") is specified by the precision: the default is 6;\n", "a precision of 0 suppresses the decimal point. Non-finite values\n", "are converted to NA, NaN or (perhaps a sign followed\n", "by) Inf.\n", "

\n", "
\n", "
e, E

Double precision value, in\n", "“exponential” decimal notation of the\n", "form [-]m.ddde[+-]xx or [-]m.dddE[+-]xx.\n", "

\n", "
\n", "
g, G

Double precision value, in %e or\n", "%E format if the exponent is less than -4 or greater than or\n", "equal to the precision, and %f format otherwise.\n", "(The precision (default 6) specifies the number of\n", "significant digits here, whereas in %f, %e, it is\n", "the number of digits after the decimal point.)\n", "

\n", "
\n", "
a, A

Double precision value, in binary notation\n", "of the form [-]0xh.hhhp[+-]d. This is a binary fraction\n", "expressed in hex multiplied by a (decimal) power of 2. The number\n", "of hex digits after the decimal point is specified by the precision:\n", "the default is enough digits to represent exactly the internal\n", "binary representation. Non-finite values are converted to NA,\n", "NaN or (perhaps a sign followed by) Inf. Format\n", "%a uses lower-case for x, p and the hex\n", "values: format %A uses upper-case.\n", "

\n", "

This should be supported on all platforms as it is a feature of C99.\n", "The format is not uniquely defined: although it would be possible\n", "to make the leading h always zero or one, this is not\n", "always done. Most systems will suppress trailing zeros, but a few\n", "do not. On a well-written platform, for normal numbers there will\n", "be a leading one before the decimal point plus (by default) 13\n", "hexadecimal digits, hence 53 bits. The treatment of denormalized\n", "(aka ‘subnormal’) numbers is very platform-dependent.\n", "

\n", "
\n", "
s

Character string. Character NAs are\n", "converted to \"NA\".\n", "

\n", "
\n", "
%

Literal % (none of the extra formatting\n", "characters given below are permitted in this case).\n", "

\n", "
\n", "
\n", "\n", "

Conversion by as.character is used for non-character\n", "arguments with s and by as.double for\n", "non-double arguments with f, e, E, g, G. NB: the length is\n", "determined before conversion, so do not rely on the internal\n", "coercion if this would change the length. The coercion is done only\n", "once, so if length(fmt) > 1 then all elements must expect the\n", "same types of arguments.\n", "

\n", "

In addition, between the initial % and the terminating\n", "conversion character there may be, in any order:\n", "

\n", "\n", "
\n", "
m.n

Two numbers separated by a period, denoting the\n", "field width (m) and the precision (n).

\n", "
\n", "
-

Left adjustment of converted argument in its field.

\n", "
\n", "
+

Always print number with sign: by default only\n", "negative numbers are printed with a sign.

\n", "
\n", "
a space

Prefix a space if the first character is not a sign.

\n", "
\n", "
0

For numbers, pad to the field width with leading\n", "zeros. For characters, this zero-pads on some platforms and is\n", "ignored on others.

\n", "
\n", "
#

specifies “alternate output” for numbers, its\n", "action depending on the type:\n", "For x or X, 0x or 0X will be prefixed\n", "to a non-zero result. For e, e, f, g\n", "and G, the output will always have a decimal point; for\n", "g and G, trailing zeros will not be removed.\n", "

\n", "
\n", "
\n", "\n", "

Further, immediately after % may come 1$ to 99$\n", "to refer to numbered argument: this allows arguments to be\n", "referenced out of order and is mainly intended for translators of\n", "error messages. If this is done it is best if all formats are\n", "numbered: if not the unnumbered ones process the arguments in order.\n", "See the examples. This notation allows arguments to be used more than\n", "once, in which case they must be used as the same type (integer,\n", "double or character).\n", "

\n", "

A field width or precision (but not both) may be indicated by an\n", "asterisk *: in this case an argument specifies the desired\n", "number. A negative field width is taken as a '-' flag followed by a\n", "positive field width. A negative precision is treated as if the\n", "precision were omitted. The argument should be integer, but a double\n", "argument will be coerced to integer.\n", "

\n", "

There is a limit of 8192 bytes on elements of fmt, and on\n", "strings included from a single %letter conversion\n", "specification.\n", "

\n", "

Field widths and precisions of %s conversions are interpreted\n", "as bytes, not characters, as described in the C standard.\n", "

\n", "

The C doubles used for R numerical vectors have signed zeros, which\n", "sprintf may output as -0, -0.000 ....\n", "

\n", "\n", "\n", "

Value

\n", "\n", "

A character vector of length that of the longest input. If any\n", "element of fmt or any character argument is declared as UTF-8,\n", "the element of the result will be in UTF-8 and have the encoding\n", "declared as UTF-8. Otherwise it will be in the current locale's\n", "encoding.\n", "

\n", "\n", "\n", "

Warning

\n", "\n", "

The format string is passed down the OS's sprintf function, and\n", "incorrect formats can cause the latter to crash the R process . R\n", "does perform sanity checks on the format, but not all possible user\n", "errors on all platforms have been tested, and some might be terminal.\n", "

\n", "

The behaviour on inputs not documented here is ‘undefined’,\n", "which means it is allowed to differ by platform.\n", "

\n", "\n", "\n", "

Author(s)

\n", "\n", "

Original code by Jonathan Rougier.\n", "

\n", "\n", "\n", "

References

\n", "\n", "

Kernighan, B. W. and Ritchie, D. M. (1988)\n", "The C Programming Language. Second edition, Prentice Hall.\n", "Describes the format options in table B-1 in the Appendix.\n", "

\n", "

The C Standards, especially ISO/IEC 9899:1999 for ‘C99’. Links\n", "can be found at http://developer.r-project.org/Portability.html.\n", "

\n", "

man sprintf on a Unix-alike system.\n", "

\n", "\n", "\n", "

See Also

\n", "\n", "

formatC for a way of formatting vectors of numbers in a\n", "similar fashion.\n", "

\n", "

paste for another way of creating a vector combining\n", "text and values.\n", "

\n", "

gettext for the mechanisms for the automated translation\n", "of text.\n", "

\n", "\n", "\n", "

Examples

\n", "\n", "
\n",
       "## be careful with the format: most things in R are floats\n",
       "## only integer-valued reals get coerced to integer.\n",
       "\n",
       "sprintf(\"%s is %f feet tall\\n\", \"Sven\", 7.1)      # OK\n",
       "try(sprintf(\"%s is %i feet tall\\n\", \"Sven\", 7.1)) # not OK\n",
       "    sprintf(\"%s is %i feet tall\\n\", \"Sven\", 7  )  # OK\n",
       "\n",
       "## use a literal % :\n",
       "\n",
       "sprintf(\"%.0f%% said yes (out of a sample of size %.0f)\", 66.666, 3)\n",
       "\n",
       "## various formats of pi :\n",
       "\n",
       "sprintf(\"%f\", pi)\n",
       "sprintf(\"%.3f\", pi)\n",
       "sprintf(\"%1.0f\", pi)\n",
       "sprintf(\"%5.1f\", pi)\n",
       "sprintf(\"%05.1f\", pi)\n",
       "sprintf(\"%+f\", pi)\n",
       "sprintf(\"% f\", pi)\n",
       "sprintf(\"%-10f\", pi) # left justified\n",
       "sprintf(\"%e\", pi)\n",
       "sprintf(\"%E\", pi)\n",
       "sprintf(\"%g\", pi)\n",
       "sprintf(\"%g\",   1e6 * pi) # -> exponential\n",
       "sprintf(\"%.9g\", 1e6 * pi) # -> \"fixed\"\n",
       "sprintf(\"%G\", 1e-6 * pi)\n",
       "\n",
       "## no truncation:\n",
       "sprintf(\"%1.f\", 101)\n",
       "\n",
       "## re-use one argument three times, show difference between %x and %X\n",
       "xx <- sprintf(\"%1$d %1$x %1$X\", 0:15)\n",
       "xx <- matrix(xx, dimnames = list(rep(\"\", 16), \"%d%x%X\"))\n",
       "noquote(format(xx, justify = \"right\"))\n",
       "\n",
       "## More sophisticated:\n",
       "\n",
       "sprintf(\"min 10-char string '%10s'\",\n",
       "        c(\"a\", \"ABC\", \"and an even longer one\"))\n",
       "\n",
       "## Platform-dependent bad example from qdapTools 1.0.0:\n",
       "## may pad with spaces or zeroes.\n",
       "sprintf(\"%09s\", month.name)\n",
       "\n",
       "n <- 1:18\n",
       "sprintf(paste0(\"e with %2d digits = %.\", n, \"g\"), n, exp(1))\n",
       "\n",
       "## Using arguments out of order\n",
       "sprintf(\"second %2$1.0f, first %1$5.2f, third %3$1.0f\", pi, 2, 3)\n",
       "\n",
       "## Using asterisk for width or precision\n",
       "sprintf(\"precision %.*f, width '%*.3f'\", 3, pi, 8, pi)\n",
       "\n",
       "## Asterisk and argument re-use, 'e' example reiterated:\n",
       "sprintf(\"e with %1$2d digits = %2$.*1$g\", n, exp(1))\n",
       "\n",
       "## re-cycle arguments\n",
       "sprintf(\"%s %d\", \"test\", 1:3)\n",
       "\n",
       "## binary output showing rounding/representation errors\n",
       "x <- seq(0, 1.0, 0.1); y <- c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1)\n",
       "cbind(x, sprintf(\"%a\", x), sprintf(\"%a\", y))\n",
       "
\n", "\n", "
[Package base version 3.1.3 ]
" ], "text/latex": [ "\\inputencoding{utf8}\n", "\\HeaderA{sprintf}{Use C-style String Formatting Commands}{sprintf}\n", "\\aliasA{gettextf}{sprintf}{gettextf}\n", "\\keyword{print}{sprintf}\n", "\\keyword{character}{sprintf}\n", "%\n", "\\begin{Description}\\relax\n", "A wrapper for the C function \\code{sprintf}, that returns a character\n", "vector containing a formatted combination of text and variable values.\n", "\\end{Description}\n", "%\n", "\\begin{Usage}\n", "\\begin{verbatim}\n", "sprintf(fmt, ...)\n", "gettextf(fmt, ..., domain = NULL)\n", "\\end{verbatim}\n", "\\end{Usage}\n", "%\n", "\\begin{Arguments}\n", "\\begin{ldescription}\n", "\\item[\\code{fmt}] a character vector of format strings, each of up to 8192 bytes.\n", "\\item[\\code{...}] values to be passed into \\code{fmt}. Only logical,\n", "integer, real and character vectors are supported, but some coercion\n", "will be done: see the `Details' section.\n", "\\item[\\code{domain}] see \\code{\\LinkA{gettext}{gettext}}.\n", "\\end{ldescription}\n", "\\end{Arguments}\n", "%\n", "\\begin{Details}\\relax\n", "\\code{sprintf} is a wrapper for the system \\code{sprintf} C-library\n", "function. Attempts are made to check that the mode of the values\n", "passed match the format supplied, and \\R{}'s special values (\\code{NA},\n", "\\code{Inf}, \\code{-Inf} and \\code{NaN}) are handled correctly.\n", "\n", "\\code{gettextf} is a convenience function which provides C-style\n", "string formatting with possible translation of the format string.\n", "\n", "The arguments (including \\code{fmt}) are recycled if possible a whole\n", "number of times to the length of the longest, and then the formatting\n", "is done in parallel. Zero-length arguments are allowed and will give\n", "a zero-length result. All arguments are evaluated even if unused, and\n", "hence some types (e.g., \\code{\"symbol\"} or \\code{\"language\"}, see\n", "\\code{\\LinkA{typeof}{typeof}}) are not allowed.\n", "\n", "The following is abstracted from Kernighan and Ritchie (see\n", "References): however the actual implementation will follow the C99\n", "standard and fine details (especially the behaviour under user error)\n", "may depend on the platform.\n", "\n", "The string \\code{fmt} contains normal characters,\n", "which are passed through to the output string, and also conversion\n", "specifications which operate on the arguments provided through\n", "\\code{...}. The allowed conversion specifications start with a\n", "\\code{\\%} and end with one of the letters in the set\n", "\\code{aAdifeEgGosxX\\%}. These letters denote the following types:\n", "\n", "\\begin{description}\n", "\n", "\\item[\\code{d}, \\code{i}, \\code{o}, \\code{x}, \\code{X}] Integer\n", "value, \\code{o} being octal, \n", "\\code{x} and \\code{X} being hexadecimal (using the same case for\n", "\\code{a-f} as the code). Numeric variables with exactly integer\n", "values will be coerced to integer. Formats \\code{d} and \\code{i}\n", "can also be used for logical variables, which will be converted to\n", "\\code{0}, \\code{1} or \\code{NA}.\n", "\n", "\\item[\\code{f}] Double precision value, in ``\\bold{f}ixed\n", "point'' decimal notation of the form \"[-]mmm.ddd\". The number of\n", "decimal places (\"d\") is specified by the precision: the default is 6;\n", "a precision of 0 suppresses the decimal point. Non-finite values\n", "are converted to \\code{NA}, \\code{NaN} or (perhaps a sign followed\n", "by) \\code{Inf}.\n", "\n", "\\item[\\code{e}, \\code{E}] Double precision value, in\n", "``\\bold{e}xponential'' decimal notation of the\n", "form \\code{[-]m.ddde[+-]xx} or \\code{[-]m.dddE[+-]xx}.\n", "\n", "\\item[\\code{g}, \\code{G}] Double precision value, in \\code{\\%e} or\n", "\\code{\\%E} format if the exponent is less than -4 or greater than or\n", "equal to the precision, and \\code{\\%f} format otherwise.\n", "(The precision (default 6) specifies the number of\n", "\\emph{significant} digits here, whereas in \\code{\\%f, \\%e}, it is\n", "the number of digits after the decimal point.)\n", "\n", "\\item[\\code{a}, \\code{A}] Double precision value, in binary notation\n", "of the form \\code{[-]0xh.hhhp[+-]d}. This is a binary fraction\n", "expressed in hex multiplied by a (decimal) power of 2. The number\n", "of hex digits after the decimal point is specified by the precision:\n", "the default is enough digits to represent exactly the internal\n", "binary representation. Non-finite values are converted to \\code{NA},\n", "\\code{NaN} or (perhaps a sign followed by) \\code{Inf}. Format\n", "\\code{\\%a} uses lower-case for \\code{x}, \\code{p} and the hex\n", "values: format \\code{\\%A} uses upper-case.\n", "\n", "This should be supported on all platforms as it is a feature of C99.\n", "The format is not uniquely defined: although it would be possible\n", "to make the leading \\code{h} always zero or one, this is not\n", "always done. Most systems will suppress trailing zeros, but a few\n", "do not. On a well-written platform, for normal numbers there will\n", "be a leading one before the decimal point plus (by default) 13\n", "hexadecimal digits, hence 53 bits. The treatment of denormalized\n", "(aka `subnormal') numbers is very platform-dependent.\n", "\n", "\\item[\\code{s}] Character string. Character \\code{NA}s are\n", "converted to \\code{\"NA\"}.\n", "\n", "\\item[\\code{\\%}] Literal \\code{\\%} (none of the extra formatting\n", "characters given below are permitted in this case).\n", "\n", "\n", "\\end{description}\n", "\n", "Conversion by \\code{\\LinkA{as.character}{as.character}} is used for non-character\n", "arguments with \\code{s} and by \\code{\\LinkA{as.double}{as.double}} for\n", "non-double arguments with \\code{f, e, E, g, G}. NB: the length is\n", "determined before conversion, so do not rely on the internal\n", "coercion if this would change the length. The coercion is done only\n", "once, so if \\code{length(fmt) > 1} then all elements must expect the\n", "same types of arguments.\n", "\n", "In addition, between the initial \\code{\\%} and the terminating\n", "conversion character there may be, in any order:\n", "\n", "\\begin{description}\n", "\n", "\\item[\\code{m.n}] Two numbers separated by a period, denoting the\n", "field width (\\code{m}) and the precision (\\code{n}).\n", "\\item[\\code{-}] Left adjustment of converted argument in its field.\n", "\\item[\\code{+}] Always print number with sign: by default only\n", "negative numbers are printed with a sign.\n", "\\item[a space] Prefix a space if the first character is not a sign.\n", "\\item[\\code{0}] For numbers, pad to the field width with leading\n", "zeros. For characters, this zero-pads on some platforms and is\n", "ignored on others.\n", "\\item[\\code{\\#}] specifies ``alternate output'' for numbers, its\n", "action depending on the type:\n", "For \\code{x} or \\code{X}, \\code{0x} or \\code{0X} will be prefixed\n", "to a non-zero result. For \\code{e}, \\code{e}, \\code{f}, \\code{g}\n", "and \\code{G}, the output will always have a decimal point; for\n", "\\code{g} and \\code{G}, trailing zeros will not be removed.\n", "\n", "\n", "\\end{description}\n", "\n", "Further, immediately after \\code{\\%} may come \\code{1\\$} to \\code{99\\$}\n", "to refer to numbered argument: this allows arguments to be\n", "referenced out of order and is mainly intended for translators of\n", "error messages. If this is done it is best if all formats are\n", "numbered: if not the unnumbered ones process the arguments in order.\n", "See the examples. This notation allows arguments to be used more than\n", "once, in which case they must be used as the same type (integer,\n", "double or character).\n", "\n", "A field width or precision (but not both) may be indicated by an\n", "asterisk \\code{*}: in this case an argument specifies the desired\n", "number. A negative field width is taken as a '-' flag followed by a\n", "positive field width. A negative precision is treated as if the\n", "precision were omitted. The argument should be integer, but a double\n", "argument will be coerced to integer.\n", "\n", "There is a limit of 8192 bytes on elements of \\code{fmt}, and on\n", "strings included from a single \\code{\\%}\\emph{letter} conversion\n", "specification.\n", "\n", "Field widths and precisions of \\code{\\%s} conversions are interpreted\n", "as bytes, not characters, as described in the C standard.\n", "\n", "The C doubles used for \\R{} numerical vectors have signed zeros, which\n", "\\code{sprintf} may output as \\code{-0}, \\code{-0.000} \\dots.\n", "\\end{Details}\n", "%\n", "\\begin{Value}\n", "A character vector of length that of the longest input. If any\n", "element of \\code{fmt} or any character argument is declared as UTF-8,\n", "the element of the result will be in UTF-8 and have the encoding\n", "declared as UTF-8. Otherwise it will be in the current locale's\n", "encoding.\n", "\\end{Value}\n", "%\n", "\\begin{Section}{Warning}\n", "The format string is passed down the OS's \\code{sprintf} function, and\n", "incorrect formats can cause the latter to crash the \\R{} process . \\R{}\n", "does perform sanity checks on the format, but not all possible user\n", "errors on all platforms have been tested, and some might be terminal.\n", "\n", "The behaviour on inputs not documented here is `undefined',\n", "which means it is allowed to differ by platform.\n", "\\end{Section}\n", "%\n", "\\begin{Author}\\relax\n", "Original code by Jonathan Rougier.\n", "\\end{Author}\n", "%\n", "\\begin{References}\\relax\n", "Kernighan, B. W. and Ritchie, D. M. (1988)\n", "\\emph{The C Programming Language.} Second edition, Prentice Hall.\n", "Describes the format options in table B-1 in the Appendix.\n", "\n", "The C Standards, especially ISO/IEC 9899:1999 for `C99'. Links\n", "can be found at \\url{http://developer.r-project.org/Portability.html}.\n", "\n", "\\command{man sprintf} on a Unix-alike system.\n", "\\end{References}\n", "%\n", "\\begin{SeeAlso}\\relax\n", "\\code{\\LinkA{formatC}{formatC}} for a way of formatting vectors of numbers in a\n", "similar fashion.\n", "\n", "\\code{\\LinkA{paste}{paste}} for another way of creating a vector combining\n", "text and values.\n", "\n", "\\code{\\LinkA{gettext}{gettext}} for the mechanisms for the automated translation\n", "of text.\n", "\\end{SeeAlso}\n", "%\n", "\\begin{Examples}\n", "\\begin{ExampleCode}\n", "## be careful with the format: most things in R are floats\n", "## only integer-valued reals get coerced to integer.\n", "\n", "sprintf(\"%s is %f feet tall\\n\", \"Sven\", 7.1) # OK\n", "try(sprintf(\"%s is %i feet tall\\n\", \"Sven\", 7.1)) # not OK\n", " sprintf(\"%s is %i feet tall\\n\", \"Sven\", 7 ) # OK\n", "\n", "## use a literal % :\n", "\n", "sprintf(\"%.0f%% said yes (out of a sample of size %.0f)\", 66.666, 3)\n", "\n", "## various formats of pi :\n", "\n", "sprintf(\"%f\", pi)\n", "sprintf(\"%.3f\", pi)\n", "sprintf(\"%1.0f\", pi)\n", "sprintf(\"%5.1f\", pi)\n", "sprintf(\"%05.1f\", pi)\n", "sprintf(\"%+f\", pi)\n", "sprintf(\"% f\", pi)\n", "sprintf(\"%-10f\", pi) # left justified\n", "sprintf(\"%e\", pi)\n", "sprintf(\"%E\", pi)\n", "sprintf(\"%g\", pi)\n", "sprintf(\"%g\", 1e6 * pi) # -> exponential\n", "sprintf(\"%.9g\", 1e6 * pi) # -> \"fixed\"\n", "sprintf(\"%G\", 1e-6 * pi)\n", "\n", "## no truncation:\n", "sprintf(\"%1.f\", 101)\n", "\n", "## re-use one argument three times, show difference between %x and %X\n", "xx <- sprintf(\"%1$d %1$x %1$X\", 0:15)\n", "xx <- matrix(xx, dimnames = list(rep(\"\", 16), \"%d%x%X\"))\n", "noquote(format(xx, justify = \"right\"))\n", "\n", "## More sophisticated:\n", "\n", "sprintf(\"min 10-char string '%10s'\",\n", " c(\"a\", \"ABC\", \"and an even longer one\"))\n", "\n", "## Platform-dependent bad example from qdapTools 1.0.0:\n", "## may pad with spaces or zeroes.\n", "sprintf(\"%09s\", month.name)\n", "\n", "n <- 1:18\n", "sprintf(paste0(\"e with %2d digits = %.\", n, \"g\"), n, exp(1))\n", "\n", "## Using arguments out of order\n", "sprintf(\"second %2$1.0f, first %1$5.2f, third %3$1.0f\", pi, 2, 3)\n", "\n", "## Using asterisk for width or precision\n", "sprintf(\"precision %.*f, width '%*.3f'\", 3, pi, 8, pi)\n", "\n", "## Asterisk and argument re-use, 'e' example reiterated:\n", "sprintf(\"e with %1$2d digits = %2$.*1$g\", n, exp(1))\n", "\n", "## re-cycle arguments\n", "sprintf(\"%s %d\", \"test\", 1:3)\n", "\n", "## binary output showing rounding/representation errors\n", "x <- seq(0, 1.0, 0.1); y <- c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1)\n", "cbind(x, sprintf(\"%a\", x), sprintf(\"%a\", y))\n", "\\end{ExampleCode}\n", "\\end{Examples}" ], "text/plain": [ "sprintf package:base R Documentation\n", "\n", "_\bU_\bs_\be _\bC-_\bs_\bt_\by_\bl_\be _\bS_\bt_\br_\bi_\bn_\bg _\bF_\bo_\br_\bm_\ba_\bt_\bt_\bi_\bn_\bg _\bC_\bo_\bm_\bm_\ba_\bn_\bd_\bs\n", "\n", "_\bD_\be_\bs_\bc_\br_\bi_\bp_\bt_\bi_\bo_\bn:\n", "\n", " A wrapper for the C function ‘sprintf’, that returns a character\n", " vector containing a formatted combination of text and variable\n", " values.\n", "\n", "_\bU_\bs_\ba_\bg_\be:\n", "\n", " sprintf(fmt, ...)\n", " gettextf(fmt, ..., domain = NULL)\n", " \n", "_\bA_\br_\bg_\bu_\bm_\be_\bn_\bt_\bs:\n", "\n", " fmt: a character vector of format strings, each of up to 8192\n", " bytes.\n", "\n", " ...: values to be passed into ‘fmt’. Only logical, integer, real\n", " and character vectors are supported, but some coercion will\n", " be done: see the ‘Details’ section.\n", "\n", " domain: see ‘gettext’.\n", "\n", "_\bD_\be_\bt_\ba_\bi_\bl_\bs:\n", "\n", " ‘sprintf’ is a wrapper for the system ‘sprintf’ C-library\n", " function. Attempts are made to check that the mode of the values\n", " passed match the format supplied, and R's special values (‘NA’,\n", " ‘Inf’, ‘-Inf’ and ‘NaN’) are handled correctly.\n", "\n", " ‘gettextf’ is a convenience function which provides C-style string\n", " formatting with possible translation of the format string.\n", "\n", " The arguments (including ‘fmt’) are recycled if possible a whole\n", " number of times to the length of the longest, and then the\n", " formatting is done in parallel. Zero-length arguments are allowed\n", " and will give a zero-length result. All arguments are evaluated\n", " even if unused, and hence some types (e.g., ‘\"symbol\"’ or\n", " ‘\"language\"’, see ‘typeof’) are not allowed.\n", "\n", " The following is abstracted from Kernighan and Ritchie (see\n", " References): however the actual implementation will follow the C99\n", " standard and fine details (especially the behaviour under user\n", " error) may depend on the platform.\n", "\n", " The string ‘fmt’ contains normal characters, which are passed\n", " through to the output string, and also conversion specifications\n", " which operate on the arguments provided through ‘...’. The\n", " allowed conversion specifications start with a ‘%’ and end with\n", " one of the letters in the set ‘aAdifeEgGosxX%’. These letters\n", " denote the following types:\n", "\n", " ‘d’, ‘i’, ‘o’, ‘x’, ‘X’ Integer value, ‘o’ being octal, ‘x’ and\n", " ‘X’ being hexadecimal (using the same case for ‘a-f’ as the\n", " code). Numeric variables with exactly integer values will be\n", " coerced to integer. Formats ‘d’ and ‘i’ can also be used for\n", " logical variables, which will be converted to ‘0’, ‘1’ or\n", " ‘NA’.\n", "\n", " ‘f’ Double precision value, in “*f*ixed point” decimal notation of\n", " the form \"[-]mmm.ddd\". The number of decimal places (\"d\") is\n", " specified by the precision: the default is 6; a precision of\n", " 0 suppresses the decimal point. Non-finite values are\n", " converted to ‘NA’, ‘NaN’ or (perhaps a sign followed by)\n", " ‘Inf’.\n", "\n", " ‘e’, ‘E’ Double precision value, in “*e*xponential” decimal\n", " notation of the form ‘[-]m.ddde[+-]xx’ or ‘[-]m.dddE[+-]xx’.\n", "\n", " ‘g’, ‘G’ Double precision value, in ‘%e’ or ‘%E’ format if the\n", " exponent is less than -4 or greater than or equal to the\n", " precision, and ‘%f’ format otherwise. (The precision\n", " (default 6) specifies the number of _significant_ digits\n", " here, whereas in ‘%f, %e’, it is the number of digits after\n", " the decimal point.)\n", "\n", " ‘a’, ‘A’ Double precision value, in binary notation of the form\n", " ‘[-]0xh.hhhp[+-]d’. This is a binary fraction expressed in\n", " hex multiplied by a (decimal) power of 2. The number of hex\n", " digits after the decimal point is specified by the precision:\n", " the default is enough digits to represent exactly the\n", " internal binary representation. Non-finite values are\n", " converted to ‘NA’, ‘NaN’ or (perhaps a sign followed by)\n", " ‘Inf’. Format ‘%a’ uses lower-case for ‘x’, ‘p’ and the hex\n", " values: format ‘%A’ uses upper-case.\n", "\n", " This should be supported on all platforms as it is a feature\n", " of C99. The format is not uniquely defined: although it\n", " would be possible to make the leading ‘h’ always zero or one,\n", " this is not always done. Most systems will suppress trailing\n", " zeros, but a few do not. On a well-written platform, for\n", " normal numbers there will be a leading one before the decimal\n", " point plus (by default) 13 hexadecimal digits, hence 53 bits.\n", " The treatment of denormalized (aka ‘subnormal’) numbers is\n", " very platform-dependent.\n", "\n", " ‘s’ Character string. Character ‘NA’s are converted to ‘\"NA\"’.\n", "\n", " ‘%’ Literal ‘%’ (none of the extra formatting characters given\n", " below are permitted in this case).\n", "\n", " Conversion by ‘as.character’ is used for non-character arguments\n", " with ‘s’ and by ‘as.double’ for non-double arguments with ‘f, e,\n", " E, g, G’. NB: the length is determined before conversion, so do\n", " not rely on the internal coercion if this would change the length.\n", " The coercion is done only once, so if ‘length(fmt) > 1’ then all\n", " elements must expect the same types of arguments.\n", "\n", " In addition, between the initial ‘%’ and the terminating\n", " conversion character there may be, in any order:\n", "\n", " ‘m.n’ Two numbers separated by a period, denoting the field width\n", " (‘m’) and the precision (‘n’).\n", "\n", " ‘-’ Left adjustment of converted argument in its field.\n", "\n", " ‘+’ Always print number with sign: by default only negative\n", " numbers are printed with a sign.\n", "\n", " a space Prefix a space if the first character is not a sign.\n", "\n", " ‘0’ For numbers, pad to the field width with leading zeros. For\n", " characters, this zero-pads on some platforms and is ignored\n", " on others.\n", "\n", " ‘#’ specifies “alternate output” for numbers, its action depending\n", " on the type: For ‘x’ or ‘X’, ‘0x’ or ‘0X’ will be prefixed to\n", " a non-zero result. For ‘e’, ‘e’, ‘f’, ‘g’ and ‘G’, the\n", " output will always have a decimal point; for ‘g’ and ‘G’,\n", " trailing zeros will not be removed.\n", "\n", " Further, immediately after ‘%’ may come ‘1$’ to ‘99$’ to refer to\n", " numbered argument: this allows arguments to be referenced out of\n", " order and is mainly intended for translators of error messages.\n", " If this is done it is best if all formats are numbered: if not the\n", " unnumbered ones process the arguments in order. See the examples.\n", " This notation allows arguments to be used more than once, in which\n", " case they must be used as the same type (integer, double or\n", " character).\n", "\n", " A field width or precision (but not both) may be indicated by an\n", " asterisk ‘*’: in this case an argument specifies the desired\n", " number. A negative field width is taken as a '-' flag followed by\n", " a positive field width. A negative precision is treated as if the\n", " precision were omitted. The argument should be integer, but a\n", " double argument will be coerced to integer.\n", "\n", " There is a limit of 8192 bytes on elements of ‘fmt’, and on\n", " strings included from a single ‘%’_letter_ conversion\n", " specification.\n", "\n", " Field widths and precisions of ‘%s’ conversions are interpreted as\n", " bytes, not characters, as described in the C standard.\n", "\n", " The C doubles used for R numerical vectors have signed zeros,\n", " which ‘sprintf’ may output as ‘-0’, ‘-0.000’ ....\n", "\n", "_\bV_\ba_\bl_\bu_\be:\n", "\n", " A character vector of length that of the longest input. If any\n", " element of ‘fmt’ or any character argument is declared as UTF-8,\n", " the element of the result will be in UTF-8 and have the encoding\n", " declared as UTF-8. Otherwise it will be in the current locale's\n", " encoding.\n", "\n", "_\bW_\ba_\br_\bn_\bi_\bn_\bg:\n", "\n", " The format string is passed down the OS's ‘sprintf’ function, and\n", " incorrect formats can cause the latter to crash the R process . R\n", " does perform sanity checks on the format, but not all possible\n", " user errors on all platforms have been tested, and some might be\n", " terminal.\n", "\n", " The behaviour on inputs not documented here is ‘undefined’, which\n", " means it is allowed to differ by platform.\n", "\n", "_\bA_\bu_\bt_\bh_\bo_\br(_\bs):\n", "\n", " Original code by Jonathan Rougier.\n", "\n", "_\bR_\be_\bf_\be_\br_\be_\bn_\bc_\be_\bs:\n", "\n", " Kernighan, B. W. and Ritchie, D. M. (1988) _The C Programming\n", " Language._ Second edition, Prentice Hall. Describes the format\n", " options in table B-1 in the Appendix.\n", "\n", " The C Standards, especially ISO/IEC 9899:1999 for ‘C99’. Links\n", " can be found at .\n", "\n", " ‘man sprintf’ on a Unix-alike system.\n", "\n", "_\bS_\be_\be _\bA_\bl_\bs_\bo:\n", "\n", " ‘formatC’ for a way of formatting vectors of numbers in a similar\n", " fashion.\n", "\n", " ‘paste’ for another way of creating a vector combining text and\n", " values.\n", "\n", " ‘gettext’ for the mechanisms for the automated translation of\n", " text.\n", "\n", "_\bE_\bx_\ba_\bm_\bp_\bl_\be_\bs:\n", "\n", " ## be careful with the format: most things in R are floats\n", " ## only integer-valued reals get coerced to integer.\n", " \n", " sprintf(\"%s is %f feet tall\\n\", \"Sven\", 7.1) # OK\n", " try(sprintf(\"%s is %i feet tall\\n\", \"Sven\", 7.1)) # not OK\n", " sprintf(\"%s is %i feet tall\\n\", \"Sven\", 7 ) # OK\n", " \n", " ## use a literal % :\n", " \n", " sprintf(\"%.0f%% said yes (out of a sample of size %.0f)\", 66.666, 3)\n", " \n", " ## various formats of pi :\n", " \n", " sprintf(\"%f\", pi)\n", " sprintf(\"%.3f\", pi)\n", " sprintf(\"%1.0f\", pi)\n", " sprintf(\"%5.1f\", pi)\n", " sprintf(\"%05.1f\", pi)\n", " sprintf(\"%+f\", pi)\n", " sprintf(\"% f\", pi)\n", " sprintf(\"%-10f\", pi) # left justified\n", " sprintf(\"%e\", pi)\n", " sprintf(\"%E\", pi)\n", " sprintf(\"%g\", pi)\n", " sprintf(\"%g\", 1e6 * pi) # -> exponential\n", " sprintf(\"%.9g\", 1e6 * pi) # -> \"fixed\"\n", " sprintf(\"%G\", 1e-6 * pi)\n", " \n", " ## no truncation:\n", " sprintf(\"%1.f\", 101)\n", " \n", " ## re-use one argument three times, show difference between %x and %X\n", " xx <- sprintf(\"%1$d %1$x %1$X\", 0:15)\n", " xx <- matrix(xx, dimnames = list(rep(\"\", 16), \"%d%x%X\"))\n", " noquote(format(xx, justify = \"right\"))\n", " \n", " ## More sophisticated:\n", " \n", " sprintf(\"min 10-char string '%10s'\",\n", " c(\"a\", \"ABC\", \"and an even longer one\"))\n", " \n", " ## Platform-dependent bad example from qdapTools 1.0.0:\n", " ## may pad with spaces or zeroes.\n", " sprintf(\"%09s\", month.name)\n", " \n", " n <- 1:18\n", " sprintf(paste0(\"e with %2d digits = %.\", n, \"g\"), n, exp(1))\n", " \n", " ## Using arguments out of order\n", " sprintf(\"second %2$1.0f, first %1$5.2f, third %3$1.0f\", pi, 2, 3)\n", " \n", " ## Using asterisk for width or precision\n", " sprintf(\"precision %.*f, width '%*.3f'\", 3, pi, 8, pi)\n", " \n", " ## Asterisk and argument re-use, 'e' example reiterated:\n", " sprintf(\"e with %1$2d digits = %2$.*1$g\", n, exp(1))\n", " \n", " ## re-cycle arguments\n", " sprintf(\"%s %d\", \"test\", 1:3)\n", " \n", " ## binary output showing rounding/representation errors\n", " x <- seq(0, 1.0, 0.1); y <- c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1)\n", " cbind(x, sprintf(\"%a\", x), sprintf(\"%a\", y))\n", " " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we're going to want to format the printed output. The relevant function is \"sprintf\". \n", "# Here's the documentation.\n", "help(sprintf)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] professional: mean difference -0.608696\n", "[1] respect: mean difference -0.608696\n", "[1] caring: mean difference -0.523913\n", "[1] enthusiastic: mean difference -0.573913\n", "[1] communicate: mean difference -0.567391\n", "[1] helpful: mean difference -0.456522\n", "[1] feedback: mean difference -0.473913\n", "[1] prompt: mean difference -0.797826\n", "[1] consistent: mean difference -0.456522\n", "[1] fair: mean difference -0.760870\n", "[1] responsive: mean difference -0.219565\n", "[1] praised: mean difference -0.671739\n", "[1] knowledgeable: mean difference -0.354348\n", "[1] clear: mean difference -0.413043\n", "[1] overall: mean difference -0.473913\n" ] } ], "source": [ "# Let's try to reproduce the MacNell et al. results\n", "character <- setdiff(names(ratings),c(\"group\",\"gender\",\"tagender\",\"taidgender\",\"age\"));\n", "for (ch in character) {\n", " print(sprintf('%s: mean difference %f', \n", " ch,\n", " mean(ratings[ratings$taidgender==0,ch]) - mean(ratings[ratings$taidgender==1,ch])), quote=F)\n", "}" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
groupgradetagendertaidgender
1377.401
2389.0201
3353.501
4388.3201
5390.0201
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " & group & grade & tagender & taidgender\\\\\n", "\\hline\n", "\t1 & 3 & 77.4 & 0 & 1\\\\\n", "\t2 & 3 & 89.02 & 0 & 1\\\\\n", "\t3 & 3 & 53.5 & 0 & 1\\\\\n", "\t4 & 3 & 88.32 & 0 & 1\\\\\n", "\t5 & 3 & 90.02 & 0 & 1\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " group grade tagender taidgender\n", "1 3 77.40 0 1\n", "2 3 89.02 0 1\n", "3 3 53.50 0 1\n", "4 3 88.32 0 1\n", "5 3 90.02 0 1" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ " group grade tagender taidgender \n", " Min. :3.000 Min. :49.46 Min. :0.0000 Min. :0.0000 \n", " 1st Qu.:3.500 1st Qu.:75.20 1st Qu.:0.0000 1st Qu.:0.0000 \n", " Median :5.000 Median :80.13 Median :1.0000 Median :0.0000 \n", " Mean :4.532 Mean :79.01 Mean :0.5106 Mean :0.4894 \n", " 3rd Qu.:6.000 3rd Qu.:85.09 3rd Qu.:1.0000 3rd Qu.:1.0000 \n", " Max. :6.000 Max. :95.10 Max. :1.0000 Max. :1.0000 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grades <- read.csv(\"Data/Macnell-GradeData.csv\",as.is = T); # reads a .csv file into a DataFrame\n", "grades[1:5,]\n", "summary(grades) # summary statistics for the data. Note the issue with \"age\"" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# simulate the distribution of the mean difference\n", "simPermu <- function(m1, f1, m2, f2, iter) {\n", " n1f <- length(f1); # number of students assigned to instructor 1 when instructor 1\n", " # was identified as female\n", " n2f <- length(f2); # number of students assigned to instructor 2 when instructor 2\n", " # was identified as female\n", " z1 <- c(m1, f1); # pooled responses for instructor 1\n", " z2 <- c(m2, f2); # pooled responses for instructor 1\n", " ts <- abs(mean(c(m1,m2)) - mean(c(f1,f2))) # test statistic\n", " sum(replicate(iter, { # replicate() repeats the 2nd argument\n", " zp1 <- sample(z1);\n", " zp2 <- sample(z2);\n", " abs(mean(c(zp1[1:n1],zp2[1:n2])) - mean(c(zp1[-(1:n1)],zp2[-(1:n2)]))) > ts\n", " }))/iter\n", "}" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 'professional'
  2. \n", "\t
  3. 'respect'
  4. \n", "\t
  5. 'caring'
  6. \n", "\t
  7. 'enthusiastic'
  8. \n", "\t
  9. 'communicate'
  10. \n", "\t
  11. 'helpful'
  12. \n", "\t
  13. 'feedback'
  14. \n", "\t
  15. 'prompt'
  16. \n", "\t
  17. 'consistent'
  18. \n", "\t
  19. 'fair'
  20. \n", "\t
  21. 'responsive'
  22. \n", "\t
  23. 'praised'
  24. \n", "\t
  25. 'knowledgeable'
  26. \n", "\t
  27. 'clear'
  28. \n", "\t
  29. 'overall'
  30. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 'professional'\n", "\\item 'respect'\n", "\\item 'caring'\n", "\\item 'enthusiastic'\n", "\\item 'communicate'\n", "\\item 'helpful'\n", "\\item 'feedback'\n", "\\item 'prompt'\n", "\\item 'consistent'\n", "\\item 'fair'\n", "\\item 'responsive'\n", "\\item 'praised'\n", "\\item 'knowledgeable'\n", "\\item 'clear'\n", "\\item 'overall'\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 'professional'\n", "2. 'respect'\n", "3. 'caring'\n", "4. 'enthusiastic'\n", "5. 'communicate'\n", "6. 'helpful'\n", "7. 'feedback'\n", "8. 'prompt'\n", "9. 'consistent'\n", "10. 'fair'\n", "11. 'responsive'\n", "12. 'praised'\n", "13. 'knowledgeable'\n", "14. 'clear'\n", "15. 'overall'\n", "\n", "\n" ], "text/plain": [ " [1] \"professional\" \"respect\" \"caring\" \"enthusiastic\" \n", " [5] \"communicate\" \"helpful\" \"feedback\" \"prompt\" \n", " [9] \"consistent\" \"fair\" \"responsive\" \"praised\" \n", "[13] \"knowledgeable\" \"clear\" \"overall\" " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "[1] professional: diff of means=0.608696, est. p=0.058400\n", "[1] respect: diff of means=0.608696, est. p=0.058500\n", "[1] caring: diff of means=0.523913, est. p=0.113600\n", "[1] enthusiastic: diff of means=0.573913, est. p=0.069000\n", "[1] communicate: diff of means=0.567391, est. p=0.081600\n", "[1] helpful: diff of means=0.456522, est. p=0.205300\n", "[1] feedback: diff of means=0.473913, est. p=0.180600\n", "[1] prompt: diff of means=0.797826, est. p=0.012700\n", "[1] consistent: diff of means=0.456522, est. p=0.238700\n", "[1] fair: diff of means=0.760870, est. p=0.011100\n", "[1] responsive: diff of means=0.219565, est. p=0.537400\n", "[1] praised: diff of means=0.671739, est. p=0.008700\n", "[1] knowledgeable: diff of means=0.354348, est. p=0.229600\n", "[1] clear: diff of means=0.413043, est. p=0.271700\n", "[1] overall: diff of means=0.473913, est. p=0.142200\n" ] } ], "source": [ "# It's good practice to set the seed of the random number generator, so that your work will be\n", "# reproducible. I'm using the date of this lecture as the seed. \n", "# Don't reset the seed repeatedly in your analysis! That compromises the pseudorandom behavior of the PRNG.\n", "# R uses the Mersenne Twister PRNG, which is good enough for general statistical purposes, but not for cryptography\n", "#\n", "set.seed(20150630); # set the seed so that the analysis is reproducible\n", "iter <- 10^4; # iterations to estimate p-value\n", "\n", "characteristics <- setdiff(names(ratings),c(\"group\",\"gender\",\"tagender\",\"taidgender\",\"age\"));\n", "characteristics\n", "\n", "male1 <- ratings[ratings$taidgender == 1 & ratings$tagender == 1,][characteristics];\n", "female1 <- ratings[ratings$taidgender == 0 & ratings$tagender == 1,][characteristics];\n", "\n", "male2 <- ratings[ratings$taidgender == 1 & ratings$tagender == 0,][characteristics];\n", "female2 <- ratings[ratings$taidgender == 0 & ratings$tagender == 0,][characteristics];\n", "\n", "for (ch in characteristics) {\n", " sp <- simPermu(unlist(male1[ch]), unlist(female1[ch]), unlist(male2[ch]), unlist(female2[ch]), iter);\n", " print(sprintf(\"%s: diff of means=%f, est. p=%f\", \n", " ch, \n", " mean(c(unlist(male1[ch]),unlist(male2[ch])))- mean(c(unlist(female1[ch]),unlist(female2[ch]))), \n", " sp), quote = F);\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assignment\n", "\n", "+ Read MacNell et al. (2014)\n", "+ Repeat the analysis above but using a different test statistic, the two-sample t test with pooled variance estimate. This test statistic is \n", "$$ \\frac{\\left | \\bar{f} - \\bar{m} \\right |}{\\sqrt{\\frac{s_m^2}{n_m} + \\frac{s_f^2}{n_f}}},\n", "$$\n", "where:\n", " - $\\bar{f}$ is the mean score when the TAs were identified as female\n", " - $\\bar{m}$ is the mean score when the TAs were identified as male\n", " - $n_f$ is the number of scores when the TAs were identified as female\n", " - $n_m$ the number of scores when the TAs were identified as male\n", " - $s_f$ is the sample standard deviation of the scores when the TAs were identified as female\n", " - $s_m$ is the sample standard deviation of the scores when the TAs were identified as male\n", "+ Analyze the grade data\n", " - repeat the analysis as above, comparing grades by identified TA gender and by true TA gender\n", " - try a different randomization, randomizing across instructors, not only within instructors\n", " - test whether the true instructor gender has an effect on student grades\n", " - explain the results of all these tests\n", " - use $10^5$ interations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confidence intervals and the uncertainty of the estimated $p$ value\n", "\n", "[To do: show that $\\mbox{SE}(\\hat{p}) \\le 0.5/\\sqrt{iter}$]\n", "\n", "We are estimating the \"degree of surprise\" $p$ by simulation. The true probability will differ from the\n", "estimate, in general.\n", "\n", "How can we tell how much larger the true $p$ might be?\n", "\n", "We can make a _confidence interval_ for the true $p$ based on the estimated $p$.\n", "Recall that a confidence interval with confidence level $1-\\alpha$ for a parameter\n", "is a random interval computed using a method that has probability at least $1-\\alpha$\n", "of containing the true value of the parameter.\n", "\n", "[TO DO: explain duality between tests and confidence sets.]\n", "\n", "Notation: Suppose $X$ is a random variable that has a probability distribution that depends\n", "on some parameter $\\theta \\in \\Theta$\n", "Then ${\\mathbb P}_\\eta (X \\in A)$ means the probability that $X \\in A$, computed on the assumption\n", "that the true value of $\\theta$ is $\\eta$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "### Theorem: duality of confidence intervals and sets\n", "\n", "Let $\\{ A_\\eta \\}_{\\eta \\in \\Theta}$ be a family of acceptance regions for testing the hypothesis that \n", "$\\theta = \\eta$ at significance level $\\alpha$.\n", "\n", "Define ${\\mathcal I}(X) = \\{ \\eta \\in \\Theta: X \\in A_\\eta \\}$\n", "\n", "Then\n", "$$\n", " {\\mathbb P}_\\eta \\left ( {\\mathcal I}{X} \\ni \\eta \\right ) \\ge 1- \\alpha, \\;\\;\\forall \\eta \\in \\Theta.\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Illustration: upper confidence interval for Binomial $p$\n", "\n", "Suppose $X \\sim \\mbox{Binomial}(n, p)$, with $n$ fixed.\n", "Define $x_\\eta$ as follows:\n", "\n", "$$ x_\\eta \\equiv \\min \\{ x: {\\mathbb P}_\\eta (X > x) \\le \\alpha. $$\n", "\n", "[To do: explain why we use a one-sided test, with that direction.]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "binoUpperCL <- function(n, x, cl = 0.975, inc=0.000001, p=x/n) {\n", " if (x < n) {\n", " f <- pbinom(x, n, p, lower.tail = TRUE);\n", " while (f >= 1-cl) { # this could be sped up using Brent's method, e.g.\n", " p <- p + inc;\n", " f <- pbinom(x, n, p, lower.tail = TRUE)\n", " }\n", " p\n", " } else {\n", " 1.0\n", " }\n", "}" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "0.00206699999999996" ], "text/latex": [ "0.00206699999999996" ], "text/markdown": [ "0.00206699999999996" ], "text/plain": [ "[1] 0.002067" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Example: 10**4 samples give 13 successes, so the estimated p is 0.0013. \n", "## What's an upper 95% confidence interval for the true p?\n", "binoUpperCL(10**4, 13, cl=0.95)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
sd {stats}R Documentation
\n", "\n", "

Standard Deviation

\n", "\n", "

Description

\n", "\n", "

This function computes the standard deviation of the values in\n", "x.\n", "If na.rm is TRUE then missing values are removed before\n", "computation proceeds.\n", "

\n", "\n", "\n", "

Usage

\n", "\n", "
\n",
       "sd(x, na.rm = FALSE)\n",
       "
\n", "\n", "\n", "

Arguments

\n", "\n", "\n", "\n", "\n", "\n", "\n", "
x\n", "

a numeric vector or an R object which is coercible to one\n", "by as.vector(x, \"numeric\").

\n", "
na.rm\n", "

logical. Should missing values be removed?

\n", "
\n", "\n", "\n", "

Details

\n", "\n", "

Like var this uses denominator n - 1.\n", "

\n", "

The standard deviation of a zero-length vector (after removal of\n", "NAs if na.rm = TRUE) is not defined and gives an error.\n", "The standard deviation of a length-one vector is NA.\n", "

\n", "\n", "\n", "

See Also

\n", "\n", "

var for its square, and mad, the most\n", "robust alternative.\n", "

\n", "\n", "\n", "

Examples

\n", "\n", "
\n",
       "sd(1:2) ^ 2\n",
       "
\n", "\n", "
[Package stats version 3.1.3 ]
" ], "text/latex": [ "\\inputencoding{utf8}\n", "\\HeaderA{sd}{Standard Deviation}{sd}\n", "\\keyword{univar}{sd}\n", "%\n", "\\begin{Description}\\relax\n", "This function computes the standard deviation of the values in\n", "\\code{x}.\n", "If \\code{na.rm} is \\code{TRUE} then missing values are removed before\n", "computation proceeds.\n", "\\end{Description}\n", "%\n", "\\begin{Usage}\n", "\\begin{verbatim}\n", "sd(x, na.rm = FALSE)\n", "\\end{verbatim}\n", "\\end{Usage}\n", "%\n", "\\begin{Arguments}\n", "\\begin{ldescription}\n", "\\item[\\code{x}] a numeric vector or an \\R{} object which is coercible to one\n", "by \\code{as.vector(x, \"numeric\")}.\n", "\\item[\\code{na.rm}] logical. Should missing values be removed?\n", "\\end{ldescription}\n", "\\end{Arguments}\n", "%\n", "\\begin{Details}\\relax\n", "Like \\code{\\LinkA{var}{var}} this uses denominator \\eqn{n - 1}{}.\n", "\n", "The standard deviation of a zero-length vector (after removal of\n", "\\code{NA}s if \\code{na.rm = TRUE}) is not defined and gives an error.\n", "The standard deviation of a length-one vector is \\code{NA}.\n", "\\end{Details}\n", "%\n", "\\begin{SeeAlso}\\relax\n", "\\code{\\LinkA{var}{var}} for its square, and \\code{\\LinkA{mad}{mad}}, the most\n", "robust alternative.\n", "\\end{SeeAlso}\n", "%\n", "\\begin{Examples}\n", "\\begin{ExampleCode}\n", "sd(1:2) ^ 2\n", "\\end{ExampleCode}\n", "\\end{Examples}" ], "text/plain": [ "sd package:stats R Documentation\n", "\n", "_\bS_\bt_\ba_\bn_\bd_\ba_\br_\bd _\bD_\be_\bv_\bi_\ba_\bt_\bi_\bo_\bn\n", "\n", "_\bD_\be_\bs_\bc_\br_\bi_\bp_\bt_\bi_\bo_\bn:\n", "\n", " This function computes the standard deviation of the values in\n", " ‘x’. If ‘na.rm’ is ‘TRUE’ then missing values are removed before\n", " computation proceeds.\n", "\n", "_\bU_\bs_\ba_\bg_\be:\n", "\n", " sd(x, na.rm = FALSE)\n", " \n", "_\bA_\br_\bg_\bu_\bm_\be_\bn_\bt_\bs:\n", "\n", " x: a numeric vector or an R object which is coercible to one by\n", " ‘as.vector(x, \"numeric\")’.\n", "\n", " na.rm: logical. Should missing values be removed?\n", "\n", "_\bD_\be_\bt_\ba_\bi_\bl_\bs:\n", "\n", " Like ‘var’ this uses denominator n - 1.\n", "\n", " The standard deviation of a zero-length vector (after removal of\n", " ‘NA’s if ‘na.rm = TRUE’) is not defined and gives an error. The\n", " standard deviation of a length-one vector is ‘NA’.\n", "\n", "_\bS_\be_\be _\bA_\bl_\bs_\bo:\n", "\n", " ‘var’ for its square, and ‘mad’, the most robust alternative.\n", "\n", "_\bE_\bx_\ba_\bm_\bp_\bl_\be_\bs:\n", "\n", " sd(1:2) ^ 2\n", " " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "help(\"sd\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.1.3" } }, "nbformat": 4, "nbformat_minor": 0 }