{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "73f5c201",
   "metadata": {},
   "source": [
    "# Randomized Response Surveys"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1feaddc4",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "Social stigmas can inhibit people from confessing potentially embarrassing activities or opinions.\n",
    "\n",
    "When people are reluctant to participate a sample survey about personally  sensitive issues, they  might decline  to participate, and even if they do participate, they might choose to provide incorrect answers to  sensitive questions.\n",
    "\n",
    "These problems induce  **selection**  biases that present challenges to interpreting and designing surveys.\n",
    "\n",
    "To illustrate how social scientists have thought about estimating the prevalence of such embarrassing activities and opinions, this lecture describes a classic approach  of S. L. Warner [[Warner, 1965](https://python.quantecon.org/zreferences.html#id9)].\n",
    "\n",
    "Warner  used elementary  probability to construct a  way to protect the privacy  of **individual** respondents to surveys while still    estimating  the fraction of a **collection** of individuals   who have a socially stigmatized characteristic or who engage in  a socially stigmatized activity.\n",
    "\n",
    "Warner’s idea was to  add **noise** between the respondent’s answer and the **signal** about that answer that the  survey maker ultimately receives.\n",
    "\n",
    "Knowing about the structure of the noise assures the respondent that the survey maker does not observe his answer.\n",
    "\n",
    "Statistical properties of the  noise injection procedure provide the respondent  **plausible deniability**.\n",
    "\n",
    "Related ideas underlie  modern **differential privacy** systems.\n",
    "\n",
    "(See [https://en.wikipedia.org/wiki/Differential_privacy](https://en.wikipedia.org/wiki/Differential_privacy))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39650e5a",
   "metadata": {},
   "source": [
    "## Warner’s Strategy\n",
    "\n",
    "As usual, let’s bring in the Python modules we’ll be using."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9df575a4",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8f9140a",
   "metadata": {},
   "source": [
    "Suppose that every person in population either belongs to Group A or Group B.\n",
    "\n",
    "We want to estimate the proportion $ \\pi $ who belong to Group A while protecting individual respondents’ privacy.\n",
    "\n",
    "Warner [[Warner, 1965](https://python.quantecon.org/zreferences.html#id9)] proposed and analyzed the following procedure.\n",
    "\n",
    "- A  random sample of $ n $ people is drawn with replacement from the population and  each person is interviewed.  \n",
    "- Draw $ n $ random samples from the population with replacement and interview each person.  \n",
    "- Prepare a **random spinner** that with $ p $ probability points to the Letter A and with $ (1-p) $ probability points to the Letter B.  \n",
    "- Each subject spins a random spinner and sees an outcome (A or B)  that the interviewer  does **not observe**.  \n",
    "- The subject   states whether he belongs to the group to which the spinner points.  \n",
    "- If the spinner points to the group that the spinner belongs, the subject reports “yes”; otherwise he reports “no”.  \n",
    "- The  subject answers the question truthfully.  \n",
    "\n",
    "\n",
    "Warner constructed a maximum  likelihood estimators of the proportion of the population in set A.\n",
    "\n",
    "Let\n",
    "\n",
    "- $ \\pi $ : True probability of A in the population  \n",
    "- $ p $ : Probability that the spinner points to A  \n",
    "- $ X_{i}=\\begin{cases}1,\\text{ if the } i\\text{th} \\ \\text{ subject  says yes}\\\\0,\\text{ if the } i\\text{th} \\ \\text{  subject  says no}\\end{cases} $  \n",
    "\n",
    "\n",
    "Index the sample set so that  the first $ n_1 $ report “yes”, while the second $ n-n_1 $ report “no”.\n",
    "\n",
    "The likelihood function of a sample set is\n",
    "\n",
    "\n",
    "<a id='equation-eq-one'></a>\n",
    "$$\n",
    "L=\\left[\\pi p + (1-\\pi)(1-p)\\right]^{n_{1}}\\left[(1-\\pi) p +\\pi (1-p)\\right]^{n-n_{1}} \\tag{15.1}\n",
    "$$\n",
    "\n",
    "The log of the likelihood function is:\n",
    "\n",
    "\n",
    "<a id='equation-eq-two'></a>\n",
    "$$\n",
    "\\log(L)= n_1 \\log \\left[\\pi p + (1-\\pi)(1-p)\\right] + (n-n_{1}) \\log \\left[(1-\\pi) p +\\pi (1-p)\\right] \\tag{15.2}\n",
    "$$\n",
    "\n",
    "The first-order necessary condition for maximizing the log likelihood function with respect to  $ \\pi $ is:\n",
    "\n",
    "$$\n",
    "\\frac{(n-n_1)(2p-1)}{(1-\\pi) p +\\pi (1-p)}=\\frac{n_1 (2p-1)}{\\pi p + (1-\\pi)(1-p)}\n",
    "$$\n",
    "\n",
    "or\n",
    "\n",
    "\n",
    "<a id='equation-eq-3'></a>\n",
    "$$\n",
    "\\pi p + (1-\\pi)(1-p)=\\frac{n_1}{n} \\tag{15.3}\n",
    "$$\n",
    "\n",
    "If  $ p \\neq \\frac{1}{2} $, then the maximum likelihood estimator (MLE) of $ \\pi $ is:\n",
    "\n",
    "\n",
    "<a id='equation-eq-four'></a>\n",
    "$$\n",
    "\\hat{\\pi}=\\frac{p-1}{2p-1}+\\frac{n_1}{(2p-1)n} \\tag{15.4}\n",
    "$$\n",
    "\n",
    "We compute the mean and variance of the MLE estimator $ \\hat \\pi $ to be:\n",
    "\n",
    "\n",
    "<a id='equation-eq-five'></a>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\mathbb{E}(\\hat{\\pi})&= \\frac{1}{2 p-1}\\left[p-1+\\frac{1}{n} \\sum_{i=1}^{n} \\mathbb{E} X_i \\right] \\\\\n",
    "&=\\frac{1}{2 p-1} \\left[ p -1 + \\pi p + (1-\\pi)(1-p)\\right] \\\\\n",
    "&=\\pi\n",
    "\\end{aligned} \\tag{15.5}\n",
    "$$\n",
    "\n",
    "and\n",
    "\n",
    "\n",
    "<a id='equation-eq-six'></a>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "Var(\\hat{\\pi})&=\\frac{n Var(X_i)}{(2p - 1 )^2 n^2} \\\\\n",
    "&= \\frac{\\left[\\pi p + (1-\\pi)(1-p)\\right]\\left[(1-\\pi) p +\\pi (1-p)\\right]}{(2p - 1 )^2 n^2}\\\\\n",
    "&=\\frac{\\frac{1}{4}+(2 p^2 - 2 p +\\frac{1}{2})(- 2 \\pi^2 + 2 \\pi -\\frac{1}{2})}{(2p - 1 )^2 n^2}\\\\\n",
    "&=\\frac{1}{n}\\left[\\frac{1}{16(p-\\frac{1}{2})^2}-(\\pi-\\frac{1}{2})^2 \\right]\n",
    "\\end{aligned} \\tag{15.6}\n",
    "$$\n",
    "\n",
    "Equation [(15.5)](#equation-eq-five) indicates  that $ \\hat{\\pi} $ is an **unbiased estimator** of $ \\pi $ while equation [(15.6)](#equation-eq-six) tell us the variance of the estimator.\n",
    "\n",
    "To compute a  confidence interval, first  rewrite [(15.6)](#equation-eq-six) as:\n",
    "\n",
    "\n",
    "<a id='equation-eq-seven'></a>\n",
    "$$\n",
    "Var(\\hat{\\pi})=\\frac{\\frac{1}{4}-(\\pi-\\frac{1}{2})^2}{n}+\\frac{\\frac{1}{16(p-\\frac{1}{2})^2}-\\frac{1}{4}}{n} \\tag{15.7}\n",
    "$$\n",
    "\n",
    "This equation indicates that the variance of $ \\hat{\\pi} $ can be represented as a sum of the variance due to sampling plus the variance due to the random device.\n",
    "\n",
    "From the expressions above we can find that:\n",
    "\n",
    "- When $ p $ is $ \\frac{1}{2} $, expression [(15.1)](#equation-eq-one) degenerates to a constant.  \n",
    "- When $ p $ is $ 1 $ or $ 0 $, the randomized estimate degenerates to an estimator without randomized sampling.  \n",
    "\n",
    "\n",
    "We shall only discuss  situations in which $ p \\in (\\frac{1}{2},1) $\n",
    "\n",
    "(a situation in which $ p \\in (0,\\frac{1}{2}) $ is symmetric).\n",
    "\n",
    "From expressions [(15.5)](#equation-eq-five) and [(15.7)](#equation-eq-seven) we can deduce that:\n",
    "\n",
    "- The MSE of $ \\hat{\\pi} $  decreases as $ p $ increases.  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be401ecc",
   "metadata": {},
   "source": [
    "## Comparing Two Survey Designs\n",
    "\n",
    "Let’s compare the preceding randomized-response method with a stylized non-randomized response method.\n",
    "\n",
    "In our non-randomized response method, we suppose that:\n",
    "\n",
    "- Members of Group A tells the truth with probability  $ T_a $ while the members of Group B tells the truth with probability $ T_b $  \n",
    "- $ Y_i $ is $ 1 $ or $ 0 $ according to whether the sample’s $ i\\text{th} $ member’s report is in Group A or not.  \n",
    "\n",
    "\n",
    "Then we can estimate $ \\pi $ as:\n",
    "\n",
    "\n",
    "<a id='equation-eq-eight'></a>\n",
    "$$\n",
    "\\hat{\\pi}=\\frac{\\sum_{i=1}^{n}Y_i}{n} \\tag{15.8}\n",
    "$$\n",
    "\n",
    "We calculate the expectation, bias, and variance of the estimator to be:\n",
    "\n",
    "\n",
    "<a id='equation-eq-nine'></a>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\mathbb{E}(\\hat{\\pi})&=\\pi T_a + \\left[ (1-\\pi)(1-T_b)\\right]\\\\\n",
    "\\end{aligned} \\tag{15.9}\n",
    "$$\n",
    "\n",
    "\n",
    "<a id='equation-eq-ten'></a>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "Bias(\\hat{\\pi})&=\\mathbb{E}(\\hat{\\pi}-\\pi)\\\\\n",
    "&=\\pi [T_a + T_b -2 ] + [1- T_b] \\\\\n",
    "\\end{aligned} \\tag{15.10}\n",
    "$$\n",
    "\n",
    "\n",
    "<a id='equation-eq-eleven'></a>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "Var(\\hat{\\pi})&=\\frac{ \\left[ \\pi T_a + (1-\\pi)(1-T_b)\\right]  \\left[1- \\pi T_a -(1-\\pi)(1-T_b)\\right] }{n}\n",
    "\\end{aligned} \\tag{15.11}\n",
    "$$\n",
    "\n",
    "It is useful to define a\n",
    "\n",
    "$$\n",
    "\\text{MSE Ratio}=\\frac{\\text{Mean Square Error Randomized}}{\\text{Mean Square Error Regular}}\n",
    "$$\n",
    "\n",
    "We can compute  MSE Ratios for different survey designs associated with different parameter values.\n",
    "\n",
    "The following Python code computes  objects we want to stare at in order to make comparisons\n",
    "under  different values of $ \\pi_A $ and $ n $:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4881ecfb",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "class Comparison:\n",
    "    def __init__(self, A, n):\n",
    "        self.A = A\n",
    "        self.n = n\n",
    "        TaTb = np.array([[0.95,  1], [0.9,   1], [0.7,    1], \n",
    "                         [0.5,   1], [1,  0.95], [1,    0.9], \n",
    "                         [1,   0.7], [1,   0.5], [0.95, 0.95], \n",
    "                         [0.9, 0.9], [0.7, 0.7], [0.5,  0.5]])\n",
    "        self.p_arr = np.array([0.6, 0.7, 0.8, 0.9])\n",
    "        self.p_map = dict(zip(self.p_arr, [f\"MSE Ratio: p = {x}\" for x in self.p_arr]))\n",
    "        self.template = pd.DataFrame(columns=self.p_arr)\n",
    "        self.template[['T_a','T_b']] = TaTb\n",
    "        self.template['Bias'] = None\n",
    "    \n",
    "    def theoretical(self):\n",
    "        A = self.A\n",
    "        n = self.n\n",
    "        df = self.template.copy()\n",
    "        df['Bias'] = A * (df['T_a'] + df['T_b'] - 2) + (1 - df['T_b'])\n",
    "        for p in self.p_arr:\n",
    "            df[p] = (1 / (16 * (p - 1/2)**2) - (A - 1/2)**2) / n / \\\n",
    "                    (df['Bias']**2 + ((A * df['T_a'] + (1 - A) * (1 - df['T_b'])) * (1 - A * df['T_a'] - (1 - A) * (1 - df['T_b'])) / n))\n",
    "            df[p] = df[p].round(2)\n",
    "        df = df.set_index([\"T_a\", \"T_b\", \"Bias\"]).rename(columns=self.p_map)\n",
    "        return df\n",
    "        \n",
    "    def MCsimulation(self, size=1000, seed=123456):\n",
    "        A = self.A\n",
    "        n = self.n\n",
    "        df = self.template.copy()\n",
    "        np.random.seed(seed)\n",
    "        sample = np.random.rand(size, self.n) <= A\n",
    "        random_device = np.random.rand(size, n)\n",
    "        mse_rd = {}\n",
    "        for p in self.p_arr:\n",
    "            spinner = random_device <= p\n",
    "            rd_answer = sample * spinner + (1 - sample) * (1 - spinner)\n",
    "            n1 = rd_answer.sum(axis=1)\n",
    "            pi_hat = (p - 1) / (2 * p - 1) + n1 / n / (2 * p - 1)\n",
    "            mse_rd[p] = np.sum((pi_hat - A)**2)\n",
    "        for inum, irow in df.iterrows():\n",
    "            truth_a = np.random.rand(size, self.n) <= irow.T_a\n",
    "            truth_b = np.random.rand(size, self.n) <= irow.T_b\n",
    "            trad_answer = sample * truth_a + (1 - sample) * (1 - truth_b)\n",
    "            pi_trad = trad_answer.sum(axis=1) / n\n",
    "            df.loc[inum, 'Bias'] = pi_trad.mean() - A\n",
    "            mse_trad = np.sum((pi_trad - A)**2)\n",
    "            for p in self.p_arr:\n",
    "                df.loc[inum, p] = (mse_rd[p] / mse_trad).round(2)\n",
    "        df = df.set_index([\"T_a\", \"T_b\", \"Bias\"]).rename(columns=self.p_map)\n",
    "        return df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb722f64",
   "metadata": {},
   "source": [
    "Let’s put the code to work for parameter values\n",
    "\n",
    "- $ \\pi_A=0.6 $  \n",
    "- $ n=1000 $  \n",
    "\n",
    "\n",
    "We can generate MSE Ratios theoretically using the above formulas.\n",
    "\n",
    "We can also perform   Monte Carlo simulations  of a MSE Ratio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c67bc08",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "cp1 = Comparison(0.6, 1000)\n",
    "df1_theoretical = cp1.theoretical()\n",
    "df1_theoretical"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be7f3c47",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df1_mc = cp1.MCsimulation()\n",
    "df1_mc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fa13323",
   "metadata": {},
   "source": [
    "The theoretical calculations  do a good job of predicting  Monte Carlo results.\n",
    "\n",
    "We see that in many situations, especially when the bias is not small, the MSE of the randomized-sampling  methods is smaller than that of the non-randomized sampling method.\n",
    "\n",
    "These differences become larger as  $ p $ increases.\n",
    "\n",
    "By adjusting  parameters $ \\pi_A $ and $ n $, we can study outcomes in different situations.\n",
    "\n",
    "For example, for another situation described in Warner [[Warner, 1965](https://python.quantecon.org/zreferences.html#id9)]:\n",
    "\n",
    "- $ \\pi_A=0.5 $  \n",
    "- $ n=1000 $  \n",
    "\n",
    "\n",
    "we can use the code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3404f794",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "cp2 = Comparison(0.5, 1000)\n",
    "df2_theoretical = cp2.theoretical()\n",
    "df2_theoretical"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d498b74f",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df2_mc = cp2.MCsimulation()\n",
    "df2_mc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "503214f4",
   "metadata": {},
   "source": [
    "We can also revisit a calculation in the  concluding section of Warner [[Warner, 1965](https://python.quantecon.org/zreferences.html#id9)] in which\n",
    "\n",
    "- $ \\pi_A=0.6 $  \n",
    "- $ n=2000 $  \n",
    "\n",
    "\n",
    "We use the code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df1ef684",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "cp3 = Comparison(0.6, 2000)\n",
    "df3_theoretical = cp3.theoretical()\n",
    "df3_theoretical"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "166f0ae5",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df3_mc = cp3.MCsimulation()\n",
    "df3_mc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30af1f18",
   "metadata": {},
   "source": [
    "Evidently, as $ n $ increases, the randomized response method does  better performance in more situations."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b596e77a",
   "metadata": {},
   "source": [
    "## Concluding Remarks\n",
    "\n",
    "[This QuantEcon lecture](https://python.quantecon.org/util_rand_resp.html)  describes some alternative randomized response surveys.\n",
    "\n",
    "That lecture presents a utilitarian analysis of those alternatives conducted by Lars Ljungqvist\n",
    "[[Ljungqvist, 1993](https://python.quantecon.org/zreferences.html#id10)]."
   ]
  }
 ],
 "metadata": {
  "date": 1714442509.356489,
  "filename": "rand_resp.md",
  "kernelspec": {
   "display_name": "Python",
   "language": "python3",
   "name": "python3"
  },
  "title": "Randomized Response Surveys"
 },
 "nbformat": 4,
 "nbformat_minor": 5
}