{ "cells": [ { "cell_type": "markdown", "id": "73f5c201", "metadata": {}, "source": [ "# Randomized Response Surveys" ] }, { "cell_type": "markdown", "id": "1feaddc4", "metadata": {}, "source": [ "## Overview\n", "\n", "Social stigmas can inhibit people from confessing potentially embarrassing activities or opinions.\n", "\n", "When people are reluctant to participate a sample survey about personally sensitive issues, they might decline to participate, and even if they do participate, they might choose to provide incorrect answers to sensitive questions.\n", "\n", "These problems induce **selection** biases that present challenges to interpreting and designing surveys.\n", "\n", "To illustrate how social scientists have thought about estimating the prevalence of such embarrassing activities and opinions, this lecture describes a classic approach of S. L. Warner [[Warner, 1965](https://python.quantecon.org/zreferences.html#id9)].\n", "\n", "Warner used elementary probability to construct a way to protect the privacy of **individual** respondents to surveys while still estimating the fraction of a **collection** of individuals who have a socially stigmatized characteristic or who engage in a socially stigmatized activity.\n", "\n", "Warner’s idea was to add **noise** between the respondent’s answer and the **signal** about that answer that the survey maker ultimately receives.\n", "\n", "Knowing about the structure of the noise assures the respondent that the survey maker does not observe his answer.\n", "\n", "Statistical properties of the noise injection procedure provide the respondent **plausible deniability**.\n", "\n", "Related ideas underlie modern **differential privacy** systems.\n", "\n", "(See [https://en.wikipedia.org/wiki/Differential_privacy](https://en.wikipedia.org/wiki/Differential_privacy))" ] }, { "cell_type": "markdown", "id": "39650e5a", "metadata": {}, "source": [ "## Warner’s Strategy\n", "\n", "As usual, let’s bring in the Python modules we’ll be using." ] }, { "cell_type": "code", "execution_count": null, "id": "9df575a4", "metadata": { "hide-output": false }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "f8f9140a", "metadata": {}, "source": [ "Suppose that every person in population either belongs to Group A or Group B.\n", "\n", "We want to estimate the proportion $ \\pi $ who belong to Group A while protecting individual respondents’ privacy.\n", "\n", "Warner [[Warner, 1965](https://python.quantecon.org/zreferences.html#id9)] proposed and analyzed the following procedure.\n", "\n", "- A random sample of $ n $ people is drawn with replacement from the population and each person is interviewed. \n", "- Draw $ n $ random samples from the population with replacement and interview each person. \n", "- Prepare a **random spinner** that with $ p $ probability points to the Letter A and with $ (1-p) $ probability points to the Letter B. \n", "- Each subject spins a random spinner and sees an outcome (A or B) that the interviewer does **not observe**. \n", "- The subject states whether he belongs to the group to which the spinner points. \n", "- If the spinner points to the group that the spinner belongs, the subject reports “yes”; otherwise he reports “no”. \n", "- The subject answers the question truthfully. \n", "\n", "\n", "Warner constructed a maximum likelihood estimators of the proportion of the population in set A.\n", "\n", "Let\n", "\n", "- $ \\pi $ : True probability of A in the population \n", "- $ p $ : Probability that the spinner points to A \n", "- $ X_{i}=\\begin{cases}1,\\text{ if the } i\\text{th} \\ \\text{ subject says yes}\\\\0,\\text{ if the } i\\text{th} \\ \\text{ subject says no}\\end{cases} $ \n", "\n", "\n", "Index the sample set so that the first $ n_1 $ report “yes”, while the second $ n-n_1 $ report “no”.\n", "\n", "The likelihood function of a sample set is\n", "\n", "\n", "\n", "$$\n", "L=\\left[\\pi p + (1-\\pi)(1-p)\\right]^{n_{1}}\\left[(1-\\pi) p +\\pi (1-p)\\right]^{n-n_{1}} \\tag{15.1}\n", "$$\n", "\n", "The log of the likelihood function is:\n", "\n", "\n", "\n", "$$\n", "\\log(L)= n_1 \\log \\left[\\pi p + (1-\\pi)(1-p)\\right] + (n-n_{1}) \\log \\left[(1-\\pi) p +\\pi (1-p)\\right] \\tag{15.2}\n", "$$\n", "\n", "The first-order necessary condition for maximizing the log likelihood function with respect to $ \\pi $ is:\n", "\n", "$$\n", "\\frac{(n-n_1)(2p-1)}{(1-\\pi) p +\\pi (1-p)}=\\frac{n_1 (2p-1)}{\\pi p + (1-\\pi)(1-p)}\n", "$$\n", "\n", "or\n", "\n", "\n", "\n", "$$\n", "\\pi p + (1-\\pi)(1-p)=\\frac{n_1}{n} \\tag{15.3}\n", "$$\n", "\n", "If $ p \\neq \\frac{1}{2} $, then the maximum likelihood estimator (MLE) of $ \\pi $ is:\n", "\n", "\n", "\n", "$$\n", "\\hat{\\pi}=\\frac{p-1}{2p-1}+\\frac{n_1}{(2p-1)n} \\tag{15.4}\n", "$$\n", "\n", "We compute the mean and variance of the MLE estimator $ \\hat \\pi $ to be:\n", "\n", "\n", "\n", "$$\n", "\\begin{aligned}\n", "\\mathbb{E}(\\hat{\\pi})&= \\frac{1}{2 p-1}\\left[p-1+\\frac{1}{n} \\sum_{i=1}^{n} \\mathbb{E} X_i \\right] \\\\\n", "&=\\frac{1}{2 p-1} \\left[ p -1 + \\pi p + (1-\\pi)(1-p)\\right] \\\\\n", "&=\\pi\n", "\\end{aligned} \\tag{15.5}\n", "$$\n", "\n", "and\n", "\n", "\n", "\n", "$$\n", "\\begin{aligned}\n", "Var(\\hat{\\pi})&=\\frac{n Var(X_i)}{(2p - 1 )^2 n^2} \\\\\n", "&= \\frac{\\left[\\pi p + (1-\\pi)(1-p)\\right]\\left[(1-\\pi) p +\\pi (1-p)\\right]}{(2p - 1 )^2 n^2}\\\\\n", "&=\\frac{\\frac{1}{4}+(2 p^2 - 2 p +\\frac{1}{2})(- 2 \\pi^2 + 2 \\pi -\\frac{1}{2})}{(2p - 1 )^2 n^2}\\\\\n", "&=\\frac{1}{n}\\left[\\frac{1}{16(p-\\frac{1}{2})^2}-(\\pi-\\frac{1}{2})^2 \\right]\n", "\\end{aligned} \\tag{15.6}\n", "$$\n", "\n", "Equation [(15.5)](#equation-eq-five) indicates that $ \\hat{\\pi} $ is an **unbiased estimator** of $ \\pi $ while equation [(15.6)](#equation-eq-six) tell us the variance of the estimator.\n", "\n", "To compute a confidence interval, first rewrite [(15.6)](#equation-eq-six) as:\n", "\n", "\n", "\n", "$$\n", "Var(\\hat{\\pi})=\\frac{\\frac{1}{4}-(\\pi-\\frac{1}{2})^2}{n}+\\frac{\\frac{1}{16(p-\\frac{1}{2})^2}-\\frac{1}{4}}{n} \\tag{15.7}\n", "$$\n", "\n", "This equation indicates that the variance of $ \\hat{\\pi} $ can be represented as a sum of the variance due to sampling plus the variance due to the random device.\n", "\n", "From the expressions above we can find that:\n", "\n", "- When $ p $ is $ \\frac{1}{2} $, expression [(15.1)](#equation-eq-one) degenerates to a constant. \n", "- When $ p $ is $ 1 $ or $ 0 $, the randomized estimate degenerates to an estimator without randomized sampling. \n", "\n", "\n", "We shall only discuss situations in which $ p \\in (\\frac{1}{2},1) $\n", "\n", "(a situation in which $ p \\in (0,\\frac{1}{2}) $ is symmetric).\n", "\n", "From expressions [(15.5)](#equation-eq-five) and [(15.7)](#equation-eq-seven) we can deduce that:\n", "\n", "- The MSE of $ \\hat{\\pi} $ decreases as $ p $ increases. " ] }, { "cell_type": "markdown", "id": "be401ecc", "metadata": {}, "source": [ "## Comparing Two Survey Designs\n", "\n", "Let’s compare the preceding randomized-response method with a stylized non-randomized response method.\n", "\n", "In our non-randomized response method, we suppose that:\n", "\n", "- Members of Group A tells the truth with probability $ T_a $ while the members of Group B tells the truth with probability $ T_b $ \n", "- $ Y_i $ is $ 1 $ or $ 0 $ according to whether the sample’s $ i\\text{th} $ member’s report is in Group A or not. \n", "\n", "\n", "Then we can estimate $ \\pi $ as:\n", "\n", "\n", "\n", "$$\n", "\\hat{\\pi}=\\frac{\\sum_{i=1}^{n}Y_i}{n} \\tag{15.8}\n", "$$\n", "\n", "We calculate the expectation, bias, and variance of the estimator to be:\n", "\n", "\n", "\n", "$$\n", "\\begin{aligned}\n", "\\mathbb{E}(\\hat{\\pi})&=\\pi T_a + \\left[ (1-\\pi)(1-T_b)\\right]\\\\\n", "\\end{aligned} \\tag{15.9}\n", "$$\n", "\n", "\n", "\n", "$$\n", "\\begin{aligned}\n", "Bias(\\hat{\\pi})&=\\mathbb{E}(\\hat{\\pi}-\\pi)\\\\\n", "&=\\pi [T_a + T_b -2 ] + [1- T_b] \\\\\n", "\\end{aligned} \\tag{15.10}\n", "$$\n", "\n", "\n", "\n", "$$\n", "\\begin{aligned}\n", "Var(\\hat{\\pi})&=\\frac{ \\left[ \\pi T_a + (1-\\pi)(1-T_b)\\right] \\left[1- \\pi T_a -(1-\\pi)(1-T_b)\\right] }{n}\n", "\\end{aligned} \\tag{15.11}\n", "$$\n", "\n", "It is useful to define a\n", "\n", "$$\n", "\\text{MSE Ratio}=\\frac{\\text{Mean Square Error Randomized}}{\\text{Mean Square Error Regular}}\n", "$$\n", "\n", "We can compute MSE Ratios for different survey designs associated with different parameter values.\n", "\n", "The following Python code computes objects we want to stare at in order to make comparisons\n", "under different values of $ \\pi_A $ and $ n $:" ] }, { "cell_type": "code", "execution_count": null, "id": "4881ecfb", "metadata": { "hide-output": false }, "outputs": [], "source": [ "class Comparison:\n", " def __init__(self, A, n):\n", " self.A = A\n", " self.n = n\n", " TaTb = np.array([[0.95, 1], [0.9, 1], [0.7, 1], \n", " [0.5, 1], [1, 0.95], [1, 0.9], \n", " [1, 0.7], [1, 0.5], [0.95, 0.95], \n", " [0.9, 0.9], [0.7, 0.7], [0.5, 0.5]])\n", " self.p_arr = np.array([0.6, 0.7, 0.8, 0.9])\n", " self.p_map = dict(zip(self.p_arr, [f\"MSE Ratio: p = {x}\" for x in self.p_arr]))\n", " self.template = pd.DataFrame(columns=self.p_arr)\n", " self.template[['T_a','T_b']] = TaTb\n", " self.template['Bias'] = None\n", " \n", " def theoretical(self):\n", " A = self.A\n", " n = self.n\n", " df = self.template.copy()\n", " df['Bias'] = A * (df['T_a'] + df['T_b'] - 2) + (1 - df['T_b'])\n", " for p in self.p_arr:\n", " df[p] = (1 / (16 * (p - 1/2)**2) - (A - 1/2)**2) / n / \\\n", " (df['Bias']**2 + ((A * df['T_a'] + (1 - A) * (1 - df['T_b'])) * (1 - A * df['T_a'] - (1 - A) * (1 - df['T_b'])) / n))\n", " df[p] = df[p].round(2)\n", " df = df.set_index([\"T_a\", \"T_b\", \"Bias\"]).rename(columns=self.p_map)\n", " return df\n", " \n", " def MCsimulation(self, size=1000, seed=123456):\n", " A = self.A\n", " n = self.n\n", " df = self.template.copy()\n", " np.random.seed(seed)\n", " sample = np.random.rand(size, self.n) <= A\n", " random_device = np.random.rand(size, n)\n", " mse_rd = {}\n", " for p in self.p_arr:\n", " spinner = random_device <= p\n", " rd_answer = sample * spinner + (1 - sample) * (1 - spinner)\n", " n1 = rd_answer.sum(axis=1)\n", " pi_hat = (p - 1) / (2 * p - 1) + n1 / n / (2 * p - 1)\n", " mse_rd[p] = np.sum((pi_hat - A)**2)\n", " for inum, irow in df.iterrows():\n", " truth_a = np.random.rand(size, self.n) <= irow.T_a\n", " truth_b = np.random.rand(size, self.n) <= irow.T_b\n", " trad_answer = sample * truth_a + (1 - sample) * (1 - truth_b)\n", " pi_trad = trad_answer.sum(axis=1) / n\n", " df.loc[inum, 'Bias'] = pi_trad.mean() - A\n", " mse_trad = np.sum((pi_trad - A)**2)\n", " for p in self.p_arr:\n", " df.loc[inum, p] = (mse_rd[p] / mse_trad).round(2)\n", " df = df.set_index([\"T_a\", \"T_b\", \"Bias\"]).rename(columns=self.p_map)\n", " return df" ] }, { "cell_type": "markdown", "id": "fb722f64", "metadata": {}, "source": [ "Let’s put the code to work for parameter values\n", "\n", "- $ \\pi_A=0.6 $ \n", "- $ n=1000 $ \n", "\n", "\n", "We can generate MSE Ratios theoretically using the above formulas.\n", "\n", "We can also perform Monte Carlo simulations of a MSE Ratio." ] }, { "cell_type": "code", "execution_count": null, "id": "0c67bc08", "metadata": { "hide-output": false }, "outputs": [], "source": [ "cp1 = Comparison(0.6, 1000)\n", "df1_theoretical = cp1.theoretical()\n", "df1_theoretical" ] }, { "cell_type": "code", "execution_count": null, "id": "be7f3c47", "metadata": { "hide-output": false }, "outputs": [], "source": [ "df1_mc = cp1.MCsimulation()\n", "df1_mc" ] }, { "cell_type": "markdown", "id": "6fa13323", "metadata": {}, "source": [ "The theoretical calculations do a good job of predicting Monte Carlo results.\n", "\n", "We see that in many situations, especially when the bias is not small, the MSE of the randomized-sampling methods is smaller than that of the non-randomized sampling method.\n", "\n", "These differences become larger as $ p $ increases.\n", "\n", "By adjusting parameters $ \\pi_A $ and $ n $, we can study outcomes in different situations.\n", "\n", "For example, for another situation described in Warner [[Warner, 1965](https://python.quantecon.org/zreferences.html#id9)]:\n", "\n", "- $ \\pi_A=0.5 $ \n", "- $ n=1000 $ \n", "\n", "\n", "we can use the code" ] }, { "cell_type": "code", "execution_count": null, "id": "3404f794", "metadata": { "hide-output": false }, "outputs": [], "source": [ "cp2 = Comparison(0.5, 1000)\n", "df2_theoretical = cp2.theoretical()\n", "df2_theoretical" ] }, { "cell_type": "code", "execution_count": null, "id": "d498b74f", "metadata": { "hide-output": false }, "outputs": [], "source": [ "df2_mc = cp2.MCsimulation()\n", "df2_mc" ] }, { "cell_type": "markdown", "id": "503214f4", "metadata": {}, "source": [ "We can also revisit a calculation in the concluding section of Warner [[Warner, 1965](https://python.quantecon.org/zreferences.html#id9)] in which\n", "\n", "- $ \\pi_A=0.6 $ \n", "- $ n=2000 $ \n", "\n", "\n", "We use the code" ] }, { "cell_type": "code", "execution_count": null, "id": "df1ef684", "metadata": { "hide-output": false }, "outputs": [], "source": [ "cp3 = Comparison(0.6, 2000)\n", "df3_theoretical = cp3.theoretical()\n", "df3_theoretical" ] }, { "cell_type": "code", "execution_count": null, "id": "166f0ae5", "metadata": { "hide-output": false }, "outputs": [], "source": [ "df3_mc = cp3.MCsimulation()\n", "df3_mc" ] }, { "cell_type": "markdown", "id": "30af1f18", "metadata": {}, "source": [ "Evidently, as $ n $ increases, the randomized response method does better performance in more situations." ] }, { "cell_type": "markdown", "id": "b596e77a", "metadata": {}, "source": [ "## Concluding Remarks\n", "\n", "[This QuantEcon lecture](https://python.quantecon.org/util_rand_resp.html) describes some alternative randomized response surveys.\n", "\n", "That lecture presents a utilitarian analysis of those alternatives conducted by Lars Ljungqvist\n", "[[Ljungqvist, 1993](https://python.quantecon.org/zreferences.html#id10)]." ] } ], "metadata": { "date": 1714442509.356489, "filename": "rand_resp.md", "kernelspec": { "display_name": "Python", "language": "python3", "name": "python3" }, "title": "Randomized Response Surveys" }, "nbformat": 4, "nbformat_minor": 5 }