{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Factors of Underrepresentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright 2021 Allen B. Downey\n", "\n", "License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)\n", "\n", "[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/ProbablyOverthinkingIt2/blob/master/nlsy.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I recently encountered a 2009 paper by Ceci, Williams, and Barnett, \"[Women's underrepresentation in science: sociocultural and biological considerations](http://www2.psych.utoronto.ca/users/psy3001/files/Ceci%20&%20Williams.pdf)\", which lists in the abstract these \"factors unique to underrepresentation [of women] in math-intensive fields\": \n", "\n", "> (a) Math-proficient women disproportionately prefer careers in non–math-intensive fields and are more likely to leave math-intensive careers as they advance; \n", ">\n", "> (b) more men than women score in the extreme math-proficient range on gatekeeper tests, such as the SAT Mathematics and the Graduate Record Examinations Quantitative Reasoning sections; \n", ">\n", "> (c) women with high math competence are disproportionately more likely to have high verbal competence, allowing greater choice of professions; and \n", ">\n", "> (d) in some math-intensive fields, women with children are penalized in promotion rates.\n", "\n", "To people familiar with this area of research, none of these are surprising, but the third caught my attention because [I recently looked at the correlation between math and verbal scores on the SAT and ACT](https://www.allendowney.com/blog/2021/04/07/berkson-goes-to-college/). In general, they are highly correlated, with $r$ around 0.7, and they are equally correlated for men and women. So I was curious to know where this claim comes from and, if it is true, how big a factor it might be." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As evidence, Ceci et al. summarize results from \"a tracking study of 1,100 high–mathematics aptitude students who expressed a goal of majoring in mathematics or science in college\", which found that \n", "\n", "> One determinant of who switched out of math/science fields was the asymmetry between their verbal and mathematics abilities. Women's verbal abilities on average were nearly as strong as their mathematics abilities (only 61 points difference between their SAT-V and SAT-M), leading them to enter professions that prized verbal reasoning (e.g., law), whereas men's verbal abilities were an average of 115 points lower than their mathematics ability, possibly leading them to view mathematics as their only strength.\"\n", "\n", "And they cite [Achter, Lubinski, Benbow, & Eftekhari-Sanjani, 1999](https://www.researchgate.net/publication/232571718_Assessing_Vocational_Preferences_among_Gifted_Adolescents_Adds_Incremental_Validity_to_Abilities_A_Discriminant_Analysis_of_Educational_Outcomes_over_a_10-Year_Interval) and\n", "[Wai, Lubinski, & Benbow, 2005](https://www.psychologytoday.com/files/attachments/56143/creativity-and-occupational-accomplishments-among-intellectually-precocious-youths.pdf)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I don't have access to their data, but I ran a similar analysis with data from the [National Longitudinal Survey of Youth 1997](https://www.nlsinfo.org/content/cohorts/nlsy97) (NLSY97), which \"follows the lives of a sample of [8,984] American youth born between 1980-84\". The public data set includes the participants' scores on several standardized tests, including the SAT and ACT.\n", "Assuming that most participants took these exams when they were 17, they probably took them between 1997 and 2001.\n", "\n", "I found that the pattern described by Ceci et al. also appears in this dataset. Although the correlation between math and verbal scores is the same for men and women, the slope of the regression line is not.\n", "In a group of male and female test-takers with the same math score, the verbal scores for the female test-takers are higher, on average.\n", "Near the high end of the range, the difference is about 34 points, which is a little smaller than the difference in the previous study, 54 points.\n", "\n", "So we might ask:\n", "\n", "1. Is this a big enough difference that it seems likely to affect career choices? For example, suppose Student A has scores M 750 V 660 and Student B has scores M 750 V 690. Do you think A would be substantially more likely than B to \"view mathematics as their only strength\"?\n", "\n", "2. If we assume that the answer is yes, and that both students make career choices accordingly, how big an effect would this have on the sex ratios we see in math-intensive fields?\n", "\n", "I don't have the data to answer the first question, but we can use the data we have, and a model of the filtering processes, to put an upper bound on the second.\n", "\n", "To summarize the results, the largest effect I found for factor (c) is that it might increase the sex ratio in a math-intensive field by 5-15%.\n", "For example, if the minimum for a math-intensive job is 700 on the math section, the sex ratio among the people who meet this requirement is 1.8:1.\n", "\n", "Now suppose that everyone who meets this standard takes a math-intensive job, EXCEPT the people who also get 700 or more on the verbal section.\n", "If all of those people choose a different career, the sex ratio of the ones left in the math-intensive job goes up to 2.0:1.\n", "\n", "To see what happens as we move farther into the tail, I used the data to create a Gaussian model, and used the model to simulate test scores beyond the range of the SAT.\n", "With this model, we see that the effect of factor (b) increases as we make the requirements stricter.\n", "\n", "For example, if the threshold score is 800 for the math and verbal sections, the sex ratio among the people who meet the math requirement is 4.6:1.\n", "If the people who meed the verbal requirement choose different careers, the sex ratio among the people left behind is 4.9:1 (an increase of about 7%).\n", "So it seems like the effect of factor (c) gets smaller as we go farther into the tails of the distributions.\n", "\n", "Finally, I use the model to decompose two parts of factor (b), the difference in means and the difference in variance.\n", "When the threshold is 800, the contribution of these two parts is about equal; that is:\n", "\n", "* If we set the means to be the same, but preserve the difference in variance, the sex ratio among people who meet the math requirement is about 2.2:1.\n", "\n", "* If we set the variances to be the same, but preserve the difference in means, the sex ratio among people who meet the math requirement is about 2.2:1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "The following cell downloads the data file." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", " local, _ = urlretrieve(url, filename)\n", " print('Downloaded ' + local)\n", "\n", "download('https://github.com/AllenDowney/ProbablyOverthinkingIt2/raw/master/' +\n", " 'nlsy/stand_test_corr.csv.gz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And I'll read the data into a Pandas `DataFrame`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "np.random.seed(17)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8984, 29)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlsy = pd.read_csv('stand_test_corr.csv.gz')\n", "nlsy.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
R0000100R0536300R0536401R0536402R1235800R1482600R5473600R5473700R7237300R7237400...R9794001R9829600S1552600S1552700Z9033700Z9033800Z9033900Z9034000Z9034100Z9034200
0129198114-4-4-4-4...199845070-4-44332-4-4
1217198212-4-4-4-4...199858483-4-44545-4-4
2329198312-4-4-4-4...-427978-4-42424-4-4
3422198112-4-4-4-4...-437012-4-4-4-4-4-4-4-4
45110198212-4-4-4-4...-4-4-4-42363-4-4
\n", "

5 rows × 29 columns

\n", "
" ], "text/plain": [ " R0000100 R0536300 R0536401 R0536402 R1235800 R1482600 R5473600 \\\n", "0 1 2 9 1981 1 4 -4 \n", "1 2 1 7 1982 1 2 -4 \n", "2 3 2 9 1983 1 2 -4 \n", "3 4 2 2 1981 1 2 -4 \n", "4 5 1 10 1982 1 2 -4 \n", "\n", " R5473700 R7237300 R7237400 ... R9794001 R9829600 S1552600 S1552700 \\\n", "0 -4 -4 -4 ... 1998 45070 -4 -4 \n", "1 -4 -4 -4 ... 1998 58483 -4 -4 \n", "2 -4 -4 -4 ... -4 27978 -4 -4 \n", "3 -4 -4 -4 ... -4 37012 -4 -4 \n", "4 -4 -4 -4 ... -4 -4 -4 -4 \n", "\n", " Z9033700 Z9033800 Z9033900 Z9034000 Z9034100 Z9034200 \n", "0 4 3 3 2 -4 -4 \n", "1 4 5 4 5 -4 -4 \n", "2 2 4 2 4 -4 -4 \n", "3 -4 -4 -4 -4 -4 -4 \n", "4 2 3 6 3 -4 -4 \n", "\n", "[5 rows x 29 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlsy.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [columns are documented in the codebook](https://www.nlsinfo.org/investigator/pages/search?s=NLSY97). The following dictionary maps the current column names to more memorable names. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "d = {'R9793200': 'psat_math',\n", " 'R9793300': 'psat_verbal',\n", " 'R9793400': 'act_comp',\n", " 'R9793500': 'act_eng',\n", " 'R9793600': 'act_math',\n", " 'R9793700': 'act_read',\n", " 'R9793800': 'sat_verbal',\n", " 'R9793900': 'sat_math',\n", " 'R0536300': 'sex',\n", " }\n", "\n", "nlsy.rename(columns=d, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 4599 male and 4385 female participants." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 4599\n", "2 4385\n", "Name: sex, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlsy['sex'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SAT\n", "\n", "The SAT data includes a few cases where the scores are less than 200, so let's clean those up." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "varnames = ['sat_verbal', 'sat_math']\n", "\n", "for varname in varnames:\n", " invalid = (nlsy[varname] < 200)\n", " nlsy.loc[invalid, varname] = np.nan" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "male = nlsy[nlsy['sex'] == 1]\n", "female = nlsy[nlsy['sex'] == 2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1400 of the participants took the SAT. Their average and standard deviation are close to the national average (500) and standard deviation (100)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 1400.000000\n", "mean 501.678571\n", "std 108.343678\n", "min 200.000000\n", "25% 430.000000\n", "50% 500.000000\n", "75% 570.000000\n", "max 800.000000\n", "Name: sat_verbal, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlsy['sat_verbal'].describe()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 1399.000000\n", "mean 503.213009\n", "std 109.901382\n", "min 200.000000\n", "25% 430.000000\n", "50% 500.000000\n", "75% 580.000000\n", "max 800.000000\n", "Name: sat_math, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlsy['sat_math'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the verbal section, the male and female averages are roughly the same.\n", "The male scores are a little more variable.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 649.000000\n", "mean 502.681048\n", "std 111.764295\n", "min 200.000000\n", "25% 430.000000\n", "50% 500.000000\n", "75% 580.000000\n", "max 800.000000\n", "Name: sat_verbal, dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "male['sat_verbal'].describe()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 751.000000\n", "mean 500.812250\n", "std 105.365425\n", "min 200.000000\n", "25% 430.000000\n", "50% 490.000000\n", "75% 570.000000\n", "max 800.000000\n", "Name: sat_verbal, dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "female['sat_verbal'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the math section the male average is substantially higher (518 compared to 491) and the male scores are substantially more variable (std 115 compared to 104)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 648.000000\n", "mean 517.608025\n", "std 114.682496\n", "min 220.000000\n", "25% 440.000000\n", "50% 520.000000\n", "75% 600.000000\n", "max 800.000000\n", "Name: sat_math, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "male['sat_math'].describe()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 751.000000\n", "mean 490.792277\n", "std 104.089408\n", "min 200.000000\n", "25% 420.000000\n", "50% 480.000000\n", "75% 550.000000\n", "max 790.000000\n", "Name: sat_math, dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "female['sat_math'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The correlation between the sections is high, about 0.73 overall." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sat_verbalsat_math
sat_verbal1.0000000.734739
sat_math0.7347391.000000
\n", "
" ], "text/plain": [ " sat_verbal sat_math\n", "sat_verbal 1.000000 0.734739\n", "sat_math 0.734739 1.000000" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlsy[varnames].corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the correlation is about the same for both groups." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sat_verbalsat_math
sat_verbal1.0000000.736466
sat_math0.7364661.000000
\n", "
" ], "text/plain": [ " sat_verbal sat_math\n", "sat_verbal 1.000000 0.736466\n", "sat_math 0.736466 1.000000" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "male[varnames].corr()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sat_verbalsat_math
sat_verbal1.0000000.742472
sat_math0.7424721.000000
\n", "
" ], "text/plain": [ " sat_verbal sat_math\n", "sat_verbal 1.000000 0.742472\n", "sat_math 0.742472 1.000000" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "female[varnames].corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression\n", "\n", "Although the correlations are the same, the regression lines are not." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def decorate(**options):\n", " \"\"\"Decorate the current axes.\n", " \n", " Call decorate with keyword arguments like\n", " decorate(title='Title',\n", " xlabel='x',\n", " ylabel='y')\n", " \n", " The keyword arguments can be any of the axis properties\n", " https://matplotlib.org/api/axes_api.html\n", " \"\"\"\n", " ax = plt.gca()\n", " ax.set(**options)\n", " \n", " handles, labels = ax.get_legend_handles_labels()\n", " if handles:\n", " ax.legend(handles, labels)\n", "\n", " plt.tight_layout()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from statsmodels.nonparametric.smoothers_lowess import lowess\n", "\n", "def make_lowess(x, y):\n", " \"\"\"Use LOWESS to compute a smooth line.\n", " \n", " series: pd.Series\n", " \n", " returns: pd.Series\n", " \"\"\"\n", " #y = series.values\n", " #x = series.index.values\n", "\n", " smooth = lowess(y, x)\n", " index, data = np.transpose(smooth)\n", "\n", " return pd.Series(data, index=index) " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def plot_lowess(df, color='C0', **options):\n", " x, y = df['sat_math'], df['sat_verbal']\n", " plt.plot(x, y, '.', \n", " ms=2, alpha=0.2, color=color)\n", " smooth = make_lowess(x, y)\n", " smooth.plot(color=color, **options)\n", "\n", " decorate(xlabel='SAT Math',\n", " ylabel='SAT Math - Verbal',\n", " title='Difference vs math score')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following figure shows a scatterplot of verbal and math scores for male and female participants, along with a local regression line." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_lowess(female, 'C1', label='female')\n", "plot_lowess(male, label='male')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a given math score, the female participants have a higher verbal score, and this gap seems to be wider at the high end of the range.\n", "\n", "For male participants who got a 750 on the math section, the average score on the verbal section is 659." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "658.8914941639335" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "smooth = make_lowess(male['sat_math'], male['sat_verbal'])\n", "smooth[750]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "91.10850583606646" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "750 - smooth[750]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For female participants who got a 750 on the math section, the average score one the verbal section is 693." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "750.0 693.277281\n", "750.0 693.277281\n", "750.0 693.277281\n", "dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "smooth = make_lowess(female['sat_math'], female['sat_verbal'])\n", "smooth[750]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "750.0 56.722719\n", "750.0 56.722719\n", "750.0 56.722719\n", "dtype: float64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "750 - smooth[750]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is consistent with the result Ceci et al reported from a previous study: \n", "\n", "> Women's verbal abilities on average were nearly as strong as their mathematics abilities (only 61 points difference between their SAT-V and SAT-M), leading them to enter professions that prized verbal reasoning (e.g., law), whereas men's verbal abilities were an average of 115 points lower than their mathematics ability, possibly leading them to view mathematics as their only strength.\n", "\n", "In the NLSY dataset, these differences are a little smaller, 57 points for women and 91 for men.\n", "So we might ask\n", "\n", "1. Is this a big enough difference that it seems likely to affect career choices? For example, suppose Student A has scores M 750 V 660 and Student B has scores M 750 V 690. Do you think A would be substantially more likely than B to \"view mathematics as their only strength\"?\n", "\n", "2. If we assume that the answer is yes, and that both students make career choices accordingly, how big an effect would this have on the sex ratios we see in math-intensive fields?\n", "\n", "I don't have the data to answer the first question, but we can use the data we have, and a model of the filtering processes, to put an upper bound on the second.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simulating the filtering process\n", "\n", "As a simple model of the world, let's suppose that there are two jobs:\n", "\n", "* A math-intensive job that requires a math SAT score of 700 or more.\n", "\n", "* A math-and-verbal-intensive job that requires a math SAT score of 700 or more AND a verbal SAT score of 700 or more.\n", "\n", "And let's suppose that these jobs are so appealing that\n", "\n", "* If someone is qualified for the math-and-verbal job, they will choose to do it;\n", "\n", "* Otherwise, if they are qualified for the math job, they will choose to do it;\n", "\n", "* Otherwise they will do something else.\n", "\n", "The following function simulates this filtering process, computing:\n", "\n", "* The number of people who meet the math requirement, and their fraction of the population,\n", "\n", "* The fraction of people who meet the verbal requirement, given that they meet the math requirement.\n", "\n", "* The number of people who meet the math requirement and NOT the verbal requirement, and their fraction of the population, " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def simulate_filter(df, thresh_math, thresh_verbal):\n", " subset = df.dropna(subset=['sat_math', 'sat_verbal'])\n", " \n", " high_math = subset['sat_math'] >= thresh_math\n", " high_verbal = subset['sat_verbal'] >= thresh_verbal\n", " \n", " n = len(subset)\n", " n_math = high_math.sum()\n", " n_math_and_verbal = (high_math & high_verbal).sum()\n", " n_math_no_verbal = (high_math & ~high_verbal).sum()\n", " \n", " result = dict(n=n, n_math=n_math, n_math_no_verbal=n_math_no_verbal,\n", " pct_math=n_math/n*100,\n", " pct_math_no_verbal=n_math_no_verbal/n*100,\n", " pct_verbal_given_math = n_math_and_verbal/n_math*100,\n", " )\n", " \n", " return result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Among male participants, 6.2% meet the math requirement; 30% of them also meet the verbal requirement, which means that 4.3% of male participants meet the math requirement and NOT the verbal requirement." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'n': 648,\n", " 'n_math': 40,\n", " 'n_math_no_verbal': 28,\n", " 'pct_math': 6.172839506172839,\n", " 'pct_math_no_verbal': 4.320987654320987,\n", " 'pct_verbal_given_math': 30.0}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "percents_male = simulate_filter(male, 700, 700)\n", "percents_male" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Among female participants, 3.5% meet the math requirement; 38% of them also meet the verbal requirement, which means that 2.1% of female participants meet the math requirement and NOT the verbal requirement." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'n': 750,\n", " 'n_math': 26,\n", " 'n_math_no_verbal': 16,\n", " 'pct_math': 3.4666666666666663,\n", " 'pct_math_no_verbal': 2.1333333333333333,\n", " 'pct_verbal_given_math': 38.46153846153847}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "percents_female = simulate_filter(female, 700, 700)\n", "percents_female" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predicting sex ratios\n", "\n", "We can use the percentages in the previous section to compute the sex ratios we would see in the math-intensive job, which requires a high math score only.\n", "\n", "In the following function, `ratio1` is the ratio of men to women who meet the math requirement; `ratio2` is the ratio of men to women who meet the math requirement and NOT the verbal requirement." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def compute_ratios(pct1, pct2):\n", " result = {}\n", " result['sex_ratio1'] = pct1['pct_math'] / pct2['pct_math']\n", " result['verbal_ratio'] = pct2['pct_verbal_given_math'] / pct1['pct_verbal_given_math']\n", " result['sex_ratio2'] = pct1['pct_math_no_verbal'] / pct2['pct_math_no_verbal']\n", " result['factor_c'] = result['sex_ratio2'] / result['sex_ratio1']\n", " return result" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'factor_c': 1.1374999999999997,\n", " 'sex_ratio1': 1.7806267806267808,\n", " 'sex_ratio2': 2.025462962962963,\n", " 'verbal_ratio': 1.2820512820512822}\n" ] } ], "source": [ "from pprint import pprint\n", "\n", "pprint(compute_ratios(percents_male, percents_female))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results are consistent with factors (b) and (c) as listed by Ceci et al:\n", "\n", "> (b) more men than women score in the extreme math-proficient range on gatekeeper tests, such as the SAT Mathematics and the Graduate Record Examinations Quantitative Reasoning sections; \n", ">\n", "> (c) women with high math competence are disproportionately more likely to have high verbal competence, allowing greater choice of professions; and \n", "\n", "In this dataset, men are 1.8 times more likely to have an SAT score of 700 or more.\n", "But women who meet the math requirement are 1.3 times more likely to ALSO meet the verbal requirement.\n", "\n", "So in a job that has a math requirement but no verbal requirement, we expect to find a sex ratio near 2:1, that is, 2 men for every 1 woman.\n", "\n", "Under these conditions, the effect of factor (c) is to increase the sex ratio in the math-intensive job from 1.8 to 2.0, an increase of 14%.\n", "\n", "For this analysis, I set both threshold scores to 700 so that the number of participants that meet the requirements is big enough to make reasonable estimates of these ratios.\n", "But if we make the thresholds any higher, we get into very small sample sizes.\n", "So, before we go on, I will extrapolate the dataset using a Gaussian model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extending the model into the tails\n", "\n", "The following function takes a dataset and computes the summary statistics we'll use as parameters of the model:\n", "\n", "* The mean and standard deviation of both test scores.\n", "\n", "* A regression model of verbal scores as a function of math scores, including the slope and intercept of the best fit line and the standard deviation of the residuals." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import linregress\n", "\n", "def run_regress(df):\n", " subset = df.dropna(subset=['sat_math', 'sat_verbal'])\n", "\n", " x = subset['sat_math']\n", " y = subset['sat_verbal']\n", " model = linregress(x, y)._asdict()\n", " \n", " model['x_bar'] = x.mean()\n", " model['y_bar'] = y.mean()\n", " model['std_x'] = x.std()\n", " model['std_y'] = y.std()\n", " model['std_resid'] = np.sqrt(y.std()**2 * (1-model['rvalue']**2))\n", " model['diff'] = 750 - (model['slope'] * 750 + model['intercept'])\n", " \n", " return model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the results for male and female participants." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'slope': 0.7178981765301716,\n", " 'intercept': 131.23421699076414,\n", " 'rvalue': 0.7364655846849872,\n", " 'pvalue': 9.366475010436844e-112,\n", " 'stderr': 0.025944535141853017,\n", " 'intercept_stderr': 13.754270755516483,\n", " 'x_bar': 517.608024691358,\n", " 'y_bar': 502.8240740740741,\n", " 'std_x': 114.68249590043382,\n", " 'std_y': 111.79117721035935,\n", " 'std_resid': 75.62393800388467,\n", " 'diff': 80.34215061160717}" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_male = run_regress(male)\n", "model_male" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'slope': 0.7529532783615287,\n", " 'intercept': 131.58465185439144,\n", " 'rvalue': 0.7424724177662607,\n", " 'pvalue': 2.7404858746767917e-132,\n", " 'stderr': 0.024838864388664287,\n", " 'intercept_stderr': 12.454336396031112,\n", " 'x_bar': 490.53333333333336,\n", " 'y_bar': 500.93333333333334,\n", " 'std_x': 103.91653903494971,\n", " 'std_y': 105.38344168763642,\n", " 'std_resid': 70.59390551775333,\n", " 'diff': 53.70038937446202}" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_female = run_regress(female)\n", "model_female" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have seen most of these statistics before, but the regression parameters are new. Note that the slope of the regression line is 0.75 for women and 0.72 for men, which means that each additional point on the math section corresponds to a bigger fraction of a point on the verbal section.\n", "And that's consistent with what we saw using local regression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resampling\n", "\n", "Now we can use these models to simulate larger datasets, which will make it possible to explore farther into the tails.\n", "The following function takes a model and uses Gaussian distributions to generate a sample with the same parameters." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import norm\n", "\n", "def resample_normal(model, n):\n", " mu_x = model['x_bar']\n", " sigma_x = model['std_x']\n", " xs = norm(mu_x, sigma_x).rvs(n)\n", " over700 = norm(mu_x, sigma_x).sf(700)\n", " print(over700)\n", " \n", " mu_y = xs * model['slope'] + model['intercept']\n", " sigma_y = model['std_resid']\n", " ys = norm(mu_y, sigma_y).rvs(n)\n", " \n", " return pd.DataFrame(dict(sat_math=xs, sat_verbal=ys))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a sample based on the male model." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.05587141804055689\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sat_mathsat_verbal
0549.290886608.282316
1304.914648329.386366
2589.158561633.972882
3648.955182619.365723
4636.555616588.038710
\n", "
" ], "text/plain": [ " sat_math sat_verbal\n", "0 549.290886 608.282316\n", "1 304.914648 329.386366\n", "2 589.158561 633.972882\n", "3 648.955182 619.365723\n", "4 636.555616 588.038710" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_male = resample_normal(model_male, 1000000)\n", "sample_male.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can confirm that the summary statistics are about right." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'slope': 0.7176117527952254,\n", " 'intercept': 131.42994588189276,\n", " 'rvalue': 0.7354400554847134,\n", " 'pvalue': 0.0,\n", " 'stderr': 0.0006611645370343535,\n", " 'intercept_stderr': 0.3505110789379023,\n", " 'x_bar': 517.6181002913617,\n", " 'y_bar': 502.8787781105116,\n", " 'std_x': 114.5514635830689,\n", " 'std_y': 111.77454362738786,\n", " 'std_resid': 75.73728964910423,\n", " 'diff': 80.36123952168816}" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "run_regress(sample_male)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compared to the model it is based on." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'slope': 0.7178981765301716,\n", " 'intercept': 131.23421699076414,\n", " 'rvalue': 0.7364655846849872,\n", " 'pvalue': 9.366475010436844e-112,\n", " 'stderr': 0.025944535141853017,\n", " 'intercept_stderr': 13.754270755516483,\n", " 'x_bar': 517.608024691358,\n", " 'y_bar': 502.8240740740741,\n", " 'std_x': 114.68249590043382,\n", " 'std_y': 111.79117721035935,\n", " 'std_resid': 75.62393800388467,\n", " 'diff': 80.34215061160717}" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_male" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a sample based on the female model." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.021914621151376847\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sat_mathsat_verbal
0381.189235361.843294
1518.926570604.141411
2500.096698514.488709
3447.117311387.088183
4604.412694532.195418
\n", "
" ], "text/plain": [ " sat_math sat_verbal\n", "0 381.189235 361.843294\n", "1 518.926570 604.141411\n", "2 500.096698 514.488709\n", "3 447.117311 387.088183\n", "4 604.412694 532.195418" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_female = resample_normal(model_female, 1000000)\n", "sample_female.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can confirm that the summary statistics are about right." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'slope': 0.7523444964587589,\n", " 'intercept': 131.92380626866048,\n", " 'rvalue': 0.7424020891829097,\n", " 'pvalue': 0.0,\n", " 'stderr': 0.0006789274327209064,\n", " 'intercept_stderr': 0.3403908546824145,\n", " 'x_bar': 490.46814940959257,\n", " 'y_bar': 500.92481916527976,\n", " 'std_x': 103.96375681543968,\n", " 'std_y': 105.35606164222709,\n", " 'std_resid': 70.58377592684528,\n", " 'diff': 53.81782138727044}" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "run_regress(sample_female)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compared to the model it's based on." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'slope': 0.7529532783615287,\n", " 'intercept': 131.58465185439144,\n", " 'rvalue': 0.7424724177662607,\n", " 'pvalue': 2.7404858746767917e-132,\n", " 'stderr': 0.024838864388664287,\n", " 'intercept_stderr': 12.454336396031112,\n", " 'x_bar': 490.53333333333336,\n", " 'y_bar': 500.93333333333334,\n", " 'std_x': 103.91653903494971,\n", " 'std_y': 105.38344168763642,\n", " 'std_resid': 70.59390551775333,\n", " 'diff': 53.70038937446202}" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_female" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing percents and ratios\n", "\n", "Now let's check whether the samples we generated yield similar results when we compute sex ratios.\n", "Here's what we get when we use the male sample to simulate the filtering process." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'n': 1000000,\n", " 'n_math': 55456,\n", " 'n_math_no_verbal': 36272,\n", " 'pct_math': 5.545599999999999,\n", " 'pct_math_no_verbal': 3.6271999999999998,\n", " 'pct_verbal_given_math': 34.593190998268895}" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "thresh = 700\n", "percents_male = simulate_filter(sample_male, thresh, thresh)\n", "percents_male" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's compare it to the results with the actual data." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'n': 648,\n", " 'n_math': 40,\n", " 'n_math_no_verbal': 28,\n", " 'pct_math': 6.172839506172839,\n", " 'pct_math_no_verbal': 4.320987654320987,\n", " 'pct_verbal_given_math': 30.0}" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "simulate_filter(male, thresh, thresh)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out that there are discrepancies bigger than we would expect due to random sampling. For example, in the original dataset, 5.6% of male participants meet the math requirement; in the resampled data, it's 6.2%.\n", "\n", "We'll see what's going on in the next section, but first let's run the same analysis with the sample based on the female model:\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'n': 1000000,\n", " 'n_math': 21834,\n", " 'n_math_no_verbal': 12312,\n", " 'pct_math': 2.1834,\n", " 'pct_math_no_verbal': 1.2312,\n", " 'pct_verbal_given_math': 43.61088211046991}" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "percents_female = simulate_filter(sample_female, thresh, thresh)\n", "percents_female" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And compare it with the results from the female data." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'n': 750,\n", " 'n_math': 26,\n", " 'n_math_no_verbal': 16,\n", " 'pct_math': 3.4666666666666663,\n", " 'pct_math_no_verbal': 2.1333333333333333,\n", " 'pct_verbal_given_math': 38.46153846153847}" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "simulate_filter(female, thresh, thresh)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, there are non-negligible differences.\n", "And these differences also affect the predicted sex ratios." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'sex_ratio1': 2.5398919116973526,\n", " 'verbal_ratio': 1.2606782101325165,\n", " 'sex_ratio2': 2.946068875893437,\n", " 'factor_c': 1.1599189958932723}" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compute_ratios(percents_male, percents_female)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the Gaussian model, male participants are 2.5 times more likely to meet the math requirement (compared to 1.8 in the actual data) and the sex ratio we expect in the math-intensive job is about 2.9 (compared to 2.0 in the actual data)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's going on?\n", "\n", "To see what's going on, let's compare the distribution of math scores in the original data and in the Gaussian model.\n", "Here are the distributions for the male participants." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "\n", "sns.kdeplot(male['sat_math'], cut=0, label='data')\n", "sns.kdeplot(sample_male['sat_math'], label='model')\n", "\n", "decorate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the mean and standard deviation of SAT scores, we expect the tails to extend below 200 and above 800.\n", "But SAT scores are truncated at these bounds, and the scores are somewhat less accurate above 700 and below 300, compared to scores closer to the mean.\n", "So people who might score 810 on an exam with wider range end up spread out in the 700s, more or less at random, based on the results from a small number of questions.\n", "As a result, the Gaussian model departs from the data in the tails.\n", "\n", "We see the same effect for female participants." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sns.kdeplot(female['sat_math'], cut=0, label='data')\n", "sns.kdeplot(sample_female['sat_math'], label='model')\n", "\n", "decorate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The truncation of SAT scores at the high end has a substantial effect on the predicted sex ratios in a math-intensive job.\n", "In particular, it seems to mitigate the effect of the filtering processes, compared to a test that extends farther into the tails.\n", "\n", "Nevertheless, we can use the Gaussian model to see what happens if we increase the threshold, assuming that it is based on a test that extends farther into the tails as, for example, the GRE might." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Higher threshold\n", "\n", "If we increase the thresholds to 750, fewer people satisfy the requirements." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'n': 1000000,\n", " 'n_math': 21185,\n", " 'n_math_no_verbal': 15609,\n", " 'pct_math': 2.1185,\n", " 'pct_math_no_verbal': 1.5609,\n", " 'pct_verbal_given_math': 26.320509794666037}" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "thresh = 750\n", "percents_male = simulate_filter(sample_male, thresh, thresh)\n", "percents_male" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'n': 1000000,\n", " 'n_math': 6311,\n", " 'n_math_no_verbal': 4136,\n", " 'pct_math': 0.6311,\n", " 'pct_math_no_verbal': 0.41359999999999997,\n", " 'pct_verbal_given_math': 34.46363492315005}" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "percents_female = simulate_filter(sample_female, thresh, thresh)\n", "percents_female" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the sex ratio in math-intensive jobs gets higher." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'sex_ratio1': 3.3568372682617653,\n", " 'verbal_ratio': 1.3093832601272128,\n", " 'sex_ratio2': 3.773936170212766,\n", " 'factor_c': 1.1242535364745228}" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compute_ratios(percents_male, percents_female)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this threshold, male participants are 3.4 times more likely to meet the math requirement, but female participants who meet the math requirement are 1.3 times more likely to ALSO meet the verbal requirement.\n", "\n", "If all people who meet both requirements choose a different job, the sex ratio we expect to see in the math intensive job is about 3.8 : 1.\n", "The effect of factor (c) is to increase the sex ratio by about 12%.\n", "\n", "If we increase the threshold to 800, fewer than 1% of people meet either requirement." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'n': 1000000,\n", " 'n_math': 6828,\n", " 'n_math_no_verbal': 5426,\n", " 'pct_math': 0.6828,\n", " 'pct_math_no_verbal': 0.5426,\n", " 'pct_verbal_given_math': 20.53309900410076}" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "thresh = 800\n", "percents_male = simulate_filter(sample_male, thresh, thresh)\n", "percents_male" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'n': 1000000,\n", " 'n_math': 1480,\n", " 'n_math_no_verbal': 1100,\n", " 'pct_math': 0.148,\n", " 'pct_math_no_verbal': 0.11,\n", " 'pct_verbal_given_math': 25.675675675675674}" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "percents_female = simulate_filter(sample_female, thresh, thresh)\n", "percents_female" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the sex ratios are higher." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'sex_ratio1': 4.613513513513514,\n", " 'verbal_ratio': 1.2504530207811235,\n", " 'sex_ratio2': 4.932727272727273,\n", " 'factor_c': 1.0691910315811897}" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compute_ratios(percents_male, percents_female)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With these thresholds, male participants are 4.6 times more likely to meet the math requirement, but female participants who meet the math requirement are 1.25 times more likely to ALSO meet the verbal requirement.\n", "\n", "If all people who meet both requirements choose a different job, the sex ratio we expect to see in the math-intensive job is about 4.9:1.\n", "The effect of factor (c) is to increase the sex ratio by about 7%.\n", "\n", "In summary, the more strict the math and verbal requirements are, the larger the effect of factor (b).\n", "At every level, factor (c) has the effect of increasing the sex ratio we expect in a math intensive job, but the increase is only 7-14%, and might get smaller as the requirements get stricter.\n", "\n", "And this is probably an upper bound on the effect of factor (c), since it assumes that everyone who qualifies for the verbal-intensive job chooses to do it instead of the math-intensive job." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What if?\n", "\n", "Having made this model, we can use it to answer a question related to factor (b):\n", "\n", "> (b) more men than women score in the extreme math-proficient range on gatekeeper tests, such as the SAT Mathematics and the Graduate Record Examinations Quantitative Reasoning sections; \n", "\n", "There are more men in the tail of the distribution because their average is higher, but also because their variance is higher.\n", "So we might wonder what part of the sex ratio in the tail is explained by the difference in the means and what part by the difference in variance.\n", "\n", "Continuing the previous example with threshold 800, the sex ratio among people who exceed this threshold is 4.6:1." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'sex_ratio1': 4.613513513513514,\n", " 'verbal_ratio': 1.2504530207811235,\n", " 'sex_ratio2': 4.932727272727273,\n", " 'factor_c': 1.0691910315811897}" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compute_ratios(percents_male, percents_female)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's see what would happen if the male mean were the same as the female mean." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'slope': 0.7178981765301716,\n", " 'intercept': 131.23421699076414,\n", " 'rvalue': 0.7364655846849872,\n", " 'pvalue': 9.366475010436844e-112,\n", " 'stderr': 0.025944535141853017,\n", " 'intercept_stderr': 13.754270755516483,\n", " 'x_bar': 502,\n", " 'y_bar': 502.8240740740741,\n", " 'std_x': 114.68249590043382,\n", " 'std_y': 111.79117721035935,\n", " 'std_resid': 75.62393800388467,\n", " 'diff': 80.34215061160717}" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_male1 = model_male.copy()\n", "model_male1['x_bar'] = 502\n", "model_male1" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'slope': 0.7529532783615287,\n", " 'intercept': 131.58465185439144,\n", " 'rvalue': 0.7424724177662607,\n", " 'pvalue': 2.7404858746767917e-132,\n", " 'stderr': 0.024838864388664287,\n", " 'intercept_stderr': 12.454336396031112,\n", " 'x_bar': 502,\n", " 'y_bar': 500.93333333333334,\n", " 'std_x': 103.91653903494971,\n", " 'std_y': 105.38344168763642,\n", " 'std_resid': 70.59390551775333,\n", " 'diff': 53.70038937446202}" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_female1 = model_female.copy()\n", "model_female1['x_bar'] = 502\n", "model_female1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function takes the two counter-factual models, generates samples from each, and computes the sex ratio in the tail of the distribution." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "def run_analysis(model_male, model_female, thresh=800):\n", " sample_male = resample_normal(model_male, 1000000)\n", " percents_male = simulate_filter(sample_male, thresh, thresh)\n", "\n", " sample_female = resample_normal(model_female, 1000000)\n", " percents_female = simulate_filter(sample_female, thresh, thresh)\n", "\n", " ratios = compute_ratios(percents_male, percents_female)\n", " pprint(ratios)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the male and female means are the same, but the male variance is higher, the sex ratio in the tail is about 2.2." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.042128223945598495\n", "0.02836565600695388\n", "{'factor_c': 1.084553701087678,\n", " 'sex_ratio1': 2.218451242829828,\n", " 'sex_ratio2': 2.4060295060936494,\n", " 'verbal_ratio': 1.3285781038520206}\n" ] } ], "source": [ "run_analysis(model_male1, model_female1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's see what happens if the distributions have different means and the same variance." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.04713207262057749\n", "0.02732096935190762\n", "{'factor_c': 1.115579785162076,\n", " 'sex_ratio1': 2.1703637976929904,\n", " 'sex_ratio2': 2.4212139791538934,\n", " 'verbal_ratio': 1.4338670688894304}\n" ] } ], "source": [ "model_male2 = model_male.copy()\n", "model_male2['std_x'] = 109\n", "\n", "model_female2 = model_female.copy()\n", "model_female2['std_x'] = 109\n", "\n", "run_analysis(model_male2, model_female2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the distributions have the same variance and different means, the sex ratio in the tail is about 2.2.\n", "So the contributions of the mean and variance are roughly equal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And, just to confirm, if the mean and variance are the same, the ratio is close to 1." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.034645799449790106\n", "0.034645799449790106\n", "{'factor_c': 1.1460984634238311,\n", " 'sex_ratio1': 1.0064516129032257,\n", " 'sex_ratio2': 1.1534926470588236,\n", " 'verbal_ratio': 1.5245267054468534}\n" ] } ], "source": [ "model_male3 = model_male.copy()\n", "model_male3['x_bar'] = 502\n", "model_male3['std_x'] = 109\n", "\n", "model_female3 = model_female.copy()\n", "model_female3['x_bar'] = 502\n", "model_female3['std_x'] = 109\n", "\n", "run_analysis(model_male3, model_female3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 2 }