# Bite Size Bayes


Copyright 2020 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT

In [1]:
import pandas as pd
import numpy as np

The dataset includes variables I selected from the General Social Survey, available from this project on the GSS site: https://gssdataexplorer.norc.org/projects/54786

I also store the data in the GitHub repository for this book; the following cell downloads it, if necessary.

In [2]:
# Load the data file

import os

if not os.path.exists('gss_bayes.tar.gz'):
    !wget https://github.com/AllenDowney/BiteSizeBayes/raw/master/gss_bayes.tar.gz
    !tar -xzf gss_bayes.tar.gz

`utils.py` provides `read_stata`, which reads the data from the Stata format.

In [3]:
from utils import read_stata

gss = read_stata('GSS.dct', 'GSS.dat')
gss.rename(columns={'id_': 'caseid'}, inplace=True)
gss.index = gss['caseid']
gss.head()

Unnamed: 0_level_0,year,relig,srcbelt,region,adults,wtssall,ballot,cohort,feminist,polviews,partyid,race,sex,educ,age,indus10,occ10,caseid,realinc
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,1972,3,3,3,1,0.4446,0,1949,0,0,2,1,2,16,23,5170,520,1,18951.0
2,1972,2,3,3,2,0.8893,0,1902,0,0,1,1,1,10,70,6470,7700,2,24366.0
3,1972,1,3,3,2,0.8893,0,1924,0,0,3,1,2,12,48,7070,4920,3,24366.0
4,1972,5,3,3,2,0.8893,0,1945,0,0,1,1,2,17,27,5170,800,4,30458.0
5,1972,1,3,3,2,0.8893,0,1911,0,0,0,1,2,12,61,6680,5020,5,50763.0


In [4]:
def replace_invalid(series, bad_vals, replacement=np.nan):
    """Replace invalid values with NaN

    Modifies series in place.

    series: Pandas Series
    bad_vals: list of values to replace
    replacement: value to replace
    """
    series.replace(bad_vals, replacement, inplace=True)

The following cell replaces invalid responses for the variables we'll use.

In [5]:
replace_invalid(gss['feminist'], [0, 8, 9])
replace_invalid(gss['polviews'], [0, 8, 9])
replace_invalid(gss['partyid'], [8, 9])
replace_invalid(gss['indus10'], [0, 9997, 9999])
replace_invalid(gss['age'], [0, 98, 99])

In [6]:
def values(series):
    """Make a series of values and the number of times they appear.
    
    series: Pandas Series
    
    returns: Pandas Series
    """
    return series.value_counts(dropna=False).sort_index()

### feminist

https://gssdataexplorer.norc.org/variables/1698/vshow

This question was only asked during one year, so we're limited to a small number of responses.

In [7]:
values(gss['feminist'])

1.0      298
2.0     1083
NaN    61085
Name: feminist, dtype: int64

### polviews

https://gssdataexplorer.norc.org/variables/178/vshow


In [8]:
values(gss['polviews'])

1.0     1560
2.0     6236
3.0     6754
4.0    20515
5.0     8407
6.0     7876
7.0     1733
NaN     9385
Name: polviews, dtype: int64

### partyid

https://gssdataexplorer.norc.org/variables/141/vshow

In [9]:
values(gss['partyid'])

0.0     9999
1.0    12942
2.0     7485
3.0     9474
4.0     5462
5.0     9661
6.0     6063
7.0      995
NaN      385
Name: partyid, dtype: int64

### race

https://gssdataexplorer.norc.org/variables/82/vshow

In [10]:
values(gss['race'])

1    50340
2     8802
3     3324
Name: race, dtype: int64

### sex

https://gssdataexplorer.norc.org/variables/81/vshow

In [11]:
values(gss['sex'])

1    27562
2    34904
Name: sex, dtype: int64

### age



In [12]:
values(gss['age'])

18.0     219
19.0     835
20.0     870
21.0     987
22.0    1042
        ... 
86.0     172
87.0     143
88.0     113
89.0     335
NaN      221
Name: age, Length: 73, dtype: int64

### indus10

https://gssdataexplorer.norc.org/variables/17/vshow

In [13]:
values(gss['indus10'])

170.0      458
180.0      444
190.0       37
270.0       69
280.0       36
          ... 
9770.0      13
9780.0       8
9790.0      53
9870.0      22
NaN       4704
Name: indus10, Length: 271, dtype: int64

## Select subset

Here's the subset of the data with valid responses for the variables we'll use.

In [14]:
varnames = ['year', 'age', 'sex', 'polviews', 'partyid', 'indus10']

valid = gss.dropna(subset=varnames)
valid.shape

(49290, 19)

In [15]:
subset = valid[varnames]
subset.head()

Unnamed: 0_level_0,year,age,sex,polviews,partyid,indus10
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1974,21.0,1,4.0,2.0,4970.0
2,1974,41.0,1,5.0,0.0,9160.0
5,1974,58.0,2,6.0,1.0,2670.0
6,1974,30.0,1,5.0,4.0,6870.0
7,1974,48.0,1,5.0,4.0,7860.0


## Save the data

In [20]:
subset.to_csv('gss_bayes.csv')

In [21]:
!ls -l gss_bayes.csv

-rw-rw-r-- 1 downey downey 1546290 Jan 21 10:11 gss_bayes.csv
