{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Gender Guessing Methods\n",
"\n",
"Now that we've got our author data set and inferences for genders, it's time to do some exploratory data analysis. I'm going to do this in [julia](http://julialang.org). I made two functions - `importauthors()` and `getgenderprob()` that take the csv fils from `write_names_to_file` found in [xml parsing](xml_parsing.py) and create julia `DataFrames`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING: Method definition importauthors(String, String) in module Main at /Users/ksb/computation/science/gender-comp-bio/src/dataimport.jl:5 overwritten at /Users/ksb/computation/science/gender-comp-bio/src/dataimport.jl:5.\n",
"WARNING: Method definition getgenderprob(DataFrames.DataFrame, String, Symbol) in module Main at /Users/ksb/computation/science/gender-comp-bio/src/dataimport.jl:14 overwritten at /Users/ksb/computation/science/gender-comp-bio/src/dataimport.jl:14.\n"
]
},
{
"data": {
"text/plain": [
"getgenderprob (generic function with 1 method)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"include(\"../src/dataimport.jl\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
ID
Date
Journal
Author_First_Name
Author_Last_Name
Author_Initials
Position
Title
Dataset
1
26605382
2015-11-24
IEEE/ACM Trans Comput Biol Bioinform
Yufei
Huang
NA
first
Selected Articles from the 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2012).
comp
2
26605382
2015-11-24
IEEE/ACM Trans Comput Biol Bioinform
Xiaoning
Qian
NA
last
Selected Articles from the 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2012).
comp
3
26605382
2015-11-24
IEEE/ACM Trans Comput Biol Bioinform
Yidong
Chen
NA
second
Selected Articles from the 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2012).
comp
4
26357062
2015-09-11
IEEE/ACM Trans Comput Biol Bioinform
Haiying
Wang
NA
first
Organized Modularity in the Interactome: Evidence from the Analysis of Dynamic Organization in the Cell Cycle.
comp
5
26357062
2015-09-11
IEEE/ACM Trans Comput Biol Bioinform
Huiru
Zheng
NA
last
Organized Modularity in the Interactome: Evidence from the Analysis of Dynamic Organization in the Cell Cycle.
comp
"
],
"text/plain": [
"5×9 DataFrames.DataFrame\n",
"│ Row │ ID │ Date │ Journal │\n",
"├─────┼──────────┼────────────┼────────────────────────────────────────┤\n",
"│ 1 │ 26605382 │ 2015-11-24 │ \"IEEE/ACM Trans Comput Biol Bioinform\" │\n",
"│ 2 │ 26605382 │ 2015-11-24 │ \"IEEE/ACM Trans Comput Biol Bioinform\" │\n",
"│ 3 │ 26605382 │ 2015-11-24 │ \"IEEE/ACM Trans Comput Biol Bioinform\" │\n",
"│ 4 │ 26357062 │ 2015-09-11 │ \"IEEE/ACM Trans Comput Biol Bioinform\" │\n",
"│ 5 │ 26357062 │ 2015-09-11 │ \"IEEE/ACM Trans Comput Biol Bioinform\" │\n",
"\n",
"│ Row │ Author_First_Name │ Author_Last_Name │ Author_Initials │ Position │\n",
"├─────┼───────────────────┼──────────────────┼─────────────────┼──────────┤\n",
"│ 1 │ \"Yufei\" │ \"Huang\" │ NA │ \"first\" │\n",
"│ 2 │ \"Xiaoning\" │ \"Qian\" │ NA │ \"last\" │\n",
"│ 3 │ \"Yidong\" │ \"Chen\" │ NA │ \"second\" │\n",
"│ 4 │ \"Haiying\" │ \"Wang\" │ NA │ \"first\" │\n",
"│ 5 │ \"Huiru\" │ \"Zheng\" │ NA │ \"last\" │\n",
"\n",
"│ Row │ Title │\n",
"├─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤\n",
"│ 1 │ \"Selected Articles from the 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2012).\" │\n",
"│ 2 │ \"Selected Articles from the 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2012).\" │\n",
"│ 3 │ \"Selected Articles from the 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2012).\" │\n",
"│ 4 │ \"Organized Modularity in the Interactome: Evidence from the Analysis of Dynamic Organization in the Cell Cycle.\" │\n",
"│ 5 │ \"Organized Modularity in the Interactome: Evidence from the Analysis of Dynamic Organization in the Cell Cycle.\" │\n",
"\n",
"│ Row │ Dataset │\n",
"├─────┼─────────┤\n",
"│ 1 │ \"comp\" │\n",
"│ 2 │ \"comp\" │\n",
"│ 3 │ \"comp\" │\n",
"│ 4 │ \"comp\" │\n",
"│ 5 │ \"comp\" │"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"using DataFrames\n",
"\n",
"bio = importauthors(\"../data/pubdata/bio.csv\", \"bio\")\n",
"comp = importauthors(\"../data/pubdata/comp.csv\", \"comp\")\n",
"\n",
"comp[1:5, :]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's combine all the data - we can subset it again later. I'm also going to clear the `bio` and `comp` variables to free up some memory.\n",
"\n",
"We'll also use the `getgenderprob()` function to add columns for the probability that the author is female (`P`) using the different apis and the number of times that name showed up in the respective database, which gives us some sense of how certain we can be in the result (`Count`).\n",
"\n",
"Finally, we'll use `pool!`, which makes the represenation of factored data (data that has distinct rather than continuous values) a bit more efficient in memory (and will make queries faster later on)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"alldata = vcat(bio, comp)\n",
"bio = 0\n",
"comp = 0\n",
"\n",
"alldata[:izeP], alldata[:izeCount] = getgenderprob(\n",
" alldata, \"../data/genders/genderize_genders.json\", :Author_First_Name)\n",
"alldata[:apiP], alldata[:apiCount] = getgenderprob(\n",
" alldata, \"../data/genders/genderAPI_genders.json\", :Author_First_Name)\n",
"\n",
"pool!(alldata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In julia, we can [subset our dataframes](http://dataframesjl.readthedocs.io/en/latest/subsets.html) pretty easily. For example, we can pull back out rows for our different datasets."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
ID
Date
Journal
Author_First_Name
Author_Last_Name
Author_Initials
Position
Title
Dataset
izeP
izeCount
apiP
apiCount
1
26466425
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Suwit
Chotinun
NA
first
PREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.
bio
0.0
2
0.020000000000000018
306
2
26466425
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Prapas
Patchanee
NA
last
PREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.
bio
0.0
1
0.030000000000000027
58
3
26466425
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Suvichai
Rojanasthien
NA
second
PREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.
bio
NA
0
NA
0
4
26466425
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Pakpoom
Tadee
NA
penultimate
PREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.
bio
0.0
1
0.020000000000000018
116
5
26466425
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Fred
Unger
NA
other
PREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.
bio
0.020000000000000018
966
0.040000000000000036
49394
"
],
"text/plain": [
"5×13 DataFrames.DataFrame\n",
"│ Row │ ID │ Date │ Journal │\n",
"├─────┼──────────┼────────────┼───────────────────────────────────────────────┤\n",
"│ 1 │ 26466425 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"│ 2 │ 26466425 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"│ 3 │ 26466425 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"│ 4 │ 26466425 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"│ 5 │ 26466425 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"\n",
"│ Row │ Author_First_Name │ Author_Last_Name │ Author_Initials │ Position │\n",
"├─────┼───────────────────┼──────────────────┼─────────────────┼───────────────┤\n",
"│ 1 │ \"Suwit\" │ \"Chotinun\" │ NA │ \"first\" │\n",
"│ 2 │ \"Prapas\" │ \"Patchanee\" │ NA │ \"last\" │\n",
"│ 3 │ \"Suvichai\" │ \"Rojanasthien\" │ NA │ \"second\" │\n",
"│ 4 │ \"Pakpoom\" │ \"Tadee\" │ NA │ \"penultimate\" │\n",
"│ 5 │ \"Fred\" │ \"Unger\" │ NA │ \"other\" │\n",
"\n",
"│ Row │ Title │\n",
"├─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤\n",
"│ 1 │ \"PREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.\" │\n",
"│ 2 │ \"PREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.\" │\n",
"│ 3 │ \"PREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.\" │\n",
"│ 4 │ \"PREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.\" │\n",
"│ 5 │ \"PREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.\" │\n",
"\n",
"│ Row │ Dataset │ izeP │ izeCount │ apiP │ apiCount │\n",
"├─────┼─────────┼──────┼──────────┼──────┼──────────┤\n",
"│ 1 │ \"bio\" │ 0.0 │ 2 │ 0.02 │ 306 │\n",
"│ 2 │ \"bio\" │ 0.0 │ 1 │ 0.03 │ 58 │\n",
"│ 3 │ \"bio\" │ NA │ 0 │ NA │ 0 │\n",
"│ 4 │ \"bio\" │ 0.0 │ 1 │ 0.02 │ 116 │\n",
"│ 5 │ \"bio\" │ 0.02 │ 966 │ 0.04 │ 49394 │"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alldata = alldata[!isna(alldata[:Journal]), :] # remove rows where there's no Journal\n",
"\n",
"biodata = alldata[alldata[:Dataset] .== \"bio\", :] # get all columns for rows where the Dataset column is \"bio\"\n",
"compdata = alldata[alldata[:Dataset] .== \"comp\", :]\n",
"\n",
"biodata[1:5, :] # get the first 5 rows, and all columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we're going to use the plotting package [`Plots`](https://juliaplots.github.io/), which allows us to use different plotting backends to take a look. First, we need to reshape the data a little bit to make it easier to plot."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
"10×3 DataFrames.DataFrame\n",
"│ Row │ Position │ Dataset │ x1 │\n",
"├─────┼───────────────┼─────────┼──────────┤\n",
"│ 1 │ \"first\" │ \"bio\" │ 0.375575 │\n",
"│ 2 │ \"first\" │ \"comp\" │ 0.316339 │\n",
"│ 3 │ \"last\" │ \"bio\" │ 0.244588 │\n",
"│ 4 │ \"last\" │ \"comp\" │ 0.207292 │\n",
"│ 5 │ \"other\" │ \"bio\" │ 0.368088 │\n",
"│ 6 │ \"other\" │ \"comp\" │ 0.330706 │\n",
"│ 7 │ \"penultimate\" │ \"bio\" │ 0.279369 │\n",
"│ 8 │ \"penultimate\" │ \"comp\" │ 0.236174 │\n",
"│ 9 │ \"second\" │ \"bio\" │ 0.378591 │\n",
"│ 10 │ \"second\" │ \"comp\" │ 0.321791 │"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"genderize_byposition = by(alldata, [:Position, :Dataset], df -> mean(dropna(df[:izeP])))\n",
"genderapi_byposition = by(alldata, [:Position, :Dataset], df -> mean(dropna(df[:apiP])))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ys = hcat([genderize_byposition[genderize_byposition[:Dataset] .== x, :x1] for x in levels(genderize_byposition[:Dataset])]...)\n",
"\n",
"groupedbar(ys, bar_position=:dodge, \n",
" ylims=(0,0.6), xticks=(1:5,levels(genderize_byposition[:Position])),\n",
" lab=levels(genderize_byposition[:Dataset]),\n",
" xlabel=\"Author Position\",\n",
" ylabel=\"Percent Female\",\n",
" title=\"By Position, Genderize.io\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ys = hcat([genderapi_byposition[genderapi_byposition[:Dataset] .== x, :x1] for x in levels(genderapi_byposition[:Dataset])]...)\n",
"\n",
"groupedbar(ys, bar_position=:dodge, \n",
" ylims=(0,0.6), xticks=(1:5,levels(genderapi_byposition[:Position])),\n",
" lab=levels(genderapi_byposition[:Dataset]),\n",
" xlabel=\"Author Position\",\n",
" ylabel=\"Percent Female\",\n",
" title=\"By Position, GenderAPI\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The good news:\n",
"- both of the gender inferences look pretty similar.\n",
"- this recapitulates previously published data that:\n",
" - Women are less likely to be authors than men\n",
" - Women are less likely to be first authors than second authors\n",
" - Women are less likely to be last authors than first authors\n",
" \n",
"The bad news - this recapitulates previously published data that women are under-represented in biology publishing. \n",
"\n",
"New finding: It seems to be worse in computational biology than in all of biology, though not by as much as I expected. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Methods discrepancies\n",
"\n",
"Using genderize, it looks like women are better represented than when using genderAPI. Which one is better? \n",
"\n",
"A couple of things to consider:\n",
"1. how many names of our names can the service guess? \n",
"2. what proportion of authors can the service guess (this is a different question)\n",
"3. for names that can be guessed, how certain can we be that the gender assignment is correct?\n",
"\n",
"To do this, we'll start by reshaping our dataframe to show the stats for each name, including the number of times they show up."
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"