{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Gender Stats\n",
"\n",
"Since the GenderAPI data seems a bit more comprehensive ([see here](./gender_methods.ipynb)), that's what we'll use going forward. This first block recapitulates what I did in the beginning of the last notebook.function bs2{T<:Number}(a::DataArray{T}, n::Int)\n",
" means = @data(Float64[])\n",
" for x in 1:n\n",
" push!(means, mean(dropna(\n",
" sample!(a, similar(a, length(a)))\n",
" )))\n",
" end\n",
" means = dropna(means)\n",
" return (mean(means), [quantile(means, .025), quantile(means, .975)])\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
ID
Date
Journal
Author_First_Name
Author_Last_Name
Author_Initials
Position
1
26466425
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Suwit
Chotinun
NA
first
2
26466425
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Prapas
Patchanee
NA
last
3
26466425
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Suvichai
Rojanasthien
NA
second
4
26466425
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Pakpoom
Tadee
NA
penultimate
5
26466425
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Fred
Unger
NA
other
"
],
"text/plain": [
"5×7 DataFrames.DataFrame\n",
"│ Row │ ID │ Date │ Journal │\n",
"├─────┼──────────┼────────────┼───────────────────────────────────────────────┤\n",
"│ 1 │ 26466425 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"│ 2 │ 26466425 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"│ 3 │ 26466425 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"│ 4 │ 26466425 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"│ 5 │ 26466425 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"\n",
"│ Row │ Author_First_Name │ Author_Last_Name │ Author_Initials │ Position │\n",
"├─────┼───────────────────┼──────────────────┼─────────────────┼───────────────┤\n",
"│ 1 │ \"Suwit\" │ \"Chotinun\" │ NA │ \"first\" │\n",
"│ 2 │ \"Prapas\" │ \"Patchanee\" │ NA │ \"last\" │\n",
"│ 3 │ \"Suvichai\" │ \"Rojanasthien\" │ NA │ \"second\" │\n",
"│ 4 │ \"Pakpoom\" │ \"Tadee\" │ NA │ \"penultimate\" │\n",
"│ 5 │ \"Fred\" │ \"Unger\" │ NA │ \"other\" │"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"using StatPlots\n",
"\n",
"\n",
"include(\"../src/dataimport.jl\") # `importauthors()` and `getgenderprob()` functions\n",
"\n",
"bio = importauthors(\"../data/pubdata/bio.csv\", \"bio\")\n",
"comp = importauthors(\"../data/pubdata/comp.csv\", \"comp\")\n",
"alldata = vcat(bio, comp)\n",
"bio = 0 # to free up memory\n",
"comp = 0\n",
"\n",
"alldata[:Pfemale], alldata[:Count] = getgenderprob(alldata, \"../data/genders/genderAPI_genders.json\", :Author_First_Name)\n",
"\n",
"pool!(alldata)\n",
"alldata = alldata[!isna(alldata[:Journal]), :] # remove rows where there's no Journal\n",
"\n",
"alldata[1:5, 1:7]\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
Dataset
MeanPF
1
bio
0.34048754243415824
2
comp
0.29633429724211385
"
],
"text/plain": [
"2×2 DataFrames.DataFrame\n",
"│ Row │ Dataset │ MeanPF │\n",
"├─────┼─────────┼──────────┤\n",
"│ 1 │ \"bio\" │ 0.340488 │\n",
"│ 2 │ \"comp\" │ 0.296334 │"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"means = by(alldata, [:Dataset], df -> DataFrame(MeanPF = mean(dropna(df[:Pfemale]))))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"sys:1: MatplotlibDeprecationWarning: The set_axis_bgcolor function was deprecated in version 2.0. Use set_facecolor instead.\n"
]
}
],
"source": [
"bar(means, :MeanPF,\n",
" xaxis=(\"Dataset\", ([1,2], means[:Dataset])),\n",
" yaxis=(\"Percent Female\", (0, 0.6), 0:0.1:0.6),\n",
" legend=false,\n",
" grid=false,\n",
" title=\"Proportion of Female Authors\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Currently, the author positions are ordered in the dataframe in alphabetical order (first, last, other, penultimate, second), so I'm going to define a \"less than\" function to do a custom sort (thanks [Stack Overflow](http://stackoverflow.com/questions/37932963/efficient-custom-ordering-in-julia-dataframes)!). This function needs to return true for the call `x < y` for the strings in the following order: [\"first\", \"second\", \"other\", \"penultimate\", \"last\"]. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"true\n",
"false\n"
]
}
],
"source": [
"order = Dict(key => ix for (ix, key) in enumerate([\"first\", \"second\", \"other\", \"penultimate\", \"last\"]))\n",
"\n",
"\n",
"function authororder(pos1, pos2)\n",
" return order[pos1] < order[pos2]\n",
"end\n",
"\n",
"println(authororder(\"first\", \"second\"))\n",
"println(authororder(\"second\", \"first\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So now we can sort the dataframe using our custom function and the `lt` keyword."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
ID
Date
Journal
Author_First_Name
Author_Last_Name
Author_Initials
Position
1
26466425
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Suwit
Chotinun
NA
first
2
26466421
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Chariya
Chomvarin
NA
first
3
26466418
2015-10-15
Southeast Asian J. Trop. Med. Public Health
Meng-Bin
Tang
NA
first
4
26460400
2015-10-11
Ann. Hum. Genet.
Christopher
Steele
D
first
5
26255944
2015-08-10
Mutat. Res.
Sara
Skiöld
NA
first
"
],
"text/plain": [
"5×7 DataFrames.DataFrame\n",
"│ Row │ ID │ Date │ Journal │\n",
"├─────┼──────────┼────────────┼───────────────────────────────────────────────┤\n",
"│ 1 │ 26466425 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"│ 2 │ 26466421 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"│ 3 │ 26466418 │ 2015-10-15 │ \"Southeast Asian J. Trop. Med. Public Health\" │\n",
"│ 4 │ 26460400 │ 2015-10-11 │ \"Ann. Hum. Genet.\" │\n",
"│ 5 │ 26255944 │ 2015-08-10 │ \"Mutat. Res.\" │\n",
"\n",
"│ Row │ Author_First_Name │ Author_Last_Name │ Author_Initials │ Position │\n",
"├─────┼───────────────────┼──────────────────┼─────────────────┼──────────┤\n",
"│ 1 │ \"Suwit\" │ \"Chotinun\" │ NA │ \"first\" │\n",
"│ 2 │ \"Chariya\" │ \"Chomvarin\" │ NA │ \"first\" │\n",
"│ 3 │ \"Meng-Bin\" │ \"Tang\" │ NA │ \"first\" │\n",
"│ 4 │ \"Christopher\" │ \"Steele\" │ \"D\" │ \"first\" │\n",
"│ 5 │ \"Sara\" │ \"Skiöld\" │ NA │ \"first\" │"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sort!(alldata, cols=:Position, lt=authororder)\n",
"alldata[1:5, 1:7]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"byposition = by(alldata, [:Position, :Dataset], df -> mean(dropna(df[:Pfemale])))\n",
"sort!(byposition, cols=:Position, lt=authororder)\n",
"ys = hcat([byposition[byposition[:Dataset] .== x, :x1] for x in levels(byposition[:Dataset])]...)\n",
"\n",
"groupedbar(ys, bar_position=:dodge,\n",
" xaxis=(\"Author Position\", (1:5, levels(alldata[:Position]))),\n",
" yaxis=(\"Percent Female\", (0, 0.6), 0:0.1:0.6),\n",
" legend=false,\n",
" grid=false,\n",
" title=\"Proportion of Female Authors\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To reiterate, these data suggest:\n",
"\n",
"- Women are less likely to be authors than men\n",
"- Women are less likely to be first authors than second authors\n",
"- Women are less likely to be last authors than first authors\n",
"- this recapitulates previously published data that women are under-represented in biology publishing.\n",
"\n",
"New finding: It seems to be worse in computational biology than in all of biology, though not by as much as I expected."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional Annalysis\n",
"\n",
"Let's see how this holds up. We can't do the \"normal\" sorts of statistics that folks often do (like T-tests, chi squared etc), since we're not taking a random sample of a population, we're looking at the whole population. An alternative is to use [bootstrap analysis](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)), where we randomly resample the data a bunch of times, and get statistics on those samples. Julia has a [bootstrap package](https://github.com/juliangehring/Bootstrap.jl), but it seems to be actively in development and stuff is a bit broken right now, so I decided to roll my own (my needs are modest).\n",
"\n",
"The function below takes a DataArray `a` (like a column in a DataFrame), and then takes `|a|` samples with replacement. It does this `n` times, and then takes the mean, and generates a 95% confidence interval (because of the [central limit theroem](https://en.wikipedia.org/wiki/Central_limit_theorem), I can assume these samples are normally distributed). "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"bootstrap (generic function with 1 method)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"function bootstrap{T<:Number}(a::DataArray{T}, n::Int)\n",
" means = Float64[]\n",
" for x in 1:n\n",
" push!(means, mean(dropna(\n",
" sample!(a, similar(a, length(a)))\n",
" )))\n",
" end\n",
" means = dropna(means) \n",
" return (mean(means), [quantile(means, .025), quantile(means, .975)])\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"This function takes a vector of values `v` (like, say, a group of gender probabilities), and then samples them with replacement `n` times. In other words, if I had a vector `t = [1, 2, 3, 4, 5]` and I called getbootstrap(t), it would go through and take 5 samples at random... sometimes it would be `[5, 2, 4, 3, 3]`, sometimes `[1, 1, 1, 1, 1]` etc. It would do this 1000 times, getting the mean each time."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(2.9941999999999993,[1.8,4.4])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t = @data([1,2,3,4,5])\n",
"(m, ci) = bootstrap(t, 1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The estimate for the mean is ~3, just as it should be, but I also get a confidence interval. That means that 95% of the time, the mean was between 1.6 and 4.2.\n",
"\n",
"I can use this to look at the author genders:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
Dataset
Position
Mean
Lower
Upper
1
bio
first
0.37554899967549205
0.37341618040791885
0.37760934519023026
2
bio
last
0.2445807802726165
0.2425217412944195
0.24667229687530196
3
bio
other
0.36811105488635143
0.36672005662522206
0.36955886435787955
4
bio
penultimate
0.2794508567134617
0.2769456272599794
0.28192603507064085
5
bio
second
0.37867720975402264
0.376309021032719
0.38124814138903007
6
comp
first
0.31628457043398167
0.31216277156822253
0.32071948984831855
7
comp
last
0.20728100188266085
0.2035118072511263
0.21131039035313787
8
comp
other
0.33066245511496395
0.3278958261091234
0.3334795748392778
9
comp
penultimate
0.23616028172389428
0.2314216315959131
0.24072637668427668
10
comp
second
0.3219207849916515
0.3173954530484164
0.3265934968596715
"
],
"text/plain": [
"10×5 DataFrames.DataFrame\n",
"│ Row │ Dataset │ Position │ Mean │ Lower │ Upper │\n",
"├─────┼─────────┼───────────────┼──────────┼──────────┼──────────┤\n",
"│ 1 │ \"bio\" │ \"first\" │ 0.375549 │ 0.373416 │ 0.377609 │\n",
"│ 2 │ \"bio\" │ \"last\" │ 0.244581 │ 0.242522 │ 0.246672 │\n",
"│ 3 │ \"bio\" │ \"other\" │ 0.368111 │ 0.36672 │ 0.369559 │\n",
"│ 4 │ \"bio\" │ \"penultimate\" │ 0.279451 │ 0.276946 │ 0.281926 │\n",
"│ 5 │ \"bio\" │ \"second\" │ 0.378677 │ 0.376309 │ 0.381248 │\n",
"│ 6 │ \"comp\" │ \"first\" │ 0.316285 │ 0.312163 │ 0.320719 │\n",
"│ 7 │ \"comp\" │ \"last\" │ 0.207281 │ 0.203512 │ 0.21131 │\n",
"│ 8 │ \"comp\" │ \"other\" │ 0.330662 │ 0.327896 │ 0.33348 │\n",
"│ 9 │ \"comp\" │ \"penultimate\" │ 0.23616 │ 0.231422 │ 0.240726 │\n",
"│ 10 │ \"comp\" │ \"second\" │ 0.321921 │ 0.317395 │ 0.326593 │"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"by(alldata, [:Dataset, :Position]) do df\n",
" (m, ci) = bootstrap(df[:Pfemale], 1000)\n",
" return DataFrame(Mean=m, Lower=ci[1], Upper=ci[2])\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That code takes the data, subsets it by `:Dataset` and `:Position` (that's the first two columns) and then returns the mean and 95% confidence intervals for each subset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Based on Journal Specialty\n",
"\n",
"Another way to do this is to split by journals that tend to publish computational biology articles vs those that are more generalist. Here we'll only use the articles in the \"bio\" dataset to avoid double-dipping (the \"comp\" dataset is almost entirely a subset of \"bio\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 202816 articles in the \"bio\" dataset\n",
"There are 42880 articles in the \"comp\" dataset\n",
"There are 236 articles in the \"comp\" dataset that aren't in the \"bio\" dataset\n"
]
}
],
"source": [
"bioids = Set(levels(alldata[alldata[:Dataset] .== \"bio\", :ID]))\n",
"compids = Set(levels(alldata[alldata[:Dataset] .== \"comp\", :ID]))\n",
"\n",
"println(\"There are $(length(bioids)) articles in the \\\"bio\\\" dataset\")\n",
"println(\"There are $(length(compids)) articles in the \\\"comp\\\" dataset\")\n",
"\n",
"dif = length(setdiff(compids, bioids))\n",
"\n",
"println(\"There are $dif articles in the \\\"comp\\\" dataset that aren't in the \\\"bio\\\" dataset\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10×5 DataFrames.DataFrame\n",
"│ Row │ Journal │ Position │ Mean │ Lower │ Upper │\n",
"├─────┼──────────────────────┼───────────────┼──────────┼──────────┼──────────┤\n",
"│ 1 │ \"PLoS Biol.\" │ \"first\" │ 0.332263 │ 0.279012 │ 0.389686 │\n",
"│ 2 │ \"PLoS Comput. Biol.\" │ \"first\" │ 0.243255 │ 0.224564 │ 0.261672 │\n",
"│ 3 │ \"PLoS Biol.\" │ \"second\" │ 0.329098 │ 0.264278 │ 0.393998 │\n",
"│ 4 │ \"PLoS Comput. Biol.\" │ \"second\" │ 0.24995 │ 0.228304 │ 0.271633 │\n",
"│ 5 │ \"PLoS Biol.\" │ \"other\" │ 0.327756 │ 0.305247 │ 0.351926 │\n",
"│ 6 │ \"PLoS Comput. Biol.\" │ \"other\" │ 0.292379 │ 0.27278 │ 0.312326 │\n",
"│ 7 │ \"PLoS Biol.\" │ \"penultimate\" │ 0.217716 │ 0.153413 │ 0.284721 │\n",
"│ 8 │ \"PLoS Comput. Biol.\" │ \"penultimate\" │ 0.200221 │ 0.176161 │ 0.224715 │\n",
"│ 9 │ \"PLoS Biol.\" │ \"last\" │ 0.24128 │ 0.190627 │ 0.303766 │\n",
"│ 10 │ \"PLoS Comput. Biol.\" │ \"last\" │ 0.1768 │ 0.16034 │ 0.194609 │"
]
},
{
"data": {
"text/html": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"plosfocus = alldata[(alldata[:Dataset] .== \"bio\")&\n",
" ((alldata[:Journal] .== String(\"PLoS Biol.\"))|\n",
" (alldata[:Journal] .== String(\"PLoS Comput. Biol.\"))), :]\n",
"\n",
"df = by(plosfocus, [:Journal, :Position]) do df\n",
" (m, ci) = bootstrap(df[:Pfemale], 1000)\n",
" return DataFrame(Mean=m, Lower=ci[1], Upper=ci[2])\n",
"end\n",
"\n",
"sort!(df, cols=:Position, lt=authororder)\n",
"show(df)\n",
"ys = hcat([df[df[:Journal] .== x, :Mean] for x in [\"PLoS Biol.\", \"PLoS Comput. Biol.\"]]...)\n",
"\n",
"groupedbar(ys, bar_position=:dodge, lab=[\"PLoS Bio\", \"PLoS Comp Bio\"],\n",
" xaxis=(\"Author Position\", (1:5, [\"first\", \"second\", \"other\", \"penultimate\", \"last\"])),\n",
" yaxis=(\"Percent Female\", (0, 0.6), 0:0.1:0.6),\n",
" grid=false,\n",
" title=\"Proportion of Female Authors in PLoS Journals\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, we can look at all the journals, and look at where each journal in terms of the proportion of female authors. Here, I'm plotting out all of the journals with more than 1000 authors (each author is a line, so I can just get the number of rows with a particular journal), and coloring them grey if they have \"comput\", \"omic\", \"informatic\", or \"system\" in their title."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bio = alldata[alldata[:Dataset] .== \"bio\", :]\n",
"c = countmap(bio[:Journal])\n",
"\n",
"journals = bio[map(x -> c[x] > 1000, bio[:Journal]), :]\n",
"\n",
"function checktitle(t::String)\n",
" for f in [\"comput\", \"omic\", \"informatic\", \"system\"]\n",
" if contains(t, f)\n",
" return true\n",
" end\n",
" end\n",
" return false\n",
"end\n",
"\n",
"data = by(journals, :Journal) do df\n",
" m = mean(dropna(df[:Pfemale]))\n",
" return DataFrame(Mean=m)\n",
"end\n",
"\n",
"data[:Color] = [checktitle(x) ? :grey : :black for x in data[:Journal]]\n",
"data = data[!isna(data[:Mean]), :]\n",
"sort!(data, cols=:Mean)\n",
"bar(data[:Mean], color=data[:Color], grid=false, legend=false, bar_edges=false)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Subsetting Conclusions\n",
"\n",
"With a few exceptions, each of these subsets show a similar trend: women are less likely to be authors in computational biology publications. "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Data from the arXiv\n",
"\n",
"Previous work suggests that women are far less likely to publish in computer science. Unfortunately, pubmed doesn't index computer science research. \n",
"\n",
"The arXiv has preprints in many fields, including computer science. The sorts of papers posted here are likely to be different, so we can't compare directly to the stuff on pubmed, but there's also quantatative biology..."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "LoadError",
"evalue": "SystemError: opening file ../data/pubdata/arxivcs.csv: No such file or directory",
"output_type": "error",
"traceback": [
"SystemError: opening file ../data/pubdata/arxivcs.csv: No such file or directory",
"",
" in #systemerror#51 at ./error.jl:34 [inlined]",
" in systemerror(::String, ::Bool) at ./error.jl:34",
" in open(::String, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at ./iostream.jl:89",
" in open(::String, ::String) at ./iostream.jl:101",
" in #readtable#79(::Bool, ::Char, ::Array{Char,1}, ::Char, ::Array{String,1}, ::Array{String,1}, ::Array{String,1}, ::Bool, ::Int64, ::Array{Symbol,1}, ::Array{Any,1}, ::Bool, ::Char, ::Bool, ::Int64, ::Array{Int64,1}, ::Bool, ::Symbol, ::Bool, ::Bool, ::DataFrames.#readtable, ::String) at /Users/ksb/.julia/v0.5/DataFrames/src/dataframe/io.jl:941",
" in readtable(::String) at /Users/ksb/.julia/v0.5/DataFrames/src/dataframe/io.jl:930",
" in importauthors(::String, ::String) at /Users/ksb/computation/science/gender-comp-bio/src/dataimport.jl:5"
]
}
],
"source": [
"\n",
"arxivcs = importauthors(\"../data/pubdata/arxivcs.csv\", \"arxivcs\")\n",
"arxivbio = importauthors(\"../data/pubdata/arxivbio.csv\", \"arxivbio\")\n",
"\n",
"arxiv = vcat(arxivbio, arxivcs)\n",
"arxivcs = 0\n",
"arxivbio = 0\n",
"\n",
"pool!(arxiv)\n",
"arxiv = arxiv[!isna(arxiv[:Author_Name]), :]\n",
"\n",
"arxiv[:Pfemale], arxiv[:Count] = getgenderprob(arxiv, \"../data/genders/genderAPI_genders.json\", :Author_Name)\n",
"\n",
"\n",
"arxivbyposition = bystats(arxiv, [:Dataset, :Position])"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "",
"image/svg+xml": [
"\n",
"\n"
],
"text/html": [
"\n",
"\n"
],
"text/plain": [
"Plot(...)"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"plot(arxivbyposition, x=:Position, y=:Mean, color=:Dataset,\n",
" Scale.color_discrete_manual(my_colors...),\n",
" Guide.title(\"Female Authors in arXiv\"),\n",
" Geom.bar(position = :dodge),\n",
" Scale.x_discrete(levels=[\"first\", \"second\", \"other\", \"penultimate\", \"last\"]),\n",
" Theme(bar_spacing=2mm))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 0.5.1-pre",
"language": "julia",
"name": "julia-0.5"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "0.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}