{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "# Clean Bibliography\n",
    "\n",
    "To goal of this notebook is to clean your `.bib` file to ensure that it only contains the full first names of references that you have cited in your paper. The full first names will then be used to query the probabilistic gender classifier, [Gender API](https://gender-api.com). The full names will be used to query for probabilistic race using the [ethnicolr package](https://ethnicolr.readthedocs.io/).\n",
    "\n",
    "The only required file you need is your manuscript's bibliography in `.bib` format. __Your `.bib` must only contain references cited in the manuscript__. Otherwise, the estimated proportions will be inaccurate.\n",
    "\n",
    "If you intend to analyze the reference list of a published paper instead of your own manuscript in progress, search the paper on [Web of Knowledge](http://apps.webofknowledge.com/) (you will need institutional access). Next, [download the .bib file from Web of Science following these instructions, but start from Step 4 and on Step 6 select BibTeX instead of Plain Text](https://github.com/jdwor/gendercitation/blob/master/Step0_PullingWOSdata.pdf).\n",
    "\n",
    "If you are not using LaTeX, collect and organize only the references you have cited in your manuscript using your reference manager of choice (e.g. Mendeley, Zotero, EndNote, ReadCube, etc.) and export that selected bibliography as a `.bib` file. __Please try to export your .bib in an output style that uses full first names (rather than only first initials) and using the full author lists (rather than abbreviated author lists with \"et al.\").__ If first initials are included, our code will automatically retrieve about 70% of those names using the article title or DOI. \n",
    "\n",
    "   * [Export `.bib` from Mendeley](https://blog.mendeley.com/2011/10/25/howto-use-mendeley-to-create-citations-using-latex-and-bibtex/)\n",
    "   * [Export `.bib` from Zotero](https://libguides.mit.edu/ld.php?content_id=34248570)\n",
    "   * [Export `.bib` from EndNote](https://www.reed.edu/cis/help/LaTeX/EndNote.html). Note: Please export full first names by either [choosing an output style that does so by default (e.g. in MLA style)](https://canterbury.libguides.com/endnote/basics-output) or by [customizing an output style.](http://bibliotek.usn.no/cite-and-write/endnote/how-to-use/how-to-show-the-author-s-full-name-in-the-reference-list-article185897-28181.html)\n",
    "   * [Export `.bib` from Read Cube Papers](https://support.papersapp.com/support/solutions/articles/30000024634-how-can-i-export-references-from-readcube-papers-)\n",
    "\n",
    "For those working in LaTeX, we can use an optional `.aux` file to automatically filter your `.bib` to check that it only contains entries which are cited in your manuscript.\n",
    "\n",
    "| Input                 | Output                                                                                                                        |\n",
    "|-----------------------|-------------------------------------------------------------------------------------------------------------------------------|\n",
    "| `.bib` file(s)**(REQUIRED)**    | `cleanBib.csv`: table of author first names, titles, and .bib keys                                                            |\n",
    "| `.aux` file (OPTIONAL)| `predictions.csv`: table of author first names, estimated gender classification, and confidence                                   |\n",
    "| `.tex` file (OPTIONAL) | `race_gender_citations.pdf`: heat map of your citations broken down by probabilistic gender and race estimations\n",
    "|                       | `yourTexFile_gendercolor.tex`: your `.tex` file modified to compile .pdf with in-line citations colored-coded by gender pairs |\n",
    "\n",
    "## 1. Import functions\n",
    "\n",
    "Upload your `.bib` file(s) and _optionally_ an `.aux` file generated from compiling your LaTeX manuscript and your `.tex` file\n",
    "\n",
    "![upload button](img/upload.png)\n",
    "\n",
    "![confirm upload button](img/confirmUpload.png)\n",
    "\n",
    "Then, run the code block below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "Python 3"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import bibtexparser\n",
    "from bibtexparser.bparser import BibTexParser\n",
    "import glob\n",
    "import subprocess\n",
    "import os\n",
    "from pybtex.database.input import bibtex\n",
    "import csv\n",
    "from pylatexenc.latex2text import LatexNodes2Text \n",
    "import unicodedata\n",
    "import re\n",
    "import pandas as pd\n",
    "from habanero import Crossref\n",
    "import string\n",
    "from time import sleep\n",
    "import tqdm\n",
    "import matplotlib.pylab as plt\n",
    "import matplotlib.gridspec as gridspec\n",
    "import json\n",
    "import pickle\n",
    "from urllib.request import urlopen\n",
    "from ethnicolr import census_ln, pred_census_ln,pred_wiki_name\n",
    "from pybtex.database import parse_file\n",
    "import seaborn as sns\n",
    "\n",
    "\n",
    "def checkcites_output(aux_file):\n",
    "    '''take in aux file for tex document, return list of citation keys\n",
    "    that are in .bib file but not in document'''\n",
    "\n",
    "    result = subprocess.run(['texlua', 'checkcites.lua', aux_file[0]], stdout=subprocess.PIPE)\n",
    "    result = result.stdout.decode('utf-8')\n",
    "    unused_array_raw = result.split('\\n')\n",
    "    # process array of unused references + other output \n",
    "    unused_array_final = list()\n",
    "    for x in unused_array_raw:\n",
    "        if len(x) > 0: # if line is not empty\n",
    "            if x[0] == '-':  # and if first character is a '-', it's a citation key\n",
    "                unused_array_final.append(x[2:]) # truncate '- '            \n",
    "    if \"------------------------------------------------------------------------\" in unused_array_final:\n",
    "        return(result)\n",
    "    else:\n",
    "        return(unused_array_final)\n",
    "\n",
    "\n",
    "def removeMiddleName(line):\n",
    "    arr = line.split()\n",
    "    last = arr.pop()\n",
    "    n = len(arr)\n",
    "    if n == 4:\n",
    "        first, middle = ' '.join(arr[:2]), ' '.join(arr[2:])\n",
    "    elif n == 3:\n",
    "        first, middle = arr[0], ' '.join(arr[1:])\n",
    "    elif n == 2:\n",
    "        first, middle = arr\n",
    "    elif n==1:\n",
    "        return line\n",
    "    return(str(first + ' ' + middle))\n",
    "\n",
    "\n",
    "def returnFirstName(line):\n",
    "    arr = line.split()\n",
    "    n = len(arr)\n",
    "    if n == 4:\n",
    "        first, middle = ' '.join(arr[:2]), ' '.join(arr[2:])\n",
    "    elif n == 3:\n",
    "        first, middle = arr[0], ' '.join(arr[1:])\n",
    "    elif n == 2:\n",
    "        first, middle = arr\n",
    "    elif n==1:\n",
    "        return line\n",
    "    return(str(middle))\n",
    "\n",
    "\n",
    "def convertLatexSpecialChars(latex_text):\n",
    "    return LatexNodes2Text().latex_to_text(latex_text)\n",
    "\n",
    "\n",
    "def convertSpecialCharsToUTF8(text):\n",
    "    data = LatexNodes2Text().latex_to_text(text)\n",
    "    return unicodedata.normalize('NFD', data).encode('ascii', 'ignore').decode('utf-8')\n",
    "\n",
    "\n",
    "def namesFromXref(doi, title, authorPos):\n",
    "    '''Use DOI and article titles to query Crossref for author list'''\n",
    "    if authorPos == 'first':\n",
    "        idx = 0\n",
    "    elif authorPos == 'last':\n",
    "        idx = -1\n",
    "    # get cross ref data\n",
    "    authors = ['']\n",
    "    # first try DOI\n",
    "    if doi != \"\":\n",
    "        works = cr.works(query = title, select = [\"DOI\",\"author\"], limit=1, filter = {'doi': doi})\n",
    "        if works['message']['total-results'] > 0:\n",
    "            authors = works['message']['items'][0]['author']\n",
    "    elif title != '': \n",
    "        works = cr.works(query = f'title:\"{title}\"', select = [\"title\",\"author\"], limit=10)\n",
    "        cnt = 0\n",
    "        name = ''\n",
    "        # check that you grabbed the proper paper\n",
    "        if works['message']['items'][cnt]['title'][0].lower() == title.lower():\n",
    "            authors = works['message']['items'][0]['author']\n",
    "\n",
    "    # check the all fields are available\n",
    "    if not 'given' in authors[idx]:\n",
    "        name = ''\n",
    "    else:\n",
    "        # trim initials\n",
    "        name = authors[idx]['given'].replace('.',' ').split()[0]\n",
    "\n",
    "    return name\n",
    "\n",
    "\n",
    "def namesFromXrefSelfCite(doi, title):\n",
    "    selfCiteCheck = 0\n",
    "    # get cross ref data\n",
    "    authors = ['']\n",
    "    # first try DOI\n",
    "    if doi != \"\":\n",
    "        works = cr.works(query = title, select = [\"DOI\",\"author\"], limit=1, filter = {'doi': doi})\n",
    "        if works['message']['total-results'] > 0:\n",
    "            authors = works['message']['items'][0]['author']\n",
    "    \n",
    "    for i in authors:\n",
    "        if i != \"\":\n",
    "            first = i['given'].replace('.',' ').split()[0]\n",
    "            last = i['family'].replace('.',' ').split()[0]\n",
    "            authors = removeMiddleName(last + \", \" + first)\n",
    "            if authors in removeMiddleName(yourFirstAuthor) or authors in removeMiddleName(convertSpecialCharsToUTF8(yourFirstAuthor)) or authors in removeMiddleName(yourLastAuthor) or authors in removeMiddleName(convertSpecialCharsToUTF8(yourLastAuthor)):\n",
    "                selfCiteCheck += 1\n",
    "    return selfCiteCheck\n",
    "\n",
    "\n",
    "cr = Crossref()\n",
    "homedir = '/home/jovyan/'\n",
    "bib_files = glob.glob(homedir + '*.bib')\n",
    "paper_aux_file = glob.glob(homedir + '*.aux')\n",
    "paper_bib_file = 'library_paper.bib'\n",
    "try:\n",
    "    tex_file = glob.glob(homedir + \"*.tex\")[0]\n",
    "except:\n",
    "    print('No optional .tex file found.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "### 2. Define the _first_ and _last_ author of your paper.\n",
    "\n",
    "For example: \n",
    "```\n",
    "yourFirstAuthor = 'Teich, Erin G.'\n",
    "yourLastAuthor = 'Bassett, Danielle S.'\n",
    "```\n",
    "\n",
    "And optionally, define any co-first or co-last author(s), making sure to keep the square brackets to define a list.\n",
    "\n",
    "For example:\n",
    "```\n",
    "optionalEqualContributors = ['Dworkin, Jordan', 'Stiso, Jennifer']\n",
    "```\n",
    "\n",
    "or \n",
    "\n",
    "```\n",
    "optionalEqualContributors = ['Dworkin, Jordan']\n",
    "```\n",
    "\n",
    "If you are analyzing published papers' reference lists from Web of Science, change the variable checkingPublishedArticle to True:\n",
    "```\n",
    "checkingPublishedArticle = True\n",
    "```\n",
    "\n",
    "Then, run the code block below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "Python 3"
   },
   "outputs": [],
   "source": [
    "yourFirstAuthor = 'LastName, FirstName OptionalMiddleInitial'\n",
    "yourLastAuthor = 'LastName, FirstName OptionalMiddleInitial'\n",
    "optionalEqualContributors = ['LastName, FirstName OptionalMiddleInitial', 'LastName, FirstName OptionalMiddleInitial']\n",
    "checkingPublishedArticle = False\n",
    "\n",
    "if (yourFirstAuthor == 'LastName, FirstName OptionalMiddleInitial') or (yourLastAuthor == 'LastName, FirstName OptionalMiddleInitial'):\n",
    "    raise ValueError(\"Please enter your manuscript's first and last author names\")\n",
    "\n",
    "if paper_aux_file:\n",
    "    if optionalEqualContributors == ['LastName, FirstName OptionalMiddleInitial', 'LastName, FirstName OptionalMiddleInitial']:\n",
    "        citing_authors = np.array([yourFirstAuthor, yourLastAuthor])\n",
    "    else:\n",
    "        citing_authors = np.array([yourFirstAuthor, yourLastAuthor, optionalEqualContributors])\n",
    "    print(checkcites_output(paper_aux_file))\n",
    "    unused_in_paper = checkcites_output(paper_aux_file) # get citations in library not used in paper\n",
    "    print(\"Unused citations: \", unused_in_paper.count('=>'))\n",
    "    \n",
    "    \n",
    "    parser = BibTexParser()\n",
    "    parser.ignore_nonstandard_types = False\n",
    "    parser.common_strings = True\n",
    "    \n",
    "    bib_data = None\n",
    "    for bib_file in bib_files:\n",
    "        with open(bib_file) as bibtex_file:\n",
    "            if bib_data is None:\n",
    "                bib_data = bibtexparser.bparser.BibTexParser(common_strings=True, ignore_nonstandard_types=False).parse_file(bibtex_file)\n",
    "            else:\n",
    "                bib_data_extra = bibtexparser.bparser.BibTexParser(common_strings=True, ignore_nonstandard_types=False).parse_file(bibtex_file)\n",
    "                bib_data.entries_dict.update(bib_data_extra.entries_dict)\n",
    "                bib_data.entries.extend(bib_data_extra.entries)\n",
    "    \n",
    "    all_library_citations = list(bib_data.entries_dict.keys())\n",
    "    print(\"All citations: \", len(all_library_citations))\n",
    "    \n",
    "    for k in all_library_citations:\n",
    "        if re.search('\\\\b'+ k + '\\\\b', unused_in_paper.replace('\\n',' ').replace('=>',' ')) != None:\n",
    "            del bib_data.entries_dict[k] # remove from entries dictionary if not in paper\n",
    "            \n",
    "    in_paper_mask = [re.search('\\\\b'+ bib_data.entries[x]['ID'] + '\\\\b', unused_in_paper.replace('\\n',' ').replace('=>',' ')) == None for x in range(len(bib_data.entries))]\n",
    "    bib_data.entries = [bib_data.entries[x] for x in np.where(in_paper_mask)[0]] # replace entries list with entries only in paper\n",
    "    del bib_data.comments\n",
    "    \n",
    "    duplicates = []\n",
    "    for key in bib_data.entries_dict.keys():\n",
    "        count = str(bib_data.entries).count(\"'ID\\': \\'\"+ key + \"\\'\")\n",
    "        if count > 1:\n",
    "            duplicates.append(key)\n",
    "            \n",
    "    if len(duplicates) > 0:\n",
    "        raise ValueError(\"In your .bib file, please remove duplicate entries or duplicate entry ID keys for:\", ' '.join(map(str, duplicates)))\n",
    "\n",
    "    if os.path.exists(paper_bib_file):\n",
    "        os.remove(paper_bib_file)\n",
    "    \n",
    "    with open(paper_bib_file, 'w') as bibtex_file:\n",
    "        bibtexparser.dump(bib_data, bibtex_file)\n",
    "    \n",
    "    # define first author and last author names of citing paper -- will exclude citations of these authors\n",
    "    # beware of latex symbols within author names\n",
    "    # in_paper_citations = list(bib_data.entries_dict.keys())\n",
    "    in_paper_citations = [bib_data.entries[x]['ID'] for x in range(len(bib_data.entries))] # get list of citation keys in paper\n",
    "    \n",
    "    # extract author list for every cited paper\n",
    "    cited_authors = [bib_data.entries_dict[x]['author'] for x in in_paper_citations]\n",
    "    # find citing authors in cited author list\n",
    "    # using nested list comprehension, make a citing author -by- citation array of inclusion\n",
    "    self_cite_mask = np.array([[str(citing_author) in authors for authors in cited_authors] for citing_author in citing_authors])\n",
    "    self_cite_mask = np.any(self_cite_mask,axis=0) # collapse across citing authors such that any coauthorship by either citing author -> exclusion\n",
    "    \n",
    "    print(\"Self-citations: \", [bib_data.entries[x]['ID'] for x in np.where(self_cite_mask)[0]]) # print self citations\n",
    "    for idx,k in enumerate(in_paper_citations):\n",
    "        if self_cite_mask[idx]:\n",
    "            del bib_data.entries_dict[k] # delete citation from dictionary if self citationi\n",
    "    bib_data.entries = [bib_data.entries[x] for x in np.where(np.invert(self_cite_mask))[0]] # replace entries list with entries that aren't self citations\n",
    "    \n",
    "    paper_bib_file_excl_sc = os.path.splitext(paper_bib_file)[0] + '_noselfcite.bib'\n",
    "    \n",
    "    if os.path.exists(paper_bib_file_excl_sc):\n",
    "        os.remove(paper_bib_file_excl_sc)\n",
    "    \n",
    "    with open(paper_bib_file_excl_sc, 'w') as bibtex_file:\n",
    "        bibtexparser.dump(bib_data, bibtex_file)\n",
    "        \n",
    "if os.path.exists('*_noselfcite.bib'):\n",
    "    ID = glob.glob(homedir + paper_bib_file_excl_sc)\n",
    "else:\n",
    "    ID = glob.glob(homedir + '*bib')\n",
    "    with open(ID[0]) as bibtex_file:\n",
    "        bib_data = bibtexparser.bparser.BibTexParser(common_strings=True, ignore_nonstandard_types=False).parse_file(bibtex_file)\n",
    "    duplicates = []\n",
    "    for key in bib_data.entries_dict.keys():\n",
    "        count = str(bib_data.entries).count(\"'ID\\': \\'\"+ key + \"\\'\")\n",
    "        if count > 1:\n",
    "            duplicates.append(key)\n",
    "            \n",
    "    if len(duplicates) > 0:\n",
    "        raise ValueError(\"In your .bib file, please remove duplicate entries or duplicate entry ID keys for:\", ' '.join(map(str, duplicates)))\n",
    "\n",
    "if checkingPublishedArticle == True:\n",
    "    FA = []\n",
    "    LA = []\n",
    "    counter = 1\n",
    "    selfCiteCount = 0\n",
    "    titleCount = 1 # \n",
    "    counterNoDOI = list() # row index (titleCount) of entries with no DOI\n",
    "    outPath = homedir + 'cleanedBib.csv'\n",
    "\n",
    "    if os.path.exists(outPath):\n",
    "        os.remove(outPath)\n",
    "\n",
    "    with open(outPath, 'w', newline='') as csvfile:\n",
    "        writer = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)\n",
    "        writer.writerow(['Article', 'FA', 'LA', 'Title', 'SelfCite', 'CitationKey'])\n",
    "    \n",
    "    citedArticleDOI = list()\n",
    "    citedArticleNoDOI = list()\n",
    "    allArticles = list()\n",
    "    for entry in bib_data.entries:\n",
    "        my_string= entry['cited-references'].split('\\n')\n",
    "        for citedArticle in my_string:\n",
    "            allArticles.append(citedArticle)\n",
    "            if citedArticle.partition(\"DOI \")[-1]=='':\n",
    "                citedArticleNoDOI.append(citedArticle)\n",
    "                counterNoDOI.append(titleCount)\n",
    "            else:\n",
    "                line = citedArticle.partition(\"DOI \")[-1].replace(\"DOI \",\"\").rstrip(\".\")\n",
    "                line = ''.join( c for c in line if  c not in '{[}] ')\n",
    "                if \",\" in line:\n",
    "                    line = line.partition(\",\")[-1]\n",
    "                citedArticleDOI.append(line)\n",
    "                with open('citedArticlesDOI.csv', 'a', newline='') as csvfile:\n",
    "                    writer = csv.writer(csvfile, delimiter=',')\n",
    "                    writer.writerow([line])\n",
    "            titleCount += 1\n",
    "\n",
    "    articleNum = 0\n",
    "    for doi in citedArticleDOI:\n",
    "        try:\n",
    "            FA = namesFromXref(doi, '', 'first')\n",
    "        except UnboundLocalError:\n",
    "            sleep(1)\n",
    "            continue\n",
    "\n",
    "        try:\n",
    "            LA = namesFromXref(doi, '', 'last')\n",
    "        except UnboundLocalError:\n",
    "            sleep(1)\n",
    "            continue\n",
    "\n",
    "        try:\n",
    "            selfCiteCount = namesFromXrefSelfCite(doi, '')\n",
    "        except UnboundLocalError:\n",
    "            sleep(1)\n",
    "            continue\n",
    "\n",
    "        with open(outPath, 'a', newline='') as csvfile:            \n",
    "            if selfCiteCount == 0:\n",
    "                writer = csv.writer(csvfile, delimiter=',')\n",
    "                getArticleIndex = [i for i, s in enumerate(allArticles) if doi in s]\n",
    "                writer.writerow([counter, convertSpecialCharsToUTF8(FA), convertSpecialCharsToUTF8(LA), allArticles[[i for i, s in enumerate(allArticles) if doi in s][0]], '', ''])\n",
    "                print(str(counter) + \": \" + doi )\n",
    "                counter += 1\n",
    "            else:\n",
    "                print(str(articleNum) + \": \" + doi + \"\\t\\t\\t <-- self-citation\" )\n",
    "        articleNum += 1\n",
    "\n",
    "    if len(citedArticleNoDOI)>0:\n",
    "        print()\n",
    "        for elem in citedArticleNoDOI:\n",
    "            with open(outPath, 'a', newline='') as csvfile:            \n",
    "                writer = csv.writer(csvfile, delimiter=',')\n",
    "                writer.writerow([counter, '', '', elem, '', ''])\n",
    "                print(str(counter) + \": \" + elem )\n",
    "            counter += 1\n",
    "        print()\n",
    "        raise ValueError(\"WARNING: No article DOI was provided for the last \" + str(len(citedArticleNoDOI)) + \" listed papers. Please manually search for these articles. IF AND ONLY IF your citing paper's first and last author are not co-authors in the paper that was cited, enter the first name of the first and last authors of the paper that was cited manually. Then, continue to the next code block.\")\n",
    "else:\n",
    "    FA = []\n",
    "    LA = []\n",
    "    parser = bibtex.Parser()\n",
    "    bib_data = parser.parse_file(ID[0])\n",
    "    counter = 1\n",
    "    nameCount = 0\n",
    "    outPath = homedir + 'cleanedBib.csv'\n",
    "\n",
    "    if os.path.exists(outPath):\n",
    "        os.remove(outPath)\n",
    "\n",
    "    with open(outPath, 'w', newline='') as csvfile:\n",
    "        writer = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)\n",
    "        writer.writerow(['Article', 'FA', 'LA', 'Title', 'SelfCite', 'CitationKey'])\n",
    "\n",
    "    for key in bib_data.entries.keys():\n",
    "        diversity_bib_titles = ['The extent and drivers of gender imbalance in neuroscience reference lists','The gender citation gap in international relations','Quantitative evaluation of gender bias in astronomical publications from citation counts', '\\# CommunicationSoWhite', '{Just Ideas? The Status and Future of Publication Ethics in Philosophy: A White Paper}','Gendered citation patterns across political science and social science methodology fields','Gender Diversity Statement and Code Notebook v1.0']\n",
    "        if bib_data.entries[key].fields['title'] in diversity_bib_titles:\n",
    "            continue\n",
    "\n",
    "        try:\n",
    "            author = bib_data.entries[key].persons['author']\n",
    "        except:\n",
    "            author = bib_data.entries[key].persons['editor']\n",
    "        FA = author[0].rich_first_names\n",
    "        LA = author[-1].rich_first_names\n",
    "        FA = convertLatexSpecialChars(str(FA)[7:-3]).translate(str.maketrans('', '', string.punctuation)).replace('Protected',\"\").replace(\" \",'')\n",
    "        LA = convertLatexSpecialChars(str(LA)[7:-3]).translate(str.maketrans('', '', string.punctuation)).replace('Protected',\"\").replace(\" \",'')\n",
    "\n",
    "        # check that we got a name (not an initial) from the bib file, if not try using the title in the crossref API\n",
    "        try:\n",
    "            title = bib_data.entries[key].fields['title'].replace(',', '').replace(',', '').replace('{','').replace('}','')\n",
    "        except:\n",
    "            title = ''\n",
    "        try:\n",
    "            doi =  bib_data.entries[key].fields['doi']\n",
    "        except:\n",
    "            doi = ''\n",
    "        if FA == '' or len(FA.split('.')[0]) <= 1:\n",
    "            while True:\n",
    "                try:\n",
    "                    FA = namesFromXref(doi, title, 'first')\n",
    "                except UnboundLocalError:\n",
    "                    sleep(1)\n",
    "                    continue\n",
    "                break\n",
    "        if LA == '' or len(LA.split('.')[0]) <= 1:\n",
    "            while True:\n",
    "                try:\n",
    "                    LA = namesFromXref(doi, title, 'last')\n",
    "                except UnboundLocalError:\n",
    "                    sleep(1)\n",
    "                    continue\n",
    "                break\n",
    "\n",
    "        if (yourFirstAuthor!='LastName, FirstName OptionalMiddleInitial') and (yourLastAuthor!='LastName, FirstName OptionalMiddleInitial'):\n",
    "            selfCiteCheck1 = [s for s in author if removeMiddleName(yourLastAuthor) in str([convertLatexSpecialChars(str(s.rich_last_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\"), convertLatexSpecialChars(str(s.rich_first_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\")]).replace(\"'\", \"\")]\n",
    "            selfCiteCheck1a = [s for s in author if removeMiddleName(yourLastAuthor) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\"), convertSpecialCharsToUTF8(str(s.rich_first_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\")]).replace(\"'\", \"\")]\n",
    "            selfCiteCheck1b = [s for s in author if removeMiddleName(yourLastAuthor) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\"), LA]).replace(\"'\", \"\")]\n",
    "\n",
    "            selfCiteCheck2 = [s for s in author if removeMiddleName(yourFirstAuthor) in str([convertLatexSpecialChars(str(s.rich_last_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\"), convertLatexSpecialChars(str(s.rich_first_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\")]).replace(\"'\", \"\")]\n",
    "            selfCiteCheck2a = [s for s in author if removeMiddleName(yourFirstAuthor) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\"), convertSpecialCharsToUTF8(str(s.rich_first_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\")]).replace(\"'\", \"\")]\n",
    "            selfCiteCheck2b = [s for s in author if removeMiddleName(yourFirstAuthor) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\"), FA]).replace(\"'\", \"\")]\n",
    "\n",
    "            nameCount = 0\n",
    "            if optionalEqualContributors != ('LastName, FirstName OptionalMiddleInitial', 'LastName, FirstName OptionalMiddleInitial'):\n",
    "                for name in optionalEqualContributors:\n",
    "                    selfCiteCheck3 = [s for s in author if removeMiddleName(name) in str([convertLatexSpecialChars(str(s.rich_last_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\"), convertLatexSpecialChars(str(s.rich_first_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\")]).replace(\"'\", \"\")]\n",
    "                    selfCiteCheck3a = [s for s in author if removeMiddleName(name) in str([convertSpecialCharsToUTF8(str(s.rich_last_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\"), convertSpecialCharsToUTF8(str(s.rich_first_names)[7:-3]).replace(\"', Protected('\",\"\").replace(\"'), '\", \"\")]).replace(\"'\", \"\")]\n",
    "                    if len(selfCiteCheck3)>0:\n",
    "                        nameCount += 1\n",
    "                    if len(selfCiteCheck3a)>0:\n",
    "                        nameCount += 1\n",
    "            selfCiteChecks = [selfCiteCheck1, selfCiteCheck1a, selfCiteCheck1b, selfCiteCheck2, selfCiteCheck2a, selfCiteCheck2b]\n",
    "            if sum([len(check) for check in selfCiteChecks]) + nameCount > 0:\n",
    "                selfCite = 'Y'\n",
    "                if len(FA) < 2:\n",
    "                    print(str(counter) + \": \" + key + \"\\t\\t  <-- self-citation <--  ***NAME MISSING OR POSSIBLY INCOMPLETE***\")\n",
    "                else:\n",
    "                    print(str(counter) + \": \" + key + \"  <-- self-citation\")\n",
    "            else:\n",
    "                selfCite= 'N'\n",
    "                if len(FA) < 2:\n",
    "                    print(str(counter) + \": \" + key + \"\\t\\t  <--  ***NAME MISSING OR POSSIBLY INCOMPLETE***\")\n",
    "                else:\n",
    "                    print(str(counter) + \": \" + key)\n",
    "        else:\n",
    "            selfCite = 'NA'\n",
    "\n",
    "        with open(outPath, 'a', newline='') as csvfile:\n",
    "            writer = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)\n",
    "            writer.writerow([counter, convertSpecialCharsToUTF8(FA), convertSpecialCharsToUTF8(LA), title, selfCite, key])\n",
    "        counter += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## 3. Estimate gender and race of authors from cleaned bibliography\n",
    "\n",
    "### Checkpoint for cleaned bibliography and using Gender API to estimate genders by first names\n",
    "After registering for a [gender-api](https://gender-api.com/) (free account available), use your 500 free monthly search credits by __pasting your API key in the code for the line indicated below__ (replace only YOUR ACCOUNT KEY HERE):\n",
    "\n",
    "```genderAPI_key <- '&key=YOUR ACCOUNT KEY HERE'```\n",
    "\n",
    "[You can find your key in your account's profile page.](https://gender-api.com/en/account/overview#my-api-key)\n",
    "\n",
    "__NOTE__: Please edit your .bib file using information printed by the code and provided in cleanedBib.csv. Edit directly within the Binder environment by clicking the `Edit` button, making modifications, and saving the file.\n",
    "\n",
    "![edit button](img/manualEdit.png)\n",
    "\n",
    "Common issues include: \n",
    "\n",
    "* Bibliography entry did not include a last author because the author list was truncated by \"and Others\" or \"et al.\" \n",
    "* Some older journals articles only provide first initial and not full first names, in which case you will need to go digging via Google to identify that person. \n",
    "* In rare cases where the author cannot be identified even after searching by hand, replace the first name with \"UNKNOWNNAME\" so that the classifier will estimate the gender as unknown. \n",
    "\n",
    "__NOTE__: your free account has 500 queries per month. This box contains the code that will use your limited API credits/queries if it runs without error. Re-running all code repeatedly will repeatedly use these credits.\n",
    "\n",
    "Then, run the code blocks below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "R"
   },
   "outputs": [],
   "source": [
    "genderAPI_key <- '&key=YOUR ACCOUNT KEY HERE'\n",
    "\n",
    "fileConn<-file(\"genderAPIkey.txt\")\n",
    "writeLines(c(genderAPI_key), fileConn)\n",
    "close(fileConn)\n",
    "\n",
    "names=read.csv(\"/home/jovyan/cleanedBib.csv\",stringsAsFactors=F)\n",
    "setwd('/home/jovyan/')\n",
    "\n",
    "require(rjson)\n",
    "gendFA=NULL;gendLA=NULL\n",
    "gendFA_conf=NULL;gendLA_conf=NULL\n",
    "\n",
    "namesIncompleteFA=NULL\n",
    "namesIncompleteLA=NULL\n",
    "incompleteKeys=list()\n",
    "incompleteRows=list()\n",
    "\n",
    "for(i in 1:nrow(names)){\n",
    "  if (nchar(names$FA[i])<2 || grepl(\"\\\\.\", names$FA[i])){\n",
    "    namesIncompleteFA[i] = i+1\n",
    "    incompleteKeys = c(incompleteKeys, names$CitationKey[i])\n",
    "    incompleteRows = c(incompleteRows, i+1)\n",
    "  }\n",
    "  namesIncompleteFA = namesIncompleteFA[!is.na(namesIncompleteFA)]\n",
    "    \n",
    "  if (nchar(names$LA[i])<2 || grepl(\"\\\\.\", names$LA[i])){\n",
    "    namesIncompleteLA[i] = i+1\n",
    "    incompleteKeys = c(incompleteKeys, names$CitationKey[i])\n",
    "    incompleteRows = c(incompleteRows, i+1)\n",
    "  }\n",
    "  namesIncompleteLA = namesIncompleteLA[!is.na(namesIncompleteLA)]\n",
    "}\n",
    "\n",
    "write.table(incompleteKeys[2:length(incompleteKeys)], \"incompleteKeys.csv\", sep=\",\",  col.names=FALSE)\n",
    "write.table(incompleteRows[2:length(incompleteRows)], \"incompleteRows.csv\", sep=\",\",  col.names=FALSE)\n",
    "\n",
    "if (length(names$CitationKey[which(names$SelfCite==\"Y\")]>0)){\n",
    "    print(paste(\"STOP: Please remove self-citations by searching for the following citation keys in your .bib file: \"))\n",
    "    print(paste(names$CitationKey[which(names$SelfCite==\"Y\")]))\n",
    "}\n",
    "\n",
    "if (length(namesIncompleteFA)>0 || length(namesIncompleteLA)>0){\n",
    "    print(paste(\"STOP: Please revise incomplete full first names or empty cells by searching for the following citation keys in your .bib file: \"))\n",
    "    print(paste(incompleteKeys))\n",
    "    print(paste(\"Do not continue without revising the incomplete names in the citations of your .bib file as indicated above. For more info, see rows\", paste(unique(c(namesIncompleteFA, namesIncompleteLA))), \"of cleanedBib.csv\"))\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## 4. Describe the proportions of genders in your reference list and compare it to published base rates in neuroscience.\n",
    "\n",
    "__NOTE__: your free GenderAPI account has 500 queries per month. This box contains the code that will use your limited API credits/queries if it runs without error. Re-running all code repeatedly will repeatedly use these credits.\n",
    "\n",
    "Run the code blocks below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "Python 3"
   },
   "outputs": [],
   "source": [
    "from ethnicolr import pred_fl_reg_name\n",
    "f = open(\"genderAPIkey.txt\", \"r\")\n",
    "genderAPI_key = f.readline().replace('\\n', '')\n",
    "\n",
    "import argparse\n",
    "parser = argparse.ArgumentParser()\n",
    "parser.add_argument('-bibfile',action='store',dest='bibfile',default=' '.join(bib_files))\n",
    "parser.add_argument('-homedir',action='store',dest='homedir',default='/home/jovyan/')\n",
    "parser.add_argument('-authors',action='store',dest='authors', default=(yourFirstAuthor+' '+yourLastAuthor).replace(',',''))\n",
    "parser.add_argument('-method',action='store',dest='method',default='florida')\n",
    "parser.add_argument('-font',action='store',dest='font',default='Palatino') # hey, we all have our favorite\n",
    "parser.add_argument('-gender_key',action='store',dest='gender_key',default=genderAPI_key)\n",
    "r = parser.parse_args()\n",
    "locals().update(r.__dict__)\n",
    "bibfile = parse_file(bibfile)\n",
    "\n",
    "\n",
    "def gender_base():\n",
    "\t\"\"\"\n",
    "\tfor unknown gender, fill with base rates\n",
    "\tyou will never / can't run this (that file is too big to share)\n",
    "\t\"\"\"\n",
    "\tmain_df = pd.read_csv('/%s/data/NewArticleData2019.csv'%(homedir),header=0)\n",
    "\n",
    "\n",
    "\tgender_base = {}\n",
    "\tfor year in np.unique(main_df.PY.values):\n",
    "\t\tydf = main_df[main_df.PY==year].AG\n",
    "\t\tfa = np.array([x[0] for x in ydf.values])\n",
    "\t\tla = np.array([x[1] for x in ydf.values])\n",
    "\n",
    "\t\tfa_m = len(fa[fa=='M'])/ len(fa[fa!='U'])\n",
    "\t\tfa_w = len(fa[fa=='W'])/ len(fa[fa!='U'])\n",
    "\n",
    "\t\tla_m = len(la[fa=='M'])/ len(la[la!='U'])\n",
    "\t\tla_w = len(la[fa=='W'])/ len(la[la!='U'])\n",
    "\n",
    "\t\tgender_base[year] = [fa_m,fa_w,la_m,la_w]\n",
    "\n",
    "\tgender_base[2020] = [fa_m,fa_w,la_m,la_w]\n",
    "\n",
    "\twith open(homedir + '/data/gender_base' + '.pkl', 'wb') as f:\n",
    "\t\tpickle.dump(gender_base, f, pickle.HIGHEST_PROTOCOL)\n",
    "\n",
    "\n",
    "with open(homedir + 'data/gender_base' + '.pkl', 'rb') as f:\n",
    "\tgender_base =  pickle.load(f)\n",
    "\n",
    "authors = authors.split(' ')\n",
    "print ('first author is %s %s '%(authors[1],authors[0]))\n",
    "print ('last author is %s %s '%(authors[3],authors[2]))\n",
    "print (\"we don't count these, but check the predictions file to ensure your names did not slip through!\")\n",
    "\n",
    "citation_matrix = np.zeros((8,8))\n",
    "matrix_idxs = {'white_m':0,'api_m':1,'hispanic_m':2,'black_m':3,'white_f':4,'api_f':5,'hispanic_f':6,'black_f':7}\n",
    "\n",
    "asian = [0,1,2]\n",
    "black = [3,4]\n",
    "white = [5,6,7,8,9,11,12]\n",
    "hispanic = [10]\n",
    "print ('looping through your references, predicting gender and race')\n",
    "\n",
    "columns=['CitationKey','Author','Gender','W','A', 'GendCat']\n",
    "paper_df = pd.DataFrame(columns=columns)\n",
    "\n",
    "gender = []\n",
    "race = []\n",
    "\n",
    "\n",
    "idx = 0\n",
    "for paper in tqdm.tqdm(bibfile.entries,total=len(bibfile.entries)): \n",
    "\tif 'author' not in bibfile.entries[paper].persons.keys():\n",
    "\t\tcontinue #some editorials have no authors\n",
    "\tif 'year' not in bibfile.entries[paper].fields.keys():\n",
    "\t\tyear = 2020\n",
    "\telse: year = int(bibfile.entries[paper].fields['year'])  \n",
    "\t\n",
    "\tif year not in gender_base.keys():\n",
    "\t\tgb = gender_base[1995]\n",
    "\telse:\n",
    "\t\tgb = gender_base[year]\n",
    "\n",
    "\tfa = bibfile.entries[paper].persons['author'][0]\n",
    "\ttry:fa_fname = fa.first_names[0] \n",
    "\texcept:fa_fname = fa.last_names[0] #for people like Plato\n",
    "\tfa_lname = fa.last_names[0] \n",
    "\n",
    "\tla = bibfile.entries[paper].persons['author'][-1]\n",
    "\ttry:la_fname = la.first_names[0] \n",
    "\texcept:la_fname = la.last_names[0] #for people like Plato\n",
    "\tla_lname = la.last_names[0]\n",
    "\n",
    "\tif fa_fname.lower().strip() == authors[1].lower().strip():\n",
    "\t\tif fa_lname.lower().strip()  == authors[0].lower().strip() :\n",
    "\t\t\tcontinue\n",
    "\n",
    "\tif fa_fname.lower().strip()  == authors[3].lower().strip() :\n",
    "\t\tif fa_lname.lower().strip()  == authors[2].lower().strip() :\n",
    "\t\t\tcontinue\n",
    "\n",
    "\tif la_fname.lower().strip()  == authors[1].lower().strip() :\n",
    "\t\tif la_lname.lower().strip()  == authors[0].lower().strip() :\n",
    "\t\t\tcontinue\n",
    "\t\n",
    "\tif la_fname.lower().strip()  == authors[3].lower().strip() :\n",
    "\t\tif la_lname.lower().strip()  == authors[2].lower().strip() :\n",
    "\t\t\tcontinue\n",
    "\n",
    "\tfa_fname = fa_fname.encode(\"ascii\", errors=\"ignore\").decode() \n",
    "\tfa_lname = fa_lname.encode(\"ascii\", errors=\"ignore\").decode()\n",
    "\tla_fname = la_fname.encode(\"ascii\", errors=\"ignore\").decode() \n",
    "\tla_lname = la_lname.encode(\"ascii\", errors=\"ignore\").decode() \n",
    "\n",
    "\tnames = [{'lname': fa_lname,'fname':fa_fname}]\n",
    "\tfa_df = pd.DataFrame(names,columns=['fname','lname'])\n",
    "\tasian,hispanic,black,white = pred_fl_reg_name(fa_df,'lname','fname').values[0][-4:]\n",
    "\tfa_race = [white,asian,hispanic,black]\n",
    "\t\n",
    "\tnames = [{'lname': la_lname,'fname':la_fname}]\n",
    "\tla_df = pd.DataFrame(names,columns=['fname','lname'])\n",
    "\tasian,hispanic,black,white = pred_fl_reg_name(la_df,'lname','fname').values[0][-4:]\n",
    "\tla_race = [white,asian,hispanic,black]\n",
    "\n",
    "\turl = \"https://gender-api.com/get?key=\" + gender_key + \"&name=%s\" %(fa_fname)\n",
    "\tresponse = urlopen(url)\n",
    "\tdecoded = response.read().decode('utf-8')\n",
    "\tfa_gender = json.loads(decoded)\n",
    "\tif fa_gender['gender'] == 'female':\n",
    "\t\tfa_g = [0,fa_gender['accuracy']/100.]\n",
    "\tif fa_gender['gender'] == 'male':\n",
    "\t\tfa_g = [fa_gender['accuracy']/100.,0]\n",
    "\tif fa_gender['gender'] == 'unknown':\n",
    "\t\tfa_g = gb[:2]\n",
    "\n",
    "\turl = \"https://gender-api.com/get?key=\" + gender_key + \"&name=%s\" %(la_fname)\n",
    "\tresponse = urlopen(url)\n",
    "\tdecoded = response.read().decode('utf-8')\n",
    "\tla_gender = json.loads(decoded)\n",
    "\tif la_gender['gender'] == 'female':\n",
    "\t\tla_g = [0,la_gender['accuracy']/100.]\n",
    "\t\n",
    "\tif la_gender['gender'] == 'male':\n",
    "\t\tla_g = [la_gender['accuracy']/100.,0]\n",
    "\n",
    "\tif la_gender['gender'] == 'unknown':\n",
    "\t\tla_g = gb[2:] \n",
    "\t\n",
    "\tfa_data = np.array([paper,'%s,%s'%(fa_fname,fa_lname),'%s,%s'%(fa_gender['gender'],fa_gender['accuracy']),fa_race[0],np.sum(fa_race[1:]), '']).reshape(1,6)\n",
    "\tpaper_df = paper_df.append(pd.DataFrame(fa_data,columns=columns),ignore_index =True)\n",
    "\tla_data = np.array([paper,'%s,%s'%(la_fname,la_lname),'%s,%s'%(la_gender['gender'],la_gender['accuracy']),la_race[0],np.sum(la_race[1:]), '%s%s' % (fa_gender['gender'], la_gender['gender'])]).reshape(1,6)\n",
    "\tpaper_df = paper_df.append(pd.DataFrame(la_data,columns=columns),ignore_index =True)\n",
    "\n",
    "\tmm = fa_g[0]*la_g[0]\n",
    "\twm = fa_g[1]*la_g[0]\n",
    "\tmw = fa_g[0]*la_g[1]\n",
    "\tww = fa_g[1]*la_g[1]\n",
    "\tmm,wm,mw,ww = [mm,wm,mw,ww]/np.sum([mm,wm,mw,ww])\n",
    "\t\n",
    "\tgender.append([mm,wm,mw,ww])\n",
    "\n",
    "\tww = fa_race[0] * la_race[0]\n",
    "\taw = np.sum(fa_race[1:]) * la_race[0]\n",
    "\twa = fa_race[0] * np.sum(la_race[1:])\n",
    "\taa = np.sum(fa_race[1:]) * np.sum(la_race[1:])\n",
    "\n",
    "\trace.append([ww,aw,wa,aa])\n",
    "\n",
    "\tpaper_matrix = np.zeros((2,8))\n",
    "\tpaper_matrix[0] = np.outer(fa_g,fa_race).flatten() \n",
    "\tpaper_matrix[1] = np.outer(la_g,la_race).flatten() \n",
    "\n",
    "\tpaper_matrix = np.outer(paper_matrix[0],paper_matrix[1]) \n",
    "\n",
    "\tcitation_matrix = citation_matrix + paper_matrix\n",
    "\tidx = idx + 1\n",
    "\n",
    "mm,wm,mw,ww = np.mean(gender,axis=0)*100\n",
    "WW,aw,wa,aa = np.mean(race,axis=0)*100\n",
    "\n",
    "statement = \"Recent work in several fields of science has identified a bias in citation practices such that papers from women and other minority scholars\\\n",
    "are under-cited relative to the number of such papers in the field (1-5). Here we sought to proactively consider choosing references that reflect the \\\n",
    "diversity of the field in thought, form of contribution, gender, race, ethnicity, and other factors. First, we obtained the predicted gender of the first \\\n",
    "and last author of each reference by using databases that store the probability of a first name being carried by a woman (5, 6). By this measure \\\n",
    "(and excluding self-citations to the first and last authors of our current paper), our references contain ww% woman(first)/woman(last), \\\n",
    "MW% man/woman, WM% woman/man, and MM% man/man. This method is limited in that a) names, pronouns, and social media profiles used to construct the \\\n",
    "databases may not, in every case, be indicative of gender identity and b) it cannot account for intersex, non-binary, or transgender people. \\\n",
    "Second, we obtained predicted racial/ethnic category of the first and last author of each reference by databases that store the probability of a \\\n",
    "first and last name being carried by an author of color (7,8). By this measure (and excluding self-citations), our references contain AA% author of \\\n",
    "color (first)/author of color(last), WA% white author/author of color, AW% author of color/white author, and WW% white author/white author. This method \\\n",
    "is limited in that a) names and Florida Voter Data to make the predictions may not be indicative of racial/ethnic identity, and b) \\\n",
    "it cannot account for Indigenous and mixed-race authors, or those who may face differential biases due to the ambiguous racialization or ethnicization of their names.  \\\n",
    "We look forward to future work that could help us to better understand how to support equitable practices in science.\"\n",
    "\n",
    "statement = statement.replace('MM',str(np.around(mm,2)))\n",
    "statement = statement.replace('WM',str(np.around(wm,2)))\n",
    "statement = statement.replace('MW',str(np.around(mw,2)))\n",
    "statement = statement.replace('ww',str(np.around(ww,2)))\n",
    "statement = statement.replace('WW',str(np.around(WW,2)))\n",
    "statement = statement.replace('AW',str(np.around(aw,2)))\n",
    "statement = statement.replace('WA',str(np.around(wa,2)))\n",
    "statement = statement.replace('AA',str(np.around(aa,2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "Python 3"
   },
   "source": [
    "## 5. Print the Diversity Statement and visualize your results\n",
    "\n",
    "The example template can be copied and pasted into your manuscript. We have included it in our methods or references section. If you are using LaTeX, [the bibliography file can be found here](https://github.com/dalejn/cleanBib/blob/master/diversityStatement/).\n",
    "\n",
    "### Additional info about the neuroscience benchmark\n",
    "For the top 5 neuroscience journals (Nature Neuroscience, Neuron, Brain, Journal of Neuroscience, and Neuroimage), the expected gender proportions in reference lists as reported by [Dworkin et al.](https://www.biorxiv.org/content/10.1101/2020.01.03.894378v1.full.pdf) are 58.4% for man/man, 9.4% for man/woman, 25.5% for woman/man, and 6.7% for woman/woman. Expected proportions were calculated by randomly sampling papers from 28,505 articles in the 5 journals, estimating gender breakdowns using probabilistic name classification tools, and regressing for relevant article variables like publication date, journal, number of authors, review article or not, and first-/last-author seniority. See [Dworkin et al.](https://www.biorxiv.org/content/10.1101/2020.01.03.894378v1.full.pdf) for more details. \n",
    "\n",
    "Using a similar random draw model regressing for relevant variables, the expected race proportions in reference lists as reported by Bertolero et al. were 51.8% for white/white, 12.8% for white/author-of-color, 23.5% for author-of-color/white, and 11.9% for author-of-color/author-of-color. \n",
    "\n",
    "This box does NOT contain code that will use your limited API credits/queries.\n",
    "\n",
    "Run the code block below. (click to select the block and then press Ctrl+Enter; or click the block and press the Run button in the top menubar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "Python 3"
   },
   "outputs": [],
   "source": [
    "print (statement)\n",
    "\n",
    "\n",
    "cmap = sns.diverging_palette(220, 10, as_cmap=True)\n",
    "names = ['white_m','api_m','hispanic_m','black_m','white_w','api_w','hispanic_w','black_w']\n",
    "plt.close()\n",
    "sns.set(style='white')\n",
    "fig, axes = plt.subplots(ncols=2,nrows=1,figsize=(7.5,4))\n",
    "axes = axes.flatten()\n",
    "plt.sca(axes[0])\n",
    "heat = sns.heatmap(np.around((citation_matrix/citation_matrix.sum())*100,2),annot=True,ax=axes[0],annot_kws={\"size\": 8},cmap=cmap,vmax=1,vmin=0)\n",
    "axes[0].set_ylabel('first author',labelpad=0)  \n",
    "heat.set_yticklabels(names,rotation=0)\n",
    "axes[0].set_xlabel('last author',labelpad=1)  \n",
    "heat.set_xticklabels(names,rotation=90) \n",
    "heat.set_title('percentage of citations')  \n",
    "\n",
    "citation_matrix_sum = citation_matrix / np.sum(citation_matrix) \n",
    "\n",
    "expected = np.load('/%s/data/expected_matrix_florida.npy'%(homedir))\n",
    "expected = expected/np.sum(expected)\n",
    "\n",
    "percent_overunder = np.ceil( ((citation_matrix_sum - expected) / expected)*100)\n",
    "plt.sca(axes[1])\n",
    "heat = sns.heatmap(np.around(percent_overunder,2),annot=True,ax=axes[1],fmt='g',annot_kws={\"size\": 8},vmax=50,vmin=-50,cmap=cmap)\n",
    "axes[1].set_ylabel('',labelpad=0)  \n",
    "heat.set_yticklabels('')\n",
    "axes[1].set_xlabel('last author',labelpad=1)  \n",
    "heat.set_xticklabels(names,rotation=90) \n",
    "heat.set_title('percentage over/under-citations')\n",
    "plt.tight_layout()\n",
    "\n",
    "plt.savefig('/home/jovyan/race_gender_citations.pdf')\n",
    "\n",
    "\n",
    "paper_df.to_csv('/home/jovyan/predictions.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "R"
   },
   "outputs": [],
   "source": [
    "# Plot a histogram #\n",
    "names <- read.csv('/home/jovyan/predictions.csv', header=T)\n",
    "total_citations <- nrow(na.omit(names))\n",
    "names$GendCat <- gsub(\"female\", \"W\", names$GendCat, fixed=T)\n",
    "names$GendCat <- gsub(\"male\", \"M\", names$GendCat, fixed=T)\n",
    "names$GendCat <- gsub(\"unknown\", \"U\", names$GendCat, fixed=T)\n",
    "gend_cats <- unique(names$GendCat)  # get a vector of all the gender categories in your paper\n",
    "\n",
    "# Create an empty data frame that will be used to plot the histogram. This will have the gender category (e.g., WW, MM) in the first column and the percentage (e.g., number of WW citations divided by total number of citations * 100) in the second column #\n",
    "dat_for_plot <- data.frame(gender_category = NA,\n",
    "                           number = NA,\n",
    "                           percentage = NA)\n",
    "\n",
    "\n",
    "### Loop through each gender category from your paper, calculate the citation percentage of each gender category, and save the gender category and its citation percentage in dat_for_plot data frame ###\n",
    "if (length(names$GendCat) != 1) {\n",
    "  \n",
    "  for (i in 1:length(gend_cats)){\n",
    "    \n",
    "    # Create an empty temporary data frame that will be binded to the dat_for_plot data frame\n",
    "    temp_df <- data.frame(gender_category = NA,\n",
    "                          number = NA,\n",
    "                          percentage = NA)\n",
    "    \n",
    "    # Get the gender category, the number of citations with that category, and calculate the percentage of citations with that category\n",
    "    gend_cat <- gend_cats[i]\n",
    "    number_gend_cat <- length(names$GendCat[names$GendCat == gend_cat])\n",
    "    perc_gend_cat <- (number_gend_cat / total_citations) * 100\n",
    "    \n",
    "    # Bind this information to the original data frame\n",
    "    temp_df$gender_category <- gend_cat\n",
    "    temp_df$number <- number_gend_cat\n",
    "    temp_df$percentage <- perc_gend_cat\n",
    "    dat_for_plot <- rbind(dat_for_plot, temp_df)\n",
    "    \n",
    "  }\n",
    "  \n",
    "}\n",
    "\n",
    "\n",
    "# Create a data frame with only the WW, MW, WM, MM categories and their base rates - to plot percent citations relative to benchmarks\n",
    "dat_for_baserate_plot <- subset(dat_for_plot, gender_category == 'WW' | gender_category == 'MW' | gender_category == 'WM' | gender_category == 'MM')\n",
    "dat_for_baserate_plot$baserate <- c(6.7, 9.4, 25.5, 58.4)\n",
    "dat_for_baserate_plot$citation_rel_to_baserate <- dat_for_baserate_plot$percentage - dat_for_baserate_plot$baserate\n",
    "\n",
    "\n",
    "# Plot the Histogram of Number of Papers per category against predicted gender category #\n",
    "\n",
    "library(ggplot2)\n",
    "\n",
    "dat_for_plot = dat_for_plot[-1:-2,]\n",
    "\n",
    "dat_for_plot$gender_category <- factor(dat_for_plot$gender_category, levels = dat_for_plot$gender_category)\n",
    "ggplot(dat_for_plot[-c(1),], aes(x = gender_category, y = number, fill = gender_category)) +\n",
    "  geom_bar(stat = 'identity', width = 0.75, na.rm = TRUE, show.legend = TRUE) + \n",
    "  scale_x_discrete(limits = c('WW', 'MW', 'WM', 'MM', 'UW', 'UM', 'WU', 'MU', 'UU')) +\n",
    "  geom_text(aes(label = number), vjust = -0.3, color = 'black', size = 2.5) +\n",
    "  theme(legend.position = 'right') + theme_minimal() +\n",
    "  xlab('Predicted gender category') + ylab('Number of papers') + ggtitle(\"\") + theme_classic(base_size=15)\n",
    "\n",
    "\n",
    "# Plot the Histogram of % citations relative to benchmarks against predicted gender category\n",
    "ggplot(dat_for_baserate_plot, aes(x = gender_category, y = citation_rel_to_baserate, fill = gender_category)) +\n",
    "  geom_bar(stat = 'identity', width = 0.75, na.rm = TRUE, show.legend = TRUE) +\n",
    "  scale_x_discrete(limits = c('WW', 'MW', 'WM', 'MM')) +\n",
    "  geom_text(aes(label = round(citation_rel_to_baserate, digits = 2)), vjust = -0.3, color = 'black', size = 2.5) +\n",
    "  theme(legend.position = 'right') + theme_minimal() +\n",
    "  xlab('Predicted gender category') + ylab('% of citations relative to benchmarks') + ggtitle(\"\") + theme_classic(base_size=15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "### (OPTIONAL) Color-code your .tex file using the estimated gender classifications\n",
    "\n",
    "Running this code-block will optionally output your uploaded `.tex` file with color-coding for gender pair classifications. You can find the [example below's pre-print here.](https://www.biorxiv.org/content/10.1101/664250v1)\n",
    "\n",
    "![Color-coded .tex file, Eli Cornblath](img/texColors.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "kernel": "Python 3"
   },
   "outputs": [],
   "source": [
    "cite_gender = pd.read_csv(homedir+'Authors.csv') # output of getReferenceGends.ipynb\n",
    "cite_gender.index = cite_gender.CitationKey\n",
    "cite_gender['Color'] = '' # what color to make each gender category\n",
    "colors = {'MM':'red','MW':'blue','WW':'green','WM':'magenta','UU':'black',\n",
    "'MU':'black','UM':'black','UW':'black','WU':'black'}\n",
    "for idx in cite_gender.index: # loop through each citation key and set color\n",
    "    cite_gender.loc[idx,'Color'] = colors[cite_gender.loc[idx,'GendCat']]\n",
    "cite_gender.loc[cite_gender.index[cite_gender.SelfCite=='Y'],'Color'] = 'black' # make self citations black\n",
    "\n",
    "fin = open(homedir+tex_file)\n",
    "texdoc=fin.readlines()\n",
    "with open(homedir+tex_file[:-4]+'_gendercolor.tex','w') as fout:\n",
    "    for i in range(len(texdoc)):\n",
    "        s = texdoc[i]\n",
    "        cite_instances = re.findall('\\\\\\\\cite\\{.*?\\}',s)\n",
    "        cite_keys = re.findall('\\\\\\\\cite\\{(.*?)\\}',s)\n",
    "        cite_keys = [x.split(',') for x in cite_keys]\n",
    "        cite_keys_sub = [['\\\\textcolor{' + cite_gender.loc[x.strip(),'Color'] + '}{\\\\cite{'+x.strip()+'}}' for x in cite_instance] for cite_instance in cite_keys]\n",
    "        cite_keys_sub = ['\\\\textsuperscript{,}'.join(x) for x in cite_keys_sub]\n",
    "        for idx,cite_instance in enumerate(cite_instances):\n",
    "            s = s.replace(cite_instances[idx],cite_keys_sub[idx])\n",
    "        fout.write(s)\n",
    "        # place color key after abstract\n",
    "        if '\\\\section*{Introduction}\\n' in s:            \n",
    "            l = ['\\\\textcolor{' + colors[k] + '}{'+k+'}' for k in colors.keys()]\n",
    "            fout.write('\\tKey: '+ ', '.join(l)+'.\\n')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "SoS",
   "language": "sos",
   "name": "sos"
  },
  "language_info": {
   "codemirror_mode": "sos",
   "file_extension": ".sos",
   "mimetype": "text/x-sos",
   "name": "sos",
   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
   "pygments_lexer": "sos"
  },
  "sos": {
   "kernels": [
    [
     "Python 3",
     "python3",
     "python3",
     "",
     {
      "name": "ipython",
      "version": 3
     }
    ],
    [
     "R",
     "ir",
     "R",
     "",
     "r"
    ]
   ],
   "panel": {
    "displayed": true,
    "height": 0
   },
   "version": "0.21.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}