{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "75a6920e",
   "metadata": {},
   "source": [
    "# 5. Journals indexed in Scopus by their disciplinary distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36d5a09d",
   "metadata": {},
   "source": [
    "### Notebook objectives:\n",
    "1. Determine the disciplinary distribution of Scopus journals for the sake of comparison to OJS.\n",
    "*Updated 9/22/2022"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "950a5897",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Python 3.10.5\r\n"
     ]
    }
   ],
   "source": [
    "!python --version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b48bbcdc",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import pandas as pd\n",
    "import requests\n",
    "import json\n",
    "import os\n",
    "import re\n",
    "from easynmt import EasyNMT\n",
    "from tqdm import tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5b396012",
   "metadata": {},
   "outputs": [],
   "source": [
    "scopus = pd.read_excel(os.path.join('data', 'scopus_jan2021.xlsx'))\n",
    "scopus = scopus.drop_duplicates(subset=[\"Print-ISSN\", \"E-ISSN\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82907140",
   "metadata": {},
   "source": [
    "### Education isn't a defined subject area in the Scopus data, so I approximate the number of Education journals using a multilngual string search for \"education,\" \"teach,\" and \"learn\" in journal titles:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "cc3e4750",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "49 \n",
      " ['AFR', 'ARA', 'ARM', 'AZE', 'BAQ', 'BOS', 'BUL', 'CAT', 'CHI', 'CHN', 'CZE', 'DAN', 'DUT', 'ENF', 'ENG', 'EST', 'FIN', 'FRE', 'GER', 'GLE', 'GLG', 'GRE', 'HEB', 'HUN', 'ICE', 'IND', 'ITA', 'JPN', 'KOR', 'LAV', 'LIT', 'MAC', 'MAO', 'MAY', 'NOR', 'PER', 'POL', 'POR', 'RUM', 'RUS', 'SCC', 'SCR', 'SLO', 'SLV', 'SPA', 'SWE', 'THA', 'TUR', 'UKR']\n"
     ]
    }
   ],
   "source": [
    "langs = scopus['Article language in source (three-letter ISO language codes)'].unique().tolist()\n",
    "langs = [re.split(r\"[^a-zA-Z]+\", l) for l in langs if isinstance(l, str)]\n",
    "langs = sorted(list(set([l for subl in langs for l in subl])))\n",
    "print(len(langs), \"\\n\", langs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "0d91bc16",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['AFR: 13', 'ARA: 24', 'ARM: 1', 'AZE: 3', 'BAQ: 6', 'BOS: 9', 'BUL: 18', 'CAT: 33', 'CHI: 562', 'CHN: 1', 'CZE: 131', 'DAN: 14', 'DUT: 71', 'ENF: 1', 'ENG: 27125', 'EST: 19', 'FIN: 18', 'FRE: 1197', 'GER: 997', 'GLE: 5', 'GLG: 2', 'GRE: 38', 'HEB: 7', 'HUN: 44', 'ICE: 3', 'IND: 4', 'ITA: 535', 'JPN: 199', 'KOR: 69', 'LAV: 8', 'LIT: 19', 'MAC: 2', 'MAO: 1', 'MAY: 12', 'NOR: 20', 'PER: 46', 'POL: 190', 'POR: 474', 'RUM: 53', 'RUS: 412', 'SCC: 15', 'SCR: 109', 'SLO: 64', 'SLV: 61', 'SPA: 1348', 'SWE: 23', 'THA: 3', 'TUR: 134', 'UKR: 26']\n"
     ]
    }
   ],
   "source": [
    "print([f\"{lang}: {scopus.iloc[:, 7].str.contains(lang).sum()}\" for lang in langs])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c54d7503",
   "metadata": {},
   "source": [
    "#### Save a list of ISO-639-1 language codes with Latin scripts, because the Scopus data only feature titles written in Latin scripts:\n",
    "(Transliteration is used by Scopus, but transliterating \"Education\" to Chinese \"Jiaoyu\" returns no titles. I will skip transliteration because the success rate seems so low.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "9c723f0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "iso639_1 = [\"af\", \"ca\", \"cs\", \"da\", \"nl\", \"et\", \"fi\", \"fr\", \"de\", \"hu\", \"id\", \"it\", \"lv\", \"lt\", \"no\", \"pl\",\n",
    "            \"pt\", \"ro\", \"sr\", \"sk\", \"sl\", \"es\", \"sv\", \"tr\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "e9cc392b",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = EasyNMT(\"opus-mt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a31a362e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/jball/opt/anaconda3/envs/tmp/lib/python3.10/site-packages/transformers/models/marian/tokenization_marian.py:194: UserWarning: Recommended: pip install sacremoses.\n",
      "  warnings.warn(\"Recommended: pip install sacremoses.\")\n",
      "/Users/jball/opt/anaconda3/envs/tmp/lib/python3.10/site-packages/transformers/generation_utils.py:1227: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to 512 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
      "  warnings.warn(\n",
      "Exception: Helsinki-NLP/opus-mt-en-lv is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\n",
      "If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\n",
      "Exception: Helsinki-NLP/opus-mt-en-lt is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\n",
      "If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\n",
      "Exception: Helsinki-NLP/opus-mt-en-no is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\n",
      "If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\n",
      "Exception: Helsinki-NLP/opus-mt-en-pl is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\n",
      "If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\n",
      "Exception: Helsinki-NLP/opus-mt-en-pt is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\n",
      "If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\n",
      "Exception: Helsinki-NLP/opus-mt-en-sr is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\n",
      "If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\n",
      "Exception: Helsinki-NLP/opus-mt-en-sl is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\n",
      "If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\n",
      "Exception: Helsinki-NLP/opus-mt-en-tr is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\n",
      "If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Opvoeding', 'leer', 'onderrig', 'educació', 'learn', 'ensenyeu', 'vzdělávání', 'učit se', 'Učit', 'uddannelse', 'lær', 'underviser', 'onderwijs', 'leren', 'lesgeven', 'haridus', 'õpi', 'õpetamine', 'koulutus', 'opi', 'opettaa', 'éducation', 'apprendre', 'enseigner', 'Bildung', 'lernen', 'Unterricht', 'oktatás', 'tanulj!', 'Tanárnő!', 'pendidikan', 'belajar', 'mengajar', 'istruzione', 'imparare', 'Insegna', 'educaţie', 'Învaţă', 'Predă', 'vzdelávanie', 'učiť sa', 'vyučovať', 'Educación', 'aprender', 'enseñar', 'utbildning', 'lära dig', 'lära ut']\n"
     ]
    }
   ],
   "source": [
    "doc = [\"education\",\n",
    "       \"learn\",\n",
    "       \"teach\"]\n",
    "edu = []\n",
    "\n",
    "for code in iso639_1:\n",
    "    try:\n",
    "        edu.extend(model.translate(doc, target_lang=code))\n",
    "    except OSError:\n",
    "        continue\n",
    "print(edu)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "73aaa63c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1097"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scopus.iloc[:, 1].str.contains(\n",
    "    \"ducat|teach|learn|opvoeding|onderrig|educaci|enseny|vzdela|ucit se|uddanel|undervis|onderwijs|leren|lesgev|haridus|opetami|koulutus|opettaa|apprend|enseign|bildung|lernen|unterricht|oktatas|tanulj|tanarno|pendidikan|belajar|mengajar|istruzione|imparare|insenga|invata|vyuco|aprend|ensena|utbild|lara dig|lara ut|jiaoyu\", \n",
    "    regex=True, case=False).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "21091c5c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Education journals: 2.6%\n"
     ]
    }
   ],
   "source": [
    "ed = 1097 / 41957\n",
    "print(f\"Education journals: {round(ed*100, 1)}%\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "48676a34",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['Top level:\\n\\nLife Sciences', 'Top level:\\n\\nSocial Sciences',\n",
      "       'Top level:\\n\\nPhysical Sciences', 'Top level:\\n\\nHealth Sciences',\n",
      "       '1000 \\nGeneral', '1100\\nAgricultural and Biological Sciences',\n",
      "       '1200\\nArts and Humanities',\n",
      "       '1300\\nBiochemistry, Genetics and Molecular Biology',\n",
      "       '1400\\nBusiness, Management and Accounting',\n",
      "       '1500\\nChemical Engineering', '1600\\nChemistry',\n",
      "       '1700\\nComputer Science', '1800\\nDecision Sciences',\n",
      "       '1900\\nEarth and Planetary Sciences',\n",
      "       '2000\\nEconomics, Econometrics and Finance', '2100\\nEnergy',\n",
      "       '2200\\nEngineering', '2300\\nEnvironmental Science',\n",
      "       '2400\\nImmunology and Microbiology', '2500\\nMaterials Science',\n",
      "       '2600\\nMathematics', '2700\\nMedicine', '2800\\nNeuroscience',\n",
      "       '2900\\nNursing', '3000\\nPharmacology, Toxicology and Pharmaceutics',\n",
      "       '3100\\nPhysics and Astronomy', '3200\\nPsychology',\n",
      "       '3300\\nSocial Sciences', '3400\\nVeterinary', '3500\\nDentistry',\n",
      "       '3600\\nHealth Professions'],\n",
      "      dtype='object')\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 41958 entries, 0 to 42473\n",
      "Data columns (total 31 columns):\n",
      " #   Column                                             Non-Null Count  Dtype \n",
      "---  ------                                             --------------  ----- \n",
      " 0   Top level:\n",
      "\n",
      "Life Sciences                          7666 non-null   object\n",
      " 1   Top level:\n",
      "\n",
      "Social Sciences                        13940 non-null  object\n",
      " 2   Top level:\n",
      "\n",
      "Physical Sciences                      14039 non-null  object\n",
      " 3   Top level:\n",
      "\n",
      "Health Sciences                        14744 non-null  object\n",
      " 4   1000 \n",
      "General                                      148 non-null    object\n",
      " 5   1100\n",
      "Agricultural and Biological Sciences          3031 non-null   object\n",
      " 6   1200\n",
      "Arts and Humanities                           5424 non-null   object\n",
      " 7   1300\n",
      "Biochemistry, Genetics and Molecular Biology  3105 non-null   object\n",
      " 8   1400\n",
      "Business, Management and Accounting           2007 non-null   object\n",
      " 9   1500\n",
      "Chemical Engineering                          1081 non-null   object\n",
      " 10  1600\n",
      "Chemistry                                     1312 non-null   object\n",
      " 11  1700\n",
      "Computer Science                              2274 non-null   object\n",
      " 12  1800\n",
      "Decision Sciences                             513 non-null    object\n",
      " 13  1900\n",
      "Earth and Planetary Sciences                  2250 non-null   object\n",
      " 14  2000\n",
      "Economics, Econometrics and Finance           1412 non-null   object\n",
      " 15  2100\n",
      "Energy                                        717 non-null    object\n",
      " 16  2200\n",
      "Engineering                                   5236 non-null   object\n",
      " 17  2300\n",
      "Environmental Science                         2636 non-null   object\n",
      " 18  2400\n",
      "Immunology and Microbiology                   902 non-null    object\n",
      " 19  2500\n",
      "Materials Science                             1917 non-null   object\n",
      " 20  2600\n",
      "Mathematics                                   1929 non-null   object\n",
      " 21  2700\n",
      "Medicine                                      13779 non-null  object\n",
      " 22  2800\n",
      "Neuroscience                                  793 non-null    object\n",
      " 23  2900\n",
      "Nursing                                       897 non-null    object\n",
      " 24  3000\n",
      "Pharmacology, Toxicology and Pharmaceutics    1231 non-null   object\n",
      " 25  3100\n",
      "Physics and Astronomy                         1542 non-null   object\n",
      " 26  3200\n",
      "Psychology                                    1566 non-null   object\n",
      " 27  3300\n",
      "Social Sciences                               8704 non-null   object\n",
      " 28  3400\n",
      "Veterinary                                    317 non-null    object\n",
      " 29  3500\n",
      "Dentistry                                     257 non-null    object\n",
      " 30  3600\n",
      "Health Professions                            702 non-null    object\n",
      "dtypes: object(31)\n",
      "memory usage: 10.2+ MB\n"
     ]
    }
   ],
   "source": [
    "print(scopus.columns[23:])\n",
    "scopus.iloc[:, 23:].info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "d543d992",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6752, 54)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scopus[(scopus[\"1700\\nComputer Science\"].notnull()) | \n",
    "       (scopus[\"2200\\nEngineering\"].notnull())].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "a4b176db",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "% CS & Engineering journals: 16.1%\n"
     ]
    }
   ],
   "source": [
    "csen = 6752 / 41957\n",
    "print(f\"% CS & Engineering journals: {round(csen*100, 1)}%\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "f071aeb3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1929, 54)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scopus[scopus[\"2600\\nMathematics\"].notnull()].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "1139f13f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "% Math journals: 4.6%\n"
     ]
    }
   ],
   "source": [
    "math = 1929 / 41957\n",
    "print(f\"% Math journals: {round(math*100, 1)}%\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "6523d922",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "% Med-Health journals: 35.1%\n"
     ]
    }
   ],
   "source": [
    "medh = scopus[scopus[\"Top level:\\n\\nHealth Sciences\"].notnull()].shape[0] / 41957\n",
    "print(f\"% Med-Health journals: {round(medh*100, 1)}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e087c8b1",
   "metadata": {},
   "source": [
    "### Use <a href=\"https://docs.openalex.org/\">OpenAlex</a> to try and disaggregate the \"Social Sciences\" journals:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "d16348d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "e_socsci = scopus[(scopus[\"Top level:\\n\\nSocial Sciences\"].notnull()) &\n",
    "                      (scopus[\"E-ISSN\"].notnull())][\"E-ISSN\"]\n",
    "e_socsci = list(zip(e_socsci.index, [str(issn)[:4] + \"-\" + str(issn)[4:] for issn in e_socsci]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "b5b2a84c",
   "metadata": {},
   "outputs": [],
   "source": [
    "print_socsci = scopus[(scopus[\"Top level:\\n\\nSocial Sciences\"].notnull()) &\n",
    "                      (scopus[\"Print-ISSN\"].notnull())][\"Print-ISSN\"]\n",
    "print_socsci = list(zip(print_socsci.index, [str(issn)[:4] + \"-\" + str(issn)[4:] for issn in print_socsci]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "d6855385",
   "metadata": {},
   "outputs": [],
   "source": [
    "ss_issns = e_socsci + print_socsci"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "620495d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_subjects(list_of_tuples):\n",
    "    \n",
    "    idx2subject = []\n",
    "    error_issns = []\n",
    "    \n",
    "    for i, v in tqdm(list_of_tuples):\n",
    "        query = \"https://api.openalex.org/venues/issn:\" + v\n",
    "        \n",
    "        try:\n",
    "            response = json.loads(\n",
    "                requests.get(query).content.decode()\n",
    "            )\n",
    "            subject = response[\"x_concepts\"][0][\"display_name\"]\n",
    "        except:\n",
    "            error_issns.append(v)\n",
    "            \n",
    "        idx2subject.append(\n",
    "            (i, subject)\n",
    "        )\n",
    "        \n",
    "    return idx2subject, error_issns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "0acdd037",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████| 19908/19908 [2:39:52<00:00,  2.08it/s]\n"
     ]
    }
   ],
   "source": [
    "idx2subject, errors = get_subjects(ss_issns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "e5273ea4",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(os.path.join(\"data\", \"idx2subject_ss.json\"), \"w\") as outfile:\n",
    "    json.dump(idx2subject, outfile)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "a06bc5ee",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'History', 'Ecology', 'Art', 'Linguistics', 'Genetics', 'Humanities', 'Common value auction', 'Business', 'Macroeconomics', 'Economics', 'Chemistry', 'Visual arts', 'Biology', 'Physics', 'Demographic economics', 'Finance', 'Materials science', 'Economic geography', 'Population', 'Computer science', 'Psychology', 'Law', 'Nanotechnology', 'Thermodynamics', 'Outbreak', 'Geology', 'Environmental science', 'Astronomy', 'Poison control', 'Geophysics', 'Political science', 'Medicine', 'Nursing', 'Mathematics', 'Geography', 'Archaeology', 'Philosophy', 'Engineering', 'Monetary policy', 'Sociology'}\n"
     ]
    }
   ],
   "source": [
    "print(set([t[1] for t in idx2subject]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "9f60d919",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "% Philosophy journals: 5.6%\n",
      "% Linguistics journals: 0.0%\n",
      "% Humanities journals: 0.1%\n",
      "Phil: 2365, Ling: 3, Hum: 23\n"
     ]
    }
   ],
   "source": [
    "d = {}\n",
    "phil = 0\n",
    "ling = 0\n",
    "hum = 0\n",
    "\n",
    "for idx, subject in idx2subject:\n",
    "    \n",
    "    if idx not in d:\n",
    "        d[idx] = subject\n",
    "        \n",
    "        match subject:\n",
    "            \n",
    "            case \"Philosophy\":\n",
    "                phil += 1\n",
    "            \n",
    "            case \"Linguistics\":\n",
    "                ling += 1\n",
    "                \n",
    "            case \"Humanities\":\n",
    "                hum += 1\n",
    "\n",
    "print(f\"% Philosophy journals: {round(phil / 41957 * 100, 1)}%\")\n",
    "print(f\"% Linguistics journals: {round(ling / 41957 * 100, 1)}%\")\n",
    "print(f\"% Humanities journals: {round(hum / 41957 * 100, 1)}%\")\n",
    "print(f\"Phil: {phil}, Ling: {ling}, Hum: {hum}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "1bad2e92",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "13940\n"
     ]
    }
   ],
   "source": [
    "print(len(d))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "7f03481d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "% History journals: 1.9%\n",
      "% Art + Visual Arts journals: 3.4%\n",
      "% Sociology journals: 1.6%\n",
      "Hist: 792, Art: 1437, Vis Art: 1, Soc: 656\n"
     ]
    }
   ],
   "source": [
    "hist = 0\n",
    "art = 0\n",
    "visa = 0\n",
    "soc = 0\n",
    "\n",
    "for idx, subject in d.items():\n",
    "    \n",
    "    match subject:\n",
    "        case \"History\":\n",
    "            hist += 1\n",
    "        case \"Art\":\n",
    "            art += 1\n",
    "        case \"Visual arts\":\n",
    "            visa += 1\n",
    "        case \"Sociology\":\n",
    "            soc += 1\n",
    "            \n",
    "print(f\"% History journals: {round(hist / 41957 * 100, 1)}%\")\n",
    "print(f\"% Art + Visual Arts journals: {round((art+visa) / 41957 * 100, 1)}%\")\n",
    "print(f\"% Sociology journals: {round(soc / 41957 * 100, 1)}%\")\n",
    "print(f\"Hist: {hist}, Art: {art}, Vis Art: {visa}, Soc: {soc}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2819b41",
   "metadata": {},
   "source": [
    "### Final step: I need to give a rough estimate of the proportions of Scopus-indexed journals falling under the rubrics of \"Language, communication, and culture\" and \"Philosophy and religion.\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1be088cc",
   "metadata": {},
   "source": [
    "First, I will assume that a \"Philosophy\" classification from OpenAlex genuinely indicates a philosophy journal only if the journal is also classified by Scopus as \"Arts and Humanities\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "017e7339",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(13940, 54)\n"
     ]
    }
   ],
   "source": [
    "alex = scopus[scopus.index.isin(\n",
    "    list(set([t[0] for t in idx2subject]))\n",
    ")]\n",
    "print(alex.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "72d0bc7e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2365\n"
     ]
    }
   ],
   "source": [
    "phil_indices = [idx for idx, subject in d.items() if subject == \"Philosophy\"]\n",
    "print(len(phil_indices))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "8d810e9f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1694\n",
      "% Philosophy journals, double checked: 4.0\n"
     ]
    }
   ],
   "source": [
    "philn = alex[\n",
    "    (alex.index.isin(phil_indices)) & (alex[\"1200\\nArts and Humanities\"].notnull())\n",
    "].shape[0]\n",
    "print(philn)\n",
    "print(f\"% Philosophy journals, double checked: {round(philn / 41957 * 100, 1)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d079275f",
   "metadata": {},
   "source": [
    "As for Scopus's \"Arts and Humanities\" journals which aren't labeled \"Philosophy,\" \"Arts,\" or \"Visual arts\" by OpenAlex, those provide a rough estimate for the number of journals in \"Language, communication, and culture.\" That is to say, these are just \"Humanities\" journals:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "0aff6dcd",
   "metadata": {},
   "outputs": [],
   "source": [
    "art_indices = [idx for idx, subject in d.items() if subject == \"Art\" or subject == \"Visual arts\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "aa86535f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1438\n"
     ]
    }
   ],
   "source": [
    "print(len(art_indices))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "36df1cc7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2493\n",
      "% Language, communication, and culture journals, double checked: 5.9\n"
     ]
    }
   ],
   "source": [
    "lcc = alex[\n",
    "    (alex[\"1200\\nArts and Humanities\"].notnull()) &\n",
    "    (~alex.index.isin(art_indices)) & \n",
    "    (~alex.index.isin(phil_indices))\n",
    "].shape[0]\n",
    "print(lcc)\n",
    "print(f\"% Language, communication, and culture journals, double checked: {round(lcc / 41957 * 100, 1)}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Py10 Temp",
   "language": "python",
   "name": "tmp"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}