{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Languages of Journals using OJS " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This link navigates to a Google doc with examples of online journals using OJS to publish open access articles in **60 different languages**.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notebook objectives:\n", "1. [Obtain gcld3 language predictions for the abstracts of the most recent 100 articles published in a sample of 22,561 ISSN-verified journals actively using OJS.](#gcld3)\n", " * Google's Compact Language Detector v3 (https://github.com/google/cld3), a pretrained model for language classification\n", "

\n", "2. [Verify gcld3 language predictions for each journal using a variety of heuristics, including:](#verify)\n", " * Checking language predictions against journal top-level domains (e.g., \"es\" for Spanish and \".es\" for Spain)\n", " * Cross-checking language predictions for article abstracts, article titles, and journal titles\n", " * Manually checking frequent misclassifications, such as Japanese, Scottish Gaelic, and Afrikaans\n", "

\n", "3. [Visualize journals by their primary language of publishing:](#primary)\n", " * [Top 10 languages](#top10)\n", " * [All languages in the dataset](#all)\n", "

\n", "4. [Classify journals based on whether they publish in multiple languages.](#multi)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import packages:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from collections import defaultdict\n", "from collections import Counter\n", "from lxml import html\n", "import matplotlib.pyplot as plt\n", "import matplotlib\n", "import seaborn as sns\n", "import pandas as pd\n", "import numpy as np\n", "import json\n", "import time\n", "import re\n", "import os" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Google's Compact Language Detector v3 (gcld3) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initialize gcld3:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import gcld3\n", "classifier = gcld3.NNetLanguageIdentifier(min_num_bytes=0, max_num_bytes=10000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Store a list of gcld3 language codes corresponding to the 60 different publishing languages in OJS use (except Faroese and Balochi, which are unsupported by gcld3):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "known_langs = ['af', 'al', 'ar', 'bg', 'bg-Latn', 'bs', 'ca', 'cs', 'da', 'de', 'el', 'el-Latn', 'en', 'es', 'et', \n", " 'eu','fa', 'fi', 'fil', 'fr', 'gd', 'gl', 'hi', 'hi-Latn', 'hr', 'hu', 'hy', 'id', 'ig', 'is', 'it', \n", " 'ja','ja-Latn', 'ka', 'kk', 'ko', 'ku', 'lt', 'mk', 'ms', 'ne', 'nl', 'no', 'pl', 'pt', 'ro', 'ru', \n", " 'ru-Latn','si','sk', 'sl', 'sr', 'sv', 'sw', 'ta', 'th', 'tr', 'uk', 'ur', 'uz', 'vi', 'zh', 'zh-Latn']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a function that:\n", "
\n", "1. Opens and streams a messy 28.5 GB file of scraped and printed OAI-PMH article metadata [title, description, subject, language, source...] for all articles published in **22,561 unique OJS contexts, or journals**;\n", "

\n", "2. Filters journals by ISSN and filters language predictions by inclusion in the list of known OJS languages (to avoid misclassification);\n", "

\n", "3. Passes the 'description' values (article abstracts) for the **most recent 100 articles published in each journal** to gcld3 to generate lists of predicted languages for each journal;\n", "

\n", "4. Returns a dict mapping journal issn to a list of gcld3-predicted language codes for article abstracts:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def classify_abstracts(path_to_dump, classifier, issn_filter, lang_filter):\n", " \n", " metadata_pattern = '.+'\n", " issn2langs = defaultdict(list) #defaultdict of lists\n", " dcount = defaultdict(int)\n", " #Full processing\n", " \n", " with open(path_to_dump, 'r') as f:\n", " article_count = 0\n", " \n", " for line in f:\n", " content = re.search(metadata_pattern, line, re.MULTILINE | re.DOTALL)\n", " if content:\n", " tree = html.fromstring(content.group())\n", "\n", " for article in tree.xpath('//metadata'):\n", " article_count += 1\n", " \n", " for source in article.xpath('.//source'):\n", " source_copy = str(source.text)\n", " source_copy = re.sub('\\s', '', source_copy)\n", " if source_copy in issn_filter:\n", " issn = source_copy\n", " for description in article.xpath('.//description'):\n", " if dcount[issn] < 100 and description is not None: #if <100 abstracts have been classified\n", " pred_ = classifier.FindLanguage(text=description.text) #run gcld3\n", " if pred_.is_reliable and pred_.language in lang_filter: \n", " #if the language prediction is reliable and in a known OJS language\n", " issn2langs[issn].append(pred_.language)\n", " #append to a list of language predictions for the journal\n", " dcount[issn] += 1\n", " del pred_\n", " \n", " while tree.getprevious() is not None:\n", " del tree.getparent()[0]\n", " del content\n", " \n", " print(f\"Articles scanned: {article_count}\")\n", " print(f\"Journals classified: {len(issn2langs)}\")\n", " print(f\"Missing issns: {set(issn_filter) - set(list(issn2langs.keys()))}\")\n", " return issn2langs" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "path_to_dump = os.path.join('data', 'datadump.txt')\n", "path_to_beacon = os.path.join('data', 'beacon_active.csv')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "22809\n" ] } ], "source": [ "with open(path_to_beacon, 'r') as f:\n", " df = pd.read_csv(f)\n", "df = df[~df['issn_1'].duplicated()]\n", "issn_filter = [i for i in df['issn_1'].tolist() if isinstance(i, str)]\n", "print(len(issn_filter))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Articles scanned: 7960979\n", "Journals classified: 22559\n", "Missing issns: {'2501-9430', '1047-6857', '2364-3714', '2525-8281', '1856-6073', '1678-9059', '2451-3962', '2358-1069', '2540-8445', '2721-5148', '2798-6241', '2655-2469', '2721-6020', '2734-9314', '2528-4967', '2734-9330', '1411-6340', '2775-5592', '2302-8432', '2355-1720', '2477-5029', '2503-3417', '1668-8708', '2615-6911', '2721-9976', '2598-0637', '1096-746X', '2451-1862', '2222-6737', '2775-1937', '2153-4012', '2600-5689', '1411-2280', '2722-3736', '2086-7840', '2183-0134', '1747-7387', '1693-4458', '2538-399X', '2501-2428', '0436-0265', '2716-3679', '2316-5324', '2655-948X', '2597-8985', '2621-783X', '2722-7960', '2620-5068', '1549-4497', '2598-9626', '2722-7111', '2723-1186', '2580-3123', '2460-352X', '2686-4908', '2622-5867', '1693-461X', '1754-4270', '2685-2799', '2614-4042', '2085-8744', '0124-1192', '2359-1382', '2622-6138', '2602-0254', '2668-9928', '1216-6804', '0103-4979', '0797-8952', '2215-9827', '2723-5319', '2339-1499', '1981-6979', '2656-1565', '1705-9100', '2337-5973', '2549-4317', '2291-8639', '2027-7636', '1987-037X', '1314-586X', '2745-6889', '2723-4088', '2442-9910', '2541-7207', '2501-5915', '2655-6065', '2549-2454', '2716-5043', '2338-0683', '2490-1199', '2715-2138', '2715-5889', '1997-3837', '0258-2724', '2447-7028', '2085-8205', '2238-8494', '2501-7136', '2334-1645', '2526-7744', '2722-0516', '2668-1056', '2375-7817', '2331-6950', '2722-2012', '0315-3681', '0216-7395', '2089-8118', '2734-5475', '2714-6278', '2734-9349', '2621-0622', '0798-0329', '2599-0543', '1978-0125', '2501-1111', '2708-7530', '1693-7619', '2654-5667', '2229-5674', '2735-9417', '2238-944X', '2525-2003', '1806-4280', '2675-4142', '2721-5016', '2502-471X', '2655-6324', '2732-3587', '1978-2403', '2723-7443', '2477-5258', '2620-5726', '2710-0898', '0216-7298', '2489-5512', '2501-9988', '2252-4797', '2350-0123', '2715-4971', '2579-9193', '2252-5262', '2715-1018', '2338-3720', '1081-1451', '1679-4605', '2094-1277', '0940-7855', '1982-6109', '2184-7193', '2615-2037', '2746-6434', '1412-0712', '2537-1754', '2294-9844', '2596-1837', '2668-9758', '2716-408X', '2722-5089', '1979-052X', '2501-1235', '2720-9903', '2406-8802', '2501-8590', '2460-8076', '2366-9217', '2461-0623', '2559-7914', '2698-5446', '1531-0167', '2580-6912', '2359-5965', '2764-1066', '2721-8511', '2685-6123', '2088-4605', '2548-7523', '2312-2528', '2734-9306', '2447-0899', '1980-5772', '2337-568X', '2302-934X', '2763-8669', '2595-9026', '2447-3472', '0126-074X', '2620-5505', '2711-4716', '2734-9357', '2460-3236', '2549-6778', '1412-226X', '2356-5225', '2684-9062', '2303-1409', '2614-719X', '2721-3315', '2622-9765', '2541-6030', '0125-9326', '2527-5445', '2540-9417', '2007-3380', '2699-5433', '2594-813X', '2179-8168', '1678-4944', '1411-2973', '2623-162X', '2174-7210', '2561-7141', '2597-5277', '2599-0551', '2716-0394', '2549-3485', '2615-8396', '2549-3361', '2528-1569', '2076-6327', '0208-5712', '2501-9120', '2686-2565', '1693-024X', '2797-5967', '2198-9397', '2089-6980', '2710-091X', '2406-8616', '2685-712x', '2601-971X', '2616-2504', '1669-726X', '2540-9808', '1412-4246', '2807-9256', '2408-350X', '2685-6425', '2616-6291', '2601-1972', '2614-5944', '2628-7129', '2685-161X', '1806-1230', '2807-887X', '2303-002X', '2685-5070', '0124-0625', '2252-7141'}\n", "CPU times: user 1h 22min 58s, sys: 55 s, total: 1h 23min 53s\n", "Wall time: 1h 23min 54s\n" ] } ], "source": [ "%time issn2langs = classify_abstracts(path_to_dump, classifier, issn_filter, lang_filter=known_langs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sanity check:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "2715-2502\n", "['id', 'en', 'id', 'id', 'id', 'id', 'id', 'en', 'id', 'en', 'en', 'id', 'id', 'ms', 'id', 'en', 'id', 'en', 'id', 'id', 'en']\n" ] } ], "source": [ "with open(os.path.join('data', 'issn2langs.json'), 'w') as outfile:\n", " json.dump(issn2langs, outfile)\n", "print(type(issn2langs))\n", "for k, v in issn2langs.items():\n", " print(k) #issn for one journal\n", " print(v) #list of gcld3 language classifications for most recent 100 or fewer articles published in journal\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Verifying gcld3-predicted language codes for each journal " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data\n", "Create a DataFrame, `AA`, that joins gcld3-predicted language codes with data from the beacon. This DataFrame will be used to verify gcld3 language predictions for each journal." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "#reload dict {issn: list of gcld3-predicted language codes}\n", "with open(os.path.join('data','issn2langs.json'), 'r') as infile:\n", " issn2langs = json.load(infile)\n", "#load dict {issn: text sample of concatenated titles and abstracts}\n", "with open(os.path.join('data','issn2payload.json'), 'r') as infile:\n", " issn2payload = json.load(infile)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "#{issn: primary language}\n", "l1 = {}\n", "for k, v in issn2langs.items():\n", " l1[k] = Counter(v).most_common(1)[0][0]\n", "#{issn: secondary language}\n", "l2 = {}\n", "for k, v in issn2langs.items():\n", " try:\n", " l2[k] = Counter(v).most_common(2)[1][0]\n", " except IndexError:\n", " l2[k] = None\n", " continue" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# issn | primary language predicted by gcld3 | secondary language predicted by gcld3\n", "dfL = pd.DataFrame({'issn': l1.keys(),\n", " 'pred_1': l1.values(),\n", " 'pred_2': l2.values(),\n", " })\n", "# issn | text sample of concatenated titles and abstracts\n", "dfP = pd.DataFrame({'issn': issn2payload.keys(),\n", " 'text':issn2payload.values()})\n", "# issn | primary language | secondary language | text sample\n", "dfA = pd.merge(dfL, dfP, how='outer')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "#load beacon data\n", "with open('data/beacon_active.csv', 'r') as infile:\n", " bA = pd.read_csv(infile)\n", "#select beacon columns useful for language verification \n", "bA = bA[['context_name', 'issn_1', 'issn_2', 'country_consolidated', 'journal_url']].copy()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "#rename then merge into AA\n", "bA.rename(columns = {'issn_1':'issn',\n", " 'issn_2':'issn_alt',\n", " 'country_consolidated':'tld'}, inplace = True)\n", "AA = pd.merge(dfA, bA, on='issn')\n", "#AA.to_csv('data/AA.csv')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['id', 'af', 'en', 'ms', 'es', 'ar', 'pt', 'th', 'ca', 'el', 'uk',\n", " 'it', 'fr', 'is', 'de', 'no', 'ja', 'ru', 'tr', 'sv', 'hi', 'pl',\n", " 'sr', 'sl', 'cs', 'da', 'vi', 'lt', 'hu', 'hr', 'mk', 'zh', 'ta',\n", " 'kk', 'sw', 'gd', 'sk', 'et', 'fa', 'bs', 'eu', 'ro', 'bg', 'fil',\n", " 'ka', 'hy', 'uz', 'nl', 'fi', 'ne', 'ig', nan], dtype=object)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#deduplicate issns\n", "AA = AA[~AA['issn'].duplicated()]\n", "#lowercase top-level domains\n", "AA['tld'] = AA['tld'].str.lower()\n", "AA['pred_1'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Helper functions:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Function for applying gcld3 to journal titles and article-level text samples:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def tag_language(s):\n", " l = classifier.FindLanguage(text=s)\n", " if l.is_reliable:\n", " return l.language" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "AA['title_language'] = pd.Series([tag_language(s) if isinstance(s, str) else None for s in AA['context_name'].tolist()])\n", "AA['text_language'] = pd.Series([tag_language(s) if isinstance(s, str) else None for s in AA['text'].tolist()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Function for adding journal issns and double-checked language predictions to a cleaned dict `clean_d`:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def add2dict(d, issns, langs, corrections=False):\n", " count = 0\n", " corrs = 0\n", " for issn, l2 in zip(issns, langs):\n", " if corrections:\n", " d[issn] = l2\n", " corrs += 1\n", " continue\n", " else:\n", " if issn in d:\n", " continue\n", " else:\n", " d[issn] = l2\n", " count += 1\n", " print(f\"{count} journal(s) added;\\n{corrs} journals corrected;\\n{len(d)} journals cleaned and stored in total\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "clean_d = {}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Heuristic approach to language verification:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Add a Faroese journal and a Balochi journal, both of which are misclassified in the data set:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 journal(s) added;\n", "0 journals corrected;\n", "2 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=['2445-6144', '2710-4850'],\n", " langs=['Faroese', 'Balochi'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. If a journal has a nonnull `pred_1` value but a null `pred_2` value, it is probably publishing in language `pred_1` (because the most recent 100 article abstracts were all tagged with a single language code, `pred_1`):" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5791 journal(s) added;\n", "0 journals corrected;\n", "5793 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_1'].notnull()) & (AA['pred_2'].isna())].issn.tolist(),\n", " langs=AA[(AA['pred_1'].notnull()) & (AA['pred_2'].isna())].pred_1.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. If gcld3 predicts the same primary language for a journal's title metadata `title_language`, article metadata `text_language`, and abstracts from the most recent 100 articles published `pred_1`, add the journal:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1172 journal(s) added;\n", "0 journals corrected;\n", "6965 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_1'] == AA['text_language']) & (AA['pred_1'] == AA['title_language'])].issn.tolist(),\n", " langs=AA[(AA['pred_1'] == AA['text_language']) & (AA['pred_1'] == AA['title_language'])].pred_1.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4. If the journal's secondary language `pred_2` matches its top level domain `tld` and the primary language predicted by gcld3 `pred_1` is not English, `pred_1` is probably an error, so add `pred_2`:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "116 journal(s) added;\n", "0 journals corrected;\n", "7081 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_2'] == AA['tld']) & (AA['pred_1'] != 'en')].issn.tolist(),\n", " langs=AA[(AA['pred_2'] == AA['tld']) & (AA['pred_1'] != 'en')].pred_2.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5. Check if journals with an `af` Afrikaans primary language classification actually have an African top-level domain `tld`:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "#store list of African country top-level domains\n", "af_tlds = ['AO','BI','BJ','BF','BW','CF','CI','CM','CD','CG','KM','CV','ER','ET','GA','GH',\n", " 'GN','GM','GW','GQ','KE','LR','LS','MG','ML','MZ','MR','MU','MW','NA','NE','NG',\n", " 'RW','SD','SN','SL','SO','SS','ST','SZ','SC','TD','TG','TZ','UG','ZA','ZM','ZW']\n", "AA['is_af'] = AA['tld'].apply(lambda x: any([i.lower() for i in af_tlds if i.lower() == x]))\n", "#AA[((AA['pred_1'] == 'af') | (AA['pred_2'] == 'af')) & (AA['is_af'] == True)]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5 journal(s) added;\n", "0 journals corrected;\n", "7086 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=['1013-1116', '0041-4751', '0041-476X', '0254-3486', '2006-1390'],\n", " langs=['af', 'af', 'af', 'af', 'en'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6. Assuming that all remaining `pred_1` == `af` are misclassifications, add `pred_2` instead:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "70 journal(s) added;\n", "0 journals corrected;\n", "7156 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_1'] == 'af')].issn.tolist(),\n", " langs=AA[(AA['pred_1'] == 'af')].pred_2.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7. If journals with an `es` Spanish primary language classification also have a Latin American top-level domain `tld`, they are probably publishing in Spanish:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False 20123\n", "True 2468\n", "Name: is_latam, dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "latam_tlds = ['AW','AR','AG','BS','BZ','BO',#'BR', not brazil! bc portuguese\n", " 'BB','CL','CO','CR','CU','CW','KY','DM','DO','EC','GD','GT','GY','HN','HT','JM','KN','LC',\n", " 'GP','MX','NI','PA','PE','PR','PY','SV','SR','AN','SX','TC','TT','UY','VC','VE','VG','VI']\n", "AA['is_latam'] = AA['tld'].apply(lambda x: any([i.lower() for i in latam_tlds if i.lower() == x]))\n", "AA['is_latam'].value_counts()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1681 journal(s) added;\n", "0 journals corrected;\n", "8837 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_1'] == 'es') & (AA['is_latam'] == True)].issn.tolist(),\n", " langs=AA[(AA['pred_1'] == 'es') & (AA['is_latam'] == True)].pred_1.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 8. Japanese journals, jointly identified by language prediction `ja` and top-level domain `jp`:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 journal(s) added;\n", "0 journals corrected;\n", "8839 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_1'] == 'ja') & (AA['tld'] == 'jp')].issn.tolist(),\n", " langs=AA[(AA['pred_1'] == 'ja') & (AA['tld'] == 'jp')].pred_1.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 9. Japanese again, with the assumption that all remaining `pred_1` == `ja` are misclassifications, so add `pred_2` instead:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "63 journal(s) added;\n", "0 journals corrected;\n", "8902 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_1'] == 'ja')].issn.tolist(),\n", " langs=AA[(AA['pred_1'] == 'ja')].pred_2.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 10. Manually enter correct language codes for misclassified French examples:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "#pd.set_option(\"display.max_rows\", None)\n", "#AA[(AA['pred_1'] == 'fr') & (AA['pred_2'].notnull())]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "13 journal(s) added;\n", "0 journals corrected;\n", "8915 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=['0008-4123', '2341-0868', '2595-6752', '1026-2881', '2368-8076', '2665-7716', '0702-7818',\n", " '1496-7308', '0002-4805', '1544-4953', '1499-6677', '0705-3657', '2605-0285'],\n", " langs=['en', 'es', 'pt', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 11. Malay journals, jointly identified by primary language prediction `ms` and top-level domain `my`:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "36 journal(s) added;\n", "0 journals corrected;\n", "8951 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_1'] == 'ms') & (AA['tld'] == 'my')].issn.tolist(),\n", " langs=AA[(AA['pred_1'] == 'ms') & (AA['tld'] == 'my')].pred_1.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 12. Brazilian journals, jointly identified by primary language prediction `pt` and top-level domain `br`:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1762 journal(s) added;\n", "0 journals corrected;\n", "10713 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_1'] == 'pt') & (AA['tld'] == 'br')].issn.tolist(),\n", " langs=AA[(AA['pred_1'] == 'pt') & (AA['tld'] == 'br')].pred_1.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 13. German journals, jointly identified by primary language prediction `de` and top-level domain `de`:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "69 journal(s) added;\n", "0 journals corrected;\n", "10782 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_1'] == 'de') & (AA['tld'] == 'de')].issn.tolist(),\n", " langs=AA[(AA['pred_1'] == 'de') & (AA['tld'] == 'de')].pred_1.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 14. Hindi journals, jointly identified by primary language prediction `hi` | `hi-Latn` and top-level domain `in`:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 journal(s) added;\n", "0 journals corrected;\n", "10784 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_1'] == 'hi') | (AA['pred_1'] == 'hi-Latn') & (AA['tld'] == 'in')].issn.tolist(),\n", " langs=AA[(AA['pred_1'] == 'hi') | (AA['pred_1'] == 'hi-Latn') & (AA['tld'] == 'in')].pred_1.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 14. Manually check OJS contexts with a Philippines top-level domain (.ph):" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "#AA[(AA['tld'] == 'ph') & (AA['pred_2'].notnull())]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 journal(s) added;\n", "2 journals corrected;\n", "10784 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(clean_d,\n", " issns=['0012-2858', '2244-6001'],\n", " langs=['en', 'fil'],\n", " corrections=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 15. Manually check OJS contexts with a Danish top-level domain (.dk):" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "scrolled": true }, "outputs": [], "source": [ "#AA[(AA['tld'] == 'dk') & (AA['pred_2'].notnull())]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "9 journal(s) added;\n", "0 journals corrected;\n", "10793 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=['2596-6200', '0909-0533', '2446-0591', '2446-3981', '2597-0704', \n", " '1603-8509', '2246-2589', '2244-9140', '1904-5565', '0029-1528'],\n", " langs=['da', 'da', 'da', 'en', 'da', 'da', 'da', 'da', 'da', 'da'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 16. Manually check musicological journals, becuase musical notation makes problems for gcld3:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "#AA[AA['context_name'].str.contains('musi[ck]', regex=True, case=False)]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 journal(s) added;\n", "0 journals corrected;\n", "10795 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=['0354-818X', '2312-2528'],\n", " langs=['sr', 'en'])" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 journal(s) added;\n", "1 journals corrected;\n", "10795 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(clean_d,\n", " issns=['0011-3735'], #current musicology, which gcld3 classified as Japanese and Gaelic\n", " langs=['en'], #https://currentmusicology.columbia.edu/\n", " corrections=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 17. Manually check all Scottish Gaelic `gd` contexts:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "#AA[(AA['pred_1'] == 'gd') | (AA['pred_2'] == 'gd')]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4 journal(s) added;\n", "0 journals corrected;\n", "10799 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(clean_d,\n", " issns=['1754-4270', #perhaps the only true Scottish Gaelic journal\n", " '0035-6867', '2675-1127', '0957-5286'],\n", " langs=['gd', 'it', 'pt', 'en'])" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 journal(s) added;\n", "3 journals corrected;\n", "10799 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(clean_d,\n", " issns=['1805-9511', '0252-9076', '2563-562X'],\n", " langs=['cs', 'es', 'en'],\n", " corrections=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 18. Manually check contexts with a Czech top-level domain `cz` :" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "#AA[AA['tld'] == 'cz']" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 journal(s) added;\n", "9 journals corrected;\n", "10804 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(clean_d,\n", " issns=['0009-2770', '2336-2766', '2336-3630', '1805-9511', '2336-4378',\n", " '1802-3983', '1804-5383', '1804-6665', '2694-9288'],\n", " langs=['cs', 'cs', 'uk', 'cs', 'cs', 'cs', 'cs', 'cs', 'cs'],\n", " corrections=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 19. Manually check contexts with an Albanian top-level domain `al` :" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "#pd.set_option(\"display.max_rows\", None)\n", "#AA[AA['tld']=='al']" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 journal(s) added;\n", "1 journals corrected;\n", "10804 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(clean_d,\n", " issns=['2523-6636'],\n", " langs=['sq'],\n", " corrections=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 20. Manually check contexts with a Pakistani top-level domain `pk` :" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "#AA[AA['tld']=='pk']" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 journal(s) added;\n", "44 journals corrected;\n", "10815 journals cleaned and stored in total\n" ] } ], "source": [ "#The output below is an error -- 11 journals added, 33 journals corrected\n", "add2dict(clean_d,\n", " issns=['2663-6255', '0430-4055', '1813-775X', '2707-1200', '2411-6211', '1995-7904', '2520-5021', \n", " '1998-4472', '2708-8235', '2073-5146', '2707-6903', '2664-4959', '2708-8847', '1816-5389', \n", " '2709-8885', '2664-1461', '2305-1345', '2708-6577', '2618-1355', '1818-9296', '2664-0023', \n", " '2709-6076', '2709-7641', '2709-4022', '2709-4162', '2710-0227', '2710-0812', '2410-8065', \n", " '2521-408X', '2707-6288', '2617-9075', '2305-154X', '2710-2475', '2710-5180', '2707-7225', \n", " '2523-0093', '2521-8948', '2788-4627', '2709-7617', '2073-3674', '2413-7480', '2415-5500', \n", " '2519-6618', '2518-5330'],\n", " langs=['ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', \n", " 'ur', 'Balochi', 'ur', 'ur', 'ur', 'ur', 'ar', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', \n", " 'ur', 'ur', 'ur', 'ur', 'ar', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ur', 'ar', 'ur', 'ur'],\n", " corrections=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 21. Manually check journals that were predicted as Arabic `ar` :" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "#AA[(AA['pred_1'] == 'ar') | (AA['pred_2'] == 'ar')]" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 journal(s) added;\n", "25 journals corrected;\n", "10835 journals cleaned and stored in total\n" ] } ], "source": [ "#The output below is an error -- 20 journals added, 5 journals corrected\n", "add2dict(clean_d,\n", " issns=['1693-3257', '2700-8355', '2600-7398', '2410-1036', '2600-7398', '2223-859X', '1996-9546',\n", " '2071-9728', '1026-3748', '1026-3721', '1995-8005', '2460-5360', '2716-5515', '0552-265X',\n", " '1994-473X', '2663-7405', '2437-0789', '2522-3259', '2522-6460', '2421-9843', '2520-7431',\n", " '2664-4673', '2706-9524', '1658-7030', '1658-3116'],\n", " langs=['ar', 'ar', 'ar', 'ku', 'ar', 'ar', 'ar', 'ar', 'ar', 'ar', 'ar', 'id', 'ar', 'ar', 'ar',\n", " 'ar', 'ar', 'ar', 'ar', 'ar', 'ar', 'ar', 'ar', 'ar', 'ar'],\n", " corrections=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 22. Manually check contexts with a Belarusian top-level domain `by` :" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "#AA[AA['tld']=='by']" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 journal(s) added;\n", "3 journals corrected;\n", "10837 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(clean_d,\n", " issns=['2222-8853', '2789-195X', '2415-7198'],\n", " langs=['by', 'en', 'uk'],\n", " corrections=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 23. If a journal's primary language prediction `pred_1` and top-level domain `tld` match, add the journal:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5298 journal(s) added;\n", "0 journals corrected;\n", "16135 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[AA['pred_1'] == AA['tld']].issn.tolist(),\n", " langs=AA[AA['pred_1'] == AA['tld']].pred_1.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 24. If a journal has an Eastern European top-level domain `tld` and its secondary language prediction `pred_2` matches its `title_language`, it probably publishes in its secondary language (e.g., Ukrainian) but translates article metadata into English. So, add the secondary language on the assumption that `text_language` == `en` is an error due to translated article titles and abstracts:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "ee_tlds = ['AL','AD','AM','AT','BG','BA','BY','CY','CZ','EE','GE','GR','GL','HR',\n", " 'HU','LT','LV','MD','MK','ME','PL','RO','RU','RS','SK','SI','UA']\n", "AA['is_ee'] = AA['tld'].apply(lambda x: any([i.lower() for i in ee_tlds if i.lower() == x]))\n", "#AA[(AA['pred_2'] == AA['title_language']) & (AA['pred_2'].notnull()) & (AA['is_ee'] == True)]" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "127 journal(s) added;\n", "0 journals corrected;\n", "16262 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[(AA['pred_2'] == AA['title_language']) & (AA['pred_2'].notnull()) & \\\n", " (AA['is_ee'] == True)].issn.tolist(),\n", " langs=AA[(AA['pred_2'] == AA['title_language']) & (AA['pred_2'].notnull()) & \\\n", " (AA['is_ee'] == True)].pred_2.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 25. Finally, add each remaining journal's primary language prediction `pred_1` to complete `clean_d`:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6299 journal(s) added;\n", "0 journals corrected;\n", "22561 journals cleaned and stored in total\n" ] } ], "source": [ "add2dict(d=clean_d,\n", " issns=AA[AA['pred_1'].notnull()].issn.tolist(),\n", " langs=AA[AA['pred_1'].notnull()].pred_1.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing the primary languages in which OJS users publish worldwide: " ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "codes = {'af':'Afrikaans', 'ar':'Arabic', 'bg':'Bulgarian', 'bg-Latn':'Bulgarian', 'bs':'Bosnian', 'by':'Belarusian',\n", " 'ca':'Catalan', 'cs':'Czech', 'da':'Danish', 'de':'German', 'el':'Greek', 'el-Latn':'Greek', 'en':'English', \n", " 'es':'Spanish', 'et':'Estonian', 'eu':'Basque', 'fa':'Persian', 'fi':'Finnish', 'fil':'Filipino',\n", " 'fr':'French', 'ga':'Irish', 'gd':'Scottish Gaelic', 'gl':'Galician', 'hi':'Hindi','hi-Latn':'Hindi',\n", " 'hr':'Croatian', 'hu':'Hungarian', 'hy':'Armenian', 'id':'Indonesian', 'ig':'Igbo', 'is':'Icelandic',\n", " 'it':'Italian', 'iw':'Hebrew', 'ja':'Japanese', 'ja-Latn':'Japanese', 'ka':'Georgian', 'kk':'Kazakh',\n", " 'ko':'Korean', 'ku':'Kurdish', 'la':'Latin', 'lt':'Lithuanian', 'lv':'Latvian', 'mk':'Macedonian',\n", " 'ms':'Malay', 'my':'Burmese', 'ne':'Nepali', 'nl':'Dutch', 'no':'Norwegian', 'pl':'Polish',\n", " 'pt':'Portuguese', 'ro':'Romanian', 'ru':'Russian', 'ru-Latn':'Russian', 'sd':'Sindhi', 'si':'Sinhala',\n", " 'sk':'Slovak', 'sl':'Slovenian', 'sq':'Albanian', 'sr':'Serbian', 'sv':'Swedish', 'sw':'Swahili', \n", " 'ta':'Tamil', 'tg':'Tajik', 'th':'Thai', 'tr':'Turkish', 'uk':'Ukrainian', 'ur':'Urdu', 'uz':'Uzbek',\n", " 'vi':'Vietnamese', 'zh':'Chinese', 'zh-Latn':'Chinese'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get primary language counts for the entire sample of journals, to be visualized using matplotlib and seaborn:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total: 22561 journals\n" ] } ], "source": [ "issn2primary = pd.DataFrame({'issn': list(clean_d.keys()),\n", " 'gcld3_code': [c if len(c) == 2 else None for c in clean_d.values()],\n", " 'language': [codes[l] if codes.get(l) else l for l in list(clean_d.values())]})\n", "issn2primary = issn2primary[issn2primary['language'].notnull()]\n", "#Get a series of value counts for the language codes\n", "ls = issn2primary['language'].value_counts(sort=True, ascending=False)\n", "print(f\"Total: {ls.sum()} journals\")" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(22561, 6)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
issnissn_altcontext_namejournal_urlgcld3_codelanguage
170641018-28882709-7951Diagnósticohttp://142.44.242.51/index.php/diagnosticoesSpanish
166722236-3785NaNRevista Ciências em Saúdehttp://186.225.220.186:7474/ojs/index.php/rcsf...ptPortuguese
94662077-13712077-1460Biosferahttp://21bs.ru/index.php/bioenEnglish
55602089-46862548-59702-TRIK: TUNAS-TUNAS RISET KESEHATANhttp://2trik.jurnalelektronik.com/index.php/2trikidIndonesian
98542705-0513NaN工程技术研究http://2winpub.usp-pl.com/index.php/ETRzhChinese
\n", "
" ], "text/plain": [ " issn issn_alt context_name \\\n", "17064 1018-2888 2709-7951 Diagnóstico \n", "16672 2236-3785 NaN Revista Ciências em Saúde \n", "9466 2077-1371 2077-1460 Biosfera \n", "5560 2089-4686 2548-5970 2-TRIK: TUNAS-TUNAS RISET KESEHATAN \n", "9854 2705-0513 NaN 工程技术研究 \n", "\n", " journal_url gcld3_code \\\n", "17064 http://142.44.242.51/index.php/diagnostico es \n", "16672 http://186.225.220.186:7474/ojs/index.php/rcsf... pt \n", "9466 http://21bs.ru/index.php/bio en \n", "5560 http://2trik.jurnalelektronik.com/index.php/2trik id \n", "9854 http://2winpub.usp-pl.com/index.php/ETR zh \n", "\n", " language \n", "17064 Spanish \n", "16672 Portuguese \n", "9466 English \n", "5560 Indonesian \n", "9854 Chinese " ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ojsLangs = AA[['issn', 'issn_alt', 'context_name', 'journal_url']].merge(issn2primary, how='left', on='issn')\n", "ojsLangs = ojsLangs[ojsLangs['language'].notnull()].sort_values(by=['journal_url'])\n", "print(ojsLangs.shape)\n", "ojsLangs.head()" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "ojsLangs.to_csv(os.path.join('data', 'OJS_languages_v3.csv'), index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bar plot of the 10 primary languages in which OJS users publish their articles (*n*=21,874)
\n", "Each bar represents the proportion of journals for which the specified language is their primary publishing language." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "\n", "sns.set(font_scale=1.25, style='whitegrid')\n", "fig, ax = matplotlib.pyplot.subplots()\n", "\n", "ld = sns.barplot(x=ls.values[:10],\n", " y=ls.index[:10],\n", " orient='h',\n", " color='grey')\n", "\n", "ax.set(xlim=(0, 12000),\n", " xlabel='Active journals using OJS',\n", " ylabel='Language')#,\n", " #title='Primary language employed by journals using OJS ($\\it{n}$ = 21,874)')\n", "\n", "sns.despine(bottom=True)\n", "\n", "\n", "matplotlib.pyplot.xticks([2000, 4000, 6000, 8000, 10000],\n", " ['2,000', '4,000', '6,000', '8,000', '10,000'])\n", "\n", "for p in ld.patches:\n", " _x = p.get_x() + p.get_width()\n", " _y = p.get_y() + p.get_height() - 0.15\n", " percent = round(((p.get_width() / 22561) * 100), 1)\n", " string = str(int(p.get_width()))\n", " if len(string) == 5:\n", " value = string[:2] + ',' + string[2:]\n", " elif len(string) == 4:\n", " value = string[0] + ',' + string[1:]\n", " else:\n", " value = string\n", " value += ' ({})'.format(str(percent)+'%')\n", " ld.text(_x + 150, _y, value, ha='left', weight='bold')\n", "\n", "fig.savefig(os.path.join('vis', 'OJS_primary_languages.png'), bbox_inches='tight')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bar plot of all the languages in which OJS users published their articles (*n*=22,561): " ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "\n", "fig, ax = matplotlib.pyplot.subplots(figsize=(8,16))\n", "\n", "mult = sns.barplot(y=list(ls.sort_values().index),\n", " x=list(ls.sort_values().values),\n", " orient='h',\n", " color='grey')\n", "\n", "sns.despine(bottom=True)\n", "\n", "ax.set(xlim=(0, 12500),\n", " xlabel='Active journals using OJS',\n", " ylabel='Language',\n", " title='Primary language employed by journals using OJS ($\\it{n}$ = 22,561)',\n", " visible=True)\n", "\n", "matplotlib.pyplot.xticks([0, 2000, 4000, 6000, 8000, 10000, 12000],\n", " ['0', '2,000', '4,000', '6,000', '8,000', '10,000', '12,000'])\n", "\n", "for p in mult.patches:\n", " _x = p.get_x() + p.get_width()\n", " _y = p.get_y() + p.get_height() - 0.175\n", " percent = round(((p.get_width() / 22561) * 100), 2)\n", " string = str(int(p.get_width()))\n", " if len(string) == 5:\n", " value = string[:2] + ',' + string[2:]\n", " elif len(string) == 4:\n", " value = string[0] + ',' + string[1:]\n", " else:\n", " value = string\n", " value += ' ({})'.format(str(percent)+'%')\n", " mult.text(_x + 125, _y, value, ha='left', weight='bold')\n", "\n", "fig.savefig(os.path.join('vis', 'OJS_languages_v3.png'), bbox_inches='tight')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bar plot of multilingualism among journals publishing with OJS (*n*=22,382)
\n", "Each bar represents the proportion of journals that published **5 or more articles in each of their publishing languages**. The decision boundary of 5 was chosen to match the decision boundary for active journals (>=5 articles published per year)." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "def classify_journals_multi(issn2langs, decision_boundary):\n", " multilingual = defaultdict(list)\n", " for k, v in issn2langs.items():\n", " for lang in Counter(v).items():\n", " if lang[1] >= decision_boundary: \n", " #If the the number of article abstracts tagged as a given language ('en') exceeds the boundary\n", " multilingual[k].append(lang[0]) #Append the language to a list for the journal\n", " \n", " multilingual_counts = defaultdict(int)\n", " array_lengths = []\n", " for v in multilingual.values():\n", " if v:\n", " multiplier = len(v)\n", " array_lengths.append(multiplier)\n", " if multiplier >= 3:\n", " multilingual_counts['Multi- (3+ languages)'] += 1\n", " elif multiplier == 2:\n", " multilingual_counts['Bi- (2 languages)'] += 1\n", " elif multiplier == 1:\n", " multilingual_counts['Mono- (1 language)'] += 1\n", " else:\n", " continue\n", " \n", " total = 0\n", " for v in multilingual_counts.values():\n", " total += v\n", " print('Total: {} journals'.format(total))\n", " print('Average number of languages per journal: {}'.format(np.array(array_lengths).mean()))\n", " \n", " return pd.Series(multilingual_counts).sort_values(ascending=False)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total: 22382 journals\n", "Average number of languages per journal: 1.6994013046197838\n" ] } ], "source": [ "multi5 = classify_journals_multi(issn2langs, decision_boundary=5)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "\n", "fig, ax = matplotlib.pyplot.subplots()\n", "\n", "mult = sns.barplot(y=multi5.index,\n", " x=multi5.values,\n", " orient='h',\n", " color='grey')\n", "\n", "ax.set(xlim=(0, 12000),\n", " xlabel='Active journals using OJS',\n", " ylabel='*-lingual journals')#,\n", " #title='Number of languages employed by journals using OJS ($\\it{n}$ = 22,382)')\n", "\n", "sns.despine(bottom=True)\n", "\n", "matplotlib.pyplot.xticks([2000, 4000, 6000, 8000, 10000],\n", " ['2,000', '4,000', '6,000', '8,000', '10,000'])\n", "\n", "for p in mult.patches:\n", " _x = p.get_x() + p.get_width()\n", " _y = p.get_y() + p.get_height() - 0.35\n", " percent = round(((p.get_width() / 22382) * 100), 1)\n", " string = str(int(p.get_width()))\n", " if len(string) == 5:\n", " value = string[:2] + ',' + string[2:]\n", " elif len(string) == 4:\n", " value = string[0] + ',' + string[1:]\n", " else:\n", " value = string\n", " value += ' ({})'.format(str(percent)+'%')\n", " mult.text(_x + 150, _y, value, ha='left', weight='bold')\n", "\n", "fig.savefig(os.path.join('vis', 'OJS_multilingual5.png'), bbox_inches=('tight'))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 4 }