{ "cells": [ { "cell_type": "markdown", "id": "60f5a73f", "metadata": {}, "source": [ "# 3.1 Multilingualism among Journals using OJS\n", "\n", "### Notebook objectives:\n", "1. Determine for each journal in the subset (n=22,561) the languages in which they published more than 5 articles\n", "2. Classify journals based on whether they published more than 5 articles in multiple languages\n", "3. Double-check error-prone Indonesian journal classifications" ] }, { "cell_type": "code", "execution_count": 1, "id": "3f737ef4", "metadata": {}, "outputs": [], "source": [ "from collections import Counter, defaultdict\n", "import pandas as pd\n", "import numpy as np\n", "import json\n", "import re\n", "import os" ] }, { "cell_type": "markdown", "id": "5d54feb4", "metadata": {}, "source": [ "Load previously determined lists of gcld3 language codes for each journal, represented by ISSN:" ] }, { "cell_type": "code", "execution_count": 2, "id": "640add0e", "metadata": {}, "outputs": [], "source": [ "with open(os.path.join(\"data\", \"issn2langs.json\"), \"r\") as infile:\n", " issn2langs = json.load(infile)" ] }, { "cell_type": "markdown", "id": "5fcccad1", "metadata": {}, "source": [ "Load a .csv with previously determined primary language classifications for each journal:" ] }, { "cell_type": "code", "execution_count": 3, "id": "dcb92bea", "metadata": {}, "outputs": [], "source": [ "with open(os.path.join(\"data\", \"OJS_languages_v3.csv\"), \"r\") as infile:\n", " ojs = pd.read_csv(infile)" ] }, { "cell_type": "code", "execution_count": 4, "id": "1335d5da", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 22561 entries, 0 to 22560\n", "Data columns (total 6 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 issn 22561 non-null object\n", " 1 issn_alt 8314 non-null object\n", " 2 context_name 22561 non-null object\n", " 3 journal_url 22561 non-null object\n", " 4 gcld3_code 22555 non-null object\n", " 5 language 22561 non-null object\n", "dtypes: object(6)\n", "memory usage: 1.0+ MB\n" ] } ], "source": [ "ojs.info()" ] }, { "cell_type": "code", "execution_count": 5, "id": "b986ef09", "metadata": {}, "outputs": [], "source": [ "issn2mono = dict(zip(ojs[\"issn\"].tolist(), ojs[\"gcld3_code\"].tolist()))" ] }, { "cell_type": "markdown", "id": "143bf879", "metadata": {}, "source": [ "An eventual visualization will only include the four main languages of OJS users. These are English, Indonesian, Spanish, Portuguese, and a placeholder category, \"Other\":" ] }, { "cell_type": "code", "execution_count": 6, "id": "ae3d31ef", "metadata": {}, "outputs": [], "source": [ "allowed_langs = [\"en\", \"id\", \"es\", \"pt\"] # +Other, \"xx\"" ] }, { "cell_type": "markdown", "id": "a81f2a33", "metadata": {}, "source": [ "Loop over the ISSNs and produce combinations of language codes for each journal:" ] }, { "cell_type": "code", "execution_count": 7, "id": "dd73958f", "metadata": {}, "outputs": [], "source": [ "d = defaultdict(int)\n", "count = 0\n", "id_check = []\n", "\n", "for idx, (k, v) in enumerate(list(issn2langs.items())):\n", " \n", " langs = []\n", " if issn2mono[k] in allowed_langs:\n", " langs.append(issn2mono[k]) #stable code for each of the allowed languages\n", " else:\n", " langs.append(\"xx\") #other language\n", " \n", " c = list(Counter(v).items())\n", " c = [tup for tup in c if tup[1] > 5] #apply the \"at least 5 articles\" criterion\n", " \n", " if c:\n", " for tup in c:\n", " if tup[0] in allowed_langs: #filter for the four languages mentioned above\n", " langs.append(tup[0])\n", " elif tup[0] in [\"af\", \"ja\"]: #Afrikaans and Japanese are common gcld3 errors\n", " #ignore these languages becuase <10 of the journals actually publish in these languages\n", " continue\n", " else:\n", " langs.append(\"xx\") #other languages\n", " \n", " langs = sorted(list(set(langs)))\n", " \n", " langtup = tuple(langs)\n", " d[langtup] += 1\n", " count += 1\n", " \n", " #checking indonesian journals\n", " if langtup == ('id', 'pt'):\n", " id_check.append(k)\n", " if langtup == ('en', 'es', 'id', 'pt'):\n", " id_check.append(k)\n", " if langtup == ('en', 'es', 'id', 'xx'):\n", " id_check.append(k)\n", " if langtup == ('en', 'es', 'id'):\n", " id_check.append(k)\n", " if langtup == ('en', 'id', 'pt'):\n", " id_check.append(k)" ] }, { "cell_type": "code", "execution_count": 8, "id": "7091f5a6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "22559\n" ] } ], "source": [ "print(count) \n", "# An additional two journals which are not present in the data, publishing in Balochi and Faroese,\n", "# will be manually added to \"Other,\" or \"xx\"." ] }, { "cell_type": "markdown", "id": "3acc84a6", "metadata": {}, "source": [ "Language combinations, some of which need to be double-checked:" ] }, { "cell_type": "code", "execution_count": 9, "id": "eb266ffa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('en',) 6646\n", "('xx',) 446\n", "('id',) 2605\n", "('pt',) 1134\n", "('es',) 852\n", "('en', 'id') 4417\n", "('id', 'xx') 121\n", "('en', 'es') 1762\n", "('en', 'xx') 2164\n", "('en', 'pt') 821\n", "('es', 'pt') 192\n", "('es', 'xx') 56\n", "('id', 'pt') 4\n", "('pt', 'xx') 36\n", "('es', 'pt', 'xx') 10\n", "('en', 'id', 'xx') 335\n", "('en', 'es', 'pt') 550\n", "('en', 'pt', 'xx') 41\n", "('en', 'es', 'xx') 218\n", "('en', 'es', 'id') 47\n", "('en', 'id', 'pt') 30\n", "('id', 'pt', 'xx') 1\n", "('en', 'es', 'pt', 'xx') 49\n", "('en', 'es', 'id', 'pt') 12\n", "('en', 'es', 'id', 'xx') 10\n" ] } ], "source": [ "for k in sorted(d, key=len, reverse=False):\n", " print(k, d[k])" ] }, { "cell_type": "markdown", "id": "c0d21861", "metadata": {}, "source": [ "These ISSNs feature unusual combinations of Indonesian, Spanish, and Portuguese classifications. Each was manually checked by querying issn.org: https://portal.issn.org/" ] }, { "cell_type": "code", "execution_count": 10, "id": "ec1d29ca", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['2599-1353', '2253-900X', '2356-1955', '1510-5091', '2722-6689', '2528-2344', '2318-5422', '2526-6675', '2613-9812', '2580-4553', '2443-3187', '2721-4192', '2723-3367', '2747-0733', '2621-3559', '2086-9754', '2745-8563', '2548-3366', '1411-545X', '0100-1965', '2745-5955', '2447-6536', '1907-5995', '2448-8232', '2145-888X', '1858-0262', '2615-5850', '2477-250X', '2655-2515', '2656-1832', '2614-7904', '1576-3420', '1693-6191', '2745-7168', '2655-6812', '2709-4685', '2722-9017', '1018-5674', '2597-7989', '2579-8766', '0121-2923', '2526-110X', '2599-3224', '2086-8162', '2151-2612', '2716-0807', '1411-4143', '0430-5027', '2715-0658', '2715-4882', '2301-9263', '2777-0362', '1858-2400', '2621-4148', '1679-1010', '2660-4418', '2615-6881', '2477-3557', '2655-7533', '0187-0173', '2656-2022', '2221-755X', '2359-0033', '2614-512X', '0211-111X', '2723-9535', '2460-1780', '2216-0973', '2589-8019', '2166-7918', '2078-1938', '2599-0136', '2412-4338', '2656-1794', '2595-9980', '2683-2100', '2318-4507', '2215-7794', '2656-3371', '2747-1292', '2746-8100', '2654-4172', '2254-6235', '1679-6101', '2686-5807', '2722-8002', '1412-7229', '2722-3620', '2686-6277', '2085-1456', '2371-2376', '2715-5919', '2318-8065', '2776-0081', '2477-0515', '2599-0837', '2716-1765', '2086-8065', '0121-3628', '2007-7831', '2007-7637', '2502-731X', '2183-8976']\n" ] } ], "source": [ "print(id_check)" ] } ], "metadata": { "kernelspec": { "display_name": "NLU", "language": "python", "name": "nlu" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }