{
"cells": [
{
"cell_type": "markdown",
"id": "60f5a73f",
"metadata": {},
"source": [
"# 3.1 Multilingualism among Journals using OJS\n",
"\n",
"### Notebook objectives:\n",
"1. Determine for each journal in the subset (n=22,561) the languages in which they published more than 5 articles\n",
"2. Classify journals based on whether they published more than 5 articles in multiple languages\n",
"3. Double-check error-prone Indonesian journal classifications"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3f737ef4",
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter, defaultdict\n",
"import pandas as pd\n",
"import numpy as np\n",
"import json\n",
"import re\n",
"import os"
]
},
{
"cell_type": "markdown",
"id": "5d54feb4",
"metadata": {},
"source": [
"Load previously determined lists of gcld3 language codes for each journal, represented by ISSN:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "640add0e",
"metadata": {},
"outputs": [],
"source": [
"with open(os.path.join(\"data\", \"issn2langs.json\"), \"r\") as infile:\n",
" issn2langs = json.load(infile)"
]
},
{
"cell_type": "markdown",
"id": "5fcccad1",
"metadata": {},
"source": [
"Load a .csv with previously determined primary language classifications for each journal:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "dcb92bea",
"metadata": {},
"outputs": [],
"source": [
"with open(os.path.join(\"data\", \"OJS_languages_v3.csv\"), \"r\") as infile:\n",
" ojs = pd.read_csv(infile)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1335d5da",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 22561 entries, 0 to 22560\n",
"Data columns (total 6 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 issn 22561 non-null object\n",
" 1 issn_alt 8314 non-null object\n",
" 2 context_name 22561 non-null object\n",
" 3 journal_url 22561 non-null object\n",
" 4 gcld3_code 22555 non-null object\n",
" 5 language 22561 non-null object\n",
"dtypes: object(6)\n",
"memory usage: 1.0+ MB\n"
]
}
],
"source": [
"ojs.info()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b986ef09",
"metadata": {},
"outputs": [],
"source": [
"issn2mono = dict(zip(ojs[\"issn\"].tolist(), ojs[\"gcld3_code\"].tolist()))"
]
},
{
"cell_type": "markdown",
"id": "143bf879",
"metadata": {},
"source": [
"An eventual visualization will only include the four main languages of OJS users. These are English, Indonesian, Spanish, Portuguese, and a placeholder category, \"Other\":"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ae3d31ef",
"metadata": {},
"outputs": [],
"source": [
"allowed_langs = [\"en\", \"id\", \"es\", \"pt\"] # +Other, \"xx\""
]
},
{
"cell_type": "markdown",
"id": "a81f2a33",
"metadata": {},
"source": [
"Loop over the ISSNs and produce combinations of language codes for each journal:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "dd73958f",
"metadata": {},
"outputs": [],
"source": [
"d = defaultdict(int)\n",
"count = 0\n",
"id_check = []\n",
"\n",
"for idx, (k, v) in enumerate(list(issn2langs.items())):\n",
" \n",
" langs = []\n",
" if issn2mono[k] in allowed_langs:\n",
" langs.append(issn2mono[k]) #stable code for each of the allowed languages\n",
" else:\n",
" langs.append(\"xx\") #other language\n",
" \n",
" c = list(Counter(v).items())\n",
" c = [tup for tup in c if tup[1] > 5] #apply the \"at least 5 articles\" criterion\n",
" \n",
" if c:\n",
" for tup in c:\n",
" if tup[0] in allowed_langs: #filter for the four languages mentioned above\n",
" langs.append(tup[0])\n",
" elif tup[0] in [\"af\", \"ja\"]: #Afrikaans and Japanese are common gcld3 errors\n",
" #ignore these languages becuase <10 of the journals actually publish in these languages\n",
" continue\n",
" else:\n",
" langs.append(\"xx\") #other languages\n",
" \n",
" langs = sorted(list(set(langs)))\n",
" \n",
" langtup = tuple(langs)\n",
" d[langtup] += 1\n",
" count += 1\n",
" \n",
" #checking indonesian journals\n",
" if langtup == ('id', 'pt'):\n",
" id_check.append(k)\n",
" if langtup == ('en', 'es', 'id', 'pt'):\n",
" id_check.append(k)\n",
" if langtup == ('en', 'es', 'id', 'xx'):\n",
" id_check.append(k)\n",
" if langtup == ('en', 'es', 'id'):\n",
" id_check.append(k)\n",
" if langtup == ('en', 'id', 'pt'):\n",
" id_check.append(k)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7091f5a6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"22559\n"
]
}
],
"source": [
"print(count) \n",
"# An additional two journals which are not present in the data, publishing in Balochi and Faroese,\n",
"# will be manually added to \"Other,\" or \"xx\"."
]
},
{
"cell_type": "markdown",
"id": "3acc84a6",
"metadata": {},
"source": [
"Language combinations, some of which need to be double-checked:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "eb266ffa",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('en',) 6646\n",
"('xx',) 446\n",
"('id',) 2605\n",
"('pt',) 1134\n",
"('es',) 852\n",
"('en', 'id') 4417\n",
"('id', 'xx') 121\n",
"('en', 'es') 1762\n",
"('en', 'xx') 2164\n",
"('en', 'pt') 821\n",
"('es', 'pt') 192\n",
"('es', 'xx') 56\n",
"('id', 'pt') 4\n",
"('pt', 'xx') 36\n",
"('es', 'pt', 'xx') 10\n",
"('en', 'id', 'xx') 335\n",
"('en', 'es', 'pt') 550\n",
"('en', 'pt', 'xx') 41\n",
"('en', 'es', 'xx') 218\n",
"('en', 'es', 'id') 47\n",
"('en', 'id', 'pt') 30\n",
"('id', 'pt', 'xx') 1\n",
"('en', 'es', 'pt', 'xx') 49\n",
"('en', 'es', 'id', 'pt') 12\n",
"('en', 'es', 'id', 'xx') 10\n"
]
}
],
"source": [
"for k in sorted(d, key=len, reverse=False):\n",
" print(k, d[k])"
]
},
{
"cell_type": "markdown",
"id": "c0d21861",
"metadata": {},
"source": [
"These ISSNs feature unusual combinations of Indonesian, Spanish, and Portuguese classifications. Each was manually checked by querying issn.org: https://portal.issn.org/"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "ec1d29ca",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['2599-1353', '2253-900X', '2356-1955', '1510-5091', '2722-6689', '2528-2344', '2318-5422', '2526-6675', '2613-9812', '2580-4553', '2443-3187', '2721-4192', '2723-3367', '2747-0733', '2621-3559', '2086-9754', '2745-8563', '2548-3366', '1411-545X', '0100-1965', '2745-5955', '2447-6536', '1907-5995', '2448-8232', '2145-888X', '1858-0262', '2615-5850', '2477-250X', '2655-2515', '2656-1832', '2614-7904', '1576-3420', '1693-6191', '2745-7168', '2655-6812', '2709-4685', '2722-9017', '1018-5674', '2597-7989', '2579-8766', '0121-2923', '2526-110X', '2599-3224', '2086-8162', '2151-2612', '2716-0807', '1411-4143', '0430-5027', '2715-0658', '2715-4882', '2301-9263', '2777-0362', '1858-2400', '2621-4148', '1679-1010', '2660-4418', '2615-6881', '2477-3557', '2655-7533', '0187-0173', '2656-2022', '2221-755X', '2359-0033', '2614-512X', '0211-111X', '2723-9535', '2460-1780', '2216-0973', '2589-8019', '2166-7918', '2078-1938', '2599-0136', '2412-4338', '2656-1794', '2595-9980', '2683-2100', '2318-4507', '2215-7794', '2656-3371', '2747-1292', '2746-8100', '2654-4172', '2254-6235', '1679-6101', '2686-5807', '2722-8002', '1412-7229', '2722-3620', '2686-6277', '2085-1456', '2371-2376', '2715-5919', '2318-8065', '2776-0081', '2477-0515', '2599-0837', '2716-1765', '2086-8065', '0121-3628', '2007-7831', '2007-7637', '2502-731X', '2183-8976']\n"
]
}
],
"source": [
"print(id_check)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "NLU",
"language": "python",
"name": "nlu"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}