# Finding non-English newspapers in Trove

There are a growing number of non-English newspapers digitised in Trove. However, if you're only searching using English keywords, you might never know that they're there. I thought it would be useful to generate a list of non-English newspapers, but it wasn't quite as straightforward as I thought.

## How not to do it...

My first thought was I could start by searching for digitised newspapers amongst the library records in Trove. My theory was that catalogue metadata would include language information. For example, you can search for newspapers using `format:Periodical/Newspaper` in the books and libraries category (or the `article` API zone). To find those that are digitised, you can add a search for 'trove.nla.gov.au'. Here's the [sort of results](https://trove.nla.gov.au/search/category/books?keyword=%22trove.nla.gov.au%22%20format%3APeriodical%2FNewspaper) you get. Unfortunately, you only get about 826 results and there are many more newspapers than that in Trove. It seems links to digitised newspapers are not consistently recorded.

My second approach was to get the list of digitised newspapers from the API, extract the ISSN, then use this to search for catalogue records. Here's the code snippet I used.

``` python
params = {
    'zone': 'article',
    'encoding': 'json',
    'l-format': 'Periodical/Newspaper',
    'reclevel': 'full',
    'key': TROVE_API_KEY
}
newspapers = get_newspapers()
for newspaper in newspapers:
    print(f'\n{newspaper["title"]}')
    issn = newspaper.get('issn')
    params['q'] = f'issn:{issn}'
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    try:
        works = data['response']['zone'][0]['records']['work']
    except KeyError:
        print('Not found')
    else:
        for work in works:
            print(work.get('language'))
    if not response.from_cache:
        time.sleep(0.2)
```

The main problem here is that not all titles have ISSNs. You could try searching on the titles is there's no ISSN, but this would involve a fair bit of disambiguation. In any case, in running this I discovered that while there is some language information in the metadata, it's not consistently applied. So basically a metadata-only approach is not going to work. Sigh...

## How I actually did it

If I couldn't get language details from metadata, then I had to try and extract it from the resource itself. I spent quite a bit of time looking around for Python packages that provided reliable language detection. The first one I tried regularly identified Mandarin as Korean (it turns out this was a known issue). Another one sent me into dependency hell. Finally I found [pycld3](https://pypi.org/project/pycld3/) which installed with `pip`, and *just worked*.

My plan was to get the list of newspapers via the API as before, then fire off an empty search for each one. I'd then loop through the results, running the language detector over the article text. I set the query parameters to retrieve the maxmimum number of results in one request – 100. That seemed like a reasonable sample. To try and provide a big enough amount of text for the language detector to work with, I set the number of words parameter to return articles with between 100 and 1000 words. So the query parameters I used were:

``` python
params = {
    'zone': 'newspaper',
    'encoding': 'json',
    'l-word': '100 - 1000 Words',
    'include': 'articletext',
    'key': TROVE_API_KEY,
    'q': ' ',
    'n': 100,
}
```

Because some of the newspapers had short runs and the word count filter limits the results, I found that I wasn't always getting 100 results per newspaper. To work around this I found the likely language for each article, aggregated the counts, and then calculated the proportion of results for each language. This gave me the proportion of articles in each language – a number I could use across newspapers to find the non-English titles. 

In general this worked pretty well, and the result was a [list of 48 newspapers](non-english-newspapers.md) (also as a [Gist](https://gist.github.com/wragge/9aa385648cff5f0de0c7d4837896df97)) that have significant amounts of non-English content. However, I had to do a fair bit of fiddling to filter out dodgy results. All the details are included below.

## Problems / limitations

* It's no surprise that the results of the language detection are affected by the quality of the OCR. 
* In filtering out what seems to be the product of dodgy OCR, it's possible that I might be excluding some non-English content. 
* I'm only detecting the predominant language for each article, so there might be articles containing a mix of languages that are being missed. 
* I'm just talking the first 100 results from a blank search in each newspaper. Larger, or more randomised samples might produce different results.
* Some dodgy detection results remain in the list of newspapers, but the point of this exercise was to find non-English newspapers. If you wanted to accurately determine the quantity of non-English content, you'd have to do a lot more fine-grained analysis.

## Import what we need

In [1]:
import requests
import time
import requests_cache
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from collections import Counter
import re
from langdetect import detect
from tqdm.auto import tqdm
import pandas as pd
import cld3
import pycountry
from language_tags import tags
import altair as alt
from pathlib import Path

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

In [2]:
TROVE_API_KEY = '[YOUR API KEY]'

## Harvest the data and run language detection on articles

In [3]:
def get_newspapers():
    '''
    Get a list of newspapers in Trove.
    '''
    response = s.get('https://api.trove.nla.gov.au/v2/newspaper/titles', params={'encoding': 'json', 'key': TROVE_API_KEY})
    data = response.json()
    return data['response']['records']['newspaper']

In [4]:
params = {
    'zone': 'newspaper',
    'encoding': 'json',
    #'l-category': 'Article',
    'l-word': '100 - 1000 Words',
    'include': 'articletext',
    'key': TROVE_API_KEY,
    'q': ' ',
    'n': 100,
}
newspaper_langs = []
newspapers = get_newspapers()
for newspaper in tqdm(newspapers):
    langs = []
    # print(f'\n{newspaper["title"]}')
    params['l-title'] = newspaper['id']
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    n = data['response']['zone'][0]['records']['n']
    try:
        articles = data['response']['zone'][0]['records']['article']
    except KeyError:
        # print('Not found')
        pass
    else:
        # Detect language for each article in results
        for article in articles:
            if 'articleText' in article:
                # Clean up OCRd text by removing takings and extra whitespace
                text = article['articleText']
                text = re.sub('<[^<]+?>', '', text)
                text = re.sub("\s\s+", " ", text)
                # Get the language
                ld = cld3.get_language(text)
                # If the language prediction is reliable, save it
                if ld.is_reliable:
                    langs.append(ld.language)
        # Find the count of each language detected in the sample of articles
        for lang, count in dict(Counter(langs)).items():
            # Calculate the language count as a proportion of the total number of results
            prop = int(count) / len(langs)
            newspaper_langs.append({'id': newspaper['id'], 'title': newspaper['title'], 'language': lang, 'proportion': prop, 'number': n})
    if not response.from_cache:
        time.sleep(0.2)
            

HBox(children=(FloatProgress(value=0.0, max=1622.0), HTML(value='')))




Convert the results into a dataframe.

In [5]:
df = pd.DataFrame(newspaper_langs)
df.head()

Unnamed: 0,id,title,language,proportion,number
0,166,Canberra Community News (ACT : 1925 - 1927),en,1.0,100
1,165,Canberra Illustrated: A Quarterly Magazine (AC...,en,1.0,29
2,69,"Federal Capital Pioneer (Canberra, ACT : 1924 ...",en,1.0,100
3,871,Good Neighbour (ACT : 1950 - 1969),en,1.0,100
4,665,Student Notes/Canberra University College Stud...,en,1.0,100


## Add full language names

The language detector returns BCP-47-style language codes. To translate these into something that's a bit easier for humans to understand, we can use the [language-tags](https://github.com/OnroerendErfgoed/language-tags) package.

In [50]:
def get_full_language(lc):
    '''
    Get full language names from codes
    '''
    lang = tags.description(lc)
    if lang:
        return lang[0]
    else:
        print(lc)
        return lc

df['language_full'] = df['language'].apply(get_full_language)

## Filtering the results

If we just look at the numbers of languages detected we might think that Australia's cultural diversity was much greater than we expected! But the likelihood that there were ten newspapers publishing articles in Igbo (the language of the Igbo people in south-eastern Nigeria) seems small. Obviously there are a considerable number of false positives here.

In [59]:
df['language_full'].value_counts()

English                  1565
Maltese                   279
Catalan                    53
Welsh                      35
Japanese                   31
Italian                    31
Somali                     24
Norwegian                  23
Danish                     17
German                     16
Samoan                     10
Igbo                       10
Portuguese                  9
French                      9
Chinese                     8
Estonian                    8
Scottish Gaelic             8
Luxembourgish               8
Vietnamese                  7
Western Frisian             7
Hawaiian                    7
Russian                     6
Modern Greek (1453-)        5
Swedish                     5
Filipino                    5
Afrikaans                   4
Javanese                    4
Indonesian                  4
Polish                      4
Hindi                       4
Bulgarian                   4
Corsican                    4
Dutch                       3
Malagasy  

Remember that for each language detected in a newspaper we calculated the proportion of articles in our results set in that language. So we can, for example, just look at newspapers where 100% of the articles are in a single language. This highlights a few non-English language newspapers, but obviously we're missing a lot of others.

In [70]:
df.loc[df['proportion'] == 1]['language_full'].value_counts()

English                 1112
Italian                    3
German                     3
Modern Greek (1453-)       1
Portuguese                 1
Estonian                   1
Name: language_full, dtype: int64

If we chart the proportions, we see them bunched up at either end of the scale. So there are lots of languages detected in only a small proportion of articles.

In [66]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('proportion:Q', bin=True),
    y='count():Q'
)

If we zoom in on the proportions less than 0.1 (that's 10 articles in a sample of 100) we see that they're mostly less that 0.01 (or 1 article in 100). It seems likely that these are false positives. 

In [72]:
alt.Chart(df.loc[df['proportion'] < 0.1]).mark_bar().encode(
    x=alt.X('proportion:Q', bin=True),
    y='count():Q'
)

Let's be fairly conservative and filter out languages that have a proportion (per newspaper) less than 0.5. This list seems a bit more in line with what we would expect, but there are still some surprises – 48 newspapers published articles in Maltese?

In [74]:
df.loc[df['proportion'] >= 0.05]['language_full'].value_counts()

English                  1559
Maltese                    48
Italian                    14
German                      9
Chinese                     8
Catalan                     6
Somali                      5
Modern Greek (1453-)        4
Japanese                    3
Portuguese                  3
Polish                      3
Western Frisian             2
Dutch                       2
French                      2
Spanish                     1
Ukrainian                   1
Malay (macrolanguage)       1
Welsh                       1
Indonesian                  1
Russian                     1
Danish                      1
Scottish Gaelic             1
Bosnian                     1
Estonian                    1
Vietnamese                  1
Macedonian                  1
Lithuanian                  1
Bulgarian                   1
Samoan                      1
Name: language_full, dtype: int64

If we focus in on the newspapers that supposedly have a significant proportion of articles in Maltese, we see some very strange results. I seriously doubt that 80% of the *Mildura Irrigationist* from 1892-3 is in Maltese. So what's going on?

In [76]:
df.loc[(df['proportion'] > 0.1) & (df['language_full'] == 'Maltese')]

Unnamed: 0,id,title,language,proportion,number,language_full
218,1596,L'Italo-Australiano = The Italo-Australian (Su...,mt,0.222222,100,Maltese
308,623,"Sunday News (Sydney, NSW : 1919)",mt,0.219178,100,Maltese
400,224,"The Castlereagh (Gilgandra, NSW : 1905 - 1907)",mt,0.105882,100,Maltese
568,500,The Richmond River Express and Casino Kyogle A...,mt,0.168675,100,Maltese
637,452,The Sydney Wool and Stock Journal (NSW : 1899 ...,mt,0.233766,100,Maltese
710,394,Twofold Bay and Maneroo Observer (NSW : 1860),mt,0.139535,100,Maltese
719,810,"Upper Hunter Courier (Murrurundi, NSW : 1871)",mt,0.142857,14,Maltese
834,1207,The Coolangatta Chronicle (Qld. : 1926),mt,0.130435,26,Maltese
884,892,Warwick Daily News (Qld. : 1919 -1954),mt,0.139241,100,Maltese
1028,34,"The Advertiser (Adelaide, SA : 1889 - 1931)",mt,0.486111,100,Maltese


If you look at results for the *Mildura Irrigationist* [in Trove](https://trove.nla.gov.au/search/advanced/category/newspapers?l-advtitle=1583&l-advWord=100%20-%201000%20Words) you'll see that many of the page images are blurry, and as a result the OCR is very, very bad. Here's a sample:

> ill Tatr W lyltwililUmt aat aa«v aa MwOkaWtOPMlkMrf faiflftMMRltitlWBfMNM fmiMW^M^K IMIOHIpM^fQBMMI ft tWMmrwl tWWiltjfNMStW ffw aailwt«M wtMitiar«lH*a ifcmH af tlw ial«««l ion «M««f ffantoif wwtMaaM. tto tf h «frwringmhw torf M hr toaiy. Im*4. ar, fc> mmirf awlUW wefllaM aA. aaytMaa. l «Wa A tfc» tow waliw Macks b aaM, b wil fVfbH Ja ^IMntaam* Mm' ls tolliac. rt Tto aad nf ttoar UhKMimiw*a afM» ftjrwl ans W l OtfWOar jpaaofTwSi aJwwr la'aahS^*— attor aakwt mm rvfimMiMh* ttoai. day - Why. aa IH thrf t«fl almd yaa."iw. aal wwifciha m OiO all tto laM amnavaA, fawawNl I r aa4 f wa* tm enr a Mtcfc tto watrr tto wiaaal m a* a* day pfaMat. aa4 (h* ilj amintir* ilm tTtsjtvL.f**' ""j •fria—lhati* tow ««4M k." tlml t | r 4m» wtn .aa rUa* I h ha«« t ctoantaf InMM* aM*toclt ttopnaMaf II It la Mat rtgM, t jmi awl a 1 : af but d awtliqg a Mr. Jafc Matwa-(MMa M t «wl y gha yaar «toa anl yaar (ma as «fpai ta af <M>t«l. i pwwiaf Mtan (tot jw. twy MwUI «*a1 a«ry ftajr «ndl tar tlw aad annaH* a*«r aarf a««r aaria. tiaa

What happens when we feed this fragment of bad OCR to the language detector? Remarkably, the language detector is 96% sure that it's Maltese! To find out why this is the case, we'd probably have to dig into the way the language detection model was trained. But for our purposes it's enough to know that some of the languages detected seem to be the result of bad OCR.

In [79]:
ocr = '''ill Tatr W lyltwililUmt aat aa«v aa MwOkaWtOPMlkMrf faiflftMMRltitlWBfMNM fmiMW^M^K IMIOHIpM^fQBMMI ft tWMmrwl tWWiltjfNMStW ffw aailwt«M wtMitiar«lH*a ifcmH af tlw ial«««l ion «M««f ffantoif wwtMaaM. tto tf h «frwringmhw torf M hr toaiy. Im*4. ar, fc> mmirf awlUW wefllaM aA. aaytMaa. l «Wa A tfc» tow waliw Macks b aaM, b wil fVfbH Ja ^IMntaam* Mm' ls tolliac. rt Tto aad nf ttoar UhKMimiw*a afM» ftjrwl ans W l OtfWOar jpaaofTwSi aJwwr la'aahS^*— attor aakwt mm rvfimMiMh* ttoai. day - Why. aa IH thrf t«fl almd yaa."iw. aal wwifciha m OiO all tto laM amnavaA, fawawNl I r aa4 f wa* tm enr a Mtcfc tto watrr tto wiaaal m a* a* day pfaMat. aa4 (h* ilj amintir* ilm tTtsjtvL.f**' ""j •fria—lhati* tow ««4M k." tlml t | r 4m» wtn .aa rUa* I h ha«« t ctoantaf InMM* aM*toclt ttopnaMaf II It la Mat rtgM, t jmi awl a 1 : af but d awtliqg a Mr. Jafc Matwa-(MMa M t «wl y gha yaar «toa anl yaar (ma as «fpai ta af <M>t«l. i pwwiaf Mtan (tot jw. twy MwUI «*a1 a«ry ftajr «ndl tar tlw aad annaH* a*«r aarf a««r aaria. tiaa'''
cld3.get_language(ocr)

LanguagePrediction(language='mt', probability=0.960280179977417, is_reliable=True, proportion=1.0)

Of course there might actually be newspapers with articles in Maltese, so we don't want to filter them all out. So let's do some manual inspection of the newspapers that *seem* to have non-English content. First we'll filter our results to include only languages with proportions of more than 0.05, and then drop out newspapers that seem to be only in English. We end up with 105 different titles. 

In [89]:
# The filter on the groupby drops out newspapers that only have articles in English.
filtered = df.loc[df['proportion'] >= 0.05].groupby(by=['title', 'id']).filter(lambda x: (len(x) > 1) or (len(x)== 1 and x['language'] != 'en'))
papers = filtered.groupby(by=['title', 'id'])
len(papers)

105

Let's list those 105 newspapers. From the list below, I think it's pretty easy to pick out the results that are likely to be the product of bad OCR.

In [86]:
for n, l in papers:
    if not l.loc[(~df['language'].isin(['en'])) & (df['proportion'] >= 0.05)].empty:
        print(f'\n{n[0]} ({n[1]})')
        display(l[['language_full', 'language', 'proportion']].loc[(l['proportion'] > 0.05)].sort_values(by='proportion', ascending=False))


A Voz de Timor (Dili, East Timor : 1970 - 1975) (1498)


Unnamed: 0,language_full,language,proportion
9,Portuguese,pt,1.0



Adelaide Chronicle and South Australian Literary Record (SA : 1840 - 1842) (986)


Unnamed: 0,language_full,language,proportion
894,English,en,0.929293
893,Catalan,ca,0.070707



Adelaide Independent and Cabinet of Amusement (SA : 1841) (1336)


Unnamed: 0,language_full,language,proportion
895,English,en,0.928571
897,Catalan,ca,0.061224



Adelaider Deutsche Zeitung (SA : 1851 - 1862) (277)


Unnamed: 0,language_full,language,proportion
904,German,de,1.0



Auburn and District News (NSW : 1929) (1320)


Unnamed: 0,language_full,language,proportion
40,English,en,0.947368
41,Vietnamese,vi,0.052632



Australische Zeitung (Adelaide, SA : 1875 - 1916) (1150)


Unnamed: 0,language_full,language,proportion
908,German,de,1.0



Bangkok Recorder (Thailand : 1865 - 1867) (1488)


Unnamed: 0,language_full,language,proportion
10,English,en,0.925532
11,Maltese,mt,0.053191



Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946) (1283)


Unnamed: 0,language_full,language,proportion
14,Malay (macrolanguage),ms,0.891304
15,Indonesian,id,0.108696



Bulong Bulletin and Mining Register (WA : 1897 - 1898) (1400)


Unnamed: 0,language_full,language,proportion
1813,English,en,0.913043
1814,Maltese,mt,0.086957



Chinese Republic News (Sydney, NSW : 1914 - 1937) (1186)


Unnamed: 0,language_full,language,proportion
82,Chinese,zh,0.945652



Chinese Times (Melbourne, Vic. : 1902 - 1922) (705)


Unnamed: 0,language_full,language,proportion
1304,Chinese,zh,0.843373



Chronicle and North Coast Advertiser (Qld. : 1903 - 1922) (286)


Unnamed: 0,language_full,language,proportion
765,English,en,0.93617
766,Maltese,mt,0.06383



Chung Wah News (Perth, WA : 1981 - 1987) (1383)


Unnamed: 0,language_full,language,proportion
1831,English,en,0.637363
1830,Chinese,zh,0.263736



Colac Reformer (Vic. : 1914 - 1918) (763)


Unnamed: 0,language_full,language,proportion
1324,English,en,0.947917
1325,Maltese,mt,0.052083



Daily Post (Hobart, Tas. : 1908 - 1918) (860)


Unnamed: 0,language_full,language,proportion
1114,English,en,0.704545
1113,Japanese,ja,0.125



Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952) (1385)


Unnamed: 0,language_full,language,proportion
1856,German,de,0.83
1857,English,en,0.17



Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906) (1600)


Unnamed: 0,language_full,language,proportion
126,German,de,1.0



Deutsche Zeitung für Sud-Australien = German Times for South Australia (Tanunda, SA : 1851) (1577)


Unnamed: 0,language_full,language,proportion
922,German,de,0.9
921,English,en,0.1



Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939) (1591)


Unnamed: 0,language_full,language,proportion
127,German,de,0.729167
128,English,en,0.270833



Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851) (1576)


Unnamed: 0,language_full,language,proportion
923,German,de,0.989691



Dutch Australian Weekly (Sydney, NSW : 1951 - 1993) (1044)


Unnamed: 0,language_full,language,proportion
132,Dutch,nl,0.882979
133,English,en,0.106383



Dutch Weekly (Sydney, NSW : 1993 - 2004) (1045)


Unnamed: 0,language_full,language,proportion
135,Dutch,nl,0.924731
136,English,en,0.053763



Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952) (1384)


Unnamed: 0,language_full,language,proportion
1862,Polish,pl,0.91
1863,English,en,0.09



Eco Italiano (Perth, WA : 1958 - 1959) (1387)


Unnamed: 0,language_full,language,proportion
1864,Italian,it,1.0



Emu Bay Times and North West and West Coast Advocate (Tas. : 1897 - 1899) (116)


Unnamed: 0,language_full,language,proportion
1130,English,en,0.929412
1131,Maltese,mt,0.070588



Evelyn Observer, and South and East Bourke Record (Vic. : 1882 - 1902) (145)


Unnamed: 0,language_full,language,proportion
1358,English,en,0.913978
1357,Maltese,mt,0.075269



Geelong Advertiser (Vic. : 1840 - 1845) (292)


Unnamed: 0,language_full,language,proportion
1379,English,en,0.904255
1378,Samoan,sm,0.074468



Geraldton Advocate and Johnstone River Guardian (Qld. : 1895 - 1896) (1103)


Unnamed: 0,language_full,language,proportion
774,English,en,0.910112
775,Maltese,mt,0.089888



Geraldton Express and Murchison Goldfields News (WA : 1894 - 1896) (1623)


Unnamed: 0,language_full,language,proportion
1875,English,en,0.661538
1879,Maltese,mt,0.076923
1876,Japanese,ja,0.061538



Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923) (704)


Unnamed: 0,language_full,language,proportion
170,Chinese,zh,0.80303
173,Western Frisian,fy,0.075758



Hamilton Spectator and Grange District Advertiser (South Melbourne, Vic. : 1860 - 1870) (927)


Unnamed: 0,language_full,language,proportion
1410,English,en,0.921348
1409,Maltese,mt,0.078652



Healesville Guardian (Vic. : 1893 - 1898) (140)


Unnamed: 0,language_full,language,proportion
1415,English,en,0.938144
1416,Maltese,mt,0.051546



Hellenic Echo (Perth, WA : 1967 - 1968) (1389)


Unnamed: 0,language_full,language,proportion
1917,Modern Greek (1453-),el,1.0



Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957) (1378)


Unnamed: 0,language_full,language,proportion
1919,Italian,it,0.97



Il Giornale Italiano (Sydney, NSW : 1932 - 1940) (279)


Unnamed: 0,language_full,language,proportion
186,Italian,it,0.92
187,English,en,0.08



Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954) (1601)


Unnamed: 0,language_full,language,proportion
188,Italian,it,0.777778
189,English,en,0.222222



Inglewood Advertiser (Vic. : 1914 - 1918) (570)


Unnamed: 0,language_full,language,proportion
1435,English,en,0.936842
1436,Maltese,mt,0.063158



Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940) (1602)


Unnamed: 0,language_full,language,proportion
199,English,en,0.840426
200,Italian,it,0.159574



Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935) (1603)


Unnamed: 0,language_full,language,proportion
201,English,en,0.903226
202,Italian,it,0.096774



Italo-Australian (Sydney, NSW : 1927 - 1940) (1595)


Unnamed: 0,language_full,language,proportion
203,Italian,it,0.909091
204,English,en,0.090909



Japanese Perth Times (Subiaco, WA : 1989 - 1996) (1386)


Unnamed: 0,language_full,language,proportion
1924,Japanese,ja,0.93617
1925,English,en,0.06383



Katoomba Times (NSW : 1889 - 1894) (906)


Unnamed: 0,language_full,language,proportion
207,English,en,0.934066
209,Maltese,mt,0.054945



Kyabram Union (Vic. : 1886 - 1894) (196)


Unnamed: 0,language_full,language,proportion
1456,English,en,0.921348
1457,Maltese,mt,0.05618



L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885) (1596)


Unnamed: 0,language_full,language,proportion
217,Italian,it,0.68254
218,Maltese,mt,0.222222



L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909) (1597)


Unnamed: 0,language_full,language,proportion
223,Italian,it,0.95



La Rondine (Perth, WA : 1969 - 1994) (1388)


Unnamed: 0,language_full,language,proportion
1942,Italian,it,0.928571
1943,English,en,0.071429



Laura Standard and Crystal Brook Courier (SA : 1917 - 1948) (926)


Unnamed: 0,language_full,language,proportion
940,English,en,0.931034
941,Maltese,mt,0.068966



Le Courrier Australien (Sydney, NSW : 1892 - 2011) (829)


Unnamed: 0,language_full,language,proportion
228,French,fr,0.816327
229,English,en,0.173469



Mediterranean Voice (Perth, WA : 1971 - 1972) (1390)


Unnamed: 0,language_full,language,proportion
1961,Modern Greek (1453-),el,0.375
1955,English,en,0.28125
1962,Portuguese,pt,0.104167
1956,French,fr,0.0625
1954,Spanish,es,0.052083



Meie Kodu = Our Home (Sydney, NSW : 1949 - 1956) (280)


Unnamed: 0,language_full,language,proportion
238,Estonian,et,1.0



Murchison Times and Cue-Big Bell-Reedy Advocate (WA : 1937 - 1942) (1543)


Unnamed: 0,language_full,language,proportion
1987,English,en,0.825
1988,Maltese,mt,0.1375



Mu̇sų Pastogė = Our Haven (Sydney, NSW : 1950 - 1954) (1594)


Unnamed: 0,language_full,language,proportion
250,Lithuanian,lt,0.95



Narandera Argus and Riverina Advertiser (NSW : 1893 - 1953) (431)


Unnamed: 0,language_full,language,proportion
254,English,en,0.940476
255,Maltese,mt,0.059524



Narromine News and Trangie Advocate (NSW : 1898 - 1955) (430)


Unnamed: 0,language_full,language,proportion
256,English,en,0.946809
257,Maltese,mt,0.053191



Nasza droga (Adelaide, SA : 1952 - 1954) (1323)


Unnamed: 0,language_full,language,proportion
947,Polish,pl,0.9
948,English,en,0.1



Norden (Melbourne, Vic. : 1914 - 1918) (797)


Unnamed: 0,language_full,language,proportion
1505,English,en,0.467391
1504,Danish,da,0.413043
1506,Maltese,mt,0.065217



North Melbourne Gazette (Vic. : 1894 - 1901) (384)


Unnamed: 0,language_full,language,proportion
1512,English,en,0.829268
1513,Maltese,mt,0.146341



Oceania (Sydney, NSW : 1913 - 1915) (1598)


Unnamed: 0,language_full,language,proportion
270,English,en,0.574468
269,Italian,it,0.425532



Referee (Sydney, NSW : 1886 - 1939) (499)


Unnamed: 0,language_full,language,proportion
284,English,en,0.924242
285,Maltese,mt,0.075758



Reporter and Illawarra Journal (Kiama, NSW : 1887 - 1894) (389)


Unnamed: 0,language_full,language,proportion
286,English,en,0.891566
288,Maltese,mt,0.084337



Ringwood and Croydon Chronicle (Vic. : 1914 - 1918) (329)


Unnamed: 0,language_full,language,proportion
1565,English,en,0.93617
1566,Maltese,mt,0.06383



Rockhampton Bulletin and Central Queensland Advertiser (Qld. : 1861 - 1871) (92)


Unnamed: 0,language_full,language,proportion
814,English,en,0.946237
815,Maltese,mt,0.053763



Sandringham Southern Cross (Vic. : 1914 - 1918) (318)


Unnamed: 0,language_full,language,proportion
1576,English,en,0.65
1577,Maltese,mt,0.3125



Seamen's Strike Bulletin (Melbourne, Vic. : 1919) (1043)


Unnamed: 0,language_full,language,proportion
1584,Polish,pl,0.4
1582,Western Frisian,fy,0.2
1583,Bosnian,bs,0.2
1585,Russian,ru-Latn,0.2



Southern Australian (Adelaide, SA : 1838 - 1844) (171)


Unnamed: 0,language_full,language,proportion
1012,English,en,0.904255
1011,Catalan,ca,0.074468



Southern Morning Herald (Goulburn, NSW : 1920 - 1923) (418)


Unnamed: 0,language_full,language,proportion
304,English,en,0.909091
306,Maltese,mt,0.077922



Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932) (1380)


Unnamed: 0,language_full,language,proportion
2026,Italian,it,0.97



Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851) (314)


Unnamed: 0,language_full,language,proportion
1022,German,de,0.888889
1023,English,en,0.111111



Sunday News (Sydney, NSW : 1919) (623)


Unnamed: 0,language_full,language,proportion
309,English,en,0.739726
308,Maltese,mt,0.219178



Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959) (1379)


Unnamed: 0,language_full,language,proportion
2031,Italian,it,1.0



Sydney Chronicle (NSW : 1846 - 1848) (94)


Unnamed: 0,language_full,language,proportion
313,English,en,0.923077
314,Maltese,mt,0.076923



Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874) (278)


Unnamed: 0,language_full,language,proportion
1020,German,de,0.989691



Tasmanian Evening Herald (Launceston, Tas. : 1878) (1265)


Unnamed: 0,language_full,language,proportion
1154,English,en,0.898876
1153,Maltese,mt,0.067416



The Advertiser (Adelaide, SA : 1889 - 1931) (34)


Unnamed: 0,language_full,language,proportion
1027,English,en,0.513889
1028,Maltese,mt,0.486111



The Argus (Melbourne, Vic. : 1848 - 1957) (13)


Unnamed: 0,language_full,language,proportion
1619,Maltese,mt,0.62963
1620,English,en,0.358025



The Castlereagh (Gilgandra, NSW : 1905 - 1907) (224)


Unnamed: 0,language_full,language,proportion
399,English,en,0.741176
401,Somali,so,0.152941
400,Maltese,mt,0.105882



The Chinese Advertiser (Ballarat, Vic. : 1856) (706)


Unnamed: 0,language_full,language,proportion
1646,Chinese,zh,0.5
1648,English,en,0.333333
1647,Scottish Gaelic,gd,0.166667



The Coolangatta Chronicle (Qld. : 1926) (1207)


Unnamed: 0,language_full,language,proportion
833,English,en,0.869565
834,Maltese,mt,0.130435



The English and Chinese Advertiser (Vic. : 1856 - 1858) (685)


Unnamed: 0,language_full,language,proportion
1664,English,en,0.894737
1665,Chinese,zh,0.052632
1666,Maltese,mt,0.052632



The Goldfields Observer (Kalgoorlie, WA : 1930 - 1939) (1626)


Unnamed: 0,language_full,language,proportion
2095,English,en,0.909091
2097,Maltese,mt,0.051948



The Gwydir Examiner and Moree General Advertiser (NSW : 1898 - 1899) (886)


Unnamed: 0,language_full,language,proportion
466,English,en,0.910112
467,Maltese,mt,0.078652



The Melbourne Advertiser (Vic. : 1838) (935)


Unnamed: 0,language_full,language,proportion
1696,English,en,0.666667
1697,Welsh,cy,0.333333



The Mildura Irrigationist (Vic. : 1892 - 1893) (1583)


Unnamed: 0,language_full,language,proportion
1715,Maltese,mt,0.795455
1714,English,en,0.113636
1716,Somali,so,0.090909



The Mildura Irrigationist and Murray River Agricultural Times (Vic. : 1888) (1581)


Unnamed: 0,language_full,language,proportion
1719,Maltese,mt,0.75
1718,Somali,so,0.132353
1717,English,en,0.117647



The Mildura Irrigationist and Murray River Cultural Advocate (Vic. : 1891 - 1892) (1582)


Unnamed: 0,language_full,language,proportion
1722,English,en,0.52381
1721,Maltese,mt,0.333333
1720,Somali,so,0.126984



The Millicent Times (SA : 1891 - 1905) (970)


Unnamed: 0,language_full,language,proportion
1048,English,en,0.94898
1049,Catalan,ca,0.05102



The News, Shoalhaven, Broughton Creek and Ulladulla Advertiser (NSW : 1875 - 1877) (1678)


Unnamed: 0,language_full,language,proportion
537,English,en,0.913978
538,Catalan,ca,0.086022



The Phillips River Times (Ravensthorpe, WA : 1908 - 1909) (1546)


Unnamed: 0,language_full,language,proportion
2163,English,en,0.9
2164,Maltese,mt,0.1



The Port Phillip Patriot and Morning Advertiser (Vic. : 1845 - 1848) (937)


Unnamed: 0,language_full,language,proportion
1729,English,en,0.894737
1728,Maltese,mt,0.084211



The Richmond River Express and Casino Kyogle Advertiser (NSW : 1904 - 1929) (500)


Unnamed: 0,language_full,language,proportion
570,English,en,0.73494
568,Maltese,mt,0.168675
569,Somali,so,0.072289



The Sydney Wool and Stock Journal (NSW : 1899 - 1917) (452)


Unnamed: 0,language_full,language,proportion
639,English,en,0.727273
637,Maltese,mt,0.233766



The Tasmanian (Launceston, Tas. : 1871 - 1879) (946)


Unnamed: 0,language_full,language,proportion
1216,English,en,0.917808
1217,Maltese,mt,0.082192



The Teetotaller and General Newspaper (Sydney, NSW : 1842 - 1843) (1036)


Unnamed: 0,language_full,language,proportion
642,English,en,0.95



The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957) (1381)


Unnamed: 0,language_full,language,proportion
2203,Modern Greek (1453-),el,0.97



To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954) (1592)


Unnamed: 0,language_full,language,proportion
690,Modern Greek (1453-),el,0.989362



Tung Wah News (Sydney, NSW : 1898 - 1902) (1185)


Unnamed: 0,language_full,language,proportion
697,Chinese,zh,0.926316



Tung Wah Times (Sydney, NSW : 1901 - 1936) (1184)


Unnamed: 0,language_full,language,proportion
704,Chinese,zh,0.968085



Twofold Bay Telegraph (NSW : 1860) (479)


Unnamed: 0,language_full,language,proportion
715,English,en,0.945652
716,Maltese,mt,0.054348



Twofold Bay and Maneroo Observer (NSW : 1860) (394)


Unnamed: 0,language_full,language,proportion
709,English,en,0.825581
710,Maltese,mt,0.139535



Uniamoci (Sydney, NSW : 1903 - 1904) (1599)


Unnamed: 0,language_full,language,proportion
717,Italian,it,1.0



Upper Hunter Courier (Murrurundi, NSW : 1871) (810)


Unnamed: 0,language_full,language,proportion
718,English,en,0.857143
719,Maltese,mt,0.142857



Vesnik (Perth, WA : 1975 - 1994) (1382)


Unnamed: 0,language_full,language,proportion
2234,Macedonian,mk,0.410526
2233,English,en,0.357895
2235,Bulgarian,bg-Latn,0.221053



Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954) (1593)


Unnamed: 0,language_full,language,proportion
720,Ukrainian,uk,0.82
721,English,en,0.18



Warwick Daily News (Qld. : 1919 -1954) (892)


Unnamed: 0,language_full,language,proportion
883,English,en,0.835443
884,Maltese,mt,0.139241



Williamstown Trade Circular (Vic. : 1855 - 1856) (213)


Unnamed: 0,language_full,language,proportion
1792,English,en,0.875
1793,Portuguese,pt,0.125


I went through the titles above and compiled a list of title identifiers that seem to be producing dodgy results. We can use this to filter these newspapers out of our results.

In [32]:
# Titles where dodgy OCR causes false positives in language detection
# This was manually created after scanning results
dodgy = ['1036', '1043', '1103', '116', '1207', '1265', '13', '1320', '1336', '140', '1400', '145', '1488', '1543', '1546', '1581', '1582', '1583', '1623', '1626', '1678', '171', '196', '213', '224', '286', '292', '318', '329', '34', '384', '389', '394', '418', '430', '431', '452', '479', '499', '500', '570', '623', '763', '810', '860', '886', '892', '906', '92', '926', '927', '935', '937', '94', '946', '970', '986']

Here we'll add the dodgy title ids into our filter. It seems that we have 48 newspapers with significant amounts of non-English content.

In [90]:
# The filter removes titles that only have one language, which is English
filtered = df.loc[(~df['id'].isin(dodgy)) & (df['proportion'] >= 0.05)].groupby(by=['title', 'id']).filter(lambda x: (len(x) > 1) or (len(x)== 1 and x['language'] != 'en'))
papers = filtered.groupby(by=['title', 'id'])
len(papers)

48

Let's list them.

In [92]:
for n, l in papers:
    print(n[0])

A Voz de Timor (Dili, East Timor : 1970 - 1975)
Adelaider Deutsche Zeitung (SA : 1851 - 1862)
Australische Zeitung (Adelaide, SA : 1875 - 1916)
Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946)
Chinese Republic News (Sydney, NSW : 1914 - 1937)
Chinese Times (Melbourne, Vic. : 1902 - 1922)
Chung Wah News (Perth, WA : 1981 - 1987)
Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952)
Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906)
Deutsche Zeitung für Sud-Australien = German Times for South Australia (Tanunda, SA : 1851)
Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939)
Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851)
Dutch Australian Weekly (Sydney, NSW : 1951 - 1993)
Dutch Weekly (Sydney, NSW : 1993 - 2004)
Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952)
Eco Italiano (Perth, WA : 1958 - 1959)
Guang yi hua bao = The Chinese Australian Herald 

That's looking pretty good. Let's save the results as a Markdown file to make it easy to explore. We'll include links into Trove. Here's the [list of all 48 newspapers](non-english-newspapers.md) (also as a [Gist](https://gist.github.com/wragge/9aa385648cff5f0de0c7d4837896df97)).

In [97]:
with open(Path('non-english-newspapers.md'), 'w') as md_file:
    i = 1
    for n, l in papers:
        md_file.write(f'\n### {i}. [{n[0]}](http://nla.gov.au/nla.news-title{n[1]})\n\n')
        md_file.write('| Language | Language code | Proportion of sample |\n')
        md_file.write('|---|---|---|\n')
        for row in l[['language_full', 'language', 'proportion']].loc[(l['proportion'] > 0.05)].sort_values(by='proportion', ascending=False).itertuples():
            md_file.write(f'| {row.language_full} | {row.language} | {row.proportion} |\n')
        i += 1

If you look at the Markdown files you'll see that there are still some dodgy results – for example, 16% of the *Chinese Advertiser* is detected as 'Scottish Gaelic'. But the point of this exercise was to find non-English newspapers, rather than accurately detect the proportion of non-English content, so I think we can live with it for now.

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).