{ "cells": [ { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%load_ext tikzmagic" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Multilingual and Multicultural NLP" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "+ Multilingual NLP\n", "+ Transfer learning\n", "+ Multicultural NLP" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Linguistic diversity and inclusion\n", "\n", "Language distribution in the world is highly skewed:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Joshi et al., 2020)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### NLP across the resource landscape\n", "\n", "Resource distribution is skewed as well:\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Joshi et al., 2020)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "\n", "Source: http://ai.stanford.edu/blog/weak-supervision/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "## How to get more labeled training data in your language?\n", "\n", "* Translating the training set\n", "* Translating the test set\n", "* Language-agnostic representations\n", "* Cross-lingual transfer learning\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Horbach et al., 2020)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Annotation projection\n", "\n", "* When translating data (train or test), we need to **project the labels**.\n", "* Easy for text classification, but what about structured prediction, e.g., sequence labeling?\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Jain et al., 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Common approach: projection based on word alignment.\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Jain et al., 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Language-agnostic representations\n", "\n", "Instead of words, use **delexicalized** features, such as (universal) parts of speech.\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Täckström et al., 2013)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Transfer learning\n", "\n", "Train on one task/domain/language, and use (or test) on another. Many different flavors:\n", "\n", "\n", "\n", "
\n", " (from Ruder, 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Multi-task learning\n", "\n", "Simple - training one model with multiple objectives simultaneously:\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Caruana, 1997)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Weight sharing can happen on different levels:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Hashimoto et al, 2017)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Domain adaptation\n", "\n", "*Domain*: intuitively, a collection of texts from a certain coherent sort of discourse ([Plank, 2011](https://pure.rug.nl/ws/portalfiles/portal/10469718/09complete.pdf)).\n", "\n", "For example, product reviews for DVDs, electronics, and kitchen goods:\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Wright & Augenstein, 2020)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Domain adaptation methods\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Ramponi & Plank, 2020)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## A more general view on generalization\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Hupkes et al., 2022)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "
\n", " \n", "
\n", "\n", "
\n", " (from Hupkes et al., 2022)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Sequential transfer learning\n", "\n", "Train general-purpose models on large datasets as representations, then use them for specific tasks.\n", "\n", "Type-level representations ([word embeddings](dl-representations_simple.ipynb)):\n", "- word2vec ([Mikolov et al., 2013](https://arxiv.org/abs/1301.3781))\n", "- GloVe ([Pennington et al., 2014](https://www.aclweb.org/anthology/D14-1162))\n", "- ...\n", "\n", "Token-level representations ([contextualised representations](dl-representations_contextual.ipynb)):\n", "- ELMo ([Peters et al., 2018](https://www.aclweb.org/anthology/N18-1202))\n", "- BERT ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423))\n", "- ...\n", "\n", "Language models ([tranformer language models](dl-representations_contextual_transformers.ipynb)):\n", "- GPT-2 ([Radford et al., 2018](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf))\n", "- T5 ([Raffel et al., 2019](https://arxiv.org/abs/1910.10683))\n", "- BART ([Lewis et al., 2020](https://aclanthology.org/2020.acl-main.703/))\n", "- GPT-3 ([Brown et al., 2020](https://arxiv.org/abs/2005.14165))\n", "- T0 ([Sanh et al., 2021](https://arxiv.org/abs/2110.08207))\n", "- FLAN ([Wei et al., 2021](https://arxiv.org/abs/2109.01652))\n", "- ..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Paradigm 1: pre-train and fine-tune\n", "\n", "LM variations (such as ELMo and BERT) can be fine-tuned with extra task-specific parameters:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Devlin et al., 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Fine-tuning without extra parameters\n", "\n", "Encoder-decoder LMs (such as T5) can be directly fine-tuned on any task (in principle) posed **as text-to-text**.\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Raffel et al., 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Paradigm 2: prompting\n", "\n", "Large enough LMs (such as GPT-3) are capable of few-shot **in-context learning** without fine-tuning any parameters.\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Wei et al., 2021)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Generalizing across tasks\n", "\n", "GPT-3 and multitask LMs (such as T0 and FLAN) are also capable of **zero-shot** task generalization:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Sanh et al., 2021)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Cross-lingual transfer learning" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Multilingual representations\n", "\n", "Trained on large, combined multilingual corpora *without cross-lingual supervision*.\n", "- mBERT ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423))\n", "- XLM, XLM-R ([Conneau et al., 2020](https://aclanthology.org/2020.acl-main.747/))\n", "- mBART ([Liu et al., 2020](https://aclanthology.org/2020.tacl-1.47/))\n", "- mT5 ([Xue et al., 2021](https://aclanthology.org/2021.naacl-main.41/))\n", "- XGLM ([Lin et al., 2021](https://arxiv.org/abs/2112.10668))\n", "- ...\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Conneau et al., 2020)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Massively multilingual models\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Xue et al., 2021)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zero-shot cross-lingual transfer\n", "\n", "1. Pre-train multilingual representation\n", "2. Fine-tune on target task in English\n", "3. Predict on target language" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "mBERT is unreasonably effective at cross-lingual transfer!\n", "\n", "NER F1:\n", "
\n", " \n", "
\n", "\n", "POS accuracy:\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Pires et al., 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Why? (poll)\n", "\n", "See also [K et al., 2020](https://arxiv.org/pdf/1912.07840.pdf);\n", "[Wu and Dredze., 2019](https://www.aclweb.org/anthology/D19-1077.pdf)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Few-shot cross-lingual transfer\n", "\n", "1. Pre-train multilingual representation\n", "2. Fine-tune on target task in English\n", "3. **Fine-tune on a few examples in target language**\n", "4. Predict on target language" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "
\n", " \n", "
\n", "\n", "
\n", " (from Lauscher et al., 2020)\n", "
\n", "\n", "\n", "Few-shot is often much better than zero-shot: a few examples can go a long way." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Exercise: multilingual pre-training for MT\n", "Consider a machine translation system using a multilingual pre-trained encoder and decoder.\n", "\n", "\n", "+ It depends on the encoding and model.\n", "+ If the performance is good, the attention should incorporate also pragmatic differences. But it might be difficult in practice.\n", "+ Because it encodes the input into a numerical representation of the meaning that is language-independent." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Multicultural knowledge and factual alignment\n", "\n", "Language models don't just learn *languages* — they also encode **world knowledge** learned from the corpora they were trained on." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Probing factual knowledge: LAMA\n", "\n", "The **LAMA (Language Model Analysis)** benchmark ([Petroni et al., 2019](https://aclanthology.org/D19-1250/)) probes whether LMs recall simple factual triples such as:\n", "\n", "> (Paris, capital_of, ?) → “France”\n", "\n", "by converting knowledge base facts (e.g., from Wikidata) into *cloze prompts*:\n", "\n", "> \"Paris is the capital of [MASK].\"\n", "\n", "Performance is measured as the model’s accuracy in predicting the correct object token.\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Petroni et al., 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "But what happens when this knowledge is **language- or culture-specific**?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### mLAMA: cross-lingual knowledge probing\n", "\n", "The **mLAMA** benchmark ([Kassner et al., 2021](https://aclanthology.org/2021.eacl-main.284/)) extends LAMA to **53 languages**, testing how well multilingual LMs like mBERT or XLM-R remember the *same* facts across languages.\n", "\n", "Key finding: \n", "> LMs often “know” the fact in English but fail to recall it in other languages — even when pre-trained multilingually.\n", "\n", "This reveals **unequal knowledge retention** and **cultural skew** toward English-centric corpora.\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Kassner et al., 2021)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Cross-lingual factual alignment and cultural bias\n", "\n", "Even if multilingual models can **translate**, they may not **align cultural knowledge** across languages.\n", "\n", "For instance:\n", "- The same country may have **different facts** (leaders, population stats) in local-language sources.\n", "- Cultural entities (e.g., “Thanksgiving”, “Diwali”) have **asymmetric representation**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For example, the distribution of mentioned entities is highly skewed...\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Faisal et al., 2022)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Evaluating cross-lingual transfer" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Multilingual evaluation benchmarks\n", "\n", "Collections of multilingual multi-task benchmarks:\n", "\n", "* XGLUE (Liang et al., 2020)\n", "* XTREME (Hu et al., 2020): includes **TyDiQA**\n", "* XTREME-R (Ruder et al., 2021) \n", "\n", "
\n", " \n", " \n", " \n", "
\n", "\n", "
\n", " (from Hu et al., 2020)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Performance on XTREME\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Ruder et al., 2021)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Performance on TyDiQA-GoldP\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Ruder et al., 2021)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Cross-lingual semantic parsing\n", "\n", "\n", "\n", "(from [Cui et al., 2022](https://aclanthology.org/2022.tacl-1.55/))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Results\n", "\n", "Evaluated `mT5-small` by Exact Match (%).\n", "\n", "| | En | He | Kn | Zh |\n", "| :--------------- | -- | -- | -- | -- |\n", "| Training on X, testing on X | 38.3 | 29.3 | 31.5 | 36.3 |\n", "| Training on English, testing on X | 38.3 | 0.2 | 0.3 | 0.2 |\n", "\n", "Zero-shot cross-lingual evaluation is much harder than monolingual evaluation." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Cross-cultural NLP\n", "\n", "Not everything can be translated. Native language (and culture) data is important!\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Hershcovich et al., 2022)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Summary: From Multilingual to Multicultural NLP\n", "\n", "| 🌐 **Aspect** | **Multilingual NLP** | **Multicultural NLP** |\n", "|---------------|----------------------|------------------------|\n", "| **Main goal** | Cross-lingual *transfer* and *coverage* | Cross-cultural *representation* and *fairness* |\n", "| **Focus** | How well models generalize to new **languages** | How well models capture diverse **knowledge, values, and norms** |\n", "| **Key methods** | Translation, annotation projection, multilingual pre-training | Culturally grounded data, value-aligned evaluation, adaptation across societies |\n", "| **Evaluation benchmarks** | XTREME, XGLUE, TyDiQA... | mLAMA, BLEnD, NormAd, WorldValuesBench... |\n", "| **Challenges** | Linguistic diversity and data imbalance | Cultural bias, representational inequality, value alignment |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Further reading\n", "\n", "- [Ruder, 2017. An Overview of Multi-Task Learning in Deep Neural Networks](https://ruder.io/multi-task/)\n", "- [Ruder, 2019. The State of Transfer Learning in NLP](https://ruder.io/state-of-transfer-learning-in-nlp/)\n", "- [Ruder, 2016. A survey of cross-lingual word embedding models](https://ruder.io/cross-lingual-embeddings/)\n", "- [Søgaard, Anders; Vulic, Ivan; Ruder, Sebastian; Faruqui, Manaal. 2019. Cross-lingual word embeddings](https://www.morganclaypool.com/doi/abs/10.2200/S00920ED2V01Y201904HLT042)\n", "- [Wang, 2019. Cross-lingual transfer learning.](https://shanzhenren.github.io/csci-699-replnlp-2019fall/lectures/W6-L3-Cross_Lingual_Transfer.pdf)\n", "- [Pawar et al., 2025. Survey of Cultural Awareness in Language Models: Text and Beyond ](https://direct.mit.edu/coli/article/51/3/907/130804/Survey-of-Cultural-Awareness-in-Language-Models)" ] } ], "metadata": { "celltoolbar": "Slideshow", "hide_input": false, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.17" } }, "nbformat": 4, "nbformat_minor": 1 }