{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "X4cRE8IbIrIV" }, "source": [ "If you're opening this Notebook on colab, you will probably need to install π€ Transformers and π€ Datasets. Uncomment the following cell and run it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "MOsHUjgdIrIW", "outputId": "f84a093e-147f-470e-aad9-80fb51193c8e" }, "outputs": [], "source": [ "#! pip install datasets transformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.\n", "\n", "To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.\n", "\n", "First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from huggingface_hub import notebook_login\n", "\n", "notebook_login()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then you need to install Git-LFS. Uncomment the following instructions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# !apt install git-lfs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import transformers\n", "\n", "print(transformers.__version__)" ] }, { "cell_type": "markdown", "metadata": { "id": "HFASsisvIrIb" }, "source": [ "You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/question-answering)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from transformers.utils import send_example_telemetry\n", "\n", "send_example_telemetry(\"question_answering_notebook\", framework=\"pytorch\")" ] }, { "cell_type": "markdown", "metadata": { "id": "rEJBSTyZIrIb" }, "source": [ "# Fine-tuning a model on a question-answering task" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will see how to fine-tune one of the [π€ Transformers](https://github.com/huggingface/transformers) model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the `Trainer` API to fine-tune a model on it.\n", "\n", "\n", "\n", "**Note:** This notebook finetunes models that answer question by taking a substring of a context, not by generating new text." ] }, { "cell_type": "markdown", "metadata": { "id": "4RRkXuteIrIh" }, "source": [ "This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a token classification head and a fast tokenizer (check on [this table](https://huggingface.co/transformers/index.html#bigtable) if this is the case). It might just need some small adjustments if you decide to use a different dataset than the one used here. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zVvslsfMIrIh" }, "outputs": [], "source": [ "# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible\n", "# answers are allowed or not).\n", "squad_v2 = False\n", "model_checkpoint = \"distilbert-base-uncased\"\n", "batch_size = 16" ] }, { "cell_type": "markdown", "metadata": { "id": "whPRbBNbIrIl" }, "source": [ "## Loading the dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "W7QYTpxXIrIl" }, "source": [ "We will use the [π€ Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IreSlFmlIrIm" }, "outputs": [], "source": [ "from datasets import load_dataset, load_metric" ] }, { "cell_type": "markdown", "metadata": { "id": "CKx2zKs5IrIq" }, "source": [ "For our example here, we'll use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). The notebook should work with any question answering dataset provided by the π€ Datasets library. If you're using your own dataset defined from a JSON or csv file (see the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load them), it might need some adjustments in the names of the columns used." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 270, "referenced_widgets": [ "69caab03d6264fef9fc5649bffff5e20", "3f74532faa86412293d90d3952f38c4a", "50615aa59c7247c4804ca5cbc7945bd7", "fe962391292a413ca55dc932c4279fa7", "299f4b4c07654e53a25f8192bd1d7bbd", "ad04ed1038154081bbb0c1444784dcc2", "7c667ad22b5740d5a6319f1b1e3a8097", "46c2b043c0f84806978784a45a4e203b", "80e2943be35f46eeb24c8ab13faa6578", "de5956b5008d4fdba807bae57509c393", "931db1f7a42f4b46b7ff8c2e1262b994", "6c1db72efff5476e842c1386fadbbdba", "ccd2f37647c547abb4c719b75a26f2de", "d30a66df5c0145e79693e09789d96b81", "5fa26fc336274073abbd1d550542ee33", "2b34de08115d49d285def9269a53f484", "d426be871b424affb455aeb7db5e822e", "160bf88485f44f5cb6eaeecba5e0901f", "745c0d47d672477b9bb0dae77b926364", "d22ab78269cd4ccfbcf70c707057c31b", "d298eb19eeff453cba51c2804629d3f4", "a7204ade36314c86907c562e0a2158b8", "e35d42b2d352498ca3fc8530393786b2", "75103f83538d44abada79b51a1cec09e", "f6253931d90543e9b5fd0bb2d615f73a", "051aa783ff9e47e28d1f9584043815f5", "0984b2a14115454bbb009df71c1cf36f", "8ab9dfce29854049912178941ef1b289", "c9de740e007141958545e269372780a4", "cbea68b25d6d4ba09b2ce0f27b1726d5", "5781fc45cf8d486cb06ed68853b2c644", "d2a92143a08a4951b55bab9bc0a6d0d3", "a14c3e40e5254d61ba146f6ec88eae25", "c4ffe6f624ce4e978a0d9b864544941a", "1aca01c1d8c940dfadd3e7144bb35718", "9fbbaae50e6743f2aa19342152398186", "fea27ca6c9504fc896181bc1ff5730e5", "940d00556cb849b3a689d56e274041c2", "5cdf9ed939fb42d4bf77301c80b8afca", "94b39ccfef0b4b08bf2fb61bb0a657c1", "9a55087c85b74ea08b3e952ac1d73cbe", "2361ab124daf47cc885ff61f2899b2af", "1a65887eb37747ddb75dc4a40f7285f2", "3c946e2260704e6c98593136bd32d921", "50d325cdb9844f62a9ecc98e768cb5af", "aa781f0cfe454e9da5b53b93e9baabd8", "6bb68d3887ef43809eb23feb467f9723", "7e29a8b952cf4f4ea42833c8bf55342f", "dd5997d01d8947e4b1c211433969b89b", "2ace4dc78e2f4f1492a181bcd63304e7", "bbee008c2791443d8610371d1f16b62b", "31b1c8a2e3334b72b45b083688c1a20c", "7fb7c36adc624f7dbbcb4a831c1e4f63", "0b7c8f1939074794b3d9221244b1344d", "a71908883b064e1fbdddb547a8c41743", "2f5223f26c8541fc87e91d2205c39995" ] }, "id": "s_AY1ATSIrIq", "outputId": "fd0578d1-8895-443d-b56f-5908de9f1b6b" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Reusing dataset squad (/home/sgugger/.cache/huggingface/datasets/squad/plain_text/1.0.0/4c81550d83a2ac7c7ce23783bd8ff36642800e6633c1f18417fb58c3ff50cdd7)\n" ] } ], "source": [ "datasets = load_dataset(\"squad_v2\" if squad_v2 else \"squad\")" ] }, { "cell_type": "markdown", "metadata": { "id": "RzfPtOMoIrIu" }, "source": [ "The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GWiVUF0jIrIv", "outputId": "35e3ea43-f397-4a54-c90c-f2cf8d36873e" }, "outputs": [ { "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['id', 'title', 'context', 'question', 'answers'],\n", " num_rows: 87599\n", " })\n", " validation: Dataset({\n", " features: ['id', 'title', 'context', 'question', 'answers'],\n", " num_rows: 10570\n", " })\n", "})" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions." ] }, { "cell_type": "markdown", "metadata": { "id": "u3EtYfeHIrIz" }, "source": [ "To access an actual element, you need to select a split first, then give an index:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "X6HrpprwIrIz", "outputId": "d7670bc0-42e4-4c09-8a6a-5c018ded7d95" }, "outputs": [ { "data": { "text/plain": [ "{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},\n", " 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',\n", " 'id': '5733be284776f41900661182',\n", " 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',\n", " 'title': 'University_of_Notre_Dame'}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datasets[\"train\"][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the answers are indicated by their start position in the text (here at character 515) and their full text, which is a substring of the context as we mentioned above." ] }, { "cell_type": "markdown", "metadata": { "id": "WHUmphG3IrI3" }, "source": [ "To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "i3j8APAoIrI3" }, "outputs": [], "source": [ "from datasets import ClassLabel, Sequence\n", "import random\n", "import pandas as pd\n", "from IPython.display import display, HTML\n", "\n", "def show_random_elements(dataset, num_examples=10):\n", " assert num_examples <= len(dataset), \"Can't pick more elements than there are in the dataset.\"\n", " picks = []\n", " for _ in range(num_examples):\n", " pick = random.randint(0, len(dataset)-1)\n", " while pick in picks:\n", " pick = random.randint(0, len(dataset)-1)\n", " picks.append(pick)\n", " \n", " df = pd.DataFrame(dataset[picks])\n", " for column, typ in dataset.features.items():\n", " if isinstance(typ, ClassLabel):\n", " df[column] = df[column].transform(lambda i: typ.names[i])\n", " elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):\n", " df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])\n", " display(HTML(df.to_html()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SZy5tRB_IrI7", "outputId": "ba8f2124-e485-488f-8c0c-254f34f24f13", "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
| \n", " | answers | \n", "context | \n", "id | \n", "question | \n", "title | \n", "
|---|---|---|---|---|---|
| 0 | \n", "{'answer_start': [595], 'text': ['1964']} | \n", "Paul VI opened the third period on 14 September 1964, telling the Council Fathers that he viewed the text about the Church as the most important document to come out from the Council. As the Council discussed the role of bishops in the papacy, Paul VI issued an explanatory note confirming the primacy of the papacy, a step which was viewed by some as meddling in the affairs of the Council American bishops pushed for a speedy resolution on religious freedom, but Paul VI insisted this to be approved together with related texts such as ecumenism. The Pope concluded the session on 21 November 1964, with the formal pronouncement of Mary as Mother of the Church. | \n", "5726bc075951b619008f7c63 | \n", "In what year did Paul VI formally appoint Mary as mother of the Catholic church? | \n", "Pope_Paul_VI | \n", "
| 1 | \n", "{'answer_start': [453], 'text': ['asiento']} | \n", "At the concluding Treaty of Utrecht, Philip renounced his and his descendants' right to the French throne and Spain lost its empire in Europe. The British Empire was territorially enlarged: from France, Britain gained Newfoundland and Acadia, and from Spain, Gibraltar and Minorca. Gibraltar became a critical naval base and allowed Britain to control the Atlantic entry and exit point to the Mediterranean. Spain also ceded the rights to the lucrative asiento (permission to sell slaves in Spanish America) to Britain. | \n", "57266a9bf1498d1400e8df18 | \n", "What was the Spanish term for permission to sell slaves in Spanish America? | \n", "British_Empire | \n", "
| 2 | \n", "{'answer_start': [0], 'text': ['The variances in nomenclature']} | \n", "The variances in nomenclature in the region spanned by the Alps makes classification of the mountains and subregions difficult, but a general classification is that of the Eastern Alps and Western Alps with the divide between the two occurring in eastern Switzerland according to geologist Stefan Schmid, near the SplΓΌgen Pass. | \n", "56f876bda6d7ea1400e17689 | \n", "What makes the classification of the mountains and subregions difficult? | \n", "Alps | \n", "
| 3 | \n", "{'answer_start': [727], 'text': ['animal electricity']} | \n", "William Champion's brother, John, patented a process in 1758 for calcining zinc sulfide into an oxide usable in the retort process. Prior to this, only calamine could be used to produce zinc. In 1798, Johann Christian Ruberg improved on the smelting process by building the first horizontal retort smelter. Jean-Jacques Daniel Dony built a different kind of horizontal zinc smelter in Belgium, which processed even more zinc. Italian doctor Luigi Galvani discovered in 1780 that connecting the spinal cord of a freshly dissected frog to an iron rail attached by a brass hook caused the frog's leg to twitch. He incorrectly thought he had discovered an ability of nerves and muscles to create electricity and called the effect \"animal electricity\". The galvanic cell and the process of galvanization were both named for Luigi Galvani and these discoveries paved the way for electrical batteries, galvanization and cathodic protection. | \n", "572b5939f75d5e190021fd95 | \n", "What did Galvani name the effect he created of causing the frogs legs to twitch? | \n", "Zinc | \n", "
| 4 | \n", "{'answer_start': [150], 'text': ['the Jurchens']} | \n", "The Qing dynasty (1644β1911) was founded after the fall of the Ming, the last Han Chinese dynasty, by the Manchus. The Manchus were formerly known as the Jurchens. When Beijing was captured by Li Zicheng's peasant rebels in 1644, the Chongzhen Emperor, the last Ming emperor, committed suicide. The Manchus then allied with former Ming general Wu Sangui and seized control of Beijing, which became the new capital of the Qing dynasty. The Mancus adopted the Confucian norms of traditional Chinese government in their rule of China proper. Schoppa, the editor of The Columbia Guide to Modern Chinese History argues, \"A date around 1780 as the beginning of modern China is thus closer to what we know today as historical 'reality'. It also allows us to have a better baseline to understand the precipitous decline of the Chinese polity in the nineteenth and twentieth centuries.\" | \n", "572ee763cb0c0d14000f1666 | \n", "What were the Manchus originally known as? | \n", "Modern_history | \n", "
| 5 | \n", "{'answer_start': [145], 'text': ['song choice and performance']} | \n", "Beginning in the tenth season[citation needed], permanent mentors were brought in during the live shows to help guide the contestants with their song choice and performance. Jimmy Iovine was the mentor in the tenth through twelfth seasons, former judge Randy Jackson was the mentor for the thirteenth season and Scott Borchetta was the mentor for the fourteenth and fifteenth season. The mentors regularly bring in guest mentors to aid them, including Akon, Alicia Keys, Lady Gaga, and current judge Harry Connick, Jr.. | \n", "56daf310e7c41114004b4b74 | \n", "What two things did the mentors help the contestants with? | \n", "American_Idol | \n", "
| 6 | \n", "{'answer_start': [191], 'text': ['Kamran Baghirov']} | \n", "Gorbachev refused to make any changes to the status of Nagorno Karabakh, which remained part of Azerbaijan. He instead sacked the Communist Party Leaders in both Republics β on May 21, 1988, Kamran Baghirov was replaced by Abdulrahman Vezirov as First Secretary of the Azerbaijan Communist Party. From July 23 to September 1988, a group of Azerbaijani intellectuals began working for a new organization called the Popular Front of Azerbaijan, loosely based on the Estonian Popular Front. On September 17, when gun battles broke out between the Armenians and Azerbaijanis near Stepanakert, two soldiers were killed and more than two dozen injured. This led to almost tit-for-tat ethnic polarization in Nagorno-Karabakh's two main towns: The Azerbaijani minority was expelled from Stepanakert, and the Armenian minority was expelled from Shusha. On November 17, 1988, in response to the exodus of tens of thousands of Azerbaijanis from Armenia, a series of mass demonstrations began in Baku's Lenin Square, lasting 18 days and attracting half a million demonstrators. On December 5, 1988, the Soviet militia finally moved in, cleared the square by force, and imposed a curfew that lasted ten months. | \n", "57279998dd62a815002ea1a5 | \n", "Who was First Secretary prior to Vezirov? | \n", "Dissolution_of_the_Soviet_Union | \n", "
| 7 | \n", "{'answer_start': [0], 'text': ['World literature']} | \n", "World literature was enriched by the works of Edmund Spenser, John Milton, John Bunyan, John Donne, John Dryden, Daniel Defoe, William Wordsworth, Jonathan Swift, Johann Wolfgang Goethe, Friedrich Schiller, Samuel Taylor Coleridge, Edgar Allan Poe, Matthew Arnold, Conrad Ferdinand Meyer, Theodor Fontane, Washington Irving, Robert Browning, Emily Dickinson, Emily BrontΓ«, Charles Dickens, Nathaniel Hawthorne, Thomas Stearns Eliot, John Galsworthy, Thomas Mann, William Faulkner, John Updike, and many others. | \n", "57325d03b9d445190005eaac | \n", "Samuel Taylor is listed as enriching what? | \n", "Protestantism | \n", "
| 8 | \n", "{'answer_start': [85], 'text': ['1870']} | \n", "The first road connecting the city to the mainland at Pleasantville was completed in 1870 and charged a 30-cent toll. Albany Avenue was the first road to the mainland that was available without a toll. | \n", "57060ece75f01819005e791b | \n", "The first road that connected Atlantic City to the mainland was completed in what year? | \n", "Atlantic_City,_New_Jersey | \n", "
| 9 | \n", "{'answer_start': [172], 'text': ['progressively eliminate child labour']} | \n", "In addition to setting the international law, the United Nations initiated International Program on the Elimination of Child Labour (IPEC) in 1992. This initiative aims to progressively eliminate child labour through strengthening national capacities to address some of the causes of child labour. Amongst the key initiative is the so-called time-bounded programme countries, where child labour is most prevalent and schooling opportunities lacking. The initiative seeks to achieve amongst other things, universal primary school availability. The IPEC has expanded to at least the following target countries: Bangladesh, Brazil, China, Egypt, India, Indonesia, Mexico, Nigeria, Pakistan, Democratic Republic of Congo, El Salvador, Nepal, Tanzania, Dominican Republic, Costa Rica, Philippines, Senegal, South Africa and Turkey. | \n", "5727972a708984140094e1a8 | \n", "What is the aim of this? | \n", "Child_labour | \n", "
| Epoch | \n", "Training Loss | \n", "Validation Loss | \n", "Runtime | \n", "Samples Per Second | \n", "
|---|---|---|---|---|
| 1 | \n", "1.220600 | \n", "1.160322 | \n", "39.574900 | \n", "272.496000 | \n", "
| 2 | \n", "0.945200 | \n", "1.121690 | \n", "39.706000 | \n", "271.596000 | \n", "
| 3 | \n", "0.773000 | \n", "1.157358 | \n", "39.734000 | \n", "271.405000 | \n", "
"
],
"text/plain": [
"