{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "X4cRE8IbIrIV" }, "source": [ "If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "MOsHUjgdIrIW", "outputId": "f84a093e-147f-470e-aad9-80fb51193c8e" }, "outputs": [], "source": [ "#! pip install transformers datasets huggingface_hub" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.\n", "\n", "To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.\n", "\n", "First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then uncomment the following cell and input your token." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from huggingface_hub import notebook_login\n", "\n", "notebook_login()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then you need to install Git-LFS and setup Git if you haven't already. Uncomment the following instructions and adapt with your name and email:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# !apt install git-lfs\n", "# !git config --global user.email \"you@example.com\"\n", "# !git config --global user.name \"Your Name\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make sure your version of Transformers is at least 4.16.0 since the functionality was introduced in that version:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.21.0.dev0\n" ] } ], "source": [ "import transformers\n", "\n", "print(transformers.__version__)" ] }, { "cell_type": "markdown", "metadata": { "id": "HFASsisvIrIb" }, "source": [ "You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from transformers.utils import send_example_telemetry\n", "\n", "send_example_telemetry(\"question_answering_notebook\", framework=\"tensorflow\")" ] }, { "cell_type": "markdown", "metadata": { "id": "rEJBSTyZIrIb" }, "source": [ "# Fine-tuning a model on a question-answering task" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use Keras to fine-tune a model on it. Note that this model **does not generate new text!** Instead, it selects a span of the input passage as the answer.\n", "\n", "![Widget inference representing the QA task](images/question_answering.png)" ] }, { "cell_type": "markdown", "metadata": { "id": "4RRkXuteIrIh" }, "source": [ "This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a token classification head and a fast tokenizer (check on [this table](https://huggingface.co/transformers/index.html#bigtable) if this is the case). It might, however, need some small adjustments if you decide to use a different dataset than the one used here. Depending on your model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "zVvslsfMIrIh" }, "outputs": [], "source": [ "# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible\n", "# answers are allowed or not).\n", "squad_v2 = False\n", "model_checkpoint = \"distilbert-base-uncased\"\n", "batch_size = 16" ] }, { "cell_type": "markdown", "metadata": { "id": "whPRbBNbIrIl" }, "source": [ "## Loading the dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "W7QYTpxXIrIl" }, "source": [ "We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "IreSlFmlIrIm" }, "outputs": [], "source": [ "from datasets import load_dataset, load_metric" ] }, { "cell_type": "markdown", "metadata": { "id": "CKx2zKs5IrIq" }, "source": [ "For our example here, we'll use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). The notebook should work with any question answering dataset in the 🤗 Datasets library. If you're using your own dataset in a JSON or CSV file (see the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load them), it might need some adjustments to the column names." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 270, "referenced_widgets": [ "69caab03d6264fef9fc5649bffff5e20", "3f74532faa86412293d90d3952f38c4a", "50615aa59c7247c4804ca5cbc7945bd7", "fe962391292a413ca55dc932c4279fa7", "299f4b4c07654e53a25f8192bd1d7bbd", "ad04ed1038154081bbb0c1444784dcc2", "7c667ad22b5740d5a6319f1b1e3a8097", "46c2b043c0f84806978784a45a4e203b", "80e2943be35f46eeb24c8ab13faa6578", "de5956b5008d4fdba807bae57509c393", "931db1f7a42f4b46b7ff8c2e1262b994", "6c1db72efff5476e842c1386fadbbdba", "ccd2f37647c547abb4c719b75a26f2de", "d30a66df5c0145e79693e09789d96b81", "5fa26fc336274073abbd1d550542ee33", "2b34de08115d49d285def9269a53f484", "d426be871b424affb455aeb7db5e822e", "160bf88485f44f5cb6eaeecba5e0901f", "745c0d47d672477b9bb0dae77b926364", "d22ab78269cd4ccfbcf70c707057c31b", "d298eb19eeff453cba51c2804629d3f4", "a7204ade36314c86907c562e0a2158b8", "e35d42b2d352498ca3fc8530393786b2", "75103f83538d44abada79b51a1cec09e", "f6253931d90543e9b5fd0bb2d615f73a", "051aa783ff9e47e28d1f9584043815f5", "0984b2a14115454bbb009df71c1cf36f", "8ab9dfce29854049912178941ef1b289", "c9de740e007141958545e269372780a4", "cbea68b25d6d4ba09b2ce0f27b1726d5", "5781fc45cf8d486cb06ed68853b2c644", "d2a92143a08a4951b55bab9bc0a6d0d3", "a14c3e40e5254d61ba146f6ec88eae25", "c4ffe6f624ce4e978a0d9b864544941a", "1aca01c1d8c940dfadd3e7144bb35718", "9fbbaae50e6743f2aa19342152398186", "fea27ca6c9504fc896181bc1ff5730e5", "940d00556cb849b3a689d56e274041c2", "5cdf9ed939fb42d4bf77301c80b8afca", "94b39ccfef0b4b08bf2fb61bb0a657c1", "9a55087c85b74ea08b3e952ac1d73cbe", "2361ab124daf47cc885ff61f2899b2af", "1a65887eb37747ddb75dc4a40f7285f2", "3c946e2260704e6c98593136bd32d921", "50d325cdb9844f62a9ecc98e768cb5af", "aa781f0cfe454e9da5b53b93e9baabd8", "6bb68d3887ef43809eb23feb467f9723", "7e29a8b952cf4f4ea42833c8bf55342f", "dd5997d01d8947e4b1c211433969b89b", "2ace4dc78e2f4f1492a181bcd63304e7", "bbee008c2791443d8610371d1f16b62b", "31b1c8a2e3334b72b45b083688c1a20c", "7fb7c36adc624f7dbbcb4a831c1e4f63", "0b7c8f1939074794b3d9221244b1344d", "a71908883b064e1fbdddb547a8c41743", "2f5223f26c8541fc87e91d2205c39995" ] }, "id": "s_AY1ATSIrIq", "outputId": "fd0578d1-8895-443d-b56f-5908de9f1b6b" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Reusing dataset squad (/home/matt/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "354ff956c46a4157b2b39ec80be14272", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/2 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "datasets = load_dataset(\"squad_v2\" if squad_v2 else \"squad\")" ] }, { "cell_type": "markdown", "metadata": { "id": "RzfPtOMoIrIu" }, "source": [ "The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "GWiVUF0jIrIv", "outputId": "35e3ea43-f397-4a54-c90c-f2cf8d36873e" }, "outputs": [ { "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['id', 'title', 'context', 'question', 'answers'],\n", " num_rows: 87599\n", " })\n", " validation: Dataset({\n", " features: ['id', 'title', 'context', 'question', 'answers'],\n", " num_rows: 10570\n", " })\n", "})" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions." ] }, { "cell_type": "markdown", "metadata": { "id": "u3EtYfeHIrIz" }, "source": [ "To access an actual element, you need to select a split first, then give an index:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "X6HrpprwIrIz", "outputId": "d7670bc0-42e4-4c09-8a6a-5c018ded7d95" }, "outputs": [ { "data": { "text/plain": [ "{'id': '5733be284776f41900661182',\n", " 'title': 'University_of_Notre_Dame',\n", " 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',\n", " 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',\n", " 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datasets[\"train\"][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the answers are indicated by their start position in the text (here at character 515) and their full text, which is a substring of the context as we mentioned above." ] }, { "cell_type": "markdown", "metadata": { "id": "WHUmphG3IrI3" }, "source": [ "To get a sense of what the data looks like, the following function will show some examples picked randomly from the dataset and decoded back to strings." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "i3j8APAoIrI3" }, "outputs": [], "source": [ "from datasets import ClassLabel, Sequence\n", "import random\n", "import pandas as pd\n", "from IPython.display import display, HTML\n", "\n", "\n", "def show_random_elements(dataset, num_examples=10):\n", " assert num_examples <= len(\n", " dataset\n", " ), \"Can't pick more elements than there are in the dataset.\"\n", " picks = []\n", " for _ in range(num_examples):\n", " pick = random.randint(0, len(dataset) - 1)\n", " while pick in picks:\n", " pick = random.randint(0, len(dataset) - 1)\n", " picks.append(pick)\n", "\n", " df = pd.DataFrame(dataset[picks])\n", " for column, typ in dataset.features.items():\n", " if isinstance(typ, ClassLabel):\n", " df[column] = df[column].transform(lambda i: typ.names[i])\n", " elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):\n", " df[column] = df[column].transform(\n", " lambda x: [typ.feature.names[i] for i in x]\n", " )\n", " display(HTML(df.to_html()))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "SZy5tRB_IrI7", "outputId": "ba8f2124-e485-488f-8c0c-254f34f24f13", "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", " | id | \n", "title | \n", "context | \n", "question | \n", "answers | \n", "
---|---|---|---|---|---|
0 | \n", "56f71ecf711bf01900a449a5 | \n", "Treaty | \n", "When a state limits its treaty obligations through reservations, other states party to that treaty have the option to accept those reservations, object to them, or object and oppose them. If the state accepts them (or fails to act at all), both the reserving state and the accepting state are relieved of the reserved legal obligation as concerns their legal obligations to each other (accepting the reservation does not change the accepting state's legal obligations as concerns other parties to the treaty). If the state opposes, the parts of the treaty affected by the reservation drop out completely and no longer create any legal obligations on the reserving and accepting state, again only as concerns each other. Finally, if the state objects and opposes, there are no legal obligations under that treaty between those two state parties whatsoever. The objecting and opposing state essentially refuses to acknowledge the reserving state is a party to the treaty at all. | \n", "Who remains unaffected when a party's reservation is accepted by a second party? | \n", "{'text': ['other parties to the treaty'], 'answer_start': [480]} | \n", "
1 | \n", "5730ec8305b4da19006bcc47 | \n", "United_States_Air_Force | \n", "Due to the Budget sequestration in 2013, the USAF was forced to ground many of its squadrons. The Commander of Air Combat Command, General Mike Hostage indicated that the USAF must reduce its F-15 and F-16 fleets and eliminate platforms like the A-10 in order to focus on a fifth-generation jet fighter future. In response to squadron groundings and flight time reductions, many Air Force pilots have opted to resign from active duty and enter the Air Force Reserve and Air National Guard while pursuing careers in the commercial airlines where they can find flight hours on more modern aircraft. | \n", "Who was the Commander of Air Combat Command in 2013? | \n", "{'text': ['General Mike Hostage'], 'answer_start': [131]} | \n", "
2 | \n", "56fb7d7a8ddada1400cd6476 | \n", "Middle_Ages | \n", "Under the Capetian dynasty France slowly began to expand its authority over the nobility, growing out of the Île-de-France to exert control over more of the country in the 11th and 12th centuries. They faced a powerful rival in the Dukes of Normandy, who in 1066 under William the Conqueror (duke 1035–1087), conquered England (r. 1066–87) and created a cross-channel empire that lasted, in various forms, throughout the rest of the Middle Ages. Normans also settled in Sicily and southern Italy, when Robert Guiscard (d. 1085) landed there in 1059 and established a duchy that later became the Kingdom of Sicily. Under the Angevin dynasty of Henry II (r. 1154–89) and his son Richard I (r. 1189–99), the kings of England ruled over England and large areas of France,[W] brought to the family by Henry II's marriage to Eleanor of Aquitaine (d. 1204), heiress to much of southern France.[X] Richard's younger brother John (r. 1199–1216) lost Normandy and the rest of the northern French possessions in 1204 to the French King Philip II Augustus (r. 1180–1223). This led to dissension among the English nobility, while John's financial exactions to pay for his unsuccessful attempts to regain Normandy led in 1215 to Magna Carta, a charter that confirmed the rights and privileges of free men in England. Under Henry III (r. 1216–72), John's son, further concessions were made to the nobility, and royal power was diminished. The French monarchy continued to make gains against the nobility during the late 12th and 13th centuries, bringing more territories within the kingdom under their personal rule and centralising the royal administration. Under Louis IX (r. 1226–70), royal prestige rose to new heights as Louis served as a mediator for most of Europe.[Y] | \n", "During what period did William reign over England? | \n", "{'text': ['1066–87'], 'answer_start': [331]} | \n", "
3 | \n", "572e9a3fc246551400ce43c6 | \n", "Steven_Spielberg | \n", "Studio producers Richard D. Zanuck and David Brown offered Spielberg the director's chair for Jaws, a thriller-horror film based on the Peter Benchley novel about an enormous killer shark. Spielberg has often referred to the gruelling shoot as his professional crucible. Despite the film's ultimate, enormous success, it was nearly shut down due to delays and budget over-runs. But Spielberg persevered and finished the film. It was an enormous hit, winning three Academy Awards (for editing, original score and sound) and grossing more than $470 million worldwide at the box office. It also set the domestic record for box office gross, leading to what the press described as \"Jawsmania.\":248 Jaws made Spielberg a household name and one of America's youngest multi-millionaires, allowing him a great deal of autonomy for his future projects.:250 It was nominated for Best Picture and featured Spielberg's first of three collaborations with actor Richard Dreyfuss. | \n", "How many Academy Awards did the film \"Jaws\" win? | \n", "{'text': ['three'], 'answer_start': [458]} | \n", "
4 | \n", "56d4c4e72ccc5a1400d8321b | \n", "Beyoncé | \n", "On January 7, 2012, Beyoncé gave birth to her first child, a daughter, Blue Ivy Carter, at Lenox Hill Hospital in New York. Five months later, she performed for four nights at Revel Atlantic City's Ovation Hall to celebrate the resort's opening, her first performances since giving birth to Blue Ivy. | \n", "Where was Beyoncé's first public performance after giving birth? | \n", "{'text': ['Revel Atlantic City's Ovation Hall'], 'answer_start': [176]} | \n", "
5 | \n", "573212a80fdd8d15006c6758 | \n", "Party_leaders_of_the_United_States_House_of_Representatives | \n", "During this early period, it was more usual that neither major party grouping (Federalists and Democratic-Republicans) had an official leader. In 1813, for instance, a scholar recounts that the Federalist minority of 36 Members needed a committee of 13 \"to represent a party comprising a distinct minority\" and \"to coordinate the actions of men who were already partisans in the same cause.\" In 1828, a foreign observer of the House offered this perspective on the absence of formal party leadership on Capitol Hill: | \n", "In early 19th century, what were 2 common parties? | \n", "{'text': ['(Federalists and Democratic-Republicans)'], 'answer_start': [78]} | \n", "
6 | \n", "572fd29ea23a5019007fca49 | \n", "Bacteria | \n", "In ordinary circumstances, transduction, conjugation, and transformation involve transfer of DNA between individual bacteria of the same species, but occasionally transfer may occur between individuals of different bacterial species and this may have significant consequences, such as the transfer of antibiotic resistance. In such cases, gene acquisition from other bacteria or the environment is called horizontal gene transfer and may be common under natural conditions. Gene transfer is particularly important in antibiotic resistance as it allows the rapid transfer of resistance genes between different pathogens. | \n", "What is horizontal gene transfer? | \n", "{'text': ['gene acquisition from other bacteria or the environment'], 'answer_start': [339]} | \n", "
7 | \n", "5727e3c24b864d1900163f55 | \n", "Oklahoma | \n", "More than 12,000 miles (19,000 km) of roads make up the state's major highway skeleton, including state-operated highways, ten turnpikes or major toll roads, and the longest drivable stretch of Route 66 in the nation. In 2008, Interstate 44 in Oklahoma City was Oklahoma's busiest highway, with a daily traffic volume of 123,300 cars. In 2010, the state had the nation's third highest number of bridges classified as structurally deficient, with nearly 5,212 bridges in disrepair, including 235 National Highway System Bridges. | \n", "Oklahoma has the longest drivable stretch of what famous highway? | \n", "{'text': ['Route 66'], 'answer_start': [194]} | \n", "
8 | \n", "5733ecdb4776f41900661524 | \n", "Portugal | \n", "The President, who is elected to a five-year term, has an executive role: the current President is AnÃbal Cavaco Silva. The Assembly of the Republic is a single chamber parliament composed of 230 deputies elected for a four-year term. The Government is headed by the Prime Minister (currently António Costa) and includes Ministers and Secretaries of State. The Courts are organized into several levels, among the judicial, administrative and fiscal branches. The Supreme Courts are institutions of last resort/appeal. A thirteen-member Constitutional Court oversees the constitutionality of the laws. | \n", "How many members sit on the Constitutional Court? | \n", "{'text': ['thirteen'], 'answer_start': [520]} | \n", "
9 | \n", "5733796c4776f41900660b65 | \n", "Saint_Barth%C3%A9lemy | \n", "Marine mammals are many, such as the dolphins, porpoises and whales, which are seen here during the migration period from December till May. Turtles are a common sight along the coastline of the island. They are a protected species and in the endangered list. It is stated that it will take 15–50 years for this species to attain reproductive age. Though they live in the sea, the females come to the shore to lay eggs and are protected by private societies. Three species of turtles are particularly notable. These are: The leatherback sea turtles which have leather skin instead of a shell and are the largest of the type found here, some times measuring a much as 3 m (average is about 1.5 m) and weighing about 450 kg (jellyfish is their favourite diet); the hawksbill turtles, which have hawk-like beaks and found near reefs, generally about 90 cm in diameter and weigh about 60 kg and their diet consists of crabs and snails; and the green turtles, herbivores which have rounded heads, generally about 90 cm in diameter and live amidst tall sea grasses. | \n", "Where do green turtles live? | \n", "{'text': ['amidst tall sea grasses'], 'answer_start': [1035]} | \n", "