"Open

# HuggingFace `nlp` library - Quick overview

Models come and go (linear models, LSTM, Transformers, ...) but two core elements have consistently been the beating heart of Natural Language Processing: Datasets & Metrics

`nlp` is a lightweight and extensible library to easily share and load dataset and evaluation metrics, already providing access to ~100 datasets and ~10 evaluation metrics.

The library has several interesting features (beside easy access to datasets/metrics):

- Build-in interoperability with PyTorch, Tensorflow 2, Pandas and Numpy
- Small and fast library with a transparent and pythonic API
- Strive on large datasets: nlp naturally frees you from RAM memory limits, all datasets are memory-mapped on drive by default.
- Smart caching with an intelligent `tf.data`-like cache: never wait for your data to process several times

`nlp` originated from a fork of the awesome Tensorflow-Datasets and the HuggingFace team want to deeply thank the team behind this amazing library and user API. We have tried to keep a layer of compatibility with `tfds` and a conversion can provide conversion from one format to the other.

# Main datasets API

This notebook is a quick dive in the main user API for loading datasets in `nlp`

In [11]:
# install nlp
!pip install nlp

# Make sure that we have a recent version of pyarrow in the session before we continue - otherwise reboot Colab to activate it
import pyarrow
if int(pyarrow.__version__.split('.')[1]) < 16 and int(pyarrow.__version__.split('.')[0]) == 0:
 import os
 os.kill(os.getpid(), 9)



In [0]:
import logging
logging.basicConfig(level=logging.INFO)

17


In [0]:
# Let's import the library
import nlp

INFO:nlp.utils.file_utils:PyTorch version 1.5.0+cu101 available.
INFO:nlp.utils.file_utils:TensorFlow version 2.2.0 available.


## Listing the currently available datasets and metrics

In [0]:
# Currently available datasets and metrics
datasets = nlp.list_datasets()
metrics = nlp.list_metrics()

print(f"🤩 Currently {len(datasets)} datasets are available on HuggingFace AWS bucket: \n" 
 + '\n'.join(dataset.id for dataset in datasets) + '\n')
print(f"🤩 Currently {len(metrics)} metrics are available on HuggingFace AWS bucket: \n" 
 + '\n'.join(metric.id for metric in metrics))

🤩 Currently 114 datasets are available on HuggingFace AWS bucket: 
aeslc
ai2_arc
anli
arcd
art
billsum
blimp
blog_authorship_corpus
boolq
break_data
c4
cfq
civil_comments
cmrc2018
cnn_dailymail
coarse_discourse
com_qa
commonsense_qa
coqa
cornell_movie_dialog
cos_e
cosmos_qa
crime_and_punish
csv
definite_pronoun_resolution
discofuse
drop
empathetic_dialogues
eraser_multi_rc
esnli
event2Mind
flores
fquad
gap
germeval_14
gigaword
glue
hansards
hellaswag
imdb
jeopardy
json
kor_nli
lc_quad
librispeech_lm
lm1b
math_dataset
math_qa
mlqa
movie_rationales
multi_news
multi_nli
multi_nli_mismatch
natural_questions
newsroom
openbookqa
opinosis
para_crawl
qa4mre
qangaroo
qanta
qasc
quarel
quartz
quoref
race
reclor
reddit
reddit_tifu
scan
scicite
scientific_papers
scifact
sciq
scitail
sentiment140
snli
social_i_qa
squad
squad_es
squad_it
squad_v1_pt
squad_v2
super_glue
ted_hrlr
ted_multi
tiny_shakespeare
trivia_qa
tydiqa
ubuntu_dialogs_corpus
webis/tl_dr
wiki40b
wiki_qa
wiki_split
wikihow
wikipedia


In [0]:
# You can read a few attributes of the datasets before loading them (they are python dataclasses)
from dataclasses import asdict

for key, value in asdict(datasets[6]).items():
 print('👉 ' + key + ': ' + str(value))

👉 id: blimp
👉 key: nlp/datasets/blimp/blimp.py
👉 lastModified: 2020-05-14T14:57:19.000Z
👉 description: BLiMP is a challenge set for evaluating what language models (LMs) know about
major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each
containing 1000 minimal pairs isolating specific contrasts in syntax,
morphology, or semantics. The data is automatically generated according to
expert-crafted grammars.
👉 citation: @article{warstadt2019blimp,
 title={BLiMP: A Benchmark of Linguistic Minimal Pairs for English},
 author={Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei, and Wang, Sheng-Fu and Bowman, Samuel R},
 journal={arXiv preprint arXiv:1912.00582},
 year={2019}
}
👉 size: 7307
👉 etag: "3659a5abbb1ca837439d94aa2217c5f2"
👉 siblings: [{'key': 'nlp/datasets/blimp/blimp.py', 'etag': '"3659a5abbb1ca837439d94aa2217c5f2"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 7307, 'rfilename': 'blimp.py'}, {'key': 'nlp/datasets/bli

## An example with SQuAD

In [0]:
# Downloading and loading a dataset

dataset = nlp.load_dataset('squad', split='validation[:10%]')

INFO:filelock:Lock 139884110310704 acquired on /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py.lock
INFO:nlp.utils.file_utils:https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/squad.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/tmpd52q9bes


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=4997.0, style=ProgressStyle(description…

INFO:nlp.utils.file_utils:storing https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/squad.py in cache at /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py
INFO:nlp.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py
INFO:filelock:Lock 139884110310704 released on /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py.lock
INFO:filelock:Lock 139886448054000 acquired on /root/.cache/huggingface/datasets/9ba53336b6bc977097b39b8527b06ec6ba3f60a44230f2a0a918735fcd8ad902.893fb39fe374e4c574667dd71a3017b7e2e1d196f3a34fb00b56bac805447f7c.lock
INFO:nlp.utils.file_utils:https://s3.amazonaws.com/data




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2240.0, style=ProgressStyle(description…

INFO:nlp.utils.file_utils:storing https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/dataset_infos.json in cache at /root/.cache/huggingface/datasets/9ba53336b6bc977097b39b8527b06ec6ba3f60a44230f2a0a918735fcd8ad902.893fb39fe374e4c574667dd71a3017b7e2e1d196f3a34fb00b56bac805447f7c
INFO:nlp.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/9ba53336b6bc977097b39b8527b06ec6ba3f60a44230f2a0a918735fcd8ad902.893fb39fe374e4c574667dd71a3017b7e2e1d196f3a34fb00b56bac805447f7c
INFO:filelock:Lock 139886448054000 released on /root/.cache/huggingface/datasets/9ba53336b6bc977097b39b8527b06ec6ba3f60a44230f2a0a918735fcd8ad902.893fb39fe374e4c574667dd71a3017b7e2e1d196f3a34fb00b56bac805447f7c.lock
INFO:nlp.load:Checking /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py for additional imports.
INFO:filelock:Lock 139886448054000 acquired on /root/.ca




INFO:nlp.builder:Dataset not on Hf google storage. Downloading and preparing it from source


Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.75 MiB, total: 119.27 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0...


INFO:filelock:Lock 139884104848496 acquired on /root/.cache/huggingface/datasets/downloads/b8bb19735e1bb591510a01cc032f4c9f969bc0eeb081ae1b328cd306f3b24008.961f90ccac96b3e5df3c9ebb533f58da8f3ae596f5418c74cc814af15b348739.lock
INFO:nlp.utils.file_utils:https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmp74l2ywcp


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=8116577.0, style=ProgressStyle(descript…

INFO:nlp.utils.file_utils:storing https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json in cache at /root/.cache/huggingface/datasets/downloads/b8bb19735e1bb591510a01cc032f4c9f969bc0eeb081ae1b328cd306f3b24008.961f90ccac96b3e5df3c9ebb533f58da8f3ae596f5418c74cc814af15b348739
INFO:nlp.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/b8bb19735e1bb591510a01cc032f4c9f969bc0eeb081ae1b328cd306f3b24008.961f90ccac96b3e5df3c9ebb533f58da8f3ae596f5418c74cc814af15b348739
INFO:filelock:Lock 139884104848496 released on /root/.cache/huggingface/datasets/downloads/b8bb19735e1bb591510a01cc032f4c9f969bc0eeb081ae1b328cd306f3b24008.961f90ccac96b3e5df3c9ebb533f58da8f3ae596f5418c74cc814af15b348739.lock





INFO:filelock:Lock 139884104848328 acquired on /root/.cache/huggingface/datasets/downloads/9d5462987ef5f814fe15a369c1724f6ec39a2018b3b6271a9d7d2598686ca2ff.01470d0bbaa4753fc1435055451f474b824c23e0dc139470f39b1f233bde8747.lock
INFO:nlp.utils.file_utils:https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmpfhou_5to


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1054280.0, style=ProgressStyle(descript…

INFO:nlp.utils.file_utils:storing https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json in cache at /root/.cache/huggingface/datasets/downloads/9d5462987ef5f814fe15a369c1724f6ec39a2018b3b6271a9d7d2598686ca2ff.01470d0bbaa4753fc1435055451f474b824c23e0dc139470f39b1f233bde8747
INFO:nlp.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/9d5462987ef5f814fe15a369c1724f6ec39a2018b3b6271a9d7d2598686ca2ff.01470d0bbaa4753fc1435055451f474b824c23e0dc139470f39b1f233bde8747
INFO:filelock:Lock 139884104848328 released on /root/.cache/huggingface/datasets/downloads/9d5462987ef5f814fe15a369c1724f6ec39a2018b3b6271a9d7d2598686ca2ff.01470d0bbaa4753fc1435055451f474b824c23e0dc139470f39b1f233bde8747.lock
INFO:nlp.utils.info_utils:All the checksums matched successfully.
INFO:nlp.builder:Generating split train





HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

INFO:root:generating examples from = /root/.cache/huggingface/datasets/downloads/b8bb19735e1bb591510a01cc032f4c9f969bc0eeb081ae1b328cd306f3b24008.961f90ccac96b3e5df3c9ebb533f58da8f3ae596f5418c74cc814af15b348739
INFO:nlp.arrow_writer:Done writing 87599 examples in 79317110 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0.incomplete/squad-train.arrow.
INFO:nlp.builder:Generating split validation




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

INFO:root:generating examples from = /root/.cache/huggingface/datasets/downloads/9d5462987ef5f814fe15a369c1724f6ec39a2018b3b6271a9d7d2598686ca2ff.01470d0bbaa4753fc1435055451f474b824c23e0dc139470f39b1f233bde8747
INFO:nlp.arrow_writer:Done writing 10570 examples in 10472653 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0.incomplete/squad-validation.arrow.
INFO:nlp.utils.info_utils:All the splits matched successfully.
INFO:nlp.builder:Constructing Dataset for split validation[:10%], from /root/.cache/huggingface/datasets/squad/plain_text/1.0.0


Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0. Subsequent calls will reuse this data.


This call to `nlp.load_dataset()` does the following steps under the hood:

1. Download and import in the library the **SQuAD python processing script** from HuggingFace AWS bucket if it's not already stored in the library. You can find the SQuAD processing script [here](https://github.com/huggingface/nlp/tree/master/datasets/squad/squad.py) for instance.

 Processing scripts are small python scripts which define the info (citation, description) and format of the dataset and contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files.


2. Run the SQuAD python processing script which will:
 - **Download the SQuAD dataset** from the original URL (see the script) if it's not already downloaded and cached.
 - **Process and cache** all SQuAD in a structured Arrow table for each standard splits stored on the drive.

 Arrow table are arbitrarly long tables, typed with types that can be mapped to numpy/pandas/python standard types and can store nested objects. They can be directly access from drive, loaded in RAM or even streamed over the web.
 

3. Return a **dataset build from the splits** asked by the user (default: all), in the above example we create a dataset with the first 10% of the validation split.

In [0]:
# Informations on the dataset (description, citation, size, splits, format...)
# are provided in `dataset.info` (as a python dataclass)
for key, value in asdict(dataset.info).items():
 print('👉 ' + key + ': ' + str(value))

👉 description: Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

👉 citation: @article{2016arXiv160605250R,
 author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
 Konstantin and {Liang}, Percy},
 title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
 journal = {arXiv e-prints},
 year = 2016,
 eid = {arXiv:1606.05250},
 pages = {arXiv:1606.05250},
archivePrefix = {arXiv},
 eprint = {1606.05250},
}

👉 homepage: https://rajpurkar.github.io/SQuAD-explorer/
👉 license: 
👉 features: {'id': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'title': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'context': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'question': {'dtype': 'string', 'id': None, '_type': 'Val

## Inspecting and using the dataset: elements, slices and columns

The returned `Dataset` object is a memory mapped dataset that behave similarly to a normal map-style dataset. It is backed by an Apache Arrow table which allows many interesting features.

In [0]:
print(dataset)

Dataset(schema: {'id': 'string', 'title': 'string', 'context': 'string', 'question': 'string', 'answers': 'struct, answer_start: list>'}, num_rows: 1057)


You can query it's length and get items or slices like you would do normally with a python mapping.

In [0]:
from pprint import pprint

print(f"👉Dataset len(dataset): {len(dataset)}")
print("\n👉First item 'dataset[0]':")
pprint(dataset[0])

👉Dataset len(dataset): 1057

👉First item 'dataset[0]':
{'answers': {'answer_start': [177, 177, 177],
 'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
 'context': 'Super Bowl 50 was an American football game to determine the '
 'champion of the National Football League (NFL) for the 2015 '
 'season. The American Football Conference (AFC) champion Denver '
 'Broncos defeated the National Football Conference (NFC) champion '
 'Carolina Panthers 24–10 to earn their third Super Bowl title. The '
 "game was played on February 7, 2016, at Levi's Stadium in the San "
 'Francisco Bay Area at Santa Clara, California. As this was the '
 '50th Super Bowl, the league emphasized the "golden anniversary" '
 'with various gold-themed initiatives, as well as temporarily '
 'suspending the tradition of naming each Super Bowl game with '
 'Roman numerals (under which the game would have been known as '
 '"Super Bowl L"), so that the logo could prominently feature the '
 'Arabic numerals 

In [0]:
# Or get slices with several examples:
print("\n👉Slice of the two items 'dataset[10:12]':")
pprint(dataset[10:12])


👉Slice of the two items 'dataset[10:12]':
OrderedDict([('id', ['56bea9923aeaaa14008c91bb', '56beace93aeaaa14008c91df']),
 ('title', ['Super_Bowl_50', 'Super_Bowl_50']),
 ('context',
 ['Super Bowl 50 was an American football game to determine the '
 'champion of the National Football League (NFL) for the 2015 '
 'season. The American Football Conference (AFC) champion Denver '
 'Broncos defeated the National Football Conference (NFC) '
 'champion Carolina Panthers 24–10 to earn their third Super '
 "Bowl title. The game was played on February 7, 2016, at Levi's "
 'Stadium in the San Francisco Bay Area at Santa Clara, '
 'California. As this was the 50th Super Bowl, the league '
 'emphasized the "golden anniversary" with various gold-themed '
 'initiatives, as well as temporarily suspending the tradition '
 'of naming each Super Bowl game with Roman numerals (under '
 'which the game would have been known as "Super Bowl L"), so '
 'that the logo could prominently feature the Arabic num

In [0]:
# You can get a full column of the dataset by indexing with its name as a string:
print(dataset['question'][:10])

['Which NFL team represented the AFC at Super Bowl 50?', 'Which NFL team represented the NFC at Super Bowl 50?', 'Where did Super Bowl 50 take place?', 'Which NFL team won Super Bowl 50?', 'What color was used to emphasize the 50th anniversary of the Super Bowl?', 'What was the theme of Super Bowl 50?', 'What day was the game played on?', 'What is the AFC short for?', 'What was the theme of Super Bowl 50?', 'What does AFC stand for?']


The `__getitem__` method will return different format depending on the type of query:

- Items like `dataset[0]` are returned as dict of elements.
- Slices like `dataset[10:20]` are returned as dict of lists of elements.
- Columns like `dataset['question']` are returned as a list of elements.

This may seems surprising at first but in our experiments it's actually a lot easier to use for data processing than returning the same format for each of these views on the dataset.

In particular, you can easily iterate along columns in slices, and also naturally permute consecutive indexings with identical results as showed here by permuting column indexing with elements and slices:

In [0]:
print(dataset[0]['question'] == dataset['question'][0])
print(dataset[10:20]['context'] == dataset['context'][10:20])

True
True


### Dataset are internally typed and structured

The dataset is backed by one (or several) Apache Arrow tables which are typed and allows for fast retrieval and access as well as arbitrary-size memory mapping.

This means respectively that the format for the dataset is clearly defined and that you can load datasets of arbitrary size without worrying about RAM memory limitation (basically the dataset take no space in RAM, it's directly read from drive when needed with fast IO access).

In [0]:
# You can inspect the dataset column names and type 
print(dataset.column_names)
print(dataset.schema)

['id', 'title', 'context', 'question', 'answers']
id: string not null
title: string not null
context: string not null
question: string not null
answers: struct, answer_start: list> not null
 child 0, text: list
 child 0, item: string
 child 1, answer_start: list
 child 0, item: int32


### Additional misc properties

In [0]:
# Datasets also have a bunch of properties you can access
print("The number of bytes allocated on the drive is ", dataset.nbytes)
print("For comparison, here is the number of bytes allocated in memory which can be")
print("accessed with `nlp.total_allocated_bytes()`: ", nlp.total_allocated_bytes())
print("The number of rows", dataset.num_rows)
print("The number of columns", dataset.num_columns)
print("The shape (rows, columns)", dataset.shape)

The number of bytes allocated on the drive is 9855914
For comparison, here is the number of bytes allocated in memory which can be
accessed with `nlp.total_allocated_bytes()`: 0
The number of rows 1057
The number of columns 5
The shape (rows, columns) (1057, 5)


### Additional misc methods

In [0]:
# We can list the unique elements in a column. This is done by the backend (so fast!)
print(f"dataset.unique('title'): {dataset.unique('title')}")

# This will drop the column 'id'
dataset.drop('id') # Remove column 'id'
print(f"After dataset.drop('id'), remaining columns are {dataset.column_names}")

# This will flatten nested columns (in 'answers' in our case)
dataset.flatten()
print(f"After dataset.flatten(), column names are {dataset.column_names}")

# We can also "dictionnary encode" a column if many of it's elements are similar
# This will reduce it's size by only storing the distinct elements (e.g. string)
# It only has effect on the internal storage (no difference from a user point of view)
dataset.dictionary_encode_column('title')

dataset.unique('title'): ['Super_Bowl_50', 'Warsaw']
After dataset.drop('id'), remaining columns are ['title', 'context', 'question', 'answers']
After dataset.flatten(), column names are ['title', 'context', 'question', 'answers.text', 'answers.answer_start']


## Cache

`nlp` datasets are backed by Apache Arrow cache files which allows:
- to load arbitrary large datasets by using [memory mapping](https://en.wikipedia.org/wiki/Memory-mapped_file) (as long as the datasets can fit on the drive)
- to use a fast backend to process the dataset efficiently
- to do smart caching by storing and reusing the results of operations performed on the drive

Let's dive a bit in these parts now

You can check the current cache files backing the dataset with the `.cache_file` property

In [0]:
dataset.cache_files

({'filename': '/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/squad-validation.arrow',
 'skip': 0,
 'take': 1057},)

You can clean up the cache files in the current dataset directory (only keeping the currently used one) with `.cleanup_cache_files()`.

Be careful that no other process is using some other cache files when running this command.

In [0]:
dataset.cleanup_cache_files() # Returns the number of removed cache files

INFO:nlp.arrow_dataset:Listing files in /root/.cache/huggingface/datasets/squad/plain_text/1.0.0


0

## Modifying the dataset with `dataset.map`

There is a powerful method `.map()` which is inspired by `tf.data` map method and that you can use to apply a function to each examples, independently or in batch.

In [0]:
# `.map()` takes a callable accepting a dict as argument
# (same dict as returned by dataset[i])
# and iterate over the dataset by calling the function with each example.

# Let's print the length of each `context` string in our subset of the dataset
# (10% of the validation i.e. 1057 examples)

dataset.map(lambda example: print(len(example['context']), end=','))

775,775,

0it [00:00, ?it/s]

775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,179,179,179,179,179,179,179,179,179,179,179,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1166,1166,1166,1166,1166,1166,1166,1

637it [00:00, 6365.64it/s]

804,804,804,804,804,804,804,804,804,804,397,397,397,397,397,397,397,397,397,397,397,397,397,397,360,360,360,360,360,360,360,973,973,973,973,973,973,973,973,973,973,973,973,973,973,263,263,263,263,263,263,263,263,263,263,263,568,568,568,568,568,568,568,568,568,568,568,264,264,264,264,264,264,264,264,264,264,264,264,264,264,264,892,892,892,892,892,892,892,892,892,892,892,206,206,206,206,206,489,489,489,489,489,489,489,489,489,489,489,489,489,181,181,181,181,181,181,181,181,181,181,181,181,531,531,531,531,531,531,531,531,531,531,531,531,664,664,664,664,664,664,664,664,664,664,664,664,664,664,672,672,672,672,672,672,672,672,672,672,672,672,672,672,858,858,858,858,858,858,858,858,858,858,858,858,634,634,634,634,634,634,634,634,634,634,634,634,634,634,891,891,891,891,891,891,891,891,891,891,891,891,891,488,488,488,488,488,488,488,488,488,488,488,488,942,942,942,942,942,942,942,942,942,942,942,942,942,942,942,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1353,1353

899it [00:00, 4413.66it/s]

,1088,1088,1088,1088,1088,1619,1619,1619,1619,1619,939,939,939,939,939,865,865,865,865,865,711,711,711,711,711,831,831,831,831,831,501,501,501,501,501,676,676,676,676,676,854,854,854,854,854,784,784,784,784,784,641,641,641,641,641,544,544,544,544,544,918,918,918,918,918,763,763,763,763,763,906,906,906,906,906,632,632,632,632,632,869,869,869,869,869,1044,1044,1044,1044,1044,760,760,760,760,760,715,715,715,715,715,838,838,838,838,838,881,881,881,881,881,940,940,940,940,940,618,618,618,618,618,1205,1205,1205,534,534,534,534,534,757,757,757,757,757,1239,1239,1239,1239,1239,609,609,609,609,609,798,798,798,798,798,613,613,613,613,613,613,

1057it [00:00, 4215.63it/s]

613,613,613,613,




Dataset(schema: {'title': 'string', 'context': 'string', 'question': 'string', 'answers.text': 'list', 'answers.answer_start': 'list'}, num_rows: 1057)

This is basically the same as doing

```python
for example in dataset:
 function(example)
```

The above example had no effect on the dataset because the method we supplied to `.map()` didn't return a `dict` or a `abc.Mapping` that could be used to update the examples in the dataset.

In such a case, `.map()` will return the same dataset (`self`).

Now let's see how we can use a method that actually modify the dataset.

### Modifying the dataset example by example

The main interest of `.map()` is to update and modify the content of the table and leverage smart caching and fast backend.

To use `.map()` to update elements in the table you need to provide a function with the following signature: `function(example: dict) -> dict`.

In [0]:
# Let's add a prefix 'My cute title: ' to each of our titles

def add_prefix_to_title(example):
 example['title'] = 'My cute title: ' + example['title']
 return example

dataset = dataset.map(add_prefix_to_title)

print(dataset.unique('title'))

INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-7fc546b401ec7a73d642e3460f4bcaa3.arrow
1057it [00:00, 13900.01it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 905032 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-7fc546b401ec7a73d642e3460f4bcaa3.arrow.


['My cute title: Super_Bowl_50', 'My cute title: Warsaw']


This call to `.map()` compute and return the updated table. It will also store the updated table in a cache file indexed by the current state and the mapped function.

A subsequent call to `.map()` (even in another python session) will reuse the cached file instead of recomputing the operation.

You can test this by running again the previous cell, you will see that the result are directly loaded from the cache and not re-computed again.

The updated dataset returned by `.map()` is (again) directly memory mapped from drive and not allocated in RAM.

The function you provide to `.map()` should accept an input with the format of an item of the dataset: `function(dataset[0])` and return a python dict.

The columns and type of the outputs can be different than the input dict. In this case the new keys will be added as additional columns in the dataset.

Bascially each dataset example dict is updated with the dictionary returned by the function like this: `example.update(function(example))`.

In [0]:
# Since the input example dict is updated with our function output dict,
# we can actually just return the updated 'title' field
dataset = dataset.map(lambda example: {'title': 'My cutest title: ' + example['title']})

print(dataset.unique('title'))

INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-e254729a165001477fc910898551132f.arrow
1057it [00:00, 12758.48it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 923001 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-e254729a165001477fc910898551132f.arrow.


['My cutest title: My cute title: Super_Bowl_50', 'My cutest title: My cute title: Warsaw']


#### Removing columns
You can also remove columns when running map with the `remove_columns=List[str]` argument.

In [0]:
# This will remove the 'title' column while doing the update (after having send it the the mapped function so you can use it in your function!)
dataset = dataset.map(lambda example: {'new_title': 'Wouhahh: ' + example['title']},
 remove_columns=['title'])

print(dataset.column_names)
print(dataset.unique('new_title'))

INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-319ffdab1a236b2101739c4b33dc26d8.arrow
1057it [00:00, 12976.87it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 932514 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-319ffdab1a236b2101739c4b33dc26d8.arrow.


['context', 'question', 'answers.text', 'answers.answer_start', 'new_title']
['Wouhahh: My cutest title: My cute title: Super_Bowl_50', 'Wouhahh: My cutest title: My cute title: Warsaw']


#### Using examples indices
With `with_indices=True`, dataset indices (from `0` to `len(dataset)`) will be supplied to the function which must thus have the following signature: `function(example: dict, indice: int) -> dict`

In [0]:
# This will add the index in the dataset to the 'question' field
dataset = dataset.map(lambda example, idx: {'question': f'{idx}: ' + example['question']},
 with_indices=True)

print('\n'.join(dataset['question'][:5]))

INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-0d7046ac832c326979b2f70469eac9fa.arrow
1057it [00:00, 13039.70it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 937746 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-0d7046ac832c326979b2f70469eac9fa.arrow.


0: Which NFL team represented the AFC at Super Bowl 50?
1: Which NFL team represented the NFC at Super Bowl 50?
2: Where did Super Bowl 50 take place?
3: Which NFL team won Super Bowl 50?
4: What color was used to emphasize the 50th anniversary of the Super Bowl?


### Modifying the dataset with batched updates

`.map()` can also work with batch of examples (slices of the dataset).

This is particularly interesting if you have a function that can handle batch of inputs like the tokenizers of HuggingFace `tokenizers`.

To work on batched inputs set `batched=True` when calling `.map()` and supply a function with the following signature: `function(examples: Dict[List]) -> Dict[List]` or, if you use indices, `function(examples: Dict[List], indices: List[int]) -> Dict[List]`).

Bascially, your function should accept an input with the format of a slice of the dataset: `function(dataset[:10])`.

In [0]:
!pip install transformers

Collecting transformers
[?25l Downloading https://files.pythonhosted.org/packages/12/b5/ac41e3e95205ebf53439e4dd087c58e9fd371fd8e3724f2b9b4cdb8282e5/transformers-2.10.0-py3-none-any.whl (660kB)
[K |████████████████████████████████| 665kB 3.5MB/s 
[?25hCollecting sentencepiece
[?25l Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K |████████████████████████████████| 1.1MB 17.6MB/s 
Collecting sacremoses
[?25l Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K |████████████████████████████████| 890kB 25.9MB/s 
Collecting tokenizers==0.7.0
[?25l Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K |███████████████████████

In [0]:
# Let's import a fast tokenizer that can work on batched inputs
# (the 'Fast' tokenizers in HuggingFace)
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

INFO:transformers.file_utils:PyTorch version 1.5.0+cu101 available.
INFO:transformers.file_utils:TensorFlow version 2.2.0 available.
INFO:filelock:Lock 139884348804680 acquired on /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpbrrc_uwe


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…

INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt in cache at /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
INFO:filelock:Lock 139884348804680 released on /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1.lock
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c




In [0]:
# Now let's batch tokenize our dataset 'context'
dataset = dataset.map(lambda example: tokenizer.batch_encode_plus(example['context']),
 batched=True)

print("dataset[0]", dataset[0])

INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-4c8436e14fee9674f678b8735b43c65e.arrow
100%|██████████| 2/2 [00:00<00:00, 3.54it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 4749270 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-4c8436e14fee9674f678b8735b43c65e.arrow.


dataset[0] {'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.', 'question': '0: Which NFL team represented the AFC at Super Bowl 50?', 'answers.text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answers.answer_start': [177, 177, 177], 'new_title': 

In [0]:
# we have added additional columns
print(dataset.column_names)

['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask']


In [0]:
# Let show a more complex processing with the full preparation of the SQuAD dataset
# for training a model from Transformers
def convert_to_features(batch):
 # Tokenize contexts and questions (as pairs of inputs)
 # keep offset mappings for evaluation
 input_pairs = list(zip(batch['context'], batch['question']))
 encodings = tokenizer.batch_encode_plus(input_pairs,
 pad_to_max_length=True,
 return_offsets_mapping=True)

 # Compute start and end tokens for labels
 start_positions, end_positions = [], []
 for i, (text, start) in enumerate(zip(batch['answers.text'], batch['answers.answer_start'])):
 first_char = start[0]
 last_char = first_char + len(text[0]) - 1
 start_positions.append(encodings.char_to_token(i, first_char))
 end_positions.append(encodings.char_to_token(i, last_char))

 encodings.update({'start_positions': start_positions, 'end_positions': end_positions})
 return encodings

dataset = dataset.map(convert_to_features, batched=True)

INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-3cceeef76f89add124dd3c1c12d2f776.arrow
100%|██████████| 2/2 [00:00<00:00, 2.50it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 21643250 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-3cceeef76f89add124dd3c1c12d2f776.arrow.


In [0]:
# Now our dataset comprise the labels for the start and end position
# as well as the offsets for converting back tokens
# in span of the original string for evaluation
print("column_names", dataset.column_names)
print("start_positions", dataset[:5]['start_positions'])

column_names ['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions']
start_positions [34, 45, 80, 34, 98]


## Formating outputs for numpy/torch/tensorflow

Now that we have tokenized our inputs, we probably want to use this dataset in a `torch.Dataloader` or a `tf.data.Dataset`.

To be able to do this we need to tweak two things:

- format the indexing (`__getitem__`) to return numpy/pytorch/tensorflow tensors, instead of python objects, and probably
- format the indexing (`__getitem__`) to return only the subset of the columns that we need for our model inputs.

 We don't want the columns `id` or `title` as inputs to train our model, but we could still want to keep them in the dataset, for instance for the evaluation of the model.
 
This is handled by the `.set_format(type: Union[None, str], columns: Union[None, str, List[str]])` where:

- `type` define the return type for our dataset `__getitem__` method and is one of `[None, 'numpy', 'pandas', 'torch', 'tensorflow']` (`None` means return python objects), and
- `columns` define the columns returned by `__getitem__` and takes the name of a column in the dataset or a list of columns to return (`None` means return all columns).

In [0]:
columns_to_return = ['input_ids', 'token_type_ids', 'attention_mask',
 'start_positions', 'end_positions']

dataset.set_format(type='torch',
 columns=columns_to_return)

# Our dataset indexing output is now ready for being used in a pytorch dataloader
print('\n'.join([' '.join((n, str(type(t)), str(t.shape))) for n, t in dataset[:10].items()]))

INFO:nlp.arrow_dataset:Set __getitem__(key) output type to torch for ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'] columns (when key is int or slice) and don't output other (un-formated) columns.


input_ids torch.Size([10, 451])
token_type_ids torch.Size([10, 451])
attention_mask torch.Size([10, 451])
start_positions torch.Size([10])
end_positions torch.Size([10])


In [0]:
# Note that the columns are not removed from the dataset, just not returned when calling __getitem__
# Similarly the inner type of the dataset is not changed to torch.Tensor, the conversion and filtering is done on-the-fly when querying the dataset
print(dataset.column_names)

['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions']


In [0]:
# We can remove the formating with `.reset_format()`
# or, identically, a call to `.set_format()` with no arguments
dataset.reset_format()

print('\n'.join([' '.join((n, str(type(t)))) for n, t in dataset[:10].items()]))

INFO:nlp.arrow_dataset:Set __getitem__(key) output type to python objects for no columns (when key is int or slice) and don't output other (un-formated) columns.


context 
question 
answers.text 
answers.answer_start 
new_title 
input_ids 
token_type_ids 
attention_mask 
offset_mapping 
start_positions 
end_positions 


In [0]:
# The current format can be checked with `.format`,
# which is a dict of the type and formating
dataset.format

{'columns': ['context',
 'question',
 'answers.text',
 'answers.answer_start',
 'new_title',
 'input_ids',
 'token_type_ids',
 'attention_mask',
 'offset_mapping',
 'start_positions',
 'end_positions'],
 'output_all_columns': False,
 'type': 'python'}

# Wrapping this all up (PyTorch)

Let's wrap this all up with the full code to load and prepare SQuAD for training a PyTorch model from HuggingFace `transformers` library.



In [13]:
!pip install transformers



In [0]:
import nlp
import torch 
from transformers import BertTokenizerFast

# Load our training dataset and tokenizer
dataset = nlp.load_dataset('squad')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

def get_correct_alignement(context, answer):
 """ Some original examples in SQuAD have indices wrong by 1 or 2 character. We test and fix this here. """
 gold_text = answer['text'][0]
 start_idx = answer['answer_start'][0]
 end_idx = start_idx + len(gold_text)
 if context[start_idx:end_idx] == gold_text:
 return start_idx, end_idx # When the gold label position is good
 elif context[start_idx-1:end_idx-1] == gold_text:
 return start_idx-1, end_idx-1 # When the gold label is off by one character
 elif context[start_idx-2:end_idx-2] == gold_text:
 return start_idx-2, end_idx-2 # When the gold label is off by two character
 else:
 raise ValueError()

# Tokenize our training dataset
def convert_to_features(example_batch):
 # Tokenize contexts and questions (as pairs of inputs)
 input_pairs = list(zip(example_batch['context'], example_batch['question']))
 encodings = tokenizer.batch_encode_plus(input_pairs, pad_to_max_length=True)

 # Compute start and end tokens for labels using Transformers's fast tokenizers alignement methodes.
 start_positions, end_positions = [], []
 for i, (context, answer) in enumerate(zip(example_batch['context'], example_batch['answers'])):
 start_idx, end_idx = get_correct_alignement(context, answer)
 start_positions.append(encodings.char_to_token(i, start_idx))
 end_positions.append(encodings.char_to_token(i, end_idx-1))
 encodings.update({'start_positions': start_positions,
 'end_positions': end_positions})
 return encodings

dataset['train'] = dataset['train'].map(convert_to_features, batched=True)

# Format our dataset to outputs torch.Tensor to train a pytorch model
columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']
dataset['train'].set_format(type='torch', columns=columns)

# Instantiate a PyTorch Dataloader around our dataset
dataloader = torch.utils.data.DataLoader(dataset['train'], batch_size=8)

In [0]:
# Let's load a pretrained Bert model and a simple optimizer
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('distilbert-base-cased')
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

INFO:filelock:Lock 139884094601256 acquired on /root/.cache/torch/transformers/774d52b0be7c2f621ac9e64708a8b80f22059f6d0e264e1bdc4f4d71c386c4ea.f44aaaab97e2ee0f8d9071a5cd694e19bf664237a92aea20ebe04ddb7097b494.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmp1uhk_b1k


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…

INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json in cache at /root/.cache/torch/transformers/774d52b0be7c2f621ac9e64708a8b80f22059f6d0e264e1bdc4f4d71c386c4ea.f44aaaab97e2ee0f8d9071a5cd694e19bf664237a92aea20ebe04ddb7097b494
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/774d52b0be7c2f621ac9e64708a8b80f22059f6d0e264e1bdc4f4d71c386c4ea.f44aaaab97e2ee0f8d9071a5cd694e19bf664237a92aea20ebe04ddb7097b494
INFO:filelock:Lock 139884094601256 released on /root/.cache/torch/transformers/774d52b0be7c2f621ac9e64708a8b80f22059f6d0e264e1bdc4f4d71c386c4ea.f44aaaab97e2ee0f8d9071a5cd694e19bf664237a92aea20ebe04ddb7097b494.lock
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json from cache at /root/.cache/torch/transformers/774d52b0be7c2f621ac9e64708a8b80f22059f6d0e264e1bdc4f4d71c386c4ea.f44aaaab9




INFO:filelock:Lock 139884094601256 acquired on /root/.cache/torch/transformers/185eb053d63bc5c2d6994e4b2a8e5eb59f31af90db9c5fae5e38c32a986462cb.857b7d17ad0bfaa2eec50caf481575bab1073303fef16bd5f29bc5248b2b8c7d.lock
INFO:transformers.file_utils:https://cdn.huggingface.co/distilbert-base-cased-pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmp8t3mu3iu


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=263273408.0, style=ProgressStyle(descri…

INFO:transformers.file_utils:storing https://cdn.huggingface.co/distilbert-base-cased-pytorch_model.bin in cache at /root/.cache/torch/transformers/185eb053d63bc5c2d6994e4b2a8e5eb59f31af90db9c5fae5e38c32a986462cb.857b7d17ad0bfaa2eec50caf481575bab1073303fef16bd5f29bc5248b2b8c7d
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/185eb053d63bc5c2d6994e4b2a8e5eb59f31af90db9c5fae5e38c32a986462cb.857b7d17ad0bfaa2eec50caf481575bab1073303fef16bd5f29bc5248b2b8c7d
INFO:filelock:Lock 139884094601256 released on /root/.cache/torch/transformers/185eb053d63bc5c2d6994e4b2a8e5eb59f31af90db9c5fae5e38c32a986462cb.857b7d17ad0bfaa2eec50caf481575bab1073303fef16bd5f29bc5248b2b8c7d.lock
INFO:transformers.modeling_utils:loading weights file https://cdn.huggingface.co/distilbert-base-cased-pytorch_model.bin from cache at /root/.cache/torch/transformers/185eb053d63bc5c2d6994e4b2a8e5eb59f31af90db9c5fae5e38c32a986462cb.857b7d17ad0bfaa2eec50caf481575bab1073303fef16bd5f29bc5248b




INFO:transformers.modeling_utils:Weights of BertForQuestionAnswering not initialized from pretrained model: ['embeddings.word_embeddings.weight', 'embeddings.position_embeddings.weight', 'embeddings.token_type_embeddings.weight', 'embeddings.LayerNorm.weight', 'embeddings.LayerNorm.bias', 'encoder.layer.0.attention.self.query.weight', 'encoder.layer.0.attention.self.query.bias', 'encoder.layer.0.attention.self.key.weight', 'encoder.layer.0.attention.self.key.bias', 'encoder.layer.0.attention.self.value.weight', 'encoder.layer.0.attention.self.value.bias', 'encoder.layer.0.attention.output.dense.weight', 'encoder.layer.0.attention.output.dense.bias', 'encoder.layer.0.attention.output.LayerNorm.weight', 'encoder.layer.0.attention.output.LayerNorm.bias', 'encoder.layer.0.intermediate.dense.weight', 'encoder.layer.0.intermediate.dense.bias', 'encoder.layer.0.output.dense.weight', 'encoder.layer.0.output.dense.bias', 'encoder.layer.0.output.LayerNorm.weight', 'encoder.layer.0.output.LayerNo

In [0]:
# Now let's train our model

model.train()
for i, batch in enumerate(dataloader):
 outputs = model(**batch)
 loss = outputs[0]
 loss.backward()
 optimizer.step()
 model.zero_grad()
 print(f'Step {i} - loss: {loss:.3}')
 if i > 3:
 break

Step 0 - loss: 6.42
Step 1 - loss: 5.64
Step 2 - loss: 5.09
Step 3 - loss: 5.59
Step 4 - loss: 4.81


# Wrapping this all up (Tensorflow)

Let's wrap this all up with the full code to load and prepare SQuAD for training a Tensorflow model (works only from the version 2.2.0)

In [15]:
import tensorflow as tf
import nlp
from transformers import BertTokenizerFast

# Load our training dataset and tokenizer
train_tf_dataset = nlp.load_dataset('squad', split="train")
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

# Tokenize our training dataset
# The only one diff here is that start_positions and end_positions
# must be single dim list => [[23], [45] ...]
# instead of => [23, 45 ...]
def convert_to_tf_features(example_batch):
 # Tokenize contexts and questions (as pairs of inputs)
 input_pairs = list(zip(example_batch['context'], example_batch['question']))
 encodings = tokenizer.batch_encode_plus(input_pairs, pad_to_max_length=True, max_length=tokenizer.max_len)

 # Compute start and end tokens for labels using Transformers's fast tokenizers alignement methods.
 start_positions, end_positions = [], []
 for i, (context, answer) in enumerate(zip(example_batch['context'], example_batch['answers'])):
 start_idx, end_idx = get_correct_alignement(context, answer)
 start_positions.append([encodings.char_to_token(i, start_idx)])
 end_positions.append([encodings.char_to_token(i, end_idx-1)])
 
 if start_positions and end_positions:
 encodings.update({'start_positions': start_positions,
 'end_positions': end_positions})
 return encodings

train_tf_dataset = train_tf_dataset.map(convert_to_tf_features, batched=True)

def remove_none_values(example):
 return not None in example["start_positions"] or not None in example["end_positions"]

train_tf_dataset = train_tf_dataset.filter(remove_none_values, load_from_cache_file=False)
columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']
train_tf_dataset.set_format(type='tensorflow', columns=columns)
features = {x: train_tf_dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.max_len]) for x in columns[:3]} 
labels = {"output_1": train_tf_dataset["start_positions"].to_tensor(default_value=0, shape=[None, 1])}
labels["output_2"] = train_tf_dataset["end_positions"].to_tensor(default_value=0, shape=[None, 1])
tfdataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(8)

100%|██████████| 88/88 [00:38<00:00, 2.30it/s]
100%|██████████| 88/88 [00:38<00:00, 2.26it/s]


In [0]:
# Let's load a pretrained TF2 Bert model and a simple optimizer
from transformers import TFBertForQuestionAnswering

model = TFBertForQuestionAnswering.from_pretrained("bert-base-cased")
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt,
 loss={'output_1': loss_fn, 'output_2': loss_fn},
 loss_weights={'output_1': 1., 'output_2': 1.},
 metrics=['accuracy'])

In [17]:
# Now let's train our model

model.fit(tfdataset, epochs=1, steps_per_epoch=3)





















# Metrics API

`nlp` also provides easy access and sharing of metrics.

This aspect of the library is still experimental and the API may still evolve more than the datasets API.

Like datasets, metrics are added as small scripts wrapping common metrics in a common API.

There are several reason you may want to use metrics with `nlp` and in particular:

- metrics for specific datasets like GLUE or SQuAD are provided out-of-the-box in a simple, convenient and consistant way integrated with the dataset,
- metrics in `nlp` leverage the powerful backend to provide smart features out-of-the-box like support for distributed evaluation in PyTorch

## Using metrics

Using metrics is pretty simple, they have two main methods: `.compute(predictions, references)` to directly compute the metric and `.add(prediction, reference)` or `.add_batch(predictions, references)` to only store some results if you want to do the evaluation in one go at the end.

Here is a quick gist of a standard use of metrics (the simplest usage):
```python
import nlp
bleu_metric = nlp.load_metric('bleu')

# If you only have a single iteration, you can easily compute the score like this
predictions = model(inputs)
score = bleu_metric.compute(predictions, references)

# If you have a loop, you can "add" your predictions and references at each iteration instead of having to save them yourself (the metric object store them efficiently for you)
for batch in dataloader:
 model_input, targets = batch
 predictions = model(model_inputs)
 bleu_metric.add_batch(predictions, targets)
score = bleu_metric.compute() # Compute the score from all the stored predictions/references
```

Here is a quick gist of a use in a distributed torch setup (should work for any python multi-process setup actually). It's pretty much identical to the second example above:
```python
import nlp
# You need to give the total number of parallel python processes (num_process) and the id of each process (process_id)
bleu_metric = nlp.load_metric('bleu', process_id=torch.distributed.get_rank(),b num_process=torch.distributed.get_world_size())

for batch in dataloader:
 model_input, targets = batch
 predictions = model(model_inputs)
 bleu_metric.add_batch(predictions, targets)
score = bleu_metric.compute() # Compute the score on the first node by default (can be set to compute on each node as well)
```

Example with a NER metric: `seqeval`

In [0]:
ner_metric = nlp.load_metric('seqeval')
references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
ner_metric.compute(predictions, references)

# Adding a new dataset or a new metric

They are two ways to add new datasets and metrics in `nlp`:

- datasets can be added with a Pull-Request adding a script in the `datasets` folder of the [`nlp` repository](https://github.com/huggingface/nlp)

=> once the PR is merged, the dataset can be instantiate by it's folder name e.g. `nlp.load_dataset('squad')`. If you want HuggingFace to host the data as well you will need to ask the HuggingFace team to upload the data.

- datasets can also be added with a direct upload using `nlp` CLI as a user or organization (like for models in `transformers`). In this case the dataset will be accessible under the gien user/organization name, e.g. `nlp.load_dataset('thomwolf/squad')`. In this case you can upload the data yourself at the same time and in the same folder.

We will add a full tutorial on how to add and upload datasets soon.