# HuggingFace Datasets library demo

Quick summary:

- 50+ NLP datasets + super easy to add new ones (like Transformers models)
- Simple and fast API to download and pre-process the datasets
- Super easy to tokenize and process them in an efficient way
- All dataset memory mapped on drive (no RAM limitation)
- Smart caching on drive, process once, reuse everytime

Soon: datasets streaming for huge datasets and 100+ datasets

In [1]:
import logging
logging.basicConfig(level=logging.INFO)

In [2]:
# Let's import the library
import nlp

INFO:nlp.utils.file_utils:PyTorch version 1.4.0 available.


Currently available 54 datasets (not tested yet for most of them):
- aeslc
- amazon_us_reviews
- big_patent
- billsum
- blimp
- c4
- cfq
- civil_comments
- cnn_dailymail
- cos_e
- definite_pronoun_resolution
- eraser_multi_rc
- esnli
- flores
- forest_fires
- gap
- german_credit_numeric
- gigaword
- glue
- higgs
- imdb
- iris
- librispeech_lm
- lm1b
- math_dataset
- movie_rationales
- multi_news
- multi_nli
- multi_nli_mismatch
- natural_questions
- newsroom
- opinosis
- para_crawl
- qa4mre
- reddit_tifu
- rock_you
- scan
- scicite
- scientific_papers
- snli
- squad
- super_glue
- ted_hrlr
- ted_multi
- tiny_shakespeare
- titanic
- trivia_qa
- wiki40b
- wikihow
- wikipedia
- wmt
- xnli
- xsum
- yelp_polarity

## An example with SQuAD

In [3]:
# Downloading and loading a dataset is a one-liner

dataset = nlp.load('squad', split='validation[:10%]')

INFO:nlp.load:Dataset script /Users/thomwolf/.cache/huggingface/datasets/ee43d2be6898ebb9c2afefda4455306911d308bcf924d21c975796832cc7c114.e7d8881147e5da61c98918c61832c7f1c88b33b51a082c464e70e119bb24983d already found in datasets directory at /Users/thomwolf/Documents/GitHub/datasets/src/nlp/datasets/686d79c021d7dcd78da4d67fe01fbe30dfecabcd4bd02d06aa9d51edab713144/squad.py, returning it. Use `force_reload=True` to override.
INFO:nlp.builder:No config specified, defaulting to first: squad/plain_text
INFO:nlp.builder:Overwrite dataset info from restored data version.
INFO:nlp.info:Loading Dataset info from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0
INFO:nlp.builder:Reusing dataset squad (/Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0)
INFO:nlp.builder:Constructing Dataset for split validation[:10%], from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0


This call to `nlp.load()` does the following steps under the hood:

1. Download and import in the library the **SQuAD python processing script** from our S3 if it's not already stored in the library. You can find the SQuAD processing script [here](https://s3.amazonaws.com/datasets.huggingface.co/nlp/squad/squad.py) for instance.

 Proecssing scripts are small python scripts that define the info and format of the dataset, contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files.


2. Run the SQuAD python processing script which will:
 - **Download the SQuAD dataset** from the original URL (see the script) if it's not already downloaded and cached.
 - **Process and cache** all SQuAD in a structured Arrow table for each standard splits stored on the drive.

 Arrow table are arbitrarly long tables, typed with types that can be mapped to numpy/pandas/python standard types and can store nested objects. They can be directly access from drive, loaded in RAM or even streamed over the web.
 

3. Return a **dataset build from the splits** asked by the user (default: all), in the above example we create a dataset with the first 10% of the validation split.

In [4]:
# General informations on the dataset are provided in the `.info` property
print(dataset.info)

DatasetInfo(
 name='squad',
 version=1.0.0,
 description='Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
',
 homepage='https://rajpurkar.github.io/SQuAD-explorer/',
 features=struct, answer_start: list>>,
 total_num_examples=98169,
 splits={
 'train': 87599,
 'validation': 10570,
 },
 supervised_keys=None,
 citation="""@article{2016arXiv160605250R,
 author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
 Konstantin and {Liang}, Percy},
 title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
 journal = {arXiv e-prints},
 year = 2016,
 eid = {arXiv:1606.05250},
 pages = {arXiv:1606.05250},
 archivePrefix = {arXiv},
 eprint = {1606.05250},
 }""",
 license=None,
)



## Inspecting the dataset: elements, slices and columns

The returned `Dataset` object is a memory mapped dataset that behave similarly to a normal map-style dataset. It is backed by an Apache Arrow table which allows many interesting features.

In [5]:
print(dataset)

Dataset(schema: {'id': 'string', 'title': 'string', 'context': 'string', 'question': 'string', 'answers': 'struct, answer_start: list>'}, num_rows: 1057)


You can query it's length and get items or slices like you would do normally with a python mapping.

In [6]:
from pprint import pprint

print(f"Dataset len(dataset): {len(dataset)}")
print("First item:")
pprint(dataset[0])
print("Slice of the first two items:")
pprint(dataset[:2])

Dataset len(dataset): 1057
First item:
{'answers': {'answer_start': [177, 177, 177],
 'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
 'context': 'Super Bowl 50 was an American football game to determine the '
 'champion of the National Football League (NFL) for the 2015 '
 'season. The American Football Conference (AFC) champion Denver '
 'Broncos defeated the National Football Conference (NFC) champion '
 'Carolina Panthers 24–10 to earn their third Super Bowl title. The '
 "game was played on February 7, 2016, at Levi's Stadium in the San "
 'Francisco Bay Area at Santa Clara, California. As this was the '
 '50th Super Bowl, the league emphasized the "golden anniversary" '
 'with various gold-themed initiatives, as well as temporarily '
 'suspending the tradition of naming each Super Bowl game with '
 'Roman numerals (under which the game would have been known as '
 '"Super Bowl L"), so that the logo could prominently feature the '
 'Arabic numerals 50.',
 'id': '56

You can get a full column of the dataset by indexing with its name as a string:

In [7]:
print(dataset['question'][:10])

['Which NFL team represented the AFC at Super Bowl 50?', 'Which NFL team represented the NFC at Super Bowl 50?', 'Where did Super Bowl 50 take place?', 'Which NFL team won Super Bowl 50?', 'What color was used to emphasize the 50th anniversary of the Super Bowl?', 'What was the theme of Super Bowl 50?', 'What day was the game played on?', 'What is the AFC short for?', 'What was the theme of Super Bowl 50?', 'What does AFC stand for?']


Items are returned as dict of element.

Slices are returned as dict of lists of elements.

Columns are returned as a list.

You can thus permute slice, index and columns indexings with identical results:

In [8]:
print(dataset[0]['question'] == dataset['question'][0])
print(dataset[10:20]['context'] == dataset['context'][10:20])

True
True


In [9]:
# The underlying table is typed (int/float/strings/lists/dict) and structured 
print(dataset.column_names)
print(dataset.schema)

['id', 'title', 'context', 'question', 'answers']
id: string
title: string
context: string
question: string
answers: struct, answer_start: list>
 child 0, text: list
 child 0, item: string
 child 1, answer_start: list
 child 0, item: int32


### Additional misc properties

In [10]:
# Datasets also have a bunch of properties you can access
print("The number of bytes allocated on the drive is ", dataset.nbytes)
print("For comparison, here is the number of bytes allocated in memory which can be")
print("accessed with `nlp.total_allocated_bytes()`: ", nlp.total_allocated_bytes())
print("The number of rows", dataset.num_rows)
print("The number of columns", dataset.num_columns)
print("The shape (rows, columns)", dataset.shape)

The number of bytes allocated on the drive is 10472672
For comparison, here is the number of bytes allocated in memory which can be
accessed with `nlp.total_allocated_bytes()`: 0
The number of rows 1057
The number of columns 5
The shape (rows, columns) (1057, 5)


### Additional misc methods

In [11]:
# We can list the unique elements in a column. This is done by the backend (so fast!)
print(dataset.unique('title'))

['Super_Bowl_50', 'Warsaw']


In [12]:
# This will drop the column 'id'
dataset.drop('id') # Remove column 'id'
print(dataset.column_names)

['title', 'context', 'question', 'answers']


In [13]:
# This will flatten the nested columns in 'answers'
dataset.flatten()
print(dataset.column_names)

['title', 'context', 'question', 'answers.text', 'answers.answer_start']


In [14]:
# We can also "dictionnary encode" a column if many of it's elements are similar
# This will reduce it's size by only storing the distinct elements (e.g. string)
# It only has effect on the internal storage (no difference from a user point of view)
dataset.dictionary_encode_column('title')

### Cache files

You can check the current cache files backing the dataset with the `.cache_file` property

In [15]:
dataset.cache_files

({'filename': '/Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/squad-validation.arrow',
 'skip': 0,
 'take': 1057},)

You can clean up the cache files for in the current dataset directory with the `.cleanup_cache_files()`.

Be careful that no other process is using these cache files when running this command.

In [16]:
dataset.cleanup_cache_files() # Returns the number of removed cache files

INFO:nlp.arrow_dataset:Listing files in /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-2b0c4368cd1b9d9ab7dd158754adb501.arrow
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-fef84cefe794447d6dc0b28596974c80.arrow
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-b9d042be98ac7ed20cb12b2e9d65d208.arrow
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-0d81cced63f868bf1a233bffb4c94b85.arrow
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-fdd554f8e6ee8230941052eceac92e0f.arrow
INFO:nlp.arrow_dataset:Removing /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-2d5d9f6d0f564bbd27c91aee95cfc0dc.arrow
INFO:nlp.arrow_dataset:Removi

7

## Modifying the dataset with `dataset.map`

There is a powerful method `.map()` that you can use to apply a function to each examples, independantly or in batch.

In [17]:
# `.map()` takes a callable accepting a dict as argument
# (same dict as returned by dataset[i])
# and iterate over the dataset by calling the function with each example.

# Let's print the length of each `context` string in our subset of the dataset
# (10% of the validation i.e. 1057 examples)

dataset.map(lambda example: print(len(example['context']), end=','))

1057it [00:00, 10624.60it/s]

775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,179,179,179,179,179,179,179,179,179,179,179,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1166,1166,1166,1166,1166,116




Dataset(schema: {'title': 'string', 'context': 'string', 'question': 'string', 'answers.text': 'list', 'answers.answer_start': 'list'}, num_rows: 1057)

This is basically the same as doing

```python
for example in dataset:
 function(example)
```

The above example had no effect on the dataset because our function supplied to `.map()` didn't return a `dict` or a `abc.Mapping` that could be used to update the examples in the dataset. `.map()` then just return the same dataset (`self`).

Now let's see how to use a function that can modify the dataset.

### Modifying the dataset example by example

The main interest of `.map()` is to update and modify the content of the table.

To use `.map()` to update elements in the table you should provide a function with the following signature: `function(example: dict) -> dict`.

In [18]:
# Let's add a prefix 'My cute title: ' to each of our titles

def add_prefix_to_title(example):
 example['title'] = 'My cute title: ' + example['title']
 return example

dataset = dataset.map(add_prefix_to_title)

print(dataset.unique('title'))

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-ed8b1249a765df5c159965379e685e44.arrow
1057it [00:00, 21208.28it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 906626 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-ed8b1249a765df5c159965379e685e44.arrow.


['My cute title: Super_Bowl_50', 'My cute title: Warsaw']


This call to `.map()` compute and return the updated table. It will also store the updated table in a cache file indexed by the current state and the mapped function. A subsequent call to `.map()` (even in another python session) will reuse the cached file instead of recomputing the operation (this caching may not work in jupyter notebooks yet).

The returned updated dataset is (again) directly memory mapped from drive and not allocated in RAM.

Your function should accept an input with the format of an item of the dataset: `function(dataset[0])` and return a python dict.

The columns and type of the outputs can be different than the input dict. In this case the new keys will be added as additional columns in the dataset.

The example is `updated()` with the output dictionary: `examples.update(function(example))`.

In [19]:
# Since the input example is updated with our function output,
# we can actually just return the updated 'title' field
dataset = dataset.map(lambda example: {'title': 'My cutest title: ' + example['title']})

print(dataset.unique('title'))

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-17091682b8ed78e55221838b2595bbd5.arrow
1057it [00:00, 24103.23it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 924595 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-17091682b8ed78e55221838b2595bbd5.arrow.


['My cutest title: My cute title: Super_Bowl_50', 'My cutest title: My cute title: Warsaw']


#### Removing columns
You can also remove columns when running map with the `remove_columns=List[str]` argument.

In [20]:
# This will select the 'title' input to send to our function (as only field in the input)
# and replace it with the output of the method as a 'new_title' field
dataset = dataset.map(lambda example: {'new_title': 'Wouhahh: ' + example['title']},
 remove_columns=['title'])

print(dataset.column_names)
print(dataset.unique('new_title'))

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-9741cad18be490ab827b103119d5c732.arrow
1057it [00:00, 25135.67it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 934108 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-9741cad18be490ab827b103119d5c732.arrow.


['context', 'question', 'answers.text', 'answers.answer_start', 'new_title']
['Wouhahh: My cutest title: My cute title: Super_Bowl_50', 'Wouhahh: My cutest title: My cute title: Warsaw']


#### Using examples indices
With `with_indices=True`, dataset indices (from `0` to `len(dataset)`) will be supplied to the function which must thus have the following signature: `function(example: dict, indice: int) -> dict`

In [21]:
# This will add the index in the dataset to the 'question' field
dataset = dataset.map(lambda example, idx: {'question': f'{idx}: ' + example['question']},
 with_indices=True)

print('\n'.join(dataset['question'][:5]))

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-93827205b7769e301be275a794040d51.arrow
1057it [00:00, 24952.75it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 939340 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-93827205b7769e301be275a794040d51.arrow.


0: Which NFL team represented the AFC at Super Bowl 50?
1: Which NFL team represented the NFC at Super Bowl 50?
2: Where did Super Bowl 50 take place?
3: Which NFL team won Super Bowl 50?
4: What color was used to emphasize the 50th anniversary of the Super Bowl?


### Modifying the dataset with batched updates

`.map()` can also work with batch of examples (slices of the dataset).

This is particularly interesting if you have a function that can handle batch of inputs like the tokenizers of HuggingFace `tokenizers`.

To work on batched inputs set `batched=True` when calling `.map()` and supply a function with the following signature: `function(examples: Dict[List]) -> Dict[List]` or, if you use indices, `function(examples: Dict[List], indices: List[int]) -> Dict[List]`).

Your function should accept an input with the format of a slice of the dataset: e.g. `function(dataset[:10])`.

In [22]:
# Let's import a fast tokenizer that can work on batched inputs
# (the 'Fast' tokenizers in HuggingFace)
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

INFO:transformers.file_utils:PyTorch version 1.4.0 available.
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /Users/thomwolf/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1


In [23]:
# Now let's batch tokenize our dataset 'context'
dataset = dataset.map(lambda example: tokenizer.batch_encode_plus(example['context']),
 batched=True)

print("dataset[0]", dataset[0])

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-09ed6375515654521b963025766295d1.arrow
100%|██████████| 2/2 [00:00<00:00, 18.20it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 4811564 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-09ed6375515654521b963025766295d1.arrow.


dataset[0] {'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.', 'question': '0: Which NFL team represented the AFC at Super Bowl 50?', 'answers.text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answers.answer_start': [177, 177, 177], 'new_title': 

In [24]:
# we have added additional columns
# we could have replaced the dataset with `remove_columns=True`
print(dataset.column_names)

['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask']


In [25]:
# Let show a more complex processing with the full preparation of the SQuAD dataset
# for training a model from Transformers
def convert_to_features(batch):
 # Tokenize contexts and questions (as pairs of inputs)
 # keep offset mappings for evaluation
 input_pairs = list(zip(batch['context'], batch['question']))
 encodings = tokenizer.batch_encode_plus(input_pairs,
 pad_to_max_length=True,
 return_offsets_mapping=True)

 # Compute start and end tokens for labels
 start_positions, end_positions = [], []
 for i, (text, start) in enumerate(zip(batch['answers.text'], batch['answers.answer_start'])):
 first_char = start[0]
 last_char = first_char + len(text[0]) - 1
 start_positions.append(encodings.char_to_token(i, first_char))
 end_positions.append(encodings.char_to_token(i, last_char))

 encodings.update({'start_positions': start_positions, 'end_positions': end_positions})
 return encodings

dataset = dataset.map(convert_to_features, batched=True)

INFO:nlp.arrow_dataset:Caching processed dataset at /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-b01e56f9216a5c4e04189ae568585041.arrow
100%|██████████| 2/2 [00:00<00:00, 6.16it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 21999734 bytes /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-b01e56f9216a5c4e04189ae568585041.arrow.


In [26]:
# Now our dataset comprise the labels for the start and end position
# as well as the offsets for converting back tokens
# in span of the original string for evaluation
print("column_names", dataset.column_names)
print("start_positions", dataset[:5]['start_positions'])

column_names ['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions']
start_positions [34, 45, 80, 34, 98]


## Formating outputs for numpy/torch/tensorflow

Now that we hae all our tokenized inputs, we would like to use this dataset in a `torch.Dataloader` or a `tf.data.Dataset`.

To be able to do this we need to tweak two things:

- format the indexing (`__getitem__`) to return numpy/torch/tensorflow tensors, instead of python objects, and
- format the indexing (`__getitem__`) to return only the subset of the columns that we need for our model inputs.

 We don't want the columns `id` or `title` as input sto train our model, but we could still want to keep them in the dataset, for instance for the evaluation of the model.
 
This is handled by the `.set_format(type: Union[None, str], columns: Union[None, str, List[str]])` where:

- `type` define the return type for our dataset `__getitem__` method and is one of `[None, 'numpy', 'torch', 'tensorflow']` (`None` means return python objects), and
- `columns` define the columns returned by `__getitem__` and takes the name of a column in the dataset or a list of columns to return (`None` means return all columns).

In [27]:
columns_to_return = ['input_ids', 'token_type_ids', 'attention_mask',
 'start_positions', 'end_positions']

dataset.set_format(type='torch',
 columns=columns_to_return)

# Our dataset indexing output is now ready for being used in a pytorch dataloader
print('\n'.join([' '.join((n, str(type(t)), str(t.shape))) for n, t in dataset[:10].items()]))

INFO:nlp.arrow_dataset:Set __getitem__(key) output type to torch and filter ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'] columns (when key is int or slice).


input_ids torch.Size([10, 451])
token_type_ids torch.Size([10, 451])
attention_mask torch.Size([10, 451])
start_positions torch.Size([10])
end_positions torch.Size([10])


In [28]:
# Note that the columns are not removed from the dataset,
# just not returned when calling __getitem__
print(dataset.column_names)

['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions']


In [29]:
# We can remove the formating with `.reset_format()`
# or, identically, a call to `.set_format()` with no arguments
dataset.reset_format()

print('\n'.join([' '.join((n, str(type(t)))) for n, t in dataset[:10].items()]))

INFO:nlp.arrow_dataset:Set __getitem__(key) output type to python objects and filter no columns (when key is int or slice).


context 
question 
answers.text 
answers.answer_start 
new_title 
input_ids 
token_type_ids 
attention_mask 
offset_mapping 
start_positions 
end_positions 


In [30]:
# The current format can be checked with `.format`,
# which is a dict of the type and formating
dataset.format

{'type': 'python',
 'columns': ['context',
 'question',
 'answers.text',
 'answers.answer_start',
 'new_title',
 'input_ids',
 'token_type_ids',
 'attention_mask',
 'offset_mapping',
 'start_positions',
 'end_positions']}

# Wrapping this all up

Let's wrap this all up with the full code to load and prepare SQuAD for training a PyTorch model.

In [31]:
import nlp
import torch 
from transformers import BertTokenizerFast

# Load our training dataset and tokenizer
dataset = nlp.load('squad')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

# Tokenize our training dataset
def convert_to_features(example_batch):
 # Tokenize contexts and questions (as pairs of inputs)
 input_pairs = list(zip(example_batch['context'], example_batch['question']))
 encodings = tokenizer.batch_encode_plus(input_pairs, pad_to_max_length=True)

 # Compute start and end tokens for labels
 start_positions, end_positions = [], []
 for i, answer in enumerate(example_batch['answers']):
 first_char = answer['answer_start'][0]
 last_char = first_char + len(answer['text'][0]) - 1
 start_positions.append(encodings.char_to_token(i, first_char))
 end_positions.append(encodings.char_to_token(i, last_char))

 encodings.update({'start_positions': start_positions,
 'end_positions': end_positions})
 return encodings

dataset['train'] = dataset['train'].map(convert_to_features, batched=True)

# Format our outputs to train a pytorch model
columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']
dataset['train'].set_format(type='torch', columns=columns)

# Instantiate a PyTorch Dataloader around our dataset
dataloader = torch.utils.data.DataLoader(dataset['train'], batch_size=8)

INFO:nlp.load:Dataset script /Users/thomwolf/.cache/huggingface/datasets/ee43d2be6898ebb9c2afefda4455306911d308bcf924d21c975796832cc7c114.e7d8881147e5da61c98918c61832c7f1c88b33b51a082c464e70e119bb24983d already found in datasets directory at /Users/thomwolf/Documents/GitHub/datasets/src/nlp/datasets/686d79c021d7dcd78da4d67fe01fbe30dfecabcd4bd02d06aa9d51edab713144/squad.py, returning it. Use `force_reload=True` to override.
INFO:nlp.builder:No config specified, defaulting to first: squad/plain_text
INFO:nlp.builder:Overwrite dataset info from restored data version.
INFO:nlp.info:Loading Dataset info from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0
INFO:nlp.builder:Reusing dataset squad (/Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0)
INFO:nlp.builder:Constructing Dataset for split None, from /Users/thomwolf/.cache/huggingface/datasets/squad/plain_text/1.0.0
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggin

In [32]:
# Let's load a pretrained Bert model and a simple optimizer
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-base-cased')
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json from cache at /Users/thomwolf/.cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352.3d5adf10d3445c36ce131f4c6416aa62e9b58e1af56b97664773f4858a46286e
INFO:transformers.configuration_utils:Model config BertConfig {
 "_num_labels": 2,
 "architectures": [
 "BertForMaskedLM"
 ],
 "attention_probs_dropout_prob": 0.1,
 "bad_words_ids": null,
 "bos_token_id": null,
 "decoder_start_token_id": null,
 "do_sample": false,
 "early_stopping": false,
 "eos_token_id": null,
 "finetuning_task": null,
 "hidden_act": "gelu",
 "hidden_dropout_prob": 0.1,
 "hidden_size": 768,
 "id2label": {
 "0": "LABEL_0",
 "1": "LABEL_1"
 },
 "initializer_range": 0.02,
 "intermediate_size": 3072,
 "is_decoder": false,
 "is_encoder_decoder": false,
 "label2id": {
 "LABEL_0": 0,
 "LABEL_1": 1
 },
 "layer_norm_eps": 1e-12,
 "length_penalty": 1.0

In [33]:
# Now let's train our model

model.train()
for i, batch in enumerate(dataloader):
 outputs = model(**batch)
 loss = outputs[0]
 loss.backward()
 optimizer.step()
 model.zero_grad()
 print(f'Step {i} - loss: {loss:.3}')
 if i > 3:
 break

Step 0 - loss: 6.26


KeyboardInterrupt: 