# Building a classifier with DistilBert
In this notebook is code to create the movie_overview_classification model. The model accepts an overview of a movie and returns a prediction regarding whether the movie will a pass the Bechdel test. It only achieves accuracy (measured via f-score) of .77, but it will be implemented as part of a larger ensemble algorithm.

## Imports and Data

In [33]:
import pandas as pd
import warnings
from sklearn.model_selection import train_test_split
import numpy as np
warnings.filterwarnings("ignore")
import datasets
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, DataCollatorWithPadding
from datasets import load_metric
 


In [2]:
import BechdelDataImporter as data
df = data.NoScripts()

## Text Cleaning
First, instantiate a tokenizer, data collator, and model:

In [3]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


'\npredicted_class_id = logits.argmax().item()\nmodel.config.id2label[predicted_class_id]\n'

- Drop duplicate and na rows in the data
- Tokenize the overviews
- Change the labels from a 4-category rating to a pass-fail rating

In [4]:
df['overview_tokenized'] = pd.Series()
df['label'] = pd.Series()
df = df.drop_duplicates(subset=['overview']).dropna(subset=['overview'])
for i in df.index:
 df['overview_tokenized'][i] = tokenizer(df.loc[i, 'overview'], return_tensors="pt")
 if df['bechdel_rating'][i] == 3: df['label'][i] = 1
 else: df['label'][i] = 0

Split off a test set:

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df[['overview', 'overview_tokenized']], df['label'], test_size=0.2, random_state=42)

## Connecting to Hugging Face

In [13]:
from huggingface_hub import notebook_login
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Defining accuracy metrics and some final processing steps

In [None]:
def compute_metrics(eval_pred):
 load_accuracy = load_metric("accuracy")
 load_f1 = load_metric("f1")
 
 logits, labels = eval_pred
 predictions = np.argmax(logits, axis=-1)
 accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
 f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
 return {"accuracy": accuracy, "f1": f1}


In [14]:
def processing(X: pd.DataFrame, y: pd.Series) -> datasets.Dataset: 
 X['input_ids'] = pd.Series()
 X['attention_mask'] = pd.Series()
 for i in X.index:
 X['input_ids'][i], X['attention_mask'][i] = X.loc[i, 'overview_tokenized'].input_ids.tolist()[0], X.loc[i, 'overview_tokenized'].attention_mask.tolist()[0]
 
 
 return datasets.Dataset.from_pandas(X.join(y).drop(columns=['overview_tokenized']).rename(columns={'overview':'text'}))

train_df, test_df = processing(X_train, y_train), processing(X_test, y_test)

## Training the Model

In [15]:
from transformers import TrainingArguments, Trainer
 
repo_name = "movie_overview_classification"
 
training_args = TrainingArguments(
 output_dir=repo_name,
 learning_rate=2e-5,
 per_device_train_batch_size=16,
 per_device_eval_batch_size=16,
 num_train_epochs=2,
 weight_decay=0.01,
 save_strategy="epoch",
 push_to_hub=True
)
 
trainer = Trainer(
 model=model,
 args=training_args,
 train_dataset=train_df,
 eval_dataset=test_df,
 tokenizer=tokenizer,
 data_collator=data_collator,
 compute_metrics=compute_metrics
)


In [16]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=1010, training_loss=0.5062790998137824, metrics={'train_runtime': 5485.4763, 'train_samples_per_second': 2.944, 'train_steps_per_second': 0.184, 'total_flos': 545236330318872.0, 'train_loss': 0.5062790998137824, 'epoch': 2.0})

In [17]:
trainer.evaluate()

{'eval_loss': 0.5222412347793579,
 'eval_accuracy': 0.7439326399207529,
 'eval_f1': 0.7701200533570476,
 'eval_runtime': 272.8653,
 'eval_samples_per_second': 7.399,
 'eval_steps_per_second': 0.465,
 'epoch': 2.0}

Pushing the model to Hugging Face hub

In [18]:
trainer.push_to_hub()

events.out.tfevents.1721757963.Marks_Laptop.61328.1: 0%| | 0.00/457 [00:00<?, ?B/s]

Upload 2 LFS files: 0%| | 0/2 [00:00<?, ?it/s]

events.out.tfevents.1721751825.Marks_Laptop.61328.0: 0%| | 0.00/5.53k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/mocboch/movie_overview_classification/commit/059aaf4a08b21825ae434df60b2c7af682dc5f6f', commit_message='End of training', commit_description='', oid='059aaf4a08b21825ae434df60b2c7af682dc5f6f', pr_url=None, pr_revision=None, pr_num=None)

## Making Predictions

In [19]:
from transformers import pipeline
final_model = pipeline(model="mocboch/movie_overview_classification")


config.json: 0%| | 0.00/640 [00:00<?, ?B/s]

model.safetensors: 0%| | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json: 0%| | 0.00/1.30k [00:00<?, ?B/s]

vocab.txt: 0%| | 0.00/262k [00:00<?, ?B/s]

special_tokens_map.json: 0%| | 0.00/132 [00:00<?, ?B/s]

In [28]:
test_data = pd.DataFrame(test_df)
test_data['preds'] = pd.Series()

for i in test_data.index:
 test_data['preds'][i] = final_model(test_df['text'][i])

In [32]:
test_data

Unnamed: 0,text,input_ids,attention_mask,label,__index_level_0__,preds
0,A young and devoted morning television produce...,"[101, 1037, 2402, 1998, 7422, 2851, 2547, 3135...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1,6240,"[{'label': 'LABEL_1', 'score': 0.8961271643638..."
1,"Don Birnam, a long-time alcoholic, has been so...","[101, 2123, 12170, 12789, 2213, 1010, 1037, 21...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",0,574,"[{'label': 'LABEL_0', 'score': 0.7674303054809..."
2,"One peaceful day on Earth, two remnants of Fri...","[101, 2028, 9379, 2154, 2006, 3011, 1010, 2048...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",0,8304,"[{'label': 'LABEL_0', 'score': 0.6576490402221..."
3,Dominic Toretto and his crew battle the most s...,"[101, 11282, 9538, 9284, 1998, 2010, 3626, 264...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1,9737,"[{'label': 'LABEL_0', 'score': 0.8074591755867..."
4,"The Martins family are optimistic dreamers, qu...","[101, 1996, 19953, 2155, 2024, 21931, 24726, 2...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1,10031,"[{'label': 'LABEL_1', 'score': 0.8910151124000..."
...,...,...,...,...,...,...
2014,Seven short films - each one focused on the pl...,"[101, 2698, 2460, 3152, 1011, 2169, 2028, 4208...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1,4951,"[{'label': 'LABEL_1', 'score': 0.6277556419372..."
2015,After an unprecedented series of natural disas...,"[101, 2044, 2019, 15741, 2186, 1997, 3019, 186...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1,8786,"[{'label': 'LABEL_1', 'score': 0.6020154356956..."
2016,Girl Lost tackles the issue of underage prosti...,"[101, 2611, 2439, 10455, 1996, 3277, 1997, 210...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1,8693,"[{'label': 'LABEL_1', 'score': 0.9668087363243..."
2017,Loosely based on the true-life tale of Ron Woo...,"[101, 11853, 2241, 2006, 1996, 2995, 1011, 216...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1,7366,"[{'label': 'LABEL_1', 'score': 0.9210842251777..."
