Fine-Tuning DistilBERT for Question Answering

Advertise here

The transformers library offers a clear and well-documented interface for a lot of fashionable transformer fashions. Not solely it makes the supply code simpler to learn and perceive, it additionally offered a standardize method to work together with the mannequin. You could have seen within the earlier submit methods to use a mannequin corresponding to DistilBERT for pure language processing duties. On this submit, you’ll learn to fine-tune the mannequin on your personal objective. This expands the usage of the mannequin from inference to coaching. Particularly, you’ll study:

Find out how to put together the dataset for coaching
Find out how to prepare a mannequin utilizing a helper library

Let’s get began.

High quality-Tuning DistilBERT for Query Answering
Photograph by Lea Fabienne. Some rights reserved.

Overview

This submit is split into three elements; they’re:

High quality-tuning DistilBERT for Customized Q&A
Dataset and Preprocessing
Operating the Coaching

High quality-tuning DistilBERT for Customized Q&A

The only manner to make use of a mannequin within the transformers library is to create a pipeline, which hides many particulars about methods to work together with it.

One motive it’s possible you’ll not need to create a pipeline however to arrange the mannequin individually is that you simply need to fine-tune the mannequin by yourself dataset. That is unimaginable with a pipeline as a result of you could look at the mannequin’s uncooked output with a loss operate, which is normally hidden from the pipeline.

Often, the pre-trained mannequin is created utilizing a general-purpose dataset. Nonetheless, it could not work effectively for a particular area, particularly if the language within the area is considerably completely different from the overall utilization. That is the place fine-tuning could also be tried.

The issue in fine-tuning might be the supply of a very good dataset. That is normally very costly and time-consuming to create. For illustration functions, within the following, we use a general-purpose and publicly accessible dataset referred to as SQuAD (Stanford Query Answering Dataset).

Due to the extremely generalized and cleaned design of the transformers library, fine-tuning the mannequin is simple. Beneath is an instance of methods to fine-tune the mannequin on the SQuAD dataset:

rom transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering, Coach, TrainingArguments from datasets import load_dataset # Load the SQuAD dataset dataset = load_dataset(“squad”) # Load tokenizer and mannequin model_name = “distilbert-base-uncased” tokenizer = DistilBertTokenizerFast.from_pretrained(model_name) mannequin = DistilBertForQuestionAnswering.from_pretrained(model_name) # Tokenize the dataset def preprocess_function(examples): questions = [q.strip() for q in examples[“question”]] inputs = tokenizer( questions, examples[“context”], max_length=384, truncation=”only_second”, return_offsets_mapping=True, padding=”max_length”, ) offset_mapping = inputs.pop(“offset_mapping”) solutions = examples[“answers”] start_positions = [] end_positions = [] for i, offsets in enumerate(offset_mapping): reply = solutions[i] start_char = reply[“answer_start”][0] end_char = start_char + len(reply[“text”][0]) sequence_ids = inputs.sequence_ids(i) # Discover the beginning and finish of the context context_start = sequence_ids.index(1) context_end = len(sequence_ids) – 1 – sequence_ids[::-1].index(1) # If the reply is just not absolutely contained in the context, label it (0, 0) if offsets[context_start][0] > end_char or offsets[context_end][1] = context_start and offsets[idx][1] >= end_char: idx -= 1 end_positions.append(idx + 1) inputs[“start_positions”] = start_positions inputs[“end_positions”] = end_positions return inputs # Apply preprocessing to the dataset tokenized_datasets = dataset.map(preprocess_function, batched=True, remove_columns=dataset[“train”].column_names) # Outline coaching arguments training_args = TrainingArguments( output_dir=”./outcomes”, evaluation_strategy=”epoch”, learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01, ) # Initialize Coach coach = Coach( mannequin=mannequin, args=training_args, train_dataset=tokenized_datasets[“train”], eval_dataset=tokenized_datasets[“validation”], tokenizer=tokenizer, ) # Practice the mannequin and save the outcomes coach.prepare() mannequin.save_pretrained(“./fine-tuned-distilbert-squad”) tokenizer.save_pretrained(“./fine-tuned-distilbert-squad”)

rom transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering,

Coach, TrainingArguments

from datasets import load_dataset

# Load the SQuAD dataset

dataset = load_dataset(“squad”)

# Load tokenizer and mannequin

model_name = “distilbert-base-uncased”

tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

mannequin = DistilBertForQuestionAnswering.from_pretrained(model_name)

# Tokenize the dataset

def preprocess_function(examples):

questions = [q.strip() for q in examples[“question”]]

inputs = tokenizer(

questions,

examples[“context”],

max_length=384,

truncation=“only_second”,

return_offsets_mapping=True,

padding=“max_length”,

)

offset_mapping = inputs.pop(“offset_mapping”)

solutions = examples[“answers”]

start_positions = []

end_positions = []

for i, offsets in enumerate(offset_mapping):

reply = solutions[i]

start_char = reply[“answer_start”][0]

end_char = start_char + len(reply[“text”][0])

sequence_ids = inputs.sequence_ids(i)

# Discover the beginning and finish of the context

context_start = sequence_ids.index(1)

context_end = len(sequence_ids) – 1 – sequence_ids[::–1].index(1)

# If the reply is just not absolutely contained in the context, label it (0, 0)

if offsets[context_start][0] > end_char or offsets[context_end][1] start_char:

start_positions.append(0)

end_positions.append(0)

else:

# In any other case discover the beginning and finish token positions

idx = context_start

whereas idx context_end and offsets[idx][0] start_char:

idx += 1

start_positions.append(idx – 1)

idx = context_end

whereas idx >= context_start and offsets[idx][1] >= end_char:

idx -= 1

end_positions.append(idx + 1)

inputs[“start_positions”] = start_positions

inputs[“end_positions”] = end_positions

return inputs

# Apply preprocessing to the dataset

tokenized_datasets = dataset.map(preprocess_function,

batched=True,

remove_columns=dataset[“train”].column_names)

# Outline coaching arguments

training_args = TrainingArguments(

output_dir=“./outcomes”,

evaluation_strategy=“epoch”,

learning_rate=2e–5,

per_device_train_batch_size=16,

per_device_eval_batch_size=16,

num_train_epochs=3,

weight_decay=0.01,

)

# Initialize Coach

coach = Coach(

mannequin=mannequin,

args=training_args,

train_dataset=tokenized_datasets[“train”],

eval_dataset=tokenized_datasets[“validation”],

tokenizer=tokenizer,

)

# Practice the mannequin and save the outcomes

coach.prepare()

mannequin.save_pretrained(“./fine-tuned-distilbert-squad”)

tokenizer.save_pretrained(“./fine-tuned-distilbert-squad”)

This code is a bit complicated. Let’s break it down step-by-step.

Dataset and Preprocessing

The SQuAD dataset is a well-liked dataset for query answering and it’s accessible on the Hugging Face hub. You possibly can load it utilizing the load_dataset() operate from Hugging Face’s datasets library.

from datasets import load_dataset dataset = load_dataset(“squad”)

from datasets import load_dataset

dataset = load_dataset(“squad”)

Each dataset is completely different. However this explicit dataset is dictionary-like with keys “title”, “context”, “query”, and “solutions”. The “context” is a chunk of reasonably lengthy textual content. The “query” is a query sentence. The “solutions” is a dictionary with the important thing “textual content” and “answer_start“. The “textual content” maps to a brief string that’s the reply to the query. The “answer_start” maps to the beginning place of the reply within the context. The “title” could be ignored, because it offers the title of the article that the context is extracted from.

To make use of the dataset for coaching, you could know the way the mannequin expects the enter and what sort of output it produces. Within the case of DistilBERT for query answering, the mannequin is mounted by the implementation of the DistilBertForQuestionAnswering class until you resolve to write down your individual mannequin implementation. On this class, the mannequin expects the enter as a sequence of integer token IDs and the output is 2 vectors of logits, one for the beginning place and one for the top place of the reply.

You will discover the small print of the enter and output format of the mannequin within the previous post. Or you will discover the small print within the DistilBertForQuestionAnswering class documentation.

With a purpose to use the dataset for coaching, you could do some preprocessing. That is to remodel the dataset right into a format that matches the mannequin’s enter and output format. The dataset object loaded from the Hugging Face hub means that you can do that with the map() technique, during which the transformation is applied as a customized operate, preprocess_function().

… tokenized_datasets = dataset.map(preprocess_function, batched=True, remove_columns=dataset[“train”].column_names)

...

tokenized_datasets = dataset.map(preprocess_function,

batched=True,

remove_columns=dataset[“train”].column_names)

Be aware that the preprocess_function() is to simply accept a batch from the dataset as you used batched=True within the map() technique.

Within the preprocess_function(), the tokenizer is invoked with the questions from examples["question"] and the context from examples["context"]. The query is stripped of additional areas and the context is truncated to slot in the utmost size of 384 tokens. The usage of a tokenizer on this operate is completely different from what you’ve gotten seen within the earlier submit:

… inputs = tokenizer( questions, examples[“context”], max_length=384, truncation=”only_second”, return_offsets_mapping=True, padding=”max_length”, )

...

inputs = tokenizer(

questions,

examples[“context”],

max_length=384,

truncation=“only_second”,

return_offsets_mapping=True,

padding=“max_length”,

)

Firstly, the tokenizer is invoked with a batch of questions and the context. For doubtlessly ragged enter, the tokenizer will pad the enter to the utmost size of the batch. Secondly, with return_offsets_mapping=True, the tokenizer returns a dictionary with the keys “input_ids“, “attention_mask“, and “offset_mapping“. The “input_ids” is the sequence of integer token IDs. The “attention_mask” is a binary masks that signifies which tokens are actual (1) and that are padded (0). The “offset_mapping” is what’s added by setting return_offsets_mapping=True and it’s a record of tuples that signifies the character positions (begin and finish offsets) of every token within the unique textual content.

The input_ids from the tokenizer output concatenates the query and the context within the format of:

[CLS] query [SEP] context [SEP]

[CLS] query [SEP] context [SEP]

which is what the mannequin expects. The reply from the dataset is a string, and the character offset from which the reply could be discovered within the unique context. That is completely different from what the mannequin produces, particularly, the logits of token positions. Subsequently, you used a for-loop in preprocess_function() to recreate the beginning and finish token positions of the reply.

On this code, the tokenizer is invoked with further arguments. Setting return_offsets_mapping=True will make the returned object comprise offset_mapping, an inventory of tuples figuring out the beginning and finish positions of every token in every enter textual content.

First, the offset_mapping is popped from the thing returned by the tokenizer since it isn’t wanted for the coaching. Then for every reply, you recognized the character begin and finish offset from the context. You possibly can confirm this with the code like the next:

… start_char = reply[“answer_start”][0] end_char = start_char + len(reply[“text”][0]) assert reply[“text”] == context[start_char:end_char]

...

start_char = reply[“answer_start”][0]

end_char = start_char + len(reply[“text”][0])

assert reply[“text”] == context[start_char:end_char]

Even when you understand the character offset, the mannequin operates on token positions.

Recall that the tokenizer concatenated the query and the context. Happily, the tokenizer offered the clue to determine the beginning and finish of the context from its output. In inputs.sequence_ids(i), it’s a Python record of integers or None, equivalent to the ingredient i of the batch. The record holds None for the place the place a particular token is discovered and an integer for which the place is a token from the precise enter. In your use case, you invoked the tokenizer with query first and context the second, due to this fact integer 0 corresponds to the query and 1 corresponds to the context.

Subsequently, you possibly can determine the token begin and finish offset of the context by checking the place the integer 1 first and final seems within the sequence_ids record:

… sequence_ids = inputs.sequence_ids(i) context_start = sequence_ids.index(1) context_end = len(sequence_ids) – 1 – sequence_ids[::-1].index(1)

...

sequence_ids = inputs.sequence_ids(i)

context_start = sequence_ids.index(1)

context_end = len(sequence_ids) – 1 – sequence_ids[::–1].index(1)

Given you understand the beginning and finish token positions of the context, you continue to have to verify if the reply is roofed by any token. That is completed in a loop by checking every token one after the other. You employ a for-loop to iterate over every pair of offsets and verify if the beginning and finish character positions of the reply are inside any token. In that case, the place of the token is remembered because the start_positions and end_positions. For solutions not discovered (e.g., as a result of context was clipped), they’re set to 0.

… # If the reply is just not absolutely contained in the context, label it (0, 0) if offsets[context_start][0] > end_char or offsets[context_end][1] = context_start and offsets[idx][1] >= end_char: idx -= 1 end_positions.append(idx + 1)

...

# If the reply is just not absolutely contained in the context, label it (0, 0)

if offsets[context_start][0] > end_char or offsets[context_end][1] start_char:

start_positions.append(0)

end_positions.append(0)

else:

# In any other case discover the beginning and finish token positions

idx = context_start

whereas idx context_end and offsets[idx][0] start_char:

idx += 1

start_positions.append(idx – 1)

idx = context_end

whereas idx >= context_start and offsets[idx][1] >= end_char:

idx -= 1

end_positions.append(idx + 1)

On the finish of the preprocess_function(), the thing inputs is returned. It’s dictionary-like with keys input_ids, attention_masks, start_positions, and end_positions. You will need to not change the identify of those keys as a result of the DistilBertForQuestionAnswering class expects such arguments within the ahead() technique.

The DistilBERT mannequin expects you to name it with the arguments input_ids. And when you name with a big batch, attention_masks is required as effectively to inform which token within the enter are paddings. Should you name with the non-compulsory begin and finish positions, the cross-entropy loss can be computed as effectively. That is how the transformers library is designed that can assist you name the mannequin in inference and coaching with the identical interface.

Operating the Coaching

To run this code, you could set up the next packages:

pip set up torch datasets transformers speed up

pip set up torch datasets transformers speed up

When you can count on the necessities of torch, transformers, and datasets. The speed up package deal is a dependency whenever you use the Coach class from the transformers library.

You could count on coaching a posh mannequin like DistilBERT to require loads of code. Certainly, it isn’t simple since you could resolve what optimizer to make use of, the variety of epochs to coach, and the hyperparameters corresponding to batch measurement, studying fee, weight decay, and many others. You even have to deal with the checkpointing with the intention to resume the coaching in case of interruption.

That is why the Coach class is launched. You simply have to arrange the coaching arguments, then arrange the Coach with the dataset, after which run the coaching:

… training_args = TrainingArguments( output_dir=”./outcomes”, eval_strategy=”epoch”, learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01, ) coach = Coach( mannequin=mannequin, args=training_args, train_dataset=tokenized_datasets[“train”], eval_dataset=tokenized_datasets[“validation”], processing_class=tokenizer, ) coach.prepare()

...

training_args = TrainingArguments(

output_dir=“./outcomes”,

eval_strategy=“epoch”,

learning_rate=2e–5,

per_device_train_batch_size=16,

per_device_eval_batch_size=16,

num_train_epochs=3,

weight_decay=0.01,

)

coach = Coach(

mannequin=mannequin,

args=training_args,

train_dataset=tokenized_datasets[“train”],

eval_dataset=tokenized_datasets[“validation”],

processing_class=tokenizer,

)

coach.prepare()

The Coach will deal with the checkpointing, the logging, and the analysis in a single operate name. You simply want to avoid wasting the fine-tuned mannequin (along with the tokenizer since they’re loaded collectively) within the Hugging Face format as soon as the coaching is full:

… mannequin.save_pretrained(“./fine-tuned-distilbert-squad”) tokenizer.save_pretrained(“./fine-tuned-distilbert-squad”)

...

mannequin.save_pretrained(“./fine-tuned-distilbert-squad”)

tokenizer.save_pretrained(“./fine-tuned-distilbert-squad”)

That’s all you could do. Even when you didn’t specify utilizing a GPU for the coaching, the Coach will mechanically uncover the GPU in your system and use it to hurry up the method. The code above, though not very lengthy, is the entire code for fine-tuning DistilBERT on the SQuAD dataset.

Should you run this code, you’ll count on the next output:

Some weights of DistilBertForQuestionAnswering weren't initialized from the mannequin checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight'] You need to in all probability TRAIN this mannequin on a down-stream job to have the ability to use it for predictions and inference. Map: 100%|████████████████████████████████| 10570/10570 [00:01 {'loss': 2.9462, 'grad_norm': 13.834440231323242, 'learning_rate': 1.9391171993911722e-05, 'epoch': 0.09} {'loss': 1.7333, 'grad_norm': 14.540811538696289, 'learning_rate': 1.8782343987823442e-05, 'epoch': 0.18} {'loss': 1.5268, 'grad_norm': 15.629022598266602, 'learning_rate': 1.8173515981735163e-05, 'epoch': 0.27} {'loss': 1.4487, 'grad_norm': 20.17080307006836, 'learning_rate': 1.756468797564688e-05, 'epoch': 0.37} {'loss': 1.3957, 'grad_norm': 21.543432235717773, 'learning_rate': 1.69558599695586e-05, 'epoch': 0.46} {'loss': 1.3816, 'grad_norm': 15.349509239196777, 'learning_rate': 1.634703196347032e-05, 'epoch': 0.55} {'loss': 1.314, 'grad_norm': 14.986817359924316, 'learning_rate': 1.573820395738204e-05, 'epoch': 0.64} {'loss': 1.2313, 'grad_norm': 15.443862915039062, 'learning_rate': 1.5129375951293761e-05, 'epoch': 0.73} {'loss': 1.2613, 'grad_norm': 10.729198455810547, 'learning_rate': 1.4520547945205482e-05, 'epoch': 0.82} {'loss': 1.1976, 'grad_norm': 18.681406021118164, 'learning_rate': 1.39117199391172e-05, 'epoch': 0.91} {'eval_loss': 1.142066240310669, 'eval_runtime': 14.8679, 'eval_samples_per_second': 710.926, 'eval_steps_per_second': 44.458, 'epoch': 1.0} {'loss': 1.1858, 'grad_norm': 15.170207023620605, 'learning_rate': 1.330289193302892e-05, 'epoch': 1.0} {'loss': 0.962, 'grad_norm': 14.375147819519043, 'learning_rate': 1.2694063926940641e-05, 'epoch': 1.1} {'loss': 0.9994, 'grad_norm': 13.867342948913574, 'learning_rate': 1.2085235920852361e-05, 'epoch': 1.19} {'loss': 0.9912, 'grad_norm': 13.35099983215332, 'learning_rate': 1.147640791476408e-05, 'epoch': 1.28} {'loss': 0.976, 'grad_norm': 18.943002700805664, 'learning_rate': 1.08675799086758e-05, 'epoch': 1.37} {'loss': 0.9687, 'grad_norm': 12.70341968536377, 'learning_rate': 1.025875190258752e-05, 'epoch': 1.46} {'loss': 0.949, 'grad_norm': 10.327693939208984, 'learning_rate': 9.64992389649924e-06, 'epoch': 1.55} {'loss': 0.9482, 'grad_norm': 17.166929244995117, 'learning_rate': 9.04109589041096e-06, 'epoch': 1.64} {'loss': 0.9248, 'grad_norm': 23.135452270507812, 'learning_rate': 8.432267884322679e-06, 'epoch': 1.74} {'loss': 0.9289, 'grad_norm': 15.964847564697266, 'learning_rate': 7.823439878234399e-06, 'epoch': 1.83} {'loss': 0.9605, 'grad_norm': 10.738043785095215, 'learning_rate': 7.214611872146119e-06, 'epoch': 1.92} {'eval_loss': 1.0946319103240967, 'eval_runtime': 14.7779, 'eval_samples_per_second': 715.256, 'eval_steps_per_second': 44.729, 'epoch': 2.0} {'loss': 0.9376, 'grad_norm': 22.791458129882812, 'learning_rate': 6.605783866057839e-06, 'epoch': 2.01} {'loss': 0.7745, 'grad_norm': 15.398698806762695, 'learning_rate': 5.996955859969558e-06, 'epoch': 2.1} {'loss': 0.7458, 'grad_norm': 17.4672908782959, 'learning_rate': 5.388127853881279e-06, 'epoch': 2.19} {'loss': 0.7636, 'grad_norm': 13.833612442016602, 'learning_rate': 4.779299847792998e-06, 'epoch': 2.28} {'loss': 0.7803, 'grad_norm': 11.179983139038086, 'learning_rate': 4.170471841704719e-06, 'epoch': 2.37} {'loss': 0.7666, 'grad_norm': 9.601215362548828, 'learning_rate': 3.5616438356164386e-06, 'epoch': 2.47} {'loss': 0.7784, 'grad_norm': 24.625328063964844, 'learning_rate': 2.9528158295281586e-06, 'epoch': 2.56} {'loss': 0.7389, 'grad_norm': 13.041014671325684, 'learning_rate': 2.343987823439878e-06, 'epoch': 2.65} {'loss': 0.7636, 'grad_norm': 12.822973251342773, 'learning_rate': 1.7351598173515982e-06, 'epoch': 2.74} {'loss': 0.7625, 'grad_norm': 12.254212379455566, 'learning_rate': 1.1263318112633182e-06, 'epoch': 2.83} {'loss': 0.727, 'grad_norm': 8.469372749328613, 'learning_rate': 5.17503805175038e-07, 'epoch': 2.92} {'eval_loss': 1.1390912532806396, 'eval_runtime': 14.8303, 'eval_samples_per_second': 712.731, 'eval_steps_per_second': 44.571, 'epoch': 3.0} {'train_runtime': 1106.5639, 'train_samples_per_second': 237.489, 'train_steps_per_second': 14.843, 'train_loss': 1.0775353780946775, 'epoch': 3.0} 100%|██████████████████████████████████████████████| 16425/16425 [18:26


This takes some time to run, even if you use a decent GPU. However, you are fine-tuning a pre-trained model on the new dataset. This is orders of magnitude faster and easier than training from scratch.
Once you finished the training, you can load the model in your other project by using the path:

<br />
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering
<br />

<br />
model_path = “./fine-tuned-distilbert-squad”
<br />
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
<br />
model = DistilBertForQuestionAnswering.from_pretrained(model_path)
<br />
…







from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering
 
model_path = “./fine-tuned-distilbert-squad”
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
model = DistilBertForQuestionAnswering.from_pretrained(model_path)
...





Please make sure that model_path is the correct path to find the saved model files from your project.
Further Reading
Below are some links to the documentation of the classes and methods used in this post:
Summary
In this post, you have learned how to fine-tune DistilBERT for a custom question-answering task. Even DistilBERT and question-answering is used as an example, you can apply the same process to other models and tasks. In particular, you learned:

How to prepare the dataset for training
How to train or fine-tune the model using the Trainer interface from the transformers library


Advertise here



Source link

Fine-Tuning DistilBERT for Question Answering

Thursday Briefing – The New York Times

Revolut to join the mobile phone plan shake-up with key roaming offer included

Yen slides with JGB yields after BOJ cuts growth forecasts

Regina city manager on abrupt leave, interim manager appointed – Regina

Nigel Benn defends Chris Eubank Jr after he was labelled a ‘disgrace’ by father

Trump says he’s buying a Tesla to support Musk after stock price plunges – National

Dubai signs deal for Dubai Loop project with Elon Musk's infrastructure firm

Guelph’s ex-deputy CAO caught off guard after hiring rescinded in Orillia

Top 5 Emerging Altcoins to Watch in 2025: Diversify Your Crypto Portfolio

Fine-Tuning DistilBERT for Question Answering

Overview

High quality-tuning DistilBERT for Customized Q&A

Dataset and Preprocessing

Operating the Coaching

Further Reading

Summary

Related Posts