DistilBart is a typical encoder-decoder mannequin for NLP duties. On this tutorial, you’ll find out how such a mannequin is constructed and how one can examine its structure with the intention to evaluate it with different fashions. Additionally, you will learn to use the pretrained DistilBart mannequin to generate summaries and the way to management the summaries’ fashion.
After finishing this tutorial, you’ll know:
- How DistilBart’s encoder-decoder structure processes textual content internally
- Strategies for controlling abstract fashion and content material
- Methods for evaluating and enhancing abstract high quality
Let’s get began!

Understanding the DistilBart Mannequin and ROUGE Metric
Photograph by Svetlana Gumerova. Some rights reserved.
Overview
This put up is in two components; they’re:
- Understanding the Encoder-Decoder Structure
- Evaluating the Results of Summarization utilizing ROUGE
Understanding the Encoder-Decoder Structure
DistilBart is a “distilled” model of the BART mannequin, a robust sequence-to-sequence mannequin for pure language era, translation, and comprehension. The BART mannequin makes use of a full transformer structure with an encoder and decoder.
You’ll find the structure of transformer fashions within the paper Attention is all you need. At a excessive stage, the illustration is as follows:

Transformer structure
The important thing attribute of the transformer structure is that it’s cut up into an encoder and a decoder. The encoder takes the enter sequence and outputs a sequence of hidden states. The decoder takes the hidden states and outputs the ultimate sequence. It is extremely efficient for sequence-to-sequence duties like summarization, by which the enter must be absolutely consumed to extract the important thing data earlier than the abstract could be generated.
As defined within the previous post, you should use the pretrained DistilBart mannequin to construct a summarizer with only a few traces of code. In actual fact, you’ll be able to see among the design parameters in DistilBart’s structure by trying on the mannequin config:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
rom transformers import AutoConfig, AutoModelForSeq2SeqLM
def explore_model_architecture(): “”“Study DistilBart’s configuration and structure.”“” model_name = “sshleifer/distilbart-cnn-12-6”
# Load mannequin configuration config = AutoConfig.from_pretrained(model_name) print(“Mannequin Structure:”) print(f“- Encoder layers: {config.encoder_layers}”) print(f“- Decoder layers: {config.decoder_layers}”) print(f“- Hidden measurement: {config.hidden_size}”) print(f“- Consideration heads: {config.encoder_attention_heads}”)
# Confirm encoder-decoder construction mannequin = AutoModelForSeq2SeqLM.from_pretrained(model_name) print(“nModel Parts:”) print(f“- Encoder: {kind(mannequin.mannequin.encoder).__name__}”) print(f“- Decoder: {kind(mannequin.mannequin.decoder).__name__}”) return mannequin, config
# Instance utilization mannequin, config = explore_model_architecture() |
The code above prints the dimensions of the hidden state, the variety of consideration heads, and the variety of encoder and decoder layers i
Mannequin Structure: – Encoder layers: 12 – Decoder layers: 6 – Hidden measurement: 1024 – Consideration heads: 16 Mannequin Parts: – Encoder: BartEncoder – Decoder: BartDecoder |
The mannequin created on this manner is a PyTorch mannequin. You possibly can print the mannequin if you wish to see extra:
Which ought to present you:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
artForConditionalGeneration( (mannequin): BartModel( (shared): BartScaledWordEmbedding(50264, 1024, padding_idx=1) (encoder): BartEncoder( (embed_tokens): BartScaledWordEmbedding(50264, 1024, padding_idx=1) (embed_positions): BartLearnedPositionalEmbedding(1026, 1024) (layers): ModuleList( (0-11): 12 x BartEncoderLayer( (self_attn): BartSdpaAttention( (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (activation_fn): GELUActivation() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) ) (layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (decoder): BartDecoder( (embed_tokens): BartScaledWordEmbedding(50264, 1024, padding_idx=1) (embed_positions): BartLearnedPositionalEmbedding(1026, 1024) (layers): ModuleList( (0-5): 6 x BartDecoderLayer( (self_attn): BartSdpaAttention( (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (activation_fn): GELUActivation() (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (encoder_attn): BartSdpaAttention( (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (out_proj): Linear(in_features=1024, out_features=1024, bias=True) ) (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) ) (layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) ) (lm_head): Linear(in_features=1024, out_features=50264, bias=False) ) |
This is probably not simple to learn. However if you’re conversant in the transformer structure, you’ll discover that:
- The
BartModel
has an embedding mannequin, an encoder mannequin, and a decoder mannequin. The identical embedding mannequin seems in each the encoder and decoder. - The dimensions of the embedding mannequin means that the vocabulary comprises 50264 tokens. The output of the embedding mannequin has a measurement of 1024 (the “hidden measurement”), which is the size of the embedding vector for every token.
- Each the encoder and decoder use the
BartLearnedPositionalEmbedding
mannequin, which presumably is a realized positional encoding for the enter sequence to every mannequin. - The encoder has 12 layers and the decoder has solely 6 layers. Observe that DistilBart is a “distilled” model of BART as a result of BART has 12 layers of decoder however DistilBart simplified it into 6.
- In every layer of the encoder, there’s one self-attention, two layer norms, two feed-forward layers, and utilizing GELU because the activation perform.
- In every layer of the decoder, there’s one self-attention, one cross-attention from the encoder, three layer norms, two feed-forward layers, and utilizing GELU because the activation perform.
- In each the encoder and decoder, the hidden measurement doesn’t change by means of the layers, however the feed-forward layer makes use of 4x the hidden measurement within the center.
Most transformer fashions use an identical structure however with some variations. These are the high-level constructing blocks of the mannequin, however you can not see the precise algorithm used, for instance, the order of the constructing blocks invoked with the enter sequence. You’ll find such particulars solely once you examine the mannequin implementation code.
Not all fashions have each an encoder and a decoder. Nevertheless, this design is quite common for sequence-to-sequence duties. The output from the encoder mannequin is known as the “contextual illustration” of the enter sequence. It captures the essence of the enter textual content. The decoder mannequin makes use of the contextual illustration to generate the ultimate sequence.
Evaluating the Results of Summarization utilizing ROUGE
As you may have seen the way to use the pretrained DistilBart mannequin to generate summaries, how are you aware the standard of its output?
That is certainly a really troublesome query. Everybody has their very own opinion on what an excellent abstract is. Nevertheless, some well-known metrics are used to judge numerous outputs of language fashions. One well-liked metric for evaluating the standard of summaries is ROUGE.
ROUGE stands for Recall-Oriented Understudy for Gisting Analysis. It’s a set of metrics used to judge the standard of textual content summarization and machine translation. Behind the scenes, the F1 rating of the precision and recall of the generated abstract is computed towards the reference abstract. It’s easy to grasp and straightforward to compute. As a recall-based metric, it focuses on the power of the abstract to recall the important thing phases. The weak spot of ROUGE is that it wants a reference abstract. Therefore, the effectiveness of the analysis will depend on the standard of the reference.
Let’s revisit how we are able to use DistilBart to generate summaries:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
import torch from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
class Summarizer: def __init__(self, model_name=“sshleifer/distilbart-cnn-12-6”): “”“Initialize the summarizer with mannequin and tokenizer.”“” self.machine = “cuda” if torch.cuda.is_available() else “cpu” self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.mannequin = AutoModelForSeq2SeqLM.from_pretrained(model_name) self.mannequin.to(self.machine)
def summarize(self, textual content, context_weight=0.5, max_length=150, min_length=50, num_beams=4, length_penalty=2.0, repetition_penalty=1.0, do_sample=False, temperature=1.0, early_stopping=True): “”“Generate a abstract with context consciousness.”“” inputs = self.tokenizer(textual content, return_tensors=“pt”, padding=True, truncation=True, max_length=1024 ).to(self.machine) # Generate abstract utilizing solely the enter tokens summary_ids = self.mannequin.generate( inputs[“input_ids”], attention_mask=inputs[“attention_mask”], max_length=max_length, min_length=min_length, num_beams=num_beams, length_penalty=length_penalty, repetition_penalty=repetition_penalty, do_sample=do_sample, temperature=temperature, early_stopping=early_stopping, ) # Decode and return the abstract abstract = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True) return abstract
# Let’s run an instance to see the way it works summarizer = Summarizer() textual content = “”“ The event of synthetic intelligence has revolutionized quite a few industries. Machine studying algorithms now energy all the things from suggestion methods to autonomous automobiles. Deep studying, particularly, has proven exceptional success in duties like picture recognition and pure language processing. Nevertheless, these advances additionally elevate necessary moral issues about AI’s influence on society, privateness, and employment. ““”
abstract = summarizer.summarize(textual content) print(f“Abstract:n{abstract}”) |
The Summarizer
class masses the pretrained DistilBart mannequin and tokenizer after which makes use of the mannequin to generate a abstract of the enter textual content. To generate the abstract, a number of parameters are handed to the generate()
methodology to regulate how the abstract is generated. You possibly can modify these parameters, however the default values are an excellent place to begin.
Now let’s prolong the Summarizer
class to generate summaries with totally different kinds by setting totally different parameters for the generate()
methodology:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
..
class StyleControlledSummarizer(Summarizer): def summarize_with_style(self, textual content, fashion=“concise”): “”“Generate summaries with totally different kinds.
Args: textual content (str): Enter textual content to summarize fashion (str): Abstract fashion (‘concise’, ‘detailed’, ‘technical’, ‘easy’) Returns: str: Generated abstract with specified fashion ““” style_params = { “concise”: { “max_length”: 80, “min_length”: 30, “length_penalty”: 3.0, “num_beams”: 4, “early_stopping”: True }, “detailed”: { “max_length”: 200, “min_length”: 100, “length_penalty”: 1.0, “num_beams”: 6, “early_stopping”: False }, “technical”: { “max_length”: 150, “min_length”: 50, “length_penalty”: 2.0, “num_beams”: 5, “repetition_penalty”: 1.5 }, “easy”: { “max_length”: 100, “min_length”: 30, “length_penalty”: 2.0, “num_beams”: 3, “do_sample”: True, “temperature”: 0.7 } } params = style_params[style] return self.summarize(textual content, **params)
# Let’s run an instance to see the way it works style_summarizer = StyleControlledSummarizer() textual content = “”“ Quantum computing leverages the ideas of quantum mechanics to carry out computations. Not like classical computer systems that use bits, quantum computer systems use quantum bits or qubits. These qubits can exist in a number of states concurrently by means of superposition, doubtlessly permitting quantum computer systems to resolve sure issues exponentially quicker than classical computer systems. Nevertheless, sustaining quantum coherence and minimizing errors stays a important problem in constructing sensible quantum computer systems. ““”
kinds = [“concise”, “detailed”, “technical”, “simple”] for fashion in kinds: abstract = style_summarizer.summarize_with_style(textual content, fashion=fashion) print(f“n{fashion.capitalize()} Abstract:”) print(abstract) |
The StyleControlledSummarizer
class outlined 4 kinds of summaries, named “concise”, “detailed”, “technical”, and “easy”. You possibly can see that the parameters for the generate()
methodology differ for every fashion. Particularly, the “detailed” fashion makes use of an extended abstract size, the “technical” fashion makes use of the next repetition penalty, and the “easy” fashion makes use of a decrease temperature for extra inventive summaries.
Is that good? Let’s see what the ROUGE metric says:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
...
from rouge_score import rouge_scorer
class SummaryEvaluator: def __init__(self): “”“Initialize with ROUGE metrics.”“” self.scorer = rouge_scorer.RougeScorer( [‘rouge1’, ‘rouge2’, ‘rougeL’], use_stemmer=True )
def evaluate_summary(self, reference, candidate): “”“Calculate ROUGE scores for a abstract.
Args: reference (str): Reference abstract candidate (str): Generated abstract
Returns: dict: ROUGE scores for various metrics ““” scores = self.scorer.rating(reference, candidate)
print(“Abstract High quality Metrics:”) print(f“ROUGE-1: {scores[‘rouge1’].fmeasure:.3f}”) print(f“ROUGE-2: {scores[‘rouge2’].fmeasure:.3f}”) print(f“ROUGE-L: {scores[‘rougeL’].fmeasure:.3f}”)
return scores
# Checking the matrics implementation summarizer = StyleControlledSummarizer() evaluator = SummaryEvaluator() reference = “Quantum computing makes use of qubits for quicker computation however faces coherence challenges.” for fashion in [“concise”, “detailed”, “technical”, “simple”]: candidate = summarizer.summarize_with_style(textual content, fashion=fashion) scores = evaluator.evaluate_summary(reference, candidate) |
You might even see the output like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
Concise Abstract: Quantum computing leverages the ideas of quantum mechanics to carry out sure issues exponentially quicker than classical computer systems . Not like classical computer systems that use bits, quantum computer systems use quantum bits or qubits . These qubits can exist in a number of states concurrently by means of superposition . Abstract High quality Metrics: ROUGE-1: 0.235 ROUGE-2: 0.082 ROUGE-L: 0.157
Detailed Abstract: Quantum computing leverages the ideas of quantum mechanics to carry out quantum computations . Not like classical computer systems that use bits, quantum computer systems use quantum bits or qubits . These qubits can exist in a number of states concurrently by means of superposition, doubtlessly permitting quantum computer systems to resolve sure issues exponentially quicker than classical computer systems . Nevertheless, sustaining quantum coherence and minimizing errors stays a major problem in constructing sensible quantum computer systems, based on the College of Cambridge, UK, researchers . Again to Mail On-line house .Again to the web page you got here from . Abstract High quality Metrics: ROUGE-1: 0.168 ROUGE-2: 0.043 ROUGE-L: 0.168
Technical Abstract: Quantum computing leverages the ideas of quantum mechanics to carry out sure issues exponentially quicker than classical computer systems . Not like classical computer systems that use bits, quantum computer systems use quantum bits or qubits . These qubits can exist in a number of states concurrently by means of superposition . Nevertheless, sustaining quantum coherence and minimizing errors stays a problem . Abstract High quality Metrics: ROUGE-1: 0.262 ROUGE-2: 0.068 ROUGE-L: 0.197
Easy Abstract: Quantum computing leverages the ideas of quantum mechanics to carry out quantum computing . Not like classical computer systems that use bits, quantum computer systems use quantum bits or qubits . These qubits can exist in a number of states concurrently by means of superposition . Abstract High quality Metrics: ROUGE-1: 0.217 ROUGE-2: 0.091 ROUGE-L: 0.174 |
To run this code, you might want to set up the rouge_score
bundle:
Three metrics are used above. ROUGE-1 relies on unigrams, i.e., single phrases. ROUGE-2 relies on bigrams, i.e., two phrases. ROUGE-L relies on the longest widespread subsequence. Every metric measures totally different elements of abstract high quality. The upper the metric, the higher.
As you’ll be able to see from the above, an extended abstract will not be at all times higher. All of it will depend on the “reference” you used to judge the ROUGE metrics.
Placing all of it collectively, under is the entire code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
import torch from rouge_score import rouge_scorer from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
class Summarizer: def __init__(self, model_name=“sshleifer/distilbart-cnn-12-6”): “”“Initialize the summarizer with mannequin and tokenizer.”“” self.machine = “cuda” if torch.cuda.is_available() else “cpu” self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.mannequin = AutoModelForSeq2SeqLM.from_pretrained(model_name) self.mannequin.to(self.machine)
def summarize(self, textual content, context_weight=0.5, max_length=150, min_length=50, num_beams=4, length_penalty=2.0, repetition_penalty=1.0, do_sample=False, temperature=1.0, early_stopping=True): “”“Generate a abstract with context consciousness.”“” inputs = self.tokenizer(textual content, return_tensors=“pt”, padding=True, truncation=True, max_length=1024 ).to(self.machine) # Generate abstract utilizing solely the enter tokens summary_ids = self.mannequin.generate( inputs[“input_ids”], attention_mask=inputs[“attention_mask”], max_length=max_length, min_length=min_length, num_beams=num_beams, length_penalty=length_penalty, repetition_penalty=repetition_penalty, do_sample=do_sample, temperature=temperature, early_stopping=early_stopping, ) # Decode and return the abstract abstract = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True) return abstract
class StyleControlledSummarizer(Summarizer): def summarize_with_style(self, textual content, fashion=“concise”): “”“Generate summaries with totally different kinds.
Args: textual content (str): Enter textual content to summarize fashion (str): Abstract fashion (‘concise’, ‘detailed’, ‘technical’, ‘easy’) Returns: str: Generated abstract with specified fashion ““” style_params = { “concise”: { “max_length”: 80, “min_length”: 30, “length_penalty”: 3.0, “num_beams”: 4, “early_stopping”: True }, “detailed”: { “max_length”: 200, “min_length”: 100, “length_penalty”: 1.0, “num_beams”: 6, “early_stopping”: False }, “technical”: { “max_length”: 150, “min_length”: 50, “length_penalty”: 2.0, “num_beams”: 5, “repetition_penalty”: 1.5 }, “easy”: { “max_length”: 100, “min_length”: 30, “length_penalty”: 2.0, “num_beams”: 3, “do_sample”: True, “temperature”: 0.7 } } params = style_params[style] return self.summarize(textual content, **params)
class SummaryEvaluator: def __init__(self): “”“Initialize with ROUGE metrics.”“” self.scorer = rouge_scorer.RougeScorer( [‘rouge1’, ‘rouge2’, ‘rougeL’], use_stemmer=True )
def evaluate_summary(self, reference, candidate): “”“Calculate ROUGE scores for a abstract.
Args: reference (str): Reference abstract candidate (str): Generated abstract
Returns: dict: ROUGE scores for various metrics ““” scores = self.scorer.rating(reference, candidate)
print(“Abstract High quality Metrics:”) print(f“ROUGE-1: {scores[‘rouge1’].fmeasure:.3f}”) print(f“ROUGE-2: {scores[‘rouge2’].fmeasure:.3f}”) print(f“ROUGE-L: {scores[‘rougeL’].fmeasure:.3f}”)
return scores
# Checking the matrics implementation summarizer = StyleControlledSummarizer() evaluator = SummaryEvaluator() textual content = “”“ Quantum computing leverages the ideas of quantum mechanics to carry out computations. Not like classical computer systems that use bits, quantum computer systems use quantum bits or qubits. These qubits can exist in a number of states concurrently by means of superposition, doubtlessly permitting quantum computer systems to resolve sure issues exponentially quicker than classical computer systems. Nevertheless, sustaining quantum coherence and minimizing errors stays a important problem in constructing sensible quantum computer systems. ““” reference = “Quantum computing makes use of qubits for quicker computation however faces coherence challenges.” for fashion in [“concise”, “detailed”, “technical”, “simple”]: abstract = summarizer.summarize_with_style(textual content, fashion=fashion) print(f“n{fashion.capitalize()} Abstract:”) print(abstract) scores = evaluator.evaluate_summary(reference, abstract) |
Additional Studying
Under are some sources that you could be discover helpful:
- DistilBart Model
- ROUGE Metric
- Pre-trained Summarization Distillation by Sam Shleifer, Alexander M. Rush (arXiv:2010.13002)
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer (arXiv:1910.13461)
- Attention is all you need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (arXiv:1706.03762)
- Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Affiliation for Computational Linguistics.
Abstract
On this superior tutorial, you’ve realized a number of superior options of textual content summarization. Notably, you realized:
- How DistilBart’s encoder-decoder structure processes textual content
- Strategies for controlling abstract fashion
- Approaches to evaluating abstract high quality
These superior strategies allow you to create extra subtle and efficient textual content summarization methods tailor-made to particular wants and necessities.
Source link