Implementing Extractive Text Summarization with BERT- Issue #4
Across many business and practical use cases, we might want to automatically generate short, abbreviated versions of some long text. It turns out you can apply advances in modern (BERT based) neural networks in solving this task! In this issue, we will start with an overview extractive and abstractive summarization and how to implement extractive summarization with BERT.
Extractive vs Abstractive Summarization
In extractive summarization, the task is to extract subsets (sentences) from a document that are then assembled to form a summary. Abstractive summarization on the other hand might generate novel words, paraphrase original text or rewrite text (substitution, deleting, reordering etc).
Abstractive summarization, while being a harder problem, benefits from advances in sophisticated transformer-based langauge models such as BERT, GPT-2/3, RoBERTa, XLNet, ALBERT, T5, ELECTRA). We can treat abstractive summarization as a sequence to sequence translation task, where the task is to translate a long document to a shorter summary (see PEGASUS). However, as these models generate summaries, there is a risk that they might synthesize new text that changes the meaning of the original text, non factual text or plain incorrect summaries.
For applications where these sort of correctness errors are intolerable, extractive summarization are a potentially good fit e.g. summarization of medical documents, legal documents etc.
Problem Framing: Extractive Summarization as Sentence Classification
Overall, we can treat extractive summarization as a recommendation problem i.e. Given a query, recommend a set of sentences that are relevant. The query here is the document, relevance is a measure of if a given sentence belongs in the document summary.
How we go about obtaining this measure of relevance might vary (the common dilemma for any recommendation systems problem). We can select multiple problem formulations for example.
Classification/Regression. Given input(s), output a class or some relevance score for each sentence. Here, the input is a document and a sentence in the document, the output is a class (belongs in summary or not) or a likelihood score (likelihood that sentence belongs in summary). This formulation is pairwise, i.e at test time, we need to compute n passes through the model for n sentences to get n classes/scores, or compute this as a batch.
Metric Learning: Learn a shared distance metric embedding space for both documents and sentences such embedding for documents and sentences that belong in the summary for that document are close in distance space. At test time, we get a representation of the document and each sentence, and then get the most similar sentences. This approach is particularly useful as we can leverage fast similarity search algorithms.
In this work, we will explore a classification setup which follows existing studies (e.g. Nallapati et al 2017, use RNNs for text encoding and classify each sentence). While this approach is pairwise (and compute intensive wrt to the number of sentences), we can accept this limitation as most documents have a relatively small number of sentences.
Sample Results (Summarizing TechCrunch)
To allow for some comparison, we generated extractive and abstractive summaries using articles scraped from the front page of TechCrunch! Extractive summarization is implemented using the small sentence BERT baseline described earlier. We also benchmark against an abstractive summary which is implemented using a pre-trained t5-base
sequence to sequence generator model from the HuggingFace transformers library.
View an interactive widget with more summary examples here - https://victordibia.com/blog/extractive-summarization/.
More Details?
Want more details e.g. how to improve an extractive summarization model, how to implement it in Pytorch etc?
Finally ...
It's been a while! I hope you are all doing well! Good vibes!