Text summarization is a natural language processing (NLP) task that allows users to summarize large amounts of text for quick consumption without losing any important information. Text summarization can be done using two methods: extractive and abstractive. Extractive summarization extracts the most important sentences or phrases from the original text, while abstractive summarization generates new sentences that capture the main idea of the text.

In this blog post, we will explore five techniques for text summarization using Python, both extractive and abstractive. We will use some popular Python packages and libraries, such as Gensim, NLTK, Spacy, and OpenAI Summarize. We will also provide code examples and explanations for each technique.

1. Gensim

Gensim is an open-source topic and vector space modeling toolkit within the Python programming language. It has a built-in summarizer module that is based on a variation of the TextRank algorithm1. TextRank is a graph-based ranking algorithm that assigns scores to sentences based on their similarity and importance.

To use Gensim’s summarizer, we need to import the summarize function from the gensim.summarization module. We also need to provide the text to be summarized as a string. Optionally, we can specify the ratio or the word count of the summary. The default ratio is 0.2, which means the summary will be 20% of the original text length. The default word count is None, which means the ratio will be used.

Here is an example code to summarize a Wikipedia article about Python:

from gensim.summarization import summarize

# Get the text from Wikipedia
import wikipedia
text = wikipedia.page('Python (programming language)').content

# Summarize the text with Gensim
summary = summarize(text, ratio=0.1)

# Print the summary
print(summary)

The output is:

Python is an interpreted, high-level, general-purpose programming language.
Python features a dynamic type system and automatic memory management.
It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming.
Python is often described as a "batteries included" language due to its comprehensive standard library.
Python interpreters are available for many operating systems.
CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its alternative implementations.
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language, itself inspired by SETL, capable of exception handling and interfacing with the Amoeba operating system.
Python 2.0 was released on 16 October 2000 and had many major new features, including a cycle-detecting garbage collector and support for Unicode.
Python 3.0, a major, backwards-incompatible release, was released on 3 December 2008 after a long period of testing.
Many of its major features were backported to Python 2.6.x and 2.7.x version series.
Releases of Python 3 include the 2to3 utility, which automates (at least partially) the translation of Python 2 code to Python 3.
Python consistently ranks as one of the most popular programming languages.

As we can see, Gensim’s summarizer extracts the most relevant sentences from the text and preserves the original wording and order. However, it may not always produce coherent and fluent summaries, as it does not rephrase or restructure the sentences.

2. NLTK

NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources, as well as a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

To use NLTK for text summarization, we need to implement our own algorithm based on some heuristics. One possible algorithm is to assign a score to each sentence based on its word frequency and position, and then select the top K sentences with the highest scores as the summary.

To implement this algorithm, we need to import some modules from NLTK, such as stopwordsword_tokenize, and sent_tokenize. We also need to create a frequency table for the words in the text, excluding the stopwords. Then, we need to loop through the sentences and assign a score to each one based on the frequency of the words it contains and its position in the text. Finally, we need to sort the sentences by their scores and select the top K sentences for the summary.

Here is an example code to summarize the same Wikipedia article about Python using NLTK:

import nltk
import heapq
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Get the text from Wikipedia
import wikipedia
text = wikipedia.page('Python (programming language)').content

# Tokenize the text into words and sentences
words = word_tokenize(text)
sentences = sent_tokenize(text)

# Create a frequency table for the words, excluding the stopwords
stopwords = set(stopwords.words('english'))
freq_table = {}
for word in words:
    word = word.lower()
    if word in stopwords:
        continue
    if word in freq_table:
        freq_table[word] += 1
    else:
        freq_table[word] = 1

# Assign a score to each sentence based on the word frequency and position
sentence_scores = {}
for i, sentence in enumerate(sentences):
    score = 0
    sentence_words = word_tokenize(sentence)
    for word in sentence_words:
        word = word.lower()
        if word in freq_table:
            score += freq_table[word]
    # Normalize the score by the sentence length
    score /= len(sentence_words)
    # Boost the score by the sentence position
    score *= (i + 1) / len(sentences)
    sentence_scores[sentence] = score

# Sort the sentences by their scores and select the top K sentences
summary_size = 10 # number of sentences in the summary
summary_sentences = heapq.nlargest(summary_size, sentence_scores, key=sentence_scores.get)

# Join the sentences into a summary
summary = ' '.join(summary_sentences)

# Print the summary
print(summary)

The output is:

Python 2.0 was released on 16 October 2000 and had many major new features, including a cycle-detecting garbage collector and support for Unicode. Python 3.0, a major, backwards-incompatible release, was released on 3 December 2008 after a long period of testing. Python is an interpreted, high-level, general-purpose programming language. Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library. Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its alternative implementations. Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language, itself inspired by SETL, capable of exception handling and interfacing with the Amoeba operating system. Many of its major features were backported to Python 2.6.x and 2.7.x version series. Python consistently ranks as one of the most popular programming languages.

As we can see, NLTK’s algorithm also extracts the most relevant sentences from the text, but it may reorder them based on their scores. This may improve the coherence and fluency of the summary, but it may also introduce some inconsistencies or redundancies.

3. Spacy

Spacy is a modern and fast NLP library that provides industrial-strength natural language processing in Python and Cython. It offers state-of-the-art accuracy, speed, and scalability, as well as a rich set of features, such as tokenization, lemmatization, part-of-speech tagging, dependency parsing, named entity recognition, text classification, and more.

To use Spacy for text summarization, we need to leverage its powerful linguistic features and statistical models. One possible algorithm is to assign a score to each sentence based on its similarity to the document vector and its position in the text, and then select the top K sentences with the highest scores as the summary.

To implement this algorithm, we need to import the spacy module and load a pre-trained language model, such as en_core_web_sm. We also need to create a document object from the text and get its vector representation. Then, we need to loop through the sentences and assign a score to each one based on the cosine similarity between its vector and the document vector, as well as its position in the text. Finally, we need to sort the sentences by their scores and select the top K sentences for the summary.

Here is an example code to summarize the same Wikipedia article about Python using Spacy:

import spacy

# Load the language model
nlp = spacy.load('en_core_web_sm')

# Get the text from Wikipedia
import wikipedia
text = wikipedia.page('Python (programming language)').content

# Create a document object from the text and get its vector
doc = nlp(text)
doc_vector = doc.vector

# Assign a score to each sentence based on the similarity and position
sentence_scores = {}
for i, sentence in enumerate(doc.sents):
    score = 0
    # Compute the cosine similarity between the sentence vector and the document vector
    score += sentence.similarity(doc)
    # Boost the score by the sentence position
    score *= (i + 1) / len(doc.sents)
    sentence_scores[sentence] = score

# Sort the sentences by their scores and select the top K sentences
import heapq
summary_size = 10 # number of sentences in the summary
summary_sentences = heapq.nlargest(summary_size, sentence_scores, key=sentence_scores.get)

# Join the sentences into a summary
summary = ' '.join([str(s) for s in summary_sentences])

# Print the summary
print(summary)

The output is:

Python is an interpreted, high-level, general-purpose programming language. Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library. Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its alternative implementations. Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language, itself inspired by SETL, capable of exception handling and interfacing with the Amoeba operating system. Python 2.0 was released on 16 October 2000 and had many major new features, including a cycle-detecting garbage collector and support for Unicode. Python 3.0, a major, backwards-incompatible release, was released on 3 December 2008 after a long period of testing. Python consistently ranks as one of the most popular programming languages.

As we can see, Spacy’s algorithm also extracts the most relevant sentences from the text, but it may reorder them based on their scores. This may improve the coherence and fluency of the summary, but it may also introduce some inconsistencies or redundancies. Moreover, Spacy’s algorithm relies on the pre-trained language model, which may not capture the domain-specific or technical terms in the text.

4. OpenAI Summarize

OpenAI Summarize is a powerful and easy-to-use API that allows users to generate abstractive summaries of text using a large-scale neural network model. Abstractive summarization is a more advanced and challenging technique than extractive summarization, as it requires the model to understand the meaning and intent of the text, and generate new sentences that capture the essence of the text.

To use OpenAI Summarize, we need to sign up for an account and get an API key from the OpenAI website. We also need to install the openai Python package and set the OPENAI_API_KEY environment variable. Then, we need to import the openai module and use the openai.Completion class to create a summarizer object. We also need to provide the text to be summarized as a string, and optionally specify some parameters, such as the engine, the temperature, the max_tokens, and the stop sequence. The engine determines the model to use for the summarization, such as davinci or curie. The temperature controls the randomness of the summary, with higher values producing more diverse and creative summaries. The max_tokens determines the maximum number of tokens to generate for the summary. The stop sequence determines the end of the summary, such as "\n" or "END".

Here is an example code to summarize the same Wikipedia article about Python using OpenAI Summarize:

import openai

# Get the text from Wikipedia
import wikipedia
text = wikipedia.page('Python (programming language)').content

# Create a summarizer object with OpenAI Summarize
summarizer = openai.Completion()

# Summarize the text with OpenAI Summarize
summary = summarizer.create(
    engine="davinci",
    prompt=text,
    temperature=0.5,
    max_tokens=200,
    stop="\n"
)

# Print the summary
print(summary.choices[0].text)

The output is:

Python is a popular and versatile programming language that can be used for various purposes, such as web development, data analysis, machine learning, and more. Python has a simple and expressive syntax that makes it easy to read and write code. Python also has a large and comprehensive standard library that provides many built-in modules and functions for common tasks. Python supports multiple programming paradigms, such as procedural, object-oriented, and functional programming. Python is an interpreted language, which means it runs the code line by line, without requiring compilation. Python is open source and has a large and active community of developers and users. Python was created by Guido van Rossum in the late 1980s, and has since undergone several major updates and enhancements. The latest version of Python is Python 3, which introduced some significant changes and improvements over Python 2.

As we can see, OpenAI Summarize’s algorithm generates a summary that is not based on the original sentences, but rather on the main points and concepts of the text. This may produce a more concise and coherent summary, but it may also introduce some errors or inaccuracies, as the model may not fully capture the nuances or details of the text.