TextRank
TextRank (Mihalecea and Tarau, 2004) is the idea of using graph ranking algorithms, like PageRank, as an unsupurvised way of extracting key units of text from a document. The interesting part is how they define graphs on the units of text; filtering text units (e.g. words, sentences) to obtain a set of vertices, and using proximity or similarity measures for defining a graph. They specifically look at the applications of keyword extraction and summarisation. This is a really useful and versatile idea; and although it requires application specific tuning, is a valuable tool in the unsupervised text toolkit.
The first application is extracting keywords from 500 abstracts of papers from Inspec (scientific and technical papers). The original dataset was from Hulth’s 2003 paper Improved Automatic Keyword Extraction Given More Linguistic Knowledge, and the dataset is on the web (e.g. here, here and here). It consists of small abstracts and human annotated key phrases. Their TextRank approach is to:
- filter the tokens of the abstract to just nouns and adjectives as the vertices
- form a co-occurance graph (unweighted, undirected) based on window of size 2
- using PageRank to rank the vertices
- filtering to the top 1/3 of vertices by rank
- combining any adjacent words in the filtered vertices to get key phrases
This works quite well for this dataset, although there’s a lot of choices in here that I suspect are specific to the dataset and task. The algorithm looks for dense clusters of keywords, and frequent keywords, and this does much better than just frequent keywords alone. With all this tuning (which unfortunately sounds like it was done on the test set), they get something that works better in terms of precision and F1 than Hulth’s supervised approach.
The second application is text summarisation, based on the Document Understanding Conference 2002 task (which you have to request access for. It consists of news articles and manually generated abstracts. Their approach is to extract key sentences by forming a weighted graph of the sentences based on textual similarity between the sentences, rank the sentences using weighted PageRank, and limit the summary by length. The particular similarity they use is the number of tokens in common, divided by the sum of log lengths of the sentences (perhaps a cousin of Jaccard Distance).
\[{\rm Similarity}(S_i, S_j) = \frac{\left \vert S_i \cap S_j \right \vert}{\log(\vert S_i \vert) + \log(\vert S_j \vert)}\]
In terms of ROUGE-1 it ranks about middle of the systems submitted; which is pretty impressive given the simplicity. In the example it seems to work well because the subject is in the title line and repeated in the most relevant sentences.
These particular TextRank algorithms are available in software; in R there is textrank which covers both tasks, and in Python textacy has keyword extraction and gensim has textrank summarisation.
The real innovation here isn’t the particular approaches used, but the idea of using existing graph ranking algorithms for text extraction problems. There are a large and growing number of ways to find similarity between text units (for example sentence embeddings and contextual embeddings), other behavioural sources of graph information from interactions with documents, and a large number of graph algorithms. These could be combined in a multitude of ways to solve problems where the information to be extracted is likely to be threaded throughout a document. Being unsupervised, and able to work on any scale from characters all the way to entire books, makes it a versatile tool.