Fork me on GitHub

Multi-Document Summarization and Semantic Relatedness

Licentiate Thesis

Olof Mogren

Extractive Multi-Document Summarization

Abstract: Automatic summarization is the process of presenting the contents of written documents in a short, comprehensive fashion. Many approaches have been proposed for this problem, some of which extract content from the input documents (extractive methods), and others that generate the language in the summary based on some representation of the document contents (abstractive methods).

This thesis is concerned with extractive summarization in the multi-document setting, and we define the problem as choosing the most informative sentences from the input documents, while minimizing the redundancy in the summary. This definition calls for a way of measuring the similarity between sentences that captures as much as possible of the meaning. We present novel ways of measuring the similarity between sentences, based on neural word embeddings and sentiment analysis. We also show that combining multiple sentence similarity scores, by multiplicative aggregation, helps in the process of creating better extractive summaries.

We also discuss the use of information extraction for improving the quality of automatic summarization by providing ways of assessing the salience of information elements, as well as helping with the fluency of the output and providing the temporal dimension. Furthermore, we present graph-based algorithms for clustering words by co-occurrence, and for summarizing short online user-reviews by computing bicliques. The biclique algorithm provides a fast, simple algorithm for summarization in many e-commerce settings.

The thesis is comprised of five different papers, listed below. Click on the links for information and to download the fulltexts and related material.

Supervisors: Peter Damaschke and Devdatt Dubhashi
Defense: The thesis was successfully defended in public on November 20th, 2015 at 10:00 in room ML2, Hörsalsvägen 7B, Chalmers University of Technology.
Discussion leader: Tapani Raiko from Aalto University
Fulltext: Download the whole thesis in PDF format.

Paper I: Extractive Summarization using Continuous Vector Space Models

Comments: This is a workshop paper showing preliminary results on multi-document summarization with continuous vector space models for sentence representation. The experiments were performed on opinionated online user reviews.
My contributions: I implemented the submodular optimization algorithm for sentence selection and created the setup for the experimental evaluation.
2nd Workshop on Continuous Vector Space Models and their Compositionality CVSC 2014, Gothenburg Sweden
Mikael Kågebäck, Olof Mogren, Nina Tahmasebi, Devdatt Dubhashi
More info, PDF fulltext.

Paper II: Extractive Summarization by Aggregating Multiple Similarities

Comments: Many existing methods for extracting summaries rely on comparing the similarity of two sentences in some way. In this paper, we present new ways of measuring this similarity, based on sentiment analysis and continuous vector space representations, and show that combining these together with similarity measures from existing methods, helps to create better summaries. The finding is demonstrated with MULTSUM, a novel summarization method that uses ideas from kernel methods to combine sentence similarity measures. Submodular optimization is then used to produce summaries that take several different similarity measures into account. Our method improves over the state-of-the-art on standard benchmark datasets; it is also fast and scale to large document collections, and the results are statistically significant.
My contributions: I am the main author of this work. I designed the study, performed the experiments, and wrote the manuscript.
RANLP 2015, Hissar, Bulgaria, September 6th-11th
Olof Mogren, Mikael Kågebäck, Devdatt Dubhashi
More info, PDF fulltext.

Paper III: Visions and Open Challenges for a Knowledge-Based Culturomics

Comments: This is a white paper outlining some ideas and challenges within the field of culturomics.
My contributions: I wrote section 5, titled "Temporal Semantic Summarization", where I shared my views on possible research directions on generic multi-document summarization.
International Journal on Digital Libraries, February 2015
Nina Tahmasebi, Lars Borin, Gabriele Capannini, Devdatt Dubhashi, Peter Exner, Markus Forsberg, Gerhard Gossen, Fredrik D. Johansson, Richard Johansson, Mikael Kågebäck, Olof Mogren, Pierre Nugues, Thomas Risse
More info, PDF fulltext.

Paper IV: Editing Simple Graphs

Extractive Multi-Document Summarization

Comments: Inspired by the word-co-occurrence graph from Wikipedia documents, this paper presents an FPT approach to cluster the words.
My contributions: I contributed to the study and the analysis, and to the writing of the manuscript, including illustrations.
Journal of Graph Algorithms and Applications 18 (2014), 557-576
Peter Damaschke, Olof Mogren
More info, PDF fulltext.

Paper V: Summarizing Online User Reviews Using Bicliques

Comments: This paper presents an approach to summarize online user-reviews based on finding bicliques in the bipartite word-document graph.
My contributions: I contributed to the study, did a substantial part of the experimental work, and contributed to the writing of the manuscript.
SOFSEM 2016, LNCS 9587, pp 569-579.
Azam Sheikh Muhammad, Peter Damaschke, Olof Mogren
PDF fulltext.

Olof Mogren, PhD, RISE Research institutes of Sweden. Follow me on Bluesky.