TEXT

Representation learning for natural language

Relation embeddings computed using the proposed model in Paper IV.

Olof Mogren

PhD thesis was successfully defended March 23, 2018 at 10:00 at Chalmers university of technology for the Degree of Doctor of Philosophy.

Opponent: Prof. Dr. Hinrich Schütze, CIS Ludwig-Maximilians Universität, München

Video: Youtube

Abstract:
Click to view popular scientific summary instead.
Artificial neural networks have obtained astonishing results in a diverse number of tasks. One of the reasons for the success is their ability to learn the whole task at once (end-to-end learning), including the representations for data. This thesis will investigate representation learning for natural language through the study of a number of tasks ranging from automatic multi-document summarization to named entity recognition and the transformation of words into morphological forms specified by analogies.

In the first two papers, we investigate whether automatic multi-document summarization can benefit from learned representations, and what are the best ways of incorporating learned representations in an extractive summarization system. We propose a novel summarization approach that represents sentences using word embeddings, and a strategy for aggregating multiple sentence similarity scores to compute summaries that take multiple aspects into account. The approach is evaluated quantitatively using the de facto evaluation system ROUGE, and obtains state-of-the-art results on standard benchmark datasets for generic multi-document summarization.

The rest of the thesis studies models trained end-to-end for some specific tasks, and investigates how to train the models to perform well, and to learn internal representations of data that explain the factors of variation in the data.

Specifically, we investigate whether character-based recurrent neural networks (RNN) can learn the necessary representations for tasks such as named entity recognition (NER) and morphological analogies, and what is the best way of learning the representations needed to solve the mentioned tasks. We devise a novel character-based recurrent neural network model that recognize medical terms in health record data. The model is trained on openly available data, and evaluated using standard metrics on sensitive medical health record data in Swedish. We conclude that the model learns to solve the task and is able to generalize from the training data domain to the test domain.

We then present a novel recurrent neural model that transforms a query word into the morphological form demonstrated by another word. The model is trained and evaluated using word analogies and takes as input the raw character-sequence of the words with no explicit features needed. We conclude that character-based RNNs can successfully learn good representations internally and that the proposed model performs well on the analogy task, beating the baseline with a large margin. As the model learns to transform words, it learns internal representations that disentangles morphological relations using only cues from the training objective, which is to perform well on the word transformation task.

In other settings, such cues may not be available at training time, and we therefore present a regularizer that improves disentanglement in the learned representations by penalizing the correlation between activations in a layer. In the second part of the thesis we have proposed models and associated training strategies that solves the tasks and simultaneously learns informative internal representations; in Paper V this is enforced by an explicit regularization signal, suitable for when such a signal is missing from the training data (e.g. in the case of autoencoders).

Popular science summary:
Click to view abstract instead.
The advances in artificial intelligence have been astonishing in recent years, with new algorithms showing super-human performance for a wide number of tasks. An important reason for this development is the availability of large datasets and powerful computers, making it possible to train larger machine learning models with higher learning capacity. Artificial neural networks (ANNs) are machine learning models that have been of paramount importance to the development. ANNs are composed of layers of artificial neurons, each of which can only perform a simple computation, but when stacked together in deep architectures, they can be trained to approximate complicated non-linear functions. These models have achieved fantastic results in tasks on various data modalities such as audio, vision, and text. One reason for the success is the internal vector representations computed by the layers, each transforming their input into numerical feature vectors which are increasingly useful for the end task. A complete model is often trained at once (end-to-end learning), and the representations are optimized during training to solve the given task.

This thesis studies the representations computed using artificial neural networks that are trained on and applied to natural language. In paper I and II, we apply learned representations for words to improve the performance of multi-document summarization. In Paper III, we study the use of deep neural sequence models working on the raw character stream as input, and how this class of models can be used to detect medical terms in text (such as drugs, symptoms, and body parts). The system is evaluated on medical health records in Swedish. In paper IV, we propose a novel deep neural sequence model trained to transform words into inflected forms as demonstrated by analogies: "see" is to "sees" as "eat" is to what? The model outperforms previous rule-based approaches by a massive margin, and when inspecting the internal representations computed by this model, one can see that it learns to distinguish classes of transformations of word forms, without being explicitly told to do so. This is an effect from training the model to transform words while provided with the analogous words forms. In other cases, however, the training objective may not provide such cues for the learning algorithm. In Paper V, we study how to improve the way learned representations disentangle the underlying factors of variation in the data. This can be useful for unsupervised representation learning, such as using autoencoders for task agnostic representations or when the final use case is unknown.

Supervisor: Richard Johansson
Co-supervisor: Devdatt Dubhashi
Fulltext: Download the whole thesis in PDF format. (2018-03-01)

Defense presentation slides: Download the defense presentation slides in PDF format. (2018-03-23)

Glossary for people without background in computer science or mathematics: English, Swedish.

Paper I: Extractive summarization using continuous vector space models

Comments: This is a workshop paper showing preliminary results on multi-document summarization with continuous vector space models for sentence representation. The experiments were performed on opinionated online user reviews.
My contributions: I implemented the submodular optimization algorithm for sentence selection and created the setup for the experimental evaluation.
2nd Workshop on Continuous Vector Space Models and their Compositionality CVSC 2014, Gothenburg Sweden
Mikael Kågebäck, Olof Mogren, Nina Tahmasebi, Devdatt Dubhashi
More info, PDF fulltext.

Paper II: Extractive summarization by aggregating multiple similarities

Comments: Many existing methods for extracting summaries rely on comparing the similarity of two sentences in some way. In this paper, we present new ways of measuring this similarity, based on sentiment analysis and continuous vector space representations, and show that combining these together with similarity measures from existing methods, helps to create better summaries. The finding is demonstrated with MULTSUM, a novel summarization method that uses ideas from kernel methods to combine sentence similarity measures. Submodular optimization is then used to produce summaries that take several different similarity measures into account. Our method improves over the state-of-the-art on standard benchmark datasets; it is also fast and scale to large document collections, and the results are statistically significant.
My contributions: I am the main author of this work. I designed the study, performed the experiments, and wrote the manuscript.
RANLP 2015, Hissar, Bulgaria, September 6th-11th
Olof Mogren, Mikael Kågebäck, Devdatt Dubhashi
More info, PDF fulltext.

Paper III: Named entity recognition in Swedish health records with character-based deep bidirectional LSTMs

Comments: In this paper, we train a deep character based recurrent neural network to recognize medical entities in Swedish patient health records.
My contributions: I designed the study, supervised the experimental work, and wrote the manuscript.
Fifth workshop on building and evaluating resources for biomedical text mining (BioTxtM 2016) at COLING 2016
Simon Almgren, Sean Pavlov, Olof Mogren
More info, PDF fulltext.

Paper IV: Character-based recurrent neural networks for morphological relational reasoning

Comments: In this paper, we propose a novel character-based neural architecture to perform morphological relational reasoning; a syntactic analogy task similar to the tasks used to evaluate the performance of word embedding models, but where these closed-vocabulary solutions fail much because of the limited vocabulary. Within the model, the internally computed representations were evaluated and visualized, showing that the model can learn to compute representations that capture and disentangle the necessary underlying factors of variation: the class of the morphological relation that was demonstrated in the analogy.
My contributions: I designed the study, performed the experiments, and wrote the majority of the manuscript.
Submitted draft. Early version published at the EMNLP Workshop on Subword and Character-level Models in NLP
Olof Mogren, Richard Johansson
PDF fulltext.

Paper V: Disentangled activations in deep networks

Comments: In this paper, we propose a regularization technique that penalizes correlation between activations in a layer. This helps a model learn more interpretable and disentangled activations. Also, it helps with the design of deep learning models by estimating the required dimensionality, while being computationally inexpensive and simple to implement.
My contributions: I contributed to the design of the study, performed parts of the experiments, and wrote parts of the manuscript.
Submitted draft. Early version presented at the NIPS Workshop on Learning Disentangled Features
Mikael Kågebäck, Olof Mogren
PDF fulltext.

Olof Mogren, PhD, RISE Research institutes of Sweden. Follow me on Bluesky.