Department of Computer Science
Permanent URI for this community
Browse
Browsing Department of Computer Science by browse.metadata.advisor "Dunaiski, Marcel"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemHierarchical text classification with transformer-based language models(Stellenbosch : Stellenbosch University, 2024-03) Du Toit, Jaco; Dunaiski, Marcel; Stellenbosch University. Faculty of Science. Dept. of Computer Science.ENGLISH ABSTRACT: Hierarchical text classification (HTC) is a natural language processing (NLP) task which has the objective of classifying text documents into a set of classes from a structured class hierarchy. For example, news articles can be classified into a hierarchical class set which comprises broad categories such as “Politics” and “Sport” in higher-levels with associated finer-grained categories such as “Europe” and “Cycling” in lower-levels. In recent years many different NLP approaches have been significantly improved through the use of transformer-based pre-trained language mod- els (PLMs). PLMs are typically trained on large amounts of textual data through self-supervised tasks such that they acquire language understanding capabilities which can be used to solve various NLP tasks, including HTC. In this thesis, we propose three new approaches for leveraging transformer-based PLMs to improve classification performance on HTC tasks. Our first approach formulates how hierarchy-aware prompts can be applied to discriminative language models such that it allows HTC tasks to scale to problems with very large hierarchical class structures. Our second approach uses label-wise attention mechanisms to obtain label-specific document repre- sentations which are used to fine-tune PLMs for HTC tasks. Furthermore, we propose a label-wise attention mechanism which splits the attention mecha- nisms into the different levels of the class hierarchy and leverages the predic- tions of all ancestor levels during the prediction of classes at a particular level. The third approach combines features extracted from a PLM and a topic model to train a classifier which comprises convolutional layers followed by a label- wise attention mechanism. We evaluate all three approaches comprehensively and show that our first two proposed approaches obtain state-of-the-art per- formances on three HTC benchmark datasets. Our results show that the use of prompts and label-wise attention mechanisms to fine-tune PLMs are very effective techniques for classifying text documents into hierarchical class sets. Furthermore, we show that these techniques are able to effectively leverage the language understanding capabilities of PLMs and incorporate the hierarchical class structure information to improve classification performance. We also introduce three new HTC benchmark datasets which comprise the titles and abstracts of research publications from the Web of Science publica- tion database with associated categories. The first two datasets use journal- and citation-based classification schemas respectively, while the third dataset combines these classifications with the aim of removing documents and classes which do not have a clear overlap between the two schemas. We show that this results in a more consistent classification of the publications. Finally, we per- form experiments on these three datasets with the best-performing approaches proposed in this thesis to provide a baseline for future research.