Masters Degrees (Statistics and Actuarial Science)
Permanent URI for this collection
Browse
Browsing Masters Degrees (Statistics and Actuarial Science) by browse.metadata.advisor "Lamont, M. M. C."
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- ItemClassification in high dimensional data using sparse techniques(Stellenbosch : Stellenbosch University, 2019-04) Stulumani, Agrippa; Lamont, M. M. C.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH SUMMARY : Traditional classification techniques fail in the analysis of high-dimensional data. In response, new classification techniques and accompanying theory have recently emerged. These techniques are natural extensions of linear discriminant analysis. The aim is to solve the statistical challenges that arise with high-dimensional data by utilising the sparse coding (Johnstone and Titterington, 2009). In this project, our focus is on the following techniques: penalized LDA-FL, penalized LDA-FL, sparse discriminant analysis, sparse mixture discriminant analysis and sparse partial least squares. We evaluated the performance of these techniques in simulation studies and on two microarray gene expression datasets by comparing the test error rates and the number of features selected. In the simulation studies, we found that performance vary depending on the simulation set-up and on the classification technique used. The two microarray gene expression datasets are considered for practical implementation of these techniques. The results from the microarray gene expression datasets showed that these classification techniques achieve satisfactory accuracy.
- ItemExploratory and inferential multivariate statistical techniques for multidimensional count and binary data with applications in R(Stellenbosch : Stellenbosch University, 2011-12) Ntushelo, Nombasa Sheroline; Lamont, M. M. C.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: The analysis of multidimensional (multivariate) data sets is a very important area of research in applied statistics. Over the decades many techniques have been developed to deal with such datasets. The multivariate techniques that have been developed include inferential analysis, regression analysis, discriminant analysis, cluster analysis and many more exploratory methods. Most of these methods deal with cases where the data contain numerical variables. However, there are powerful methods in the literature that also deal with multidimensional binary and count data. The primary purpose of this thesis is to discuss the exploratory and inferential techniques that can be used for binary and count data. In Chapter 2 of this thesis we give the detail of correspondence analysis and canonical correspondence analysis. These methods are used to analyze the data in contingency tables. Chapter 3 is devoted to cluster analysis. In this chapter we explain four well-known clustering methods and we also discuss the distance (dissimilarity) measures available in the literature for binary and count data. Chapter 4 contains an explanation of metric and non-metric multidimensional scaling. These methods can be used to represent binary or count data in a lower dimensional Euclidean space. In Chapter 5 we give a method for inferential analysis called the analysis of distance. This method use a similar reasoning as the analysis of variance, but the inference is based on a pseudo F-statistic with the p-value obtained using permutations of the data. Chapter 6 contains real-world applications of these above methods on two special data sets called the Biolog data and Barents Fish data. The secondary purpose of the thesis is to demonstrate how the above techniques can be performed in the software package R. Several R packages and functions are discussed throughout this thesis. The usage of these functions is also demonstrated with appropriate examples. Attention is also given to the interpretation of the output and graphics. The thesis ends with some general conclusions and ideas for further research.
- ItemInterpreting decision boundaries of deep neural networks(Stellenbosch : Stellenbosch University, 2019-12) Wessels, Zander; Lamont, M. M. C.; Reid, Stuart; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: As deep learning methods are becoming the front runner among machine learning techniques, the importance of interpreting and understanding these methods grows. Deep neural networks are known for their highly competitive prediction accuracies, but also infamously for their “black box” properties when it comes to their decision making process. Tree-based models on the other end of the spectrum, are highly interpretable models, but lack the predictive power with certain complex datasets. The proposed solution of this thesis is to combine these two methods and obtain the predictive accuracy from the complex learner, but also the explainability from the interpretable learner. The suggested method is a continuation of the work done by the Google Brain Team in their paper Distilling a Neural Network Into a Soft Decision Tree (Frosst and Hinton, 2017). Frosst and Hinton (2017) argue that the reason why it is difficult to understand how a neural network model comes to a particular decision, is due to the learner being reliant on distributed hierarchical representations. If the knowledge gained by the deep learner were to be transferred to a model based on hierarchical decisions instead, interpretability would be much easier. Their proposed solution is to use a “deep neural network to train a soft decision tree that mimics the input-output function discovered by the neural network”. This thesis tries to expand upon this by using generative models (Goodfellow et al., 2016), in particular VAEs (variational autoencoders), to generate additional data from the training data distribution. This synthetic data can then be labelled by the complex learner we wish to approximate. By artificially growing our training set, we can overcome the statistical inefficiencies of decision trees and improve model accuracy.
- ItemNearest hypersphere classification : a comparison with other classification techniques(Stellenbosch : Stellenbosch University, 2014-12) Van der Westhuizen, Cornelius Stephanus; Lamont, M. M. C.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: Classification is a widely used statistical procedure to classify objects into two or more classes according to some rule which is based on the input variables. Examples of such techniques are Linear and Quadratic Discriminant Analysis (LDA and QDA). However, classification of objects with these methods can get complicated when the number of input variables in the data become too large ( ≪ ), when the assumption of normality is no longer met or when classes are not linearly separable. Vapnik et al. (1995) introduced the Support Vector Machine (SVM), a kernel-based technique, which can perform classification in cases where LDA and QDA are not valid. SVM makes use of an optimal separating hyperplane and a kernel function to derive a rule which can be used for classifying objects. Another kernel-based technique was proposed by Tax and Duin (1999) where a hypersphere is used for domain description of a single class. The idea of a hypersphere for a single class can be easily extended to classification when dealing with multiple classes by just classifying objects to the nearest hypersphere. Although the theory of hyperspheres is well developed, not much research has gone into using hyperspheres for classification and the performance thereof compared to other classification techniques. In this thesis we will give an overview of Nearest Hypersphere Classification (NHC) as well as provide further insight regarding the performance of NHC compared to other classification techniques (LDA, QDA and SVM) under different simulation configurations. We begin with a literature study, where the theory of the classification techniques LDA, QDA, SVM and NHC will be dealt with. In the discussion of each technique, applications in the statistical software R will also be provided. An extensive simulation study is carried out to compare the performance of LDA, QDA, SVM and NHC for the two-class case. Various data scenarios will be considered in the simulation study. This will give further insight in terms of which classification technique performs better under the different data scenarios. Finally, the thesis ends with the comparison of these techniques on real-world data.