Doctoral Degrees (Statistics and Actuarial Science)
Permanent URI for this collection
Browse
Browsing Doctoral Degrees (Statistics and Actuarial Science) by browse.metadata.advisor "De Wet, Tertius"
Now showing 1 - 8 of 8
Results Per Page
Sort Options
- ItemAspects of model development using regression quantiles and elemental regressions(Stellenbosch : Stellenbosch University, 2007-03) Ranganai, Edmore; De Wet, Tertius; Van Vuuren, J.O.; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: It is well known that ordinary least squares (OLS) procedures are sensitive to deviations from the classical Gaussian assumptions (outliers) as well as data aberrations in the design space. The two major data aberrations in the design space are collinearity and high leverage. Leverage points can also induce or hide collinearity in the design space. Such leverage points are referred to as collinearity influential points. As a consequence, over the years, many diagnostic tools to detect these anomalies as well as alternative procedures to counter them were developed. To counter deviations from the classical Gaussian assumptions many robust procedures have been proposed. One such class of procedures is the Koenker and Bassett (1978) Regressions Quantiles (RQs), which are natural extensions of order statistics, to the linear model. RQs can be found as solutions to linear programming problems (LPs). The basic optimal solutions to these LPs (which are RQs) correspond to elemental subset (ES) regressions, which consist of subsets of minimum size to estimate the necessary parameters of the model. On the one hand, some ESs correspond to RQs. On the other hand, in the literature it is shown that many OLS statistics (estimators) are related to ES regression statistics (estimators). Therefore there is an inherent relationship amongst the three sets of procedures. The relationship between the ES procedure and the RQ one, has been noted almost “casually” in the literature while the latter has been fairly widely explored. Using these existing relationships between the ES procedure and the OLS one as well as new ones, collinearity, leverage and outlier problems in the RQ scenario were investigated. Also, a lasso procedure was proposed as variable selection technique in the RQ scenario and some tentative results were given for it. These results are promising. Single case diagnostics were considered as well as their relationships to multiple case ones. In particular, multiple cases of the minimum size to estimate the necessary parameters of the model, were considered, corresponding to a RQ (ES). In this way regression diagnostics were developed for both ESs and RQs. The main problems that affect RQs adversely are collinearity and leverage due to the nature of the computational procedures and the fact that RQs’ influence functions are unbounded in the design space but bounded in the response variable. As a consequence of this, RQs have a high affinity for leverage points and a high exclusion rate of outliers. The influential picture exhibited in the presence of both leverage points and outliers is the net result of these two antagonistic forces. Although RQs are bounded in the response variable (and therefore fairly robust to outliers), outlier diagnostics were also considered in order to have a more holistic picture. The investigations used comprised analytic means as well as simulation. Furthermore, applications were made to artificial computer generated data sets as well as standard data sets from the literature. These revealed that the ES based statistics can be used to address problems arising in the RQ scenario to some degree of success. However, due to the interdependence between the different aspects, viz. the one between leverage and collinearity and the one between leverage and outliers, “solutions” are often dependent on the particular situation. In spite of this complexity, the research did produce some fairly general guidelines that can be fruitfully used in practice.
- ItemClassifying yield spread movements in sparse data through triplots(Stellenbosch : Stellenbosch University, 2020-03) Van der Merwe, Carel Johannes; De Wet, Tertius; Inghelbrecht, Koen; Vanmaele, Michele; Conradie, W. J. (Willem Johannes); Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH SUMMARY : In many developing countries, including South Africa, all data that are required to calculate the fair values of financial instruments are not always readily available. Additionally, in some instances, companies who do not have the necessary quantitative skills are reluctant to incorporate the correct fair valuation by failing to employ the appropriate techniques. This problem is most notable with regards to unlisted debt instruments. There are two main inputs with regards to the valuation of unlisted debt instruments, namely the the risk-free curve and the the yield spread. Investigation into these two components forms the basis of this thesis. Firstly, an analysis is carried out to derive approximations of risk-free curves in areas where data is sparse. Thereafter it is investigated whether there is sufficient evidence of a significant change in yield spreads of unlisted debt instruments. In order to determine these changes, however, a new method that allows for simultaneous visualisation and classification of data was developed - termed triplot classification with polybags. This new classification technique also has the ability to limit misclassification rates. In the first paper, a proxy for the extended zero curve, calculated from other observable inputs, is found through a simulation approach by incorporating two new techniques, namely permuted integer multiple linear regression and aggregate standardised model scoring. It was found that a Nelson Siegel fit, with a mixture of one year forward rates as proxies for the long term zero point, and some discarding of initial data points, performs relatively well in the training and testing data sets. This new method allows for the approximation of risk-free curves where no long term points are available, and further allows for the determinants of the yield curve shape by considering other available data. The changes in these shape determining parameters are used in the final paper as determinants for changes in yield spreads. For the second paper, a new classification technique is developed that was used in the final paper. Classification techniques do not easily allow for visual interpretation, nor do they usually allow for the limitation of the false negative and positive error rates. For some areas of research and practical applications these shortcomings are important to address. In this paper, classification techniques are combined with biplots, allowing for simultaneous visual representation and classification of the data, resulting in the so-called triplot. By further incorporating polybags, the ability of limiting misclassification type errors is also introduced. A simulation study as well as an application is provided showing that the method provides similar results compared to existing methods, but with added visualisation benefits. The paper focuses purely on developing a statistical technique that can be applied to any field. The application that is provided, for example, is on a medical data set. In the final paper the technique is applied to changes in yield spreads. The third paper considered changes in yield spreads which were analysed through various covariates to determine whether significant decreases or increases would have been observed for unlisted debt instruments. The methodology does not specifically determine the new spread, but gives evidence on whether the initial implied spread could be left the same, or whether a new spread should be determined. These yield spread movements are classified using various share, interest rate, financial ratio, and economic type covariates in a visually interpretive manner. This also allows for a better understanding of how various factors drive the changes in yield spreads. Finally, as supplement to each paper, a web-based application was built allowing the reader to interact with all the data and properties of the methodologies discussed. The following links can be used to access these three applications: - Paper 1: https://carelvdmerwe.shinyapps.io/ProxyCurve/ - Paper 2: https://carelvdmerwe.shinyapps.io/TriplotSimulation/ - Paper 3: https://carelvdmerwe.shinyapps.io/SpreadsTriplot/
- ItemExtreme quantile inference(Stellenbosch : Stellenbosch University, 2020-03) Buitendag, Sven; De Wet, Tertius; Beirlant, Jan; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH SUMMARY : A novel approach to performing extreme quantile inference is proposed by applying ridge regression and the saddlepoint approximation to results in extreme value theory. To this end, ridge regression is applied to the log differences of the largest sample quantiles to obtain a bias-reduced estimator of the extreme value index, which is a parameter in extreme value theory that plays a central role in the estimation of extreme quantiles. The utility of the ridge regression estimators for the extreme value index is illustrated by means of simulations results and applications to daily wind speeds. A new pivotal quantity is then proposed with which a set of novel asymptotic confidence intervals for extreme quantiles are obtained. The ridge regression estimator for the extreme value index is combined with the proposed pivotal quantity together with the saddlepoint approximation to yield a set of confidence intervals that are accurate and narrow. The utility of these confidence intervals are illustrated by means of simulation results and applications to Belgian reinsurance data. Multivariate generalizations of sample quantiles are considered with the aim of developing multivariate risk measures, including maximum correlation risk measures and an estimator for the extreme value index. These multivariate sample quantiles are called center-outward quantiles, and are defined as an optimal transportation of the uniformly distributed points in the unit ball Sd to the observed sample points in Rd. A continuous extension of the centeroutward quantile is proposed, which yields quantile contours that are nested. Furthermore, maximum correlation risk measures for multivariate samples are presented, as well as an estimator for the extreme value index for multivariate regularly varying samples. These results are applied to Danish fire insurance data and the stock returns of Google and Apple share prices to illustrate their utility.
- ItemImproved estimation procedures for a positive extreme value index(Stellenbosch : University of Stellenbosch, 2010-12) Berning, Thomas Louw; De Wet, Tertius; University of Stellenbosch. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: In extreme value theory (EVT) the emphasis is on extreme (very small or very large) observations. The crucial parameter when making inferences about extreme quantiles, is called the extreme value index (EVI). This thesis concentrates on only the right tail of the underlying distribution (extremely large observations), and specifically situations where the EVI is assumed to be positive. A positive EVI indicates that the underlying distribution of the data has a heavy right tail, as is the case with, for example, insurance claims data. There are numerous areas of application of EVT, since there are a vast number of situations in which one would be interested in predicting extreme events accurately. Accurate prediction requires accurate estimation of the EVI, which has received ample attention in the literature from a theoretical as well as practical point of view. Countless estimators of the EVI exist in the literature, but the practitioner has little information on how these estimators compare. An extensive simulation study was designed and conducted to compare the performance of a wide range of estimators, over a wide range of sample sizes and distributions. A new procedure for the estimation of a positive EVI was developed, based on fitting the perturbed Pareto distribution (PPD) to observations above a threshold, using Bayesian methodology. Attention was also given to the development of a threshold selection technique. One of the major contributions of this thesis is a measure which quantifies the stability (or rather instability) of estimates across a range of thresholds. This measure can be used to objectively obtain the range of thresholds over which the estimates are most stable. It is this measure which is used for the purpose of threshold selection for the proposed PPD estimator. A case study of five insurance claims data sets illustrates how data sets can be analyzed in practice. It is shown to what extent discretion can/should be applied, as well as how different estimators can be used in a complementary fashion to give more insight into the nature of the data and the extreme tail of the underlying distribution. The analysis is carried out from the point of raw data, to the construction of tables which can be used directly to gauge the risk of the insurance portfolio over a given time frame.
- ItemSome statistical aspects of LULU smoothers(Stellenbosch : University of Stellenbosch, 2007-12) Jankowitz, Maria Dorothea; Conradie, W. J.; De Wet, Tertius; University of Stellenbosch. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.The smoothing of time series plays a very important role in various practical applications. Estimating the signal and removing the noise is the main goal of smoothing. Traditionally linear smoothers were used, but nonlinear smoothers became more popular through the years. From the family of nonlinear smoothers, the class of median smoothers, based on order statistics, is the most popular. A new class of nonlinear smoothers, called LULU smoothers, was developed by using the minimum and maximum selectors. These smoothers have very attractive mathematical properties. In this thesis their statistical properties are investigated and compared to that of the class of median smoothers. Smoothing, together with related concepts, are discussed in general. Thereafter, the class of median smoothers, from the literature is discussed. The class of LULU smoothers is defined, their properties are explained and new contributions are made. The compound LULU smoother is introduced and its property of variation decomposition is discussed. The probability distributions of some LULUsmoothers with independent data are derived. LULU smoothers and median smoothers are compared according to the properties of monotonicity, idempotency, co-idempotency, stability, edge preservation, output distributions and variation decomposition. A comparison is made of their respective abilities for signal recovery by means of simulations. The success of the smoothers in recovering the signal is measured by the integrated mean square error and the regression coefficient calculated from the least squares regression of the smoothed sequence on the signal. Finally, LULU smoothers are practically applied.
- ItemStatistical inference for inequality measures based on semi-parametric estimators(Stellenbosch : Stellenbosch University, 2011-12) Kpanzou, Tchilabalo Abozou; De Wet, Tertius; Neethling, Ariane; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science.ENGLISH ABSTRACT: Measures of inequality, also used as measures of concentration or diversity, are very popular in economics and especially in measuring the inequality in income or wealth within a population and between populations. However, they have applications in many other fields, e.g. in ecology, linguistics, sociology, demography, epidemiology and information science. A large number of measures have been proposed to measure inequality. Examples include the Gini index, the generalized entropy, the Atkinson and the quintile share ratio measures. Inequality measures are inherently dependent on the tails of the population (underlying distribution) and therefore their estimators are typically sensitive to data from these tails (nonrobust). For example, income distributions often exhibit a long tail to the right, leading to the frequent occurrence of large values in samples. Since the usual estimators are based on the empirical distribution function, they are usually nonrobust to such large values. Furthermore, heavy-tailed distributions often occur in real life data sets, remedial action therefore needs to be taken in such cases. The remedial action can be either a trimming of the extreme data or a modification of the (traditional) estimator to make it more robust to extreme observations. In this thesis we follow the second option, modifying the traditional empirical distribution function as estimator to make it more robust. Using results from extreme value theory, we develop more reliable distribution estimators in a semi-parametric setting. These new estimators of the distribution then form the basis for more robust estimators of the measures of inequality. These estimators are developed for the four most popular classes of measures, viz. Gini, generalized entropy, Atkinson and quintile share ratio. Properties of such estimators are studied especially via simulation. Using limiting distribution theory and the bootstrap methodology, approximate confidence intervals were derived. Through the various simulation studies, the proposed estimators are compared to the standard ones in terms of mean squared error, relative impact of contamination, confidence interval length and coverage probability. In these studies the semi-parametric methods show a clear improvement over the standard ones. The theoretical properties of the quintile share ratio have not been studied much. Consequently, we also derive its influence function as well as the limiting normal distribution of its nonparametric estimator. These results have not previously been published. In order to illustrate the methods developed, we apply them to a number of real life data sets. Using such data sets, we show how the methods can be used in practice for inference. In order to choose between the candidate parametric distributions, use is made of a measure of sample representativeness from the literature. These illustrations show that the proposed methods can be used to reach satisfactory conclusions in real life problems.
- ItemStatistical inference of the multiple regression analysis of complex survey data(Stellenbosch : Stellenbosch University, 2016-12) Luus, Retha; De Wet, Tertius; Neethling, Ariane; Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics & Actuarial Science.ENGLISH SUMMARY : The quality of the inferences and results put forward from any statistical analysis is directly dependent on the correct method used at the analysis stage. Most survey data analyzed in practice riginate from stratified multistage cluster samples or complex samples. In developed countries the statistical analysis, for example linear modelling, of complex sampling (CS) data, otherwise known as survey-weighted least squares (SWLS) regression, has received some attention over time. In developing countries such as South Africa and the rest of Africa, SWLS regression is often confused with weighted least squares (WLS) regression or, in some extreme cases, the CS design is ignored and an ordinary least squares (OLS) model is fitted to the data. This is in contrast to what is found in the developed countries. Furthermore, especially in the developing countries, inference concerning the linear modelling of a continuous response is not as well documented as is the case for the inference of a categorical response, specifically in terms of a dichotomous response. Hence, the decision was made to research the linear modelling of a continuous response under CS with the objective of illustrating how the results could differ if the statistician ignores the complex design of the data or naively applies WLS in comparison to the correct SWLS regression. The complex sampling design leads to observations having unequal inclusion probabilities, the inverse of which is known as the design weight of an observation. Once adjusted for unit nonresponse and differential non-response, the sampling weights can have large variability that could have an adverse effect on the estimation precision. Weight trimming is cautiously recommended as a remedy for this, but could also increase the bias of an estimator which then affects the estimation precision once more. The effect of weight trimming on estimation precision is also investigated in this research. Two important parts of regression analysis are researched here, namely the evaluation of the fitted model and the inference concerning the model parameters. The model evaluation part includes the adjustment of well-known prediction error estimation methods, viz. leave-one-out cross-validation, bootstrap estimation and .632 bootstrap estimation, for application to CS data. It also considers a number of outlier detection diagnostics such as the leverages and Cook's distance. The model parameter inference includes bootstrap variance estimation as well as the construction of bootstrap confidence intervals, viz. the percentile, bootstrap-t, and BCa confidence intervals. Two simulation studies are conducted in this thesis. For the first simulation study a model was developed and then used to simulate a hierarchical population such that stratified two-stage cluster samples can be selected from this population. The second simulation study makes use of stratified two-stage cluster samples that are sampled from real-world data, i.e. the Income and Expenditure Survey of 2005/2006 conducted by Statistics South Africa. Similar conclusions are made from both simulation studies. These conclusions include that the incorrect linear model applied to CS data could lead to wrong conclusions, that weight trimming, when conducted with care, further improves estimation precision, and that linear modelling based on resampling methods such as the bootstrap, could outperform standard linear modelling methods, especially when applied to real-world data.
- ItemTime series forecasting and model selection in singular spectrum analysis(Stellenbosch : Stellenbosch University, 2002-11) De Klerk, Jacques; De Wet, Tertius; Stellenbosch University. Faculty of Science. Dept. of Mathematical Sciences.ENGLISH ABSTRACT: Singular spectrum analysis (SSA) originated in the field of Physics. The technique is non-parametric by nature and inter alia finds application in atmospheric sciences, signal processing and recently in financial markets. The technique can handle a very broad class of time series that can contain combinations of complex periodicities, polynomial or exponential trend. Forecasting techniques are reviewed in this study, and a new coordinate free joint-horizon k-period-ahead forecasting formulation is derived. The study also considers model selection in SSA, from which it become apparent that forward validation results in more stable model selection. The roots of SSA are outlined and distributional assumptions of signal senes are considered ab initio. Pitfalls that arise in the multivariate statistical theory are identified. Different approaches of recurrent one-period-ahead forecasting are then reviewed. The forecasting approaches are all supplied in algorithmic form to ensure effortless adaptation to computer programs. Theoretical considerations, underlying the forecasting algorithms, are also considered. A new coordinate free joint-horizon kperiod- ahead forecasting formulation is derived and also adapted for the multichannel SSA case. Different model selection techniques are then considered. The use of scree-diagrams, phase space portraits, percentage variation explained by eigenvectors, cross and forward validation are considered in detail. The non-parametric nature of SSA essentially results in the use of non-parametric model selection techniques. Finally, the study also considers a commercial software package that is available and compares it with Fortran code, which was developed as part of the study.