Department of Computer Science

Permanent URI for this community

https://scholar.sun.ac.za/handle/10019.1/96336

Browse

Now showing 1 - 6 of 6

Concept-based exploration of rich semi-structured data collections
(Stellenbosch : Stellenbosch University, 2017-03) Greene, Gillian J.; Fischer, Bernd; Stellenbosch University. Faculty of Science. Dept. of Mathematical Sciences (Computer Science)
ENGLISH ABSTRACT : Search has become one of the fundamental operations in computer science, allowing users to extract data and ultimately information from datasets. However, when users have no previous knowledge of a dataset, or have not clearly defined their search task and are therefore unable to formulate a direct query, their task becomes one of exploratory search or browsing rather than focused search or retrieval. While search and retrieval tools are widely provided, support for browsing of large and especially rich semi-structured datasets, is lacking. Semi-structured data is particularly difficult to explore and query because treating it as complete free-text causes a loss of important additional information which is encoded in the structured portions of the data while considering only the structured fields results in the loss of important free-text information. We therefore develop a framework to support exploration of semi-structured data, which is otherwise difficult to gather insights from, without requiring the user to have prior knowledge of the dataset or have formulated a specific query. Our approach uses a novel combination of tag clouds and concept lattices to facilitate data exploration, analysis and visualization by allowing the user to continuously update (i.e., add and remove) a set of keywords that the searched documents must contain. The document set is not directly provided as the result of a specific query, but aggregated information and properties of relevant documents are provided as a result. We apply our framework to data contained in software repositories, in two different ways for different goals to highlight the flexibility of the approach and the different tasks that can be supported using the same underlying dataset. We also instantiate our framework to support the exploration of a large collection of academic publication data. We evaluate the instantiations of our framework by conducting user and case studies, which indicate that our approach is usable and allows users to gather valuable information from semi-structured data archives.
Coverage directed algorithms for test suite construction from LR-automata
(Stellenbosch : Stellenbosch University, 2022-04) Rossouw, Christoffel Jacobus; Fischer, Bernd; Stellenbosch University. Faculty of Science. Dept. of Computer Science.
ENGLISH ABSTRACT: Bugs in software can have disastrous results in terms of both economic cost and human lives. Parsers can have bugs, like any other type of software, and must therefore be thoroughly tested in order to ensure that a parser recognizes its intended language accurately. However, parsers often need to recognize many different variations and combinations of grammar structures which can make it time consuming and difficult to construct test suites by hand. We therefore require automated methods of test suite construction for these systems. Currently, the majority of test suite construction algorithms focus on the grammar describing the language to be recognized by the parser. In this thesis we take a different approach. We consider the LR-automaton that recognizes the target language and use the context information encoded in the automaton. Specifically, we define a new class of algorithm and coverage criteria over a variant of the LR-automaton that we define, called an LR-graph. We define methods of constructing positive test suites, using paths over this LR-graph, as well as mutations on valid paths to construct negative test suites. We evaluate the performance of our new algorithms against other state-of-the-art algorithms. We do this by comparing coverage achieved over various systems, some smaller systems used in a university level compilers course and other larger, real-world systems. We find good performance of our algorithms over these systems, when compared to algorithms that produce test suites of equivalent size. Our evaluation has uncovered a problem in grammar-based testing algorithms that we call bias. Bias can lead to significant variation in coverage achieved over a system, which can in turn lead to a flawed comparison of two algorithms or unrealized performance when a test suite is used in practice. We therefore define bias and measure it for all grammar-based test suite construction algorithms we use in this thesis.
Design and evaluation of a formula cache for SMT-based bounded model checking tools
(Stellenbosch : Stellenbosch University, 2018-03) Breytenbach, Jean Anré; Fischer, Bernd; Stellenbosch University. Faculty of Science. Dept. of Mathematical Sciences (Computer Science)
ENGLISH ABSTRACT : Program verification is a computationally expensive and time-consuming process. Bounded model checking is a branch of program verification that produces FOL formulas to be checked for satisfiability by an SMT solver. These formulas encode state transitions to states where property violations will occur and the SMT solver attempts to find a list of variable assignments that would create a path to one of these states. Bounded model checking tools create these formulas by iteratively increasing an unwind bound k that dictates the number of transitions that can be present in a path, for each unwind bound k all possible paths of length k are generated. Any state containing a property violation that is generated during the unwind bound k − 1 should also be present during the unwind bound k with perhaps a longer path to reach it. This creates many of the same paths being checked during each subsequent iteration causing the SMT solver to potentially perform duplicate work. This thesis seeks to build and evaluate a formula cache in which to store parts of the formula for which the satisfiability is already known. During subsequent iterations the formula can be sliced by removing the parts that are found in the cache, providing smaller formulas for the SMT solver which should take less time to solve. Similar formula caches have already proven successful in the field of symbolic execution. Multiple techniques are described to modify the formulas to increase the likelihood of finding a match in the cache and these techniques are combined in a multitude of ways to generate a collection of caching strategies. These strategies are then evaluated against multiple data sets to find the best performing strategies and to identify the types of problems that benefit the most from caching. The results are then compared to the results of caching formulas during symbolic execution to gain insight as to how these different approaches effectively implement caching.
Scaling the ConceptCloud browser to very large semi-structured data sets: architecture and data completion
(Stellenbosch : Stellenbosch University, 2020-12) Berndt, Joshua; Fischer, Bernd; Britz, K.; Stellenbosch University. Faculty of Science. Dept. of Mathematical Sciences. Division Computer Science.
ENGLISH ABSTRACT: Semi-structured data sets such as product reviews or event log data are simultaneously becoming more widely used and ever larger. This thesis describes ConceptCloud, a exible, interactive browser for semi-structured datasets, with a focus on the improvements made to accommodate larger datasets, more intuitive data representation and the enrichment of the underlying data by way of data-imputation. ConceptCloud makes use of an intuitive tag cloud visualisation viewer in combination with an underlying concept lattice to provide a formal structure for navigation through datasets without prior knowledge of the structure of the data or compromising scalability. This scalability is achieved by the implementation of architectural changes to increase the system's resource efficiency. These changes are demonstrated by way of a case study on a dataset of wine reviews. Semi-structured data sets such as product reviews or event log data often contain a geolocation aspect: for example, the location of the winery for wine reviews, or the accident location for traffic data. In this thesis, I describe ConceptCloud extensions which allow for the rendering of specialised geolocation data while providing alternate navigation paths through the dataset. I show that using biclusters can make the navigation bidirectional, and demonstrate this approach on a crime data set making use of a geolocation specialised map viewer. Semi-structured data often contains implicit information which will be useful in driving data exploration if made explicit. I take advantage of domain ontologies to both allow implicit data in each input data set to be made explicit and verify and correct inconsistencies allowing for better data exploration. I demonstrate this approach with a continuation of the wine case study.
Test case generation for context free grammars
(Stellenbosch : Stellenbosch University, 2018-03) Esterhuizen, M. H.; Fischer, Bernd; Stellenbosch University. Faculty of Science. Dept. of Mathematical Sciences (Computer Science)
ENGLISH ABSTRACT : Software testing, despite decades of ongoing research, still forms a significant part of the development cycle. When the input domain of a software system must satisfy structural constraints, such as those specified by file formats or command line arguments, test input construction has the potential to be an extremely time consuming process. In this work we investigate how structured test inputs may be constructed through the use of formal grammars and how the input domain of a software system may be adequately represented by the constructed test inputs. We develop a framework which facilitates the automatic generation of test inputs and use it to implement a number of test input generation techniques to satisfy specific coverage criteria. We also introduce a method of test case case generation that exploits the structure of LR-automata to generate sets of passing and failing test cases. The investigations conducted in this work focusses on primarily two areas. Firstly, we investigate how the LR-automata based method of generating test inputs compares to the existing methods. Secondly we seek to verify that the LR-automata and existing methods of test input generation conform to their expected running times. The framework and algorithms are evaluated, in the context of two university level compiler courses as a method to assess parsers, generated from handwritten grammars, submitted by students and compared to the method of assessing submissions with test inputs constructed manually. We find that, in our sample set, assessment based on the positive test inputs generated by the LR-automata based method correlates most closely to assessment based on manually constructed test inputs. This indicates that the method would be most appropriate as a replacement for manually constructing test inputs. The results obtained from assessment based on negative test inputs generated by both, grammar an LR-automata, based methods is found to be unsatisfactory as there is no correlation between them and the results obtained through assessment with a set of manually constructed negative test inputs. An assessment of the performance of the algorithms is also given. We find that the running time of the existing methods of test input generation, based on context free grammars, varies linearly with respect to the size of the grammar and that the running time of the LR-automata based methods varies linearly with respect to the size of the constructed parsing automata.
Untitled
(Stellenbosch : Stellenbosch University, 2023-03) Raselimo, Moeketsi; Fischer, Bernd; Stellenbosch University. Faculty of Science. Dept. of Computer Science.
ENGLISH ABSTRACT: Context-free grammars (CFGs) or simply grammars, are widely used to describe and encode the structure of complex objects. Systems that process such structured objects are known as grammarware. CFGs are also used for software engineering tasks such as testing grammarware, specifically in test suite generation and fuzzing. However, correct and complete CFGs are rarely available, which limits the benefits of these automated techniques. While current grammar engineering approaches can demonstrate the presence of faults in CFGs, none of them can be used to confirm the locations of these faults, much less repair them. In this thesis, we address this problem and develop, implement and evaluate novel automated methods that find locations of faults in CFGs and repair them against a test suite specification. We describe and evaluate the first spectrum-based method aimed at finding faults CFGs. In its basic form, it takes as input a test suite and a modified parser for the grammar that can collect grammar spectra, i.e., the sets of grammar elements used in attempts to parse the individual test cases, and returns as output a ranked list of suspicious elements. We define grammar spectra suitable for localizing faults on the level of the grammar rules and at the rules’ individual symbols, respectively. We show how these types of spectra can be collected in widely used parsing tools. We also show how the spectra can be constructed directly from test cases derived from a grammar, and how these synthetic spectra can be used for localization in cases where standalone parsers implementing the grammars are not available. We qualitatively and quantitatively demonstrate the effectiveness of our fault localization approach. We first evaluate our method over a large number of medium-sized single fault CFGs, which we constructed by fault seeding from a common origin grammar. At the rule level, it ranks the rules containing the seeded faults within the top five rules in about 40%–70% of the cases, and pinpoints them (i.e., correctly identifies them as unique most suspicious rule) in about 10%–30% of the cases, with significantly better results for the synthetic spectra. On average, it ranks the faulty rules within about 25% of all rules. At the item level, our method remains remarkably effective despite the larger number of possible locations: it typically ranks the seeded faults within the top five positions in about 30%–60% of the cases, and pinpoints them in about 15%–40% of the cases. On average, it ranks the seeded faults within about 10%–20% of all positions. We further evaluate our method over CFGs that contain real faults. We show that an iterative approach can be used to localize and manually remove one by one multiple faults in grammars submitted by students enrolled in various compiler engineering courses; in most iterations, the topranked rule already contains an error, and no error is ranked outside the top five ranked rules. We finally apply our method to a large open-source SQLite grammar and show where the original version deviates from the language accepted by the actual SQLite system. We then describe the first method to automatically repair faults in CFGs: given a grammar that fails some tests in a given test suite, we iteratively and gradually transform the grammar until it passes all tests. Our core idea is to build on spectrum-based fault localization to identify promising repair sites (i.e., specific positions in rules), and to apply grammar patches at these sites whenever they satisfy explicitly formulated pre-conditions necessary to potentially improve the grammar. We implement and evaluate passive and active repair variants of our approach. In passive repair, we repair against the fixed input test suite as specification. The active repair variant takes a black-box parser for the unknown target language or oracle that can answer membership queries. The key extension of active repair is to incorporate some test suite enrichment, by generating additional tests from each repair candidate and using the oracle to confirm the outcome of each of these tests. We demonstrate the effectiveness of both repair variants using thirtythree student grammars that contain multiple faults. We show that both variants are effective in fixing real faults in these grammars. A like-for-like comparison of both variants shows that active repair produces more high quality repairs than passive repair.

Browse

Browsing Department of Computer Science by browse.metadata.advisor "Fischer, Bernd"

Results Per Page

Sort Options