Department of Information Science
Permanent URI for this community
Browse
Browsing Department of Information Science by browse.metadata.advisor "Cleophas, Ir. Loek"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemAn investigation of dead-zone pattern matching algorithms(Stellenbosch : Stellenbosch University, 2016-02) Mauch, Melanie Barbara; Watson, Bruce W.; Cleophas, Ir. Loek; Stellenbosch University. Faculty of Arts and Social Sciences. Dept. of Information Science.ENGLISH ABSTRACT: Pattern matching allows us to search some text for a word or for a sequence of characters|a popular feature of computer programs such as text editors. Traditionally, three distinct families of pattern matching algorithms exist: the Boyer-Moore (BM) algorithm, the Knuth-Morris-Pratt (KMP) algorithm, and the Rabin-Karp (RK) algorithm. The basic algorithm in all these algorithmic families was developed in the 1970s and 1980s. However a new family of pattern matching algorithms, known as the Dead-Zone (DZ) family of algorithms, has recently been developed. In a previous study, it was theoretically proven that DZ is able to pattern match a text with fewer match attempts than the well- known Horspool algorithm, a derivative of the BM algorithm. The main aim of this study was to provide empirical evidence to determine whether DZ is faster in practice. A benchmark platform was developed to com- pare variants of the DZ algorithm to existing pattern matching algorithms. Initial experiments were performed with four C implementations of the DZ algorithm (two recursive and two iterative implementations). Subsequent to this, DZ variants that make use of di erent shift functions as well as two parallel variants of DZ (implemented with Pthreads and CUDA) were devel- oped. Additionally, the underlying skeleton of the DZ algorithm was tweaked to determine whether the DZ code was optimal. The benchmark results showed that the C implementation of the iterative DZ variants performed favourably. Both iterative algorithms beat traditional pat- tern matching algorithms when searching natural language and genome texts, particularly for short patterns. When di erent shift functions were used, the only time a DZ implementation performed better than an implementation of the traditional algorithm was for a pattern length of 65536 characters. Con- trary to our expectations, the parallel implementation of DZ did not always provide a speedup. In fact, the Pthreaded variants of DZ were slower than the non-threaded DZ implementations, although the CUDA DZ variants were consistently ve times faster than a CPU implementation of Horspool. By us- ing a cache-friendly DZ algorithm, which reduces cache misses by about 20%, the the original DZ can be improved by approximately 5% for relatively short patterns (up to 128 characters with a natural language text). Moreover, a cost of recursion and the impact of information sharing were observed for all DZ variants and have thus been identi ed as intrinsic DZ characteristics. Further research is recommended to determine whether the cache-friendly DZ algorithm should become the standard implementation of the DZ algorithm. In addition, we hope that the development of our benchmark platform has produced a technique that can be used by researchers in future studies to conduct benchmark tests