Özet:
A vast majority of the studies in machine learning focus on time-directed or in other words sequential processes. Objectives of these studies vary from classi cation to prediction and clustering to segmentation. Since the dimension of these datasets could be very high as a corollary of sequential process, it is required to map the sequences to a lower dimensional representation for learning tasks. Probabilistic and data adaptive representation approaches are prominent in the literature. This thesis provides a new data adaptive representation method for categorical time series to apply any supervised learning algorithm. The proposed method, namely SW-RF (Sliding Window-Random Forest), requires two main steps to learn a representation for categorical time series. The initial representation is constituted with a sliding window algorithm by using a predetermined window size. Then, this simple representation is trained with a decision tree classi er and a numerical vector representation is gathered by using the frequency of subsequences on the leaf nodes of decision trees for each sequence. Categorical sequences of varying length and missing values are handled e ciently by the tree learners in SW-RF. It is able to perform e ciently even the number of symbols in the sequence is high. Classi cation accuracy of the SW-RF is compared with k-mers and Hidden Markov Model representations, since these two are common representation methods in the literature. Experiments show that proposed approach provides signi cantly better results in terms of accuracy on both synthetic data and DNA promoter sequence data.