Arşiv ve Dokümantasyon Merkezi
Dijital Arşivi

A tree based categorical variable encoding strategy in supervised learning tasks

Basit öğe kaydını göster

dc.contributor Graduate Program in Computational Science and Engineering.
dc.contributor.advisor Baydoğan, Mustafa Gökçe.
dc.contributor.author Gazioğlu, Mine.
dc.date.accessioned 2023-10-15T06:40:59Z
dc.date.available 2023-10-15T06:40:59Z
dc.date.issued 2022
dc.identifier.other CSE 2022 G38
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/19691
dc.description.abstract Categorical variables are present in most real-world datasets, often consisting of a high number of levels, referred to as high-cardinality categorical variables. Most machine learning algorithms do not have an innate mechanism to deal with categor ical variables, hence, their encoding is necessary. Categorical variable encoding is the general term for the conversion of nominal independent variables to a numerical format. Many encoding strategies exist, and they are discussed in this thesis. This the sis presents a novel encoding strategy, categorical split encoding, and also provides an analysis of existing encoding methods. Categorical split encoding uses primary and sur rogate split information as the vector representation for categorical variables, through a tree-based algorithm, this method outputs binary columns for each categorical variable making use of target information. Missing values are imputed by using surrogate infor mation, while clustering similar values together based on the path they take through the decision tree algorithm. Various existing encoding strategies are benchmarked for comparison with the proposed strategy. The performance of categorical split encod ing and other encoding methods is compared with three different machine learning algorithms (generalized linear models, random forest and xgboost) using datasets from regression, binary and multiclass classification settings. Datasets used are made pub licly available for replication purposes. As a result, categorical split encoding provides competitive results compared to existing encoding strategies in various datasets.
dc.publisher Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2022.
dc.subject.lcsh Categories (Mathematics)
dc.subject.lcsh Supervised learning (Machine learning)
dc.title A tree based categorical variable encoding strategy in supervised learning tasks
dc.format.pages xiii, 59 leaves


Bu öğenin dosyaları

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

Basit öğe kaydını göster

Dijital Arşivde Ara


Göz at

Hesabım