Abstract:
Many traditional classification approaches focus on the minimization of misclassification rate. However, this is not a suitable metric in case of imbalance in class distribution and unknown misclassification costs. In such cases, Area under Receiver Operating Characteristics Curve (AUC) is an effective metric, which also quantifies the ranking quality of a classifier. Although this metric can be optimized directly by employing some mixed integer programming models, it is challenging to solve these models due to large number of binary variables. Some alternative formulations such as margin maximizing approaches optimizing surrogate objectives are proposed to solve this problem approximately. These methods extend classical Support Vector Machine (SVM) formulation and aim at minimizing ranking error while penalizing the model coe cients with a cost parameter in the objective. In these approaches, the cost and kernel-related parameters (i.e., type, degree and etc.) must be determined by parameter tuning operations since the test performance is highly reliant on these parameters. Primary aim of this study is to avoid the repetitive experiments to tune the parameters of margin-maximization approaches. We propose a linear programming model and a column generation approach, namely Ranking-CG, to select relevant features in an iterative way to decrease the number of features in the model. Additionally, kernel selection is avoided using the Euclidean distances between points as features to learn the non-linear relations. Ranking-CG is modified slightly to obtain faster convergence by solving a non-linear subproblem at each iteration to find the vector (i.e. prototype) in the feature space that violates dual feasibility the most. Our experiments show that the modified approach, Ranking-CG Prototype, provides competitive results with significantly less number of features compared to margin-maximization approaches.