DEAMS Research Paper Series 2024, 4
Permanent URI
CONTENTS / SOMMARIO
Gioia Di Credico, Nicola Torelli
Cluster based oversampling for imbalanced classification
Browse
Recent Submissions
- PublicationCluster based oversampling for imbalanced classification(2024)
;Di Credico, GioiaTORELLI, NicolaOversampling is a widespread remedy used when data imbalance in classification problems occurs. Some oversampling techniques amount to generating new cases in the minority class, which are similar to the observed ones. ROSE (Random Over Sampling Examples) is an algorithm for generating new data, both in minority and majority classes, using kernel density estimation and bootstrap resampling. In practical application of ROSE, fine tuning of smoothing parameter in kernel density estimate is advisable, especially for the rare class. This is particularly true when well separated subgroups characterize the rare class. We propose a new strategy, ROSEclust, which pairs density-based clustering methods with ROSE to deal with a strongly skewed distribution of the classes and grouping within the rare class. Evidence from simulation studies and real data applications shows that the new approach solves some issues related to ROSE in dealing with complex class data structures. The synthetic data distribution is closer to the original one, and predictive performances of classification methods to synthetic data are not compromised. The entire procedure is designed to be free from parameter tuning. Therefore, the ROSEclust strategy expands application of ROSE and automates the balancing data step, leaving more room for the modelling step.28 517