Options
Cluster based oversampling for imbalanced classification
Di Credico, Gioia
TORELLI, Nicola
2024
Abstract
Oversampling is a widespread remedy used when data imbalance in classification problems occurs. Some oversampling techniques amount to generating new cases in the minority class, which are similar to the observed ones. ROSE (Random Over Sampling Examples) is an algorithm for generating new data, both in minority and majority classes, using kernel density estimation and bootstrap resampling. In practical application of ROSE, fine tuning of smoothing parameter in kernel density estimate is advisable, especially for the rare class. This is particularly true when well separated subgroups characterize the rare class. We propose a new strategy, ROSEclust, which pairs density-based clustering methods with ROSE to deal with a strongly skewed distribution of the classes and grouping within the rare class. Evidence from simulation studies and real data applications shows that the new approach solves some issues related to ROSE in dealing with complex class data structures. The synthetic data distribution is closer to the original one, and predictive performances of classification methods to synthetic data are not compromised. The entire procedure is designed to be free from parameter tuning. Therefore, the ROSEclust strategy expands application of ROSE and automates the balancing data step, leaving more room for the modelling step.
Source
Gioia Di Credico and Nicola Torelli, "Cluster based oversampling for imbalanced classification", Trieste, EUT Edizioni Università di Trieste, 2024
Languages
en
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International
File(s)