The majority classes’ reducing method of imbalanced datasets

To speed up the process of diagnostic and recognition model constructing, it is necessary to extract a subsample of a smaller volume from the original sample, which will preserve the basic properties of the dataset. The problem of the sample selection from the imbalanced large-sized datasets has bee...

Повний опис

Збережено в:
Бібліографічні деталі
Дата:2018
Автори: Kavrin, D. A., Subbotin, S. A.
Формат: Стаття
Мова:rus
Опубліковано: Інститут проблем реєстрації інформації НАН України 2018
Теми:
Онлайн доступ:http://drsp.ipri.kiev.ua/article/view/142902
Теги: Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:Data Recording, Storage & Processing

Репозитарії

Data Recording, Storage & Processing
Опис
Резюме:To speed up the process of diagnostic and recognition model constructing, it is necessary to extract a subsample of a smaller volume from the original sample, which will preserve the basic properties of the dataset. The problem of the sample selection from the imbalanced large-sized datasets has been addressed for constructing of the diagnostic and pattern recognition models. The goal of the work is the creation of the sampling’s automatization method from the imbalanced large-sized dataset, based on the principles of undersampling. The method of automatization of sample selection from the original imbalanced large-sized dataset has been proposed. The method consists of two phases. The first phase is reducing the size of the original imbalanced large-sized dataset while maintaining important topological properties by reducing the majority class. The second phase is restoring the quantitative balance of the classes by generating synthetic examples of a smaller class. Thus, in the conditions of the class imbalance, the method has allowed restoring the balance and reducing the training sample while maintaining important topological properties of the original imbalanced large-sized dataset, creating high accuracy model within acceptable operating time. The software implementing proposed method has been developed and used in the computational experiments on synthetic and real imbalanced datasets. The conducted experiments confirmed the efficiency and working capacity of the proposed method and its implemented software. The method and software for sample selection have been developed. They allow automating the process of training sample selection in conditions of class imbalance for the synthesis of diagnostic and recognition models by precedents. Prospects for the further research lay in developing the implementation of the proposed method for multiprocessor systems operating in parallel modes, as well as its experimental study on the larger datasets of practical problems of different nature and dimension.